1. Introduction
Reading and learning from text are critical skills for learners to acquire new knowledge, which is essential for educational and career success. To comprehend text, the reader constructs a mental model of the text while he/she reads. This mental model can be represented at three levels: (1) surface-level knowledge of the exact words in the text, (2) the textbase-level semantic representation of ideas, and (3) the situation model, which combines the textbase with the reader’s prior knowledge. The ability to leverage strategies that support comprehension is a critical skill that readers need in the absence of the essential prior knowledge necessary to develop a coherent situation model. Proficient readers are more likely to spontaneously employ strategies while reading to help them comprehend difficult texts than students who are less-skilled readers [
1]. Fortunately, students can learn when and how to implement these reading comprehension strategies through direct instruction and deliberate practice. One such strategy, with considerable evidence supporting its use by students with limited prior knowledge or lower reading skills, is self-explanation.
Self-Explanation (SE) is the practice of explaining the meaning of portions of a text to one’s self while reading. Engaging in self-explanation encourages students to generate inferences, in which they connect sentences or idea units between text sections or texts. Similarly, students may generate elaborative self-explanations in which they connect their prior knowledge to new information they read in the text. Generating bridging and elaborative self-explanations supports readers’ inference making, which, in turn, supports the development of their mental representation of the text.
Developed by McNamara [
2], Self-Explanation Reading Training (SERT) teaches readers strategies to enhance text comprehension. The training guides students through each strategy in increasing order of difficulty, starting with comprehension monitoring. Comprehension monitoring aims to help students understand when they need to implement the remaining strategies to support their comprehension. This work focused on the three remaining strategies: paraphrasing, bridging inference, and elaboration.
The paraphrasing strategy refers to reformulating a sequence of text in one’s own words. SERT can help develop readers’ text comprehension skills by prompting them to access their vocabulary to translate the ideas into more-familiar language. Bridging involves linking multiple ideas across a text or across multiple texts (e.g., two different articles about the same topic). Generating bridging inferences requires the reader to find connections between the ideas and to structure them in a coherent way. Elaboration involves linking information in the text and the reader’s knowledge base; this helps the reader integrate new information with existing knowledge. Collectively, these strategies support readers’ construction of more-coherent mental representations of the text, in particular challenging texts that require substantial prior knowledge to understand.
Considerable evidence indicates that these strategies support readers’ comprehension of complex texts. However, additional benefits can be realized when the reader receives feedback about the accuracy or quality of his/her self-explanation [
3]. One way readers can receive feedback is from instructors who review and score self-explanations based on a rubric [
4]. This method is time-consuming and does not provide readers with the feedback they need in real-time. To alleviate this challenge, students can practice their reading and self-explaining using an intelligent tutoring system, where they both have the opportunity to engage in the deliberate practice of reading and self-explaining, but they also receive essential guiding feedback [
5]. Thus, refining and improving software applications that can detect the presence of these strategies in the readers’ constructed responses can be helpful for both evaluation and training. Natural Language Processing (NLP) [
6] techniques and Machine Learning can be used to develop such models, given a large enough dataset containing labeled examples of the presence and absence of these strategies in readers’ self-explanations. Previous work [
7] has shown that such automated models can be built to reliably assess self-explanation reading strategies. The recent release of more-sophisticated and readily accessible large language models further supports the expansion of this prior work.
4. Discussion
This study evaluated the performance of LLMs on scoring self-explanations using multiple employed strategies in either out-of-the-box or fine-tuned setups. In the out-of-the-box scenario, a comparison was made between the performance of the FLAN-T5 models and the GPT3.5-turbo API. The FLAN-T5 models obtained better results on three comprehension strategy tasks. The model performance did not scale with the model size and the number of examples listed in the prompts. The GPT3.5-turbo model obtained better results on the overall quality task and showed a clearer improvement on the other tasks with the addition of more examples to the prompt.
When analyzing the correctness of the responses generated by the LLMs, it was also observed that GPT3.5-turbo and FLAN-T5 large were more likely to generate answers in the correct format. This capability improved for all the models if more examples were provided in the prompt.
When looking at the confusion matrix for the overall task, the two best-performing out-of-the-box models tended to misclassify multiple examples, not only in adjacent classes, but in other classes as well (see
Table 8). Numerous instances of Class 0 examples were classified as Class 3 and vice versa. This indicated that the models could not reliably identify content that had been copied and pasted. There were even more high-quality examples, namely Class 3, being labeled as low-quality, or Class 0. This could indicate that the models have not completely understood the task. They might be solving a proxy task, such as paraphrase assessment, with similar scores in some cases and diverging scores in others. For instance, a good self-explanation might contain relevant paraphrases; however, good self-explanations should target information beyond the source text. In addition, the predictions can also be influenced by high class imbalance (i.e., Class 0 had almost nine times fewer examples than Class 2 for self-explanation quality).
In the fine-tuning scenario, only the FLAN-T5 models were targeted. Initially, the models were fine-tuned for one epoch using the LoRA method. After this fine-tuning, the performances drastically improved and scaled better with the model size and number of examples provided. When the models were trained for three epochs, the differences between the FLAN-T5 XL and XXL models decreased.
The confusion matrix generated for the best-performing, fine-tuned model on the overall task showed improved results compared to the out-of-the-box models (see
Table 9). We observed that most predictions coincided with the ground truth for almost all classes, except the underrepresented Class 0. Furthermore, even when errors occurred, they appeared near the correct options; only four instances occurred at a distance from three classes (i.e., Class 0 examples evaluated as Class 3). Furthermore, there was no example of a high-quality sample (i.e., Class 3) being labeled as a poor-quality self-explanation (i.e., Class 0).
The FLAN fine-tuned models and the previous MTL approach can also be compared in regards to the training time, as reported in
Table 10. We can observe that the MTL model required the least training time while using less-performant hardware. For the FLAN models, the training time listed was for 1 epoch, so the 3-epoch fine-tuned model would take roughly three times more time to train. Our best-performing three-epoch-trained FLAN XXL surpassed the previous MTL model performance, but that model required 540 min to train (a 27× increase) and more expensive hardware.
Author Contributions
Conceptualization, M.D. and D.S.M.; methodology, B.N., M.D. and D.S.M.; software, B.N.; validation, B.N., M.D. and D.S.M.; formal analysis, M.D.; investigation, D.S.M.; resources, R.B. and T.A.; data curation, R.B. and T.A.; writing—original draft preparation, B.N.; writing—review and editing, M.D., R.B., T.A. and D.S.M.; visualization, B.N.; supervision, M.D. and D.S.M.; project administration, M.D. and D.S.M.; funding acquisition, M.D. and D.S.M. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by the Ministry of Research, Innovation, and Digitalization, project CloudPrecis, Contract Number 344/390020/06.09.2021, MySMIS code: 124812, within POC, the Ministry of European Investments and Projects, POCU 2014-2020 project, Contract Number 62461/03.06.2022, MySMIS code: 153735, the IES U.S. Department of Education (R305A130124, R305A190063, the Office of Naval Research (N00014-20-1-2623), and the National Science Foundation (NSF REC0241144; IIS-0735682). The opinions expressed are those of the authors and do not represent the views of the Institute, the U.S. Department of Education, ONR, or NSF.
Institutional Review Board Statement
The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Institutional Review Board of Arizona State University.
Informed Consent Statement
Informed consent was obtained from all subjects involved in the study.
Data Availability Statement
Conflicts of Interest
The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
API | Application Programming Interface |
BERT | Bidirectional Encoder Representations from Transformers |
CoT | Chain-of-Thought |
LLM | Large Language Model |
LoRA | Low-Rank Adaptation |
NLP | Natural Language Processing |
SE | Self-Explanation |
SERT | Self-Explanation Reading Training |
STEM | Science, Technology, Engineering, and Mathematics |
References
- McNamara, D.S. Self-explanation and reading strategy training (SERT) improves low-knowledge students’ science course performance. Discourse Process. 2017, 54, 479–492. [Google Scholar] [CrossRef]
- McNamara, D.S. SERT: Self-explanation reading training. Discourse Process. 2004, 38, 1–30. [Google Scholar] [CrossRef]
- Anders Ericsson, K. Deliberate practice and acquisition of expert performance: A general overview. Acad. Emerg. Med. 2008, 15, 988–994. [Google Scholar] [CrossRef]
- McNamara, D.S.; Arner, T.; Butterfuss, R.; Fang, Y.; Watanabe, M.; Newton, N.; McCarthy, K.S.; Allen, L.K.; Roscoe, R.D. iSTART: Adaptive Comprehension Strategy Training and Stealth Literacy Assessment. Int. J.-Hum.-Comput. Interact. 2023, 39, 2239–2252. [Google Scholar] [CrossRef]
- McNamara, D.S.; O’Reilly, T.; Rowe, M.; Boonthum, C.; Levinstein, I.B. iSTART: A web-based tutor that teaches self-explanation and metacognitive reading strategies. In Reading Comprehension Strategies: Theories, Interventions, and Technologies; Lawrence Erlbaum Associates Publishers: Mahwah, NJ, USA, 2007; pp. 397–421. [Google Scholar]
- Jurafsky, D.; Martin, J.H. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition; Pearson Prentice Hall: Upper Saddle River, NJ, USA, 2009. [Google Scholar]
- Nicula, B.; Panaite, M.; Arner, T.; Balyan, R.; Dascalu, M.; McNamara, D.S. Automated Assessment of Comprehension Strategies from Self-explanations Using Transformers and Multi-task Learning. In Proceedings of the International Conference on Artificial Intelligence in Education, Tokyo, Japan, 3–7 July 2023; Springer: Cham, Switzerland, 2023; pp. 695–700. [Google Scholar] [CrossRef]
- OpenAI. Introducing ChatGPT. 2022. Available online: https://openai.com/blog/chatgpt (accessed on 5 October 2023).
- Pichai, S. An Important Next Step on Our AI Journey. 2023. Available online: https://blog.google/technology/ai/bard-google-ai-search-updates/ (accessed on 5 October 2023).
- Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; Zoph, B.; Borgeaud, S.; Yogatama, D.; Bosma, M.; Zhou, D.; Metzler, D.; et al. Emergent Abilities of Large Language Models. arXiv 2022, arXiv:2206.07682. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; Volume 30, pp. 5998–6008. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the NAACL, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
- OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
- Chung, H.W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, E.; Wang, X.; Dehghani, M.; Brahma, S.; et al. Scaling Instruction-Finetuned Language Models. arXiv 2022, arXiv:2210.11416. [Google Scholar] [CrossRef]
- Chiesurin, S.; Dimakopoulos, D.; Cabezudo, M.A.S.; Eshghi, A.; Papaioannou, I.; Rieser, V.; Konstas, I. The Dangers of trusting Stochastic Parrots: Faithfulness and Trust in Open-domain Conversational Question Answering. arXiv 2023, arXiv:2305.16519. [Google Scholar]
- Perez, F.; Ribeiro, I. Ignore Previous Prompt: Attack Techniques For Language Models. arXiv 2022, arXiv:2211.09527. [Google Scholar]
- Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S.R. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. arXiv 2019, arXiv:1804.07461. [Google Scholar]
- Wang, A.; Pruksachatkun, Y.; Nangia, N.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S.R. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. arXiv 2020, arXiv:1905.00537. [Google Scholar]
- Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. PaLM: Scaling Language Modeling with Pathways. arXiv 2022, arXiv:2204.02311. [Google Scholar]
- Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding with Unsupervised Learning. OpenAI Blog 2018, 1, 8. [Google Scholar]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Christiano, P.F.; Leike, J.; Brown, T.; Martic, M.; Legg, S.; Amodei, D. Deep Reinforcement Learning from Human Preferences. In Proceedings of the Advances in Neural Information Processing Systems, 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
- McNamara, D.S.; Newton, N.; Christhilf, K.; McCarthy, K.S.; Magliano, J.P.; Allen, L.K. Anchoring your bridge: The importance of paraphrasing to inference making in self-explanations. Discourse Process. 2023, 60, 337–362. [Google Scholar] [CrossRef]
- Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pretrain, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. arXiv 2021, arXiv:2107.13586. [Google Scholar]
- OpenAI. Chat Completions API Documentation. 2023. Available online: https://platform.openai.com/docs/guides/gpt/chat-completions-api (accessed on 10 August 2023).
- Liu, X.; Zheng, Y.; Du, Z.; Ding, M.; Qian, Y.; Yang, Z.; Tang, J. GPT Understands, Too. arXiv 2021, arXiv:2103.10385. [Google Scholar] [CrossRef]
- Li, X.L.; Liang, P. Prefix-Tuning: Optimizing Continuous Prompts for Generation. arXiv 2021, arXiv:2101.00190. [Google Scholar]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
- Smith, S.L.; Kindermans, P.; Le, Q.V. Don’t Decay the Learning Rate, Increase the Batch Size. arXiv 2017, arXiv:1711.00489. [Google Scholar]
- He, Y.; Zhang, X.; Sun, J. Channel Pruning for Accelerating Very Deep Neural Networks. arXiv 2017, arXiv:1707.06168. [Google Scholar]
- Amelio, A.; Bonifazi, G.; Cauteruccio, F.; Corradini, E.; Marchetti, M.; Ursino, D.; Virgili, L. Representation and compression of Residual Neural Networks through a multilayer network based approach. Expert Syst. Appl. 2023, 215, 119391. [Google Scholar] [CrossRef]
- Huang, K.; Ni, B.; Yang, X. Efficient quantization for neural networks with binary weights and low bitwidth activations. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 3854–3861. [Google Scholar]
- Hubara, I.; Courbariaux, M.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Quantized neural networks: Training neural networks with low precision weights and activations. J. Mach. Learn. Res. 2017, 18, 6869–6898. [Google Scholar]
- Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv 2015, arXiv:1510.00149. [Google Scholar]
- Ma, X.; Fang, G.; Wang, X. LLM-Pruner: On the Structural Pruning of Large Language Models. arXiv 2023, arXiv:2305.11627. [Google Scholar]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).