MIFM: Multimodal Information Fusion Model for Educational Exercises
Abstract
:1. Introduction
- A multimodal fusion-based exercise characterization method is proposed to vectorize exercise with heterogeneous data and use the generated multimodal vectors to act on downstream tasks.
- Propose a dual-stream architecture for the feature extraction of heterogeneous data and fuse cross-modal attention for the fusion of heterogeneous features.
- Through experiments conducted on datasets collected in real educational settings across three distinct educational tasks, the model is demonstrated to effectively enhance the educational task implementation.
2. Related Works
2.1. Unimodal Characterization Method
2.2. Multimodal Characterization Method
2.3. Exercise Characterization Method
3. Proposed Method
3.1. Overview of the MIFM
Algorithm 1 Pseudocode for MIFM |
Input: Exercise Output: Exercise Characterization Vector
|
3.2. Text-Encoding Layer
3.3. Image-Encoding Layer
3.4. Knowledge Concept Embedding Layer
- Generate a word list by counting the number of occurrences of each word in the exercise text corpus. Sort the words in the word list based on their occurrence frequency, from highest to lowest. Let denote the ith word, denotes the number of occurrences of the ith word, and n denotes the size of the word list, which refers to the number of different words in the exercise text corpus.
- Set the sliding window size to w and traverse all words in the corpus. Record the frequency of word occurrences in the fixed window on the left side of the target word. Generate a left co-occurrence matrix , with denoting the words in the ith row and jth column of the left co-occurrence matrix.
3.5. Modal Fusion Layer
4. Experiment
4.1. DataSet
4.2. Experimental Evaluation Metrics
4.3. Experimental Environment and Parameters
4.4. Experiment and Analysis
4.5. Time Analysis
5. Ablation Studies
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Thakur, N. A large-scale dataset of Twitter chatter about online learning during the current COVID-19 Omicron wave. Data 2022, 7, 109. [Google Scholar] [CrossRef]
- Boca, G.D. Factors influencing students’ behavior and attitude towards online education during COVID-19. Sustainability 2021, 13, 7469. [Google Scholar] [CrossRef]
- Chatterjee, A.; Gupta, U.; Chinnakotla, M.K.; Srikanth, R.; Galley, M.; Agrawal, P. Understanding emotions in text using deep learning and big data. Comput. Hum. Behav. 2019, 93, 309–317. [Google Scholar] [CrossRef]
- Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef]
- Qin, Z.; Zhao, Y. Correlation analysis of mathematical knowledge points based on word co-occurrence and clustering. In Proceedings of the 2020 International Conference on Computers, Information Processing and Advanced Education (CIPAE), Ottawa, ON, Canada, 16–18 October 2020; pp. 47–52. [Google Scholar] [CrossRef]
- Liu, Y.; Yi, X.; Chen, R.; Zhai, Z.; Gu, J. Feature extraction based on information gain and sequential pattern for English question classification. IET Softw. 2018, 12, 520–526. [Google Scholar] [CrossRef]
- Huo, Y.; Wong, D.F.; Ni, L.M.; Chao, L.S.; Zhang, J. Knowledge modeling via contextualized representations for LSTM-based personalized exercise recommendation. Inf. Sci. 2020, 523, 266–278. [Google Scholar] [CrossRef]
- Wang, L.; Sun, Y.; Zhu, Z. Knowledge points extraction of junior high school english exercises based on SVM method. In Proceedings of the 2018 2nd International Conference on E-Education, E-Business and E-Technology, Beijing, China, 5–7 July 2018; pp. 43–47. [Google Scholar] [CrossRef]
- Shahmirzadi, O.; Lugowski, A.; Younge, K. Text similarity in vector space models: A comparative study. In Proceedings of the 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA), Boca Raton, FL, USA, 16–19 December 2019; pp. 659–666. [Google Scholar] [CrossRef]
- Zhang, L.; Wang, J.; Zhang, Y.; Fu, A.; Xu, D. Research on the multi-source information deduplication method based on named entity recognition. In Proceedings of the 2022 8th International Conference on Big Data and Information Analytics (BigDIA), Guiyang, China, 11–12 August 2022; pp. 479–484. [Google Scholar] [CrossRef]
- Huang, Z.; Liu, Q.; Chen, E.; Zhao, H.; Gao, M.; Wei, S.; Su, Y.; Hu, G. Question Difficulty Prediction for READING Problems in Standard Tests. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31, pp. 1352–1359. [Google Scholar] [CrossRef]
- Hermann, K.M.; Kocisky, T.; Grefenstette, E.; Espeholt, L.; Kay, W.; Suleyman, M.; Blunsom, P. Teaching machines to read and comprehend. Adv. Neural Inf. Process. Syst. 2015, 28, 1693–1701. [Google Scholar]
- Liu, Q.; Huang, Z.; Huang, Z.; Liu, C.; Chen, E.; Su, Y.; Hu, G. Finding similar exercises in online education systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 1821–1830. [Google Scholar] [CrossRef]
- Zhao, D.; Liu, Y. A Multimodal Model for College English Teaching Using Text and Image Feature Extraction. Comput. Intell. Neurosci. 2022, 2022, 3601545. [Google Scholar] [CrossRef]
- Ochoa, X.; Chiluiza, K.; Méndez, G.; Luzardo, G.; Guamán, B.; Castells, J. Expertise estimation based on simple multimodal features. In Proceedings of the 15th ACM on International Conference on Multimodal Interaction, Sydney, Australia, 9–13 December 2013; pp. 583–590. [Google Scholar] [CrossRef]
- Penuel, W.R. Research–practice partnerships as a strategy for promoting equitable science teaching and learning through leveraging everyday science. Sci. Educ. 2017, 101, 520–525. [Google Scholar] [CrossRef]
- Jalilifard, A.; Caridá, V.F.; Mansano, A.F.; Cristo, R.S.; da Fonseca, F.P.C. Semantic sensitive TF-IDF to determine word relevance in documents. In Advances in Computing and Network Communications: Proceedings of CoCoNet 2020; Springer: Berlin/Heidelberg, Germany, 2021; Volume 2, pp. 327–337. [Google Scholar] [CrossRef]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar] [CrossRef]
- Graves, A. Generating sequences with recurrent neural networks. arXiv 2013, arXiv:1308.0850. [Google Scholar] [CrossRef]
- Wang, R.; Li, Z.; Cao, J.; Chen, T.; Wang, L. Convolutional recurrent neural networks for text classification. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–6. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
- Li, Y.; Qian, Y.; Yu, Y.; Qin, X.; Zhang, C.; Liu, Y.; Yao, K.; Han, J.; Liu, J.; Ding, E. Structext: Structured text understanding with multi-modal transformers. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, 20 October 2021; pp. 1912–1920. [Google Scholar] [CrossRef]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar] [CrossRef]
- Bank, D.; Koenigstein, N.; Giryes, R. Autoencoders. arXiv 2020, arXiv:2003.05991. [Google Scholar] [CrossRef]
- Noh, H.; Hong, S.; Han, B. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1520–1528. [Google Scholar] [CrossRef]
- Cheng, H.T.; Koc, L.; Harmsen, J.; Shaked, T.; Chandra, T.; Aradhye, H.; Anderson, G.; Corrado, G.; Chai, W.; Ispir, M.; et al. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, Boston, MA, USA, 15 September 2016; pp. 7–10. [Google Scholar] [CrossRef]
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar] [CrossRef]
- Graves, A.; Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 2005, 18, 602–610. [Google Scholar] [CrossRef]
- Huang, P.Y.; Chang, X.; Hauptmann, A. Multi-head attention with diversity for learning grounded multilingual multimodal representations. arXiv 2019, arXiv:1910.00058. [Google Scholar] [CrossRef]
- Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; Hovy, E. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; pp. 1480–1489. [Google Scholar] [CrossRef]
- Su, Y.; Liu, Q.; Liu, Q.; Huang, Z.; Yin, Y.; Chen, E.; Ding, C.; Wei, S.; Hu, G. Exercise-enhanced sequential modeling for student performance prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32, pp. 2435–2443. [Google Scholar] [CrossRef]
- Devassy, B.M.; George, S. Dimensionality reduction and visualisation of hyperspectral ink data using t-SNE. Forensic Sci. Int. 2020, 311, 110194. [Google Scholar] [CrossRef] [PubMed]
- Sarzynska-Wawer, J.; Wawer, A.; Pawlak, A.; Szymanowska, J.; Stefaniak, I.; Jarkiewicz, M.; Okruszek, L. Detecting formal thought disorder by deep contextualized word representations. Psychiatry Res. 2021, 304, 114135. [Google Scholar] [CrossRef] [PubMed]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar] [CrossRef]
- Ma, L.; Lu, Z.; Li, H. Learning to answer questions from image using convolutional neural network. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30, pp. 3567–3573. [Google Scholar] [CrossRef]
Project | Enviroment |
---|---|
Memory | 32 GB |
GPU | NVIDIA GeForce RTX3060 |
Python Version | Python3.9.1 |
Pytorch Version | Pytorch1.13.0 |
Original [30] | 0.5281 | 0.3973 | 0.7521 | 0.5219 |
ELMo [33] | 0.6435 | 0.7508 | 0.7274 | 0.7389 |
BERT [34] | 0.5983 | 0.7102 | 0.6376 | 0.6719 |
m-CNN [35] | 0.6521 | 0.7420 | 0.6697 | 0.7039 |
MIFM | 0.7235 | 0.8164 | 0.7583 | 0.8065 |
Original [11] | 0.2308 | 0.2801 | 0.3231 |
ELMo [33] | 0.2378 | 0.2776 | 0.4421 |
BERT [34] | 0.2303 | 0.3105 | 0.3753 |
m-CNN [35] | 0.2108 | 0.2721 | 0.3809 |
MIFM | 0.2076 | 0.2632 | 0.4683 |
Original [31] | 0.4362 | 0.4653 | 0.7417 | 0.5279 |
ELMo [33] | 0.3635 | 0.4672 | 0.7731 | 0.5535 |
BERT [34] | 0.4239 | 0.4534 | 0.7263 | 0.5107 |
m-CNN [35] | 0.4152 | 0.4398 | 0.7535 | 0.5631 |
MIFM | 0.3512 | 0.4521 | 0.7736 | 0.6257 |
0.4582 | 0.7532 | 0.7213 | 0.7378 | |
0.1726 | 0.1912 | 0.2126 | 0.2001 | |
0.2013 | 0.2063 | 0.2142 | 0.2107 | |
0.6873 | 0.7605 | 0.7352 | 0.7476 | |
0.5158 | 0.7513 | 0.7079 | 0.7289 | |
0.1986 | 0.2146 | 0.2523 | 0.2319 | |
0.7235 | 0.8164 | 0.7583 | 0.8065 |
0.4536 | 0.4752 | 0.7528 | 0.5658 | |
0.4621 | 0.4821 | 0.6892 | 0.5485 | |
0.4660 | 0.4723 | 0.7013 | 0.5502 | |
0.4221 | 0.4603 | 0.7664 | 0.6121 | |
0.4375 | 0.4672 | 0.7592 | 0.5613 | |
0.4487 | 0.4716 | 0.7121 | 0.5572 | |
0.3512 | 0.4521 | 0.7736 | 0.6257 |
0.2216 | 0.2834 | 0.3341 | |
0.2351 | 0.2883 | 0.2042 | |
0.2406 | 0.2763 | 0.2215 | |
0.2116 | 0.2720 | 0.3631 | |
0.2195 | 0.2648 | 0.3586 | |
0.2374 | 0.2761 | 0.2875 | |
0.2076 | 0.2632 | 0.4683 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Song, J.; Chen, H.; Li, C.; Xie, K. MIFM: Multimodal Information Fusion Model for Educational Exercises. Electronics 2023, 12, 3909. https://doi.org/10.3390/electronics12183909
Song J, Chen H, Li C, Xie K. MIFM: Multimodal Information Fusion Model for Educational Exercises. Electronics. 2023; 12(18):3909. https://doi.org/10.3390/electronics12183909
Chicago/Turabian StyleSong, Jianfeng, Hui Chen, Chuan Li, and Kun Xie. 2023. "MIFM: Multimodal Information Fusion Model for Educational Exercises" Electronics 12, no. 18: 3909. https://doi.org/10.3390/electronics12183909
APA StyleSong, J., Chen, H., Li, C., & Xie, K. (2023). MIFM: Multimodal Information Fusion Model for Educational Exercises. Electronics, 12(18), 3909. https://doi.org/10.3390/electronics12183909