Next Article in Journal
Traffic Stream Characteristics Analysis for Roadway Linking to Pick-up Zone of Passenger Transportation Hub: A Fundamental Diagram Derived from Threshold Queueing Theory
Next Article in Special Issue
Course Recommendation Based on Enhancement of Meta-Path Embedding in Heterogeneous Graph
Previous Article in Journal
Endothelial Reprogramming in Sports Traumatology: Role of the Widespread Neuro-Immuno-Endocrine-Endothelial System
Previous Article in Special Issue
Practice Promotes Learning: Analyzing Students’ Acceptance of a Learning-by-Doing Online Programming Learning Tool
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Novel Method for Cross-Modal Collaborative Analysis and Evaluation in the Intelligence Era

1
School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China
2
Center of Campus Network and Modern Educational Technology, Guangdong University of Technology, Guangzhou 510006, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2023, 13(1), 163; https://doi.org/10.3390/app13010163
Submission received: 19 October 2022 / Revised: 18 December 2022 / Accepted: 20 December 2022 / Published: 23 December 2022
(This article belongs to the Special Issue STEAM Education and the Innovative Pedagogies in the Intelligence Era)

Abstract

:
The development of intelligent information technology provides an effective way for cross-modal learning analytics and promotes the realization of procedural and scientific educational evaluation. To accurately capture the emotional changes of learners and make an accurate evaluation of the learning process, this paper takes the evaluation of learners’ emotional status in the behavior process as an example to construct an intelligent analysis model of learners’ emotional status in the behavior process, which provides effective technical solutions for the construction of cross-modal learning analytics. The effectiveness and superiority of the proposed method are verified by experiments. Through the analysis of learners’ cross-mode learning behavior, it innovates the evaluation method of classroom teaching in the intelligent age and improves the quality of modern teaching.

1. Introduction

1.1. Educational Evaluation in the Age of Intelligence

With the development of science and information technology, contemporary education has undergone a series of reforms and innovations with a focus on improving the quality of education rather than increasing the level of its scale and speed. Establishing a high-quality educational system has become the aim for the country and society as a whole.
In recent years, technologies such as big data, the Internet of Things, artificial intelligence, short video, virtual reality (VR), augmented reality (AR), and the metaverse have been gradually applied to the field of education. The upgraded level of information technology accelerates reform and innovation in education and profoundly affects the connotation, method, and path of educational evaluation, making educational evaluation transfer from a result-oriented, subjective direction to a process-oriented, concomitant, and scientific direction. There is an emphasis on the innovation of evaluation tools and the application of modern information technologies such as artificial intelligence and big data to explore and carry out the longitudinal evaluation of students’ learning at all grades, such as the horizontal evaluation of moral, intellectual, physical, aesthetic, and labor factors.
In accordance with the national aim and requirements of education, which lie in the scientific, procedural, and intelligent standards of educational evaluation, and with the use of advanced technologies in information and artificial intelligence, this paper adopts an innovative teaching and learning evaluation model that focuses on intelligence education evaluation methods. It is deemed essential for the development of education reform in terms of the integration of information technology, intelligent technology, and educational philosophy.

1.2. Proposition of Research Questions

Education in the age of intelligence emphasizes the cultivation of students’ abilities in terms of innovation, interdisciplinary problem-solving, teamwork, and other comprehensive qualities through the process of hands-on practice. The biggest difference compared with traditional education is that it focuses more on the process of learning than the ultimate results. The emotional status generated by students during the learning process can, to some extent, reflect their learning interests and the learning effects, and therefore form the mechanism of learning eventually. The status can be recorded and analyzed through cross-modal data such as facial expressions, actions, language, and heartbeat.
Cross-modal collaborative analysis and evaluation can provide an effective technical solution for the construction of a teaching evaluation system in the Age of Intelligence. Thus, the data analysis and conclusions from the evaluation system can shed some light on research questions such as how to improve the quality of intelligent education.
In the smart education environment, based on technologies such as the Internet of Things and artificial intelligence, as well as intelligent sensing devices such as eye trackers and electroencephalographs, it is possible to monitor and record complex teaching processes in real-time and form cross-modal data such as audios, texts, images, videos, etc. Based on cross-modal learning analysis technology, collaborative analysis of student learning behavior data is carried out to mine and restore the students’ learning emotional status information implied by the cross-modal and accompanying behavioral information in the classroom. A scientific model further reveals the implicit relationship between teaching and learning process information and classroom education quality, and a new evaluation model of classroom teaching and learning has been constructed. This evaluation method is more comprehensive and scientific by eliminating the weaknesses of the traditional educational evaluation model, which emphasized tests and grades, and by optimizing teaching methods based on evaluation and improving the quality of classroom teaching. The purposed method focuses on evaluating and developing students’ innovative thinking ability, comprehensive quality, personality, and mental health.
Taking the cross-modal learning behavior data as the carrier, with technologies such as artificial intelligence, big data, etc., this research collects students’ emotional status through the whole process of learning, conducts a collaborative analysis of learning behavior, and thus constructs a new evaluation method of teaching that links learning behavior with learners’ emotional status. This research indicates significance in research and real-world practice in terms of achieving a comprehensive and accurate assessment of a student’s ability, cognitive level, personality traits, mental health, etc., explaining and exploring cross-modal data-driven educational phenomena and educational laws, offering guidance on the optimization of the teaching process, and improving the goal of classroom teaching quality.

2. Literature Review

Research on cross-modal learning analysis could be dated back a decade, and the academic community continues to focus on its development. In 2012, scholars raised the idea of applying cross-modal data such as text, video, and audio in the field of cross-modal interaction research, a new direction for the research of cross-modal learning interaction, which put cross-modal learning as the core status of educational evaluation. In 2012, the International Conference on Cross-Modal Interaction (ICMI) established the frontier status of cross-modal learning analysis by organizing relevant scholars to participate in the seminar on “Cross-Modal Learning Analysis” for the first time in the form of workshops [1], declaring that the aim of this research field is to combine cross-modal analysis technology with learning science research to help researchers better understand the mechanism of learning. Since 2014, with the gradual development of blended learning, learning analysis, and big data technologies, a large amount of online and offline learning data have become available for identifying patterns such as learning emotions by extracting and analyzing cross-modal feature data. In 2016, the International Conference on Learning Analysis and Knowledge (LAK) set up a data challenge workshop for cross-modal learning and analysis, organized relevant scholars to participate in the practical work of cross-modal data analysis, and explored the development direction of learning analysis research supported by cross-modal data [2]. The application of artificial intelligence technology has been widely spread in cross-modal learning analysis since 2018. By applying artificial intelligence to education, Li Songqing [3] proposed to use cross-modal machine learning to create and replicate human cognition, analyze and design artificial artifacts based on artificial intelligence technology through cross-modal learning, and ultimately support, help, and expand human cognitive abilities. The American Educational Research Association (AERA) annual meeting in 2019 proposed to use cross-modal data such as text, charts, audio, and video to construct a research paradigm based on cross-modal narratives to seek evidence of fairness in educational research [4].
In terms of specific research content, Hu Qintai et al. [5] analyzed the interpretability problem of cross-modal learning behavior and used the HDRBM (Hybrid Deep Restricted Boltzmann Machine), a deep neural network model, to obtain the implicit physiological and emotional characteristics of students. This research used the Bayesian network and the junction tree algorithm to analyze the interpretability of the results, which has a certain effect on improving the interpretability of learning behavior analysis. From a cross-modal measurement perspective, Kyllonen et al. [6] designed and developed a computational model framework for assessing learning status. This framework is capable of capturing, analyzing, and measuring complex human behavior as well as analyzing noisy, unstructured, and cross-modal data. The analysis process utilizes cross-modal data, including audio, video, and activity log files, to construct an Analytic Hierarchy Process for modeling the temporal dynamics of human behavior. Riquelme et al. [7] used transponders to collect students’ voice data to analyze the continuity and motivation of group collaboration. Poria et al. [8] used an ensemble feature extraction method to develop a new cross-modal information extraction agent. In particular, the developed method exploited joint tri-modal features (text, audio, and video) to infer and aggregate semantic and emotional information on user-generated cross-modal data. In terms of analysis methods, Scherer [9] used a cross-modal sequence classifier to analyze the expressions of laughter in the process of multi-party dialogue in the natural environment. At the same time, multi-dimensional data channels are used to extract the frequency and spectral features from audio streams and motion-related behavioral features from video streams. On this basis, the Hidden Markov Model and the Echo State Network are used to identify the paralinguistic behavior in the communication process, and a classification model with a certain level of accuracy is obtained. Researchers from the Norwegian University of Science and Technology collected cross-modal data of learners in adaptive learning activities and used the fuzzy set qualitative comparative analysis (fsQCA) method to describe the relationship between learner participation patterns and learning performance [10]. Vicente et al. proposed a Wearable Internet of Things in Education (WIoTED) system based on IoT technology and real-time monitoring data from wearable devices and used machine learning techniques and cross-modal learning analysis methods to build a model that could “explain” student engagement. This study selects a decision tree and rule system based on a set of correlated variables, and the obtained rules could be easily interpreted by non-professionals [11].
Overall, the research on cross-modal learning analysis has a rapid speed of development in spite of its comparatively short history. In recent years, it has received extensive attention from academic groups across different disciplines. The evolutionary process covers knowledge model and framework design, cross-modal feature extraction and sentiment analysis, cross-modal representation learning, and in-depth learning. Its research ideas and research results have further enriched the fields of learning analysis and learning science. Major research teams in China are from normal universities such as Beijing Normal University, East China Normal University, and South China Normal University. By analyzing cross-modal learning from the perspective of learning science, the research mainly focuses on theoretical discussion (Such as Mu Su [12], Wang Weifu [13]) and framework construction (Such as Zhou Jin [14], Zhang Qi [15], Mu Zhijia [16], Li Qing [17], etc.).
Previous studies indicate that cross-modal learning analysis is an effective method for analyzing learning behavior and obtaining learners’ status. However, current methods rarely apply theories to practice from the perspectives of computer science and data science. Meanwhile, the accuracy, comprehensiveness, and interpretability of analytical models need to be improved. Based on deep learning and cross-modal learning, we established a multi-modal sentiment analysis model for learning behavior, and the robustness of the analytical model was improved using deep learning and pre-training, implying an important research trend.

3. Research Method

3.1. Problem Definition

This paper processes and analyzes multi-modal data (visual, text, audio, etc.) through in-depth learning algorithms such as neural networks and attention mechanisms. It also constructs an intelligent analysis model of the emotional status of the learner’s learning process, which can accurately capture the emotional changes produced by learners during the learning process for collaborative analysis and evaluation.

3.2. Model Architecture

By studying cross-modal data processing, cross-modal data fusion, pre-training model construction, and collaborative analysis of learner learning status, we propose an architecture to identify the mechanism of deep inquiry learning occurrence and improve the accuracy of students’ emotional status analysis during the learning process. We term our architecture ‘BLBA-MODEL’ (Bi-LSTM and Bi-Attention Mechanism Model). Its main framework is shown in Figure 1, which can be divided into three modules: data feature extraction, data fusion and parsing, and emotional status assessment of the learning process. These three modules are illustrated as follows.

3.2.1. Representation, Normalization, and Alignment of Cross-Modal Learning Behavioral Data

The data in the learning process includes multiple modalities such as video, text, and audio. In multi-modal data, each modality provides specific information for other modalities, and there is a certain correlation between the modalities. In this paper, the text, audio, and visual in the video of the learner’s learning process are used as data for processing, and the two techniques of batch normalization and layer normalization are used to remove the dimension and realize the normalization of the training data. Formula (1) is a specific representation of normalization. For input z i l , after calculating its mean μ and variance σ 2 , normalization is performed, where γ and β represent the learnable parameter variables of scaling and translation, respectively. For each neuron, before the data enters the activation function, the mean and variance of each batch are calculated along the channel so that the data maintains a normal distribution with a mean of 0 and a variance of 1 to avoid the disappearance of the gradient.
z ˜ i l = z i l μ σ 2 + o × γ + β
The aligned hidden vectors are spliced as the input of the subsequent module; it is the LSTM-Attention (Long Short-Term Memory and Attention Mechanism) combination module. The technical route of characterization, normalization, and alignment of cross-modal streaming media data is shown in Figure 2.

3.2.2. Fusion and Analysis of Cross-Modal Learning Behavior Data

After normalizing and aligning the cross-modal data, the information interaction within and between modalities is identified by fusing the feature representations of various modal information. This paper proposes the LSTM-Attention combination module to realize this kind of information interaction and takes the interaction of three modalities (audio, text, and visual) in the video as an example to illustrate in detail. Since we only used text, audio, and visual as our inputs for subsequent experiments, we will no longer regard video as a separate input.

3.2.2.1. Unimodal Feature Information Interaction

Considering the video information as a set of several utterances, we use multiple independent Bi-LSTMs (Bi-Directional Long Short-Term Memory) to capture the context-related semantic information of each modality, and the output h i j after the Bi-LSTM layer can be obtained.
h i j = L S T M x i j , h j 1 L S T M x i j , h j 1
Among them, x i j represents the input feature of utterance j in video i , h j 1 represents the hidden layer state of utterance j 1 , and h i j represents the output of Bi-LSTM layer.
Then the matrix of video i after the Bi-LSTM layer is expressed as:
H i s m = h i j : j L i ,   H i s m q × d
where d represents the feature dimension.
Through this layer processing, three modal feature representations with context-related information in video i can be obtained: textual feature H i s t , audio feature H i s a , and visual feature H i s v .

3.2.2.2. Bimodal Feature Information Interaction

After obtaining the above unimodal feature representation, three modalities are combined in pairs (text + audio, audio + visual, and text + visual) to obtain three sets of modal pairs. Next, we proposed a BAM-model (Bimodal Attention Mechanism model) to capture the interaction information between modalities and contexts. The model structure is shown in Figure 3. The specific process and formula are as follows:
The first step is to multiply the feature representation matrices obtained in Section 3.2.2.1 to obtain the cross-modal information matrix, where u , w = t , a , y :
C i u w = H i s u H i s w T
C i w u = H i s w H i s u T ,
The SoftMax function (SoftMax logical regression) is used to model multiclass classification problems where we want the output to be a probability distribution of the different possible classes. The second step is to demonstrate a SoftMax calculation on the cross-modal information matrix to obtain the attention score:
α i u w = s o f t m a x C i u w ,   α i u w q × q
α i w u = s o f t m a x C i w u ,   α i w u q × q ,
In the third step, the attention score is multiplied by the feature matrix, and the modal feature representations with information interaction are obtained, respectively:
H i d 1 u = ( α u v H i u ) · H i u ,   H i d 1 u q × d
H i d 1 w = ( α w u H i w ) · H i w ,   H i d 1 w q × d ,
After the above operations, text feature representations H i d 1 t , H i d 3 t , audio feature representations H i d 1 a , H i d 2 a , and visual feature representations H i d 2 v , H i d 3 v can be obtained, respectively.
Figure 3. BAM-model.
Figure 3. BAM-model.
Applsci 13 00163 g003

3.2.3. Assessment of Learning Emotional Status

The feature representation of each modality with contextual interaction information can be obtained through the module processing in the first two sections. In this section, the attention mechanism is used to weigh the importance of each modality of information to filter out redundant information, classify emotions according to the final fusion information and obtain students’ emotional status throughout the whole process of learning. The specific process is as follows:
First, we use the fully connected layer to splice the above modal features and obtain the feature representation to enter the attention mechanism to filter redundant information:
R i t = tanh W i t H i s t H i d 1 t H i d 3 t + b i t
R i a = tan h W i a H i s a H i d 1 a H i d 2 a + b i a ,
R i v = tan h W i v H i s v H i d 2 v H i d 3 v + b i v ,
where W i m is the weight and b i m is the bias.
Then, the feature representation R i m of each modality is sent to the attention mechanism model, and the multi-modal fusion feature representation is obtained by weighted summation:
m = s o f t m a x tan h W a t t m R i m + b a t t m
R i m * = m R i m m T ,
Among them, W a t t m represents the weight, b a t t m represents the bias, and m represents the normalized weight.
Finally, through the fully connected layer and using the SoftMax function for emotion prediction and classification, the final original learned emotion and cognitive state of the model are obtained:
y i = s o f t m a x W q · tanh W p · R i m * + b p + b q
Among them, W p and b p are the weights and biases of the SoftMax layer, and W q and b q are the weights and biases of the fully connected layer.

4. Results and Discussion

This section systematically analyzes the effect of the proposed model on the multi-modal sentiment analysis task.

4.1. Experimental Datasets

Since the data set for students’ emotional behavior in the learning process involves issues such as students’ privacy, no data set has been found as an experimental data set for targeted model training and analysis. Therefore, this paper uses the universal public data set CMU-MOSEI to conduct experiments. On the one hand, it can verify the sentiment classification performance of the model. On the other hand, it can expand the scope of application of the model and facilitate subsequent targeted work.
The dataset was collected from 1000 speakers and contains a total of 3228 videos with 23,453 annotated sentences and 250 different topics. In this dataset, males and females make up about 57% and 43%. We divide the dataset into a training set and a test set; the details are shown in Table 1. The dataset has six different annotations for sentiment, namely Happiness, Sadness, Anger, Disgust, Surprise, and Fear. For classification prediction, videos with sentiment values greater than or equal to 0 are marked as positive sentiment classifications, and videos with sentiment values less than 0 are marked as negative sentiment classifications.

4.2. Ablation Experiment

This paper divides the feature input into different combinations to verify the importance of multi-modal information and performs sentiment classification prediction, respectively. This paper uses ACC (accuracy) and F 1 indexes to evaluate algorithm performance:
A c c u r a c y = T P + T N / T P + F P + T N + F N
F 1 = 2 T P / 2 T P + F P + F N ,
where TP is true positive, TN is true negative, FP is false positive, and FN is false negative. The experimental results are shown in Figure 4.

4.3. Comparative Experiment

This paper compares the proposed model with existing multi-modal sentiment analysis models, and we use the same dataset to experiment with these methods. The benchmark methods are as follows:
(1)
SVM-MD: An SVM model using multi-modal features trained by early fusion.
(2)
LF-LSTM: LSTM network with late fusion method.
(3)
MFN: A neural network structure for multi-view sequential learning based on attention networks and gated memory, which well models the interaction between modalities.
(4)
MARN: The method of combining a multi-attention module and a recurrent neural network is adopted to obtain the interaction information between three modalities through a multi-attention unit, and the recurrent network is used as a memory unit for storage.
(5)
Graph-MFN: Based on the MFN method, a method of combining the dynamic fusion graph with it is introduced; that is, the multi-modal dynamic fusion graph is combined with the context memory fusion network.
(6)
MMMU-BA: A method that utilizes the correlation of contextual information of target video segments between different modalities to assist multi-modal information fusion.

4.4. Experimental Results and Analysis

It can be seen from the ablation experiment results that the performance of text-based sentiment analysis in unimodality is the best. The sentiment analysis performance of all modal combinations in bimodality is generally better than that of the best unimodality analysis type. The trimodal sentiment analysis model performed the best among all models. Its accuracy was increased by 1.36% and 1.61% compared with the best models in the unimodal and bimodal, respectively. Its F 1 index is improved by 2.33% and 1.98% compared to the best model in unimodal and bimodal, respectively. Therefore, the effective combination of multiple modalities can well obtain sentiment classification prediction and improve the model’s performance. Moreover, during the process of students’ learning, it can accurately grasp the emotional status of students by capturing the information from multiple modalities, adopting multi-modal emotional analysis, and presenting evaluation results that could provide emotional guidance to students promptly or adjust the difficulty level and teaching mode of the course to help students acquire knowledge much better and achieve the learning aims much easier.
From the experimental data in Table 2, the proposed method has achieved excellent results accuracy and F 1 index. Compared with early fusion, the performance improvement is the most obvious. Compared with recent more advanced deep learning and other methods, the accuracy and F 1 index are also 0.17% and 0.95% higher. Therefore, the model can extract meaningful information from multi-modalities for fusion and accurately classify and predict, which is helpful to apply to the scenarios where students learn emotional status analysis in the process of learning.

5. Conclusions

With the development of intelligence, the single source and simple structure of analytics data in the past have gradually become unsuitable for the era of rapid development. Based on deep learning, this paper proposes a multi-modal emotional analysis method to obtain people’s sentiment status and make accurate evaluations of people’s behavior processes. All the above experiments demonstrate this study method’s effectiveness and excellent performance. In future work, we will use emotional analysis in intelligent education to build a learning analysis model between different subjects and find out the correlation between them while studying the learning behavior of each subject. We will use different dataset related to the student’s learning behavior process to build a complete cross-modal collaborative analysis framework.

Author Contributions

Methodology, W.W.; Formal analysis, Y.H.; Data curation, G.F.; Writing—original draft, W.W.; Supervision, Q.H. All authors have read and agreed to the published version of the manuscript.

Funding

Supported by National Natural Science Foundation of China, 2022, “Collaborative analysis and evaluation method of cross-modal knowledge elements of classroom streaming media”(Grant No. 62237001).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available dataset was analyzed in this study. The dataset can be found here: http://multicomp.cs.cmu.edu/resources/cmu-mosei-dataset/ (accessed on 19 December 2022). The dataset is available for download through CMU Multimodal Data SDK GitHub: https://github.com/A2Zadeh/CMU-MultimodalDataSDK (accessed on 19 December 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Zhong, W.; Li, R.; Ma, X.; Wu, Y. The development trend of learning and analysis technology—Research and exploration under the environment of multi-modal data. China Distance Educ. 2018, 11, 41–49. [Google Scholar]
  2. Li, X.; Zuo, M.; Wang, Z. Research Status and Future Prospects of Learning Analysis—Review of the 2016 International Conference on Learning Analysis and Knowledge. Open Educ. Res. 2017, 23, 46–55. [Google Scholar]
  3. Li, S.; Zhao, Q.; Zhou, Z.; Zhang, Y. Cognitive neural mechanism of image and text processing in multimedia learning. Prog. Psychol. Sci. 2015, 23, 1361–1370. [Google Scholar]
  4. Wu, B.; Peng, X.; Hu, Y. The Way to Eliminate the Turnip and Preserve the Quintessence in Educational Research: From multi-modal Narration to Evidential Fairness—A Review of the American Aera 2019 Conference. J. Distance Educ. 2019, 37, 13–23. [Google Scholar]
  5. Hu, Q.; Wu, W.; Feng, G.; Pang, T.; Qiu, K. Research on interpretability analysis of multi-modal learning behavior supported by in-depth learning. Res. Audio Vis. Educ. 2021, 42, 77–83. [Google Scholar]
  6. Kyllonen, P.C.; Zhu, M.; Davier, A.A. Introduction: Innovative Assessment of Collaboration; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar]
  7. Riquelme, F.; Munoz, R.; Mac Lean, R.; Villarroel, R.; Barcelos, T.S.; de Albuquerque, V.H.C. Using multi-modal learning analytics to study collaboration on discussion groups. Univers. Access Inf. Soc. 2019, 18, 633–643. [Google Scholar] [CrossRef]
  8. Poria, S.; Cambria, E.; Hussain, A.; Huang, G.B. Towards an intelligent framework for multi-modal affective data analysis. Neural Netw. 2015, 63, 104–116. [Google Scholar] [CrossRef] [PubMed]
  9. Scherer, S. multi-modal behavior analytics for interactive technologies. Künstl. Intell. 2016, 30, 91–92. [Google Scholar] [CrossRef] [Green Version]
  10. Papamitsiou, Z.; Pappas, I.O.; Sharma, K.; Giannakos, M.N. Utilizing multi-modal data through fsQCA to explain engagement in adaptive learning. IEEE Trans. Learn. Technol. 2020, 13, 689–703. [Google Scholar] [CrossRef]
  11. Camacho, V.L.; Guía ED, L.; Olivares, T.; Flores, M.J.; Orozco-Barbosa, L. Data capture and multi-modal learning analytics focused on engagement with a new wearable IoT approach. IEEE Trans. Learn. Technol. 2020, 13, 704–717. [Google Scholar] [CrossRef]
  12. Mu, S.; Cui, M.; Huang, X. Data Integration Method for Panoramic Perspective multi-modal Learning Analysis. Res. Mod. Distance Educ. 2021, 33, 26–37. [Google Scholar]
  13. Wang, W.; Mao, M. multi-modal learning analysis: A new way to understand and evaluate real learning. Res. Audio Vis. Educ. 2021, 42, 25–32. [Google Scholar]
  14. Zhou, J.; Ye, J.; Li, C. Emotional Computing in multi-modal Learning: Motivation, Framework and Suggestions. Res. Audio Vis. Educ. 2021, 42, 26–32. [Google Scholar]
  15. Zhang, Q.; Li, F.; Sun, J. Multi-modal Learning Analysis: Learning Analysis Towards the Era of Computational Education. China Audio Vis. Educ. 2020, 9, 7–14. [Google Scholar]
  16. Mou, Z. Multi-modal learning analysis: A new growth point of learning analysis research. Audio Vis. Educ. 2020, 41, 27–32. [Google Scholar]
  17. Li, Q.; Ren, Y.; Huang, T.; Liu, S.; Qu, J. Research on application of learning analysis based on sensor data. Res. Audio Vis. Educ. 2019, 40, 64–71. [Google Scholar]
Figure 1. Overall architecture diagram of a cross-modal collaborative analysis model.
Figure 1. Overall architecture diagram of a cross-modal collaborative analysis model.
Applsci 13 00163 g001
Figure 2. Representation, normalization, and alignment of cross-modal learning behavioral data.
Figure 2. Representation, normalization, and alignment of cross-modal learning behavioral data.
Applsci 13 00163 g002
Figure 4. Ablation experiment results.
Figure 4. Ablation experiment results.
Applsci 13 00163 g004
Table 1. Training set and test set of the CMU-MOSEI dataset.
Table 1. Training set and test set of the CMU-MOSEI dataset.
CMU-MOSEI DatasetVideosNumber of
Utterances
Positive
Sentiment
Negative
Sentiment
Training set22501614678698277
test set679463425412093
Table 2. Comparative experimental results of different models.
Table 2. Comparative experimental results of different models.
CMU-MOSEIMethod Model
SVM-MDLF-LSTMMFNMARNGraph-MFNMMMU-BABLBA-MODEL
ACC (%)68.3480.6177.1376.876.980.2780.78
F 1 (%)67.9279.4676.3577.0177.1479.2380.41
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, W.; Hu, Q.; Feng, G.; He, Y. A Novel Method for Cross-Modal Collaborative Analysis and Evaluation in the Intelligence Era. Appl. Sci. 2023, 13, 163. https://doi.org/10.3390/app13010163

AMA Style

Wu W, Hu Q, Feng G, He Y. A Novel Method for Cross-Modal Collaborative Analysis and Evaluation in the Intelligence Era. Applied Sciences. 2023; 13(1):163. https://doi.org/10.3390/app13010163

Chicago/Turabian Style

Wu, Wenyan, Qintai Hu, Guang Feng, and Yaxuan He. 2023. "A Novel Method for Cross-Modal Collaborative Analysis and Evaluation in the Intelligence Era" Applied Sciences 13, no. 1: 163. https://doi.org/10.3390/app13010163

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop