Advancing Facial Expression Recognition in Online Learning Education Using a Homogeneous Ensemble Convolutional Neural Network Approach

Lawpanom, Rit; Songpan, Wararat; Kaewyotha, Jakkrit

doi:10.3390/app14031156

Open AccessArticle

Advancing Facial Expression Recognition in Online Learning Education Using a Homogeneous Ensemble Convolutional Neural Network Approach

by

Rit Lawpanom

,

Wararat Songpan

^*

and

Jakkrit Kaewyotha

Department of Computer Science, College of Computing, Khon Kaen University, Khon Kaen 40002, Thailand

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(3), 1156; https://doi.org/10.3390/app14031156

Submission received: 7 December 2023 / Revised: 9 January 2024 / Accepted: 23 January 2024 / Published: 30 January 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Facial expression recognition (FER) plays a crucial role in understanding human emotions and is becoming increasingly relevant in educational contexts, where personalized and empathetic interactions are essential. The problems with existing approaches are typically solved using a single deep learning method, which is not robust with complex datasets, such as FER data, which have a characteristic imbalance and multi-class labels. In this research paper, an innovative approach to FER using a homogeneous ensemble convolutional neural network, called HoE-CNN, is presented for future online learning education. This paper aims to transfer the knowledge of models and FER classification using ensembled homogeneous conventional neural network architectures. FER is challenging to research because there are many real-world applications to consider, such as adaptive user interfaces, games, education, and robot integration. HoE-CNN is used to improve the classification performance on an FER dataset, encompassing seven main multi-classes (Angry, Disgust, Fear, Happy, Sad, Surprise, Neutral). The experiment shows that the proposed framework, which uses an ensemble of deep learning models, performs better than a single deep learning model. In summary, the proposed model will increase the efficiency of FER classification results and solve FER2013 at a accuracy of 75.51%, addressing both imbalanced datasets and multi-class classification to transfer the application of the model to online learning applications.

Keywords:

facial expression recognition; convolutional neural network; ensemble; homogeneous; deep learning

1. Introduction

Facial expressions are a fundamental aspect of human communication that convey emotions, intentions, and reactions. In educational environments, the ability to accurately perceive and respond to students’ emotional states can significantly impact the students’ learning experience. Traditional facial expression recognition systems have limitations in handling complex expressions and varying lighting conditions. Research on these environments has received widespread attention in multiple disciplines, including computer science, psychology, architecture, and education [1]. Usually, the learning flow in these environments depends on the learner’s mental responses based on solving tests and answering exam questions, enabling the next level of the learning process to be reached. However, conventional methods do not consider the emotional behavior of the learner during the learning process. Emotional behavior is an important factor in the quality of the learning process and the success of the desired learning outcomes. A dashboard can support emotion detection during online learning [2]. The learner’s interactions with the learning environment can be classified into two categories. The first consists of mental responses to test questions, and the second consists of emotional responses through facial expressions. Therefore, integrating both types of responses is critical for developing more adaptive and intelligent computer-based learning environments [3]. Due to the importance of considering the learner’s emotions during the learning process, other research efforts have focused on modelling these emotions by interpreting facial expressions using machine learning algorithms [4,5,6,7,8,9]. Furthermore, researchers [10,11] have tried to create models for studying the human mind, including analyzing various behaviors such as what kind of person someone is or their likes and dislikes. For example, in terms of applications, the benefit of experimenting on an FER dataset that can be adopted in the education sector is the ability to assess student satisfaction with the quality of learning when they are learning online using tools such as Zoom and Google Meets. However, facial expression recognition (FER) is a significant area of research with applications in the education field. Understanding students’ emotional states and engagement levels can greatly enhance the environmental learning process.

Research has explored the application of facial expression recognition to applied techniques such as those based on machine learning and deep learning [12]. Deep learning performance mostly depends on the image resolution [13,14,15], and preprocessing data during feature extraction and feature selection are very important for FER classification [16]. However, when a person is portraying a neutral expression, their internal feelings or hidden messages can be reflected on their face, and observers can perceive those feelings. The six most universal emotions are sadness, fear, happiness, disgust, surprise, and angry [17]. However, facial expressions and the emotional interpretations of those expressions are not the same thing. The emotional interpretations of expressions are inferred by a person’s perception of the internal state of emotion.

The current research is still examining the complex dataset FER2013 [18], which poses challenges for single machine learning models and deep learning approaches. Existing efforts encounter bottlenecks in terms of the accuracy, which are drawbacks when addressing FER2013 due to the following reasons:

(1): The existing approaches concentrate on feature processing within deep learning to effectively handle between five and seven multiple class labels.
(2): The existing approaches rely on a single deep learning method, which is not amenable to resolving imbalanced data issues, particularly evident in a class of seven classes like the disgust class label. Multi-class labels in the FER2013 dataset are numerous and are benchmark class labels for challenging FER datasets.

This proposed model uses ensemble methods that are advantageous for enhancing the performance of machine learning models. These algorithms aim to form better final classification results by combining the outputs of various machine learning methods. Reviewing methods for ensemble methods for FER datasets highlight methodological differences and reveal the reported performance of combinations of heterogeneous or homogenous methods. The results are prioritized to compare between methods. Thanks to deep learning algorithms with image processing, we propose an ensemble learning method called homogeneous deep learning. Homogenous deep learning entails the use of similar kinds of CNNs, for which we chose between six and seven ensembles from DCNN, EfficientNetB2, InceptionRestNetV2, ResNet50, Xception, DenseNet, and VGG16, respectively. The contributions of this paper are as follows:

(1): The homogeneous CNN ensemble combination determined to produce the best model is found to perform better than a single deep learning method, and the number of ensemble methods applied to the FER2013 dataset is determined for translating the model to uses in online learning education.
(2): The homogeneous CNN ensemble techniques handle the imbalanced data and yield better performances on minority classes, labelling ambiguous classes using seven multiclass labels: Angry, Disgust, Fear, Happy, Sad, Surprise, Neutral.

Therefore, a novel approach to addressing these challenges in imbalanced datasets through applying homogeneous ensemble convolutional neural networks (HoE-CNNs), which are deep learning methods, is proposed in this paper. The deep learning algorithms are promising for use in researcher artificial neural networks because of their high performance in completing many tasks, especially image processing, video classification, speech recognition and so on. In this paper, a deep learning method is adopted in a single dataset and parameters are fine-tuned to fit the model to a domain dataset.

The structure of the paper is as follows: Section 2 covers the literature review of facial expression recognition in education and deep learning and ensemble learning methods. Section 3 describes and proposes a homogenous ensemble convolutional neural network. Section 4 describes the dataset, experimental setup, experimental results and discussion. Finally, Section 5 concludes the implementation and describes future work.

2. Literature Review

The education field has been dramatically transformed by the incorporation of online learning and machine learning. Research integrating online learning and machine learning techniques for monitoring the emotions of students has received significant attention in the educational technology field. The following is a review of notable research and studies in this domain. In [19] the use of machine learning for detecting and monitoring the emotional states of online learners was explored. They discuss the potential of affective computing techniques in understanding student engagement and providing personalized feedback. The paper investigated the use of machine learning models to predict student emotions and engagement in massive open online courses (MOOCs) [20]. They highlight the importance of real-time emotional feedback in enhancing the learning experience. In machine learning, students’ facial expressions, speech, and interaction patterns are discussed and analyses to infer students’ emotional states and adapt the learning content accordingly [21]. These techniques can be used to design intelligent tutoring systems that provide tailored emotional support, which is the role of machine learning in recognizing students’ affective states during online learning. However, there are ethical considerations of using machine learning to monitor student emotions, including issues related to privacy, bias, and fairness in emotion recognition systems [2]. These studies collectively underline the growing importance of integrating machine learning techniques with online learning platforms to monitor student emotions. They emphasize the potential benefits of adapting educational content and support to individual emotional states, ultimately enhancing the overall learning experience. However, they also highlight the need to consider ethical issues and privacy as this technology continues advancing.

Many types of machine learning techniques exist. As mentioned, machine learning is used to fit the individual algorithm to the dataset. The single techniques are effective for binary classes, for example, yes or no. However, in the case of FER, data are multiclass, with 7 main classes. Therefore, single algorithms are not feasible for applications in online learning environments for education. Ensemble methods have emerged as powerful techniques in the classification field, offering improved accuracy and robustness by combining multiple individual classifiers. Over the years, extensive research has been conducted to explore the effectiveness of ensemble methods and their applications in various domains. Researchers have aimed to discuss the advancements and challenges of ensemble methods in classification. Moreover, the ensemble methods use many datasets and aim to determine the best classifier performance. Preprocessing data before the data are input into the ensemble models is also very important. Face detection (FD) is the extraction of facial landmarks in face regions using color gradients, followed by the extraction of geometric features inside actual facial emotions [22,23,24]. Face detection is an ultrafast face recognition solution that encompasses 6 landmarks and multifaced supports that are processed holistically, allowing us to determine the contribution of holistic processing to the face inversion effect [13] used important key points by holistically locating faces, noting the coordinates of each located face and drawing a holistic key point around every face. The normalization value from the facial image is used to accelerate training; this step is performed differently by different algorithms. Afterwards, feature extraction is processed by feature selection, which ranks the importance of the existing features in the dataset and discards less important [16]. In this paper, we preprocess the data and create labels and features. Various techniques have been proposed to improve performance before inputting data into the classification model. The details of the preprocessing operation and the algorithms using the applied FER dataset are summarized in Table 1.

3. Proposed Frameworks

Our research focuses on a homogeneous ensemble of CNN models, which are trained and fine-tuned to specialize in recognizing specific facial expressions and 7 emotions. The homogeneous CNN ensemble approach uses a combination of deep learning models in the same category as CNN as DNN, EfficientNetB2, InceptionRestnetV2 enabling more robust and accurate emotion recognition. The architecture, as shown in Figure 1, details preprocessing for collecting the data from face expression detection, image processing and the training process of the HoE-CNN, highlighting the techniques used to enhance performance.

The first necessary steps in facial detection methods are usually to preprocess the input original image, generate a face candidate box on the original image, and input the face candidate box into the network for feature extraction, for example, inputting an image into probability, Bx, By, Bh, Bw, Face and No Face classes. The following describes the two most widely used methods for generating face candidates. The first method is scanning the input image with different resolutions to obtain the candidate frame; this method can capture as large a position area of the face target as possible. The second method is to obtain the face candidate box by scaling the input image by a multiple of fixed scale.

Second, face candidate frame generation is based on selective search. This method primarily uses some inherent attributes of the input object to determine the possible position of the face candidate frame in the input image. This method greatly reduces the number of face candidate frames and reduces the computing time of the network. In addition, multiresolution methods can be used to obtain regions of different scales, and the CNN model can then be used for classifying each region to yield the category and confidence. Therefore, the image processing could include cropping the face emotion area, transforming the image to greyscale, normalizing the values into [0, 1], and augmenting the data. In most cases, data augmentation uses a larger dataset to yield a more accurate prediction and a more robust model, which involves increasing the quantity of training data using information exclusively as training data to avoid overfitting on models. Data augmentation is performed during training to increase the amount of data, including rescaling from the original scale horizontal flip, random zoom, feature wise, rotation, horizontal, and vertical shifts as Figure 2.

Convolution neural networks (CNNs) have revolutionized the computer vision field. These networks are extremely effective for a variety of image and video recognition tasks due to their ability to automatically learn and extract key features from videos [29] or image [30,31]. CNN has yielded other models in deep learning, indicating the potential of deep convolution neural networks. The CNN model contains a convolution layer, as shown in Figure 3. There are layers for initial step in the process of extracting various traits from the input photographs. Convolution is a mathematical operation that occurs at the layer between the image being input and a filter of a particular size MxM. Sliding the filter over the input image yields the dot product between the filter’s elements and those of the input image with respect to the filter’s size (MxM). The output, known as the feature map, provides information about the image, including its corners and edges. The subsequent layers include additional features from the input image, and the feature map is later provided. The second layer is the pooling layer, in which the primary objective is to reduce the size of the convolved feature map to decrease computational costs. This is achieved by minimizing the connections between layers and working individually on each feature map. Depending on the technique used, many types of pooling procedures are available. Max pooling uses the largest component from the feature map. Average pooling determines the average of the components in an image segment with a predetermined size. Sum pooling determines the overall total of the components in the designated section. Typically, the pooling layer acts as a link between the convolutional layer and the FC layer. Finally, the weights and biases compose the fully connected (FC) layer, which connects the neurons of two layers. These levels are the final few in CNN designs and are frequently inserted before the output layer. Activation functions are one of the most essential components of CNN models for each model called homogenous ensemble convolution neural network. his paper comprises between 6 and 7-ensemble CNN models, which encompass the optimization parameters setting for each CNN model as Table 2.

The ensemble model is activated to include decision multiple-class labels from the ensemble CNN model. The output pⁱ is the probability vector of CNN_i. Given a sample x, the highest argument of probability x and majority vote from the n-ensemble model are predicted following Equation (1).

C l a s s L a b e l (x) = a r g M a x_{j} (x) * \sum_{p C N N i}^{n} p^{i} (y = k I x)

(1)

where,

n is the number of n-ensemble model

x is class label x

pCNNi is the probability vector of CNN by i

For the created model and transfer model to the online learning environment setting, which focuses on the input process, a video is a sequence of images called frames. Experiments have been conducted on various video sequences with multiple faces occurring at different sizes, faces disappearing from the sequence or becoming occluded, and faces changing poses. The input video sequence is preprocessed by detecting and tracking the face. A probability value is computed for different scales and for two different views. In addition, the frame sequences were 30 frames per second. The images from the frame captures are transferred to the proposed mode from the save model and loaded to real-time video via camera, and FER is achieved using emotional classification from online learning.

To evaluate our model, we used the commonly used precision, recall, F1-score and accuracy metrics with four different combinations of predicted and actual values in multiclassification, including true positive (TF), false positive (FP), true negative (TN), and false negative (FN). True positive (TP) indicates the reliability of the predicted class n compared to the actual class n. False positive (FP) indicates the reliability of the other predicted class n compared to the actual other class n. False negative (FN) indicates the reliability of the predicted class n compared to the predicted other class n. Finally, true negative (TN) indicates the reliability of the predicted other class n that compared to the predicted class n. These metrics were evaluated using measure the following equations:

A c c u r a c y = \frac{T P + T N}{T P + F P + F N + F P}

(2)

Accuracy should be as high as possible when all classes, both positive and negative, are divided by the sum of the possible actual and predicted values.

4. Experimental Results

All the experiments were tested on a Linux operating system using 11th Gen Intel Core i9-11900 @2.50GHz Ram CORSAIR VENGEANCE LPX 32GB (16 GBx2) DDR4 3200MHz NVIDIA GeForce 1060 6GB to process proposed model. We implemented all our experiment uses the FER2013 dataset, which is a benchmark FER dataset consisting of nearly 35,887 48 × 48 pixel 8-bit greyscale images of various people’s facial expressions, determined by their facial emotions as show in Figure 4.

The data points were split into 3 datasets: 28,709 training data points, 3589 validation data points, and 3589 test data points, as shown in Table 3. The facial images encompass both male and female subjects and 7 emotion classes (Angry, Disgust, Fear, Happy, Sad, Surprise, Neutral) as Figure 5. Due to various levels of exposure, illumination, and occlusions, manual annotation is approximately 65±% accurate on this dataset, and the imbalanced dataset is challenging, as shown in Figure 5 and Figure 6.

As the number of datasets increases, the data are imbalanced between the 7 classes. The most common class of the FER dataset is happy, followed by neutral, sad, fear, angry, surprise and disgust. In particular, the disgust class is a more difficult class than the others. The hyperparameters of the CNN are optimized for FER2013 as Table 4. Moreover, the models fine-tune the different hyperparameters of the CNN models. The proposed method was evaluated on FER2013.

The accuracies (%) of the models on the FER dataset are compared in Table 4. The accuracy performance of the 7-ensemble CNN is compared with those of other models, referencing in [26,27,28] given accuracy 71.20%, 72.40%, 73.39% and 73.40% in the similar environment with FER2013, respectively, who proposed a developed and fine-tuning CNN model as same as environment within 7 multi-class and imbalance datasets.

Our proposed model, namely, the HoE-7 ensemble, achieves an accuracy of 75.15% which used combined by Xception, DCNN, VGG16, EffcientNetB2, DenceNet, RestNet50 and InceptionRestnetv2, and the HoE-6 ensemble achieves a slightly lower accuracy than HoE-7 ensemble which used combination of Equation (1) by DCNN, VGG16, EffcientNetB2, DenceNet, RestNet50 and InceptionRestnetv2 to accuracy 73.73%. The original single CNN models as Xception, DCNN, EffcientNetB2, VGG16, DenceNet, RestNet50 and InceptionRestNetV2 achieve 65.19%, 66.75%, 68.09%, 68.20%, 69.10%, 69.70% and 71.60% accuracies, respectively. The experimental results demonstrate the robustness and strong performance of the 7-ensemble CNN Ensemble model.

Figure 7 depicts the confusion matrix for the final model on the FER2013 testing set, which classifies multiple classes into 7 emotions. The confusion matrix can be analyzed using true labels and predictions labels with values between 0 and 1, where 1 is the highest true positive value for the true class and prediction class. The analysis of model performance shows that the model most effectively categorizes the “happy” and “surprise” emotions because the number of training and validation points for these emotions were sufficient for our proposed model, yielding the best classification performance. The “happy” and “surprise” emotions are challenging to classify; however, the HoE-7 CNN yields robust and useful FER accuracies of 0.91 and 0.85, respectively. It achieves accuracies of 0.80 on the neutral class and 0.72 on the disgust class. On the angry, sad and fear classes, it achieves accuracies of 0.69, 0.61, and 0.57, respectively. The challenge of FER is the use of multiple classes to classify FER2023.

5. Discussion

The proposed methodology yields comprehensive experimental results that demonstrate the superiority of the HoE-CNN approach over traditional methods. True positive performance of each class labels and distribution of prediction are robustness from the ensemble methodology as shown Figure 8. This evaluation uses benchmark datasets commonly used in FER research, showcasing the model’s ability to handle various expressions, poses, and lighting conditions. We also discuss the implications of our findings for educational applications. Ensemble methods often sacrifice interpretability for improved accuracy. Understanding the decision-making process of an ensemble method can be challenging due to the complexity introduced by multiple classifiers. Research efforts are needed to develop techniques for providing insights into ensemble decisions and making the models more interpretable. As the size of datasets continues increasing, scalability becomes a major concern for ensemble methods. Training and combining many classifiers can be computationally expensive and time-consuming. Researchers are exploring parallel and distributed computing techniques for addressing the scalability issues associated with ensemble methods. The selection and combination of diverse classifiers are crucial factors in the performances of ensemble methods. Ongoing research focuses on optimizing ensemble diversity through innovative ensemble generation algorithms, ensemble pruning techniques, or ensemble selection strategies. Developing efficient and effective algorithms to automatically determine the optimal diversity in ensembles remains an active research area. Ensemble methods are susceptible to noisy or corrupted data, which can adversely affect the ensemble performance. Research for developing robust ensemble techniques that are resilient to noise and outliers in data is ongoing. Approaches such as robust aggregation techniques and noise detection and removal methods are being explored to enhance the resilience of ensemble methods against data imperfections.

Aspect of online learning, this work integrated with Face Expression Recognition (FER) technology and developed the innovative model to offers a transformative approach to education by fostering a more interactive and personalized learning environment. By analyzing facial expressions in real-time, this innovative combination provides educators with invaluable insights into students’ engagement levels, emotional responses, and comprehension as shown in Figure 9.

The applied models enable instructors to adapt their teaching methods dynamically, offering personalized assistance and tailored resources to students based on their emotional cues. Additionally, FER facilitates the creation of a more empathetic and supportive learning atmosphere, allowing educators to provide timely assistance and guidance to students in need. Ultimately, this integration not only enhances the educational experience by promoting active engagement but also enables educators to refine their teaching strategies for better learning outcomes. However, the student face has problems with non-face detection and limitation of their camera.

6. Conclusions

In modern e-learning environments, understanding students’ emotional states is crucial for personalized education. In this paper, we explore the application of a homogeneous ensemble CNN method to FER, combining multiple CNN models to improve emotion recognition accuracy. By enhancing our ability to detect students’ emotional responses, we aim to adapt educational content, provide real-time support, and create more engaging learning environments. The research findings have significant implications for the education field. The improved FER accuracy can be incorporated into educational technology to form more responsive and personalized learning environments. Teachers and educators can utilize this technology to gauge students’ emotional states and adapt their teaching methods, thus fostering more empathetic and effective educational experiences. In this paper, we design a model for face recognition and detail the proposed model using convolutional networks. Additionally, we overview the proposed face recognition method through three processes. The first process is facial detection. The next process is image processing. The last process is the FER process and transferring to online learning education applications. In this paper, we study FER on FER2013 datasets, which encompass 7 emotions for training validation and testing, including the “Sad”, “Fear”, “Disgust”, “Natural”, “Happy”, “Angry” and “Surprise” class. The HoE-CNN approach has revolutionized the classification field, offering improved accuracy and robustness by harnessing the power of multiple classifiers. Advancements in ensemble techniques, such as performance improvements, imbalanced data responses, diversity measures, and strategy combinations, contribute significantly to the effectiveness of the proposed models. However, challenges related to interpretability, scalability, diversity optimization, and noisy data persist. Continued research efforts are crucial to address these challenges and further enhance the capabilities of ensemble methods in classification. By overcoming these challenges, ensemble methods can continue to significantly contribute to various education domains.

Author Contributions

Conceptualization, R.L. and W.S.; methodology, R.L.; software, R.L.; validation, R.L., W.S. and J.K.; formal analysis, R.L.; investigation, R.L.; resources, R.L.; data curation, R.L.; writing—original draft preparation, R.L. and W.S.; writing—review and editing, W.S. and J.K.; visualization, R.L.; supervision, W.S. and J.K; project administration, W.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical and informed consent for data used.

Informed Consent Statement

The corresponding author have passed CITI program in Social & Behavioral Research—Basic/Refresher(Curriculum Group), Social & Behavioral Research—Basic/Refresher, (Course Learner Group) and 2—Refresher Course(Stage) verify at https://www.citiprogram.org/verify/?w43848f6b-6e46-4ca7-87d8-bd0ce92d5663-46678776, accessed on 17 January 2023.

Data Availability Statement

The data used in this paper are publicly available and can be downloaded at https://www.kaggle.com/datasets/msambare/fer2013 (accessed on 6 December 2023).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lajoie, S.P.; Naismith, L.; Poitras, E.; Hong, Y.-J.; Cruz-Panesso, I.; Ranellucci, J.; Mamane, S.; Wiseman, J. Technology-rich tools to support self-regulated learning and performance in medicine. In International Handbook of Metacognition and Learning Technologies; Springer: New York, NY, USA, 2013; pp. 229–242. [Google Scholar]
Ez-Zaouia, M.; Tabard, A.; Lavoué, E. Emodash: A dashboard supporting retrospective awareness of emotions in online learning. Int. J. Hum. Comput. Stud. 2020, 139, 102411. [Google Scholar] [CrossRef]
Kärner, T.; Kögler, K. Emotional states during learning situations and students’ self-regulation: Process-oriented analysis of person-situation interactions in the vocational classroom. Empir. Res. Vocat. Educ. Train. 2016, 8, 12. [Google Scholar] [CrossRef]
Ayvaz, U.; Gürüler, H.; Devrim, M.O. Use of facial emotion recognition in e-learning systems. Inf. Technol. Learn. Tools 2017, 60, 95–104. [Google Scholar] [CrossRef]
Chickerur, S.; Joshi, K. 3D face model dataset: Automatic detection of facial expressions and emotions for educational environments. Br. J. Educ. Technol. 2015, 46, 1028–1037. [Google Scholar] [CrossRef]
Khalfallah, J.; Slama, J.B.H. Facial Expression Recognition for Intelligent Tutoring Systems in Remote Laboratories Platform. Procedia Comput. Sci. 2015, 73, 274–281. [Google Scholar] [CrossRef]
Krithika, L.B.; GG, L.P. Student Emotion Recognition System (SERS) for e-learning Improvement Based on Learner Concentration Metric. Procedia Comput. Sci. 2016, 85, 767–776. [Google Scholar] [CrossRef]
Petrovica, S.; Anohina-Naumeca, A.; Ekenel, H.K. Emotion Recognition in Affective Tutoring Systems: Collection of Ground-truth Data. Procedia Comput. Sci. 2017, 104, 437–444. [Google Scholar] [CrossRef]
Yang, D.; Alsadoon, A.; Prasad, P.; Singh, A.; Elchouemi, A. An Emotion Recognition Model Based on Facial Recognition in Virtual Learning Environment. Procedia Comput. Sci. 2018, 125, 2–10. [Google Scholar] [CrossRef]
Pramerdorfer, C.; Kampel, M. Facial expression recognition using convolutional neural networks: State of the art. arXiv 2016, arXiv:1612.02903. [Google Scholar]
Mou, L.; Ghamisi, P.; Zhu, X.X. Deep Recurrent Neural Networks for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3639–3655. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Khabarlak, K.; Koriashkina, L. Fast Facial Landmark Detection and Applications: A Survey. J. Comput. Sci. Technol. 2022, 22, e02. [Google Scholar] [CrossRef]
Khan, H.; Haq, I.U.; Munsif, M.; Mustaqeem; Khan, S.U.; Lee, M.Y. Automated Wheat Diseases Classification Framework Using Advanced Machine Learning Technique. Agriculture 2022, 12, 1226. [Google Scholar] [CrossRef]
Khan, H.; Hussain, T.; Khan, S.U.; Khan, Z.A.; Baik, S.W. Deep multi-scale pyramidal features network for supervised video summarization. Expert Syst. Appl. 2023, 237, 121288. [Google Scholar] [CrossRef]
Nixon, M.; Aguado, A. Feature Extraction and Image Processing for Computer Vision; Academic Press: Cambridge, MA, USA, 2019. [Google Scholar]
Guermazi, R.; Ben Abdallah, T.; Hammami, M. Facial micro-expression recognition based on accordion spatio-temporal representation and random forests. J. Vis. Commun. Image Represent. 2021, 79, 103183. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Erhan, D.; Carrier, P.L.; Courville, A.; Mirza, M.; Hamner, B.; Cukierski, W.; Tang, Y.; Thaler, D.; Lee, D.-H.; et al. Challenges in representation learning: A report on three machine learning contests. Neural Netw. 2015, 64, 59–63. [Google Scholar] [CrossRef]
Picard, R.W.; Vyzas, E.; Healey, J. Toward Machine Emotional Intelligence: Analysis of Affective Physiological State. IEEE Trans. Pattern Anal. Mach. Intell. 2001, 23, 1175–1191. [Google Scholar] [CrossRef]
A Pardos, Z.; Baker, R.S.; Pedro, M.S.; Gowda, S.M.; Gowda, S.M. Affective States and State Tests: Investigating How Affect and Engagement during the School Year Predict End-of-Year Learning Outcomes. J. Learn. Anal. 2014, 1, 107–128. [Google Scholar] [CrossRef]
Behera, A.; Matthew, P.; Keidel, A.; Vangorp, P.; Fang, H.; Canning, S. Associating Facial Expressions and Upper-Body Gestures with Learning Tasks for Enhancing Intelligent Tutoring Systems. Int. J. Artif. Intell. Educ. 2020, 30, 236–270. [Google Scholar] [CrossRef]
Hasan, M.K.; Ahsan, M.S.; Newaz, S.S.; Lee, G.M. Human face detection techniques: A comprehensive review and future research directions. Electronics 2021, 10, 2354. [Google Scholar] [CrossRef]
Kumar, A.; Kaur, A.; Kumar, M. Face detection techniques: A review. Artif. Intell. Rev. 2019, 52, 927–948. [Google Scholar] [CrossRef]
Rajan, S.; Chenniappan, P.; Devaraj, S.; Madian, N. Facial expression recognition techniques: A comprehensive survey. IET Image Process. 2019, 13, 1031–1040. [Google Scholar] [CrossRef]
Khaireddin, Y.; Chen, Z. Facial Emotion Recognition: State of the Art Performance on FER2013. arXiv 2021, arXiv:2105.03588. [Google Scholar]
Pham, L.; Vu, T.H.; Tran, T.A. Facial Expression Recognition Using Residual Masking Network. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 4513–4519. [Google Scholar] [CrossRef]
Pecoraro, R.; Basile, V.; Bono, V. Local Multi-Head Channel Self-Attention for Facial Expression Recognition. Information 2022, 13, 419. [Google Scholar] [CrossRef]
Connie, T.; Al-Shabi, M.; Cheah, W.P.; Goh, M. Facial Expression Recognition Using a Hybrid CNN–SIFT Aggregator. In Multi-Disciplinary Trends in Artificial Intelligence. MIWAI 2017; Phon-Amnuaisuk, S., Ang, S.P., Lee, S.Y., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2017; Volume 10607. [Google Scholar] [CrossRef]
Laraib, U.; Shaukat, A.; Khan, R.A.; Mustansar, Z.; Akram, M.U.; Asgher, U. Recognition of Children’s Facial Expressions Using Deep Learned Features. Electronics 2023, 12, 2416. [Google Scholar] [CrossRef]
Venkatesan, R.; Shirly, S.; Selvarathi, M.; Jebaseeli, T.J. Human Emotion Detection Using DeepFace and Artificial Intelligence. Eng. Proc. 2023, 59, 37. [Google Scholar] [CrossRef]
Alsharekh, M.F. Facial Emotion Recognition in Verbal Communication Based on Deep Learning. Sensors 2022, 22, 6105. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The framework of the proposed model and model transfer to online learning environment setting.

Figure 2. Image processing for proposed model.

Figure 3. Architecture of the homogenous ensemble convolution neural network.

Figure 4. Example of benchmark images contained in the FER2013 database (Available online: https://www.kaggle.com/datasets/msambare/fer2013 (accessed on 6 December 2023)).

Figure 5. The class label distribution of the training data.

Figure 6. The class label distribution of the validation data.

Figure 7. The best performance of the confusion matrix by the proposed models (HoE 7-CNN).

Figure 8. The example of true positive performance of each class labels and distribution of prediction are calculated by Equation (1).

Figure 9. Model transfer with online learning education using HoE-CNN approach.

Table 1. Comparison of difference preprocessing operations, class labels, algorithms, and types of ensembles.

Authors	Preprocessing	Class Labels	Algorithms	Type of Ensemble
Khaireddin, Y. and Chen, Z. (2021) [25]	Augmentation	Angry, Disgust, Fear, Happy, Neutral, Sad and Surprise	VGGNet	None
Pham, L. and et.al. (2021) [26]	Face Detection, Residual Marking Block	Angry, Disgust, Fear, Happy, Neutral, Sad and Surprise	ResNet (Like-Ensemble)	Like-Homogeneous
Pecoraro, R. and et al. (2022) [27]	Face Detection	Angry, Disgust, Fear, Happy, Neutral, Sad and Surprise	LHC-Net (ResNet34v2 Backbone)	None
Connie, T. and et al. (2017) [28]	Extract key-points	Angry, Disgust, Fear, Happy, Neutral, Sad and Surprise	Hybrid CNNSIFT Aggregator	None
Proposed Models	Face Detection, Augmentation, Normalizing Data	Angry, Disgust, Fear, Happy, Neutral, Sad and Surprise	Ensemble CNN	Homogenous

Table 2. The structure of CNN models and optimize parameter setting of models.

CNN Models	Layers	Parameters Setting
DCNN	9	Optimizer = Adam, learning rate = 0.001, beta_1 = 0.9, beta_2 = 0.999, epsilon = 1 × 10⁻⁷, epoch = 100, batch size = 64
EfficientNetB2	342	Optimizer = Adam, learning rate = 0.001, beta_1 = 0.9, beta_2 = 0.999, epsilon = 1 × 10⁻⁷, epoch = 100, batch size = 64
InceptionResNetV2	164	Optimizer = Adam, learning rate = 0.005, beta_1 = 0.9, beta_2 = 0.999, epsilon = 1 × 10⁻⁷, epoch = 100, batch size = 32
RestNet50	50	Optimizer = Adam, learning rate = 0.005, beta_1 = 0.9, beta_2 = 0.999, epsilon = 1 × 10⁻⁷, epoch = 100, batch size = 64
Xception	71	Optimizer = Adam, learning rate = 1 × 10⁻³, beta_1 = 0.9, beta_2 = 0.999, epsilon = 1 × 10⁻⁷, epoch = 100, Batch size = 64
DenceNet	20	Optimizer = Adam, learning_rate = 0.0001, beta_1 = 0.9, beta_2 = 0.999, epsilon = 1 × 10⁻⁸, epoch = 100, Batch size = 32
VGG16.	16	Optimizer = SGD, learning_rate = 0.01, momentum = 0.9, nesterov = True, epoch = 100, batch size = 32

Table 3. The divided datasets.

Datasets	Training Data	Validation Data	Test Data
Number of samples	28,709	3589	3589

Table 4. Comparison of the accuracies of models on FER datasets.

Models	Accuracy (%)
Xception	65.19
DCNN	66.75
EfficientNetB2	68.09
VGG16	68.20
DenceNet	69.10
RestNet50	69.70
InceptionRestNetV2	71.60
VGG (Finetune) [25]	71.20
ResNet (like-Ensembles) [26]	72.40
LHC-Net [27]	73.39
Hybrid CNN–SIFT [28]	73.40
Proposed Method (6 ensemble CNN)	73.73
Proposed Method (7 ensemble CNN)	75.15

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lawpanom, R.; Songpan, W.; Kaewyotha, J. Advancing Facial Expression Recognition in Online Learning Education Using a Homogeneous Ensemble Convolutional Neural Network Approach. Appl. Sci. 2024, 14, 1156. https://doi.org/10.3390/app14031156

AMA Style

Lawpanom R, Songpan W, Kaewyotha J. Advancing Facial Expression Recognition in Online Learning Education Using a Homogeneous Ensemble Convolutional Neural Network Approach. Applied Sciences. 2024; 14(3):1156. https://doi.org/10.3390/app14031156

Chicago/Turabian Style

Lawpanom, Rit, Wararat Songpan, and Jakkrit Kaewyotha. 2024. "Advancing Facial Expression Recognition in Online Learning Education Using a Homogeneous Ensemble Convolutional Neural Network Approach" Applied Sciences 14, no. 3: 1156. https://doi.org/10.3390/app14031156

APA Style

Lawpanom, R., Songpan, W., & Kaewyotha, J. (2024). Advancing Facial Expression Recognition in Online Learning Education Using a Homogeneous Ensemble Convolutional Neural Network Approach. Applied Sciences, 14(3), 1156. https://doi.org/10.3390/app14031156

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Advancing Facial Expression Recognition in Online Learning Education Using a Homogeneous Ensemble Convolutional Neural Network Approach

Abstract

1. Introduction

2. Literature Review

3. Proposed Frameworks

4. Experimental Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI