1. Introduction
In the wake of the COVID-19 pandemic, online video communication has become widespread, and we have come to realize that facial expressions on the screen play a significant role in mutual interactions. In recent years, facial expression recognition has emerged as a critical determinant of danger because of the rapid advancements in artificial intelligence and computer vision. For instance, it is now used for detecting driver fatigue in automotive safety systems [
1]. In the system, driver behavior is monitored through an in-car camera. The camera analyzes information such as head posture, eye movements, and facial expressions to detect the level of driver fatigue. In the medical field, emotion recognition has become a preventive measure for conditions such as depression and other mental health disorders. The system analyzes individuals’ emotional states by detecting emotions. If it detects prolonged negative emotional states, the system will encourage or connect users to professional mental health resources.
In the past, people primarily relied on experience or a combination of their cognitive understanding of others to judge emotions when receiving emotional signals. However, these methods often carried biases. Today, people are turning to objective machines for expression recognition. These models automatically extract facial features and classify emotional categories. Yang et al. [
2] combined VGG 16 and LBP to extract facial features and classify six emotion categories with
softmax. Agrawal et al. [
3] aimed to extract useful facial features and designed two novel convolutional neural network architectures. Then, they also achieved a 69% accuracy result.
Traditional facial expression recognition faces several challenges, including imbalanced data, privacy concerns, and limitations in handling only single-person facial images. For example, datasets containing human faces are often incomplete and limited due to privacy concerns. Additionally, the dataset also suffers from data sparsity due to the limited number of samples for specific emotion categories. Furthermore, existing facial emotion recognition systems are constrained by their ability to process only single-face images. Therefore, we adopted data augmentation to enhance the facial image and improve the data sparsity problem. Subsequently, a CNN with the Haar cascade classifier ensemble was employed as the backbone network for the method. While the CNN with the Haar cascade classifier approach may not necessarily yield the highest accuracy, it possesses the capability to perform real-time detection on multiple individuals. This makes it valuable in applications such as automatic attendance systems in classroom settings, negative emotion alerts for mental health patients, and surgical assessments for facial deformities. We propose a method for facial expression recognition with low-cost hardware. In this method, facial expression recognition is regarded as an emotion classification task. We augment the facial image data, utilize the Haar cascade classifier for face localization, and employ a convolutional neural network (CNN) for emotion classification.
The main contributions of this article are as follows:
Integrating the Haar cascade classifier and CNN model to enhance feature extraction and ensure real-time capabilities of the model.
Designing a facial expression recognition system in both single-person and multi-person images.
The rest of this article is organized as follows:
Section 2 reviews the previous research on facial expression recognition.
Section 3 describes the proposed approach.
Section 4 presents the experimental results and a discussion of these results. Finally,
Section 5 concludes this article.
2. Related Works
Face expression is an important element in human emotional expression. Therefore, facial expression recognition has consistently been a critical research theme in the fields of affective computing and computer vision. Yang et al. [
2] proposed a weighted mixture deep neural network. The network combined VGG16 and LBP with weighted fusion to extract facial features and used
softmax to classify six kinds of emotion categories. They experimented on the JAFFE dataset, CK+ dataset, and Oulu-CASIA dataset, achieving excellent results in each case. Agrawal et al. [
3] researched the impact of different combinations of filter sizes and quantities on classification accuracy. They also designed two architectures best suited for facial expression recognition. The FER-2013 dataset was employed to evaluate the performance of these architectures. Sarvakar et al. [
4] proposed facial emotion recognition using convolutional neural networks (FERC). FERC is divided into two parts: background removal and facial feature extraction. The CNN model was used to segment the facial region in background removal. In the facial feature extraction stage, FERC utilized the expression vectors as a representation of their facial features. Finally, the expression vectors were further fed into the CNN model and classified the emotions. Khattak et al. [
5] enhanced the accuracy of facial expression recognition by augmenting the layers of the CNN model. Additionally, this model demonstrated effective age and gender classification. Zeng et al. [
6] proposed facial expression recognition to automatically distinguish the expressions with high accuracy. Facial geometric features and appearance features were introduced into the model. In addition, the deep sparse autoencoder (DSAE) was adopted to learn a robust and discriminative facial expression recognition system. Zeng et al. [
7] proposed a real-time facial expression recognition and learning feedback system. The system analyzed students’ facial expressions instantly and assessed their learning emotions. Additionally, many facial expression recognition systems utilized CNN models as their backbone [
8,
9,
10]. Hence, it is evident that CNN models are highly beneficial for facial expression recognition tasks. Gao et al. [
11] presented cross-domain facial expression recognition (CD-FER). CD-FER integrates features across domains and incorporates dynamic label weighting. This enables the model to leverage the potential benefits of transferable cross-domain local features, resulting in improved performance in facial emotion recognition tasks. Tang et al. [
12] introduced BGA-Net, a novel approach to facial expression recognition. BGA-Net integrates bidirectional gated recurrent units (BiGRUs), a convolutional neural network (CNN), and an attention mechanism. The primary objective of BGA-Net is to capture dependencies between sub-regions in the image, focusing on the most discriminative areas through an attention mechanism.
In deep learning, the scarcity of data due to challenges in data collection or limited sample sizes often leads to the problem of insufficient data. Such a dataset can potentially have a detrimental impact on the training and learning performance of models. Therefore, data augmentation is employed as a solution to overcome the limitations of limited training data. Lai [
13] utilized additional image samples and data augmentation techniques to improve errors in expression recognition. Liu [
14] used GAN-based image enhancement techniques, and data augmentation has been employed to enhance image quality and improve the accuracy of the model. Porcu et al. [
15] evaluated the impact of various data augmentation techniques on facial expression recognition. Umer et al. [
16] proposed novel data augmentation. The novel data augmentation combines a bilateral filter, a sharpening filter, image rotation, and more. Then, this data augmentation can enhance the fine-tuning capabilities of a CNN model and improve the overfitting problem. Psaroudakis et al. [
17] proposed a MixAugment data augmentation method based on Mixup. Mixup demonstrated an improvement in the recognition rate for wild data in the context of facial expression recognition tasks. Bobojanov et al. [
18] introduced a novel facial emotion recognition model vision transformer (ViT) model. Recognizing the common issue of data imbalance for widely used emotion datasets, the researchers opted for the evaluation of their model on RAF-DB and FER-2013. Additionally, to address the imbalance challenge, they employed data augmentation techniques to construct a balanced dataset.
A wavelet is a mathematical tool used in signal processing and image processing, frequently applied in applications such as facial recognition. Wavelets exhibit advantages in multi-scale analysis and feature extraction. The particular significance of real-time facial recognition is due to its feature compression properties. The Symlet wavelet, Coiflet wavelet, and Haar wavelet are commonly employed wavelet techniques in facial recognition. The Symlet wavelet is characterized by its symmetry and flexibility. With adjustable parameters, it provides improved accuracy in facial identification. The Coiflet wavelet is a family of compactly supported wavelets. Its finite support property effectively reduces computational complexity while preserving essential features. Therefore, it exhibits superior performance in both time and frequency domains. The Haar wavelet is the simplest form of a wavelet. Its straightforward computation gives it a significant advantage in real-time signal processing. While the Symlet wavelet and Coiflet wavelet can depict more detailed facial contour information, their computational demands make them less feasible in resource-constrained scenarios. Conversely, the Haar wavelet is unable to capture intricate facial contour details. However, its minimal resource requirement and precise edge detection capabilities contribute to its excellent performance in capturing facial positions. In our system, the prioritization of facial position detection outweighs the demand for capturing detailed facial contours. Therefore, we opted for the Haar wavelet to be incorporated as a part of our model.
Goyani et al. [
19] discussed various applications of Haar in different fields and categorized the proposed methods. Shen et al. [
20] employed an enhanced active shape model (ASM) for facial expression recognition, utilizing Haar features to automate face detection in their applications and research. Liu et al. [
21] proposed an Adaboost-based face detection algorithm using Haar-like features. The model aimed to improve issues related to extended training times and low detection efficiency. Mishra et al. [
22] proposed a facial recognition system. The system combines Haar-like features, face detection, and OpenCV to compare certain facial features in the image with a facial dataset. Guo et al. [
23] proposed a CNN-enhanced multi-level Haar wavelet features fusion network (CNN-MHWF2N). The system combines the spatial features of the 2-D-CNN and the Haar wavelet decomposition features to obtain spectral and spatial information. Wu et al. [
24] proposed a face image recognition method. The model is based on Haar-like and Euclidean distance and improved accuracy of facial recognition. Zhang et al. [
25] proposed a weak classifier that improved the Haar-like features and the AdaBoost algorithm. The weight update method in the AdaBoost algorithm was changed and achieved an effective accuracy improvement. Farkhod et al. [
26] presented a graph-based approach to emotion recognition. In their methodology, a two-step process was employed, combining the Haar cascade classifier and a media-pipe face mesh model. Initially, the Haar-cascade classifier was utilized for face detection, followed by the application of the media-pipe face mesh model for precise landmark localization. This synergistic combination allowed for the extraction of facial features critical for emotion recognition. Notably, the proposed framework by Farkhod et al. demonstrated effective emotion prediction even in scenarios where individuals were wearing masks. Chinimilli et al. [
27] introduced an attendance system that combines the Haar cascade and the Local Binary Pattern Histogram (LBPH) algorithm. In the paper, the Haar cascade is employed for face detection, while LBPH is utilized for face recognition. Shafique et al. [
28] employed the Haar cascade classifier to implement the detection of social anxiety disorder (SAD). They utilized the detection of gaze interaction/avoidance to determine whether individuals were afflicted by SAD. Takiddin et al. [
29] proposed a facial deformity measurement method based on the Haar cascade. Haar was trained on normal facial data and deformity data to learn the differences and features among different faces, providing a score indicating the degree of facial deformity.
3. Proposed Method
In this section, we introduce the details of the proposed method. Our system architecture is illustrated in
Figure 1. The system was divided into training and test stages. In the training stage, we trained a Haar cascade classifier to segment facial areas in an image and trained a CNN model to detect human emotion. In the test stage, the Haar cascade classifier detected the presence of a human face in an image and segmented the facial region for the CNN model. Subsequently, the CNN model yielded emotional results based on facial features. Furthermore, we provided a user interface (UI) for users to upload images they want to recognize.
3.1. Data Augmentation
Data bias and data sparsity are challenges in deep learning training. Data bias and data sparsity can lead to inaccuracy, poor generalization, and model bias. Reducing the impact of these issues has become a crucial concern in the field of deep learning. The causes of these issues include data imbalance, sampling selection bias, and data sparsity. In this study, we used data augmentation to alleviate the impact of data sparsity on deep learning models.
The analysis of the FER-2013 dataset revealed significant scarcity of the “disgust” emotion data compared to other emotion data. This attribute of disgust data led to difficulties in accurately predicting the “disgust” emotion by the model and reduced the overall model performance. The confusion matrix is illustrated in
Figure 2. To address this issue, we employed two data augmentation techniques: random flipping and brightness adjustment. These approaches aimed to augment the FER-2013 dataset by increasing the availability of “disgust” emotion samples, thereby reducing the adverse effects of data sparsity.
3.1.1. Image Flipping
We adopted random flipping in our data augmentation. Horizontal flipping was implemented by the function library provided by TensorFlow and is illustrated in
Figure 3.
3.1.2. Color Adjustment
We adopted the TensorFlow function library to implement color adjustment in our data augmentation. We added a random value to the image that was a basis for increasing or decreasing brightness. The example is illustrated in
Figure 4.
3.2. Haar Cascade Classification
The Haar cascade classifier is a machine learning approach for object detection, especially for face detection. The Haar cascade classifier has the advantage of speed and high accuracy and is often used in biometric identification and security systems. In our system, the Haar cascade classifier was employed to detect the presence of faces and extract facial regions in images. Then, these facial region extractions were fed into the CNN model and trained. The Haar cascade classifier is a Haar wavelet with a robust classifier based on the AdaBoost algorithm. In our implementation, we utilized the Haar cascade classifier from OpenCV (Open Source Computer Vision Library) as the foundational model for our Haar cascade classifier. This model encompasses the Haar-like feature, Integral Image, and AdaBoost classifier modules. The Haar cascade classifier architecture is illustrated in
Figure 5.
3.2.1. Haar-like Features and Integral Image
Haar-like features consist of multiple sets of black and white rectangular masks that slide over the image and calculate the pixel sum between different pixels to extract the grained visual feature. The mask of Haar-like features consists of the white rectangular and black rectangular. The white rectangular represents areas with bright or highly reflective areas of an object. Conversely, the black rectangular represents areas with dark or shadowed areas of an object. An example of Haar-like features is illustrated in
Figure 6 and
Figure 7. A feature vector can combine with multiple Haar-like features and be used to train a classifier.
However, Haar-like spends too much time calculating the pixel value sum. Therefore, an integral image was designed to accelerate the calculation of pixel value sums within rectangular regions. Equation (1) defines the feature value. An integral image is defined as a matrix of the same size as the image, where each pixel value represents the cumulative sum of pixel values within a rectangular region from the top-left corner to that pixel. Thus, Equations (2) and (3) can be employed to compute the sum of pixel values within a rectangular area, eliminating redundant computations and achieving time savings.
where
Pixel represents the pixel value in position (
x,y) of the image. While Haar-like features are a simple feature extraction method, they are widely recognized for their performance and computational efficiency in applications such as facial detection.
3.2.2. AdaBoost (Adaptive Boosting)
AdaBoost was further used to choose the useful Haar-like features and train the classifier. AdaBoost is a machine learning technique that creates a strong classifier by combining multiple weak classifiers. In our system, we employed various masks of Haar-like features paired with thresholds as individual weak classifiers. Subsequently, AdaBoost selected the useful weak classifiers and combined them to form a strong classifier. Finally, we adopted the Haar cascade classifier approach to integrate the strong classifiers into a multi-stage classifier. An advantage of this approach was that each stage underwent threshold-based detection, ensuring recall and reducing false-positive rates. The approach is illustrated in
Figure 8.
3.3. Convolutional Neural Network (CNN)
A convolutional neural network (CNN) was used as the backbone model in this study. The advantages of CNN include feature extraction ability, spatial hierarchy, and scale invariance. The convolutional layer extracts facial features automatically and trains the model based on facial features. The facial feature is a multiple spatial hierarchy including local features (e.g., ear, eyes) and global features. A CNN model can extract a spatial hierarchy by multiple convolution layers and pooling layers and allow the model to understand images better. Due to the different sizes in input images, our system must possess the attribute of scale invariance. Multi-layers of convolutional kernels in the CNN with varying field sizes enable the processing of features at different scales, thereby achieving scale invariance. Because of the three attributes of a CNN, we ultimately selected the CNN model as the primary network for achieving facial expression recognition.
Our CNN model was composed of four convolution layers, four pooling layers, a flattened layer, and two fully connected layers. The overall architecture is shown in
Figure 9. The filters slide the input image and emphasize object edges to realize the extraction effect of facial features in the convolution layer. The output example of the convolution layer is illustrated in
Figure 10. We employed max pooling as the model’s pooling method because it can preserve high-frequency features more effectively than average pooling can. Additionally, it reduces the sensitivity of the convolutional layers to edges, thus achieving the goals of feature preservation, dimension reduction, and a decrease in the number of learnable parameters. In the flattened layer, facial features are mapped into a one-dimensional vector and fed into the fully connected layer as input. The flattened layer diagram is illustrated in
Figure 11. In the fully connected layer, features are integrated, and calculated probabilities correspond to seven classes of emotions. Lastly, expression recognition was conducted through
softmax. Cross-entropy loss was computed to facilitate subsequent model training and evaluation. The
softmax and cross-entropy functions were defined by Equations (4) and (5), respectively.
Here, C represents the number of neurons in the fully connected layer, sj represents the values of each neuron, and si denotes the value obtained for the i-th class.
5. Conclusions
In this study, we proposed a multi-person facial expression recognition system. Our approach aimed to improve the limitation of sparse data and the restriction to single-person facial images. Leveraging the attribute of Haar cascade classifiers, we identified facial regions while simultaneously focusing on the face during subsequent model training. This attention to facial regions rather than the background aided in more precise feature extraction. Additionally, the multi-face detection capability of the Haar cascade classifiers helped us overcome the constraint of single-person facial images. Subsequently, we employed a convolutional neural network (CNN) as the backbone network to implement facial feature extraction and create an efficient classification system. Furthermore, we implemented experiments to optimize the model and confirmed that the model achieved its 69.47% accuracy at epoch 60 when using data augmentation with horizontal flipping. When compared to other methods, our approach outperformed most of them, which demonstrated the effectiveness of our method. In a multi-person expression recognition experiment, the proposed approach achieved an accuracy of 55.88%.
However, our system has its limitations. Firstly, to achieve higher accuracy, we employed methods with lower time complexity, such as the Haar cascade classifier, but we did not reduce the time complexity by altering the architecture of the CNN model. As a result, compared to the approach of modifying the CNN model by Minaee et al. [
9], we slightly lagged in both accuracy and time complexity. In future work, we want to incorporate the concept of attention, allowing the model to focus more on learning emotion-related features. We plan to expand the training dataset for multi-person expression recognition and develop more algorithms tailored specifically for this task to achieve higher accuracy.