CCNN-SVM: Automated Model for Emotion Recognition Based on Custom Convolutional Neural Networks with SVM

Metwally Rashad; Doaa M. Alebiary; Mohammed Aldawsari; Ahmed A. El-Sawy; Ahmed H. AbuEl-Atta

doi:10.3390/info15070384

,

and

¹

Faculty of Computers and Artificial Intelligence, Benha University, Benha 13518, Egypt

²

Department of Computer Engineering and Information, College of Engineering, Wadi Ad Dwaser, Prince Sattam Bin Abdulaziz University, Al-Kharj 16273, Saudi Arabia

³

Information technology Department, Faculty of Technological Industry and Energy, Delta Technological Unversity, Quesna 32631, Egypt

^*

Author to whom correspondence should be addressed.

Information2024, 15(7), 384;https://doi.org/10.3390/info15070384

This article belongs to the Section Artificial Intelligence

Version Notes

Order Reprints

Abstract

The expressions on human faces reveal the emotions we are experiencing internally. Emotion recognition based on facial expression is one of the subfields of social signal processing. It has several applications in different areas, specifically in the interaction between humans and computers. This study presents a simple CCNN-SVM automated model as a viable approach for FER. The model combines a Convolutional Neural Network for feature extraction, certain image preprocessing techniques, and Support Vector Machine (SVM) for classification. Firstly, the input image is preprocessed using face detection, histogram equalization, gamma correction, and resizing techniques. Secondly, the images go through custom single Deep Convolutional Neural Networks (CCNN) to extract deep features. Finally, SVM uses the generated features to perform the classification. The suggested model was trained and tested on four datasets, CK+, JAFFE, KDEF, and FER. These datasets consist of seven primary emotional categories, which encompass anger, disgust, fear, happiness, sadness, surprise, and neutrality for CK+, and include contempt for JAFFE. The model put forward demonstrates commendable performance in comparison to existing facial expression recognition techniques. It achieves an impressive accuracy of

99.3 %

on the CK+ dataset,

98.4 %

on the JAFFE dataset,

87.18 %

on the KDEF dataset, and

88.7 %

on the FER.

Keywords:

emotion recognition; facial expression; machine learning; convolutional neural networks

1. Introduction

Facial expression is one of the most crucial aspects of human emotion recognition. People can give us a sense of their feelings just by looking at their faces, it can be characterized as changes in facial appearance influenced by an individual’s internal emotional state, objectives, or social interactions []. Applications for facial expression recognition in human–computer interaction include social robots, medical treatments, and monitoring driver weariness.

The facial expression recognition systems are separated into two basic categories. The first category is based on the static images [,] and the second category deals with dynamic image sequences [,]. The feature vector in static-based algorithms only contains data related to the current input image, as they do not employ temporal information. However, sequence-based approaches make use of the temporal information contained in images to identify the expressions gathered. The automated facial expression recognition systems typically process either a single static image or an image sequence as input. Their output usually corresponds to one of the six primary expressions (e.g., anger, sadness, surprise, happiness, disgust, and fear), with some systems also capable of detecting neutral expressions. In this study, the focus will be on static image-based methods, addressing both the six fundamental expressions and the seven-expression sets, which include the six primary emotions plus a neutral expression.

Automatic facial expression analysis, as defined by [], includes four stages: capturing the face, extracting facial data, representation, and recognizing facial emotions. Capturing the face can be broken down into two main stages: head posture estimation [] and face detection []. After capturing the face, it is essential to extract the facial changes induced by different facial expressions.

In the past, researchers have utilized various methods, such as the Naive Bayes Classifier, Ensemble Learning, and Support Vector Machines, to address FER. These techniques relied on extracting information from images using Local binary patterns, Face-landmark features, Harris Corner Detection, and Scale-Invariant Feature Transform (SIFT). Within these methods, neural networks have acquired the most popularity and are widely used for FER. Previous FER approaches typically employed CNNs with only a few layers, despite evidence suggesting that deeper models perform better in other image-processing domains []. This phenomenon may be attributed to the specific challenges posed by FER. Firstly, recognizing emotions necessitates moderately high-resolution images, which translates to handling high-dimensional data. Secondly, the subtle variations in facial expressions across different emotional states pose difficulties for classification tasks. Conversely, very deep CNNs entail a substantial number of hidden convolutional layers. Training such a large number of layers in a CNN proves to be cumbersome and often fails to generalize effectively. Moreover, merely increasing the number of layers does not necessarily lead to improved accuracy beyond a certain threshold, owing to issues such as the vanishing gradient problem [].

Deep learning reduces the burden of discovering new features for every task and has a significant advantage over conventional machine learning in terms of performance as the amount of data increases. The main contributions of this work are:

Propose CCNN-SVM as an automated model for emotion recognition. A fusion of a deep neural network that include many convolution layers and an SVM classifier to improve the efficiency of facial emotion recognition.
Investigate the impact of image preprocessing operations on facial recognition tasks.

The subsequent sections of the paper are arranged as follows. Section 2 delivers the related works. Section 3 presents the proposed model. Section 4 displays the results and experiments. Finally, Section 5 concludes the observations.

2. Related Works

Facial expressions are very important in conveying emotional information between individuals, often more effectively than direct speech. They can also serve as evidence to determine whether someone is being truthful or not. The use of facial recognition technology has been on the rise [], as evidenced by its application in detecting fake faces in images and distinguishing real faces through the eyes. As technology continues to advance, various methods have emerged for detecting human facial expressions. The following subsections contain a brief description of the methodologies used in the well-known FER methods.

2.1. Facial Expression Recognition Methodologies Based on Machine Learning

This section displays some methods of FER which depend on machine learning. The Gabor filter is a popular and widely utilized technique for recognizing facial expressions. In [], a method for extracting image texture descriptors is Local Binary Patterns (LBP), which has various adaptations for extracting facial feature appearances. Along with LBP, Local Phase Quantization (LPQ) has also been used as an image texture descriptor. In [], they performed feature extraction from facial images by creating specialized features. Subsequently, they introduced innovative classification methods employing SVM classifiers for facial expression recognition, leveraging these extracted features. In [], an innovative method for detecting facial expressions inspired by human vision has been introduced. In another study by [], they utilized an edge-based descriptor for feature extraction in facial expression recognition (FER). LPDP efficiently utilized information from pixels to construct descriptive features in the FER process.

In [], expressions are characterized as a synthesis of various action units that are taken from RGB video sequences and categorized using an SVM for deception detection. However, aspects that depend on appearance are delicate to lighting conditions. First, facial regions are used to identify landmark points in geometric features. Then, the geometric connection between landmark points is used to describe the expression.

2.2. Facial Expression Recognition Methodologies Based on Deep Learning

In this section, we explain some methods of FER which depend on deep learning. Convolutional neural networks (CNNs) have had considerable success lately in a variety of computer vision applications, such as recognizing facial expressions. Using convolutional and subsampling layers as well as training, the networks differ between CNN architectures. The quantity of generated maps and the size of the kernels serve as the primary parameters for convolutional layers. The two primary subsampling layer types are average pooling and maximum pooling. Average pooling calculates the new value of a pixel by taking the average of its neighboring pixels, whereas maximum pooling selects the highest pixel value within the neighborhood region and assigns it as the new value for that region. Obtaining the best weights related to the kernel is the first step in the learning process for CNNs.

In [], the authors incorporated advanced computer vision and machine learning techniques into a framework designed to computationally analyze the facial expression production of both children with Autism Spectrum Disorder (ASD) and typically developing children. The framework utilizes techniques to detect and track a set of landmarks, akin to virtual electromyography sensors, to observe and monitor the movements of facial muscles involved in producing facial expressions. In [], incorporated dense connections across pooling layers to promote feature sharing in a relatively shallow CNN architecture. The authors in [] introduced an innovative approach to facial expression recognition (FER) utilizing support vector machine (SVM) technology. Initially, the FER technique incorporates a face detection algorithm that merges the Haar-like features method with the self-quotient image (SQI) filter. Following this, three methodologies, namely the angular radial transform (ART), the discrete cosine transform (DCT), and the Gabor filter (GF), are simultaneously utilized to extract features for facial expression within the FER technique. The authors created and built a multi-block deep convolutional neural networks (DCNN) model aimed at detecting facial emotions across virtual, stylized, and human characters. Ref. [] introduced an adaptively weighted sharing network capable of conducting action unit identification and emotion recognition concurrently.

As shown in [], LBP features and the network of attention mechanisms can be combined to improve the attention model and produce better outcomes for facial expression recognition. In order to resolve the conflict between the FER challenge and current deep learning trends, Ref. [] investigated the correlation between the magnitude of input semantic disturbance and the fluctuation in output probabilities. In [], using a multi-block DCNN setup, we introduced two models utilizing ensemble learning methods. The first model combines bagging ensemble with SVM (DCNN-SVM), while the second model utilizes an ensemble of three distinct classifiers with a voting technique (DCNN-VC). The architecture presented by [] employs a novel spatial-channel attention net (SCAN) to obtain local and global attention for each channel at every spatial location. By converting the input data into RGB and depth map images, and subsequently utilizing a recursive procedure with randomized channel concatenation, Ref. [] demonstrated a sparsity-aware deep network capable of automatic recognition of 3D/4D facial expressions. Ref. [] introduced an improved CNN-based structure that integrates multiple branches consisting of radial basis function (RBF) units. In [], the authors introduced a novel method for accurately recognizing facial expressions using a combination of a hybrid feature descriptor and an enhanced classifier. Drawing inspiration from the effectiveness of the stationary wavelet transform in various computer vision applications, the technique initially applies the stationary wavelet transform to preprocess the facial image. Subsequently, the pyramid of histograms of orientation gradient features is computed from the low-frequency stationary wavelet transform coefficients. This step aims to extract more prominent details from facial images, thereby enhancing the recognition process. In [], the authors proposed a novel approach to facial expression recognition, employing a hybrid model that merges convolutional neural networks (CNNs) with a support vector machine (SVM) classifier, utilizing dynamic facial expression data. The method involves utilizing dense facial motion flows and geometry landmark flows extracted from facial expression sequences as inputs for the CNN and SVM classifier, respectively. Additionally, CNN architectures tailored for facial expression recognition based on dense facial motion flows are proposed. Ref. [] delved into the topic of recognizing facial expressions using a deep learning technique known as the Visual-Attention-based Composite Dense Neural Network (VA-CDNN). In this study, the model focuses on extracting specific areas of interest such as the mouth, eyes, and face regions from facial images. These regions are then individually processed through a pretrained Xception deep Convolutional Neural Network (ConvNet) to obtain spatial representations. In [], the authors used the Boosted Histogram of Oriented Gradient descriptor to offer an effective method for FER in static images. Through the extraction of hybrid features from an ellipse and an undirected graph with a

β

-skeleton, the authors in [] enhanced Fusion-CNN for face emotion identification. The outputs are fused and combined to form a feature vector for classification within a deep neural network. Table 1 summarizes the related of work.

Table 1. Summary of related works.

3. The Proposed Model

Convolutional Neural Networks (CNNs) are effective in learning image characteristics; however, they do not always produce the best results for classification. In contrast, SVMs generate effective decision interfaces by utilizing soft-margin techniques to maximize margins rather than learning complex invariances through their fixed activation functions.

In this section, the proposed model aims to explore an automated framework, where the Custom Convolutional Neural Network (CCNN) is prepared to pick up on subjective characteristics. SVM offers the best options for class labels in the learned feature space, and exchanges the CNN’s output layer (i.e., the CCNN’s fully connected layer acts as an input to the SVM).

Figure 1 shows the general framework of the suggested model that includes three steps. The first step is preprocessing, including image cropping, image equalization, gamma correction, and downsampling. In the second step, the CCNN model extracts the features. Finally, the classification step is based on the SVM classifier. The proposed model is fully explained in the next subsections.

Figure 1. The framework for the proposed model of facial expression recognition.

3.1. The Preprocessing Step

The proposed model includes a preprocessing step, which is designed to eliminate non-expression specific features from a facial image. There are four stages in the preprocessing step: face detection and cropping, histogram equalization, gamma correction, and downsampling. The preprocessing step is fully explained in the next subsections.

3.1.1. Image Cropping Stage

Many background details in the original image are unimportant to the emotion classification process. These details might make categorization less accurate because the classifier now has to distinguish between background and foreground, which is a new challenge. After cropping, any parts of the image that do not contain expression-related information are eliminated. Moreover, the cropping region is designed to exclude facial features that do not contribute to the expression, such as ears, portions of the forehead, and so on. Therefore, The distance ratio between the eyes is used to define the region of interest. As a result, this technique can manage various people and image sizes without human involvement.

A ratio

α

represents the distance between the midpoint of the eyes and the center of the right eye (in Figure 2

α = 45.6

). A factor of 2.8 applied to the ratio

α

defines the vertical cropping area (considering 0.8 for the area above the eyes and 2.0 for the area below). A factor of 2.5 defines the horizontal cropping area applied to this same distance

α

. The values of these factors were established empirically after trying more values to obtain the optimum, as shown in Figure 2.

Figure 2. Two different image cropping examples with

α = 45.6

3.1.2. Histogram Equalization Stage

The second stage is normalizing the image’s grey scale value and improving the ability to distinguish between the brightness of the foreground and background in a facial image, which can both be achieved with histogram equalization (Hist-eq). Equation (1) shows the histogram function:

H (v) = \frac{c d f (v) - c d f_{m i n}}{(M \times N) - c d f_{m i n}} \times (L - 1)

(1)

Let

H (v)

represent the histogram function of a set of resized face images consisting of n images. The function

c d f (v)

represents the cumulative function, while

c d f_{m i n}

is the minimum non-zero value of the cumulative distribution function. The term

M \times N

indicates the total number of pixels in the image, where M represents the width, N represents the height, and L denotes the number of gray levels. An example of this stage is shown in Figure 3.

Figure 3. Image equalization example.

3.1.3. Gamma Correction Stage

The third stage involves Gamma correction, a non-linear operation employed to regulate the overall brightness of an image by manipulating the Gamma parameter. It is defined by Equation (2):

I_{o u t} = I_{i n}^{γ}

(2)

Gamma can take on any value within the range of 0 → ∞. A value of 1 (the default) results in a linear mapping. When gamma is less than 1, the mapping tends to favor higher (brighter) output values. Conversely, for gamma values greater than 1, the mapping is biased toward lower (darker) output values. An example of this operation is shown in Figure 4.

Figure 4. Gamma correction example.

3.1.4. Downsampling Stage

The final stage is downsampling, which enables CNN to discover which areas are associated with each particular expression. Since most modern graphics cards only have a small amount of memory, the downsampling process also enables convolutions to be executed on the GPU. The final image size of this stage is

48 \times 48

pixels. Figure 5 shows an example of this stage.

Figure 5. Downsampling example.

Table 2 presents the test set accuracies for custom CNN on both KDEF and JAFFE datasets for various input sizes, considering different input dimensions. Notably, the highest accuracies were attained with image sizes set at 48 × 48. This preference arises from the model’s compatibility with this specific input size, as larger sizes necessitate more extensive data and a larger model for improved performance. Additionally, we used two datasets, CK+ and Fer, with standard size

48 \times 48

.

Table 2. Test set accuracies of a custom CNN on the CK+, JAFFE, and KDEF datasets for various input sizes.

3.2. The Feature Extraction Step

This section introduces the newly proposed convolutional neural network for feature extraction. The architecture of the Custom Convolutional Neural Network (CCNN) is shown in Figure 6. A

48 \times 48

grayscale image is provided as input to the network, and the output is the expression. The new CCNN architecture consists of six convolutional layers, each with a ReLU activation function, six batch normalization layers, five max pooling layers, and one fully connected layer. Finally, we used SVM instead of Softmax for the classification. The convolutional neural network parameters are presented in Table 3. The details of the proposed network are shown in Table 4.

Figure 6. The proposed custom CNN model (CCNN) architecture.

Table 3. CNN parameters.

Table 4. The proposed custom CNN model’s specifics.

3.3. The Classification Step

Support Vector Machine (SVM) is a supervised algorithm utilized for both regression and classification tasks. It operates as a classifier aiming to find optimal hyperplanes for effective separation in accordance with specific generalization criteria. SVM replaces the softmax output layer to determine the classes. Finally, the SVM classifier classifies subjective features.

SVM is constructed based on statistical learning theory and the principle of structural risk minimization [], exhibits strong generalization capabilities, and can effectively tackle small sample problems. By employing the Gaussian kernel function method, SVM maps vectors from low-dimensional space to high-dimensional space to address classification problems, as shown in Figure 7.

Figure 7. SVM classifier.

4. Experimental Results

The preprocessing and CCNN for the proposed model are implemented in MATLAB R2021b. It is tested on a Dell PC with the following configurations: Intel (R) Core (TM) i7-9750H CPU @ 2.60GHz, 2.59 GHz, 16 GB RAM, and 64-bit, with Windows 11 as the operating system. The performance evaluation of the proposed model is performed using the CK+ and JAFFE datasets, which are discusses in Section 4.1. The experiments were conducted using two distinct approaches to partition the dataset into training and test sets: In the first approach,

80 %

of the images from dataset were randomly selected for the training set, while the remaining

20 %

were reserved for the test set. In the second approach, a five-fold cross-validation method was employed. Section 4.2 describe a discussion of the experimental results and a comparison with other most recent models.

4.1. Datasets

The facial expression recognition research made use of four renowned datasets commonly employed in this field: the Extended Cohn-Kanade (CK+) dataset, the Japanese Female Facial Expressions (JAFFE) dataset, KDEF dataset, and FER dataset. Both fundamental emotion datasets comprised seven categories: anger, disgust, fear, happiness, sadness, and surprise, in addition to neutrality for CK+ and contempt for JAFFE. Table 5 summarizes the dataset description, and Figure 8 shows a sample and number of images in each class.

Table 5. Datasets’ description.

Figure 8. The seven expressions for the CK+ and JAFFE datasets.

4.2. Experiments

This section demonstrates the impact of each preprocessing step on the model’s accuracy. Each experiment is first shown and thoroughly discussed in Table 6 and Table 7. Second, a comparison with the most advanced techniques available is shown in Table 13.

Table 6. Measurements of the accuracy of the proposed model using different preprocessing stages on the CK+ dataset.

Table 7. Measurements of the accuracy of the proposed model using different preprocessing stages on the JAFFE dataset.

4.2.1. Accuracy with Different Preprocessing Stages

There are five preprocessing steps to enhance CNN performances: (a) raw data or without preprocessing, (b) image cropping, (c) image cropping and equalization, (d) image cropping and gamma correction, (e) all the preprocessing steps. Table 6 and Table 7 demonstrate the effect of each preprocessing step on the model’s accuracy.

(a): Without preprocessing
The initial experiment was conducted using the dataset without any modifications or image preprocessing. The suggested CNN is trained using a gradient descendant method. The order of the provided samples is used by gradient descendant methods to find the local minimum.
Every experimental setup is repeated ten times, both during training and testing, with diverse image presentations to avoid fluctuations. This adjustment enables the descending gradient algorithm to explore more pathways and adjust the recognition rate by either decreasing or increasing it. The accuracy is presented as the highest value achieved among all ten runs, along with the average performance. We demonstrate that the approach still produces good accuracy rates even when the presentation order is not optimal by using the accuracy as an average.
In this experiment, the highest accuracy obtained across all ten runs on the JAFFE dataset was $89.5 %$ , and the average accuracy for all ten runs was $85.8 %$ . The highest accuracy obtained across all ten runs on the CK+ dataset was $97.5 %$ , and the average accuracy for all ten runs was $96.9 %$ . When compared to most recent models, the recognition rate achieved using only CNN without any image preprocessing is significantly lower.
(b): Image Cropping
To enhance the accuracy of the suggested model, the image is automatically cropped to eliminate areas that are not related to a facial regions as shown in Section 3. In this experiment, the highest accuracy obtained across all ten runs on the JAFFE dataset was $93.3 %$ , and the average was $90 %$ . The highest accuracy obtained across all ten runs on the CK+ dataset was $96.9 %$ , and the average accuracy for all ten runs was $95.9 %$ . Because the input of the suggested network is a fixed $48 \times 48$ pixels image, downsampling is also performed here.
By merely adding the cropping procedure, we can observe a considerable improvement in the recognition rate when compared to the results previously displayed. The primary factor contributing to the improvement in accuracy is the removal of a significant amount of irrelevant data that the classifier would otherwise have to process to find out the subject expression and make better use of the image space present in the network input.
(c): Image Cropping and Equalization
The cropping and equalization results provide the erroneous impression that the model’s accuracy might be reduced. The highest accuracy obtained across all ten runs on the JAFFE dataset was $90.6 %$ , and the average accuracy for all ten runs was $85.5 %$ . The highest accuracy obtained across all ten runs on the CK+ dataset was $98 %$ , and the average accuracy for all ten runs was $97.6 %$ .
(d): Image Cropping and Gamma Correction
The result of the cropping and gamma correction also gives a false impression that the gamma correction might make the technique less accurate. The highest accuracy obtained across all ten runs on the JAFFE dataset was $85.2 %$ , and the average accuracy for all ten runs was $79 %$ . The highest accuracy obtained across all ten runs on the CK+ dataset was $99 %$ , and the average accuracy for all ten runs was $98.3 %$ .
(e): Image Cropping, Equalization and Gamma Correction
The best results achieved in the proposed model apply the three image preprocessing steps: image cropping, equalization, and gamma correction. The highest accuracy obtained across all ten runs on the JAFFE dataset was $98.7 %$ , and the average accuracy for all ten runs was $98.4 %$ . The highest accuracy obtained across all ten runs on the CK+ dataset was $100 %$ , and the average accuracy for all ten runs was $99.3 %$ .

4.2.2. Accuracy with a Five-Fold Cross-Validation Method

This method divided the available images into five approximately equal sets. During each iteration of the cross-validation process, one set was designated as the test set, while the remaining four sets were used for training. This process was repeated five times, with each set being used as the test set once. The final outcome was obtained by averaging the results from these five individual runs, as shown in Table 8.

Table 8. The evaluation performance of the proposed model for each fold on the CK+, JAFFE, KDEF, and FER datasets.

4.2.3. Comparisons with the State of the Art

The proposed CCNN model’s confusion matrix for the CK+ dataset is presented in Table 9, the confusion matrix for the JAFFE dataset is presented in Table 10, the confusion matrix for the KDEF dataset is presented in Table 11 and the confusion matrix for the FER dataset is presented in Table 12. The proposed model’s facial expression recognition can reach 99.3%, 98.4%, 90.5%, and 69.6% for CK+, JAFFE, KDEF, and FER datasets, respectively.

Table 9. The classification’s confusion matrix for the CK+ dataset.

Table 10. The classification’s confusion matrix for the JAFFE dataset.

Table 11. The classification’s confusion matrix for the KDEF dataset.

Table 12. The classification’s confusion matrix for the FER dataset.

The performance of the suggested CCNN-SVM model on the CK+ and JAFFE datasets is compared to other recent deep learning models in Table 13.

Table 13. Comparison of the recognition performance of the proposed model with the state of the art on the JAFFE and CK+ datasets.

5. Conclusions

This paper proposed the CCNN-SVM combined model for facial expression recognition. The model proposes a custom single Deep Convolutional Neural Network (CCNN) to extract features and an SVM classifier for classification. The proposed model was evaluated under different circumstances and hyperparameters to properly tune the proposed model. Based on the outcomes of the experiment, it is clear that preprocessing step achieved the best improvement for CCNN performance. To assess the effectiveness of the suggested model, four publicly available datasets were used, i.e., CK+, JAFFE, KDEF, and FER. Studies have shown that the accuracy of the proposed model is considerably increased when all preprocessing methods are combined. Based on the outcomes, the proposed model obtains higher accuracy and offers a more straightforward solution when compared to the most recent models that employ the same facial expression datasets.

Author Contributions

Conceptualization, D.M.A., M.A. and A.H.A.-A.; Formal analysis, M.A. and A.A.E.-S.; Investigation, M.R. and M.A.; Methodology, D.M.A., A.A.E.-S. and A.H.A.-A.; Software, D.M.A.; Supervision, M.R., A.A.E.-S. and A.H.A.-A.; Validation, M.R.; Writing—original draft, D.M.A.; Writing—review and editing, M.R. and A.H.A.-A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Deanship of Scientific Research, Prince Sattam bin Abdulaziz University, Al-Kharj, Saudi Arabia, supporting this research via funding from Prince sattam bin Abdulaziz University project number (PSAU/2024/R/1445).

Institutional Review Board Statement

Not applicable. No humans or animals were studied, and thus, the university Ethics Committee did not need to be consulted.

Informed Consent Statement

All interviewees agreed to the use of their anonymized responses and feedback for research purposes.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to thank the Deanship of Scientific Research, Prince Sattam bin Abdulaziz University, Al-Kharj, Saudi Arabia, for supporting this research via funding from Prince sattam bin Abdulaziz University project number (PSAU/2024/R/1445).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jain, A.K.; Li, S.Z. Handbook of Face Recognition; Springer: New York, NY, USA, 2011; Volume 1. [Google Scholar]
Liu, M.; Li, S.; Shan, S.; Chen, X. Au-inspired deep networks for facial expression feature learning. Neurocomputing 2015, 159, 126–136. [Google Scholar] [CrossRef]
Ali, G.; Iqbal, M.A.; Choi, T.S. Boosted NNE collections for multicultural facial expression recognition. Pattern Recognit. 2016, 55, 14–27. [Google Scholar] [CrossRef]
Zhang, W.; Zhang, Y.; Ma, L.; Guan, J.; Gong, S. Multimodal learning for facial expression recognition. Pattern Recognit. 2015, 48, 3191–3202. [Google Scholar] [CrossRef]
Fan, X.; Tjahjadi, T. A spatial-temporal framework based on histogram of gradients and optical flow for facial expression recognition in video sequences. Pattern Recognit. 2015, 48, 3407–3416. [Google Scholar] [CrossRef]
Demirkus, M.; Precup, D.; Clark, J.J.; Arbel, T. Multi-layer temporal graphical model for head pose estimation in real-world videos. In Proceedings of the 2014 IEEE International Conference on Image Processing (ICIP), Paris, France, 27–30 October 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 3392–3396. [Google Scholar]
Liu, P.; Reale, M.; Yin, L. 3d head pose estimation based on scene flow and generic head model. In Proceedings of the 2012 IEEE International Conference on Multimedia and Expo, Melbourne, Australia, 9–13 July 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 794–799. [Google Scholar]
Khan, A.; Sohail, A.; Zahoora, U.; Qureshi, A.S. A survey of the recent architectures of deep convolutional neural networks. Artif. Intell. Rev. 2020, 53, 5455–5516. [Google Scholar] [CrossRef]
Bynagari, N.B. The difficulty of learning long-term dependencies with gradient flow in recurrent nets. Eng. Int. 2020, 8, 127–138. [Google Scholar] [CrossRef]
Jain, D.K.; Shamsolmoali, P.; Sehdev, P. Extended deep neural network for facial emotion recognition. Pattern Recognit. Lett. 2019, 120, 69–74. [Google Scholar] [CrossRef]
Sadeghi, H.; Raie, A.A. Suitable models for face geometry normalization in facial expression recognition. J. Electron. Imaging 2015, 24, 013005. [Google Scholar] [CrossRef]
Farajzadeh, N.; Hashemzadeh, M. Exemplar-based facial expression recognition. Inf. Sci. 2018, 460, 318–330. [Google Scholar] [CrossRef]
Sadeghi, H.; Raie, A.A. Human vision inspired feature extraction for facial expression recognition. Multimed. Tools Appl. 2019, 78, 30335–30353. [Google Scholar] [CrossRef]
Makhmudkhujaev, F.; Abdullah-Al-Wadud, M.; Iqbal, M.T.B.; Ryu, B.; Chae, O. Facial expression recognition with local prominent directional pattern. Signal Process. Image Commun. 2019, 74, 1–12. [Google Scholar] [CrossRef]
Avola, D.; Cinque, L.; Foresti, G.L.; Pannone, D. Automatic deception detection in rgb videos using facial action units. In Proceedings of the 13th International Conference on Distributed Smart Cameras, Trento, Italy, 9–11 September 2019; pp. 1–6. [Google Scholar]
Leo, M.; Carcagnì, P.; Distante, C.; Mazzeo, P.L.; Spagnolo, P.; Levante, A.; Petrocchi, S.; Lecciso, F. Computational analysis of deep visual data for quantifying facial expression production. Appl. Sci. 2019, 9, 4542. [Google Scholar] [CrossRef]
Dong, J.; Zheng, H.; Lian, L. Dynamic facial expression recognition based on convolutional neural networks with dense connections. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 3433–3438. [Google Scholar]
Tsai, H.H.; Chang, Y.C. Facial expression recognition using a combination of multiple facial features and support vector machine. Soft Comput. 2018, 22, 4389–4405. [Google Scholar] [CrossRef]
Wang, C.; Zeng, J.; Shan, S.; Chen, X. Multi-task learning of emotion recognition and facial action unit detection with adaptively weights sharing network. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 56–60. [Google Scholar]
Li, J.; Jin, K.; Zhou, D.; Kubota, N.; Ju, Z. Attention mechanism-based CNN for facial expression recognition. Neurocomputing 2020, 411, 340–350. [Google Scholar] [CrossRef]
Fu, Y.; Wu, X.; Li, X.; Pan, Z.; Luo, D. Semantic neighborhood-aware deep facial expression recognition. IEEE Trans. Image Process. 2020, 29, 6535–6548. [Google Scholar] [CrossRef] [PubMed]
Chirra, V.R.R.; Uyyala, S.R.; Kolli, V.K.K. Virtual facial expression recognition using deep CNN with ensemble learning. J. Ambient. Intell. Humaniz. Comput. 2021, 12, 10581–10599. [Google Scholar] [CrossRef]
Gera, D.; Balasubramanian, S. Landmark guidance independent spatio-channel attention and complementary context information based facial expression recognition. Pattern Recognit. Lett. 2021, 145, 58–66. [Google Scholar] [CrossRef]
Behzad, M.; Vo, N.; Li, X.; Zhao, G. Towards reading beyond faces for sparsity-aware 3d/4d affect recognition. Neurocomputing 2021, 458, 297–307. [Google Scholar] [CrossRef]
Hernández-Luquin, F.; Escalante, H.J. Multi-branch deep radial basis function networks for facial emotion recognition. Neural Comput. Appl. 2023, 35, 18131–18145. [Google Scholar] [CrossRef]
Kar, N.B.; Nayak, D.R.; Babu, K.S.; Zhang, Y.D. A hybrid feature descriptor with Jaya optimised least squares SVM for facial expression recognition. IET Image Process. 2021, 15, 1471–1483. [Google Scholar] [CrossRef]
Kim, J.C.; Kim, M.H.; Suh, H.E.; Naseem, M.T.; Lee, C.S. Hybrid approach for facial expression recognition using convolutional neural networks and SVM. Appl. Sci. 2022, 12, 5493. [Google Scholar] [CrossRef]
Shaik, N.S.; Cherukuri, T.K. Visual attention based composite dense neural network for facial expression recognition. J. Ambient. Intell. Humaniz. Comput. 2023, 14, 16229–16242. [Google Scholar] [CrossRef]
Saurav, S.; Saini, R.; Singh, S. Fast facial expression recognition using Boosted Histogram of Oriented Gradient (BHOG) features. Pattern Anal. Appl. 2022, 26, 381–402. [Google Scholar] [CrossRef]
Jabbooree, A.I.; Khanli, L.M.; Salehpour, P.; Pourbahrami, S. A novel facial expression recognition algorithm using geometry β–skeleton in fusion based on deep CNN. Image Vis. Comput. 2023, 134, 104677. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Liew, C.F.; Yairi, T. Facial expression recognition and analysis: A comparison study of feature descriptors. IPSJ Trans. Comput. Vis. Appl. 2015, 7, 104–120. [Google Scholar] [CrossRef]
Shima, Y.; Omori, Y. Image augmentation for classifying facial expression images by using deep neural network pre-trained with object image database. In Proceedings of the 3rd International Conference on Robotics, Control and Automation, Chengdu, China, 11–13 August 2018; pp. 140–146. [Google Scholar]
Akhand, M.A.H.; Roy, S.; Siddique, N.; Kamal, M.A.S.; Shimamura, T. Facial Emotion Recognition Using Transfer Learning in the Deep CNN. Electronics 2021, 10, 1036. [Google Scholar] [CrossRef]
Minaee, S.; Minaei, M.; Abdolrashidi, A. Deep-emotion: Facial expression recognition using attentional convolutional network. Sensors 2021, 21, 3046. [Google Scholar] [CrossRef] [PubMed]
Niu, B.; Gao, Z.; Guo, B. Facial expression recognition with LBP and ORB features. Comput. Intell. Neurosci. 2021, 2021, 8828245. [Google Scholar] [CrossRef] [PubMed]
Gowda, S.M.; Suresh, H. Facial Expression Analysis and Estimation Based on Facial Salient Points and Action Unit (AUs). IJEER 2022, 10, 7–17. [Google Scholar] [CrossRef]
Borgalli, M.R.A.; Surve, S. Deep learning for facial emotion recognition using custom CNN architecture. In Journal of Physics: Conference Series; IOP Publishing: Bristol, UK, 2022; Volume 2236, p. 012004. [Google Scholar]

Figure 1. The framework for the proposed model of facial expression recognition.

Figure 2. Two different image cropping examples with

α = 45.6

Figure 3. Image equalization example.

Figure 4. Gamma correction example.

Figure 5. Downsampling example.

Figure 6. The proposed custom CNN model (CCNN) architecture.

Figure 7. SVM classifier.

Figure 8. The seven expressions for the CK+ and JAFFE datasets.

Table 1. Summary of related works.

Facial Expression Recognition Methodologies Based on Machine Learning
Publications	Method	Database	Accuracy	Year
[]	local binary patterns and local phase quantization + SVM	CK+, JAFFE	95.07, 91.8	2015
[]	local binary patterns and HOG+ SVM	CK, CK+, JAFFE, TFEID, and MMI.	97.14, 97.94, 92.53, 98.90 and 83.12	2018
[]	Gabor filters and a new coding scheme + SVM	CK+, SFEW, MMI, and RAF-DB	95.72, 36.92, 71.88 and 76.23	2019
[]	Local Prominent Directional Pattern(LPDP) + SVM	CK+, MMI, BU-3DFE, ISED, GEMEP-FERA and FACES	94.50, 70.63, 73.4, 78.32, 94.72, 94.72	2019
[]	Extracts facial Action Units (AU) using SVM-RBF	Real trial court data	76.84	2019
Facial Expression Recognition Methodologies based on Deep Learning
Publications	Method	Database	Accuracy	Year
[]	Combined (ART), (DCT) and (GF)	CKFI, FG-NET, and JAFFE	99.25, 98.5, and 98.46	2018
[]	Deep Temporal Geometry Network (DTGN)	CK+ and Oulu-CASIA	97.25, and 83.33	2018
[]	CECLM	ASD Children	93.11	2019
[]	Adaptively Weights Sharing Network (AWS-Net)	RAF	84.55	2019
[]	LBP, attention mechanism	CK+, JAFFE, FER2013 and Oulu-CASIA	98.9, 98.5, 75.8, 94.6	2020
[]	Neighborhood Semantic Transformation	CK+	98.5	2020
[]	multi-block DCNN-SVM	UIBVFED, FERG, CK+, JAFFE, and TFEID	98.45, 99.95, 98.09, 99.29, and 99.3	2021
[]	Spatio-Channel Attention Net(SCAN)	CK+, JAFFE	91.4, 58.49	2021
[]	sparsity-aware deep network, a long short-term memory (LSTM) network	BU-3DFE, BU-4DFE and BP4D-Spontaneous	85.11, 86.97, 89.91	2021
[]	Multi-Branch Deep Radial Basis Function	CK+, JAFFE	98.58, 95.83	2021
[]	CNN with SVM	CK+ and BU4D	99.69, 94.69	2022
[]	Visual attention based composite dense neural network	CK+, JAFFE	97.46, 97.67	2022
[]	Enhanced Histogram of Oriented Gradient + Kernel extreme machine learning	CK+, JAFFE	99.84, 98.63	2022
[]	Fusion-CNN	CK+, JAFFE, KDEF, and Oulu-CASIA	98.22, 93.07, 90.3, and 90.13	2023

Table 2. Test set accuracies of a custom CNN on the CK+, JAFFE, and KDEF datasets for various input sizes.

Input Image Size	CK+	JAFFE	KDEF
130 × 114	95.9	96.8	73.5
48 × 48	99.5	90.0	77.6
32 × 32	94.8	85.1	71.4

Table 3. CNN parameters.

Parameter	Value
Optimizer	Adam
Learning Rate	0.01
Epochs	100
Batch Size	128

Table 4. The proposed custom CNN model’s specifics.

Type	Filter Size/Stride	Output Size
Conv1 + batchN + relu	5 × 5/1	48 × 48 × 32
Maxpool1	2 × 2/2	24 × 24 × 32
Conv2 + batchN + relu	3 × 3/1	24 × 24 × 32
Maxpool2	2 × 2/2	12 × 12 × 32
Conv3 + batchN + relu	3 × 3/1	12 × 12 × 64
Maxpool3	2 × 2/2	6 × 6 × 64
Conv4 + batchN + relu	3 × 3/1	6 × 6 × 64
Maxpool4	2 × 2/2	3 × 3 × 64
Conv5 + batchN + relu	3 × 3/1	3 × 3 × 128
Maxpool5	2 × 2/2	1 × 1 × 128
Conv6 + batchN + relu	3 × 3 / 1	1 × 1 × 256
FC	7	7

Table 5. Datasets’ description.

	CK+	JAFFE	KDEF	FER
Label	7 classes	7 classes	7 classes	7 classes
Total Number of Images	974	213	4900	35,736
Individuals	123	10	70	-
Resolution	$48 \times 48$	$256 \times 256$	$562 \times 762$	$48 \times 48$
Format	.png	.tiff	.jpeg	.jpg

Table 6. Measurements of the accuracy of the proposed model using different preprocessing stages on the CK+ dataset.

	Facial Expressions
Preprocessing Step	Sad	Angery	Disgust	Fear	Happy	Surprise	Contempt	Average Accuracy for 10 Runs	Best Accuracy for 10 Runs
(a)	100	87	100	91.7	100	100	93.3	96.9	97.5
(b)	100	86	100	100	100	92.5	100	95.9	96.9
(c)	83.3	100	100	91.7	100	100	100	97.6	98
(d)	100	100	100	95.5	98.4	100	84.2	98.3	99
(e)	92	100	100	100	100	100	100	99.3	100

Table 7. Measurements of the accuracy of the proposed model using different preprocessing stages on the JAFFE dataset.

	Facial Expressions
Preprocessing Step	Sad	Anger	Disgust	Fear	Happy	Surprise	Neutral	Average Accuracy for 10 Runs	Best Accuracy for 10 Runs
(a)	77.8	75	60	100	100	100	90	85.8	89.5
(b)	100	100	100	80	66.7	100	100	90	93.3
(c)	100	90	71.4	72.7	85.7	100	80	85.5	90.6
(d)	80	87.5	87.5	62.5	85.7	66.7	88.9	79	85.2
(e)	100	88.9	100	100	100	100	100	98.4	98.7

Table 8. The evaluation performance of the proposed model for each fold on the CK+, JAFFE, KDEF, and FER datasets.

Fold	CK+	Jaffe	FER	KDEF
1	99	92.9	82.3	85.6
2	99	95.1	90.2	83.5
3	99.5	95.1	90.2	84.4
4	99.5	100	90.3	87.6
5	99.5	100	90.5	94.8
Average Accuracy	99.3	96.62	88.7	87.18

Table 9. The classification’s confusion matrix for the CK+ dataset.

Facial Expression	Predicted							Classification Accuracy (%)
Facial Expression	Sad	Anger	Disgust	Fear	Happiness	Contempt	Surprise	Classification Accuracy (%)
Sad	23	0	0	0	0	0	0	100
Anger	2	40	0	0	0	0	0	95.2
Disgust	0	0	53	0	0	0	0	100
Fear	0	0	0	22	0	0	0	100
Happiness	0	0	0	0	62	0	0	100
Contempt	0	0	0	0	0	16	0	100
Surprise	0	0	0	0	0	0	74	100
Overall Accuracy								99.3

Table 10. The classification’s confusion matrix for the JAFFE dataset.

Facial Expression	Predicted							Classification Accuracy (%)
Facial Expression	Sad	Anger	Disgust	Fear	Happiness	Neutral	Surprise	Classification Accuracy (%)
Sad	9	0	0	0	0	0	0	100
Anger	0	9	0	0	0	0	0	100
Disgust	0	0	8	0	0	0	0	100
Fear	0	0	0	9	1	0	0	90
Happiness	0	0	0	0	8	0	0	100
Neutral	0	0	0	0	0	9	0	100
Surprise	0	0	0	0	0	0	9	100
Overall Accuracy								98.4

Table 11. The classification’s confusion matrix for the KDEF dataset.

Facial Expression	Predicted							Classification Accuracy (%)
Facial Expression	Sad	Anger	Disgust	Fear	Happiness	Neutral	Surprise	Classification Accuracy (%)
Sad	19	0	0	0	0	2	0	95.5
Anger	0	20	1	0	0	0	0	90.5
Disgust	0	1	19	1	0	0	0	85.7
Fear	2	0	0	18	0	0	1	90.5
Happiness	0	0	0	2	19	0	0	95.2
Neutral	0	0	0	1	20	0	0	90.5
Surprise	1	0	0	1	0	1	18	85.7
Overall Accuracy								90.5

Table 12. The classification’s confusion matrix for the FER dataset.

Facial Expression	Predicted							Classification Accuracy (%)
Facial Expression	Sad	Anger	Disgust	Fear	Happiness	Neutral	Surprise	Classification Accuracy (%)
Sad	520	29	6	31	25	20	17	80.2
Anger	27	289	12	27	42	20	19	66.2
Disgust	0	1	26	2	1	2	0	81.3
Fear	56	35	10	289	16	41	33	60.2
Happiness	48	51	5	39	678	74	42	72.4
Neutral	50	71	6	35	55	402	24	62.5
Surprise	17	20	1	57	26	19	255	64.6
Overall Accuracy								69.6

Table 13. Comparison of the recognition performance of the proposed model with the state of the art on the JAFFE and CK+ datasets.

Model	Year	Model Name	JAFFE	CK+	KDEF	FER	Validation Patterns Selection
[]	2015	Gabor, Haar, LBP and SVM, KNN, LDA	89.50	-	82.40	-	Split dataset into train and test
[]	2018	CNN+SVM	95.31	-	-	-	K-fold Cross Validation
[]	2020	Attention mechanism-based CNN	98.52	98.68	-	75.82	K-fold Cross Validation
[]	2021	Multi-Branch Deep Radial Basis Function	97.96	99.64	-	68.15	Split dataset into train and test
[]	2021	pretrained Deep CNN	99.52	-	96.51	-	K-fold Cross Validation
[]	2021	Spatio-Channel Attention Net(SCAN)	58.49	91.4	-	-	Split dataset into train and test
[]	2021	Attentional convolutional network	92.8	98	-	-	Split dataset into train and test
[]	2021	ORB (oriented FAST and rotated BRIEF) and LBP	90.5	96.2	-	-	Split dataset into train and test
[]	2022	Visual attention based composite dense neural network	97.67	97.46	-	-	K-fold Cross Validation
[]	2022	Active Learning and SVM	88.31	-	-	-	K-fold Cross Validation
[]	2022	Custom CNN	95	83	-	86.78	K-fold Cross Validation
[]	2023	Fusion-CNN	93.07	98.22	90.30	-	K-fold Cross Validation
Proposed Model CCNN-SVM			98.4	99.3	90.43	91.2	Split dataset into $80 %$ for train and $20 %$ for test
Proposed Model CCNN-SVM			96.92	99.3	87.18	88.77	Five-fold cross validation

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

CCNN-SVM: Automated Model for Emotion Recognition Based on Custom Convolutional Neural Networks with SVM

Abstract

1. Introduction

2. Related Works

2.1. Facial Expression Recognition Methodologies Based on Machine Learning

2.2. Facial Expression Recognition Methodologies Based on Deep Learning

3. The Proposed Model

3.1. The Preprocessing Step

3.1.1. Image Cropping Stage

3.1.2. Histogram Equalization Stage

3.1.3. Gamma Correction Stage

3.1.4. Downsampling Stage

3.2. The Feature Extraction Step

3.3. The Classification Step

4. Experimental Results

4.1. Datasets

4.2. Experiments

4.2.1. Accuracy with Different Preprocessing Stages

4.2.2. Accuracy with a Five-Fold Cross-Validation Method

4.2.3. Comparisons with the State of the Art

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics