1. Introduction
Facial emotion recognition (FER) methods are mainly utilized to recognize facial expressions on the human face [
1]. Numerous kinds of emotions occur, but some may not be superficial to the human eye [
2]. Therefore, with the help of appropriate mechanisms, any kind of suggestions can aid in identifying the classification. In the FER field, there are different types of universal facial expressions, like neutral, happiness, surprise, fear, anger, sadness, and disgust [
3]. From facial expressions, emotion extraction is an active study in mental health, psychiatry, and psychology nowadays [
4]. The automatic emotion recognition from facial expressions comes with numerous usages, likely HCI (human–computer interaction), modern augmented reality, healthcare, smart living, and HRI (human–robot interaction) [
5]. Most of the researchers are employing FER because it comes with many customs [
6]. The procedure to build emotion-specific features is difficult because of several factors which appear from the nonlinear interaction among dissimilar evidence, multidimensional data, and the modalities that one can rapidly face with emotions in different scenarios [
7].
There is an increasing demand for robust and accurate systems able to automatically recognize and classify human emotions from facial expressions. Emotions have a major role in human communication, affect decision-making processes, and have applications in diverse domains, from human–computer interaction to mental health assessment. Existing FER models often face challenges related to noise, feature extraction, and generalization. Recently, machine learning (ML) and deep networks have established an effective technique to avoid such restrictions by distinguishing the most multipart nonlinear features linked in multimodal data [
8]. The two utmost-serious techniques in emotion detection are feature extraction and classification [
9]. Some of the foremost feature classifiers used for superior classification exactness are artificial neural network, ML, and deep-learning (DL) systems [
10]. The established feature-engineering and ML methods attempt to remove complicated as well as nonlinear patterns from the multivariate time-series data [
11]. However, selecting an effective characteristic from many feature sets is very difficult, so the dimensionality-reduction method will be needed. The feature-extraction as well as selection process takes more time. For instance, when the dimensionality feature increases, the calculating feature overhead selection develops radically [
12]. DL techniques, namely, the recurrent neural network (RNN), autoencoder (AE), and convolutional neural network (CNN), have improved in all the areas of computing, especially computer vision, natural language processing (NLP), audio-recognition machine translation, etc. [
13]. Recently, DL methods have been employed to deliver high-level data abstraction to develop a flexible structure for emotion detection. In the DL method, DNNs are utilized to gather unique qualities from the high-level data illustration [
14].
This study introduces an automated facial emotion recognition using the pelican optimization algorithm with a deep convolutional neural network (AFER-POADCNN) model. The major intention of the AFER-POADCNN model lies in the automated recognition and classification of facial emotions. To accomplish this, the AFER-POADCNN method exploits the median-filtering (MF) approach to remove the noise present in it. Furthermore, the capsule-network (CapsNet) approach can be applied to the feature-extraction process and the hyperparameter tuning of the CapsNet model is carried out by the POA. Finally, the detection and classification of different kinds of facial emotions take place using a bidirectional long short-term memory (BiLSTM) network. The performance analysis of the AFER-POADCNN technique is tested on benchmark FER databases. In short, the key contributions of the paper are summarized as follows.
An AFER-POADCNN technique comprising MF-based preprocessing, a CapsNet feature extractor, POA-based hyperparameter tuning, and BiLSTM classification has been developed for FER. To the best of our knowledge, the AFER-POADCNN technique has never existed in the literature;
The CapsNet model has been employed for feature extraction, allowing for the capture of intricate and nuanced facial expressions;
The POA is presented to tune the hyperparameters of the capsule network, enhancing the model’s adaptability and generalization to different emotions and diverse populations;
The BiLSTM model applied for emotion classification ensures the robust detection and categorization of various facial emotions.
2. Literature Review
Sarvakar et al. [
15] proposed a neural network convolutionary (FERC) technique which uses two parts. At the initial stage, the image backdrop is removed, and then the face vector is removed. For the classification process, the expressional vector (EV) is employed. The two-layer CNN is constant, and the exponent and weight standards of the last perception layer may differ with each iteration. Then, the EV generation ensures the growth of issues before a novel background-removal procedure is utilized. Said and Barr [
16] designed a face-sensitive CNN for human-emotion classification, which is collected from two phases. At the initial stage, the method is employed to identify faces in high-resolution images, and then the faces are trimmed for further processing. Next, the CNN is utilized to estimate facial expression, which is relayed on standard analytics. Then, it is implemented on pyramid images for processing scale invariance. In [
17], the facial-expression-detection method has been designed. At an initial stage, an area of interest has been performed as face classification. Secondly, a DL-based CNN design is projected. In the third stage, some of the new data-augmentation methods have been applied.
Talaat [
18] proposed a real-time emotion-identification method that employs three phases of emotion classification. The selection designs an enhanced DL method to identify facial emotions by utilizing the CNN. The projected emotion-recognition framework took up the benefit of employing the IoT and fog for reducing the delays for real-time classification, with a quick response time as well as providing location awareness. Chowdary et al. [
19] mainly dealt with emotion detection by making use of the transfer-learning (TL) method. The well-defined networks of Mobile Net, Vgg19, Resnet50, and Inception V3 are utilized in the research. The pretrained ConvNets are deleted, and then entire connected layers are added that are more appropriate for the totality of the instructions. At last, the fresh additional networks are skilled to improve the weights. In [
20], an automated framework algorithm is used for facial recognition by employing an FD-CNN, which is developed with four convolutional layers as well as two hiding networks for enhancing the accuracy. An extensive CK+ dataset is mainly employed, including facial images of dissimilar females and males with various expressions. For validating the projected technique, K-fold cross-validation is executed.
Sikkandar and Thiyagarajan [
21] presented an improved cat swarm optimization (ICSO) method. The deep-CNN technique is employed for the extraction process. The ICSO is mainly designed to select optimum features. Using DCNN with the ICSO method enhances the retrieval performance; then, the ensemble classification algorithm uses a support vector machine (SVM) and neural network (NN) that are performed to classify facial expressions. Helaly et al. [
22] developed a DCNN method based on an intelligent computer-vision system which is capable of identifying the facial emotions on human faces. In the first stage, the DCNN designed using the TL method is mainly introduced to build up an accurate FER system. Secondly, the research suggests the ResNet18 method.
In [
23], a new end-to-end facial-microexpression-recognition architecture termed Deep3DCANN has been developed to combine these modules for active microexpression discovery. The first module of our design is a deep 3D-CNN that learns beneficial spatiotemporal features from a series of facial images. Kansizoglou et al. [
24] offer a new model for online emotion detection that features audio as well as visual modalities, and then offers a receptive forecast when the system is sufficiently self-assured. The author developed two deep CNN techniques for removing emotional features; one model for each modality, and a DNN for their fusion. Li et al. [
25] presented a unique self-supervised exclusive–inclusive interactive-learning (SEIIL) technique to simplify the discriminative multilabel FER in the wild that effectually grip-coupled manifold thoughts with incomplete unrestrained training data. Kansizoglou et al. [
26] presented a new method that slowly maps, as well as learns, human personalities by considering and following a person’s emotional differences through communication. The developed network removes the facial landmarks of a subject, which are utilized to train a properly planned deep recurrent neural network (DRNN) framework.
3. The Proposed Model
In this study, we have designed a novel AFER-POADCNN model. The primary objective of the AFER-POADCNN model lies in the automated recognition and classification of facial emotions. To accomplish this, the AFER-POADCNN method includes different phases of operations, namely, MF-based preprocessing, the CapsNet model for feature extraction, the BiLSTM model for classification, and POA-based hyperparameter tuning.
Figure 1 portrays the entire procedure of the AFER-POADCNN system.
3.1. Image Preprocessing
To remove the noise that exists in the input images, MF is used. It is a basic preprocessing method deployed in image processing to decrease noise and increase data quality [
26]. Different from other smoothing techniques, MF methods excel at maintaining edges and fine details while efficiently mitigating impulse noise, like salt-and-pepper noise usually created in images. It works by changing all values of the pixels with the median value from a local neighborhood, making it specifically compatible with applications where noise reduction is important, without affecting the integrity of significant image features. MF is extensively utilized in domains like CV, remote sensing, and medical imaging for enhancing the data quality and robustness before additional visualization or analysis.
3.2. Feature Extraction
For deriving the feature vectors, the CapsNet model is applied. Capsule networks, also known as CapsNets, describe a new method for DL techniques, developed to address a few drawbacks of standard CNNs in tasks, namely image recognition [
27]. Developed by Geoffrey Hinton and his team, CapsNets present capsules as the main building blocks. These capsules are smaller collections of neurons that function together to identify different parts of objects or visual patterns within an image. Dissimilar CNNs depend on max-pooling for extracting features, and CapsNets utilize dynamic-routing mechanisms to evaluate the spatial relationships among parts and their entire objects. This allows CapsNets to manage complex hierarchical connections among features that are specifically beneficial in conditions where object pose and orientation matter, like in understanding handwritten features or identifying objects in cluttered scenes. Another feature of CapsNets is the capability to manage variable-length pose vectors for all parts, permitting them to obtain rich information about the relative positions and orientations of object components. This makes CapsNets robust to various transformations, including scaling, rotation, and deformation, making them a compelling choice for tasks like image segmentation and object recognition in challenging real-world conditions. While CapsNets are still an evolving field of research, they hold significant promise in advancing the state-of-the-art in computer vision and pattern recognition.
Figure 2 exemplifies the infrastructure of the CapsNet.
3.3. Hyperparameter Tuning
In this work, the POA optimally chooses the hyperparameter values of the CapsNet model. POA is a new swarm intelligence (SI)-based optimization algorithm, and pelicans are its population [
28]. The swarm member implies a candidate solution. Mainly, the swarm members are randomly initialized based on the problem limit:
In Equation (1), the upper and lower bounds of the variable of the problem are and , correspondingly. The amount of the parameter defined by the candidate solutions are ; the number of swarm members is , an accidental amount between ) is , and the number of parameters in the solution space is .
The pelican swarm’s associates can be described by the matrix. In the following matrix, the value in each row describes a candidate solution; moreover, the value in the column depicts the presented amount for the variable in the solution space.
In Equation (2), and are the swarm matrix and pelicans.
In this work, the cost function is calculated on any candidate solution. The cost-function vector (
) is used to define the attained amount for the cost function, as follows:
- 1.
Exploration Stage (Moving direction of Bait)
First, pelican members used to identify the hunting region; then, they move towards that place. The solution area is scanned due to its simulation of the pelican strategy; also, it gives rise to the exploration capability of the POA to explore different areas of the solution space. Consider that the hunting position is randomly generated in the searching region; the exploration ability rises once it finds the solution space. This can be mathematically expressed as follows:
In Equation (4), the new situation of pelicans at the dimension is , indicates a random integer; the prey location in the dimension denotes the . The parameter is arbitrarily chosen for any member and any iteration. If the quantity of is 2, then it increases in dislocation for a member; thus, the member conducts a new region of the problem. The exploration ability of these optimizers for the incorrect scanning of the problem is better than parameter.
The new position for the pelican is attained, which provides the cost-function value. The algorithm could not move towards the nonoptimum region using this type of upgrade, which is called the effective-update process. This method is simulated by the subsequent formula:
In Equation (5), shows the new position of pelicans and represents its cost-function amount.
- 2.
Exploitation Stage (Winging on the Water Surface)
In this phase, the pelicans spread their wings to attain the water surface; therefore, the fish come out and they gather bait in their throat pouch. The statistical formula for the demeanor of pelicans during hunting is given below:
In Equation (6), the new location of the
individuals at the
parameter is
, and the neighborhood radius of
is
, where the iteration number is
,
equals 0.2, and the maximal iteration counter is
.
is the coefficient representing the radii of the surrounding of the swarm individual to the local exploration, and the proximity to any member to converge for a promising solution. The nearby area and swarm member with more precise and shorter movements can be examined, and the PO is able to converge the answer closer to the global optima. In this stage, an effective update is used for taking or refusing the new pelican position, which is formulated as follows:
In Equation (7), the novel situation of the pelican is and the cost-function amount is .
- 3.
Repetition
Once the swarm individuals are updated, the optimal solution will be upgraded by the rate of the performance index and the new position of the swarm. The next iteration begins, and the different stages of the proposed PO, using the abovementioned formula, are repeated, finishing the whole performance. Eventually, the best solution candidate attained in the algorithm epoch is shown as a quasioptimal solution.
The POA algorithm derives an FF to obtain a high efficiency of classification. It determines a positive integer to represent the improved performance of the solution candidate. The decline of the classifier error rate is considered an FF.
3.4. Detection Using the BiLSTM Model
To detect the presence of emotions in distinct types, the BiLSTM model is applied. The input, forget, and output gates are the three gates of the LSTM unit [
29]. The input gate defines what amount of the input data to remain in the existing state of the memory unit, the forget gate is a basic design for the LSTM to learn long-term dependency, and the output gate decides which state of the memory cell is transported to the hidden output state. The state of the memory unit at
is
, the input of the unit at existing moment
is
and the output of the hidden state at prior time
is
, which are the three inputs of the LSTM neurons. The state of the memory cell at
is
and the output of the hidden layer (HL) at
is
, which are the two outputs of the LSTM neuron:
From the equations, the sigmoid function is . The multiplication by elements is ; and are the corresponding weight parameters and bias.
The prediction of the BiLSTM is based on time sequences and considers the negative and positive direction of prior data. The BiLSTM model comprises two layers of one-way LSTM, where HL in the positive time direction consists of prior data series and evaluates the present data sequence. HL in the reverse time direction is used to add the reverse data series in the calculation and read the future data series in the input. Next, the value defined by the two LSTM modules is feedforwarded into the output layer.
4. Results and Discussion
The proposed model is simulated using the Python 3.8.5 tool on a PC i5-8600k, GeForce 1050Ti 4 GB, 16 GB RAM, 250 GB SSD, and 1 TB HDD. The parameter settings are given as follows: learning rate: 0.01, dropout: 0.5, batch size: 5, epoch count: 50, and activation: ReLU. In this section, the FER outcome of the AFER-POADCNN algorithm is examined on the FER database [
30], comprising 920 images and 8 classes, as illustrated in
Table 1.
Figure 3 demonstrates the confusion matrices attained by the AFER-POADCNN algorithm at 80:20 and 70:30 of the TR phase/TS phase. The outcome denotes the effective recognition and classification of all eight classes.
In
Table 2 and
Figure 4, the FER results of the AFER-POADCNN technique at 80:20 of the TR phase/TS phase are presented. The simulation values highlight that the AFER-POADCNN method properly recognized facial emotions accurately. With 80% of the TR phase, the AFER-POADCNN technique offers an average
of 98.85%,
of 83.90%,
of 98.93%,
of 86.88%, and an MCC of 86.36%. Additionally, with 20% of the TS phase, the AFER-POADCNN approach achieves an average
of 99.05%,
of 90.03%,
of 99.22%,
of 91.47%, and an MCC of 91.28%.
In
Table 3 and
Figure 5, the FER outcome of the AFER-POADCNN approach at 70:30 of the TR phase/TS phase is presented. The outcome displayed that the AFER-POADCNN algorithm appropriately detected the facial emotions accurately. With 70% of the TR phase, the AFER-POADCNN system gains an average
of 98.91%,
of 83.55%,
of 98.92%,
of 87.37%, and an MCC of 86.99%. Furthermore, with 30% of the TS phase, the AFER-POADCNN approach reaches an average
of 98.82%,
of 89.65%,
of 99.02%,
of 91.43%, and an MCC of 90.74%.
To estimate the performance of the AFER-POADCNN algorithm at 80:20 of the TR phase/TS phase, TR and TS
curves are defined, as illustrated in
Figure 6. The TR and TS
curves demonstrate the outcome of the AFER-POADCNN algorithm on various epochs. The figure offers meaningful details regarding the learning task and generalization capabilities of the AFER-POADCNN approach. With an enhancement in the epoch count, it is observed that the TR and TS
curves are enhanced. It is still experimental that the AFER-POADCNN algorithm attains higher testing accuracy, which has the capability in identifying the patterns in the TR and TS data.
Figure 7 reveals the overall TR and TS loss values of the AFER-POADCNN algorithm at 80:20 of the TR phase/TS phase over epochs. The TR loss displays that the model loss is lesser over epochs. Primarily, the loss values are reduced as the model modifies the weight to minimize the prediction error on the TR and TS data. The loss curves demonstrate the extent to which the model fits the training data. It is detected that the TR and TS loss is steadily decreased, and represents that the AFER-POADCNN approach effectually learns the patterns displayed in the TR and TS data. It is also noticed that the AFER-POADCNN algorithm fine-tunes the parameters for decreasing the discrepancy between the prediction and the original training label.
The precision–recall (PR) outcome of the AFER-POADCNN approach at 80:20 of the TR phase/TS phase is represented by plotting the precision against the recall, as defined in
Figure 8. The outcomes confirm that the AFER-POADCNN algorithm reaches higher PR performances under all classes. The outcome exhibits that the model learns to identify distinct classes. The AFER-POADCNN algorithm reaches improved solutions in the recognition of positive instances, with minimal false positives.
The ROC curves offered by the AFER-POADCNN algorithm at 80:20 of the TR phase/TS phase are illustrated in
Figure 9, which has the capability of discriminating the class labels. The outcome implies appreciated insights into the trade-offs among the TPR and FPR rates, with various classifier thresholds and distinct counts of epochs. It defines the correct predictive outcome of the AFER-POADCNN approach on the classifier of various classes.
The comparison study of the AFER-POADCNN technique with recent systems is given in
Table 4 [
31].
Figure 10 represents a comparative
and
examination of the AFER-POADCNN technique with recent models. The results imply that the AFER-POADCNN approach gains better results. Based on the
, the AFER-POADCNN technique offers a higher
of 99.05%, while the HGSO-DLFER, ResNet50, SVM, MobileNet, Inception-v3, and CNN-VGG19 models obtain reduced
values of 98.45%, 88.54%, 91.64%, 92.32%, 93.74%, and 94.035%, respectively. Additionally, based on the
, the AFER-POADCNN system achieves an enhanced
of 91.47%, while the HGSO-DLFER, ResNet50, SVM, MobileNet, Inception-v3, and CNN-VGG19 approaches reach decreased
values of 87.78%, 85.99%, 87.55%, 86.52%, 73.82%, and 81.75%, correspondingly.
Figure 11 signifies a comparative
and
inspection of the AFER-POADCNN system with recent algorithms. The outcomes inferred that the AFER-POADCNN algorithm attains optimum solutions. Based on
, the AFER-POADCNN algorithm achieves a superior
of 90.03%, while the HGSO-DLFER, ResNet50, SVM, MobileNet, Inception-v3, and CNN-VGG19 algorithms achieve lesser
values of 84.99%, 83.96%, 83.17%, 83.74%, 80.23%, and 82.95%, correspondingly. Furthermore, based on
, the AFER-POADCNN system offers an enhanced
of 99.22%, while the HGSO-DLFER, ResNet50, SVM, MobileNet, Inception-v3, and CNN-VGG19 systems gain minimal
values of 98.65%, 83.65%, 82.18%, 83.81%, 84.06%, and 81.59%, correspondingly. These performances confirmed the higher outcome of the AFER-POADCNN algorithm.