Next Article in Journal
UBSP-Net: Underclothing Body Shape Perception Network for Parametric 3D Human Reconstruction
Previous Article in Journal
Dynamic PSO-Optimized XGBoost–RFE with Cross-Domain Hierarchical Transfer: A Small-Sample Feature Selection Approach for Equipment Health Management
Previous Article in Special Issue
FRU-Adapter: Frame Recalibration Unit Adapter for Dynamic Facial Expression Recognition
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

AFCN: An Attention-Based Fusion Consistency Network for Facial Emotion Recognition

1
School of Teacher Development, Shaanxi Normal University, Xi’an 710062, China
2
The Fifth Primary School of Xi’an Aerospace City, Xi’an 710100, China
3
School of Artificial Intelligence, Xidian University, Xi’an 710126, China
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(17), 3523; https://doi.org/10.3390/electronics14173523
Submission received: 23 July 2025 / Revised: 24 August 2025 / Accepted: 28 August 2025 / Published: 3 September 2025

Abstract

Due to the local similarities between different facial expressions and the subjective influences of annotators, large-scale facial expression datasets contain significant label noise. Recognition-based noisy labels are a key challenge in the field of deep facial expression recognition (FER). Based on this, this paper proposes a simple and effective attention-based fusion consistency network (AFCN), which suppresses the impact of uncertainty and prevents deep networks from overemphasising local features. Specifically, the AFCN comprises four modules: a sample certainty analysis module, a label correction module, an attention fusion module, and a fusion consistency learning module. Among these, the sample certainty analysis module is designed to calculate the certainty of each input facial expression image; the label correction module re-labels samples with low certainty based on the model’s prediction results; the attention fusion module identifies all possible key regions of facial expressions and fuses them; the fusion consistency learning module constrains the model to maintain consistency between the regions of interest for the actual labels of facial expressions and the fusion of all possible key regions of facial expressions. This guides the model to perceive and learn global facial expression features and prevents it from incorrectly classifying expressions based solely on local features associated with noisy labels. Experiments are conducted on multiple noisy datasets to validate the effectiveness of the proposed method. The experimental results illustrate that the proposed method outperforms current state-of-the-art methods, achieving a 3.03% accuracy improvement on the 30% noisy RAF-DB dataset in particular.

1. Introduction

Facial expressions [1] are an essential form of nonverbal communication in humans, conveying rich information through facial images that include emotions, cognitive states, personality traits, and intentions for social interaction. Specially, it has significant applications in the field of education, especially in enhancing intelligent and personalized learning environments [2]. By analyzing the facial expressions of students during learning activities, real-time information on their emotional states, such as confusion, participation, boredom, or satisfaction, is gathered, which educators can use to adjust their teaching strategies accordingly to improve classroom interaction and student motivation. Facial expression recognition (FER) [3,4,5,6,7] plays a significant role in allowing machines in understanding human behavior and interacting with humans in a friendly manner, attracting a great deal of attention. During the past decade, researchers have constructed numerous large-scale facial expression datasets, including mainly CK+ [8], CASIA [9], JAFFE [10], SFEW/AFEW [4], FERPlus [11], EmotioNet [12], RAF [13], AffectNet [14], etc.
However, due to similarities between different facial expressions and subjective biases of human annotators, label noise [15,16,17] is prevalent in many FER datasets. The similarities between different facial expressions are related to the variability in individual emotional expressions. Different people not only have distinct physical appearances but also vary in how they express their emotions. Even the same person may adopt different expression methods when conveying the same feeling, depending on the context. As a case in point, the facial expressions of disgust, sadness, and surprise are characterized by the action of lowering the corners of the mouth. Facial expressions of surprise, anger, and happiness may include the action of opening the mouth. It is evident that facial expressions of anger, sadness, and disgust may all contain the action of furrowing the brows. These similarities can result in ambiguity between different categories of facial expressions, which complicates the process of determining the appropriate category for a particular expression. In essence, emotion ambiguity is inevitable with emotional exhibition, which is an unavoidable issue in fine-gained emotion recognition [18]. Moreover, different annotators may assign different categories to facial expressions that belong to the same category due to subjective biases.
The label noise introduced by the above factors has been shown to have a significant impact on the performance of FER models. First, the model exhibits a high degree of susceptibility to overfitting, incorrectly labelling data, learning noise as features, and failing to capture actual feature patterns. This tendency leads to an increase in error rates. Second, the presence of noise has been shown to have a deleterious effect on the calculation of the loss function, resulting in a deviation from the optimal path and an increase in computation time. To alleviate the negative effects of label noise in FER, many methods [19,20,21,22] have been proposed, including assuming that noisy labels have additional distributions or constraints, estimating noise transfer matrices, and designing new loss functions for adversarial learning. Although these methods have theoretical guarantees, they are insufficient to handle the complexity of real-world scenarios. As a result, some scholars have begun to focus on using sample selection methods to suppress label noise. The core idea of sample selection is to select high-quality samples for training and re-label low-quality samples. For example, Wang et al. [23] proposed the SCN method that learns importance weights for each sample and relabels samples of lower importance. Zhang et al. [24] proposed the RUL method that learns uncertainty across different facial expressions through comparison and relabels images with higher uncertainty.
Although most existing methods mitigate the adverse effects of label noise on FER to some extent through sample selection and label correction, all of them overlook the potential problem that the model may only utilize local features of expressions for classification during the process of learning facial expressions, rather than making a comprehensive judgment based on the global features of the facial expression. For example, when using SCN [23] in RAF-DB datasets with noise 30%, it is observed that SCN focuses on some local facial regions (rather than the global facial region) during expression classification, shown in Figure 1, where the truth labels are marked in black and the noise labels are marked in red. As shown in Figure 1, the brighter the local facial region, the more attention the model pays to that region during the training process, which implies that that area is more critical for facial expression recognition. The model tends to learn only local features of expressions and classify them based on these areas.
However, due to the high similarity between different facial expressions, a robust FER model should be able to extract global features (including eyebrows, eyes, nose, mouth, etc). Classifying facial expressions based on only one of these features often leads to incorrect predictions. Therefore, to prevent the model from incorrect predictions, such as incorrectly classifying ‘Angry’ as ‘Sad’ based solely on the action of angulus oris being pulled down, it is necessary to constrain the model to learn all key features of facial expressions, thereby reducing the interference caused by similar local regions between different categories of expressions on FER.
Based on this, we propose a new FER method called the attention-based fusion consistency network, which can explore crucial local regions during facial expression feature learning, constraining the model to maintain consistency between the areas corresponding to the truth labels and the fusion of all crucial areas, thus promoting the model to learn the global features of facial expressions. Specifically, the proposed method is mainly composed of four partial modules: the sample determinism analysis module, the label correction module, the attention fusion module, and the fusion consistency module. First, the sample determinism analysis module calculates the determinism of each input expression image. Second, the label correction module relabels samples with low determinism based on the model’s prediction results. Subsequently, the attention fusion module explores and fuses all possible crucial regions of facial expressions. The fusion consistency module restricts the model to maintain consistency between the local regions for the truth labels and the fusion of crucial local regions. This process ultimately facilitates the model to perceive and learn the global features of expressions, thereby preventing the model from incorrectly classifying expressions based solely on local features associated with noisy labels.

2. Related Works

2.1. Gradient-Weighted Class Activation Mapping

The deep learning model of [25,26,27,28,29] is often considered a ‘black-box’ model. To address the issue of low interpretability in neural networks, Selvaraju et al. proposed [30] Gradient-weighted Class Activation Mapping (Grad-CAM) to visualize the decision-making basis of convolutional neural networks. Recently, Li et al. [31] proposed an enhanced multiview attention model with random interpolation resize, which enhanced feature diversity through parallel channel attention and data augmentation via random interpolation resize. It helps us understand how the model works by highlighting the areas in the input image that are most important to the model’s specific predictions.
The core idea behind Grad-CAM is to use gradient information to infer the feature map regions that the model focuses on. It calculates the gradient of the classification score gradient for a feature map layer, combines it with the weighted sum of the feature map, and generates a heat-map of the same size as the input image, representing the model’s attention regions. Specifically, through class activation mapping, a class activation map can be generated for each input image, indicating the importance of each pixel in the model decision-making process. The larger the value corresponding to a pixel, the more critical that region is to the model’s decision-making process. The Grad-CAM calculation process is demonstrated, with the corresponding formula given as
L G r a d C A M c = R e L U ( k α k c A k ) ,
where L G r a d C A M c R u × v denotes the activation map of the class obtained when the current image is classified into the c-th class, which typically has the same dimensions as A k . A k R u × v represents a convolutional layer (typically the last convolutional layer in the network), where u and v denote the width and height of A k , respectively; α k c denotes the weight contributed by A k when the current object is classified into class c; since the class activation map only considers features that positively influence classification, the activation function R e L U is used to remove negative values. The calculation of α k c is shown as
α k c = 1 Z i j l o g i t c A i j k .
In this context, l o g i t c denotes the prediction score when the model classifies the current image into category c, A i j k denotes the pixel value at position ( i , j ) in the feature map A k , and Z denotes the number of pixels in a feature map single channel.
Grad-CAM has many improved versions, including Grad-CAM++ [32], Smooth Grad-CAM++ [33], and Ablation-CAM [34], which calculate weights by introducing higher-order gradients to more accurately allocate activation regions and solve the problem of inaccurate localization in Grad-CAM under multiple targets or multiple activation regions. However, due to the concentration of facial expression features, we ultimately chose to use Grad-CAM from the perspective of computational complexity and model optimization speed.

2.2. Weighted Cross Entropy Loss Function

In image classification tasks, the number of samples in different categories can vary significantly, causing the model to favor categories with more samples during training and testing while neglecting those with fewer samples. In such cases, the traditional cross-entropy function [35] may fail to adequately account for the balance between different categories, resulting in poor model performance in unbalanced datasets. To address this issue, researchers have proposed a class-weighted cross-entropy loss function [36], which applies category-based weighting to the loss function, allowing the model to treat samples from different categories more equitably.
However, for multiple samples within a specific category, their weights remain the same, ignoring differences in other aspects among the samples. In reality, due to the variability in individual emotional expressions and the complexity of environmental factors, most existing outdoor datasets contain varying degrees of noise. To enable differentiated learning of these samples with different noise levels, training sample weighting is a well-researched technique that can be used to adjust the contribution of samples to training CNN models. Based on this approach, some scholars have proposed a Logit-weighted cross-entropy loss function [37], which has achieved good results. This loss function not only addresses the imbalance between different samples but is also simpler compared to traditional sample-based weighted cross-entropy loss functions. Specifically, the mathematical form of the Logit-weighted cross-entropy loss function is shown in Equation (3):
L L W C E = 1 N i = 1 N log e α i W y i T f + b y i j = 1 C e α i W j T f i + b ,
where f i denotes the features of the i-th sample belonging to class y i , α i denotes the sample weight of the i-th sample, W j denotes the j-th row of the weight matrix W used for classification in the fully connected layer, and b denotes the bias vector used for classification in the fully connected layer. b j denotes the j-th value in b, which is typically fixed to 0. We primarily use the Logit-based weighted cross-entropy loss to calculate the classification loss of the model.

3. The Proposed Method

To address the issue that existing FER methods tend to misclassify expressions based solely on local features associated with noisy labels, we propose a new FER method based on attention-fusion consistency, focusing on all potential crucial local regions corresponding to the truth labels, named the Attention-based Fusion Consistency Network (AFCN). In the following parts, we will introduce the details of the proposed method.

3.1. Overview of the Attention-Based Fusion Consistency Network

The proposed method (AFCN) is composed of four main modules: (i) sample certainty analysis module, (ii) label correction module, (iii) attention fusion module, (iv) fusion consistency learning module. The framework of AFCN is shown in Figure 2.
For batch-processed images, we first extract deep features from the images using ResNet [38]. And a sigmoid function is applied to obtain the importance weight for each image. These weights are then multiplied by the logits of the sample reweighting scheme to obtain the final output of the network. In the sample certainty analysis module, we also introduce a ranking regularization module to explicitly reduce the importance of uncertain samples and regularize the attention weights. We first rank the learned attention weights and then divide them into two groups: high-importance and low-importance groups. Then, we add a constraint between the average weights of these groups using a margin-based loss known as the rank regularization loss (RR-Loss). To mitigate the impact of incorrect labels, we introduce a label correction module that adjusts the labels of low-certainty samples based on the model’s output probabilities. To address the issue where the model incorrectly classifies expressions based solely on local features associated with noisy labels, we first propose an attention generation module that generates a Grad-CAM for each expression category from an input facial expression image as the basis for attention. Then, we feed Grad-CAM into the fusion consistency learning module to calculate the fusion consistency loss and the mutual exclusivity loss of the category. Finally, we sum the above losses with weights to compute the joint optimization loss for network optimization.

3.2. Sample Certainty Analysis Module

To enable our model to learn facial expressions with different levels of noise in a differentiated manner, we introduce a sample certainty analysis module into the model’s learning process. This module is designed to calculate the certainty of each input facial image, increasing the model’s focus on certain expressions and reducing its focus on uncertain expressions during the training phase. It is expected that certain samples may have high importance weights while uncertain ones have low importance. Specifically, for the input images { x 1 , x 2 , , x n } , we denote F = [ f 1 , f 2 , , f n ] as deep features extracted by the backbone network. For each image, its certainty weight is calculated by
α i = sigmoid W T f i + b , i = { 1 , 2 , , n } ,
where α i represents the certainty weight of the i-th facial image x i , α i ( 0 , 1 ) , f i represents the deep features extracted by the backbone network for x i , and W and b represent the weights and biases of the fully connected layer, respectively. And S i g m o i d refers explicitly to the Sigmoid function, which restricts the certainty weight to the range of 0 to 1. Here, the certainty of each sample implies the dependability of its prediction given by the model. When the certainty value is higher, it means the dependability is higher; the reverse means the dependability is lower. Actually, in experiments, we also find that uncertain samples generally have low importance weights; therefore, an intuitive idea is to design a strategy to relabel these samples.
Although Equation (4) evaluates the certainty weights for each image without additional constraints, these certainty weights are arbitrary numbers between 0 and 1, which are meaningless. Therefore, we assess and categorize all input images into two groups with high and low certainty according to a specific noise rate β . Then, we design a regularization loss to constrain the average certainty weight M H of the high-certainty group to be higher than the average certainty weight M L of the low-certainty group, and the centers of the high-certainty group and the low-certainty group should maintain a certain distance d 1 . The regularization loss is given as
L r e g 1 = max 0 , d 1 M H M L ,
where
M H = 1 H i = 0 H α i , M L = 1 L i = 0 L α i ,
M H denotes the average of the certainty weights of the high-certainty samples, and M L denotes the average of the certainty weights of the low-certainty samples. In Equation (5), d 1 is used to constrain the distance between M H and M L to remain constant and is a fixed hyperparameter with a default value of 0.15.
After obtaining the confidence weights for the input facial images, the logit scores are logarithmically weighted to calculate the classification loss ( L wce ), shown as
L wce = 1 N i = 1 N log e α i W y i T f i + b y i j = 1 C e α i W j T f i + b j ,
where f i denotes the depth features extracted for the facial image x i , y i is the given label of x i , α 1 denotes the sample weight of x i , W j denotes the j-th row of the weight matrix W used for classification, b denotes the bias of the fully connected layer used for classification, and b j denotes the j-th value in b, i = { 1 , 2 , , N } , j = { 1 , 2 , , C } .

3.3. Label Correction Module

In the sample certainty analysis module, the input images are divided into the high-certainty group and the low-certainty group. For lower-certainty images, our module compares the maximum predicted probability ( p max ) to the expected probability ( p o r g ) corresponding to the given initial label. If the probability p max exceeds the probability p o r g and exceeds a certain threshold d 2 , the input image is considered as a noisy sample. Then, the label correction strategy is designed to modify the label for this noisy sample, formulated as
y = l max , p max p o r g > d 2 l o r g , otherwise .
In Equation (8), y represents the corrected label, l m a x represents the category index corresponding to the maximum predicted probability, d 2 is the threshold (default value is 0.2), and l o r g is the given initial label.

3.4. Attention Generation Module

To prevent the model from incorrectly classifying facial expressions based solely on local features associated with noisy labels, we design an attention generation module that utilizes Grad-CAM to explore all crucial local regions that the model focuses on during the learning process.
Specifically, for an input facial image x i and its given label c, the region that the network f focuses on when recognizing expression x i as category c can be obtained through the following steps. First, the image x i is fed into the backbone network f, producing the output of the last convolutional layer of the network, R K × H × W , where K, H, and W represent the channels, height, and width of the feature map F, respectively. The feature map is then converted into a feature vector v R K by the global average pooling. The vector v is then passed to a fully connected layer for each category. Grad-CAM is used to estimate the attention map m c for category c when the network f recognizes facial expression x i by performing a weighted sum over each channel, as shown as follows
m c = ReLu k = 1 K α k c F k ,
where m c denotes the activation map of the class corresponding to the class c when the image x i is classified into the class c, which typically has the exact dimensions of F k . F k represents the feature map of the k-th channel, and α k c denotes the weight contributed by F k when the image is classified into the class c.
Considering that the class activation map only considers features that positively influence classification, we apply the ReLU activation function to remove negative values. The weight α k c is calculated by the following formulation
α k c = 1 Z i j l o g i t c F i j k ,
where l o g i t c denotes the prediction score when the image is classified in class c, A i j k represents the value of the pixel value in position ( i , j ) on the feature map F k , and Z denotes the number of pixels on the single channel feature map. Similarly, by calculating the attention maps for other categories, we obtain all the attention maps M i = [ m 1 , m 2 , , m C ] for the image x i , where C is the number of categories. These attention maps reflect focus of the model to different categories while classifying the facial image x i .

3.5. Fusion Consistency Learning Module

The fusion consistency learning module builds upon the attention generation module by constraining the model to maintain consistency between the regions of interest for the given emotional labels and potentially crucial regions for expressions. This facilitates our model to perceive the global features of the emotion and thereby increases the classification ability for FER. Specifically, based on the attention maps M i = [ M 1 , M 2 , , M C ] for different emotional categories, we fuse them to obtain the fused attention maps ( m e r g e d i ) by
merged i h , w = max m 1 h , w , m 2 h , w , , m C h , w ,
where ( h , w ) expresses the position of each pixel in the facial image x i , and m j ( h , w ) expresses the value of the attention map corresponding to the j-th emotional category. By Equation (11), the pixel value at position ( h , w ) is the maximum value of the pixel values in the attention maps corresponding to all emotional categories. Therefore, the fused map m e r g e d i includes all potential crucial regions related to emotional categories.
Meanwhile, we design a fusion consistency loss which assesses the consistency between the fused attention map and the regions of interest corresponding to the given labels to promote the model from incorrectly classifying facial expressions based solely on local features associated with noisy labels. Note that the fused attention map implies all potential crucial regions of the facial expression. And the fusion consistency loss is defined as
L c o n = 1 N H W i = 1 N m y i m e r g e d i 2 ,
where y i denotes the given label of the facial image x i after the label is corrected by the label correction module, m y i denotes the region of interest for the given label, and m e r g e d i denotes the fusion of all potential regions of interest when the model recognizes all categories for x i .
Although the consistency loss helps improve the robustness of the model, experiments have shown that using only such a consistency loss in the network often leads to model collapse. At this point, the model’s attention maps for all categories in the image become similar and concentrated at a single point, as the model tends to remember all attention maps to minimize consistency loss. Noticeably, in the classification task, different emotion categories are inherently considered independent of one another, and there are some mutually exclusive among emotional categories, such as positive emotions (happy etc.) and negative emotions (sad, disgust, etc.). Based on this, we design a category mutual exclusivity loss that constrains the attention maps corresponding to categories (except the corrected category) to be mutually distinct from one another, thereby enhancing the learning of the consistency loss. Specifically, the mutual exclusivity category loss is calculated by
L r e g 2 = 1 N T H W i = 1 N j = 1 j y i C k = j + 1 k y i C 1 M i j M i k 2 ,
where N denotes the batch size, T = ( C 1 ) ( C 2 ) 2 represents the number of pairwise combinations of emotion categories excluding the actual category of the expression, c denotes the total number of emotion categories, H and W denote the height and width of M i j and M i k , respectively, y i denotes the emotion category of the i-th expression image x i after correction by the label correction module, M i j denotes the attention map of the model for the i-th expression image based on category j, and M i k denotes the attention map of the model for the i-th expression image based on category k.

3.6. Joint Optimization

In short, the total loss of the proposed method is formulated as
L t o t a l = λ 1 L w c e + λ 2 L r e g 1 + λ 3 L c o n + λ 4 L r e g 2 ,
where λ 1 , λ 2 , λ 3 , and λ 4 represent the weighted cross-entropy loss for classification, the regularization loss for auxiliary sample deterministic computation, the fusion consistency loss for consistency learning, and the mutual exclusivity category loss for auxiliary fusion consistency learning, respectively. In this model, λ 1 and λ 2 are set to 1 by default, while the values of λ 3 and λ 4 will be explained in the subsequent parameter analysis. Additionally, it is essential to note that during the initial training phase, the model often has not yet learned sufficiently effective feature representations. At this stage, premature introduction label correction can interfere with the model learning process, making it challenging to achieve optimal performance. Therefore, we monitor the model’s performance and ultimately begin optimizing the label correction starting from the tenth iteration.

4. Experiments and Analyses

To validate the performance of the proposed method, we implement experiments on three public FER datasets with different noise ratios (10%, 20%, and 30%), mainly including three items: the performance comparisons of the proposed method with state-of-the-art methods, the ablation studies, and the analyses on the hyperparameters with different values. The details are shown below.

4.1. Experimental Setup

In experiments, we apply three public FER datasets to validate the classification performance: RAF-DB [13], FERPlus [11], and AffectNet [14], which are commonly used in existing methods. For each dataset, all facial images are uniformly resized to 224 × 224 before being fed into the model, and random erasure and horizontal flipping data augmentations are applied. ResNet50 is selected as the backbone network and pre-trained on the MS-Celeb-1M face recognition dataset. All experiments are implemented using PyTorch 2.7 and an RTX 3090 GPU with 24 GB of VRAM (NVIDIA Corporation, Senta Clara, CA, USA). The model is trained for 30 epochs, with a batch size of 32 and an initial learning rate of 0.00005. An exponential learning rate (LR) regulator with a Gamma of 0.9 was used to reduce the learning rate in subsequent epochs. In each iteration of the batch, the training images are divided into two parts: the high-certainty expression set and the low-certainty expression set. The ratio of the number of expression sets of high certainty to low certainty is set to 7:3, and the interval d 1 between the average certainty weight of the high-certainty set and the average certainty weight of the low-certainty set is set to 0.15. Furthermore, since training in the early rounds is not stable enough, the label correction module is optimized from the 5th round, and the label correction threshold d 2 is set to 0.2.

4.2. Comparisons of Classification Performance

To verify the robustness of the proposed method for noise labels, we first conduct experiments on different noise ratios (10%, 20%, and 30%), where the labels of training samples are randomly changed to other labels based on the noise ratios. Three state-of-the-art methods as compared are used, including SCN [23], RUL [24], and EAC [21]. The experimental results are shown in Table 1. From Table 1, it is seen that our method (AFCN) outperforms these compared methods in all cases for the RAF-DB and FER. On the one hand, this is due to the weighted classification loss for deterministic expressions. On the other hand, it is due to the correction of noisy labels. The former enables the model to focus more on deterministic facial expressions, while the latter mitigates the adverse effects of noisy labels on facial expression recognition. Additionally, the improvement in model performance is also attributed to our loss of fusion consistency. Before the introduction of the fusion consistency loss, the model tended to incorrectly classify expressions based solely on local features associated with noisy labels; after the introduction of the fusion consistency loss, the model can effectively perceive the global features of expressions and learn and classify expressions based on these global features, thereby improving the model’s accuracy in facial expression recognition. For the AffectNet dataset, our method outperforms all compared methods under 30% noise but slightly underperforms under 10% and 20% noise. This may be because the AffectNet dataset is relatively complex, and besides label noise, other issues may be involved, which warrant further investigation.
Furthermore, to illustrate the classification performance of our method on noisy datasets, Figure 3 shows the classification of the confusion matrices obtained by our method on the RAF-DB dataset with 10% (left), 20% (middle), and 30% (right) noise labels. As can be seen, our method can accurately recognize various facial expressions under different noise levels, effectively demonstrating the robustness of our method to label noise. Specifically, under the three noise levels mentioned above, the model achieves an accuracy rate of 95% for recognizing happy expressions, and no happy expression is misclassified as angry. However, under the three noise levels mentioned above, the model still predicts some facial expressions labeled ‘fear’ in the RAF-DB dataset as ‘surprise’, but this does not mean a complete classification error. Although the model still has some confusion in recognizing certain expressions, this is likely due to the inherent ambiguity of some expressions. For example, classifying an image as a panicked facial expression as a surprise or fear category seems reasonable, as it inherently contains multiple emotions.
In addition, we also show some visualization results of the proposed method on the RAF-DB dataset with 30% noise, with the comparison to SCN. The results are shown in Figure 4. In the figure, SCN predicts noisy labels, and AFCN predicts real labels. In Figure 4, it is seen that the proposed method effectively captures the global features of facial expressions compared to SCN, preventing the model from making incorrect predictions based solely on local features related to noisy labels and thus improving the classification performance of the model.

4.3. Ablation Studies

To validate the effectiveness of each module in our method, we perform ablation experiments on the RAF-DB datasets with 30% noise, as shown in Table 2, where L w c e corresponds to logit-weighted cross-entropy loss, L r e g 1 corresponds to the regularization loss sample certainty analysis module, L c o n corresponds to consistency loss, and L r e g 2 corresponds to the category mutual exclusivity loss of fusion consistency learning module.
From Table 2, it is obvious that when all modules are used for training (Case 6), the model achieves the best performance. For Case 1 (the classification loss is not weighted according to the certainty weights of facial expressions), the model cannot focus on certain facial expressions and is instead distracted by uncertain expressions, leading to a decline in performance. For Case 2 (the regularization loss is not used to constrain the learning process of expression certainty during the weighting of classification loss), the model learns certainty weights for each expression as arbitrary values between 0 and 1, rendering the weighted cross-entropy loss meaningless and naturally affecting the model’s performance. If the label correction module is not used to correct potential noise labels (Case 3), the model will be affected by noise labels, leading to biased learning and reduced model performance. When the fusion consistency loss is not used to constrain the model’s perception of global facial expression features (Case 4), the model may focus on local features associated with noise labels, leading to prediction errors. When the category mutual exclusion loss is not incorporated to assist in the learning process of fusion consistency loss (Case 5), the model cannot fully extract all potential features of facial expressions, leading to biased learning of fusion consistency loss and indirectly reducing the model’s performance. In short, the experimental results indicate that the model achieves the best performance when all modules are used, which indirectly reflects the effectiveness of each module in the proposed method.

4.4. Parameters Analyses

In this part, we conduct the experiment on the RAF-DB and FERPlus datasets with 30% noise based on different values of λ 3 (for loss of fusion consistency) and λ 4 (for loss of mutual exclusivity in the category) to analyze the impact of different hyperparameters on classification performance. In the proposed method, the loss of fusion consistency aims to constrain the model to maintain consistency between the facial regions on which it focuses for the truth labels and the fusion of potential crucial regions during classification. This facilitates the model to perceive the global features of facial expressions, thereby avoiding incorrect predictions based solely on local features associated with noisy labels. The loss of the mutual exclusivity category assists in learning the fusion consistency loss, promoting the model to perceive all potential crucial regions of facial expressions.
Since the experiments revealed that the loss in the mutual exclusivity category is much smaller than the loss in fusion consistency for a batch of input facial images, in this experiment, we set λ 4 from 1 to 10 and λ 3 from 0.1 to 1. The experimental results are shown in Table 3. From Table 3, it is seen that when λ 3 is set to 0.5 and λ 4 is set to 3, the model achieves the best performance in both the RAF-DB and FERPlus datasets with noise labels. This indicates that the fusion consistency loss and category mutual exclusivity loss achieve a good balance during model training, thereby facilitating the collaborative promotion of the model’s feature learning of expressions and yielding favorable outcomes.

4.5. Potential Limitations and Practical Considerations

AFCN performs well on the high-noise AffectNet dataset, which contains ambiguous facial expressions, mixed emotions, low-resolution faces, frequent occlusions, extreme illumination conditions, and large pose variations. The promising results under these conditions suggest that the proposed model is resistant to typical sources of noise and distortion. Furthermore, AFCN adopts ResNet-50 as its backbone, enabling compatibility with low-computational-power environments; on an NVIDIA RTX 3090 GPU, the model achieves 400 frames per second, which makes it suitable for real-time educational applications that require immediate feedback. However, its performance in extreme real-world scenarios, particularly under severe occlusions combined with complex background interference or on ultra-low-power devices, warrants further investigation to ensure robust deployment in all target domains.

5. Conclusions

To address label noise in facial expression recognition tasks, we propose a facial expression recognition algorithm based on attention fusion consistency. This algorithm identifies all the key regions that the model focuses on during expression learning and classification. It constrains the model to maintain consistency between the areas it focuses on for accurate expression labels and the fusion of all possible key regions of the expression. This prevents the model from incorrectly classifying expressions based solely on local features associated with noisy labels. Additionally, to enable differentiated learning for expressions with varying degrees of noise, the algorithm calculates deterministic weights for each input expression image. It weights the classification loss based on the magnitude of these weights. This allows the model to focus more on deterministic expressions and mitigate the interference caused by uncertain expressions. Furthermore, this algorithm also performs label correction on some expressions with lower certainty, enabling the model to learn as much helpful information as possible from cleaner data. Various experiments were conducted on the RAF-DB, FERPlus, and AffectNet noise datasets, and the results demonstrate that our AFCN achieves better classification performance compared to existing algorithms, proving its robustness to label noise.

Author Contributions

Conceptualization, S.M.; Methodology, Q.W. and S.M.; Software, H.P.; Validation, Q.W. and H.P.; Writing—original draft, Q.W. and H.P.; Writing—review & editing, S.M.; Funding acquisition, S.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by State Key Program of National Natural Science of China grant number 62234010, the National Natural Science Foundation of China 62576265.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author. All experimental datasets are from publicly shared datasets, http://www.whdeng.cn/RAF/model1.html; https://mohammadmahoor.com/pages/databases/; https://www.kaggle.com/datasets/deadskull7/fer2013 (accessed on 7 July 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Pantic, M.; Rothkrantz, L.J.M. Automatic analysis of facial expressions: The state of the art. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 22, 1424–1445. [Google Scholar] [CrossRef]
  2. Fang, B.; Li, X.; Han, G.; He, J. Facial expression recognition in educational research from the perspective of machine learning: A systematic review. IEEE Access 2023, 11, 112060–112074. [Google Scholar] [CrossRef]
  3. Ortega-Garcia, J.; Fierrez, J.; Alonso-Fernandez, F.; Galbally, J.; Freire, M.R.; Gonzalez-Rodriguez, J.; Garcia-Mateo, C.; Alba-Castro, J.L.; Gonzalez-Agulla, E.; Otero-Muras, E.; et al. The Multiscenario Multienvironment BioSecure Multimodal Database (BMDB). IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1097–1111. [Google Scholar] [CrossRef] [PubMed]
  4. Li, S.; Deng, W. Deep Facial Expression Recognition: A Survey. IEEE Trans. Affect. Comput. 2022, 13, 1195–1215. [Google Scholar] [CrossRef]
  5. Hino, H.; Murata, N. Information estimators for weighted observations. Neural Netw. 2013, 46, 260–275. [Google Scholar] [CrossRef]
  6. Zheng, C.; Mendieta, M.; Chen, C. Poster: A pyramid cross-fusion transformer network for facial expression recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 3146–3155. [Google Scholar]
  7. Shi, G.; Mao, S.; Gou, S.; Yan, D.; Jiao, L.; Xiong, L. Adaptively enhancing facial expression crucial regions via a local non-local joint network. Mach. Intell. Res. 2024, 21, 331–348. [Google Scholar] [CrossRef]
  8. Lucey, P.; Cohn, J.F.; Kanade, T.; Saragih, J.; Ambadar, Z.; Matthews, I. The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition—Workshops, San Francisco, CA, USA, 13–18 June 2010; pp. 94–101. [Google Scholar] [CrossRef]
  9. Dong, J.; Wang, W.; Tan, T. CASIA Image Tampering Detection Evaluation Database. In Proceedings of the 2013 IEEE China Summit and International Conference on Signal and Information Processing, Beijing, China, 6–10 July 2013; pp. 422–426. [Google Scholar] [CrossRef]
  10. Lyons, M.; Akamatsu, S.; Kamachi, M.; Gyoba, J. Coding facial expressions with Gabor wavelets. In Proceedings of the Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, Nara, Japan, 14–16 April 1998; pp. 200–205. [Google Scholar] [CrossRef]
  11. Barsoum, E.; Zhang, C.; Canton Ferrer, C.; Zhang, Z. Training Deep Networks for Facial Expression Recognition with Crowd-Sourced Label Distribution. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI), Tokyo, Japan, 12–16 November 2016. [Google Scholar]
  12. Benitez-Quiroz, C.F.; Srinivasan, R.; Martinez, A.M. EmotioNet: An Accurate, Real-Time Algorithm for the Automatic Annotation of a Million Facial Expressions in the Wild. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 5562–5570. [Google Scholar] [CrossRef]
  13. Li, S.; Deng, W.; Du, J. Reliable Crowdsourcing and Deep Locality-Preserving Learning for Expression Recognition in the Wild. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2584–2593. [Google Scholar]
  14. Mollahosseini, A.; Hasani, B.; Mahoor, M.H. AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild. IEEE Trans. Affect. Comput. 2017, 10, 18–31. [Google Scholar] [CrossRef]
  15. Gusak, J.; Katrutsa, A.; Daulbaev, T.; Cichocki, A.; Oseledets, I.V. Meta-Solver for Neural Ordinary Differential Equations. arXiv 2021, arXiv:2103.08561. [Google Scholar] [CrossRef]
  16. Frenay, B.; Verleysen, M. Classification in the Presence of Label Noise: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2014, 25, 845–869. [Google Scholar] [CrossRef]
  17. Song, H.; Kim, M.; Park, D.; Shin, Y.; Lee, J.G. Learning From Noisy Labels with Deep Neural Networks: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 8135–8153. [Google Scholar] [CrossRef]
  18. Singh, G.; Brahma, D.; Rai, P.; Modi, A. Text-based fine-grained emotion prediction. IEEE Trans. Affect. Comput. 2023, 15, 405–416. [Google Scholar] [CrossRef]
  19. Liu, Y.; Zhang, X.; Kauttonen, J.; Zhao, G. Uncertain Label Correction via Auxiliary Action Unit Graphs for Facial Expression Recognition. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; pp. 777–783. [Google Scholar]
  20. Le, N.; Nguyen, K.; Tran, Q.; Tjiputra, E.; Le, B.; Nguyen, A. Uncertainty-Aware Label Distribution Learning for Facial Expression Recognition. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 6077–6086. [Google Scholar] [CrossRef]
  21. Zhang, Y.; Wang, C.; Ling, X.; Deng, W. Learn From All: Erasing Attention Consistency for Noisy Label Facial Expression Recognition. In Computer Vision—ECCV 2022; Springer: Cham, Switzerland, 2022. [Google Scholar]
  22. Mao, S.; Zhang, Y.; Yan, D.; Chen, P. Heterogeneous Dual-Branch Emotional Consistency Network for Facial Expression Recognition. IEEE Signal Process. Lett. 2025, 32, 566–570. [Google Scholar] [CrossRef]
  23. Wang, K.; Peng, X.; Yang, J.; Lu, S.; Qiao, Y. Suppressing Uncertainties for Large-Scale Facial Expression Recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6896–6905. [Google Scholar]
  24. Zhang, Y.; Wang, C.; Deng, W. Relative Uncertainty Learning for Facial Expression Recognition. In NIPS ’21, Proceedings of the 35th International Conference on Neural Information Processing Systems, Online, 6–14 December 2021; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 17616–17627. [Google Scholar]
  25. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
  26. Guo, Y.; Liu, Y.; Oerlemans, A.; Lao, S.; Wu, S.; Lew, M.S. Deep learning for visual understanding: A review. Neurocomputing 2016, 187, 27–48. [Google Scholar] [CrossRef]
  27. Janiesch, C.; Zschech, P.; Heinrich, K. Machine learning and deep learning. Electron. Mark. 2021, 31, 685–695. [Google Scholar] [CrossRef]
  28. Niu, Z.; Zhong, G.; Yu, H. A review on the attention mechanism of deep learning. Neurocomputing 2021, 452, 48–62. [Google Scholar] [CrossRef]
  29. Tang, H.; Yuan, C.; Li, Z.; Tang, J. Learning attention-guided pyramidal features for few-shot fine-grained recognition. Pattern Recognit. 2022, 130, 108792. [Google Scholar] [CrossRef]
  30. Selvaraju, R.R.; Das, A.; Vedantam, R.; Cogswell, M.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. 2016, 128, 336–359. [Google Scholar] [CrossRef]
  31. Li, P.; Tao, H.; Zhou, H.; Zhou, P.; Deng, Y. Enhanced Multiview attention network with random interpolation resize for few-shot surface defect detection. Multimed. Syst. 2025, 31, 36. [Google Scholar] [CrossRef]
  32. Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 839–847. [Google Scholar]
  33. Omeiza, D.; Speakman, S.; Cintas, C.; Weldermariam, K. Smooth grad-cam++: An enhanced inference level visualization technique for deep convolutional neural network models. arXiv 2019, arXiv:1908.01224. [Google Scholar]
  34. Desai, S.; Ramaswamy, H.G. Ablation-cam: Visual explanations for deep convolutional network via gradient-free localization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass, CO, USA, 1–5 March 2020; pp. 983–991. [Google Scholar]
  35. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
  36. Ho, Y.; Wookey, S. The Real-World-Weight Cross-Entropy Loss Function: Modeling the Costs of Mislabeling. IEEE Access 2020, 8, 4806–4813. [Google Scholar] [CrossRef]
  37. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  38. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Figure 1. Attention maps of SCN algorithm for some facial expressions in the noisy RAF-DB dataset.
Figure 1. Attention maps of SCN algorithm for some facial expressions in the noisy RAF-DB dataset.
Electronics 14 03523 g001
Figure 2. The framework of the Attention-based Fusion Consistency Network (AFCN).
Figure 2. The framework of the Attention-based Fusion Consistency Network (AFCN).
Electronics 14 03523 g002
Figure 3. Classification confusion matrix obtained by AFCN with different noise proportions.
Figure 3. Classification confusion matrix obtained by AFCN with different noise proportions.
Electronics 14 03523 g003
Figure 4. Visualization results obtained by AFCN on RAF-DB dataset with 30% noisy labels.
Figure 4. Visualization results obtained by AFCN on RAF-DB dataset with 30% noisy labels.
Electronics 14 03523 g004
Table 1. Comparisons of classification performance of AFCN with state-of-the-art methods, where the best results are bold.
Table 1. Comparisons of classification performance of AFCN with state-of-the-art methods, where the best results are bold.
MethodsNoiseRAF-DB (%)FER+ (%)AffectNet (%)
Baseline10%81.0183.2957.24
SCN10%82.1584.9958.60
RUL10%86.1786.9360.54
EAC10%88.0287.0361.11
AFCN (Ours)10%82.1584.9958.60
Baseline20%77.9882.3455.89
SCN20%79.7983.3557.51
RUL20%84.3285.0559.01
EAC20%86.0586.0760.29
AFCN (Ours)20%87.7888.5059.58
Baseline30%75.5079.7752.16
SCN30%77.4582.2054.60
RUL30%82.0683.9056.93
EAC30%84.4285.4458.31
AFCN (Ours)30%87.4587.8358.71
Table 2. Performance of the proposed method with or without different modules, where the best results are bold, the mark represses the model training with this loss or module, the X expresses the model training without this loss or module.
Table 2. Performance of the proposed method with or without different modules, where the best results are bold, the mark represses the model training with this loss or module, the X expresses the model training without this loss or module.
Cases L w c e L r e g 1 Label Correction L c o n L r e g 2 RAF-DB
1×86.57%
2×86.39%
3×85.01%
4×86.70%
5×86.60%
687.45%
Table 3. Comparisons of classification performance under different values of λ 3 and λ 4 , where the best results are bold.
Table 3. Comparisons of classification performance under different values of λ 3 and λ 4 , where the best results are bold.
Dataset λ 3 λ 4
135710
RAF-DB0.187.03%86.93%87.31%86.83%86.54%
0.386.99%87.35%86.96%86.44%86.86%
0.587.08%87.45%86.54%87.29%86.57%
0.787.13%87.39%86.89%86.76%86.73%
187.21%87.27%87.16%86.97%86.75%
FERPlus0.186.78%86.75%86.73%86.62%86.59%
0.386.89%87.01%87.15%86.97%86.83%
0.587.10%87.83%87.71%86.65%86.57%
0.787.21%87.46%86.97%87.03%86.88%
187.15%87.32%87.36%86.81%86.69%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wei, Q.; Pei, H.; Mao, S. AFCN: An Attention-Based Fusion Consistency Network for Facial Emotion Recognition. Electronics 2025, 14, 3523. https://doi.org/10.3390/electronics14173523

AMA Style

Wei Q, Pei H, Mao S. AFCN: An Attention-Based Fusion Consistency Network for Facial Emotion Recognition. Electronics. 2025; 14(17):3523. https://doi.org/10.3390/electronics14173523

Chicago/Turabian Style

Wei, Qi, Hao Pei, and Shasha Mao. 2025. "AFCN: An Attention-Based Fusion Consistency Network for Facial Emotion Recognition" Electronics 14, no. 17: 3523. https://doi.org/10.3390/electronics14173523

APA Style

Wei, Q., Pei, H., & Mao, S. (2025). AFCN: An Attention-Based Fusion Consistency Network for Facial Emotion Recognition. Electronics, 14(17), 3523. https://doi.org/10.3390/electronics14173523

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop