Next Article in Journal
Comprehensive Characterization and Metamorphic Control Analysis of Full Apertures in Different Coal Ranks within Deep Coal Seams
Previous Article in Journal
Anti-Vibration Method for the Near-Bit Measurement While Drilling of Pneumatic Down-the-Hole Hammer Drilling
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Method of Multi-Label Visual Emotion Recognition Fusing Fore-Background Features

by
Yuehua Feng
and
Ruoyan Wei
*
School of Management Science and Information Engineering, Hebei University of Economics and Business, Shijiazhuang 050061, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(18), 8564; https://doi.org/10.3390/app14188564
Submission received: 13 August 2024 / Revised: 7 September 2024 / Accepted: 19 September 2024 / Published: 23 September 2024
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

:
This paper proposes a method for multi-label visual emotion recognition that fuses fore-background features to address the following issues that visual-based multi-label emotion recognition often overlooks: the impacts of the background that the person is placed in and the foreground, such as social interactions between different individuals on emotion recognition; the simplification of multi-label recognition tasks into multiple binary classification tasks; and it ignores the global correlations between different emotion labels. First, a fore-background-aware emotion recognition model (FB-ER) is proposed, which is a three-branch multi-feature hybrid fusion network. It efficiently extracts body features by designing a core region unit (CR-Unit) that represents background features as background keywords and extracts depth map information to model social interactions between different individuals as foreground features. These three features are fused at both the feature and decision levels. Second, a multi-label emotion recognition classifier (ML-ERC) is proposed, which captures the relationship between different emotion labels by designing a label co-occurrence probability matrix and cosine similarity matrix, and uses graph convolutional networks to learn correlations between different emotion labels to generate a classifier that considers emotion correlations. Finally, the visual features are combined with the object classifier to enable the multi-label recognition of 26 different emotions. The proposed method was evaluated on the Emotic dataset, and the results show an improvement of 0.732% in the mAP and 0.007 in the Jaccard’s coefficient compared with the state-of-the-art method.

1. Introduction

Emotion recognition has become increasingly integrated into various aspects of daily life and social activities, such as monitoring student learning states [1], enhancing human–computer interaction [2], driver monitoring [3], and lie detection [4].
In visual-based emotion recognition research, facial expressions have traditionally been regarded as the most effective features [5]. The field has seen rapid advancements in facial expression analysis, thus leading to significant progress. However, in real-world scenarios, individuals are often engaged in diverse social activities within varied environments, which makes it challenging to accurately capture facial information due to factors such as head tilts, occlusions, and uneven lighting. Beyond facial expressions, recent studies began to explore additional visual cues. For instance, Aviezer et al. [6] and Martinez [7] conducted psychological experiments in silent environments where participants’ faces were obscured. Their findings revealed that observers could accurately infer emotions through body posture and environmental context, thus suggesting that these factors are also valuable for emotion recognition. Additionally, there has been limited exploration of foreground information, such as social interactions between individuals, which can offer insights into behaviors through features like distance and proximity, thereby enhancing the emotion recognition accuracy.
Most current research on emotion recognition [8,9,10] focuses on the single-label recognition of a few common emotions, such as happiness, sadness, surprise, neutrality, and anger. However, human emotions are inherently complex and multidimensional, and single-label classification falls short of capturing the full spectrum of emotional expression. Emotion datasets [11] encompass 26 fine-grained emotion categories, and multi-label recognition of these categories can provide a more nuanced and comprehensive description of an individual’s emotional state. In multi-label emotion recognition, some studies [12] overlook the relationships between different emotions, often decomposing the task into independent binary classification tasks, which fails to account for the global correlations between all emotion labels and leads to reduced recognition accuracy. Traditional multi-label classification methods, such as ML-KNN [13] and ML-RBF [14], attempt to learn label correlations through kernel functions, loss functions, and association rules. However, these methods are often inefficient and primarily capture local correlations. The self-attention mechanism in Transformer architectures was also employed for modeling label correlations [15,16], but it is more suited for long sequence data and demands substantial computational resources. Graph-based approaches can model label correlations as well, but the construction of the adjacency matrix is crucial. Consequently, developing a multi-label classifier that effectively captures the global correlations of emotions remains a significant challenge.
To address the above issues, this paper proposes a method of multi-label visual emotion recognition that fuses fore-background features. The contributions of this approach are highlighted in three key aspects as follows:
  • A fore-background-aware emotion recognition model (FB-ER) is proposed. The background in which the person is placed and the foreground, such as interactions between different individuals, provide beneficial visual cues for emotion recognition, as shown in Figure 1. FB-ER is a three-branch multi-feature hybrid fusion network. It designs a core region unit (CR-Unit) to effectively extract body features, represents background features as background keywords to help the model understand emotion expressions in specific contexts through semantic data, and captures depth map information to better simulate interactions between different individuals as foreground features. These three types of features are fused both at the feature level and the decision level.
  • A multi-label emotion recognition classifier model (ML-ERC) is introduced, which utilizes graph convolutional networks (GCNs) to capture label correlations. The nodes in the graph are represented by word-embedding vectors associated with different emotion labels. The edge weights are determined by considering both the label co-occurrence probability matrix, as shown in Figure 2, and the cosine similarity matrix. Information is propagated between nodes by the GCN, which enables the classifier to learn the inter-relational information between emotional labels.
  • An end-to-end network architecture that combines FB-ER and ML-ERC was designed for the multi-label recognition of 26 distinct emotions. This architecture demonstrates generalizability across various environments and emotion categories.
The rest of this paper is organized as follows: Section 2 describes the current related work of visual emotion recognition and multi-label emotion classification. Section 3 gives a description of the proposed FB-ER and ML-ERC. Section 4 introduces the experimental designation and verifies the performance of the proposed method. Finally, Section 5 concludes this paper.

2. Related Work

2.1. Visual Emotion Recognition

Significant progress has been made in the study of facial emotions in recent years. These methods have achieved superior performance by developing effective facial feature extraction networks. Siqueira et al. [8] and Karani et al. [10] optimized their CNN models for computational efficiency, which resulted in models with both high computational efficiency and accuracy. Liao et al. [9] enhanced emotion recognition by incorporating facial optical flow information, while Arabian et al. [17] constructed a facial key point grid and utilized GCN for emotion classification. Kim et al. [18] adopted the designs of VGG, Inception-v1, ResNet, and Xception to propose the CVGG-19 architecture for classifying driver emotions, which offers the advantages of high accuracy and low cost. Oh et al. [19] investigated the impact of noise in image data on classification accuracy and proposed a deep learning model to enhance robustness against noise. Khan et al. [20] proposed a novel temporal shifting approach for a frame-wise transformer-based model using multi-head self/cross-attention (MSCA) to reduce the computational cost of emotion recognition. However, these methods focus solely on the facial region and do not consider other visual information beyond the face.
Several studies [11,21,22,23,24] explored the use of visual information beyond facial features. For instance, Mou et al. [21] employed a dual-stream network, where one branch extracted body features and the other extracted environmental features using a k-NN classifier to independently identify arousal and valence from these feature sets. However, due to the simplicity of the feature extraction network, it could not fully capture the necessary features. Additionally, the fusion of the two branches’ outputs occurred at the decision level, without accounting for the influence of environmental features on emotion recognition. Building on this work, Kosti et al. [11], Lee et al. [22], and Zhang et al. [23] concatenated different visual features for emotion classification. Nonetheless, the recognition accuracy varied greatly across different emotions, particularly those that are more difficult to classify. Ilyes et al. [24] proposed a multi-label focal loss function to address the issue of imbalanced emotion categories, which led to improved emotion recognition performance. However, the networks used in these studies exhibited limited feature extraction capabilities, underutilized background features, and relied on the simple concatenation of features without considering their complementarity. Furthermore, foreground information, such as social interactions between individuals, which can provide valuable insights into behavioral characteristics and enhance emotion recognition accuracy, remains underexplored in the current literature.

2.2. Multi-Label Emotion Classification

Multi-label classification methods can be categorized into three strategies: first order, second order, and high order [12]. The first-order strategy decomposes multi-label classification into multiple independent binary classification tasks while neglecting correlations between labels. The second-order strategy considers pairwise label relationships and distinguishes between related and unrelated labels. However, in real-world scenarios, label dependencies often exceed the limitations of second-order correlations. The high-order strategy addresses the associations between multiple labels and the influence of each label on all others, thus providing a more robust framework for modeling label correlations. Traditional high-order methods include ML-KNN [13], ML-RBF [14], BPMLL [25], and MLC-ARM [26]. ML-KNN [13] generates predictions by considering the relevance of local samples, but it fails to capture the global correlations across all labels. ML-RBF [14] employs a kernel function in feature space to account for label dependencies, while BPMLL [25] introduces a ranking loss function to minimize the dissimilarity between labels. However, as data correlations exceed the second-order level, the effectiveness of both ML-RBF and BPMLL becomes constrained. MLC-ARM [26] leverages association rules mined from sample correlations to perform multi-label classification, but its efficiency is hampered by the high computational costs involved.
In recent years, several approaches adapted modified Transformer architectures to learn correlations between different labels. For instance, methods like Q2L [15], ML-Decoder [16], and SAML [27] utilize enhanced Transformer models and the self-attention mechanism to capture inter-label correlations. However, these methods are better suited for capturing correlations in long sequence data and may not perform optimally when modeling label correlations in image data. Additionally, they demand significant computational resources and storage capacity.
In the field of object detection, several studies [28,29,30,31] constructed simple graph structures to represent the relationships between different objects within a single image, where an edge with a value of 0 denotes that two labels are unrelated, and an edge with a value of 1 indicates that they are related. Graph convolutional networks (GCNs) are then employed to learn local correlations across different images. However, these methods fail to account for the varying strengths of label correlations and overlook the global relationships between labels, and the application of GCNs for modeling correlations has seldom been explored in the domain of emotion recognition.

3. Approach

3.1. Fore-Background-Aware Emotion Recognition (FB-ER)

3.1.1. Relationship between Fore-Background and Emotions

Table 1 presents three different scenarios (derived from the Emotic dataset [11] and Google images): First, facial and postural cues are entirely obscured, which makes it impossible to use these features for emotion recognition. In such cases, the background context becomes crucial for inferring emotions. For example, in Scenario 1, individuals are shown sunbathing; although their faces are not visible, elements like clear water and a bright sky suggest a happy mood. Second, as illustrated in Scenario 2, faces and postures are partially covered. In these instances, emotions are often classified as neutral, which implies neither happiness nor sadness. However, by considering the background, alternative interpretations may arise. For instance, in a wedding setting, more fitting emotions could include affection, esteem, or happiness. Third, when facial expressions are fully visible, as shown in Scenario 3, the same expression can convey different emotions depending on the background context. For example, both expressions in Scenario 3 were initially classified as pain. Yet, with background consideration, one image set in a hospital and the other in a sports stadium, the latter scenario might more accurately reflect emotions such as disquiet, engagement, or excitement.
Next, the relationships between the foreground and emotions are examined. In this context, the foreground refers to the interactions between individuals within the image. These interactions can be divided into two scenarios, as depicted in Table 2 (using images from the Emotic dataset [11] and Google): First, when individuals share a common identity or are familiar with one another, their emotional tendencies often converge, as illustrated in Scenario 1. Second, when individuals possess different identities or are unfamiliar with each other, their emotional tendencies may diverge, as shown in Scenario 2. This suggests that emotions can spread rapidly through social interactions, and an individual’s emotional state is significantly influenced by their interactions with others.

3.1.2. Emotion Recognition Fusing Fore-Background Features

FB-ER consists of three sub-network branches, as depicted in Figure 3. The first branch is dedicated to the human body, where it extracts essential cues, such as facial expressions and body postures. To achieve this, ResNet18 [32] is utilized as the backbone network for feature extraction. Transfer learning is employed to fine-tune the pre-trained model developed by Krizhevsky et al. [33]. Following the extraction of the body features, a CR-Unit is introduced to emphasize regions that contribute the most to emotion recognition. The structure of the CR-Unit is presented in Figure 4, with its mathematical formulation detailed in Equations (1)–(3):
f F b o d y , n u m 1 = B a t c h N o r m L i n e a r 1 F b o d y , n u m 1
C o r e _ w e i g h t s = A d a p t i v e σ L i n e a r 2 n u m 1 , n u m 2
F c o r e = diag d o t F b o d y , A t t e n t i o n _ w e i g h t s
In Equation (1), F b o d y represents the body features, which are mapped to length n u m 1 through L i n e a r 1 , followed by B a t c h N o r m . In Equation (2), the features are mapped from length n u m 1 to length n u m 2 by L i n e a r 2 , and then C o r e _ w e i g h t s of the same length as F b o d y is obtained using S i g m o i d and A d a p t i v e . In Equation (3), d o t denotes the dot product operation, and d i a g represents the extraction of diagonal elements, thus ultimately yielding F c o r e . F c o r e focuses more on the regions of F b o d y that are beneficial for emotion classification, thus effectively enhancing the model’s ability to recognize the important parts of the F b o d y .
The second sub-network branch was designed for background feature extraction, which provides useful information for understanding the emotions of individuals in the image. The background can be considered as a set of keywords associated with objects and ongoing activities within a scene. To evaluate the scene comprehension, the Places365 model [34] was utilized. The results presented in Table 3 exhibit two different scenes: outdoors and indoors. For the outdoor scene, the recognition keywords include “outdoor”, “cliff”, “climbing”, and “sunny”. For the indoor scene, the keywords include “indoor”, “office”, “working”, and “closed area”. ResNet18 was also employed for extracting background features. Transfer learning techniques were used to fine-tune the pre-trained weights of the Places365 model in this study.
The third sub-network branch was designed for foreground feature extraction, which primarily includes the interactions between different individuals. These interactions are valuable for emotion evaluation. This study adopted a method based on depth map extraction to simulate the interaction and proximity between different individuals. The steps are organized as follows:
Step 1—Data preprocessing: all images in the dataset are normalized to RGB three channels.
Step 2—Depth map extraction: MegaDepth [35] is a powerful tool for depth map extraction that was designed to generate depth information from single-view images using advanced stereo matching techniques. Its pre-trained models exhibit a high accuracy and reliability, which enables precise depth map extraction across diverse scenes, lighting conditions, and viewpoints. Therefore, we utilized the MegaDepth pre-trained model to obtain depth maps from the original images.
Step 3—Color rendering: color rendering is applied to the acquired depth map to enhance the visual effect, as illustrated in Figure 5.
Step 4—Depth matrix calculation: The depth matrix information D of the depth map is calculated using Equation (4) presented below:
D = D 1 , 1 D 1 , 2 . . . . D 1 , N D 2 , 1 D 2 , 2 . . . . D 2 , N . . . . . . . . . . . . . . . . D M , 1 D M , 2 . . . . D M , N
where D represents a matrix of dimensions M × N , and D i , j denotes the depth value corresponding to column j in row i within the depth map. The third sub-network branch comprises three convolutional layers, three pooling layers, and three fully connected layers, which collectively extracts distinctive features from the depth map.
By extracting the aforementioned features and performing shape adjustment, the following features are obtained: core feature F c o r e , background feature F b a c k , foreground feature F f o r e , and fusion feature F m i x :
F c o r e b a t c h s i z e × d 1 F b a c k b a t c h s i z e × d 2 F f o r e b a t c h s i z e × d 3
F m i x = F c o r e | | F b a c k | | F f o r e F m i x b a t c h s i z e × d 1 + d 2 + d 3
where d i represents the vector length and | | denotes the concatenation operation. Subsequently, F c o r e , F b a c k , and F m i x are combined with the multi-label object classifier to perform emotion recognition for 26 different types of emotions. The optimal weights of the three emotion recognition results are obtained through training, and different weights are assigned to represent the relative importances of F c o r e , F b a c k , and F m i x .

3.2. Multi-Label Emotion Recognition Classifier (ML-ERC)

3.2.1. The Design of the Label Co-Occurrence Probability Matrix

G = N , E represents a global label correlation graph structure, where N represents all nodes and E represents all edges within the graph, as illustrated in Equations (7) and (8):
N = n 1 , n 2 , n 3 , , n m n k 1 , m = R k ω N = R m × ω N = R m × ω = R 1 ω , R 2 ω , R 3 ω , , R m ω
E = e i j | i , j Ω e i j i , j Ω = θ i j i , j Ω
In Equation (7), each graph node, as denoted by n k , corresponds to an emotion type, where k 1 , m , with m representing the number of emotion categories. Each node is represented by a word-embedding vector R k ω with a length of ω using Word2Vec and GloVe. Consequently, all the nodes in the graph structure can be represented as R m × ω . In Equation (8), e i j represents the edge between node i and node j, Ω represents the set of all edges in the graph structure, and θ i j denotes the weight of the edge between node i and node j. A larger value of θ i j indicates a stronger association between nodes i and j.
In this study, the correlation between different emotions was determined by analyzing the co-occurrence patterns of labels. A represents the label co-occurrence probability matrix, which can be constructed as follows:
Step 1: Compute the co-occurrence count C i , j i , j 1 , m of each pair of emotion labels i and j statistically from the dataset.
Step 2: Compute the conditional probability P i | j for each pair of emotion labels in Equation (9):
P i | j = P i j P j = C i , j i , j 1 , m C j j 1 , m
C j j 1 , m represents the frequency of occurrence of emotion label j, while C i , j i , j 1 , m represents the frequency of co-occurrence of emotion labels i and j.
Step 3: Construct the conditional probability matrix P:
P = 1 P 2 | 1 P m | 1 P 1 | 2 1 P m | 2 P 1 | m P 2 | m 1
Step 4: According to Equation (11), a judgment is made on whether there is a correlation between different emotional labels. τ represents the empirical threshold, which indicates the degree of correlation between i and j. Values lower than τ indicate a weaker correlation and should be ignored. After many experimental verifications, τ was set to 0.2.
P ^ = 0 , i f P j | i < τ 1 , i f P j | i τ
Step 5: The normalization operation in Equation (12) is used to balance the connection weights between different nodes. The parameter λ was set to 1 × 10 6 , which was used to prevent the denominator from being zero.
P ˜ i j = P ^ i j λ + i = 1 m P ^ i j , i j 1 , i = j
Step 6: To ensure more balanced and stable feature updates in the GCN and improve the model’s performance, further normalization of P ˜ i j is applied using Equations (13)–(15) based on the normalized Laplacian matrix principle. D i j represents the degree matrix, and A represents the label co-occurrence probability matrix.
D i j = j = 1 m P ˜ i j , i = j 0 , i j
D ˜ i j = D i j 1 2
A = D ˜ i j P ˜ D ˜ i j , i j 1 , i = j

3.2.2. The Design of the Cosine Similarity Matrix

Continuous convolution in GCNs reduces the similarity of node representations in the original feature space, as illustrated in Figure 6a,b. Figure 6a shows the Isomap [36] dimensionality reduction plot of the initial node semantic features R m × ω , where semantically similar emotions are closely positioned, such as “Excitement, Happiness, Pleasure”, “Anger, Annoyance, Aversion”, and “Suffering, Pain”. Figure 6b shows the Isomap dimensionality reduction plot of the node semantic features R m × ω after two layers of convolution, where the similarity of node semantic features is disrupted, with semantically similar emotions positioned far apart and dissimilar emotions positioned close together.
To prevent consecutive convolution operations from disrupting the similarity of node features, a cosine similarity matrix C is introduced to ensure that the semantic similarity of nodes remains unchanged after each convolution. The construction is organized as follows:
Step 1: Obtain the initial cosine similarity matrix C i j according to Equation (16):
C i j = R i ω · R j ω | | R i ω | | | | R j ω | |
where | | | | represents the L 2 norm.
Step 2: To ensure that similarity values between different emotion categories are on the same scale, eliminate dimensional discrepancies, and reduce the impact of extreme values, normalize C i j by the Z-score to obtain the cosine similarity matrix C. Figure 7b illustrates the resulting C using R m × ω | G l o v e as an example.

3.2.3. Label Correlation Emotion Classifier

Under the dual constraints of co-occurrence probability matrix A and cosine similarity matrix C, the GCN aggregates each label node to update its representation, thus effectively capturing inter-label association information. This approach results in an object classifier capable of learning the correlations between emotion labels.
The input to the GCN includes the word-embedding vectors of all nodes R m × ω , the label co-occurrence probability matrix A, and the label cosine similarity matrix C. By stacking multiple GCN layers, the network is able to learn and model the complex relationships between nodes, as described in Equation (17):
R m × ω l + 1 = σ A R m × ω l W a l + C R m × ω l W c l 2
l represents the l-th layer of the GCN, while W a l and W c l are the weight matrices that need to be learned. Each layer takes the input from the previous layer and updates the output according to Equation (17).
The final output of each GCN node is a classifier h i for the corresponding emotion label, as shown in Equation (18):
H = h i i = 1 m = R m × ω L , h i = R i ω L
The final emotion prediction score is denoted as y ^ , as shown in Equation (19):
y ^ = η 1 F c o r e H T + η 2 F b a c k H T + η 3 F m i x H T η 1 + η 2 + η 3 = 1
where F c o r e , F b a c k , and F m i x denote the features of the body’s core region, background features, and fused features mentioned earlier, respectively. η i represents the trainable coefficient, which indicates the relative importances of F c o r e , F b a c k , and F m i x . Initially, the values were assigned as η 1 = 0.333 , η 2 = 0.333 , and η 3 = 0.333 . After several epochs of iteration, the optimal predictive performance of the model was achieved when η 1 = 0.3 , η 2 = 0.2 , and η 3 = 0.5 . H T denotes the transpose operation of H. Both feature-level fusion and decision-level fusion are employed for the final emotion identification.

4. Experimental Section

4.1. Dataset

The Emotic dataset [11] was employed in the experiment due to its comprehensive annotations across 26 different emotion categories. The images in this dataset feature complex backgrounds that encompass diverse environments, times, locations, camera viewpoints, and lighting conditions. Therefore, the proposed method demonstrates generalizability across different environments and emotion categories. These inherent characteristics provide the dataset with remarkable diversity, which made the experimental task significantly more challenging.
The dataset was divided into four subsets: Ade20k, Emodb-small, Framesdb, and Mscoco. The sample distribution within the dataset was imbalanced, with varying quantities for each type of emotion. Positive emotions were more prevalent, while negative emotions were less represented. During the experiments, the dataset was partitioned into training, validation, and test sets, as shown in Table 4.

4.2. Loss Function and Evaluation Metrics

Due to the imbalanced sample distribution in the Emotic dataset, we utilized the multi-label focal loss [24] as the loss function in this study, which is expressed as Equation (20):
M F L α , γ y , y ^ = i = 1 26 [ α 1 y ^ i γ y i log y ^ i + 1 α y ^ i γ 1 y i log 1 y ^ i ]
where y ^ i represents the predicted label of the i-th category, y i represents the true label of the i-th category, and α and γ are two hyperparameters. The balance factor α effectively balances the overall contributions of the positive and negative samples, while the focusing parameter γ adjusts the weights of the easy and hard samples. The optimal recognition performance of the model was achieved when α = 0.5 and γ = 0.3 through multiple experiments.
The evaluation metrics were the average precision (AP), mean average precision (mAP), Jaccard coefficient (JC), total parameters (Params), and floating point operations (FLOPs). mAP and JC are defined as follows in Equations (21) and (22):
m A P = i = 1 m A P i m
J C = y y ^ y y ^
The AP is an approximation of the area under the precision–recall curve and assesses the model’s average precision at all recall levels. In Equation (21), i represents different emotion categories, and m represents the total number of emotion categories. mAP is a global metric that evaluates the average performance of the model across all categories. In Equation (22), y represents the true labels, y ^ represents the predicted results, and the JC measures the degree of overlap between the predicted results and the true labels. The model’s predictive performance value ranged between 0 and 1, with a higher value indicating better performance.

4.3. Comparative Experiments

The methods proposed in this paper were compared with several existing approaches, namely, ML-KNN [13], ML-RBF [14], Kosti et al. [11], Lee et al. [22], Zhang et al. [23], ML-GCN [28], ESRs [8], Ilyes et al. [24], Q2L [15], LGLM [29], RCL-Net [9], ML-Decoder [17], SAML [27], FLNet [30], BHARAT [10], and LFPLM [17]. The emotion precision of these methods is shown in Table A1 in Appendix A. The comparison results of the mAP and JC are shown in Table 5 and Figure 8. The results indicate that the methods proposed in this paper achieved the best performance in terms of the mAP and JC, with the mAP exceeding the state-of-the-art by 0.732% and the JC by 0.007, and there was no significant increase in the model total parameters and FLOPs.

4.4. Ablation Experiments

4.4.1. Ablation Experiments of FB-ER and ML-ERC

To validate the effectiveness of FB-ER and ML-ERC, three ablation experiments were conducted. In the first experiment, only FB-ER was retained, without the inclusion of ML-ERC. In the second experiment, only ML-ERC was used and FB-ER was omitted. In the third experiment, both FB-ER and ML-ERC were utilized. The results of these experiments are presented in Figure 9. The findings demonstrate that both components of the proposed method contributed significantly to its overall effectiveness.

4.4.2. Ablation Experiments of FB-ER

Four experiments were conducted using different combinations of branches while keeping all other experimental parameters constant. In the first experiment, only the body information was utilized. The second experiment combined the body information with the background information. The third experiment integrated the body information with the foreground information, and the fourth experiment fused the body information with both the foreground and background information. The resulting average precision is depicted in Figure 10, while the mAP and JC values are shown in Table 6. It can be concluded that the mAP improved by approximately 8.9% and the JC increased by 0.126 when the fore-background information was incorporated.
The ablation experiments with different fusion strategies were conducted to evaluate their effectiveness. The first experiment utilized feature-level fusion, the second employed decision-level fusion, and the third implemented mixed-level fusion. The results for the mAP and JC are presented in Table 6. It can be concluded that compared with feature-level fusion and decision-level fusion, mixed-level fusion improved the mAPs by approximately 2.3% and 4.9%, respectively, and the JCs by approximately 0.034 and 0.066, respectively.
To test the performance of the CR-Unit applied to body features, two ablation experiments were conducted. In the first experiment, the CR-Unit was not incorporated, whereas in the second experiment, the CR-Unit was incorporated. The results for the mAP and JC are presented in Table 6. It can be seen that incorporating the CR-Unit led to an improvement of 1.8% in the mAP and 0.03 in the JC.

4.4.3. Ablation Experiments of ML-ERC

Matrices A and C were applied in the multi-label emotion recognition classifier (ML-ERC) as introduced in Section 3.2.1 and Section 3.2.2. To evaluate their effects, four ablation experiments were conducted. In the first experiment, a traditional threshold decision classifier was used. In the second experiment, only the label co-occurrence probability matrix A was used. In the third experiment, only the cosine similarity matrix C was used. In the fourth experiment, a combination of the label co-occurrence probability matrix A and the cosine similarity matrix C was used. The category precision results are shown in Figure 11, and the mAP and JC results are presented in Table 7. The results demonstrate that the utilization of ML-ERC led to an mAP improvement of 3.5% and an increase in the JC by 0.082.
This study investigated the impacts of four distinct word-embedding vectors (GloVe [37], Word2Vec [38], FastText [39], and ELMo [40]) on the performance of the model. The experimental results are presented in Table 7. Based on these results, it can be observed that the mAP and the JC obtained using the four different word-embedding vectors were approximately 35.9% and 0.450, respectively.

4.5. Parameter Experiments

To explore the impact of different thresholds τ on model performance, the experimental results are shown in Table 8. According to the experiments, the highest mAP and JC were observed when τ = 0.2 . However, when τ = 0.1 , the mAP and JC did not reach their optimal state, possibly because the τ value was set too low, which introduced weakly correlated data and caused interference. Additionally, when τ 0.5 , both the mAP and JC decreased, possibly because the τ value was set too high, which resulted in the exclusion of some relevant information.

4.6. Experimental Visualization Results

The visual results of the ablation experiments for different branch combinations are presented in Figure 12. The visualization in Figure 12 demonstrates that the FB-ER model significantly outperformed the baseline in emotion recognition. It accurately predicted a substantial number of emotional categories, with only a minimal number of incorrect predictions and unpredicted categories.
Figure 13 shows the visual results of the ablation experiments on the ML-ERC model. Based on the visualization in Figure 13, it could be concluded that the application of the ML-ERC model led to an increase in correct predictions, while the number of incorrect predictions and unpredicted emotional categories decreased.

5. Conclusions

This paper proposes a method of multi-label visual emotion recognition that fuses fore-background features to address the following issues: multi-label visual emotion recognition overlooks the influence of fore-background features and fails to exploit the correlations between different emotion labels. First, the fore-background-aware emotion recognition model (FB-ER) is proposed to extract human, background, and foreground features. Then, it generates combined features using CR-Unit and hybrid level fusion. Second, multi-label emotion recognition classifier (ML-ERC) is proposed to construct an emotion label graph using the co-occurrence probability matrix and cosine similarity matrix to represent the edges and word-embedding vectors as nodes. The model employs a graph convolutional network to learn the correlations between emotions and obtain a classifier containing the labeled correlations. Finally, visual features are combined with the classifier to achieve multi-label recognition for 26 emotions. To validate this method, detailed ablation and comparative experiments were conducted. The results showed that compared with the state-of-the-art methods, the mAP increased by 0.732% and the JC increased by 0.007.
However, there were still some problems that need to be improved in further studies:
(1) The proposed method exhibited relatively low precision in recognizing difficult-to-identify emotions, such as embarrassment, confusion, and sensitivity.
(2) The use of a static approach to construct the correlation matrix reduced the model’s generalization ability. Future work will consider employing dynamic methods to calculate emotional correlations.

Author Contributions

Conceptualization, R.W. and Y.F.; methodology, R.W.; software, Y.F.; validation, Y.F. and R.W.; formal analysis, R.W.; investigation, Y.F.; resources, Y.F.; data curation, Y.F.; writing—original draft preparation, Y.F.; writing—review and editing, Y.F. and R.W.; visualization, Y.F.; supervision, R.W.; project administration, R.W.; funding acquisition, R.W. All authors have read and agreed to the published version of the manuscript.

Funding

Graduate Innovation Funding Project of Hebei University of Economics and Business (XJCX202408). National Natural Science Foundation of China (62103009); Key Research and Development Program of Hebei Province (17216108); Natural Science Foundation of Hebei Province (F2018207038); Higher Education Teaching Reform Research and Practice Project of Hebei Province (2022GJJG178); Scientific Research Project of Education Department of Hebei Province (QN2020186); Key Research Project of Hebei University of Economics and Trade (ZD20230001).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

In the comparative experiments of Section 4.3, the proposed method was compared with 16 other methods. Table A1 shows the emotion precisions of some of these methods.
Table A1. Category precisions of different models.
Table A1. Category precisions of different models.
EmotionsAverage Precision (%)
Kosti [11]Lee [22]Zhang [23]Ilyes [24]ESRs [8]RCL-Net [9]BHARAT [10]LFPLM [17]Ours
Affection27.85118.96046.89531.92627.42627.88732.94030.09834.056
Anger9.4896.25110.87613.9428.97316.82714.31812.44917.688
Annoyance14.0628.71311.27117.42414.89710.11518.38617.34422.384
Anticipation58.64150.01962.64857.73559.02260.10294.23194.84895.201
Aversion7.4774.9755.9358.1927.3053.23612.91315.76719.033
Confidence78.35258.47672.49775.29576.18860.21377.07468.44976.165
Disapproval14.9717.40811.28314.88414.9699.98814.90817.00819.566
Disconnection21.32019.20026.91628.32321.69820.06430.40032.16240.001
Disquiet16.88613.84116.94219.72418.55210.00518.64020.28522.863
Doubt/Confusion29.62714.80218.68423.11529.26420.01224.76518.83521.362
Embarrassment3.1823.4452.0102.8475.6165.5539.82710.76911.255
Engagement87.53177.63388.56285.83887.61180.12599.51095.06097.264
Esteem17.73013.06613.33816.72517.82917.89326.35728.09235.431
Excitement77.15758.58171.89170.43679.25262.31676.24666.46479.018
Fatigue9.7006.43213.26314.4349.7839.94212.08913.11413.463
Fear14.1444.7325.6878.27714.2483.88314.09811.93413.527
Happiness58.26156.92773.26576.62659.33556.83373.65875.71872.648
Pain8.9427.9603.5279.3858.6668.00114.88712.36519.556
Peace21.58818.07132.85224.31821.96822.01330.91426.15732.432
Pleasure45.46235.30057.46646.89346.38039.97650.75641.87352.363
Sadness19.6619.58810.38823.94619.52413.27322.27913.35528.674
Sensitivity9.2804.4144.9766.2869.3313.4399.3449.6099.721
Suffering18.8398.3724.47726.24519.52710.01821.58512.62818.568
Surprise18.81010.2659.02410.11018.22410.25819.05017.02526.387
Sympathy14.71110.78117.53613.98413.42210.52330.96530.42334.255
Yearning8.3437.04610.5599.7178.6527.10022.15719.48622.524
mAP(%)27.38420.58727.03028.33227.60223.06133.54931.20535.977

References

  1. Long, T.D.; Tung, T.T.; Dung, T.T. A facial expression recognition model using lightweight dense-connectivity neural networks for monitoring online learning activities. Int. J. Mod. Educ. Comput. Sci. (IJMECS) 2022, 14, 53–64. [Google Scholar] [CrossRef]
  2. Xian, N.Z.; Ying, Y.; Yong, B. Application of human-computer interaction system based on machine learning algorithm in artistic visual communication. Soft Comput. 2023, 27, 10199–10211. [Google Scholar]
  3. Feng, L.J.; Guang, L.; Yan, Z.J.; Dun, L.L.; Bing, C.C.; Fei, Y.H. Research on fatigue driving monitoring model and key technologies based on multi-input deep learning. J. Phys. Conf. Ser. 2020, 1648, 022112. [Google Scholar]
  4. Jordan, S.; Brimbal, L.; Wallace, B.D.; Kassin, S.M.; Hartwig, M.; Chris, N.H. A test of the micro-expressions training tool: Does it improve lie detection? J. Investig. Psychol. Offender Profiling 2019, 16, 222–235. [Google Scholar]
  5. Yacine, Y. An efficient facial expression recognition system with appearance-based fused descriptors. Intell. Syst. Appl. 2023, 17, 200166. [Google Scholar] [CrossRef]
  6. Aviezer, H.; Trope, Y.; Todorov, A. Body cues, not facial expressions, discriminate between intense positive and negative emotions. Science 2012, 338, 1225–1229. [Google Scholar]
  7. Martinez, A.M. Context may reveal how you feel. Proc. Natl. Acad. Sci. USA 2019, 116, 7169–7171. [Google Scholar] [CrossRef]
  8. Siqueira, H.; Magg, S.; Wermter, S. Efficient Facial Feature Learning with Wide Ensemble-Based Convolutional Neural Networks. Proc. AAAI Conf. Artif. Intell. 2020, 34, 5800–5809. [Google Scholar]
  9. Jun, L.; Chang, L.Y.; Yun, M.T.; Ying, H.S.; Fang, L.X.; Tian, H.G. Facial expression recognition methods in the wild based on fusion feature of attention mechanism and LBP. Sensors 2023, 23, 4204. [Google Scholar] [CrossRef]
  10. Karani, R.; Jani, J.; Desai, S. FER-BHARAT: A lightweight deep learning network for efficient unimodal facial emotion recognition in Indian context. Discov. Artif. Intell. 2024, 4, 35. [Google Scholar] [CrossRef]
  11. Kosti, R.; Alvarez, J.M.; Recasens, A.; Lapedriza, A. Context based emotion recognition using EMOTIC dataset. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2755–2766. [Google Scholar]
  12. Ling, Z.M.; Hua, Z.Z. A Review on Multi-Label Learning Algorithms. IEEE Trans. Knowl. Data Eng. 2014, 26, 1819–1837. [Google Scholar]
  13. Ling, Z.M.; Hua, Z.Z. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognit. 2007, 40, 2038–2048. [Google Scholar]
  14. Ling, Z.M. Ml-rbf: RBF Neural Networks for Multi-Label Learning. Neural Process. Lett. 2009, 29, 61–74. [Google Scholar]
  15. Liu, S.; Zhang, L.; Yang, X.; Su, H.; Zhu, J. Query2Label: A Simple Transformer Way to Multi-Label Classification. arXiv 2021, arXiv:2107.10834. [Google Scholar]
  16. Ridnik, T.; Sharir, G.; Cohen, A.B.; Baruch, B.E.; Noy, A. ML-Decoder: Scalable and Versatile Classification Head. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 32–41. [Google Scholar]
  17. Arabian, H.; Alshirbaji, A.T.; Chase, G.J.; Moeller, K. Emotion Recognition beyond Pixels: Leveraging Facial Point Landmark Meshes. Appl. Sci. 2024, 14, 3358. [Google Scholar] [CrossRef]
  18. Kim, J.H.; Poulose, A.; Han, D.S. CVGG-19: Customized Visual Geometry Group Deep Learning Architecture for Facial Emotion Recognition. IEEE Access 2024, 12, 41557–41578. [Google Scholar]
  19. Oh, S.; Kim, D.-K. Noise-Robust Deep Learning Model for Emotion Classification using Facial Expressions. IEEE Access 2024. [Google Scholar] [CrossRef]
  20. Khan, M.; Saddik, A.E.; Deriche, M.; Gueaieb, W. STT-Net: Simplified Temporal Transformer for Emotion Recognition. IEEE Access 2024, 12, 86220–86231. [Google Scholar]
  21. Xuan, M.W.; Celiktutan, O.; Gunes, H. Group-level arousal and valence recognition in static images: Face, body and context. In Proceedings of the 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Ljubljana, Slovenia, 4–8 May 2015; pp. 1–6. [Google Scholar]
  22. Lee, J.; Kim, S.; Park, J.; Sohn, K. Contextaware emotion recognition networks. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 10142–10151. [Google Scholar]
  23. Hui, Z.M.; Meng, L.Y.; Dong, M.H. Context-Aware Affective Graph Reasoning for Emotion Recognition. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 151–156. [Google Scholar]
  24. Ilyes, B.; Frederic, V.; Denis, H.; Fadi, D. Multi-label, multi-task CNN approach for context-based emotion recognition. Inf. Fusion 2020, 76, 422–428. [Google Scholar]
  25. Ling, Z.M.; Hua, Z.Z. Multi-label neural networks with applications to functional genomics and text categorization. IEEE Trans. Knowl. Data Eng. 2006, 18, 1338–1351. [Google Scholar]
  26. Yu, L.J.; Yi, J.X. Multi-Label Classification Algorithm Based on Association Rule Mining. J. Softw. 2017, 28, 2865–2878. [Google Scholar]
  27. Kun, W.P.; Wu, L. Image-Based Self-attentive Multi-label Weather Classification Network. In Proceedings of the International Conference on Image, Vision and Intelligent Systems 2022 (ICIVIS 2022), Jinan, China, 15–17 August 2022; Springer: Singapore, 2023; Volume 1019, pp. 497–504. [Google Scholar]
  28. Min, C.Z.; Shen, W.X.; Peng, W.; Wen, G.Y. Multi-Label Image Recognition with Graph Convolutional Networks. CVPR 2019, 5172–5181. [Google Scholar] [CrossRef]
  29. Zhao, X.Y.; Tao, W.Y.; Yu, L.; Ke, Z. Label graph learning for multi-label image recognition with cross-modal fusion. Multimed. Tools Appl. 2022, 81, 25363–25381. [Google Scholar]
  30. Di, S.D.; Lei, M.L.; Lian, D.Z.; Bin, L. An Attention-Driven Multi-label Image Classification with Semantic Embedding and Graph Convolutional Networks. Cogn. Comput. 2023, 15, 1308–1319. [Google Scholar]
  31. Tao, W.Y.; Zhao, X.Y.; Sheng, F.L.; Xing, H.G. Stmg: Swin transformer for multi-label image recognition with graph convolution network. Neural Comput. Appl. 2022, 34, 10051–10063. [Google Scholar]
  32. Ming, H.K.; Yu, Z.X.; Qing, R.S.; Jian, S. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  33. Krizhevsky, A.; Sutskever, I.; Hinton, E.G. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar]
  34. Bolei, Z.; Agata, L.; Antonio, T.; Antonio, T.; Aude, O. Places: An image database for deep scene understanding. J. Vis. 2017, 17, 296. [Google Scholar]
  35. Qi, L.Z.; Snavely, N. MegaDepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2041–2050. [Google Scholar]
  36. Tenenbaum, J.B.; Silva, V.; Langford, J. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 2000, 290, 2319–2323. [Google Scholar] [CrossRef]
  37. Pennington, J.; Socher, R.; Manning, C. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
  38. Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
  39. Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching Word Vectors with Subword Information. Trans. Assoc. Comput. Linguist. 2016, 5, 135–146. [Google Scholar]
  40. Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep Contextualized Word Representations. arXiv 2018, arXiv:1802.05365. [Google Scholar]
Figure 1. Fore−background features diagram.
Figure 1. Fore−background features diagram.
Applsci 14 08564 g001
Figure 2. Directed graph between the first five emotions (threshold: 0.2). Dashed lines represent weak correlations, solid lines represent strong correlations.
Figure 2. Directed graph between the first five emotions (threshold: 0.2). Dashed lines represent weak correlations, solid lines represent strong correlations.
Applsci 14 08564 g002
Figure 3. Overall architecture diagram.
Figure 3. Overall architecture diagram.
Applsci 14 08564 g003
Figure 4. CR-Unit diagram.
Figure 4. CR-Unit diagram.
Applsci 14 08564 g004
Figure 5. Examples of depth maps.
Figure 5. Examples of depth maps.
Applsci 14 08564 g005
Figure 6. (a) Graph of dimension reduction using initial node features with Isomap. (b) Graph of dimension reduction using Isomap with node features after two layers of GCN convolution.
Figure 6. (a) Graph of dimension reduction using initial node features with Isomap. (b) Graph of dimension reduction using Isomap with node features after two layers of GCN convolution.
Applsci 14 08564 g006
Figure 7. (a) Co−occurrence probability matrix of labels. (b) Cosine similarity matrix.
Figure 7. (a) Co−occurrence probability matrix of labels. (b) Cosine similarity matrix.
Applsci 14 08564 g007
Figure 8. JCs of different methods [8,9,10,11,13,14,15,16,17,22,23,24,27,28,29,30].
Figure 8. JCs of different methods [8,9,10,11,13,14,15,16,17,22,23,24,27,28,29,30].
Applsci 14 08564 g008
Figure 9. Ablation results of FB-ER and ML-ERC.
Figure 9. Ablation results of FB-ER and ML-ERC.
Applsci 14 08564 g009
Figure 10. APs of different branch combinations.
Figure 10. APs of different branch combinations.
Applsci 14 08564 g010
Figure 11. Ablation results of AP on ML-ERC.
Figure 11. Ablation results of AP on ML-ERC.
Applsci 14 08564 g011
Figure 12. Visual display of different branch ablation experiments.
Figure 12. Visual display of different branch ablation experiments.
Applsci 14 08564 g012
Figure 13. Visual display of ML-ERC ablation experiment.
Figure 13. Visual display of ML-ERC ablation experiment.
Applsci 14 08564 g013
Table 1. Example table of correlations between background and emotions.
Table 1. Example table of correlations between background and emotions.
IndexDescriptionApproximate Proportion in the DatasetOriginal FigureEmotion (without Background)Background Heat MapBackground SemanticsEmotion (with Background Information)
No. 1Completely
lacking
facial information.
16.7%Applsci 14 08564 i001-Applsci 14 08564 i002Beach, water park, sunny, boating, swimming[“Anticipation”, “Engagement”, “Happiness”, “Pleasure”]
Applsci 14 08564 i003-Applsci 14 08564 i004Physics, chemistry lab, working, competing, stressful[“Fatigue”, “Suffering”]
Applsci 14 08564 i005-Applsci 14 08564 i006Ski, mountain, snowy, cold, sunny, far-away horizon[“Engagement”, “Excitement”]
No. 2Partial facial
information.
45%Applsci 14 08564 i007NeutralApplsci 14 08564 i008Ballroom, legislative, beauty salon, socializing, congregating[“Affection”, “Esteem”, “Happiness”]
Applsci 14 08564 i009NeutralApplsci 14 08564 i010Bedroom, enclosed area, cloth, warm, dry[“Peace”, “Happiness”]
No.3When facial
information is
complete, the same
facial expression
can be observed
in different
backgrounds.
38.3%Applsci 14 08564 i011PainApplsci 14 08564 i012Hospital, operating, working, stressful, medical activity[“Pain”, “Sadness”, “Suffering”]
Applsci 14 08564 i013PainApplsci 14 08564 i014Martial gym, competing, sports, exercise, congregating[“Disquiet”, “Engagement”, “Excitement”]
Table 2. Example table of correlations between foreground and emotions.
Table 2. Example table of correlations between foreground and emotions.
IndexDescriptionApproximate Proportion in the DatasetOriginal FigureDepth MapEmotions of Character AEmotions of Character B
No. 1Interaction occurs
when individuals
share a common
identity or mutual
familiarity.
10.4%Applsci 14 08564 i015Applsci 14 08564 i016[“Affection”,
“Happiness”]
[“Affection”,
“Happiness”]
Applsci 14 08564 i017Applsci 14 08564 i018[“Confidence”,
“Engagement”]
[“Confidence”,
“Engagement”]
No. 2Having different
identities or
being unfamiliar
with one another.
19.6%Applsci 14 08564 i019Applsci 14 08564 i020[“Engagement”][“Anticipation”,
“Esteem”]
Applsci 14 08564 i021Applsci 14 08564 i022[“Anticipation”,
“Confidence”,
“Excitement”]
[“Peace”,
“Engagement”]
Table 3. Background semantic recognition.
Table 3. Background semantic recognition.
Original FigureHeat MapRecognition Results
Applsci 14 08564 i023Applsci 14 08564 i024Outdoors, cliff, natural, sunny, climbing, rugged scene, far away horizon
Applsci 14 08564 i025Applsci 14 08564 i026Indoors, office, working, studying, enclosed area, no horizon
Table 4. Dataset statistics.
Table 4. Dataset statistics.
Data Distribution-Quantity
Sub-dataset distributionAde20k432
Emodb-small1374
Framesdb4869
Mscoco16,510
Sample distributionAffection1063
Anger209
Annoyance368
Anticipation5335
Aversion168
Confidence4059
Disapproval659
Disconnection325
Disquietment1462
Doubt/Confusion479
Embarrassment152
Engagement12,814
Esteem851
Excitment4394
Fatigue538
Fear177
Happiness5630
Pain188
Peace1691
Pleasure2103
Sadness405
Sensitivity360
Suffering260
Surprise417
Sympathy718
Yearning668
Data partitionTraining23,265
Validation3314
Test7202
Table 5. The mAP and JC of different methods. The bold parts represent the optimal values of the model.
Table 5. The mAP and JC of different methods. The bold parts represent the optimal values of the model.
ExperimentmAP (%)JCParams (M)FLOPs (G)
ML-KNN [13] (2007)32.5310.40011.70410.240
ML-RBF [14] (2011)23.6650.39911.92610.470
Kosti et al. [11] (2019)27.3840.34944.8207.803
Lee et al. [22] (2019)20.5870.26916.0109.207
Zhang et al. [23] (2019)27.0300.35330.5829.553
ML-GCN [28] (2019)35.2450.37246.9608.410
ESRs [8] (2020)27.6020.35520.3309.436
Ilyes et al. [24] (2020)28.3320.36039.40023.400
Q2L [15] (2021)34.2430.41618.0138.310
LGLM [29] (2022)29.4940.27520.9568.414
RCL-Net [9] (2023)23.0610.29554.8819.410
ML-Decoder [16] (2023)35.2040.38635.0248.511
SAML [27] (2023)34.4900.44333.6146.660
FLNet [30] (2023)29.0260.36535.7507.440
BHARAT [10] (2024)33.5490.41837.7693.216
LFPLM [17] (2024)31.2050.40212.8104.326
Ours35.9770.45033.5005.500
Table 6. Ablation results of mAP and JC on FB-ER model. The bold parts represent the optimal values of the model.
Table 6. Ablation results of mAP and JC on FB-ER model. The bold parts represent the optimal values of the model.
Ablation TypesExperimentmAP (%)JC
Different combinations of branchesBody27.0560.324
Body + background32.2380.370
Body + foreground30.7120.362
Body + fore-background35.9770.450
Different fusion strategiesFeature-level fusion33.5770.416
Decision-level fusion31.0660.364
Mixed-level fusion35.9770.450
With or without CR-UnitWithout CR-Unit34.1500.420
With CR-Unit35.9770.450
Table 7. Ablation results of mAP and JC on ML-ERC model. The bold parts represent the optimal values of the model.
Table 7. Ablation results of mAP and JC on ML-ERC model. The bold parts represent the optimal values of the model.
Ablation TypesExperimentmAP (%)JC
Different combinations
of matrices
Traditional threshold decision classifier32.5170.378
A35.3670.425
C33.1360.386
A + C 35.9770.450
Different word
embedding vectors
GloVe35.9760.450
Word2Vec35.9570.450
Fasttext35.9770.447
Elmo35.9750.449
Table 8. The impact of different τ values on the mAP. The bold parts represent the optimal values of the model.
Table 8. The impact of different τ values on the mAP. The bold parts represent the optimal values of the model.
τ AmAP (%)JC GraphJC
τ = 0.1 Applsci 14 08564 i02733.16Applsci 14 08564 i0280.422
τ = 0.2 Applsci 14 08564 i02935.977Applsci 14 08564 i0300.450
τ = 0.3 Applsci 14 08564 i03135.96Applsci 14 08564 i0320.413
τ = 0.4 Applsci 14 08564 i03333.85Applsci 14 08564 i0340.415
τ = 0.5 Applsci 14 08564 i03532.31Applsci 14 08564 i0360.401
τ = 0.6 Applsci 14 08564 i03732.39Applsci 14 08564 i0380.384
τ = 0.7 0.8 0.9 Applsci 14 08564 i03931.98Applsci 14 08564 i0400.384
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Feng, Y.; Wei, R. Method of Multi-Label Visual Emotion Recognition Fusing Fore-Background Features. Appl. Sci. 2024, 14, 8564. https://doi.org/10.3390/app14188564

AMA Style

Feng Y, Wei R. Method of Multi-Label Visual Emotion Recognition Fusing Fore-Background Features. Applied Sciences. 2024; 14(18):8564. https://doi.org/10.3390/app14188564

Chicago/Turabian Style

Feng, Yuehua, and Ruoyan Wei. 2024. "Method of Multi-Label Visual Emotion Recognition Fusing Fore-Background Features" Applied Sciences 14, no. 18: 8564. https://doi.org/10.3390/app14188564

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop