Next Article in Journal
Design of New BLE GAP Roles for Vehicular Communications
Previous Article in Journal
A Comprehensive Survey on Emerging Assistive Technologies for Visually Impaired Persons: Lighting the Path with Visible Light Communications and Artificial Intelligence Innovations
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

CATM: A Multi-Feature-Based Cross-Scale Attentional Convolutional EEG Emotion Recognition Model

Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Sensors 2024, 24(15), 4837; https://doi.org/10.3390/s24154837
Submission received: 24 June 2024 / Revised: 21 July 2024 / Accepted: 23 July 2024 / Published: 25 July 2024
(This article belongs to the Section Biomedical Sensors)

Abstract

:
Aiming at the problem that existing emotion recognition methods fail to make full use of the information in the time, frequency, and spatial domains in the EEG signals, which leads to the low accuracy of EEG emotion classification, this paper proposes a multi-feature, multi-frequency band-based cross-scale attention convolutional model (CATM). The model is mainly composed of a cross-scale attention module, a frequency–space attention module, a feature transition module, a temporal feature extraction module, and a depth classification module. First, the cross-scale attentional convolution module extracts spatial features at different scales for the preprocessed EEG signals; then, the frequency–space attention module assigns higher weights to important channels and spatial locations; next, the temporal feature extraction module extracts temporal features of the EEG signals; and, finally, the depth classification module categorizes the EEG signals into emotions. We evaluated the proposed method on the DEAP dataset with accuracies of 99.70% and 99.74% in the valence and arousal binary classification experiments, respectively; the accuracy in the valence–arousal four-classification experiment was 97.27%. In addition, considering the application of fewer channels, we also conducted 5-channel experiments, and the binary classification accuracies of valence and arousal were 97.96% and 98.11%, respectively. The valence–arousal four-classification accuracy was 92.86%. The experimental results show that the method proposed in this paper exhibits better results compared to other recent methods, and also achieves better results in few-channel experiments.

1. Introduction

Emotion recognition has been applied in various fields, including mental health, medical diagnosis, intelligent driving, and distance education [1,2,3]. Early methods primarily relied on non-physiological signals such as voice intonation [4], facial expressions [5], and body movements [6]. However, these approaches are limited by subjectivity and susceptibility to manipulation. To address these limitations, methods based on physiological electrical signals have been developed. These signals include EEG, EMG [7], eye tracking [8], and ECG [9]. Among these, EEG provides rich information across time, frequency, and spatial domains, allowing for a more objective reflection of emotional states while minimizing the influence of external artifacts. Consequently, EEG is currently the preferred physiological signal for emotion recognition.
Some previous studies on EEG emotion recognition have not fully utilized the information in the time, frequency, and spatial domains, limiting emotional classification effectiveness. For instance, Han et al. [10] processed the original EEG signals using techniques like de-baselining and sliding window slicing, employing convolution and LSTM for feature extraction, achieving 96.28% accuracy in binary classification on the DEAP dataset. However, this approach relies on original EEG signals and fails to deeply mine entropy and energy features, focusing only on temporal information while neglecting spatial data. Lin et al. [11] utilized one-dimensional convolution and graph convolution to classify EEG emotions, achieving an average accuracy of 91% on the DEAP dataset. This method captures emotional features within channels and considers spatial relationships, yet it overlooks the temporal information in EEG signals, presenting significant limitations. Chen et al. [12] extracted spatial connectivity information from EEG channels, introduced a domain adaptation method, and constructed four connectivity matrices based on time series, achieving 95.15% accuracy in valence’s dichotomous classification on the DEAP dataset. However, this method does not account for the positional information between channels or the frequency domain and nonlinearity features in EEG signals, indicating room for improvement. Han et al. [13] proposed a multi-scale emotion recognition method (MS-ERM) that spatially mapped EEG signals and extracted temporal information using TimesNet. Validated on the DEAP dataset, it achieved average accuracies of 91.31% for arousal and 90.45% for valence classification. However, this method focuses solely on differential entropy features, neglecting energy, nonlinearity, and other important aspects, which limits classification effectiveness. Wu et al. [14] developed a 1D-Densenet model to extract frequency band energy and sample entropy from EEG signals, combining them into a one-dimensional input. This approach achieved an average binary classification accuracy of 93.51% on the DEAP dataset. While it effectively utilizes frequency domain, energy, and entropy features, it overlooks the temporal and spatial information present in the EEG signals. Singh et al. [15] introduced a hybrid model combining one-dimensional convolutional neural networks (1DCNNs) and bidirectional long short-term memory (Bi-LSTM), achieving binary and quadratic classification accuracies of 91.31% and 88.19%, respectively, on the DEAP dataset. This method also uses one-dimensional vectors as inputs, but ignores spatial information.
Most EEG-based emotion recognition studies have utilized full electrode arrays covering the entire scalp. However, with advancements in technology, portable, miniaturized EEG devices with fewer electrodes are becoming more common [16]. Studies using fewer channels have not consistently achieved better classification results, indicating significant room for improvement [17]. For example, Mert et al. [18] applied empirical modal decomposition (EMD) and multivariate expansion (MEMD) to process EEG signals, achieving dichotomous classification accuracies of 75% for arousal and 72.87% for valence using 18 channels from the DEAP dataset. This method does not account for the temporal continuity or spatial information of EEG signals and lacks channel selection, failing to fully leverage the available data. Bazgir et al. [19] employed Discrete Wavelet Transform (DWT) to decompose EEG signals into four frequency bands and extract spectral features. Using data from 10 channels in the DEAP dataset, they achieved binary classification accuracies of 91.3% for arousal and 91.1% for valence. However, this approach also did not extract the temporal and spatial features from the EEG signals.
To address these issues, we propose a multi-feature, multi-band spatio-temporal fusion model based on CATM. CATM performs feature extraction through cross-scale convolution, applies weight assignment using an attention mechanism, and ultimately classifies with a deep classifier. We conducted extensive binary classification, four-class classification, and experiments with fewer channels on the DEAP [20] public dataset, comparing our results with other models. The experimental findings demonstrate that our model exhibits excellent accuracy performance. The main contributions of this paper are as follows:
  • In this study, we extracted four features at different frequencies: differential entropy, power spectral density, nonlinear energy, and fractal dimension. We spatially mapped these features according to the positions of the electrode channels, resulting in four-dimensional spatial features with enhanced discriminative and characterization capabilities. Finally, we performed feature fusion of multiple spatial features to leverage the advantages of each feature type.
  • We propose a new spatio-temporal feature attention network to address the limitations of existing EEG emotion recognition methods. The network comprises a cross-scale attention convolution module, a transition module, a Bi-LSTM module, and a classifier. This architecture effectively enhances feature extraction capabilities and fully leverages the spatio-temporal features in EEG signals.
  • To enable the proposed network model to fully utilize the information in the four-dimensional structure, we designed a frequency–space attention mechanism. This mechanism comprehensively considers the weights between convolutional channels and the positional relationships between electrode channels. It is embedded within the cross-scale convolutional module, allowing for adaptive weight assignment for both frequency and electrode channel positions in the EEG signal.
  • Extensive experiments on the DEAP dataset demonstrate that the model exhibits strong emotion recognition accuracy and robustness in binary, four-class, and few-channel scenarios, validating the effectiveness of the proposed method.

2. Related Work

This section reviews the current research status of EEG emotion recognition, focusing on multi-feature approaches, machine learning, deep learning, attention mechanisms, and channel selection techniques.

2.1. Multi-Feature-Based EEG Emotion Recognition

EEG signals reflect the electrophysiological activity of brain neurons in the cerebral cortex or on the scalp surface. They offer advantages such as non-invasiveness, high temporal resolution, practicality, and cost-effectiveness, making them widely used in emotion recognition and motor imagery [21]. Extracting representative features from raw EEG data is crucial for subsequent classification, recognition, and analysis tasks. Common feature extraction methods include time domain analysis, frequency domain analysis, time–frequency domain analysis, multivariate statistical analysis, and nonlinear dynamic analysis [22,23,24]. Researchers typically categorize EEG signals into five frequency bands: delta, theta, alpha, beta, and gamma. Delta waves (0–3 Hz) are slow and associated with deep sleep, primarily observed in infants; they are not standard components in the EEG data of awake adolescents and adults, so this study focuses on four bands [25]. Theta waves (3–8 Hz) occur during relaxation or sleep. Alpha waves (8–14 Hz) are linked to visual processing, cognitive load, and memory activities. Beta waves (14–31 Hz) are prominent during conscious states such as calculation, reading, and thinking. Gamma waves (31–45 Hz) are associated with higher brain functions and are important for learning and memory. Singh et al. [26] explored single-dimensional and multidimensional EEG signal processing and feature extraction techniques across time, frequency, decomposition, time–frequency, and spatial domains. Yuvaraj et al. [27] extracted statistical features, fractal features, Hjorth parameters, higher-order spectral features, and wavelet coefficients from the DEAP dataset, classifying them with a shallow classifier, achieving valence and arousal accuracies of 78.18% and 79.90%, respectively. Liu et al. [28] proposed a dynamic differential entropy feature extraction algorithm that combines differential entropy (DE) with empirical modal decomposition (EMD). The resulting dynamic differential entropy features were classified using a convolutional neural network, yielding promising results. Çelebi et al. [29] utilized empirical wavelet transform (EWT) to decompose signals and extract frequency, linear, and nonlinear features from EEG data. They constructed a three-dimensional image and employed a deep learning framework with the DEAP dataset, achieving classification accuracies of 90.57% for valence and 90.59% for arousal in an across-subjects experiment. These studies indicate that extracting diverse EEG features and employing appropriate feature fusion methods can enhance emotion recognition compared to using single features [30].

2.2. Machine Learning-Based EEG Emotion Recognition

In early machine learning-based EEG emotion recognition, manual feature extraction was common, with these features applied to shallow classifiers. Currently, the mainstream approach involves automatic feature extraction through various algorithms, including time-domain, frequency domain, and time–frequency domain analyses. Extracted features are then applied to machine learning algorithms like Support Vector Machines (SVMs), k-Nearest Neighbor (KNN), and Naive Bayes (NB), achieving notable classification results [31]. Nawaz et al. [32] compared power, entropy, fractal metrics, statistical features, and wavelet features, using a feature selection algorithm, Principal Component Analysis (PCA), to attain validity and arousal accuracies of 77.62% and 78.96%, respectively. Bhardwaj et al. [33] classified seven different emotions using EEG signals; after preprocessing with filtering and independent component analysis (ICA), they extracted energy and power spectral density (PSD) features, achieving average classification accuracies of 74.13% with SVMs and 66.50% with Linear Discriminant Analysis (LDA). Gupta et al. [34] proposed a flexible analytic wavelet transform (FAWT) that decomposes EEG signals into different sub-band signals for feature extraction and smoothing. They classified the data using SVMs and Random Forests (RFs) on the SEED dataset, achieving an accuracy of 83.3%. Asghar et al. [35] introduced a deep neural network (DNN) method for EEG emotion recognition, extracting raw features from 2D spectrograms of each channel and employing dimensionality reduction with a pre-trained AlexNet model. They used SVM and k-NN classifiers on the SEED and DEAP datasets, achieving accuracies of 93.8% and 77.4%, respectively. These studies indicate that machine learning has made significant progress in emotion recognition based on EEG signals.

2.3. Deep Learning-Based EEG Emotion Recognition

With the advancement of deep learning, significant achievements have been made in various fields, including image processing and natural language processing. Recently, deep learning has also been widely applied to EEG emotion recognition. Li et al. [36] developed a convolutional recurrent neural network that combines convolutional and recurrent neural networks (LSTM), achieving binary classification accuracies of 72.06% for utility and 74.12% for arousal on the DEAP dataset. However, while the recurrent neural network captures temporal features, it does not effectively utilize spatial information. Chakravarthi et al. [37] introduced an automated CNN-LSTM model with Res-Net-152, achieving an impressive emotion recognition accuracy of 98% for human behavior and post-traumatic stress disorder (PTSD). Yang et al. [38] addressed the baseline signal’s impact on classification by removing it and combining channel position and frequency domain features to represent EEG signals as a three-dimensional structure, preserving spatial information. In their dichotomization experiments on the DEAP dataset, they achieved accuracies of 90.24% for validity and 89.45% for arousal.
Liu et al. [39] proposed the three-dimensional convolutional attentional neural network (3DCANN), which fuses spatio-temporal features with dual attention learning weights and uses a softmax classifier for emotion classification, achieving 97.35% accuracy on the SEED dataset. An et al. [40] constructed a spatio-temporal convolutional attention network, BiTCAN, which creates a two-dimensional mapping matrix of EEG signals based on electrode positions. This model extracts salient brain features using a bi-hemispheric disparity module and captures spatio-temporal features through a three-dimensional convolutional module. Extensive validation on the DEAP and SEED datasets resulted in accuracies exceeding 97% on both. Comparing these studies, it is evident that deep learning methods outperform traditional machine learning approaches in EEG-based emotion recognition, demonstrating superior feature extraction capabilities from EEG signals.

2.4. Attentional Mechanisms in EEG Emotion Recognition

The attention mechanism, rooted in the study of the human visual system, is a vital cognitive function. Researchers have adapted this concept for deep learning, integrating attention with neural networks. This integration allows neural networks to effectively receive target information, filter out irrelevant data, and allocate resources rationally. The core idea of the attention mechanism is to assign dynamic attentional weights that adjust during the learning process, enhancing the performance of the network by weighting raw data appropriately. Xiao et al. [41] introduced a four-dimensional attentional neural network that employs spectral and spatial attention mechanisms to assign weights adaptively across different brain regions and frequency bands, achieving 96.1% accuracy on the SEED dataset. Similarly, Jia et al. [42] developed a spatial–spectral–temporal attention 3D dense network, which attained accuracies of 96.02% and 84.92% on the SEED and SEED-IV datasets, respectively. The attention mechanism is crucial for optimizing deep learning models, allowing neural networks to focus on significant information during data processing, ultimately enhancing model performance.

2.5. Channel Selection in EEG Emotion Recognition

Selecting EEG channels that are highly relevant to emotions can significantly reduce the number of channels needed, making EEG signal collection more practical in daily life. This has become a key area of research. For instance, Zhang et al. [43] employed a ReliefF-based channel selection method on the DEAP dataset, achieving an optimal classification accuracy of 59.13% with 19 channels. Özerdem et al. [44] utilized Discrete Wavelet Transform (DWT) for feature extraction, implementing dynamic channel selection to identify five channels most relevant to emotions, resulting in a highest binary classification accuracy of 77.14%. Additionally, Topic et al. [45] applied ReliefF and Neighborhood Component Analysis (NCA) for channel selection, creating a holographic feature map. They achieved a maximum binary classification accuracy of 88.58% using data from just 10 channels in the DEAP dataset. These studies highlight the importance of effective channel selection in enhancing emotion recognition accuracy.

3. Materials and Methods

Figure 1 depicts the general framework and flow of the proposed method in this study. The EEG-based emotion classification method is divided into three parts: the first part is preprocessing and feature extraction, the second part is feature mapping and feature fusion, and the third part is feature extraction and classification using CATM. CATM mainly consists of five parts: the cross-scale attention module (CSAM), frequency–space attention module (FSAM), feature transition module (FTM), temporal feature extraction module (Bi_LSTM), and deep classification module (DCM). Table 1 shows the acronyms, full names, and functions of each CATM module. The next sections describe the model in detail and evaluate the model.

3.1. Feature Extraction

The EEG signal is characterized by a low signal-to-noise ratio, a significant presence of low-frequency components, and a waveform that displays changing nonlinear characteristics. Building on these traits, this study aims to synthesize frequency domain, time domain, and nonlinear features from EEG data to enhance emotion recognition accuracy. First, for each subject, the raw EEG data are divided into N non-overlapping, equal-length segments, with labels assigned to each segment. Each segment is then decomposed into four frequency bands, theta (3–8 Hz), alpha (8–14 Hz), beta (14–31 Hz), and gamma (31–45 Hz), using a Butterworth filter. Subsequently, data from each band are extracted using four specific features: differential entropy (DE), power spectral density (PSD), nonlinear energy (NE), and fractal dimension (FD). The feature extraction and mapping process is illustrated in Figure 2.
Differential entropy (DE) features are widely used by researchers for emotion recognition in EEG signals and have been shown to be the most stable features [46]. The DE of the EEG signal is an extension of the Shannon entropy on continuous variables, which allows us to distinguish the low- and high-frequency energies of the EEG signal and is calculated as follows:
D E = a b 1 2 π σ i 2 e ( x μ ) 2 2 σ i 2 log ( 1 2 π σ i 2 e ( x μ ) 2 2 σ i 2 ) d x = 1 2 log ( 2 π e σ i 2 )
where e represents Euler’s constant and σ represents the standard deviation of x . The DE is calculated for an EEG signal of length [a, b] and approximately obeys a Gaussian distribution N ( μ , σ i 2 ) , which is equal to the logarithm of the energy spectrum in a particular frequency band.
Power spectral density (PSD) features have been shown to be effective in EEG emotion recognition in previous studies [47], and in this study, the average power of a segment of an EEG signal is described. If there is a segment of EEG signal of length M denoted as x ( t ) , and the value of t is taken as 0 ~ M 1 , the PSD formula for this segment is as follows:
P ( ω k ) = t = ( M 1 ) M 1 γ ( t ) e j ω k t
where P ( ω k ) is the power spectral density, γ ( t ) is the autocorrelation function of x ( t ) , ω k is the angular frequency, t is the time, and γ ( t ) is the autocorrelation function of x ( t ) . γ ( t ) is calculated as follows:
γ ( t ) = E x ( n ) x ( n + t )
where E is the expectation of the function and x is the complex conjugate of x .
Nonlinear energy (NE) was proposed by Toole et al. [48] and is widely used in areas such as epileptogenesis detection by EEGs [49]. Nonlinear energy takes into account the product of amplitude squared and frequency squared and is used to calculate the instantaneous energy of a signal, especially for identifying transient changes. The first-order difference of the signal and the Hilbert transform were utilized in this study to obtain an estimate of the nonlinear energy. The average NE of a segment of an EEG signal is calculated as follows:
N E = 1 N n = 1 N Γ ( x ( n ) )
where Γ ( x ( n ) ) is the estimated value of the nonlinear energy. The formula Γ ( x ( n ) ) is given below:
Γ ( x ( n ) ) = ( y ( n ) ) 2 + | y hilbert ( n ) | 2
where y ( n ) denotes the first-order difference of the EEG signal and y hilbert ( n ) denotes the Hilbert transform of y ( n ) .
Fractal dimension (FD) [50] is a nonlinear feature used to quantify EEG signals, which captures the self-similarity and complexity of the signal, and can characterize the temporal structure and dynamics of the signal. It is widely used in the fields of disease diagnosis from EEG signals [51] and sleep monitoring [52]. FD features have also been applied to research related to EEG emotion recognition [53], which has achieved some results in few-channel emotion recognition and demonstrated that classification accuracy can be improved by using FD features. FD features are generally calculated by measuring the self-similarity of the signal; due to the computational difficulties in estimating the fractal dimension of complex signals, this paper adopts the Petrosian fractal dimension approximation, which is calculated as follows:
F D = log 10 ( N E E G ) log 10 ( N E E G ( N E E G + 0.4 N Δ ) )
where N E E G is the number of sampling points of the EEG signal and N Δ is the number of sign changes in the signal.
In this study, four features will be extracted for each frequency band within the 0.5 s time window. Due to the presence of low-frequency noise in the EEG baseline signal, in order to effectively avoid noise effects and baseline drift, the baseline correction was performed by subtracting the average features of the first 3 s baseline signal from the features of the last 60 s, and finally, all the data were normalized to obtain the one-dimensional EEG features.

3.2. Feature Mapping and Feature Fusion

EEG signals also contain crucial spatial location information. To fully utilize the temporal, spatial, and frequency aspects of these signals, we constructed four-dimensional structural features to describe them. As illustrated in Figure 3, we mapped the features from the 32-channel EEG signals into a two-dimensional feature matrix based on the 32-electrode layout of the international 10–20 system. Positions without corresponding channels were filled with zeros, resulting in a two-dimensional feature map sized 8 × 9. The features from each frequency band of the EEG signals were mapped to these 2D feature maps. Subsequently, the 2D feature maps for the four frequency bands were stacked to create 8 × 9 × 4 3D features. This approach allows for a comprehensive representation of the EEG data, integrating multiple dimensions of information.
The most commonly used feature fusion methods in deep learning-based algorithms are feature matrix addition, multiplication, and splicing [54]. According to a previous study [55], the use of feature matrix splicing is better than that of matrix addition and multiplication, so in this study, the feature splicing method is used to realize the specific formula as follows:
X D E = d 1 , d 2 , d 3 , , d N
X P S D = p 1 , p 2 , p 3 , , p N
X N E = n 1 , n 2 , n 3 , , n N
X F D = f 1 , f 2 , f 3 , , f N
X c o n = X D E ; X P S D ; X N E ; X F D = d 1 ; p 1 ; n 1 ; f 1 , , d N ; p N ; n N ; f N
In Equations (7)–(11), d denotes the three-dimensional features of DE, whose four-dimensional features are denoted by X D E ; p denotes the three-dimensional features of PSD, whose four-dimensional features are denoted by X P S D ; n denotes the three-dimensional features of NE, whose four-dimensional features are denoted by X N E ; and f denotes the three-dimensional features of FD, whose four-dimensional features are denoted by X F D ; these features are fused to obtain the feature set X c o n . Since the EEG signal of each subject was divided into N equal-length segments according to time, a four-dimensional feature of N × 16 × 8 × 9 was obtained. The feature fully preserves the temporal, spatial, and frequency information in the EEG signal, laying the foundation for subsequent feature extraction and classification.

3.3. Network Model Architecture and Classification

3.3.1. Cross-Scale Attention Module (CSAM)

CSAM captures features at different scales by using convolutional kernels of different sizes in parallel and combining them with a frequency–space attention mechanism. Although CSAM is sparsely structured, it produces dense feature data. The CSAM module uses three different sizes of convolutional kernels (1 × 1, 3 × 3, 5 × 5) and a maximum pooling kernel (3 × 3). Convolutional kernels of different sizes have different receptive fields, and the use of different sizes of convolution can extract subtle EEG features and take into account the relative positional relationships between individual electrodes. The spatial dimensions of the feature maps are compressed using the maximum pooling layer to further extract more abstract and advanced features by reducing the width and height of the feature maps. Adding the BN layer and ReLU activation function after the convolution and pooling operations effectively prevents overfitting and increases the nonlinear characteristics of the network. Finally, the CSAM module is obtained after adding the frequency–space attention mechanism to the ReLU activation function. The structure of the CSAM module is shown in Figure 4.
The parameter settings of the CSAM module are shown in Table 2.

3.3.2. Frequency–Space Attention Module (FSAM)

It has been shown [56] that different frequency bands of EEG signals have different recognition effects in emotion recognition. The β and γ bands have better recognition performance, followed by the α band, while the θ band is the least effective. When mapping 2D feature maps, there are regions where electrode positions are missing and we use zeros to fill in these positions, but this introduces a lot of useless information. To enable the network model to better extract useful information from the feature maps, we introduced a frequency–space integrated attention mechanism to give higher weights to the more important frequency bands and spatial locations to better utilize the EEG information related to emotions. The structure of the frequency–space attention mechanism network is shown in Figure 5.
The input tensor in Figure 5 is B × C × H × W. B, C, H, and W denote the batch size, number of channels, height, and width of the tensor, respectively. For a given input feature map X R B × C × H × W , the feature map Y R B × C × H × W is obtained after going through the frequency attention mechanism F f and the spatial attention mechanism F s in turn. This is expressed as the following equation:
X = F f ( X ) X
Y = F s ( X ) X
where X in Equations (12) and (13) denotes the output of the feature map after going through the frequency attention mechanism F f .
The frequency attention mechanism mainly acts on different channels of the feature map and can effectively label important frequency bands in the EEG signal. First, a global adaptive average pooling operation on the spatial dimension of X R B × C × H × W is performed to obtain X a v g R B × C × 1 × 1 , which is computed as follows:
X a v g = Adaptive _ AvgPool ( X ( h , w ) )
X a v g is then fed into the multilayer perceptron (MLP), which consists of two fully connected layers and ReLu and Sigmoid activation functions. The F R e L u ( x ) and F S i g m o i d ( x ) activation functions are denoted as
F R e L u ( x ) = x ( x > 0 ) 0 ( x 0 )
F S i g n o d ( x ) = 1 1 + e x
Finally, the frequency attention mechanism F f R B × C + b is realized by going through the linear layer and adding the bias term b. By applying the frequency attention weights of the output to the input signal, different weights can be assigned to the individual channels, and emotionally relevant frequency bands will be given more attention. As The different weights assigned by the frequency attention mechanism to the 464 channels in CSAM are shown in Figure 6.
The spatial attention mechanism mainly acts on different spatial locations of the feature map, and for the feature X output from the frequency attention mechanism, X max is first obtained through the maximum pooling layer:
X max = M a x Pool ( X ( h , w ) )
After that, the spatial attention mechanism F s R B × 1 × H × W is obtained by two-dimensional convolution and ReLu and Sigmoid activation functions, and then the final output Y of the frequency–space attention mechanism can be obtained by multiplying F s with the feature X . Figure 7 represents the 3 epochs for random training of the model, which are features in CSAM. After going through the spatial attention mechanism, different electrode positions were given different weights for the thermogram. It is easy to see that the frontal part of the brain is weighted higher and the parietal part is weighted lower.

3.3.3. Feature Transition Module (FTM)

Multiple features of EEG signals were used, and feature fusion was performed in this study. Feature splicing was also used to fuse feature maps of different sizes in the subsequent cross-scale convolution, with more channels of the output feature maps, all of these operations doubling the dimensionality of the input feature maps and occupying more memory. In order to control memory consumption while maintaining network performance, transition modules are introduced. The transition module normalizes the input feature map by a batch normalization layer, which enables obtaining more emotionally expressive and stable low-dimensional features. The network structure of the feature transition module is shown in Figure 8, where the size of the maximum pooling kernel is set to 3.

3.3.4. Temporal Feature Extraction Module (Bi-LSTM)

The bidirectional long short-term memory (Bi-LSTM) network can process temporal data efficiently by using two LSTM units at each time step, one for processing sequences from front to back and the other for processing sequences from back to front. The final output is obtained by superimposing the results of LSTM computations in two different directions. The LSTM cell consists of an input gate, an oblivion gate, and an output gate, which can be defined as
i t = σ ( W i h t 1 , q t + b i )
f t = σ ( W f h t 1 , q t + b f )
g t = tan h ( W c h t 1 , q t + b c )
C t = f t C t 1 + i t g t
o t = σ ( W o h t 1 , q t + b o )
h t = o t tan h ( C t )
In Equations (18)–(23), h t 1 , q t denotes the current input and the previous state, W represents its corresponding weight matrix, σ is the Sigmoid function, and b is the bias term. The final output is obtained from the output of the forward LSTM and the reverse LSTM, denoted as y t = h t ; h t .

3.3.5. Deep Classification Module (DCM)

Since deep classifiers tend to be more effective than shallow ones, for the extracted temporal feature output, a deep classifier for emotion classification is designed in this study. The classifier consists of a fully connected layer, a ReLU activation function, and Dropout regularization. The deep classifier network structure is shown in Figure 9, and the specific parameters are shown in Table 3.

4. Experiments

4.1. Dataset and Dataset Processing

The DEAP dataset is a physiological signaling dataset generated using music videos as eliciting materials. During the experiment, each subject watched 40 min of music videos and assessed their emotions based on subjective feelings of arousal, valence, liking, and dominance. A total of 32 subjects participated, and each experimental session consisted of signals from 40 channels. The first 32 channels recorded EEG signals, while the last 8 channels included other physiological signals, such as ocular and EMG data. Each experiment lasted 63 s, with the first 3 s capturing baseline signals and the remaining 60 s reflecting emotion-evoked signals. The dataset’s sampling frequency was downsampled from 512 Hz to 128 Hz, resulting in a physiological signal matrix for each subject of size 40 × 40 × 8064 (40 experimental music clips, 40 physiological signal channels, and 8064 sampling points). The DEAP dataset offers rich data on EEG and other physiological signals, along with subjective emotion assessments from participants, providing valuable insights into the relationship between emotion and physiological responses.
Valence and arousal tags were selected in the DEAP dataset, and tags scoring above 5 were assigned a value of 1 and tags below 5 were assigned a value of 0. Dichotomous experiments can be categorized as high valence (HV) and low valence (LV) or high arousal (HA) and low arousal (LA). The four-classification experiment (V-A) can be categorized as high valence high arousal (HVHA), high valence low arousal (HVLA), low valence high arousal (LVHA), and low valence low arousal (LVLA). HVHA corresponds to a state of excitement or agitation, HVLA to a state of calmness or relaxation, LVHA to depression or anger, and LVLA to frustration or sadness.

4.2. Experiment Setup and Performance Evaluation Metrics

All experiments used the same software environment, experimental dataset division, parameter settings, and evaluation metrics. The software environment is the Windows 11 operating system, Python 3.9 programming language environment, and Pytorch deep learning framework, and the hardware uses NVIDIA GeForce RTX 4060 GPU (NVIDIA: Santa Clara, CA, USA). The model is trained using Adam as the optimizer, and cross-entropy and L2 regularization as the loss function, and the learning rate is set to 0.001. To reduce the overfitting phenomenon during training, the parameter of Dropout was set to 0.5. We employed ten-fold cross-validation. Each participant’s dataset was divided into ten parts, and in each experiment, one part was used as the test set while the remaining nine parts were used as the training set. This procedure was repeated ten times, and the average of the ten experiments was taken as the final result. Therefore, in each experiment, the training set and the test set were split in a ratio of 9:1. Finally, the average of the 32 subjects’ identifications was taken as the result of the dataset. To objectively evaluate the performance of the model, this study uses the accuracy, precision, recall, and F1-score as evaluation metrics, and the calculation formulas are as follows, respectively:
A c c u r a c y = T P + T N T P + F P + T N + F N
p r e c i s i o n = T P T P + F P
R e c a l l = T P T P + F N
F 1 _ s c o r e = 2 × p r e c i s i o n × R e c a l l p e r c c i s i o n + R e c a l l
In Equations (24)–(27), TP denotes that the true label is a positive class and is predicted to be positive, TN denotes that the true label is a negative class and is predicted to be negative, FP denotes that the true label is a negative class but is predicted to be positive, and FN denotes that the true label is a positive class but is predicted to be negative.

5. Results and Discussion

To verify the effectiveness of the proposed method, this study conducts ablation experiments and comparative tests on the DEAP dataset and evaluates the model performance with evaluation metrics. Figure 10 shows the accuracy of the 32 subjects in the DEAP dataset in the valence dimension, arousal dimension, and valence–arousal dimension. The accuracies of the proposed method in this study are 99.70% and 99.74% for the valence dimension and arousal dimension, respectively, and 97.27% for V-A. Figure 11 shows the confusion matrices of the CATM experiment on the DEAP dataset.

5.1. Ablation Experiments

In this study, feature fusion ablation experiments, feature ablation experiments, and module ablation experiments were conducted.

5.1.1. Feature Fusion Ablation Experiments

The commonly used feature fusion methods in deep learning algorithms are feature addition, multiplication, and concatenation, and this experiment discusses the effect of different feature fusion strategies on the effect of emotion recognition. The results of the experiments are shown in Table 4.
In the table, X a d d means four 3D features are added together, X m u l t means four 3D features are multiplied together, and X c o n means four 3D features are concatenated together.

5.1.2. Feature Ablation Experiments

The feature ablation experiments were performed on four features, DE, PSD, NE, and FD, respectively. The experimental results are shown in Table 5.
From the experimental results, it can be seen that the four features have the best classification accuracy compared to single, double, and triple features due to the fact that the introduction of multi-features helps to improve the accuracy of emotion recognition, and the experimental results also prove the effectiveness of PSD features, FD features, and NE features for emotion recognition.

5.1.3. Module Ablation Experiments

CATM mainly consists of a cross-scale attention module, a frequency–space attention module, a spatio-temporal feature extraction module, and a deep classification module. We ablate each module to verify its contribution in the classification task. For a better representation, we use Model 1 to represent the model that lacks the cross-scale attention module, Model 2 to represent the model that lacks the frequency–space attention module, Model 3 to represent the model that lacks the temporal feature extraction module, and Model 4 to represent the model that lacks the deep classification module. Detailed information is shown in Table 6.
As can be seen from Table 7, after ablating the cross-scale attention module, the validity and arousal dimension classification accuracies decreased by 8.36% and 4.68%, respectively; the ablation of the frequency–space attention module resulted in decreases of 0.75% and 0.77% in potency and arousal dimension classification accuracies, respectively; after ablating the temporal feature extraction module, the classification accuracies of the potency and arousal dimensions decreased by 7.95% and 7.29%, respectively; and after ablating the deep classification module, the classification accuracies of the potency and arousal dimensions decreased by 0.78% and 0.87%, respectively. It can be seen that after ablating each module, the feature extraction ability of the model is reduced, and the temporal, spatial, and frequency information in the EEG signal is not fully utilized. Figure 12 demonstrates the accuracy of each subject in the valence and arousal dimensions in the model ablation experiment. We found that the recognition accuracy of individual subjects decreased after ablating certain modules. Module ablation experiments demonstrate the validity of individual modules in the model.
In order to verify the performance of the cross-scale attention module in the model, we replaced the module with a single-scale convolution with convolution kernel sizes of 1, 3, and 5 for the experiments, and the results are shown in Table 8.
Table 8 shows that there is a decrease in model performance after replacing CSAM with single-scale convolution. A possible reason is that single-scale convolution has a relatively limited capability for feature extraction, making it unable to capture multi-scale features.

5.2. Experiments with Few Channels

To demonstrate the validity of this method for use with fewer channels, we selected electrodes with a high correlation to emotion based on a previous study [12]. We conducted experiments using data from these channels in the DEAP dataset. Experiments were performed with both 5 and 18 electrodes. The 5 selected electrodes were Fp1, Fp2, F7, F8, and O1. The 18 electrodes included Fp1, Fp2, F7, F3, Fz, F4, F8, T7, C3, CZ, C4, T8, P7, P3, P4, P8, O1, and O2. The electrode mapping is shown in Figure 13.
The experimental results for the fewer channels are shown in Table 9 and Table 10.
According to Table 9 and Table 10, the method in this study demonstrates good performance in experiments with fewer channels. In the 5-channel experiments, the accuracies for valence and arousal were 97.96% and 98.11%, respectively, while the V-A classification achieved an accuracy of 92.86%. For the 18-channel experiment, the accuracies were 99.59% for valence and 99.53% for arousal, with the V-A classification achieving 94.57%. The accuracy for each subject is illustrated in Figure 14 and Figure 15.
According to Figure 14 and Figure 15, there was a noticeable decrease in the experimental categorization accuracy of individual subjects following the reduction in EEG channels. Specifically, in the 5-channel experiments, both dichotomous and quaternary classifications yielded lower accuracy for individual subjects compared to the 18- and 32-channel configurations.
In the 5-channel experiment, we selected EEG channels from the frontal and occipital regions. The literature [57] indicates that the prefrontal, parietal, and occipital areas may contain the most relevant information for emotion recognition, consistent with previous studies [58,59]. Additionally, references [60,61] noted that synchronization between the frontal and occipital lobes is associated with both positive and fearful emotions. This study conducted experiments using all channels (32), 18 channels, and 5 channels. The results showed that using more channels improved the model’s classification performance, aligning with previous research findings [62]. Reducing the number of channels decreases the input EEG data to the model, significantly lowering both the parameter count and computational cost. However, having fewer data may hinder the model’s ability to perform classification tasks, potentially affecting its performance. Our experiments demonstrated that CATM still achieved good classification results even with only 5 channels.
In recent years, portable EEG acquisition devices have gained popularity, offering options for 15 channels, 5 channels, and even 1 channel. Limited channels can capture only low-density EEG signals, and since emotions are represented across multiple brain regions, extracting effective emotional patterns from such limited data poses significant challenges.

5.3. Comparative Experiments

We compare the proposed sentiment classification model with recent sentiment classification models, as shown in Table 11.
The following is a characterization of the various methods in Table 11:
(1)
FSA-3D-CNN [63]: This method constructs a 3D matrix of the EEG containing spatio-temporal information and introduces an attention mechanism to use 3D-CNN for emotion classification tasks.
(2)
TSFFN [64]: This method performs de-baselining of the EEG and extracts spatio-temporal features from EEG signals using a parallel transformer and a three-dimensional convolutional neural network (3D-CNN), and finally performs an emotion classification task.
(3)
Multi-aCRNN [65]: This method proposes a multi-view feature fusion attentional convolutional recurrent neural network. The interference of label noise is reduced by label smoothing, and GRU and CNN are combined to accomplish the emotion classification task.
(4)
RA2-3DCNN [66]: This method introduces segmentation–transformation–merge techniques, residuals, and attention mechanisms into shallow networks to improve the accuracy of the model. It is based on the 2D convolutional neural network and 3D convolutional neural network for emotion recognition.
(5)
MDCNAResnet [67]: This method extracts differential entropy features from EEG signals and constructs a three-dimensional feature matrix, uses deformable convolution to extract high-level abstract features, and combines MDCNAResnet with bidirectional gated recurrent units (BiGRUs) to accomplish emotion recognition.
(6)
BiTCAN [40]: This method utilizes a bi-hemispheric difference module to extract salience features of brain cognition, fuses salience and spatio-temporal features into an attention module, and inputs them into a classifier for emotion recognition.
(7)
RFPN-S2D-CNN [68]: This method uses preprocessed signals, differential entropy (DE), symmetric difference, and the symmetric quotient to construct four EEG signal feature matrices, and proposes a residual feature pyramid network (RFPN) to obtain inter-channel correlation, which is effective in improving the classification accuracy of emotion recognition.
(8)
FCAN-XGBoost [55]: This method extracts DE features and PSD features of the EEG and fuses FCAN and XGBoost algorithms for sentiment recognition, which reduces computational cost and improves classification accuracy.
(9)
Multi-scale 3D-CRU [69]: This method reconstructs a 3D feature representation of the EEG containing delta (δ) frequencies, combined with a recurrent neural network GRU for emotion classification.
(10)
MES-CTNet [70]: This method proposes a new capsule transformer network based on multidomain features, which uses multiple features and multiple attention mechanisms, and achieves high accuracy in emotion classification.
The proposed method in this study utilizes various features of the EEG and combines a cross-scale attentional convolution module with a temporal feature extraction module to extract temporal–frequency–spatial features in EEG signals, enhancing the accuracy of emotion classification. The data in Table 11 indicate that our method outperforms other approaches in both dichotomous and quaternary tasks.
We obtained some information about the CATM through experiments and built-in functions in Pytorch. The model has a total of 9,478,504 parameters. Under the previously mentioned hardware conditions, it takes 293.31 ms to process the EEG data of one subject. The size of the model’s weight file is 36.07 MB. We also hope that other researchers will include model-related information in their studies to facilitate comparisons in future research.
In summary, the multi-feature, multi-band emotion recognition method proposed in this study offers significant advantages over other recent approaches. The effectiveness of various features and modules is validated through multiple ablation experiments. In tests with fewer channels, the method achieves high emotion classification accuracy using only 5-channel and 18-channel data. Additionally, the recognition accuracy is more balanced across subjects, and the performance is more stable compared to single-feature methods, making it better suited for real-world applications.
However, this method has some shortcomings. While it shows better performance in within-subject experiments, it performs poorly in cross-subject experiments. The reason for this may be that physiological representations and subjective feelings differ across individuals in terms of cross-subject emotion recognition. Cross-subject experiments will be favored in future research, more integrated with practical applications.

6. Conclusions

In this study, we propose a multi-feature, multi-band spatio-temporal fusion algorithm called CATM. First, we extract DE, PSD, NE, and FD features from the EEG signals across different channels. These features undergo de-baselining and are organized into a 3D feature matrix based on the relative positions of electrodes, fully utilizing the frequency domain, energy, nonlinearity, spatio-temporal complexity, and spatial information in the EEG signals. Next, a cross-scale attention module is employed to extract spatial features at different scales within the EEG signals. The extracted features receive varying weights at spatial locations and channels through a frequency–space attention mechanism, enhancing the classification performance of the model. To prevent overfitting and reduce computation, a transition module is introduced to improve model generalization. Finally, Bi-LSTM is utilized to extract the temporal features from the EEG signals, facilitating the fusion of spatio-temporal features.
The experimental results of the proposed method on the DEAP dataset demonstrate its effectiveness in extracting emotion-related features from EEG signals. The classification accuracies achieved were 99.70% for valence, 99.74% for arousal, and 97.27% for the combined valence–arousal dimension. In experiments using fewer channels, the 5-channel EEG signal yielded classification accuracies of 97.96% for valence and 98.11% for arousal, with a 92.86% accuracy for the combined dimension. In the 18-channel experiment, the accuracies were 99.59% for valence, 99.53% for arousal, and 94.57% for the combined dimension. In our future work, we aim to optimize the network structure to reduce parameters and computational costs while enhancing classification accuracy. Additionally, we plan to conduct experiments that combine multimodal data to mitigate the impact of individual differences on network models.

Author Contributions

Conceptualization, H.Y. and X.X.; methodology, H.Y.; software, H.Y.; validation, H.Y. and X.X.; formal analysis, H.Y. and X.X.; investigation, R.Q. and K.S.; resources, X.X. and J.Z.; data curation, H.Y.; writing—original draft preparation, H.Y.; writing—review and editing, H.Y. and J.Z.; visualization, H.Y.; supervision, X.X. and J.Z.; funding acquisition, X.X. and J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

The authors would like to thank the National Natural Science Foundation of China (82060329) and Yunnan Fundamental Research Projects (202201AT070108) for their support.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data for this study were obtained from publicly available datasets. The DEAP dataset is available at http://www.eecs.qmul.ac.uk/mmv/datasets/deap/index.html (accessed on 2 September 2023).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Torres, E.P.; Torres, E.A.; Hernández-Álvarez, M.; Yoo, S.G. EEG-Based BCI Emotion Recognition: A Survey. Sensors 2020, 20, 5083. [Google Scholar] [CrossRef] [PubMed]
  2. Lin, W.; Li, C. Review of Studies on Emotion Recognition and Judgment Based on Physiological Signals. Appl. Sci. 2023, 13, 2573. [Google Scholar] [CrossRef]
  3. Kamble, K.; Sengupta, J. A comprehensive survey on emotion recognition based on electroencephalograph (EEG) signals. Multimed. Tools Appl. 2023, 82, 27269–27304. [Google Scholar] [CrossRef]
  4. Zhang, J.; Xing, L.; Tan, Z.; Wang, H.; Wang, K. Multi-head attention fusion networks for multi-modal speech emotion recognition. Comput. Ind. Eng. 2022, 168, 108078. [Google Scholar] [CrossRef]
  5. Rajan, S.; Chenniappan, P.; Devaraj, S.; Madian, N. Novel deep learning model for facial expression recognition based on maximum boosted CNN and LSTM. IET Image Process. 2020, 14, 1373–1381. [Google Scholar] [CrossRef]
  6. Ahmed, F.; Bari, A.S.M.H.; Gavrilova, M.L. Emotion Recognition from Body Movement. IEEE Access 2020, 8, 11761–11781. [Google Scholar] [CrossRef]
  7. Xu, M.; Cheng, J.; Li, C.; Liu, Y.; Chen, X. Spatio-temporal deep forest for emotion recognition based on facial electromyography signals. Comput. Biol. Med. 2023, 156, 106689. [Google Scholar] [CrossRef]
  8. Zhu, H.; Fu, C.; Shu, F.; Yu, H.; Chen, C.; Chen, W. The Effect of Coupled Electroencephalography Signals in Electrooculography Signals on Sleep Staging Based on Deep Learning Methods. Bioengineering 2023, 10, 573. [Google Scholar] [CrossRef] [PubMed]
  9. Lee, J.-A.; Kwak, K.-C. Personal Identification Using an Ensemble Approach of 1D-LSTM and 2D-CNN with Electrocardiogram Signals. Appl. Sci. 2022, 12, 2692. [Google Scholar] [CrossRef]
  10. Han, Z.; Chang, H.; Zhou, X.; Wang, J.; Wang, L.; Shao, Y. E2ENNet: An end-to-end neural network for emotional brain-computer interface. Front. Comput. Neurosci. 2022, 16, 942979. [Google Scholar] [CrossRef]
  11. Lin, X.; Chen, J.; Ma, W.; Tang, W.; Wang, Y. EEG emotion recognition using improved graph neural network with channel selection. Comput. Methods Programs Biomed. 2023, 231, 107380. [Google Scholar] [CrossRef]
  12. Chen, J.; Min, C.; Wang, C.; Tang, Z.; Liu, Y.; Hu, X. Electroencephalograph-Based Emotion Recognition Using Brain Connectivity Feature and Domain Adaptive Residual Convolution Model. Front. Neurosci. 2022, 16, 878146. [Google Scholar] [CrossRef] [PubMed]
  13. Han, L.; Zhang, X.; Yin, J. EEG emotion recognition based on the TimesNet fusion model. Appl. Soft Comput. 2024, 159, 111635. [Google Scholar] [CrossRef]
  14. Wu, Q.; Yuan, Y.; Cheng, Y.; Ye, T. A novel emotion recognition method based on 1D-DenseNet. J. Intell. Fuzzy Syst. 2023, 44, 5507–5518. [Google Scholar] [CrossRef]
  15. Singh, K.; Ahirwal, M.K.; Pandey, M. Quaternary classification of emotions based on electroencephalogram signals using hybrid deep learning model. J. Ambient Intell. Humaniz. Comput. 2023, 14, 2429–2441. [Google Scholar] [CrossRef]
  16. Liu, F.; Yang, P.; Shu, Y.; Liu, N.; Sheng, J.; Luo, J.; Wang, X.; Liu, Y.J. Emotion Recognition from Few-Channel EEG Signals by Integrating Deep Feature Aggregation and Transfer Learning. IEEE Trans. Affect. Comput. 2023, 1–17. [Google Scholar] [CrossRef]
  17. Jatupaiboon, N.; Pan-ngum, S.; Israsena, P. Emotion classification using minimal EEG channels and frequency bands. In Proceedings of the 2013 10th International Joint Conference on Computer Science and Software Engineering (JCSSE), Khon Kaen, Thailand, 29–31 May 2013; pp. 21–24. [Google Scholar]
  18. Mert, A.; Akan, A. Emotion recognition from EEG signals by using multivariate empirical mode decomposition. Pattern Anal. Appl. 2018, 21, 81–89. [Google Scholar] [CrossRef]
  19. Bazgir, O.; Mohammadi, Z.; Habibi, S. Emotion Recognition with Machine Learning Using EEG Signals. In Proceedings of the 2018 25th National and 3rd International Iranian Conference on Biomedical Engineering (ICBME), Qom, Iran, 29–30 November 2019. [Google Scholar]
  20. Koelstra, S.; Muhl, C.; Soleymani, M.; Lee, J.S.; Yazdani, A.; Ebrahimi, T.; Pun, T.; Nijholt, A.; Patras, I. DEAP: A Database for Emotion Analysis; Using Physiological Signals. IEEE Trans. Affect. Comput. 2012, 3, 18–31. [Google Scholar] [CrossRef]
  21. Zheng, W.L.; Lu, B.L. Investigating Critical Frequency Bands and Channels for EEG-Based Emotion Recognition with Deep Neural Networks. IEEE Trans. Auton. Ment. Dev. 2015, 7, 162–175. [Google Scholar] [CrossRef]
  22. Houssein, E.H.; Hammad, A.; Ali, A.A. Human emotion recognition from EEG-based brain–computer interface using machine learning: A comprehensive review. Neural Comput. Appl. 2022, 34, 12527–12557. [Google Scholar] [CrossRef]
  23. Rahman, M.M.; Sarkar, A.K.; Hossain, M.A.; Hossain, M.S.; Islam, M.R.; Hossain, M.B.; Quinn, J.M.W.; Moni, M.A. Recognition of human emotions using EEG signals: A review. Comput. Biol. Med. 2021, 136, 104696. [Google Scholar] [CrossRef] [PubMed]
  24. Jafari, M.; Shoeibi, A.; Khodatars, M.; Bagherzadeh, S.; Shalbaf, A.; García, D.L.; Gorriz, J.M.; Acharya, U.R. Emotion recognition in EEG signals using deep learning methods: A review. Comput. Biol. Med. 2023, 165, 107450. [Google Scholar] [CrossRef] [PubMed]
  25. Dadebayev, D.; Goh, W.W.; Tan, E.X. EEG-based emotion recognition: Review of commercial EEG devices and machine learning techniques. J. King Saud Univ. Comput. Inf. Sci. 2022, 34, 4385–4401. [Google Scholar] [CrossRef]
  26. Singh, A.K.; Krishnan, S. Trends in EEG signal feature extraction applications. Front. Artif. Intell. 2023, 5, 1072801. [Google Scholar] [CrossRef] [PubMed]
  27. Yuvaraj, R.; Thagavel, P.; Thomas, J.; Fogarty, J.; Ali, F. Comprehensive Analysis of Feature Extraction Methods for Emotion Recognition from Multichannel EEG Recordings. Sensors 2023, 23, 915. [Google Scholar] [CrossRef] [PubMed]
  28. Liu, S.; Wang, X.; Zhao, L.; Zhao, J.; Xin, Q.; Wang, S.H. Subject-Independent Emotion Recognition of EEG Signals Based on Dynamic Empirical Convolutional Neural Network. IEEE/ACM Trans. Comput. Biol. Bioinform. 2021, 18, 1710–1721. [Google Scholar] [CrossRef]
  29. Çelebi, M.; Öztürk, S.; Kaplan, K. An emotion recognition method based on EWT-3D–CNN–BiLSTM-GRU-AT model. Comput. Biol. Med. 2024, 169, 107954. [Google Scholar] [CrossRef] [PubMed]
  30. Zhang, J.; Zhang, X.; Chen, G.; Huang, L.; Sun, Y. EEG emotion recognition based on cross-frequency granger causality feature extraction and fusion in the left and right hemispheres. Front. Neurosci. 2022, 16, 974673. [Google Scholar] [CrossRef] [PubMed]
  31. Cai, J.; Xiao, R.; Cui, W.; Zhang, S.; Liu, G. Application of Electroencephalography-Based Machine Learning in Emotion Recognition: A Review. Front. Syst. Neurosci. 2021, 15, 729707. [Google Scholar] [CrossRef]
  32. Nawaz, R.; Cheah, K.H.; Nisar, H.; Yap, V.V.J.B.; Engineering, B. Comparison of different feature extraction methods for EEG-based emotion recognition. Biocybern. Biomed. Eng. 2020, 40, 910–926. [Google Scholar] [CrossRef]
  33. Bhardwaj, A.; Gupta, A.; Jain, P.; Rani, A.; Yadav, J. Classification of human emotions from EEG signals using SVM and LDA Classifiers. In Proceedings of the 2015 2nd International Conference on Signal Processing and Integrated Networks (SPIN), Noida, India, 19–20 February 2015; pp. 180–185. [Google Scholar]
  34. Gupta, V.; Chopda, M.D.; Pachori, R.B. Cross-Subject Emotion Recognition Using Flexible Analytic Wavelet Transform From EEG Signals. IEEE Sens. J. 2019, 19, 2266–2274. [Google Scholar] [CrossRef]
  35. Asghar, M.A.; Khan, M.J.; Fawad; Amin, Y.; Rizwan, M.; Rahman, M.; Badnava, S.; Mirjavadi, S.S. EEG-Based Multi-Modal Emotion Recognition using Bag of Deep Features: An Optimal Feature Selection Approach. Sensors 2019, 19, 5218. [Google Scholar] [CrossRef] [PubMed]
  36. Li, X.; Song, D.; Zhang, P.; Yu, G.; Hou, Y.; Hu, B. Emotion recognition from multi-channel EEG data through Convolutional Recurrent Neural Network. In Proceedings of the 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Shenzhen, China, 15–18 December 2016; pp. 352–359. [Google Scholar]
  37. Chakravarthi, B.; Ng, S.-C.; Ezilarasan, M.R.; Leung, M.-F. EEG-based emotion recognition using hybrid CNN and LSTM classification. Front. Comput. Neurosci. 2022, 16, 1019776. [Google Scholar] [CrossRef]
  38. Yang, Y.; Wu, Q.; Fu, Y.; Chen, X. Continuous Convolutional Neural Network with 3D Input for EEG-Based Emotion Recognition. In Proceedings of the Neural Information Processing: 25th International Conference, ICONIP 2018, Siem Reap, Cambodia, 13–16 December 2018; pp. 433–443. [Google Scholar]
  39. Liu, S.; Wang, X.; Zhao, L.; Li, B.; Hu, W.; Yu, J.; Zhang, Y.D. 3DCANN: A Spatio-Temporal Convolution Attention Neural Network for EEG Emotion Recognition. IEEE J. Biomed. Health Inform. 2022, 26, 5321–5331. [Google Scholar] [CrossRef] [PubMed]
  40. An, Y.; Hu, S.; Liu, S.; Li, B. BiTCAN: A emotion recognition network based on saliency in brain cognition. Math. Biosci. Eng. 2023, 20, 21537–21562. [Google Scholar] [CrossRef]
  41. Xiao, G.; Shi, M.; Ye, M.; Xu, B.; Chen, Z.; Ren, Q. 4D attention-based neural network for EEG emotion recognition. Cogn. Neurodynamics 2022, 16, 805–818. [Google Scholar] [CrossRef]
  42. Jia, Z.; Lin, Y.; Cai, X.; Chen, H.; Gou, H.; Wang, J. SST-EmotionNet: Spatial-Spectral-Temporal based Attention 3D Dense Network for EEG Emotion Recognition. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12 October 2020; pp. 2909–2917. [Google Scholar]
  43. Zhang, J.; Chen, M.; Zhao, S.; Hu, S.; Shi, Z.; Cao, Y. ReliefF-Based EEG Sensor Selection Methods for Emotion Recognition. Sensors 2016, 16, 1558. [Google Scholar] [CrossRef]
  44. Özerdem, M.S.; Polat, H. Emotion recognition based on EEG features in movie clips with channel selection. Brain Inform. 2017, 4, 241–252. [Google Scholar] [CrossRef]
  45. Topic, A.; Russo, M.; Stella, M.; Saric, M. Emotion Recognition Using a Reduced Set of EEG Channels Based on Holographic Feature Maps. Sensors 2022, 22, 3248. [Google Scholar] [CrossRef]
  46. Duan, R.N.; Zhu, J.Y.; Lu, B.L. Differential entropy feature for EEG-based emotion classification. In Proceedings of the 2013 6th International IEEE/EMBS Conference on Neural Engineering (NER), San Diego, CA, USA, 6–8 November 2013; pp. 81–84. [Google Scholar]
  47. Chen, J.; Jiang, D.; Zhang, Y.; Zhang, P. Emotion recognition from spatiotemporal EEG representations with hybrid convolutional recurrent neural networks via wearable multi-channel headset. Comput. Commun. 2020, 154, 58–65. [Google Scholar] [CrossRef]
  48. Toole, J.M.O.; Temko, A.; Stevenson, N. Assessing instantaneous energy in the EEG: A non-negative, frequency-weighted energy operator. In Proceedings of the 2014 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Chicago, IL, USA, 26–30 August 2014; pp. 3288–3291. [Google Scholar]
  49. Zhao, X.; Zhang, R.; Mei, Z.; Chen, C.; Chen, W. Identification of Epileptic Seizures by Characterizing Instantaneous Energy Behavior of EEG. IEEE Access 2019, 7, 70059–70076. [Google Scholar] [CrossRef]
  50. Vega, C.F.; Noel, J. Parameters analyzed of Higuchi’s fractal dimension for EEG brain signals. In Proceedings of the 2015 Signal Processing Symposium (SPSympo), Debe, Poland, 10–12 June 2015; pp. 1–5. [Google Scholar]
  51. Jacob, J.E.; Gopakumar, K. Automated Diagnosis of Encephalopathy Using Fractal Dimensions of EEG Sub-Bands. In Proceedings of the 2018 IEEE Recent Advances in Intelligent Computational Systems (RAICS), Thiruvananthapuram, India, 6–8 December 2018; pp. 94–97. [Google Scholar]
  52. Pavithra, M.; NiranjanaKrupa, B.; Sasidharan, A.; Kutty, B.M.; Lakkannavar, M. Fractal dimension for drowsiness detection in brainwaves. In Proceedings of the 2014 International Conference on Contemporary Computing and Informatics (IC3I), Mysore, India, 27–29 November 2014; pp. 757–761. [Google Scholar]
  53. Liu, Y.; Sourina, O. EEG-based subject-dependent emotion recognition algorithm using fractal dimension. In Proceedings of the 2014 IEEE International Conference on Systems, Man, and Cybernetics (SMC), San Diego, CA, USA, 5–8 October 2014; pp. 3166–3171. [Google Scholar]
  54. Gao, Q.; Wang, C.-h.; Wang, Z.; Song, X.-l.; Dong, E.-z.; Song, Y. EEG based emotion recognition using fusion feature extraction method. Multimed. Tools Appl. 2020, 79, 27057–27074. [Google Scholar] [CrossRef]
  55. Zong, J.; Xiong, X.; Zhou, J.; Ji, Y.; Zhou, D.; Zhang, Q. FCAN–XGBoost: A Novel Hybrid Model for EEG Emotion Recognition. Sensors 2023, 23, 5680. [Google Scholar] [CrossRef]
  56. Ma, J.; Tang, H.; Zheng, W.L.; Lu, B.L. Emotion Recognition using Multimodal Residual LSTM Network. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019. [Google Scholar]
  57. Zhong, P.; Wang, D.; Miao, C. EEG-Based Emotion Recognition Using Regularized Graph Neural Networks. IEEE Trans. Affect. Comput. 2022, 13, 1290–1301. [Google Scholar] [CrossRef]
  58. Schmidt, L.A.; Trainor, L.J. Frontal brain electrical activity (EEG) distinguishes valence and intensity of musical emotions. Cogn. Emot. 2001, 15, 487–500. [Google Scholar] [CrossRef]
  59. Lin, Y.P.; Wang, C.H.; Jung, T.P.; Wu, T.L.; Jeng, S.K.; Duann, J.R.; Chen, J.H. EEG-Based Emotion Recognition in Music Listening. IEEE Trans. Biomed. Eng. 2010, 57, 1798–1806. [Google Scholar] [CrossRef] [PubMed]
  60. Costa, T.; Rognoni, E.; Galati, D. EEG phase synchronization during emotional response to positive and negative film stimuli. Neurosci. Lett. 2006, 406, 159–164. [Google Scholar] [CrossRef]
  61. Mattavelli, G.; Rosanova, M.; Casali, A.G.; Papagno, C.; Lauro, L.J.R. Timing of emotion representation in right and left occipital region: Evidence from combined TMS-EEG. Brain Cogn. 2016, 106, 13–22. [Google Scholar] [CrossRef]
  62. Javaid, M.M.; Yousaf, M.A.; Sheikh, Q.Z.; Awais, M.M.; Saleem, S.; Khalid, M. Real-Time EEG-based Human Emotion Recognition. In Proceedings of the Neural Information Processing; Springer International Publishing: Cham, Switzerland, 2015; pp. 182–190. [Google Scholar]
  63. Zhang, J.; Zhang, X.Y.; Chen, G.J.; Yan, C. EEG emotion recognition based on the 3D-CNN and spatial-frequency attention mechanism. J. Xidian Univ. 2022, 49, 191–198. [Google Scholar] [CrossRef]
  64. Sun, J.; Wang, X.; Zhao, K.; Hao, S.; Wang, T. Multi-Channel EEG Emotion Recognition Based on Parallel Transformer and 3D-Convolutional Neural Network. Mathematics 2022, 10, 3131. [Google Scholar] [CrossRef]
  65. Xin, R.; Miao, F.; Cong, P.; Zhang, F.; Xin, Y.; Feng, X. Multiview Feature Fusion Attention Convolutional Recurrent Neural Networks for EEG-Based Emotion Recognition. J. Sens. 2023, 2023, 9281230. [Google Scholar] [CrossRef]
  66. Cui, D.; Xuan, H.; Liu, J.; Gu, G.; Li, X. Emotion Recognition on EEG Signal Using ResNeXt Attention 2D-3D Convolution Neural Networks. Neural Process. Lett. 2023, 55, 5943–5957. [Google Scholar] [CrossRef]
  67. Du, X.; Meng, Y.; Qiu, S.; Lv, Y.; Liu, Q. EEG Emotion Recognition by Fusion of Multi-Scale Features. Brain Sci. 2023, 13, 1293. [Google Scholar] [CrossRef] [PubMed]
  68. Hou, F.; Liu, J.; Bai, Z.; Yang, Z.; Liu, J.; Gao, Q.; Song, Y. EEG-Based Emotion Recognition for Hearing Impaired and Normal Individuals with Residual Feature Pyramids Network Based on Time–Frequency–Spatial Features. IEEE Trans. Instrum. Meas. 2023, 72, 2505011. [Google Scholar] [CrossRef]
  69. Dong, H.; Zhou, J.; Fan, C.; Zheng, W.; Tao, L.; Kwan, H.K. Multi-scale 3D-CRU for EEG emotion recognition. Biomed. Phys. Eng. Express 2024, 10, 045018. [Google Scholar] [CrossRef]
  70. Du, Y.; Ding, H.; Wu, M.; Chen, F.; Cai, Z. MES-CTNet: A Novel Capsule Transformer Network Base on a Multi-Domain Feature Map for Electroencephalogram-Based Emotion Recognition. Brain Sci. 2024, 14, 344. [Google Scholar] [CrossRef]
Figure 1. Overall framework and process of the proposed method.
Figure 1. Overall framework and process of the proposed method.
Sensors 24 04837 g001
Figure 2. Feature extraction and feature mapping.
Figure 2. Feature extraction and feature mapping.
Sensors 24 04837 g002
Figure 3. Two-dimensional matrix mapping.
Figure 3. Two-dimensional matrix mapping.
Sensors 24 04837 g003
Figure 4. Structural diagram of CSAM.
Figure 4. Structural diagram of CSAM.
Sensors 24 04837 g004
Figure 5. Structural diagram of FSAM.
Figure 5. Structural diagram of FSAM.
Sensors 24 04837 g005
Figure 6. Weights for various channels in CSAM.
Figure 6. Weights for various channels in CSAM.
Sensors 24 04837 g006
Figure 7. Heat map for each electrode position weight in CSAM.
Figure 7. Heat map for each electrode position weight in CSAM.
Sensors 24 04837 g007
Figure 8. Structural diagram of FTM.
Figure 8. Structural diagram of FTM.
Sensors 24 04837 g008
Figure 9. Structural diagram of DCM.
Figure 9. Structural diagram of DCM.
Sensors 24 04837 g009
Figure 10. Accuracy of all subjects in the DEAP dataset.
Figure 10. Accuracy of all subjects in the DEAP dataset.
Sensors 24 04837 g010
Figure 11. The confusion matrices of the CATM experiment on the DEAP dataset: (a) arousal dimension confusion matrix; (b) valence dimension confusion matrix; (c) valence–arousal dimension confusion matrix.
Figure 11. The confusion matrices of the CATM experiment on the DEAP dataset: (a) arousal dimension confusion matrix; (b) valence dimension confusion matrix; (c) valence–arousal dimension confusion matrix.
Sensors 24 04837 g011
Figure 12. Accuracies of all subjects in different models: (a) valence dimension accuracy of all subjects in different models; (b) arousal dimension accuracy of all subjects in different models.
Figure 12. Accuracies of all subjects in different models: (a) valence dimension accuracy of all subjects in different models; (b) arousal dimension accuracy of all subjects in different models.
Sensors 24 04837 g012
Figure 13. Electrode map with few channels.
Figure 13. Electrode map with few channels.
Sensors 24 04837 g013
Figure 14. Accuracy of the DEAP dataset across all subjects in the 5-channel experiment.
Figure 14. Accuracy of the DEAP dataset across all subjects in the 5-channel experiment.
Sensors 24 04837 g014
Figure 15. Accuracy of the DEAP dataset across all subjects in the 18-channel experiment.
Figure 15. Accuracy of the DEAP dataset across all subjects in the 18-channel experiment.
Sensors 24 04837 g015
Table 1. Acronyms, full names, and functions of each module of CATM.
Table 1. Acronyms, full names, and functions of each module of CATM.
ModuleFull NameFunction
CSAMCross-scale attention moduleExtracts features of different scales and assigns weights
FSAMFrequency–space attention moduleGives higher weight to more important frequency bands and spatial locations
Bi_LSTMBidirectional long short-term memoryExtracts time features
DCMDeep classification moduleClassifies the features
Table 2. Detailed parameters of CSAM.
Table 2. Detailed parameters of CSAM.
LayerLayer SettingOutput
Conv2D (1 × 1)In_features = 80(128 × 128 × 8 × 9)
Out_feaures = 128
BatchNorm2d
Activation = ReLU
Conv2D (3 × 3)In_features = 80(128 × 128 × 8 × 9)
Out_feaures = 128
BatchNorm2d
Activation = ReLU
Conv2D (5 × 5)In_features = 80(128 × 128 × 8 × 9)
Out_feaures = 128
BatchNorm2d
Activation = ReLU
Max pooling(3 × 3)In_features = 80(128 × 80 × 8 × 9)
Out_feaures = 80
BatchNorm2d
Activation = ReLU
Concatenate (128 × 464 × 8 × 9)
Table 3. Detailed parameters of DCM.
Table 3. Detailed parameters of DCM.
LayerLayer SettingOutput
Linear1In_features = 128(128 × 64)
Out_feaures = 64
Activation = ReLU
Dropoutp = 0.5
Linear2In_features = 64(128 × 32)
Out_feaures = 32
Activation = ReLU
Dropoutp = 0.5
Linear3In_features = 32
Out_feaures = num_classes
(128 × num_classes)
Table 4. Experimental results of different feature fusion methods.
Table 4. Experimental results of different feature fusion methods.
FeatureAccuracy%Precision%Recall%F1-Score%
ValenceArousalValenceArousalValenceArousalValenceArousal
X a d d 98.1898.5998.2398.6298.1898.5998.1898.59
X m u l t 89.8488.7990.0889.1689.8488.7989.7888.40
X c o n 99.7099.7499.7099.7499.6999.7399.6999.73
Table 5. Experimental results with different features on the DEAP dataset.
Table 5. Experimental results with different features on the DEAP dataset.
FeatureAccuracy%Precision%Recall%F1-Score%
ValenceArousalValenceArousalValenceArousalValenceArousal
DE97.3997.7598.1898.2897.2597.8397.697.99
PSD96.4496.5897.8098.3496.6897.1197.0697.50
NE95.3295.5197.0997.5794.5996.2195.3796.66
FD90.9797.7395.1096.0190.9692.9391.1392.94
DE-PSD98.9898.9599.2599.2798.8998.9999.0199.08
DE-NE98.8198.7299.3098.8598.8998.5299.0398.63
DE-FD98.6498.8898.9999.1598.5098.7998.6798.92
PSD-NE97.8498.3398.7698.9497.8198.4698.1398.64
PSD-FD98.4398.4099.0199.0798.3898.5998.5698.76
NE-FD98.2998.1998.5998.9597.8597.7298.1297.88
DE-PSD-NE99.0199.1699.4899.4999.3299.1699.4099.28
PSD-NE-FD98.8199.1399.2299.4398.4299.1698.5199.25
DE-PSD-FD99.0599.2599.4999.6199.1699.3699.2999.46
DE-NE-FD99.1499.1999.4299.0399.2499.0499.3399.13
All Feature99.7099.7499.7099.7499.6999.7399.6999.73
Table 6. Ablation experiment models.
Table 6. Ablation experiment models.
ModelsCSAMFSAMBi-LSTMDCM
Model 1×
Model 2×
Model 3×
Model 4×
The “×” in Table 6 represents a model that does not contain the element on the left, while the “√” represents a model that contains the element. For example, Model 1 is a model that does not contain the CSAM module.
Table 7. Ablation results of different modules.
Table 7. Ablation results of different modules.
ModelsAccuracy/%Precision/%Recall/%F1-Score/%
ValenceArousalValenceArousalValenceArousalValenceArousal
Model 191.3495.0697.3398.5788.6194.7387.294.15
Model 298.9598.9799.3699.2798.9598.9898.9598.97
Model 391.7592.4595.0795.3291.4592.6292.6993.52
Model 498.9298.8798.9398.8898.9298.8798.9298.87
CATM99.7099.7499.7099.7499.6999.7399.6999.73
Table 8. Impact of replacing CSAM with single-scale convolution on model performance.
Table 8. Impact of replacing CSAM with single-scale convolution on model performance.
Kernel SizeAccuracy%Precision%Recall%F1-Score%
ValenceArousalValenceArousalValenceArousalValenceArousal
188.2593.7393.4796.3088.2593.7384.4491.77
395.1297.2297.0598.2495.1297.2293.6696.42
597.6398.5298.5098.7597.6398.5296.9898.32
Table 9. Results of 5 channels on the DEAP dataset.
Table 9. Results of 5 channels on the DEAP dataset.
DimensionAccuracy/%Precision/% Recall/% F1-Score/%
Valence97.9698.0197.9697.95
Arousal98.1198.1798.1198.10
V-A92.8694.1292.8691.95
Table 10. Results of 18 channels on the DEAP dataset.
Table 10. Results of 18 channels on the DEAP dataset.
DimensionAccuracy/%Precision/% Recall/% F1-Score/%
Valence99.5999.6199.5999.59
Arousal99.5399.5499.5399.53
V-A94.5795.9894.5793.50
Table 11. Recent performance comparison of different methods.
Table 11. Recent performance comparison of different methods.
ModelFeatureDatasetValenceArousalV-AYear
FSA-3D-CNNDEDEAP95.87%95.23%94.532022
TSFFNBaseline removalDEAP98.27%98.53%-2022
Multi-aCRNNDEDEAP96.30%96.43%-2023
RA2-3DCNNBaseline removalDEAP97.58%97.19%-2022
MDCNAResnetDEDEAP98.63%98.89%-2023
BiTCANBaseline removalDEAP98.46%97.65%-2023
RFPN–S2D–CNNDEDEAP96.89%96.82%93.56%2023
FCAN–XGBoostDE, PSDDEAP--95.26%2023
Multi-scale 3D-CRUDEDEAP93.12%94.31%-2024
MES-CTNetDE, PSD, SEDEAP98.31%98.28%-2024
OursDE, PSD, NE, FDDEAP99.70%99.74%97.27%2024
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yu, H.; Xiong, X.; Zhou, J.; Qian, R.; Sha, K. CATM: A Multi-Feature-Based Cross-Scale Attentional Convolutional EEG Emotion Recognition Model. Sensors 2024, 24, 4837. https://doi.org/10.3390/s24154837

AMA Style

Yu H, Xiong X, Zhou J, Qian R, Sha K. CATM: A Multi-Feature-Based Cross-Scale Attentional Convolutional EEG Emotion Recognition Model. Sensors. 2024; 24(15):4837. https://doi.org/10.3390/s24154837

Chicago/Turabian Style

Yu, Hongde, Xin Xiong, Jianhua Zhou, Ren Qian, and Kaiwen Sha. 2024. "CATM: A Multi-Feature-Based Cross-Scale Attentional Convolutional EEG Emotion Recognition Model" Sensors 24, no. 15: 4837. https://doi.org/10.3390/s24154837

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop