A Research on Emotion Recognition of the Elderly Based on Transformer and Physiological Signals

Feng, Guohong; Wang, Hongen; Wang, Mengdi; Zheng, Xiao; Zhang, Runze

doi:10.3390/electronics13153019

Open AccessArticle

A Research on Emotion Recognition of the Elderly Based on Transformer and Physiological Signals

by

Guohong Feng

^*,

Hongen Wang

,

Mengdi Wang

,

Xiao Zheng

and

Runze Zhang

College of Mechanical and Electrical Engineering, Northeast Forestry University, Harbin 150040, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(15), 3019; https://doi.org/10.3390/electronics13153019 (registering DOI)

Submission received: 8 June 2024 / Revised: 20 July 2024 / Accepted: 30 July 2024 / Published: 31 July 2024

(This article belongs to the Special Issue Applied AI in Emotion Recognition)

Download

Browse Figures

Versions Notes

Abstract

:

Aiming at problems such as the difficulty of recognizing emotions in the elderly and the inability of traditional machine-learning models to effectively capture the nonlinear relationship between physiological signal data, a Recursive Map (RM) combined with a Vision Transformer (ViT) is proposed to recognize the emotions of the elderly based on Electroencephalogram (EEG), Electrodermal Activity (EDA), and Heart Rate Variability (HRV) signals. The Dung Beetle Optimizer (DBO) is used to optimize the variational modal decomposition of EEG, EDA, and HRV signals. The optimized decomposed time series signals are converted into two-dimensional images using RM, and then the converted image signals are applied to the ViT for the study of emotion recognition of the elderly. The pre-trained weights of ViT on the ImageNet-22k dataset are loaded into the model and retrained with the two-dimensional image data. The model is validated and compared using the test set. The research results show that the recognition accuracy of the proposed method on EEG, EDA, and HRV signals is 99.35%, 86.96%, and 97.20%, respectively. This indicates that EEG signals can better reflect the emotional problems of the elderly, followed by HRV signals, while EDA signals have poorer effects. Compared with Support Vector Machine (SVM), Naive Bayes (NB), and K-Nearest Neighbors (KNN), the recognition accuracy of the proposed method is increased by at least 9.4%, 11.13%, and 12.61%, respectively. Compared with ResNet34, EfficientNet-B0, and VGG16, it is increased by at least 1.14%, 0.54%, and 3.34%, respectively. This proves the superiority of the proposed method in emotion recognition for the elderly.

Keywords:

elderly; emotion recognition; recursive map; vision transformer; deep learning

1. Introduction

Emotion recognition is a technique that identifies an individual’s emotional state by analyzing physiological signals or facial expressions. It has widespread applications in human–computer interactions, intelligent education, and smart healthcare. As the proportion of elderly individuals in China’s population continues to rise, addressing the physical and mental health of the elderly has become a pressing social issue [1]. However, research on emotion recognition specifically tailored for the elderly population remains scarce. While physical health issues in the elderly are relatively easy to detect, emotional problems often go unnoticed. Imbalanced emotional states in the elderly can lead to increased risks of various health conditions [2]. For instance, when elderly individuals experience intense negative emotions, their bodies secrete higher levels of adrenaline and cortisol compared to those in a positive emotional state. This imbalance accelerates physical decline and aging [3]. Additionally, heightened blood flow due to extreme emotional states increases the risk of cardiovascular diseases [4]. Due to the challenges posed by an aging society, accurately and rapidly identifying the emotional states of elderly individuals has become an urgent problem.

Traditional emotion recognition methods primarily rely on physiological signals and employ machine-learning algorithms such as Support Vector Machines (SVM) [5] and K-Nearest Neighbors (KNN) [6]. However, these algorithms struggle to effectively capture the high-dimensional and nonlinear relationships present in physiological signal data, resulting in suboptimal accuracy. Deep learning, as proposed by Hinton et al. [7], offers a powerful solution due to its flexible model structures and strong representation capabilities. Deep learning’s distributed representations enable better generalization, allowing it to capture shared intrinsic features in complex physiological signal data across different contexts. Common deep-learning methods include Convolutional Neural Networks (CNNs) [8], Auto-Encoders (AEs) [9], and Transformers [10]. Among these, Transformer-based visual models have found widespread use in tasks such as object detection [11], image segmentation [12,13], image generation [14], and image captioning [15]. For instance, Li et al. [16] employed a top-k sparse Transformer model combined with multivariate empirical mode decomposition and canonical correlation analysis to achieve 95.2% recognition accuracy on a gesture-based Electroencephalogram (EEG) dataset. Similarly, Liu et al. [17] enhanced traditional Transformer models by incorporating fragment reuse mechanisms and relative positional encoding of previous segments, achieving classification accuracy and kappa values of 94.27% and 87.34%, respectively, on the BCI Competition 2008-Graz Dataset A for motor imagery EEG classification.

In this study, we collected EEG, Electrodermal Activity (EDA) and Heart Rate Variability (HRV) signals from elderly individuals. We optimized the number of modes (k) and penalty coefficient (α) in Variational Mode Decomposition (VMD) using the Dung Beetle Optimizer (DBO) [18]. After optimization, we transformed the decomposed signals into image data using a Recursive Map (RM). Subsequently, we fed the three types of transformed image data into a Vision Transformer (ViT) for feature extraction and model training. Finally, we compared the recognition performance of the three signal types using a test dataset and validated the superiority of our proposed method. Our research aims to provide insights into emotion recognition for elderly individuals.

2. Experimental Data Acquisition and Pre-Processing

2.1. Participants and Experimental Apparatus

Participants: a total of 14 elderly people over 60 years of age, including eight males and six females. All participants had normal vision, hearing, and overall physical and mental health. They had no history of psychiatric or neurological disorders. Informed consent was obtained from each participant, who voluntarily agreed to participate in the experiment. None of the participants had taken any medication in the week leading up to the study, and they refrained from vigorous physical activity for 4 h before the experiment. Before the start of the study, all participants were thoroughly briefed on the experimental procedures. The experimental setup is depicted in Figure 1.

Experimental Instruments: The BitBrain EEG Semi-Dry Electrode Wearable Brainwave Monitor was used for recording brain electrical activity. The ErgoLAB EDA Wireless Skin Conductance Sensor measured electrodermal activity (skin conductance). The ErgoLAB HRV Wireless Pulse Sensor monitored the pulse rate. The ErgoLAB Human-Machine Synchronization Platform V3.0 provided a synchronized environment for data collection and analysis. All the above instruments were provided by Beijing Jinfatech Co., Ltd., Beijing, China. The overall setup with participants wearing the instruments is depicted in Figure 2.

2.2. Stimulation Methods

The video-evoked method was chosen for this study. Various film clips were used to evoke emotional states in the participants. Drawing inspiration from the Database for Emotion Analysis Using Physiological Signals (DEAP) dataset and the SJTU Emotion EEG Dataset (SEED) dataset, six suitable video segments were selected. The anticipated emotions to be induced in the participants included, but were not limited to, happiness, sadness, fear, and disgust. When defining emotion labels, this study adopted a dimensional theory approach. Two commonly used dimensions for quantifying human emotions are valence, which reflects the positive or negative nature of emotions and arousal, which indicates the level of excitement during a specific state. According to the emotional dimensional theory, the rating scale ranges from 1 to 9. Scores above 5 were categorized as the high-level group, while scores below 5 fell into the low-level group. Three trained experimenters were selected to watch six movie clips and then score them. The results showed that the three experimenters scored within the same range, proving that the selected movie clips could successfully induce emotions. The six movie clips induced four emotional states: high arousal high valence, high arousal low valence, low arousal high valence, and low arousal low valence. These states were labeled as 0, 1, 2, and 3, respectively.

2.3. Experimental Procedure

Participants were equipped with the EEG, electrodermal sensors, and wireless pulse sensors as required, and then the experiment was started. Participants were asked to calmly relax and enter the resting state data acquisition phase. After two minutes, they entered the experimental task state, where they were asked to watch a series of movie clips and experience the emotions depicted. After each movie clip, subjects recorded the time points of the most profound images according to their real feelings and saved the corresponding EEG, EDA, and HRV data. The order of the six movie clips was randomized. At the end of the experiment, each subject was asked about their state to determine whether the emotion was successfully induced, ensuring the validity of the induced emotion. Emotions were successfully induced in all 14 participants.

2.4. Experimental Data Acquisition

Based on the video-evoked emotion method, physiological signals were acquired using EEG, piezoelectric sensors, and pulse sensors. The acquired physiological signals were then input into ErgoLAB to eliminate noise and redundant signals.

2.4.1. EEG Signal Acquisition

EEG is a signal used to measure the electrical activity of the brain by placing electrodes on the scalp to record changes in potentials located in the brain’s electrical activity. EEG measures the weak electrical signals generated by neuronal discharges and can be used to study and diagnose different brain functions such as sleep states, states of consciousness, emotions, cognitive processes, and brain diseases. The EEG acquisition device used was the BitBrain EEG semi-dry electrode-wearable EEG device provided by Beijing Jinfa Technology Co. The BitBrain EEG semi-dry electrode is a wearable device, as shown in Figure 3, which uses non-invasive semi-dry electrode technology capable of measuring the electrical activity of the brain. This device records and analyzes EEG signals, tracking brain activity and providing data on attention, stress, mood, and cognitive state. It can be used in various fields such as research, healthcare, sports training, and brain-machine interfaces. The EEG signals were collected from eight locations on the subject’s head: Fpz, F3, Fz, F4, P3, P4, O1, and O2. The impedance of the electrodes was kept below 5 kΩ, with a sampling frequency of 256 Hz.

2.4.2. EDA Signal Acquisition

EDA is a series of electrophysiological responses that occur in the palms of the hands, soles of the feet, and toes. EDA signals reflect the somatic and emotional states of the body, including agitation, anxiety, and stress. The EDA signal-acquisition device used was the ErgoLAB EDA wireless electrodermal sensor provided by Beijing Jinfa Technology Co. The ErgoLAB EDA wireless electrodermal sensor measures the skin’s electrical response. The two electrodes of the sensor were fixed on the abdomen of the index and middle fingers of the subject’s right hand, and the sensor was tightly attached to the right wrist to ensure it would not slip off due to the subject’s natural slight movements during the experiment. The sampling frequency was 64 Hz. The ErgoLAB EDA wireless electrodermal sensor was worn as shown in Figure 4.

2.4.3. HRV Signal Acquisition

HRV is a measure that describes how much the heart rate varies at different points in time. Heart rate is the number of times the heart beats per minute, while HRV is the variation of heart rate intervals at different points in time. Normally, the heart rate is uneven, i.e., there is some fluctuation. HRV assesses the stability of the autonomic nervous system and the health of the heart, particularly the balance between the sympathetic and parasympathetic nervous systems, primarily by analyzing the differences in the intervals between heartbeats. HRV signals can be obtained from measurements made by devices such as electrocardiograms or HRV meters. Although HRV is a relatively complex indicator, it has a wide range of applications in the fields of cardiovascular disease, exercise training, and stress management. The HRV signal-acquisition device used was the ErgoLAB HRV Wireless Pulse Sensor provided by Beijing Jinfa Technology Co. The ErgoLAB HRV Wireless Pulse Sensor measures heart rate and oxygen saturation. It uses the principle of photoelectric measurement to detect the pulse waveform in the bloodstream through a red LED and a photosensitive sensor, providing values for heart rate and oxygen saturation. The sensor, worn as shown in Figure 5, is compactly designed and wirelessly connected, making it suitable for various health-monitoring applications, such as fitness tracking, disease management, and medical research.

3. Algorithms and Models

3.1. Recurrence Map

RM is a common processing method in the field of nonlinear signal research, widely used in bearing fault diagnosis and other fields. RM utilizes the idea of phase space reconstruction to map time series data onto a two-dimensional image. This method reveals the internal information of the time-series data and provides relevant a priori knowledge of predictability and similarity, offering significant advantages in analyzing time-series data.

For particular timing series data X = [x₁,x₂,…,x_i,…,x_n], phase-space reconstruction is performed:

X_{i} = {x_{i}, x_{i + τ}, \dots, x_{i + (m - 1) τ}}

(1)

where: i = 1, 2,…, N − (m − 1)_i,

τ

denotes the lag length.

Calculate the distance between two points after phase-space reconstruction:

s_{i j} = ‖X_{i} - X_{j}‖

(2)

where: X_i and X_j are the points after reconstruction of the original signal, and

‖•‖

is the Euclidean distance.

Based on the selected threshold value. Calculate a difference and input the result into the Heaviside function to find the value of each point on the recurrence matrix, as shown in Equation (3):

\begin{array}{l} R_{i j} = H e a v i s i d e (ε - s_{i j}) \\ H e a v i s i d e (x) = \{\begin{cases} 1, x \geq 0 \\ 0, x < 0 \end{cases} \end{array}

(3)

where: the value of R_ij may be 1 and 0. A value of 1 indicates that the distance between two points in the reconstructed phase space is less than the threshold, indicating that the two points have a recursive relationship. A value of 0 indicates that the distance between two points in the reconstructed phase space exceeds the threshold, indicating that the two points do not have a recursive relationship. Therefore, the original time series data will generate color images after recursive processing, which retains the feature information of the complete time series data.

3.2. ViT

The Transformer was first proposed in 2017 by Vaswani et al. from the Google Translate team. It is distinguished from previous RNN and CNN networks by its lack of recursive and convolutional structures, relying entirely on a self-attention mechanism. This approach has achieved state-of-the-art results in the fields of Natural Language Processing, Computer Vision, and Multi-Modal tasks. The Transformer network is mainly composed of an encoder and a decoder. The encoder, similar to a convolutional layer, extracts features from the input data, while the decoder converts the extracted features into the output. The ViT architecture is shown in Figure 6. In the ViT, only the encoder part is used, with positional encoding performed before the encoder. The decoder part is replaced by a fully connected layer.

3.2.1. Location Coding

For the standard Transformer, the input data must be a sequence of vectors (tokens), i.e., a two-dimensional matrix ([num_token, token_dim]). In the case of ViT-B/16, each token has a length of 768. For image data, which are sized ([H, W, C]), it is necessary to transform the data into a two-dimensional matrix through the Embedding layer. Firstly, the input image [H × W] is divided into 16 × 16 size patches to get HW/256 patches. Each patch is then mapped to a token by linear mapping to obtain a token with a length dimension of 768. This results in a two-dimensional matrix of size [HW/256,768]. Finally, a trainable category token of size [1, 768] is appended to obtain a two-dimensional matrix of size [HW/256+1, 768] as input data. The positional encoding process is shown in Figure 7.

3.2.2. Encoders

The Transformer Encoder performs feature extraction by repeatedly stacking Encoder Blocks (repeated 12 times in ViT-B/16), as shown in Figure 8. The incoming data are first normalized by the Layer Norm layer, then processed by the Multi-Head Attention layer, which allows the model to focus on information from different locations and subspaces, assigning appropriate weights. Next, the DropPath layer discards some redundant information to prevent overfitting. The output is then a shortcut connected to the original input. The combined output and original input serve as the input for the next Layer Norm. The normalized result is then passed through the MLP Block to enhance the model’s expressive ability. The MLP Block consists of two linear layers, a Dropout layer, and a GELU activation function. The result is then passed through the DropPath layer once again, and the output is shortcut connected to the input once more. This describes the overall architecture of an Encoder Block.

The Multi-Head Attention layer, a key component of ViT architecture includes the following steps:

For the sequence data with output length i, corresponding to input i nodes x₁, x₂,…, x_i, map the inputs to a₁, a₂,…, a_i using f(x). Then, perform a dot product with the three trainable transformation matrices W_q, W_k, and W_v to obtain the corresponding q_i, k_i, respectively, as shown in Equation (4):

q_{i} = a_{i} W_{q}, k_{i} = a_{i} W_{k}, v_{i} = a_{i} W_{v}

(4)

where, q_i is the query vector, matched with the corresponding k_i, v_i represents the information extracted from a_i. The matching of q_i and k_i is to calculate the correlation between the two, the greater the correlation, the greater the weight of the corresponding v_i. The correlation weights are calculated in the form of a scaled dot product in ViT, as shown in Equation (5):

w e i g h t (q_{t}, k_{i}) = \frac{{q^{T}}_{t} k_{i}}{\sqrt{d_{k}}}

(5)

where d_k denotes the vector length. The input matrix size is appropriately adjusted by scaling to avoid the gradient being too small after subsequent normalization, which could affect network training. The weight (q_t, k_t) is normalized by the softmax function. In summary, the self-attention mechanism can be represented by matrix multiplication, as shown in Equation (6):

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(6)

Multi-Head Attention is similar to Self-Attention, as shown in Figure 9. The input a_i is passed through W_q, W_k, and W_v to obtain the corresponding q_i, k_i, and v_i. These are then divided into h parts according to the number of heads h. For each head using the same method as Equation (6) in Self-Attention, the h heads are concatenated and spliced, and the final output is obtained by matrix multiplication with the trainable W^O matrix, as shown in Equation (7):

\begin{matrix} MultiHead (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{h}) W^{O} \\ where {head}_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}) \end{matrix}

(7)

3.2.3. MLP Head

The shape of the output token remains unchanged after the Transformer Encoder. At this stage, the category token is extracted, and the final classification result is obtained using an MLP Head, consisting of a Linear layer, a tanh activation function, and another Linear layer.

4. Model Construction and Result Analysis

4.1. Acquisition Signal Analysis

The physiological signals captured from participants under different emotions were segmented based on the time nodes of emotional evocations recorded during various movie clips. Figure 10, Figure 11 and Figure 12 present schematic diagrams of different physiological signals from one subject under various emotions. As shown in these figures, the fluctuation effect of the EEG signals was most pronounced under different emotions, indicating that EEG signals were more responsive to the participant’s emotional changes. In contrast, the schematic diagrams of the EDA signals under high arousal low valence, low arousal high valence, and low arousal low valence conditions were more similar, making it difficult to intuitively judge the participant’s emotions.

4.2. Feature Extraction

To make the acquired physiological signals more reflective of their internal characteristics, this study uses VMD [19,20] to decompose the signals. In the VMD algorithm, the number of modes k and the penalty factor α are two crucial hyperparameters. A value of k that is too small will result in the loss of original signal features, while a value that is too large will lead to frequency aliasing. The penalty factor α determines the bandwidth of each Intrinsic Mode Function (IMF) after VMD decomposition. To this end, DBO is used to optimize the number of Modes k and the penalty factor α of the VMD, where the value range of the number of Modes k is set at [3,12], the value range of the penalty factor α is set at [500, 3000], and the fitness function is

H (x) = - \sum [p (x_{i}) \log_{2} (p (x_{i}))]

, and the fitness curves of the three different physiological signals optimized by DBO for VMD decomposition are shown in Figure 13. From Figure 13, it can be seen that after 18 iterations of the three different physiological signals, the fitness values converge to the minimum, indicating that the optimum has been reached. The IMF decomposition diagrams are shown in Figure 14, and the number of modes decomposed by the optimization of the three different physiological signals are 3, 4, and 8, respectively.

The decomposed data of three different physiological signals were saved separately. Each sample has a length of 1920, with a sampling step of 20. Inverse tangent normalization was applied to each sample, as shown in Equation (8). The normalized signal samples were encoded and processed with RM to generate 2D image data. For each film clip, 100 samples were taken for each physiological signal. The 2D image data were labeled and divided into training, validation, and test sets in a 6:2:2 ratio. The schematic diagrams of the three different physiological signals after RM transformation are shown in Figure 15.

z_{i} = \frac{2}{Π} \arctan (θ x_{i})

(8)

where: z_i is the normalized data and θ is the adjustment parameter.

4.3. Model Construction

To accelerate the network training speed, transfer learning was used to load the pre-trained weight files from the ImageNet-22k dataset. The number of iterations was set to 30, and the cross-entropy loss function, commonly used for multi-classification tasks, was selected. To improve the efficiency of memory alignment and matrix multiplication, the training batch size was set to a power of 2, specifically 8 in this study. A larger learning rate would have made the training process difficult to converge, while a smaller learning rate could have caused the network to fall into a local optimum. Therefore, the initial learning rate in this study was set to 0.001, with a cosine annealing strategy for learning rate attenuation. This approach ensured a rapid decline in the loss during the pre-training process and accurate optimization in the later stages. The Stochastic Gradient Descent (SGD) optimization algorithm, a more mainstream method, was chosen. Specific model parameter settings are shown in Table 1.

4.4. Analysis and Validation of Results

4.4.1. Ablation Experiments

To verify the effectiveness of DBO and VMD, the training and validation sets of three different physiological signals were used as input samples in this study and analyzed using ablation experiments. The network using only RM and ViT is denoted as Base; the network with VMD is denoted as VBase, where the number of modes k is set to 5 and the penalty factor α is set to 2000; and the network with DBO-optimized VMD is denoted as DVBase. The training loss and training accuracy for the three different physiological signals are shown in Figure 16 and Figure 17, respectively. As seen in Figure 16, the losses of the three different physiological signals show a decreasing trend as the number of iterations increases, and all converge within 30 epochs. During training, all three networks perform best when the input data are EEG signals. The loss of the training and validation sets under DVBase is the lowest compared to VBase and Base, with values of 0.042 and 0.003 at the 30th epoch, respectively. When the input data are EDA and HRV signals, the losses of both VBase and Base increase and fluctuate to varying degrees, which is particularly noticeable with EDA signals. The training and validation set losses under Base reached more than 0.3, likely due to older participants having looser skin and more interference in the original signals extracted, making the network training effect more dependent on feature signal extraction. As seen in Figure 17, regardless of which physiological signal data are used as input, the training effect of the DVBase network is better than that of VBase and Base. The accuracy under DVBase is close to 1 in the validation set when the input data are EEG signals at Epoch 3, and it remains high until the end of the training. The accuracy of both the training and validation sets under DVBase remains above 90%, even with EDA signals, which are shown to be ineffective by VBase and Base. In summary, the superiority of DBO and VMD is evident.

4.4.2. Validation of Test-Set Results

Table 2 lists the test-set accuracies of the ViT under three different physiological signals, and the corresponding confusion matrices are shown in Figure 18. To verify the superiority of the ViT for emotion recognition in the elderly, the network weight parameters corresponding to the lowest validation set loss within 30 epochs were saved for EEG, EDA, and HRV signals, respectively. The test set data were then imported to evaluate the training effect. Table 2 lists the test-set accuracies of the ViT for the three different physiological signals, and the corresponding confusion matrices are shown in Figure 18.

From Table 2, it could be seen that the test set accuracies for EEG, EDA, and HRV signals were 99.35%, 86.96%, and 97.20%, respectively. This indicated that the method of decomposing the three physiological signals after optimizing the VMD using DBO, transforming the signals using RM, and inputting them into the ViT was effective.

As shown in Table 2 and Figure 18, the accuracy of emotion recognition based on EEG signals was higher. To analyze the reason, EEG signals originated from the central nervous system, which was highly related to emotions. Compared to other physiological signals, EEG signals could be extracted with more features, including time domain, frequency domain, time-frequency domain, and spatial domain, among others.

4.4.3. Comparison of RM and Other Image-Modal Effects

Different methods of image-modal transformation affected the accuracy of emotion recognition in older adults. To investigate the advantages and disadvantages of RM compared to other methods, this study compared Gramian Angular Difference Fields (GADF), Gramian Angular Summation Fields (GASF), and Markov Transition Fields (MTF) with RM. All three physiological signals were processed with a sample length of 1920 and a sampling step of 20. They were all trained with the same hyperparameters, and the test set comparison results are shown in Table 3.

As can be seen from Table 3, the image-modal transformation methods had a significant impact on recognition accuracy. The recognition accuracies of the signal data transformed by the RM were better than those of the other three image-modal transformation methods, achieving 99.35%, 86.96%, and 97.20% for the three physiological signals, respectively. In contrast, the recognition accuracies of the signal data transformed by the Markov Transition Fields were much lower than those of the other three methods, indicating that this method was not suitable for emotion recognition in the elderly.

4.4.4. Comparison of the Effectiveness of ViT with Machine-Learning Models

To verify that the model proposed in this study had higher prediction accuracy compared to machine-learning models [21], this study compared SVM, NB, and KNN with the method presented in this paper. The experimental results are shown in Table 4.

As can be seen from Table 4, compared with SVM, NB, and KNN, the proposed method in this paper improved the recognition accuracies on EEG, EDA, and HRV signals by at least 9.4%, 11.13%, and 12.61%, respectively. The reason for this improvement is that machine-learning models cannot adequately capture the nonlinearity in physiological signal data, whereas deep learning can achieve end-to-end mapping, which helps to address the nonlinearity problem.

4.4.5. Comparison of the Effect of ViT with Individual CNN

To further compare the effects of different deep-learning methods on emotion recognition in the elderly, this study compared ResNet34, EfficientNet-B0, and VGG16 with the method proposed in this paper. The experimental results are shown in Table 5. From Table 5, it can be seen that the method proposed in this paper was significantly better than ResNet34 and VGG16, and slightly better than EfficientNet-B0. Considering that the Transformer architecture has the property of a self-attention mechanism, it can retain the relative position information between elements in the input sequence through position encoding. This provides it an advantage over the CNN architecture in dealing with temporal data, which is one of the reasons why the ViT was selected for this study.

5. Conclusions

In this study, the physiological signals of EEG, EDA, and HRV from the elderly were collected. The variational modal decomposition was optimized using DBO. The decomposed signals were then converted into two-dimensional image data using the RM image-conversion method, which was subsequently inputted into the ViT to achieve fast and accurate recognition of the emotions of the elderly. The research conclusions are as follows:

(1): DBO was used to optimize the variational modal decomposition, effectively obtaining the feature information. This optimization improved the accuracy rate by more than 5% compared to the validation set under the baseline condition.
(2): RM was used to transform the signal data into a two-dimensional image to extract the rich nonlinear information from the original signal. High accuracy was achieved by utilizing the Transformer’s sensitivity to image data. Compared to SVM, NB, and KNN methods, the recognition accuracy was improved by at least 9.4%.
(3): In emotion recognition for the elderly, using the same data-processing method and model, the validation set accuracy for EEG signals was close to 100% and 99.35% on the test set. This indicates that EEG signals can most accurately reflect the emotions of the elderly.
(4): Compared with EEG signals, the recognition accuracies of EDA and HRV signals in the validation and test sets decreased to a certain extent. Particularly for EDA signals, the accuracy on the test set was less than 90%, making accurate classification difficult. This may be related to elderly people’s skin laxity and the difficulty in extracting feature signals, suggesting that EDA signals should not be used alone for emotion recognition. The accuracy of HRV signals on the test set was 97.20%, which is intermediate. The advantage of HRV signals over EEG signals is their ease of use during acquisition, making HRV signals a viable option when portability is a consideration.

Author Contributions

Methodology, G.F., M.W. and R.Z.; Software, H.W. and X.Z.; Validation, G.F.; Formal analysis, X.Z.; Data curation, H.W., M.W. and R.Z.; Writing—original draft, H.W.; Writing—review & editing, G.F. and M.W.; Funding acquisition, G.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

https://github.com/engege2000/data (accessed on 20 July 2024).

Conflicts of Interest

The authors declare no conflict of interest.

References

Baraković, S.; Akhtar, Z.; Baraković Husić, J. Editorial: Quality of Life Improvement: Smart Approaches for the Working and Aging Populations. Front. Public Health 2024, 12, 362019. [Google Scholar] [CrossRef] [PubMed]
Mulalint, T.; Seeherunwong, A.; Wanitkun, N.; Tongsai, S. Determinants of Continuing Mental Health Service Use Among Older Persons Diagnosed with Depressive Disorders in General Hospitals: Latent Class Analysis and GEE. BMC Health Serv. Res. 2022, 22, 899. [Google Scholar] [CrossRef] [PubMed]
Kempnich, C.L.; Andrews, S.C.; Fisher, F.; Wong, D.; Georgiou-Karistianis, N.; Stout, J.C. Emotion Recognition Correlates with Social-Neuropsychiatric Dysfunction in Huntington’s Disease. J. Int. Neuropsychol. Soc. 2018, 24, 417–423. [Google Scholar] [CrossRef] [PubMed]
Ricciardi, L.; Visco-Comandini, F.; Erro, R.; Morgante, F.; Volpe, D.; Kilner, J.; Edwards, M.J.; Bologna, M. Emotional Facedness in Parkinson’s Disease. J. Neural Transm. 2018, 125, 1819–1827. [Google Scholar] [CrossRef] [PubMed]
Pan, L.; Yin, Z.; She, S.; Song, A. Emotional State Recognition from Peripheral Physiological Signals Using Fused Nonlinear Features and Team-Collaboration Identification Strategy. Entropy 2020, 22, 511. [Google Scholar] [CrossRef] [PubMed]
Rajalakshmi, A.; Sridhar, S.S. Classification of Yoga, Meditation, Combined Yoga-Meditation EEG Signals Using L-SVM, KNN, and MLP Classifiers. Soft Comput. 2024, 28, 4607–4619. [Google Scholar] [CrossRef]
Hinton, G.E.; Salakhutdinov, R.R. Reducing the Dimensionality of Data with Neural Networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Meng, L.; Zhang, X.; Fang, W.; Wu, D. Universal Adversarial Perturbations for CNN Classifiers in EEG-based BCIs. J. Neural Eng. 2022, 18, 0460a4. [Google Scholar] [CrossRef] [PubMed]
Song, K.; Zhou, L.; Wang, H. Deep Coupling Recurrent Auto-Encoder with Multi-Modal EEG and EOG for Vigilance Estimation. Entropy 2021, 23, 1316. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Du, J.Q.; Zhu, Y.C.; Guo, Y.K. Survey of Transformer-Based Object Detection Algorithms. Comput. Eng. Appl. 2023, 59, 48–64. [Google Scholar]
Lou, Z.; Luo, S. Vehicle Infrared Target Detection Based on YOLOX and Swin Transformer. Infrared Technol. 2022, 44, 1167–1175. [Google Scholar]
Zhou, T.; Hou, S.; Lu, H.; Zhao, Y.; Dang, P.; Dong, Y. Exploring and Analyzing the Improvement Mechanism of U-Net and its Application in Medical Image Segmentation. J. Biomed. Eng. 2022, 39, 806–825. [Google Scholar]
Yang, H.; Bai, Z. CoT-Transunet: Lightweight Context Transformer Medical Image Segmentation Network. Comput. Eng. Appl. 2023, 59, 218–225. [Google Scholar]
Tan, X.; He, X.; Wang, Z. Text-to-Image Generation Technology Based on Transformer Cross Attention. Comput. Sci. 2022, 49, 107–115. [Google Scholar]
Yli, Y.; Jin, X. Research on Image Captioning Method Based on Sparse Self-Attention Mechanism of Integrated Geometric Relationship. Appl. Res. Comput. 2022, 39, 1132–1136. [Google Scholar]
Li, Z.; Zhou, Y.; Feng, W. Classification and Recognition of Gesture EEG Signals Based on the Transformer Model. Sci. Technol. Eng. 2023, 23, 2044–2050. [Google Scholar]
Liu, Y.F.; Liu, H.F.; Wang, Y. A Study of Motor Imagery EEG Classification Method Based on the Improved Transformer Model. J. Metrol. 2023, 44, 1147–1153. [Google Scholar]
Xue, J.; Shen, B. Dung Beetle Optimizer: A New Meta-Heuristic Algorithm for Global Optimization. J. Supercomput. 2022, 79, 7305–7336. [Google Scholar] [CrossRef]
Chen, Q.; Li, Y.Y.; Yuan, X.H. A Hybrid Method for Muscle Artifact Removal from EEG Signals. J. Neurosci. Methods 2021, 352, 109104. [Google Scholar] [CrossRef] [PubMed]
Peketi, S.; Dhok, S.B. Machine Learning Enabled P300 Classifier for Autism Spectrum Disorder Using Adaptive Signal Decomposition. Brain Sci. 2023, 13, 315. [Google Scholar] [CrossRef] [PubMed]
Zhang, R.; Zhu, Y. Predicting the Mechanical Properties of Heat-Treated Woods Using Optimization-Algorithm-Based BPNN. Forests 2023, 14, 935. [Google Scholar] [CrossRef]

Figure 1. Diagram of the test site.

Figure 2. The overall wearing of the participant.

Figure 3. EEG-wearing renderings.

Figure 4. EDA wireless piezoelectric sensor wearing renderings.

Figure 5. Renderings of the HRV wireless pulse sensor.

Figure 6. ViT architecture diagram.

Figure 7. Diagram of the position-encoding process.

Figure 8. Diagram of the overall architecture of the Encoder Block.

Figure 9. Schematic diagram of the multi-head attention mechanism.

Figure 10. Schematic diagrams of the EEG signals under different emotions: (a) high arousal high valence; (b) high arousal low valence; (c) low arousal high valence; (d) low arousal low valence.

Figure 11. Schematic diagrams of the EDA signals under different emotions: (a) high arousal high valence; (b) high arousal low valence; (c) low arousal high valence; (d) low arousal low valence.

Figure 12. Schematic diagrams of the HRV signals under different emotions: (a) high arousal high valence; (b) high arousal low valence; (c) low arousal high valence; (d) low arousal low valence.

Figure 13. Fitness curves for three different physiological signals: (a) EEG signals; (b) EDA signals; (c) HRV signals.

Figure 14. IMF exploded view: (a) EEG signals; (b) EDA signals; (c) HRV signals.

Figure 15. Schematic diagrams of three different physiological signals transformed by RM: (a) EEG signals; (b) EDA signals; (c) HRV signals.

Figure 16. Training loss for three different physiological signals: (a) EEG signals; (b) EDA signals; (c) HRV signals.

Figure 17. Training accuracy for three different physiological signals: (a) EEG signals; (b) EDA signals; (c) HRV signals.

Figure 18. Three different physiological signal test confusion matrices: (a) EEG signals; (b) EDA signals; (c) HRV signals.

Table 1. Model parameter settings.

Parameter Name	Specific Settings
Number of iterations	30
Loss function	Cross Entropy Loss Function
Training batches	8
Initial learning rates	0.001
Learning Rate Decay Strategies	cosine annealing
Optimization methods	SGD

Table 2. The test set accuracies of the ViT network for the three different physiological signals are listed in Table 2.

Signal Type	Accuracy/%
EEG signals	99.35
EDA signals	86.96
HRV signals	97.20

Table 3. Comparison of RM and other image-modal effects.

Signal Type	Recognition Accuracy/%
Signal Type	RM	GASF	GADF	MTF
EEG signals	99.35	98.21	98.81	94.94
EDA signals	86.96	86.31	85.42	80.95
HRV signals	97.20	96.73	97.02	90.18

Table 4. Comparison of the effectiveness of ViT with machine-learning models.

Signal Type	Recognition Accuracy/%
Signal Type	SVM	NB	KNN	ViT
EEG signals	84.52	79.35	78.10	99.35
EDA signals	77.56	75.83	74.35	86.96
HRV signals	78.51	76.37	72.50	97.20

Table 5. Comparison of the effect of ViT with individual CNN.

Signal Type	Recognition Accuracy/%
Signal Type	ResNet34	EfficientNet-B0	VGG16	ViT
EEG signals	98.21	98.81	96.01	99.35
EDA signals	84.52	85.12	82.14	86.96
HRV signals	95.06	96.43	91.07	97.20

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Feng, G.; Wang, H.; Wang, M.; Zheng, X.; Zhang, R. A Research on Emotion Recognition of the Elderly Based on Transformer and Physiological Signals. Electronics 2024, 13, 3019. https://doi.org/10.3390/electronics13153019

AMA Style

Feng G, Wang H, Wang M, Zheng X, Zhang R. A Research on Emotion Recognition of the Elderly Based on Transformer and Physiological Signals. Electronics. 2024; 13(15):3019. https://doi.org/10.3390/electronics13153019

Chicago/Turabian Style

Feng, Guohong, Hongen Wang, Mengdi Wang, Xiao Zheng, and Runze Zhang. 2024. "A Research on Emotion Recognition of the Elderly Based on Transformer and Physiological Signals" Electronics 13, no. 15: 3019. https://doi.org/10.3390/electronics13153019

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Research on Emotion Recognition of the Elderly Based on Transformer and Physiological Signals

Abstract

1. Introduction

2. Experimental Data Acquisition and Pre-Processing

2.1. Participants and Experimental Apparatus

2.2. Stimulation Methods

2.3. Experimental Procedure

2.4. Experimental Data Acquisition

2.4.1. EEG Signal Acquisition

2.4.2. EDA Signal Acquisition

2.4.3. HRV Signal Acquisition

3. Algorithms and Models

3.1. Recurrence Map

3.2. ViT

3.2.1. Location Coding

3.2.2. Encoders

3.2.3. MLP Head

4. Model Construction and Result Analysis

4.1. Acquisition Signal Analysis

4.2. Feature Extraction

4.3. Model Construction

4.4. Analysis and Validation of Results

4.4.1. Ablation Experiments

4.4.2. Validation of Test-Set Results

4.4.3. Comparison of RM and Other Image-Modal Effects

4.4.4. Comparison of the Effectiveness of ViT with Machine-Learning Models

4.4.5. Comparison of the Effect of ViT with Individual CNN

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI