Emotion Classification Algorithm for Audiovisual Scenes Based on Low-Frequency Signals

Jin, Peiyuan; Si, Zhiwei; Wan, Haibin; Xiong, Xiangrui

doi:10.3390/app13127122

Open AccessArticle

Emotion Classification Algorithm for Audiovisual Scenes Based on Low-Frequency Signals

¹

School of Computer, Electronics and Information, Guangxi University, 100 Daxue Road, Nanning 530004, China

²

School of Computer, Electronics and Information, Guangxi Key Laboratory of Multimedia Communications and Network Technology, Guangxi University, Nanning 530004, China

³

College of Electronics and Information Engineering, Beibu Gulf University, Qinzhou 535000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(12), 7122; https://doi.org/10.3390/app13127122

Submission received: 21 April 2023 / Revised: 10 June 2023 / Accepted: 10 June 2023 / Published: 14 June 2023

(This article belongs to the Section Acoustics and Vibrations)

Download

Browse Figures

Versions Notes

Abstract

:

Since informatization and digitization came into life, audio signal emotion classification has been widely studied and discussed as a hot issue in many application fields. With the continuous development of artificial intelligence, in addition to speech and music audio signal emotion classification technology, which is widely used in production life, its application is also becoming more and more abundant. Current research on audiovisual scene emotion classification mainly focuses on the frame-by-frame processing of video images to achieve the discrimination of emotion classification. However, those methods have the problems of algorithms with high complexity and high computing cost, making it difficult to meet the engineering needs of real-time online automatic classification. Therefore, this paper proposes an automatic algorithm for the detection of effective movie shock scenes that can be used for engineering applications by exploring the law of low-frequency sound effects on the perception of known emotions, based on a database of movie emotion scene clips in 5.1 sound format, extracting audio signal feature parameters and performing dichotomous classification of shock and other types of emotions. As LFS can enhance a sense of shock, a monaural algorithm for detecting emotional scenes with impact using a subwoofer (SW) is proposed, which trained a classification model using SW monaural features and achieved a maximum accuracy of 87% on the test set using a convolutional neural network (CNN) model. To expand the application scope of the above algorithm, a monaural algorithm for detecting emotional scenes with impact based on low-pass filtering (with a cutoff frequency of 120 Hz) is proposed, which achieved a maximum accuracy of 91.5% on the test set using a CNN model.

Keywords:

low frequency sound; brain wave measurement; 5.1 sound formats; deep learning; movie shock scene detection

1. Introduction

Research on low-frequency sound (LFS), ranging from 15 Hz to 120 Hz, has primarily focused on low-frequency noise (LFN) [1,2], particularly the negative impact of LFS on human emotion perception or behaviors [3,4,5], e.g., the sense of noise in LFS is the main factor that incurs the perception of annoyance. Experimental comparisons have shown that LFS is more strongly associated with annoyance than other types under the same sound pressure level [6]. LFS can also affect emotion perception in related ways, such as feeling that one’s thoughts are disturbed, inability to concentrate, experiencing anger, frustration, and feeling discomfort towards the surrounding environment, work, and visual information [7]. Research conducted by the Psychology Department at Goldsmiths, University of London found that 80% of people felt dizzy after hearing LFS in the infrasound range, while 11% felt sadness and 9% felt fear [8]. O’Keffe and Angliss have conducted experiments by exposing 700 subjects to four pieces of music with embedded 17 Hz LFS in 2003 to find its effect on emotion. The results showed that 22% of the subjects felt anxious, uneasy, extremely sad, tense, disgusted, or fearful, and experienced chills down the spine [9]. While most of the existing research has focused on the negative impact of LFS on human emotion perception, there is limited research on the potential of LFS to enhance human emotion perception. With the development of multimedia audio technology, the evaluation of audio quality, including the quality of LFS in audio, has received increasing attention from the public, as its quality plays a crucial and prominent role in audio quality evaluation. In audiovisual scenes, LFS is often utilized as a primary source of shock in films, and the inclusion of its components can enhance the impact of these scenes, providing the audience with a heightened sense of immersion.

Electroencephalography (EEG) signals are important physiological signals generated by brain activity within the cranial cavity that can be recorded from the surface of the scalp and contain rich physiological information. The psychological effects of external stimuli on individuals can be described and quantified using various emotion perceptual measures, and objective parameters can also be measured using instruments such as EEG and functional magnetic resonance imaging (fMRI). EEG instruments have been widely used in the field of psychoacoustics, such as investigating the changes in energy of different brain functional areas in response to external auditory stimuli [10] and their corresponding relationship with emotion perception. However, there are relatively few studies that have directly applied LFS as a stimulus in EEG experiments. Studies by Weisz and Müller have shown that the energy of the alpha rhythm in EEG decreases in the presence of external auditory stimuli [10]. Cho’s experiments have shown that the presence of LFS increases the energy of beta rhythm, and excessive beta rhythm activity can induce feelings of anxiety, fear, and stress, while higher-than-average frequencies of brain rhythms can induce fantasies [11]. However, there is currently limited research on the characteristic changes in EEG signals related to known emotion perception and the underlying mechanisms under the influence of external stimuli, particularly in the context of LFS stimulation in audiovisual scenes and their effects on rhythmic energy changes in brain functional areas.

Emotion classification of audio signals has been widely discussed as a hot topic in numerous applications. Researchers have mainly focused on the study of emotion classification in speech signals and music signals [12,13]. Nicholson et al. [14] used a one-class-in-one neural network for speech emotion recognition, training a subnetwork for each emotion and using the output of each subnetwork to make decisions to obtain the corresponding classification results. Grimm et al. [15] applied the three-dimensional emotion description model (activation–evaluation–power space) to spontaneous speech emotion recognition and treated it as a standard regression prediction problem. Enthusiasm for dimensional speech emotion recognition was also sparked [16,17,18,19]. Since 2010, with the widespread adoption and extensive use of machine learning, support vector machine (SVM)-based classification algorithms have become the mainstream in speech emotion recognition. Subsequently, with the rapid development of deep learning technology and the improvement in computer processing speed, convolutional neural networks (CNNs) were utilized to learn significant emotion information from acoustic features of speech, and experimental results showed that this classification model had good robustness [20]. In ref. [21], an end-to-end approach that combined CNNs with long short-term memory (LSTM) networks was proposed to directly learn emotion information from raw speech signals. In [22], the development of speech emotion recognition over the years and future trends and challenges were summarized. In recent years, the Google team introduced transformer models into the field of speech recognition. Additionally, an end-to-end speech emotion recognition architecture with stacked transformer layers was proposed in [23]. In [24], a transformer-like model that reduces memory usage and time overhead and greatly improves computational efficiency was proposed. Emotion recognition in music signals is derived from emotion recognition in speech signals, but the analysis difficulty in music signal recognition is greater. Most researchers extract time-domain, frequency-domain, and cepstrum-domain eigenvalues related to emotion information from music signals, and use these eigenvalues to train traditional machine learning and classical deep learning models to achieve the purpose of music emotion classification [25,26,27]. K-nearest neighbors (KNN) and SVM are commonly used models in machine learning-based music emotion recognition. The KNN regression algorithm is affected by the dataset samples and the value of K, which may result in overfitting or underfitting during the model training process. SVM can handle both linear and nonlinear problems, and it separates different emotion music data samples by finding an optimal hyperplane in a high-dimensional space to achieve emotion recognition. The purpose of deep learning-based music emotion recognition is to solve the problem of large-scale music signal data and reduce the cost of feature extraction. In [28], a hybrid architecture combining deep learning and broad learning (BL) for music classification tasks was proposed, which improved efficiency while ensuring high performance of the classifier. Most of the current methods for emotion detection of movie scene clips in multichannel audio format focus on the parsing of video signals, which is difficult and limited in terms of recognition rate, and minor methods that use audio signals for emotion classification only use them as an auxiliary tool. For 5.1 sound format movie audiovisual scenes, the audio signal has a unique role, especially for shock emotion-type scenes with high recognition impact, plus the audio signal processing method is relatively simple and fast compared to those of video signals and more suitable for real-time processing applications in engineering applications. This paper proposes a shock scene recognition algorithm, which trained the basic machine learning and neural network classification models with a certain recognition rate by extracting the characteristic parameters of the audio signal in 5.1 sound format, especially using the regularity of the impact of the LFS signal on emotion. Furthermore, a mixed-channel shock scene recognition algorithm that can be applied in the engineering application is also proposed, and the feasibility of it is judged according to the accuracy rate of the detection of shock emotional scenes in actual engineering detection.

In the following parts, Section 2 details the methods and steps of the subjective evaluation and EEG measurement experiments on the perception of known emotions under LFS stimuli, as well as the process of the LFS-based emotion classification algorithm for audiovisual scenes and the low-pass filtered mono-based emotion classification algorithm for those scenes. Section 3 processes and analyzes the data obtained from the experiments in the previous section to determine the feasibility of the classification algorithm, and Section 4 provides a full summary.

2. Materials and Methods

2.1. Subjective Evaluation Experiment and EEG Signal Observation Experiment under LFS Stimulation

2.1.1. Materials for Subjective Evaluation Experiments

In the experiment, video signals were used to determine static images and dynamic images that could elicit three types of emotions: happiness, sadness, and shock. Figure 1 shows the dataset used to induce three emotions with rows of “happy,” “sad,” and “shocking” pictures. Each emotion category consists of four static images and one dynamic image, resulting in a total of 15 different video signals. The experiment also employed nine LFS signals, previously shown to modulate the intensity of happiness, sadness, and shock to varying degrees. Previous experimental studies have found that LFS composed of different frequency components can enhance the level of arousal, which can make people feel influenced and dominated by the signal and cause a more pronounced, intense sense of annoyance, space, and immersion and can alter subjective emotional perception. Then, a subjective evaluation experiment was conducted based on the evaluation dimensions (annoyance sense, space sense, immersion sense), in which the participants were asked to watch happy, sad, and shocking movie clips without the addition of LFS, and it was found that the change in annoyance correlated most with the level of happiness, the change in space sense correlated most with the level of shock, and the change in immersion sense correlated most with the level of sadness in the movie. Finally, adding LFS to the movie clips revealed that S1–S3 signals most enhanced the primal sense of annoyance, S4–S6 signals most enhanced the primal sense of immersion, and S7–S9 signals most enhanced the primal sense of space. Thus, we concluded that S1–S3 had the effect of modulating happiness, S4–S6 had the effect of modulating sadness, and S7–S9 had the effect of modulating shock.

Table 1 displays the frequency composition of these LFS signals, with

f_{1}

representing the fundamental frequency signal (pure tone), and

f_{2}

to

f_{7}

representing harmonic signals (1/12 octave narrowband signals), with a 6 dB reduction in sound pressure level between each harmonic. Signals S1–S3 are used to modulate happiness, while signals S4–S6 and S7–S9 are used for the sadness and shock conditions, respectively. A total of 34 university students aged 18–22 years were enrolled in this experiment, and the meaning of the subjective evaluation parameters was explained to them before the experiment. Video signals were played on a laptop, and LFS was played on in-ear headphones with model ATH-CLR100iS, which has a frequency response of 10 Hz–25 kHz, an impedance of 16 ohms, a sensitivity of 100 dB and a maximum power of 20 mW. The experiment was conducted in a controlled laboratory environment with no external noise interference.

2.1.2. Evaluation Methods for Subjective Evaluation Experiments

The subjective evaluation experiment utilized a rating-scale method, with 1–10 levels corresponding to low emotion intensity (1) to neutral state (5) to high emotion intensity (10). Participants provided evaluations based on the overall audiovisual experience and the intensity of emotions evoked. The subjective evaluation experiment proceeded in order of happy, sad, and shocked emotions. First, video signals without LFS signals were played, following the sequence of static images 1–4 and then motion pictures. Subsequently, the audiovisual signals with added LFS signals were played in the same order. The LFS signals added corresponded to the three emotion conditions in the order of S1–S3, S4–S6, and S7–S9, respectively. The duration of both video and audiovisual signals was 5 s, with a 6 s interval between them. The interval between each emotion condition was 10 s, during which a gray image with no sound was presented. Participants were required to provide evaluations of the corresponding emotion intensity after each signal ended.

2.1.3. Experimental Method of Brain Wave Observation

The composition of subjects, the equipment used for audiovisual signal presentation, and the experimental setting were identical to those in the subjective evaluation experiment. The EMOTIV EEG device, as illustrated in Figure 2, is utilized for recording the brain wave signals. Figure 3 illustrates the proper placement of the EEG device on the subject’s scalp, with electrode positions AF3, AF4, F3, and F4 based on the Brodmann functional brain regions known to process emotion stimuli [29]. The video signals and LFS signals used in the experiment remained the same as those in the subjective evaluation experiment in the same order. The duration of the video and audiovisual signals was adjusted to 10 s, with a 5 s interval between them, and a 15 s interval between each type of emotion, during which gray pictures were displayed without accompanying sound. Subjects were instructed to maintain a relaxed state, minimize blinking and swallowing, and avoid head movement during the signal presentation. Initially, a 15 s baseline reference state was recorded without any visual or auditory stimulation. Subsequently, brain wave signals were recorded during the presentation of video signals depicting happy, sad, and shocked emotions without audio stimulation, and then, with the addition of LFS stimulation, following the experimental conditions in order.

2.2. LFS-Based Sentiment Classification Algorithm for Audiovisual Scenes

2.2.1. Movie Scene Emotion Clip Information Statistics

The dataset was a movie scene emotion clip database that consists of 21 clips in 5.1 audio format. The 5.1 audio channels include six channels: the center channel, front left channel, front right channel, rear left surround channel, rear right surround channel, and a 0.1 channel for LFS (SW). The SW channel was specifically designed for the subwoofer and had a frequency range set at 20 Hz–120 Hz. Each movie scene clip was annotated with an emotion category label, and the quantity and duration ranges of clips for each emotion category are shown in Table 2. Seven clips from the shock category and eight clips from other emotion categories were extracted from the movie scene emotion clip database as the training and validation sets, with a ratio of 7:3 for training and validation. The accuracy of the validation set was used to evaluate the performance of the model. The remaining three clips from the shock emotion category and three clips from other emotion categories were used as the test set to evaluate the generalization ability of the classification model and its feasibility in practical engineering applications.

2.2.2. Classification Algorithm Flow

Based on subjective evaluation experiments and EEG measurement experiments, it was demonstrated that LFS had a significant influence and modulation of known happiness, sadness, and shock, a suppressive effect on happiness in audiovisual scenarios, and that it could enhance sadness and, to varying degrees, shock. The simulation experiments indicated that bass has a consistent effect on enhancing shock. The bass component is the main source of shock in the viewing environment, with a richer database of shock scenes in actual 5.1 sound format films. Therefore, in this section, a classification algorithm for emotion scene classification based on mono-channel SW signals is proposed by processing and analyzing movie clips in 5.1 audio format. The feasibility of using eigenvalues extracted from the SW signal (i.e., the 0.1 channel in the 5.1 audio) for binary classification of shock and other emotion types is verified through training of the classification model, and the most suitable classification model is determined, along with the specific algorithm process and selected eigenvalues. The experimental flowchart is shown in Figure 4, with specific experimental steps as follows.

(1) Segmentation of movie clips. Eleven time lengths of 0.5 s and 1 s–10 s are used as window lengths to clip seven shock scenes and eight other emotion scenes extracted from the 5.1 audio format movie emotion scene database.

(2) Feature extraction. Based on the time domain, frequency domain, and cepstral domain audio features, six channels and SW channel eigenvalues are calculated for each segmented movie scene in the 5.1 audio format.

(3) Feature engineering. The preprocessed dataset is normalized using the min–max method, and the training set and validation set are divided in a 7:3 ratio for training eight classification models. Models with better accuracy performance on the validation set are selected, as well as the optimal time lengths for movie scene segmentation.

(4) Feature Selection. After obtaining the optimal time lengths for movie scene segmentation and the classification models, feature selection is performed to achieve dimensionality reduction. Variance of the original eigenvalues is calculated, and features that do not meet the threshold are removed and then normalized using min–max normalization. Kernel density estimation (KDE) plots are used to evaluate the distribution of eigenvalues in the training set and test set, and features with significant differences in data distribution are removed. Correlation heat maps between eigenvalues and labels are plotted, and features with high correlation coefficients with the target variable are selected. After feature selection, data balancing is performed by removing a portion of positive class samples, i.e., shock scenes.

(5) Model Training. Finally, classification models are trained using different segmentation window lengths and the selected eigenvalues. GridSearchCV is used to iterate through predefined hyperparameters to select the optimal hyperparameters and further improve the performance of the classification models.

(6) Model Testing. Six randomly selected movie scene test clips (three shock scenes and three other emotion scenes) are segmented according to the window lengths and eigenvalues are calculated based on the final selected features. The resulting dataset is normalized using min–max normalization and imported into the selected best classification model for testing to evaluate the generalization ability of the final model.

2.2.3. Audio Signal Feature Extraction

Feature extraction from audio signals is an essential module in audio scene classification, and the selection of audio features is often analyzed from the time domain, frequency domain, or cepstral domain. As shown in Table 3, the 5.1 audio format contains unique content and characteristics in each channel, including movie dialogues and background music. Therefore, the initially selected parameters include 50 time-domain, frequency-domain, and cepstral-domain eigenvalues that are relevant to speech emotion recognition and music emotion recognition fields. Due to the application environment for the 5.1 sound format movie scene emotion classification being based on online real-time processing, the audio signal to be processed needs to be windowed with the calculated movie clip length, i.e., how long the film clip is, and a determination of the type of emotion is given. Based on the statistical results of movie clip lengths and considering the clip length that can best represent a certain emotion type, a total of 11 window lengths, ranging from 0.5 s to 10 s, were determined for segmenting the movie scene clips. Then, the librosa, AudioSegment, and pyAudioAnalysis audio feature extraction toolkits were used to calculate the values of various audio eigenvalues for each clip, and the mean value, standard deviation (STD), and maximum–minimum difference of each feature parameter were calculated. With a full set of six channels, each sample in the dataset contains 900 features, but if only the SW channel features are retained, there are only 150 features per sample.

Mel-frequency cepstral coefficients (MFCCs) are an important feature parameter in the field of audio signal processing and the focus of the experiments in this paper, and have good robustness in classifying the emotions of audio scenes [30]. An MFCC is an inverse spectral parameter extracted in the frequency domain of the Mel scale, which describes the nonlinear auditory characteristics of the human ear and has good recognition performance, even when the signal-to-noise ratio is reduced [31]. The main steps in the feature extraction process are preprocessing of the audio signal, fast Fourier transform (FFT), passing through a Mel filter bank, taking logarithms, and discrete cosine transform (DCT) [32]. After these steps, the M-order cepstrum parameters are obtained, from which the n-dimensional parameters are taken as the MFCC features of the audio signal. This paper used the librosa package to extract the 24-dimensional MFCC, where the sampling rate (SR) was set to 22,050 and n_mfcc was set to 24.

2.2.4. Initial Screening of Classification Models

The experiment in this study adopts supervised learning algorithms, as each scene clip in the movie emotion scene database is labeled with corresponding emotion tags. The preliminary selected classification models consist of six traditional machine learning models and two classical neural network models. The six machine learning models are SVM classifier, random forest classifier, Bernoulli naïve Bayes classifier, Gaussian naïve Bayes classifier, and multinomial naïve Bayes classifier. The two neural network models are backpropagation neural network (BPNN) classifier and CNN classifier. The basic principles of these main models are as follows.

(1): SVM is a classical binary classification supervised learning algorithm known for its good robustness. The main objective of this algorithm is to find the optimal separating hyperplane that correctly classifies different types of data points [33,34], as illustrated in Figure 5.

(2): BPNN is a multilayer feedforward neural network trained using the error backpropagation algorithm. The learning process of the BPNN classification model consists of two stages: forward propagation of signals and backward propagation of errors. Firstly, during forward propagation, signals pass through the input layer, hidden layer(s), and undergo nonlinear transformations to produce output signals at the output layer. Then, during backward propagation, the error signals propagate from the output layer through the hidden layer(s) and finally reach the input layer, and the weights and biases between the hidden layer and output layer are continuously adjusted and updated in the direction of gradient descent, as well as the weights and biases between the input layer and hidden layer(s). The BPNN classification model undergoes repeated learning and training until it obtains the weights and biases corresponding to the minimum error, at which point the training stops. Figure 6 illustrates the structure of a BPNN binary classification model composed of input layer, hidden layer(s), and output layer.

(3): CNN models can be used not only in image recognition but also in emotion recognition from audio signals [35]. The network structure of a CNN classification model consists of four parts: input layer, convolutional layer, pooling layer, and fully connected layer. The model is trained iteratively using the backpropagation algorithm, as shown in Figure 7 illustrating the CNN model network structure. The convolutional layer, with its characteristics of local connectivity and weight sharing, simulates simple cells with local receptive fields and extracts primary visual features [36]. This reduces the number of connections in the network significantly. The pooling layer, also known as the subsampling layer, simulates complex cells by filtering and combining the primary visual features into higher-level, abstract visual features, thereby reducing the number of parameters. The fully connected layer maps the learned “distributed feature representation” to the sample label space, enhancing the nonlinear capability of the network and limiting the size of the network [37].

The eigenvalues of 6-channel information in 5.1 sound format movie scene clips were used to train the model. It was found that, among the selected six traditional machine learning algorithms, the SVM performed the best. The model showed high accuracy of the validation set under the segmentation of movie scene clips with window length of 6–9 s, with an accuracy of 93% when the window length was set to 6 s. In the neural network algorithms, the BPNN model and CNN model showed similar overall performance, both achieving high accuracy on the validation set with a window length of 6–9 s for movie scene clip segmentation. The BPNN model achieved an accuracy of 93% on the validation set with a window length of 9 s, while the CNN model achieved an accuracy of 95% on the validation set with a window length of 8 s.

When using the eigenvalues of SW channel information in 5.1 sound format movie scene clips to train the model, it was found that the SVM still performed the best among the six machine learning classification models. This model also achieved high accuracy on the validation set with a window length of 6–9 s for movie scene clip segmentation. Meanwhile, the two neural network models performed better than the SVM model overall. The accuracy of the SVM model reached a maximum of 84% with a window length of 9 s for movie scene clip segmentation, while the BPNN model achieved an accuracy of 86% with a window length of 6 or 9 s on the validation set. Similarly, the CNN model also achieved an accuracy of 86% on the validation set with a window length of 8 or 9 s for movie scene clip segmentation.

In conclusion, SVM, BPNN, CNN models all showed good recognition performance for emotion content in movie scene clips based on 5.1 sound format. However, these three models all suffered from severe overfitting, with significant lack of generalization in the test set. Therefore, in this experiment, the above three models were retrained and optimized with hyperparameters based on a window length of 6–9 s for movie scene clip segmentation using selected eigenvalues, in order to obtain models with stronger generalization ability.

2.2.5. Selection of Eigenvalues

For a specific learning algorithm, not all features are beneficial for algorithm learning, such as irrelevant features and redundant features. In practical applications, an excessive number of features can easily lead to the problem of dimensionality curse. Feature selection through feature parameter screening can not only greatly reduce the learning and training time cost of the model but also significantly increase the interpretability of the model. The screening procedure of this experiment consists of ANOVA screening, homoscedastic screening by means of kernel density estimation (KDE) plots, and correlation screening by means of correlation heat maps. The specific steps are detailed as follows.

(1): ANOVA screening. Variance analysis is performed on the selected eigenvalues. The divergence of the features is observed, and if the variance of a certain feature is close to 0, it indicates that the sample dataset has little variability on this feature and the classification effect of this feature on the samples is small. Therefore, this redundant feature is removed. In this experiment, eigenvalues with a variance less than 0.1 are removed, and after variance analysis, the eigenvalues are normalized.
(2): Homoscedastic screening by means of KDE plots. KDE plots are used to infer the distribution of population data based on a limited sample. The result of kernel density estimation is the estimated probability density function of the samples, and based on this result, the clustering areas of data can be obtained [38]. KDE is a nonparametric estimation that does not incorporate any prior knowledge, but fits the distribution based on the characteristics and properties of the dataset itself, and thus obtains a more optimized model than parametric estimation methods [39]. In this experiment, the differences in data distribution between eigenvalues in the training set and the test set are evaluated by drawing kernel density estimation plots of the dataset distribution. If the overlap of the kernel density estimation plots is high, it indicates that the data distribution differences are small, and the feature is retained [40]; otherwise, it is removed. Figure 8 presents eigenvalues with small data distribution differences, while Figure 9 presents eigenvalues with large data distribution differences.

(3): Correlation screening by means of correlation heat maps. By analyzing the correlation between two or more variables, the closeness of the correlation can be measured. Figure 10 shows the correlation heat map of SW mono eigenvalues with labels under 9 s window length segmentation. In this experiment, the Spearman rank correlation coefficient is used to analyze the data, which is a nonparametric measure of rank correlation. The mathematical formula for calculating the Spearman rank correlation coefficient is as follows:

ρ = \frac{\sum_{i = 1}^{n} (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sqrt{\sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{2} \sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}}

The larger the obtained correlation coefficient, the stronger the linear impact of these feature variables on the target variable [41]. Based on the experimental results, the final selection of eigenvalues was determined by filtering out those with a correlation coefficient with the target variable greater than 0.1.

After the feature parameter selection process, the near-miss undersampling technique was applied to the dataset to balance the distribution of class categories. Near-miss is an undersampling technique that aims to balance class distribution by randomly removing majority class data [42,43], as illustrated in Figure 11. Since the duration of each movie clip in the movie scene clip database is not uniform, the category distribution of the training data will be skewed. The data imbalance problem will cause significant interference to the learning of the algorithm. In schematic 10, if the blue squares represent category 1 and the green circles represent category 2, it is obvious that the data samples of category 1 are larger than the data samples of category 2, and the near-miss algorithm will eliminate the excess. The near-miss algorithm eliminates the redundant category 1 samples (red squares) to achieve the purpose of balancing the category 1 and category 2 data distribution categories. Therefore, in this experiment, after completing the feature parameter selection, near-miss undersampling was applied to balance the class distribution by removing the majority class data [44,45], which in this case corresponds to the positive class or the “shock” emotion clips from the dataset.

2.2.6. Model Training and Testing

Fifteen movie scenes in 5.1 sound format were segmented in intervals of 6–9 s, and the finalized feature parameter values were calculated and filtered for each sound channel of each movie scene segment. Three models were trained using the 6-channel and SW-channel feature information respectively.

The six movie test clips were segmented, extracted and normalized into an SVM model, a BP neural network model, and a CNN model with a window length of 6–9 s and trained with six channel information features, and the average of the six clips was used as the test set accuracy of the 5.1 sound format-based emotion scene classification algorithm. Likewise, the abovementioned six test movie clips were imported into the same three types of models trained with SW mono information features for movie scenes with a window length of 6–9 s, and the average of the six test movie clips was used as the test set accuracy of the SW mono-channel-based emotion classification algorithm.

2.3. Low-Pass Filtering Mono-Based Sentiment Classification Algorithm

To enable application in a wider range of practical scenarios, a mono-channel-based emotion classification algorithm with a low-pass filter was proposed. AudioSegment Pydub was used to mix movie scene clips in 5.1 audio format into mono channel, and a low-pass filter (with an upper cutoff frequency of 120 Hz) was added to simulate the SW channel. For this experiment, 7 clips of scenes with a “shock” emotion and 8 clips of scenes with other emotion contexts were used as the training and validation sets, while 6 movie test clips (3 shock scenes and 3 other emotion scenes) in mono channel were used as the test set. The feature selection method followed the same steps described in Section 2.2.5, including analysis of variance, kernel density estimation (KDE) plot, and correlation heat map, to extract features from the low-pass filtered mono channel.

3. Numerical Results

3.1. Data Results and Analysis of Subjective Evaluation Experiments and EEG Measurement Experiments

3.1.1. Data Results and Analysis of Subjective Evaluation Experiments

The subjective evaluation scores of video stimuli without audio were used as reference values for each participant, and the subjective evaluation scores of the same visual stimuli with the addition of different LFS signals were considered outcome values. The relative values of subjective emotion evaluation were obtained by subtracting the reference values from the outcome values, which indicated the enhancement or reduction in perceived emotion intensity for each participant under the three emotion states caused by the different LFS signals. A positive relative value indicated that the addition of LFS signals increased the perceived intensity of the known emotion. A negative relative value indicated a trend of reduced intensity in the perceived emotion due to the addition of the corresponding LFS signal. A relative value of zero indicated no effect of the LFS on the perceived emotion intensity. As is shown in Figure 12, the relative values of the subjective ratings of all subjects were counted according to 5 video signals of 3 emotions.

The statistical results showed that when the video signal clearly induced happy emotions, the relative values of subjective emotion ratings were negative after adding LFS, which indicated that the subjects’ happiness was overwhelmingly diminished. This implies that LFS signals generally have an inhibitory effect on happy emotions. Among these three LFS signals, the S3 signal exhibited the strongest inhibitory effect on happy emotions, while the S1 signal had the weakest effect. The inhibitory effect of LFS signals on happy emotions was relatively more pronounced in dynamic videos. When subjects were known to induce sadness, the relative values of subjective affect scores were almost always positive after the addition of LFS, which suggested that LFS can bring enhancement to subjects’ known levels of sadness. The enhancing effect of the S4 signal was relatively weak, and in some cases, no enhancement was observed, whereas the S6 signal had the strongest effect on sad emotions. The addition of LFS to visual signals known to induce shock emotions in subjects revealed mostly positive relative values for subjective emotion scores, which indicated that the intensity of shock emotions from audiovisual signals were overwhelmingly higher than those without LFS, suggesting that LFS signals can enhance shock to varying degrees. No significant differences were observed between the results of still images and moving videos in terms of sadness or shock (Table 4).

3.1.2. Brain Wave Data Statistics and Result Analysis

First, the EEG data were processed by the EEGLAB toolbox in MATLAB and the Neuron-Spectrum software. The data were segmented based on the duration of audiovisual stimuli, and prominent artifacts caused by blinking, eye movement, and muscle activity were removed. After obtaining clean EEG signals, the power spectral density of theta (4 Hz–8 Hz), alpha (8 Hz–13 Hz), and beta (13 Hz–30 Hz) rhythms was calculated. A total of 20 participants’ EEG data with valid signals were obtained (the subjective evaluation experiment yielded data from the same 20 participants). Due to the interindividual variability in EEG energy, relative power spectral density values were calculated, which involved computing the ratio of the power spectral density of each participant’s EEG signals under different audiovisual stimulus conditions to the power spectral density of their reference states.

The main focus of this work is to observe the trend of changes in EEG energy after adding LFS signals under three known emotion states. Therefore, statistical analysis was conducted on the relative power spectral density values of EEG signals after adding LFS signals compared to the values without audio signals (Table 5 shows the relative power spectral density ratio after adding LFS signals compared to the original silent condition). The results showed significant differences in the effects of LFS signals on EEG energy under the three known emotion states. Overall, the results were consistent across the four electrode locations, which may be related to their shared involvement in emotion-related brain functions. The addition of LFS signals triggered an increase in θ and α rhythm energy associated with known happy emotions, while suppressing β rhythm energy. In contrast, the effects of LFS signals on EEG energy associated with known sad emotions manifested as an increase in θ and β rhythm energy, with inconsistent changes observed in α rhythm energy. When the known emotion state of the participants was shock, the added LFS signals significantly suppressed θ rhythm energy, while stimulating β rhythm energy to a large extent, with minimal effects observed on α rhythm energy based on the results.

The change in EEG rhythm energy induced by the addition of each LFS relative to the original video signal under the three known mood conditions is shown in Figure 13. When comparing the results of EEG energy induced by various LFS signals added to the original video signals under three known emotion conditions (averaged across four electrode locations), it was found that the results of the θ and β rhythm energy induced by three LFS signals known to trigger happy emotions were consistent with the results in Table 3, but there was no apparent pattern among these three signals. However, when the known emotions were sadness and shock, the effects of LFS signals on EEG energy showed a consistent pattern. For the signals S4, S5, and S6, the strength of θ rhythm energy induced after adding them to video signals that clearly elicited sadness emotions was in the order of S6 > S5 > S4, while the relationship between the LFS signals and the increase in α and β rhythm values was opposite, i.e., S6 < S5 < S4. This indicates that the addition of LFS signals overall stimulates the energy of emotion-related brain functional areas in terms of EEG waves, but the effects of different LFS signals on different rhythm energies are different. In the observed EEG results of known shock, it was found that the signals S7, S8, and S9 showed a relatively consistent relationship with the three rhythm energies. Although they all suppressed θ rhythm energy, the suppression intensity was in the order of S7 > S8 > S9, and the activation effects on α rhythm were in the same order. Similarly, they all had an activating effect on β rhythm energy, but the degree of activation was in the opposite order, i.e., S7 < S8 < S9 (Table 6).

3.1.3. Analysis of Subjective and Objective Experimental Results

A quantitative analysis of the overall trend of changes reveals a certain regularity between LFS signals and the subjective and objective experimental results of known emotions. LFS signals have an inhibitory effect on the intensity of known happy emotions, which is consistent with the changes in β rhythm EEG energy of the subjects after the addition of LFS signals, but opposite to the overall trend of changes in θ and α rhythms. The subjective and objective results of known sad emotions also showed the same trend of changes, with increased intensity of sad emotions and θ and β rhythm energy. Although LFS signals also have an enhancing effect on subjective shock, the trend of changes in EEG energy of the subjects is different from that of known sad emotions, and the enhancement of subjective emotions corresponds to the inhibition of θ rhythm and the increase in β rhythm energy.

To further quantify the correlation between the subjective relative values of participants’ evaluations and the relative power spectral density ratios of EEG rhythms after the addition of LFS signals compared to those without LFS signals, Spearman correlation analysis was conducted, with statistical significance set at

p < 0.05

. Table 7 shows the results of statistically significant correlations. It can be seen that there is only significant correlation between the changes in subjective emotion intensity caused by four LFS signals and the measured values of objective EEG energy. S1 and S2 are both LFS signals used when the known emotion is happiness, and the effect on the intensity of happy emotion after adding known happy emotion video signals is significantly positively correlated with β rhythm in the F3 and AF4 electrode positions. S4 is an LFS signal added when the known emotion is sadness, and the changes in subjective sad emotion intensity it causes are significantly positively correlated with the changes in α rhythm in the AF4 electrode position. The quantified values of the effect on subjective emotion after adding the S8 LFS signal in video signals known to induce shock are significantly negatively correlated with theta rhythm energy in the F3 electrode position.

3.2. Classification Algorithm Data Results and Analysis

3.2.1. Validation Set Results Based on 5.1 Sound Format and SW Mono-Channel Emotion Recognition

For the segmentation of movie scenes with window lengths of 6–9 s, the validation accuracy results of three models trained with either six-channel audio features or SW monaural audio features are shown in Table 8. In general, the model trained with 5.1-channel audio features achieved the highest validation accuracy of 91%, while the model trained with SW monaural audio features achieved the highest validation accuracy of 88%. Although the former achieved a 3% higher validation accuracy than the latter, the latter was able to improve the computational speed by utilizing separately extracted SW monaural audio features without significantly affecting the accuracy of shock detection, resulting in performance comparable to the former. This finding suggests the importance of SW monaural audio features in detecting shock scenes in 5.1-channel audio format movies.

Looking at each individual classification model, the SVM model and the CNN model achieved higher validation accuracy with the use of all six-channel audio features compared to SW monaural audio features, while the BPNN model showed the opposite trend. Specifically, the BPNN model trained with SW monaural audio features achieved a 2% higher validation accuracy than the model trained with all six-channel audio features. This result indicates that a classification model trained solely with SW monaural audio features may have better performance in detecting shock scenes in 5.1-channel audio format movies, thereby validating the importance of LFS features in detecting shock scenes. The finalized features of SW channel screening under each window length segment are shown in Table 9, where SW, number, mean, std and diff represents SW mono, dimension, mean, standard deviation, maximum and minimum difference, respectively.

3.2.2. Test Set Results Based on 5.1 Sound Format and SW Mono-Channel Emotion Recognition

The accuracy results of the emotion scene classification algorithm based on 5.1 audio format on the test set showed that, the BPNN model and the CNN model had better generalization ability than the SVM model. Among them, when using a window length of 9 s for segmenting movie scenes, the CNN model achieved the highest accuracy of 84% on the test set.

As for the emotion classification algorithm based on SW monaural audio format, the test set showed that the BPNN and CNN models had better generalization ability than the SVM model. The SVM model achieved the highest accuracy of 78% on the test set when using a window length of 9 s for segmenting movie scenes, while the BPNN model achieved the highest accuracy of 82% when using a window length of 8 s for segmenting movie scenes, and the CNN model achieved the highest accuracy of 87% when using a window length of 9 s for segmenting movie scenes among the three classification models.

When segmenting movie scenes with a window length of 9 s, six movie test scenes (three shock scenes and three scenes with other emotions) were input into the CNN model trained with either six-channel audio features or separately extracted SW monaural audio features. The predicted results of the six movie test scenes are shown in Table 10, where “1” represents correct prediction of a certain clip in the emotion scene and “0” represents incorrect prediction of a certain clip. There may be two reasons for the occurrence of continuous incorrect predictions in the predicted results. One could be insufficient generalization ability of the model, and the other could be the presence of a small portion of emotion audio signals in the selected movie test scenes that are different from the annotated emotion, leading to continuous misjudgments in emotion classification.

3.3. Test Set Results of Low-Pass Filter-Based Monophonic Sentiment Classification Algorithm

With a window length of 9 s, the extracted feature parameters of the low-pass filtered mono channel are shown in Table 11 below, where number, mean, std and diff represents dimension, mean, standard deviation, maximum and minimum difference, respectively. With a window length of 9 s, the movie scene clips were segmented, and the low-pass filtered mono-channel features were used to train a CNN classification model. After optimizing the hyperparameters of the model, the accuracy on the training set reached 84% and that on the validation set reached 83%. Then, the six test clips were input into the trained CNN classification model, and the average accuracy of the test set was used as the actual engineering test accuracy, which reached 91.5%. The predicted results of each movie test clip are shown in Table 12, where “1” represents correct prediction of a clip in the emotion scene and “0” represents incorrect prediction. With a window length of 9 s for segmenting the movie scene clips, the emotion classification algorithm based on the low-pass filtered mono-channel features achieved a test accuracy of 4.5%, which was higher than the algorithm based on the SW mono channel. This verifies the feasibility of the emotion classification algorithm based on the low-pass filtered mono channel in practical engineering for determining emotion scenes in movies, and suggests that this algorithm may perform better than the SW mono-channel-based algorithm in practical engineering applications. The reason for the better performance of the emotion classification algorithm based on the low-pass filtered mono-channel may be that the SW mono-channel does not cover all the LFS signals in the 5.1 audio format, as the frequency range of the other five channels in 5.1 is 20–20 kHz, and there are also some LFS signals. However, after mixing and low-pass filtering of all channels to obtain the mono channel, the obtained audio signal contains all the LFS information of the mixed mono channel and the extracted audio signal features in the low-frequency range are more comprehensive, which is beneficial for training the model to detect shock scenes.

4. Conclusions

Through subjective evaluation experiments and EEG measurements, it has been demonstrated that LFS has significant impact and regulatory effects on induced emotions of happiness, sadness, and shock. It has been concluded that LFS signals generally have an inhibitory effect on happiness in audiovisual scenes, but can enhance sadness and shock to varying degrees, along with corresponding changes in rhythmic energy in brain functional areas. Based on the above, this paper proposes an emotion scene classification algorithm based on an SW monaural channel, which proves the feasibility of this algorithm for detecting shock scenes in 5.1-sound-format movies, and also demonstrates the importance of LFS features for detecting shock scenes. Furthermore, in order to expand the practical engineering application scenarios of emotion-based audiovisual scene classification algorithms using LFS, this paper also proposes an emotion classification algorithm based on the low-pass filtering monaural channel, which obtains a movie shock scene automatic detection model with higher generalization ability, a milestone in the practical engineering application of emotion-type determination for multichannel movie scenes.

Author Contributions

P.J. conducted the experiments and was primarily responsible for writing the paper; X.X. wrote all the code; Z.S. and X.X. edited and proofread the manuscript; H.W. supervised this research. All authors have read and agreed to the published version of the manuscript.

Funding

Guangxi Science and Technology Base and Talent Special Project.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available upon request.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

Academic Vocabulary	Acronyms
Low-frequency sound	LFS
Low-frequency noise	LFN
Electroencephalographic	EEG
Functional magnetic resonance imaging	fMRI
Support vector machine	SVM
Convolutional neural networks	CNN
Long short-term memory	LSTM
K-nearest neighbors	KNN
Broad learning	BL
Kernel density estimation	KDE
Standard deviation	STD
Back propagation neural network	BPNN
Mel-scale frequency cepstral coefficients	MFCC
Zero crossing rate	ZCR
Subwoofer	SW
Analysis of variance	ANOVA
Relative power spectral density	RPSD

References

Silva, L.T.; Magalhães, A.; Silva, J.F.; Fonseca, F. Impacts of low-frequency noise from industrial sources in residential areas. Appl. Acoust. 2021, 182, 108203. [Google Scholar] [CrossRef]
Leventhall, G. Low Frequency Noise. What we know, what we do not know, and what we would like to know. J. Low Freq. Noise Vib. Act. Control 2009, 28, 79–104. [Google Scholar] [CrossRef] [Green Version]
Rossi, L.; Prato, A.; Lesina, L.; Schiavi, A. Effects of low-frequency noise on human cognitive performances in laboratory. Build. Acoust. 2018, 25, 17–33. [Google Scholar] [CrossRef] [Green Version]
Javadi, A.; Pourabdian, S.; Forouharmajd, F. The Effect of Low Frequency Noise on Working Speed and Annoyance. Iran J. Public Health 2022, 51, 2634–2635. [Google Scholar] [CrossRef]
Fuchs, G.; Verzini, A.; Ortiz Skarp, A. The effects of low frequency noise on man: Two experiments. In Proceedings of the International Congress on Noise Control Engineering, Liverpool, UK, 30 July–2 August 1996; pp. 2137–2140. [Google Scholar]
Pawlaczyk-Luszcaynska, M.; Dudarewicz, A.; Waszkowska, M. Annoyance of low frequency noise in control rooms. In Proceedings of the 2002 International Congress and Exposition on Noise Control Engineering, Dearborn, MI, USA, 19−21 August 2002; pp. 1604–1609. [Google Scholar]
Guski, R.; Felscher-Suhr, U.; Schuemer, R. The concept of noise annoyance: How international experts see it. J. Sound Vib. 1999, 223, 513–527. [Google Scholar] [CrossRef]
French, C.C.; Haque, U.; Bunton-Stasyshyn, R.; Davis, R. The “Haunt” project: An attempt to build a “haunted” room by manipulating complex electromagnetic fields and infrasound. Cortex 2009, 45, 619–629. [Google Scholar] [CrossRef] [Green Version]
O’Keeffe, C.; Angliss, S. The subjective effects of infrasound in a live concert setting. In Proceedings of the CIM04: Conference on Interdisciplinary Musicology, Graz, Austria, 15–18 April 2004; pp. 132–133. [Google Scholar]
Leske, S.; Tse, A.; Oosterhof, N.N.; Hartmann, T.; Müller, N.; Keil, J.; Weisz, N. The strength of alpha and beta oscillations parametrically scale with the strength of an illusory auditory percept. Neuroimage 2014, 88, 69–78. [Google Scholar] [CrossRef]
Cho, W.; Hwang, S.-H.; Choi, H. An investigation of the influences of noise on EEG power bands and visual cognitive responses for human-oriented product design. J. Mech. Sci. Technol. 2011, 25, 821–826. [Google Scholar] [CrossRef]
Mocanu, B.; Tapu, R.; Zaharia, T. Utterance level feature aggregation with deep metric learning for speech emotion recognition. Sensors 2021, 21, 4233. [Google Scholar] [CrossRef]
Dai, W.; Han, D.; Dai, Y.; Xu, D. Emotion recognition and affective computing on vocal social media. Inf. Manag. 2015, 52, 777–788. [Google Scholar] [CrossRef]
Van Bezooijen, R.; Otto, S.A.; Heenan, T.A. Recognition of vocal expressions of emotion: A three-nation study to identify universal characteristics. J. Cross-Cult. Psychol. 1983, 14, 387–406. [Google Scholar] [CrossRef]
Nicholson, J.; Takahashi, K.; Nakatsu, R. Emotion recognition in speech using neural networks. Neural Comput. Appl. 2000, 9, 290–296. [Google Scholar] [CrossRef]
Wu, D.; Parsons, T.D.; Mower, E.; Narayanan, S. Speech emotion estimation in 3D space. In Proceedings of the 2010 IEEE International Conference on Multimedia and Expo, Singapore, 19–23 July 2010; pp. 737–742. [Google Scholar]
Karadoğan, S.G.; Larsen, J. Combining semantic and acoustic features for valence and arousal recognition in speech. In Proceedings of the 2012 3rd International Workshop on Cognitive Information Processing (CIP), Baiona, Spain, 28–30 May 2012; pp. 1–6. [Google Scholar]
Grimm, M.; Kroschel, K.; Narayanan, S. Support Vector Regression for Automatic Recognition of Spontaneous Emotions in Speech. In Proceedings of the IEEE International Conference on Acoustics, Honolulu, Hawaii, 16–20 April 2007. [Google Scholar]
Giannakopoulos, T.; Pikrakis, A.; Theodoridis, S. A dimensional approach to emotion recognition of speech from movies. In Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, Taiwan, 19–24 April 2009; pp. 65–68. [Google Scholar]
Mao, Q.; Dong, M.; Huang, Z.; Zhan, Y. Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimed. 2014, 16, 2203–2213. [Google Scholar] [CrossRef]
Tzirakis, P.; Zhang, J.; Schuller, B.W. End-to-end speech emotion recognition using a deep convolutional recurrent network. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, Calgary, AB, Canada, 15–20 April 2018. [Google Scholar]
Schuller, B.W. Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends. Commun. ACM 2018, 61, 90–99. [Google Scholar] [CrossRef]
Wang, X.; Wang, M.; Qi, W.; Su, W.; Wang, X.; Zhou, H. A novel end-to-end speech emotion recognition network with stacked transformer layers. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6289–6293. [Google Scholar]
Jing, D.; Manting, T.; Li, Z. Transformer-like model with linear attention for speech emotion recognition. J. Southeast Univ. 2021, 37, 164–170. [Google Scholar]
Ren, J.-M.; Wu, M.-J.; Jang, J.-S.R. Automatic music mood classification based on timbre and modulation features. IEEE Trans. Affect. Comput. 2015, 6, 236–246. [Google Scholar] [CrossRef]
Fu, Z.; Lu, G.; Ting, K.M.; Zhang, D. A survey of audio-based music classification and annotation. IEEE Trans. Multimed. 2010, 13, 303–319. [Google Scholar] [CrossRef]
Baniya, B.K.; Hong, C.S.; Lee, J. Nearest multi-prototype based music mood classification. In Proceedings of the IEEE/ACIS International Conference on Computer & Information Science, Las Vegas, NV, USA, 28 June–1 July 2015. [Google Scholar]
Tang, H.; Chen, N. Combining CNN and broad learning for music classification. IEICE Trans. Inf. Syst. 2020, 103, 695–701. [Google Scholar] [CrossRef] [Green Version]
Brodmann, K. Vergleichende Lokalisationslehre der Großhirnrinde in Ihren Prinzipien Dargestellt auf Grund des Zellenbaues; von Johann Ambrosius Barth: Leipzig, Germany, 1909. [Google Scholar]
Mohan, M.; Dhanalakshmi, P.; Kumar, R.S. Speech Emotion Classification using Ensemble Models with MFCC. Procedia Comput. Sci. 2023, 218, 1857–1868. [Google Scholar] [CrossRef]
Ruan, P.; Zheng, X.; Qiu, Y.; Hao, Z. A Binaural MFCC-CNN Sound Quality Model of High-Speed Train. Appl. Sci. 2022, 12, 12151. [Google Scholar] [CrossRef]
Tu, Z.; Liu, B.; Zhao, W.; Yan, R.; Zou, Y. A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition. Appl. Sci. 2023, 13, 4124. [Google Scholar] [CrossRef]
Noble, W.S. What is a support vector machine? Nat. Biotechnol. 2006, 24, 1565–1567. [Google Scholar] [CrossRef] [PubMed]
Feradov, F.; Mporas, I.; Ganchev, T. Evaluation of Features in Detection of Dislike Responses to Audio–Visual Stimuli from EEG Signals. Computers 2020, 9, 33. [Google Scholar] [CrossRef] [Green Version]
Trapanotto, M.; Nanni, L.; Brahnam, S.; Guo, X. Convolutional Neural Networks for the Identification of African Lions from Individual Vocalizations. J. Imaging 2022, 8, 96. [Google Scholar] [CrossRef]
Alluhaidan, A.S.; Saidani, O.; Jahangir, R.; Nauman, M.A.; Neffati, O.S. Speech Emotion Recognition through Hybrid Features and Convolutional Neural Network. Appl. Sci. 2023, 13, 4750. [Google Scholar] [CrossRef]
Liu, T.; Yan, D.; Wang, R.; Yan, N.; Chen, G. Identification of Fake Stereo Audio Using SVM and CNN. Information 2021, 12, 263. [Google Scholar] [CrossRef]
Kamalov, F. Kernel density estimation based sampling for imbalanced class distribution. Inf. Sci. 2020, 512, 1192–1201. [Google Scholar] [CrossRef] [Green Version]
Wang, S.; Wang, J.; Chung, F.L. Kernel Density Estimation, Kernel Methods, and Fast Learning in Large Data Sets. IEEE Trans Cybern 2014, 44, 1–20. [Google Scholar] [CrossRef]
Martínez-Camblor, P.; de Uña-Álvarez, J. Non-parametric-sample tests: Density functions vs distribution functions. Comput. Stat. Data Anal. 2009, 53, 3344–3357. [Google Scholar] [CrossRef]
Jain, S.; Jadon, R. Audio based movies characterization using neural network. Int. J. Comput. Sci. Appl. 2008, 1, 87–90. [Google Scholar]
Liu, X.-Y.; Wu, J.; Zhou, Z.-H. Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B 2008, 39, 539–550. [Google Scholar]
Bao, L.; Juan, C.; Li, J.; Zhang, Y. Boosted Near-miss Under-sampling on SVM ensembles for concept detection in large-scale imbalanced datasets. Neurocomputing 2016, 172, 198–206. [Google Scholar] [CrossRef]
He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar]
Han, H.; Wang, W.-Y.; Mao, B.-H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Proceedings of the Advances in Intelligent Computing: International Conference on Intelligent Computing, ICIC 2005, Hefei, China, 23–26 August 2005; pp. 878–887. [Google Scholar]

Figure 1. Happy, sad, and shocked emotion-inducing image dataset.

Figure 2. EMOTIV wireless portable EEG device.

Figure 3. Wearing of wireless portable EEG devices.

Figure 4. Complete process of emotion detection for impressive scenes in 5.1 sound format movies.

Figure 5. Graphical representation of an SVM classification model separating hyperplanes.

Figure 6. BPNN classification model.

Figure 7. CNN classification model.

Figure 8. The KDE plots of the above feature parameter values in the training and test sets overlap well. This indicates that there is little difference between the data distribution of this feature parameter value on the training and test sets, and so the feature is retained.

Figure 9. The KDE plots of the above feature parameter values on the training and test sets have low overlap. It means that the feature parameter value shows a large difference between the data distribution on the training and test sets, and so the feature is rejected.

Figure 10. Heat map of correlation between SW channel eigenvalues and labels under 9 s window length segmentation.

Figure 11. Near-miss undersampling process for handling data imbalance.

Figure 12. Relative values of the subjective ratings of the 5 video signals on the 3 emotions.

Figure 13. EEG rhythm energy changes evoked by the addition of each LFS in 3 known emotional conditions relative to the original video signal.

Table 1. Frequency composition of LFS.

Signals	f₁/Hz	f₂/Hz	f₃/Hz	f₄/Hz	f₅/Hz	f₆/Hz	f₇/Hz
S1	15	-	-	-	-	-	105
S2	15	30	-	-	60	90	-
S3	18	36	-	54	-	-	-
S4	16	-	-	48	-	-	-
S5	16	-	-	-	64	-	-
S6	30	-	-	90	-	-	-
S7	25	50	-	75	100	-	-
S8	28	-	-	84	112	-	-
S9	30	60	-	90	-	-	-

Table 2. Information of movie scene clip database, including 10 clips for impressive scenes and 11 clips for other emotion scenes.

	Shock Clips	Other Emotion Clips
Number of Scene Clips	10	11
Minimum Duration (s)	79	39
Maximum Duration (s)	239	195
Average Duration (s)	158	139

Table 3. Audio eigenvalues extracted from time domain, frequency domain, and cepstral domain in the initial screening.

Time Domain	Frequency Domain	Cepstral Domain
Zero Crossing Rate, Energy, Energy Entropy	Pectral_Centroid, Tonnetz, Spectral_Rolloff, Spectral_Spread, Spectral_RMS, Central, Spectral_Flux, Spectral_Flatness, Spectral_Bandwidth, Spectral_Contrast, Chroma_1-Chroma_12	MFCC_0-MFCC_23 Melspectrogram

Table 4. Subjective evaluation relative values of 3 emotion states.

Emotion	Signals	Figure 1	Figure 2	Figure 3	Figure 4	Animated Figure
Happiness	S1	−0.2	−0.45	0.05	−0.25	−1
	S2	−0.65	−0.9	−0.25	−1.15	−1.05
	S3	−0.95	−1.15	−0.65	−1.65	−1.7
Sadness	S4	0.15	0.05	0.1	0	0
	S5	0.2	0.35	0.1	0.2	0.3
	S6	1.05	0.6	0.9	0.35	0.85
Shock	S7	0.35	0.3	−0.1	0.55	0.05
	S8	0.35	0.4	0.15	0.25	−0.15
	S9	0.3	0.75	−0.1	0.6	0.25

Table 5. The comparison of the average relative power spectral density ratio results between the LFS added and the original video signal for four lead positions.

Emotion	Lead Positions	θ Rhythm	α Rhythm	β Rhythm
Happiness	AF3	1.17	1.04	0.94
	AF4	1.15	1.05	0.91
	F3	1.17	1.09	0.94
	F4	1.18	1.07	0.90
Sadness	AF3	1.05	1.08	1.21
	AF4	1.11	1.03	1.18
	F3	1.10	0.94	1.21
	F4	1.06	1.03	1.13
Shock	AF3	0.87	0.99	1.22
	AF4	0.84	1.02	1.25
	F3	0.85	0.97	1.19
	F4	0.85	0.96	1.20

Table 6. Relative power spectral density ratio of LFS added to and original video signal—comparison among different LFSs.

Emotion	Lead Positions	θ Rhythm	α Rhythm	β Rhythm
	S1	1.12	1.10	0.95
Happiness	S2	1.14	1.00	0.93
	S3	1.24	1.09	0.89
	S4	1.04	1.09	1.25
Sadness	S5	1.08	1.02	1.18
	S6	1.12	0.95	1.12
	S7	0.88	1.02	1.18
Shock	S8	0.87	1.01	1.21
	S9	0.81	0.93	1.26

Table 7. Significant correlation results between subjective and objective parameters.

Signals	Lead Positions	Rhythm	Correlation Coefficient
S1	F3	β	0.975
S2	AF4	β	0.900
S4	AF4	α	0.975
S8	F3	θ	−0.900

Table 8. The accuracy of the validation set of the models trained using 5.1 of all 6 vocal channels: feature information and SW mono-feature information, respectively.

Length	SVM		BPNN		CNN
	6 Channels	SW Channel	6 Channels	SW Channel	6 Channels	SW Channel
6	63%	87%	79%	88%	77%	74%
7	91%	87%	82%	88%	84%	83%
8	76%	87%	86%	78%	84%	73%
9	89%	83%	86%	82%	80%	80%

Table 9. The features selected based on SW channel using various window lengths for segmentation. The numbers represent the degree of the selected characteristic parameter, “SW” represents the subwoofer channel, “std” represents the standard deviation, “mean” represents the average, and “diff” represents the maximum and minimum difference.

Length	Eigenvalues Determined from SW Channel Selection
6 s (1 × 25)	‘mfcc_0_SW’, ‘mfcc_15_SW’, ‘mfcc_std_0_SW’, ‘mfcc_diff_10_SW’, ‘mfcc_diff_18_SW’, ‘melspec_diff_SW’, ‘mfcc_6_SW’, ‘mfcc_18_SW’, ‘mfcc_diff_1_SW’, ‘mfcc_diff_17_SW’, ‘cent_diff_SW’, ‘specroll_diff_SW’, ‘mfcc_8_SW’, ‘mfcc_23_SW’, ‘mfcc_diff_6_SW’, ‘mfcc_diff_19_SW’, ‘melspec_std_SW’, ‘tonnetz_std_SW’, ‘mfcc_10_SW’, ‘mfcc_16_SW’, ‘mfcc_diff_8_SW’, ‘mfcc_19_SW’, ‘mfcc_diff_21_SW’, ‘tonnetz_diff_SW’, ‘specroll_std_SW’,
7 s (1 × 25)	‘mfcc_0_SW’, ‘mfcc_10_SW’, ‘mfcc_23_SW’, ‘mfcc_diff_6_SW’, ‘mfcc_diff_19_SW’, ‘melspec_std_SW’, ‘mfcc_6_SW’, ‘mfcc_19_SW’, ‘mfcc_16_SW’, ‘mfcc_diff_8_SW’, ‘mfcc_diff_18_SW’, ‘melspec_diff_SW’, ‘mfcc_8_SW’, ‘mfcc_15_SW’, ‘mfcc_std_0_SW’, ‘mfcc_diff_10_SW’, ‘mfcc_std_20_SW’, ‘tonnetz_std_SW’, ‘mfcc_18_SW’, ‘mfcc_diff_1_SW’, ‘mfcc_diff_17_SW’, ‘mfcc_diff_21_SW’, ‘tonnetz_diff_SW’, ‘specroll_diff_SW’, ‘specroll_std_SW’
8 s (1 × 22)	‘mfcc_0_SW’, ‘mfcc_10_SW’, ‘mfcc_23_SW’, ‘mfcc_13_SW’, ‘tonnetz_std_SW’, ‘tonnetz_diff_SW’, ‘mfcc_4_SW’, ‘mfcc_11_SW’, ‘mfcc_15_SW’, ‘specroll_diff_SW’, ‘melspec_diff_SW’, ‘specroll_std_SW’, ‘mfcc_6_SW’, ‘mfcc_12_SW’, ‘mfcc_17_SW’, ‘mfcc_diff_10_SW’, ‘melspec_std_SW’, ‘cent_diff_SW’, ‘mfcc_8_SW’, ‘mfcc_19_SW’, ‘mfcc_16_SW’, ‘mfcc_diff_8_SW’,
9 s (1 × 26)	‘mfcc_0_SW’, ‘mfcc_11_SW’, ‘mfcc_14_SW’, ‘mfcc_diff_1_SW’, ‘cent_std_SW’, ‘specroll_mean_SW’, ‘mfcc_2_SW’, ‘mfcc_10_SW’, ‘mfcc_16_SW’, ‘mfcc_diff_8_SW’, ‘melspec_diff_SW’, ‘tonnetz_diff_SW’, ‘mfcc_6_SW’, ‘mfcc_12_SW’, ‘mfcc_19_SW’, ‘mfcc_diff_11_SW’, ‘mfcc_diff_17_SW’, ‘melspec_std_SW’, ‘mfcc_8_SW’, ‘mfcc_15_SW’, ‘mfcc_diff_0_SW’, ‘mfcc_diff_10_SW’, ‘mfcc_diff_18_SW’, ‘mfcc_20_SW’, ‘mfcc_std_20_SW’, ‘mfcc_std_21_SW’

Table 10. The predicted images of the six movie test clips were imported into a trained CNN model with a window length of 9 s.

Movie Scene	5.1 Channels	SW Mono
Other emotion Scene 1	Correct: 9; Error: 2; Accuracy: 81.8%	Correct: 9; Error: 2; Accuracy: 81.8%
Other emotion Scene 2	Correct: 10; Error: 2; Accuracy: 83.3%	Correct: 11; Error: 1; Accuracy 91.7%
Other emotion Scene 3	Correct: 7; Error: 1; Accuracy: 87.5%	Correct: 7; Error: 1; Accuracy: 87.5%
Shock Scene 1	Correct: 9; Error: 1; Accuracy: 90%	Correct: 9; Error: 1; Accuracy: 90%
Shock Scene 2	Correct: 14; Error: 3; Accuracy: 82.4%	Correct: 14; Error: 3; Accuracy: 82.4%
Shock Scene 3	Correct: 11; Error: 2; Accuracy: 84.6%	Correct: 12; Error: 1; Accuracy: 92.3%

Table 11. Eigenvalues determined from low-pass filtered monaural channel information for training in 5.1 sound format movies. The number represents the degree of the selected feature parameter, “std” represents the standard deviation, “mean” represents the average, and “diff” represents the maximum and minimum difference. As the features are extracted in mono, there is no vocal channel information.

Length

Eigenvalues Determined from Low-Pass Filtered Monaural Channel Selection

9 s
(1 × 82)

‘mfcc_0’, ‘mfcc_11’, ‘mfcc_21’, ‘mfcc_std_10’, ‘mfcc_diff_5’, ‘mfcc_std_6’, ‘mfcc_std_13’, ‘mfcc_std_19’,
‘mfcc_1’, ‘mfcc_12’, ‘mfcc_22’, ‘mfcc_diff_10’, ‘cent_mean’, ‘mfcc_diff_6’, ‘mfcc_std_14’, ‘mfcc_diff_19’,
‘mfcc_2’,’mfcc_3’, ‘mfcc_13’, ‘mfcc_std_11’, ‘mfcc_std_17’, ‘mfcc_std_7’, ‘mfcc_diff_13’, ‘mfcc_std_20’,
‘mfcc_4’, ‘mfcc_14’, ‘mfcc_23’, ‘mfcc_diff_11’, ‘mfcc_diff_17’, ‘mfcc_diff_7’, ‘mfcc_diff_14’, ‘mfcc_diff_20’,
‘mfcc_5’, ‘mfcc_15’, ‘mfcc_std_0’, ‘mfcc_std_3’, ‘cent_diff’, ‘mfcc_std_8’, ‘mfcc_std_15’, ‘mfcc_std_21’,
‘mfcc_6’, ‘mfcc_16’, ‘mfcc_diff_0’, ‘mfcc_diff_3’, ‘tonnetz_std’, ‘mfcc_diff_8’, ‘mfcc_diff_15’, ‘mfcc_diff_21’,
‘mfcc_7’, ‘mfcc_17’, ‘mfcc_std_1’, ‘mfcc_std_4’, ‘specroll_diff’, ‘mfcc_std_9’, ‘mfcc_std_16’, ‘mfcc_std_22’,
‘mfcc_8’, ‘mfcc_18’, ‘mfcc_diff_1’, ‘mfcc_diff_4’, ‘specroll_std’, ‘mfcc_diff_9’, ‘mfcc_diff_16’, ‘mfcc_diff_22’,
‘mfcc_9’, ‘mfcc_19’, ‘mfcc_std_2’, ‘mfcc_std_5’, ‘specroll_mean’, ‘tonnetz_diff’, ‘mfcc_std_18’, ‘mfcc_std_23’,
‘mfcc_10’, ‘mfcc_20’, ‘mfcc_diff_2’, ‘cent_std’, ‘mfcc_std_12’, ‘mfcc_diff_12’, ‘mfcc_diff_18’, ‘mfcc_diff_23’,
‘melspec_std’, ‘melspec_diff’

Table 12. Prediction images of emotion classification algorithm based on low-pass filtered monaural channel in CNN model with a window length of 9 s.

Movie Scene Clip	Low-Pass Filtered Monaural
Other emotion Scene 1	Correct: 9; Error: 2; Accuracy: 81.1%
Other emotion Scene 2	Correct: 10; Error: 2; Accuracy: 83.3%
Other emotion Scene 3	Correct: 8; Error: 0; Accuracy: 100%
Shock Scene 1	Correct: 10; Error: 0; Accuracy: 100%
Shock Scene 2	Correct: 17; Error: 0; Accuracy: 100%
Shock Scene 3	Correct: 11; Error: 2; Accuracy: 84.6%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jin, P.; Si, Z.; Wan, H.; Xiong, X. Emotion Classification Algorithm for Audiovisual Scenes Based on Low-Frequency Signals. Appl. Sci. 2023, 13, 7122. https://doi.org/10.3390/app13127122

AMA Style

Jin P, Si Z, Wan H, Xiong X. Emotion Classification Algorithm for Audiovisual Scenes Based on Low-Frequency Signals. Applied Sciences. 2023; 13(12):7122. https://doi.org/10.3390/app13127122

Chicago/Turabian Style

Jin, Peiyuan, Zhiwei Si, Haibin Wan, and Xiangrui Xiong. 2023. "Emotion Classification Algorithm for Audiovisual Scenes Based on Low-Frequency Signals" Applied Sciences 13, no. 12: 7122. https://doi.org/10.3390/app13127122

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Signals	f₁/Hz	f₂/Hz	f₃/Hz	f₄/Hz	f₅/Hz	f₆/Hz	f₇/Hz
S1	15	-	-	-	-	-	105
S2	15	30	-	-	60	90	-
S3	18	36	-	54	-	-	-
S4	16	-	-	48	-	-	-
S5	16	-	-	-	64	-	-
S6	30	-	-	90	-	-	-
S7	25	50	-	75	100	-	-
S8	28	-	-	84	112	-	-
S9	30	60	-	90	-	-	-

Signals	f₁/Hz	f₂/Hz	f₃/Hz	f₄/Hz	f₅/Hz	f₆/Hz	f₇/Hz
S1	15	-	-	-	-	-	105
S2	15	30	-	-	60	90	-
S3	18	36	-	54	-	-	-
S4	16	-	-	48	-	-	-
S5	16	-	-	-	64	-	-
S6	30	-	-	90	-	-	-
S7	25	50	-	75	100	-	-
S8	28	-	-	84	112	-	-
S9	30	60	-	90	-	-	-

Article Menu

Emotion Classification Algorithm for Audiovisual Scenes Based on Low-Frequency Signals

Abstract

1. Introduction

2. Materials and Methods

2.1. Subjective Evaluation Experiment and EEG Signal Observation Experiment under LFS Stimulation

2.1.1. Materials for Subjective Evaluation Experiments

2.1.2. Evaluation Methods for Subjective Evaluation Experiments

2.1.3. Experimental Method of Brain Wave Observation

2.2. LFS-Based Sentiment Classification Algorithm for Audiovisual Scenes

2.2.1. Movie Scene Emotion Clip Information Statistics

2.2.2. Classification Algorithm Flow

2.2.3. Audio Signal Feature Extraction

2.2.4. Initial Screening of Classification Models

2.2.5. Selection of Eigenvalues

2.2.6. Model Training and Testing

2.3. Low-Pass Filtering Mono-Based Sentiment Classification Algorithm

3. Numerical Results

3.1. Data Results and Analysis of Subjective Evaluation Experiments and EEG Measurement Experiments

3.1.1. Data Results and Analysis of Subjective Evaluation Experiments

3.1.2. Brain Wave Data Statistics and Result Analysis

3.1.3. Analysis of Subjective and Objective Experimental Results

3.2. Classification Algorithm Data Results and Analysis

3.2.1. Validation Set Results Based on 5.1 Sound Format and SW Mono-Channel Emotion Recognition

3.2.2. Test Set Results Based on 5.1 Sound Format and SW Mono-Channel Emotion Recognition

3.3. Test Set Results of Low-Pass Filter-Based Monophonic Sentiment Classification Algorithm

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Signals	f₁/Hz	f₂/Hz	f₃/Hz	f₄/Hz	f₅/Hz	f₆/Hz	f₇/Hz
S1	15	-	-	-	-	-	105
S2	15	30	-	-	60	90	-
S3	18	36	-	54	-	-	-
S4	16	-	-	48	-	-	-
S5	16	-	-	-	64	-	-
S6	30	-	-	90	-	-	-
S7	25	50	-	75	100	-	-
S8	28	-	-	84	112	-	-
S9	30	60	-	90	-	-	-