Next Article in Journal
Optimized Water Distillation Layout for Detritiation Purpose
Previous Article in Journal
Research on Dynamic Target Search for Multi-UAV Based on Cooperative Coevolution Motion-Encoded Particle Swarm Optimization
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Combined Data Augmentation on EANN to Identify Indoor Anomalous Sound Event

1
Key Laboratory of Cognitive Radio and Information Processing, Ministry of Education, Guilin University of Electronic Technology, Guilin 541004, China
2
College of Information Science and Engineering, Guilin University of Technology, Guilin 541006, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(4), 1327; https://doi.org/10.3390/app14041327
Submission received: 15 January 2024 / Revised: 29 January 2024 / Accepted: 1 February 2024 / Published: 6 February 2024

Abstract

:
Indoor abnormal sound event identification refers to the automatic detection and recognition of abnormal sounds in an indoor environment using computer auditory technology. However, the process of model training usually requires a large amount of high-quality data, which can be time-consuming and costly to collect. Utilizing limited data has become another preferred approach for such research, but it introduces overfitting issues for machine learning models on small datasets. To overcome this issue, we proposed and validated the framework of combining the offline augmentation of raw audio and online augmentation of spectral features, making the application of small datasets in indoor anomalous sound event identification more feasible. Along with this, an improved two-dimensional audio convolutional neural network (EANN) was also proposed to evaluate and compare the impacts of different data augmentation methods under the framework on the sensitivity of sound event identification. Moreover, we further investigated the performance of four combinations of data augmentation techniques. Our research shows that the proposed combined data augmentation method has an accuracy of 97.4% on the test dataset, which is 10.6% higher than the baseline method. This demonstrates the method’s potential in the identification of indoor abnormal sound events.

1. Introduction

Anomalous sound event detection (SED) is a relatively new topic in audio and speech processing, which involves digital signal processing, sound identification, and machine learning. It has advantages such as lower cost, smaller bandwidth, storage, and computing requirements compared to video detection. In applications such as home environment [1,2], biometric monitoring, environmental sound event detection, indoor emergency monitoring [3,4,5], etc., indoor abnormal sound identification has won people’s favor with its safe, low-cost, accurate and effective concepts [6].
In recent years, due to the complex algorithm design of Gaussian mixture model (GMM) and hidden Markov model (HMM), and the difficulty in selecting appropriate features in unknown scenarios [7,8], it may not be suitable for the recognition and classification of abnormal sound data. On the contrary, deep learning models like CNNs, RNNs, LSTMs, TDNNs, and ResNets [9,10,11,12,13] have gained popularity. Vafeiadis [14] demonstrates that two-dimensional Convolutional Neural Networks (2D CNNs) using Mel-spectrograms as input features exhibit excellent identification capabilities. Sharnil Pandya’s research [15] has shown that LSTM-CNN deep learning models outperform traditional recognition methods like support vector machine (SVM), k-nearest neighbors (KNN), and C4.5 decision trees, particularly in home environments.
Although deep learning has demonstrated its significant practical utility in various fields such as image processing, medical diagnosis, and speech recognition [16,17,18], the performance of deep learning models always relies heavily on data-driven conditions. In real-world applications, the inherent scarcity of abnormal sound data makes it difficult to meet this condition. In addition, issues like small datasets, inaccurate labels, and imbalanced event types are frequently encountered [19]. In summary, obtaining high-quality audio data to satisfy data-driven supervised or semi-supervised learning is particularly challenging, as it requires careful control of multiple factors to ensure data quality and prevent model overfitting. Therefore, data augmentation plays a crucial role in optimizing the dataset and reducing model overfitting [20].
Early data augmentation methods relied on singular transformations of the original audio, such as time stretching, pitch shifting, dynamic range compression, silence trimming, and background noise addition [21,22,23]. These methods often depended on the characteristics of the original data and required manual adjustment of transformation scales based on the context, which could increase computational costs during training. Then, augmentation methods inspired by traditional image processing techniques were adopted, where audio data were first converted into image data, and then image data augmentation was applied [24]. In recent years, the focus has shifted towards spectrogram data augmentation, where data augmentation is applied to the spectrogram features of audio. Techniques in this category include time warping, frequency masking, and time masking [25,26], which have yielded promising results. Building upon this foundation, enhanced spectrogram augmentation methods like Filteraugment [27] and SpecSub [28] have been introduced. These approaches transform spectrogram features to obtain a broader range of training information, further enhancing recognition accuracy. Additionally, the extension of sound samples can be achieved through the use of Generative Adversarial Network (GAN) models [29]. However, GAN networks are challenging to train, and current techniques still require a substantial amount of training sample data.
It is particularly noteworthy that combined data augmentation methods have gained widespread application. Abayomi [24] achieved excellent recall using a composite data augmentation method that combined color transformations and noise addition applied to images. Jeong Y [30] proposed a composite data augmentation method that combined the processing of both original audio and images, where the image processing method employed only salt-and-pepper noise for data augmentation, yielding favorable results. Furthermore, based on the research of [22,31], individual data augmentation techniques like pitch shifting, background noise, time stretching, and trim silence have been meticulously analyzed for their impact and sensitivity. This approach aims to combine the most effective augmentation techniques. These methods have employed different combined data augmentation strategies, and they have improved the accuracy of acoustic event recognition in relevant scenes. However, for an indoor environment, there are no related data augmentation techniques, and the effective combination of the audio’s original features with the audio’s spectrogram features within indoor environments is yet to be explored.
We have proposed a new data augmentation method and, on this basis, successfully completed the recognition of indoor abnormal sound events. First, indoor abnormal sound data are obtained from the public datasets ESC-50 and UrbanSound8k. Second, by combining techniques for processing the raw audio and spectrogram features, various augmentation techniques such as pitch shifting, background noise, time stretching, and time-frequency masking was applied to enhance the input sound events, resulting in augmented audio data. Finally, a 2D convolutional neural network model was used to recognize abnormal sound events. The embedding technique was introduced to transform sparse high-dimensional feature vectors into dense low-dimensional feature vectors, and the decoupled weight decay Adam algorithm (AdamW) was employed to optimize the model. Our results indicate a significant improvement compared to not using any data augmentation or employing only a single augmentation method.
Our contributions can be divided into two parts:
(i)
We studied the impact of different combinations of data augmentation methods on detection sensitivity in indoor abnormal sound environments, which can provide a quantitative reference for subsequent related research.
(ii)
We successfully demonstrated that a data augmentation framework combining offline processing of raw audio with online enhancement of time-frequency masking of spectrograms (spectral features) can effectively improve the recognition ability of deep learning models for indoor abnormal sound in limited datasets. Even in the case of very limited data, it can avoid model overfitting and is of practical value.
The structure of this paper is as follows: Section 2 presents the fundamental concepts and principle involved in this research. It provides an overview of the data generation process and offers a detailed explanation of the data augmentation methods and recognition models used in this experiment. Section 3 provides an account of the results obtained from the various experiments. Section 4 is devoted to the analysis and comparison of the experimental results. Finally, Section 5 concludes the paper with some key findings and conclusions.

2. Materials and Methods

Data augmentation represents a crucial process in data manipulation that involves making minor modifications to the raw data without changing its underlying structure. It serves to enhance the quality and diversity of the data. When applied to audio data, data augmentation techniques assist in preventing model overfitting and improving model generalization performance. The overall schematic diagram of the entire experiment is shown in Figure 1. In order to address the issue of dataset size and quality in indoor anomalous sound event recognition, we propose a combined data augmentation approach. The system input consists of training and testing datasets. We apply offline data augmentation to raw audio amplification for the training dataset and follow it with online data augmentation processing for the generated MFCC spectrogram. These data are then fed into a neural network recognition model for training.

2.1. Dataset Generation

We primarily combined two publicly available datasets, ESC-50 and UrbanSound8k, to obtain indoor anomalous sound data. ESC-50 is a dataset that comprises 50 different types of sounds, with 40 samples for each sound, resulting in a total of 2000 samples. It includes common sounds such as animal calls, urban noises, and indoor sounds. UrbanSound8k is a dataset consisting of 10 categories of urban environmental sounds, with a total of 8732 samples. These samples encompass sounds like vehicle noises, human voices, gunshots, air conditioning sounds, and children playing.
To obtain indoor anomalous sound data samples, we selected data from these two datasets that were relevant to indoor anomalous sounds. We focused on eight categories of anomalies: coughing, cracking fire, crying baby, glass breaking, gunshots, sneezing, snoring, and screams. This dataset served as the input for our model. During the experiments involving Background Noise data augmentation, we acquired four categories of indoor-related background noises from freesound.org (accessed on 18 October 2023), such as air conditioning noise, rain noise, television noise, and fan noise. This not only expanded the dataset but also increased the diversity and representativeness of indoor anomalous sound data.

2.2. Mel-Frequency Cepstral Coefficients

Mel-Frequency Cepstral Coefficients (MFCCs) are a widely adopted feature in the field of speech recognition, and their origins can be traced to the perceptual mechanisms of the human auditory system. The human auditory system has a unique structure that enables it to automatically distinguish between the low-frequency and high-frequency components of audio signals, with the low-frequency components carrying critical identification features. Based on this biological observation, researchers have employed a Mel scale strategy. By applying denser filters in the lower frequency range and sparser filters in the higher frequency range, this strategy simulates the perceptual characteristics of the human ear. Compared with Bark or ERB scale, Mel scale not only considers human ears’ perception of sound loudness, but also human ears’ perception of sound pitch. It can better capture the global characteristics of sound and is more suitable for abnormal sound detection [32]. The relationship between Mel frequency and linear frequency can be approximated by the following equation:
Mel f = 2595 × log 1 + f 700    
where f represents frequency in Hertz (Hz). This method allows researchers to extract essential information from spectrogram data and convert it into Mel-frequency spectrogram representation. Figure 2 below illustrates the MFCC feature extraction process, which primarily consists of preprocessing, Fast Fourier Transform (FFT), Mel filterbank, logarithmic operation, and Discrete Cosine Transform (DCT). * represents the audio signal processed in the previous step.
The preprocessing stage includes pre-emphasis, framing, and windowing. Pre-emphasis enhances the spectral content of the high-frequency part of the signal, making the spectral characteristics smoother. Framing divides the signal into several short-term segments, where each segment can be considered a stationary process. During framing, a common practice is to use half of the overlap frame length to split the signal, ensuring a smooth transition between frames. Windowing is applied to reduce signal truncation effects. After preprocessing, all frames undergo FFT processing, resulting in the discrete spectrum S k of the signal, as shown in Equation (2):
S k = n = 0 N 1 s w n e x p j 2 π k n N ,   0 k N  
where s w n represents the sound signal after windowing, and N denotes the number of points for the Fourier Transform. Subsequently, the energy spectrum is filtered using M sets of triangular bandpass filters to obtain the Mel spectrogram. The logarithmic values of the Mel spectrogram are then applied to the inverse spectral transform, which converts a multiplicative signal into an additive one. This transformation emphasizes the low-frequency envelope characteristics and high-frequency detail features. In the cepstral analysis process, the DCT is employed to compute M -order logarithmic energies, resulting in L -order MFCC coefficients. The formula for DCT is represented as (3), where S m represents the logarithmic energy spectrum, M is the number of triangular filters, and L denotes the order of MFCC coefficients ( L = 12–16)
C n = m = 0 N 1 S m cos n π m 0.5 M , n = 1 , 2 , , L .  
The MFCC spectrograms of several types of abnormal sound events studied in this paper are illustrated in Figure 3, as shown below. The different colors represent the energy distribution.

2.3. Data Augmentation Methods

The data augmentation (DA) approach employed in this paper combines raw audio data augmentation with spectrogram data augmentation. It involves applying a combination of techniques such as Pitch Shifting, Background Noise, Time Stretching, and Time-Frequency Masking to the input sound events, resulting in enhanced audio data. For indoor abnormal acoustic events, we conducted a detailed investigation into the optimal performance of various data augmentation methods and further examined the performance of different combinations of data augmentation. The four types of augmentation methods used in this study were as follows.

2.3.1. Time Stretching (TS)

Time stretching is a process that adjusts the speed or duration of an audio signal without affecting its pitch. This method allows for the selection of a deformation factor to slow down or speed up the playback rate of an audio sample without altering its pitch. By changing the deformation factor s in Equation (4), the audio output l of the time stretching transformation will have different lengths and speeds.
l = l s   ,    s 0.8 ,   1.2
This study introduced six sets of scaling factors to stretch the original sound along its time axis at a fixed rate. The six sets of scaling factors used in this study are as follows: (0.80, 0.90, 1.00, 1.10, 1.20), (0.84, 0.92, 1.00, 1.08, 1.16), (0.88, 0.94, 1.00, 1.06, 1.12), (0.90, 0.95, 1.00, 1.05, 1.10), (0.92, 0.96, 1.00, 1.04, 1.08), and (0.96, 0.98, 1.00, 1.02, 1.04). Similar methods have also been explored in classification studies in the previous literature [33].

2.3.2. Pitch Shifting (PS)

The pitch of an audio sample can be shifted up or down by several semitones without altering the playback speed and amplitude, as shown in Equation (5). These semitone shift factors, denoted as τ , are constrained within the range of [–10, 10] to ensure a uniform distribution of transformation effects.
f r = f r × 2 τ 12
This study introduced five sets of shift factors for deformation. These five sets of shift factors are (−1.0, −0.5, 0.0, 0.5, 1.0), (−3.0, −1.5, 0.0, 1.5, 3.0), (−5, −2.5, 0.0, 2.5, 5), (−8, −4, 0.0, 4, 8), and (−10, −5, 0.0, 5, 10). Through this transformation, the frame rate of the original audio is altered while preserving the output characteristics of the original sound. Similar scale ranges have also been applied in the previous literature [34].

2.3.3. Background Noise (BN)

While keeping the pitch and speed of the audio constant, the audio samples were mixed with background noise at different signal-to-noise ratios (SNR). Each instance was generated by blending the original audio with four environmental audio samples, which included indoor-related background noises such as air conditioning, rainfall, television, and fan sounds. The mixing process followed the formula below:
y o u t i = y + y n o i s e i
where y o u t (i) represents the output of the mixed audio. i denotes a different SNR level, ranging from −10 dB to 20 dB. The range used in this study includes −10, −5, 0, 5, 10, 15, and 20. A larger i indicates a greater SNR. Similar mixing methods have been adopted in the previous literature [22,34].

2.3.4. Online Augmentation

Online data augmentation involves applying random transformations to spectrogram features after generating them from the training data. SpecAugment [25] applied to speech recognition is a technique that masks spectrogram features to achieve data augmentation effects. It is not a simple preprocessing of the raw audio but rather a feature-based (in the visual domain) method aimed at enriching the time-frequency characteristics of audio. Randomly masking certain regions helps prevent model overfitting and enhances the model’s generalization ability. Frequency masking works by randomly selecting f frequency bins to be masked from the interval [0, F], where F is a known parameter. Then, it selects the random frequency bin f 0 from 0 , v f along the vertical axis, where v represents the number of Mel frequency channels. Finally, all pixel values in the frequency bin range f 0 ,   f + f 0 , are replaced with a constant mask value. Similarly, time masking uses the mask to the t consecutive time step t 0 ,   t + t 0 , and t 0 , T , T is the time mask parameter, t 0 0 ,   τ t , where τ represents the number of time bins.
This paper focused on sound event recognition, which, compared to speech recognition, often involves shorter durations and wider frequency coverage for anomalous sound events. Therefore, we combined this method with the original audio processing techniques of Time Stretching, Pitch Shifting, and Background Noise, proposing a combination data augmentation method that enhances the diversity of anomalous sound events in small datasets. This method is straightforward, computationally efficient, and does not require additional data; it can be directly applied to audio. The MFCC of these various data augmentation methods is illustrated in Figure 4.

2.4. EANN Identification Model

Indoor anomalous sound events are aperiodic, random and the background environment is uncertain [32]. The commonly used 1-D CNN model has been unable to meet the recognition requirements. In this paper, the recognition model is an improved 2-D Audio Neural Network (EANN) model; we added an Embedding layer before the full connection layer of the ANN [9]. It can further compress the feature vector dimension and ensure sensitivity to abnormal indoor sounds, and additional embedding layers of ANN recognition models are effective when training a small amount of data. The core components of the EANN recognition model include convolutional block, pooling layers, an Embedding layer, and fully connected layers. Therefore, in order to verify the effect of data augmentation on indoor anomalous sound detection, we used this model as a benchmark.
Each convolution block of EANN consists of two convolution layers, and the core size is 3 × 3. Batch normalization was applied between convolution layers, and the ReLU nonlinearity was used to accelerate and stabilize training. We applied an average pooling of size 2 × 2 to each convolutional layer for down sampling. Convolutional block were primarily used to extract local features from the data while pooling layers served to reduce the dimensionality and complexity of the data. The Embedding layer [35] maps content from one space to another, typically used in deep learning to transform high-dimensional sparse feature vectors into low-dimensional dense feature vectors. The low-dimensional dense feature vector space often offers better interpretability and generalization, capturing more details and features in the data, thereby improving the performance of the recognition model. In addition to convolutional layers, pooling layers, and the Embedding layer, the EANN model also includes a fully connected layer responsible for mapping feature vectors to class probabilities. The specific structure of the EANN recognition model is illustrated in Figure 5.

2.5. AdamW Optimizer

AdamW is an optimization algorithm that builds upon the Adam algorithm. Adam algorithm is a commonly used gradient descent optimization algorithm. Compared with conventional SGD and RMSProp algorithms, Adam can dynamically adjust the learning rate adaptively to improve the training efficiency and accuracy. However, the Adam algorithm has some issues, such as suboptimal weight decay effects. To address this problem, the AdamW algorithm [36] introduced a method for decoupling weight decay, separating weight decay from the learning rate, and making weight decay more accurate and effective. Specifically, the AdamW algorithm performs L2 regularization on the weights during parameter updates, effectively achieving the desired weight decay effect. The advantages of the AdamW algorithm include improving the model’s robustness and generalization capability while reducing the risk of overfitting. In this paper, the decoupled weight decay AdamW algorithm was used to optimize the model, thereby enhancing recognition accuracy and stability. The core computation process of AdamW is as follows:
Calculation of the moving average of the gradient m t :
m t = β 1 m t 1 + 1 β 1 g t
Calculation of the moving average of the squared gradient v t :
v t = β 2 v t 1 + 1 β 2 g t 2  
Update parameter vector p t :
p t = p t 1 + η t ( α m ^ t v ^ t + ϵ + λ p t 1 )
where the learning rate α = 0.001 , exponential decay rates β 1 = 0.9 , exponential decay weight β 2 = 0.999 , infinitesimal value ϵ = 10 8 , weight decay value λ R , initialize time step t = 0 , parameter vector p t = 0 R n , first-moment vector m t = 0 = 0 , second-moment vector v t = 0 = 0 , schedule multiplier η t = 0 R , bias-corrected m ^ t = m t / 1 β 1 t , and bias-corrected v ^ t = v t / 1 β 2 t .

2.6. Implementation Details

We obtained eight types of abnormal sounds, namely coughing, cracking fire, crying baby, glass breaking, gunshot, sneezing, snoring, and screams by combining the publicly available datasets ESC-50 and UrbanSound8k, which were used as input data for detecting abnormal sound samples. In the experiments, we set the training data to test data as 4:1—for every 4 samples trained, 1 test was conducted—and the optimal model and parameters were obtained by five-fold cross validation. The comprehensive information about data augmentation is shown in Table 1. These data were divided into eight categories, with 37 samples per class for gunshot and screams in the base dataset and 40 samples per class for the other six sounds. Augmentation was directly applied to the eight sound data categories to generalize and improve data quality. For single data augmentation methods like pitch shifting, add background noise, and time stretching, the augmented data were increased fivefold. With the combination of pairwise data augmentation methods, the augmented data were increased tenfold. Each type of original audio data augmentation scheme was combined with online data augmentation for experimentation. All augmented samples required labeling of the augmentation type to determine which augmentation method could maximize recognition accuracy.
In comparison to the existing data augmentation techniques of Time Stretching, Pitch Shifting, and Background Noise, the proposed combination data augmentation method was evaluated. The input features for MFCC were configured with the following parameters: f_max = 14,000, f_min = 50, hop_length = 256, n_fft = 1024, n_mels = 128, n_mfcc = 40, sample_rate = 16,000, and win_length = 1024. The model optimizer used was AdamW, with an initial learning rate of 0.001, a learning rate decay weight of 1 × 10−6, and a dropout rate of 0.5.
We used several state-of-the-art evaluation metrics to investigate the performance of the classification task, namely Accuracy, Macro Precision, Macro Recall, Macro Specificity, Macro F1-Score, and the confusion matrix. Calculating the above metrics requires an understanding of four key concepts: True Positives (TPs), which are instances that are positive and correctly predicted as positive; False Negatives (FNs), referring to instances that are positive but incorrectly predicted as negative; False Positives (FPs), indicating instances that are negative but incorrectly predicted as positive; and True Negatives (TNs), which are instances that are negative and correctly predicted as negative. The descriptions for the performance metrics used in this study are presented in Table 2.

3. Results

We evaluated and compared the impact of different data augmentation methods on the dataset and further investigated the performance of various combined data augmentation techniques. Prior to this, we compared the performance of the EANN model with the optimization effect after adding the AdamW optimizer. The analytical results are presented in the form of tables, bar graphs, and confusion matrices. The results of the experiments are described in Table 3, Table 4, Table 5, Table 6 and Table 7.

3.1. Model and Optimizer Performance

We found that the Embedding layer can transform high-dimensional sparse feature vectors into low-dimensional dense feature vectors, which in turn capture more details and features and improve the recognition effect. The optimization of the model is related to the final training performance. We found that traditional optimizers were insufficient to achieve better results. The reason for this is that the weight and learning rate calculation are influenced by each other; AdamW makes each iteration more accurate by separating them. The experimental results are shown in Table 3.

3.2. Time Stretching Performance

The classification results for Time Stretching augmentation are shown in Table 4. The use of Time Stretching for data augmentation yields improved recognition performance compared to not using any data augmentation techniques. The recognition performance varied with the stretching range of Time Stretching, with accuracy increasing when the stretching factor changed from [0.80, 1.20] to [0.88, 1.12], and decreasing after [0.88, 1.12]. The highest accuracy can reach 0.933, and the macro F1 score can reach 0.935 within the [0.88, 1.12] stretching range. Additionally, online data augmentation using only spectrogram masking achieved an accuracy of 0.889, which is an improvement compared to not using any data augmentation techniques. When combining spectrogram masking with Time Stretching, there was a further improvement in recognition performance compared to using Time Stretching alone, with the highest accuracy reaching 0.944 within the [0.88, 1.12] stretching range.

3.3. Pitch Shifting Performance

The classification results for Pitch Shifting augmentation are shown in Table 5. The use of Pitch Shifting for data augmentation leds to improved recognition performance compared to not using any data augmentation techniques. However, the recognition performance did not vary significantly with the range of Pitch Shifting in terms of pitch changes. The highest accuracy can reach 0.903 within the [−8, 8] pitch range, and the macro F1 score can reach 0.901. When combining spectrogram masking with Pitch Shifting, there is a noticeable improvement in recognition performance compared to using Pitch Shifting alone, with the highest accuracy reaching 0.923 within the [−5, 5] pitch range.

3.4. Background Noise Performance

The classification results for Background Noise augmentation are presented in Table 6. Using Background Noise for data augmentation led to improved recognition performance compared to not using any data augmentation techniques. However, the recognition performance was related to the SNR and decreased as the background noise energy ratio increased. In cases of high SNR (with a SNR of approximately 10 dB), there was not much variation in recognition accuracy, and it could reach 0.92. When combining spectrogram masking with Background Noise augmentation, the improvement in recognition performance compared to using Background Noise alone was not very significant. Recognition performance may degrade in the presence of high background noise.

3.5. Combined Augmentation Performance

The results for combined data augmentation methods are presented in Table 7. It is evident that combined data augmentation methods significantly outperformed not using any data augmentation. The TS and BN and TS and PS combined data augmentation methods showed good recognition performance, and the difference between these two was not substantial. The BN and PS combination had relatively lower recognition performance. Among these combinations, TS [0.88, 1.12] and BN [10 dB] achieved the highest recognition performance, with a maximum accuracy of 0.964. When spectrogram masking was combined with TS and BN and TS and PS combined data augmentation techniques, there was a slight improvement in recognition performance. However, when spectrogram masking was combined with PS and BN, the recognition performance decreased. The proposed TS [0.88, 1.12] and BN [10 dB] and Online Aug combined data augmentation technique exhibited the most significant improvement, with an accuracy increase of 10.643% compared to not using any data augmentation techniques, and the recall and specificity were greatly improved. The visualized accuracy is shown in Figure 6.

4. Discussion

The time stretching experiment demonstrated that stretching within the range [0.88, 1.12] produced the best results, while excessive or insufficient stretching could reduce accuracy, weakening the salient features of the original audio signal. Spectrogram masking techniques can further enhance the diversity of audio signals, resulting in a higher recognition performance for indoor abnormal sound events. Pitch Shifting experiments have shown that MFCC features are not sensitive to pitch changes, but the combining spectrogram masking techniques can improve the sensitivity of the Pitch Shifting technique. The energy levels of the background noise can affect the feature representation of the original audio signal. However, the combination of spectrogram masking and background noise is unstable under high background noise conditions, where the high background noise masks the characteristics of the raw audio. This indicates that data with high background noise are not suitable for abnormal sound detection and recognition. The proposed TS [0.88, 1.12] and BN [10 dB] and Online Aug combined data augmentation method can significantly improve the accuracy of indoor abnormal sound event recognition. Its overall accuracy was 10.6% higher than the baseline model.
The confusion matrix provides a clear view of the confusion between different types of abnormal sound events. Figure 7 shows the confusion matrix results generated by our method on the test dataset. It can be observed that the number of false negatives was very low and mainly concentrated in pairs such as glass breaking and gunshot, coughing, and sneezing. Gunshot is generally loud and sharp, depending on the type of gun, and large glass breaking is often loud and sharp. Both coughing and sneezing come from throat sounds. This indicates that these pairs of events share similar time-frequency characteristics. We found that the screams had excellent performance, which may be related to its characteristic higher frequency and sustained vocalization. This result demonstrates that EANN combined with the combined data augmentation technique achieved superior performance and outperforms the baseline classification methods.
Table 8 shows the performance comparison of various data augmentation techniques, and their results were all improved compared to the original no data augmentation model. This highlights the effectiveness of applying data augmentation techniques to acoustic event recognition. Our method had a classification accuracy of 0.974 for indoor anomalous sound datasets.

5. Conclusions

This paper addressed sound event recognition in scenarios with limited and noisy indoor abnormal sound data. Given the challenges posed by the size and quality of the existing indoor abnormal sound samples, we proposed a data augmentation method that combines raw audio augmentation with spectral feature enhancement. To simulate a realistic indoor environment, we selected eight common indoor abnormal sound events and introduced four types of indoor background sounds. Furthermore, to achieve better interpretability and generalization ability, we introduced the Embedding layer into the ANN model and optimized the model using the AdamW method. Our results indicate that the TS [0.88, 1.12] and BN [10 dB] and Online Aug combination data augmentation method performed the best, achieving a recognition accuracy of 97.4%. This represents an overall accuracy improvement of 10.6% over the baseline model.
We also analyzed the sensitivity of various augmentation methods, finding that Time Stretching is more sensitive within the [0.88, 1.12] stretching range, Pitch Shifting performs better within the [−5, 5] pitch shift range, and Background Noise leads to performance improvements under high SNR conditions. Similarly, we can conclude that this combined data augmentation method outperforms traditional audio augmentation methods or single spectral feature augmentation methods. Of course, there are more indoor anomalous sounds in real life than the above eight categories, and the related scenes are more complex and changeable. This approach may not produce the best results for every indoor anomalous sound. Future work will focus on exploring the multi-modal fusion of indoor anomalous sound recognition methods, make full use of data augmentation techniques, and ultimately improve the generalization and robustness of indoor anomalous sound detection.

Author Contributions

Conceptualization, M.W. and J.X.; software, X.S., J.X., Q.M. and X.L.; data curation, J.X.; writing—original draft preparation, J.X. and Q.M.; writing—review and editing, M.W. and X.S.; funding acquisition, M.W. and X.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Natural Science Foundation of China (62071135), and Projects from Key Laboratory of Cognitive Radio and Information Processing, Ministry of Education, Guilin University of Electronic Technology (CRKL220105 and CRKL200111), also Guangxi Key Laboratory of Wireless Wideband Communication and Signal Processing (GXKL06180109).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. The dataset can be found on https://github.com/karolpiczak/ESC-50 and https://urbansounddataset.weebly.com/urbansound8k.html (accessed on 18 October 2023).

Acknowledgments

We are very grateful to Hongbing Qiu for his financial support in the demonstration and implementation of this work plan.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Mondal, S.; Barman, A.D. Human auditory model based real-time smart home acoustic event monitoring. Multimed. Tools Appl. 2022, 81, 887–906. [Google Scholar] [CrossRef]
  2. Salekin, A.; Ghaffarzadegan, S.; Feng, Z.; Stankovic, J. A Real-Time Audio Monitoring Framework with Limited Data for Constrained Devices. In Proceedings of the 2019 15th International Conference on Distributed Computing in Sensor Systems (DCOSS), Santorini, Greece, 29–31 May 2019; pp. 98–105. [Google Scholar]
  3. Xie, J.; Hu, K.; Zhu, M.; Yu, J.; Zhu, Q. Investigation of Different CNN-Based Models for Improved Bird Sound Classification. IEEE Access 2019, 7, 175353–175361. [Google Scholar] [CrossRef]
  4. Kim, H.-G.; Kim, J.Y. Environmental sound event detection in wireless acoustic sensor networks for home telemonitoring. China Commun. 2017, 14, 1–10. [Google Scholar] [CrossRef]
  5. Kim, H.-G.; Kim, G.Y. Deep Neural Network-Based Indoor Emergency Awareness Using Contextual Information from Sound, Human Activity, and Indoor Position on Mobile Device. IEEE Trans. Consum. Electron. 2020, 66, 271–278. [Google Scholar] [CrossRef]
  6. Shilaskar, S.; Bhatlawande, S.; Vaishale, A.; Duddalwar, P.; Ingale, A. An Expert System for Identification of Domestic Emergency based on Normal and Abnormal Sound. In Proceedings of the 2023 Somaiya International Conference on Technology and Information Management (SICTIM), Mumbai, India, 24–25 March 2023; pp. 100–105. [Google Scholar]
  7. Mayorga, P.; Ibarra, D.; Zeljkovic, V.; Druzgalski, C. Quartiles and Mel Frequency Cepstral Coefficients vectors in Hidden Markov-Gaussian Mixture Models classification of merged heart sounds and lung sounds signals. In Proceedings of the 2015 International Conference on High Performance Computing & Simulation (HPCS), Amsterdam, The Netherlands, 20–24 July 2015; pp. 298–304. [Google Scholar]
  8. Rabiner, L.R. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 1989, 77, 257–286. [Google Scholar] [CrossRef]
  9. Kong, Q.; Cao, Y.; Iqbal, T.; Wang, Y.; Wang, W.; Plumbley, M.D. PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2880–2894. [Google Scholar] [CrossRef]
  10. Sang, J.; Park, S.; Lee, J. Convolutional Recurrent Neural Networks for Urban Sound Classification Using Raw Waveforms. In Proceedings of the 2018 26th European Signal Processing Conference (EUSIPCO), New York, NY, USA, 3–7 September 2018; pp. 2444–2448. [Google Scholar]
  11. Lezhenin, I.; Bogach, N.; Pyshkin, E. Urban Sound Classification using Long Short-Term Memory Neural Network. In Proceedings of the 2019 Federated Conference on Computer Science and Information Systems, Leipzig, Germany, 1–4 September 2019; Volume 18, pp. 57–69. [Google Scholar]
  12. Kumawat, P.; Routray, A. Applying TDNN Architectures for Analyzing Duration Dependencies on Speech Emotion Recognition. In Proceedings of the Interspeech 2021, Brno, Czechia, 30 August–3 September 2021; pp. 3410–3414. [Google Scholar]
  13. Li, Y.; Cao, W.; Xie, W.; Huang, Q.; Pang, W.; He, Q. Low-Complexity Acoustic Scene Classification Using Data Augmentation and Lightweight ResNet. In Proceedings of the 2022 16th IEEE International Conference on Signal Processing (ICSP), Beijing, China, 21–24 October 2022; pp. 41–45. [Google Scholar]
  14. Vafeiadis, A.; Votis, K.; Giakoumis, D.; Tzovaras, D.; Chen, L.; Hamzaoui, R. Audio content analysis for unobtrusive event detection in smart home. Eng. Appl. Artif. Intell. 2020, 89, 103226. [Google Scholar] [CrossRef]
  15. Pandya, S.; Ghayvat, H. Ambient acoustic event assistive framework for identification, detection, and recognition of unknown acoustic events of a residence. Adv. Eng. Inform. 2021, 47, 101238. [Google Scholar] [CrossRef]
  16. Li, Y.; Li, H.; Fan, D.; Li, Z.; Ji, S. Improved Sea Ice Image Segmentation Using U2-Net and Dataset Augmentation. Appl. Sci. 2023, 13, 9402. [Google Scholar] [CrossRef]
  17. Mikami, K.; Nemoto, M.; Ishinoda, A.; Nagura, T.; Nakamura, M.; Matsumoto, M.; Nakashima, D. Improvement of Machine Learning-Based Prediction of Pedicle Screw Stability in Laser Resonance Frequency Analysis via Data Augmentation from Micro-CT Images. Appl. Sci. 2023, 13, 9037. [Google Scholar] [CrossRef]
  18. Anvarjon, T.; Mustaqeem, M.; Kwon, S. Deep-net: A lightweight CNN-based speech emotion recognition system using deep frequency features. Sensors 2020, 20, 1–16. [Google Scholar] [CrossRef]
  19. Wang, M.; Yao, Y.; Qiu, H.; Song, X. Adaptive Memory-Controlled Self-Attention for Polyphonic Sound Event Detection. Symmetry 2022, 14, 366. [Google Scholar] [CrossRef]
  20. Nam, G.-H.; Bu, S.-J.; Park, N.-M.; Seo, J.-Y.; Jo, H.-C.; Jeong, W.-T. Data Augmentation Using Empirical Mode Decomposition on Neural Networks to Classify Impact Noise in Vehicle. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 731–735. [Google Scholar]
  21. Salamon, J.; Bello, J.P. Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process. Lett. (SPL) 2017, 24, 279–283. [Google Scholar] [CrossRef]
  22. Abeysinghe, A.; Tohmuang, S.; Davy, J.L.; Fard, M. Data augmentation on convolutional neural networks to classify mechanical noise. Appl. Acoust. 2023, 203, 109209. [Google Scholar] [CrossRef]
  23. Li, X.; Zhang, W.; Ding, Q.; Sun, J.-Q. Intelligent rotating machinery fault diagnosis based on deep learning using data augmentation. J. Intell. Manuf. 2020, 31, 433–452. [Google Scholar] [CrossRef]
  24. Abayomi-Alli, O.O.; Abbasi, A.A. Detection of COVID-19 from Deep Breathing Sounds Using Sound Spectrum with Image Augmentation and Deep Learning Techniques. Electronics 2022, 11, 2520. [Google Scholar] [CrossRef]
  25. Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.-C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proceedings of the Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15–19 September 2019; pp. 2613–2617. [Google Scholar]
  26. Padovese, B.; Frazao, F.; Kirsebom, O.S.; Matwin, S. Data augmentation for the classification of North Atlantic right whales upcalls. J. Acoust. Soc. Am. 2021, 149, 2520–2530. [Google Scholar] [CrossRef] [PubMed]
  27. Nam, H.; Kim, S.-H.; Park, Y.-H. Filteraugment: An Acoustic Environmental Data Augmentation Method. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 4308–4312. [Google Scholar]
  28. Wu, D.; Zhang, B.; Yang, C.; Peng, Z.; Xia, W.; Chen, X.; Lei, X. U2++: Unified Two-pass Bidirectional End-to-end Model for Speech Recognition. arXiv 2021, arXiv:2106.05642. [Google Scholar] [CrossRef]
  29. Yao, Q.; Wang, Y.; Yang, Y. Underwater Acoustic Target Recognition Based on Data Augmentation and Residual CNN. Electronics 2023, 12, 1206. [Google Scholar] [CrossRef]
  30. Jeong, Y.; Kim, J.; Kim, D.; Kim, J. Methods for Improving Deep Learning-Based Cardiac Auscultation Accuracy: Data Augmentation and Data Generalization. Appl. Sci. 2021, 11, 4544. [Google Scholar] [CrossRef]
  31. Mushtaq, Z.; Su, S.-F.; Tran, Q.-V. Spectral images based environmental sound classification using CNN with meaningful data augmentation. Appl. Acoust. 2021, 172, 107581. [Google Scholar] [CrossRef]
  32. Mnasri, Z.; Rovetta, S.; Masulli, F. Anomalous sound event detection: A survey of machine learning based methods and applications. Multimed. Tools Appl. 2022, 81, 5537–5586. [Google Scholar] [CrossRef]
  33. Damskägg, E.-P.; Välimäki, V. Audio time stretching using fuzzy classification of spectral bins. Appl. Sci. 2017, 7, 1293. [Google Scholar] [CrossRef]
  34. Wei, S.; Zou, S.; Liao, F.; Lang, W. A comparison on data augmentation methods based on deep learning for audio classification. J. Phys. Conf. Ser. 2020, 1453, 012085. [Google Scholar] [CrossRef]
  35. Zhang, F.; Dvornek, N.; Yang, J.; Chapiro, J.; Duncan, J. Layer Embedding Analysis in Convolutional Neural Networks for Improved Probability Calibration and Classification. IEEE Trans. Med. Imaging 2020, 39, 3331–3342. [Google Scholar] [CrossRef]
  36. Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar] [CrossRef]
Figure 1. Experimental framework for data augmentation methods of indoor anomalous acoustic events identification.
Figure 1. Experimental framework for data augmentation methods of indoor anomalous acoustic events identification.
Applsci 14 01327 g001
Figure 2. MFCC feature extraction process.
Figure 2. MFCC feature extraction process.
Applsci 14 01327 g002
Figure 3. The MFCC spectrograms (L = 13) of several types of abnormal sound events: (a) coughing, (b) cracking fire, (c) crying baby, (d) glass breaking, (e) gunshot, (f) sneezing, (g) snoring, (h) screams.
Figure 3. The MFCC spectrograms (L = 13) of several types of abnormal sound events: (a) coughing, (b) cracking fire, (c) crying baby, (d) glass breaking, (e) gunshot, (f) sneezing, (g) snoring, (h) screams.
Applsci 14 01327 g003
Figure 4. MFCC of different DA methods (a) No Aug (Augmentation), (b) Time Stretching, (c) Pitch Shifting, (d) Background Noise, (e) Spectral Masking Aug, and (f) Time Stretching and Spectral Aug.
Figure 4. MFCC of different DA methods (a) No Aug (Augmentation), (b) Time Stretching, (c) Pitch Shifting, (d) Background Noise, (e) Spectral Masking Aug, and (f) Time Stretching and Spectral Aug.
Applsci 14 01327 g004
Figure 5. EANN for sound events recognition.
Figure 5. EANN for sound events recognition.
Applsci 14 01327 g005
Figure 6. The classification accuracy results for combined data augmentation.
Figure 6. The classification accuracy results for combined data augmentation.
Applsci 14 01327 g006
Figure 7. The confusion matrix results from combined augmentation.
Figure 7. The confusion matrix results from combined augmentation.
Applsci 14 01327 g007
Table 1. Comparison of datasets before and after data augmentation.
Table 1. Comparison of datasets before and after data augmentation.
Data ClassBaselineSignal DACombined DA
coughing40200400
cracking_fire40200400
crying_baby40200400
glass_breaking40200400
gun_shot37185370
sneezing40200400
snoring40200400
screams37185370
Total31415703140
Table 2. Summary of the evaluation metrics.
Table 2. Summary of the evaluation metrics.
MetricsDescriptionMathematical Expression
AccuracyDegree of true value (correct identification of all abnormal acoustic events) measurements against all the evaluated instances. A c c = T P + T N T P + F N + F P + T N
Macro PrecisionThe statistical average of the ratio of true positive abnormal sounds (k class events) against the actual positive number. P r e = 1 k i = 1 k T P i T P i + F P i
Macro RecallThe statistical average of the ratio of true positive abnormal sounds (k class events) against predicted positive number. R e c = 1 k i = 1 k T P i T P i + F N i
Macro SpecificityThe statistical average of the ratio of true negative abnormal sounds (k class events) against predicted negative number. S p c = 1 k i = 1 k T N i T N i + F P i
Macro F1-ScoreThe weighted average of Macro Precision and Macro Recall. F 1 = 2 P r e · R e c P r e + R e c
Table 3. The classification results (%) for different models and optimizers.
Table 3. The classification results (%) for different models and optimizers.
Training MethodAccPreRecSpcF1
ANN and Adam78.4272.2467.3262.8669.69
ANN and AdamW81.2673.1769.7864.2871.43
EANN and Adam83.6876.5271.1666.9773.74
EANN and AdamW86.8178.4471.7167.5874.92
Table 4. The classification results (%) for time stretching augmentation.
Table 4. The classification results (%) for time stretching augmentation.
Time Stretching RangesAccPreRecSpcF1
no stretching (Baseline)86.8178.4471.7167.5874.92
88.8982.2176.4068.6379.19
[0.80, 1.20]89.7790.1389.3374.2489.73
90.9191.6990.2977.3690.99
[0.84, 1.16]91.0492.0890.8378.4391.45
92.3191.5292.4482.6191.98
[0.88, 1.12]93.3193.7393.3783.6293.55
94.4494.5194.4886.2394.49
[0.90, 1.10]87.7888.6686.8582.5287.75
88.0788.6087.1784.7287.88
[0.92, 1.08]87.7887.3787.0079.4687.19
89.7790.3289.0482.2189.68
[0.96, 1.04]86.0886.2786.2573.9286.26
87.7887.2788.2577.4687.76
The first row of “Baseline” indicates that no data augmentation is used. The second row represents the combination with spectrogram online data augmentation.
Table 5. The classification results (%) for pitch shifting augmentation.
Table 5. The classification results (%) for pitch shifting augmentation.
Pitch Shifting RangesAccPreRecSpcF1
no shifting (Baseline)86.8178.4471.7167.5874.92
88.8982.2176.4068.6379.19
[−10, 10]89.4989.4389.1778.8589.30
90.6290.6089.9881.3790.29
[−8, 8]90.3490.4489.6783.3290.05
92.0492.1591.6984.9291.92
[−5, 5]89.2089.6688.2783.8688.96
92.3392.7691.7185.4692.23
[−3, 3]88.0788.1387.1781.6587.65
90.9190.8790.5884.2890.73
[−1, 1]87.9287.2287.2581.4687.24
90.4890.7389.9283.6990.32
Table 6. The classification results (%) for background noise augmentation.
Table 6. The classification results (%) for background noise augmentation.
Background Noise RangesAccPreRecSpcF1
no noise (Baseline)86.8178.4471.7167.5874.92
88.8982.2176.4068.6379.19
−1087.1986.9186.6072.6386.75
86.2685.6385.3571.4285.49
−588.1388.5087.9474.8288.22
90.3290.1290.0978.3790.15
090.6090.7590.9381.2990.84
91.4691.6091.4483.8491.52
592.8891.8191.3983.6791.60
92.0292.1991.7784.8291.98
1092.8892.0291.7185.3791.86
93.4492.3892.3386.2592.36
1592.5992.4792.7385.4892.59
93.1492.4292.8885.9392.67
2092.5892.4792.7386.0292.59
92.4692.6292.5885.7492.60
Table 7. The classification results (%) for combined augmentation.
Table 7. The classification results (%) for combined augmentation.
Combination TypeAccPreRecSpcF1
No Aug86.8178.4471.7167.5874.92
Online Aug88.8982.2176.3968.6379.19
BN [10] and PS [−5, 5]93.2193.2993.2478.3693.27
91.2891.3892.3377.8691.86
TS [0.88, 1.12] and PS [−5, 5]95.8195.8795.8583.4195.86
96.0196.1096.0386.4296.06
TS [0.88, 1.12] and BN [10]96.4396.4596.4488.5296.45
97.4597.1697.0990.6197.12
Table 8. Performance comparison of various data augmentation techniques.
Table 8. Performance comparison of various data augmentation techniques.
Research Sound TypesMethod UsedAccuracy before AugmentationAccuracy after
Augmentation
Environmental sound
classification [21]
Pitch Shifting, Time Stretching, Dynamic Range Compression, Background Noise0.7410.791
Detecting COVID-19 from deep breathing sounds [24]Color Transformation and Noise Addition + DeepShufNet0.7490.901
North Atlantic right
whales upcalls [26]
SpecAugment, Mixup0.8600.902
Mechanical noise identification [22]Background Noise and Time Stretching0.9310.971
Underwater acoustic
target recognition [29]
DCGAN + ResNet180.9250.964
Environmental sound
classification [34]
Mixed Frequency Masking0.9240.937
Indoor anomalous sound
event identification
Time Stretching, Background Noise, Spectral Masking + EANN0.8680.974
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Song, X.; Xiong, J.; Wang, M.; Mei, Q.; Lin, X. Combined Data Augmentation on EANN to Identify Indoor Anomalous Sound Event. Appl. Sci. 2024, 14, 1327. https://doi.org/10.3390/app14041327

AMA Style

Song X, Xiong J, Wang M, Mei Q, Lin X. Combined Data Augmentation on EANN to Identify Indoor Anomalous Sound Event. Applied Sciences. 2024; 14(4):1327. https://doi.org/10.3390/app14041327

Chicago/Turabian Style

Song, Xiyu, Junhan Xiong, Mei Wang, Qingshan Mei, and Xiaodong Lin. 2024. "Combined Data Augmentation on EANN to Identify Indoor Anomalous Sound Event" Applied Sciences 14, no. 4: 1327. https://doi.org/10.3390/app14041327

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop