Using Deep Learning to Classify Environmental Sounds in the Habitat of Western Black-Crested Gibbons

Hu, Ruiqi; Hu, Kunrong; Wang, Leiguang; Guan, Zhenhua; Zhou, Xiaotao; Wang, Ning; Ye, Longjia

doi:10.3390/d16080509

Open AccessArticle

Using Deep Learning to Classify Environmental Sounds in the Habitat of Western Black-Crested Gibbons

by

Ruiqi Hu

¹

,

Kunrong Hu

^1,2,*

,

Leiguang Wang

^1,3,

Zhenhua Guan

¹,

Xiaotao Zhou

¹,

Ning Wang

¹ and

Longjia Ye

¹

School of Big Data and Intelligent Engineering, Southwest Forestry University, Kunming 650224, China

²

Key Laboratory of National Forestry and Grassland Administration on Forestry and Ecological Big Data, Southwest Forestry University, Kunming 650024, China

³

Institute of Big Data and Artificial Intelligence, Southwest Forestry University, Kunming 650024, China

^*

Author to whom correspondence should be addressed.

Diversity 2024, 16(8), 509; https://doi.org/10.3390/d16080509

Submission received: 6 June 2024 / Revised: 31 July 2024 / Accepted: 12 August 2024 / Published: 22 August 2024

Download

Browse Figures

Versions Notes

Abstract

The western black-crested gibbon (Nomascus concolor) is a rare and endangered primate that inhabits southern China and northern Vietnam, and has become a key conservation target due to its distinctive call and highly endangered status, making its identification and monitoring particularly urgent. Identifying calls of the western black-crested gibbon using passive acoustic monitoring data is a crucial method for studying and analyzing these gibbons; however, traditional call recognition models often overlook the temporal information in audio features and fail to adapt to channel-feature weights. To address these issues, we propose an innovative deep learning model, VBSNet, designed to recognize and classify a variety of biological calls, including those of endangered western black-crested gibbons and certain bird species. The model incorporates the image feature extraction capability of the VGG16 convolutional network, the sequence modeling capability of bi-directional LSTM, and the feature selection capability of the SE attention module, realizing the multimodal fusion of image, sequence and attention information. In the constructed dataset, the VBSNet model achieved the best performance in the evaluation metrics of accuracy, precision, recall, and F1-score, realizing an accuracy of 98.35%, demonstrating high accuracy and generalization ability. This study provides an effective deep learning method in the field of automated bioacoustic monitoring, which is of great theoretical and practical significance for supporting wildlife conservation and maintaining biodiversity.

Keywords:

western black-crested gibbon call recognition; convolutional neural networks; bioacoustics; attention mechanisms; passive acoustic monitoring

1. Introduction

The western black-crested gibbon is a rare and endangered primate that inhabits southern China and northern Vietnam, so timely and accurate monitoring of the population dynamics of this species is essential for the development of effective conservation measures. Due to its unique calls, acoustic monitoring techniques are usually used to record the calls of the black-crested gibbon for monitoring and protection, while passive acoustic monitoring techniques have the advantages of being non-invasive and persistent, and have been widely used in the field of wildlife conservation [1,2,3]. In the vicinity of the habitat of the western black-crested gibbon, the environmental sounds are complex and variable, including sounds related to various animals, wind, and rain. It is difficult for traditional audio data processing methods to effectively differentiate and recognize the calls of the western black-crested gibbon. Therefore, it is of great significance to develop a method that can automatically recognize and classify the calls of the western black-crested gibbon. The importance of accurate monitoring extends beyond individual species conservation. The presence and vocal activity of the western black-crested gibbon serve as indicators of the broader health of their ecosystem. Effective acoustic monitoring can provide insights into the biodiversity of the region, track changes in species populations, and detect early signs of environmental disturbances. This information is crucial for guiding conservation efforts, informing policy decisions, and fostering sustainable management practices. As habitat loss and human activities continue to threaten biodiversity, leveraging advanced monitoring techniques to protect endangered species and their habitats becomes increasingly vital. Deep learning plays an important role in passive acoustic monitoring, especially in combination with models such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs).

In recent years, convolutional neural networks (CNNs) and recurrent neural networks (RNNs), both key components of deep learning, have achieved breakthroughs in speech recognition [4], image classification, and other fields [5]; CNNs are particularly suitable for speech recognition due to their powerful feature extraction capabilities [6]. RNNs, on the other hand, focus on sequence modeling and perform well in speech recognition. For example, Graves et al. [7] used an RNN for TIMIT phoneme recognition, achieving an error rate of 17.7% for their test set. Sutskever et al. [8] used a long short-term memory (LSTM) network for their language models, achieving a BLEU score of 36.5. Additionally, Ren et al. [9] designed an RNN-based system for acoustic classification, which demonstrated high information processing accuracy and short processing time. A bidirectional LSTM (BiLSTM) is a type of RNN that can pass information forward and backward through a time series, enabling it to learn both past and future contextual information. BiLSTM addresses the potential loss of long-term dependent information that traditional unidirectional RNNs might face. The current structure combining CNN-BiLSTM and attention mechanisms has shown excellent performance in speech recognition [10], time-series tasks [11,12], and undersea event recognition [13]. This advancement opens new opportunities for automated bioacoustic analysis, particularly in automated species identification tasks.

Researchers have proposed a variety of deep learning-based bioacoustic recognition methods that exhibit good performance. A common approach is to extract the features of animal calls from complex environmental sounds recorded in passive acoustic monitoring, and then use deep learning models to achieve accurate classification and recognition of their calls [14,15]. A common strategy is to convert the audio data into a visual time–frequency representation, such as the Mel Frequency Cepstrum Coefficient (MFCC) spectrogram, and then use a model such as a convolutional neural network (CNN) to perform classification and recognition [10,11]. This MFCC- and CNN-based approach has been widely applied to a variety of species recognition and detection tasks, including birds [16,17], gibbons [18,19], bats [20], cetaceans [21], and many other species, demonstrating good generalization ability.

Despite the initial results, there are still some shortcomings in the existing deep learning-based bioacoustic analysis methods. On the one hand, most studies treat the input of audio data as independent images, ignoring the sequential features of animal calls in the time dimension, which are crucial for recognizing complex call patterns [22,23,24]. On the other hand, existing methods usually treat features such as MFCC spectrograms as equal and lack the ability to adaptively adjust feature weights according to the amount of information. As a result, the extracted features may contain more redundant or irrelevant information, which affects the recognition performance [25,26]. In addition, for rare species such as the western black-crested gibbon, a large amount of data is required for training using models such as convolutional neural networks due to the lack of large-scale audio datasets, while recordings captured through passive acoustic monitoring techniques tend to be more sparse, limiting the effectiveness and applicability of the model training. The applicability and robustness of current methods in coping with this data scarcity still need to be further validated [27]. In order to solve the above problems, this paper proposes a new song recognition method based on a CNN, an RNN, and a channel-feature attention mechanism, and applies it to classify and recognize the calls of western black-crested gibbons and some birds under complex environmental sound conditions. Firstly, this paper constructs an audio dataset containing the calls of the western black-crested gibbon (Nomascus concolor) and various bird species, including the streak-breasted scimitar babbler (Pomatorhinus ruficollis), greater necklaced laughingthrush (Garrulax pectoralis), Hume’s warbler (Phylloscopus humei), red-billed leiothrix (Leiothrix lutea), golden-breasted fulvetta (Lioparus chrysotis), great barbet (Psilopogon virens), and terpnosia obscura. additionally, background noises such as cicadas, wind, and rain were included. Including these bird species was essential as they coexist with the western black-crested gibbon in their natural habitat. The presence of diverse bird calls and environmental sounds creates a complex acoustic environment, which is crucial for developing and testing robust call recognition models. These models must accurately identify gibbon calls amidst various background noises, reflecting real-world conditions and enhancing the practical applicability of the research. The original data contain only two hundred pieces of audio data for each class. By adding white noise, the dataset was augmented to improve the robustness of the model. Then, an improved deep neural network model was designed in this paper, which combines the advantages of the VGG16 convolutional network [28], bidirectional LSTM (BiLSTM) [29], and Squeeze-and-Excitation (SE) attention mechanism [30]. Among them, the convolutional part of VGG16 is responsible for extracting local time–frequency features from MFCC spectrograms, BiLSTM is able to model the serial dependence of animal calls in the time dimension, and the SE module can adaptively adjust the weights of different features to highlight the key information. With this design, the model can characterize animal sounds more comprehensively and accurately, which is expected to improve the recognition accuracy.

The main contributions of this paper are as follows: (1) by augmenting the audio dataset of western black-crested gibbon calls and various bird calls with white noise, we enhanced the dataset’s robustness for developing and testing our algorithm; (2) we propose an improved deep learning model that combines a convolutional neural network (CNN) and a recurrent neural network (RNN). This model can extract key features of western black-crested gibbon calls from MFCC spectrograms in two dimensions: temporal frequency and time series; (3) we added a channel-feature attention mechanism, which can adaptively adjust the weights of different channel features, so that the network pays more attention to the features that are helpful for the current task. From a broader perspective, this study explores the application of AI technology in the field of wildlife conservation, demonstrates the potential of solving real ecological problems, and inspires more innovative research results to contribute to the maintenance of biodiversity.

2. Materials

2.1. Data Sources

The audio data used in this study were obtained from a study published by Zhong [31] and Zhou et al. [32] on the monitoring and identification of western black-crested gibbon calls. The study utilized an innovative automated acoustic monitoring system developed by Zhong and colleagues [31] for recording and analyzing bird song activity. This system employs passive acoustic monitoring technology, which is significant for ecological conservation efforts. By continuously monitoring the acoustic environment, it provides valuable data for studying and protecting the habitat of western black-crested gibbons. Through this automated acoustic monitoring system, the project team obtained a large amount of high-quality field western black-crested gibbon song data, bird song data, and environmental background sounds, and, therefore, it laid a solid data foundation for further in-depth research on the singing behavior of western black-crested gibbons and birds, and their interactions with the environment. The system collected the following types of audio data: (1) Western black-crested gibbon song: This is a rare and endangered gibbon species, and its song audio data record their activities in the field environment. These recordings are crucial for studying the behavioral habits of this species and understanding its habitat requirements, which are essential for developing effective conservation strategies. By analyzing these calls, researchers can gain insights into the social structure, communication patterns, and population dynamics of the gibbons, all of which are vital for their protection and management. (2) Bird song: The dataset covers the song audio data of various bird species distributed in different geographic regions and habitats, showcasing rich biodiversity. Studying bird songs helps in monitoring avian biodiversity, understanding migration patterns, and detecting changes in bird populations, which can be indicators of broader environmental changes. This information is valuable for conserving bird species and their habitats and ensuring the health of ecosystems. (3) Background sounds of the natural environment: The dataset includes background audio data of wind, rain, and other natural environments, helping to simulate the sound environment in real situations. This enhances the diversity and practical application value of the dataset. Analyzing these sounds can provide insights into the environmental conditions of the habitats, which is important for assessing the impacts of climate change and human activities on wildlife. All raw audio data are stored in lossless WAV format with a sampling rate of 32 kHz, ensuring the retention of high-quality acoustic signals. These high-quality data are indispensable for conducting detailed and accurate acoustic analyses, which are critical for the effective monitoring and conservation of wildlife.

2.2. Data Augmentation

In order to cope with the problem of the small amount of audio data for western black-crested gibbon monitoring and to enhance the robustness of the model in dealing with complex acoustic environments, this study applied the data enhancement method of noise superimposition. Due to the size limitation of the initial dataset of only 200 sound samples per category, this paper adopted the approach of expanding the dataset by superimposing multiple levels of signal-to-noise ratio (SNR) noise on the original sound samples. Specifically, nine different SNR levels 1, 0.5, 0.2, 0.1, 0.3, −0.1, −0.2, −0.5, −1 were set as a way to simulate mild to severe noise environments and to improve the model’s adaptability in complex environments [33]. The signal-to-noise ratio is calculated as follows:

S N R = 10 \log_{10} (\frac{P_{s}}{P_{n}})

(1)

where SNR denotes the signal-to-noise ratio,

P_{s}

denotes the power of the pure speech signal, and

P_{n}

denotes the power of the noise signal. By controlling the level of SNR, different degrees of noise environments can be simulated, so that the model is exposed to more diverse data samples during the training process, and its generalization ability is improved to enhance the robustness of the model [34]. The audio samples under different SNR conditions are shown in Table 1.

The audio data of the western black-crested gibbon, various bird species (Pomatorhinus ruficollis, Erythrogenys gravivox, Phylloscopus valentini, Trochalopteron milnei, Trochalopteron chrysopterum, Psilopogon virens, and Terpnosia obscura), and environmental sounds (rain and wind) were preprocessed after noise overlay. An audio dataset containing 10 categories with a total of 20,000 audio data samples was constructed, which lays a solid data foundation for the subsequent classification of environmental sounds and related ecological research tasks.

In order to convert the audio data into a form suitable for deep learning model input, we computed Mel’s inverted spectral coefficient features for each audio file and visualized them as MFCC images. The calculation process of the Mel spectrum includes the following steps: Short-Time Fourier Transform (STFT) is performed on the original audio signal to obtain the frequency domain representation

X (m, k)

. The linear frequency is mapped to the Mel Scale (MS) to better fit the perceptual properties of sound to the human ear. The Mel Scale is mainly used for feature extraction and operational dimension reduction of speech data, and its relationship with frequency can be approximated by the following equation.

m = 2595 \log_{10} (1 + \frac{f}{700})

(2)

The square operation is performed on the output of each Mel filter to obtain the energy output from that filter. The logarithmic operation on the result yields the final Mel spectrum

S (m, n)

with the following equation:

S (m, n) = \log (\sum_{k = 0}^{N - 1} {|X (m, k)|}^{2} H_{m} (k))

(3)

where,

S (m, n)

denotes the value of the Mel spectrum, and

X (m, k)

is the STFT coefficients of the mth frame.

H_{m} (k)

is the transfer function of the mth Mel filter, and N is the number of points of FFT. To fulfill the input requirements of the deep learning model, we used the Matplotlib library to generate the Mel Spectrogram, removing the axes and boundaries to simplify the image. Subsequently, the image was resized to 224 × 224 pixels using the PIL library. Examples of various audio MFCC spectrograms are shown in Figure 1.

Figure 2 shows the western black-crested gibbon calls under different signal-to-noise ratio conditions, where Figure 2a shows the spectrogram of the original western black-crested gibbon calls after the extraction of the Meier inversion coefficient features, while Figure 2b–j show the spectrograms of the western black-crested gibbon calls with Mel-frequency cepstral coefficient (MFCC) features extracted under different signal-to-noise ratio conditions.

3. Methods

3.1. Overall Classification Model

In this paper, we propose an innovative deep learning model designed to meet the challenges of image multi-categorization recognition tasks. The model is based on the classical VGG16 convolutional neural network (CNN) with important extensions and optimizations, and the Squeeze-and-Excitation (SE) module and Bidirectional Long Short-Term Memory Network (BiLSTM) are introduced to enhance the model’s performance. First, the authors embedded the SE module behind each convolutional block of VGG16. The introduction of the SE module aims to dynamically adjust the response weights of each channel to enhance the model’s ability to capture key features of the image. In addition, to further improve the network performance, a Batch Normalization layer was added after each SE-enhanced convolutional block, which accelerated the model training process while effectively suppressing overfitting.

Next, the model utilizes a bidirectional LSTM module to serialize the convolutional features. This step aims to model the before and after temporal information of the feature sequences using the bidirectional LSTM to better understand the temporal relationships between image features and their dynamics. This is essential to capture the contextual information in the image sequence, providing the model with a deep understanding of the sequence semantics.

Finally, in order to prevent overfitting, a Dropout layer was introduced and the features were classified using a fully connected layer and Softmax as a classifier to output the classification labels of the images. The Dropout layer, as a regularization technique, effectively reduced the model complexity and enhanced the generalization ability. With these structural innovations and improvements, the model achieved significant performance gains on image multi-classification tasks. These improvements not only enable the model to recognize objects in images more accurately, but also to better understand the temporal relationships and semantic information among objects in images. The structure of the whole model is schematically shown in Figure 3, which provides the reader with an intuitive view of the model composition and workflow.

The network loaded image data using PyTorch’s ImageFolder dataset class, which automatically encodes labels based on folder names to generate a labeled dataset. Data preprocessing was performed through the transforms module after data loading, including scaling, cropping, normalization, and other operations on the image. The network structure includes the following: (1) A total of 16 convolutional layers, each of which is a 3 × 3 convolutional kernel distributed in 5 convolutional blocks, the specific connection is two convolutional layers (3 × 3 convolutional kernel) + ReLU activation function + SELayer module + Batch Normalization layer, and then connected to a one-by-one maximum pooling layer, and, finally, each convolutional layer is followed by ReLU activation function + SELayer module + Batch Normalization layer, then connected to a one-by-one maximum pooling layer. The layer is followed by a ReLU activation function, except for the last convolutional layer. (2) SENet module is added at the end of each convolutional block, which is mainly used to dynamically adjust the importance of the channels. (3) The BiLSTM part is accessed through a bi-directional LSTM layer after the convolutional part, which is used for sequence modeling. (4) Finally, a fully-connected layer is connected to map the output of the BiLSTM into the category label space.

Parameter and hyperparameter settings were as follows: num_classes—number of classes, here set to 10, applicable to 10 classes of image classification tasks; lstm_hidden_size—LSTM hidden layer size, set to 256; lstm_num_layers—number of LSTM layers, set to 2; batch_size—batch size, set to 16; num_epochs—number of training rounds, set to 100; lr—learning rate, set to 0.0001.

3.1.1. Based on the Improved VGG16

VGG16 [28] is a landmark convolutional neural network model that builds a deep network structure by cascading multiple convolutional and pooling layers to efficiently extract image features. The model consists of five convolutional blocks, each containing a series of convolutional and pooling layers inside. By design, VGG16 employs a smaller convolutional kernel size (3 × 3) and a smaller step size (1) to increase the network depth and nonlinearity and reduce the number of parameters. By gradually decreasing the feature map size and increasing the number of channels, the network is able to gradually understand more complex and abstract image features. Finally, classification is performed through a fully connected layer that maps the extracted features to category probabilities. VGG16 has achieved remarkable performance in tasks such as image classification with its simple yet efficient structure.

In this paper, key improvements were made to VGG16, a Batch Normalization layer was added to five convolutional blocks, and a Squeeze-and-Excitation (SE) module was introduced after the Relu module and the maximum pooling layer in each convolutional block. These improvements were aimed at improving the model’s ability to capture key features of the image and accelerating the training process of the model, while effectively suppressing overfitting.

The original VGG16 model lacks dynamic adjustment of the importance between features between convolutional blocks. Instead, the introduction of the SE module enabled our model to adaptively learn and adjust the response weights of each channel, thus capturing the key features in the image more efficiently. In addition, by adding the Batch Normalization layer, we were able to train more stably and alleviate the overfitting problem to some extent. Overall, the improved VGG16 performed better in the image multi-classification recognition task, improving the performance and generalization ability of the model, making it more suitable for different image datasets and application scenarios.

3.1.2. Bidirectional Long and Short-Term Memory Neural Network

The Bidirectional Long Short-Term Memory Neural Network (BiLSTM) is a recurrent neural network structure (RNN) designed to solve the problem that a traditional LSTM unidirectional structure may ignore the information in front and back of the input sequences. BiLSTM effectively captures the contextual relationships in the sequence data by considering both forward and backward information of the input sequences, which enables the model to more comprehensively comprehend the information in the sequences. The computational process of LSTM is as follows:

(1) Calculate the input gate:

i_{t} = σ (W_{i x} x_{t} + W_{i h} h_{t - 1} + b_{i})

(4)

(2) Calculate the forget gate:

f_{t} = σ (W_{f x} x_{t} + W_{f h} h_{t - 1} + b_{f})

(5)

(3) Calculate the output gate:

o_{t} = σ (W_{o x} x_{t} + W_{o h} h_{t - 1} + b_{o})

(6)

(4) Calculate the cell state:

C_{t}^{\sim} = \tanh (W_{c x} x_{t} + W_{c h} h_{t - 1} + b_{c})

(7)

(5) Updating the cellular state:

C_{t} = f_{t} * C_{t - 1} + i_{t} * C_{t}^{\sim}

(8)

(6) Compute the hidden state:

h_{t} = o_{t} * \tanh (C_{t})

(9)

where

σ

denotes the sigmoid function, ∗ denotes the element-by-element multiplication,

W_{i x}

,

W_{f x}

,

W_{o x}

, and

W_{c x}

are the input weight matrices,

W_{i h}

,

W_{f h}

,

W_{o h}

, and

W_{c h}

are the hidden state weight matrices,

b_{i}

,

b_{f}

,

b_{o}

, and

b_{c}

are the bias vectors, and the structure of LSTM is shown in Figure 4.

The introduction of the BiLSTM structure plays an important role in MFCC spectrogram recognition. BiLSTM is based on the traditional LSTM structure, adds an inverse LSTM structure, and splices its output with that of the forward LSTM to form a bidirectional sequence model. This structure allows the network to propagate information both forward and backward at each time step, thus improving the model’s ability to model sequential data. Proponents of BiLSTM include Schuster and Paliwal et al. who first proposed the BiLSTM structure for an early speech recognition task and achieved significant performance gains. LSTM and BiLSTM have been widely used in various time-series modeling tasks [35], such as speech recognition [36], natural language processing [29,37,38,39], and other fields, and have become an important sequence model structure.

In the MFCC spectrogram recognition task, animal calls are usually represented as continuous time-series information, and the introduction of the BiLSTM structure can better capture the temporal information in the acoustic data, thus improving the recognition performance. BiLSTM effectively captures the long-term dependencies in the acoustic data by simultaneously considering the past and future information of the sequential data, which enables the model to more accurately understand and categorize acoustic features. Therefore, the deep learning model combined with BiLSTM shows higher accuracy and robustness in the MFCC spectrogram recognition task. The structure of BiLSTM is shown in Figure 5.

3.1.3. SE Attention Mechanism

The attention mechanism [40] is applied in neural machine translation tasks, and its core idea is to dynamically learn the importance of different parts of the input sequence at each time step of the model and adjust the corresponding attention weights accordingly to achieve more flexible and accurate information extraction. The introduction of the SENet channel attention mechanism can significantly improve the recognition ability of the MFCC spectrograms. The SENet [30] network architecture, which won the ImageNet competition in 2017 in the image classification task, consists of core components including Squeeze-and-Excitation operations. In the Squeeze operation, the MFCC spectrogram, with an input size of H × W and the number of channels C, is globally average pooled to obtain the global information; in the Excitation operation, the one-dimensional feature map obtained from the Squeeze operation is used as an input using the fully connected layer and the parameter W to obtain the importance weights of each channel to characterize the correlation between the channels. Finally, the weights of the Excitation output are multiplied channel by channel with the original features to realize the recalibration of the features.

The SENet channel attention mechanism enables the network to adaptively learn the correlation between feature channels, thus effectively improving the recognition performance of MFCC spectrograms. By introducing the SENet channel attention mechanism, the network can focus more on the features that are useful for the recognition task, which improves the network’s ability to capture and utilize the important features in the acoustic data, and, thus, enhances the ability to accurately recognize acoustic signals. As a result, the deep learning model incorporating the SENet channel attention mechanism demonstrates higher efficiency and performance in bioacoustic analysis.

The SENet work process is mainly divided into the following processes: compression step (Squeeze step): in this step, the input MFCC feature map is compressed by a global pooling operation, and the spatial information of each channel is aggregated to obtain a global description vector for each channel. Specifically, for the input feature map X, a global average pooling operation is performed to obtain a vector z of length C, representing the global average activation value of each channel:

Z = A v g P o o l (X)

(10)

Excitation step: In this step, the fully connected layers are utilized to learn the weights of each channel and, thus, dynamically adjust the importance of each channel. Specifically, first, the global description vector z is input into two fully connected layers, the first of which maps z to a smaller intermediate dimension C/r (r is a hyperparameter called the “compression ratio”), which is then activated by an activation function (usually ReLU) to obtain an intermediate representation

\tilde{z}

:

\tilde{z} = Re l u (F C_{1} (z))

(11)

The intermediate representation

\tilde{z}

is then fed into the second fully connected layer, which maps it back to the original channel dimension C. It is activated by a sigmoid function, yielding a vector s representing the importance of each channel:

s = σ (F C_{2} (\tilde{z}))

(12)

Finally, the learned channel dominance vectors s are used to weight the input feature map to obtain the enhanced feature representation. Specifically, the learned channel dominance vector s for each channel is multiplied by the eigenvalue at each spatial location in the corresponding original feature map X to obtain the weighted feature

\tilde{X}

.

X_{c}^{\sim} = X_{c} \cdot s_{c}

(13)

where

X_{c}^{\sim}

denotes the weighted would-be graph of the cth channel,

X_{c}

is the c channel in the original would-be graph X, and

s_{c}

is the importance weight of the c channel, and the SENet structure diagram is shown in Figure 6.

3.1.4. Evaluation Indicators

In this study, the following evaluation metrics are used to evaluate the model performance of VBSNet: accuracy, precision, recall, and F1-score. These metrics can be calculated using True Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FN) in the confusion matrix. The formulas are as follows:

(1) Accuracy: a measure of the proportion of all samples that are correctly categorized:

A c c u r a c y = \frac{T P + T N}{T P + F P + F N + T N}

(14)

(2) Precision: a measure of the proportion of all cases predicted by the model to be positive that are actually positive:

\Pr e c i s i o n = \frac{T P}{T P + F P}

(15)

(3) Recall: measures the proportion of all actual positive cases that are correctly predicted by the model to be positive:

Re c a l l = \frac{T P}{T P + F N}

(16)

(4) F1 score (F1-score): it is a reconciled average of precision and recall, which combines the accuracy and recall of a classification model, and is usually used to evaluate the performance of classification models for unbalanced categories.

F 1 - s c o r e = \frac{2 \times \Pr e c i s i o n \times Re c a l l}{\Pr e c i s i o n + Re c a l l}

(17)

The above experimental setup and evaluation approach aimed to impartially and comprehensively evaluate the performance of the proposed VBSNet model on western black-crested gibbon calls, bird calls, and natural environment sound recognition tasks.

4. Results

4.1. Experimental Environment

The experiment was performed on a configuration of Intel(R) Xeon(R) CPU E5-2678 v3 @ 2.50 GHz, Ubuntu 16.04 LTS operating system, 32 GB RAM and NVIDIA GeForce RTX 3060 8 G GPU, CUDA version 11.7, Python 3.8, and Pytorch 2.0.1. The experimental parameter settings are shown in Table 2.

All experiments were trained using 100 cycles and 16 batches, and the dataset was divided into training, validation, and test sets in the ratio of 7:2:1. The training set was used for learning and optimizing the model parameters, the validation set was used for selecting and tuning the model hyperparameters, and the test set was used as an evaluation criterion for the final performance of the model. In order to prevent overfitting, Batch Normalization (BN) and a Dropout layer were used in the training process to ensure the generalization ability of the model.

4.2. Categorical Comparison Structure after Adding BiLSTM Module

In order to investigate the impact of integrating LSTM and BiLSTM into the VGG16 model, the performance of the LSTM-VGG16 and BiLSTM-VGG16 models was compared with the basic VGG16 model while ensuring the same experimental setup. As shown in Table 3, the accuracy of the BiLSTM-VGG16 model is 97.80%, which is better than LSTM-VGG16 (97.15%) and VGG16 (97.05%). This indicates that the bidirectional LSTM architecture captures the temporal dependencies more effectively than the unidirectional LSTM and improves the model performance.

4.3. Results of Ablation Experiments Based on SE Attention Mechanisms

To further investigate the contribution of SE modules and BiLSTM in the proposed VBSNet (VGG16-BiLSTM-SE) model, we conducted an ablation study. We evaluated the performance of the model with only SE module added (VGG16-SE) and only BiLSTM added (VGG16-BiLSTM) to the VGG16 architecture. As shown in Table 4, the accuracy of the VGG16-SE model is 97.40%, while the accuracy of the VGG16-BiLSTM model is 97.80%. The VBSNet model that integrates both the SE module and BiLSTM achieves the highest accuracy of 98.35%, showing the complementary effect of these two components.

This paper also investigated the effect of adding the SE attention mechanism into different convolutional block positions of the VGG16-BiLSTM model. The results show that adding the SE attention mechanism to all the convolutional blocks achieves the best performance with an accuracy of 98.35%. 1SE indicates that the SE attention mechanism is added to the first convolutional block of the VGG16 model, 2SE indicates that the SE attention mechanism is added to the first two convolutional blocks, 3SE indicates that the SE attention mechanism is added to the first three convolutional blocks, 4SE indicates that the SE attention mechanism is added to the first four convolutional blocks, and VBSNet adds the SE attention mechanism into all five convolutional blocks. Adding the SE module to the first three convolutional blocks also achieves an accuracy of up to 98.30%, while adding it to one, two, or four convolutional blocks results in slightly lower accuracies of 97.50%, 97.80%, and 97.65%, respectively. The specific results are shown in Table 4.

4.4. Comparison of Different Model Classifications

In order to thoroughly evaluate the overall performance of the VBSNet model proposed in this paper on the task of the automatic recognition of chirps, we conducted a series of comparative experiments, which not only included the classical convolutional neural network structures VGG16 and VGG19, which are known for their repetitive patterns of small-sized convolutional layers and enhance the model’s learning capability by deepening the network structure, but also included classical network models such as AlexNet, which served as an early breakthrough deep learning model and dramatically improved the performance and generalization ability of the network by introducing the ReLU activation function and Dropout technique. In addition, this paper also considers MobileNetV3 [41], a lightweight model designed for edge computing, which achieves parameter reduction and computational efficiency through deep separable convolution, and EfficientNet [39], a recent and systematically optimized model which, through Neural Architecture Search (NAS) techniques and a composite scaling mechanism, achieves parameter reduction and computational efficiency improvements on a number of standard datasets, setting new performance benchmarks. The comparison of these representative models allows us to comprehensively evaluate the advantages of VBSNet in sound feature extraction and temporal information modeling, ensuring its superiority and practical application value in the field of sound recognition. The results are shown in Table 5, where the VBSNet model achieves the highest accuracy of 98.35%, outperforming all other models. The EfficientNet model obtains the next highest accuracy of 98.00%, followed by VGG16 with 97.05%. The MobileNetV3 model achieves 96.65% accuracy, while AlexNet has the lowest accuracy among the compared models, with an accuracy of 96.15%.

4.5. Analysis of Classification Results

In biological research and ecological monitoring, animal sound data are a key source of information for understanding their behavior and environmental conditions. Especially for some endangered species, such as the western black-crested gibbon, sound data can help researchers monitor their population dynamics, reproductive behavior, and habitat changes. However, for many rare or inaccessible species, collecting large numbers of sound samples is often challenging, which limits the effectiveness of using automated analysis methods such as machine learning, which typically require large amounts of data to train reliable models.

To address the problem of a small dataset of western black-crested gibbon calls, this study proposed a data augmentation method that increases data diversity by adding different levels of noise to the original recordings. This method is based on the concept of signal-to-noise ratio (SNR), which is the ratio of signal strength to noise strength, and is used to simulate different listening conditions in the natural environment. The SNR can be adjusted for positive (1, 0.5, 0.2, 0.1, 0.3) and negative (−0.1, −0.2, −0.5, −1) values, covering a wide range of environments from mild background noise to extreme noise interference.

Through our analysis and experimental validation, we found that applying data augmentation to the western black-crested gibbon call dataset significantly improved the classification accuracy of six different deep learning networks. These networks included VGG16, VGG19, AlexNet, MobileNetV3, EfficientNet, and VBSNet. This method not only increased the volume and diversity of the dataset, but also provided an effective means of modeling complex sound environments in the real world. Specific results are shown in Table 6, which demonstrates significant improvements in classification accuracy for all six models after data augmentation.

By comparing the classification results before and after data augmentation, it is evident that the performance of all six network models is significantly improved. This demonstrates the importance of data augmentation in improving the accuracy of models in classifying rare organism sound data. This not only increases the model’s ability to handle data in complex, changing contexts, but also enhances the model’s ability to generalize, a valuable strategy for bioacoustic research and in the development of automated monitoring systems. These findings emphasize the potential of employing advanced machine learning techniques in bioacoustic research, opening up new directions for future research and applications. It is important to note that the background noise was introduced artificially for the purpose of data augmentation. This may not perfectly replicate natural conditions where overlapping noise is present during the initial recording. The behavior of animals and the acoustic properties of the environment could influence the actual recordings. Despite these limitations, this approach provides a valuable means to enhance dataset diversity and improve model robustness.

By adding different levels of noise to the original gibbon call recordings, we were able to generate a series of sound samples under varying noise conditions, thereby modeling the various auditory environments that gibbons may encounter in their natural habitats. This increases the size of the dataset and improves the model’s robustness to noise, making it more adaptable to real-world complexities. For example, in denser woodlands or harsh climatic conditions, background noise may significantly affect the transmission and reception of sound. A model that can still accurately recognize and classify gibbon calls under these conditions will greatly enhance its utility and reliability.

In addition, using different signal-to-noise levels increases the dataset’s diversity and provides an experimental basis for studying the behavior of sound data in different noise environments, which is important for acoustic ecology research. By analyzing the response of calls under different noise levels, researchers can gain a deeper understanding of how sound travels through natural environments and how animals acoustically adapt to changes in their living environments.

In conclusion, the data augmentation method based on signal-to-noise ratio adjustment provides an effective solution for analyzing acoustic data of rare species such as the western black-crowned gibbon. This helps to improve the quality and efficiency of bioacoustic monitoring, and provides important technical support for the conservation and study of these precious species.

After a systematic performance evaluation of the VBSNet model, this paper presents the learning curves and evaluation metrics of the model after 100 training iterations. Figure 7 shows the accuracy and loss curves of VBSNet on the training and test sets. In the early stage of training, the model has high loss values and low accuracy, which is in line with the general pattern of deep learning model training. As the number of iterations increases, the model begins to learn the patterns in the data, the loss value continues to decrease, and the accuracy rate increases steadily. The model begins to converge after 25 iterations, and as the number of iterations increases, the two curves are gradually fitted, indicating that the model has achieved significant performance improvement by learning from the data. Subsequently, the model performance is increasingly stabilized, and both loss and accuracy on the validation set level off, which reveals the learning and convergence of the model.

By incorporating MFCC spectrograms as input features, the VBSNet model demonstrated excellent accuracy and robustness on the sound event recognition task. The model exhibits the following specific performance metrics for different categories of sound events, as shown in the Table 7 below.

The VBSNet model demonstrated a very high performance in recognizing the calls of the endangered species, the western black-crested gibbon, maintaining 98.49% precision and 98.00% recall, as well as an F1-score of 98.25%, as shown in Table 7. This further highlights the significant value of the model in ecological monitoring and biodiversity species identification.

In this paper, we provided the confusion matrix generated by the VBSNet model on this dataset, where 0 represents the brown-necked thrush, 1 represents the hooked thrush, 2 represents the pachyderm warbler, 3 represents the ruddy-tailed thrush, 4 represents the golden-winged thrush, 5 represents the great mockingbird, 6 represents the sound of cicadas, 7 represents the sound of rain, 8 represents the sound of the western black-crested gibbon, and 9 represents the sound of wind, and the confusion matrix is shown in Figure 8 above. In summary, this confusion matrix demonstrates that the model has a very high classification accuracy, and the vast majority of samples are correctly classified into their respective true categories. The few misclassifications that exist may be due to the similarity of sound features between categories. Future work could further tune the model or investigate the characteristics of the misclassified samples to improve the model’s ability to distinguish between these sound events.

5. Discussion

This study showed that augmenting a dataset by introducing noise with different signal-to-noise ratio levels significantly improves the performance of six deep learning networks on a western black-crested gibbon song classification task. These results emphasize the potential of using deep learning techniques to process rare and complex sound data in biodiversity monitoring and bioacoustic studies. These results emphasize the potential of using deep learning techniques to process rare and complex sound data in biodiversity monitoring and bioacoustic studies.

In this study, a neural network model based on VGG16 incorporating BiLSTM and SE modules was proposed for recognizing some bird calls and the calls of the endangered western black-crested gibbon under complex environmental sound conditions. A carefully constructed comparison experiment as well as a well-designed ablation experiment were conducted and its performance was demonstrated compared to current state-of-the-art models such as VGG16, VGG19, AlexNet, MobileNetV3, and EfficientNet in sound feature extraction and classification tasks. The model demonstrated 98.35% accuracy and 98.33% F1-Score on the bird sound classification task as well as on the calls of the endangered western black-collared gibbon, which is not only ahead of VGG16 (97.05% accuracy and 96.98% F1-Score) and VGG19 (96.55% accuracy and 96.45% F1-Score), but also in line with the performance of recent state-of-the-art models such as AlexNet, MobileNetV3, and EfficientNet, and its performance is even more in line with the trend presented by many recent studies, with the help of multi-feature attention mechanisms, some of which have already achieved 93.2% accuracy for sound classification in complex noise contexts [25], while the application of convolutional recurrent networks based on the attention mechanism to deal with the classification of ambient sounds has achieved 93.7% accuracy [22].

The VBSNet model proposed in this paper contains three core elements: First, a deeper convolutional neural network, VGG16, is used for feature extraction; second, the Squeeze-and-Excitation (SE) module, the channel-attention mechanism, which, by adaptively regulating the channel responses in the network, greatly improves the model’s reinforcement of useful features as well as the suppression of useless features, which effectively improves the overall characterization ability. This attentional mechanism enables the model to capture discriminative features more precisely, thus improving the accuracy of sound recognition. Third, the BiLSTM network, which is able to process sequence data efficiently and capture long-term dependencies in the pre- and post-texts, continues to deepen the model’s understanding of the sound sequences. As mentioned in several academic studies in recent years, BiLSTM has a unique advantage in capturing temporal relationships in sequence data and is the preferred model for sequence processing tasks. In our ablation experiments, either the addition of the SE module or the BiLSTM network alone positively affects the model performance. When combining the two, the best performance output was achieved, a result that clearly demonstrates the complementary and synergistic benefits that the SE module and BiLSTM network can bring to the sound recognition task. In addition, to address the problem of sparse data, this study balanced the dataset by adding noise with different signal-to-noise ratios to address the imbalance in the total number of detections per species.

The VBSNet model proposed in this study has important theoretical contributions. It improves feature extraction and temporal data processing by combining the SE module and BiLSTM network, and verifies the effectiveness of the multi-feature attention mechanism and sequence processing model in sound recognition tasks. In terms of accuracy against sound recognition, the VBSNet model demonstrates extremely high accuracy in categories such as hooked thrush as well as others. This is consistent with recent literature that suggests that adaptive attention mechanisms can significantly improve the accuracy of sound category recognition [42,43,44], especially when the target sound category has distinctive features. Meanwhile, the model achieved 100% recognition accuracy in the cicada and rain sound categories, which emphasizes the unique advantage of the present model in distinguishing environmental sounds with excellent handling of stability and continuity patterns in sound data. Although VBSNet performed well in most sound categories, the model performed slightly less well in the recognition of categories such as the large proposed woodpecker. The possible reason for this is that the sound features of these categories are more subtle and confusing, which poses a higher challenge to the model’s differentiation ability.

Future research needs to optimize the model architecture for this problem or introduce new methods such as sound enhancement techniques to improve recognition accuracy. Research should validate the VBSNet model on a wider and more diverse set of sound datasets and perform fine-grained optimization. For example, the robustness and generalization ability of the model can be further improved by increasing the sample size, improving data preprocessing methods, and optimizing the model parameters. More feature extraction methods and model integration strategies can be explored in the future, such as the introduction of self-supervised learning or transfer learning techniques, to improve the model performance on unseen data. In addition, for practical applications in the field of ecological monitoring and environmental protection, developing a real-time sound recognition system integrated with IoT technology could facilitate real-time monitoring and feedback of the ecological environment. This would enhance our ability to detect and respond to environmental changes promptly, contributing to more effective conservation strategies and biodiversity protection.

The excellent performance exhibited by the VBSNet model in this study demonstrates the significant potential of utilizing extracted MFCC audio feature spectrograms and deep learning techniques in sound analysis. This approach is particularly valuable for addressing complex sound recognition and classification tasks in ecological monitoring. Future work should include validating and fine-tuning the model on a broader array of sound datasets, along with exploring additional feature extraction methods and model integration strategies. These efforts aim to advance the development of sound recognition systems with higher robustness and generalization capabilities, providing the necessary technical support for key areas such as ecological monitoring and environmental protection. By improving these systems, we can enhance our ability to monitor biodiversity, track species populations, and assess the health of ecosystems, ultimately contributing to the preservation of biodiversity and the sustainability of natural habitats.

6. Conclusions

In this paper, we proposed a new deep learning model, VBSNet, designed to accurately recognize and classify the calls near the habitat of the western black-crested gibbon. This model may provide valuable technical support for the monitoring and conservation of this critically endangered species. Aiming at the problems of ignoring the time-series information and not being able to adapt channel-feature weights in the problem of call recognition, the model effectively combines the VGG16 convolutional neural network, Squeeze-and-Excitation (SE) attention mechanism, and Bidirectional Long Short-Term Memory (BiLSTM) network together; it has excellent performance in chirp recognition, which proves the effectiveness of the network. The introduction of the SE attention mechanism and the BiLSTM network proved to be a key component in improving the performance of the model. The SE module adaptively recalibrates the feature responses at the channel level, emphasizing informative features and suppressing less relevant ones. On the other hand, the BiLSTM network efficiently captures both forward and backward contextual information, making it possible to better model temporal relationships in sound sequences. The ablation experiments demonstrate the complementary advantages of these two mechanisms in improving sound recognition performance. All the metrics of the present model were higher than those of the other models compared in all the comparative experiments and the ablation experiments, indicating that the model proposed in this paper has better generalization capabilities. Future work will focus on further validating and optimizing our model on larger and more diverse datasets, exploring advanced feature extraction and model fusion techniques, and extending them to a wider range of sound recognition tasks. Through these efforts, we aim to develop sound recognition systems with robustness and generalization capabilities, which can provide beneficial technical support for various fields such as ecological monitoring and environmental protection.

Author Contributions

Conceptualization, R.H. and K.H.; methodology, Z.G. and L.W.; software, X.Z. and N.W.; validation, R.H. and K.H.; formal analysis, R.H. and Z.G.; investigation, X.Z.; resources, K.H.; data curation, K.H.; writing—original draft preparation, X.Z.; writing—review and editing, K.H. and X.Z.; visualization, X.Z.; supervision, L.Y. and R.H.; project administration, K.H.; funding acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

We are grateful to the Chuxiong Management and Protection Branch of the Ailao Mountains National Nature Reserve in Yunnan Province and the builders of the passive acoustic monitoring system in the early stages of this project. We thank the Major Science and Technology Project of Yunnan Province (202202AD080010) for support. We thank the National Natural Science Foundation of China for grant Nos. 32160369 and 31860182. We are very grateful for the financial support from the open project (No. 2022-BDG-07) of the State Forestry and Grassland Bureau Key Laboratory of Forest Ecological Big Data, Southwest Forestry University.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Part 1: If you need this part of experimental data, you can send an email to hukunrong@swfu.edu.cn to obtain it.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sugai, L.S.M.; Silva, T.S.F.; Ribeiro, J.W.; Llusia, D. Terrestrial Passive Acoustic Monitoring: Review and Perspectives. BioScience 2018, 69, 15–25. [Google Scholar] [CrossRef]
Winiarska, D.; Szymański, P.; Osiejuk, T.S. Detection ranges of forest bird vocalisations: Guidelines for passive acoustic monitoring. Sci. Rep. 2024, 14, 894. [Google Scholar] [CrossRef] [PubMed]
Brinkløv, S.M.M.; Macaulay, J.D.J.; Bergler, C.; Tougaard, J.; Beedholm, K.; Elmeros, M.; Madsen, P.T. Open-source workflow approaches to passive acoustic monitoring of bats. Methods Ecol. Evol. 2023, 14, 1747–1763. [Google Scholar] [CrossRef]
Hema, C.R.; Márquez, F.P.G. Emotional speech Recognition using CNN and Deep learning techniques. Appl. Acoust. 2023, 211, 109492. [Google Scholar] [CrossRef]
Chen, R.; Hei, L.; Lai, Y. Image Recognition and Safety Risk Assessment of Traffic Sign Based on Deep Convolution Neural Network. IEEE Access 2020, 8, 201799–201805. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2012, 60, 84–90. [Google Scholar] [CrossRef]
Graves, A.; Rahman Mohamed, A.; Hinton, G.E. Speech recognition with deep recurrent neural networks. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 6645–6649. [Google Scholar]
Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. arXiv 2014, arXiv:1409.3215. [Google Scholar]
Ren, D.; Srivastava, G. A Novel Natural Language Processing Model in Mobile Communication Networks. Mob. Netw. Appl. 2022, 27, 2575–2584. [Google Scholar] [CrossRef]
Yilihamu, D.; Ablimit, M.; Hamdulla, A. Speech Language Identification Using CNN-BiLSTM with Attention Mechanism. In Proceedings of the 2022 3rd International Conference on Pattern Recognition and Machine Learning (PRML), Chengdu, China, 15–17 July 2022; pp. 308–314. [Google Scholar] [CrossRef]
Cao, G.; Tang, Y.; Sheng, J.; Cao, W. Emotion Recognition from Children Speech Signals Using Attention Based Time Series Deep Learning. In Proceedings of the 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), San Diego, CA, USA, 18–21 November 2019; pp. 1296–1300. [Google Scholar] [CrossRef]
Luo, A.; Zhong, L.; Wang, J.; Wang, Y.; Li, S.; Tai, W. Short-Term Stock Correlation Forecasting Based on CNN-BiLSTM Enhanced by Attention Mechanism. IEEE Access 2024, 12, 29617–29632. [Google Scholar] [CrossRef]
Xu, C.; Liang, R.; Wu, X.; Cao, C.; Chen, J.; Yang, C.; Zhou, Y.; Wen, T.; Lv, H.; Wei, C. A Hybrid Model Integrating CNN–BiLSTM and CBAM for Anchor Damage Events Recognition of Submarine Cables. IEEE Trans. Instrum. Meas. 2023, 72, 1–11. [Google Scholar] [CrossRef]
Jeantet, L.; Dufourq, E. Improving deep learning acoustic classifiers with contextual information for wildlife monitoring. Ecol. Inform. 2023, 77, 102256. [Google Scholar] [CrossRef]
Ruan, W.; Wu, K.; Chen, Q.; Zhang, C. Resnet-based bio-acoustics presence detection technology of hainan gibbon calls. Appl. Acoust. 2022, 198, 108939. [Google Scholar] [CrossRef]
Kahl, S.; Wood, C.M.; Eibl, M.; Klinck, H. BirdNET: A deep learning solution for avian diversity monitoring. Ecol. Inform. 2021, 61, 101236. [Google Scholar] [CrossRef]
Morales, G.; Vargas, V.; Espejo, D.; Poblete, V.; Tomasevic, J.A.; Otondo, F.; Navedo, J.G. Method for passive acoustic monitoring of bird communities using UMAP and a deep neural network. Ecol. Inform. 2022, 72, 101909. [Google Scholar] [CrossRef]
Dufourq, E.; Durbach, I.N.; Hansford, J.P.; Hoepfner, A.; Ma, H.; Bryant, J.V.; Stender, C.S.; Li, W.; Liu, Z.; Chen, Q.; et al. Automated detection of Hainan gibbon calls for passive acoustic monitoring. Remote Sens. Ecol. Conserv. 2021, 7, 475–487. [Google Scholar] [CrossRef]
Lakdari, M.W.; Ahmad, A.H.; Sethi, S.; Bohn, G.A.; Clink, D.J. Mel-frequency cepstral coefficients outperform embeddings from pre-trained convolutional neural networks under noisy conditions for discrimination tasks of individual gibbons. Ecol. Inform. 2024, 80, 102457. [Google Scholar] [CrossRef]
Aodha, O.M.; Gibb, R.; Barlow, K.E.; Browning, E.; Jones, K.E. Bat detective—Deep learning tools for bat acoustic signal detection. PLoS Comput. Biol. 2018, 14, e1005995. [Google Scholar]
Kirsebom, O.S.; Frazão, F.; Simard, Y.; Roy, N.; Matwin, S.; Giard, S. Performance of a Deep Neural Network at Detecting North Atlantic Right Whale Upcalls. J. Acoust. Soc. Am. 2020, 147, 2636. [Google Scholar] [CrossRef]
Zhang, Z.; Xu, S.; Zhang, S.; Qiao, T.; Cao, S. Attention based convolutional recurrent neural network for environmental sound classification. Neurocomputing 2021, 453, 896–903. [Google Scholar] [CrossRef]
Mu, W.; Yin, B.; Huang, X.; Xu, J.; Du, Z. Environmental sound classification using temporal-frequency attention based convolutional neural network. Sci. Rep. 2021, 11, 1–14. [Google Scholar] [CrossRef]
Kvsn, R.R.; Montgomery, J.; Garg, S.; Charleston, M. Bioacoustics Data Analysis—A Taxonomy, Survey and Open Challenges. IEEE Access 2020, 8, 57684–57708. [Google Scholar] [CrossRef]
Yang, C.; Gan, X.; Peng, A.; Yuan, X. ResNet Based on Multi-Feature Attention Mechanism for Sound Classification in Noisy Environments. Sustainability 2023, 15, 10762. [Google Scholar] [CrossRef]
Wang, T.; Liu, Z.; Ou, W.; Huo, H. Domain adaptation based on feature fusion and multi-attention mechanism. Comput. Electr. Eng. 2023, 108, 108726. [Google Scholar] [CrossRef]
Nanni, L.; Maguolo, G.; Brahnam, S.; Paci, M. An Ensemble of Convolutional Neural Networks for Audio Classification. arXiv 2020, arXiv:2007.07966. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Graves, A.; Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM networks. In Proceedings of the 2005 IEEE International Joint Conference on Neural Networks, Montreal, QC, Canada, 31 July–4 August 2005; Volume 4, pp. 2047–2052. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Zhong, E.; Guan, Z.; Zhou, X.; Zhao, Y.; Hu, K. Application of passive acoustic monitoring techniques to the monitoring of the western black-crested gibbon. Biodiversity 2021, 29, 9. [Google Scholar]
Zhou, X.; Hu, K.; Guan, Z.; Yu, C.; Wang, S.; Fan, M.; Sun, Y.; Cao, Y.; Wang, Y.; Miao, G. Methods for processing and analyzing passive acoustic monitoring data: An example of song recognition in western black-crested gibbons. Ecol. Indic. 2023, 155, 110908. [Google Scholar] [CrossRef]
Pervaiz, A.; Hussain, F.; Israr, H.; Tahir, M.A.; Raja, F.R.; Baloch, N.K.; Ishmanov, F.; Zikria, Y.B. Incorporating Noise Robustness in Speech Command Recognition by Noise Augmentation of Training Data. Sensors 2020, 20, 2326. [Google Scholar] [CrossRef] [PubMed]
Saito, K.; Uhlich, S.; Fabbro, G.; Mitsufuji, Y. Training Speech Enhancement Systems with Noisy Speech Datasets. arXiv 2021, arXiv:2105.12315. [Google Scholar]
Siami-Namini, S.; Tavakoli, N.; Namin, A.S. The Performance of LSTM and BiLSTM in Forecasting Time Series. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019; pp. 3285–3292. [Google Scholar] [CrossRef]
Zeyer, A.; Doetsch, P.; Voigtlaender, P.; Schlüter, R.; Ney, H. A comprehensive study of deep bidirectional LSTM RNNS for acoustic modeling in speech recognition. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LO, USA, 5–9 March 2017; pp. 2462–2466. [Google Scholar] [CrossRef]
VeeraSekharReddy, B.; Rao, K.S.; Koppula, N. An Attention Based Bi-LSTM DenseNet Model for Named Entity Recognition in English Texts. Wirel. Pers. Commun. 2023, 130, 1435–1448. [Google Scholar] [CrossRef]
Dhumal Deshmukh, R.; Kiwelekar, A. Deep Learning Techniques for Part of Speech Tagging by Natural Language Processing. In Proceedings of the 2020 2nd International Conference on Innovative Mechanisms for Industry Applications (ICIMIA), Bangalore, India, 5–7 March 2020; pp. 76–81. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv 2019, arXiv:1905.11946. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Howard, A.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.C.; Tan, M.; Chu, G.; Vasudevan, V.; Zhu, Y.; Pang, R.; et al. Searching for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar] [CrossRef]
Fang, Z.; Yin, B.; Du, Z.; Huang, X. Fast environmental sound classification based on resource adaptive convolutional neural network. Sci. Rep. 2022, 12, 6599. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Wang, B.; Cui, X.; Li, S.; Liu, J. Underwater acoustic target recognition based on attention residual network. Entropy 2022, 24, 1657. [Google Scholar] [CrossRef] [PubMed]
Ren, Z.; Qian, K.; Dong, F.; Dai, Z.; Nejdl, W.; Yamamoto, Y.; Schuller, B.W. Deep attention-based neural networks for explainable heart sound classification. Mach. Learn. Appl. 2022, 9, 100322. [Google Scholar] [CrossRef]

Figure 1. Spectrograms of Mel’s inverted spectral coefficients extracted from different audio types. The x-axis represents time (s), and the y-axis represents frequency (Hz).

Figure 2. Comparison of spectrograms of Mel’s inverted spectral coefficients at different signal-to-noise ratios.

Figure 3. Overall environmental sound classification model. The network includes a total of 16 convolutional layers, 4 SE modules, 4 Batch Normalization layers, 5 maximal pooling layers, a BiLSTM layer, and a fully connected layer, for a total of 31 layers.

Figure 4. LSTM network structure diagram.

Figure 5. BiLSTM network structure diagram.

Figure 6. SENet network structure diagram.

Figure 7. Training and Validation Accuracy and Loss Curves of VBSNet.

Figure 8. Confusion matrix.

Table 1. Number of samples of each type of chirp under different signal-to-noise ratio (SNR) conditions.

Class	Original Sample Size	1	0.5	0.2	0.1	0.3	−0.1	−0.2	−0.5	−1
Pomatorhinus ruficollis	200	200	200	200	200	200	200	200	200	200
Erythrogenys gravivox	200	200	200	200	200	200	200	200	200	200
Phylloscopus valentini	200	200	200	200	200	200	200	200	200	200
Trochalopteron milnei	200	200	200	200	200	200	200	200	200	200
Trochalopteron chrysopterum	200	200	200	200	200	200	200	200	200	200
Psilopogon virens	200	200	200	200	200	200	200	200	200	200
Terpnosia obscura	200	200	200	200	200	200	200	200	200	200
rain	200	200	200	200	200	200	200	200	200	200
wind	200	200	200	200	200	200	200	200	200	200
Nomascus concolor	200	200	200	200	200	200	200	200	200	200

Note: the 1, 0.5, 0.2, 0.1, 0.3, −0.1, −0.2, −0.5, and −1 in the table indicate the inclusion of different signal-to-noise ratios (SNRs).

Table 2. Table of parameter settings for this study.

Parameter	Value
Epoch	100
Batch size	16
Learning rate	0.0001
Optimizer	Adam
Loss function	Cross-entropy
Dropout rate	0.5

Table 3. Comparison results of classification by adding different LSTM models.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
VGG16	97.05%	96.97%	97.02%	96.98%
VGG16-LSTM	97.15%	97.19%	97.09%	97.11%
VGG16-BiLSTM	97.80%	97.80%	97.77%	97.77%

Table 4. Classification results of ablation experiments with the addition of different modules and attentions.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
VGG16-SE	97.40%	97.45%	97.32%	97.37%
VGG16-BiLSTM	97.80%	97.80%	97.77%	97.77%
VGG16-BiLSTM-1SE	97.50%	97.48%	97.45%	97.45%
VGG16-BiLSTM-2SE	97.80%	97.79%	97.76%	97.76%
VGG16-BiLSTM-3SE	98.30%	98.29%	98.26%	98.27%
VGG16-BiLSTM-4SE	97.65%	97.69%	97.56%	97.61%
VBSNet	98.35%	98.34%	98.34%	98.33%

Table 5. Comparison of classification results of different models.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
VGG16	97.05%	96.97%	97.02%	96.98%
VGG19	96.55%	96.62%	96.43%	96.45%
AlexNet	96.15%	96.18%	96.11%	96.12%
MobileNetV3	96.65%	96.63%	96.61%	96.61%
EfficientNet	98.00%	98.02%	97.96%	97.99%
VBSNet	98.35%	98.34%	98.34%	98.33%

Table 6. Comparison of different model classification results before and after audio data augmentation.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
VGG16_Ori	89.05%	89.18%	89.01%	88.80%
VGG16_Aug	97.05%	96.97%	97.02%	96.98%
VGG19_Ori	88.10%	88.60%	86.99%	87.27%
VGG19_Aug	96.55%	96.62%	96.43%	96.45%
AlexNet_Ori	89.55%	90.60%	89.24%	89.08%
AlexNet_Aug	96.16%	96.18%	96.11%	96.12%
MobileNetV3_Ori	86.57%	86.13%	86.46%	85.96%
MobileNetV3_Aug	96.65%	96.63%	96.61%	96.61%
EfficientNet_Ori	88.06%	88.52%	87.03%	86.84%
EfficientNet_Aug	98.00%	98.02%	97.96%	97.99%
VBSNet_Ori	91.04%	90.63%	90.41%	90.32%
VBSNet_Aug	98.35%	98.34%	98.34%	98.33%

Table 7. Classification of different voice categories.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
Pomatorhinus ruficollis	99.85%	100.00%	98.67%	99.33%
Erythrogenys gravivox	99.60%	96.95%	98.96%	97.95%
Phylloscopus valentini	99.40%	96.26%	98.10%	97.17%
Trochalopteron milnei	99.70%	99.52%	97.63%	98.56%
Trochalopteron chrysopterum	99.70%	97.52%	99.49%	98.50%
Psilopogon virens	99.35%	95.65%	97.24%	96.44%
Terpnosia obscura	99.80%	100.00%	97.79%	98.88%
Rain	99.95%	100.00%	99.49%	99.75%
Nomascus concolor	99.65%	98.49%	98.00%	98.25%
Wind	99.70%	99.00%	98.02%	98.51%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, R.; Hu, K.; Wang, L.; Guan, Z.; Zhou, X.; Wang, N.; Ye, L. Using Deep Learning to Classify Environmental Sounds in the Habitat of Western Black-Crested Gibbons. Diversity 2024, 16, 509. https://doi.org/10.3390/d16080509

AMA Style

Hu R, Hu K, Wang L, Guan Z, Zhou X, Wang N, Ye L. Using Deep Learning to Classify Environmental Sounds in the Habitat of Western Black-Crested Gibbons. Diversity. 2024; 16(8):509. https://doi.org/10.3390/d16080509

Chicago/Turabian Style

Hu, Ruiqi, Kunrong Hu, Leiguang Wang, Zhenhua Guan, Xiaotao Zhou, Ning Wang, and Longjia Ye. 2024. "Using Deep Learning to Classify Environmental Sounds in the Habitat of Western Black-Crested Gibbons" Diversity 16, no. 8: 509. https://doi.org/10.3390/d16080509

APA Style

Hu, R., Hu, K., Wang, L., Guan, Z., Zhou, X., Wang, N., & Ye, L. (2024). Using Deep Learning to Classify Environmental Sounds in the Habitat of Western Black-Crested Gibbons. Diversity, 16(8), 509. https://doi.org/10.3390/d16080509

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Using Deep Learning to Classify Environmental Sounds in the Habitat of Western Black-Crested Gibbons

Abstract

1. Introduction

2. Materials

2.1. Data Sources

2.2. Data Augmentation

3. Methods

3.1. Overall Classification Model

3.1.1. Based on the Improved VGG16

3.1.2. Bidirectional Long and Short-Term Memory Neural Network

3.1.3. SE Attention Mechanism

3.1.4. Evaluation Indicators

4. Results

4.1. Experimental Environment

4.2. Categorical Comparison Structure after Adding BiLSTM Module

4.3. Results of Ablation Experiments Based on SE Attention Mechanisms

4.4. Comparison of Different Model Classifications

4.5. Analysis of Classification Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI