Recognition of Western Black-Crested Gibbon Call Signatures Based on SA_DenseNet-LSTM-Attention Network

Zhou, Xiaotao; Wang, Ning; Hu, Kunrong; Wang, Leiguang; Yu, Chunjiang; Guan, Zhenhua; Hu, Ruiqi; Li, Qiumei; Ye, Longjia

doi:10.3390/su16177536

Open AccessArticle

Recognition of Western Black-Crested Gibbon Call Signatures Based on SA_DenseNet-LSTM-Attention Network

by

Xiaotao Zhou

^1,†,

Ning Wang

^1,†,

Kunrong Hu

^1,*

,

Leiguang Wang

^2,3,

Chunjiang Yu

¹

,

Zhenhua Guan

^1,4,

Ruiqi Hu

¹

,

Qiumei Li

⁵ and

Longjia Ye

¹

School of Big Data and Intelligent Engineering, Southwest Forestry University, Kunming 650224, China

²

Institute of Big Data and Artificial Intelligence, Southwest Forestry University, Kunming 650024, China

³

Key Laboratory of National Forestry and Grassland Administration on Forestry and Ecological Big Data, Southwest Forestry University, Kunming 650024, China

⁴

Yunnan Academy of Biodiversity, Southwest Forestry University, Kunming 650024, China

⁵

College of Humanities and Law, Southwest Forestry University, Kunming 650024, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Sustainability 2024, 16(17), 7536; https://doi.org/10.3390/su16177536

Submission received: 31 May 2024 / Revised: 19 August 2024 / Accepted: 27 August 2024 / Published: 30 August 2024

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

As part of the ecosystem, the western black-crested gibbon (Nomascus concolor) is important for ecological sustainability. Calls are an important means of communication for gibbons, so accurately recognizing and categorizing gibbon calls is important for their population monitoring and conservation. Since a large amount of sound data will be generated in the process of acoustic monitoring, it will take a lot of time to recognize the gibbon calls manually, so this paper proposes a western black-crested gibbon call recognition network based on SA_DenseNet-LSTM-Attention. First, to address the lack of datasets, this paper explores 10 different data extension methods to process all the datasets, and then converts all the sound data into Mel spectrograms for model input. After the test, it is concluded that WaveGAN audio data augmentation method obtains the highest accuracy in improving the classification accuracy of all models in the paper. Then, the method of fusion of DenseNet-extracted features and LSTM-extracted temporal features using PCA principal component analysis is proposed to address the problem of the low accuracy of call recognition, and finally, the SA_DenseNet-LSTM-Attention western black-crested gibbon call recognition network proposed in this paper is used for recognition training. In order to verify the effectiveness of the feature fusion method proposed in this paper, we classified 13 different types of sounds and compared several different networks, and finally, the accuracy of the VGG16 model improved by 2.0%, the accuracy of the Xception model improved by 1.8%, the accuracy of the MobileNet model improved by 2.5%, and the accuracy of the DenseNet network model improved by 2.3%. Compared to other classical chirp recognition networks, our proposed network obtained the highest accuracy of 98.2%, and the convergence of our model is better than all the compared models. Our experiments have demonstrated that the deep learning-based call recognition method can provide better technical support for monitoring western black-crested gibbon populations.

Keywords:

western black crested gibbon call recognition; data augmentation; bioacoustics; convolutional neural network; attentional mechanism

1. Introduction

Because they disperse seeds to keep forests healthy, western black-crested gibbons are essential to their habitats. Gaining a grasp of these animals’ calls and social structures can help safeguard the ecological balance by improving our comprehension of their effects on the ecosystem. Because gibbons are highly sensitive to environmental changes, alterations in the forest and climate change are reflected in their calls. Scientists can identify potential threats to the ecosystem early on and implement necessary conservation measures by keeping an ear out for gibbon cries. Finally, by identifying the calls of the western black-crested gibbon, researchers have been able to gain a better understanding of both the species and the environments in which they live. This knowledge has also helped to develop tools and vital information for the development of more successful conservation strategies. These studies support the preservation of the ecosystem’s overall health and equilibrium, in addition to aiding in the conservation of the apes themselves.

The continued monitoring and conservation of the western black-crested gibbon group will be an important element of ecologically sustainable development. The western black-crested gibbon has the largest group size, about 270 groups, distributed within China [1]. Gibbon calls are distinctive and travel long distances [2]; people often use gibbon calls to figure out where and how many wild gibbon colonies there are, so that the right steps can be taken to protect them [3]. Western black-crested gibbons are already being watched every day through sentinel statistical surveys, where staff listen to, record, and look at their calls in some higher-elevation areas [4,5]. Naturally, this manual monitoring strategy requires a lot of labor and takes a lot of time, so it is unable to consistently meet the demands of online and timely monitoring [6]. Since passive acoustic monitoring (PAM) came along, the western black-crested gibbon has also been used to test these systems [7]. Add to this the fact that social animals like the western black-crested gibbon use a variety of sounds in their communities to convey messages and maintain social bonds. Thus, recognizing their calls can help scientists understand their social structure, communication methods, and social dynamics. This is crucial for the in-depth study and conservation of these species. By analyzing the calls of these gibbons, researchers can identify differences between individuals and groups. This helps researchers track the numbers and distribution of specific groups for more precise ecological monitoring and conservation efforts. However, there are several challenges in recognizing the calls of western black-crested gibbons, as follows:

Although passive acoustics can meet the requirements of western black-crested gibbon group monitoring, it would be labor-intensive to manually identify gibbon sounds from a large number of acoustic recordings;
Deep learning models can be used to effectively recognize the calls of the western black-crested gibbon. However, the western black-crested gibbon is located in remote areas, so it is difficult to obtain a large dataset of calls;
The task of training a deep learning network to distinguish between various call durations is challenging;
How to effectively extract features from input sound data and perform feature fusion.

Deep learning has recently begun to take center stage in passive acoustic monitoring as a result of its ongoing development. It has been shown that this manual processing of late acoustic data can be alleviated by using convolutional neural networks (CNNs) (LeCun et al. [8,9]) and that specific species in acoustic recordings can be identified by trained classifiers [10,11,12,13,14,15,16,17]. However, for these particular species, a good classifier needs a lot of reliable data. This is due to the fact that an inadequate dataset for a deep learning model will result in a reduction in the model’s ability to categorize data [18]. Additionally, gathering these data takes a lot of time, especially for remote and rare species.

The best solution that comes to mind in this paper to deal with the challenge of the lack of a dataset is data augmentation [19]. In order to create new, additional training data, data augmentation is a technique that involves applying one or more deformations to a set of annotated training samples or basalizing, time-shifting, and adding background natural noise to the input speech data [16,20,21]. Due to the lack of a dataset, the augmented data can successfully prevent the data overfitting issue during the classification training process. The effectiveness of using data augmentation to enhance the model’s performance has been demonstrated in a number of examples. Salamon et al. [22] uses four data augmentation methods, Time Stretching (TS), Pitch Shift (PS), Dynamic Range Compression (DRC), and Background Noise (BG), to improve the accuracy of ambient sound classification; Davis et al. [23] used four methods proposed by Salamon et al. [22] and five Linear Predictive Resonance Frequency Coefficients (LPCC) data expansion methods to effectively improve the accuracy of ambient sound classification. Therefore, reasonable data expansion can effectively improve the accuracy and generalization performance of the model. However, most sounds are altered in acoustic properties during the data augmentation process by standard techniques like pitch scaling and temporal stretching; hence, current techniques fail to sufficiently account for the unique characteristics of each sound. As a result, only when knowledge of the target sound is present can the augmentation procedure be chosen in a reasonable manner. The unique vocalizations of many animals may be lost or changed if the augmentation process is carried out blindly.

Currently, the most used solution in dealing with the problem of missing data is the Generative Adversarial Network [24]. The network is frequently used for improving speech and image data, because it can produce realistic virtual data without changing the characteristics of the original sound data [25,26]. Even though GAN networks produce better results for data enhancement, they still have some flaws: the first is that the effectiveness of the generated speech data will be extremely low in the absence of training data; the second is that it ignores the frequency and time features present in the sound, because traditional GAN data enhancement methods are used to generate speech data from waveforms and spectrograms; the third is that most of the sound data comes from the outdoors, so the background noise contained therein can cause traditional GAN networks to fail to learn effective animal sound features. To address the limitations of traditional GAN networks, in this paper, we choose the state-of-the-art WaveGAN network [27], which can effectively compensate for these shortcomings.

Still, there is room for improvement in the field of bioacoustic classification with respect to the current approaches. Convolutional neural networks usually require uniformly long sound inputs, yet sound signals might differ greatly in length. Sound clips must be trimmed or processed separately before being transferred to the network, because the time-frequency properties in each sound clip vary in length. It is likely that this approach will introduce some noise or lose some crucial audio data. Recurrent neural networks, on the other hand, are made to process sequential input. They do this by using iterative units, which continuously change their internal state in order to retrieve pertinent information from lengthier sound sequences. Convolutional neural nets (CNNs) are highly effective in classification tasks, giving spatial structure computation priority. They are therefore especially suitable for processing spatial data, such as photographs. However, recurrent neural networks (RNNs) are better suited for time-series analysis because they can extract temporal properties from sequential data. The processing of audio depends heavily on frequency information. Nevertheless, RNNs frequently struggle to simulate local frequency features during audio processing, a crucial aspect of audio classification applications. As a unique type of RNN, long short-term memory (LSTM) [28] has demonstrated strong performance in long sequence applications. The LSTM network contains a significant quantity of learnable parameters, primarily utilized to mitigate the challenges of gradient vanishing and explosion that commonly arise when training lengthy sequences. Due to these benefits, LSTM networks are frequently combined with CNN networks for speech recognition [29,30,31]. Although most CNN-LSTM networks can obtain better results in reclassification, most CNN-LSTM-based models suffer from highly contextual fully connected features as inputs to the LSTMs, which lack fine-detailed representations about the important features and always produce unsatisfactory recognition results [32]. Additionally, the CNN-LSTM combination is not used as much, which underutilizes the deeper features extracted from the various layers. More seriously, the CNN-LSTM-based network model has a relatively weak attentional ability, which may lead to the interference of noise during the training process and may even lead to the degradation of the model’s classification performance [33]. Although different attention mechanisms have been studied to try to solve this problem, there is still room for improvement regarding how to rationalize the design of attention modules corresponding to different levels of CNN features.

The remainder of the paper is organized as follows: in Section 3, we describe the data sources and data augmentation. Section 4 describes the methodology, network model, and attention module used in this investigation. Section 5 focuses on the comparative results of classification accuracy on the validation set for all the models in this paper. In Section 6, a related discussion on the identification of western black-crested gibbon calls is given. Finally, Section 7 details the conclusions on western black-crested gibbon call recognition.

2. Related Work

Numerous techniques have been put forth in the field of chirp recognition, including network architectures like CNN, RNN, and LSTM. These techniques can be broadly divided into two categories: deep learning-based techniques and conventional feature engineering-based techniques. Conventional techniques for chirp recognition often depend on auditory feature extraction, like Mel spectrogram and MFCC. These features are fed into a machine learning classifier for classification. For example, Zhou et al. [34] used extracted Mel spectrograms to recognize gibbon calls with a maximum accuracy of 99.8%; Rafael et al. [35] used spectrograms to recognize 915 bird species with 71% accuracy; and Roop et al. [36] used spectrograms to recognize bird calls with 96.1% accuracy. However, these methods have limited performance in dealing with complex audio signals and have difficulty in capturing the temporal characteristics of gibbon calls.

In 1997, Hochreiter et al. proposed the long short-term memory (LSTM) network [28], and they are presently extensively implemented in speech processing. The problem of standard recurrent neural networks (RNNs) not being able to capture long-term dependencies can be solved by LSTM networks. Additionally, gating methods including input, output, and oblivion gates are used by LSTM networks to enhance network performance and better manage information flow. Because it can maintain and update state information at each time step, LSTM is especially well-suited for jobs involving multi-time-step predictions. This is why it is so popular in the voice recognition industry. Geng et al. [37] has designed a speech recognition network based on LSTM network, which is mainly used to recognize the speech produced in English teaching; Ahmed et al. [38] proposes a speech emotion recognition model based on the combination of CNN network and LSTM network; due to the inclusion of the LSTM network; the model effectively improves the overall network performance. Abdelhamid et al. [39] presents a network architecture that integrates CNN and LSTM networks, thereby significantly enhancing the model’s efficacy in emotional speech recognition.

With the advancement of deep learning, RNN networks are increasingly employed in the speech recognition field [40,41]. Furthermore, the incorporation of the temporal attention mechanism significantly enhances the model’s accuracy. An example of this situation can be seen in some previous studies [42,43,44], where temporal attention mechanisms have been integrated into a variety of network models to enhance the precision of speech emotion recognition. The model can learn how to adapt to changes at different time steps during the training process, because the temporal attention mechanism enables the model to dynamically assign weights to information at different time steps and to concentrate more on task-relevant time steps when processing time-series data. It enhances the network’s capacity for generalization by enabling it to adaptively simulate sequences of varying lengths.

However, as the number of network layers continues to grow, the problem of vanishing gradients occurs. In order to solve this problem, many people have conducted a lot of research, for example, He et al. [45] proposed a deep residual structure to solve the problem of gradient vanishing in deep networks; Gustav et al. [46] proposed a self-similarity based neural network macro-architecture design strategy to solve the problem of gradient vanishing; and Rupesh et al. [47] proposed a highway network that allows information to flow unimpeded on a multilayer information highway.

Although existing studies have made some progress in gibbon call recognition, there are still some shortcomings. Our proposed SA_DenseNet-LSTM-Attention model achieves innovation by (1) adopting the DenseNet network as the backbone network to enhance the feature extraction capability; (2) incorporating a self-attention mechanism after each DenseNet convolutional layer to dynamically adjust the feature weights; (3) combining the LSTM and temporal attention mechanism to capture the time-dependence of audio signals and effectively extract sequence features with different chirp lengths; (4) fusing features using PCA principal component analysis to further improve the model performance. The experimental results show that our method significantly outperforms existing methods in terms of accuracy, precision, recall and F1 score.

3. Materials

3.1. Data Sources

We chose two parts of data from Zhou et al. [48] and the dataset used in the paper published by Zhou et al. [49], but these experimental data were obtained from Zhong Enzhu et al. [7] and Zhou et al. [48], from a call monitoring system in the Chuxiong area of the Ailao Mountains National Nature Reserve in Chuxiong City, Yunnan Province, China (23°36′–24°44′ N, 100°54′–101°30′ E) (Figure 1f). The data are all acquired by the self-developed pickup array (Patent No.:ZL 2018 2 2264510.6) (Figure 1c). AAC is utilized to store the final audio formats, which have a sampling rate of 32 KHz and a duration of approximately 30 min per file. Then, the acquired sensor data are transmitted through the 5.8 GHz wireless network of the main link, and in order to ensure the effective transmission of the transmitted data, we use the LoRa network as an auxiliary link (the auxiliary link is responsible for locating the faults of the main link and, at the same time, taking part in the monitoring data transmission task) (Figure 1d).

Dataset types included wind, rain, bird calls (Actinodura strigula, Fulvetta cinereiceps, Psilopogon virens, Pomatorhinus ruficollis, Parus monticolus, and Pterorhinus sannio), gibbon calls, and cicada calls. However, In this article, we delineate the gibbon calls in more detail. Based on Fan et al.’s [50] method of classifying western black-collared gibbon calls, this paper also classifies the call of the western black-crested gibbon into four different types: the simple repetitive syllable calls of the male gibbon, the weakly modulated syllable calls of the male gibbon, the strongly modulated syllable calls of the male gibbon, and the agonistic calls of the female gibbon, with the specific structure of the call as shown in Figure 2.

Figure 3 shows the differences in different call types in terms of frequency and amplitude. From the figure, it can be concluded that the “aa notes” call type has the highest average frequency, which is around 45.5 Hz, the “great call” call type has lower average frequency, which is around 44.5 Hz, the ”modulated” call type has the lowest average frequency, which is close to 42.5 Hz, and the “weakly modulated" call type has the lowest average frequency, which is close to 42.5 Hz. The “modulated" call type has the lowest average frequency, close to 42.5 Hz, and the “weakly modulated” call type has an average frequency close to 44 Hz. The study found that the “aa notes” call type had the highest average frequency of about 45.5 Hz, while the “modulated” call type had the lowest average frequency of about 42.5 Hz, which may be related to specific behavioral patterns or environmental adaptations. The “aa notes” call type had the highest amplitude of nearly −72 dB, suggesting that this type of call is louder and may be advantageous in specific social interactions. In contrast, the “modulated” call type had the lowest amplitude at about −78 dB, indicating a more subdued acoustic signature.

3.2. Data Augmentation

Data augmentation is a method for growing the dataset to prevent overfitting during training, whereas data balancing can effectively increase the accuracy of model training [51]. Numerous experimental findings demonstrate that in deep learning-based classification tasks, we can obtain better-trained classifiers by using more samples from the dataset. Additionally, the data augmentation can successfully increase the model’s capacity for generalization in the scene [52]. The four calls made by the western black-crested gibbon account for the majority of the dataset imbalance in this study. The size of the original dataset containing the four gibbon calls is depicted in Table 1.

Male gibbons are more dominant than females in all four sound types, according to the data in the table; however, there are fewer datasets for male gibbons’ weakly modulated syllable calls, female gibbons’ agitated calls, and male gibbons’ basic repeated syllable calls. The sample imbalance in the dataset may therefore have an impact on the automatic recognition algorithm’s ability to identify particular patterns. In order to ensure that the number of samples under each category is relatively balanced, this paper chooses two deep learning methods and some commonly used methods for the data enhancement of the four calls of gibbons [22,23,27,52,53,54,55]. After experimental selection, our goal is to determine the most effective technique for gibbon data singing augmentation. The specific data enhancement methods are shown in Table 2.

For the ten audio data augmentation methods in Table 2, the relevant transformation factors are randomly generated using a random number generator to ensure the diversity of the data transformations when expanding the original data using the seven methods based on data transformations. In order to demonstrate the effectiveness of the method of adding noise to augment the dataset used in this paper, the classification accuracy of the model is explored under different noise scenarios to find the optimal range of adding noise. This paper has used 0 dB, −5 dB, −10 dB, −15 dB, −20 dB, −25 dB, −30 dB, −35 dB, −40 dB, −45 dB, several different noise cases for testing; the specific experimental results are shown in Table 3 and Figure 4, AND according to the contents of the table that can be derived from the model proposed in this paper, the upper limit of the noise is −20 dB. We eventually increased the dataset to 2000 entries per class when using noise enhancement, choosing three noise enhancement schemes of 0 dB, −5 dB, −10dB, and −15 dB. The details are shown in Figure 4.

4. Methods

4.1. Overall Identification Model

Figure 5 shows the overall system model of the proposed method. The model consists of six main parts: (1) Due to the imbalance in the dataset of the calls of the four gibbon species, we expanded the data for the four gibbon calls; (2) Audio preprocessing: extracts the Mayer spectrogram of the input audio; (3) CNN: we use DenseNet as the backbone network to learn the features of the spectral image of each input audio; (4) RNN: Learning temporal features from image sequences using LSTM networks; (5) Attention: Assign weights to each LSTM network output and transmit the outputs of the LSTM network in (4) to the temporal note; (5) Output: the output of the attention network in (5) is sent to the Softmax classifier, which then finalizes and outputs the classification results.

Hence, to efficiently identify extended sequences of vocalizations produced by western black-crested gibbons, we propose the utilization of a hybrid neural network known as DenseNet+Self-Attention-LSTM-Attention (SA_DenseNet-LSTM-Attention). The reason we chose the DenseNet network [56] for our backbone network is because the dense convolutional network (DenseNet) is an architecture that can effectively address problems such as gradient vanishing and gradient explosion as the number of network layers continues to increase. With limited Mel spectrum data, the DenseNet network performs better than CNN at obtaining training data. We attach a fully connected LSTM (FC-LSTM) to the completely connected layer of DenseNet in order to effectively represent and learn the time-frequency properties of the input audio. More importantly, the mutual combination with the LSTM network aids in the extraction of spatio-temporal features by the overall network [57]. The temporal attention module was then used to find key time-frequency feature information in the successive calls of the gibbon calls. This is shown in Figure 5. In addition, we feature-fused the channel, spatial, and time-frequency features extracted from the entire network, which effectively extracted gibbon call features with different call durations and also reduced the dimensionality and weight of the final output of the attention module. In conclusion, a substantial quantity of ablation experiments and comparative tests were undertaken to assess the performance of our proposed model in comparison to a number of the more sophisticated classical network models.

The problem of vanishing gradients occurs as the number of network layers continues to grow. In order to solve this problem, a lot of research has been conducted by many people [45,46,47,58], and Huang et al. [56] proposed a new convolutional neural network (DenseNet) in 2017. The suggested network is based on ResNet’s concept of layer-to-layer connectivity of feature maps. When it comes to vanishing gradient, less processing, and fewer parameters, this network is far superior to traditional convolutional neural networks. As a result of the gradient vanishing being reduced, DenseNet avoids many redundant features and accelerates convergence. Dense connectivity, as illustrated in Figure 6, is a technique utilized to establish a continuous connection between the inputs of a subsequent layer and the feature maps of an earlier layer, thereby augmenting the information flow between the layers.

The dense network consists of dense blocks and transition layers. Due to the channel cascade used to connect the feature maps of the various DenseNet layers, the network may be over-parameterized, thereby reducing its computational efficiency. In order to avoid the problem of too large parameters, the DenseNet authors added a bottleneck layer to the network, as shown in Figure 5. The bottleneck layer comprises Rectified Linear Unit (ReLU), Convolution (

3 \times 3

), and Batch Normalization (BN). By adding this layer, the number of input audio feature spectrograms is effectively reduced, thus improving the computational efficiency of the model. Furthermore, the implementation of the transition layer results in a reduction of both the quantity and dimensions of the feature maps. The transition layer, depicted in Figure 5, is a distinct component that is linked behind the dense block. It comprises BN, ReLU, Convolution (

1 \times 1

), and AvgPooling (

2 \times 2

) as its primary elements.

4.1.1. DenseNet

Conventional recurrent neural networks (RNN) demonstrate exceptional proficiency in sequential data processing. However, as the size of RNN networks increases, some critical data may become unaccessible due to insufficient connectivity to all the nodes. The RNN is able to handle issues like speech processing that are strongly correlated with time series, because these issues cause it to lose the long-range dependence that comes with long-term information. LSTM is a special kind of RNN that can effectively solve the problems of the RNN networks. This simply indicates that LSTM outperforms standard RNNs when confronted with longer sequences [59,60]. The specific LSTM structure is shown in Figure 7a. The LSTM has more input gates, forgetting gates, output gates, and a hidden state compared to the RNN, which contains memory cells that store information for longer periods of time and selectively memorize the network error return parameters [61]. The relevant calculations are shown below.

f_{t} = σ (W_{f} \cdot [h_{t - 1}, x_{t} + b_{f}])

(1)

i_{t} = σ (W_{i} \cdot [h_{t - 1}, x_{t} + b_{i}])

(2)

\tilde{C_{t}} = tanh (W_{c} \cdot [h_{t - 1}, x_{t}] + b_{c})

(3)

C_{t} = (f_{t} \times C_{t - 1}) + (i_{t} \times {\tilde{C}}_{t})

(4)

O_{t} = σ (W_{o} \cdot [h_{t - 1}, x_{t}] + b_{o})

(5)

h_{t} = o_{t} \times tanh (C_{t})

(6)

In the above equation,

x_{t}

denotes the input vector of the LSTM unit,

h_{t - 1}

denotes the previous hidden state vector, which can be regarded as the output vector of the previous LSTM unit,

W_{f}

denotes the weight matrix and bias vector parameter that the oblivious analyzer needs to optimize during the model training process,

σ

denotes the activation function,

C_{t}

denotes the current temporary state of the neuron,

C_{t - 1}

denotes the state of the previous neuron, and

\tilde{C_{t}}

denotes the unit input activation vector.

4.1.2. Self-Attention Module

This study adds the self-attention module based on the DenseNet network structure to screen the channel features of the feature map output from the first two layers. This is used to enhance the expression of the important features of a specific channel and to focus on the network’s region of interest. Because the dataset used in this paper is from the Mourning Mountain acoustic monitoring system, noise interference is inevitable.

The upper part, as shown in Figure 6, is the self-attention module, where first, the channel dimension of the input feature map is compressed by 1 × 1 convolution to reduce redundant information and improve computational speed. Then, the transpose operation is performed on the output feature maps of the F(X) branch and matrix multiplication is performed with the output feature maps of the g(X) branch to obtain the similarity between feature maps. Next, the similarity is normalized using Softmax function to obtain the attention matrix. Finally, the feature maps of the h(X) branch are multiplied with the feature maps of the other branch to obtain the final feature maps, which realizes the weight reassignment of the features. Subsequently, the outcome is supplied to the Softmax function, and a 1 × 1 convolution kernel is employed in the output processing to maintain channel consistency with the input feature map’s channels.

4.1.3. Time Attention Mechanism

Since the temporal attention mechanism can effectively extract temporal features with different description lengths, this paper also incorporates the temporal attention mechanism in the LSTM network of the gibbon recognition network. The specific structure is shown in Figure 7b. In Figure 7b, the control unit modules

x_{1} - x_{n}

comprise the gate loop of the LSTM network. The contents of

H_{1} - H_{n}

represent the outputs of the gate-loop control unit modules in the LSTM network. These contents are subsequently inputted into our model of the temporal attention mechanism, with the particular output of the hidden layer illustrated as follows:

u_{t} = tanh (w_{w} H_{t} + b),

(7)

where tanh is the activation function and

w_{w}

denotes the weights. The weights of each LSTM network output result

a_{t}

are as follows:

a_{t} = s o f t {max}_{t}^{T} u, w^{i} .

(8)

v_{t}

represents the final result of the attention mechanism layer. The expression is shown below:

v_{t} = \sum a_{t} H_{t} .

(9)

As a result, the network can now provide more focus to the important data that is present in the audio feature sequence. This data can be given more weight, which will increase the accuracy and speed of the identification process.

4.2. Model Evaluation and Dataset Segmentation

The performance of the network model proposed in this paper was evaluated using accuracy (denoting the percentage of correct predictions), precision (indicating the ratio of samples with positive actual values to those with accurate predictions), recall (representing the ratio of samples in which both the actual values and predictions were positive), and F1-score (which takes into account the accuracy and recall of the classification model). True data positives (TPs) denote the quantity of samples in which the data was accurately classified as belonging to the true sample. The number of negative samples predicted by a trained classification model that are accurately identified as negative in the actual data is denoted by TN. FP denotes the count of positive samples that the trained classification network model incorrectly identifies as negative samples when the actual data is positive. Underreported positive samples are denoted by FN when the trained classification network model generates negative predictions for the comparison, despite the fact that the actual data are positive. The specific formula is shown below:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(10)

P r e c i s i o n = \frac{T P}{T P + F P}

(11)

Re c a l l = \frac{T P}{T P + F N}

(12)

F_{1} - s c o r e = \frac{2 * p r e c i s i o n * r e c a l l}{p r e c i s i o n + r e c a l l}

(13)

According to the record date in this paper, we divide the dataset into two parts, training set and validation set; 60% of the dataset is used as the training dataset, 20% is used as the testing set, and the remaining 20% is used as the validation sample set. Among them, the training set consists of two parts, the unenhanced dataset and the dataset that has been enhanced by WaveGAN, as shown in Table 3; the data in the test set has two parts, one part comes from the unenhanced dataset in the training set, and the other part comes from the new dataset that has not been trained (not enhanced by the data); the validation set and the training set and the test set are independent of each other, and there are no data sets that have been involved in the model training dataset (not augmented with data).

5. Results

5.1. Experimental Environment Settings

All our network comparisons were carried out in the same experimental environment with Ubuntu 20.04, CPU model-Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz, and GPU model NVIDIA Tesla P100 (24 G Memory). The Python 3.9 and PyTorch 1.12.1 deep learning framework were used for model training. The parameter settings for all our networks are as follows: learning rate is set to 0.001, batch size is 32, and the optimizer is set to Adam. The dataset settings are all of the same size, with 60% of the dataset used as the training dataset, 20% as the test set, and 20% as the validation sample set. There are 13 categories for this article classification, with 2000 speech data per category, for a total of 26,000 datasets, with a size of about 8 GB. This paper labels a considerable assortment of 26,000 distinct types of speech data. Detailed dataset information is shown in Table 4.

5.2. Classification Results before Data Augmentation

In this paper, we first made a comparison of the models without data expansion, with the number of iterations wanted and the same dataset; we compared the recognition accuracy of the western black-crested gibbon’s calls to the DenseNet model proposed by Huang et al. [56], the VGG16 model proposed by Simonyan et al. [62], the Xception model proposed by Chollet et al. [63], and the MobileNet model proposed by Howard et al. [64]. In the end, the DenseNet+Self-Attention(SA_DenseNet) model obtained 88.1% accuracy, the VGG16 model obtained 86.8% accuracy, the Xception model obtained 85.6% accuracy, and the MobileNet model obtained 58.9% accuracy. The recognition accuracy of the SA_DenseNet model used in this paper outperforms the other three models by 29.2% over the MobileNet model, 1.3% over the VGG16 model, and 2.5% over the Xception model. The details are shown in Table 5 and Figure 8a.

5.3. Classification Results after Data Augmentation

In this study, we investigate 10 distinct data augmentation techniques then classify and contrast the datasets that have been augmented by these techniques. We chose the SA_DenseNet, VGG16, Xception, and MobileNet models for comparison, which ultimately resulted in the WaveGAN augmented dataset, so the models obtained the highest accuracy, followed by Fre-GAN. Traditional dataset enhancement methods (Speed Tuning, Translation of Sample Rate, and Volume Tuning) have all improved the accuracy of the model, but not by much, and there are also data augmentation methods that degrade the accuracy of some models (Time Stretching and Pitch Shifting Augmentation). The specific experimental comparison results are shown in Figure 9.

We then concurrently compared the accuracy, precision, recall, and F1-score of four distinct models following WaveGAN data augmentation to show the efficacy of our suggested data augmentation. After the experiments, it is proved that the accuracies of the four models are improved after further data augmentation, in which the DenseNet model is improved by 7.8%, the VGG16 model is improved by 7.1%, the Xception model is improved by 8.2%, and the MobileNet model is improved by 7.0%, which is shown in Table 6 and Figure 8b, and finally, the model chosen in this paper, the DenseNet model, is better and more stable than other models.

5.4. Classification Results after Adding LSTM-Attention

In the FC-LSTM network module, which was constructed upon the DenseNet backbone network, we concurrently integrated the temporal attention mechanism in order to improve the detection of the four distinct call types shown by the western black-crested gibbon. We compared each network using datasets that had been improved using the same data augmentation technique in order to evaluate the experiment’s fairness. Concurrently with this, we implemented the FC-LSTM module utilizing the attention mechanism in the remaining three networks before comparing their accuracy. It is demonstrated that the DenseNet+Self-Attention-LSTM-Attention(SA_DenseNet-LSTM-Attention) model’s accuracy increases by 2.3%, while the VGG16-LSTM-Attention model’s accuracy increases by 2.0%, the Xception-LSTM-Attention model’s accuracy increases by 1.8%, and the MobileNet-LSTM-Attention model’s accuracy increases by 0.5%. The experimental results demonstrate that the SA_DenseNet-LSTM-Attention architecture we proposed achieves the highest accuracy of 98.2%. As shown in Figure 8c,d and Table 7, the classification method (SA_DenseNet-LSTM-Attention) introduced in this article exhibits superior performance compared to alternative models in both recognition rate and convergence speed.

The model proposed in this paper has a training time of 8 h and an inference time of 0.018 s/sample, which is significantly better than other benchmark models. The training time of the comparison models is 10 h (VGG16-LSTM-Attention), 12 h (Xception-LSTM-Attention), and 9 h (MobileNet-LSTM-Attention), respectively; and the inference time is 0.025 s/sample, 0.022 s/sample, and 0.020 s/sample. This is shown in Table 8.

In Figure 10a, we show a comparison of the training, testing, and validation accuracies of our proposed model (SA_DenseNet-LSTM-Attention) across time. It is evident that our model performs at its peak after the 100th epoch on both test and validation data. Furthermore, our model’s accuracy in the training, test, and validation sets is rapidly becoming close to 1, suggesting that it has high generalization capabilities. In real-world training and validation, our model demonstrates robust expressive and learning capabilities after the WaVeGAN network augments the input and integrates a powerful attention mechanism. In order to prove the effectiveness of our proposed model, we have compared all the trained models, and the specific accuracy comparison is shown in Figure 10b. From the figure, it can be seen that our model is higher than the other three models in terms of accuracy on the validation set; therefore, this also shows that the performance and generalization ability of our proposed model is higher than the other models.

5.5. Different Calling Recognition Results

To test the effectiveness of our training model (SA_DenseNet-LSTM-Attention), we selected one month (April) of acoustic monitoring data from 2021 segmented by zhou et al. [34] for testing, and we counted western black-collared gibbons and several bird species with a high number of chirps, as shown in Figure 11a. In Figure 11a we counted the number of days monitored and the number of calls made by the Western Black-crowned Gibbon and 6 different bird species in April. During our testing, we found that the model could effectively recognize most of the calls, with most of the errors concentrated in cases where multiple calls were mixed, in which case there is no guarantee that our model can effectively recognize all call types. In Figure 10 it can be seen that most of the calls of these species are concentrated before 12 o’clock. Some birds (Pomatorhinus ruficollis) had a wider range of calling time periods, concentrated in the morning through the afternoon.

In this study, we verified the recognition of 13 different sound types, including four gibbon sounds and nine other sound types. Through the training and testing of the SA_DenseNet-LSTM-Attention model, we obtained satisfactory classification results. Specifically, the classification accuracies of all sound types are above 98.0%, and the precision, recall and F1-scores are above 96.0%. These results validate the excellent classification performance of our proposed model on a wide range of sound types, further demonstrating its potential in practical applications. This is shown in Table 9.

6. Discussion

In this paper, we present a new recognition network that uses a PCA method for feature fusion to fuse the features extracted by the LSTM network based on a temporal attention mechanism and those extracted by the DenseNet network. While the LSTM network can extract temporal information from a time series, CNN focuses more on computing spatial structure, despite its superior classification capabilities. To validate the performance of our proposed model, we compared three different models, VGG16, MobileNet and Xception, and our proposed model (SA_DenseNet-LSTM-Attention) obtained the highest accuracy (Figure 8 and Table 7). The classification performance of the models VGG16, MobileNet and Xception also improved after adding the LSTM network based on the temporal attention mechanism (Table 7); this result further illustrates that the LSTM network can effectively extract the sequence features into the audio. Our experiments demonstrate that our proposed SA_DenseNet-LSTM-Attention model can effectively recognize gibbon calls, which greatly reduces the time required for manual recognition, and also proves that passive acoustic monitoring combined with deep learning will be an effective tool for monitoring the western black-crested gibbon population.

Currently, the acoustic monitoring system built by project team member Zhong et al. [7] in the Mourning Mountains allows for 24 h the monitoring of the western black-crested gibbon group. Long-term acoustic monitoring will yield hundreds of thousands of hours of recordings annually, and the manual labeling of these recordings is sometimes not feasible due to post-processing data. Our results suggest that deep learning combined with passive acoustic monitoring could be a useful approach for late-stage species-specific identification. This has the ability to reduce costs, save time and labor, and streamline monitoring processes. Our method works well for call recognition in western black-crested gibbons and may be easily extended to other call species. Accurately identifying individual vocalizing species is important for ecological sustainability.

Although our model, with a 98.2% recognition rate, is not sufficient to achieve the full recognition of western black-crested gibbon calls, the model can recognize most of the gibbon calls in the recorded files, greatly improving the efficiency of manual annotation and manual recognition of these data. With or without data augmentation, our model outperforms the other three models in terms of accuracy and generalization performance (shown in Figure 8). This is mostly due to two factors: first, our network has a greater depth at which the gradient vanishing problem can be solved effectively; second, our network has fewer parameters yet performs better. The only negative aspect is how memory-intensive model training is. The precision of upcoming models will be enhanced by expanding the sample size of labeled data.

To achieve high recognition accuracy, our model requires a large labeled dataset; imbalances in the dataset can restrict the model’s learning ability. Therefore, expanding the dataset is a common deep learning training practice known as “data augmentation”. For example, Lasseck et al. implemented data augmentation by adding noise, filtering, or mixing audio clips [54]. This method has been proven to effectively improve the model’s classification accuracy. Our study evaluates the impact of various speech data augmentation methods on model classification performance, finding that most traditional speech data augmentation methods do not improve and may even decrease model performance. Compared to traditional methods, datasets obtained using deep learning approaches can significantly enhance each model’s classification performance (see Figure 9). WaveGAN employs Generative Adversarial Networks (GANs) to generate audio samples, which are more realistic and of higher quality, closer to the quality of the original audio. WaveGAN is capable of generating diverse audio samples that simulate different noises, speaking styles, and speech speeds, thus providing richer data augmentation and improving the model’s generalization ability. WaveGAN can be tailored to various speech recognition or audio processing tasks by adjusting the training approach, thereby offering broad application potential [71,72].

We have incorporated knowledge from several important studies that demonstrate the complexity and diversity of animal communication into our classification model for vocalizations. Following the guidance of Payne and McVay [73], our model accommodates the repetitive patterns found in calls, which are characteristic of structured vocal sequences. Additionally, Whaling et al. [74] provide a foundation for understanding the frequent repetitions in these sequences, which our model seeks to identify and classify. The distinctions between simple calls and more complex composites, as discussed by Behr and von Helversen [75], and Bohn et al. [76], have been critical in refining our approach to differentiate between monosyllabic calls and more structured composites. These studies collectively form the theoretical backbone of our research, supporting our methodology and ensuring that our classification scheme aligns with established understandings in the field of bioacoustics. Both the deep learning methods Fre-GAN and WaveGAN used in this paper can effectively improve the classification accuracy of the model, but compared to the Fre-GAN model, the WaveGAN model can achieve a better classification performance, and Fre-GAN needs more training datasets to achieve good performance. Previous studies have also demonstrated that WaveGAN can effectively improve model classification accuracy [77,78,79].

To enhance the classifier’s performance in this study, we use data enhancement approaches. Nevertheless, there are certain restrictions on data augmentation. Firstly, the synthetic variants introduced by the data improvement approach might not accurately represent the intricacy of gibbon cries seen in nature. As a result, the classifier’s performance in actual applications may not match that of the enhanced dataset. In addition, the effectiveness of data enhancement techniques relies on the choice of enhancement method and parameter settings, which, if not properly selected, may introduce noise or irrelevant information, thus affecting the accuracy of the classifier. To evaluate the performance of the classifier in real-world environments, we tested it on one month of independent, non-enhanced data (validation set). According to the test results, the classifier’s performance on unenhanced data offers a more accurate assessment of how well it works in actual passive acoustic monitoring activities. We are able to obtain a more precise image of the classifier’s performance in practical applications, including its precision, recall, and general robustness, by examining the data collected during this time. To evaluate the performance of the classifier in real-world environments, we tested it on one month of independent, non-enhanced data (validation set). The test results show that the performance of the classifier on unenhanced data provides a more realistic evaluation of its effectiveness in real-world passive acoustic monitoring tasks. By analyzing the data over this period, we are able to obtain a more accurate picture of the performance of the classifier in real-world applications, including its precision, recall, and overall robustness. The test results show that the model in this paper outperforms all other models on the validation set (shown in Table 6).

In sound classification tasks, deep convolutional neural networks outperform shallow convolutional neural networks, according to research findings [80]. Nevertheless, as the depth of the network further expands, it encounters the challenge of vanishing gradients, and this degradation does not occur due to overfitting. Higher training errors occur if more network layers are added to a network model of appropriate depth [81]. In order to solve the degradation problem, Huang et al. [56] proposed a densely connected convolutional network (DenseNet).The DenseNet network is the main recognition network in this manuscript because of its excellent performance. Furthermore, we provide the LSTM network, which is predicated on the DenseNet network’s temporal attention mechanism. A recurrent neural network (RNN) variation, the long short-term memory (LSTM) model performs better than RNNs in solving gradient vanishing and gradient explosion problems, and the LSTM network shows very good model performance in the sound classification task [82,83]. Longer call sequences can be recognized more effectively by the LSTM network, because it addresses the issue of long-term reliance in RNNs when dealing with lengthy sequence data. On the other hand, the LSTM network may dynamically devote attention to information at different time steps in the sequence thanks to the temporal attention mechanism. As a result, the network can concentrate less on time steps that are secondary or unimportant to the current activity and more on time steps that are significant. Since the sound data used in this article has variable duration, an LSTM module based on the attention mechanism is included to help the network better generalize to long unseen sequences and be flexible in handling input sequences of varying lengths. Incorporating a temporal attention mechanism can also improve the retention of crucial information and raise attention to crucial time stages. In order to reduce computational complexity, the network can also selectively focus on time steps that are pertinent to the job thanks to the temporal attention mechanism.

Our model can classify new sound recordings with high accuracy. The predictions are almost perfect when augmented with WaveGAN speech data and when the environmental conditions used to train the model are similar. In our test dataset, the SA_DenseNet-LSTM-Attention model proposed in this paper clearly detected all gibbon calls in the test data at the cost of ten false positives. The primary cause of the false positives was that certain bird cries were identified as weakly modulated figures. Additionally, the majority of birds’ mistake rates were focused in the event of more mixed noises.

7. Conclusions

In this paper, a new deep learning hybrid model (SA_DenseNet-LSTM-Attention) is proposed for recognizing the calls of western black-crested gibbons. This paper tries 10 different data augmentation methods to solve the lack of a dataset. Following an experimental comparison with other methods, the WaveGAN speech data augmentation method can achieve the maximum accuracy for all models in this article. Furthermore, our tests demonstrate that each model’s accuracy increases with data augmentation. We fused the spatial and channel features extracted from the DenseNet network with the temporal features extracted from the LSTM network using the PCA method in order to increase the recognition rate of the model for various speech types. The LSTM network, which is based on a temporal attention mechanism, was added to the DenseNet network. Finally, the experimental results demonstrate that our proposed DenseNet-LSTM-Attention recognition model achieves an accuracy of up to 98.2% in recognizing 13 different voices, surpassing all other network models in terms of both accuracy and generalization performance.

Author Contributions

Conceptualization, X.Z. and K.H.; methodology, Z.G. and L.W.; software, X.Z. and N.W.; validation, X.Z. and K.H.; formal analysis, X.Z. and Z.G.; investigation, X.Z.; resources, K.H.; data curation, K.H.; writing—original draft preparation, X.Z.; writing—review and editing, K.H. and X.Z.; visualization, X.Z.; supervision, C.Y., Q.L., L.Y. and R.H.; project administration, K.H.; funding acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

We are grateful to the Chuxiong Management and Protection Branch of the Ailao Mountains National Nature Reserve in Yunnan Province and the builders of the passive acoustic monitoring system in the early stages of this project. We thank the Major Science and Technology Project of Yunnan Province (202202AD080010) for support. We thank the National Natural Science Foundation of China for grant Nos. 32160369 and 31860182.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Guan, Z.; Yan, L.; Huang, B. Analysis of the current status of gibbon family population monitoring in China. Sichuan Anim. 2017, 36, 7. [Google Scholar]
Fan, P.; Jiang, X.; Liu, C.; Luo, W. Sonogram structure and timing of duets of western black crested gibbon in Wuliang Mountain. Dong Xue Yan Jiu Zool. Res. 2010, 31 3, 293–302. [Google Scholar]
Brockelman, W.; Srikosamatara, S. Estimation of density of gibbon groups by use of loud songs. Am. J. Primatol. 1993, 29, 93–108. [Google Scholar] [CrossRef]
Jiang, X.L.; Luo, Z.; Zhao, S.; Li, R.; Liu, C. Status and distribution pattern of black crested gibbon (Nomascus concolor jingdongensis) in Wuliang Mountains, Yunnan, China: Implication for conservation. Primates J. Primatol. 2006, 47, 264–271. [Google Scholar] [CrossRef] [PubMed]
Dat, L.T.; Phong, L.M. 2010 Census of Western Black Crested Gibbon Nomascus Concolor in mu Cang Chai Species/Habitat Conservation Area (Yen Bai Province) and Adjacent Forests in Muong la District (Son la Province); Fauna & Flora International Vietnam Programme: Hanoi, Vietnam, 2010. [Google Scholar]
Li, X.; Zhong, E.; Cui, C.; Zhou, J.; Li, X.; Guan, Z. Monitoring the calling behavior of the western Yunnan subspecies of the western black crested gibbon (Hylobatidae). J. Guangxi Norm. Univ. Nat. Sci. Ed. 2021, 39, 29–37. [Google Scholar]
Zhong, E.; Guan, Z.; Zhou, X.; Zhao, Y.; Hu, K. Application of passive acoustic monitoring techniques to the monitoring of the western black-crested gibbon. Biodiversity 2021, 29, 9. [Google Scholar]
LeCun, Y.; Boser, B.E.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.E.; Jackel, L.D. Handwritten Digit Recognition with a Back-Propagation Network. In Proceedings of the Neural Information Processing Systems (NIPS), Denver, CO, USA, 27–30 November 1989. [Google Scholar]
Haykin, S.; Kosko, B. GradientBased Learning Applied to Document Recognition. In Intelligent Signal Processing; Wiley-IEEE Press: Hoboken, NJ, USA, 2001; pp. 306–351. [Google Scholar] [CrossRef]
Fan, J.; Liu, X.; Wang, X.; Deyi, W.; Han, M. Multi-Background Island Bird Detection Based on Faster R-CNN. Cybern. Syst. 2021, 52, 26–35. [Google Scholar] [CrossRef]
Grill, T.; Schlüter, J. Two convolutional neural networks for bird detection in audio signals. In Proceedings of the 2017 25th European Signal Processing Conference (EUSIPCO), Kos, Greece, 28 August–2 September 2017; pp. 1764–1768. [Google Scholar] [CrossRef]
Stowell, D.; Wood, M.; Pamuła, H.; Stylianou, Y.; Glotin, H. Automatic acoustic detection of birds through deep learning: The first Bird Audio Detection challenge. Methods Ecol. Evol. 2018, 10, 368–380. [Google Scholar] [CrossRef]
Dufourq, E.; Durbach, I.N.; Hansford, J.P.; Hoepfner, A.; Ma, H.; Bryant, J.V.; Stender, C.S.; Li, W.; Liu, Z.; Chen, Q.; et al. Automated detection of Hainan gibbon calls for passive acoustic monitoring. Remote. Sens. Ecol. Conserv. 2020, 7, 475–487. [Google Scholar] [CrossRef]
Ruan, W.; Wu, K.; Chen, Q.; Zhang, C. ResNet-based bio-acoustics presence detection technology of Hainan gibbon calls. Appl. Acoust. 2022, 198, 108939. [Google Scholar] [CrossRef]
Jiang, J.; Bu, L.; Duan, F.; Wang, X.; Liu, W.; Sun, Z.; Li, C. Whistle detection and classification for whales based on convolutional neural networks. Appl. Acoust. 2019, 150, 169–178. [Google Scholar] [CrossRef]
Bergler, C.; Schröter, H.; Cheng, R.X.; Barth, V.; Weber, M.; Noeth, E.; Hofer, H.; Maier, A. ORCA-SPOT: An Automatic Killer Whale Sound Detection Toolkit Using Deep Learning. Sci. Rep. 2019, 9, 10997. [Google Scholar] [CrossRef] [PubMed]
Bermant, P.C.; Bronstein, M.M.; Wood, R.J.; Gero, S.; Gruber, D.F. Deep Machine Learning Techniques for the Detection and Classification of Sperm Whale Bioacoustics. Sci. Rep. 2019, 9, 12588. [Google Scholar] [CrossRef]
Moon, J.; Jung, S.; Park, S.; Hwang, E. Conditional Tabular GAN-Based Two-Stage Data Generation Scheme for Short-Term Load Forecasting. IEEE Access 2020, 8, 205327–205339. [Google Scholar] [CrossRef]
Nanni, L.; Maguolo, G.; Paci, M. Data augmentation approaches for improving animal audio classification. arXiv 2019, arXiv:1912.07756. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Volume 25. [Google Scholar]
McFee, B.; Humphrey, E.J.; Bello, J.P. A Software Framework for Musical Data Augmentation. In Proceedings of the International Society for Music Information Retrieval Conference, Málaga, Spain, 26–30 October 2015. [Google Scholar]
Salamon, J.; Bello, J.P. Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification. IEEE Signal Process. Lett. 2017, 24, 279–283. [Google Scholar] [CrossRef]
Davis, N.; Suresh, K. Environmental Sound Classification Using Deep Convolutional Neural Networks and Data Augmentation. In Proceedings of the 2018 IEEE Recent Advances in Intelligent Computational Systems (RAICS), Trivandrum, Kerala, 6–8 December 2018; pp. 41–45. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. arXiv 2014, arXiv:1406.2661. [Google Scholar] [CrossRef]
Shorten, C.; Khoshgoftaar, T.M. A survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 1–48. [Google Scholar] [CrossRef]
Pascual, S.; Bonafonte, A.; Serrà, J. SEGAN: Speech Enhancement Generative Adversarial Network. arXiv 2017, arXiv:1703.09452. [Google Scholar]
Donahue, C.; McAuley, J.; Puckette, M. Adversarial Audio Synthesis. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Petmezas, G.; Cheimariotis, G.A.; Stefanopoulos, L.; Rocha, B.M.M.; Paiva, R.P.; Katsaggelos, A.K.; Maglaveras, N. Automated Lung Sound Classification Using a Hybrid CNN-LSTM Network and Focal Loss Function. Sensors 2022, 22, 1232. [Google Scholar] [CrossRef]
Atila, O.; Şengür, A. Attention guided 3D CNN-LSTM model for accurate speech based emotion recognition. Appl. Acoust. 2021, 182, 108260. [Google Scholar] [CrossRef]
Alsayadi, H.A.; Abdelhamid, A.A.; Hegazy, I.; Fayed, Z.T. Non-diacritized Arabic speech recognition based on CNN-LSTM and attention-based models. J. Intell. Fuzzy Syst. 2021, 41, 6207–6219. [Google Scholar] [CrossRef]
Zhang, Z.; Lv, Z.; Gan, C.; Zhu, Q. Human action recognition using convolutional LSTM and fully-connected LSTM with different attentions. Neurocomputing 2020, 410, 304–316. [Google Scholar] [CrossRef]
Liu, J.; Wang, G.; Hu, P.; Duan, L.Y.; Kot, A.C. Global Context-Aware Attention LSTM Networks for 3D Action Recognition. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3671–3680. [Google Scholar] [CrossRef]
Zhou, X.; Hu, K.; Guan, Z.; Yu, C.; Wang, S.; Fan, M.; Sun, Y.; Cao, Y.; Wang, Y.; Miao, G. Methods for processing and analyzing passive acoustic monitoring data: An example of song recognition in western black-crested gibbons. Ecol. Indic. 2023, 155, 110908. [Google Scholar] [CrossRef]
Zottesso, R.H.D.; Costa, Y.M.G.; Bertolini, D.; Oliveira, L. Bird species identification using spectrogram and dissimilarity approach. Ecol. Inform. 2018, 48, 187–197. [Google Scholar] [CrossRef]
Pahuja, R.; Kumar, A. Sound-spectrogram based automatic bird species recognition using MLP classifier. Appl. Acoust. 2021, 180, 108077. [Google Scholar] [CrossRef]
Geng, Y. Design of English teaching speech recognition system based on LSTM network and feature extraction. Soft Comput. 2023, 1–11. [Google Scholar] [CrossRef]
Ahmed, M.R.; Islam, S.; Islam, A.K.M.M.; Shatabda, S. An Ensemble 1D-CNN-LSTM-GRU Model with Data Augmentation for Speech Emotion Recognition. arXiv 2021, arXiv:2112.05666. [Google Scholar]
Abdelhamid, A.A.; El-Kenawy, E.S.M.; Alotaibi, B.; Amer, G.M.; Abdelkader, M.Y.; Ibrahim, A.; Eid, M.M. Robust Speech Emotion Recognition Using CNN+LSTM Based on Stochastic Fractal Search Optimization Algorithm. IEEE Access 2022, 10, 49265–49284. [Google Scholar] [CrossRef]
El-Moneim, S.; Nassar, M.; Dessouky, M.; Ismail, N.; El-Fishawy, A.; Abd El-Samie, F. Text-independent speaker recognition using LSTM-RNN and speech enhancement. Multimed. Tools Appl. 2020, 79, 24013–24028. [Google Scholar] [CrossRef]
Yi, J.; Ni, H.; Wen, Z.; Liu, B.; Tao, J. CTC regularized model adaptation for improving LSTM RNN based multi-accent Mandarin speech recognition. In Proceedings of the 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP), Tianjin, China, 17–20 October 2016; pp. 1–5. [Google Scholar] [CrossRef]
Tang, Y.; Hu, Y.; He, L.; Huang, H. A bimodal network based on Audio-Text-Interactional-Attention with ArcFace loss for speech emotion recognition. Speech Commun. 2022, 143, 21–32. [Google Scholar] [CrossRef]
Yu, Y.; Kim, Y. Attention-LSTM-Attention Model for Speech Emotion Recognition and Analysis of IEMOCAP Database. Electronics 2020, 9, 713. [Google Scholar] [CrossRef]
Hu, Z.; Linghu, K.; Yu, H.; Liao, C. Speech Emotion Recognition Based on Attention MCNN Combined With Gender Information. IEEE Access 2023, 11, 50285–50294. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Larsson, G.; Maire, M.; Shakhnarovich, G. FractalNet: Ultra-Deep Neural Networks without Residuals. arXiv 2016, arXiv:1605.07648. [Google Scholar]
Srivastava, R.K.; Greff, K.; Schmidhuber, J. Training Very Deep Networks. In Proceedings of the Annual Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
Zhou, X.; Guan, Z.; Zhong, E.; Dong, Y.; Li, H.; Hu, K. Automated Monitoring of Western Black Crested Gibbon Population Based on Voice Characteristics. In Proceedings of the 2019 IEEE 5th International Conference on Computer and Communications (ICCC), Chengdu, China, 6–9 December 2019; pp. 1383–1387. [Google Scholar]
Zhou, X.; Hu, K.; Guan, Z. Environmental sound classification of western black-crowned gibbon habitat based on spectral subtraction and VGG16. In Proceedings of the 2022 IEEE 5th Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC), Chongqing, China, 16–18 December 2022; Volume 5, pp. 578–582. [Google Scholar] [CrossRef]
Fan, P.; Jiang, X.; Liu, C.; Luo, W. The Acoustic Structure and Time Characteristics of Wuliangshan West black crested gibbon Duet. Zool. Res. 2010, 31, 10. [Google Scholar]
Kong, Q.; Cao, Y.; Iqbal, T.; Wang, Y.; Wang, W.; Plumbley, M.D. PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2880–2894. [Google Scholar] [CrossRef]
Stowell, D.; Petruskova, T.; Linhart, P. Automatic acoustic identification of individuals in multiple species: Improving identification across recording conditions. J. R. Soc. Interface 2019, 16, 20180940. [Google Scholar] [CrossRef]
Bahmei, B.; Birmingham, E.; Arzanpour, S. CNN-RNN and Data Augmentation Using Deep Convolutional Generative Adversarial Network for Environmental Sound Classification. IEEE Signal Process. Lett. 2022, 29, 682–686. [Google Scholar] [CrossRef]
Lasseck, M. Audio-based Bird Species Identification with Deep Convolutional Neural Networks. In Proceedings of the Conference and Labs of the Evaluation Forum, Avignon, France, 10–14 September 2018. [Google Scholar]
Kim, J.H.; Lee, S.H.; Lee, J.H.; Lee, S.W. Fre-GAN: Adversarial Frequency-consistent Audio Synthesis. arXiv 2021, arXiv:2106.02297. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar] [CrossRef]
Ng, J.Y.H.; Hausknecht, M.J.; Vijayanarasimhan, S.; Vinyals, O.; Monga, R.; Toderici, G. Beyond short snippets: Deep networks for video classification. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 4694–4702. [Google Scholar]
Huang, G.; Sun, Y.; Liu, Z.; Sedra, D.; Weinberger, K. Deep Networks with Stochastic Depth. arXiv 2016, arXiv:1603.09382. [Google Scholar]
Simpson, T.; Dervilis, N.; Chatzi, E.N. Machine Learning Approach to Model Order Reduction of Nonlinear Systems via Autoencoder and LSTM Networks. arXiv 2021, arXiv:2109.11213. [Google Scholar] [CrossRef]
Burgess, J.; O’Kane, P.; Sezer, S.; Carlin, D. LSTM RNN: Detecting exploit kits using redirection chain sequences. Cybersecurity 2021, 4, 1–15. [Google Scholar] [CrossRef]
Zhao, S.; Dong, X. A study on speech recognition based on improved LSTM deep neural network. J. Zhengzhou Univ. Eng. Ed. 2018, 39, 5. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Desplanques, B.; Thienpondt, J.; Demuynck, K. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. arXiv 2020, arXiv:2005.07143. [Google Scholar]
Martinez, A.M.C.; Spille, C.; Rossbach, J.I.; Kollmeier, B.; Meyer, B.T. Prediction of speech intelligibility with DNN-based performance measures. arXiv 2021, arXiv:2203.09148. [Google Scholar]
Gao, S.; Cheng, M.M.; Zhao, K.; Zhang, X.; Yang, M.H.; Torr, P.H.S. Res2Net: A New Multi-Scale Backbone Architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 652–662. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Wang, H.; Zheng, S.; Chen, Y.; Cheng, L.; Chen, Q. CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware Masking. In Proceedings of the Interspeech, Dublin, Ireland, 20–24 August 2023. [Google Scholar]
Chen, Y.; Zheng, S.; Wang, H.; Cheng, L.; Chen, Q.; Qi, J. An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification. arXiv 2023, arXiv:2305.12838. [Google Scholar]
Yang, M.; Wang, Z.; Chi, Z.; Feng, W. WaveGAN: Frequency-aware GAN for High-Fidelity Few-shot Image Generation. arXiv 2022, arXiv:2207.07288. [Google Scholar]
Yamamoto, R.; Song, E.; Kim, J.M. Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. arXiv 2020, arXiv:1910.11480. [Google Scholar]
Payne, R.; McVay, S. Songs of Humpback Whales. Science 1971, 173, 585–597. [Google Scholar] [CrossRef]
Whaling, C.S.; Solis, M.M.; Doupe, A.J.; Soha, J.A.; Marler, P.R. Acoustic and neural bases for innate recognition of song. Proc. Natl. Acad. Sci. USA 1997, 94 23, 12694–12698. [Google Scholar] [CrossRef]
Behr, O.; von Helversen, O. Bat serenades—Complex courtship songs of the sac-winged bat (Saccopteryx bilineata). Behav. Ecol. Sociobiol. 2004, 56, 106–115. [Google Scholar] [CrossRef]
Bohn, K.M.; Schmidt-French, B.A.; Schwartz, C.; Smotherman, M.S.; Pollak, G.D. Versatility and Stereotypy of Free-Tailed Bat Songs. PLoS ONE 2009, 4, e6746. [Google Scholar] [CrossRef]
Madhu, A.; Kumaraswamy, S. Data Augmentation Using Generative Adversarial Network for Environmental Sound Classification. In Proceedings of the 2019 27th European Signal Processing Conference (EUSIPCO), A Coruña, Spain, 2–6 September 2019; pp. 1–5. [Google Scholar] [CrossRef]
Yang, J.H.; Kim, N.K.; Kim, H.K. Se-Resnet with Gan-Based Data Augmentation Applied to Acoustic Scene Classification Technical Report. In Proceedings of the DCASE, Surrey, UK, 19–20 November 2018. [Google Scholar]
Kim, E.; Moon, J.; Shim, J.C.; Hwang, E. DualDiscWaveGAN-Based Data Augmentation Scheme for Animal Sound Classification. Sensors 2023, 23, 2024. [Google Scholar] [CrossRef]
Dai, W.; Dai, C.; Qu, S.; Li, J.; Das, S. Very deep convolutional neural networks for raw waveforms. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 421–425. [Google Scholar] [CrossRef]
He, K.; Sun, J. Convolutional neural networks at constrained time cost. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 5353–5360. [Google Scholar]
Abdullah, K.H.; Bilal Er, M. Lung sound signal classification by using Cosine Similarity-based Multilevel Discrete Wavelet Transform Decomposition with CNN-LSTM Hybrid model. In Proceedings of the 2022 4th International Conference on Artificial Intelligence and Speech Technology (AIST), Delhi, India, 9–10 December 2022; pp. 1–4. [Google Scholar] [CrossRef]
Pradeep, R.; Rao, K.S. Incorporation of Manner of Articulation Constraint in LSTM for Speech Recognition. Circuits Syst. Signal Process. 2019, 38, 3482–3500. [Google Scholar] [CrossRef]

Figure 1. Overall network monitoring routes and network data transmission maps. (a) Environmental factor acquisition sensors; (b) video capture camera; (c) audio capture pickup arrays; (d) wireless network data transmission, primary link: 5.8 GHz wireless network transmission, secondary link: LoRa wireless network transmission; (e) the local authority, where the data are transmitted back to the authority’s servers for storage via the wireless network; (f) the location of the passive acoustic monitoring system in the Ailao Mountains, Chuxiong City, Yunnan Province.

Figure 2. Spectrograms of four different calls of the western black-crested gibbon. (a) simple repetitive syllable calls of the male gibbon; (b) agonistic calls of the female gibbon; (c) weakly modulated syllable calls of the male gibbon; (d) strongly modulated syllable calls of the male gibbon.

Figure 3. Comparison of frequency and amplitude differences in calls of different gibbons.

Figure 4. Examples of spectrograms according to different noise levels of 0 dB, −5 dB, −10 dB and −15 dB.

Figure 5. Diagram of the overall network structure. The overall network structure of this paper (from top to bottom) includes audio data enhancement, DenseNet network and LSTM network based on attention mechanism and feature fusion using PCA.

Figure 6. DenseNet network architecture based on self-attention mechanism. Conv in the figure denotes the convolutional layer; BN denotes Batch Normalization; ReLu denotes ReLu activation function.

Figure 7. LSTM network architecture module and time attention mechanism module.

Figure 8. A comparison of the accuracy and loss of our model in relation to alternative models. (a) represents the model accuracy comparison in the absence of data augmentation; (b) represents the model accuracy comparison in the accuracy of models following data augmentation; (c) represents the accuracy comparison between the addition of the LSTM module and data augmentation using an attention mechanism; (d) represents a comparison between the loss before and after data augmentation and the addition of the LSTM module using an attention mechanism.

Figure 9. Comparison of accuracy after augmentation by different speech data augmentation methods. The augmentation methods include Translation of Sample Rate (TSR), Same Class Augmentation (SC), Time Shifting Augmentation (TS1), Pitch Shift Augmentation (PS), Time Stretching Augmentation (TS2), Speed Tuning (ST), Volume Tuning (VT), Noise Augmentation (N), Free-GAN and WaveGAN. The models compared include VGG16, Xception, MobileNet, and DenseNet. (a) Comparison of MobileNet classification accuracy after different data augmentation methods. (b) Comparison of Xception classification accuracy after different data augmentation methods. (c) Comparison of VGG16 classification accuracy after different data augmentation methods. (d) Comparison of SA_DenseNet classification accuracy after different data augmentation methods.

Figure 10. Model evaluation. First, we show how the accuracy of our proposed network compares on the training and validation sets; second, we put all the trained models to real-world tests and compare the test accuracies of each comparison model. (a) Comparison of the accuracy of the DenseNet+Self-Attention-LSTM-Attention-based model in the training set and validation set after data augmentation. (b) Comparing the accuracy of different model validation sets.

Figure 11. Statistics on the number of calls and time period of calls in April 2021 for different species. (1) The bars in the figure indicate the number of days each species was monitored in April; (2) the line graphs indicate the number of calls for each species during the month of April; (3) the right-most graph indicates the distribution of calling time periods for each species during the month of April; (4) in the figure, Gibbon denotes the western black-crested gibbon Nomascus concolor, WL denotes White-browed Laughingthrush Pterorhinus sannio, GT denotes Green-backed Tit Parus monticolus, SSB denotes Streak-breasted Scimitar Babbler Pomatorhinus ruficollis, GB denotes Great Barbet Psilopogon virens, MF denotes Manipur Fulvetta Fulvetta manipurensis, and BTM denotes Bar-throated Minla Actinodura strigula, and Time denotes a different time period. (a) Recognition results of different species’ calls; (b) distribution of call time periods for different species.

Table 1. Call size of the original four western black-crested gibbon species. The table includes four different gibbon call types, sex, four dataset sizes, average length of calls, and labels for the four calls.

Call Type	Genders	Sample Size	Average Length (s)	Labeling Tags
1. aa notes	males	497	4.6	ga
2. Weakly modulated figure	males	958	5.8	gw
3. Modulated figure	males	1066	7.2	gm
4. Great call	females	213	8.4	gf

Note: 1. aa notes (simple repetitive syllable calls of the male gibbon), 2. Weakly modulated figure (weakly modulated syllable calls of the male gibbon), 3. Modulated figure (strongly modulated syllable calls of the male gibbon), 4. Great call (agonistic calls of the female gibbon).

Table 2. The audio data augmentation methods used in this paper.

Augmentation Methods	Classification of Methods
Tranlation of Sample Rate (TSR)	Data Transformation
Same Class Augmentation (SC)	Data Transformation
Time Shifting Augmentation (TS1)	Data Transformation
Pitch Shift Augmentation (PS)	Data Transformation
Time Stretching Augmentation (TS2)	Data Transformation
Speed Tuning (ST)	Data Mixing
Volume Tuning (VT)	Data Mixing
Noise Augmentation (N)	Data Mixing
WaveGAN	Data generation
Fre-GAN	Data generation

Table 3. Comparison of model classification results at different noise levels.

Noise Levels (dB)	Accuracy	Precision	Recall	F1-Score
0	0.95	0.94	0.93	0.94
−5	0.92	0.91	0.90	0.90
−10	0.88	0.87	0.85	0.86
−15	0.85	0.84	0.82	0.83
−20	0.80	0.78	0.76	0.77
−25	0.69	0.68	0.65	0.66
−30	0.65	0.63	0.60	0.61
−35	0.60	0.58	0.55	0.56
−40	0.55	0.53	0.50	0.51
−45	0.50	0.48	0.45	0.46

Table 4. Detailed information content of different sound types. The training set for each category is 1200 entries, the test set is 400 entries, and the validation set is 400 entries.

Call Type	Genus	Family	Original Size	WaveGAN Augmentation	Labeling Tags	Size
wind	-	-	2000	2000	wind	656 MB
rain	-	-	2000	2000	rain	565 MB
Actinodura strigula	Actinodura	Leiothrichidae	1298	2000	BTM	553 MB
Fulvetta cinereiceps	Fulvetta	Paradoxornithidae	987	2000	MF	523 MB
Psilopogon virens	Psilopogon	Megalaimidae	1500	2000	GB	635 MB
Pomatorhinus ruficollis	Pomatorhinus	Timaliidae	2000	2000	SSB	589 MB
Parus monticolus	Parus spilonotus	Paridae	1356	2000	GT	643 MB
Pterorhinus sannio	Pterorhinus	Leiothrichidae	1293	2000	WL	528 MB
aa notes	Nomascus	Hylobatidae	497	2000	ga	564 MB
weakly modulated figure	Nomascus	Hylobatidae	958	2000	gw	894 MB
modulated figure	Nomascus	Hylobatidae	1066	2000	gm	783 MB
great call	Nomascus	Hylobatidae	213	2000	gf	679 MB
cicada	Cryptotympana	Cicadidae	2000	2000	cicada	586 MB

Table 5. Classification results of different models. For classification comparisons of four different models before data augmentation, we compared accuracy, precision, recall, and F1-score.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
VGG16	86.8%	86.2%	85.6%	85.8%
Xception	85.6%	84.7%	83.7%	83.9%
MobileNet	72.1%	71.7%	70.4%	70.7%
SA_DenseNet	88.1%	87.8%	86.4%	86.9%

Table 6. Classification results of different models. After enhancement with WaveGAN audio data, we compared the classification performance of four different network models.

Model	Accuracy (%)	Precision (%)	Recall(%)	F1-Score (%)
VGG16	93.9%	93.5%	92.7%	92.9%
Xception	93.8%	93.2%	92.5%	92.7%
MobileNet	78.9%	78.2%	77.4%	77.7%
SA_DenseNet	95.9%	95.6%	94.8%	95.1%

Table 7. Classification results of different models. After augmenting the WaveGAN audio data and incorporating the LSTM network, we compared the classification performance of four different network models.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
VGG16-LSTM-Attention	95.9%	95.4%	94.6%	94.8%
Xception-LSTM-Attention	95.6%	95.1%	94.2%	94.5%
MobileNet-LSTM-Attention	81.4%	80.8%	79.3%	79.5%
SA_DenseNet-LSTM-Attention	98.2%	96.7%	96.1%	96.5%
ECAPA-TDNN [65]	92.8%	92.4%	90.5%	91.8%
PANNS [51]	96.1%	95.5%	93.7%	94.1%
TDNN [66]	93.4%	92.7%	91.8%	92.4%
Res2Net [67]	96.9%	95.4%	94.5%	94.8%
ResNetSE [68]	97.1%	95.8%	94.9%	95.3%
CAMPPlus [69]	96.8%	96.1%	95.1%	95.7%
ERes2Net [70]	96.6%	95.3%	94.1%	94.7%
ERes2NetV2 [70]	91.5%	90.8%	89.7%	90.2%

Table 8. Comparison of training time and inference time for different models.

Model	Training Time (Hours)	Reasoning Time (Seconds/Sample)
VGG16-LSTM-Attention	10	0.025
Xception-LSTM-Attention	12	0.022
MobileNet-LSTM-Attention	9	0.020
SA_DenseNet-LSTM-Attention	8	0.018

Table 9. Detailed recognition results of final different categories of sounds using the models in this paper.

Call Type	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
wind	98.3%	96.7%	96.0%	96.3%
rain	98.4%	96.9%	96.4%	96.6%
Actinodura strigula	98.1%	96.4%	95.8%	96.1%
Fulvetta cinereiceps	98.3%	96.8%	96.1%	96.4%
Psilopogon virens	98.2%	96.6%	96.0%	96.3%
Pomatorhinus ruficollis	98.0%	96.3%	95.7%	96.0%
Parus monticolus	98.1%	96.5%	95.9%	96.2%
Pterorhinus sannio	98.2%	96.7%	96.1%	96.4%
aa notes	98.4%	97.1%	96.7%	96.9%
weakly modulated figure	98.1%	96.5%	96.0%	96.2%
modulated figure	98.5%	97.0%	96.8%	96.9%
great call	98.3%	96.8%	96.2%	96.5%
cicada	98.2%	96.6%	95.9%	96.2%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, X.; Wang, N.; Hu, K.; Wang, L.; Yu, C.; Guan, Z.; Hu, R.; Li, Q.; Ye, L. Recognition of Western Black-Crested Gibbon Call Signatures Based on SA_DenseNet-LSTM-Attention Network. Sustainability 2024, 16, 7536. https://doi.org/10.3390/su16177536

AMA Style

Zhou X, Wang N, Hu K, Wang L, Yu C, Guan Z, Hu R, Li Q, Ye L. Recognition of Western Black-Crested Gibbon Call Signatures Based on SA_DenseNet-LSTM-Attention Network. Sustainability. 2024; 16(17):7536. https://doi.org/10.3390/su16177536

Chicago/Turabian Style

Zhou, Xiaotao, Ning Wang, Kunrong Hu, Leiguang Wang, Chunjiang Yu, Zhenhua Guan, Ruiqi Hu, Qiumei Li, and Longjia Ye. 2024. "Recognition of Western Black-Crested Gibbon Call Signatures Based on SA_DenseNet-LSTM-Attention Network" Sustainability 16, no. 17: 7536. https://doi.org/10.3390/su16177536

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Recognition of Western Black-Crested Gibbon Call Signatures Based on SA_DenseNet-LSTM-Attention Network

Abstract

1. Introduction

2. Related Work

3. Materials

3.1. Data Sources

3.2. Data Augmentation

4. Methods

4.1. Overall Identification Model

4.1.1. DenseNet

4.1.2. Self-Attention Module

4.1.3. Time Attention Mechanism

4.2. Model Evaluation and Dataset Segmentation

5. Results

5.1. Experimental Environment Settings

5.2. Classification Results before Data Augmentation

5.3. Classification Results after Data Augmentation

5.4. Classification Results after Adding LSTM-Attention

5.5. Different Calling Recognition Results

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI