Next Article in Journal
TCα-PIA: A Personalized Social Network Anonymity Scheme via Tree Clustering and α-Partial Isomorphism
Previous Article in Journal
Performance Evaluation of Routing Algorithm in Satellite Self-Organizing Network on OMNeT++ Platform
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Investigation of Bird Sound Transformer Modeling and Recognition

School of Electrical and Electronic Engineering, Shanghai Institute of Technology, Shanghai 201418, China
*
Author to whom correspondence should be addressed.
Electronics 2024, 13(19), 3964; https://doi.org/10.3390/electronics13193964
Submission received: 29 August 2024 / Revised: 3 October 2024 / Accepted: 7 October 2024 / Published: 9 October 2024

Abstract

:
Birds play a pivotal role in ecosystem and biodiversity research, and accurate bird identification contributes to the monitoring of biodiversity, understanding of ecosystem functionality, and development of effective conservation strategies. Current methods for bird sound recognition often involve processing bird songs into various acoustic features or fusion features for identification, which can result in information loss and complicate the recognition process. At the same time, the recognition method based on raw bird audio has not received widespread attention. Therefore, this study proposes a bird sound recognition method that utilizes multiple one-dimensional convolutional neural networks to directly learn feature representations from raw audio data, simplifying the feature extraction process. We also apply positional embedding convolution and multiple Transformer modules to enhance feature processing and improve accuracy. Additionally, we introduce a trainable weight array to control the importance of each Transformer module for better generalization of the model. Experimental results demonstrate our model’s effectiveness, with an accuracy rate of 99.58% for the public dataset Birds_data, as well as 98.77% for the Birdsonund1 dataset, and 99.03% for the UrbanSound8K environment sound dataset.

1. Introduction

The conservation of bird species plays a pivotal role in safeguarding biodiversity and serves as an indicator group for assessing the ecological integrity of human ecosystems. The abundance of bird populations directly mirrors the level of local biodiversity [1]. Currently, there exist approximately 11,000 bird species worldwide, encompassing habitats across all terrestrial areas. Birds play crucial roles in ecosystem services for humans by serving as pollinators, seed spreaders, ecosystem engineers, cleaners, and predators. They not only contribute to the enhancement and preservation of biodiversity but also provide valuable assistance to human activities, such as promoting agricultural sustainability through pest control [2]. Therefore, prioritizing the conservation of avifauna is imperative [3].
Traditional methods for bird recognition heavily rely on manual identification, which is subjective, restrictive, and characterized by low accuracy. With the advancements in machine learning techniques, there has been a gradual shift towards employing machine learning for automated bird song recognition [4]. Generally, two types of machine learning-based approaches are used for bird song recognition: template matching algorithms, such as the Dynamic Time Warping (DTW) algorithm [5], and feature-based recognition methods, including kernel extreme learning machines [6], hidden Markov Models [7], Gaussian mixture models, Support Vector Machines [8], and random forests [9]. However, these machine learning algorithms often exhibit limited accuracy when it comes to recognizing bird songs in complex and dynamic environments. Therefore, there is a need for more reliable and accurate techniques to enhance bird song recognition.
With the widespread adoption of deep learning technology in bird sound recognition tasks, deep learning models have demonstrated their ability to automatically extract features and exhibit strong robustness [10,11]. In 2016, researchers employed convolutional neural networks with dense layers and utilized spectrograms of bird songs as input, achieving a record-breaking accuracy rate of 55% in the bird-clef competition that year. This experiment substantiated the superior recognition accuracy of deep learning methods compared to previous machine learning approaches [12]. Subsequently, Sankupellay and Konovalov employed a Res-net50 model based on a deep convolutional neural network architecture, also utilizing spectrograms of bird songs as input. They achieved an unprecedented accuracy rate of 72% among 46 species of birds [13]. Recently, research has revealed that the extraction of a single feature may result in information loss, while the fusion of multiple features can complement and enhance recognition accuracy [14,15]. Yan et al. proposed a model based on 3DCNN-LSTM that integrates logarithmic mel spectrograms, mel-frequency cepstral coefficients (MFCCs), and Chroma features to generate novel fusion features, achieving an average accuracy of 97.9% in experimental evaluations [16]. Yao et al. employed wavelet transform (WT) to obtain spectrograms of bird songs and fitted them with Gaussian mixture models for extracting acoustic feature parameters, which were then fused with MFCCs for classification and recognition tasks involving nine bird species [17]. The experiment demonstrated that this fusion approach improved accuracy by 3.41% compared to using only MFCCs. Liu et al., respectively, utilized 2D convolutional neural networks and 3D convolutional neural networks to extract features from original waveform images and logarithmic mel spectrograms, which were subsequently fused and input into dual-gated recurrent units (d-GRUs) for recognition purposes, ultimately achieving an accuracy rate of 95.9% [18]. Murugaiya et al. [19] introduced a new probabilistic enhanced entropy feature combined with improved Gammatone frequency cepstral coefficient (GFCC) features, resulting in a significant increase in accuracy by 3.5% compared to using GFCCs alone. Hu et al. [20] integrated mel spectrograms with Sinc spectrograms containing sound quality features and employed ResNet18 for recognition tasks, attaining the highest accuracy rate of 98.34%.
In recent years, Transformers have been increasingly utilized in the field of bird sound recognition. Puget proposed a Transformer-based STFT-Transformer model, which takes the mel spectrogram of bird sounds as input and achieves comparable performance to convolutional neural networks [21]. Tang et al. employed a vision Transformer (VIT) that extends beyond feature extraction for encoding and normalizing MFCCs, resulting in a dataset with discernible visual features. They achieved an improvement of 10.64% in accuracy compared to state-of-the-art models at that time [22]. Xiao et al. integrated MFCC, Chroma, Spectral contrast, and Tonnetz features and trained an AMResnet network with a Transformer module, achieving an accuracy of 90.1% for the dataset [23]. Zhang et al. combined log mel spectrogram features extracted by deep networks with MFCC, Chroma, and Tonnetz features processed by Transformer modules [24]. They ultimately attained an accuracy of 97.99% on a dataset consisting of 20 bird species.
However, we have observed that while there is a wide range of methods available for bird song recognition, the majority of these approaches rely on extracting acoustic features from bird songs. Moreover, the fusion of features primarily involves combining acoustic characteristics, with limited research conducted on directly learning features from the raw audio signals of bird songs. Sanchez et al. utilized Sinc-Net [25], which is based on Sinc-filtering techniques, to directly learn features from the raw audio waveforms of bird songs. Their results demonstrated an accuracy only slightly lower than that achieved by VGG16, DesNet121, and ResNet50 models for the same dataset (by approximately 0.03%). Notably though, in terms of parameter quantity and training time requirements, Sinc-Net significantly outperformed these three aforementioned models. Their study emphasizes both the simplicity and feasibility of employing deep learning methodologies to classify bird songs solely based on their original waveforms. We also noted that Rauch et al. in 2023 proposed a Transformer-based end-to-end model for bird song recognition theory aiming to bypass traditional spectrogram conversion and process raw audio signals directly [26]. Additionally, Gazneli et al.’s research has demonstrated the advantages offered by end-to-end models in speech recognition tasks; however, they have only presented theoretical approaches without practical application specifically within the domain of bird song recognition [27].
In this study, our theoretical basis is based on the theory of Rauch et al. [26]; we utilize raw bird sound audio as the input for our model. To bypass the complex process of spectrogram conversion, we employ a stacked one-dimensional convolutional neural network to directly learn features from the raw audio. We utilize the longer convolutional kernel to capture contextual information within the input sequence, while shorter ones capture local features within smaller ranges. Moreover, we combine the Transformer layer with one-dimensional position embedding convolution to effectively capture long-range dependencies in the input sequence. During forward propagation, trainable weight parameters are set for each layer of the Transformer to combine their outputs into the final model output. This approach enhances flexibility and adaptability by dynamically adjusting the importance of each layer and improving generalization ability concerning different data characteristics Finally, a linear classifier was employed for classification tasks, achieving accuracy of 99.58%, 98.65%, and 99.03% for the public datasets Birds_data, birdsound1, and UrbanSound8K.

2. Materials and Methods

In this section, we provide a comprehensive introduction to the dataset and model structure employed, as well as elaborate on the feature extraction process for raw audio of bird vocalizations, computation of feature scales, and present an in-depth explanation of the Transformer layer and recognition procedure.

2.1. Dataset

In the field of bird song recognition research, it is imperative to possess publicly accessible and dependable datasets that can be widely utilized. In this study, we employed the Beijing Hundred Birds dataset and selected a subset of bird songs from the dataset released by the Cornell Lab of Ornithology in 2023 for segmentation processing. Furthermore, we integrated the Urbansound8K dataset comprising urban environmental sound data to evaluate our model’s generalization performance in other sound recognition tasks. Compared with other datasets, all the audio samples in these three datasets are about 2 s in length and are converted to wav format to minimize the performance impact caused by sample format differences. At the same time, the sample length of 2 s also meets the requirements of the model input size. Each dataset was divided into training sets and testing sets, with 80% allocated for training purposes and 20% for testing purposes. Detailed descriptions of these datasets are provided in Table 1.

2.1.1. Birds_data Dataset

The Birds_data dataset was collaboratively planned and collected by the Beijing Zhiyuan Artificial Intelligence Research Institute and the Bai Niao Data Center, with its credibility and accuracy being highly acclaimed [22]. This dataset comprises records of 20 common bird species in China. All bird vocalization segments have undergone meticulous noise reduction processing and have been trimmed to a standardized 2 s length, obviating the need for further data preprocessing. Among these 14,311 two-second bird vocalization segments from the 20 bird species, approximately 90% of the birds possess over 600 samples, while only about 5% of the birds exhibit fewer than 50 samples, thus ensuring the integrity of the bird vocalization segments. The number distribution of various birds is shown in Figure 1. https://data.baai.ac.cn/details/Birdsdata, accessed on 28 June 2024.

2.1.2. Birdsound1 Dataset

The Birdsonund1 dataset is a derived dataset from the birdclef-2023 dataset, which was released by the Cornell Lab of Ornithology during a bird identification competition held in Kenya, an East African country, in 2023. It comprises recordings of vocalizations from 264 avian species in ogg format. For this study, we randomly selected 12 avian species from the original set and converted their audio files from ogg to wav format using Python’s (Version 3.9.12) os and pydub libraries. Subsequently, we segmented the wav audio into a total of 61,867 clips with a duration of 2 s each. No additional noise reduction techniques were applied to process the audio, which contain environmental noise.

2.1.3. UrbanSound8K Dataset

The Urbansound8K dataset is a widely utilized public dataset for research on automatic urban sound classification. It comprises 8732 annotated sound clips encompassing 10 categories: air conditioning sounds, car horn sounds, children playing sounds, dog barking sounds, drilling noises, engine idling noises, gunshot noises, handheld drill machine noises, police siren sounds, and street music. All sound clips are sourced from www.freesound.org, accessed on 3 July 2024. This dataset is employed in this paper to assess the model’s generalization performance. https://aistudio.baidu.com/datasetdetail/142387, accessed on 3 July 2024.

2.2. Bird Sound Model

The model architecture is illustrated in Figure 2, comprising a convolutional feature extraction layer, a position embedding convolutional layer, a Transformer layer incorporating a multi-head attention mechanism, a pooling layer, and a linear classification layer.

2.2.1. Feature Pre-Extraction Layer

The convolutional layer employs a seven-layer one-dimensional convolutional neural network to directly extract features from the original audio of bird songs. In contrast to traditional acoustic features that necessitate the manual design and selection of feature extraction methods, such as filter design, audio framing, and computation of cepstral coefficients [28], the one-dimensional convolutional neural network enables direct learning of feature representations from raw audio signals, thereby enhancing convenience and automation in this approach. We utilize long convolution kernels to capture extensive contextual information within input sequences, followed by short convolution kernels for capturing local features within smaller ranges in the sequences. Table 2 presents the configuration details of the feature extraction layer.
The sampling rate in the experiment is set to 16 kHz. For 2 s audio, its corresponding input size is (1,32000), and the number of channels is set to 256. Consequently, the output of the convolutional layer is (1,256,99). The calculation formula for each layer’s convolutional output is as the following equation:
L = N K + 2 P S + 1
In this equation, the N is determined by input sequence length, K denotes the convolutional kernel length, S denotes the stride, and P represents padding. In this study, we set padding to 0. Consequently, from Equation (1), we can conclude that the output length of the final convolution layer is 99 and the final output tensor dimension is (1,99,256), which is equivalent to generating a 256-dimensional feature vector approximately every 20 ms. After the input sequence undergoes a feature extraction layer, we set a position embedding convolution to further enhance feature information at different time steps. After the input sequence undergoes a feature extraction layer, we set a position embedding convolution to further enhance feature information at different time steps. To maintain consistent sequence lengths before and after position embedding convolution, we employ a one-dimensional convolution with a kernel length of 128 and padding set to half of the kernel size. The convolution is grouped into 16 groups, enabling parallel sliding of kernels on the input to capture information from different positions simultaneously, thereby enhancing model computation capability. Following position embedding convolution, we employ the Gelu activation function to introduce non-linearity and improve model performance for better adaptation to feature complexity and diversity. Residual connections were used in this paper to connect the output of one-dimensional convolutional layers with the output of position embedding convolutional layers, facilitating the optimization of learning rate and alleviated gradient vanishing issues caused by excessively deep network layers, thereby enhancing training efficiency and model performance [29]. In order to mitigate the risk of overfitting, this study incorporates dropout and layer normalization operations. The model input consists of original bird song audio, which is characterized by long sequences. Layer normalization ensures that all time steps within each sample are normalized, as opposed to batch normalization, which considers data distribution within mini-batches. Unlike batch normalization, layer normalization is independent of batch size and can effectively adapt to varying sequence processing requirements. This not only reduces computational complexity but also enhances the efficiency and applicability of the model for handling long sequence data applications, particularly in domains such as natural language processing [30]. The feature extraction layer ultimately produces features of size (1,99,512), which are subsequently inputted into the Transformer layer for further processing.

2.2.2. Transformer Layer

The Transformer layer is composed of multiple Transformer modules connected in series, with each module taking the output of the previous one as its input. The Transformer module consists of several components, including a multi-head attention mechanism and a feed-forward layer, as illustrated in Figure 3. The tensor size input to the Transformer layer is (1,99,512), and multi-head attention is employed to learn diverse feature mappings. Each attention head can selectively attend to information from different positions within the sequence while capturing distinct relationships and patterns, thereby enhancing feature representation and improving model performance on subsequent recognition tasks [31].
The multi-head attention enables the simultaneous calculation of attention weights in each subspace, facilitating parallel processing. This parallelization significantly enhances the computational efficiency of the model, particularly when dealing with raw audio inputs as utilized in this study. The subsequent chapters will demonstrate the pivotal role played by the multi-head attention mechanism through experiment. In this paper, we divide the multi-head attention mechanism into eight subspaces. Prior to performing calculations, three linear layers are employed to individually transform the input tensor and obtain Q, K, and V matrices. The Q matrix is derived by scaling the input tensor after linear transformation using a scaling factor S, which equals to the reciprocal square root of the number of attention heads. The calculation formula of the scaling factor can be expressed as the follow equation:
S = 1 h e a d _ d i m
The head_dim represents the subspace dimension. The value of head_dim is one-eighth of the tensor dimension 512. By substituting it into Equation (2), we can calculate that the value of scaling factor S is 0.125. This scaling factor is employed to adjust the numerical range of attention scores, typically obtained through dot product calculation between Q and K matrices. When dimensions of Q and K are large, dot product values tend to be excessively high, leading to oversized inputs for softmax function and subsequently generating diminished gradients. Therefore, we employ a scaling factor here to effectively regulate gradient propagation and ensure model training stability.
After obtaining the Q and K matrices, transpose the K matrix and perform a dot product with the Q matrix to calculate attention weights. The equation is as follows:
A t t e n t i o n   w e i g h t s = Q × K T
Then, multiply the softmax-processed attention weights with the V matrix to obtain the final output. Finally, reshape and combine the subspace outputs into a tensor of size (1,99,512). The process of tensors through multi-head attention is illustrated in Figure 4.
Subsequently, we employ fully connected layers to map features to a higher dimension of 2048 for learning more complex features. We employed the Gelu activation function to introduce non-linearity and enhance the model’s generalization capability, as it has been demonstrated in research that Gelu exhibits a smoother curve compared to Relu, thereby mitigating the issue of gradient vanishing. The superiority of Gelu over Relu is consistently observed across multiple datasets [32]. Figure 5 illustrates the graphical representations of both Relu and Gelu functions. Subsequently, we incorporated another fully connected layer to restore the feature dimension back to 512 for controlling model complexity while applying a dropout operation to prevent overfitting. Furthermore, a deformation operation was conducted to integrate the outputs from each layer and obtain a final output feature size of (N,99,512); the N denotes the number of layers in the Transformer.
For each layer of the Transformer, we set a trainable weight parameter that allows the model to dynamically adjust the importance level of each Transformer layer based on different tasks. Finally, the Transformer layer outputs the features with a size of (1,99,512).

3. Experiment and Results

3.1. Settings

The implementation of our study is based on Python, utilizing the torch library in Pytorch for model construction. The experiments are primarily based on the Birds_data dataset, while Birdsound1 and UrbanSound8K datasets are employed to assess the model’s generalization performance. Each dataset is divided into an 80% training set and a 20% testing set. During training, a batch size of 32 is utilized with a total of 30 epochs. Prior to each epoch, samples within each dataset are randomly shuffled. The initial learning rate is set at 0.0001, and the parameters are updated by a backpropagation method, Adam optimizer, and cross-entropy loss function. Evaluation metrics for the model employ the macro-average method. Table 3 presents details regarding the environmental configuration.

3.2. Model Evaluation

The performance of the model is evaluated in this study using various metrics, including precision, accuracy, recall, and F1 score. Precision is defined as the ratio of true positive samples to all predicted positive samples and can be calculated as follows:
P r e c i s i o n = T P T P + F P
The accuracy is defined as the proportion of correctly predicted samples among all samples, which is a calculation formula for evaluating the overall classification correctness of a model as follows:
A c c u r a c y = T P + T N T P + T N + F P + F N
The recall rate is defined as the ratio of true positive predictions to all actual positive samples, which can be calculated by the following formula:
R e c a l l = T P T P + F N
The F1 score is used to evaluate the overall performance of a model by combining its accuracy and recall. The calculation formula is as follow:
F 1 = 2 P r e c i s i o n R e c a l l P r e c i s i o n + R e c a l l
The range of these four indicators is from 0 to 1, with higher values indicating superior model performance. In these equations, TP represents positive samples correctly predicted as positive, FP represents negative samples incorrectly predicted as positive, TN represents negative samples correctly predicted as negative samples, and FN represents incorrectly predicted negative samples as positive samples.
We employ a cross-entropy loss function to assess the classification performance of the model. The calculation formula is as follow:
L o s s = i = 1 n P X i log q ( X i )
P represents the probability of the sample I being assigned as the true label, while q denotes the probability of the model accurately predicting the sample i.

3.3. Experiment

3.3.1. Ablation Experiment

In this section, we conducted ablation experiments on the convolutional layer and Transformer layer, respectively, aiming to determine the optimal model structure based on a comprehensive analysis of parameter quantity and performance. All ablation experiments were performed using the Birds_data dataset. The accuracy results for varying numbers of Transformer layers from 0 to 10 when the number of convolution layers is seven are presented in Table 4.
Based on the experimental results, when there is no Transformer layer, the model achieves an accuracy of 98.04%. Incorporating one Transformer layer leads to a significant increase in accuracy by 0.77%. Subsequently, as more Transformer layers are added, the accuracy gradually improves with each additional layer until reaching its peak at 99.65% with seven layers. However, further stacking of Transformer layers results in a notable decline in accuracy due to challenges in fully integrating and effectively utilizing information as well as computational limitations. The slight decrease in accuracy observed at the fourth Transformer layer can be attributed to the combined effects of model initialization randomness and data sampling variability.
Subsequently, we conducted an ablation experiment to investigate the impact of varying the number of convolutional layers in the feature extraction layer on the model. The number of Transformer layers was fixed at seven, as it yielded the highest accuracy among all configurations tested. The experimental results are presented in Table 5.
The results presented in Table 5 demonstrate that the impact of the number of convolutional layers on the final accuracy in bird call recognition tasks is significantly greater than that of Transformer layers when there are more Transformer layers. Specifically, in Table 4, we observe a maximum influence of only 0.77% from Transformer layers on model accuracy, whereas the highest influence achieved by varying convolutional layer numbers reaches up to 2.44% in Table 5. Notably, reducing the number of convolutional layers amplifies their effect on model accuracy. In Section 2.2.1, we propose that a configuration for convolutional layers employing four layers is equivalent to eliminating three convolutional layers with smaller kernels at the end. The substantial decrease in accuracy can be attributed to these smaller kernels playing a crucial role in capturing local and subtle features within input sequences; each kernel specializes in learning more localized feature patterns, and it is precisely this absence of small kernels capable of capturing fine pattern structures that leads to a noticeable decline in accuracy.
Simultaneously, our experimental findings demonstrate that the Transformer layer effectively compensates for the drop in accuracy caused by the absence of small-sized convolutional kernels in the model. We conducted experiments to investigate the impact of Transformer layers on model performance when there are only convolution layers with large kernels, as presented in Table 6.
Based on the experimental data presented in Table 6, it is evident that the model’s accuracy is merely 83.52% when solely employing the convolution layers. However, upon incorporating a single Transformer layer, a significant improvement of 10.9% in accuracy was observed. Moreover, with an incremental addition of Transformer layers, there was a gradual enhancement in accuracy, culminating at its peak of 97.91% with four layers. Notably, according to Equation (1), utilizing only four convolutional neural network layers resulted in an output tensor size from the feature extraction layer of (1,799,256), leading to excessive computational resource consumption. The model has already achieved its maximum accuracy with a mere four Transformer layers.
The ablation experiment validated the efficacy of one-dimensional convolutional neural networks and Transformer layers in this model, establishing that a model architecture comprising seven layers of convolutional neural networks and seven layers of Transformers yields optimal performance for bird sound recognition tasks. Consequently, the models used in subsequent experiments will all incorporate seven layers of convolutional neural networks and seven layers of Transformers.

3.3.2. Model Performance

When training and testing the model on the Birds_data dataset, the accuracy of the model initially exhibits rapid improvement followed by convergence to a stable trend, as depicted in Figure 6.
In the test set comprising 2862 bird sound segments, the model achieves an impressive accuracy rate of 99.65% alongside a logarithmic loss value of 0.016. To comprehensively evaluate the performance of our model, we employ additional evaluation metrics outlined in Section 3.2. On the Birds_data dataset, the recall is 99.62%, accuracy is 99.58%, and F1 score is 99.63%. During the testing phase, we also record the predicted labels and actual labels for each bird species to generate a confusion matrix, which can be used to calculate accuracy, recall, and F1 score for each bird species, as shown in Figure 7.
In Figure 6, the birds represented by numbers 0 to 19 are the Greylag Goose, Mute Swan, Green-winged Teal, Green-winged Duck, Grey Partridge, Common Quail, Common Pheasant, Red-throated Diver, Grey Heron, Great Cormorant, Eurasian Buzzard, Eurasian Hobby, Western Capercaillie, Crested Tit, Black-winged Stilt, Eurasian Oystercatcher, Common Greenshank, Redshank, Wood Sandpiper, and House Sparrow.
Simultaneously, we have presented the precision, recall, and F1 scores for each avian specie provided in Table 7, and we plotted the average precision curve for each bird species in Figure 8. It is evident that all 20 bird species achieved commendable recall, precision, and F1 scores. Even for Grey Partridge and Western Capercaillie with limited sample sizes, they attained a perfect recall rate, along with F1 scores of 99.3% and 99.6%, respectively. Despite Western Capercaillie having only 290 samples and Grey Partridge having merely 29 samples available for analysis, this further highlights our model’s exceptional performance in recognizing bird species with scarce data.
At the same time, we use the class activation map to show how the model assigns weights to the feature matrix. The class activation map is a visualization technique of classification results, which is obtained by the dot product of the feature matrix and classifier weight matrix. This paper gives the class activation graph of two species of birds, as shown in Figure 9.
In Figure 8, we show class activation maps for each of the four types of birdsound; they are the (a) Green-winged Teal, (b) Eurasian Oystercatcher, (c) Red-throated Diver, and (d) Wood Sandpiper. In these class activation maps, the horizontal axis represents the time step, and the vertical coordinate represents the activation value, representing the contribution value of the characteristic value to the prediction class at the current time step. A deeper color indicates that the model considers this time step to be more significant for category prediction, which shows that our model pays attention to the important information part of bird song during training.

3.3.3. Generalization Experiment

In this section, we utilized two datasets, Birdsonund1 and UrbanSound8K, to test the performance of the model on a bird sound dataset with environmental noise and other sound classification tasks. The Birdsonund1 dataset was derived from birdclef-2023 and comprised 61,867 segments, including calls of 12 different bird species mixed with environmental noise, and the UrbanSound8K dataset consisted of 8732 annotated sound clips across 10 categories. Furthermore, we generated confusion matrices for both datasets.
Notably, our model achieved exceptional results for the Birdsonund1 dataset, with the highest accuracy at 98.77%, a recall rate of 98.56%, F1 score of 98.65%, and precision of 98.74%. The confusion matrix is shown in Figure 10.
Our model achieved an accuracy of 99.03%, a recall rate of 98.55%, an F1 score of 98.67%, and a precision of 98.81% for the UrbanSound8K dataset simultaneously. Figure 11 presents the confusion matrix, where the sounds corresponding from zero to nine are car_horn, children_playing, dog_bark, drilling, engine_idling, gun_shot, jackhammer, siren, and street_music, respectively. These two datasets comprehensively validate the reliability of our model’s performance.
The generalization performance of the model proposed in this paper has been validated through experiments conducted on the Birdsongund1 and UrbanSound8K datasets, thereby demonstrating its applicability for both bird sound recognition and environmental sound classification tasks.

3.4. Comparison

In this chapter, we conducted a comparative analysis between our model using raw bird audio and various recognition models utilizing acoustic features for the Birds_data dataset and UrbanSound8K dataset to validate the superiority of using raw bird audio for bird sound recognition tasks. The results of the Birds_data dataset are presented in Table 8. Through comparing different models, it is evident that our model exhibits significant potential. Compared to mainstream bird sound recognition models employing log mel spectrograms, our model achieved an accuracy rate improvement of 3%. Among these models, the AMResNet currently holds one of the highest performance records for the Birds_data dataset, with an accuracy rate of 92.6% [23]; yet, our model surpasses it by 6.98%. Above AMResNet lies a hierarchical birdsong feature extraction architecture combining static and dynamic modeling, which attained an accuracy rate of 95.19% for the Birds_data dataset; however, our model outperforms it by 3.39%. The current top-performing model for the Birds_data dataset utilizes three models and combines four acoustic features to achieve a maximum accuracy rate of 98% [24]. Nevertheless, our model still outperforms its accuracy by 1.58%.
In Table 8, the training speed of the other models is not given in their papers, and the training speed of our model is 1.38 min per epoch, with an average of 5.12 iterations per second.
Using the UrbanSound8K dataset, we also compared our model with several models that achieved the highest accuracy. As shown in Table 9, our model surpasses the highest accuracy models by 2.1%.
This experiment demonstrates the potential and feasibility of our model in bird sound recognition through comparative experiments conducted on the Birds_data dataset and UrbanSound8K dataset. In comparison to models that rely on acoustic features, our model achieves a high recognition accuracy without the need for acoustic feature extraction.
Additionally, our model was evaluated using an additional five-fold cross-validation method. This method divides the dataset into five equal parts; one equal part is used as the test set, and the other four equal parts are used as the training sets and cycle five times in turn. The average accuracy of the final model is 96.52%, the accuracy is 95.95%, the F1 score is 95.71%, and the recall is 95.72%.

4. Discussion

The experiment demonstrates the superior performance and good accuracy of our model in three datasets. However, there are still limitations in this research. The diversity and quantity of samples in the three datasets we used were insufficient. The Birds_data dataset includes 20 bird species with a total of 14,311 audio recordings, while the Birdsound1 dataset consists of 12 bird species with a total of 61,867 audio recordings. The UrbanSound8K dataset contains only 10 types of environmental sounds with just 8732 audio recordings. In future studies, we will further test the performance of the model on datasets with more diverse sample categories and larger sample sizes.
Simultaneously, the model can still be subject to optimization. Throughout the training process, a marginal fluctuation of approximately 1% in model accuracy may occur. This phenomenon is primarily attributed to the stochastic nature of both model initialization and sample selection. In forthcoming investigations, we will strive to identify optimization techniques aimed at enhancing stability. Hu et al. [20] proposed a Sinc spectrogram, which is obtained from the original audio after one-dimensional convolution processing, then through a Sinc filter bank, and finally through batch normalization and pooling, and contains the audio timbral information. Kumar et al. [43] proposed nineteen data augmentation methods in their study to enhance model performance, among which the time masking technique aligns particularly well with our model. The above methods provide an optimization idea for our model.
Currently, our model is limited to recognizing only one type of bird, whereas the natural environment exhibits a complex and diverse range of bird species. Bird call audio recordings often encompass multiple species, posing a significant challenge in identifying all present bird species within an audio segment. Swaminathan et al. have proposed a method to address this issue and have made some progress [30]. In their study, to perform this task, each audio recording has been divided into overlapping chunks to determine multiple labels from it. By applying transfer learning, features of the audio recordings are automatically extracted for classification and fed into a feed-forward network. The probabilities associated with each audio segment are then aggregated through the clipping approach to represent multiple species of bird calls. These probability scores are used to determine the presence of predominant bird species in the audio recording for multi-labeling classification tasks. Their method requires setting a parameter K to represent the number of bird species in advance before the audio input model. In real natural environments, the number of bird species present in a given audio segment is often unknown and needs to be determined by our model. Additionally, the evaluation of their model was based on a very limited sample size of only 400 instances, and the creation of a dataset for multi-label classification tasks on bird species also presents a significant hurdle. Therefore, our future research will focus on enhancing and expanding upon their work.

5. Conclusions

This study proposes a bird sound recognition model based on multiple one-dimensional convolutional neural networks and Transformers, which enables direct learning of informative feature representations from raw audio data of bird sounds. This approach simplifies the process of extracting acoustic features and achieves impressive accuracy by utilizing large convolutional kernels to capture contextual information in the input sequence, small convolutional kernels to capture local features, and combines them with Transformers to effectively capture long-distance dependency relationships. Additionally, we introduce a trainable parameter that controls the importance level of each Transformer layer, thereby enhancing the generalization ability of our proposed model. The model’s performance for the UrbanSound8K dataset confirms this.
Our model presented has demonstrated exceptional performance across three publicly available datasets. Specifically, for the Birds_data dataset, we achieved an accuracy of 99.58%, a recall of 99.62%, an F1 score of 99.63%, and a precision of 99.65%. On the Birdsound1 dataset, we achieved an accuracy of 99.77%, a recall of 99.56%, an F1 score of 99.65%, and a precision of 98.74%. On the UrbanSound8K dataset, we achieved an accuracy of 99.03%, a recall of 98.55%, an F1 score of 98.67%, and a precision of 98.81%. The experimental results have demonstrated the superiority of our model over other approaches.
In summary, this paper proposes a swift and accurate bird sound recognition method, simplifying the feature extraction process and enhancing recognition accuracy, thus making significant contributions to biodiversity conservation.

Author Contributions

Methodology, D.Y.; software, D.Y.; writing—original draft preparation, D.Y.; writing—review and editing, X.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available in the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Kovařík, P.; Pechanec, V.; Machar, I.; Harmáček, J.; Grim, T. Are Birds Reliable Indicators of Most Valuable Natural Areas? Evaluation of Special Protection Areas in the Context of Habitat Protection. Ecol. Indic. 2021, 132, 108298. [Google Scholar] [CrossRef]
  2. Lees, A.C.; Haskell, L.; Allinson, T.; Bezeng, S.B.; Burfield, I.J.; Renjifo, L.M.; Rosenberg, K.V.; Viswanathan, A.; Butchart, S.H.M. State of the World’s Birds. Annu. Rev. Environ. Resour. 2022, 47, 231–260. [Google Scholar] [CrossRef]
  3. Marini, M.Â.; Garcia, F.I. Bird Conservation in Brazil. Conserv. Biol. 2005, 19, 665–671. [Google Scholar] [CrossRef]
  4. Lopes, M.T.; Koerich, A.L.; Silla, C.N.; Kaestner, C.A.A. Feature Set Comparison for Automatic Bird Species Identification. In Proceedings of the 2011 IEEE International Conference on Systems, Man, and Cybernetics, Anchorage, AK, USA, 9–12 October 2011; pp. 965–970. [Google Scholar] [CrossRef]
  5. Kaewtip, K.; Tan, L.N.; Alwan, A.; Taylor, C.E. A Robust Automatic Bird Phrase Classifier Using Dynamic Time-Warping with Prominent Region Identification. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 768–772. [Google Scholar] [CrossRef]
  6. Qian, K.; Zhang, Z.; Baird, A.; Schuller, B. Active Learning for Bird Sound Classification via a Kernel-Based Extreme Learning Machine. J. Acoust. Soc. Am. 2017, 142, 1796–1804. [Google Scholar] [CrossRef]
  7. Potamitis, I.; Ntalampiras, S.; Jahn, O.; Riede, K. Automatic Bird Sound Detection in Long Real-Field Recordings: Applications and Tools. Appl. Acoust. 2014, 80, 1–9. [Google Scholar] [CrossRef]
  8. Stastny, J.; Munk, M.; Juranek, L. Automatic Bird Species Recognition Based on Birds' Vocalization. EURASIP J. Audio Speech Music. Process. 2018, 2018, 19. [Google Scholar] [CrossRef]
  9. Stowell, D.; Plumbley, M.D. Automatic Large-Scale Classification of Bird Sounds Is Strongly Improved by Unsupervised Feature Learning. PeerJ 2014, 2, e488. [Google Scholar] [CrossRef]
  10. Shaheen, F.; Verma, B.; Asafuddoula, M. Impact of Automatic Feature Extraction in Deep Learning Architecture. In Proceedings of the 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA), Sydney, Australia, 6–9 December 2016; pp. 1–8. [Google Scholar] [CrossRef]
  11. Zhang, H.; McLoughlin, I.; Song, Y. Robust Sound Event Recognition Using Convolutional Neural Networks. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; pp. 559–563. [Google Scholar] [CrossRef]
  12. Sprengel, E.; Jaggi, M.; Kilcher, Y.; Hofmann, T. Audio Based Bird Species Identification Using Deep Learning Techniques. LifeCLEF 2016, 2016, 547–559. [Google Scholar]
  13. Sankupellay, S.; Konovalov, D. Bird Call Recognition Using Deep Convolutional Neural Network, ResNet-50. In Proceedings of the Acoustics Conference, Adelaide, Australia, 22–26 October 2018; Volume 7, pp. 1–8. [Google Scholar] [CrossRef]
  14. Bold, N.; Zhang, C.; Akashi, T. Cross-Domain Deep Feature Combination for Bird Species Classification with Audio-Visual Data. IEICE Trans. Inf. Syst. 2019, 102, 2033–2042. [Google Scholar] [CrossRef]
  15. Chang, P.C.; Chen, Y.S.; Lee, C.H. MS-SincResNet: Joint Learning of 1D and 2D Kernels Using Multi-Scale SincNet and ResNet for Music Genre Classification. In Proceedings of the 2021 International Conference on Multimedia Retrieval (ICMR ’21), New York, NY, USA, 21–24 June 2021; pp. 29–36. [Google Scholar] [CrossRef]
  16. Yan, N.; Chen, A.; Zhou, G.; Zhang, Z.; Liu, X.; Wang, J.; Liu, Z.; Chen, W. Birdsong Classification Based on Multi-Feature Fusion. Multimed. Tools Appl. 2021, 80, 36529–36547. [Google Scholar] [CrossRef]
  17. Yao, W.; Lv, D.; Zi, J.; Huang, X.; Zhang, Y.; Liu, J. Crane Song Recognition Based on the Features Fusion of GMM Based on Wavelet Spectrum and MFCC. In Proceedings of the 2021 7th International Conference on Computer and Communications (ICCC), Chengdu, China, 10–13 December 2021; pp. 501–508. [Google Scholar] [CrossRef]
  18. Liu, Z.; Chen, W.; Chen, A.; Zhou, G.; Yi, J. Birdsong Classification Based on Multi-Feature Channel Fusion. Multimed. Tools Appl. 2022, 81, 15469–15490. [Google Scholar] [CrossRef]
  19. Murugaiya, R.; Abas, P.E.; De Silva, L.C. Probability Enhanced Entropy (PEE) Novel Feature for Improved Bird Sound Classification. Mach. Intell. Res. 2022, 19, 52–62. [Google Scholar] [CrossRef]
  20. Hu, S.; Chu, Y.; Wen, Z.; Zhou, G.; Sun, Y.; Chen, A. Deep Learning Bird Song Recognition Based on MFF-ScSEnet. Ecol. Indic. 2023, 154, 110844. [Google Scholar] [CrossRef]
  21. Puget, J.F. STFT Transformers for Bird Song Recognition. In Proceedings of the Conference and Labs of the Evaluation Forum, Bucharest, Romania, 21–24 September 2021; pp. 1609–1616. [Google Scholar]
  22. Tang, Q.; Xu, L.; Zheng, B.; He, C. Transound: Hyper-Head Attention Transformer for Birds Sound Recognition. Ecol. Inform. 2023, 75, 102001. [Google Scholar] [CrossRef]
  23. Xiao, H.; Liu, D.; Chen, K.; Zhu, M. AMResNet: An Automatic Recognition Model of Bird Sounds in Real Environment. Appl. Acoust. 2022, 201, 109121. [Google Scholar] [CrossRef]
  24. Zhang, S.; Gao, Y.; Cai, J.; Yang, H.; Zhao, Q.; Pan, F. A Novel Bird Sound Recognition Method Based on Multifeature Fusion and a Transformer Encoder. Sensors 2023, 23, 8099. [Google Scholar] [CrossRef]
  25. Sanchez, F.J.B.; Hossain, M.R.; English, N.B.; Moore, S.T. Bioacoustic Classification of Avian Calls from Raw Sound Waveforms with an Open-Source Deep Learning Architecture. Sci. Rep. 2021, 11, 15733. [Google Scholar] [CrossRef]
  26. Rauch, L.; Schwinger, R.; Wirth, M.; Sick, B.; Tomforde, S.; Scholz, C. Active Bird2Vec: Towards End-to-End Bird Sound Monitoring with Transformers. arXiv 2023, arXiv:2308.07121. [Google Scholar] [CrossRef]
  27. Gazneli, A.; Zimerman, G.; Ridnik, T.; Sharir, G.; Noy, A. End-to-End Audio Strikes Back: Boosting Augmentations Towards an Efficient Audio Classification Network. arXiv 2022, arXiv:2204.11479. [Google Scholar] [CrossRef]
  28. Lopez-Meyer, P.; del Hoyo Ontiveros, J.A.; Lu, H.; Stemmer, G. Efficient End-to-End Audio Embeddings Generation for Audio Classification on Target Applications. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 601–605. [Google Scholar] [CrossRef]
  29. Dukler, Y.; Gu, Q.; Montufar, G. Optimization Theory for ReLU Neural Networks Trained with Normalization Layers. In Proceedings of the 37th International Conference on Machine Learning, Virtual, 12–18 July 2020; pp. 2751–2760. [Google Scholar]
  30. Swaminathan, B.; Jagadeesh, M.; Vairavasundaram, S. Multi-Label Classification for Acoustic Bird Species Detection Using Transfer Learning Approach. Ecol. Inform. 2024, 80, 102471. [Google Scholar] [CrossRef]
  31. Targ, S.; Almeida, D.; Lyman, K. ResNet in ResNet: Generalizing Residual Architectures. arXiv 2016, arXiv:1603.08029. [Google Scholar] [CrossRef]
  32. Lee, M. Mathematical Analysis and Performance Evaluation of the GELU Activation Function in Deep Learning. J. Math. 2023, 2023, 4229924. [Google Scholar] [CrossRef]
  33. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  34. Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 10–15 June 2019; pp. 6105–6114. [Google Scholar]
  35. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  36. Huang, G.; Liu, Z.; Maaten, L.V.D.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar] [CrossRef]
  37. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
  38. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
  39. Kahl, S.; Wood, C.M.; Eibl, M.; Klinck, H. BirdNET: A deep learning solution for avian diversity monitoring. Ecol. Inform. 2021, 61, 101236. [Google Scholar] [CrossRef]
  40. Wang, Y.; Chen, A.; Li, H.; Zhou, G.; Yi, J.; Zhang, Z. A Hierarchical Birdsong Feature Extraction Architecture Combining Static and Dynamic Modeling. Ecol. Indic. 2023, 150, 110258. [Google Scholar] [CrossRef]
  41. Gong, Y.; Chung, Y.A.; Glass, J. AST: Audio Spectrogram Transformer. In Proceedings of the Interspeech 2021, Virtual, 30 August–3 September 2021. [Google Scholar]
  42. Mushtaq, Z.; Su, S.F. Environmental Sound Classification Using a Regularized Deep Convolutional Neural Network with Data Augmentation. Appl. Acoust. 2020, 167, 107389. [Google Scholar] [CrossRef]
  43. Kumar, A.S.; Schlosser, T.; Kahl, S.; Kowerko, D. Improving Learning-Based Birdsong Classification by Utilizing Combined Audio Augmentation Strategies. Ecol. Inform. 2024, 82, 102699. [Google Scholar] [CrossRef]
Figure 1. Introduction to the Birds_data dataset.
Figure 1. Introduction to the Birds_data dataset.
Electronics 13 03964 g001
Figure 2. Flowchart of birdsong recognition.
Figure 2. Flowchart of birdsong recognition.
Electronics 13 03964 g002
Figure 3. Structure of the Transformer layer.
Figure 3. Structure of the Transformer layer.
Electronics 13 03964 g003
Figure 4. The process of processing the input tensor by using multi-head attention.
Figure 4. The process of processing the input tensor by using multi-head attention.
Electronics 13 03964 g004
Figure 5. Function images of Relu and Gelu.
Figure 5. Function images of Relu and Gelu.
Electronics 13 03964 g005
Figure 6. Accuracy and loss curves during training and testing on Birds_data. (a) Accuracy and loss curves during training. (b) Accuracy and loss curves during testing.
Figure 6. Accuracy and loss curves during training and testing on Birds_data. (a) Accuracy and loss curves during training. (b) Accuracy and loss curves during testing.
Electronics 13 03964 g006
Figure 7. Confusion matrix of Birds_data.
Figure 7. Confusion matrix of Birds_data.
Electronics 13 03964 g007
Figure 8. The mean average precision curve for Birds_data.
Figure 8. The mean average precision curve for Birds_data.
Electronics 13 03964 g008
Figure 9. Class activation map for four types of birdsound from the Birds_data dataset. (a) Green-winged Teal. (b) Eurasian Oystercatcher. (c) Red-throated Diver. (d) Wood Sandpiper.
Figure 9. Class activation map for four types of birdsound from the Birds_data dataset. (a) Green-winged Teal. (b) Eurasian Oystercatcher. (c) Red-throated Diver. (d) Wood Sandpiper.
Electronics 13 03964 g009
Figure 10. Confusion matrix of Birdsonund1.
Figure 10. Confusion matrix of Birdsonund1.
Electronics 13 03964 g010
Figure 11. Confusion matrix of UrbanSound8K.
Figure 11. Confusion matrix of UrbanSound8K.
Electronics 13 03964 g011
Table 1. Dataset.
Table 1. Dataset.
DatasetCategoryTraining (80%)Testing (20%)
Birds_data2011,4492862
Birdsound11249,49312,373
UrbanSound8K1069851746
Table 2. Feature extraction layer configuration.
Table 2. Feature extraction layer configuration.
Convolutional LayerKernel SizeStrideDim
CNN0105256
CNN132256
CNN232256
CNN332256
CNN432256
CNN522256
CNN622256
Pos_conv1d1281512
Table 3. Settings.
Table 3. Settings.
DesignationParameters
CPU24 V CPU AMD EPYC 7642 48-Core Processor
GPUNVIDIA RTX 3090 (24 G)
Memory80 G
System platformWindows10
Software environmentPython 3.9.12, Pytorch 1.12.0
Table 4. The influence of the number of Transformer layers on seven convolutional layers.
Table 4. The influence of the number of Transformer layers on seven convolutional layers.
TransformerAccuracy (%)Params (M)
098.040.2
198.813.4
298.826.6
399.299.7
499.1512.9
599.3316.0
699.5419.2
799.6522.3
899.3525.5
998.8128.7
1098.2731.9
Table 5. The impact of the number of convolutional layers on a seven-layer Transformer.
Table 5. The impact of the number of convolutional layers on a seven-layer Transformer.
Number of Convolutional LayersAccuracy (%)
799.79
699.04
597.47
494.73
Table 6. The influence of the Transformer layer on a four-layer convolutional network.
Table 6. The influence of the Transformer layer on a four-layer convolutional network.
Number of Transformer LayersAccuracy (%)
083.52
194.42
297.19
397.91
497.63
596.97
696.46
794.73
Table 7. The performance of the model for each bird species in Birds_data.
Table 7. The performance of the model for each bird species in Birds_data.
ClassesRecall (%)Precision (%)F1 (%)
Greylag Goose99.310099.6
Mute Swan99.399.399.3
Green-winged Teal99.410099.7
Green-winged Duck99.299.299.2
Grey Partridge100100100
Common Quail10099.399.6
Common Pheasant99.499.499.4
Red-throated Diver10098.399.1
Grey Heron100100100
Great Cormorant100100100
Eurasian Buzzard98.699.398.9
Eurasian Hobby10098.099.0
Western Capercaillie99.310099.6
Crested Tit10097.898.9
Black-winged Stilt10099.399.6
Eurasian Oystercatcher10099.599.7
Common Greenshank98.410099.2
Redshank98.710099.3
Wood Sandpiper10099.499.7
House Sparrow99.210099.6
Table 8. Comparison with the results of other models using Birds_data.
Table 8. Comparison with the results of other models using Birds_data.
ModelInputAccuracy (%)Recall (%)F1 Score (%)Precision (%)
ResNet50 + SM [33]Log-mel94.594.19194.4
EfficientNetB3 + SM [34]Log-mel95.291.592.692.8
VGG16 + SM [35]Log-mel94.292.190.892.8
DenseNet121 + SM [36]Log-mel93.593.590.791.7
EffcientNetB3 + ResNet50 + SM [24]Log-mel9693.694.194.8
EffcientNetB3 + ResNet50 + LG [24]Log-mel97.19693.993.8
EffcientNetB3 + ResNet50 + SM [24]MFCC + Chroma + Tonnetz85.685.583.985.0
LightGBM [37]MFCC + Chroma + Tonnetz88.586.387.688.6
Transformer encoder + LG [38]Chroma + Tonnetz81.682.382.982.9
Transformer encoder + LG [38]MFCC + Chroma + Tonnetz89.489.288.589.6
Transformer encoder + LG [38]MFCC + Chroma + Tonnetz + Spectral contrast spectrogram88.58987.989.3
BirdNET [39]MFCC + Chroma + Tonnetz + Spectral contrast spectrogram86.787.986.386.5
AMResNet [23]log-mel + Spectral contrast + Chroma + Tonnetz88.788.088.588.1
EffcientNetB3 + ResNet50 +
Transformer encoder + LG [24]
log-mel + MFCC + Chroma + Tonnetz9896.196.997.8
Model of this articleRaw audio99.5899.6299.6399.65
Table 9. Comparison with the results of other models using UrbanSound8K.
Table 9. Comparison with the results of other models using UrbanSound8K.
ModelAccuracy (%)
CMS-H [40]97.02
Audio Spectrogram Transformer (AST) [41]94.79
EfficientNet-B7 [40]96.74
DCNN [42]95.3
Model presented in this study99.03
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yi, D.; Shen, X. Investigation of Bird Sound Transformer Modeling and Recognition. Electronics 2024, 13, 3964. https://doi.org/10.3390/electronics13193964

AMA Style

Yi D, Shen X. Investigation of Bird Sound Transformer Modeling and Recognition. Electronics. 2024; 13(19):3964. https://doi.org/10.3390/electronics13193964

Chicago/Turabian Style

Yi, Darui, and Xizhong Shen. 2024. "Investigation of Bird Sound Transformer Modeling and Recognition" Electronics 13, no. 19: 3964. https://doi.org/10.3390/electronics13193964

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop