Polyphonic Sound Event Detection Using Temporal-Frequency Attention and Feature Space Attention

Jin, Ye; Wang, Mei; Luo, Liyan; Zhao, Dinghao; Liu, Zhanqi

doi:10.3390/s22186818

Open AccessArticle

Polyphonic Sound Event Detection Using Temporal-Frequency Attention and Feature Space Attention

¹

Ministry of Education Key Laboratory of Cognitive Radio and Information Processing, Guilin 541006, China

²

School of Information and Communication, Guilin University of Electronic Technology, Guilin 541006, China

³

School of Information Science & Engineering, Guilin University of Technology, Guilin 541006, China

^*

Author to whom correspondence should be addressed.

Sensors 2022, 22(18), 6818; https://doi.org/10.3390/s22186818

Submission received: 8 August 2022 / Revised: 1 September 2022 / Accepted: 3 September 2022 / Published: 9 September 2022

(This article belongs to the Section Physical Sensors)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The complexity of polyphonic sounds imposes numerous challenges on their classification. Especially in real life, polyphonic sound events have discontinuity and unstable time-frequency variations. Traditional single acoustic features cannot characterize the key feature information of the polyphonic sound event, and this deficiency results in poor model classification performance. In this paper, we propose a convolutional recurrent neural network model based on the temporal-frequency (TF) attention mechanism and feature space (FS) attention mechanism (TFFS-CRNN). The TFFS-CRNN model aggregates Log-Mel spectrograms and MFCCs feature as inputs, which contains the TF-attention module, the convolutional recurrent neural network (CRNN) module, the FS-attention module and the bidirectional gated recurrent unit (BGRU) module. In polyphonic sound events detection (SED), the TF-attention module can capture the critical temporal–frequency features more capably. The FS-attention module assigns different dynamically learnable weights to different dimensions of features. The TFFS-CRNN model improves the characterization of features for key feature information in polyphonic SED. By using two attention modules, the model can focus on semantically relevant time frames, key frequency bands, and important feature spaces. Finally, the BGRU module learns contextual information. The experiments were conducted on the DCASE 2016 Task3 dataset and the DCASE 2017 Task3 dataset. Experimental results show that the F1-score of the TFFS-CRNN model improved 12.4% and 25.2% compared with winning system models in DCASE challenge; the ER is reduced by 0.41 and 0.37 as well. The proposed TFFS-CRNN model algorithm has better classification performance and lower ER in polyphonic SED.

Keywords:

sound event detection; temporal-frequency attention; feature space attention; convolutional recurrent neural networks; feature aggregation

1. Introduction

Polyphonic sound event detection (SED) has attracted increasing research attention and numerous challenges [1,2] in recent years, and is mainly used for acoustic event classification and time detection. In real environments, multiple audio events may occur simultaneously. Polyphonic SED has important practical application value and theoretical significance [3]. In recent years, polyphonic SED has been applied to smart city traffic systems [4], equipment failure monitoring [5], smart home devices [6,7], telemedicine [8] and wildlife monitoring [9]. Environmental polyphonic sounds don’t show regular temporal patterns, such as phonemes in speech or rhythm in music. Therefore, it is difficult to accurately capture the characteristic information of time frames. In addition, the complexity and variability of polyphonic sound events in real scenes make them more difficult to detect.

To solve the above problems, various methods have been applied to perform polyphonic SED. In all polyphonic SED models, the extraction of acoustic features with strong characterization ability and effective classification algorithms are the keys to improving the overall model classification performance. The main features include linear predictive coding (LPC) [3], linear predictive cepstral coefficients (LPCC), discrete wavelet transform (DWT), Mel frequency cepstral coefficients (MFCCs) [3] and Log-Mel spectrograms (Log-Mel). Traditional classifiers include support vector machines (SVMs) [10], Gaussian mixture models (GMM) [11], hidden Markov models (HMM) [12], multi-layer perceptron (MLP) [13], and so on. However, these traditional models are only applicable to single acoustic events and small datasets [14]. With the increasing of dataset scale and audio complexity, the above traditional classification models cannot meet the classification requirements of the system. With the development of machine learning, classification models of the neural network are far superior to traditional classifiers, including feedforward neural networks (FNN), recurrent neural networks (RNN) [15], convolutional neural networks (CNN) [16] and convolutional recurrent neural networks (CRNN) [17,18,19,20]. In recent years, most participants of the DCASE challenge have used classification models based on deep learning for SED [1]. CRNN not only has the powerful ability of CNN to capture time-frequency features and process multi-dimensional feature information [21,22,23,24] but also has the advantages of RNN for sequence recognition. Therefore, CRNN is very suitable for polyphonic SED tasks [24]. In particular, RNN with the BGRU module provides better access to contextual information [25], which more accurately predicts the start times and offset times of each sound event.

In recent years, machine learning models based on attention mechanisms have been popularly adopted in image recognition [26], machine translation [27], text classification [28] and speech recognition [29,30]. In addition, relevant literature [31,32] has proved that networks based on attention mechanisms can further improve the classification performance of SED, such as channel attention [33], spatial attention [34,35], temporal attention [36] and frequency attention [37]. Hu et al. demonstrated the superiority of the channel attention mechanism [38]. Shen et al. presented a TF-attention mechanism for SED [37]. Li et al. focused on the mechanism of temporal attention by calculating the weight of the spectrograms [39]. The above studies of attention mechanisms just focused on the difference in time steps but ignored the importance of different frequency bands and different feature dimensions. Although classification models based on the neural network have been popularly applied in the acoustics field, the following challenges exist in the detection of environmental polyphonic sounds: (1) The temporal-frequency structure of polyphonic sounds is very complicated, which may be continuous (rain), abrupt (gunshots) or periodic (clock tick). (2) Discontinuity and uncertain duration of polyphonic sounds affect model classification performance. (3) The polyphonic SED model has larger datasets, more parameters and more feature space dimensions.

In polyphonic SED, this paper proposes an innovative model named TFFS-CRNN to address the mentioned challenges effectively. The TFFS-CRNN model contains four modules: dual-input TF-attention module, CNN module, FS-attention module and RNN with BGRU module. Our model innovatively combines the dual-input TF-attention mechanism and FS-attention mechanism, filtering out unimportant frequency bands information and weighting the important feature dimensions to characterize the key feature information of the polyphonic sound event. The TFFS-CRNN model can capture key temporal-frequency features and spatial features of sound events. In the experiment, the performance of the TFFS-CRNN model was evaluated on the public DCASE 2016 Task3 dataset and DCASE 2017 Task3 dataset. The TFFS-CRNN model proposed in this paper is compared with the DCASE challenge’s winning system models, which fully demonstrates the superiority of the TFFS-CRNN model in polyphonic SED. Our contributions are as follows: (1) The dual-input TF-attention mechanism was introduced for polyphonic sound events with complex time-frequency structures. (2) In the CRNN module, the multi-dimensional higher-order features are extracted to solve the problem of discontinuous polyphonic sounds. (3) In order to solve the problem of feature dimension redundancy, we introduce the FS-attention mechanism to weight important dimensions.

The rest of the paper is arranged as follows: Section 2 involves a detailed description of the TFFS-CRNN model and its rationale. Section 3 reports and compares the experimental results. Section 4 discusses our findings. Conclusions are placed in Section 5.

2. Methods

The overall structure of the proposed TFFS-CRNN model is shown in Figure 1, which includes five main components: the dual-input TF-attention module, the CNN module, the FS-attention module, the RNN module and the fully connected (FC) output layer. Firstly, the MFCCs feature and Log-Mel spectrograms are extracted as input of the dual-input TF-attention module and integrated into the TF-attention spectrograms of two features. The dual-input TF-attention features not only enhance the representation capabilities of features for polyphonic sounds, but also highlight important time-frame information and key frequency-bands information. CNN is responsible for extracting multi-dimensional higher-order features from the TF-attention feature of MFCCs and Log-Mel spectrograms. The FS-attention module dynamically learns the importance of each dimension of multi-dimensional high-order features and gives different attention to different dimensions. The FS-attention module could extract important feature space information and ignore unimportant dimensions. The RNN module uses 32-unit BGRU to gain contextual information and predict the start times and offset times of polyphonic sounds accurately. Finally, the output features of BGRU are fed into the FC layer to obtain the classification results of the TFFS-CRNN model. In this section, the whole network model is described in detail.

2.1. Dual-Input Temporal-Frequency Attention

This sub-section focuses on the dual-input TF-attention algorithm model shown in Figure 2. The polyphonic SED model introduces a TF-attention mechanism that assigns larger weight values to the critical time frames and critical frequency bands. At the same time, the attention weight of time frames and irrelevant bands with less information will be decreased. Log-Mel features X_Log-Mel ∈ R^F^×T and MFCCs features X_MFCCs ∈ R^F^×T are chosen as inputs to the dual-input TF-attention model. The harmonic spectrograms of the two features can visually and clearly display the frequency band information of polyphonic sounds, and the vertical structure of the percussive spectrograms can highlight the difference between time frames and noise. Therefore, the TF-attention algorithm model uses the harmonic-percussive source separation (HPSS) algorithm [32]. Then, the harmonic spectrograms and percussive spectrograms of Log-Mel and MFCCs features will be gained via HPSS. The harmonic spectrograms are input to the frequency attention network (F-Attention-Net) and the percussive spectrograms are fed into the temporal attention network (T-Attention-Net). The specific structure is shown in Figure 2. First, the Log-Mel spectrograms X_Log-Mel ∈ R^F×T or the MFCCs X_MFCCs ∈ R^F×T will be normalized. After that, we perform the convolution operations on the harmonic and percussive spectrograms to extract high-order features. The convolution kernel sizes are (1 × 3) and (2 × 1), respectively. A couple of convolution operations reduce the time dimension of harmonic spectrograms to 1. Similarly, the same convolution operation is performed on the frequency dimension of percussive spectrograms. After that, we use the (1 × 1) channel convolution kernel to compress. In this way, we can obtain the I_F ∈ R^F×¹ with size (F,1) and the I_T ∈ R^1×T with size (1,T). Finally, we normalize I_F ∈ R^F×¹ and I_T ∈ R^1×T by using the SoftMax function, thus obtaining the frequency weight F_W ∈ R^F×¹ and the temporal weight T_W ∈ R^1×T.

F_{L o g m e l - w} (f) = \frac{\exp (I_{L o g m e l - F} (f, 1))}{\sum_{i = 1}^{F} \exp (I_{L o g m e l - F} (i, 1))}

(1)

F_{M F C C s - w} (f) = \frac{\exp (I_{M F C C s - F} (f, 1))}{\sum_{i = 1}^{F} \exp (I_{M F C C s - F} (i, 1))}

(2)

T_{L o g m e l - w} (t) = \frac{\exp (I_{L o g m e l - T} (1, t))}{\sum_{j = 1}^{T} \exp (I_{L o g m e l - T} (1, j))}

(3)

T_{M F C C s - w} (t) = \frac{\exp (I_{M F C C s - T} (1, t))}{\sum_{j = 1}^{T} \exp (I_{M F C C s - T} (1, j))}

(4)

where 1 ≤ t ≤ T and 1 ≤ f ≤ F in Equations (1)–(8). Next, the spectrogram X_Log-Mel ∈ R^F×T or X_MFCCs ∈ R^F×T point-multiplied with the attention weight matrices along the temporal and frequency directions respectively to produce the frequency attention spectrogram A_F ∈ R^F×T and the temporal attention spectrogram A_T ∈ R^F×T with the following expressions.

A_{L o g m e l - F} (f) = X_{L o g m e l} (F, t) * F_{L o g m e l - w}

(5)

A_{M F C C s - F} (f) = X_{M F C C s} (F, t) * F_{M F C C s - w}

(6)

A_{L o g m e l - T} (t) = X_{L o g m e l} (f, T) * T_{L o g m e l - w}

(7)

A_{M F C C s - T} (t) = X_{M F C C s} (f, T) * T_{M F C C s - w}

(8)

As the time and frequency domains of spectrograms contain information about temporal and frequency features, respectively, they are quite different from images in the image classification domain. The TF-attention mechanism assigns different weights to the time frame and frequency band. Both temporal and frequency features can be enhanced by the TF-attention mechanism, improving the ability of features to characterize the key feature information of the polyphonic sound event. In general, the combined approach is available in either parallel or concatenation. However, the concatenation methods combine two attention mechanisms, leading to poor performance. In this paper, three fusion strategies are designed, including average combination, weighted combination and channel combination. In the experiment in Section 4.2, the weighted combination strategy has the best performance. So, the weighted combination strategy was chosen to merge the temporal attention mechanisms and frequency attention mechanisms into a unified model. In the training, set up two variable parameters α and β and α + β = 1. Finally, the TF-attention spectrogram A_T_&F is computed as follows:

A_{L o g m e l - T & F} = α A_{L o g m e l - T} + β A_{L o g m e l - F}

(9)

A_{M F C C s - T & F} = α A_{M F C C s - T} + β A_{M F C C s - F}

(10)

Log-Mel-TF-attention and MFCCs-TF-attention fusion strategies: The first strategy is called average combination. The obtained Log-Mel-T&F-Attention and MFCCs-T&F-Attention are fused in a 1:1 ratio to produce the final temporal-frequency attention spectrogram LM-T&F-Average. The detailed process is as follows:

A_{L M - T & F - A verage} = A_{L o g m e l - T & F} + A_{M F C C s - T & F}

(11)

The second strategy is called weighted combination. In the training, set up two variable parameters λ and μ and λ + μ = 1. Then the L-TF-attention and M-TF-attention spectrograms are fused according to the ratio of the parameters. Finally, the TF-attention spectrogram LM-T&F-Weight that fuses the two features of Log-Mel and MFCCs is then obtained. The detailed process is as follows:

A_{L M - T & F - W eight} = λ A_{L o g m e l - T & F} + μ A_{M F C C s - T & F}

(12)

The third strategy is called channel combination. The obtained two attention spectrograms L-TF-attention and M-TF-attention are concatenated as two-channel output. The process is as follows:

A_{L M - T & F - C hannel} = joint (A_{L o g m e l - T & F}; A_{M F C C s - T & F})

(13)

2.2. Feature Space Attention

This sub-section focuses on the FS-attention algorithm model shown in Figure 3. In the neural network model, all dimensions of the feature spectrogram are treated equally, thus drowning out some important feature dimensions that contain critical information. The TFFS-CRNN model uses the FS-attention module to obtain the important feature dimension. Each dimension of multi-dimensional high-order features was assigned different learnable weights to obtain the space attention features F_Attention ∈ R^T×K×F.

The feature space attention algorithm is shown in Figure 3, the multi-dimensional high-order features F_LM ∈ R^K×T×F are obtained from the CNN module, where K represents the number of channels, F represents the frequency and T represents the time frame length. Then the high-order feature F_LM ∈ R^K×T×F was input to the FS-attention model. The FS-attention contains a SoftMax activation layer and a FC feedforward layer to calculate the importance weight of each feature dimension of F_LM. The importance weight I is the outputs of the SoftMax layer, which is assigned to different feature dimensions. First, F_LM ∈ R^K×T×F is permuted into a 3-dimensional tensor

F_{L M}^{'}

.

F_{L M}^{'} = R^{K \times T \times F \to T \times K \times F} (F_{L M})

(14)

where R^K×T×F^→^T×K×F denotes the F_LM with dimension K × T × F becomes the

F_{L M}^{'}

with dimension T × K × F. Subsequently,

F_{L M}^{'}

is flattened as a 2-dimensional tensor

F_{L M}^{"}

by fixing the time dimension T, as shown in the following:

F_{L M}^{"} = R^{T \times K \times F \to T \times K F} (F_{L M}^{'})

(15)

Next, the

F_{L M}^{"}

is input to the feedforward layer. In the FC feedforward layer, the number of hidden units is set to KF. The dimension of weights I is M = KF, which is expressed as follows:

I = {I_{1}, I_{2}, \dots, I_{d}, \dots, I_{M}}

(16)

where I_m influences the mth dimensional feature of

F_{L M}^{"}

, the expression of I_m is:

I_{m} = \frac{\exp (O_{m})}{\sum_{j = 1}^{j = m} \exp (O_{j})}

(17)

The dimension of

F_{L M}^{"}

is M. The jth dimensional output of the SoftMax activation layer is O_j. The importance weight I is repeated T times, and the dimension of weight I becomes T × K × F. The FS-attention vector I^’ ∈ R^T×K×F can be expressed as:

I^{'} = R^{T \times K F \to T \times K \times F} (I)

(18)

where R^T×KF^→^T×K×F denotes the feature space importance weight I is reshaped from T × KF to T × K × F. The outputs of the FS-attention module can be expressed as:

F_{A t t e n t i o n} = I^{'} \circ F_{L M}^{'}

(19)

where “◦” represent the Hadamard product. Then the outputs F_Attention of the FS-attention module are fed into the RNN with the BGRU module.

2.3. BGRU

In the RNN module, BGRU is used to learn contextual information from higher-order attention features, which can more accurately predict the start times and offset times of each sound event. As is shown in Figure 4, the BGRU module consists of two gated recurrent units (GRU), training direction of each neuron is reversed. Therefore, the correlation between the pre and post information can be fully exploited and information from both past and future directions can be introduced. GRU integrates the forget gate and the input gate into one gate called the update gate (denoted as z_t), and the GRU unit only contains the update gate and the reset gate. The update gate is mainly used to control how much of the memory information from the h_t−1 moment is retained at the current moment. While the reset gate determines how much of the memory information from the previous moment is combined with the new input information to form the new memory content. The single-layer GRU and the output vector y are computed as:

z_{t} = σ_{g} (W_{z} x_{t} + U_{z} h_{t - 1} + b_{z})

(20)

r_{t} = σ_{g} (W_{r} x_{t} + U_{r} h_{t - 1} + b_{r})

(21)

{h^{'}}_{t} = \tanh (W_{h} x_{t} + U_{z} (r_{t} \otimes h_{t - 1}) + b_{h})

(22)

h_{t} = (1 - z_{t}) \otimes h_{t - 1} + z_{t} \otimes {h^{'}}_{t}

(23)

y = σ_{g} (W_{0} h_{t} + b_{0})

(24)

where z_t is the output of the update gate, r_t is the output value of the reset gate, y ∈ [0, 1]^N, N represents the number of sound events. The symbol

\otimes

denotes matrix multiplication. W* and U* in Equations (20)–(24) denote the weight matrix, b* represents the bias term, and δ_g represents the sigmoid activation function. Each dimension of y_t means the probability of a certain sound event occurring at time t. If the hidden layer of GRU has a dimension of d, then r_t, z_t, h_t,

h_{t}^{'}

∈ R^d. Using the BGRU module further enhances the performance of the TFFS-CRNN model. The binary cross entropy loss function (BCE loss) of the BGRU is expressed as:

l o s s (y, \hat{y}) = - \frac{1}{N} \sum_{i = 1}^{N} [{\hat{y}}_{i} \log (y_{i}) + (1 - {\hat{y}}_{i}) \log (1 - y_{i})]

(25)

N represents the number of sound events, and

\hat{y} \in {0, 1}^{N}

is the binary indicator of sound events.

3. Experiment

This section describes the experimental datasets, evaluation metrics and experimental configurations in the domain of polyphonic SED [40,41,42]. Experiments are conducted on publicly available datasets to validate the effectiveness of the model, and then the results of the method provided in this study are compared with the results of existing methods.

3.1. Environmental Sound Datasets

In this study, the DCASE 2016 Task3 dataset and DCASE 2017 Task3 dataset [42,43] were used to evaluate our proposed TFFS-CRNN model. It is the datasets of DCASE 2016 Task3 and DCASE 2017 Task3 mentioned above. The experimental results were compared with the winner systems. The datasets are as follows.

Both datasets contain environmental sounds of daily life, which are divided into indoor and outdoor scenes. The audio of the DCASE 2016 Task3 dataset is mono with 44.1 kHz sampling rate. The DCASE 2017 Task3 dataset contains more street sounds and human voices in real-life recordings. Each audio of the DCASE 2017 Task3 dataset is 3–5 min in length and the sampling frequency is also 44.1 kHz. As shown in Table 1, the DCASE 2017 Task3 includes two daily environments: an outdoor residential area and an indoor home. Both the DCASE 2016 Task3 dataset and DCASE 2017 Task3 dataset contain a development set accounting for 70% of the total sample, and an evaluation set accounting for 30%. In this study, the four-fold cross-validation method is used to train and test.

3.2. Evaluation Metrics

In this paper, we use the widely recognized measurements proposed for polyphonic in [40] to compare the performance of different algorithms. In the experiment of this paper, the evaluation metrics are the segment-based F1-score (F1) and the error rate (ER). In addition, F1 is the harmonic average of precision (P) and recall (R), which takes values in the range of 0–1. The following equation describes the calculation process:

P = \frac{\sum T P}{\sum T P + \sum F P}, R = \frac{\sum T P}{\sum T P + \sum F N}, F 1 = \frac{2 P \cdot R}{P + R}

(26)

where TP, FP, and FN denote true positive, false positive, and false negative respectively. The ER is the number of samples classified incorrectly. The ER is calculated as follows:

E R = \frac{\sum_{t = 1}^{T} S (t) + \sum_{t = 1}^{T} I (t) + \sum_{t = 1}^{T} D (t)}{\sum_{t = 1}^{T} N (t)}

(27)

where T is the number of sound events in segments t. Substitution events S(t) indicates the number of events where the model misclassifies sound event A as sound event B. Insertion event I(t) refers to an event A that is only detected in the model output, but no type of event occurs in the tag annotation at this moment. Deleted events D(t) refers to sound events that were present but not detected. N(t) is the total number of acoustic events from the annotations.

3.3. Experimental Configurations

In this research, the TFFS-CRNN model algorithm was all accomplished in the Python language. In this paper, all audio datasets are 44.1 kHz mono wave files, and the dimension of Log-Mel spectrograms and MFCCs is 40 × 256 (T = 256, F = 40). The frame size is 40 ms and the overlapping frames are 50%. As shown in Table 2, the hyperparameters of the TFFS-CRNN model were enhanced with a random search strategy. The CRNN model architecture used the Adam optimizer to train, and the ReLU activation function was used to introduce non-linearity. The value of the learning rate was 0.0001. In this study, we used early stopping to solve the problem of overfitting. Then, batch normalization (BN) of the TFFS-CRNN model aims to reduce the insensitivity of the network to initialization weights, using a loss rate of 0.25 after each convolutional layer. The global threshold τ is set to 0.5. The τ is used to determine the active acoustic events. If above the threshold, we determine that a segment contains the event classes. Table 2 shows the specific parameter settings for the CRNN network from the input to the output.

4. Discussion

The performance of the TFFS-CRNN model proposed was evaluated under different fusion strategies, different features, different classifiers and different methods. Simultaneously, we designed the three experiments on the DCASE 2016 Task3 dataset and DCASE 2017 Task3 dataset.

4.1. Comparison of Different Features

Six of the most used features were chosen for the experiments, including the Log-Mel spectrograms, MFCCs and short-time Fourier transform (STFT). The sampling rate of audio signals was 44.1 kHz, the frame size was 40 ms and the frame overlap was 50%. The STFT is computed at 1024 points with a size of 40 × 256. The other three features were the fusion of Log-Mel spectrograms, MFCCs and STFT. F_MS is the fusion of MFCCs and STFT. F_LS is the fusion of Log-Mel spectrograms and STFT. F_LM is the fusion of Log-Mel spectrograms and MFCCs. In the process of experiments, the same TFFS-CRNN algorithm was used for all features to compare the classification effect of different features. Table 3 and Figure 5 show the comparison results of different features on the TFFS-CRNN model.

Compared with the other features, F_LM improved F1 and ER values and has a better feature characterization capability. In experiments on the DCASE 2016 Task3 dataset, the performance of the aggregation feature F_LM reached a maximum F1 of 60.2% and ER of 0.40. Using the DCASE 2017 Task3 dataset, its F1 and ER were 66.9% and 0.49, respectively. The results of experiments indicate that the fusion of Log-Mel spectrograms and MFCCs enhanced the classification performance. In contrast, the performances of the individual feature were poorer than the aggregated features.

4.2. Comparison of Different Fusion Strategies

In the previous section, F_LM is the best performance feature, so F_LM is used as the input feature in the fusion strategy comparison experiment. As shown in Table 4, three fusion strategies are proposed, including averaging combination, weighted combination and channel combination. The principle is shown in Equations (11)–(13). The experimental result shows that the second fusion strategy is the best compared with others. The learnable weight factor is more adaptive. So, we choose the second strategy to fuse Log-Mel-T&F-Attention and MFCCs-T&F-Attention. In the same way, we experiment on the strategies to fusing Log-Mel-T-Attention and Log-Mel-F-Attention. The experiment results show the weighted combination strategy performed best. In this way, the performance of the TF-attention mechanism can be optimally tuned.

4.3. Comparison of Different Classifiers

Compared to the other different classifiers, the proposed model TFFS-CRNN improved the F1 and reduce the ER. The comparison experiments of this section were under the same situation, using the same feature F_LM. Figure 5 shows the detailed parameters for the TFFS-CRNN network. Other classifiers include DNN, CNN, RNN, CRNN, FS-CRNN, TF-CRNN and TFFS-CRNN. The following are the primary parameter settings. SVM: One-vs-Rest SVMs, sigmoid function. CNN: five convolutional layers, ReLU activation function. RNN: two BGRU recurrent layers, a time-distributed fully connected layer; CRNN: three layers of CNN and an RNN with BGRU. FS-CRNN: an FS-attention layer, three convolutional layers and two BGRU recurrent layers. TF-CRNN: a TF-attention layer, three convolutional layers and two recurrent layers with BGRU. TFFS-CRNN: a TF-attention layer, three convolutional layers, an FS-attention layer and two recurrent layers with BGRU. Table 5 shows the comparison results of the different classifiers using the feature F_LM.

As shown in Table 5 and Figure 6, on the DCASE 2016 Task3 dataset, the F1 and ER values of the FS-CRNN network improved by 4.5% and 18%, respectively, compared with the CRNN model; the F1 and ER of TF-CRNN improved by 15.4% and 29%, respectively, compared with the CRNN; the F1 and ER values of TFFS-CRNN increased the F1 and ER values by 18.7% and 15%, respectively, compared with the FS-CRNN; TFFS-CRNN increased the F1 and ER values by 7.8% and 4%, respectively, compared with the TF-CRNN, and TFFS-CRNN increased the F1 and ER values by 23.2% and 33%, respectively, compared with the CRNN.

On the DCASE 2016 Task3 dataset, the F1 and ER values of FS-CRNN improved by 3.3% and 7%, respectively, compared with the CRNN; the F1 and ER values of TF-CRNN improved by 13.5% and 15%, respectively, compared with the CRNN; the F1 and ER values of TFFS-CRNN improved by 18.6% and 9%, respectively, compared with the FS-CRNN; TFFS-CRNN improved by 5.1% and 1%, respectively, compared with TF-CRNN, and TFFS-CRNN improved by 21.9% and 16%, respectively, compared with CRNN.

The experiment demonstrates the effectiveness of the attention mechanism. Compared with the CRNN model, the TF-CRNN model and FS-CRNN model have improved performance to a certain extent. In particular, the TFFS model combines the advantages of the TF-attention attention mechanism and the FS-attention at the same time, and its performance is greatly improved. We can conclude that the extraction of key temporal-frequency feature information and key dimensions information can greatly improve the performance of the TFFS-CRNN model in polyphonic SED. The improvement may be due to the increased attention to key information and the enhancement of the feature representation ability. Other classifiers were only a little improved, because of the submergence of key temporal-frequency information and key feature dimensions.

4.4. Comparison of Different Methods

The model provided was then compared with existing methods. Other compared models are specified below:

MFCCs+CRNN [43]: the model uses CRNN as the classifier and MFCCs feature as input.

MFCCs+GMM [42]: the model uses GMM as the classifier and MFCCs feature as input, which is the baseline model for DCASE2016 task3,

MFCCs+CNN [44]: the network model is a three-layer CNN, and the input feature is MFCCs.

Log-Mel+CaspNet [45]: the model uses Capsule Neural Networks (CaspNet), and the input feature is Log-Mel spectrograms.

Log-Mel+CRNN [46]: the network model is CRNN and the Log-Mel spectrograms as input.

Log-Mel+RNN [15]: the network model uses bidirectional LSTM RNN as the classifier, and the input feature is Log-Mel spectrograms.

Log-Mel+CNN [16]: the network model consists of two convolutional layers and two FC layers, and the input feature is Log-Mel spectrograms.

Log-Mel-CRNN [17]: the network model is CRNN and the Log-Mel spectrograms as input.

Binaural Mel energy + CRNN(BGRU) [47]: the model uses a capsule network with pixel-based attention and BGRU for polyphonic SED tasks, and the input feature is Binaural Mel energy.

Log-Mel+CRNN [48]: the network model is CRNN with GRU, and the input feature is Log-Mel spectrograms.

The experimental results in Table 6 and Figure 7 show that the proposed TFFS-CRNN model outperforms other methods. On the two TUT sound event datasets, F_LM + TFFS-CRNN achieved the best (F1, ER) performance with F1 scores of 60.2% and 66.9% and ERs of 0.40 and 0.42, respectively. Compared with the winning systems of DCASE challenge, the F1 improve 12.4% and 25.2%, and the ER is reduced 0.41 and 0.37 as well. Compared with the baseline systems of DCASE challenge, the F1 improve 25.9% and 24.1%, and the ER is reduced 0.44 and 0.52 as well. Compared to the latest algorithmic models, the TFFS-CRNN model still has superior performance.

The feature fusion strategy adds rich feature information, and the TF-attention mechanism and FS-attention mechanism increase the weight of key feature information, thus improving the feature representation ability. The experiments fully demonstrate that the combination of TF-attention and FS-attention can greatly improve the performance in the polyphonic SED, as the network can focus on the key information of the feature. Other methods without attention mechanisms drown out critical temporal-frequency information and key feature dimensions, and the characterization capability of a single feature is weaker than the fusion feature.

5. Conclusions

In this study, the TFFS-CRNN model is proposed for polyphonic SED. By introducing the TF-attention mechanism and FS-attention mechanism into the basic CRNN architecture, the TF-attention mechanism used for representational learning can effectively capture the key temporal-frequency features, and FS-attention can effectively enhance the features of important dimensions. The experiments of DCASE 2016 Task3 dataset and DCASE 2017 Task3 dataset show that the F-score is improved to 60.2% and 66.9%, and the ER is reduced to 0.40 and 0.42, respectively. The TFFS-CRNN model has a better classification performance than previous models for polyphonic SED. In particular, the model is far superior to the baseline systems and the winning systems of the DCASE challenge. In addition, this paper uses the BGRU module and FC layer to collect the previous moment state and the future moment state, thus obtaining contextual information. To further improve the generalization capability of the TFFS-CRNN model in polyphonic SED tasks, this paper also needs to enhance the training on weakly labeled datasets, and semi-supervised or unsupervised models are still worthy of further investigation in SED.

Author Contributions

Conceptualization, M.W. and Y.J.; software, L.L., Y.J., Z.L. and. D.Z.; writing—review and editing, M.W. and L.L.; funding acquisition, M.W. and L.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No.62071135), the Project of Guangxi Technology Base and Talent Special Project (No.GuiKe AD20159018), the Project of Guangxi Natural Science Foundation (No.2020GXNSFAA159004), the Fund of Key Laboratory of Cognitive Radio and Information Processing, Ministry of Education (No.CRKL200104) and the Opening Project of Guangxi Key Laboratory of UAV Remote Sensing (No.WRJ2016KF01).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly datasets DCASE 2016 Task3 and DCASE 2017 Task3 were available in this paper. The dataset can be found on https://dcase.community/challenge2017/index and https://dcase.community/challenge2016/index (accessed on 2 September 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

Baumann, J.; Meyer, P.; Lohrenz, T.; Roy, A.; Papendieck, M.; Fingscheidt, T. A New DCASE 2017 Rare Sound Event Detection Benchmark under Equal Training Data: CRNN with Multi-Width Kernels. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–9 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 865–869. [Google Scholar]
De Benito-Gorrón, D.; Ramos, D.; Toledano, D.T. A multi-resolution CRNN-based approach for semi-supervised sound event detection in DCASE 2020 challenge. IEEE Access 2021, 9, 89029–89042. [Google Scholar] [CrossRef]
Luo, L.; Zhang, L.; Wang, M.; Liu, Z.; Liu, X.; He, R.; Jin, Y. A System for the Detection of Polyphonic Sound on a University Campus Based on CapsNet-RNN. IEEE Access 2021, 9, 147900–147913. [Google Scholar] [CrossRef]
Foggia, P.; Petkov, N.; Saggese, A.; Strisciuglio, N.; Vento, M. Audio surveillance of roads: A system for detecting anomalous sounds. IEEE Trans. Intell. Transp. Syst. 2015, 17, 279–288. [Google Scholar] [CrossRef]
Mnasri, Z.; Rovetta, S.; Masulli, F. Anomalous sound event detection: A survey of machine learning based methods and applications. Multimed. Tools Appl. 2022, 81, 5537–5586. [Google Scholar] [CrossRef]
Aljshamee, M.; Mousa, A.H.; Omran, A.A.; Ahmed, S. Sound Signal Control on Home Appliances Using Android Smart-Phone; AIP Publishing LLC: Melville, NY, USA, 2020; Volume 2290, p. 040023. [Google Scholar]
Serizel, R.; Turpault, N.; Shah, A.; Salamon, J. Sound Event Detection in Synthetic Domestic Environments. In Proceedings of the ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 86–90. [Google Scholar]
Chaudhary, M.; Prakash, V.; Kumari, N. Identification vehicle movement detection in forest area using MFCC and KNN. In Proceedings of the 2018 International Conference on System Modeling & Advancement in Research Trends (SMART), Moradabad, India, 23–24 November 2018; pp. 158–164. [Google Scholar]
Florentin, J.; Dutoit, T.; Verlinden, O. Identification of European woodpecker species in audio recordings from their drumming rolls. Ecol. Inform. 2016, 35, 61–70. [Google Scholar] [CrossRef]
Guo, G.; Li, S.Z. Content-based audio classification and retrieval by support vector machines. IEEE Trans. Neural Netw. 2003, 14, 209–215. [Google Scholar]
Heittola, T.; Mesaros, A.; Eronen, A.; Virtanen, T. Audio context recognition using audio event histograms. In Proceedings of the 2010 18th European Signal Processing Conference, Aalborg, Denmark, 23–27 August 2010; pp. 1272–1276. [Google Scholar]
Degara, N.; Davies, M.E.; Pena, A.; Plumbley, M.D. Onset event decoding exploiting the rhythmic structure of polyphonic music. IEEE J. Sel. Top. Signal Process. 2011, 5, 1228–1239. [Google Scholar] [CrossRef]
Sidiropoulos, P.; Mezaris, V.; Kompatsiaris, I.; Meinedo, H.; Bugalho, M.; Trancoso, I. On the use of audio events for improving video scene segmentation. In Analysis, Retrieval and Delivery of Multimedia Content; Springer: Berlin/Heidelberg, Germany, 2013; pp. 3–19. [Google Scholar]
Liu, Y.; Tang, J.; Song, Y.; Dai, L. A capsule based approach for polyphonic sound event detection. In Proceedings of the 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Honolulu, HI, USA, 12–15 November 2018; pp. 1853–1857. [Google Scholar]
Parascandolo, G.; Huttunen, H.; Virtanen, T. Recurrent neural networks for polyphonic sound event detection in real life recordings. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 6440–6444. [Google Scholar]
Jeong, I.-Y.; Lee, S.; Han, Y.; Lee, K. Audio Event Detection Using Multiple-Input Convolutional Neural Network. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017, Munich, Germany, 16 November 2017; pp. 51–54. [Google Scholar]
Adavanne, S.; Virtanen, T. A report on sound event detection with different binaural features. arXiv 2017, arXiv:1710.02997. [Google Scholar]
Dinkel, H.; Yu, K. Duration robust weakly supervised sound event detection. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 311–315. [Google Scholar]
Imoto, K.; Mishima, S.; Arai, Y.; Kondo, R. Impact of sound duration and inactive frames on sound event detection performance. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–7 June 2021; pp. 860–864. [Google Scholar]
Lim, H.; Park, J.-S.; Han, Y. Rare Sound Event Detection Using 1D Convolutional Recurrent Neural Networks. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017, Munich, Germany, 16 November 2017; pp. 80–84. [Google Scholar]
Salamon, J.; Bello, J.P. Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process. Lett. 2017, 24, 279–283. [Google Scholar] [CrossRef]
Zhang, X.; Zou, Y.; Shi, W. Dilated convolution neural network with LeakyReLU for environmental sound classification. In Proceedings of the 2017 22nd International Conference on Digital Signal Processing (DSP), London, UK, 23–25 August 2017; pp. 1–5. [Google Scholar]
Phan, H.; Hertel, L.; Maass, M.; Mertins, A. Robust audio event recognition with 1-max pooling convolutional neural networks. arXiv 2016, arXiv:1604.06338. [Google Scholar]
Cakır, E.; Virtanen, T. Convolutional recurrent neural networks for rare sound event detection. Deep Neural Netw. Sound Event Detect. 2019, 12, 141–145. [Google Scholar]
Luo, Y.; Chen, Z.; Yoshioka, T. Dual-path rnn: Efficient long sequence modeling for time-domain single-channel speech separation. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 46–50. [Google Scholar]
Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual attention network for image classification. arXiv 2017, arXiv:1704.06904. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; Hovy, E. Hierarchical attention networks for document classification. In Proceedings of the NAACL-HLT 2016, San Diego, CA, USA, 12–17 June 2016; pp. 1480–1489. [Google Scholar]
Chorowski, J.K.; Bahdanau, D.; Serdyuk, D.; Cho, K.; Bengio, Y. Attention-based models for speech recognition. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef]
Chiu, C.-C.; Sainath, T.N.; Wu, Y.; Prabhavalkar, R.; Nguyen, P.; Chen, Z.; Kannan, A.; Weiss, R.J.; Rao, K.; Gonina, E. State-of-the-art speech recognition with sequence-to-sequence models. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4774–4778. [Google Scholar]
Zhang, Z.; Xu, S.; Zhang, S.; Qiao, T.; Cao, S. Attention based convolutional recurrent neural network for environmental sound classification. Neurocomputing 2021, 453, 896–903. [Google Scholar] [CrossRef]
Mu, W.; Yin, B.; Huang, X.; Xu, J.; Du, Z. Environmental sound classification using temporal-frequency attention based convolutional neural network. Sci. Rep. 2021, 11, 21552. [Google Scholar] [CrossRef]
Li, D.; Xu, J.; Wang, J.; Fang, X.; Ji, Y. A multi-scale fusion convolutional neural network based on attention mechanism for the visualization analysis of EEG signals decoding. IEEE Trans. Neural Syst. Rehabil. Eng. 2020, 28, 2615–2626. [Google Scholar] [CrossRef]
Tang, X.; Meng, F.; Zhang, X.; Cheung, Y.-M.; Ma, J.; Liu, F.; Jiao, L. Hyperspectral image classification based on 3-D octave convolution with spatial–spectral attention network. IEEE Trans. Geosci. Remote Sens. 2020, 59, 2430–2447. [Google Scholar] [CrossRef]
Xia, X.; Pan, J.; Wang, Y. Audio Sound Determination Using Feature Space Attention Based Convolution Recurrent Neural Network. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 3382–3386. [Google Scholar]
Tang, P.; Du, P.; Xia, J.; Zhang, P.; Zhang, W. Channel attention-based temporal convolutional network for satellite image time series classification. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Shen, Y.-H.; He, K.-X.; Zhang, W.-Q. Learning how to listen: A temporal-frequential attention model for sound event detection. arXiv 2018, arXiv:1810.11939. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Li, X.; Chebiyyam, V.; Kirchhoff, K. Multi-stream network with temporal attention for environmental sound classification. arXiv 2019, arXiv:1901.08608. [Google Scholar]
Mesaros, A.; Heittola, T.; Virtanen, T. Metrics for polyphonic sound event detection. Appl. Sci. 2016, 6, 162. [Google Scholar] [CrossRef]
Poliner, G.E.; Ellis, D.P. A discriminative model for polyphonic piano transcription. EURASIP J. Adv. Signal Process. 2006, 2007, 1–9. [Google Scholar] [CrossRef]
Mesaros, A.; Heittola, T.; Virtanen, T. TUT database for acoustic scene classification and sound event detection. In Proceedings of the 2016 24th European Signal Processing Conference (EUSIPCO), Budapest, Hungary, 29 August–2 September 2016; pp. 1128–1132. [Google Scholar]
Cakır, E.; Parascandolo, G.; Heittola, T.; Huttunen, H.; Virtanen, T. Convolutional recurrent neural networks for polyphonic sound event detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 1291–1303. [Google Scholar] [CrossRef]
Maas, A.L.; Hannun, A.Y.; Ng, A.Y. Rectifier Nonlinearities Improve Neural Network Acoustic Models; Atlanta, GA, USA, 2013; Volume 30, p. 3. Available online: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.693.1422&rep=rep1&type=pdf (accessed on 2 September 2022).
Jin, W.; Liu, J.; Feng, M.; Ren, J. Polyphonic Sound Event Detection Using Capsule Neural Network on Multi-Type-Multi-Scale Time-Frequency Representation; IEEE: Piscataway, NJ, USA, 2022; pp. 146–150. [Google Scholar]
Ding, W.; He, L. Adaptive multi-scale detection of acoustic events. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 28, 294–306. [Google Scholar] [CrossRef]
Meng, J.; Wang, X.; Wang, J.; Teng, X.; Xu, Y. A capsule network with pixel-based attention and BGRU for sound event detection. Digit. Signal Process. 2022, 123, 103434. [Google Scholar] [CrossRef]
Wang, M.; Yao, Y.; Qiu, H.; Song, X. Adaptive Memory-Controlled Self-Attention for Polyphonic Sound Event Detection. Symmetry 2022, 14, 366. [Google Scholar] [CrossRef]

Figure 1. TFFS-CRNN network structure model.

Figure 2. Temporal-frequency attention mechanism.

Figure 3. Feature space attention mechanism.

Figure 4. The BGRU model includes the forward GRU and the backward GRU. The input sequence is X = [x₀, x₁, …, x_n]. The BGRU model is computed from x₀ forward and then from x_n backward. The final output is the combination of the forward output and backward output.

Figure 5. The results of different features on TFFS-CRNN in the evaluation dataset. The vertical coordinate of each bar in different colors represents ER and F1 of the model, respectively. The figure (a) is the experimental result of the DCASE 2016 Task3 dataset; the figure (b) is the experimental result of the DCASE 2017 Task3 dataset.

Figure 6. The results of different classifiers in the evaluation dataset. The figure (a) is the experimental result of the DCASE 2016 Task3 dataset; the figure (b) is the experimental result of the DCASE 2017 Task3 dataset.

Figure 7. The results of different methods in the evaluation dataset. The figure (a) is the experimental result of the DCASE 2016 Task3 dataset; the figure (b) is the experimental result of the DCASE 2017 Task3 dataset.

Table 1. Event instances per class in DCASE 2016 Task 3 dataset and DCASE 2017 Task 3 Dataset.

DCASE 2016 Task3 Dataset				DCASE 2017 Task3 Dataset
Residential Area		Home		Street
Event Label	Instances	Event Label	Instances	Event Label	Instances
(object) banging	23	(object) rustling	60	brakes squeaking	24
bird singing	271	(object) snapping	57	car	110
car passing by	108	cupboard	40	children	19
children shouting	31	cutlery	76	large vehicle	24
people speaking	52	dishes	151	people speaking	47
people walking	44	drawer	51	people walking	48
wind blowing	30	glass jingling	36
		object impact	250
		people walking	54
		washing dishes	84
		water tap running	47

Table 2. The structure of the TFFS-CRNN model. The bottom is the input layer.

Layer Type	Configurations
Output	output shape = (256,6)
Recurrent	hidden unit number = 32
Recurrent	hidden unit number = 32
Merge	mode = ’mul’
Repeat and Reshape	output shape = (256,128,2)
Softmax activation	-
Feedforward	hidden unit number = 256
Reshape	output shape = 256,256
Permute	output shape = 256,128, 2
Max pooling	sub-sampling rate = 2
ReLU activation	-
Convolution	filter number, kernel size = 128, (3,3)
Max pooling	sub-sampling rate = 2
ReLU activation	-
Convolution	filter number, kernel size = 128, (3,3)
Max pooling	sub-sampling rate = 5
ReLU activation	-
Convolution	filter number, kernel size = 128, (3,3)
Merge	mode = ’TF-Attention’
Multiply on the T/F direction	mode = ’T-Attention’ and ‘F-Attention’
Softmax activation	-
Convolution	filter number, kernel size = 1, (1,1)
ReLU activation	-
Convolution	filter number, kernel size = 32, F(1,3) × 254/T(2,1) × 39
Input	Input shape = (256,40)

Table 3. The results of different features on TFFS-CRNN.

DCASE 2016 Task3			DCASE 2017 Task3
Feature	F1	ER	Feature	F1	ER
Log-Mel	58.1%	0.42	Log-Mel	63.2%	0.51
MFCCS	57.8%	0.45	MFCCS	58.5%	0.55
STFT	50.9%	0.52	STFT	55.1%	0.60
F_MS	58.5%	0.43	F_MS	63.7%	0.52
F_LS	59.0%	0.46	F_LS	63.2%	0.57
F_LM	60.2%	0.40	F_LM	66.9%	0.42

Table 4. The results of fusion strategies on TFFS-CRNN.

DCASE 2016 Task3			DCASE 2017 Task3
Strategy	F1	ER	Strategy	F1	ER
LM-T&F-Average	58.6%	0.42	LM-T&F-Average	61.4%	0.46
LM-T&F-Weight	60.2%	0.40	LM-T&F-Weight	66.9%	0.42
LM-T&F-Channel	55.3%	0.47	LM-T&F-Channel	58.9%	0.55
L-T&F-Average	52.6%	0.49	L-T&F-Average	57.4%	0.45
L-T&F-Weight	55.1%	0.41	L-T&F-Weight	59.1%	0.46
L-T&F-Channel	50.3%	0.48	L-T&F-Channel	56.9%	0.50

Table 5. The results of different classifiers.

DCASE 2016 Task3			DCASE 2017 Task3
Classifiers	F1	ER	Classifiers	F1	ER
DNN	36.0%	0.76	DNN	41.3%	0.90
CNN	34.8%	0.52	CNN	43.4%	0.77
RNN	32.7%	0.65	RNN	38.6%	0.81
CRNN	38.3%	0.56	CRNN	43.7%	0.75
FS-CRNN	41.6%	0.49	FS-CRNN	48.2%	0.57
TF-CRNN	55.1%	0.41	TF-CRNN	59.1%	0.46
TFFS-CRNN	60.2%	0.40	TFFS-CRNN	66.9%	0.42

Table 6. The results of different methods.

DCASE 2016 Task3			DCASE 2017 Task3
Methods	F1	ER	Methods	F1	ER
MFCCs+GMM [42] *	34.3%	0.88	Log-Mel+RNN [15]	39.6%	0.83
Binaural Mel energy +RNN [45] **	47.8%	0.81	Log-Mel+CNN [16]	40.8%	0.81
MFCCs+CNN [44]	59.8%	0.56	Log-Mel+CRNN [17] **	41.7%	0.79
Log-Mel+CaspNet [45]	-	0.62	Binaural Mel energy+ CRNN(BGRU) [47]	57.6%	0.62
Log-Mel+CRNN [46]	48.7%	0.78	Log-Mel+CRNN [48]	49.6%	0.68
F_LM+TFFS-CRNN	60.2%	0.40	F_LM+TFFS-CRNN	66.9%	0.42

The models with “*” are the baseline system of DCASE challenge. The models with “**” are the winning system of DCASE challenge.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jin, Y.; Wang, M.; Luo, L.; Zhao, D.; Liu, Z. Polyphonic Sound Event Detection Using Temporal-Frequency Attention and Feature Space Attention. Sensors 2022, 22, 6818. https://doi.org/10.3390/s22186818

AMA Style

Jin Y, Wang M, Luo L, Zhao D, Liu Z. Polyphonic Sound Event Detection Using Temporal-Frequency Attention and Feature Space Attention. Sensors. 2022; 22(18):6818. https://doi.org/10.3390/s22186818

Chicago/Turabian Style

Jin, Ye, Mei Wang, Liyan Luo, Dinghao Zhao, and Zhanqi Liu. 2022. "Polyphonic Sound Event Detection Using Temporal-Frequency Attention and Feature Space Attention" Sensors 22, no. 18: 6818. https://doi.org/10.3390/s22186818

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Polyphonic Sound Event Detection Using Temporal-Frequency Attention and Feature Space Attention

Abstract

1. Introduction

2. Methods

2.1. Dual-Input Temporal-Frequency Attention

2.2. Feature Space Attention

2.3. BGRU

3. Experiment

3.1. Environmental Sound Datasets

3.2. Evaluation Metrics

3.3. Experimental Configurations

4. Discussion

4.1. Comparison of Different Features

4.2. Comparison of Different Fusion Strategies

4.3. Comparison of Different Classifiers

4.4. Comparison of Different Methods

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI