A Method Based on Dual Cross-Modal Attention and Parameter Sharing for Polyphonic Sound Event Localization and Detection

Lee, Sang-Hoon; Hwang, Jung-Wook; Song, Min-Hwan; Park, Hyung-Min

doi:10.3390/app12105075

Open AccessArticle

A Method Based on Dual Cross-Modal Attention and Parameter Sharing for Polyphonic Sound Event Localization and Detection

¹

Department of Electronic Engineering, Sogang University, Seoul 04107, Korea

²

Autonomous IoT Research Center, Korea Electronics Technology Institute, Seongnam 13509, Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(10), 5075; https://doi.org/10.3390/app12105075

Submission received: 5 March 2022 / Revised: 26 April 2022 / Accepted: 9 May 2022 / Published: 18 May 2022

(This article belongs to the Section Acoustics and Vibrations)

Download

Browse Figures

Versions Notes

Abstract

:

Sound event localization and detection (SELD) is a joint task that unifies sound event detection (SED) and direction-of-arrival estimation (DOAE). The task has become such a popular topic that it was introduced into the challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) Task3 in 2019. In this paper, we propose a method based on dual cross-modal attention (DCMA) and parameter sharing to simultaneously detect and localize sound events. In particular, the DCMA-based decoder commonly used for multiple predictions efficiently learns the associations between SED and DOAE features by exchanging SED and DOAE information in the process of attention, in addition to the encoder with parameter sharing. Furthermore, acoustic features that have not been usually used in the SELD task are additionally adopted to improve the performance, and data augmentation techniques of the mixup to simulate polyphonic events and channel rotation for spatial augmentation are conducted for this task. Experimental results demonstrate that our efficient model using one common decoder block based on the DCMA to predict multiple events in the track-wise output format is effective for the SELD task with up to three overlapping events.

Keywords:

sound event localization and detection; dual cross-modal attention; parameter sharing; transformer; spectrogram

1. Introduction

Sound event localization and detection (SELD) is an audio processing task that unifies the tasks of sound event detection (SED) and its source localization by jointly recognizing target classes of sound events including temporal information of the sound activations and estimating their directions of arrival (DOA) [1,2,3,4]. For example, it can be used in indoor or outdoor situations, such as a baby crying, a window broken, a car horn on the road, and a disaster relief call. In these situations, it is reasonable to combine detection and localization by estimating the temporal and spatial locations of the target events since the sound of an event is transmitted to microphones from the corresponding source in a specific direction. Therefore, SELD has various applications including acoustic monitoring [5], surveillance [6], autonomous driving, and robotics [7], and the task has become such a popular topic that it was introduced into the challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) Task3 in 2019 [8]. Since sound sources should be localized, SELD typically requires multi-channel audio data acquired by a microphone array.

In DCASE2021, the TAU-NIGENS Spatial Sound Events (TNSSE) 2021 dataset is provided in two four-channel spatial audio formats, the raw signals of a tetrahedral microphone array (MIC) and first-order Ambisonics (FOA), which included emulation of recordings of static or moving sound events by spatial room impulse responses (SRIRs) captured in various rooms [3]. The isolated sound event recordings were obtained from the NIGENS general sound events database [9], and the spatialization of all sound events was based on filtering through spatial room impulse responses. Localized interfering events outside of the target classes and spatial ambient noise captured in these rooms were added to the recordings [3]. In this paper, we only used the FOA (four-channels, three-dimensional recordings) data format. The spatial responses of the four FOA channels to a source incident from DOA given by azimuth angle

ϕ

and elevation angle

θ

are as follows [10]:

\begin{matrix} H_{1} (ϕ, θ, f) & = & 1, \end{matrix}

(1)

\begin{matrix} H_{2} (ϕ, θ, f) & = & cos (ϕ) cos (θ), \end{matrix}

(2)

\begin{matrix} H_{3} (ϕ, θ, f) & = & sin (ϕ) cos (θ), \end{matrix}

(3)

\begin{matrix} H_{4} (ϕ, θ, f) & = & sin (θ) . \end{matrix}

(4)

The FOA format is obtained by converting the 32-channel microphone array signals by means of encoding filters based on anechoic measurements of the Eigenmike array response.

The task contains 12 different SED target classes of alarm, crying baby, crash, barking dog, female scream, female speech, footsteps, knocking on the door, male scream, male speech, ringing phone, and piano, and directional interferers of running engine and burning fire, and an additional general class with disparate sounds not belonging to any of the other classes that have their own temporal activities from a static or moving source. Up to three overlapping target events may occur, and these events may belong to the same class. The dataset consists of 800 one-minute spatial recordings that are split into the development set of 600 recordings and the evaluation set of 200 recordings. The development set is split into six disjoint sets of 100 recordings each. In total, 500 data (fold 1, 2, 3, 4, 5) were used for model training, and the other 100 (fold 6) were used as the test set to permit evaluation across different acoustic conditions. Note that in this challenge, the use of external data is not allowed, and a pre-trained network cannot be used to develop a model. The dataset can be augmented any number of times under the condition that no external data are used.

1.1. Existing Methods

Although SED and sound source localization (SSL) have been independently studied for a long time, research on SELD that performs two tasks simultaneously has been actively conducted only recently, and it was first introduced to a DCASE Challenge task in 2019. Every year, the data handled by this task are updated with difficulty, and the problem to be solved is complicated. In 2020, some sound sources sometimes move. In 2021, up to three overlapping events will occur, and there may be directional noise interference rather than target events. Since not all directional sounds are target events, it is important to match the class and direction of a sound event that occurs. During and after the competition every year, various modeling, features extraction, data augmentation, or other methodologies were introduced by many participants and researchers.

In 2019, Cao et al. trained the SED and DOA estimation (DOAE) models separately, and they obtained the DOAE output only when the SED output predicted an event [11]. In 2020, they presented the Event Independent Network (EIN) that outputs track-wise predictions, which enables predicting events of the same class that occur at the same time. When extracting the features of SED and DOAE through the convolutional block, related features between the two tasks were well extracted by applying parameter sharing between the two embeddings [12]. After the challenge, an improved version of the EIN (EINV2) was released, which utilizes multi-head self-attention instead of bidirectional gated recurrent units (biGRUs) after the convolutional layers [13].

In 2020, Shimada et al. expressed the labels of SED and SSL as one using the activity-coupled Cartesian DOA vector (ACCDOA) representation [14]. Thus, they did not divide the task into two different ones and considered it as a single system that requires just one loss calculation to optimize their model. They proposed the network called RD3Net, which uses a dense block with skip connections and dilated convolutions. The ACCDOA vectors are generated as model results, and training is conducted in a direction that minimizes the distance between the coordinates indicated by this vector and the target coordinates. Therefore, they treated it as a multi-output regression problem rather than a classification problem and used a mean square error (MSE) loss. In 2021, a model was designed in a way that ensembles the ACCDOA-based system and EINV2-based system, and it achieved better performance than all the other methods submitted to the DCASE2021 Task3 [15].

Adavanne et al. used the SELDNet to jointly estimate polyphonic events of various classes and their source directions [1]. The multi-channel phase and magnitude spectrograms were used as the feature. Multiple layers of two-dimensional (2D) convolutional neural network (CNN) and GRU were followed by two branches of fully connected (FC) layers in parallel for SED and DOAE predictions. The SED and DOAE were performed as a multi-label classification task and a multi-output regression task, respectively. In particular, the modified version of this network was used as the reference system for Task3 in DCASE2019, DCASE2020, and DCASE2021 [3,10,16].

Nguyen et al. proposed a feature named spatial cue-augmented log-spectrogram (SALSA) which stacks multi-channel log-spectrograms with estimated direct-to-reverberant ratio (DRR) and the normalized principal eigenvectors of the spatial covariance matrix at each time-frequency bin on the spectrograms [2,17]. The normalized eigenvector for FOA corresponds to inter-channel intensity difference (IID) for the FOA format, and this feature outperformed the method based on intensity vector (IV). The ResNet22 network was fixed for the encoder, whereas three different networks were used for the decoder: long short-term memory (LSTM), GRU, and multi-head self-attention.

In 2020, Park et al. designed three kinds of loss functions to address the issue of data imbalance problems [18]. First, a temporal masking loss was used to mask the silence interval to give more weight to the interval when target events occur. Second, the soft loss was used to overcome the problem of data imbalance caused by the different number of target events belonging to each class. Third, a sinusoidal loss was used to deal with periodic label values for DOA, since angles in azimuth and elevation are periodic. In 2021, Park et al. used a self-attention-based model called many-to-many audio spectrogram transformer (M2M-AST) to achieve improved performance in the SELD [19].

1.2. Our Contributions

In this paper, we focus on the method of enhancing the information necessary for each other between the SED and DOAE features, since SELD is a task predicting the temporal and spatial locations of a target event as two different aspects of one signal. Our model is based on the encoder–decoder structure. For the encoder, we adopt the embedding scheme of the EINV2 using soft parameter sharing, while the decoder part of the transformer [20] is modified to predict either sound events or source directions. Different from the conventional transformer that analyzes sequence data by feeding back the previous output as the current decoder input, embedded SED or DOAE features are directly used as the input to analyze the current sound in our decoder.

In particular, we apply dual cross-modal attention (DCMA) in the decoder part, which is the first trial for multi-task learning including the SELD task as far as we know, although DCMA has been used in multi-modal tasks such as audio-visual speech recognition and audio-text emotion detection [21,22]. Related information between features for SED and DOAE may be helpful for the SELD task, which needs to predict the class and direction of a specific sound event simultaneously. The related information can be enhanced by the DCMA using two different embedded features for the queries and keys/values in transformer decoder layers. Therefore, we expect our DCMA-based model structure to efficiently learn the association between the two embedded features.

In addition, the decoder consists of one common block based on the DCMA despite the need for multiple predictions with the track-wise output format. Therefore, an efficient SELD model can be constructed even for the task with polyphonic sound events. We refer to our model as DCMA-SELD.

Unlike speech, in which a feature extraction method suitable for recognition has been established, various feature extraction methods are evaluated for sound signal analysis, and a suitable combination for SELD is found among them. In addition, we try and evaluate data augmentation techniques of the mixup simulating polyphonic events and channel rotation for spatial augmentation to improve the generalization performance of our model.

The core of the model architecture has been evaluated in DCASE2021 Task3 [4], and this paper further develops the model by the methods described above with a detailed description and demonstrates the appropriateness and effectiveness of the model through extensive experiments.

The rest of the paper is organized as follows. In Section 2, we describe our approach including detailed explanations for DCMA-SELD, input features, and data augmentation techniques. Section 3 presents the experimental setup, results, and discussions. Finally, we conclude the paper with some remarks in Section 4.

2. Proposed Approach

This section describes the architecture of our model DCMA-SELD, input features, and data augmentation techniques in detail. The DCMA-SELD is split into two streams: the SED and DOAE streams. The log-mel spectrogram, chromagram, and additional spectral information extracted from four channel signals in the FOA format are used for both streams. Besides, IVs are additionally used for DOAE. Two types of data augmentation techniques have been tried to increase the generalization performance of the model [2,15]. They are the mixup that sums up weighted audio clips to simulate polyphonic events [23] and the channel rotation by exchanging audio channels [24]. These data augmentation techniques are applied to the whole training dataset to train our model.

2.1. Input Features

2.1.1. Log-Mel Spectrogram

A spectrogram is a representation of a frequency-specific value of a signal that changes with time. For analysis in the frequency domain, the time-domain signal is divided into sections by a certain number of data samples corresponding to the frame size. The short-time Fourier transform (STFT) is performed for the data in each section. In general, an appropriate segmentation length is 20–35 ms for the audio analysis. We set the frame and hop sizes for the STFT to 1024 and 600 samples, respectively.

Humans are more sensitive to changes in a low-frequency band than in a high-frequency band. Therefore, it is effective to extract and use the feature vector by focusing on the low-frequency band rather than the entire frequency band. The mel-filter bank consists of filters with narrower bandwidths as the frequency decreases to analyze the change in a relatively low-frequency range in detail. A mel-spectrogram is obtained by analyzing data by a mel-filter bank. In addition, humans notice the intensity difference more sensitively at a lower power level than at a higher power level. To magnify the intensity difference as the power level is lower, a log-mel spectrogram is obtained by taking the logarithmic value on a mel-spectrogram. We set the number of filters to 256 for the mel-spectrogram.

2.1.2. Additional Spectral Information

Various spectral features can be extracted through additional calculations on a spectrogram. In this paper, spectral centroid [25] (1st order), spectral bandwidth [25] (1st order), spectral contrast [26] (7th order), spectral flatness [27] (1st order), and spectral rolloff [28] (1st order) are concatenated to provide 11-dimensional spectral information at a frame. The spectral centroid is a frequency where the center of mass of the spectrum at a frame is located. This has a strong correlation with the brightness of sound [29]. The spectral bandwidth indicates the dispersion of the spectrum that is computed from the spectral centroid. The spectral contrast is the average of the difference between peak and valley energies in each sub-band after dividing the spectrum into many sub-bands [26]. A high-contrast value means a clear narrow-band signal whereas a wide-band noise signal provides a small contrast value. We use six sub-bands in the experiments. The spectral flatness is a measure that quantifies the degree to which the spectrum is similar to white noise [27]. The spectral rolloff provides a frequency at which the spectral energy contained at and below that frequency is a certain percentage of the total energy at a frame.

2.1.3. Chromagram

The chroma vector represents the amount of energy of an audio segment in 12 pitch classes of {C, C#, D, D#, E, F, F#, G, G#, A, A#, B} represented by Western music notation, irrespective of the octave the segment is in [30]. By shifting the time window across the audio signal, a sequence of chroma vectors referred to as a chromagram is obtained to show how the pitch changes over time. In addition, a log chromagram is obtained by taking the logarithmic value on a chromagram. The local chroma features are still too sensitive, particularly when looking at variations in the articulation and local tempo deviations. To obtain robust features, chroma energy normalized statistics (CENS) features represent a kind of weighted statistics of the energy distribution over a window of 41 consecutive chroma vectors [31].

Since the chroma feature is particularly related to the harmonic structure and melody, it is useful for analyzing music. Since target events such as piano, ringing phone, and alarm have musical elements, it is expected to help distinguish them from non-musical events such as human sounds, footsteps, and door knocks.

2.1.4. Intensity Vector

Since the spectral features described above are used for SED, additional features are required for DOAE. Acoustic IVs extracted from the FOA format are used for DOAE because they convey the information about the acoustic energy direction of the sound wave [11]. In the FOA format, omnidirectional and three-directional components are indicated in the order of the first to fourth channels, respectively. An acoustic IV consists of the real parts of the values computed by multiplying the conjugate value of the first channel signal and the value of the remaining channel signals in the STFT domain expressed as [12]

\begin{matrix} I (f, t) = ℜ \{W^{*} (f, t) \cdot [\begin{matrix} X (f, t) \\ Y (f, t) \\ Z (f, t) \end{matrix}]\}, \end{matrix}

(5)

where

W (f, t)

,

X (f, t)

,

Y (f, t)

, and

Z (f, t)

are spectral values of the first to fourth channel signals at frequency bin f and frame t in the STFT domain, respectively.

ℜ {\cdot}

and * indicate the real part and the conjugate of a complex-valued matrix, respectively. The resulting three-dimensional vector is normalized by

\begin{matrix} \bar{I} (f, t) = \frac{I (f, t)}{{∥ I (f, t) ∥}_{2}}, \end{matrix}

(6)

where

{∥ ∥}_{2}

represents the l2-norm.

Since the generalized cross-correlation with phase transform (GCC-PHAT) extracted from the microphone array is the most basic feature in the time difference of arrival (TDOA) problem, many participants used it a lot in previous challenges. However, this feature is noisy when multiple sound sources overlap or a moving source occurs. Since the TNSSE 2021 dataset contains the events moving at an angular speed of 40 degrees per second in addition to overlapped sound sources, DOAE may not be reliable using the GCC-PHAT. To effectively solve SELD in the multi-source scenario with interference, only the IV without the GCC-PHAT is used for DOAE in this paper.

2.2. Data Augmentation Techniques

2.2.1. Mixup

Since there are polyphonic events in the task, it is useful to make situations with overlapped events. A mixed datum is created by a weighted sum of two randomly selected data. In the original mixup used for image classification [23], the weight value multiplied by the data is also applied to the labels as in the formula below:

\begin{matrix} x^{'} & = & λ x_{i} + (1 - λ) x_{j}, \end{matrix}

(7)

\begin{matrix} y^{'} & = & λ y_{i} + (1 - λ) y_{j}, \end{matrix}

(8)

where

x_{i}

and

y_{i}

indicate the i-th data and label, respectively. The value of the weight

λ

is randomly selected from a uniform distribution between 0.5 and 0.8 and applied for the mixup.

In the SED task, all the mixed events should be detected even though any one of those events is mixed with small power. Therefore, in this paper,

λ

is applied only to the data, and the corresponding labels are simply added.

2.2.2. Augmentation by Rotation

Because data captured from a microphone array are usually limited, data augmentation for multi-channel data is very important for deep learning. The spatial augmentation method in [32] is a very effective augmentation method that can be used in the FOA format to increase the numbers of DOA labels without losing the physical relationships between steering vectors and observations. Since the spatial responses of the four FOA channels

{W, X, Y, Z}

to a source incident from a direction are given by Equations (1)–(4), data for a new direction can be reproduced by the reflection on a specific plane or the rotation of the angle by

\pm 90^{°}

or

180^{°}

. Whereas azimuths span the whole range of

ϕ \in [- 180^{°}, 180^{°})

, elevations span a partial range of

θ \in [{- 45}^{°}, 45^{°}]

. In order not to go out of the label range, only the reflection is applied for the elevation while the reflection and rotations of

\pm 90^{°}

can be applied for the azimuth. Then, a total of 16 combinations of rotations are applicable for augmentation, as summarized in Table 1.

2.3. Model Architecture

2.3.1. Encoder

Input features are effectively embedded by applying convolutional blocks and parameter-sharing techniques in the encoder layers. In the last decades, deep learning with convolutional neural networks (CNNs) at the forefront has dramatically upgraded the performance in several research fields (image classification, object detection, optical character recognition (OCR), etc.). CNN-based models such as AlexNet, GoogleNet, VGGNet, Xception, ResNet, ResNeXt, and DenseNet have been released [33]. Especially, its efficient representation directly extracted from raw input data based on these verified models leads to impressive performance. In this study, to extract rich internal features, local correlation information is obtained from acoustic features through convolutional layers.

Multi-task learning techniques have been successfully applied to various tasks such as natural language processing (NLP), automatic speech recognition (ASR), and computer vision [34]. Predicting the class and direction of a sound event are different tasks. However, they are closely related to each other, since they originate from the same sound. No matter how noisy the environment is, if someone focuses on a specific target sound, he can catch the direction of the sound source. Likewise, if he concentrates on a particular direction, it becomes easier to tell what sound is coming from that direction. Therefore, SELD is an appropriate task for multi-task learning. In the SELD, a no-parameter sharing scheme where features for SED and DOAE are not shared in any embedding stage cannot exploit the association between the SED and DOAE features. All of the parameters are shared between two subtasks in a hard parameter-sharing scheme. It can reduce the overfitting phenomenon to a specific subtask, but it can cause performance degradation because it cannot discard the feature that is only necessary for just one subtask [35]. Soft parameter sharing has a separate set of parameters per subtask and only shares partial instances, which are a crucial part of the learning. It is expected that only the information useful for each subtask will be complemented [13].

Although we use additional spectral information and chromagram concatenated with the log-mel spectrogram as the spectral feature, the embedding scheme of the EINV2 corresponding to four convolutional blocks with batch normalization (BN), ReLU activation, average pooling, and soft parameter sharing [13] is adopted for the encoder part since it is an encoding scheme that sufficiently reflects the characteristics of SELD discussed above.

2.3.2. Decoder

The decoder part of the transformer is modified to efficiently decode the information in the embedded features for the prediction of sound events or source directions. In particular, the prediction can be improved by exploiting the related information from the other embedded features using the DCMA in the decoder. The overall schematic diagram of the decoder is represented in Figure 1.

Similar to the encoder, the decoder is also split into SED and DOAE streams, each of which is based on the decoder part of the transformer, which was first proposed by Vaswani et al. [20]. Although the transformer composed of attention mechanisms without convolution or recurrent layers has used the previous decoder output as the current decoder input to analyze sequence data [20], our decoder uses the embedded features for either SED or DOAE as the decoder input to analyze the current sound directly.

As shown in Figure 1, the decoder in each stream is composed of a stack of identical layers. Each layer has two multi-head attention blocks followed by an FC network. The first attention block is the masked multi-head self-attention block using the common input embedding as the query, key, and value, while the second attention block is the multi-head attention based on the DCMA with embedded features for SED and DOAE (corresponding to the outputs of the first attention block) used as the queries and keys/values. Residual connections are employed around each of the blocks and the FC network, followed by layer normalization.

Multi-head attention is a module for attention mechanisms which runs through an attention mechanism several times in parallel [20]. Then, independent attention outputs are concatenated and linearly transformed into the original dimension. Therefore, multiple attention heads focus on different aspects of input features during the attention process, and the outputs are concatenated to enable effective attention. In addition, since each attention mechanism is performed in a small dimension, it is possible to benefit from the amount of computation. After passing through the multi-head self-attention that is also employed in the EINV2 [13], we exchange the related information from the other features embedded through the multi-head self-attention in the two streams by applying the DCMA [21] instead of the encoder–decoder multi-head attention in the transformer. Although the DCMA was introduced to utilize both an audio context vector using video query and a video context vector using audio query in audio-visual speech recognition [21], the related information between the embedded features for SED or DOAE can be enhanced in learning the association between the two features by employing the DCMA architecture with the two different features used as the queries and keys/values. Using the scaled dot-product attention, attention outputs in the i-th head,

A_{SD}

and

A_{DS}

, can be expressed as

\begin{array}{l} A_{SD} & = Attention (E_{S}, E_{D}, E_{D}) \\ = softmax (\frac{E_{S} W_{S}^{Q_{i}} {(W_{S}^{{K V}_{i}})}^{T} E_{D}^{T}}{\sqrt{1 / d_{k}}}) E_{D} W_{S}^{{K V}_{i}}, \end{array}

(9)

\begin{array}{l} A_{DS} & = Attention (E_{D}, E_{S}, E_{S}) \\ = softmax (\frac{E_{D} W_{D}^{Q_{i}} {(W_{D}^{{K V}_{i}})}^{T} E_{S}^{T}}{\sqrt{1 / d_{k}}}) E_{S} W_{D}^{{K V}_{i}}, \end{array}

(10)

where

E_{S} \in R^{N_{T} \times D_{F}}

and

E_{D} \in R^{N_{T} \times D_{F}}

are SED and DOAE features used as inputs with

N_{T}

and

D_{F}

denoting the number of time steps and the feature dimension, respectively.

W_{S}^{Q_{i}}

and

W_{S}^{{K V}_{i}}

represent learnable query and key/value matrices for the i-th head in the SED stream, respectively. In the same way,

W_{D}^{Q_{i}}

and

W_{D}^{{K V}_{i}}

denote the corresponding matrices in the DOAE stream.

d_{k}

is the common number of columns of these learnable matrices.

Experiments were conducted by changing the number of decoder layers from one to six, and the best results were obtained when set to two. In addition to the soft parameter sharing in the encoder part, the DCMA process will play a major role in obtaining the association between SED and DOAE feature information prior to the prediction.

2.3.3. Overall Procedure

The overall DCMA-SELD architecture is illustrated in Figure 2, and the feature dimensions and hyper-parameters are summarized in Table 2.

For the SED stream, the input feature tensor has four channels. Each of the four channels has spectral features obtained from each of the four signals formatted in the FOA format, which consists of a 256-dimensional log-mel spectrum, a 12-dimensional chroma vector, and an 11-dimensional spectral feature per frame. For the DOAE stream, the input feature tensor has three more channels in addition to the four channels for the SED stream. Each of the three channels has a 256-dimensional IV concatenated with a 23-dimensional zero vector to match the dimension by zero padding per frame.

The SED and DOAE streams have the same encoder architecture, which consists of four convolutional blocks, and a layer of CNN, BN, and ReLU activation is repeated twice in each block. After an embedding passes a block followed by average pooling in each stream, the resulting embeddings in both the streams are combined by soft parameter sharing to provide embeddings input to the next block, which are expressed as

\begin{matrix} {\bar{D}}_{c}^{(i + 1)} & = & α_{c}^{(i)} D_{c}^{(i)} + β_{c}^{(i)} S_{c}^{(i)}, \end{matrix}

(11)

\begin{matrix} {\bar{S}}_{c}^{(i + 1)} & = & γ_{c}^{(i)} D_{c}^{(i)} + δ_{c}^{(i)} S_{c}^{(i)}, \end{matrix}

(12)

where

{\bar{D}}_{c}^{(i)}

and

{\bar{S}}_{c}^{(i)}

denote the input embeddings for the i-th convolutional block at the c-th channel in the DOAE and SED streams, respectively, while

D_{c}^{(i)}

and

S_{c}^{(i)}

denote the corresponding output embeddings from a block followed by average pooling, respectively.

α^{(i)} = [α_{1}^{(i)}, α_{2}^{(i)}, . . ., α_{C}^{(i)}]

,

β^{(i)} = [β_{1}^{(i)}, β_{2}^{(i)}, . . ., β_{C}^{(i)}]

,

γ^{(i)} = [γ_{1}^{(i)}, γ_{2}^{(i)}, . . ., γ_{C}^{(i)}]

, and

δ^{(i)} = [δ_{1}^{(i)}, δ_{2}^{(i)}, . . ., δ_{C}^{(i)}]

are learnable weights applied to the output embeddings for the i-th convolutional block in the soft parameter sharing, where C is the number of channels.

In each stream, the encoder output is input to the decoder based on the decoder part of the transformer, as mentioned above. The decoder in each stream is composed of a stack of two identical layers, and each layer has a masked multi-head self attention block and a multi-head attention based on the DCMA followed by an FC network. In particular, the second attention block utilizes embedded features for SED and DOAE used as the queries and keys/values to enhance the related information between the features.

The decoder is followed by an FC layer to predict SELD outputs. Different from the class-wise output format, the track-wise output format can detect sound events of the identical class from different directions, which is called the case of homogeneous overlap [13]. Since the DCASE2021 Task3 may have up to three overlapping events, we adopt the track-wise output format [13] where three SED outputs and three DOAE outputs should be given by the FC layer. In either the SED or DOAE stream, three FC blocks in parallel simply pass a decoder output to predict three SED or DOAE outputs. Our system adopting the track-wise output format is trained by the permutation-invariant training for each frame [13].

3. Experiments and Results

3.1. Experimental Setup

3.1.1. Parameter Configuration

The sampling rate of the audio signals is 32 kHz. The frame and hop sizes for the STFT were set to 1024 and 600 samples, respectively. Spectral features composed of log-mel spectrogram, chromagram, and additional spectral information were obtained by librosa [28,36]. An Adam optimizer [37] was used for the training. The StepLR scheduler [38] that decays the learning rate by a factor every configured epoch was applied to control the learning rate during the training. For our experiments, we decayed the learning rate by a factor of 0.1 every 80 epochs. The initial learning rate was set to 0.0005.

3.1.2. Evaluation Metrics

The performance of the model was assessed by the evaluation metrics used in the DCASE2021 Challenge, which consider joint dependencies between the performance of the SED and DOAE. In the SELD, the F1-score

F_{\leq}_{T}

and error rate

E R_{\leq}_{T}

for SED are location-dependent, considering true positives for the predictions only when the difference between the DOAE and the reference is under a threshold T. Here, T was set to

20^{°}

. For the metrics of DOAE, localization error

L E_{CD}

and localization recall

L R_{CD}

are designed class-dependent, meaning that they are computed only across each class and then averaged.

L E_{CD}

shows the average angular distance between predictions and references of the same class, whereas

L R_{CD}

calculates the true positive rate of the localization predictions in a class out of all class instances [8].

These metrics can be integrated into a joint measurement

D_{SELD}

as given in [17]:

D_{SELD} = \frac{1}{4} [E R_{\leq_{T}} + (1 - F_{\leq_{T}}) + \frac{L E_{CD}}{180^{°}} + (1 - L R_{CD})] .

(13)

The higher

F_{\leq_{T}}

and

L R_{CD}

and the lower

E R_{\leq_{T}}

,

L E_{CD}

, and

D_{SELD}

means better performance.

3.2. Experimental Results and Discussions

Since the metadata information of the evaluation dataset is not provided even after the challenge, we cannot check the performance on the evaluation dataset. Instead, we compared the performance of the models on the test set in the development dataset. For all our results in the following tables, we included the 95% confidence intervals for each metric, which were estimated using the jackknife procedure, with the partial estimates calculated in a leave-one-out manner, excluding, in turns, one audio file from the test set, as presented in [8,39].

Table 3 summarizes an ablation study on input features for our DCMA-SELD. Retaining the log-mel spectrogram and IVs as basic features for SED and DOAE, performance was evaluated when some of the four types of input features were dropped. When all four kinds of features were applied, the best results were obtained in all the metrics except for

L R_{CD}

, which was 0.1% lower than the best score. The performance was degraded without some of them, which demonstrates that all types of features contribute to performance improvement. This is because the log-mel spectrogram, additional spectral information, and chroma features are obtained from spectra with different frequency resolutions and thus contain information on different aspects of the input audio signal. In the following experiments, all four types of features were used.

Table 4 compares the performance when using different parameter-sharing methods in the encoder of our DCMA-SELD model. As in [13], our model also showed the best performance for all given official metrics with soft parameter sharing. On the other hand, the worst performance was recorded for all the metrics when the SED and DOAE embeddings were performed independently with no parameter sharing. Through this experiment, it can be confirmed that the two embedded features need not be completely identical but are related and complementary.

Table 5 shows the performance when the decoder of our model was replaced with different architectures. As a basic decoder to deal with audio data, two-layer LSTM networks were evaluated. In each stream, three LSTM blocks in parallel were used to predict SED or DOAE outputs in this task where up to three overlapping events may occur. In addition, the decoder was also replaced with the multi-head self-attention, which corresponded to the model in [13]. Since the model in [13] used one multi-head self-attention block for each track, three blocks were used in the task. On the other hand, even with three tracks, our model obtained three parallel prediction results from an FC layer after passing through one common decoder. Therefore, instead of using one multi-head self-attention block per track, we also evaluated the model performing three predictions after one common multi-head self-attention block.

The LSTM-based decoder showed the lowest performance in all metrics. When the multi-head self-attention blocks were used as the decoder, the case of using three blocks achieved higher performance in all metrics than the case of using one common block. However, except for

L E_{CD}

, which provided a greater error by 0.5°, the decoder used in our DCMA-SELD outperformed the other decoder architectures despite using one common decoder block. This demonstrates that building an efficient model through one common decoder block which can produce sufficiently reliable outputs is more effective in improving generalization performance than using independent decoder blocks with too many weights in this task that has to predict three tracks.

Table 6 compares the performance of the models trained with and without data augmentation for model training described in Section 2.2. The model to which the data augmentation was applied improved the performance in all metrics, which showed that the data augmentation was very useful in the task.

Table 7 compares the performance on the development dataset of our model and highly ranked systems in the challenge (note that the presented model in this paper is an improved version of our previous model that placed fourth in the challenge as mentioned in the introduction). In the experimental results, the baseline system in the challenge corresponding to the modified version of the SELDnet in [40] showed significantly inferior performance in all metrics than the other methods, while the others showed relatively comparable performance. Especially, our model achieved the best performance for the metric

L R_{CD}

by recording the value of

79.0 %

. Excluding our model, Park et al. [19] showed the highest score of

74.2 %

in

L R_{CD}

among all the participants of the challenge, which was about

5 %

lower than ours. Moreover, both no parameter-sharing and hard parameter-sharing methods based on our model in Table 4 still showed scores higher than

75 %

in

L R_{CD}

, which was better than the other participants.

It is analyzed that the primary reason why our model can achieve overwhelming results in

L R_{CD}

compared to the others is that we applied the track-wise output format instead of the class-wise output format adopted in the others (the system in [15] is affected by class-wise predictions because it uses both the output formats together). The class-wise output format can only predict up to one event per class at a time. That is, if two or three events of the same class occur at the same time, one and two sound events, respectively, must be abandoned. However, since the track-wise output format predicts one sound event and direction per track, it is possible to cope even with the case where two or more events belonging to the same class occur at the same time. In the TNSSE 2021 dataset, the case where multiple events belonging to the same class occurs simultaneously reaches

10.45 %

[17], and the maximum possible score of

L R_{CD}

with ideal SED condition is

92.5 %

for the system adopting the class-wise output format [19]. In the DCASE2021 challenge, all the highly ranked systems compared here used the class-wise output format despite the handicap to prevent the possibility of mispredicting non-existent overlapped events that could cause performance degradation in other metrics.

Above all, in spite of the possibility of the misprediction, our model ranked second in terms of both

D_{SELD}

and the challenge ranking evaluation method determined by the sum of the rankings of the four metrics while retaining a high

L R_{CD}

with the track-wise output format that could cope even with overlapped sound events. Therefore, combining our model with another model adopting the class-wise output format that can compensate for the possible misprediction of ours can be a promising ensemble method.

4. Conclusions

In this paper, we presented a model using DCMA and soft parameter sharing to simultaneously detect and localize sound events. The CNN-based encoder with parameter sharing exchanges intermediate features in the CNN layers for the SED and DOAE, and the DCMA-based decoder efficiently outputs three predictions to detect overlapping events in the DCASE2021 Task3. In particular, key and value vectors for either SED or DOAE stream in the DCMA were given from the other stream to efficiently learn the association between SED and DOAE features. Acoustic features, such as additional spectral information and chroma features, that had not been commonly used in the SELD task were additionally adopted to improve the performance. Experimental results demonstrated that our efficient model using one common decoder block based on the DCMA was effective for the SELD task with up to three overlapping events while retaining a high

L R_{CD}

with the track-wise output format. As a future work, we will study prediction rules to induce non-independent and reliable results between the FC blocks. For further improvement, we plan to develop an ensemble network including the presented model.

Author Contributions

Conceptualization, S.-H.L., J.-W.H., M.-H.S. and H.-M.P.; methodology, S.-H.L., J.-W.H., M.-H.S. and H.-M.P.; software, S.-H.L., J.-W.H. and H.-M.P.; validation, S.-H.L. and J.-W.H.; formal analysis, S.-H.L. and J.-W.H.; investigation, S.-H.L., J.-W.H., M.-H.S. and H.-M.P.; resources, M.-H.S. and H.-M.P.; data curation, S.-H.L. and H.-M.P.; writing—original draft preparation, S.-H.L., J.-W.H., M.-H.S. and H.-M.P.; writing—review and editing, S.-H.L., J.-W.H., M.-H.S. and H.-M.P.; visualization, S.-H.L.; supervision, H.-M.P.; project administration, H.-M.P.; funding acquisition, M.-H.S. and H.-M.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Institute of Information and communications Technology Planning and Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2020-0-00860, Development of acoustic-based micro sensor device for disaster and safety with multi-role support and disaster situation recognition technology and No. 2019-0-01376, Development of the multi-speaker conversational speech recognition technology).

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

DOAE	directional-of-arrival estimation
DCASE	detection and classification of acoustic scenes and events
DCMA	dual cross-modal attention
SED	sound event detection
SELD	sound event localization and detection
TNSSE	TAU-NIGENS spatial sound events
MIC	microphone array
FOA	first-order ambisonics
SRIRs	spatial room impulse responses
SSL	sound source localization
EIN	event independent network
biGRUs	bidirectional gated recurrent
ACCDOA	activity-coupled Cartesian DOA vector
MSE	mean square error
CNN	convolutional neural network
SALSA	spatial cue-augmented log-spectrogram
DRR	direct-to-reverberant ratio
IID	inter-channel intensity difference
IV	intensity vector
LSTM	long short-term memory
M2M-AST	many-to-many audio spectrogram transformer
STFT	short-time Fourier transform
CENS	chroma energy normalized statistics
GCC-PHAT	generalized cross-correlation with phase transform
TDOA	time difference of arrival
OCR	optical character recognition
NLP	natural language processing
ASR	automatic speech recognition
BN	batch normalization

References

Adavanne, S.; Politis, A.; Nikunen, J.; Virtanen, T. Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE J. Sel. Top. Signal Process. 2018, 13, 34–48. [Google Scholar] [CrossRef] [Green Version]
Nguyen, T.N.T.; Watcharasupat, K.N.; Nguyen, N.K.; Jones, D.L.; Gan, W.S. SALSA: Spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection. arXiv 2021, arXiv:2110.00275. [Google Scholar]
Politis, A.; Adavanne, S.; Krause, D.; Deleforge, A.; Srivastava, P.; Virtanen, T. A dataset of dynamic reverberant sound scenes with directional interferers for sound event localization and detection. arXiv 2021, arXiv:2106.06999. [Google Scholar]
Lee, S.H.; Hwang, J.W.; Seo, S.B.; Park, H.M. Sound Event Localization and Detection Using Cross-Modal Attention and Parameter Sharing for DCASE2021 Challenge. Technical Report of DCASE Challenge. 2021. Available online: https://dcase.community/challenge2021/task-sound-event-localization-and-detection-results (accessed on 8 May 2022).
Valenzise, G.; Gerosa, L.; Tagliasacchi, M.; Antonacci, F.; Sarti, A. Scream and gunshot detection and localization for audio-surveillance systems. In Proceedings of the IEEE Conference on Advanced Video and Signal Based Surveillance, London, UK, 5–7 September 2007; pp. 21–26. [Google Scholar]
Foggia, P.; Petkov, N.; Saggese, A.; Strisciuglio, N.; Vento, M. Audio surveillance of roads: A system for detecting anomalous sounds. IEEE Trans. Intell. Transp. Syst. 2015, 17, 279–288. [Google Scholar] [CrossRef]
Valin, J.M.; Michaud, F.; Hadjou, B.; Rouat, J. Localization of simultaneous moving sound sources for mobile robot using a frequency-domain steered beamformer approach. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), New Orleans, LA, USA, 26 April–1 May 2004; pp. 1033–1038. [Google Scholar]
Politis, A.; Mesaros, A.; Adavanne, S.; Heittola, T.; Virtanen, T. Overview and evaluation of sound event localization and detection in DCASE 2019. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 29, 684–698. [Google Scholar] [CrossRef]
Trowitzsch, I.; Taghia, J.; Kashef, Y.; Obermayer, K. The NIGENS general sound events database. arXiv 2019, arXiv:1902.08314. [Google Scholar]
Politis, A.; Adavanne, S.; Virtanen, T. A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection. arXiv 2020, arXiv:2006.01919. [Google Scholar]
Cao, Y.; Iqbal, T.; Kong, Q.; Galindo, M.; Wang, W.; Plumbley, M.D. Two-Stage Sound Event Localization and Detection Using Intensity Vector and Generalized Cross-Correlation. Technical Report of DCASE Challenge. 2019. Available online: https://www.researchgate.net/publication/337335170_TWO-STAGE_SOUND_EVENT_LOCALIZATION_AND_DETECTION_USING_INTENSITY_VECTOR_AND_GENERALIZED_CROSS-CORRELATION (accessed on 8 May 2022).
Cao, Y.; Iqbal, T.; Kong, Q.; Zhong, Y.; Wang, W.; Plumbley, M.D. Event-Independent Network for Polyphonic Sound Event Localization and Detection. arXiv 2020, arXiv:2010.00140. [Google Scholar]
Cao, Y.; Iqbal, T.; Kong, Q.; An, F.; Wang, W.; Plumbley, M.D. An improved event-independent network for polyphonic sound event localization and detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 885–889. [Google Scholar]
Shimada, K.; Takahashi, N.; Takahashi, S.; Mitsufuji, Y. Sound Event Localization and Detection Using Activity-Coupled Cartesian DOA Vector and RD3Net. arXiv 2020, arXiv:2006.12014. [Google Scholar]
Shimada, K.; Takahashi, N.; Koyama, Y.; Takahashi, S.; Tsunoo, E.; Takahashi, M.; Mitsufuji, Y. Ensemble of ACCDOA-and EINV2-Based Systems with D3Nets and Impulse Response Simulation for Sound Event Localization and Detection. arXiv 2021, arXiv:2106.10806. [Google Scholar]
Adavanne, S.; Politis, A.; Virtanen, T. A Multi-Room Reverberant Dataset for Sound Event Localization and Detection. arXiv 2019, arXiv:1905.08546. [Google Scholar]
Nguyen, T.N.T.; Watcharasupat, K.; Nguyen, N.K.; Jones, D.L.; Gan, W.S. DCASE 2021 Task 3: Spectrotemporally-Aligned Features for Polyphonic Sound Event Localization and Detection. arXiv 2021, arXiv:2106.15190. [Google Scholar]
Park, S.; Suh, S.; Jeong, Y. Sound Event Localization and Detection with Various Loss Functions. Technical Report of DCASE Challenge. 2020. Available online: https://dcase.community/documents/challenge2020/technical_reports/DCASE2020_Park_89.pdf (accessed on 8 May 2022).
Park, S.; Jeong, Y.; Lee, T. Self-Attention Mechanism for Sound Event Localization and Detection. Technical Report of DCASE Challenge. 2021. Available online: http://dcase.community/documents/challenge2021/technical_reports/DCASE2021_Park_106_t3.pdf (accessed on 8 May 2022).
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Lee, Y.H.; Jang, D.W.; Kim, J.B.; Park, R.H.; Park, H.M. Audio–visual speech recognition based on dual cross-modality attentions with the transformer model. Appl. Sci. 2020, 10, 7263. [Google Scholar] [CrossRef]
Khare, A.; Parthasarathy, S.; Sundaram, S. Self-supervised learning with cross-modal transformers for emotion recognition. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 19–22 January 2021; pp. 381–388. [Google Scholar]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar]
Ronchini, F.; Arteaga, D.; Pérez-López, A. Sound event localization and detection based on CRNN using rectangular filters and channel rotation data augmentation. arXiv 2020, arXiv:2010.06422. [Google Scholar]
Klapuri, A.; Davy, M. (Eds.) Signal Processing Methods for Music Transcription; Springer Science & Business Media: New York, NY, USA, 2007. [Google Scholar]
Jiang, D.N.; Lu, L.; Zhang, H.J.; Tao, J.H.; Cai, L.H. Music type classification by spectral contrast feature. In Proceedings of the IEEE International Conference on Multimedia and Expo, Lausanne, Switzerland, 26–29 August 2002; Volume 1, pp. 113–116. [Google Scholar]
Dubnov, S. Generalization of spectral flatness measure for non-Gaussian linear processes. IEEE Signal Process. Lett. 2004, 11, 698–701. [Google Scholar] [CrossRef]
Feature Extraction. Available online: https://librosa.org/doc/main/feature.html#spectral-features (accessed on 8 May 2022).
Grey, J.M.; Gordon, J.W. Perceptual effects of spectral modifications on musical timbres. J. Acoust. Soc. Am. 1978, 63, 1493–1500. [Google Scholar] [CrossRef]
Müller, M.; Ewert, S. Chroma Toolbox: MATLAB implementations for extracting variants of chroma-based audio features. In Proceedings of the International Conference on Music Information Retrieval (ISMIR), Miami, FL, USA, 24–28 October 2011. [Google Scholar]
Müller, M.; Kurth, F.; Clausen, M. Audio matching via chroma-based statistical features. In Proceedings of the International Conference on Music Information Retrieval (ISMIR), London, UK, 11–15 September 2005; Volume 2005, p. 6. [Google Scholar]
Mazzon, L.; Yasuda, M.; Koizumi, Y.; Harada, N. Sound event localization and detection using FOA domain spatial augmentation. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), New York, NY, USA, 25–26 October 2019. [Google Scholar]
Dhillon, A.; Verma, G.K. Convolutional neural network: A review of models, methodologies and applications to object detection. Prog. Artif. Intell. 2020, 9, 85–112. [Google Scholar] [CrossRef]
Ruder, S. An overview of multi-task learning in deep neural networks. arXiv 2017, arXiv:1706.05098. [Google Scholar]
Standley, T.; Zamir, A.; Chen, D.; Guibas, L.; Malik, J.; Savarese, S. Which tasks should be learned together in multi-task learning? In Proceedings of the International Conference on Machine Learning, Virtual, 12–18 July 2020; pp. 9120–9132. [Google Scholar]
McFee, B.; Raffel, C.; Liang, D.; Ellis, D.P.; McVicar, M.; Battenberg, E.; Nieto, O. librosa: Audio and music signal analysis in python. In Proceedings of the 14th Python in Science Conference, Austin, TX, USA, 6–12 July 2015; Volume 8, pp. 18–25. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
StepLR. Available online: https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.StepLR.html (accessed on 8 May 2022).
Mesaros, A.; Diment, A.; Elizalde, B.; Heittola, T.; Vincent, E.; Raj, B.; Virtanen, T. Sound event detection in the DCASE 2017 challenge. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 992–1006. [Google Scholar] [CrossRef] [Green Version]
Shimada, K.; Koyama, Y.; Takahashi, N.; Takahashi, S.; Mitsufuji, Y. ACCDOA: Activity-coupled Cartesian direction of arrival representation for sound event localization and detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 915–919. [Google Scholar]
Emmanuel, P.; Parrish, N.; Horton, M. Multi-Scale Network for Sound Event Localization and Detection. Technical Report of DCASE Challenge. 2021. Available online: http://dcase.community/documents/challenge2021/technical_reports/DCASE2021_Parrish_95_t3.pdf (accessed on 8 May 2022).
Zhang, Y.; Wang, S.; Li, Z.; Guo, K.; Chen, S.; Pang, Y. Data Augmentation and Class-Based Ensembled CNN-Conformer Networks for Sound Event Localization and Detection. Technical Report of DCASE Challenge. 2021. Available online: http://dcase.community/documents/challenge2021/technical_reports/DCASE2021_Zhang_67_t3.pdf (accessed on 8 May 2022).

Figure 1. The schematic diagram of the decoder. Q, K, and V indicate the query, key, and value in the attention, respectively.

Figure 2. The overall DCMA-SELD architecture. Q, K, and V indicate the query, key, and value in the attention, respectively.

Table 1. Sixteen cases of spatial augmentation. The channel

W

carrying omnidirectional information does not change during the augmentation.

Table 1. Sixteen cases of spatial augmentation. The channel

W

carrying omnidirectional information does not change during the augmentation.

Case	$θ$	$- θ$
$ϕ$ − $90^{°}$	$X^{^{'}} \leftarrow Y$ , $Y^{^{'}} \leftarrow - X$ , $Z^{^{'}} \leftarrow Z$	$X^{^{'}} \leftarrow, Y$ $Y^{^{'}} \leftarrow - X$ , $Z^{^{'}} \leftarrow - Z$
$ϕ$	$X^{^{'}} \leftarrow X$ , $Y^{^{'}} \leftarrow Y$ , $Z^{^{'}} \leftarrow Z$	$X^{^{'}} \leftarrow X$ , $Y^{^{'}} \leftarrow Y$ , $Z^{^{'}} \leftarrow - Z$
$ϕ$ + $90^{°}$	$X^{^{'}} \leftarrow - Y$ , $Y^{^{'}} \leftarrow X$ , $Z^{^{'}} \leftarrow Z$	$X^{^{'}} \leftarrow - Y$ , $Y^{^{'}} \leftarrow X$ , $Z^{^{'}} \leftarrow - Z$
$ϕ$ + $180^{°}$	$X^{^{'}} \leftarrow - X$ , $Y^{^{'}} \leftarrow - Y$ , $Z^{^{'}} \leftarrow Z$	$X^{^{'}} \leftarrow - X$ , $Y^{^{'}} \leftarrow - Y$ , $Z^{^{'}} \leftarrow - Z$
$- ϕ$ − $90^{°}$	$X^{^{'}} \leftarrow Y$ , $Y^{^{'}} \leftarrow - X$ , $Z^{^{'}} \leftarrow Z$	$X^{^{'}} \leftarrow - Y$ , $Y^{^{'}} \leftarrow - X$ , $Z^{^{'}} \leftarrow - Z$
$- ϕ$	$X^{^{'}} \leftarrow X$ , $Y^{^{'}} \leftarrow - Y$ , $Z^{^{'}} \leftarrow Z$	$X^{^{'}} \leftarrow X$ , $Y^{^{'}} \leftarrow - Y$ , $Z^{^{'}} \leftarrow - Z$
$- ϕ$ + $90^{°}$	$X^{^{'}} \leftarrow Y$ , $Y^{^{'}} \leftarrow X$ , $Z^{^{'}} \leftarrow Z$	$X^{^{'}} \leftarrow Y$ , $Y^{^{'}} \leftarrow X$ , $Z^{^{'}} \leftarrow - Z$
$- ϕ$ + $180^{°}$	$X^{^{'}} \leftarrow - X$ , $Y^{^{'}} \leftarrow Y$ , $Z^{^{'}} \leftarrow Z$	$X^{^{'}} \leftarrow - X$ , $Y^{^{'}} \leftarrow Y$ , $Z^{^{'}} \leftarrow - Z$

Table 2. Feature dimensions and hyper-parameters of DCMA-SELD.

α^{i}

,

β^{i}

,

γ^{i}

, and

δ^{i}

denote the learnable weights for the soft parameter sharing on the input embeddings for the i-th convolutional block.

Table 2. Feature dimensions and hyper-parameters of DCMA-SELD.

α^{i}

,

β^{i}

,

γ^{i}

, and

δ^{i}

denote the learnable weights for the soft parameter sharing on the input embeddings for the i-th convolutional block.

		DOAE Stream	SED Stream
Input	Feature Extraction DOAE: B × 7 × 160 × 279 SED: B × 4 × 160 × 279	(4 × 279) - Log-mel (256-dim) - Add. Spectral info.(11-dim) - Chroma (12-dim) + (3 × 279) - IV (256-dim) - Zero vector (23-dim)	(4 × 279) - Log-mel (256-dim) - Add. Spectral info.(11-dim) - Chroma (12-dim)
Encoder	Conv. Block 1 DOAE: B × 64 × 80 × 139 SED: B × 64 × 80 × 139	Conv2d (3 × 3, 4 → 64) BN (64) ReLU Conv2d (3 × 3, 64 → 64) BN (64) ReLU	Conv2d (3 × 3, 4 → 64) BN (64) ReLU Conv2d (3 × 3, 64 → 64) BN (64) ReLU
		Average Pooling (2 × 2)	Average Pooling (2 × 2)
	Soft parameter sharing 1	Learnable weights $α^{1}$ (64) Learnable weights $β_{}^{1}$ (64)	Learnable weights $γ_{}^{1}$ (64) Learnable weights $δ_{}^{1}$ (64)
	Conv. Block 2 DOAE: B × 128 × 40 × 69 SED: B × 128 × 40 × 69	Conv2d (3 × 3, 64 → 128) BN (128) ReLU Conv2d (3 × 3, 128 → 128) BN (128) ReLU	Conv2d (3 × 3, 64 → 128) BN (128) ReLU Conv2d (3 × 3, 128 → 128) BN (128) ReLU
		Average Pooling (2 × 2)	Average Pooling (2 × 2)
	Soft parameter sharing 2	Learnable weights $α^{2}$ (128) Learnable weights $β_{}^{2}$ (128)	Learnable weights $γ_{}^{2}$ (128) Learnable weights $δ_{}^{2}$ (128)
	Conv. Block 3 DOAE: B × 256 × 40 × 34 SED: B × 256 × 40 × 34	Conv2d (3 × 3, 128 → 256) BN (256) ReLU Conv2d (3 × 3, 256 → 256) BN (256) ReLU	Conv2d (3 × 3, 128 → 256) BN (256) ReLU Conv2d (3 × 3, 256 → 256) BN (256) ReLU
		Average Pooling (1 × 2)	Average Pooling (1 × 2)
	Soft parameter sharing 3	Learnable weights $α^{3}$ (256) Learnable weights $β_{}^{3}$ (256)	Learnable weights $γ_{}^{3}$ (256) Learnable weights $δ_{}^{3}$ (256)
	Conv. Block 4 DOAE: B × 512 × 40 × 17 SED: B × 512 × 40 × 17	Conv2d (3 × 3, 256 → 512) BN (512) ReLU Conv2d (3 × 3, 512 → 512) BN (512) ReLU	Conv2d (3 × 3, 256 → 512) BN (512) ReLU Conv2d (3 × 3, 512 → 512) BN (512) ReLU
		Average Pooling (1 × 2)	Average Pooling (1 × 2)
Average Pooling	Average Pooling DOAE: B × 512 × 40 SED: B × 512 × 40	Average Pooling ( @F )	Average Pooling ( @F )
Decoder	Transformer decoder ( × 2layers) DOAE: B × 512 × 40 SED: B × 512 × 40	Masked MHSA (512, heads=8) LayerNorm (512) DCMA (512 → 512, heads=8) LayerNorm (512) FC(512 → 2048) FC(2048 → 512) LayerNorm (512)	Masked MHSA (512, heads=8) LayerNorm (512) DCMA (512 → 512, heads=8) LayerNorm (512) FC(512 → 2048) FC(2048 → 512) LayerNorm (512)
Prediction Layer	DOAE: B × 3 × 40 SED: B × 12 × 40	FC1 (512 → 3 ), FC2 (512 → 3 ), FC3 (512 → 3 )	FC1 (512 → 12 ), FC2 (512 → 12 ), FC3 (512 → 12 )

Table 3. Ablation study on input features for our DCMA-SELD on the TNSSE 2021 development dataset. All data represent mean values and associated 95% confidence intervals. The highest performance in each metric is indicated in bold.

Feature	${ER}_{\leq_{20^{°}}}$	$F_{\leq_{20^{°}}}$ (%)	${LE}_{CD}$ ( $^{°}$ )	${LR}_{CD}$ (%)	$D_{SELD}$
mel sp.+IV	0.44 ± 0.02	64.8 ± 1.9	14.4 ± 0.8	79.1 ± 1.9	0.269 ± 0.014
mel sp.+add. sp. info.+IV	0.44 ± 0.02	64.2 ± 2.1	14.8 ± 0.7	78.8 ± 2.1	0.273 ± 0.016
mel sp.+chroma+IV	0.42 ± 0.02	65.2 ± 2.1	14.0 ± 0.7	78.5 ± 1.9	0.267 ± 0.015
mel sp.+add. sp. info.+chroma+IV	0.41 ± 0.02	67.1 ± 2.1	13.6 ± 0.7	79.0 ± 2.0	0.256 ± 0.016

Table 4. Performance for different parameter-sharing methods in the encoder of our DCMA-SELD on the TNSSE 2021 development dataset. All data represent mean values and associated 95% confidence intervals. The highest performance in each metric is indicated in bold.

Parameter Sharing	${ER}_{\leq_{20^{°}}}$	$F_{\leq_{20^{°}}}$ (%)	${LE}_{CD}$ ( $^{°}$ )	${LR}_{CD}$ (%)	$D_{SELD}$
No parameter sharing	0.46 ± 0.02	63.2 ± 1.9	14.3 ± 0.8	75.4 ± 2.1	0.287 ± 0.014
Hard parameter sharing	0.44 ± 0.02	65.6 ± 2.0	13.6 ± 0.7	75.9 ± 2.0	0.278 ± 0.015
Soft parameter sharing	0.41 ± 0.02	67.1 ± 2.1	13.6 ± 0.7	79.0 ± 2.0	0.256 ± 0.016

Table 5. Performance for different decoder architectures in our DCMA-SELD on the TNSSE 2021 development dataset. The numbers in the parentheses indicate the numbers of decoder blocks used in the models to perform three predictions. All data represent mean values and associated 95% confidence intervals. The highest performance in each metric is indicated in bold.

Decoder Architecture	Number of Parameters	${ER}_{\leq_{20^{°}}}$	$F_{\leq_{20^{°}}}$ (%)	${LE}_{CD}$ ( $^{°}$ )	${LR}_{CD}$ (%)	$D_{SELD}$
LSTM (3)	34,617,325	0.49 ± 0.02	59.2 ± 2.1	17.5 ± 0.8	76.2 ± 2.2	0.309 ± 0.016
Multi-head self-attention (3)	34,635,757	0.42 ± 0.02	66.7 ± 2.1	13.1± 0.7	78.0 ± 2.1	0.260 ± 0.016
Multi-head self-attention (1)	17,813,485	0.45 ± 0.02	63.0 ± 2.2	15.8 ± 0.9	77.5 ± 2.2	0.283 ± 0.017
DCMA-SELD (1)	26,218,477	0.41 ± 0.02	67.1 ± 2.1	13.6 ± 0.7	79.0 ± 2.0	0.256 ± 0.016

Table 6. Performance of our DCMA-SELD models trained with and without data augmentation for model training on the TNSSE 2021 development dataset. All data represent mean values and associated 95% confidence intervals. The highest performance in each metric is indicated in bold.

Augmentation	${ER}_{\leq_{20^{°}}}$	$F_{\leq_{20^{°}}}$ (%)	${LE}_{CD}$ ( $^{°}$ )	${LR}_{CD}$ (%)	$D_{SELD}$
Without augmentation	0.50 ± 0.02	58.6 ± 1.9	17.9 ± 0.9	76.5 ± 1.8	0.312 ± 0.014
With augmentation	0.41 ± 0.02	67.1 ± 2.1	13.6 ± 0.7	79.0 ± 2.0	0.256 ± 0.016

Table 7. Performance of our DCMA-SELD and the models submitted to the DCASE2021 challenge on the TNSSE 2021 development dataset. The baseline system in the challenge corresponds to the modified version of the SELDnet in [40]. The numbers after # denote the rank of the model in the challenge. “Rank sum” represents the sum of the rankings of the four metrics used as the challenge ranking evaluation method. The highest performance in each metric is indicated in bold.

System	Feature	Format	${ER}_{\leq_{20^{°}}}$	$F_{\leq_{20^{°}}}$ (%)	${LE}_{CD}$ ( $^{°}$ )	${LR}_{CD}$ (%)	$D_{SELD}$	Rank Sum
Baseline	mel sp.+IV	FOA	0.73	30.7	24.5	44.8	0.528	28
(#1) Shimada et al. [15]	magnitude sp. + PCEN sp. + IPD	FOA	0.43	69.9	11.1	73.2	0.265	10
(#2) Nguyen et al. [17]	eigenvector-aug. log-sp. + mel sp.	FOA	0.37	73.7	11.2	74.1	0.239	7
(#3) Emmanuel et al. [41]	mel sp.+IV	FOA	0.65	57.1	17.5	62.8	0.387	24
(#5) Park et al. [19]	mel sp.+IV	Both	0.44	69.6	13.7	74.2	0.270	13
(#6) Zhang et al. [42]	mel sp.+IV + GCC-PHAT	Both	0.46	63.2	13.9	62.9	0.319	20
DCMA-SELD	mel sp.+add. sp.+chroma+IV	FOA	0.41 ± 0.02	67.1 ± 2.1	13.6 ± 0.7	79.0 ± 2.0	0.256 ± 0.016	10

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, S.-H.; Hwang, J.-W.; Song, M.-H.; Park, H.-M. A Method Based on Dual Cross-Modal Attention and Parameter Sharing for Polyphonic Sound Event Localization and Detection. Appl. Sci. 2022, 12, 5075. https://doi.org/10.3390/app12105075

AMA Style

Lee S-H, Hwang J-W, Song M-H, Park H-M. A Method Based on Dual Cross-Modal Attention and Parameter Sharing for Polyphonic Sound Event Localization and Detection. Applied Sciences. 2022; 12(10):5075. https://doi.org/10.3390/app12105075

Chicago/Turabian Style

Lee, Sang-Hoon, Jung-Wook Hwang, Min-Hwan Song, and Hyung-Min Park. 2022. "A Method Based on Dual Cross-Modal Attention and Parameter Sharing for Polyphonic Sound Event Localization and Detection" Applied Sciences 12, no. 10: 5075. https://doi.org/10.3390/app12105075

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Method Based on Dual Cross-Modal Attention and Parameter Sharing for Polyphonic Sound Event Localization and Detection

Abstract

1. Introduction

1.1. Existing Methods

1.2. Our Contributions

2. Proposed Approach

2.1. Input Features

2.1.1. Log-Mel Spectrogram

2.1.2. Additional Spectral Information

2.1.3. Chromagram

2.1.4. Intensity Vector

2.2. Data Augmentation Techniques

2.2.1. Mixup

2.2.2. Augmentation by Rotation

2.3. Model Architecture

2.3.1. Encoder

2.3.2. Decoder

2.3.3. Overall Procedure

3. Experiments and Results

3.1. Experimental Setup

3.1.1. Parameter Configuration

3.1.2. Evaluation Metrics

3.2. Experimental Results and Discussions

4. Conclusions

Author Contributions

Funding

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI