Next Article in Journal
Measuring Iron Oxide Composites with a Custom-Made Scanning Magnetic Microscope
Previous Article in Journal
Autoencoder-Based Hyperspectral Unmixing with Simultaneous Number-of-Endmembers Estimation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Acoustic Event Detection in Vehicles: A Multi-Label Classification Approach

by
Anaswara Antony
1,2,
Wolfgang Theimer
2,
Giovanni Grossetti
2 and
Christoph M. Friedrich
1,*
1
Department of Computer Science, University of Applied Sciences and Arts (FH Dortmund), 44227 Dortmund, Germany
2
Volkswagen Infotainment GmbH, 44803 Bochum, Germany
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(8), 2591; https://doi.org/10.3390/s25082591
Submission received: 5 March 2025 / Revised: 2 April 2025 / Accepted: 14 April 2025 / Published: 19 April 2025
(This article belongs to the Section Vehicular Sensing)

Abstract

:
Autonomous driving technologies for environmental perception are mostly based on visual cues obtained from sensors like cameras, RADAR, or LiDAR. They capture the environment as if seen through “human eyes”. If this visual information is complemented with auditory information, thereby also providing “ears”, driverless cars can become more reliable and safer. In this paper, an Acoustic Event Detection model is presented that can detect various acoustic events in an automotive context along with their time of occurrence to create an audio scene description. The proposed detection methodology uses the pre-trained network Bidirectional Encoder representation from Audio Transformers (BEATs) and a single-layer neural network trained on the database of real audio recordings collected from different cars. The performance of the model is evaluated for different parameters and datasets. The segment-based results for a duration of 1 s show that the model performs well for 11 sound classes with a mean accuracy of 0.93 and F1-Score of 0.39 for a confidence threshold of 0.5. The threshold-independent metric mAP has a value of 0.77. The model also performs well for sound mixtures containing two overlapping events with mean accuracy, F1-Score, and mAP equal to 0.89, 0.42, and 0.658, respectively.

1. Introduction

Everyday sounds, such as honking and sirens, are a few examples that assist a driver to make split-second decisions, which are crucial to one’s safety. The automotive soundscape is rich with diverse sounds that not only help the driver to be alert but also make sense of the surroundings. Similarly, autonomous vehicles can also benefit from these acoustic cues for safer self-driving. Currently, autonomous driving technologies are predominantly based on data acquired from visual sensors, such as cameras and long- and short-range sensors like RADAR and LiDAR [1]. Reports from car manufacturers also suggest that autonomous vehicles rely on these vision sensors to sense surroundings with only a few exceptions, like Waymo from Google [2]. The Hearing Car project from Fraunhofer IDMT (https://www.idmt.fraunhofer.de/en/institute/projects-products/projects/the_hearing_car.html, accessed on 2 January 2024), Embedded siren detection from Bosch (https://www.bosch.com/stories/embedded-siren-detection/, accessed on 2 January 2024), I-SPOT project by EU MSCA [3] are among the few recent approaches that equip future cars with a sense of hearing. Academic research on autonomous driving technologies follows a similar pattern. Thus, when it comes to sense the surroundings, the cars are able to “see” the environment but still lack the capability “to hear”. Research has shown that compared to other sensory signals, sound is less sensitive to harsh weather conditions and occlusion of objects, which helps in better environmental perception [4]. This also provides additional safety and better navigation possibilities. If used to their full potential, audio sensors can thus complement existing sensor modalities by providing backup in the event of a subsystem failure and reducing uncertainty [2]. Acoustic Event Detection (AED) is a step forward in this direction.
AED, also known as the Sound Event Detection (SED) system, detects different sound events in an input audio along with the temporal instances at which they occur [5]. Depending on the extent of the overlap of events, an AED system can be monophonic or polyphonic. A monophonic AED system refers to a system that does not detect overlapping events, while a polyphonic system can detect overlapping events [6]. Polyphonic AED is more challenging compared to monophonic AED since distinguishing the features of individual sounds is difficult. Natural acoustic environments are polyphonic in nature, implying the existence of multiple sounds at a certain point in time with varying amplitude, timbre, pitch, etc. The paper focuses on developing a polyphonic AED system that can detect the various overlapping sound events within a car. Consider an auditory scene that can happen inside a car as shown in Figure 1. The goal of the proposed AED system is to detect these events along with the time of occurrence as described in Figure 2. Predicting the activity of multiple sound classes for short audio segments can produce such an audio scene description. This implies that the AED problem can be seen as a multi-label classification problem on short audio segments. If the system provides predictions for the duration of the whole audio, it is referred to as tagging, while if the predictions describe the activity pattern of the audio classes, it is referred to as detection [7].

Related Work

During early research in AED, machine learning techniques that make use of SVM, HMM, GMM, and NMF were used. Although non-neural network-based methods do not require large amounts of strongly labeled data, they do not perform very well on polyphonic AED tasks [6]. With the advancements in computing resources, DNN with many linear layers that can jointly learn to extract acoustic features and classify them started to emerge. Other architectures, such as FNN, CNN, and BLSTM were also explored to solve AED problems. These architectures were followed by hybrid models with more than one network architecture. One of the most popular hybrid models used in AED is CRNN [6]. The architecture consists of CNN layers stacked with RNN layers that have specific functions [8]. Architectures, such as BLSTM [9] and bidirectional GRU [10], constitute the RNN layers. One of the drawbacks of neural network-based approaches is that they require a large amount of training data [6]. Using approaches like deep transfer learning, the limitation of acquiring a large amount of data can be solved [11]. With the advent of Transformer architecture [12], many approaches using pre-trained transformer models started to emerge [13,14]. From the literature on AED systems, it can also be inferred that they are mostly focused on environmental sounds and everyday sounds that are freely available from audio datasets such as AudioSet [15], ESC50 [16], Freesound [17], etc. The number of sound samples that originate in a vehicle environment is limited in these datasets. The paper introduces an AED system that is trained on a dataset containing the sounds in an automotive context.
Key contributions of this paper include the following:
  • An ontology that introduces a shared vocabulary to set a context for collecting audio data exclusive to the automotive environment. To the best of our knowledge, an ontology for sound data in an automotive context has never been proposed.
  • Data augmentation methodology to generate synthetic polyphonic sound mixtures with integrated user and command-line interfaces. The methodology can create polyphonic sound mixtures representing audio scenes that are close to reality. The method proves to be efficient for generating synthetic training data.
  • An Acoustic Event Detection model for vehicles that is trained using actual recordings from cars.
The following section introduces the proposed ontology, data augmentation methodology and describes how the AED system is developed using a multi-label classification approach. It gives an overview of the proposed AED system and the components involved. The performance of the classification model under various influencing factors is summarized in the results section.

2. Materials and Methods

The first step in any sound event classification/detection task is to establish a context to collect audio data that reflect sounds in an automotive environment. Existing ontology, such as the one prescribed in the AudioSet dataset [15] covers a wide range of sound events. However, a deeper look into the ontology shows that they are more generic and do not cover some of the sounds that are exclusive to the automotive environment. For instance, the sound of wipers, indicators, HVAC, etc. Therefore, an ontology described in Figure 3 proposes a hierarchical categorization of different sounds that originate in an automotive environment. The ‘Urbansound Dataset 8K’, which derived the taxonomy for urban sounds, defined the four top-level groups for sound sources as human, nature, mechanical and music [18]. Similar top-level groups can also be applied in this case. Note that the ontology is a work in progress and can be reformulated later by adding or removing events.

2.1. Methodology

The performance of multi-label classification models heavily depends on the complexity of the model as well as the amount of training data available. The freely available audio datasets contain a limited number of sound samples from the defined ontology. Therefore, deep transfer learning approaches using pre-trained models that are trained in large audio data can be used for downstream classification tasks [11]. This approach does not require a large amount of data in the domain task, which, in this case, is the classification of the sounds as mentioned in the ontology. There are many freely available pre-trained networks that are trained on large audio datasets. They are mostly based on the CNN [19] and Transformer [12] architectures.
The performance of these networks on the AudioSet dataset, the complexity of the model in terms of the number of parameters and information on the license are summarized in Table 1. It can be inferred from the table that the transformer-based network, BEATs [20] shows high performance on the AudioSet dataset, as well as the highest complexity compared to others. In addition, the license does not restrict its free usage. The complexity of the model cannot be ignored, as training such large models requires sufficient resources. However, the pre-trained model weights of BEATs are freely available and can be used as a frozen feature extractor. By doing so, the model is not retrained, retaining the same weights. Moreover, BEATs uses a self-supervised learning approach during pre-training, and therefore, the representation learning network of BEATs will not show bias towards labels in the source domain task. BEATs was also used as a frozen feature extractor in the baseline for the DCASE 2023 challenge on Sound Event Detection with Weak Labels and Synthetic Soundscapes (https://dcase.community/challenge2023/task-sound-event-detection-with-weak-labels-and-synthetic-soundscapes, accessed on 19 February 2024). All these factors make BEATs a good candidate. For resource-constrained environments, models such as YAMNet that offer good performance with much less complexity can be used. BYOL also offers good performance in downstream classification tasks compared to YAMNet with few training parameters; however, the value of mAP on Audio Set is not available and, therefore, is not considered.
Figure 4 summarizes the approach proposed for the AED system using the pre-trained network BEATs [20]. Single channel, 32-bit audio with a sampling rate equal to 16 kHZ is fed to the BEATs representation network, which will extract features of the input audio (refer to Section 2.5). These audio features are fed to the downstream classifier model, which represents the additional layers on which the target domain data are trained. Due to the high complexity of the BEATs model, its model parameters can be kept frozen. This means that during training, only the parameters of the classifier model will be updated to perform multilabel classification. Thus, BEATs acts as a frozen feature extractor while the classifier model is trained on these features to predict the corresponding labels.
Figure 5 describes the structure of the proposed AED system during the training phase. The different parts of the system are as follows:
  • Audio Database: Provides audio data and the corresponding labels with the onset and offset times to train and test the classifier
  • Audio Data Augmentation system: Responsible for generating augmented audio with overlapping sound events and its corresponding labels
  • Data preprocessor: Processes audio data and labels for the embedding extractor and classifier, respectively.
  • Embedding extractor: Transforms the input audio into embeddings that encapsulate the audio features
  • Classifier: Learns to predict the audio events from these embeddings using the reference labels provided by the data preprocessor

2.2. Audio Database

The Audio Database corresponds to the audio data for training, evaluating, and testing the classifier model. Audio events mentioned in the ontology are recorded using three A2B microphones [26] fitted on the roof of the car. Manufactured by Analog Devices, each A2B chip (node) is a digital microphone module consisting of three microphone channels. It uses MEMS technology for the microphone modules with analog to digital conversion integrated into the AD242x chip. Using the microphones, recordings were collected from three different Volkswagen cars driving around the city of Bochum. The recorded audio (48 kHz, 9 channel, wav file format) is manually annotated to generate label files (csv file format), which contain information on the sound events along with their onset and offset times.
There are also a few datasets that contain some of the audio events mentioned in the ontology, which are freely available on the Internet. To increase the amount of audio data, data were also collected from these datasets. The audio database thus corresponds to audio data from the two sources:
  • Audio recordings using A2B microphones: This corresponds to 71 min of audio recordings of 14 different sound events; Baby crying, Dog barking, Door, Driving noise, HVAC, Honking, Indicator, Knocking at the car, Passenger noise, Speech, Stationary engine noise, Vehicle passing by, Window, Wiper, available in both flac and wav formats.
  • Internet: Audio data corresponding to the events in the ontology that are downloaded from Pixabay (https://pixabay.com/sound-effects/, accessed on 24 February 2024). They are mainly available in MP3 and wav file formats. More data can be collected from the Internet on the basis of the results of the experiments using the database from the recordings.

Exploratory Data Analysis (EDA)

Figure 6 and Figure 7 describe the audio database containing recordings of 14 events totaling up to 71 min of audio data. Figure 6 plots the number of audio recordings for each label. It must be noted that these recordings contain only the annotated label and there is no overlapping of events. It can be inferred from the figure that the dataset is heavily imbalanced with a large number of samples available for events such as Driving noise and Vehicle passing by while only a few samples are available for events such as Baby crying and Wiper. If the imbalance value is defined as the highest number of samples (Vehicle passing by: 35) to the lowest number of samples (Dog barking: 2), then the imbalance ratio of this dataset will be 17.5.
Figure 7 plots the total duration of each recorded sound event. From Figure 7 it can be inferred that the event Driving noise has the longest duration. This could be attributed to the nature of the sound. Events such as Door, Honking are transient in nature, whereas events such as Stationary engine noise, HVAC, Driving noise are continuous in nature. The duration of the sound events also contributes to the imbalance in the dataset. After analyzing the duration of each sound event annotated across all recordings, the maximum duration value is for the event Driving noise with a duration of 148 s, while Honking has the lowest duration of 538 ms.

2.3. Audio Data Augmentation System

In order to increase the data for training and testing the model, an audio augmentation methodology is proposed that considers different augmentation techniques. The component ADA in Figure 5 corresponds to the system that makes use of this augmentation methodology to generate augmented audio by mixing individual sound events. This can be used to generate synthetic training data containing multiple overlapping sound events leading to polyphonic sound mixtures. This approach is also good for accurate annotations, as the start and end times are determined by the mixing algorithms [7]. They can also be used to simulate a variety of acoustic situations or auditory scenes. Therefore, the system helps to generate data for polyphonic AED. The proposed augmentation methodology is described below:
Let { s i ( t ) } be the set of sound signals to be mixed and i be the sound signal index where i = 1 , 2 , , I and n = 0 , 1 , 2 , , N be the time index.
Let s ^ i n be the normalized sound signal, which can be defined as
s ^ i n = s i ( n ) m a x ( s i ( n ) )
Thus, the range of the amplitude values of the normalized signal will be in the interval [−1, 1]. This helps to achieve a uniform range of amplitude values for all signals. Now, the mixing equation can be defined as
y n = i = 1 I A i · s ^ i n k i i = 1 , 2 , , I
where A i represents the amplitude scaling factor, which can be any random value from the interval [0, 1] and k i is a random time shift selected from the range specified by the user corresponding to each sound signal. Figure 8 gives an overview of the augmentation process. As can be seen from the figure, the user can select one or more sound events, apply amplitude scaling in a selected range, and mix them within a specified range of time-shift. The user can select any number of sound events as well as control the extent of overlapping of sound events by changing the range of time-shift. In this way, ADA can be used to generate a large amount of synthetic data for training, testing and evaluation.

2.4. Data Preprocessor

The data preprocessor is the component for processing the audio data, as well as the labels (annotations) from the Audio database and the ADA system for training and testing the classifier. It is mainly responsible for the following;
  • Processing of labels
  • Processing of audio
  • Generating datasets for iterative-stratified validation
Multi-label classification on short temporal segments of audio can describe the activity of the selected sound classes in an audio. In order to define the temporal resolution for prediction, examining the statistics of the training data or general assumptions about the target events is recommended [7]. BEATs uses a window length of 25 ms with an overlap of 10 ms, resulting in audio frames of 25 ms duration. The minimum duration of the audio event in the training data is 573 ms, while the median duration is up to 6.75 s. Since 25 ms is too short based on the annotated duration of each audio event and most of the annotated events have at least a duration of one second, a duration of one second seems to be a more reasonable choice for the temporal resolution. Pre-trained networks such as YAMNet also have a prediction interval of 0.96 s that approximates 1 s [21]. Figure 9 describes how audio and labels are processed by the data preprocessor so that for a given input audio, the classifier predicts audio events every 1 s.

Iterative Stratification

One of the drawbacks of cross-validation is that it splits the dataset into disjointed sets of equal size [27]. If the dataset is unbalanced with a large variation in the number of samples available for each class, then a stratified version of the cross-validation method is used, where the distribution of classes across each fold is maintained. Iterative stratification is an alternative for the stratification of multi-label data. The algorithm is particularly effective when the dataset has rare class samples [27]. As can be inferred from Section Exploratory Data Analysis (EDA), the audio database is heavily imbalanced with an imbalance value of 17.5. Therefore, to split the dataset for multilabel classification, iterative stratification is used. The data preprocessor is also responsible for generating these partitions of the dataset based on iterative stratification. It makes sure that the same recordings are not split into the training and validation sets. The number of folds proposed for iterative stratification is 10.

2.5. Embedding Extractor

The component Embedding extractor generates embeddings that represent features of a given input audio. According to the proposed approach for the AED system, the embedding extractor in this case is the BEATs model. The model accepts input audio in tensor format [25] and extracts embeddings that represent its features from the input audio. These embeddings can be the input features for the downstream task [7]. In this case, they are the input features for the classifier to be trained.
The backbone network of BEATs uses a Vision Transformer (ViT) [28] structure with a Linear Projection layer and a stack of Transformer Encoder layers. The backbone network consists of 12 Transformer Encoder layers, 768-dimensional hidden states, and 8 attention heads. One of the most commonly used audio features in Transformer-based networks is the Mel filter bank features [13,29,30]. Given an input audio waveform of t seconds, BEATs extracts the 128-dimensional Mel filter bank features by applying a sliding Povey window [31] of 25 ms every 10 ms. This will result in a 128 × 100 t spectrogram that is normalized to a mean of 0 and a standard deviation of 0.5 [20]. The spectrogram is split into N patches of size 16 × 16 with an overlap of 6 in both the time and frequency dimensions. This input sequence of N patches, where N = 12 [ ( 100 t 16 ) / 10 ] is fed to the Linear Projection Layer, where each 16 × 16 patch is flattened to a 1D patch embedding of size 768 by the Linear Projection Layer. To capture the spatial structure of the 2D audio spectrogram, a trainable positional embedding of size 768 is also added [23]. The resulting sequence is fed to the Transformer Encoder layers to obtain an embedding of dimension 768.

2.6. Classifier

The BEATs model uses a task-specific linear classifier on the encoder layer to generate labels for downstream classification tasks [20]. Following a similar approach, a linear classifier is considered for the classifier model of the AED system. The BEATs model shows good results for downstream tasks such as audio classification and speech classification [20]. A linear classifier can be a neural network or algorithms such as SVM.

2.6.1. Linear Classifier

The linear classifier receives the input embeddings and calculates the output probabilities of each class/label. Classes with a probability higher than the defined threshold are predicted classes. Since BEATs generates embedding of dimension 768, the input vector to the classifier network will be a tensor of dimension 768 utilizing the full representation network, while the output vector will be of dimension 14 corresponding to the number of labels. Thus, the proposed classifier model is a single-layer neural network with 768 input neurons and 14 output neurons. A single layer is chosen, as adding layers would increase complexity and can affect real-time inference capabilities.
When an audio segment of a duration of 1 s is fed to the BEATs model, it extracts 128 dimensional features every 10 ms, generating 100 frames with a dimension equal to 128 (100, 128). These features have to be converted to patches of size 16 × 16 . After confirming with one of the authors of the BEATs paper, it is understood that since 100 frames are not divisible by 16, it will discard the last 4 frames to obtain 96 frames. They are later converted to 16 × 16 patches to obtain 48 patches ( ( 128 / 16 ) · ( 96 / 16 ) ) . This is fed to the backbone network with 768-dimensional hidden states, thus obtaining an embedding tensor of dimension (48, 768). Since the output corresponds to 48 embeddings for 1 s of audio, the mean is calculated along the first dimension before applying the sigmoid activation function. The result is an output tensor of dimension (1, 14).

2.6.2. Model Hyperparameters

The following model hyperparameters and their initial set of values are considered.
  • Learning rate: The step size at which the optimizer moves towards the minimum of the loss function for each iteration. The initial learning rate is set to 0.00001.
  • Optimizer: The optimizer used is AdamW [32] which uses the adaptive gradient algorithm, Adam [33]. It modifies the optimization algorithm of Adam by decoupling weight decay from the optimization steps.
  • Number of epochs: The number of training iterations of the complete training dataset. The number of epochs is set to 50.
  • Batch size: A batch size of 32 is set, which corresponds to the number of samples loaded during each iteration of the training loop.
  • Loss function: BCE loss is used to calculate the loss between the predictions and the target variables. The other alternative is focal loss [34] to deal with an imbalanced dataset.
  • Activation function: Sigmoid activation function is used to calculate the output probabilities of each class.

2.6.3. Evaluation of Classifier

For model development, simple computational metrics are used by comparing them with reference annotations. There are two approaches for evaluating AED systems [7];
  • Segment-based evaluation: The system’s output is compared with the reference annotation at a fixed temporal resolution, e.g., 1 s.
  • Event-based evaluation: The system output is compared with the reference annotation in terms of how well the onset/offset times are detected for each sound event. The metric allows for some degree of misalignment of the temporal resolution called collar between the reference and the prediction. A 200 ms collar indicates that a misalignment of 200 ms prediction interval is allowed for the onset and offset times of the reference annotation.
The aforementioned evaluation metrics present two different views of system performance. Segment-based metrics evaluate the temporal activity of each sound event in a given audio, while event-based metrics determine the system’s performance in detecting individual instances of sound [7]. As the proposed AED system can be used to describe an auditory scene, segment-based evaluation suits the application more. Also, since annotations for audio segments of 1 s duration are considered for classification, segment-based metrics are a better choice. However, segment-based metrics have the drawback of overemphasizing the contribution of longer events that span multiple segments [35].
The following metrics are used for the evaluation of the performance of the model [7].
  • Threshold-dependent metrics, Accuracy, F1-Score, Precision, Recall: They are micro-averaging metrics that take class imbalance into consideration by computing the metric on the aggregated values. They are computed for a threshold equal to 0.5.
  • Threshold-independent metrics, mAP and AUC: They are computed with macro averaging, which does not consider class imbalance, as the metric for each class is calculated separately and the average is taken.

3. Results

The following factors are considered to evaluate the performance, training, and testing time of the model. Keeping other factors constant, only one set of factors is changed at a time for performance assessment.
  • Dataset
  • Number of epochs
  • Batch size
  • Loss Function

3.1. Experiments with Different Datasets

The dataset described in Section Exploratory Data Analysis (EDA) is an imbalanced dataset with an imbalance ratio of 17.5. It is referred to as datatset_1 henceforth. It also contains faint sound events or noises, such as driving noises. In addition, no data augmentation is performed on this dataset to train the model. Performance metrics are calculated for the devset dataset during testing the model, which contains test-time-augmented samples [36], in addition to the original samples. The test time augmentation is performed by changing the pitch of each sample is changed by 3, 4, 6, −3, −4, −6 semitones. The evaluation metrics reported are for devsets with test-time augmented data, unless otherwise specified.
In the initial experiment, the model trained on dataset_1 shows a mean accuracy of 0.95 in the devset datasets with a standard deviation of 0.14 across 10 folds. The mean mAP is only 0.39, with a standard deviation of 0.16 across 10 folds. The high accuracy and low mAP can be attributed to the imbalance in the dataset. Events such as driving noise have a higher number of samples with longer duration, and therefore, will contribute more to the ratio between TP and TN while micro-averaging.
Since dataset_1 is heavily imbalanced with a high imbalance ratio, a second dataset, dataset_2 is collected to compare the performance of the model with respect to the dataset_1 on which it is trained. The imbalance is reduced by collecting more training samples from the Internet (Source: Pixabay Sounds (https://pixabay.com/sound-effects/, accessed on 29 April 2024)) and removing the faint sound events: Driving noise, Stationary engine noise, Passenger noise and HVAC. An additional sound event Siren from the defined ontology is collected from the Internet. dataset_2 also contains augmented data in the training and devset datasets for each fold. The pitch of each sample is changed by 3, 4, 6, −3, −4, −6 semitones in both the training and the devset sets for each fold. Figure 10 describes the distribution of the number of samples, and Figure 11 describes the duration of each label in dataset_2. The imbalance ratio of dataset_2 is 5.8 compared to an imbalance ratio of 17.5 of dataset_1. The total number of labels/classes is reduced to 11 from the initial 14 events.
The model trained on dataset_2 shows a mean accuracy of 0.92 in the devset datasets with a standard deviation of 0.16 across 10 folds. The mean mAP is 0.57 with a standard deviation of 0.16 across 10 folds. The accuracy decreased by only a small value of 0.03, while mAP improved by a value of 0.18. Table 2 summarizes the devset accuracy and mAP 10 folds of data for both datasets. It can be inferred from the table that dataset_2 shows better performance even with augmented training and devset data. The results also indicate that the model does not perform well for imbalanced datasets with faint sounds or noise.

3.2. Experiments with Different Batch Sizes

The batch size is one of the hyperparameters that can influence the performance of the model. The performance of the model is estimated for batch sizes 128, 64, and 32 using dataset_2, keeping the other parameters unchanged. Table 3 summarizes the results showing the mean accuracy and mAP with the standard deviation.
The results show that a batch size of 32 shows the best results. The performance decreases as the batch size increases. Since the AdamW optimizer uses a dynamic learning rate, the initial learning rate is the same for each batch size. Batch size 128 shows the least deviation of values between folds.

3.3. Experiments with Different Loss Functions

Following BEATs, BCE loss is proposed to calculate the loss between the input and the predictions. BCE loss computes the logarithmic value of the probabilities of the target/reference values and predictions. During training, a small subset of training samples called mini-batch is sent to the network to calculate the loss function gradient for training. Therefore, BCE loss is calculated for a mini-batch, resulting in a set of loss values for each sample.
BCE loss l x , y for each sample in a batch of reference value ( y n ) and output predictions ( x n ) is calculated as follows (https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html, accessed on 8 April 2024).
L = l 1 , , l N , l n = α n [ y n · log x n + ( 1 y n ) · log ( 1 x n ) ] n = 1 , 2 , , N
where N is the batch size. l x , y = m e a n L and α n is a scaling factor applied to each element to balance the loss. If x n = 0 or ( 1 x n ) = 0 , then the corresponding logarithm tends to infinity. According to the solution proposed in the PyTorch documentation for BCELoss, the output of the log function is set to be greater than or equal to −100 so that a finite loss is obtained.
FL as an extension of BCE loss to deal with imbalanced datasets [34]. It adds a factor ( 1 x n ) γ to the cross-entropy function. When γ > 0 , relative loss for samples with the probability of prediction ( x n ) > 0.5 will be lower, while the loss for samples with x n < 0.5 will be high. In this way, more focus is placed on samples that are misclassified. Calculation of FL ( F L x n ) and BCE loss ( C E x n ) where y n = 1 and α n = 1 is as follows.
F L x n = 1 x n γ log ( x n ) C E x n = log ( x n )
The performance of the model is evaluated on both BCE and Focal loss functions. The parameter α is set to 1 while γ is set to 2 initially [34]. The results are summarized in Table 4. The results show an improvement in both accuracy and mAP of the devset dataset for 10 folds for the Focal loss function compared to BCE loss.

3.4. Model Evaluation on Unseen Data

The experiments discussed so far showed the best results on the devset dataset when the influencing factors are the following.
  • Dataset: dataset_2 shows better results than dataset_1
  • Model Hyperparameter, batch size: A batch size of 32 shows the best performance compared to sized 64 and 128.
  • Model Hyperparameter, loss function: The focal loss function for imbalanced data yields better results than the BCE loss.
Based on the results of the evaluation of the 10 fold devset dataset, the above-mentioned values are obtained. The model is then evaluated on an independent test set that is unseen by the model. That is, a dataset that is not used to tune these parameters should be used to validate the performance of the model. This dataset is henceforth referred to as the unseen dataset. A new data sample for each audio event is collected from the Internet to generate the unseen dataset. The threshold-dependent metrics F1-Score, Precision, Recall, and Accuracy with a threshold equal to 0.5 are considered for evaluation. In addition, the threshold independent metrics mAP, Average precision and ROC are considered. The metrics for each class are obtained to understand the influence of each class on the mean value of F1-Score, accuracy, and mAP. This also helps to understand how well the model can detect each class.
Figure 12 describes the distribution of the number of samples, and Figure 13 describes the duration of each label. With the updated parameters, the model is trained using the entire dataset_2 (training set and devset), while the model is evaluated on the unseen dataset without test-time augmentation.
Table 5 lists the values of accuracy, F1-Score, precision, recall, and average precision for each class. Figure 14 plots the ROC curve of each class. The AUC number corresponds to the index of each class in the list.
The results provided in Table 5 suggest that the events Baby crying, Honking and Knocking at the car show good performance according to threshold-dependent metrics; accuracy and F1-Score. The threshold independent value, average precision, shows that in addition to these events, the events Siren, Wiper also show good performance with average precision greater than 0.8. The ROC curves plotted for each class and their corresponding AUC values in Figure 14 show similar results. All the classes have an AUC value greater than 0.8, indicating a good performance of the classifier. However, the events Indicator and Speech have an average precision of less than 0.4.
The mean accuracy, F1-Score, and mAP values are 0.93, 0.39, and 0.77, respectively. The improved mAP value of the model on unseen data may be due to the fact that the data collected from the Internet contain clear and loud audio signals.

3.5. Evaluation Results on Augmented Data

The performance of the model is also evaluated after augmenting the unseen data by changing the pitch of each sample 3, 4, 6, −3, −4, −6 semitones, thereby increasing the number of samples to test. This also helps to assess the model performance to pitch variations of the samples. Table 6 lists the values of accuracy, F1-Score, precision, recall, and average precision for each class. Figure 15 plots the ROC curve of each class.
The mean accuracy, F1-Score, and mAP values are 0.93, 0.36, and 0.71, respectively. The performance of the model decreases when the data are augmented. However, the model still shows good performance.

3.6. Experiments with Data Containing Overlapping Events

The experiments discussed so far use audio data containing only a single sound event. To understand how well the selected approach distinguishes multiple sound events from a mixture of sounds with varying amplitudes that occur at different instances of time, a dataset generated using ADA is used. Using the augmentation methodology of ADA, synthetic data can be generated that contain multiple sound events with varying amplitudes that overlap each other. The ADA parameters are set to generate sound mixtures of at least two events that completely overlap. The following values are set for ADA parameters.
  • Labels: Combination of two labels. There are 11 labels available. A combination of all the labels results in C ( 11 , 2 ) = 55 augmented files. This will ensure the overlapping of each event with other events so that there is no bias to any event.
  • Frequency of labels: 1 for each label.
  • Range of time-shift: The range of (0, 0.1) ensures that the two sound events completely overlap as they begin in this small interval.
  • Scaling factor: The amplitude scaling factor is set to a range of (0.2, 0.5). A random value in this range will be applied for scaling.
  • Duration: The duration of the augmented audio is set to 20 s. This ensures that longer sound events are not clipped. However, while training and testing the model, only 1 s audio frames corresponding to the start time are fed to the model.
The training and devset datasets in fold 1 are fed to ADA to generate augmented files. The number of samples will be the same for each label since all combinations of two labels are generated. Figure 16 describes the duration of each label. Data augmentation of changing pitch by the semitones 3, 4, 6, −3, −4, −6 is conducted on each sample for both training and devset sets. The model is developed using only datasets from fold 1.
Since the dataset is balanced, the model is trained and evaluated for both BCE loss and Focal loss functions. Table 7 summarizes the results. The results show that the focal loss performs slightly better than the BCE loss. This can be due to the reason that, though the dataset is balanced, the duration of each sound may have an effect on data balance as 1 s audio frames are fed to train the model.
The decrease in performance of the model when compared to the model trained on dataset_2 is expected as the data are more complex with multiple overlapping sound events. To understand the effect of transfer learning, the model is initialized with the weights of the model trained on dataset_2. After initializing, the model is trained with the same parameters using the Focal loss function. The experiment shows a small improvement in performance. The accuracy more or less remains the same with a value of 0.88 while mAP increased to a value of 0.62. Figure 17 summarizes the results of the three experiments. From the figure, it can be inferred that initializing the model with a model trained using non-overlapping events can improve the performance.

3.7. Model Evaluation on Unseen Data with Overlapping Events

The unseen dataset (Figure 12 and Figure 13) contains audio files of a single event. This dataset is fed to ADA to generate synthetic data of overlapping events. The same ADA parameters are used. The number of samples for each label will be the same. The duration of each audio event is shown in Figure 18.
Table 8 summarizes the results of the evaluation for a threshold of 0.5 for the unseen dataset without test-time augmentation. The threshold-dependent metrics for the classes Baby crying and Honking show good results with an F1-Score greater than 0.8. In addition to these classes, the average precision value for the class Siren is close to 1, indicating good performance. It must be noted that the average precision of the class Siren increased from 0.797 to 0.99 when there is an overlap of events. The result suggests that Siren can be detected when there is an overlap of two events. However, this can only be validated with further experiments. Figure 19 plots the ROC curve for each class with their corresponding AUC values. The AUC value is equal to 1.0 for these classes. All classes have a value greater than 0.8, showing good detection performance for all classes.
The mean accuracy, F1-Score and mAP values are 0.89, 0.42, and 0.658, respectively. The accuracy and mAP values decreased, while there was a slight improvement in the F1-Score. Thus, the selected approach indeed performs well when there is an overlapping of events with varying amplitudes. However, this can only be validated with further experiments using varying degrees of overlapping of events.

3.8. Evaluation Results on Augmented Data

The experiment is repeated with the test-time augmented data. Each sample in the dataset is augmented by changing the pitch by 3, 4, 6, −3, −4, −6 semitones. The results with the augmented data are summarized in Table 9. The results show a decrease in the performance of the model for each class. The mean accuracy, F1-Score and mAP values are 0.88, 0.34, and 0.64, respectively. However, the events Baby crying, Honking, Siren show good results. Figure 20 plots the ROC curve and AUC values, suggesting a similar result with an AUC value greater than 0.9 for these classes.

4. Discussion

The following sections discuss the overall performance of the model and its limitations.

4.1. Model Performance

The following can be concluded from the experiments carried out using segment-based metrics to evaluate the AED model developed.
  • The model performs with a mean accuracy, F1-Score, and mAP values of 0.93, 0.39, and 0.77, respectively, for 11 sound events on a test dataset containing only a single event. The threshold-dependent metrics are calculated for a threshold of 0.5.
  • The overall performance of the model decreases, with the mean accuracy, the F1-Score and the mAP values equal to 0.89, 0.42, and 0.658, respectively, when the data contains two overlapping events. The threshold-dependent metrics are calculated for a threshold of 0.5. However, the model still shows good performance.
  • The sound events Baby crying, Honking and Siren show good results in all experiments. The inference results of one of the unseen data samples show that the model is able to detect the overlapping instances of both Honking and Siren for a threshold of 0.4.
Currently, the model is evaluated on generic evaluation metrics that do not consider the extent of overlapping of sound events. The Polyphonic Sound Detection Score (PSDS) score is a metric that is used in DCASE challenges to evaluate state-of-the-art AED systems, which considers the extent of overlapping of sound events. Although other threshold-independent metrics are used to evaluate the model performance, calculating the PSDS score will help to compare the performance with the state-of-the-art AED systems. In addition, the proposed AED system has been evaluated for a single architecture. Other linear classifiers, such as SVM, can be used in conjunction with BEATs to evaluate performance. Since BEATs has around 90M parameters, the performance of other smaller networks such as BYOL, YAMNet can be explored. Additional experiments can be carried out to improve the performance of the selected model through hyperparameter tuning. Since training time is not optimized, hardware with higher computational capability can be used. If training time is optimized, BEATs model can be fine-tuned to improve performance.

4.2. Limitations

  • Methods for model calibration can be used to make sure that the predicted probabilities match the real-world probabilities (https://www.tidyverse.org/blog/2022/11/model-calibration/, accessed on 11 May 2024). The predictions of the model are dependent on the defined threshold. During inference, a threshold less than 0.5 is found to predict events with more accuracy. Tuning the threshold to an optimum value can improve the classifier prediction during inference. The model can also be trained to predict parts of audio containing no sound as Silence.
  • Since the model is trained using a small dataset with a limited number of samples for each class, some classes show poor performance. Sound events, such as Indicator and Speech, have an average precision of less than 0.4. This can also be attributed to the fact that these sound events are more of a continuous nature. The performance of other classes can still be improved through hyperparameter tuning.
  • The performance of the model for overlapping sound events is evaluated for only a set of combinations of sound events that overlap. Depending on the extent of overlap and the number of overlapping events, the performance of the model may vary.

5. Conclusions

The paper introduces an AED system that can detect various sound events within the vicinity of a vehicle. Though there is active research in the area of sound event detection methodologies, they mainly focus on environmental sounds. There is also a scarcity of datasets when it comes to sound data in an automotive context. The approach described in the paper tackles this challenge by collecting real recordings of sound events according to the proposed ontology. A novel augmentation methodology is applied in order to generate polyphonic sound mixtures from these recordings. Using the pretrained model BEATs as a frozen feature extractor and a single-layer neural network for classification, the proposed system is able to predict audio events every 1 s in order to create an audio scene description. The results of the experiments carried out suggest that the model performs relatively well on the unseen datasets. Given sufficient data, a similar approach can be applied to a broader context of audio applications in the automotive domain.
Currently, the proposed system detects sound events in audio files stored locally. Therefore, future work can constitute deploying the AED model into an embedded system, where the system uses microphones to capture audio while the deployed model predicts the sound events. Since the prediction occurs every 1 s, a ring buffer of 1 s can be designed where audio events in the incoming 1 s duration are predicted, then for the next 1 s, and so on continuously in real-time. In this way, the model can be trained with large amounts of data if the system is installed in cars. In order to account for the complexity of the BEATs model, optimization techniques such as Low-Rank Adaptation (LoRA) [37] can be used for deploying the model into resource constraint embedded systems. Alternatively, models such as YAMNet and BYOL with fewer parameters can also be explored. Integrating methods such as federated learning can protect the privacy of car users while collecting data from different cars.
Further experiments must be carried out in order to explore the model’s robustness to more real-world conditions. For further investigation, synthetic data from the ADA software can be generated that contain more than two overlapping events. However, the real-world automotive environment is more complex, and further extensive studies must be performed to evaluate the model’s performance. Furthermore, self-supervised learning strategies can be introduced to replace the tedious process of collecting annotated data. Currently, the audio scene description is created with the predictions for every 1 s. This can be further extended with an LLM model to create a natural language description of the audio scene. Models like SALMONN [38], LTU [39] show a similar approach.

Author Contributions

Conceptualization, A.A., W.T., G.G. and C.M.F.; methodology, A.A., W.T., G.G. and C.M.F.; software, A.A.; validation, A.A., W.T. and G.G.; formal analysis, A.A.; investigation, A.A.; resources, W.T. and G.G.; data curation, A.A.; writing—original draft preparation, A.A.; writing—review and editing, A.A., W.T., G.G. and C.M.F.; visualization, A.A.; supervision, W.T., G.G. and C.M.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data available on request due to restrictions. The recordings and the link to the public datasets are available by request to the corresponding author.

Acknowledgments

The authors thank Henning Taschke from the Audio Department of Volkswagen Infotainment GmbH for extending his support in setting up the recording infrastructure in the cars.

Conflicts of Interest

Authors Anaswara Antony, Wolfgang Theimer and Giovanni Grossetti were employed by the company Volkswagen Infotainment GmbH. The funders had no role in the design of the study; in the collection, analysis, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results. The research was conducted at Volkswagen Infotainment GmbH as part of a master’s thesis study.

Abbreviations

The following abbreviations are used in this manuscript:
A2BAutomotive Audio Bus
ADASAdvanced Driver Assistance Systems
ADAAudio Data Augmentation System
AEDAcoustic Event Detection
ASTAudio Spectrogram Transformer
AUCArea Under the ROC curve
BCEBinary Cross Entropy
BEATsBidirectional Encoder representation from Audio Transformers
BLSTMBidirectional Long Short Term Memory
BYOLBootstrap Your Own Latent
CNNConvolutional Neural Network
CRNNConvolutional Recurrent Neural Network
DCASEDetection of Acoustic Scenes and Events
DNNDeep Neural Networks
FNNFeed Forward Neural Network
GMMGaussian Mixture Model
GRUGated Recurrent Unit
HMMHidden Markov Model
HVACHeating, Ventilation, and Air Conditioning
LiDARLight Detection and Ranging
LLMLarge Language Model
mAPmean Average Precision
MFCCMel Frequency Cepstral Coefficients
NMFNonnegative Matrix Factorization
PaSSTPatchout faSt Spectrogram Transformer
PSDSPolyphonic Sound Detection Score
RADARRadio Detecting and Ranging
RNNRecurrent Neural Networks
ROCReceiver Operating Characteristic
SVMSupport Vector Machine
YAMNetYet Another Music Recognition Network

References

  1. Ignatious, H.A.; Sayed, H.E.; Khan, M. An overview of sensors in Autonomous Vehicles. Procedia Comput. Sci. 2022, 198, 736–741. [Google Scholar] [CrossRef]
  2. Marchegiani, L.; Fafoutis, X. How Well Can Driverless Vehicles Hear? An Introduction to Auditory Perception for Autonomous and Smart Vehicles. IEEE Intell. Transp. Syst. Mag. 2022, 14, 92–105. [Google Scholar] [CrossRef]
  3. Yin, J.; Damiano, S.; Verhelst, M.; Van Waterschoot, T.; Guntoro, A. Real-Time Acoustic Perception for Automotive Applications. In Proceedings of the 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE), Antwerp, Belgium, 17–19 April 2023; pp. 1–6. [Google Scholar] [CrossRef]
  4. Furletov, Y.; Willert, V.; Adamy, J. Auditory Scene Understanding for Autonomous Driving. In Proceedings of the 2021 IEEE Intelligent Vehicles Symposium (IV), Nagoya, Japan, 11–17 July 2021; pp. 697–702. [Google Scholar] [CrossRef]
  5. Mesaros, A.; Heittola, T.; Virtanen, T. TUT database for acoustic scene classification and sound event detection. In Proceedings of the 2016 24th European Signal Processing Conference (EUSIPCO), Budapest, Hungary, 29 August–2 September 2016; pp. 1128–1132. [Google Scholar] [CrossRef]
  6. Chan, T.K.; Chin, C.S. A Comprehensive Review of Polyphonic Sound Event Detection. IEEE Access 2020, 8, 103339–103373. [Google Scholar] [CrossRef]
  7. Mesaros, A.; Heittola, T.; Virtanen, T.; Plumbley, M.D. Sound Event Detection: A tutorial. IEEE Signal Process. Mag. 2021, 38, 67–83. [Google Scholar] [CrossRef]
  8. Cakir, E.; Parascandolo, G.; Heittola, T.; Huttunen, H.; Virtanen, T. Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 1291–1303. [Google Scholar] [CrossRef]
  9. Jung, S.; Park, J.; Lee, S. Polyphonic Sound Event Detection Using Convolutional Bidirectional Lstm and Synthetic Data-based Transfer Learning. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 885–889. [Google Scholar] [CrossRef]
  10. Adavanne, S.; Politis, A.; Nikunen, J.; Virtanen, T. Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks. IEEE J. Sel. Top. Signal Process. 2019, 13, 34–48. [Google Scholar] [CrossRef]
  11. Iman, M.; Arabnia, H.R.; Rasheed, K. A Review of Deep Transfer Learning and Recent Advancements. Technologies 2023, 11, 40. [Google Scholar] [CrossRef]
  12. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
  13. Li, X.; Shao, N.; Li, X. Self-Supervised Audio Teacher-Student Transformer for Both Clip-Level and Frame-Level Tasks. IEEE/ACM Trans. Audio Speech Lang. Proc. 2024, 32, 1336–1351. [Google Scholar] [CrossRef]
  14. Shao, N.; Li, X.; Li, X. Fine-Tune the Pretrained ATST Model for Sound Event Detection. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 911–915. [Google Scholar] [CrossRef]
  15. Gemmeke, J.F.; Ellis, D.P.W.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R.C.; Plakal, M.; Ritter, M. Audio Set: An ontology and human-labeled dataset for audio events. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 776–780. [Google Scholar] [CrossRef]
  16. Piczak, K.J. ESC: Dataset for environmental sound classification. In Proceedings of the 23rd ACM International Conference on Multimedia, Brisbane Australia, 26–30 October 2015; pp. 1015–1018. [Google Scholar] [CrossRef]
  17. Font, F.; Roma, G.; Serra, X. Freesound technical demo. In Proceedings of the 21st ACM International Conference on Multimedia, Barcelona, Spain, 21–25 October 2013; pp. 411–412. [Google Scholar] [CrossRef]
  18. Salamon, J.; Jacoby, C.; Bello, J.P. A Dataset and Taxonomy for Urban Sound Research. In Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; pp. 1041–1044. [Google Scholar] [CrossRef]
  19. LeCun, Y.; Bengio, Y. Convolutional networks for images, speech, and time series. In The Handbook of Brain Theory and Neural Networks; MIT Press: Cambridge, MA, USA, 1998; pp. 255–258. [Google Scholar]
  20. Chen, S.; Wu, Y.; Wang, C.; Liu, S.; Tompkins, D.; Chen, Z.; Che, W.; Yu, X.; Wei, F. BEATs: Audio Pre-Training with Acoustic Tokenizers. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 5178–5193. [Google Scholar]
  21. Hershey, S.; Chaudhuri, S.; Ellis, D.P.W.; Gemmeke, J.F.; Jansen, A.; Moore, R.C.; Plakal, M.; Platt, D.; Saurous, R.A.; Seybold, B.; et al. CNN architectures for large-scale audio classification. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 131–135. [Google Scholar] [CrossRef]
  22. Kong, Q.; Cao, Y.; Iqbal, T.; Wang, Y.; Wang, W.; Plumbley, M.D. PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2880–2894. [Google Scholar] [CrossRef]
  23. Gong, Y.; Chung, Y.A.; Glass, J. AST: Audio Spectrogram Transformer. In Proceedings of the Interspeech 2021, Brno, Czech Republic, 30 August–3 September 2021; pp. 571–575. [Google Scholar] [CrossRef]
  24. Koutini, K.; Schlüter, J.; Eghbal-zadeh, H.; Widmer, G. Efficient Training of Audio Transformers with Patchout. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; pp. 2753–2757. [Google Scholar] [CrossRef]
  25. Kolecki, J.C. An Introduction to Tensors for Students of Physics and Engineering; National Aeronautics and Space Administration, Glenn Research Center: Cleveland, OH, USA, 2002. [Google Scholar]
  26. Waurin, K.; Ruwisch, D.; Du, Y. How A2B Technology and Digital Microphones Enable Superior Performance in Emerging Automotive Applications. Analog Dialogue 2021, 55. Available online: https://www.analog.com/en/resources/analog-dialogue/articles/how-a2b-technology-and-digital-microphones-enable-superior-performance-in-automotive-applications.html (accessed on 4 March 2025).
  27. Sechidis, K.; Tsoumakas, G.; Vlahavas, I. On the Stratification of Multi-label Data. In Proceedings of the Machine Learning and Knowledge Discovery in Databases, Athens, Greece, 5–9 September 2011; Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 145–158. [Google Scholar]
  28. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Vienna, Austria, 3–7 May 2021. [Google Scholar]
  29. Li, K.; Song, Y.; Dai, L.R.; McLoughlin, I.; Fang, X.; Liu, L. AST-SED: An Effective Sound Event Detection Method Based on Audio Spectrogram Transformer. arXiv 2023, arXiv:2303.03689. [Google Scholar]
  30. Mesaros, A.; Heittola, T.; Benetos, E.; Foster, P.; Lagrange, M.; Virtanen, T.; Plumbley, M.D. Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 379–393. [Google Scholar] [CrossRef]
  31. Ravanelli, M.; Parcollet, T.; Bengio, Y. The Pytorch-kaldi Speech Recognition Toolkit. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6465–6469. [Google Scholar] [CrossRef]
  32. Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  33. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  34. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef] [PubMed]
  35. Ebbers, J.; Haeb-Umbach, R.; Serizel, R. Threshold Independent Evaluation of Sound Event Detection Scores. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 1021–1025. [Google Scholar] [CrossRef]
  36. Kimura, M. Understanding Test-Time Augmentation. In Proceedings of the Neural Information Processing, Sanur, Indonesia, 8–12 December 2021; Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N., Eds.; Springer: Cham, Switzerland, 2021; pp. 558–569. [Google Scholar]
  37. Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the 10th International Conference on Learning Representations, ICLR 2022, Online, 25–29 April 2022. [Google Scholar]
  38. Tang, C.; Yu, W.; Sun, G.; Chen, X.; Tan, T.; Li, W.; Lu, L.; MA, Z.; Zhang, C. SALMONN: Towards Generic Hearing Abilities for Large Language Models. In Proceedings of the 12th International Conference on Learning Representations, ICLR 2024, Vienna, Austria, 7–11 May 2024. [Google Scholar]
  39. Gong, Y.; Luo, H.; Liu, A.H.; Karlinsky, L.; Glass, J.R. Listen, Think, and Understand. In Proceedings of the 12th International Conference on Learning Representations, ICLR 2024, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Figure 1. Example of an auditory scene that can be perceived inside a car. Inside, one can hear the sound of the ventilation, engine, honking, passing cars, etc. These can be identified as separate sound events that may overlap each other or may not overlap each other. The colored boxes on the figure represent some of these sources of sounds.
Figure 1. Example of an auditory scene that can be perceived inside a car. Inside, one can hear the sound of the ventilation, engine, honking, passing cars, etc. These can be identified as separate sound events that may overlap each other or may not overlap each other. The colored boxes on the figure represent some of these sources of sounds.
Sensors 25 02591 g001
Figure 2. Acoustic Event Detection of the audio scene. The sound of ventilation, cars passing and honking, along with the onset and offset times of these events, are detected. The length of the rectangle is proportional to the duration of each sound and thus marks the start and end times. The rectangles are placed one above the other to indicate overlapping time intervals.
Figure 2. Acoustic Event Detection of the audio scene. The sound of ventilation, cars passing and honking, along with the onset and offset times of these events, are detected. The length of the rectangle is proportional to the duration of each sound and thus marks the start and end times. The rectangles are placed one above the other to indicate overlapping time intervals.
Sensors 25 02591 g002
Figure 3. Ontology for automotive sounds. The coloured rectangles represent high-level semantics: ‘Inside Car’ and ‘Outside Car’ to indicate where the sounds originate with respect to the vehicle. They are further classified on the basis of their transient/continuous nature and the sources. The solid white rectangle represents the ‘labels’ with which each sound is identified. Some of the sounds are best described with more than two groups of sound sources, represented in a rectangle with an earmark.
Figure 3. Ontology for automotive sounds. The coloured rectangles represent high-level semantics: ‘Inside Car’ and ‘Outside Car’ to indicate where the sounds originate with respect to the vehicle. They are further classified on the basis of their transient/continuous nature and the sources. The solid white rectangle represents the ‘labels’ with which each sound is identified. Some of the sounds are best described with more than two groups of sound sources, represented in a rectangle with an earmark.
Sensors 25 02591 g003
Figure 4. Proposed approach for the AED system.
Figure 4. Proposed approach for the AED system.
Sensors 25 02591 g004
Figure 5. Overview of AED system. The audio data from the Audio Database are fed to the Audio Data Augmentation system for data augmentation. Data preprocessor converts the input audio along with the augmented audio into the tensor format [25] to convert them into embeddings by the embedding extractor. The Classifier model is trained using these embeddings and the corresponding labels.
Figure 5. Overview of AED system. The audio data from the Audio Database are fed to the Audio Data Augmentation system for data augmentation. Data preprocessor converts the input audio along with the augmented audio into the tensor format [25] to convert them into embeddings by the embedding extractor. The Classifier model is trained using these embeddings and the corresponding labels.
Sensors 25 02591 g005
Figure 6. Total number of samples available for each audio event.
Figure 6. Total number of samples available for each audio event.
Sensors 25 02591 g006
Figure 7. Total duration of each recorded audio event.
Figure 7. Total duration of each recorded audio event.
Sensors 25 02591 g007
Figure 8. Augmentation overview. The labels ‘Honking’ and ‘Vehicle passing by’ are selected from the database. The normalized signals are scaled by a random value in the range [0, 1] with the start of the signals time shifted by values chosen randomly from the user-specified range for each signal. The final augmented audio y(n) contains the sound of vehicles passing by and honking with the applied time shifts and amplitude scaling.
Figure 8. Augmentation overview. The labels ‘Honking’ and ‘Vehicle passing by’ are selected from the database. The normalized signals are scaled by a random value in the range [0, 1] with the start of the signals time shifted by values chosen randomly from the user-specified range for each signal. The final augmented audio y(n) contains the sound of vehicles passing by and honking with the applied time shifts and amplitude scaling.
Sensors 25 02591 g008
Figure 9. Overview of data processing for classifier. The input audio signal must be filtered and resampled to a sampling frequency of 16 kHz since BEATs accepts audio sampled at this frequency. A window of 1 s duration is applied on the input audio to obtain frames of length 1 s. These frames are converted to tensor format and fed into the BEATs model to generate embeddings for each frame. The annotations must also be updated for every 1 s duration. In the multi-label encoding stage, they are converted to a binary matrix that contains the presence (1) or absence (0) of each class/label for each 1 s frame. The classifier is trained to classify the embeddings corresponding to one second of input audio from these reference labels in a supervised learning format.
Figure 9. Overview of data processing for classifier. The input audio signal must be filtered and resampled to a sampling frequency of 16 kHz since BEATs accepts audio sampled at this frequency. A window of 1 s duration is applied on the input audio to obtain frames of length 1 s. These frames are converted to tensor format and fed into the BEATs model to generate embeddings for each frame. The annotations must also be updated for every 1 s duration. In the multi-label encoding stage, they are converted to a binary matrix that contains the presence (1) or absence (0) of each class/label for each 1 s frame. The classifier is trained to classify the embeddings corresponding to one second of input audio from these reference labels in a supervised learning format.
Sensors 25 02591 g009
Figure 10. Total number of samples available for each audio event in dataset_2.
Figure 10. Total number of samples available for each audio event in dataset_2.
Sensors 25 02591 g010
Figure 11. Total duration of each audio event in dataset_2.
Figure 11. Total duration of each audio event in dataset_2.
Sensors 25 02591 g011
Figure 12. Total number of samples available for each audio event in the unseen dataset.
Figure 12. Total number of samples available for each audio event in the unseen dataset.
Sensors 25 02591 g012
Figure 13. Total duration of each audio event in the unseen dataset.
Figure 13. Total duration of each audio event in the unseen dataset.
Sensors 25 02591 g013
Figure 14. ROC curve and AUC values of each class on unseen data. The AUC number corresponds to the index of each class in the list.
Figure 14. ROC curve and AUC values of each class on unseen data. The AUC number corresponds to the index of each class in the list.
Sensors 25 02591 g014
Figure 15. ROC curve and AUC values of each class on unseen data with augmentation. The AUC number corresponds to the index of each class in the list.
Figure 15. ROC curve and AUC values of each class on unseen data with augmentation. The AUC number corresponds to the index of each class in the list.
Sensors 25 02591 g015
Figure 16. Total duration of each audio event in the dataset generated using ADA.
Figure 16. Total duration of each audio event in the dataset generated using ADA.
Sensors 25 02591 g016
Figure 17. Comparison of mAP on devset data generated using ADA with different loss functions and transfer learning.
Figure 17. Comparison of mAP on devset data generated using ADA with different loss functions and transfer learning.
Sensors 25 02591 g017
Figure 18. Total duration of each audio event in the unseen dataset generated using ADA.
Figure 18. Total duration of each audio event in the unseen dataset generated using ADA.
Sensors 25 02591 g018
Figure 19. ROC curve and AUC values of each class on unseen data with overlapping events generated using ADA. The AUC number corresponds to the index of each class in the list.
Figure 19. ROC curve and AUC values of each class on unseen data with overlapping events generated using ADA. The AUC number corresponds to the index of each class in the list.
Sensors 25 02591 g019
Figure 20. ROC curve and AUC values of each class unseen data generated using ADA, with augmentation. The AUC number corresponds to the index of each class in the list.
Figure 20. ROC curve and AUC values of each class unseen data generated using ADA, with augmentation. The AUC number corresponds to the index of each class in the list.
Sensors 25 02591 g020
Table 1. Comparison of different pre-trained audio networks on AudioSet dataset.
Table 1. Comparison of different pre-trained audio networks on AudioSet dataset.
ModelPerformance (mAP)Complexity (Million)License
YAMNet [21]0.3893.7MApache License 2.0
PANNs [22]0.43981MMIT License
AST [23]0.45986MBSD 3-Clause License
PaSST [24]0.47186MApache License
BEATs [20]0.48690MMIT License
Table 2. Summary of results of experiments with different datasets. Mean accuracy, mAP with the respective standard deviation along with 95% confidence intervals for 10-folds, are calculated.
Table 2. Summary of results of experiments with different datasets. Mean accuracy, mAP with the respective standard deviation along with 95% confidence intervals for 10-folds, are calculated.
DatasetAccuracymAP
dataset_10.95 ± 0.014, (0.940, 0.960)0.39 ± 0.16, (0.276, 0.504)
dataset_20.92 ± 0.016, (0.909, 0.931)0.57 ± 0.16, (0.456, 0.684)
Table 3. Summary of results of experiments with different batch sizes. Mean accuracy, mAP with the respective standard deviation along with 95% confidence intervals for 10-folds are calculated.
Table 3. Summary of results of experiments with different batch sizes. Mean accuracy, mAP with the respective standard deviation along with 95% confidence intervals for 10-folds are calculated.
Batch SizeAccuracymAP
320.92 ± 0.02, (0.906, 0.934)0.55 ± 0.15, (0.443, 0.657)
640.9 ± 0.006, (0.896, 0.904)0.4 ± 0.08, (0.343, 0.457)
1280.91 ± 0.008, (0.904, 0.916)0.37 ± 0.07, (0.320, 0.420)
Table 4. Summary of results of experiments with different loss functions. Mean accuracy, mAP with the respective standard deviation along with 95% confidence intervals for 10-folds, are calculated.
Table 4. Summary of results of experiments with different loss functions. Mean accuracy, mAP with the respective standard deviation along with 95% confidence intervals for 10-folds, are calculated.
Loss FunctionAccuracymAP
BCE loss0.92 ± 0.02, (0.906, 0.934)0.55 ± 0.15, (0.443, 0.657)
Focal loss0.93 ± 0.02, (0.916, 0.944)0.64 ± 0.14, (0.540, 0.740)
Table 5. Evaluation metrics for each class on unseen data.
Table 5. Evaluation metrics for each class on unseen data.
IndexLabelAccuracyF1-ScoreRecallPrecisionAvg_Precision
0Baby crying0.990.9671.00.9371.0
1Dog barking0.9370.6060.4341.00.793
2Door0.9470.6450.5260.8330.843
3Honking0.990.8330.7141.01.0
4Indicator0.9370.00.00.00.33
5Knocking at the car0.990.80.6661.00.958
6Siren0.9850.00.00.01.0
7Speech0.9510.00.00.00.394
8Vehicle passing by0.720.0930.0491.00.622
9Window0.9370.380.250.80.684
10Wiper0.8860.00.00.00.866
Table 6. Evaluation metrics for each class on unseen data with augmentation.
Table 6. Evaluation metrics for each class on unseen data with augmentation.
IndexLabelAccuracyF1-ScoreRecallPrecisionAvg_Precision
0Baby crying0.9640.8880.9660.8210.961
1Dog barking0.9340.5990.4410.9340.624
2Door0.9380.5240.3680.9070.851
3Honking0.9840.710.5511.01.0
4Indicator0.9520.00.00.00.0
5Knocking at the car0.990.80.6661.00.994
6Siren0.9850.00.00.00.797
7Speech0.9510.00.00.00.261
8Vehicle passing by0.7380.2010.1120.960.849
9Window0.9350.2870.1690.950.576
10Wiper0.8840.00.00.00.813
Table 7. Summary of results of experiments with different loss functions for dataset from ADA for a single fold.
Table 7. Summary of results of experiments with different loss functions for dataset from ADA for a single fold.
Loss FunctionAccuracymAP
BCE loss0.860.48
Focal loss0.870.56
Table 8. Evaluation metrics for each class on unseen data with overlapping events generated using ADA.
Table 8. Evaluation metrics for each class on unseen data with overlapping events generated using ADA.
IndexLabelAccuracyF1-ScoreRecallPrecisionAvg_Precision
0Baby crying0.9470.8840.940.8350.960
1Dog barking0.870.5610.391.00.696
2Door0.8690.5250.3561.00.721
3Honking0.9840.880.7851.00.987
4Indicator0.8780.1090.070.1840.178
5Knocking at the car0.9200.3360.3160.3850.342
6Siren0.9820.6360.4661.00.994
7Speech0.8930.00.00.00.275
8Vehicle passing by0.7950.07690.0401.00.712
9Window0.9130.4560.3540.6410.535
10Wiper0.8080.1810.11.00.842
Table 9. Evaluation metrics for each class on unseen data generated using ADA, with augmentation.
Table 9. Evaluation metrics for each class on unseen data generated using ADA, with augmentation.
IndexLabelAccuracyF1-ScoreRecallPrecisionAvg_Precision
0Baby crying0.9060.7970.8610.7410.882
1Dog barking0.8460.5340.4150.750.599
2Door0.8180.2120.1200.8990.591
3Honking0.9760.8090.6791.00.997
4Indicator0.9000.0810.0460.3530.416
5Knocking at the car0.9330.3430.2710.4690.416
6Siren0.9720.250.1421.00.871
7Speech0.8930.00.00.00.331
8Vehicle passing by0.8040.1480.0801.00.877
9Window0.9120.4610.3690.6160.543
10Wiper0.800.110.0621.00.784
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Antony, A.; Theimer, W.; Grossetti, G.; Friedrich, C.M. Acoustic Event Detection in Vehicles: A Multi-Label Classification Approach. Sensors 2025, 25, 2591. https://doi.org/10.3390/s25082591

AMA Style

Antony A, Theimer W, Grossetti G, Friedrich CM. Acoustic Event Detection in Vehicles: A Multi-Label Classification Approach. Sensors. 2025; 25(8):2591. https://doi.org/10.3390/s25082591

Chicago/Turabian Style

Antony, Anaswara, Wolfgang Theimer, Giovanni Grossetti, and Christoph M. Friedrich. 2025. "Acoustic Event Detection in Vehicles: A Multi-Label Classification Approach" Sensors 25, no. 8: 2591. https://doi.org/10.3390/s25082591

APA Style

Antony, A., Theimer, W., Grossetti, G., & Friedrich, C. M. (2025). Acoustic Event Detection in Vehicles: A Multi-Label Classification Approach. Sensors, 25(8), 2591. https://doi.org/10.3390/s25082591

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop