Author Contributions
Conceptualization, A.A., W.T., G.G. and C.M.F.; methodology, A.A., W.T., G.G. and C.M.F.; software, A.A.; validation, A.A., W.T. and G.G.; formal analysis, A.A.; investigation, A.A.; resources, W.T. and G.G.; data curation, A.A.; writing—original draft preparation, A.A.; writing—review and editing, A.A., W.T., G.G. and C.M.F.; visualization, A.A.; supervision, W.T., G.G. and C.M.F. All authors have read and agreed to the published version of the manuscript.
Figure 1.
Example of an auditory scene that can be perceived inside a car. Inside, one can hear the sound of the ventilation, engine, honking, passing cars, etc. These can be identified as separate sound events that may overlap each other or may not overlap each other. The colored boxes on the figure represent some of these sources of sounds.
Figure 1.
Example of an auditory scene that can be perceived inside a car. Inside, one can hear the sound of the ventilation, engine, honking, passing cars, etc. These can be identified as separate sound events that may overlap each other or may not overlap each other. The colored boxes on the figure represent some of these sources of sounds.
Figure 2.
Acoustic Event Detection of the audio scene. The sound of ventilation, cars passing and honking, along with the onset and offset times of these events, are detected. The length of the rectangle is proportional to the duration of each sound and thus marks the start and end times. The rectangles are placed one above the other to indicate overlapping time intervals.
Figure 2.
Acoustic Event Detection of the audio scene. The sound of ventilation, cars passing and honking, along with the onset and offset times of these events, are detected. The length of the rectangle is proportional to the duration of each sound and thus marks the start and end times. The rectangles are placed one above the other to indicate overlapping time intervals.
Figure 3.
Ontology for automotive sounds. The coloured rectangles represent high-level semantics: ‘Inside Car’ and ‘Outside Car’ to indicate where the sounds originate with respect to the vehicle. They are further classified on the basis of their transient/continuous nature and the sources. The solid white rectangle represents the ‘labels’ with which each sound is identified. Some of the sounds are best described with more than two groups of sound sources, represented in a rectangle with an earmark.
Figure 3.
Ontology for automotive sounds. The coloured rectangles represent high-level semantics: ‘Inside Car’ and ‘Outside Car’ to indicate where the sounds originate with respect to the vehicle. They are further classified on the basis of their transient/continuous nature and the sources. The solid white rectangle represents the ‘labels’ with which each sound is identified. Some of the sounds are best described with more than two groups of sound sources, represented in a rectangle with an earmark.
Figure 4.
Proposed approach for the AED system.
Figure 4.
Proposed approach for the AED system.
Figure 5.
Overview of AED system. The audio data from the Audio Database are fed to the Audio Data Augmentation system for data augmentation. Data preprocessor converts the input audio along with the augmented audio into the tensor format [
25] to convert them into embeddings by the embedding extractor. The Classifier model is trained using these embeddings and the corresponding labels.
Figure 5.
Overview of AED system. The audio data from the Audio Database are fed to the Audio Data Augmentation system for data augmentation. Data preprocessor converts the input audio along with the augmented audio into the tensor format [
25] to convert them into embeddings by the embedding extractor. The Classifier model is trained using these embeddings and the corresponding labels.
Figure 6.
Total number of samples available for each audio event.
Figure 6.
Total number of samples available for each audio event.
Figure 7.
Total duration of each recorded audio event.
Figure 7.
Total duration of each recorded audio event.
Figure 8.
Augmentation overview. The labels ‘Honking’ and ‘Vehicle passing by’ are selected from the database. The normalized signals are scaled by a random value in the range [0, 1] with the start of the signals time shifted by values chosen randomly from the user-specified range for each signal. The final augmented audio y(n) contains the sound of vehicles passing by and honking with the applied time shifts and amplitude scaling.
Figure 8.
Augmentation overview. The labels ‘Honking’ and ‘Vehicle passing by’ are selected from the database. The normalized signals are scaled by a random value in the range [0, 1] with the start of the signals time shifted by values chosen randomly from the user-specified range for each signal. The final augmented audio y(n) contains the sound of vehicles passing by and honking with the applied time shifts and amplitude scaling.
Figure 9.
Overview of data processing for classifier. The input audio signal must be filtered and resampled to a sampling frequency of 16 kHz since BEATs accepts audio sampled at this frequency. A window of 1 s duration is applied on the input audio to obtain frames of length 1 s. These frames are converted to tensor format and fed into the BEATs model to generate embeddings for each frame. The annotations must also be updated for every 1 s duration. In the multi-label encoding stage, they are converted to a binary matrix that contains the presence (1) or absence (0) of each class/label for each 1 s frame. The classifier is trained to classify the embeddings corresponding to one second of input audio from these reference labels in a supervised learning format.
Figure 9.
Overview of data processing for classifier. The input audio signal must be filtered and resampled to a sampling frequency of 16 kHz since BEATs accepts audio sampled at this frequency. A window of 1 s duration is applied on the input audio to obtain frames of length 1 s. These frames are converted to tensor format and fed into the BEATs model to generate embeddings for each frame. The annotations must also be updated for every 1 s duration. In the multi-label encoding stage, they are converted to a binary matrix that contains the presence (1) or absence (0) of each class/label for each 1 s frame. The classifier is trained to classify the embeddings corresponding to one second of input audio from these reference labels in a supervised learning format.
Figure 10.
Total number of samples available for each audio event in dataset_2.
Figure 10.
Total number of samples available for each audio event in dataset_2.
Figure 11.
Total duration of each audio event in dataset_2.
Figure 11.
Total duration of each audio event in dataset_2.
Figure 12.
Total number of samples available for each audio event in the unseen dataset.
Figure 12.
Total number of samples available for each audio event in the unseen dataset.
Figure 13.
Total duration of each audio event in the unseen dataset.
Figure 13.
Total duration of each audio event in the unseen dataset.
Figure 14.
ROC curve and AUC values of each class on unseen data. The AUC number corresponds to the index of each class in the list.
Figure 14.
ROC curve and AUC values of each class on unseen data. The AUC number corresponds to the index of each class in the list.
Figure 15.
ROC curve and AUC values of each class on unseen data with augmentation. The AUC number corresponds to the index of each class in the list.
Figure 15.
ROC curve and AUC values of each class on unseen data with augmentation. The AUC number corresponds to the index of each class in the list.
Figure 16.
Total duration of each audio event in the dataset generated using ADA.
Figure 16.
Total duration of each audio event in the dataset generated using ADA.
Figure 17.
Comparison of mAP on devset data generated using ADA with different loss functions and transfer learning.
Figure 17.
Comparison of mAP on devset data generated using ADA with different loss functions and transfer learning.
Figure 18.
Total duration of each audio event in the unseen dataset generated using ADA.
Figure 18.
Total duration of each audio event in the unseen dataset generated using ADA.
Figure 19.
ROC curve and AUC values of each class on unseen data with overlapping events generated using ADA. The AUC number corresponds to the index of each class in the list.
Figure 19.
ROC curve and AUC values of each class on unseen data with overlapping events generated using ADA. The AUC number corresponds to the index of each class in the list.
Figure 20.
ROC curve and AUC values of each class unseen data generated using ADA, with augmentation. The AUC number corresponds to the index of each class in the list.
Figure 20.
ROC curve and AUC values of each class unseen data generated using ADA, with augmentation. The AUC number corresponds to the index of each class in the list.
Table 1.
Comparison of different pre-trained audio networks on AudioSet dataset.
Table 1.
Comparison of different pre-trained audio networks on AudioSet dataset.
Model | Performance (mAP) | Complexity (Million) | License |
---|
YAMNet [21] | 0.389 | 3.7M | Apache License 2.0 |
PANNs [22] | 0.439 | 81M | MIT License |
AST [23] | 0.459 | 86M | BSD 3-Clause License |
PaSST [24] | 0.471 | 86M | Apache License |
BEATs [20] | 0.486 | 90M | MIT License |
Table 2.
Summary of results of experiments with different datasets. Mean accuracy, mAP with the respective standard deviation along with 95% confidence intervals for 10-folds, are calculated.
Table 2.
Summary of results of experiments with different datasets. Mean accuracy, mAP with the respective standard deviation along with 95% confidence intervals for 10-folds, are calculated.
Dataset | Accuracy | mAP |
---|
dataset_1 | 0.95 ± 0.014, (0.940, 0.960) | 0.39 ± 0.16, (0.276, 0.504) |
dataset_2 | 0.92 ± 0.016, (0.909, 0.931) | 0.57 ± 0.16, (0.456, 0.684) |
Table 3.
Summary of results of experiments with different batch sizes. Mean accuracy, mAP with the respective standard deviation along with 95% confidence intervals for 10-folds are calculated.
Table 3.
Summary of results of experiments with different batch sizes. Mean accuracy, mAP with the respective standard deviation along with 95% confidence intervals for 10-folds are calculated.
Batch Size | Accuracy | mAP |
---|
32 | 0.92 ± 0.02, (0.906, 0.934) | 0.55 ± 0.15, (0.443, 0.657) |
64 | 0.9 ± 0.006, (0.896, 0.904) | 0.4 ± 0.08, (0.343, 0.457) |
128 | 0.91 ± 0.008, (0.904, 0.916) | 0.37 ± 0.07, (0.320, 0.420) |
Table 4.
Summary of results of experiments with different loss functions. Mean accuracy, mAP with the respective standard deviation along with 95% confidence intervals for 10-folds, are calculated.
Table 4.
Summary of results of experiments with different loss functions. Mean accuracy, mAP with the respective standard deviation along with 95% confidence intervals for 10-folds, are calculated.
Loss Function | Accuracy | mAP |
---|
BCE loss | 0.92 ± 0.02, (0.906, 0.934) | 0.55 ± 0.15, (0.443, 0.657) |
Focal loss | 0.93 ± 0.02, (0.916, 0.944) | 0.64 ± 0.14, (0.540, 0.740) |
Table 5.
Evaluation metrics for each class on unseen data.
Table 5.
Evaluation metrics for each class on unseen data.
Index | Label | Accuracy | F1-Score | Recall | Precision | Avg_Precision |
---|
0 | Baby crying | 0.99 | 0.967 | 1.0 | 0.937 | 1.0 |
1 | Dog barking | 0.937 | 0.606 | 0.434 | 1.0 | 0.793 |
2 | Door | 0.947 | 0.645 | 0.526 | 0.833 | 0.843 |
3 | Honking | 0.99 | 0.833 | 0.714 | 1.0 | 1.0 |
4 | Indicator | 0.937 | 0.0 | 0.0 | 0.0 | 0.33 |
5 | Knocking at the car | 0.99 | 0.8 | 0.666 | 1.0 | 0.958 |
6 | Siren | 0.985 | 0.0 | 0.0 | 0.0 | 1.0 |
7 | Speech | 0.951 | 0.0 | 0.0 | 0.0 | 0.394 |
8 | Vehicle passing by | 0.72 | 0.093 | 0.049 | 1.0 | 0.622 |
9 | Window | 0.937 | 0.38 | 0.25 | 0.8 | 0.684 |
10 | Wiper | 0.886 | 0.0 | 0.0 | 0.0 | 0.866 |
Table 6.
Evaluation metrics for each class on unseen data with augmentation.
Table 6.
Evaluation metrics for each class on unseen data with augmentation.
Index | Label | Accuracy | F1-Score | Recall | Precision | Avg_Precision |
---|
0 | Baby crying | 0.964 | 0.888 | 0.966 | 0.821 | 0.961 |
1 | Dog barking | 0.934 | 0.599 | 0.441 | 0.934 | 0.624 |
2 | Door | 0.938 | 0.524 | 0.368 | 0.907 | 0.851 |
3 | Honking | 0.984 | 0.71 | 0.551 | 1.0 | 1.0 |
4 | Indicator | 0.952 | 0.0 | 0.0 | 0.0 | 0.0 |
5 | Knocking at the car | 0.99 | 0.8 | 0.666 | 1.0 | 0.994 |
6 | Siren | 0.985 | 0.0 | 0.0 | 0.0 | 0.797 |
7 | Speech | 0.951 | 0.0 | 0.0 | 0.0 | 0.261 |
8 | Vehicle passing by | 0.738 | 0.201 | 0.112 | 0.96 | 0.849 |
9 | Window | 0.935 | 0.287 | 0.169 | 0.95 | 0.576 |
10 | Wiper | 0.884 | 0.0 | 0.0 | 0.0 | 0.813 |
Table 7.
Summary of results of experiments with different loss functions for dataset from ADA for a single fold.
Table 7.
Summary of results of experiments with different loss functions for dataset from ADA for a single fold.
Loss Function | Accuracy | mAP |
---|
BCE loss | 0.86 | 0.48 |
Focal loss | 0.87 | 0.56 |
Table 8.
Evaluation metrics for each class on unseen data with overlapping events generated using ADA.
Table 8.
Evaluation metrics for each class on unseen data with overlapping events generated using ADA.
Index | Label | Accuracy | F1-Score | Recall | Precision | Avg_Precision |
---|
0 | Baby crying | 0.947 | 0.884 | 0.94 | 0.835 | 0.960 |
1 | Dog barking | 0.87 | 0.561 | 0.39 | 1.0 | 0.696 |
2 | Door | 0.869 | 0.525 | 0.356 | 1.0 | 0.721 |
3 | Honking | 0.984 | 0.88 | 0.785 | 1.0 | 0.987 |
4 | Indicator | 0.878 | 0.109 | 0.07 | 0.184 | 0.178 |
5 | Knocking at the car | 0.920 | 0.336 | 0.316 | 0.385 | 0.342 |
6 | Siren | 0.982 | 0.636 | 0.466 | 1.0 | 0.994 |
7 | Speech | 0.893 | 0.0 | 0.0 | 0.0 | 0.275 |
8 | Vehicle passing by | 0.795 | 0.0769 | 0.040 | 1.0 | 0.712 |
9 | Window | 0.913 | 0.456 | 0.354 | 0.641 | 0.535 |
10 | Wiper | 0.808 | 0.181 | 0.1 | 1.0 | 0.842 |
Table 9.
Evaluation metrics for each class on unseen data generated using ADA, with augmentation.
Table 9.
Evaluation metrics for each class on unseen data generated using ADA, with augmentation.
Index | Label | Accuracy | F1-Score | Recall | Precision | Avg_Precision |
---|
0 | Baby crying | 0.906 | 0.797 | 0.861 | 0.741 | 0.882 |
1 | Dog barking | 0.846 | 0.534 | 0.415 | 0.75 | 0.599 |
2 | Door | 0.818 | 0.212 | 0.120 | 0.899 | 0.591 |
3 | Honking | 0.976 | 0.809 | 0.679 | 1.0 | 0.997 |
4 | Indicator | 0.900 | 0.081 | 0.046 | 0.353 | 0.416 |
5 | Knocking at the car | 0.933 | 0.343 | 0.271 | 0.469 | 0.416 |
6 | Siren | 0.972 | 0.25 | 0.142 | 1.0 | 0.871 |
7 | Speech | 0.893 | 0.0 | 0.0 | 0.0 | 0.331 |
8 | Vehicle passing by | 0.804 | 0.148 | 0.080 | 1.0 | 0.877 |
9 | Window | 0.912 | 0.461 | 0.369 | 0.616 | 0.543 |
10 | Wiper | 0.80 | 0.11 | 0.062 | 1.0 | 0.784 |