4.1. Evaluation Settings
To validate our algorithm, we selected the DCASE2018 Task5 dataset, specifically the SINS dataset [
6]. Owing to the planned data collection process in the SINS dataset, we could clearly verify the performance gap between the known and unknown microphones, and we anticipated confirming the efficacy of our proposed technique. Furthermore, while SED detects multiple events within each audio segment with the onset and offset times, MDA detects only a single event, making it simpler to analyze the performance of the proposed algorithm.
The SINS dataset [
19] consists of 4-channel audio data, including living rooms, workrooms, kitchens, and dining rooms. Data were recorded with four microphone bundles at seven locations, and the development dataset recorded 10 s of data from the four microphones, for a total of 200 h. The SINS dataset was labeled with nine activities: absence, cooking, dishwashing, eating, other, social activity, vacuum cleaning, watching TV, and working. The dataset was divided into 4-fold cross-validation [
6], where the data were split into four equal subsets. Each subset was used once as a validation set while the remaining three subsets were used for training. In cross-validation, a fold refers to a bundle consisting of one validation set and the remaining three subsets.
The model structure used for training was originally proposed in [
13]. The overall model structure and output size of each layer, as depicted in
Table 2, consisted of simple CNNs.
The model input was a 10 s audio segment at a sampling rate of 16 kHz, which was converted to STFT through 1024 FFT points and filtered with Mel-filterbank with 40 mel filters. The features were normalized to values between 0 and 1. We set the learning rate to 0.0001, batch size to 128, and used Adam [
26] as the optimizer and categorical cross-entropy as the loss function.
Figure 8 shows the intermediate results of the feature extraction process, where the input signal is transformed from its time-domain waveform to a frequency-domain mel-spectrogram. By applying the mel filter bank to the spectrogram, the scale of the low-frequency components is increased as human hearing is significantly emphasized in this region. This allows the model to be trained with more detailed low-frequency information. We applied our algorithm to the mel-spectrogram to generate multiple augmented data, which were then used as input for the proposed system.
For comparing the proposed method, datasets processed with different augmentation algorithms were trained in the same model structure [
13]. The first was mix&shuffle, which was used by the first-place team in the competition [
13], and proved to be a simple and effective technique. In addition, specaugment [
14] and mixup [
27], which are recent methods that are commonly used in audio augmentation, were used as comparison groups.
We augmented data for labels with less than 4000 data in each fold to balance the dataset, except for the mixup [
27]. The mixup was applied during the dataset loading process. Data augmentation was applied until the dataset reached a total of 4800 samples.
Table 3 presents the number of samples per fold and label before data augmentation.
For the mix&shuffle, two data samples with the same label were randomly selected. Each data sample was sliced into four chunks, and the positions of these chunks were shuffled. Then, the chunks were randomly mixed. For the specaugment, the number of time masks was fixed at 1, with a time masking size ranging from 5 to 10 frames. The number of frequency masks was randomly selected between 1 and 2, with sizes ranging from 10 to 15 in the mel bin direction. The mixup parameter was set to 0.2.
We used the F1-Score [
28] as a metric which is commonly employed in multi-class evaluations with imbalanced data and was also used as the evaluation metric in DCASE 2018 Task5 [
6]. This can be obtained by calculating the Precision and Recall of a class and then taking the harmonic mean. Precision, also referred to as specificity or Positive Predictive Value (PPV), is the ratio of true positives to predicted positives and can be calculated as follows:
where TP represents true positives, which are correctly classified as positive, and FP represents false positives, which are incorrectly classified as positive when their true label is negative. Recall, also known as sensitivity or True Positive Rate (TPR), is the ratio of correctly predicted positives to actual positives and calculated as follows:
The F1-Score can be obtained by harmonically averaging the Precision and Recall, as follows:
In multi-class classification tasks, in-depth analysis of the F1-Score is employed to assess the model’s performance in further detail, using micro, macro, and weighted averages [
28]. The macro F1-Score is the arithmetic average of the F1-Scores for each class, and the weighted F1-Score is the weighted average of the F1-Scores based on the proportion of classes. Additionally, the micro F1-Score is the average of the overall TP, FN, and FP. The weighted average is calculated by assigning weights according to the number of events, while the macro average treats all events equally by assigning them the same weight. Finally, the micro average allows us to determine the overall performance without any weighting for labels.
4.2. Evaluation Results
We used the micro, macro, and weighted averages of the F1-Score as metrics for the performance comparison. Following the cross-validation setup of the SINS dataset [
19], the models were trained for four folds. For the evaluation, the average of the probability outputs from these four models was used. We evaluated the performance using the test dataset from the development dataset and an evaluation dataset, which included data collected with an unknown microphone not present in the development dataset. This allowed us to compare the results for known and unknown microphone positions, each with different RTFs. For the experiment to determine the correct parameters for our proposed method, we evaluated the performance using the evaluation dataset only, dividing it into known and unknown microphone data. First, we compared the proposed method with existing data augmentation techniques, then examined whether the performance gap between the known and unknown microphones decreased, as per our proposal. Next, we compared the performance changes based on the parameters of the proposed method. Finally, we compared the performance of our proposed method with a technique that adds Gaussian noise, which is similar to our method and easily applicable.
4.2.1. Comparison of Performance with Conventional Data Augmentation Techniques
We compared the proposed technique with conventional data augmentation techniques. The data augmentation methods used for comparison were mix&shuffle [
13], mixup [
27] and specaugment [
14]. When using mix&shuffle and specaugment together with our proposed method, half of the data to be augmented were applied to the proposed method, and the remaining half was applied to mix&shuffle or specaugment. Moreover, when applying mixup together with the proposed method, mixup was applied during the training of the augmented data generated by the proposed method.
As demonstrated in
Table 4, the proposed algorithm improved performance across all metrics compared to the model without augmentation. Furthermore, the results from the development and evaluation datasets demonstrated that the proposed method achieved the best performance in both the micro and weighted F1-Scores.
When the proposed method was combined with conventional techniques (mixup [
27] and specaugment [
14]), there was a tendency for performance improvement in the evaluation dataset. Notably, although the specaugment technique exhibited the best macro-averaged performance in the development dataset, its performance was significantly reduced in the evaluation dataset. However, when used together with the proposed method, this performance degradation was mitigated. Since the proposed method is designed to achieve better performance with the evaluation dataset than the development dataset by including recordings from microphones in unknown positions, we observe that even though specaugment outperforms in terms of the macro average on the development dataset, there is also great improvement in the macro average on the evaluation dataset.
4.2.2. Comparison of Performance with Unknown Microphones
In the previous section, we observed that our algorithm enhanced model performance similarly to conventional data augmentation techniques. In this section, we examine the performance of microphones in both known and unknown locations within the evaluation dataset to determine if the model has been robustly trained with respect to location, as intended.
From the results in
Table 5, it is evident that for known microphones in the evaluation dataset, the highest performance was achieved in terms of micro and weighted averages when specaugment [
14] and the proposed method were used together. For unknown microphones, the highest performance was achieved in the micro and weighted averages when mixup [
27] and the proposed method were combined.
For known microphones, using specaugment [
14] together with the proposed method was effective. However, for unknown microphones, combining mixup [
27] with the proposed method resulted in enhanced performance improvements. This result indicated that depending on the situation, it could be more effective to use our algorithm in conjunction with conventional techniques.
It was observed that using mixup alone resulted in the highest performance improvement in terms of the macro average. Since macro averaging entails F1-Score calculation by applying equal weight to each class, it is more sensitive to the performance of classes with fewer samples. As shown in
Figure 1, classes with fewer samples and relatively lower baseline performances, such as other, dishwashing, and eating, may have experienced greater improvements with mixup. This could be due to mixup’s tendency to enhance performance in ambiguous or overlapping classes. Specifically, the dishwashing and eating classes, both of which contain dish sounds, appear to have benefited from this effect, although this conclusion remains speculative based on the macro average results.
Although, while mixup showed improvements in specific cases, the proposed method outperforms mixup when considering the overall performance across multiple metrics. The proposed method consistently achieves higher F1-Score in both the micro and weighted averages regardless of whether mixup is used alone or in combination with other techniques, demonstrating greater robustness and effectiveness across a broader range of classes. Thus, although mixup offers localized benefits in certain scenarios, our approach is more advantageous for achieving comprehensive performance improvements.
4.2.3. Performance Analysis According to Random Values of Two Rayleigh Ratio Parameters
To determine the optimal parameters, we randomly generated values by adjusting the CDF parameter for the ratio of the two Rayleigh distributions and assessed their performance. The parameters that govern the PDF of the two Rayleigh ratios are denoted as
and
as in (
11). We constrained the randomly generated values within the range of 0.5 to 5 to prevent them from being excessively large or small.
For each experiment, we increased the data to 4800 for classes with a count of less than or equal to 4000 in each fold of the development dataset, as described in
Section 4.1 Evaluation Setup. The results displayed in
Figure 9 are those for the test data in the development dataset. As a result of the experiment, by adjusting the variables, we were able to confirm that we had the highest performance when sigma and lambda were 1.
In these results, the Rayleigh parameters and showed overall better performance when they had identical values; this may be because, in real room environments, the statistical properties of the RIRs are more likely to be similar rather than significantly different, so having identical values might better simulate real-world conditions. Additionally, when was larger than , the scores tended to be lower overall; however, the small number of experiments makes it difficult to generalize this result. Consequently, it is preferable to set the Rayleigh parameters and to the same value or to avoid cases where is larger than if they are set differently.
4.2.4. Comparison of Performance with Data Augmentation Technique Using Gaussian Noise
We compared the proposed method with a method that generates data using Gaussian noise, which is a type of noise that can be included in a sound signal. We aimed to demonstrate that the effectiveness of the proposed method was not only due to the presence of noise. We generated Gaussian noise data by multiplying the Mel spectrogram by a Gaussian random value in the same way as in the data augmentation technique using RTFs. The experimental results in
Table 6 indicated that Gaussian noise degraded the model’s performance.
As demonstrated by the experimental results, Gaussian noise resulted in a lower F1-Score than the baseline model trained without any augmentation. In contrast, the proposed model increased the F1-Score compared to the baseline model and minimized the performance difference between the development and evaluation datasets. This result indicated that the model was normalized to some extent. Therefore, the proposed method demonstrated a different effect compared to simply adding noise.
For further analysis, 100 samples of augmented data were generated for a known sound source (vacuum cleaner) using Gaussian distribution and the ratio of Rayleigh distribution. These were then compared with the spectrum of the sound source recorded by both known and unknown microphones.
Figure 10a,b show the spectra of augmented data with each distribution, and in
Figure 10c, the orange line represents the spectrum from the unknown microphone, while the blue line represents the spectrum from the known microphone.
Figure 10 compares the spectra of sounds recorded from the same vacuum cleaner at different locations (with the known source taken from the development dataset and unknown source from the evaluation dataset) to those of the augmented data generated using Gaussian distribution and the proposed method. For statistical comparison, data augmentation using Gaussian distribution (
Figure 10a) and the proposed method (
Figure 10b) were performed on the same known data, thus generating 100 samples for each method. The spectrum was then calculated for each sample, and the results were sorted by order of magnitude for each frequency bin. Subsequently, lines were drawn at the 90th, 75th, 50th, 25th, and 10th percentiles based on the magnitudes.
In the case of Gaussian distribution, the spectral lines are more widely spread compared to the proposed method. In particular, for the 10% line, there is a significant difference in the shape of the spectrum compared to the original data. However, with the proposed method, the spectrum closely follows the shape of the original data, even at the 10% line. As shown in
Figure 10c, the spectra of the known and unknown sources exhibit only slight differences. This suggests that when augmenting data using arbitrary PDFs, such as Gaussian distribution, large deviations from the actual RTFs can occur, potentially degrading the model’s performance. In contrast, our proposed method tends to generate data that closely resembles real-world data, thereby enhancing the model’s performance.