Data Augmentation Method Using Room Transfer Function for Monitoring of Domestic Activities

Kim, Minhan; Lee, Seokjin

doi:10.3390/app14219644

Open AccessArticle

Data Augmentation Method Using Room Transfer Function for Monitoring of Domestic Activities

by

Minhan Kim

¹

and

Seokjin Lee

^1,2,*

¹

School of Electronic and Electrical Engineering, Kyungpook National University, Daegu 41566, Republic of Korea

²

School of Electronics Engineering, Kyungpook National University, Daegu 41566, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(21), 9644; https://doi.org/10.3390/app14219644

Submission received: 30 August 2024 / Revised: 12 October 2024 / Accepted: 16 October 2024 / Published: 22 October 2024

(This article belongs to the Special Issue AI, Machine Learning and Deep Learning in Signal Processing, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Monitoring domestic activities helps us to understand user behaviors in indoor environments, which has garnered interest as it aids in understanding human activities in context-aware computing. In the field of acoustics, this goal has been achieved through studies employing machine learning techniques, which are widely used for classification tasks involving sound recognition and other objectives. Machine learning typically achieves better performance with large amounts of high-quality training data. Given the high cost of data collection, development datasets often suffer from imbalanced data or lack high-quality samples, leading to performance degradations in machine learning models. The present study aims to address this data issue through data augmentation techniques. Specifically, since the proposed method targets indoor activities in domestic activity detection, room transfer functions were used for data augmentation. The results show that the proposed method achieves a 0.59% improvement in the F1-Score (micro) from that of the baseline system for the development dataset. Additionally, test data including microphones that were not used during training achieved an F1-Score improvement of 0.78% over that of the baseline system. This demonstrates the enhanced model generalization performance of the proposed method on samples having different room transfer functions to those of the trained dataset.

Keywords:

machine learning; acoustic scene classification; monitoring of domestic activity; data augmentation

1. Introduction

Context-aware systems provide a personal service by recognizing the user’s surrounding environment [1]. To accomplish this aim, diverse fields are explored to comprehend the user’s surroundings and activities. In the image domain, the classification of objects is a typical research topic [2], while automatic speech recognition that converts speech audio to text [3] is a representative research subject in the speech domain. In the field of acoustics, tasks such as acoustic scene classification (ASC) [4] and sound event detection (SED) [5], which involve recognizing the surrounding environment through sound, are being actively studied. ASC identifies a sound scene (e.g., park, airport, or subway) in a fixed-length audio clip. In contrast, SED classifies multiple sound events (such as bell ringing, screaming, or door knocking) with onset and offset time stamps in a fixed-length audio clip. Both of these are studied with slightly different constraints each year through the Detection and Classification of Acoustic Scenes and Events (DCASE).

Monitoring domestic activity (MDA) [6] is similar to ASC and SED, targeting activities that take place indoors and only detecting events without onset and offset timestamps. In the field of MDA, as in acoustic perception tasks such as ASC and SED, machine learning-based techniques such as convolutional neural networks (CNNs) [7], recurrent NNs (RNNs) [8], visual geometry group (VGG) NNs [9], and model ensemble techniques are more actively utilized than traditional methods like the Hidden Markov Model (HMM) [10] or K-Nearest Neighbor (KNN) [11].

In methods with machine learning approaches, well-organized datasets have a significant impact on model performance and how the statistical properties of the dataset are learned. However, obtaining sufficient reliable data is difficult because it is time-consuming and costly. In situations where obtaining data is challenging due to these difficulties, model performance can deteriorate, or it may even become inappropriate to use a machine learning approach. Additionally, in datasets with unevenly distributed data across classes, performance gaps can occur [12].

To address dataset-related issues, data augmentation has been explored across various fields. In the image domain, there are simple operations such as cropping, rotating, or adding Gaussian noise to the image, and these practices are readily applied in acoustic studies.

In acoustics, similar to the image domain, the method of slicing and pasting audio data on the time axis or masking part of the data can increase the performance of the model [13]. Specaugment is a masking technique that has been successfully applied to acoustics [14], in which a random area is masked into the time and frequency axes together with time stretching and warping. However, the effect cannot be guaranteed depending on the task because it is not contrived considering acoustic characteristics [15]. More recently, learning-based approaches (such as Generative Adversarial Networks (GANs)) have been leveraged to generate augmented data by training on existing datasets [16]. However, GAN-based data augmentation has the disadvantage of requiring additional training time and computation power for augmentation. Moreover, if the target dataset differs from the one on which the model was trained, it might not be applicable. In the MDA task of classifying acoustic events within a room, we developed a method that can be broadly applied for detecting scenes within the room by leveraging data augmentation through the room’s acoustic properties. Previous studies have demonstrated that the use of room impulse responses (RIRs) is effective in enhancing model performance in indoor acoustic perception tasks [17,18]. However, this method has limitations in that it requires RIRs measured from real environments or entails additional computational load when simulating new RIRs. The proposed approach builds on this premise and does not require additional training for augmentation, thereby incurring only the minimal computational cost.

Tasks involving MDA are often limited by the ability to obtain high-quality acoustic data, so this work aimed to improve the performances of machine learning models through data augmentation techniques. Specifically, the focus was on addressing the limitations of existing data augmentation methods, which either do not consider acoustic features or require additional data collection efforts (such as using RIRs) or introduce additional computational burden through simulations.

In this study, we aimed to augment the dataset by leveraging the acoustic characteristics of sound. Particularly, sounds generated indoors are influenced by room transfer functions (RTFs). Hence, we devised a data augmentation technique that considers the physical characteristics of RTFs. Our proposed method does not require additional training for data augmentation and can be directly applied to the frequency domain, making it suitable for various indoor acoustic applications that use frequency data as a feature.

2. Problem Description

People gather a variety of information through sound, including spatial information about their own location. They constantly perceive the various sounds occurring within a space and simultaneously produce sounds through their activities. As a result, human activities can be recognized solely through sound. In the field of acoustics, recent research [4,5,6] has focused on employing machine learning to perceive these sounds in a manner similar to humans, allowing for diverse applications. In particular, recognizing sounds indoors, where most human activities occur, facilitates wide applications. Therefore, in many cases, research on acoustic perception is focused on sounds occurring in domestic situations. For example, since 2016, the DCASE workshop, which focuses on machine learning-based sound recognition, has been conducting tasks (such as SED and ASC) annually with the aim of researching the acoustic perception of sounds occurring indoors and outdoors.

In DCASE, datasets for MDA, which target domestic situations, display two distinctive characteristics. First, during the data collection process, neither the labels nor the quantity of data to be collected are predefined. Following data acquisition, labels are then assigned without predetermined constraints, and the dataset’s size is not artificially controlled. As a result, some subsets of data are relatively small, as evidenced in the SINS dataset [19]. For example, in the SINS dataset, there are 18,860 10 s audio segments recorded for the “Absence” label, while only 972 segments are recorded for “vacuum cleaning.” Second, datasets such as SINS [19] and DESED [20] are affected by the frequency characteristics determined by the RTF. The microphone positions and room environments differ between the training environment and the model’s deployment environment, resulting in changes in frequency characteristics. Therefore, models need to be robustly trained on the acoustic factors of the microphone’s position or the room.

To address these two issues, augmentation techniques for data are necessary. For instance, the team of Inoue et al. [13], who achieved the first place in the MDA-themed task of DCASE2018 Task5, demonstrated the largest increase in accuracy for the “other” label, which had the smallest amount o f data and the lowest baseline accuracy, through data augmentation techniques. Figure 1 displays the number of data per class and the resulting changes in the model’s performance, revealing a notable difference between the baseline score and the Inoue team’s model score for the “other” class. The left-side ordinate represents the number of data, while the right-side ordinate represents model accuracy.

Additionally, in the evaluation dataset performance of the Inoue model [13], it was revealed that although they achieved an accuracy of 90.4% for data recorded from microphones in the provided locations during training, they only achieved an accuracy of 88.4% for data collected from newly added microphone locations during evaluation, indicating that there was still variance between the microphone locations [6].

In this study, we aimed to conduct data augmentation through RTFs, which is an acoustic factor of indoor environments. Unlike ASC, SED and MDA target indoor sounds, meaning the RTF affects the data. While methods such as directly recording RIRs, using algorithms to generate RIRs [21], or using publicly available RIR datasets [22] can be used for data augmentation, we propose a simpler and cheaper algorithm that employs the statistical properties of RTFs. Specifically, our goal was to harness the acoustic characteristics of sound by leveraging the variability in RTFs corresponding to the microphone’s position within the room. Consequently, our proposal uses the existing data and the statistical properties of RTFs for augmentation with RTFs distinct from those in the original dataset.

3. Proposed Data Augmentation Method

To prevent the model training from being biased towards specific classes due to imbalanced datasets, we employed data augmentation, which is a technique that artificially augments the existing dataset. In particular, for MDA, we devised a data augmentation method that applied the room’s acoustic properties. If the training dataset is only recorded at a single location within a space, the model can become susceptible to variations in RTFs. This sensitivity arises from the acoustic properties, where sound is transformed through different RTFs depending on the microphone’s location. To address this issue, [19] conducted recordings with microphones placed in various locations. Furthermore, the dataset was categorized based on microphone locations, enabling the microphones at specific locations used solely during evaluation to assess the model’s generalization performance. This approach makes sense because when the trained model is implemented into a real device, the device will operate in diverse locations within the room. By leveraging this aspect of RTFs, our proposed method aims to minimize performance reduction caused by variations in RTFs.

Typically, if the microphone’s location and the corresponding RTFs are known, the sound at that specific location can be reconstructed using the sound source and convolution with the RTFs. However, since RTFs are usually unknown, we attempt to augment the data by considering the statistical characteristics of RTFs. When training on sound, using frequency domain features that capture frequency changes over time provides more information than using one-dimensional signals. This can be achieved through the process depicted in Figure 2. Here, the waveform is converted into a frequency signal using Fast Fourier Transform (FFT), and the magnitude information is primarily used by taking the absolute value of the signal. Then, this is processed through a non-linear filter bank, such as the Mel or gammatone filter banks. The proposed algorithm can be applied after feature extraction, as displayed in Figure 3.

3.1. Room Transfer Function

Under room conditions, when the sound source reaches the receiver, it arrives as a combination of direct sound and reflected sound from the perspective of geometrical acoustics [18], as shown in Figure 4a. The direct sound first arrives at the receiver, and its combination with the reflected sound that arrives later produces an impulse response known as the RIR, as shown in Figure 4b. This RIR varies according to the acoustic properties of the room, which are determined by factors such as the size of the room and the materials comprising the walls, floor, and ceiling.

The variations between a sound source and its specific spatial location can be described by equations in the frequency domain. As demonstrated in (1), the received sound signal

y (t)

at a target position can be expressed with the convolution of sound source

x (t)

and transfer function

h (t)

, where t is the time:

y (t) = x (t) * h (t)

(1)

Here,

h (t)

denotes the RTFs. When transforming (1) into the frequency domain through the Fourier transform, the convolution operation between the sound source and the transfer function becomes a multiplication, as expressed in (2).

Y (f) = X (f) \times H (f)

(2)

The transfer function

H (f)

can be expressed as the sum of the real part

H_{r e} (f)

and imaginary part

j H_{i m} (f)

, as in (3) after the Fourier Transform. The absolute value of this

H (f)

is as in (4). We can assume that

H_{r e} (f)

and

H_{i m} (f)

follow a Gaussian distribution according to the central limit theorem [23], meaning

| H (f) |

is the sum of the square root of the two Gaussian distributions and has a Rayleigh distribution.

H (f) = H_{r e} (f) + j H_{i m} (f)

(3)

| H (f) | = \sqrt{{(H_{r e} (f))}^{2} + {(j H_{i m} (f))}^{2}}

(4)

Suppose we have

y_{1} (t)

and

y_{2} (t)

with the same sound source, although with another transfer function. In other words, although

y_{1} (t)

and

y_{2} (t)

have the same sound source, they have different transfer functions. In the frequency domain, it can be expressed by (5) and (6), which can be observed in Figure 5.

Y_{1} (f) = X (f) \times H_{1} (f)

(5)

Y_{2} (f) = X (f) \times H_{2} (f)

(6)

Here, if

Y_{1} (f)

is divided by

Y_{2} (f)

, we can observe that multiplying

Y_{2} (f)

by the ratio of the two transfer functions as in (7) becomes

Y_{1} (f)

.

Y_{2} (f) = Y_{1} (f) \times (H_{2} (f) / H_{1} (f))

(7)

Typically, the magnitudes of frequency characteristics (e.g., spectrum, Mel-frequency spectrum, constant-Q transform) are used as input features for MDA models. Therefore we only use the magnitude and take the absolute value as shown in (8).

| Y_{2} (f) | = | Y_{1} (f) | \times | H_{2} (f) / H_{1} (f) |

(8)

To obtain

| Y_{2} (f) |

in (8),

| H_{2} (f) |

must be known. In general,

| H_{2} (f) |

, which is the transfer function of the unknown signal, cannot be known exactly. However, the statistical properties of the transfer function

| H_{2} (f) |

can be used, which is described in the next section.

3.2. Statistical Properties of Ratio of the Two Room Transfer Functions

Since our aim was to augment the dataset using RTFs without the need for additional data collection, we opted to generate

| H_{2} (f) / H_{1} (f) |

as a random value instead of measuring an additional RTF through experiments. To achieve this, it is essential to derive the statistical characteristics of

| H_{2} (f) / H_{1} (f) |

through the statistical properties of the RTFs.

From (4), it is evident that the magnitude of the transfer function

H (f)

, denoted as

| H (f) |

, follows a Rayleigh distribution. Therefore, it can be understood that both

| H_{1} (f) |

and

| H_{2} (f) |

also follow a Rayleigh distribution. Consequently,

| H_{2} (f) | / | H_{1} (f) |

can be considered as the ratio of two different data, with each following a Rayleigh distribution. Hence, it is necessary to obtain the statistical characteristics of the ratio of transfer functions that follow independent Rayleigh distributions. In [24], we can find the mean and variance of the ratio of two Rayleigh distributions. The probability density function (PDF) of a random variable

| H |

with Rayleigh distribution is given by

f_{| H |} (h) = \{\begin{matrix} \frac{1}{ρ^{2}} h \exp (- \frac{1}{2 ρ^{2}} h^{2})), & h > 0, ρ > 0 \\ 0 & otherwise, \end{matrix}

(9)

where

ρ

(>0) is scale parameter of Rayleigh distribution. The PDF and cumulative distribution function (CDF) of the ratio of

| H_{1} |

to

| H_{2} |

were provided earlier by Shakil et al. [24], where

| H_{1} |

and

| H_{2} |

are assumed to be independent and identically distributed (i.i.d.) Rayleigh random variables. Considering

R = | H_{2} | / | H_{1} |

, the PDF and CDF of R are given by (10) and (11), respectively [24]:

f_{R} (r) = \frac{2 ρ_{2}^{2} ρ_{1}^{2} r}{{[ρ_{1}^{2} r^{2} + ρ_{2}^{2}]}^{2}}

(10)

and

F_{R} (r) = 1 - \frac{ρ_{2}^{2}}{ρ_{1}^{2} r^{2} + ρ_{2}^{2}}

(11)

where

ρ_{1}

and

ρ_{2}

are the scale parameters of Rayleigh distribution of

| H_{1} |

and

| H_{2} |

, respectively, and

r > 0

,

ρ_{1} > 0

and

ρ_{2} > 0

.

Using (11), we can obtain a random value that follows (10). We arranged the generated random values in a matrix matching the feature size to multiply by

| H_{2} / H_{1} |

as demonstrated in (12).

{\tilde{Y}}_{a u g} (k, n) = α (k, n) Y_{o r i g} (k, n)

(12)

where

k = 1, \dots, F

and

n = 1, \dots, T

are indices of time frame and frequency bin, respectively.

Y_{o r i g} (k, n)

and

{\tilde{Y}}_{a u g}

(k, n)

are

(k, n)

-th elements of time-frequency-domain features of

Y_{o r i g} \in R_{+}^{F \times T}

(for original signal) and

{\tilde{Y}}_{a u g} \in R_{+}^{F \times T}

(for augmented signal), respectively.

α (k, n)

is a coefficient for data augmentation which follows the ratio of Rayleigh distribution (10) and (11) for k-th frequency bin and n-th time frame. The set of coefficients

α (k, n)

to make augmented time-frequency-domain features

{\tilde{Y}}_{a u g}

for an audio signal is denoted by

A \in R_{+}^{F \times T}

in the matrix form. Figure 6 illustrates the application of the computed RIR to the mel features. After feature extraction, the data to be augmented were multiplied with a matrix generated using the random values in (12) in an element-wise manner. The overall data augmentation algorithm is presented as a pseudocode below in Algorithm 1.

Algorithm 1 Data Augmentation Using CDF of Ratio of Rayleigh

1:: Input: Dataset of extracted features (Mel-spectrogram) $D$ , number of iterations I
2:: for $i = 1$ to I do
3:: Randomly select a sample $Y_{o r i g}$ from dataset $D$
4:: Get dimensions of $Y_{o r i g}$ : time frames T and frequency bins F
5:: for $n = 1$ to T do ▹Iterate over time steps
6:: for $k = 1$ to F do ▹Iterate over frequency bins
7:: Generate a uniform random value $u \sim U (0, 1)$
8:: Calculate $α (k, n) = F_{R}^{- 1} (u)$ ▹ Calculate $α (k, n)$ using inverse CDF of Ratio of Rayleigh (11)
9:: end for
10:: end for
11:: Form matrix $A$ with elements $α (k, n)$ for all F and T
12:: Compute augmented sample ${\tilde{Y}}_{a u g} = Y_{o r i g} \circ A$ ▹Element-wise multiplication
13:: Add ${\tilde{Y}}_{a u g}$ to the augmented dataset
14:: end for
15:: Output: Augmented Mel-spectrogram dataset

3.3. Comparing the Ratio of Two Simulated RTFs to the Ratio of Two Rayleigh Random Variables

We assumed that the two RTFs are i.i.d. Rayleigh random variables and then performed a pilot test with simulations to confirm this assumption. We employed an RIR generator that uses image methods [25] to validate that the value generated by the formula in [24] works as designed. The RTF generator used the algorithm presented in [21]. This simulation is an algorithm that generates RTFs based on the size of the room, the location of the sound source, and the reflection coefficient. The receiver positions were randomly generated for the experiment. Position x was between 0 and 7 m, y was between 0 and 4 m, and z was between 1 and 2.3 m. Table 1 displays the parameters used in the experiments.

After converting these simulated data to the frequency domain through STFT, the two sets of data were divided element-wise. To make the generated data, we used the CDF in (11) and generated random values such that

σ

was 1 and

λ

was 1. Figure 7 displays these values as a PDF and histogram of the data.

Figure 7 compares the statistical properties of the data generated using the two methods, namely ratio distribution based on the RIR model and the random numbers generated using Equation (11). The solid and dashed lines in Figure 7a represent the PDFs, while Figure 7b shows their corresponding histograms. These figures show that despite the differences in their mean values, both methods produce data with similar statistical patterns that are characterized by values clustered near the beginning and having a long tail. This suggests that the proposed method can effectively replicate the statistical behavior of the RIR model, thereby offering a comparable alternative for RIR simulation.

4. Simulation

4.1. Evaluation Settings

To validate our algorithm, we selected the DCASE2018 Task5 dataset, specifically the SINS dataset [6]. Owing to the planned data collection process in the SINS dataset, we could clearly verify the performance gap between the known and unknown microphones, and we anticipated confirming the efficacy of our proposed technique. Furthermore, while SED detects multiple events within each audio segment with the onset and offset times, MDA detects only a single event, making it simpler to analyze the performance of the proposed algorithm.

The SINS dataset [19] consists of 4-channel audio data, including living rooms, workrooms, kitchens, and dining rooms. Data were recorded with four microphone bundles at seven locations, and the development dataset recorded 10 s of data from the four microphones, for a total of 200 h. The SINS dataset was labeled with nine activities: absence, cooking, dishwashing, eating, other, social activity, vacuum cleaning, watching TV, and working. The dataset was divided into 4-fold cross-validation [6], where the data were split into four equal subsets. Each subset was used once as a validation set while the remaining three subsets were used for training. In cross-validation, a fold refers to a bundle consisting of one validation set and the remaining three subsets.

The model structure used for training was originally proposed in [13]. The overall model structure and output size of each layer, as depicted in Table 2, consisted of simple CNNs.

The model input was a 10 s audio segment at a sampling rate of 16 kHz, which was converted to STFT through 1024 FFT points and filtered with Mel-filterbank with 40 mel filters. The features were normalized to values between 0 and 1. We set the learning rate to 0.0001, batch size to 128, and used Adam [26] as the optimizer and categorical cross-entropy as the loss function. Figure 8 shows the intermediate results of the feature extraction process, where the input signal is transformed from its time-domain waveform to a frequency-domain mel-spectrogram. By applying the mel filter bank to the spectrogram, the scale of the low-frequency components is increased as human hearing is significantly emphasized in this region. This allows the model to be trained with more detailed low-frequency information. We applied our algorithm to the mel-spectrogram to generate multiple augmented data, which were then used as input for the proposed system.

For comparing the proposed method, datasets processed with different augmentation algorithms were trained in the same model structure [13]. The first was mix&shuffle, which was used by the first-place team in the competition [13], and proved to be a simple and effective technique. In addition, specaugment [14] and mixup [27], which are recent methods that are commonly used in audio augmentation, were used as comparison groups.

We augmented data for labels with less than 4000 data in each fold to balance the dataset, except for the mixup [27]. The mixup was applied during the dataset loading process. Data augmentation was applied until the dataset reached a total of 4800 samples. Table 3 presents the number of samples per fold and label before data augmentation.

For the mix&shuffle, two data samples with the same label were randomly selected. Each data sample was sliced into four chunks, and the positions of these chunks were shuffled. Then, the chunks were randomly mixed. For the specaugment, the number of time masks was fixed at 1, with a time masking size ranging from 5 to 10 frames. The number of frequency masks was randomly selected between 1 and 2, with sizes ranging from 10 to 15 in the mel bin direction. The mixup parameter

α

was set to 0.2.

We used the F1-Score [28] as a metric which is commonly employed in multi-class evaluations with imbalanced data and was also used as the evaluation metric in DCASE 2018 Task5 [6]. This can be obtained by calculating the Precision and Recall of a class and then taking the harmonic mean. Precision, also referred to as specificity or Positive Predictive Value (PPV), is the ratio of true positives to predicted positives and can be calculated as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(13)

where TP represents true positives, which are correctly classified as positive, and FP represents false positives, which are incorrectly classified as positive when their true label is negative. Recall, also known as sensitivity or True Positive Rate (TPR), is the ratio of correctly predicted positives to actual positives and calculated as follows:

R e c a l l = \frac{T P}{T P + F N}

(14)

The F1-Score can be obtained by harmonically averaging the Precision and Recall, as follows:

F 1 - Score = 2 \cdot \frac{p r e c i s i o n \cdot r e c a l l}{p r e c i s i o n + r e c a l l}

(15)

In multi-class classification tasks, in-depth analysis of the F1-Score is employed to assess the model’s performance in further detail, using micro, macro, and weighted averages [28]. The macro F1-Score is the arithmetic average of the F1-Scores for each class, and the weighted F1-Score is the weighted average of the F1-Scores based on the proportion of classes. Additionally, the micro F1-Score is the average of the overall TP, FN, and FP. The weighted average is calculated by assigning weights according to the number of events, while the macro average treats all events equally by assigning them the same weight. Finally, the micro average allows us to determine the overall performance without any weighting for labels.

4.2. Evaluation Results

We used the micro, macro, and weighted averages of the F1-Score as metrics for the performance comparison. Following the cross-validation setup of the SINS dataset [19], the models were trained for four folds. For the evaluation, the average of the probability outputs from these four models was used. We evaluated the performance using the test dataset from the development dataset and an evaluation dataset, which included data collected with an unknown microphone not present in the development dataset. This allowed us to compare the results for known and unknown microphone positions, each with different RTFs. For the experiment to determine the correct parameters for our proposed method, we evaluated the performance using the evaluation dataset only, dividing it into known and unknown microphone data. First, we compared the proposed method with existing data augmentation techniques, then examined whether the performance gap between the known and unknown microphones decreased, as per our proposal. Next, we compared the performance changes based on the parameters of the proposed method. Finally, we compared the performance of our proposed method with a technique that adds Gaussian noise, which is similar to our method and easily applicable.

4.2.1. Comparison of Performance with Conventional Data Augmentation Techniques

We compared the proposed technique with conventional data augmentation techniques. The data augmentation methods used for comparison were mix&shuffle [13], mixup [27] and specaugment [14]. When using mix&shuffle and specaugment together with our proposed method, half of the data to be augmented were applied to the proposed method, and the remaining half was applied to mix&shuffle or specaugment. Moreover, when applying mixup together with the proposed method, mixup was applied during the training of the augmented data generated by the proposed method.

As demonstrated in Table 4, the proposed algorithm improved performance across all metrics compared to the model without augmentation. Furthermore, the results from the development and evaluation datasets demonstrated that the proposed method achieved the best performance in both the micro and weighted F1-Scores.

When the proposed method was combined with conventional techniques (mixup [27] and specaugment [14]), there was a tendency for performance improvement in the evaluation dataset. Notably, although the specaugment technique exhibited the best macro-averaged performance in the development dataset, its performance was significantly reduced in the evaluation dataset. However, when used together with the proposed method, this performance degradation was mitigated. Since the proposed method is designed to achieve better performance with the evaluation dataset than the development dataset by including recordings from microphones in unknown positions, we observe that even though specaugment outperforms in terms of the macro average on the development dataset, there is also great improvement in the macro average on the evaluation dataset.

4.2.2. Comparison of Performance with Unknown Microphones

In the previous section, we observed that our algorithm enhanced model performance similarly to conventional data augmentation techniques. In this section, we examine the performance of microphones in both known and unknown locations within the evaluation dataset to determine if the model has been robustly trained with respect to location, as intended.

From the results in Table 5, it is evident that for known microphones in the evaluation dataset, the highest performance was achieved in terms of micro and weighted averages when specaugment [14] and the proposed method were used together. For unknown microphones, the highest performance was achieved in the micro and weighted averages when mixup [27] and the proposed method were combined.

For known microphones, using specaugment [14] together with the proposed method was effective. However, for unknown microphones, combining mixup [27] with the proposed method resulted in enhanced performance improvements. This result indicated that depending on the situation, it could be more effective to use our algorithm in conjunction with conventional techniques.

It was observed that using mixup alone resulted in the highest performance improvement in terms of the macro average. Since macro averaging entails F1-Score calculation by applying equal weight to each class, it is more sensitive to the performance of classes with fewer samples. As shown in Figure 1, classes with fewer samples and relatively lower baseline performances, such as other, dishwashing, and eating, may have experienced greater improvements with mixup. This could be due to mixup’s tendency to enhance performance in ambiguous or overlapping classes. Specifically, the dishwashing and eating classes, both of which contain dish sounds, appear to have benefited from this effect, although this conclusion remains speculative based on the macro average results.

Although, while mixup showed improvements in specific cases, the proposed method outperforms mixup when considering the overall performance across multiple metrics. The proposed method consistently achieves higher F1-Score in both the micro and weighted averages regardless of whether mixup is used alone or in combination with other techniques, demonstrating greater robustness and effectiveness across a broader range of classes. Thus, although mixup offers localized benefits in certain scenarios, our approach is more advantageous for achieving comprehensive performance improvements.

4.2.3. Performance Analysis According to Random Values of Two Rayleigh Ratio Parameters

To determine the optimal parameters, we randomly generated values by adjusting the CDF parameter for the ratio of the two Rayleigh distributions and assessed their performance. The parameters that govern the PDF of the two Rayleigh ratios are denoted as

ρ_{1}

and

ρ_{2}

as in (11). We constrained the randomly generated values within the range of 0.5 to 5 to prevent them from being excessively large or small.

For each experiment, we increased the data to 4800 for classes with a count of less than or equal to 4000 in each fold of the development dataset, as described in Section 4.1 Evaluation Setup. The results displayed in Figure 9 are those for the test data in the development dataset. As a result of the experiment, by adjusting the variables, we were able to confirm that we had the highest performance when sigma and lambda were 1.

In these results, the Rayleigh parameters

ρ_{1}

and

ρ_{2}

showed overall better performance when they had identical values; this may be because, in real room environments, the statistical properties of the RIRs are more likely to be similar rather than significantly different, so having identical values might better simulate real-world conditions. Additionally, when

ρ_{1}

was larger than

ρ_{2}

, the scores tended to be lower overall; however, the small number of experiments makes it difficult to generalize this result. Consequently, it is preferable to set the Rayleigh parameters

ρ_{1}

and

ρ_{2}

to the same value or to avoid cases where

ρ_{1}

is larger than

ρ_{2}

if they are set differently.

4.2.4. Comparison of Performance with Data Augmentation Technique Using Gaussian Noise

We compared the proposed method with a method that generates data using Gaussian noise, which is a type of noise that can be included in a sound signal. We aimed to demonstrate that the effectiveness of the proposed method was not only due to the presence of noise. We generated Gaussian noise data by multiplying the Mel spectrogram by a Gaussian random value in the same way as in the data augmentation technique using RTFs. The experimental results in Table 6 indicated that Gaussian noise degraded the model’s performance.

As demonstrated by the experimental results, Gaussian noise resulted in a lower F1-Score than the baseline model trained without any augmentation. In contrast, the proposed model increased the F1-Score compared to the baseline model and minimized the performance difference between the development and evaluation datasets. This result indicated that the model was normalized to some extent. Therefore, the proposed method demonstrated a different effect compared to simply adding noise.

For further analysis, 100 samples of augmented data were generated for a known sound source (vacuum cleaner) using Gaussian distribution and the ratio of Rayleigh distribution. These were then compared with the spectrum of the sound source recorded by both known and unknown microphones. Figure 10a,b show the spectra of augmented data with each distribution, and in Figure 10c, the orange line represents the spectrum from the unknown microphone, while the blue line represents the spectrum from the known microphone.

Figure 10 compares the spectra of sounds recorded from the same vacuum cleaner at different locations (with the known source taken from the development dataset and unknown source from the evaluation dataset) to those of the augmented data generated using Gaussian distribution and the proposed method. For statistical comparison, data augmentation using Gaussian distribution (Figure 10a) and the proposed method (Figure 10b) were performed on the same known data, thus generating 100 samples for each method. The spectrum was then calculated for each sample, and the results were sorted by order of magnitude for each frequency bin. Subsequently, lines were drawn at the 90th, 75th, 50th, 25th, and 10th percentiles based on the magnitudes.

In the case of Gaussian distribution, the spectral lines are more widely spread compared to the proposed method. In particular, for the 10% line, there is a significant difference in the shape of the spectrum compared to the original data. However, with the proposed method, the spectrum closely follows the shape of the original data, even at the 10% line. As shown in Figure 10c, the spectra of the known and unknown sources exhibit only slight differences. This suggests that when augmenting data using arbitrary PDFs, such as Gaussian distribution, large deviations from the actual RTFs can occur, potentially degrading the model’s performance. In contrast, our proposed method tends to generate data that closely resembles real-world data, thereby enhancing the model’s performance.

5. Conclusions

This work presents a data augmentation algorithm to enhance the performance of domestic activity detection using machine learning. The proposed approach was motivated by the fact that sounds generated indoors are affected by RTFs; moreover, the RTFs remain unknown when the algorithm is deployed on a real application. The influences of these RTFs were used to perform data augmentation based on their statistical characteristics. Accordingly, sounds were simulated from the same source at different locations using a known dataset. The proposed method could be applied to the spectral domain through simple calculations with a low computational burden.

For the experiments, the model proposed in [13] was employed with a simple CNN architecture and similar training parameters. Then, the proposed method was verified to exhibit comparable or slightly superior performance to the existing method through the F1-Score metric. Specifically, compared to the model before data augmentation, the proposed method showed 0.59% improvement on the development dataset and 0.78% improvement on the evaluation dataset when comprising sounds sourced using the unknown microphone. These results support the hypothesis that the proposed technique can compensate for variations in the location of the input data. Additionally, it was observed that further performance gains could be obtained when combining the proposed method with extant techniques, such as mixup, for handling sounds from unknown locations.

Nonetheless, the proposed method has two key limitations. First, it is primarily applicable to indoor audio recordings, which restricts its generalizability to other environments, such as in ASC tasks that involve more diverse outdoor soundscapes. Second, it is designed to work only with frequency-domain features, thereby limiting its use in models that process raw audio waveforms directly in end-to-end deep learning frameworks. These constraints may reduce the flexibility of our method compared to other approaches that are more adaptable to various input types and recording conditions.

Despite these limitations, the proposed technique is confirmed to enhance model generalization in scenarios where the RTFs of the development dataset differ from those of the evaluation dataset. It is expected to improve performance with minimal computational burden when applied to tasks involving indoor audio datasets. The experimental results also suggest that further improvements could be achieved by combining the proposed method with other data augmentation techniques. By using this method, the model is anticipated to demonstrate improved generalization performance even in different environments, with different microphone setups, or with microphones other than those used to acquire the development dataset. Furthermore, our proposed method is expected to be applicable across various fields, such as bioacoustics and SED [29,30], which involve inference models targeting indoor datasets and utilizing frequency-domain features as input.

Author Contributions

Conceptualization, M.K. and S.L.; methodology, M.K. and S.L.; validation, M.K. and S.L.; writing—original draft preparation, M.K.; writing—review and editing, S.L.; visualization, M.K.; supervision, S.L.; project administration, S.L.; funding acquisition, S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Industrial Technology Innovation Program (20023556, Development of personalized noise mitigation technology and service based on noise simulation to solve the noise problem in personal space) funded By the Ministry of Trade, Industry & Energy (MOTIE, Korea) and by Korea Research Institute for defense Technology planning and advancement (KRIT)—Grant funded by the Korea government (DAPA (Defense Acquisition Program Administration)) (No. KRIT-CT-23-026, Integrated Underwater Surveillance Research Center for Adapting Future Technologies, 2024).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. These data can be found here: https://dcase.community/challenge2018/task-monitoring-domestic-activities (accessed on 29 August 2024).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study in the collection, analyses, or interpretation of data in the writing of the manuscript or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

DCASE	Detection and Classification of Acoustic Scenes and Events
ASC	Acoustic Scene Classification
RTFs	Room Transfer Functions
FFT	Fast Fourier Transform
RIR	Room Impulse Response
MDA	Monitoring of Domestic Activity
HMM	Hidden Markov Model
KNN	K-Nearest Neighbor
PDF	Probability Density Function
CDF	Cumulative Distribution Function
i.i.d.	independent and identically distributed
PPV	Positive Predictive Value
TPR	True Positive Rate

References

Schilit, B.; Adams, N.; Want, R. Context-aware computing applications. In Proceedings of the 1994 First Workshop on Mobile Computing Systems and Applications, Santa Cruz, CA, USA, 8–9 December 1994; IEEE: Piscataway, NJ, USA, 1994; pp. 85–90. [Google Scholar]
Ciregan, D.; Meier, U.; Schmidhuber, J. Multi-column deep neural networks for image classification. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 3642–3649. [Google Scholar]
Chan, W.; Jaitly, N.; Le, Q.; Vinyals, O. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 4960–4964. [Google Scholar]
Barchiesi, D.; Giannoulis, D.; Stowell, D.; Plumbley, M.D. Acoustic scene classification: Classifying environments from the sounds they produce. IEEE Signal Process. Mag. 2015, 32, 16–34. [Google Scholar] [CrossRef]
Heittola, T.; Mesaros, A.; Eronen, A.; Virtanen, T. Context-dependent sound event detection. EURASIP J. Audio Speech Music Process. 2013, 2013, 1–13. [Google Scholar] [CrossRef]
Dekkers, G.; Vuegen, L.; van Waterschoot, T.; Vanrumste, B.; Karsmakers, P. DCASE 2018 Challenge—Task 5: Monitoring of Domestic aCtivities Based on Multi-Channel Acoustics. arXiv 2018, arXiv:1807.11246. [Google Scholar] [CrossRef]
Valenti, M.; Squartini, S.; Diment, A.; Parascandolo, G.; Virtanen, T. A convolutional neural network approach for acoustic scene classification. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1547–1554. [Google Scholar]
Graves, A. Sequence transduction with recurrent neural networks. arXiv 2012, arXiv:1211.3711. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Eddy, S.R. Hidden markov models. Curr. Opin. Struct. Biol. 1996, 6, 361–365. [Google Scholar] [CrossRef] [PubMed]
Altman, N.S. An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 1992, 46, 175–185. [Google Scholar] [CrossRef]
Li, Z.; Kamnitsas, K.; Glocker, B. Analyzing overfitting under class imbalance in neural networks for image segmentation. IEEE Trans. Med Imaging 2020, 40, 1065–1077. [Google Scholar] [CrossRef] [PubMed]
Inoue, T.; Vinayavekhin, P.; Wang, S.; Wood, D.; Greco, N.; Tachibana, R. Domestic activities classification based on CNN using shuffling and mixing data augmentation. In Proceedings of the DCASE 2018 Challenge, Surrey, UK, 19–20 November 2018. [Google Scholar]
Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv 2019, arXiv:1904.08779. [Google Scholar]
Wang, S.; Rohdin, J.; Plchot, O.; Burget, L.; Yu, K.; Černockỳ, J. Investigation of specaugment for deep speaker embedding learning. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 7139–7143. [Google Scholar]
Kim, B.; Yang, S.; Kim, J.; Chang, S. QTI Submission to DCASE 2021: Residual Normalization for Device-Imbalanced Acoustic Scene Classification with Efficient Design. arXiv 2022, arXiv:2206.13909. [Google Scholar]
Koyama, Y.; Shigemi, K.; Takahashi, M.; Shimada, K.; Takahashi, N.; Tsunoo, E.; Takahashi, S.; Mitsufuji, Y. Spatial data augmentation with simulated room impulse responses for sound event localization and detection. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 8872–8876. [Google Scholar]
Savioja, L.; Svensson, U.P. Overview of geometrical room acoustic modeling techniques. J. Acoust. Soc. Am. 2015, 138, 708–730. [Google Scholar] [CrossRef] [PubMed]
Dekkers, G.; Lauwereins, S.; Thoen, B.; Adhana, M.W.; Brouckxon, H.; Van den Bergh, B.; Van Waterschoot, T.; Vanrumste, B.; Verhelst, M.; Karsmakers, P. The SINS database for detection of daily activities in a home environment using an acoustic sensor network. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017, Munich, Germany, 16–17 November 2017; pp. 1–5; 32–36. [Google Scholar]
Turpault, N.; Serizel, R.; Shah, A.P.; Salamon, J. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events, New York, NY, USA, 25–26 October 2019. [Google Scholar]
Werner, N. audiolabs/rir-generator: Version 0.1.0. 2020. Available online: https://zenodo.org/records/4133078 (accessed on 15 October 2024). [CrossRef]
Grondin, F.; Lauzon, J.S.; Michaud, S.; Ravanelli, M.; Michaud, F. BIRD: Big Impulse Response Dataset. arXiv 2020, arXiv:2010.09930. [Google Scholar]
Rosenblatt, M. A central limit theorem and a strong mixing condition. Proc. Natl. Acad. Sci. USA 1956, 42, 43. [Google Scholar] [CrossRef] [PubMed]
Shakil, M.; Ahsanullah, M. Record values of the ratio of Rayleigh random variables. Pak. J. Stat. 2011, 27, 307–325. [Google Scholar]
Allen, J.B.; Berkley, D.A. Image method for efficiently simulating small-room acoustics. J. Acoust. Soc. Am. 1979, 65, 943–950. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar]
Grandini, M.; Bagli, E.; Visani, G. Metrics for multi-class classification: An overview. arXiv 2020, arXiv:2008.05756. [Google Scholar]
Stowell, D. Computational bioacoustics with deep learning: A review and roadmap. PeerJ 2022, 10, e13152. [Google Scholar] [CrossRef] [PubMed]
Cornell, S.; Ebbers, J.; Douwes, C.; Martín-Morató, I.; Harju, M.; Mesaros, A.; Serizel, R. DCASE 2024 Task 4: Sound Event Detection with Heterogeneous Data and Missing Labels. arXiv 2024, arXiv:2406.08056. [Google Scholar]

Figure 1. Comparison of the number of data per class, baseline [6] performance per class, and performance of the Inoue team [13] model.

Figure 2. Feature extraction process.

Figure 3. Overall training process.

Figure 4. RIRs in the geometric acoustics. The direct sound and reflected sound in (a) arrive at the receiver with a time delay, as shown in (b).

Figure 5. Sound transmission in room.

Figure 6. Augmented features (right,

{\tilde{Y}}_{a u g}

) are obtained by multiplying the mel features (left,

Y_{o r i g}

) element-wise (denoted by ∘) with the

A

(middle), as defined by (12).

Figure 6. Augmented features (right,

{\tilde{Y}}_{a u g}

) are obtained by multiplying the mel features (left,

Y_{o r i g}

) element-wise (denoted by ∘) with the

A

(middle), as defined by (12).

Figure 7. (a) is a PDF of simulated and generated data, and (b) is the histogram. The solid line in (a) and orange bars in (b) represent the PDF of the ratio distribution generated by the RIR model [21], while the dashed line in (a) and purple bars in (b) correspond to random numbers simulated using (11).

Figure 8. The input signal is first transformed from its (a) time-domain waveform to a (b) frequency-domain mel-spectrogram, which is then used in the baseline system. The proposed approach in (c) uses both the mel-spectrogram and augmented features.

Figure 9. Micro F1-Score by parameters.

Figure 10. (a) Gaussian distribution and (b) the proposed method were used to create 100 augmented datasets from a known source. The spectra were then obtained and the data were sorted in the order of magnitude, where each line represents a different percentiles. (c) Spectra of the known source and an unknown source.

Table 1. Simulated data RIR generator [21] parameters.

Room size [x, y, z]	[7, 4, 2.3] m
Receiver position [x, y, z]	random
Source Position [x, y, z]	[2, 2.5, 1] m
Sound velocity	340 m/s
Reverberation time	0.4 s

Table 2. Overall model architecture [13].

Layer	Output Size
Input	40 × 501 $\times 1$
Conv(7 × 1, 64) + BN + ReLU	40 × 501 × 64
Max pooling(4 × 1) + Dropout(0.2)	10 × 501 × 64
Conv(10 $\times 1$ , 128) + BN + ReLU	1 × 501 × 128
Conv(1 × 7, 256) + BN + ReLU	1 × 501 × 256
Global max pooling + Dropout(0.5)	256
Dense	128
Softmax output	9

Table 3. Number of data instances for each label and fold in the cross-validation settings [19].

Activity	Fold1	Fold2	Fold3	Fold4	Development Dataset
Absence	14,128	14,152	14,080	14,220	18,860
Cooking	3840	3848	3804	3880	5124
Dishwashing	1068	1080	1064	1060	1424
Eating	1736	1736	1724	1728	2308
Other	1540	1552	1544	1544	2060
Social activity	3908	3896	3292	3736	4944
Vacuum cleaning	732	708	744	732	972
Watching TV	14,072	13,956	13,808	14,108	18,648
Working	13,939	14,063	13,891	14,035	18,644
Total	54,964	54,992	53,952	55,044	72,984

Table 4. F1-Score comparison of existing data augmentation techniques. The three highest scores are bold, and the highest scores are underlined.

	Development Dataset			Evaluation Dataset
F1-Score	Micro	Macro	Weighted	Micro	Macro	Weighted
Without augmentation	0.9001	0.8612	0.8974	0.8861	0.8397	0.8848
mix&shuffle [13]	0.9006	0.8642	0.8979	0.8874	0.8417	0.8873
mixup [27]	0.9050	0.8700	0.8975	0.8905	0.8460	0.8890
Specaugment [14]	0.9055	0.8756	0.9036	0.8897	0.8445	0.8887
RTFs (proposed)	0.9060	0.8692	0.9037	0.8939	0.8433	0.8922
mix&shuffle & RTFs	0.9029	0.8704	0.9009	0.8854	0.8397	0.8860
mixup & RTFs	0.9040	0.8694	0.9015	0.8935	0.8445	0.8917
Specaugment & RTFs	0.9023	0.8370	0.9003	0.8920	0.8476	0.8910

Table 5. F1-Score comparison of known and unknown mics. The three highest scores are bold, and the highest scores are underlined.

	Evaluation Dataset			Known Mic			Unknown Mic
F1-Score	Micro	Macro	Weighted	Micro	Macro	Weighted	Micro	Macro	Weighted
Without augmentation	0.8861	0.8397	0.8848	0.9072	0.8541	0.9053	0.8581	0.8201	0.8573
mix&shuffle [13]	0.8874	0.8417	0.8873	0.9091	0.8562	0.9083	0.8585	0.8223	0.8591
mixup [27]	0.8905	0.8460	0.8890	0.9116	0.8610	0.9095	0.8625	0.8257	0.8616
Specaugment [14]	0.8897	0.8445	0.8887	0.9110	0.8597	0.9095	0.8614	0.8240	0.8608
RTFs (proposed)	0.8939	0.8433	0.8922	0.9143	0.8577	0.9119	0.8668	0.8241	0.8658
mix&shuffle & RTFs	0.8854	0.8397	0.886	0.9106	0.8629	0.9096	0.8517	0.8110	0.8546
mixup & RTFs	0.8935	0.8445	0.8917	0.9133	0.8606	0.9110	0.8672	0.8227	0.8659
Specaugment & RTFs	0.8920	0.8476	0.891	0.9138	0.8641	0.9124	0.8629	0.8254	0.8625

Table 6. Comparison of F1-Score of data augmentation technique using Gaussian noise and data augmentation technique using RTF. The highest score is indicated in bold.

Development Dataset	Evaluation Dataset
F1-Score	Micro	Macro	Weighted	Micro	Macro	Weighted
Without augmentation	0.9001	0.8612	0.8974	0.8861	0.8397	0.8848
Gaussian noise	0.8950	0.8475	0.8900	0.8873	0.8301	0.8845
RTFs (proposed)	0.9060	0.8692	0.9037	0.8939	0.8433	0.8922

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, M.; Lee, S. Data Augmentation Method Using Room Transfer Function for Monitoring of Domestic Activities. Appl. Sci. 2024, 14, 9644. https://doi.org/10.3390/app14219644

AMA Style

Kim M, Lee S. Data Augmentation Method Using Room Transfer Function for Monitoring of Domestic Activities. Applied Sciences. 2024; 14(21):9644. https://doi.org/10.3390/app14219644

Chicago/Turabian Style

Kim, Minhan, and Seokjin Lee. 2024. "Data Augmentation Method Using Room Transfer Function for Monitoring of Domestic Activities" Applied Sciences 14, no. 21: 9644. https://doi.org/10.3390/app14219644

APA Style

Kim, M., & Lee, S. (2024). Data Augmentation Method Using Room Transfer Function for Monitoring of Domestic Activities. Applied Sciences, 14(21), 9644. https://doi.org/10.3390/app14219644

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Data Augmentation Method Using Room Transfer Function for Monitoring of Domestic Activities

Abstract

1. Introduction

2. Problem Description

3. Proposed Data Augmentation Method

3.1. Room Transfer Function

3.2. Statistical Properties of Ratio of the Two Room Transfer Functions

3.3. Comparing the Ratio of Two Simulated RTFs to the Ratio of Two Rayleigh Random Variables

4. Simulation

4.1. Evaluation Settings

4.2. Evaluation Results

4.2.1. Comparison of Performance with Conventional Data Augmentation Techniques

4.2.2. Comparison of Performance with Unknown Microphones

4.2.3. Performance Analysis According to Random Values of Two Rayleigh Ratio Parameters

4.2.4. Comparison of Performance with Data Augmentation Technique Using Gaussian Noise

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI