In this section, we present a supervised learning method for inter-floor noise type/position classification using a CNN.
3.2. Network Architecture
Figure 4 shows the architecture of the inter-floor noise type/position classifier. The classifier maps an inter-floor noise to a type/position through (1) conversion of an inter-floor noise into a log-scaled Mel-spectrogram, (2) input channels, (3) CNN, (4) adaptation layer, and (5) softmax classifier.
In [
25], state-of-the-art CNNs for image classification in ILSVRC are examined and applied to audio classification. The models employed in the paper are AlexNet [
26], VGGNet [
27], and ResNet [
28]. AlexNet was the first CNN-based winner in ILSVRC. VGGNet showed better image classification performance with the ImageNet dataset than AlexNet in ILSVRC 2014. ResNet was the winner of ILSVRC 2015 and showed better image classification performance than the other two models. VGGNet and ResNet can be classified based on the number of layers (depth) of their neural networks. VGG16 and ResNet V1 50, which are considered basic models from each group, were used in this study.
The inter-floor noises are converted to log-scaled Mel-spectrograms using LibROSA [
29] to represent audio samples in 2 dimensions with size of
, where
H and
W denote the height and the width, respectively. In the literature [
20,
21,
25,
30,
31,
32], a spectrogram, log-scaled Mel-spectrogram, and Mel-frequency cepstral coefficient are used as input features in ASC. They represent a signal in the TF domain and are acceptable CNN inputs. In our previous study [
13], using a log-scaled Mel-spectrogram provided the best performance with the inter-floor noise dataset. Thus, we used a log-scaled Mel-spectrogram in this study.
A log-scaled Mel-spectrogram
is obtained through the following steps. First, a signal
is extracted from an audio clip in the inter-floor noise dataset. The sample length
l is set to 132,300 samples (3 s in length). Second,
is converted to a magnitude of short time Fourier transforms
using a 2048 points fast Fourier transform with window size of
and the same hop size, where
rounds to the next largest integer. Third, a Mel-spectrogram is obtained using
where
is a Mel-filter bank with frequency range of
. This converts frequency scale of
to the Mel-scale. Finally,
is converted to a log-scaled Mel-spectrogram
where
m is the largest element in
.
is given from the input size
for a CNN. For example, VGG16 has input size of
. Since the three CNNs (AlexNet, VGG16, and ResNet V1 50) are designed for image classification, they have 3 input channels (BGR). To take advantage of the knowledge learned from a large dataset, the weights between the three input channels and the following layer are preserved instead of modifying the input channels to a single channel. Consequently,
is supplied to all the channels. Potential of this method against the inter-floor noise dataset was shown in [
13] via performance comparison between a CNN with one input channel and VGG16 without knowledge transfer.
The three CNNs contain millions of learnable weights. Given the sparsity of the inter-floor noise dataset, it is difficult to train models with many weights. In such conditions, a CNN can be trained using transfer learning, as suggested in [
33]. Transfer learning initializes a CNN with weights that are pre-trained on a large dataset (
source) and fine-tunes them on a
target dataset. The method assumes that the internal layers of a CNN can extract mid-level descriptors from a
source that explain the distribution of the
source. The distribution of a
target dataset can be learned by sharing the mid-level descriptors [
33,
34]. In this study, a large image dataset is used as
source for training a CNN via transfer learning. Usually, image and sound representations are considered different. However, low-level notions of images [
34] such as edges and shapes can be found in
and changes of lighting can be comparable to acoustic pressure changes in
. Hence,
can be considered to be an image with size of
and the descriptors of the source could be shared to learn the distribution of the target. This approach for ASC can be found in [
35,
36].
The weights of the three CNNs are initialized using weights that are pre-trained on ImageNet (
source). The pre-trained weights used in this study are from [
37] for AlexNet and VGG16, and from [
38] for ResNet V1 50. Because the three CNNs are designed for ILSVRC, each output from the three CNNs have size of
I = 1000. The adaptation layer reduces the number of output dimensions in the lower layer to
C, where
C is the number of the inter-floor noise types/positions. The weights between the output of the CNNs and the adaptation layer are randomly drawn from a normal distribution with a fixed standard deviation of 0.01. This initialization is used in [
26,
27,
39]. The bias
is initialized with 1, as in [
26].
In a classification problem, the output values are normalized using a softmax function (i.e., softmax classifier) to convert the output elements to pseudo probabilities. For a given , the softmax function is defined as
Let
be a
c-th output node of the adaptation layer
where
is a weight between the
i-th node of the former layer
and
c-th node of the adaptation layer. The predicted probabilities of given inter-floor noises with
C type/position categories are
. The loss function
L selected for optimization of three CNNs is cross entropy loss and
-regularization of
where
is a regularization strength and
is a one-hot encoded true label.
3.3. Evaluation
The performance of the three CNNs was evaluated using 5-fold cross validation. 5-fold cross validation divides the inter-floor noise dataset into 5 subsets of equal size. A model is optimized on a zero-centered training set , which is composed of the 4 subsets. The optimized model is validated on the remaining set , which is zero centered on the mean value of the training sets. These steps are repeated for the entire validation set.
Performance of a CNN model
on a dataset can be measured by finding the minimum of
where
is the optimal hyperparameter pair [
40].
with smaller
provides better performance.
is composed of the optimal regularization strength
and the optimal learning rate
. The size of the hidden layer, the number of hidden units, and an activation function are not considered to be they are already decided by
.
was estimated via random search
as introduced in [
40] with 30 epochs.
and
(
) are generated log-uniformly from
to
. The weights of the CNN are optimized via mini-batch gradient descent (GD) to minimize
L. The mini-batch size was set to 39. The CNNs and optimization were implemented with TensorFlow [
38] and available at [
41].
In the remaining subsections, inter-floor noises were classified into type/position using the three CNNs. The performance of the three CNNs were measured and compared. Type and position were separately considered in
Section 3.4 and
Section 3.5, respectively.
3.5. Position Classification Results
The inter-floor noises in the dataset were labeled into the following position categories: 1F0m, 1F6m, 1F12m, 2F0m, 2F6m, 2F12m, 3F0m, 3F6m, and 3F12m. The weights of the three CNNs with estimated optimal parameters were optimized to minimize L using GD for 50 epochs. This process is sufficient to minimize .
The accuracy of position classification with the three CNNs was evaluated using 5-fold cross validation. The accuracies are arranged in
Table 3. The first column of the table shows the name of the three CNNs, and the first row of the table shows the 9 categories.
In position classification, VGG16 outperforms the other models, except for categories 2F and 3F0m. All models show comparatively poor performance for positions 1F0m and 1F6m. The models seem to confuse these two positions. The confusion in the position classification can be seen in
Figure 5.
If confusion between positions on the same floor are ignored, the classification accuracy with the three CNNs increases.
Table 4 shows the floor classification accuracy with the three CNNs. VGG16 shows an accuracy of 99.5% for floor classification.
Summarizing the results, our approach to type/position classification based on the three adapted CNNs with knowledge transfer were compared. The CNNs showed feasibility on type/position classification of inter-floor noises in the building. VGG16 showed the best performance on both type/position classification.