2.1. Environmental Sound Recognition
The intention of the study is to resolve the conditions around Environmental sound recognition (ESR), which is also known as environmental sound classification. The study is not specifically intended to detect the event trigger time precisely, but more important to understand what the acoustic scene is. In past years, numerous methods, such as the Gaussian Mixture Model (GMM) [
9], Hidden Markov model (HMM) [
10,
11], random forest [
12], and support vector machine [
13], etc., have been used to solve the ESR problem. However, none of these methods can reach the level of human beings. Since 2012, neural networks have shown the great potential in computer vision [
14]. Increasingly, researchers have begun to apply neural networks in the ESR field.
For a neural network, it is important to choose a suitable feature to be the input value. In 2014, Piczak [
15] proposed a usable network structure using the log-mel spectrum and delta as the input features, which was once considered state-of-art in the ESR field. The log-mel spectrum has been a popular feature used in the ESR field in recent years. In Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE challenges) [
16,
17], most of the researchers still choose to take the log-mel spectrum as one of the network inputs in acoustic scene classification tasks.
In 2015, Sainath et al. [
6] used a raw audio wave as the network input to train for speech recognition and had promising results. Raw signals seem to be one of the choices in the ESR field.
In 2016, Aytar et al. [
18] proposed SoundNet, which is trained using both images and raw audio. The image part is used to assist in the training, but the scene is still recognized according to the raw audio signals. The result of the network was impressive. Although, the performance might drop considerably, the network structure can still be trainable using the raw audio signal only. In the same year, Dai et al. [
19] proposed an 18-layer network that could also work with raw audio signals, and the larger number of filters and a deeper structure provided a much better result using raw audio signals. We could clearly see that network architecture has a huge effect with the raw signal input when comparing these two works [
18,
19]. The depth and the filter numbers are obviously worth further discussions. On the other hand, both the two works use the global pooling strategy [
20] to integrate the network output information, which has shown an outstanding effect on dimension reduction. Global pooling has other benefits in structure integration, which is explained in our method development.
In 2017, Tokozume and Harada [
21] proposed EnvNet, which transforms a signal from a raw 1d signal to a 2d spectrogram-like graph through the network. This is an interesting idea, because training with a spectrum might also be adapted to this kind of graph. In the same year, Tokozume et al. [
22] proposed another augmentation method that could be applied in the same kind of network, and the results could even reach the level of human beings.
These related works reveal that the input features greatly influence the performance of a network. Although many features have been tried on the network, a proper way to combine individual acoustic features are lacking. Moreover, network architectures that use raw signals as the input also require further discussion. Therefore, based on the existing research, this study focuses on improving the above-mentioned aspects.
2.3. Feed-Forward Neural Network
The simplest feed-forward neural networks would be Single-layer perceptrons, which can be built up to do a regression. Assume we would like to project a
to
, the two variables could be rewrite as two vectors
and
, so we could simply try the formula below:
In (1),
and
, therefore, the main purpose to solve the equation is to find the suitable w and b. If we already had a certain sample
, obviously, we make the result of input
could be as close to
as possible. There are several methods that can be used to retrieve the correct value of w and b, such as the stochastic gradient descent (SGD) or Newton’s method. No matter which method is applied, the equation will have a good result when
and
among all the samples are linearly dependent. Inspired by the animal neuron system, the activation function
was added to improve Equation (1), and the new equations are listed as (2) and (3):
The activation function provides a non-linear transform to filter out the weaker signal. For example, the classic activation function sigmoid is:
After passing Equation (4), every output value is straightly normalized to a range between 0 and 1, which is a superior non-linear transform. Equation (3) can now have the ability to make a regression to the non-linear equation.
From Equation (3), it can be clearly seen that each element in is actually composed of every element of in different weights. Here, form an input layer, and each node create an element of is called a neuron in the network.
To enhance the network structure, a hidden layer can be added to improve the network performance. The neuron numbers in the hidden layer needs to be decided by users, it usually setup to the value bigger than both input and output dimensions. The hidden layer is used to project the input vector into another dimension, resulting in a greater chance of finding a linear way to transform from a higher dimension to the output layers. Several non-linear transform also make the network hold greater power to complete complex regression.
To obtain the correct weights of the feed-forward neural networks, the backpropagation method [
24] (BP) is widely used. By calculating the gradient from the loss function, the gradients can be backward propagated to each of the weights.
It seems that the network would better be design deeper, more layers, or wider, more neurons per layer, but actually both of the two methods all get some issues need to be deal with. The weight number grows exponentially with the width of the network, which also leads to a large growth of the computation times and also causes the network to face a serious overfitting condition. This also means that the network might easily fit the training data but still result in poor performance while testing. Deeper networks need to solve the gradient decent problem. When performing BP, the gradient travels from the end of the network and gets thinner and thinner while arriving at the front, and it can even vanish directly. A number of methods have been proposed to improve the vanishing gradient problem, of which a deeper network is recommended to be built as a solution.
2.5. Convolutional Layers
Each convolutional layer is composed of several graphic filters called kernels, it works just like the way in image processing does. Through convolutions, the kernels enhance part of the image’s characteristic and turn the image into an individual feature map. The feature maps are all the same size and are bundled together to become a brand-new image. The convolutional layer provides an example regarding what the new image will look like. Each map in the same image is called a channel, and the number of channels becomes the depth of the image. When working through the convolution layers, the kernel actually processes all the channels once at a time. Another important aspect of the convolutional layer is parameter sharing. If we look back to the processing method of MLP, we can discover that each pixel in the same image needs to be applied to different kernels. However, in convolutional layers, the whole image shares the same kernel to create a feature map, which gives CNN an important shift invariant characteristic. As the kernel can move all around the image, the features correlated with the kernel can be detected anywhere, which gives CNN superior performance in image recognition.
2.6. Activation Layers
As mentioned in the previous section, the main purpose of the activation layer is to provide a non-linear transform. These functions need to be derivative. There are several types of activation functions, including sigmoid (4), tanh (5) and rectified linear unit (ReLU) (6):
Unfortunately, these activation functions all have some flaws. When using the gradient decent methods, ReLU can be affected by gradient exploding, because ReLU does not change the range of the output value from the input. Another problem cause by ReLU is the dead ReLU problem. When a relu has a negative input value, it will give an output of 0, which will cause the whole chain of the input to not update at that time, or even worse, never update until training is finished. On the other hand, sigmoid and tanh are affected by the vanishing gradient problem, because the gradients BP from these functions will at most only have 1/4 left. Comparing with these two groups of activation functions, we observe that the problem of ReLU can be solved by adding a normalization layer, which also results in a faster processing speed. For these reasons, ReLU is now the most commonly-used activation function.
2.7. Pooling Layers
Even though parameter sharing reduces the large number of parameters for CNN, for a large-scale picture, it is still necessary to find a reasonable way to perform subsampling. Pooling layers can be used to finish this job.
For a continuous signal (like an image), it is intuitive to perform downsampling by grouping a fixed number of adjacent values and then picking up an output value from each group. The pickup method could be based on the average, maximum, or minimum. Among these methods, maximum pooling has shown the best result and is commonly used now.
However, care must be taken, as not all feature maps can take pooling as the down sampling method. According to the previous description, each value in the same group needs to be adjacent, which means these values actually have some spatial relationships, and each group also needs to have the same spatial meaning. Therefore, pooling layers might not be suitable to in some cases using CNN, such as in game maps [
26].
2.9. Loss Function
A neural network can be used for classification and regression, each of which needs a different loss function, and these functions all need to be derivate:
Equation (7) is the mean square loss (MSE) function, which is often used in regression tasks. It directly shows the difference between the output value and the target value. Another loss function often used in classification is cross entropy, which usually works with the softmax logistic function. In Equation (7),
is the output vector coming from the FC layer and the
J is the final class number. Softmax tends to find the probability distribution of the classification result. After passing through the function, the sum of output vector
becomes 1, and
represents the probability of the input being classified as class
j:
The purpose of cross entropy is to estimate the difference between two vectors by calculating the log likely-hood function. The result is the same as that shown by Equation (9). In most classification cases, the final result will be a one-hot vector, in which target j has a value of one and the other element is zero, that is, only has the value. Therefore, the loss function then be simplified to (10).
2.10. Model Initialization
In a network, there are numerous hyper parameters that need to be decided, it is normally to consider a way to do the initialize. An ideal properly-initialized network could have the following property: if we take a series of random inputs into the network, the output should be fairly distributed in each of the classes, and there should not be any particular trends at the beginning. Obviously, randomly initializing the parameters will not have this effect. Glorot and Bengio proposed normalized initialization [
27] to keep the various from the layer input to output.
in Equation (11) means the number of inputs in layer
j. Equation (11) performs well for linear layers, but for nonlinear layers like ReLU, the equation needs to be adjusted.
He et al. proposed another method [
28] to fix the formula, in which
in (11) can be simply ignored. Our experiment used He’s method to initialize the network.
2.11. Batch Normalization
In the previous section, it was mentioned that ReLU needs a method to support it in arranging the output value. The distribution of the output value also needs to be controlled. Sergey et al. proposed a method called
batch normalization [
29]. The main concept of this method is to force the addition of a linear transform before the nonlinear layer to make the variance and mean of the nonlinear layer input
be in a certain range:
In Equations (12) and (13), the value of m is the total number of elements in the mini-batch and channels. After Equation (14), the purpose is to find the current mean value
and current variance
, and then adjust them to become 0 and 1:
Other learnable transform parameters can be added into the formula, and the final result will be similar to Equation (15). These two variables help the input values to do a little bit adjustment, which helps to solve the dead ReLU problem. It is essential to take batch normalization in a deep network.