This section presents the basic architectures of dense and convolutional neural networks, as well as conducted experiments and received results.
For the learning process of the used neural networks, the categorical cross-entropy was chosen as the loss function, and the optimization method was set to the Adam algorithm. The learning rate of the algorithm was set to 0.0001 (instead of the default value of 0.001), and the remaining parameters had default values ( = 0.9, = 0.999 and = ). The network was trained for 50 epochs with a fixed batch size of 32.
5.1. Dense Neural Networks
The first type of model constructed during the research was a dense neural network, the simplified diagram of which is shown in
Figure 1. The created
MFCCs input data, which is matrix 128 × 13 at the beginning, needs to be flattened into one-dimensional data, which gives a vector of 1664 values (128 × 13). Such a vector of input data is processed by four consecutive layers, for which the weights of individual neurons will be set during the learning process. The last layer uses the activation function
softmax, which causes the values of all outputs to sum to one. In practice, each output neuron symbolizes one of eight instruments, and their values indicate the probability of a given instrument appearing in the recording, determined by the model. The final decision takes the form of the index of the neuron with the highest value, i.e., the highest probability.
This network architecture is marked as Model ANN1. The trained model correctly assigned 63% of the samples in the test set. This result is relatively satisfactory, but during training, the model very quickly achieved high accuracy and low error for the training data, but it failed to achieve similar results in relation to the validation data. Such dependence is a typical symptom of overfitting.
One way to counteract overfitting is the so-called
drop-out [
41]. Therefore, drop-out layers were added in the second experiment, and the
drop-out rate value was selected through testing. Too small values caused the overfitting phenomenon, and too high values caused the network not to develop any patterns. The assessment of whether
overfitting was eliminated was based on the analysis of the change in accuracy and value of the loss function during training and validation. The results are described in
Table 2.
The optimal value turned out to be a drop-out rate of 0.1, for which the accuracy of the network with respect to the test data was 65%, and F1-score was 63%. This network architecture is marked as Model ANN2 in
Table 3.
At this point, it is necessary to consider how the overall accuracy is distributed among individual classes, i.e., instruments. The confusion matrix is included in
Figure 2. Analyzing this matrix, we can see that clarinet samples are incorrectly recognized, mainly as piano and violin; electric guitar samples are 90% correctly recognized; female voice samples are correctly recognized in 84%; flute samples are only in 34% correctly recognized, but 22% and 39% of samples are incorrectly recognized, mainly as piano and violin; piano samples are 99% correctly recognized; tenor saxophone samples are incorrectly recognized mainly as electric guitar but also as piano and violin; trumpet samples are incorrectly recognized mainly as violin; violin samples are in 88% correctly recognized. In summary, we can say that the model rarely indicates clarinet, saxophone, and trumpet. The highest effectiveness was recorded for singing, electric guitar, piano, and violin. The model shows good sensitivity for these classes, but only for female singing can good precision be noted. The reasons for this state were seen in the so-called
dying ReLU.
Dying ReLU appears when the sum of inputs for a large number of neurons has negative values. Considering its characteristics, the ReLU turns off such neurons by resetting their outputs. This results in the loss of valuable information encoded in negative values. The MFCC matrices obtained from the analyzed instruments may indeed contain negative values. Therefore, the dying ReLU problem may indeed concern the built model. One solution to this problem is to minimize or completely exclude negative values. This can be obtained, among others, by normalizing the data to the range 0–1 using the min-max method. Such normalization was performed separately for the MFCC coefficients of all rows, taking into account only the training set.
The previously described ANN2 Model was retrained this time based on the normalized set and was marked as the ANN3 Model in
Table 3. The accuracy obtained was 65%, which is the same as in the previous case, but this time, it was spread over more instruments, indicating better model precision. The network maintained its tendency to recognize guitar, vocals, piano, and violin well and also improved its performance for saxophone, clarinet, and especially trumpet.
Another technique that is effective in solving the dying
ReLU problem is to use the
LeakyReLU [
25] activation function. The test was performed again on a network with such an activation function and marked as Model ANN4 in the
Table 3. The accuracy obtained was 68%, and the confusion matrix indicates improved performance—particularly for the violin.
Table 3 summarizes the quality of models ANN1 to ANN4.
5.2. Convolutional Neural Networks
In the next phase of experiments, research was carried out on the effectiveness of convolutional networks. Although mainly used in image processing, convolutional networks also work well in the audio field. The purpose of the tests was to check whether the CNN network would also be effective in the analyzed problem.
The convolutional networks used for testing were based on the architecture presented in
Figure 3. The created
MFCCs matrix 128 × 13 is an input data of evaluated convolutional neural networks. Such a matrix is processed by consecutive layers of three convolutional blocks, and then the data are flattened into one dimension so that it can be fed to the dense layers. The last layer contains 8 neurons according to the number of recognized instruments, and, as in the previous model, the sofmax activation function is used.
The first convolutional network that was built was marked as Model CNN1. Compared to the
Figure 3 architecture, Model CNN1 does not have drop-out layers.
It is worth paying attention to the batch normalization. This is a technique that allows to speed up the calculation process and improves the quality of the network. Knowing that data normalization brought the expected results in dense networks, it was decided to check whether it would be similar in the case of CNN networks using this technique. Unlike the normalization performed in the previous cases, batch normalization does not involve transforming the raw input data but normalizes the signals sent between the network layers. The model was trained using parameters with values similar to those used in the case of dense networks. The network with the given configuration was able to achieve an accuracy of 66%. The initial configuration of the CNN network, as in the case of dense networks, is characterized by overfitting.
The overfitting phenomenon was again partially eliminated by implementing
drop-out (
Figure 3). This model with the
drop-out rate parameter set to 0.1 is marked as Model CNN2. However, it should be noted that this technique proved to be more effective for dense networks. The effectiveness of the network increased to 68%, and although precision for some instruments actually increased slightly, there was a significant drop in sensitivity for the clarinet.
The next set of tests was performed for different values of the drop-out parameter. The
Table 4 confirms a relationship analogous to that occurring in dense networks: too high a
drop-out rate value causes deterioration of results, although this time, the drop in effectiveness is not so drastic. Moreover, it should be noted that the optimal value of this parameter for
CNN networks is slightly larger than for dense networks (0.2 and 0.1, respectively). Therefore, the network with a drop-out of 0.2 was marked as Model CNN3, for which the accuracy is 69%.
Guided by the experience gained from previous tests, it was decided to use normalization to reduce overfitting and improve the results. There are already normalizing elements in the network structure, but they operate based on individual batches of data, not the entire set. The test was to see whether normalizing the entire batch along with batch normalization would improve the results or if it would lose some information by scaling too frequently.
This network architecture with a drop-out rate equal to 0.1 is marked as Model CNN4 in
Table 5. The network again made 68% accurate predictions, overfitting was slightly reduced, and based on the confusion matrix, it can be concluded that it managed to modestly increase sensitivity in relation to the clarinet and saxophone (by about 20%).
A network analogous to CNN4 was also tested, but this time with a dropout of 0.2. This network architecture is marked as Model CNN5 in
Table 5. The network made 64% of its predictions accurate.
The next step was to check whether, as in the previous tests, the
LeakyReLU activation function would improve the results. The network was built according to the diagram in
Figure 3, replacing the activation function
ReLU with
LeakyReLU, and the training process was started based on normalized input data. This network architecture is marked as Model CNN6 in
Table 5. The obtained efficiency was 63%, and the confusion matrix indicates that the distribution of accuracy among individual instruments has not changed.
Table 5 summarizes the performance of various
CNN network configurations.
Analyzing
Table 5, it can be concluded that the CNN4 model has the highest F1-score efficiency. This network achieved an accuracy of 68% and a F1-score of 65%. The confusion matrix associated with the results of this model is similar to all other models tested to date. The networks achieve the best results for five classes: piano, singing, guitar, trumpet, and violin. The remaining three, i.e., clarinet, saxophone, and flute, are correctly assigned much less often—the accuracy for these instruments does not exceed 50%. The clarinet and flute are usually confused with the piano. The model more often defines these instruments as pianos than labels them correctly. However, the saxophone is accurately classified the least frequently for most of the tested models. The vast majority of saxophone recordings are labeled as electric guitars.
Similar results were achieved by the CNN3 model. It has the highest accuracy of all models at 69%, but the F1-score is 3% lower than the CNN4 Model.
Basically, the results achieved are satisfactory. Most well-configured models produce results in the range of 0.6–0.7 for both accuracy and F1-score. Taking into account only the group of five instruments mentioned above, the overall efficiency of the models would be much higher. Therefore, in order to significantly improve the current results, the way of classifying the clarinet, flute, and saxophone should be corrected.