We used the following parameters to calculate the CQT representation: number of bins equal to 84, 12 bins per octave and minimum frequency equal to 32.70 Hz. In the case of Mel spectrograms, we have used 128 filters. In training individual models, we used the following parameters: 100 epochs, a learning rate equal to 0.001, an Adam optimizer and cross entropy as a loss function. The data was randomly divided into a train, validation and test subsets as , and , respectively, for all recordings.
3.3. Results
Figure 5 shows the distribution of classification results for the test set using all prepared models without injection mechanism, divided into the individual sizes of the target feature space
M. The graph on the left side (
Figure 5a) shows the results in the form of
obtained by the models trained on the EmoDB database, and the chart on the right side (
Figure 5b) presents results for the models trained on the RAVDESS database. On the other hand,
Figure 6 also illustrates the distribution of classification results in the same manner but obtained for models with injection mechanisms. Like
Figure 5, the left graph (
Figure 6a) presents results for the EmoDB database and the right (
Figure 6b) for the RAVDESS database. As you can observe, in the general classification for both databases, models trained on CQT representations achieved significantly better classification accuracy for both variants, with and without injection mechanism, than models trained on Mel spectrograms. Additionally, in both cases, EmoDB and RAVDESS, these graphs illustrate the noticeable improvement in classification with the feature injection mechanisms for both CQT and Mel spectrogram. Moreover, in the case of EmoDB and models trained on CQT representation, the injection mechanism significantly reduced the range of results obtained by individual models within a given variant, which is especially visible for
and
. Furthermore, by comparing the highest accuracy obtained by the baseline models with the best result using the injection mechanism, an improvement of 3.9% for CQT and 6.25% for the Mel spectrogram was achieved for EmoDB. In the case of RAVDESS, it was 1.06% for the Mel spectrogram and 4.12% for CQT representation.
Table 2,
Table 3 and
Table 4 show averaged results for individual base representations and sizes of the target feature space
M, as well as the subsequent values of
. The results are presented as
for the base models and
for the injection mechanism models. In the case of the EmoDB database, the results are additionally presented in the form of
for base models and
for models with injection mechanisms. Additionally, for each measure, the value (
,
) is shown by what improvement occurred after applying the feature injection mechanism for individual sizes
M. In the case of EmoDB, the highest average scores, 74.01% (WA) for the base models, were obtained for CQT representation and
, while for the variant using the injection mechanism, was equal to 75.31% (WA) for
, which gives an improvement of 1.3% (WA). For the same value
, the improvement was 4.69% (WA).
The highest average improvement was observed for , equal to 8.91% (WA), was 0.05, so the injected features represented only 5% of the final feature space. The lowest improvement of 2.5% (WA) was obtained for , where the injected features constituted about 40% of the feature space. On the other hand, for the Mel spectrogram, the highest average accuracy was 62.5% (WA), for the models base values were obtained for , and the variant with the injection mechanism was equal to 66.25% (WA) for , which gives an improvement of 3.75% (WA). However, for , for which the overall highest result was obtained for the Mel spectrogram when injecting features, the average improvement was 8.75% (WA), where the injected features constituted 10% of the total. The lowest average difference was observed for , which was 0.87% (WA) for .
For the RAVDES database, the highest average accuracy, 66.04% (WA), for the base models, was also achieved for CQT representation and . The variant with injection mechanism was 69.06% (WA) for , which gives an improvement equal to 3.02% (WA). For the same value , the improvement was 4.27% (WA). The best average improvement, 4.48% (WA), was achieved for where was equal to 0.10. On the other hand, the lowest improvement 1.77% was observed for and . In the case of base models trained on Mel spectrograms, the highest average result was 56.77% (WA), which, compared to the best score of 57.4% (WA) for the variant with injection mechanism, gives improvements equal to 0.63% (WA).
The graphs in
Figure 7 show the relationship of classification quality (
) with the size of the target feature space.
Figure 7a and
Figure 7c present information for models without injection mechanism for EmoDB and RAVDESS databases, respectively.
Figure 7b,d illustrate results for models with injection mechanism also for EmoDB and RAVDESS. Moreover, comparing graphs contained results for EmoDB, the use of the injection mechanism not only had a positive impact on the average classification result but also reduced the average spread of results obtained for different values of
M parameter. Additionally, the
Figure 7b shows the
coefficient changes depending on the
M value.
Moreover, changes in the coefficient do not significantly impact the average level of classification, which remains at a more or less similar level throughout the entire range of changes. Such a situation and the average improvement in results after using the injection mechanism may prove the importance of the injected features, which partially reduces the complexity of the neural network architecture while maintaining its level of prediction.
Figure 8 and
Figure 9 show the confusion matrices of models trained on EmoDB, without and with the mechanism of injection that received the highest scores for CQT and Mel spectrograms, respectively. Similarly,
Figure 10 and
Figure 11 lustrate confusion matrices for RAVDESS database.
Table 5 and
Table 6 present a classification accuracy improvement in individual emotional states for the models depicted on confusion matrices. In the case of EmoDB, injection of additional features (
) for CQT representation significantly improved classification accuracy for
dis emotional state, 30% (WA). A slight improvement was also observed for the
fea and
ang states, which were equal to 16.7% (WA) and 6.7% (WA), respectively. The classification accuracy of emotions such as
neu and
bor was decreased. However, the situation for
hap and
sad did not change. A different result occurred in the case of the Mel spectrogram (
), where only
hap got worse by 16.6% (WA). However, significant improvement was observed for
neu equal to 21% (WA) and for
fea equal to 38.9% (WA). No changes for
ang,
sad,
dis and
bor. A different situation can be observed in the results for the RAVDESS database. The most significant improvement occurred for
dis—12.5% (WA) in the case of Mel spectrogram and for
hap,
dis equal to 16.7% (WA) in the case of CQT representation. On the other hand, the accuracy of classification was decreased in the model trained on Mel spectrograms for
hap and
cal. For the CQT representations, the classification of
neu,
ang,
sad, and
sur has deteriorated. In the case of EmoDB, for both representations the classification was noticeably improved
fea emotional state. On the other hand, in both cases, accuracy for
sad remained unchanged. For RAVDESS, the highest, positive change was occurred in
dis for both representations. In both case, EmoDB and RAVDESS, for Mel spectrogram, accuracy for
ang remained unchanged but accuracy for
hap was decreased. For the models trained on CQT representation, for both databases, classification of
neu was also decreased.
Moreover,
Figure 12 and
Figure 13 shows the classification distribution of individual emotional states of the models with the size of the final space feature
M, for which the highest average improvement was observed when using the injection mechanism. In the case of EmoDB, for CQT representation and Mel spectrogram, it was
and
, respectively. For RAVDESS, it was
for both representations. The ranges of some results have changed significantly in reduction. For CQT, these were
hap and
fea, and
neu for Mel spectrogram. In both cases, the range of results for the
bor state has changed extension.
Additionally, in
Table 7 and
Table 8 average values from all models, without injection mechanism (
) and with injection mechanism
, are presented for individual emotional states. Furthermore, for every emotion, the change in the classification accuracy after using the
injection mechanism was shown.
Table 7 contains results for the EmoDB database. How can be noticed, in the case of CQT representation, the highest average improvement was recorded for emotions
hap—14.6% and
neu—6.5%. Classification of other emotional states also improved slightly in the range from 1.2% to 4%. In turn, for Mel spectrograms, the classification of
neu emotional state improved significantly 8.7% compared to other emotions, for which there was also an improvement, but to a lesser extent range 4.1–5.9%. The exception was
sad emotion, where the injection mechanism worsened the average classification level by 1%. The situation is different in the case of the RAVDESS database, which is shown in
Table 8. The results for the CQT representation show significant improvement for emotions like
hap—7.5% and
sad—7.1%. However, a minimal deterioration of 0.1%
dis classification accuracy has been observed. For the remaining emotional states, a slight improvement occurred from 0.5% to 3.5%. For models trained on Mel spectrograms, significant deterioration was observed for
cal—6.6%, the highest improvement occurred for
sad emotional state equal to 6.2%. For the rest of the emotions, accuracy slightly improved from 0.2% to 4.8%.