Augmentation Embedded Deep Convolutional Neural Network for Predominant Instrument Recognition

Zhang, Jian; Bai, Na

doi:10.3390/app131810189

Open AccessArticle

Augmentation Embedded Deep Convolutional Neural Network for Predominant Instrument Recognition

by

Jian Zhang

^1,* and

Na Bai

²

¹

School of Computer Science and Technology, Chian University of Mining and Technology, Xuzhou 221116, China

²

Jiangsu Vocational Institute of Architectural Technology, Xuzhou 221008, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(18), 10189; https://doi.org/10.3390/app131810189

Submission received: 11 August 2023 / Revised: 4 September 2023 / Accepted: 8 September 2023 / Published: 11 September 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Instrument recognition is a critical task in the field of music information retrieval and deep neural networks have become the dominant models for this task due to their effectiveness. Recently, incorporating data augmentation methods into deep neural networks has been a popular approach to improve instrument recognition performance. However, existing data augmentation processes are always based on simple instrument spectrogram representation and are typically independent of the predominant instrument recognition process. This may result in a lack of coverage for certain required instrument types, leading to inconsistencies between the augmented data and the specific requirements of the recognition model. To build more expressive instrument representation and address this inconsistency, this paper constructs a combined two-channel representation for further capturing the unique rhythm patterns of different types of instruments and proposes a new predominant instrument recognition strategy called Augmentation Embedded Deep Convolutional neural Network (AEDCN). AEDCN adds two fully connected layers into the backbone neural network and integrates data augmentation directly into the recognition process by introducing a proposed Adversarial Embedded Conditional Variational AutoEncoder (ACEVAE) between the added fully connected layers of the backbone network. This embedded module aims to generate augmented data based on designated labels, thereby ensuring its compatibility with the predominant instrument recognition model. The effectiveness of the combined representation and AEDCN is validated through comparative experiments with other commonly used deep neural networks and data augmentation-based predominant instrument recognition methods using a polyphonic music recognition dataset. The results demonstrate the superior performance of AEDCN in predominant instrument recognition tasks.

Keywords:

predominant instrument recognition; data augmentation; deep neural network

1. Introduction

Instrument recognition plays a crucial role in Music Information Retrieval, particularly in identifying the predominant instruments in a piece of music. In recent years, machine learning-based approaches have gained prominence in this field. Traditionally, acoustic features such as Zero-crossing rate, Spectral centroid, Mel-spectrogram, or MFCC have been extracted from music signals for instrument recognition. These features are then used with classifiers like Naive Bayes or SVM [1,2,3]. However, these traditional machine learning-based methods have limitations in terms of recognition capability. They often struggle to capture the intricate characteristics and variations in different musical instruments, resulting in suboptimal performance. Moreover, these methods heavily rely on handcrafted features, which may not fully encompass the complex nature of musical instrument sounds.

In recent years, deep learning techniques, particularly deep neural networks, have emerged as a dominant approach for instrument recognition. Deep neural networks can automatically learn and extract discriminative features from raw audio data, allowing for more robust and accurate instrument recognition. These models are capable of handling the complexities and variations inherent in musical instrument sounds, resulting in improved performance. Researchers have made significant advancements in instrument recognition by leveraging the power of deep learning. However, directly applying neural networks to music audio signals may struggle to effectively differentiate instruments in music with varying recording qualities. To address this issue and improve recognition performance, researchers convert 1-dimensional music signals into two-dimensional spectrograms. They then utilize two-dimensional convolutional neural networks and transformers on these spectrograms for instrument recognition [4]. These spectrogram-based methods have shown more effective recognition results compared to traditional methods and other deep neural networks based on music audio signals alone. Moreover, some transfer learning-based methods and multi-instance multi-label-based methods are used for instrument recognition. However, instrument recognition methods still face challenges when dealing with low-quality recordings, especially in polyphonic music. To address the challenges posed by low recording quality, researchers firstly pre-train their models by using supplemental music datasets and then train their models with the target datasets [5,6]. Moreover, researchers also introduce data augmentation methods into instrument recognition models. One approach is the proposal of two-stage instrument recognition methods. In the first stage, data augmentation techniques like Generative Adversarial Networks (GANs) are utilized to generate additional training data. In the second stage, deep neural networks such as Convolutional Neural Networks (CNNs), Gate Recurrent Units (GRUs), Convolutional Gate Recurrent Units (Conv GRUs), and Transformers are employed to map the augmented data to instrument labels [7,8,9]. Additionally, researchers have explored label augmentation, which involves constructing auxiliary classifiers based on additional labels introduced within a multi-task learning framework. By incorporating label augmentation, a more effective instrument recognition CNN can be built [10]. These data augmentation-based methods alleviate the challenges caused by low recording quality and contribute to improvements in music instrument recognition effectiveness. Current instrument recognition models are typically designed based on two-dimensional spectrograms, which may not capture the full expressive range of musical instruments, and the corresponding data augmentation processes are usually independent of the instrument recognition process. In particular, the types of instrument spectrograms generated through data augmentation may not necessarily align with the specific requirements of the instrument recognition models. This discrepancy can cause inconsistencies between the augmented data and the needs of the instrument recognition model, which can negatively impact its accuracy and performance. Moreover, generating the high-resolution spectrograms brings high computation cost. Therefore, further research is needed to develop more integrated approaches that consider the specific requirements of instrument recognition when performing data augmentation.

To address the aforementioned issues, this paper proposes a solution that utilizes the concept of “tempogram” for instruments to construct a combined two-channel representation that can better capture the unique rhythm patterns exhibited by different types of instruments. Additionally, the paper introduces a novel instrument recognition strategy called an Augmentation Embedded Deep Convolutional neural Network (AEDCN), which specifically tackles the limitations that arise from independent data augmentation processes. According to References [5,6], firstly, an AEDCN adds a pre-training process based on introduced datasets as a transfer learning approach to improve the adaptability of the AEDCN for the predominant instrument recognition task. Then, the AEDCN integrates the data augmentation stage as a part of the instrument recognition neural network, briefly, the AEDCN uses two fully connected layers following convolutional layers in the multi-task backbone instrument recognition network and incorporates a constructed Adversarial Conditional Embedded Variational AutoEncoder (ACEVAE) between the added fully connected layers of the multi-task backbone instrument recognition network. To create a generation process with low complexity, the ACEVAE-based data augmentation method is applied to the fully connected layer rather than manipulating the input data. This approach reduces the training difficulty of the generative models while maintaining efficiency. Additionally, an adversarial loss is introduced into the embedded data augmentation model, enabling the generation of more diverse samples based on specific labels. By integrating these techniques, the AEDCN combines data augmentation and label augmentation methods within a unified network architecture to address the challenges associated with insufficient and redundant augmented data. To evaluate the effectiveness of the constructed two-channel representation, the proposed AEDCN and ACEVAE experiments are conducted using the IRMAS dataset.

The main contributions of this paper can be summarized as follows:

(1): Proposal of an Augmentation Embedded Deep Convolutional neural Network (AEDCN) for predominant instrument recognition. This approach integrates the data augmentation stage as part of the instrument recognition neural network, addressing the issue of inconsistency between enhanced data and required data.
(2): Proposal of an ACEVAE incorporated within the AEDCN architecture. The ACEVAE introduces an adversarial loss into a Conditional Variational AutoEncoder and functions as a data augmentation mechanism and is positioned between the fully connected layer and the output layer of the multi-task backbone instrument recognition network.
(3): Proposal of a combined two-channel representation aimed at capturing distinctive rhythm patterns of various instrument types more effectively. This representation comprises a mel-spectrum and a tempogram in the two channels, allowing it to better capture the characteristic rhythmic patterns associated with specific instruments. Moreover, a pre-training process based on introduced datasets for the predominant instrument recognition task is applied.

The rest of this paper is organized as follows. Section 2 reviews existing approaches for instrument recognition. Section 3 presents the proposed AEDCN model. Section 4 reports comparative experimental results compared with other commonly used instrument recognition models. Finally, Section 5 concludes and discusses several issues for future work.

2. Related Works

2.1. Instrument Recognition

In the field of instrument recognition, there has been a growing interest in machine learning-based methods. Several approaches incorporating different features and algorithms have been proposed by researchers. Essid et al. [11] utilized mel-frequency cepstral coefficients (MFCCs) and a Principal Component Analysis (PCA) to model instrument features. They applied these techniques for instrument recognition tasks. Duan et al. [12] introduced a cepstrum representation method called unified discrete cepstrum (UDC), along with its mel-scale variant MUDC. These representations were combined with a radial basis function (RBF) kernel support vector machine (SVM) for instrument recognition. Eggink et al. [13] focused on using the spectral peak of the fundamental frequency of the harmonic sequence as a feature representation for identifying musical instruments, especially in the presence of background music.

With the advancements in deep learning, deep neural networks have also been widely applied to instrument recognition tasks. Kratimenos et al. [14] proposed an approach to detect the activity of musical instruments in polyphonic music. They trained a deep neural network using one-second audio clips and employed temporal max-pooling aggregation to calculate the final prediction score. To further enhance instrument recognition, researchers have explored the use of image processing techniques by converting one-dimensional music signals into two-dimensional spectrograms. Han et al. [15] adopted a CNN structure for instrument recognition, using the mel-frequency spectrogram as the input. They also incorporated Convolutional Gated Recurrent Units (Conv GRU) for better performance. However, these methods often face challenges, particularly when dealing with low-quality recordings or polyphonic music. To address this issue, data augmentation methods have been introduced in instrument recognition models. Yu et al. [10] proposed constructing a network with an auxiliary classification based on onset groups and instrument families to generate valuable training data. In another study by using convolutional recurrent neural networks (CRNN), predominant instrument recognition in polyphonic music was addressed [9]. Hung et al. (2019) introduce multi-task learning for instrument recognition. In their work, they propose a method to recognize both pitches and instruments [16]. To augment the data, they employed a Wave Generative Adversarial Network (WaveGAN) architecture to generate audio files [7,8,9]. These approaches demonstrate the utilization of various techniques, including feature extraction, deep learning, image processing, and data augmentation, to improve instrument recognition accuracy and handle challenges such as low-quality recordings and polyphonic music.

Absolutely, transfer learning-based methods are widely used in instrument recognition tasks. Transfer learning leverages pre-trained models on large-scale datasets and adapts them to the specific instrument recognition task. By utilizing the knowledge learned from one domain, these models can generalize well to the instrument recognition domain, even with limited labeled data [17].

Furthermore, multi-instance multi-label (MIML) methods have also been employed for instrument recognition [6]. MIML considers the fact that music pieces often contain multiple instruments playing simultaneously, making it a multi-label classification problem. In this approach, the music piece is treated as a bag of instances, with each instance representing a segment or frame of the audio. The model learns to classify each segment into different instrument labels, allowing for the recognition of multiple instruments within a single music piece.

2.2. Neural Networks for Data Augmentation

In addition to WaveGAN, there are various generative neural networks that can be employed for data augmentation in instrument recognition. Among these models, Generative Adversarial Networks (GANs) and their variants are widely used. However, GANs often suffer from training instability, which limits their integration into pattern recognition models. Diffusion Models [18] and Flow Models [19] are effective large-scale generative neural networks commonly used for generating high-quality images. However, these models come with high computational complexity. Therefore, incorporating them into instrument recognition models would result in significant computational costs. In contrast, Variational Autoencoders (VAEs) [20] offer a stable training process and flexible model structure, making them suitable as embedding sub-networks in instrument recognition models. VAEs can be constructed based on the specific requirements of the task. Additionally, Conditional Variational Autoencoders (CVAEs) [21], which are a variant of VAEs, can generate data associated with specified labels. To address the inconsistency issue in data augmentation-based instrument recognition models, a fully connected Adversarial Conditional Embedded Variational AutoEncoder (ACEVAE) is designed. This model proposes an adversarial learning method into a Conditional Variational Autoencoder to generate more necessary samples and leverage the stability and flexibility of CVAEs to generate data with specified labels. By incorporating ACEVAE into instrument recognition models, it becomes possible to enhance the performance and alleviate the inconsistency problem encountered in the two-step data augmentation approach.

3. The Proposed AEDCN Methods

In many predominant instrument recognition models that rely on data augmentation, a commonly used approach involves three stages. In the first stage, the instrument representation is constructed using the original music audio. In the second stage, data augmentation methods, such as WaveGAN, are applied to generate additional instrument representations for training purposes. In the third stage, deep neural networks are built to map the augmented data to instrument labels. While these data augmentation methods address issues related to low recording quality and improve the effectiveness of music instrument recognition, they may introduce inconsistencies when the data augmentation process is independent of the recognition model. Briefly, some augmented data in a particular instrument class may be redundant, while the required data for another instrument class remain insufficient. Moreover, a single data representation format limits the expression ability of the features extracted by the deep neural networks. To mitigate these issues, the proposed Augmentation Embedded Deep Convolutional neural Network (AEDCN) introduces tempogram of instruments to construct a combined two-channel representation with mel-spectrum for further capturing unique rhythm patterns of different types of instruments and incorporates an Adversarial Conditional Embedded Variational Autoencoder (ACEVAE) between the added fully connected layers within a multi-task label augmentation-based backbone instrument recognition network. The ACEVAE is designed based on an introduced adversarial loss to generate specific data according to designated labels, enhancing the data augmentation process according to the recognition results. To further improve the recognition ability, we add a pre-training process based on introduced datasets for the predominant instrument recognition task.

The AEDCN also incorporates label-based data augmentation into its training process. Firstly, a multi-task deep instrument recognition neural network with two fully connected layers before the classification layer is designed as the backbone. This backbone network firstly pre-trains on predominant instrument recognition datasets and then is trained using the provided dataset for instrument recognition. Once the recognition training on the entire dataset is completed, the AEDCN addresses the issue of inconsistency based on their labels and the training results. To tackle this problem, the AEDCN utilizes the features extracted from the first fully connected layer to train the ACEVAE model. The ACEVAE is trained to generate diverse features for specific classes based on designated labels and a designed adversarial loss. By generating fully connected features using ACEVAE, the AEDCN increases the number of samples in the imbalanced class until it reaches the maximum among all classes. Next, the generated features are used to further train the sub-network from the first fully connected layer to the classification layer of the multi-task predominant instrument recognition backbone neural network. This additional training enhances the classification capability for the imbalanced instrument classes. Finally, the well-trained AEDCN model is applied to the testing dataset to verify its effectiveness in improving the overall recognition performance. This approach effectively addresses the issue of data imbalance and inconsistency to ensure that all instrument classes receive sufficient training, leading to enhanced recognition performance.

The structure of the AEDCN, including the incorporation of the ACEVAE, is illustrated in Figure 1.

In AEDCN, the backbone is a multi-task convolutional neural network and the embedded data augmentation network is labeled in blue background. This section provides a detailed introduction to the AEDCN model, including the data preprocessing and AEDCN model details.

3.1. Data Preprocessing

To construct two-channel representations of music audio, a series of preprocessing steps are applied. Here is an overview of the steps:

(1): Gain Normalization: The gain of the audio data is normalized to −15 dB. This ensures that the audio is brought to a consistent loudness level, which can help in reducing the impact of volume variations in the subsequent processing.
(2): Stereo to Mono Conversion: If the input audio is in stereo format, it is converted to mono by taking the mean of the left and right channels. This step merges the two channels into a single channel, simplifying the subsequent processing.
(3): Down Sampling: The audio is downsampled from the original sampling rate of 44,100 Hz to 22,050 Hz. Downsampling reduces the sample rate while retaining important information, helping to reduce computational complexity without significant loss of relevant audio characteristics.
(4): Normalization: All audio signals are normalized by dividing the time-domain signal with its maximum value. This normalization step scales the audio data to a range between −1 and 1, ensuring consistent amplitude levels across different audio samples.
(5): Short-Time Fourier Transform (STFT): The time-domain signal is transformed into a time-frequency representation using the Short-Time Fourier Transform (STFT). This involves dividing the signal into short overlapping frames and applying the Fourier Transform to each frame. A window size of 1024 samples is used, with a hop size of 512 samples. STFT provides insights into the spectral content of the audio signal over time.
(6): Mel-Scale Conversion: The linear frequency scale obtained from the STFT spectrogram is converted to the mel-scale. This conversion is based on human auditory perception and provides a more perceptually meaningful representation of the audio. A mel-frequency scale with 128 frequency bins is used, as suggested by previous studies on music annotation.
(7): Magnitude Compression: The magnitude of the mel-frequency spectrogram is compressed using a natural logarithm. This compression helps to emphasize lower magnitudes and reduces the dynamic range of the spectrogram. It can enhance the representation of subtle audio features and ensure that important information is not overshadowed by dominant components.
(8): Besides mel-frequency spectrogram, we also use tempogram [22] as concatenated features to construct a two-channel representation. Tempogram is capable of capturing the rhythm structure and repetitive patterns in audio signals by calculating the rhythm patterns present in the audio signal. In instrument recognition tasks, each instrument possesses its own distinctive rhythm patterns during performance, and tempogram proves useful in distinguishing the disparities among different audio clips of instruments.

The final data representation is a concatenated feature of mel-frequency spectrogram and tempogram in two channels, and every channel has a 128 dimensional features.

3.2. AEDCN Model Details

The proposed AEDCN consists of two main components:

(1): Multi-task deep instrument recognition backbone neural network:

The backbone deep neural network contains multiple convolutional layers, pooling layers, two fully connected layers, and an output layer for classification. To introduce label augmentation, an auxiliary classifier is integrated into the backbone neural network. This involves adding an extra classification layer to the network that predicts auxiliary labels. The purpose is to recognize augmented labels during training, which in turn enhances the model’s generalization capability. The input of the multi-task deep instrument recognition backbone neural network is the concatenated two-channel spectrogram, the structure of this network is shown in Table 1.

In this backbone, we use filters with a 3 × 3 receptive field, with a fixed stride size of 1, and spatial abstraction is conducted by max-pooling with a size of 3 × 3. In Table 1, we list the detailed network structure. It should be pointed out that the input for each convolution layer is zero-padded with 1 × 1 to preserve the spatial resolution regardless of input window size.

To effectively train the backbone neural network and maximize the inter class distance between features as much as possible, we introduce a center loss as a regularization term to the loss function. The loss function of the backbone neural network is written as:

L o s s = L o s s_{p r i c i p a l} + L o s s_{a u x i l i a r y} + L o s s_{A C E V A E} + α (L o s s_{c e n t e r_l o s s 1} + L o s s_{c e n t e r_l o s s 2})

(1)

L o s s_{p r i c i p a l} a n d L o s s_{a u x i l i a r y}

are the classification loss, they are expressed as cross entropy functions.

L o s s_{c e n t e r_l o s s 1}

is a center loss function of the first fully connected layer, which has been used in the CNN [10], and

L o s s_{c e n t e r_l o s s 2}

is a center loss function of the second fully connected layer. α is a hyper-parameter (0.01 in this paper). Using the center loss to encourage feature separation in the first fully connected layer can be beneficial when training ACEVAE. The center loss aims to minimize the intra-class variations and maximize the inter-class distances of the learned features. By leveraging the center loss, the features corresponding to different types of instruments in the first fully connected layer become more discriminative and separated. This separation allows ACEVAE to generate more diverse and distinct augmented samples for each instrument class. ACEVAE takes these feature vectors as inputs to learn the underlying distribution of instrument features and generate augmented samples that align with the desired instrument classes. This approach helps to enhance the quality and diversity of the generated samples, leading to improved instrument recognition performance.

(2): Embedded data augmentation method ACEVAE:

The second part of the AEDCN is the designed ACEVAE. The ACEVAE is a special type of Variational AutoEncoder that is capable of generating specific types of instrument features based on designated labels and an adversarial loss.

By leveraging the ACEVAE, the AEDCN can generate synthetic feature samples that specifically correspond to certain designated labels. This process enables the network to augment the training data by generating additional feature samples for particular classes, consequently improving the model’s capacity to recognize and classify these classes.

The combination of the backbone deep neural network with the ACEVAE enables the AEDCN to leverage label augmentation and data generation techniques to enhance its performance in the instrument recognition task. Compared to other models like diffusion models and flow models, the ACEVAE utilized in AEDCN has the advantage of lower computational complexity. This means that the model can generate additional augmented data without imposing a significant increase in computational load during both training and inference. This is particularly beneficial in situations where computational resources are limited. By efficiently generating augmented data without excessive computational requirements, the ACEVAE contributes to the overall efficiency and practicality of the AEDCN model. In addition, the training process of AEDCN with ACEVAE is more stable compared to models based on Generative Adversarial Networks (GANs). GAN-based models often face challenging training dynamics between the generator and discriminator networks, which can be difficult to stabilize. On the other hand, the ACEVAE used in AEDCN offers a more stable training process, making it easier to train and optimize the model effectively. This stability ensures that the AEDCN model can be trained reliably and consistently, leading to improved performance and better overall results.

To better understand the structure of the ACEVAE in AEDCN, Table 2 provides details of its architecture.

In the ACEVAE, the labels and outputs of the full connected layer in the backbone are transformed into 1024 dimensional vectors by two full connected layers, and then the vectors are reshaped as a 2 × 32 × 32 matrix. Specifically, there are three types of variables in the ECVAE: input variables x (output of the full connected layer of backbone), output variables y (instrument labels), and latent variables z (parameterized as expectation mu and variance var). The conditional generative process of the model is given as follows: for given observation x, z is drawn from the prior distribution p(z|x), and the output y is generated from the distribution p(y|x; z). The prior of the latent variables z can be modulated by the input x; however, the constraint can be relaxed to make the latent variables statistically independent with the input variables: p(z|x) = p(z).

The ACEVAE can be trained to maximize the conditional log-likelihood with an adversarial loss as a regularization term. The variational lower bound of the model is written as follows:

\log p_{θ} (y | x) \geq - K L (q_{φ} (z | x, y) | | p_{θ} (z)) + E_{q_{θ} (z | x, y)} [\log p_{θ} (y | x, z)],

(2)

and the empirical lower bound is written as:

L o s s (x, y; θ; φ) = - K L (q_{φ} (z | x, y) | | p_{θ} (z)) + \frac{1}{L} \sum_{l = 1}^{L} \log p_{θ} (y | x, z^{(l)}),

(3)

where

z^{(l)} = g_{φ} (x, y, ε), ε \sim N (0, 1)

, q is the encoder defined in Table 2.

To generate diverse features, we introduce adversarial loss as a regularization term to improve the generation ability; to construct an effective and simple adversarial loss term, we introduce an inverse KL divergence

K L (p_{θ} (z) | | q_{φ} (z | x, y))

to the empirical lower bound for building an adversarial loss, which is expressed as follows:

\begin{array}{l} L o s s_{A C E V A E} (x, y; θ; φ) = \\ - K L (q_{φ} (z | x, y) | | p_{θ} (z)) - K L (p_{θ} (z) | | q_{φ} (z | x, y)) + \frac{1}{L} \sum_{l = 1}^{L} \log p_{θ} (y | x, z^{(l)}) \end{array}

(4)

In Formula (4), the KL divergence

K L (q_{φ} (z | x, y) | | p_{θ} (z))

and the inverse KL divergence

K L (p_{θ} (z) | | q_{φ} (z | x, y))

can be effectively calculated and they are combined as the adversarial loss term, which can be treated as an un-weighted JS divergence.

ACEVAE is employed to introduce label-based data augmentation during the training process, and the whole training process is listed as follows:

(1): Transfer-based pre-training: AEDCN introduces a pre-training process based on added datasets as a transfer-learning approach to improve the adaptability of AEDCN for the predominant instrument recognition task.
(2): Recognition training: Initially, the entire training dataset is used to train the multi-task deep instrument recognition backbone neural network. This step aims to learn the recognition capability of the model using the available training samples.
(3): Imbalanced class addressing: After the initial recognition training, AEDCN focuses on addressing the issue of imbalanced classes in the dataset. Imbalanced classes refer to classes with a limited number of training samples, which can lead to imbalanced data representation.
(4): Training the ACEVAE: To generate additional samples for the imbalanced classes, the fully connected features extracted from these samples during the training process are utilized. These features serve as the input to train the ACEVAE model, which is capable of generating synthetic samples based on specific labels.
(5): Generating fully connected features: By training the ACEVAE, the model can generate fully connected features for the imbalanced class. The generation process is controlled by providing designated labels corresponding to the imbalanced class.
(6): Repeat until maximum: The process described in steps 3 and 4 continues iteratively until the number of samples in the imbalanced class reaches the maximum among all classes. This iterative approach ensures that sufficient synthetic samples are generated for balancing the data distribution.
(7): Retraining the recognition process: Once the maximum number of samples is reached for the imbalanced class, the generated features alongside the original sample data are used to retrain the sub-network of the multi-task deep instrument recognition backbone neural network. This means that both the real and synthetic samples are incorporated during the retraining stage.
(8): Using a well-trained model for testing datasets: the well-trained AEDCN model is applied to the testing dataset to verify its effectiveness in improving the overall recognition performance.

By including the generated synthetic samples along with the original samples, AEDCN enhances the recognition capability specifically for the previously imbalanced class. This process helps to mitigate the impact of imbalanced training data, improving the model’s ability to recognize and classify the imbalanced class accurately.

Next, we evaluate the effectiveness of the proposed AEDCN in experiments.

4. Experiments

The experiments aim to demonstrate two key results. Firstly, they aim to show that the proposed AEDCN achieves instrument recognition results that are comparable to or better than other commonly used instrument recognition models. Secondly, they aim to validate the effectiveness of the constructed ACEVAE and the built two-channel spectrogram representation for instrument recognition. This section first introduces the used dataset and then analyzes the experimental results.

4.1. Data Set

In this paper, we use the IRMAS dataset as our dataset. It should be noted that the training set of IRMAS is single-labelled and the testing set of IRMAS is multi-labelled. Music audio segments in the IRMAS dataset [23] are stereo with a 44.1 KHz sampling rate. These music segments are played in different styles. The recordings come from decades, and thus they have different recording qualities. The purpose of IRMAS is to recognize instruments in audio segments, including cello, clarinet, flute, acoustic guitar, electric guitar, organ, piano, saxophone, trumpet, violin, and human voice. In experiments, IRMAS is divided into the training set and the testing set. We use the training set in IRMAS to train the network and use 15% of the training as the validation set. For the IRMAS dataset, the attributes of the dataset are shown in Table 3. For pre-training, we add a dataset to improve the adaptability of AEDCN for predominant instrument recognition; as the main IRMAS dataset contains monophonic training data and polyphonic testing data, we use a monophonic melody-solos-DB dataset [24] to improve the recognition ability in the training process and keep consistency with the task of predominant instrument recognition on IRMAS dataset.

As Table 3 shows, the dataset is imbalanced and we generate designated classes of instruments in the training process by using the embedded ACEVAE. We generate fake music audios to the maximum number of the original dataset. Moreover, the auxiliary classifier contains three classes, the relation of the auxiliary classes and principal classes are shown in Table 4.

As Table 4 shows, the auxiliary classifier faces more coarse-grained classes than the principal classifier. To evaluate the performance of the proposed AEDCN, we calculate the micro average and macro average of precision, recall, and F1 score. These indicators are calculated as follows:

F 1_{m i c r o} = \frac{2 P_{m i c r o} R_{m i c r o}}{P_{m i c r o} + R_{m i c r o}}, F 1_{m a c r o} = \frac{2 P_{m a c r o} R_{m a c r o}}{P_{m a c r o} + R_{m a c r o}} \begin{array}{l} P_{m i c r o} = \frac{\sum_{l = 1}^{L} t p_{l}}{\sum_{l = 1}^{L} (t p_{l} + f p_{l})}, R_{m i c r o} = \frac{\sum_{l = 1}^{L} t p_{l}}{\sum_{l = 1}^{L} (t p_{l} + f n_{l})} \\ P_{m a c r o} = \frac{1}{L} \sum_{l = 1}^{L} \frac{t p_{l}}{(t p_{l} + f p_{l})}, R_{m a c r o} = \frac{1}{L} \sum_{l = 1}^{L} \frac{t p_{l}}{(t p_{l} + f n_{l})} \end{array}

where L is the number of classes. l is the label index. tp_l is true positive. fp_l is false positive, and fn_l is false negative for each label.

4.2. Experimental Results Analysis

To evaluate the effectiveness of the proposed AEDCN model, we compare our instrument recognition results on the IRMAS dataset with other instrument recognition models and the results are shown in Table 5.

Table 5 presents a comparison of various instrument recognition models, including both traditional and neural network-based approaches. The method proposed by Bosch et al. utilizes manually extracted features and employs SVM as the classifier. MTF-DNN applies a deep neural network on manually extracted features, while Audio DNN uses a convolutional neural network on music audios for instrument recognition, achieving more effective recognition results compared to traditional methods. ConvNet is another model that designs a CNN on the mel-spectrogram, transforming the CNN to the testing data through max-pooling. Building upon ConvNet, the Multi-task ConvNet introduces an auxiliary classifier to augment labels in the IRMAS dataset, resulting in improved recognition accuracy compared to ConvNet. For data augmentation, WaveGAN ConvNet utilizes WaveGAN to generate training data and performs instrument recognition using ConvNet. Similarly, Voting-Swin-T also generates training data using WaveGAN but adopts a Transformer-based neural network for instrument recognition. Staged trained ConvNet uses a staged training on Melody-solos-DB dataset process for training the ConvNet to recognize instruments in IRMAS dataset. VAE augmentation ConvNet is a ConvNet model which introduces a VAE for data augmentation on Melody-solos-DB dataset.

According to the results presented in Table 5, the proposed AEDCN model outperforms other neural network-based models as well as data augmentation-based instrument recognition models.

The AEDCN model consists of three main components: a concatenated two-channel instrument representation, a pre-training process, and an embedded ACEVAE. In the next section, an ablation experiment is conducted to evaluate the effectiveness of these three parts. An ablation experiment involves systematically removing or disabling specific components of a model to assess their impact on performance. By conducting such an experiment, the study aims to analyze how the two-channel instrument representation, a pre-training process, and the embedded ACEVAE contribute to the overall performance of the AEDCN model for instrument recognition. Moreover, we aim to verify the effectiveness of the constructed adversarial loss in the ACEVAE.

4.3. Ablation Experiment

In this section, this ablation experiment will shed light on the importance and effectiveness of the two-channel instrument representation, a pre-training process, and the embedded ACEVAE, providing insights into the individual contributions of these components to the model’s performance.

The AEDCN on mel-spectrogram is named AEDCN-mel-spectrogram, and the AEDCN without ECVAE is named AEDCN-single. For comparison, we also use AEDCN on 153 dimensional time-frequency representation (which is used in Muti-task ConvNet) in experiments, this model is named AEDCN-153. AEDCN without the pre-training process is denoted as AEDCN_without_pre. The results are shown in Table 6.

As Table 6 shows, the three parts of AEDCN are both effective, the introduced two-channel instrument representation performs better compared with the 153 dimensional time-frequency representation. Furthermore, we verify that the constructed adversarial loss is effective for the instrument recognition task in Table 7.

In Table 7, the AEDCN-likelihood is an AEDCN which uses the likelihood function (KL divergence) for training the ACEVAE, and the AEDCN-inverseKL is an AEDCN which uses inverse KL divergence for training the ACEVAE. As the results show in Table 7, the combined adversarial loss function of KL divergence and inverse KL divergence is more effective than the traditional loss function for instrument recognition.

As this dataset is imbalanced, next, we evaluate AEDCN on each instrument class in Table 8.

According to the results presented in Table 8, the AEDCN demonstrates improvements when compared to ConvNet-based and Transformer-based instrument recognition methods across various instrument categories. Particularly, while the AEDCN shares a similar backbone neural network structure with the Muti-task ConvNet, the generated features by ACEVAE in the AEDCN play a significant role in enhancing the instrument recognition performance for the Cla and Flu classes.

To understand the recognition results of the proposed model, we conduct a visual analysis in Figure 2.

We use the t-distributed stochastic neighbor embedding (t-SNE) [29] algorithm, which is commonly used for dimensionality reduction in high-dimensional data, and we show the extracted features can be easily clustered into 11 classes by t-SNE in two-dimensional images. This result indicates that the proposed model can provide relatively clear classification boundaries, and as Figure 2 shows, the clusters generated from AEDCN have higher intervals.

5. Conclusions

This paper proposes an augmentation-based deep convolutional neural network AEDCN for instrument recognition. Compared with other data augmentation-based instrument recognition methods, AEDCN only generates the features, rather than directly generating the high-dimensional spectrograms. Specifically, AEDCN designs an ACEVAE between the added full connected layers in a multi-task label augmentation-based backbone instrument recognition network. According to the recognition results, the ACEVAE can augment specific data based on specially designated labels. Experiments verify that the proposed AEDCN performs well compared with other commonly used data augmentation-based instrument recognition methods and the used ACEVAE is effective for improving the recognition results.

Author Contributions

Conceptualization, J.Z.; Methodology, J.Z.; Software, J.Z.; Investigation, N.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundations of China (No. 62206297).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this paper is open, which is proposed in reference [13].

Acknowledgments

Thanks to the contributions of Yufeng Zhang towards the improvement of the paper’s grammar, revisions, and final proofreading.

Conflicts of Interest

The authors declare no conflict of interest.

References

Deng, J.D.; Simmermacher, C.; Cranefield, S. A study on feature analysis for musical instrument classification. IEEE Trans. Syst. Man Cybern. Part B 2008, 38, 429–438. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Han, K.; Wang, D. Exploring monaural features for classification-based speech segregation. IEEE Trans. Audio Speech Lang. Process. 2012, 21, 270–279. [Google Scholar] [CrossRef]
Giannoulis, D.; Klapuri, A. Musical instrument recognition in polyphonic audio using missing feature approach. IEEE Trans. Audio Speech Lang. Process. 2013, 21, 1805–1817. [Google Scholar] [CrossRef]
Gómez, J.S.; Abeßer, J.; Cano, E. Jazz solo instrument classification with convolutional neural networks, source separation, and transfer learning. In Proceedings of the 19th ISMIR Conference, Paris, France, 23–27 September 2018; pp. 577–584. [Google Scholar]
Szeliga, D.; Tarasiuk, P.; Stasiak, B.; Szczepaniak, P.S. Musical Instrument Recognition with a Convolutional Neural Network and Staged Training. Procedia Comput. Sci. 2022, 207, 2493–2502. [Google Scholar] [CrossRef]
Gururani, S.; Sharma, M.; Lerch, A. An attention mechanism for musical instrument recognition. arXiv 2019, arXiv:1907.04294. [Google Scholar]
Kilambi, B.R.; Parankusham, A.R.; Tadepalli, S.K. Instrument Recognition in Polyphonic Music Using Convolutional Recurrent Neural Networks. In Proceedings of the International Conference on Intelligent Computing, Information and Control Systems: ICICCS 2020, Madurai, India, 13–15 May 2020; Springer: Singapore, 2021; pp. 449–460. [Google Scholar]
Reghunath, L.C.; Rajan, R. Transformer-based ensemble method for multiple predominant instruments recognition in polyphonic music. EURASIP J. Audio Speech Music. Process. 2022. [Google Scholar] [CrossRef]
Lekshmi, C.R.; Rajeev, R. Multiple Predominant Instruments Recognition in Polyphonic Music Using Spectro/Modgd-gram Fusion. Circuits Syst. Signal Process. 2023, 42, 3464–3484. [Google Scholar] [CrossRef]
Yu, D.; Duan, H.; Fang, J.; Zeng, B. Predominant instrument recognition based on deep neural network with auxiliary classification. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 852–861. [Google Scholar] [CrossRef]
Joder, C.; Essid, S.; Richard, G. Temporal integration for audio classification with application to musical instrument classification. IEEE Trans. Audio Speech Lang. Process. 2009, 17, 174–186. [Google Scholar] [CrossRef]
Duan, Z.; Pardo, B.; Daudet, L. A novel cepstral representation for timbre modeling of sound sources in polyphonic mixtures. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 7495–7499. [Google Scholar]
Eggink, J.; Brown, G. Using instrument recognition for melody extraction from polyphonic audio. J. Acoust. Soc. Am. 2005, 118, 2032. [Google Scholar] [CrossRef]
Kratimenos, A.; Avramidis, K.; Garoufis, C.; Zlatintsi, A.; Maragos, P. Augmentation methods on monophonic audio for instrument classification in polyphonic music. In Proceedings of the 2020 28th European Signal Processing Conference (EUSIPCO), Amsterdam, The Netherlands, 18–21 January 2021; pp. 156–160. [Google Scholar]
Han, Y.; Kim, J.; Lee, K. Deep convolutional neural networks for predominant instrument recognition in polyphonic music. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 25, 208–221. [Google Scholar] [CrossRef]
Hung, Y.N.; Chen, Y.A.; Yang, Y.H. Multitask learning for frame-level instrument recognition. In Proceedings of the ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 381–385. [Google Scholar]
Cramer, A.L.; Wu, H.H.; Salamon, J.; Bello, J.P. Look, listen, and learn more: Design choices for deep audio embeddings. In Proceedings of the ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 3852–3856. [Google Scholar]
Pavăl, S.D.; Craus, M. Reaction-diffusion model applied to enhancing U-Net accuracy for semantic image segmentation. Discret. Contin. Dyn. Syst.-S 2023, 16, 54–74. [Google Scholar] [CrossRef]
Kingma, D.P.; Dhariwal, P. Glow: Generative Flow with Invertible 1 × 1 Convolutions. In Proceedings of the Advances in Neural Information Processing Systems 31 (NeurIPS 2018), Montréal, QC, Canada, 3–8 December 2018. [Google Scholar]
Rolfe, J. Discrete Variational Autoencoders. arXiv 2016, arXiv:1609.02200. [Google Scholar]
Uzunova, H.; Schultz, S.; Handels, H.; Ehrhardt, J. Unsupervised pathology detection in medical images using conditional variational autoencoders. Int. J. Comput. Assist. Radiol. Surg. 2019, 14, 451–461. [Google Scholar] [CrossRef] [PubMed]
Grosche, P.; Müller, M.; Kurth, F. Cyclic tempogram—A mid-level tempo representation for musicsignals. In Proceedings of the 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, USA, 14–19 March 2010; pp. 5522–5525. [Google Scholar]
Nam, J.; Herrera, J.; Slaney, M.; Smith, J.O., III. Learning sparse feature representations for music annotation and retrieval. In Proceedings of the 13th International Society for Music Information Retrieval Conference, Porto, Portugal, 8–12 October 2012; pp. 565–570. [Google Scholar]
Lostanlen, V.; Cella, C.E.; Bittner, R.; Essid, S. Medley-solos-DB: A cross-collection dataset for musical instrument recognition. Zenodo 2018. [Google Scholar] [CrossRef]
Essid, S.; Richard, G.; David, B. Musical instrument recognition by pairwise classification strategies. IEEE Trans. Audio Speech Lang. Process. 2006, 14, 1401–1412. [Google Scholar] [CrossRef]
Bosch, J.J.; Janer, J.; Fuhrmann, F.; Herrera, P. A comparison of sound segregation techniques for predominant instrument recognition in musical audio signals. In Proceedings of the 13th International Society for Music Information Retrieval Conference, ISMIR 2012, Porto, Portugal, 8–12 October 2012; pp. 559–564. [Google Scholar]
Gururani, S.; Summers, C.; Lerch, A. Instrument activity detection in polyphonic music using deep neural networks. In Proceedings of the19th International Society for Music Information Retrieval Conference, Paris, France, 23–27 September 2018; pp. 577–584. [Google Scholar]
Plchot, O.; Burget, L.; Aronowitz, H.; Matejka, P. Audio enhancing with DNN autoencoder for speaker recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China, 20–25 March 2016; pp. 5090–5094. [Google Scholar]
LJPvd, M.; Hinton, G.E. Visualizing high-dimensional data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]

Figure 1. The structure of AEDCN. In AEDCN, the backbone is a multi-task convolutional neural network. The parts in blue background are the main contributions of this proposed AEGCN model.

Figure 2. The generated clusters of AEDCN and Multi-task ConvNet of a mini-batch. Clusters from DCNN are separated by larger spacing.

Table 1. Structure of multi-task deep instrument recognition backbone neural network.

Input Size	Description
2 × 43 × 128	concatenated spectrogram
32 × 45 × 130	3 × 3 convolution
32 × 47 × 132	3 × 3 convolution
32 × 15 × 44	3 × 3 max-pooling
32 × 15 × 44	dropout (0.25)
64 × 17 × 46	3 × 3 convolution
64 × 19 × 48	3 × 3 convolution
64 × 6 × 16	3 × 3 max-pooling
64 × 6 × 16	dropout (0.25)
128 × 8 × 18	3 × 3 convolution
128 × 10 × 20	3 × 3 convolution
128 × 3 × 6	13 × 3 max-pooling
128 × 3 × 6	dropout (0.25)
256 × 5 × 8	3 × 3 convolution
256 × 7 × 10	3 × 3 convolution
256 × 1 × 1	adaptive max-pooling
1024	flattened and fully connected
1024	fully connected
1024	dropout (0.50)
3 & 11	Sigmoid

Table 2. Structure of ACEVAE.

Encoder	Decoder
1024/fully connected	1024/fully connected
2 × 32 × 32/3 × 3 conv + BN + Dropout	1 × 32 × 32/3 × 3 conv + BN + Dropout
32 × 16 × 16/3 × 3 conv + BN + Dropout	32 × 16 × 16/3 × 3 conv + BN + Dropout
64 × 8 × 8/3 × 3 conv + BN + Dropout	64 × 8 × 8/3 × 3 conv + BN + Dropout
128 × 4 × 4/3 × 3 conv + BN + Dropout	128 × 4 × 4/3 × 3 conv + BN + Dropout
256 × 2 × 2/3 × 3 conv + BN + Dropout	256 × 2 × 2/3 × 3 conv + BN + Dropout
512 × 1 × 1/3 × 3 conv + BN + Dropout	512 × 1 × 1/3 × 3 conv + BN + Dropout
256 mu & 256 var

Table 3. Instruments in IRMAS [13].

Instruments	Abbreviations	Training	Testing
Cello	cel	388	111
Clarinet	cla	505	62
Flute	flu	451	163
Acoustic guitar	gac	637	535
Electric guitar	gel	760	942
Organ	org	682	361
Piano	pia	721	995
Saxophone	sax	626	326
Trumpet	tru	577	167
Violin	vio	580	211
Voice	voi	778	1024

Table 4. The relation of the auxiliary classes and principal classes in IRMAS [10].

Instruments	Principal Classes	Auxiliary Classes
Cello	cel	Soft onset
Clarinet	cla	Soft onset
Flute	flu	Soft onset
Acoustic guitar	gac	Hard onset
Electric guitar	gel	Hard onset
Organ	org	Hard onset
Piano	pia	Hard onset
Saxophone	sax	Soft onset
Trumpet	tru	Soft onset
Violin	vio	Soft onset
Voice	voi	Other

Table 5. Performance of AEDCN and other instrument recognition models.

Model	F1 Micro	F1 Macro
SVM [25]	0.36	0.27
Bosch et al. [26]	0.50	0.43
MTF-DNN (2018) [27]	0.32	0.28
Audio DNN [28]	0.55	0.51
ConvNet (2017) [15]	0.62	0.52
Muti-task ConvNet (2020) [10]	0.66	0.58
Kratimenos et al. (2021) [14]	0.65	0.55
WaveGAN ConvNet (2021) [7]	0.65	0.60
Voting-Swin-T (2022) [8]	0.66	0.62
Staged trained ConvNet	0.64	0.60
VAE augmentation ConvNet	0.64	0.60
AEDCN	0.68	0.62

Table 6. Performance of AEDCN in the ablation experiment.

Model	F1 Micro	F1 Macro
AEDCN-mel-spectrogram	0.64	0.61
AEDCN-single.	0. 66	0. 58
AEDCN-153	0.66	0.62
AEDCN_without_pre	0.67	0.62
AEDCN	0.68	0.62

Table 7. Performance of adversarial loss in AEDCN.

Model	F1 Micro	F1 Macro
AEDCN-likelihood	0.66	0.62
AEDCN-inverseKL	0.63	0.61
AEDCN	0.68	0.62

Table 8. Performance of AEDCN on each instrument class.

Model	Cel	Cla	Flu	Gac	Gel	Org	Pia	Sax	Tru	Vio	Voi	F1
SVM	0.25	0.16	0.21	0.46	0.35	0.27	0.38	0.22	0.20	0.20	0.31	0.27
MTF-DNN	0.15	0.26	0.27	0.43	0.36	0.28	0.36	0.28	0.18	0.22	0.32	0.28
ConvNet	0.45	0.18	0.43	0.62	0.59	0.45	0.61	0.61	0.44	0.44	0.81	0.51
Muti-task ConvNet	0.54	0.29	0.43	0.71	0.7	0.43	0.67	0.66	0.52	0.53	0.85	0.58
WaveGAN ConvNet	0.55	0.36	0.55	0.63	0.67	0.55	0.62	0.58	0.65	0.68	0.73	0.60
Voting-Swin-T	0.61	0.49	0.66	0.59	0.66	0.55	0.63	0.56	0.65	0.65	0.78	0.62
AEDCN	0.56	0.47	0.54	0.71	0.71	0.56	0.68	0.62	0.57	0.55	0.88	0.62

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, J.; Bai, N. Augmentation Embedded Deep Convolutional Neural Network for Predominant Instrument Recognition. Appl. Sci. 2023, 13, 10189. https://doi.org/10.3390/app131810189

AMA Style

Zhang J, Bai N. Augmentation Embedded Deep Convolutional Neural Network for Predominant Instrument Recognition. Applied Sciences. 2023; 13(18):10189. https://doi.org/10.3390/app131810189

Chicago/Turabian Style

Zhang, Jian, and Na Bai. 2023. "Augmentation Embedded Deep Convolutional Neural Network for Predominant Instrument Recognition" Applied Sciences 13, no. 18: 10189. https://doi.org/10.3390/app131810189

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Augmentation Embedded Deep Convolutional Neural Network for Predominant Instrument Recognition

Abstract

1. Introduction

2. Related Works

2.1. Instrument Recognition

2.2. Neural Networks for Data Augmentation

3. The Proposed AEDCN Methods

3.1. Data Preprocessing

3.2. AEDCN Model Details

4. Experiments

4.1. Data Set

4.2. Experimental Results Analysis

4.3. Ablation Experiment

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI