A Hybrid Parallel Computing Architecture Based on CNN and Transformer for Music Genre Classification

Chen, Jiyang; Ma, Xiaohong; Li, Shikuan; Ma, Sile; Zhang, Zhizheng; Ma, Xiaojing

doi:10.3390/electronics13163313

Open AccessArticle

A Hybrid Parallel Computing Architecture Based on CNN and Transformer for Music Genre Classification

by

Jiyang Chen

^1,2,

Xiaohong Ma

²,

Shikuan Li

²,

Sile Ma

¹,

Zhizheng Zhang

^1,*

and

Xiaojing Ma

^1,*

¹

Institute of Marine Science and Technology, Shandong University, Qingdao 266237, China

²

Shandong Zhengzhong Information Technology Co., Ltd., Jinan 250098, China

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(16), 3313; https://doi.org/10.3390/electronics13163313 (registering DOI)

Submission received: 12 July 2024 / Revised: 16 August 2024 / Accepted: 20 August 2024 / Published: 21 August 2024

(This article belongs to the Special Issue Recent Advances of Cloud, Edge, and Parallel Computing)

Download

Browse Figures

Versions Notes

Abstract

:

Music genre classification (MGC) is the basis for the efficient organization, retrieval, and recommendation of music resources, so it has important research value. Convolutional neural networks (CNNs) have been widely used in MGC and achieved excellent results. However, CNNs cannot model global features well due to the influence of the local receptive field; these global features are crucial for classifying music signals with temporal properties. Transformers can capture long-range dependencies within an image thanks to adopting the self-attention mechanism. Nevertheless, there are still performance and computational cost gaps between Transformers and existing CNNs. In this paper, we propose a hybrid architecture (CNN-TE) based on CNN and Transformer encoder for MGC. Specifically, we convert the audio signals into mel spectrograms and feed them into a hybrid model for training. Our model employs a CNN to initially capture low-level and localized features from the spectrogram. Subsequently, these features are processed by a Transformer encoder, which models them globally to extract high-level and abstract semantic information. This refined information is then classified using a multi-layer perceptron. Our experiments demonstrate that this approach surpasses many existing CNN architectures when tested on the GTZAN and FMA datasets. Notably, it achieves these results with fewer parameters and a faster inference speed.

Keywords:

music genre classification; convolutional neural networks; Transformer encoder; mel spectrogram

Graphical Abstract

1. Introduction

The online music market has experienced significant growth, and there is increasing interest in music information retrieval (MIR) and recommendation systems within the entertainment industry [1]. Accurate music genre classification (MGC) plays a crucial role in these systems by enabling effective categorization and recommendation of music tracks. Manual labeling of music genres, though possible, is both labor-intensive and time-consuming. Thus, developing efficient algorithms for MGC is of substantial importance.

Deep learning models have revolutionized the field by allowing computers to automatically learn pattern features for music classification [2,3]. These models have gradually replaced traditional machine learning methods that depend on hand-crafted features. For instance, because music is inherently temporal, LSTM networks are commonly employed to capture long-term dependencies in music data [4,5]. Typically, features like mel frequency cepstral coefficients (MFCCs) and spectral centroids are extracted from raw audio data, and are then input into LSTM networks for classification. While LSTM-based methods achieve satisfactory accuracy, their sequential structure limits parallelization on GPUs, reducing training efficiency.

Convolutional neural networks (CNNs) [6] are highly effective in image processing tasks. This has led to the adoption of CNNs in MGC by converting audio data into spectrograms, such as Fourier spectrograms and mel spectrograms, allowing CNNs to learn features directly from these visual representations [7,8,9]. CNN-based methods are not only as accurate as LSTM-based methods but also more efficient due to their parallelizable nature. However, CNNs are limited by their local receptive fields, as they process small regions of the input data at a time. This characteristic prevents them from capturing global features of the spectrograms. Moreover, due to their weight-sharing mechanism, CNNs treat different image regions equally during feature extraction, potentially missing out on key discriminative features.

Transformers [10], known for their success in natural language processing (NLP), rely entirely on self-attention mechanisms, allowing them to capture global dependencies and parallelize computations effectively. Recent efforts have adapted Transformers for computer vision tasks with notable success [11,12,13]. However, the quadratic scaling of memory and computation with the input’s spatial dimensions poses challenges, particularly in terms of training and inference efficiency [14].

Given these considerations, we propose a hybrid architecture named CNN-TE, combining CNNs with Transformer encoders, to leverage the strengths of both approaches. Our architecture includes a convolutional module with four 2D convolutional layers and max pooling layers, followed by two vertically stacked Transformer encoder layers. The design of CNN-TE targets capturing both detailed local and broad global features, thus enhancing classification performance for MGC. When applied, audio tracks are transformed into mel spectrograms, which are then processed by the hybrid model. The convolutional layers extract fine-grained, low-level characteristics, while the Transformer captures overarching high-level semantic data. These features are ultimately classified by a multi-layer perceptron.

The experimental results show that the proposed CNN-TE outperforms many existing CNN-based architectures (such as MobileNet, Inception, and EfficientNet) on the GTZAN dataset and FMA dataset, achieving superior classification accuracy while maintaining a low parameter count and fast inference speed. Additionally, the incorporation of global feature extraction capabilities via the Transformer encoder enhances the overall robustness and effectiveness of the model in various MIR applications, including song recognition and music recommendation systems. This improvement highlights the broader applicability and relevance of our approach within the MIR domain.

2. Related Work

Early methods for MGC based on traditional machine learning typically relied on manually crafted features for music data classification [15,16]. However, creating hand-crafted features requires expert knowledge and experience, leading to high costs. Furthermore, most traditional machine learning models are shallow networks with limited representation capabilities, making them insufficient for handling the vast amount of contemporary music data.

In recent years, deep learning techniques have seen success across various domains. These models allow computers to automatically learn the pattern features necessary for accurate music classification, thereby addressing the issue of challenging feature selection. Based on different model categories, existing deep learning-based MGC methods can be classified into the following categories: (1) RNN-based methods, (2) CNN-based methods, (3) CRNN-based methods, (4) Transformer-based methods, and (5) hybrid methods [17].

The RNN-based MGC methods usually extract MFCC features from the audio, and then, MFCCs are fed as input to RNN or its variants (such as LSTM [4] and IndRNN [18]) for training. RNN-based methods achieve good classification accuracy. However, due to its sequential structure, RNN cannot be efficiently trained by GPU parallelization.

CNN is also widely used in MGC due to its excellent automatic feature extraction ability [7,19,20]. CNN-based MGC approaches typically transform audio files into spectrograms (like Fourier or mel spectrograms), and then, use CNNs to extract relevant features from these spectrograms. Essentially, this method treats MGC as an image classification task. In Ref. [7], the authors applied short-time Fourier transform (STFT) to music signals to generate spectrograms. These spectrograms were then processed by CNNs for automatic feature extraction, with an FNN employed for the classification task. On the other hand, Ref. [19] enhanced the SampleCNN framework by incorporating residual connections and squeeze-and-excitation (SE) modules, resulting in improved classification performance. Nonetheless, CNNs are limited by their local receptive fields, which restricts their ability to capture global features.

The CRNN [17] is a hybrid structure that combines CNNs and RNNs, which takes advantage of CNNs for local feature extraction and RNNs for temporal summarization of the extracted features. Ref. [21] obtained the feature maps containing high-level abstract information from the mel spectrogram through the convolutional layer, and expanded these feature maps in time to obtain a convolution feature sequence. Then, the convolution feature sequence was fed into RNN. Finally, the fully connected layer acted as a classifier to output the corresponding label. The above method exhibits strong performance with less parameters and training time. Ref. [22] introduced an enhanced technique known as CRNN-TF, designed to capture the spatial dependencies of music signals across these dimensions. This approach excels in extracting and summarizing music features. Despite the advantages of CRNN, which integrates CNN and RNN components, it still faces challenges related to global representation and efficient parallel training. In addition, Zhao et al. [23] introduced a self-supervised pre-training technique using the Swin Transformer, which leverages a momentum-based contrastive learning framework and has demonstrated superior performance in music genre classification and tagging. Similarly, Jena et al. [24] proposed a hybrid deep learning model that incorporates multimodal and transfer learning techniques for music genre classification. This model showed enhanced accuracy and efficiency on the GTZAN datasets.

3. Preliminary Knowledge

This section provides an overview of CNNs and Vision Transformer architectures, followed by a discussion on the benefits of mel spectrograms and the process of generating them.

3.1. CNN

CNNs are a type of feedforward neural network that excels in processing large-scale images. They are characterized by their artificial neurons, which respond to inputs within a localized area of their receptive field. CNNs are advantageous compared to other feedforward networks due to their ability to operate with fewer parameters, making them efficient for deep learning tasks. However, CNNs also have limitations. Firstly, their receptive fields are limited to local regions rather than capturing global contexts. Secondly, the weight-sharing mechanism in CNNs treats all regions of an image uniformly during feature extraction. This uniform treatment prevents CNNs from differentiating between critical and less important areas, which can hinder the extraction of high-quality discriminative features.

3.2. Vision Transformer

Vision Transformer (ViT) [11] is a deep learning architecture designed for image classification. In contrast to conventional CNNs, which utilize a sequence of convolutional and pooling operations, ViT employs self-attention mechanisms [10] and fully connected layers within a Transformer framework.

In the ViT method, an image is divided into non-overlapping patches, each represented as a vector. These vectors are then fed through several Transformer encoder layers, where self-attention mechanisms identify relationships between various image regions. Following the Transformer encoder layers, a linear layer and pooling layer are used to combine information from all patches into a final feature vector suitable for classification tasks.

The ViT architecture is highly adaptable and well suited for training on extensive image datasets. Its use of self-attention mechanisms to capture contextual relationships between various image regions, often leads to better performance than traditional CNNs. However, the self-attention mechanism in ViT has limitations, as its memory and computational requirements grow quadratically with the spatial dimensions of the input, which can introduce significant overheads during both training and inference [14]. Additionally, ViT’s performance can be sensitive to optimization parameters, such as the choice of optimizer (e.g., AdamW versus SGD) and associated hyperparameters, leading to varying training outcomes [25].

3.3. Mel Spectrogram

The mel spectrum is a representation of an audio signal in the frequency domain that models the way the human auditory system perceives frequency. The following advantages of the mel spectrum, (1) The mel scale more closely matches the way the human auditory system perceives frequency, which makes it a more meaningful representation of the audio signal. (2) The mel spectrum is typically created by taking the logarithm of the magnitude, which reduces the dynamic range of the representation and makes it easier to work. The mel spectrum is a widely used feature representation in audio processing due to its ability to capture the perceptual characteristics of sound. It has been particularly effective in various audio classification tasks. In our work, we leverage mel spectrograms as input features for the hybrid CNN-TE architecture to perform MGC.

Mel spectrograms have not only been applied to MGC but also found applications in broader audio classification tasks. For instance, they have been used for music/speech classification in radio broadcasting systems, where distinguishing between music and speech content is crucial for automated content analysis and monitoring. Recent studies, such as [1,26,27], have demonstrated the effectiveness of mel spectrograms in this context. Furthermore, mel spectrograms have been employed in medical applications, particularly in the detection and classification of obstructive sleep apnea (OSA). Researchers have utilized these features to analyze audio recordings of breathing patterns during sleep, aiding in the diagnosis and treatment of sleep disorders. Notable works in this area include [28], which highlights the robustness of mel spectrograms in capturing relevant audio features for clinical applications.

To create a mel spectrogram, the audio signal is first segmented into overlapping frames using a window function such as Hann or Hamming. The choice of window function and the overlap percentage (typically 50%) are crucial for determining the time-frequency resolution of the spectrogram. The process includes the following steps. (1) Segmentation: The audio signal is split into overlapping frames. Each frame is transformed into the frequency domain through STFT. (2) Mel scale conversion: The STFT magnitudes are converted to a mel scale representation, which reflects the human auditory system’s perception of frequency. (3) Log compression: The mel scale magnitudes are logarithmically compressed to highlight spectral differences and reduce dynamic range. (4) Spectrogram formation: The logarithmic magnitudes are plotted against time to form the spectrogram, which visually represents the distribution of spectral energy over time. The final spectrogram image provides a visual summary of the audio signal’s spectral energy over time.

4. Proposed Hybrid Architecture

As analyzed in the previous section, ViT can capture the global information of images, but it is computationally expensive and unstable during training. CNNs have few parameters, but they cannot extract global features, and cannot distinguish the main part from the minor part of the image. It is unclear whether fusing these two networks can improve the performance of MGC.

In response to the above, we develop a novel hybrid deep method, CNN-TE, that integrates CNNs with Transformer encoders, effectively combining the benefits of both approaches. The CNN-TE is designed to capture global information from mel spectrograms while keeping the parameter count low and ensuring fast inference. The architecture of CNN-TE is composed of three key components: the convolutional module, the Transformer encoder, and the classification head. Figure 1 illustrates the overall architecture of the proposed CNN-TE model.

The convolutional module consists of four standard 2D convolutional layers, each followed by a batch normalization (BN) operation, rectified linear unit with the ReLU activation function, and a max pooling layer. The channels corresponding to the four convolutional layers are 32, 64, 128, and 256.

The Transformer encoder in our model comprises two layers, with the detailed architecture of each layer depicted in Figure 2. Each layer of the Transformer encoder includes multi-head attention, a multi-layer perceptron (MLP), and two normalization layers. To address the issue of gradient vanishing, residual connections are incorporated within each Transformer encoder layer.

The multi-head attention mechanism allows the model to focus on different parts of the sequence, thus effectively improving feature learning. Figure 3 illustrates the multi-head attention process. In this mechanism, queries, keys, and values are each transformed through h independently learned linear projections. These h sets of projected queries, keys, and values are then processed in parallel using scaled dot-product attention [10]. The resulting h attention outputs are concatenated and passed through another learned linear transformation to produce the final output. The multi-head attention process can be summarized as follows.

\begin{matrix} MultiHead (Q, K, V) & = Concat ({head}_{1}, \dots, {head}_{h}) W^{O} \\ where head & = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}) \end{matrix}

Above, the

W^{O}

,

W_{i}^{Q}

,

W_{i}^{K}

and

W_{i}^{V}

are all learnable parameter matrices. The computational process of the attention mechanism is represented as follows.

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

The dimension of Q, K and V is

d_{k}

,

\sqrt{d_{k}}

is the scaling factor.

Because the Transformer encoder cannot distinguish between different positions, the additional position encoding is crucial. We use sine and cosine functions of different frequencies for position encoding [10]. Specifically, it is expressed as follows.

\begin{matrix} P E_{(p o s, 2 i)} & = sin (p o s / 10000^{2 i / d_{feature}}) \\ P E_{(p o s, 2 i + 1)} & = cos (p o s / 10000^{2 i / d_{feature}}) \end{matrix}

where

p o s

is the position and i is the dimension. The

d_{feature}

is set to 256, which is the number of channels of the output features of the convolutional module.

5. Experiments and Analyses on GTZAN Dataset

In this section, experiments are conducted on the public GTZAN dataset to evaluate the classification performance of the proposed CNN-TE model, and some mainstream models are explored for comparing the results.

5.1. GTZAN Dataset Description

The GTZAN dataset ([29]) is widely used for music genre classification (MGC) research in the field of machine hearing. The dataset contains 1000 audio tracks, each 30 s long, covering 10 genres with 100 tracks per genre. The audio tracks are 22,050 Hz mono 16-bit .wav files. To expand the dataset, we divide each mel spectrogram into ten 128 × 128 pixel segments. These slices serve as actual experimental data. In addition, we do not adopt a data augmentation strategy in any of the experiments. For music signals of different genres, the corresponding spectrogram slices are shown in Figure 4.

5.2. Evaluation Metrics

We adopted several well-known evaluation metrics to assess the performance of the classification models, which include overall accuracy (OA), precision, recall, F1 score, and the confusion matrix. Additionally, we provide details on the number of model parameters and the inference time per sample. Overall accuracy provides a clear indication of the model’s effectiveness across the entire dataset. The formula for calculating OA is provided below.

OA = T / (T + F)

(1)

where T denotes the count of correctly classified samples, while F indicates the number of misclassifications.

The metrics of precision, recall, and F1 scores are computed using the following formulas.

\begin{matrix} Precision & = \frac{T P}{T P + F P} \end{matrix}

(2)

\begin{matrix} Recall & = \frac{T P}{T P + F N} \end{matrix}

(3)

\begin{matrix} F 1 & = \frac{2 P R}{P + R} \end{matrix}

(4)

In this context,

T P

stands for true positive,

T N

for true negative,

F P

for false positive, and

F N

for false negative, while P represents precision and R denotes recall.

A confusion matrix is an evaluation tool that compares the predicted and actual values of a classification model on a test dataset. In this matrix, the actual categories are represented on the vertical axis, and the predicted categories are on the horizontal axis. Each cell in the matrix indicates the proportion of predicted instances relative to the actual instances.

5.3. Experimental Settings

The dataset is evaluated using a 10-fold cross-validation approach. In this setup, the dataset is randomly divided into 10 equal subsets. For each fold, one subset is used as the test set, while the remaining nine subsets are used for training. The final performance metrics are obtained by averaging the results from all 10 folds. The deep learning framework Pytorch 11.8 is used to construct the proposed deep model. The experiment uses the cross-entropy loss function with label smoothing, in which the label smoothing coefficient is 0.1. We adopt the AdamW optimizer. The learning rate is 0.003; the number of training epochs is 100; the mini-batch size is 64, and the training is accelerated using GPU. We adopt an early stopping technique to prevent overfitting. We set a patience parameter, which defines the number of epochs to wait for an improvement in validation loss before stopping the training. In our experiments, we use a patience of 10 epochs. All models are trained from scratch. All the experiments are conducted on a desktop with an Inter Xeon Sliver 4210R CPU + NVIDIA GeForce RTX 2080Ti.

5.4. Comparison of Results and Analysis on GTZAN Dataset

In this section, we performed comprehensive experiments on the GTZAN dataset using various model types. The results are summarized in Table 1. As shown in Table 1, normal CNN models generally outperform lightweight CNNs. For example, the OA, precision, recall, and F1 of DenseNet121 are higher than those of MobileNet V2 by 5.94, 5.41, 5.90, and 5.76 percentage points, respectively. This is because normal CNNs have more parameters and larger capacities than lightweight CNNs, giving them better fitting capabilities. The only advantage of lightweight CNNs is that their inference speed is better than normal CNNs.

In the lightweight CNN series, the OA, precision, recall, and F1 of “MobileNet V3 large” [30], with more parameters, are lower than MobileNet V2. This phenomenon is because “MobileNet V3 large” uses neural architecture search (NAS) [31], which is the ultimate optimization for specific tasks, but the effect is not necessarily good for other tasks. In the standard CNN models, ResNet18 outperforms ResNet34 in classification accuracy. This could be due to ResNet34’s greater depth, which may lead to overfitting on smaller datasets.

The pure Transformer-based models lack inductive biases like CNNs, (e.g., translation equivariance, locality), so ViT and “MoblileViT small” [32] have the lowest classification performance on the small-scale GTZAN dataset. Meanwhile, the Swin Transformer-based [23] method obtained a precision of 81.1% due to its enhancement of the traditional Transformer structure. We similarly compared classification methods based on hybrid models [24] of deep learning, and the results show that their performance is lower than that of methods based entirely on deep learning. Compared with other models, the proposed method achieves excellent classification results, with OA, precision, recall, and F1 of 87.41%, 87.93%, 87.58%, and 87.28%, respectively. And the proposed hybrid model also has a very fast inference speed and fewer parameters thanks to the use of a convolutional module to pre-compress the input of the Transformer encoder.

Figure 5 intuitively shows the confusion matrix obtained by the proposed CNN-TE method on the GTZAN dataset. It is obvious that high classification accuracies (≥87%) are obtained for 8 out of 10 classes. There are a few classes that are difficult to distinguish. For example, rock music and country music are easily confused because rock music is developed based on country music, and they have some similarities.

6. Experiments and Analyses on Free Music Archive Dataset

In this section, we test the performance of the proposed method on the Free Music Archive (FMA) dataset and compare it with other state-of-the-art classification methods.

6.1. FMA Dataset Description

The FMA dataset [33] is a comprehensive collection of music files gathered from the Free Music Archive platform. It is designed for music analysis tasks and provides a wide range of metadata and genre annotations. The dataset is available in different sizes, from the full dataset with over 100,000 tracks to smaller subsets that are easier to manage and analyze. In our experiments, we used the FMA-small dataset, which contains 8000 music clips equally divided among eight genres. Each clip is about 30 s long and recorded at a sampling rate of 44.1 kHz. This ensures high-quality audio, making it suitable for various audio analysis and machine learning tasks.

The FMA dataset’s diverse genre representation, coupled with the k-fold cross-validation method, provides a strong foundation for evaluating the effectiveness of our proposed CNN-Transformer hybrid architecture. This setup not only demonstrates the model’s capability to handle various music genres but also underscores its potential applicability to broader music analysis tasks.

To ensure a rigorous evaluation of our model, we employed a k-fold cross-validation approach on the FMA-small dataset. Specifically, we utilized a 10-fold cross-validation technique. This method involves partitioning the dataset into 10 equal parts, where in each fold, one part serves as the validation set while the remaining nine parts are used for training. This process is repeated 10 times, with each part serving as the validation set exactly once. The final performance metrics are computed as the average of the results obtained from all folds. Other relevant experimental settings were kept consistent with the GTZAN dataset experiments.

6.2. Comparison of Results and Analysis on FMA Dataset

In this section, we present the results of our experiments on the FMA-small dataset to evaluate the effectiveness of our proposed method. We also compare our results with recent state-of-the-art approaches using the same dataset. The results are displayed in Table 2. The comparative analysis demonstrates that our hybrid architecture either matches or surpasses the performance of existing models, particularly in managing complex genres while maintaining low computational costs. Compared to advanced models based solely on CNN or Transformer architectures, such as EfficientNet V2 B0, DenseNet121, and MobileViT, the proposed CNN-TE method excels in several metrics. In summary, the experimental results on the FMA dataset validate the robustness and efficiency of our CNN-Transformer hybrid model. The consistent performance across different datasets, including GTZAN and FMA, highlights its potential for wider applications in music genre classification and other audio analysis tasks.

6.3. The Ablation Study and Analysis on CNN-TE

To further evaluate the effectiveness of the proposed CNN-TE, we conducted ablation studies on both the GTZAN and FMA datasets. ’Without CNN module’ denotes that the CNN module in the CNN-TE is removed. ’Without Transformer encoder’ denotes that the Transformer in the CNN-TE is removed. The ablation experimental results are shown in Table 3. We can see that when either the CNN module or Transformer encoder is removed, the performance decreases substantially, which shows that the effect of the CNN module has a greater impact on the model. The proposed CNN-TE structure obtains the optimal classification performance.

7. Conclusions

In this paper, we propose a lightweight and efficient hybrid architecture CNN-TE, which fuses CNN and Transformer encoder. Then, the MGC method is presented based on the CNN-TE architecture and mel spectrograms. The proposed method can extract global features from mel spectrograms instead of only local features, which is very important for accurately recognizing music with a temporal nature. In practice, compact features are first obtained through convolutional modules. These features are then fed into the Transformer encoder for further capturing long-distance dependencies. This design gives the proposed method excellent classification performance on the MGC task and maintains low parameters and fast inference speed. Extensive experiments show that the CNN-TE-based MGC method achieved better results on the GTZAN dataset and FMA dataset in OA, precision, recall, and F1 metrics than mainstream networks such as MobileNet, Inception, and EfficientNet.

Author Contributions

Conceptualization and methodology, J.C.; validation, Z.Z. and X.M. (Xiaojing Ma); formal analysis, X.M. (Xiaohong Ma), S.L. and S.M.; investigation, Z.Z.; writing—original draft preparation, J.C.; writing—review and editing, J.C. and X.M. (Xiaojing Ma); funding acquisition, Z.Z. and S.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded partly by the special funds for central guiding local science and technology development: Industrialisation of internet of things terminal safety inspection platform YDZX2022078, partly by Jinan science and technology programme project: demonstration application of high performance big data security storage system 202221012, partly supported by Shandong Provincial Natural Science Foundation ZR2023QF067.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

Authors Jiyang Chen, Xiaohong Ma and Shikuan Li are employed by the company Shandong Zhengzhong Information Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Cheng, Y.H.; Chang, P.C.; Kuo, C.N. Convolutional Neural Networks Approach for Music Genre Classification. In Proceedings of the 2020 International Symposium on Computer, Consumer and Control (IS3C), Taichung City, Taiwan, 13–16 November 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 399–403. [Google Scholar]
Liu, J.; Wang, C.; Zha, L. A middle-level learning feature interaction method with deep learning for multi-feature music genre classification. Electronics 2021, 10, 2206. [Google Scholar] [CrossRef]
Wen, Z.; Chen, A.; Zhou, G.; Yi, J.; Peng, W. Parallel attention of representation global time–frequency correlation for music genre classification. Multimed. Tools Appl. 2024, 83, 10211–10231. [Google Scholar]
Deepak, S.; Prasad, B. Music Classification based on Genre using LSTM. In Proceedings of the 2020 Second International Conference on Inventive Research in Computing Applications (ICIRCA), Coimbatore, India, 15–17 July 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 985–991. [Google Scholar]
Zheng, Z. The Classification of Music and Art Genres under the Visual Threshold of Deep Learning. Comput. Intell. Neurosci. 2022, 2022, 4439738. [Google Scholar] [PubMed]
Narkhede, N.; Mathur, S.; Bhaskar, A.; Kalla, M. Music genre classification and recognition using convolutional neural network. Multimed. Tools Appl. 2024, 1–16. [Google Scholar] [CrossRef]
Pelchat, N.; Gelowitz, C.M. Neural network music genre classification. Can. J. Electr. Comput. Eng. 2020, 43, 170–173. [Google Scholar]
Cheng, Y.H.; Kuo, C.N. Machine Learning for Music Genre Classification Using Visual Mel Spectrum. Mathematics 2022, 10, 4427. [Google Scholar] [CrossRef]
Prabhakar, S.K.; Lee, S.W. Holistic Approaches to Music Genre Classification using Efficient Transfer and Deep Learning Techniques. Expert Syst. Appl. 2023, 211, 118636. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar]
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12009–12019. [Google Scholar]
Srinivas, A.; Lin, T.Y.; Parmar, N.; Shlens, J.; Abbeel, P.; Vaswani, A. Bottleneck transformers for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 16519–16529. [Google Scholar]
Fu, Z.; Lu, G.; Ting, K.M.; Zhang, D. A survey of audio-based music classification and annotation. IEEE Trans. Multimed. 2010, 13, 303–319. [Google Scholar]
Rosner, A.; Kostek, B. Automatic music genre classification based on musical instrument track separation. J. Intell. Inf. Syst. 2018, 50, 363–384. [Google Scholar]
Shi, B.; Bai, X.; Yao, C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 2298–2304. [Google Scholar]
Wu, W.; Han, F.; Song, G.; Wang, Z. Music genre classification using independent recurrent neural network. In Proceedings of the 2018 Chinese Automation Congress (CAC), Calgary, AB, Canada, 15–20 April 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 192–195. [Google Scholar]
Kim, T.; Lee, J.; Nam, J. Comparison and analysis of samplecnn architectures for audio classification. IEEE J. Sel. Top. Signal Process. 2019, 13, 285–297. [Google Scholar]
Hongdan, W.; SalmiJamali, S.; Zhengping, C.; Qiaojuan, S.; Le, R. An intelligent music genre analysis using feature extraction and classification using deep learning techniques. Comput. Electr. Eng. 2022, 100, 107978. [Google Scholar]
Choi, K.; Fazekas, G.; Sandler, M.; Cho, K. Convolutional recurrent neural networks for music classification. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2392–2396. [Google Scholar]
Wang, Z.; Muknahallipatna, S.; Fan, M.; Okray, A.; Lan, C. Music classification using an improved crnn with multi-directional spatial dependencies in both time and frequency dimensions. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–8. [Google Scholar]
Zhao, H.; Zhang, C.; Zhu, B.; Ma, Z.; Zhang, K. S3t: Self-supervised pre-training with swin transformer for music classification. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 606–610. [Google Scholar]
Jena, K.K.; Bhoi, S.K.; Mohapatra, S.; Bakshi, S. A hybrid deep learning approach for classification of music genres using wavelet and spectrogram analysis. Neural Comput. Appl. 2023, 35, 11223–11248. [Google Scholar]
Xiao, T.; Singh, M.; Mintun, E.; Darrell, T.; Dollár, P.; Girshick, R. Early convolutions help transformers see better. arXiv 2021, arXiv:2106.14881. [Google Scholar]
Zaman, K.; Sah, M.; Direkoglu, C.; Unoki, M. A survey of audio classification using deep learning. IEEE Access 2023, 11, 106620–106649. [Google Scholar]
Gupta, C.; Li, H.; Goto, M. Deep learning approaches in topics of singing information processing. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 2422–2451. [Google Scholar]
Serrano, S.; Patanè, L.; Serghini, O.; Scarpa, M. Detection and Classification of Obstructive Sleep Apnea Using Audio Spectrogram Analysis. Electronics 2024, 13, 2567. [Google Scholar] [CrossRef]
Tzanetakis, G.; Cook, P. Musical Genre Classification of Audio Signals. IEEE Trans. Speech Audio Process. 2002, 10, 293–302. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Zoph, B.; Le, Q.V. Neural architecture search with reinforcement learning. arXiv 2016, arXiv:1611.01578. [Google Scholar]
Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
Defferrard, M.; Benzi, K.; Vandergheynst, P.; Bresson, X. FMA: A dataset for music analysis. arXiv 2016, arXiv:1612.01840. [Google Scholar]

Figure 1. The overall architecture of the proposed CNN-TE model. Max pooling is used to downsize the convolutional feature map to reduce the computational cost of the Transformer encoder. To accommodate the input style of the Transformer encoder, we flatten the output feature maps of the convolutional module.

Figure 2. Transformer encoder layer. Normalization refers to layer normalization. The multi-layer perceptron (MLP) consists of two fully connected layers with two sets of 512 neurons. ⊕ represents the element-wise sum.

Figure 3. Multi-head attention mechanism. In our experiments, we set the number of multi-heads h to 16. Without loss of generality, only three heads are given here.

Figure 4. Sample mel spectrogram slices from the GTZAN dataset: (a) country; (b) rock; (c) blues; (d) disco; (e) hiphop; (f) jazz; (g) pop; (h) metal; (i) reggae; and (j) classical.

Figure 5. Confusion matrix of proposed CNN-TE method for the GTZAN dataset.

Table 1. Classification results of different models on the GTZAN dataset.

Model Type	Model	OA (%)	Precision (%)	Recall (%)	F1 (%)	Parameter (M)	Time (ms)
†	MobileNet V2	80.37	81.47	80.50	80.16	3.52	3.26
	MobileNet V3 small	79.88	79.68	80.00	79.43	2.55	3.24
	MobileNet V3 large	78.91	78.64	78.70	79.70	5.49	3.76
	GhostNet 100	78.12	78.02	77.90	77.26	3.91	3.13
	EfficientNet B0	78.42	77.94	78.20	77.34	4.02	3.39
	EfficientNet V2 B0	80.18	79.80	80.00	79.36	5.87	3.42
‡	DenseNet121	86.31	86.88	86.40	85.92	6.96	4.88
	ResNet18	85.51	85.69	85.40	85.14	11.18	3.27
	ResNet34	84.55	84.71	84.30	83.80	21.29	3.48
	Inception V3	83.30	83.69	82.90	82.45	21.81	3.99
	Inception V4	82.03	82.48	81.90	81.73	41.16	5.59
	Xception	84.42	84.50	84.40	84.28	20.83	6.80
⋄	ViT	71.03	71.62	71.10	70.46	1.30	2.33
	MoblileViT small	73.02	74.04	73.00	72.97	5.58	4.76
	Swin Transformer	80.06	81.1	80.94	80.55	8.17	4.88
□	Jena et al. [24]	80.4	81	80.1	80.2	2.37	3.96
△	CNN-TE (ours)	87.41	87.93	87.58	87.28	1.46	3.02

† Lightweight CNN-based methods; ‡ normal CNN-based methods; ⋄ Visual Transformer-based methods; □ hybrid methods; △ proposed method. Bold indicates the best result in the indicator.

Table 2. Classification results of different models on the FMA dataset.

Model Type	Model	OA (%)	Precision (%)	Recall (%)	F1 (%)
†	MobileNet V2	81.04	80.96	81.23	81.09
	MobileNet V3 small	81.29	81.93	81.32	81.62
	MobileNet V3 large	82.24	83.61	82.64	83.12
	GhostNet 100	80.3	80.29	80.15	80.22
	EfficientNet B0	80.79	80.64	80.43	80.54
	EfficientNet V2 B0	82.06	82.03	81.98	82
‡	DenseNet121	88.23	88.61	88.17	88.39
	ResNet18	86.87	87.08	86.45	86.76
	ResNet34	87.67	87.94	87.54	87.74
	Inception V3	84.36	84.79	84.19	84.49
	Inception V4	85.42	85.74	85.23	85.48
	Xception	86.65	86.74	86.68	86.71
⋄	ViT	73.23	73.26	73.38	73.32
⋄	MoblileViT small	75.11	75.03	74.85	74.94
△	CNN-TE (ours)	89.09	89.71	89.18	89.45

† Lightweight CNN-based methods; ‡ normal CNN-based methods; ⋄ Visual Transformer-based methods; △ proposed method. Bold indicates the best result in the indicator.

Table 3. The ablation experiment results.

Method	GTZAN Dataset				FMA-Small Dataset
Method	OA (%)	Precision (%)	Recall (%)	F1 (%)	OA (%)	Precision (%)	Recall (%)	F1 (%)
Without CNN module	78.21	79.35	78.06	79.11	81.07	81.65	80.9	81.03
Without Transformer encoder	82.33	83.04	81.96	82.71	84.55	84.92	84.03	83.97
CNN-TE	87.41	87.93	87.58	87.28	89.09	89.71	89.18	89.45

Bold indicates the best result in the indicator.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, J.; Ma, X.; Li, S.; Ma, S.; Zhang, Z.; Ma, X. A Hybrid Parallel Computing Architecture Based on CNN and Transformer for Music Genre Classification. Electronics 2024, 13, 3313. https://doi.org/10.3390/electronics13163313

AMA Style

Chen J, Ma X, Li S, Ma S, Zhang Z, Ma X. A Hybrid Parallel Computing Architecture Based on CNN and Transformer for Music Genre Classification. Electronics. 2024; 13(16):3313. https://doi.org/10.3390/electronics13163313

Chicago/Turabian Style

Chen, Jiyang, Xiaohong Ma, Shikuan Li, Sile Ma, Zhizheng Zhang, and Xiaojing Ma. 2024. "A Hybrid Parallel Computing Architecture Based on CNN and Transformer for Music Genre Classification" Electronics 13, no. 16: 3313. https://doi.org/10.3390/electronics13163313

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Hybrid Parallel Computing Architecture Based on CNN and Transformer for Music Genre Classification

Abstract

1. Introduction

2. Related Work

3. Preliminary Knowledge

3.1. CNN

3.2. Vision Transformer

3.3. Mel Spectrogram

4. Proposed Hybrid Architecture

5. Experiments and Analyses on GTZAN Dataset

5.1. GTZAN Dataset Description

5.2. Evaluation Metrics

5.3. Experimental Settings

5.4. Comparison of Results and Analysis on GTZAN Dataset

6. Experiments and Analyses on Free Music Archive Dataset

6.1. FMA Dataset Description

6.2. Comparison of Results and Analysis on FMA Dataset

6.3. The Ablation Study and Analysis on CNN-TE

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI