*4.2. Experimental Environment*

The ablation experiment includes *w/o cross-attn.* (BERT + separate embedding), which used separate embedding and original self-attention instead of semi-cross attention; *w/o separate embed.* (BERT), that is, the melody and rhythm shared a common embedding layer and only used self-attention (*w/o* means "without"). Furthermore, experimental results on RNNs (and BiRNNs) without any pre-training techniques were also listed to detail the effect of pre-training. HITS@k [21] (k = 1, 3, 5, and 10), which can calculate the proportion of the correct answer included in the k candidates, was used as the evaluation metrics. HITS@k was calculated as shown in formula (2), where n represents the number of samples; <sup>I</sup>(·) is an indicator function that returns 1 if the rank of the correct answer is less than k, and 0 otherwise.

$$HITS@k = \frac{1}{n} \sum\_{i=1}^{n} \mathbb{I}(rank\_i \le k) \tag{2}$$

Table 1 presents the hyperparameters of the MRBERT (with ablation model) in pretraining and fine-tuning. During pre-training, most of the hyperparameters were set to the same values as those in RoBERTa-base [27], with slight differences in the *Number of Layers*, *Learning Rate Decay*, *Batch Size*, *Max Steps,* and *Warmup Steps*. The *Number of Layers* in the MRBERT was set to *6*×*2* because it has two sets of transformer blocks corresponding to the melody and rhythm separately, while ensuring that the number of parameters in the model is on the same level as in the ablation experiments. In terms of the *Learning Rate Decay*, *power* was used rather than linear, that is to make the change in the learning rate smoother and more conducive to convergence. While the settings of the *Batch Size*, *Max Steps,* and *Warmup Steps* were adjusted according to the music corpus used.


**Table 1.** Hyperparameters for pre-training and fine-tuning of MRBERT (with ablation model).

<sup>1</sup> Hyperparameters for pre-training. <sup>2</sup> Hyperparameters for fine-tuning. <sup>3</sup> 6 transformer layers of melody and 6 transformer layers of rhythm. <sup>4</sup> 4 represents the number of special tokens: <BOS>, <EOS>, <UNK>, <PAD>.

In fine-tuning, the *Melody Vocab Size*, *Rhythm Vocab Size,* and *Chord Vocab Size* determine the dimension of the probability distribution given by the output layer. The melody and rhythm have 72 and 21 candidates, respectively, which contain four special tokens (<BOS>, <EOS>, <UNK>, <PAD>). In the ablation experiment of *w/o separate embed.*, since the melody and rhythm share an embedding layer, the number of candidates is 89. Furthermore, the number of chord candidates reached 799.

#### *4.3. Results of Autoregressive Generation*

When evaluating autoregressive generation, the pre-trained MRBERT with the output layer of the autoregressive generation task predicts the next melody and rhythm at each time step based on the previous. Figure 7 displays the generated melody and rhythm.

**Figure 7.** Leadsheets of the generated melody sequence.

Table 2 presents the generated melody and rhythm, and the probabilities of the predictions at each time step. The top prediction of the rhythm occupies a higher proportion, whereas the probabilities of all the melody predictions are balanced. The model is more confident in the rhythm prediction. This result is consistent with the analysis results of the music data. Music typically has obvious rhythm patterns, whereas the progression of the melody is complex and changeable.


**Table 2.** Details of autoregressive generation.

Table 3 presents the ablation experimental results of HITS@k in the autoregressive generation task. For the melody prediction, in HITS@k (k = 1, 3, 5, and 10), the MRBERT achieved the average of 51.70%, 2.77% higher than *w/o cross-attn.*, and 3.65% higher than *w/o separated embed.*, and 7.94% higher than the RNN. For the rhythm prediction, it achieved the average of 81.79%, 0.37% higher than *w/o cross-attn.*, and 0.78% higher than *w/o separated embed.*, and 2.56% higher than the RNN.

**Model HITS@1 (%) HITS@3 (%) HITS@5 (%) HITS@10 (%) Mel. Rhy. Mel. Rhy. Mel. Rhy. Mel. Rhy.** MRBERT 15.87 51.53 42.03 83.01 61.53 92.81 87.36 99.81 w/o cross-attn. 14.74 51.44 38.96 82.65 57.45 91.88 84.58 99.80

w/o separate embed. 14.27 51.16 38.14 82.17 55.90 90.91 83.88 99.79 RNN 12.51 48.24 33.60 79.28 50.28 89.67 78.63 99.72

**Table 3.** Ablation experimental results of the autoregressive generation task.

The experimental results revealed that the MRBERT outperformed the models of the ablation experiment in all metrics, especially in the melody prediction. Since *w/o cross-attn.* utilized separate embedding, the performance is slightly higher than that of *w/o separated embed.* Furthermore, pre-training considerably improved the prediction of the melody and rhythm.

#### *4.4. Results of Conditional Generation*

In the conditional generation, the melody and rhythm dropped at random positions were used as the evaluation data. The pre-trained MRBERT with the output layers of the conditional generation predicted the missing part of the melody and rhythm based on a given melody and rhythm. Figure 8 displays the predictions of the model and correct answers for the missing parts of the head, middle, and tail of a piece of music. The leadsheet reveals that the missing part in the middle of the bar (or measure) could be easily predicted, but misjudgments occurred at the position at which the bar was switched.

**Figure 8.** Leadsheets of conditional generated results and reference.

Table 4 presents the details of the predictions in Figure 8. The model presents strong confidence in the rhythm prediction with a high accuracy, whereas the probabilities of the melody candidates did not differ considerably. Although the model predicted *F4* as *G4*, *F4* appeared as the second candidate immediately after. Furthermore, the rhythm *1/8* was accurately predicted at this time but the probability of the first candidate did not have an absolute advantage because, during the bar switching stage, the prediction of the rhythm fluctuates, which is a normal phenomenon.

**Table 4.** Details of conditional generation.


<sup>1</sup> The underline "\_\_" indicates the covered pitch or rhythm. <sup>2</sup> Model predicted G4, but the correct answer is F4.

Table 5 presents the ablation experimental results of HITS@k in the conditional generation task. For the melody prediction, in HITS@k (k = 1, 3, 5, and 10), the MRBERT achieved the average of 54.86%, 1.49% higher than *w/o cross-attn.*, and 5.22% higher than *w/o separated embed.*, and 9.95% higher than the BiRNN. For the rhythm prediction, it achieved the average of 81.85%, 0.55% higher than *w/o cross-attn.*, and 2.09% higher than *w/o separated embed.*, and 3.16% higher than the BiRNN.

**Table 5.** Ablation experimental results of the conditional generation task.


The experimental results revealed that the MRBERT outperformed the other ablation models, and the accuracy of the rhythm prediction was higher than that of the other models. Compared to the autoregressive generation, since information from two directions was considered in the conditional generation, the accuracy was slightly higher.

#### *4.5. Results of Seq2Seq Generation*

In the Seq2Seq generation, the melody with the chords was used as the evaluation data. Figure 9 shows an example of the real chords and predicted chords based on the pre-trained MRBERT with the output layer of the Seq2Seq generation. The predicted chords contained "F," "BbM," and "C7." They were all included in the real chords.

**Figure 9.** Leadsheets of given melody sequence with generated chords and reference chords.

Table 6 presents the ablation experimental results of HITS@k in the Seq2Seq generation task. The MRBERT achieved the average of 49.56%, 0.61% higher than *w/o cross-attn.*, and 1.83% higher than *w/o separated embed.*, and 5.14% higher than the BiRNN.


**Table 6.** Ablation experimental results of Seq2Seq generation task.

The experimental results revealed that the MRBERT outperformed the other ablation models in the Seq2Seq generation task. Separate embedding also improved the performance even when predicting the chords rather than the melody and rhythm.

#### **5. Discussion**

This paper has conducted ablation experiments for three kinds of tasks, autoregressive generation, conditional generation, and Seq2Seq generation, and has evaluated them at multiple levels by setting different k in HIST@k. The following has been demonstrated by the experimental results: First, pre-trained representation learning can improve the performance of the three kinds of tasks. This is evident in the fact that the performance of the RNN and BiRNN is significantly lower than that of the models using pre-training techniques in all tasks. Second, it is effective to consider the melody and rhythm separately in representation learning. From the ablation results, it can be seen that the model using separate embedding performs better in HITS@k in each task than that not using separate embedding. Third, the assumption that there are weak dependencies between the melody and rhythm is reasonable. The performance of the MRBERT using both separate embedding and semi-cross attention together is slightly higher than that using only separate embedding.

This paper and other music representation learning studies are inspired by language modeling in natural language processing, so this method can only be applied to symbolic format music data. In fact, a large amount of music exists in audio format, such as mp3, wav, etc. This requires the model to be able to handle continuous spectrograms rather than discrete sequences. There have been some studies in computer vision that explore the application of representation learning in image processing [30–32], which is very enlightening for future work.

#### **6. Conclusions**

This paper proposed MRBERT, a pre-trained model for multitask music generation. During pre-training, the MRBERT learned representations of the melody and rhythm by dividing the embedding layers and transformer blocks into two groups and implementing information exchanging through semi-cross attention. Compared to the original BERT, the MRBERT simultaneously considered the strong dependencies of the melodies and rhythms on themselves and the weak dependencies between them, which allows it to

learn better representations than the original BERT. In the subsequent fine-tuning, the corresponding content was generated according to the tasks. Three music generation tasks, namely autoregressive, conditional, and Seq2Seq generation, were designed to help users compose music, making the composition more convenient. Unlike traditional music generation approaches designed for a single task, these three tasks included multiple functions of melody and rhythm generation, modification, and completion, as well as chord generation. To verify the performance of the MRBERT, ablation experiments were conducted on each generation task. The experimental results revealed that pre-training improves the task performance, and the MRBERT, using separate embedding and semicross attention, outperformed the traditional language pre-trained model BERT in the metric of HITS@k.

The proposed method can be utilized in practical music generation applications, including melody and rhythm generation, modification, completion, and chord matching, such as web-based music composers. However, to generate high-quality music, a music corpus composed of leadsheets is used as the training data. These leadsheets must clearly label the melodies, rhythms, and corresponding chords. The problem is that it is difficult to collect this type of data, which limits the expansion of the data volume. In the future, although the application of pre-training techniques in music will continue to be explored, it is equally important to extend the generation tasks to unlabeled music symbolic data and audio data.

**Author Contributions:** Conceptualization, S.L., and Y.S.; methodology, S.L. and Y.S.; software, S.L., and Y.S.; validation, S.L. and Y.S.; writing—original draft preparation, S.L.; writing—review and editing, Y.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the Ministry of Education of the Republic of Korea and the National Research Foundation of Korea (NRF-2021R1F1A1063466).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Restrictions apply to the availability of these data. Data were obtained from https://github.com/00sapo/OpenEWLD and are available accessed on 1 October 2022.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
