*3.2. Pre-Training of MRBERT*

The MRBERT is a pre-trained model for the learning representations of the melody and rhythm. As displayed in Figure 2, the melody (*m*1, *m*2, \_\_, ... , *mn*) and rhythm (*r*1, *r*2, \_\_, ... ,*rn*) sequences are input to the embedding layers, where the "\_\_" represents the random masked tokens. The tokens of the melody sequences and rhythm sequences are embedded by the corresponding token embedding layer. The position embedding layer, which is shared by the melody and rhythm, adds the position feature on them. Through the embedding layers, the melody embedding *e<sup>M</sup>* and the rhythm embedding *e<sup>R</sup>* are obtained. Next, *e<sup>M</sup>* and *e<sup>R</sup>* are input to the corresponding transformer, which exchanges information through semi-cross attention. Semi-cross attention is proposed to realize the information

exchange between the melody and rhythm. As presented in formula (1), the cross query of *e<sup>M</sup>* is obtained from the dot-production of the query of the melody *q<sup>M</sup>* with the activated query of the rhythm *q<sup>R</sup>* by using softmax. The use of the key *k<sup>M</sup>* and value *v<sup>M</sup>* is similar to that of the self-attention. For the rhythm, the query of the melody *q<sup>M</sup>* is required for calculating the cross query of *eR*. Finally, the melody hidden states *h<sup>M</sup>* and rhythm hidden states *h<sup>R</sup>* output by the transformers are passed through the melody prediction layer and rhythm prediction layer to predict the masked melody *m* and rhythm *r* .

$$\begin{aligned} \text{Semi Cross Attack} &= \text{softmax} \left( \frac{q^M \cdot (\text{softmax} \left( q^R \right)) k^{MT}}{\sqrt{d\_k}} \right) v^M\\ &\text{and} \\ \text{Semi Cross Attack}^R &= \text{softmax} \left( \frac{q^R \cdot (\text{softmax} \left( q^M \right)) k^{RT}}{\sqrt{d\_k}} \right) v^R \end{aligned} \tag{1}$$

**Figure 2.** Pipeline of pre-training of MRBERT.

The pre-training strategy of this paper refers to the MLM proposed by BERT, which follows that 15% of the tokens in the sequence are randomly masked: (1) 80% of the selected tokens are replaced by MASK; (2) 10% are replaced by randomly selected tokens; (3) the remaining 10% remain unchanged. Furthermore, to enhance the performance of the pretraining, this paper refers to BERT-like models and other related studies, drops the next sentence prediction pre-training task, and uses dynamic masking [27].

#### *3.3. Fine-Tuning of Generation Tasks*

To address the diverse generation tasks, the MRBERT is fine-tuned with three downstream tasks, namely autoregressive, conditional, and Seq2Seq generation. Furthermore, after fine-tuning for each task, joint generation can be achieved by executing the three generation methods simultaneously.

#### 3.3.1. Autoregressive Generation Task

To accomplish the autoregressive generation task, its generation pattern should be known, which can be summarized as a unidirectional generation similar to a Markov chain [28] *P*(*ti*|*t*1, *t*2, *t*3,..., *ti*−1), where the probability of the token *ti* depends on *t*<sup>1</sup> to *ti*−1. Autoregressive generation reveals that the tokens are predicted in order from left to right, and the current token is predicted based on the previous tokens. First, <BOS> (the beginning of the sequence, which is a special token in vocabulary) is passed into the MRBERT. Next, the output layers, which are a pair of fully connected layers, predict the melody and rhythm based on the hidden state from the MRBERT. Finally, the predicted melody and rhythm are used to calculate the cross-entropy loss for backpropagation. When backpropagation ends, the input token sequences are incremented by one time step, and the model predicts the melody and rhythm of the next time step until <EOS> (the end of the sequence, a special token corresponding to <BOS>) is generated. The ground truth label data are easily obtained by shifting the input sequences to the right by one time step. The pre-trained model and output layer continuously shorten the gap between the prediction and the label data through fine-tuning. After fine-tuning, whenever the melody and rhythm are generated, generations are added to the end of the sequence to form a new input, as displayed in Figure 3.

**Figure 3.** Pipeline of autoregressive generation. The orange arrows represent the predicted melody and rhythm should be continuously added to the end of the input.

#### 3.3.2. Conditional Generation Task

Unlike in autoregressive generation, in conditional generation, not only previous tokens but also future tokens are considered when predicting unknown tokens. The model should consider the bidirectional contextual information of the unknown tokens. To realize this task, a generation pattern such as a denoising autoencoder [29] is used, *P* - *tj t*1, *t*2,..., *tj*−1, *tj*+1, ..., *ti* , where the unknown token *tj* should be predicted based on the known tokens. Fine-tuning for conditional generation is highly similar to pretraining. However, since multiple tokens are masked, when predicting one of the tokens, it is assumed to be independent of the other masked tokens. To address this problem, shorter sequences are used and only a pair of melody and rhythm tokens is masked in fine-tuning. The cross-entropy loss is calculated by the predictions (melody or rhythm) and ground truth labels, which are then used for fine-tuning. After fine-tuning, the MRBERT and the output layer of the conditional generation fill in the missing parts according to the contextual information obtained from the given melody and rhythm as displayed in Figure 4.

**Figure 4.** Pipeline of conditional generation. The underline represents the missing part of the music.

#### 3.3.3. Seq2Seq Generation Task

When the melody and rhythm are created, chords should be added to make it sound less monotonous. This generation pattern can be summarized as P - *t*1, *t*2,..., *ti t* <sup>1</sup>, *t* <sup>2</sup>, ..., *t i* , where *t* represents the given tokens, and *t* represents the tokens that should be predicted. The probability of *t* for the position 1 to *i* is based on the given *t* of 1 to *i*. In fine-tuning, the melody and rhythm sequences are input into the MRBERT, and the chords of the corresponding position are predicted by the output layer of the Seq2Seq generation. The cross-entropy loss calculated from the predicted chords and ground truth label data is used for fine-tuning. After fine-tuning, the MRBERT can accept the melody and rhythm, and subsequently generate chords through the output layer of the Seq2Seq generation, as displayed in Figure 5. The continuous output of the same chord symbol indicates that the same chord is continuing until a different symbol appears.

**Figure 5.** Pipeline of Seq2Seq generation. Melody and rhythm can be of any length, and the length of the generated chords vary accordingly.
