Automatic Melody Harmonization via Reinforcement Learning by Exploring Structured Representations for Melody Sequences

Zeng, Te; Lau, Francis C. M.

doi:10.3390/electronics10202469

Open AccessArticle

Automatic Melody Harmonization via Reinforcement Learning by Exploring Structured Representations for Melody Sequences

by

Te Zeng

^*

and

Francis C. M. Lau

Department of Computer Science, The University of Hong Kong, Pokfulam, Hong Kong, China

^*

Author to whom correspondence should be addressed.

Electronics 2021, 10(20), 2469; https://doi.org/10.3390/electronics10202469

Submission received: 31 August 2021 / Revised: 3 October 2021 / Accepted: 4 October 2021 / Published: 11 October 2021

(This article belongs to the Special Issue Machine Learning Applied to Music/Audio Signal Processing)

Download

Browse Figures

Versions Notes

Abstract

:

We present a novel reinforcement learning architecture that learns a structured representation for use in symbolic melody harmonization. Probabilistic models are predominant in melody harmonization tasks, most of which only treat melody notes as independent observations and do not take note of substructures in the melodic sequence. To fill this gap, we add substructure discovery as a crucial step in automatic chord generation. The proposed method consists of a structured representation module that generates hierarchical structures for the symbolic melodies, a policy module that learns to break a melody into segments (whose boundaries concur with chord changes) and phrases (the subunits in segments) and a harmonization module that generates chord sequences for each segment. We formulate the structure discovery process as a sequential decision problem with a policy gradient RL method selecting the boundary of each segment or phrase to obtain an optimized structure. We conduct experiments on our preprocessed HookTheory Lead Sheet Dataset, which has 17,979 melody/chord pairs. The results demonstrate that our proposed method can learn task-specific representations and, thus, yield competitive results compared with state-of-the-art baselines.

Keywords:

melody harmonization; structured representation; substructures exploration; reinforcement learning; music analysis

1. Introduction

Automatic melody harmonization has received a great deal of attention since it was first introduced [1]. Automatic harmonization can help both untrained people as well as practicing musicians in their music creation tasks. Most existing methods, however, are limited to generating chord sequences from fixed predefined structures, such as a bar or half a bar. More meaningful and flexible structures (such as “phrases”) are rarely considered in melody harmonization. To create chord accompaniments that are more accurate (w.r.t. musical rules) and perhaps also richer in colors, we present a reinforcement learning-based method to automatically discover phrase-level structure and predict the onset of chords for a given piece of melodic music.

Automatic melody harmonization has a long history starting in the 1970s. In the early stages, harmonization systems relied heavily on hard rules. Similar to language processing, music has hierarchical structures that can be described as a parallel to a grammar. Hence, some works have created a set of grammar-like rules that can generate chords from a melody. For instance, a rule-based attempt was proposed by Ebciogluv et al. [2]. With over 270 rules included, that system was designed for the harmonization of four-part chorales in the style of Bach. This was followed by two subsequent works [3,4]. In some of the latest automatic harmonization and generation works based on learning, these rules from the past were used to enhance the performance of statistical methods [5,6,7,8]. Learning-based methods, in general, can avoid the intensive labour needed for enumerating all the necessary rules.

Given the high availability and low costs of computation power today, many attempts have been made to employ probabilistic models and deep learning techniques in harmonization problems. Hidden Markov Models (HMMs) have been shown to be effective in melody harmonization [9,10,11,12]. Like spoken languages, music also exhibits long term dependencies in melody and chord sequences [12].

However, rather than modelling the dependencies between observations, HMM treats each observation independently. To address the gap, a line of research has attempted to employ deep learning techniques to model long-term dependencies over a sequence of melodic bars. The studies by [13,14] use Bidirectional Long Short-Term Memory (BiLSTM) to model dependencies between adjacent music events and generate chord sequences correspondingly. These methods outperformed probabilistic models and successfully demonstrated the importance of long-term dependency in music.

Although deep learning techniques have improved the performance of harmonization tasks, they are usually limited to a fixed number of generated chords per melody bar or measure. In [7,9,13,15], the proposed methods are limited to generating one chord for a measure, and, in [14,15], to one chord per a half-bar. Very rarely, however, have researchers considered substructures (e.g., phrases or segments in a melody) for melody harmonization tasks. Tsushima et al. proposed using tree structures to model chords and harmonic functions [11]. However, they did not explore the substructures in melody sequences. Tree structures can be very complex and usually require explicit phrase boundary annotations.

We believe that exploring substructures in melodies can help to learn a better representation for melody harmonization tasks. In this work, we define two terms utilized to learn a structured representation for melodies: (a) Segment, the boundary of which concurs with the chord change; that is, one segment can be used to generate one chord. (b) Phrase, which further divides a segment into subunits. The two terms are intuitively illustrated in Figure 1.

The substructures in a melody connect with each other to form a meaningful musical thought [16], which implies that analysing the structural components can help to better understand a whole melody and, thus, generate chords for it. Our method first identifies the boundaries of phrases and segments in a melody. With the obtained boundaries, we discover the note-level connections in a phrase, the phrase-level connections in a segment and segment-level connections in a melody sequence with a hierarchical LSTM. Finally, each learned segment representation is utilized to generate chords.

The definition of “phrase” is ambiguous in the field of music. In western classical music, a “phrase” usually refers to a musical sequence that consists of consecutive notes expressing a complete musical thought [19] and such a phrase is roughly 4–8 measures long [20]. In today’s popular music, the end of a ”phrase” normally coincides with the taking of breath [21]; such “phrases” are usually longer than what we need in structure discovery for chord generation. For example, in the dataset used in [22], most of the human-annotated phrases have a length of around nine notes.

Grouping rules have also been developed to split a melody into phrases for music analysis. GPR in GTTM has been applied to obtain music phrases and furthermore to help with melody phrase embedding in [23]. However, GPR contains several subrules, each of which yields different possible boundaries [24], resulting in a challenge to consider all the subrules in phrase boundary identification. Moreover, the phrase-level structure that best suits the harmonization tasks is not necessarily a part of GPR or similar rules. Consequently, the prediction of phrase boundaries in this work is performed in an unsupervised manner.

The system proposed in [25,26] showed the possibility of leveraging reinforcement learning in duet generation and structured representation learning. We propose, herein, a novel melody harmonization model based on reinforcement learning that can achieve state-of-the-art results when applied to the Hooktheory Leadsheet Dataset (HLSD). (The Hooktheory Leadsheet Dataset was compiled by [14] from the Hooktheory website https://www.hooktheory.com/site and the dataset is available on https://github.com/wayne391/lead-sheet-dataset; accessed on 4 October 2021).

We adopted the Synchronous Advantage Actor Critic (A2C) algorithm [27] and treated structure segmentation as a sequential decision problem that identifies the boundaries of segments and phrases, which facilitates chord generation by learning structured representations for melody sequences. Such a method can be easily applied in various music information retrieval (MIR) tasks where hierarchical structures exist but task-specific structure annotations are not built.

This paper attempts to make the following contributions:

to employ reinforcement learning, for the first time, to discover substructures in melody for symbolic melody harmonization;
based on the discovery of substructures, to deal with the tasks of segmentation and harmonization by learning structured representations of the given symbolic melody;
through experiments using our processed dataset to show that our proposed method outperforms other baseline methods.

2. Model

The goal of this paper is to improve melody harmonization using optimized representations that consider phrase-level structures. As shown in Figure 2, the overall architecture consists of three components: Structured Representation Module (REP), Segmentation Module (SEG) and Segment Harmonization Module (HAR). With a three-level LSTM, REP learns the note-level, phrase-level and segment-level representation for each note. The three-level representations are then concatenated to form the state of each note, based on which SEG samples an action deciding whether the current note is the boundary of a phrase or segment.

After SEG decides the actions for all the notes in a melody, REP will translate the action sequence into a structured representation (note-level representations are grouped into phrase-level representations, which are further grouped into segment-level representations). Finally, with the learned representations, HAR generates the chords for each of the obtained segments.

Given a melody

M = (m_{1}, m_{2}, \dots, m_{T})

that consists of T notes, REP provides the state of each note to SEG. SEG then outputs a sequence of decisions

A = (a_{1}, a_{2}, \dots, a_{T})

,

a_{t} \in {0, 1, 2}

, where

a_{t} = 0

indicates note

m_{t}

is inside a phrase,

a_{t} = 1

indicates note

m_{t}

is at the end of phrase and

a_{t} = 2

indicates note

m_{t}

is at the end of a segment. Consequently, the melody is formed by a sequence of segments, denoted as

G = SEG (M, A) = (g_{1}, \dots, g_{L}),

(1)

where L is the number of segments in melody M. For each segment

g_{i}

, we have

g_{i} = (p_{i, 1}, \dots, p_{i, N_{i}}),

(2)

where

N_{i}

is the number of phrases in segment

g_{i}

. For each phrase, we have

p_{i, j} = (m_{i, j, 1}, \dots, m_{i, j, K_{j}}),

(3)

where

K_{j}

is the number of notes in phrase

p_{i, j}

(the j-th phrase from the

i

-th segment).

i \in (1, L)

and

j \in (1, N_{i})

.

As a feedback signal, the sampled action

a_{t}

of the RL agent for each note will be sent back to REP and translated into a structured representation using a hierarchical LSTM network. Further, the learned structured representations will be fed into HAR to predict a chord sequence

C_{1 : L}

for melody segments

G_{1 : L}

. The ultimate goal of HAR is to approximate the function h:

G \to C, G = (g_{1}, \dots, g_{L}), C = (c_{1}, \dots, c_{L})

such that

h (g_{i}) = c_{i}, 1 \leq i \leq L

.

As the three modules interact with each other, we train them jointly. In the first stage, to pre-train REP and HAR, we use the chord boundaries provided by HLSD (as the segment boundary) and boundaries acquired by GPR from GTTM (as the phrase boundary). In the second stage, we fix the parameters of REP and HAR and train SEG in a semi-supervised manner. The chord boundaries in the HLSD are utilized as the ground-truth label to train SEG for segment boundary prediction. The prediction of the phrase boundary is trained in an unsupervised way as there is no ground truth for phrase-level structures in the dataset. In the last stage, the three modules are trained together to achieve the best harmonization results.

To better understand the process, we first introduce the symbolic melody encoding and how REP learns structured representations. After that, we explain how we adopt reinforcement learning to explore phrase-level and segment-level structures and split melody sequences into segments and phrases. Finally, we describe how HAR predicts the chords for the melody segments with the learned structures.

2.1. Symbolic Melody Encoding

In our melody harmonization task, each music piece consists of two components: (1) the human-composed melody

M = (m_{0}, m_{1}, \dots, m_{T})

and (2) the machine-generated chord accompaniment

C = (c_{0}, c_{1}, \dots, c_{L})

. As note-based representation is closer to human perception in music composition and harmonization, we adopt a note-based representation. Similar to [28], we extract pitch and duration information from a melody token and encode them into a 13-dimensional vector

x

. Specifically, the 13-dimensional vector contains two kinds of information:

(1) Chroma information: each of the index in

x

represents a pitch-class from {Rest,C,C#,D,D#,E,F,F#,G,G#,A,A#,B}, in which we consider C# and Db, D# and Eb, F# and Gb, G# and Ab, A# and Bb to be enharmonic equivalent; therefore, the involved pitch-classes are all spelled with sharp signs.

(2) Duration information: the vocabulary representing the duration context is denoted as

D = {1, 2, 3, \dots, 24}

, for which 1 is the duration value of the sixteenth note (the resolution of each melody measure in our processed HLSD. For details, see Section 3.1). For each melody note

m_{t}

, whose pitch class is r and duration value is d, the r-th element in its 13-dimensional feature vector

x_{t}

would be d. Figure 3 gives an example. For the melody sequence in the dashed rectangle, the note-based representation would be [[0,0,0,0,0,0,0,0,4,0,0,0,0], [0,0,0,0,0,0,4,0,0,0,0,0,0], [0,0,0,0,0,4,0,0,0,0,0,0,0], [0,0,0,4,0,0,0,0,0,0,0,0,0]].

2.2. Structured Representation Module

Given a melody

M = (m_{1}, m_{2}, \dots, m_{T})

containing T notes, each note is represented by a 13-dimensional feature vector

x_{t}

containing chroma and duration information. In order to learn the structured representations for a given melody, we employ a hierarchical LSTM (HLSTM) consisting of three levels: a note-level LSTM to sequentially connects notes in a phrase, a phrase-level LSTM to capture phrase-level context dependencies and a segment-level LSTM to yield a comprehensive representation of the whole melody sequence (see Figure 4). Clearly, the update of cell states and hidden states in our HLSTM depends on the actions sampled by SEG.

At the note level, an individual LSTM model

{LSTM}^{n}

is used to connect a sequence of melody notes to construct a phrase. The propagation of the note-level LSTM depends on the action at position

t - 1

. If the previous note is at the end of a phrase or a segment, say

a_{t - 1} = 1

or

a_{t - 1} = 2

,

{LSTM}^{n}

will start to connect notes for a new phrase with a zero-initialized state. More precisely,

h_{t}^{n}, c_{t}^{n} = \{\begin{matrix} {LSTM}^{n} (x_{t}, h_{t - 1}^{n}, c_{t - 1}^{n}), & a_{t - 1} = 0 \\ {LSTM}^{n} (x_{t}, 0, 0), & a_{t - 1} = 1, 2 \end{matrix}

(4)

where

x

is the vector representation of melody note

m_{t}

;

h_{t}^{n} \in R^{d}

and

c_{t}^{n} \in R^{d}

are the current hidden state and current memory cell at position t of

{LSTM}^{n}

, respectively.

The propagation of phrase-level LSTM also relies on action

a_{t}

, which the agent takes at note t in a melody sequence. When

a_{t} = 0

, it indicates that note

m_{t}

is still in a phrase and the phrase has not been completely constructed; thus, the hidden state and memory cell state are directly copied from the preceding position

t - 1

. When

a_{t} = 1

, the note

m_{t}

is at the end of a phrase and the phrase is now completely formed. When

a_{t} = 2

, the note

m_{t}

is at the end of a segment and is also the boundary of the last phrase in this segment. Hidden state

h_{t}^{p}

and memory cell

c_{t}^{p}

will be updated by

{LSTM}^{p}

:

h_{t}^{p}, c_{t}^{p} = \{\begin{matrix} h_{t - 1}^{p}, c_{t - 1}^{p}, & a_{t} = 0 \\ {LSTM}^{p} (h_{t}^{n}, h_{t - 1}^{p}, c_{t - 1}^{p}), & a_{t} = 1, 2 \end{matrix}

(5)

where

h_{t}^{p} \in R^{d}

and

c_{t}^{p} \in R^{d}

are the current hidden state and the current memory cell at position t of

{LSTM}^{p}

.

Similarly, actions also decide how the hidden state and cell state are updated in the segment-level LSTM. When

a_{t} = 0

or

a_{t} = 1

, it indicates that note

m_{t}

is still in a segment and the segment has not been completely constructed; thus, the hidden state

h_{t}^{s}

and memory cell state

c_{t}^{s}

will not be updated. When

a_{t} = 2

, the note

m_{t}

is at the end of a segment and the segment is now completely formed. Hidden state

h_{t}^{s}

and memory cell

c_{t}^{s}

will be updated by

{LSTM}^{s}

:

h_{t}^{s}, c_{t}^{s} = \{\begin{matrix} h_{t - 1}^{s}, c_{t - 1}^{s}, & a_{t} = 0, 1 \\ {LSTM}^{s} (h_{t}^{p}, h_{t - 1}^{s}, c_{t - 1}^{s}), & a_{t} = 2 \end{matrix}

(6)

where

h_{t}^{s} \in R^{d}

and

c_{t}^{s} \in R^{d}

are the current hidden state and the current memory cell at position t of

{LSTM}^{s}

.

2.3. Segmentation Module

The problem of structure segmentation is formulated as a reinforcement learning problem in SEG where we have an agent interacting with an episodic environment

ϵ

. At each time step t, the agent will receive a state

s_{t}

and take a corresponding action

a_{t} \in {0, 1, 2}

as a response that decides whether the note

m_{t}

is the boundary of a phrase or a segment. As the segmentation is performed in a semi-supervised way, we design two rewards to evaluate the decision sequence

A = (a_{1}, a_{2}, \dots, a_{t})

. One is an intermediate reward to compare the predicted segment boundaries with the chord boundaries in HLSD. The other is a long-term delayed reward that measures the performance of phrase boundary prediction grounded on harmonization results.

State

During structure segmentation, we adopt a stochastic policy

π

to generate a conditional distribution

π_{θ} (a ∣ s)

over actions conditioned on the current state where

θ

denotes the parameters in SEG. The state

s_{t}

at time step t encodes the current input and previous contexts for deciding whether the note at position t is a boundary of a phrase or a segment. To provide adequate information, state

s_{t}

is composed of current note-level hidden state and memory state, previous phrase-level and segment-level hidden and memory state:

s_{t} = h_{t - 1}^{s} \oplus c_{t - 1}^{s} \oplus h_{t - 1}^{p} \oplus c_{t - 1}^{p} \oplus h_{t}^{n} \oplus c_{t}^{n},

(7)

where ⊕ denotes the vector concatenation operation.

Action

The policy

π

samples action

a_{t} \in {0, 1, 2}

by the conditional probability

π_{θ} (a_{t} ∣ s_{t})

to represent whether a phrase or a segment is formed.

a_{t} = 0

means the current note is inside a phrase or a segment.

a_{t} = 1

indicates the current phrase is now constructed completely.

a_{t} = 2

reveals the current segment is now entirely formed. Formally,

π_{θ} (a_{t} ∣ s_{t}; θ) = softmax (W \cdot s_{t} + b),

(8)

where

θ = {W, b}

denotes the parameters used in SEG.

Reward

For the intermediate reward, we use the chord boundary labels from HLSD to evaluate the predicted segment. If the segment boundary is correctly predicted, a positive reward will be received. Otherwise, a negative reward will be given. More precisely,

r_{i n t} = \{\begin{matrix} + 1, & a_{t} = 2 and m_{t} \in B_{target} \\ + 1, & a_{t} \neq 2 and m_{t} \notin B_{target} \\ - 1, & a_{t} = 2 and m_{t} \notin B_{target} \\ - 1, & a_{t} \neq 2 and m_{t} \in B_{target} \end{matrix}

(9)

where

B_{target}

is set of notes that are at the chord boundaries provided by the dataset. The intermediate reward ranges from

- 1

to 1.

As there is no ground truth for phrase boundaries, considering that the ultimate goal of our model is to generate an appropriate sequence of chords to accompany the given melody, we design the delayed reward function based on harmonization results. Compared with the ground truth, the segment boundaries are not necessarily predicted correctly. The number of chords generated for each melodic measure may vary and even be different from the ground truth. Hence, we propose a weighted accuracy

W A

as the delayed reward. Formally,

r_{delayed} = W A = \sum_{{\hat{y}}_{t} \in {\hat{Y}}_{T}} \frac{𝟙_{{{\hat{y}}_{t} = y_{t}}} \cdot d_{t}}{y_{t} \cdot d_{t}},

(10)

where

𝟙_{{δ}}

is the indicator function, which outputs 1 when condition

δ

is true and 0 otherwise.

{\hat{Y}}_{T}

and

Y_{T}

are the set of predicted chord symbols and ground-truth chord symbols distributed to each note

m_{t}

in a melody

M_{1 : T}

.

d_{t}

denotes the duration context of note

m_{t}

. To give an intuition, assume we have a melodic bar with four notes, the duration sequence

D_{4}

of which is

(4, 4, 4, 4)

.

If the segmentation result from SEG is

(0, 2, 0, 2)

, it means that HAR would generate one chord for the first two notes and another chord for the last two notes. Suppose the first generated chord is C, the second generated chord is G and the ground truth has only one chord, which is C for this bar, then we have

{\hat{Y}}_{4} = (C, C, G, G)

and

Y_{4} = (C, C, C, C)

. In this case, the weighted accuracy would be

0.5

. We use the weighted harmonization accuracy as the criterion to encourage a better harmonization performance for each of the notes in a melody. By doing so, a structured representation, which is beneficial to the harmonization task can be learned.

Policy

The vanilla policy gradient algorithm, REINFORCE, proposed in [29] is a common method to train an RL agent but is also known for its high variance in the gradient estimate [30], which tends to induce poor convergence. To mitigate this problem, we adopt the Synchronous Advantage Actor Critic (A2C) [27] method in our SEG. To verify whether A2C continually yields a better performance than REINFORCE in the melody harmonization task, we selected the REINFORCE algorithm as a baseline method in Section 3.4.

In A2C, a critic network is used to learn the state-value function

V (s)

, which estimates the average expected return. “Advantage”

A (s, a)

is introduced to show the advantage of performing action

a_{t}

under state

s_{t}

. This offers an efficient way to approximate

V (s)

only, rather than both

Q (s, a)

and

V (s)

.

\begin{matrix} A (s_{t}, a_{t}) = & Q (s_{t}, a_{t}) - V (s_{t}) = E [R_{t} ∣ s_{t}, a_{t}] - V (s_{t}) \\ \approx r_{t} + γ V (s_{t + 1} ∣ s_{t}, a_{t}) - V (s_{t}) . \end{matrix}

(11)

We can maximize

A (s, a)

to update the actor network

π_{θ} (a_{t} ∣ s_{t})

in SEG with the following policy gradient:

\nabla J (θ) = \nabla l o g π_{θ} (a_{t} ∣ s_{t}) [r_{t} + γ V (s_{t + 1} ∣ s_{t}, a_{t}) - V (s_{t})] .

(12)

As for the critic network, we use the squared loss to update its parameters:

L_{critic} = {(r_{t} + γ V (s_{t + 1} ∣ s_{t}, a_{t}) - V (s_{t}))}^{2} .

(13)

Accordingly, the gradient to update critic network can be represented as:

\nabla J (w) = \nabla V (s_{t}) [r_{t} + γ V (s_{t + 1} ∣ s_{t}, a_{t}) - V (s_{t})] .

(14)

The details of our training process of A2C is shown in Algorithm 1 and Section 3.2.

Algorithm 1: Training Process of Segmentation Policy Module with A2C

Input:D as training data
for each mini-batch $B \in D$ do
for each melody sequence $M_{1 : T} \in B$ do
Initialize $a_{0}$ , $h_{t}^{n}$ , $c_{t}^{n}$ , $h_{t}^{p}$ , $c_{t}^{p}$ , $h_{t}^{s}$ and $c_{t}^{s}$ with zeros;
for $t \leftarrow 1$ to T do
Obtain state $s_{t}$ with $h_{t}^{n}$ , $c_{t}^{n}$ , $h_{t}^{p}$ , $c_{t}^{p}$ , $h_{t}^{s}$ and $c_{t}^{s}$ by Equation (7);
Sample action $a_{t} \sim π_{θ} (a_{t} ∣ s_{t})$ by Equation (8);
Update $h_{t}^{n}$ , $c_{t}^{n}$ , $h_{t}^{p}$ , $c_{t}^{p}$ , $h_{t}^{s}$ and $c_{t}^{s}$ by $a_{t}$ ;
end for
Compute delayed reward by Equation (10);
end for
Update $θ \leftarrow θ + α \nabla J (θ)$ using Equation (12) for actor network;
Update $w \leftarrow w + β \nabla J (w)$ using Equation (14) for critic network;
end for

2.4. Melody Harmonization Module

The melody harmonization module produces a probability distribution over chord classes based on the content vector from the structured representation module. Formally, we have

P (c ∣ g) = softmax (W_{s} x^{s} + b_{s}),

(15)

where

x^{s}

is the segment level representation vector from LSTM

^{s}

. To train HAR, we use cross entropy as the loss function:

L_{HAR} = \sum_{g \in G} \sum_{c = 1}^{84} - \hat{p} (c, g) log P (c ∣ g),

(16)

where

\hat{p} (c, g)

is the one-hot distribution of segment sample g and 84 is the number of possible chords in our HAR (84 possible chords are explained in Section 3.1).

2.5. Training Details

As the operations of REP, SEG and HAR are interleaving, the three modules should be trained jointly. The entire training process consists of three steps: (1) As a warm-start to pre-train REP and HAR, we split melodies at every onset position of chords in the dataset to obtain the segment boundaries and utilize GPR from GTTM to acquire the phrase boundaries; (2) we fix the parameters of REP and HAR and train SEG, as Algorithm 1 shows; and (3) we train REP, SEG and HAR jointly: REP provides state representations to SEG, SEG splits the melody sequence into segments and phrases, REP updates the HLSTM conditioned on the sampled actions from SEG and HAR provides chords for the obtained segments.

To pre-train REP and HAR more efficiently, we utilize the chord boundaries from HLSD to be the segment boundaries and apply GPR from GTTM to acquire the noisy phrase boundaries. GPR has been verified as being more efficient in melodic segmentation via psychological experiments [17]. It contains several sub-rules, each of which is developed on different criteria and, therefore, yields different boundaries. In this work, we adopt one of the most commonly used sub-rules, GPR 2b, to obtain the noisy phrase boundaries [23,24,31]. In GPR 2b, edges of melody phrases appear when the difference of inter-onset-intervals (

Δ

IOI) is negative.

Given a sequence of melody notes

(m_{1}, m_{2}, \dots, m_{T})

, each of which has an onset time

o_{t}

, the inter-onset-interval between note

m_{t}

and

m_{t + 1}

can be calculated as

{IOI}_{t} = o_{t + 1} - o_{t}

. The difference between

{IOI}_{t}

and

{IOI}_{t + 1}

is defined as

Δ {IOI}_{t} = {IOI}_{t + 1} - {IOI}_{t}

. When

Δ {IOI}_{t}

is negative, the note

m_{t}

can be chosen as a phrase boundary. Figure 5 shows an example that how GPR can be used to identify phrase boundaries.

3. Experiments and Results

3.1. Dataset

In this work, we use the Hooktheory Lead Sheet Dataset (HLSD) [14] to evaluate our proposed method. The dataset is collected from a user-contributed platform Hooktheory. It consists of high-quality, human-composed melodies along with corresponding chord accompaniments. Compared with other similar datasets, such as CSV Leadsheet Database [13] (collected from Wikifonia.org before the website terminated), HLSD provides more rhythmic chord sequences. Moreover, CSV Leadsheet Database only provides one chord for a melody bar, whereas in HLSD, there can be more than one chord in a melody bar. We can utilize chord onset positions from HLSD as the boundaries of our defined segments and thus pre-train our REP and HAR.

We preprocess HLSD and split the melody/chord pairs every four bars without overlap since the remained shortest music piece contains only four bars after removing those without chord lines. That means, each melody sample in our preprocessed dataset contains four bars. In addition, we filter out melody samples where the number of notes are less than 12 and longer than 32. For bars that have fewer than three notes, phrase exploration is redundant. Filtering out samples longer than 32 aims to save training time. From Figure 6, samples with length of 32 (in red) is at a watershed in terms of frequency.

To normalize different characteristics of melodies and chords in different keys and maintain the data consistency, we utilized the C key version of the selected music samples, which is provided by HLSD (C major or c minor based on the original key signature). We divided the preprocessed dataset into a training set containing 14,313 train sequences (from 8000 songs) and 3666 test sequences (from 2000 songs) (which is available in https://github.com/TeresaTsang/preprocessed-HLSD; accessed on 4 October 2021). In our experiments, the proposed method was able to generate 84 possible chords. The chords can be any type from {major, minor, diminished, seventh, major-seventh, minor-seventh and fully-diminished-seventh} with 12 possible root notes C, C#, D, D#, E, F, G, G#, A, A# and B (also spelled with sharp signs as in Section 2.1). Some chords in HLSD are beyond our chord vocabulary, which we transformed into one of the 84 chords based on their chord constructions. For chords consisting of more than four notes, we kept their root notes and dropped excessive notes, i.e., 9ths, 11ths or additional extensions. Figure 7 shows the chord distribution of our preprocessed HLSD.

Our model performs both chord rhythm prediction and melody harmonization. As most of the existing methods are limited to generating fixed number of chords for one bar or a half-bar, we first extract the chord rhythms from HLSD as the segment result and thus compare our method with baselines to analyse the engagement of phrase-level structures. Later, we evaluate the performance of our reinforcement learning-based segmentation method.

3.2. Experiment Settings

The note-level, phrase-level and segment-level LSTMs all have a hidden dimension of 256. Adam algorithm [32] is employed as the optimizer with learning rate

α = 1 \times 10^{5}

. The batch size is 64 in the pre-training of the three modules, which is adjusted to be 5 when they are trained jointly. The detailed parameter settings of each layer are shown in Figure 8.

3.3. Harmonization Results

3.3.1. Baselines

In this section, we first solely evaluate our REP and HAR using the chord boundaries from HLSD and phrase boundaries obtained by GPR. We compare our proposed method with SVM, CNN, LSTM and BiLSTM+BGS, each of which will provide a chord per obtained segment from HLSD. To fairly evaluate the performance of different models, we used the same symbolic music encoding representation for all the models as introduced in Section 2.1.

SVM: Support Vector Machines (SVMs) [33] are a traditional machine learning method for classification problems. We applied the C-Support Vector Classification (SVC) algorithm in our harmonization task with its default configurations provided by the scikit-learn library [34]. As we have 84 possible classes when generating the target chords, the decision function type is set as “ovo” (one versus one strategy) in SVC, which is always employed in multi-class strategy.
CNN: Convolution Neural Networks have been widely used in music generation tasks. We built the CNN architecture with two 2D convolution layers. The first one is constructed with a kernel size of $3 \times 3$ and followed by a pooling layer of size $2 \times 2$ . The second one has a kernel size of $4 \times 4$ , also followed by a pooling layer of size $2 \times 2$ .
LSTM: LSTM is specialized for processing sequential data. For the experimental setting, a time-distributed input layer is built before the LSTM layer. The input layer has 13 units, representing the sequence of note feature vectors. The one-layer baseline LSTM network has 128 units (to be compared with BiLSTM+BGS), whose output is then fed into a fully-connected output layer with 84 units, representing the generated chord sequence.
BiLSTM+BGS: A BiLSTM-based model with blocked Gibbs sampling was proposed in [15] for melody harmonization. In their work, the melody and partially masked chord sequences were fed into the model and the model was expected to learn the masked chord ground truth. We adopted the blocked Gibbs sampling strategy, which was used in [15] to mask the chord sequences. The sampling process uses an annealed masking probability as the proportion to randomly select chords to be masked. Formally,

$α_{i} = α_{\min} + \frac{(α_{\max} - α_{\min}) \times i}{N},$

(17)

where $α_{i}$ is the proportion of variables that remain unchanged at iteration i, N is the total number of iterations and set to 128 (the averaged length of chord sequences in each batch) and $α_{\min} = 0.05, α_{\max} = 1$ . We employed the architecture proposed in their work with our proposed melody feature representation (13-dimensional vector). The masked chord ground truth and melody context are fed into their model and concatenated together. The concatenated context is then sent to a two-layer BiLSTM with hidden size of 64 (the same as [15]), followed by a dropout layer and a fully-connected output layer with 84 units.

Other parameters involved in all the baselines, such as optimizer, dropout layer and batch size, are the same as the settings in our proposed method.

3.3.2. Metrics

For the harmonization task, as we have ground-truth chord symbols from HLSD, we can evaluate the performance of harmonization by comparing with the ground-truth labels. We first evaluate our method in terms of accuracy. Although the evaluation of music could be highly subjective, the accuracy measure can give us the most reliable intuitive results. We use an accuracy metric to show the performance of our model. The accuracy is computed by dividing the number of correct predicted chords by the total number of samples. That is,

Acc . = \frac{num . of correct chords}{num . of predictions} .

(18)

In this section, we solely evaluate the performance of REP and HAR; hence, we utilize all the chord boundaries and directly predict one chord for one segment. In other words, we do not need considering chord rhythm issues. Each predicted chord is exactly at the correct onset time as annotated in the dataset. The number of predictions will equal the number of chords in the ground truth. The predicted chords that are identical with the ground truth are recognized as the correct chords.

In addition to the accuracy, we also employ two metrics to evaluate the quality of harmony and the tonal distances between predicted chords and the ground truth: (1) Melody-Chord Harmonicity (MCH) [35], which measures the harmony between melodies and chords and (2) Tonal Pitch Step Distance (TPSD) [36] to estimate the tonal distances between generated and ground-truth chords.

3.3.3. Results

The evaluation of the harmonization task is shown in Table 1. In terms of accuracy, our proposed method outperformed all the baselines with Acc. reaching 37.42%. This indicates that substructure discovery in melody can help improve harmonization performance. As a reference, the BiLSTM and MTHarmonizer proposed in [14] in 2020 achieved accuracies of 35% and 38% on their compiled HLSD, whereas the accuracy of random guess was 2% (from Figure 7, Zero-R classifier’s accuracy was 43.49%).

This implies that our 37.42% is indeed a respectable score comparable to the state of the art. For a fairer comparison, the work in [14] preprocesses HLSD by keeping two chords in a bar and dropping the superfluous chords, which means each chord serves for a half-bar. Their models output one chord per half-bar, which is then compared with the preprocessed ground truth. However, in the actual HLSD, there are numerous pieces where three or more chords accompany a single bar and, in some pieces, one chord accompanying one note is not uncommon.

Our method and baselines by contrast are able to generate chords for melody sequences with different lengths, conforming to the chord boundaries in the raw HLSD. That means our model needs to deal with short sequences while the work in [14] did not. Understandably, our task is more challenging and more error-prone. As a result, we can say that 37.42% accuracy represents a competitive performance.

Regarding the harmonicity metrics, we also computed the metric values on the original dataset as a reference. Before analysing the results, we briefly describe the value range of MCH scores and TPSD scores. MCH scores range from 0 to infinity and lower MCH scores reveal higher harmonicity between melody and chord. TPSD scores range from 0 to 1, of which 1 indicates perfect tonal similarity between two chords. Since the MCH and TPSD scores are very close to each other, we apply paired sample t-test to measure the differences between our results and other baselines.

As Figure 9 and Figure 10 reveal, all the p-values of each paired t-test are lower than 0.05, which indicates the significant differences between our MCH/TPSD score and others. Interestingly, we can observe that human-composed original pieces are worse than most of the other methods with respect to MCH score. Our proposed method happens to have the second worst MCH score. This is because a relatively high MCH score can tolerate dissonances in some chord sequence, which is a common practice in music creation. From this perspective, the ability of our proposed method to generate dissonances for different colours is closest to the original dataset.

On the other hand, The MCH score of our method is in an acceptable range, which is only 0.1797 higher than the best MCH score produced by LSTM. As for the TPSD score, our method yielded the best TPSD score, indicating that, compared with the other baselines, our model managed to generate chords that were closer to human-composed chords.

3.4. Structure Analysis

3.4.1. Baselines

As few works concerns flexible structures in melody, we selected the Melisma Music Analyzer developed by Sleator and Temperley [37] as the baseline in structure analysis. Influenced by GTTM, Melisma formulates a set of rules for each function it provides. We mainly employed the grouping analysis and harmony analysis from Melisma. The grouper program introduced in Melisma aims to group melodic notes into phrases, while the harmony program is able to predict the chord sequence for a given melody. They are very similar to our SEG and HAR in functionality. To compare the performance of different RL algorithms, we selected REINFORCE as the second baseline in the structure analysis while REP and HAR are still engaged. The experiment setting of REINFORCE is similar to A2C except for the different update rule:

θ \leftarrow θ + α G_{t} \nabla_{θ} l o g (π_{θ} (a_{t} ∣ s_{t}; θ)),

(19)

where

G_{t}

is the cumulative return with discount rate

γ = 0.95

.

3.4.2. Metrics

For the structure segmentation task, we first evaluated the performance of segment identification by comparing with the chord boundary from HLSD. We employed the dominantly used metrics Precision, Recall, F-1 Score to measure the segmentation result. Assume that the predicted segment boundaries is organized in a set

\hat{A} = {{\hat{a}}_{1}, {\hat{a}}_{2}, \dots, {\hat{a}}_{T}}

and the set of ground truth segment boundaries is denoted as

A = {a_{1}, a_{2}, \dots, a_{T}}

, then the Precision can be expressed as:

P = \frac{T P}{T P + F P},

(20)

where

T P

stands for True Positive, specifically

{\hat{a}}_{t} = 2

and

a_{t} = 2

;

F P

denotes False Positive case, specifically

{\hat{a}}_{t} = 2

and

a_{t} = 0

.

Similarly, Recall is formally written as:

R = \frac{T P}{T P + F N},

(21)

where

F N

denotes the False Negative case, specifically

{\hat{a}}_{t} \neq 2

and

a_{t} = 2

.

Thus, the F1-score can be computed with

P r e c i s i o n

and

R e c a l l

:

F 1 = \frac{2 P R}{P + R} .

(22)

For the prediction of phrase boundaries, there is no ground truth that can be applied to statistically indicate how well the phrases are identified in a melody sequence. We, therefore, evaluate the final harmonization result to show the importance of phrase-level structures. When considering the structures, one of the issues should be addressed is that SEG might wrongly predict the segment boundaries comparing with the ground truth in HLSD. Hence, we employ the weighted accuracy, which was introduced in Equation (10):

W A R = \sum_{{\hat{o}}_{t} \in {\hat{O}}_{T}} \frac{𝟙_{{{\hat{o}}_{t} = o_{t}}} \cdot d_{t}}{o_{t} \cdot d_{t}} .

(23)

Since the harmony analyser in Melisma can only predict the chord root for a melody segment,

{\hat{O}}_{T}

and

O_{T}

are the set of predicted and ground-truth chord root notes mapping to each note

m_{t}

in melody

M_{1 : T}

here (different from

W A

in Equation (10)).

3.4.3. Results

From the results shown in Table 2, we can see that the weighted accuracy is much lower than the accuracy we obtained in Table 1. This is understandable because the segment boundaries need to be learned in this section whereas the results in Table 1 are obtained with correct segment boundaries. The errors in the segmentation task are inevitable and will also lead to a mismatching between the ground-truth chords and generated chords. However, our method indeed outperformed Melisma in both segmentation and harmonization tasks. In SEG, diverse choices of RL algorithm also produced different results. A2C outperformed REINFORCE in almost all metrics except for Recall of the segmentation, which implies that the superiority of A2C also holds in our melody harmonization task.

As the prediction of phrase boundaries is trained in an unsupervised manner and there is no ground truth for it, to intuitively show the performance of the segmentation task, we illustrate the results of two samples after being split into phrases, segments and assigned with chords, in Figure 11 and Figure 12. In each figure, the ground truth from HLSD is written in black, results from our model are in red and the results from Melisma are in purple. Purple horizontal lines illustrate the chord boundaries identified by Melisma, the bule lines demonstrate the phrase boundaries predicted by A2C and the red lines show the recognized segments by A2C. Similarly, the yellow vertical lines illustrate the phrase boundaries predicted by REINFORCE and the orange vertical lines exhibit the segments identified by REINFORCE.

In Figure 11, the sample is from a popular song, “Auld Lang Syne”, where we can clearly see that our model is able to break the melody into variable-length segments and phrases. Although Melisma can also perform segmentation for chord boundaries, our method can yield chord boundaries closer to the dataset. The phrase boundaries (in blue) obtained by our method with A2C can mostly coincide with the grouping results from GPR while the REINFORCE method can only identify one phrase.

For the generated chords, A2C can help to predict more chords correctly than REINFORCE and Melisma. For the wrongly predicted chords by A2C and REINFORCE in the second bar, they are not too far away from the melody as a correct G is predicted right after C. The REINFORCE method also predicts wrongly in the third bar—a possible explanation for this may be the lack of substructures identified in this measure. This observation further shows that substructures in melody can better help with harmonization. As for Melisma, it predicts chords mainly based on the notes at the chord onset positions.

To intuitively show how the metrics can be used in a music sample, we calculate the

W A

,

W A R

,

P r e c i s i o n

,

R e c a l l

and

F - 1

score for the sample in Figure 11. For the chords generated by our methods with A2C, we have

{\hat{Y}}_{14} = {C, C, C, C, C, C, G, G, C, C, C, C, F, F}

for each note in the given melody, while the duration context

D_{14} =

{6,2,4,4,6,2,4,4,6,2,4,4,12,4}. Then, using Equations (10) and (23), we can obtain

W A = 62.5 %

and

W A R = 87.5 %

for our method with A2C. Similarly, we can get

W A = 50 %

,

W A R = 62.5 %

for our method with REINFORCE and

W A R = 25 %

for harmony results of Melisma. For the segmentation results by our model, the number of

T P

is 12, that of

F P

and

F N

are both 1. Hence, by Equation (20)–(22), we can get

P = 92.3 %

,

R = 92.3 %

and

F 1 = 92.3 %

for A2C and REINFORCE.

Similarly, we can obtain

P = 76.92 %

,

R = 90.91 %

and

F 1 = 83.33 %

for Melisma.

In Figure 12, we can see that our method with both A2C and REINFORCE can correctly predict all the segment boundaries (chord boundaries); whereas Melisma can only predict a few of them. The harmonization results obtained by our method are almost consistent with the ground truth except for the second bar. The wrongly predicted C chord sounds inharmonious with the notes in the bar. A possible explanation for this might be that a major C would be learned when there exist pieces with more complicated C-rooted chords. The C major chord along with notes A and D (in the second bar) can constitute a C69 chord, which would then sound fine. In our data processing, however, we converted chords (a 69 chord in this case) that were out of our vocabulary into simpler ones, hence, the C major chord. This reveals a limitation of our data-driven method in melody harmonization where the conversion of chords can introduce deviations. This can be improved in the future by allowing more chord types and collecting more samples for each type in the future. On the contrary, the Melisma-generated chords in the first, third and last bar sound wrong even only with a root note. Although our method has some limitations, it can still outperform the traditional rule-based method. Interestingly, the results of phrase segmentation in Figure 12 can only partially coincide with the grouping rules in GTTM. In the first bar, the first note is identified as a single phrase, which is reasonable as the first note in a music sample is important in terms of setting the tone for the whole piece. In the third bar, the first three notes are regarded as individual phrases, because in here, we can clearly see that the phrases are divided by note duration. All the results from our experiments show that substructures can be used to better comprehend melody sequences and further help improve harmonization results when compared with the baselines.

4. Conclusions

We present a novel reinforcement learning method that explores segments and phrases to form a hierarchical melody representation, which can help improve melody segmentation and harmonization. We used the onsets of chords in the dataset as the segment boundaries and the phrases obtained by GPR as the phrase boundaries to train the REP and the HAR module as a warm-start. After that, we trained SEG to break the melody into phrases and segments with fixed parameters of REP. Finally, the three modules were trained jointly to learn a refined structured representation for the harmonization task. Experiments showed that structure discovery can help improve the performance of harmonization. This methodology can also be generalized to tackle other MIR tasks where hierarchical structures may lead to better results.

As the segmentation module was trained in a semi-supervised method (the prediction of phrases is trained in an unsupervised way), we believe the performance of segmentation will be enhanced if it can be supervised by a dataset with phrase boundaries well-labelled for use in harmonization tasks. The similarity between model-segmented and human-annotated phrases can be developed as criteria to evaluate phrase discovery performance. We adopted grouping rules for more efficient learning, but we did not dig into the structures of chords. We believe that we could achieve higher performance with the incorporation of more rules about chord patterns and harmony analysis in our future work.

Author Contributions

Both authors contributed to this work, including the problem formalization, the ideas development and the manuscript writing. T.Z. implemented the approaches, preprocessed the data and conducted the experiments. F.C.M.L. polished the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Publicly accessable dataset Hooktheory Leadsheet Dataset (HLSD) is used in our experiments. It is available on https://github.com/wayne391/lead-sheet-dataset, accessed on 4 October 2021. The train/test split used in this paper is available on https://github.com/TeresaTsang/preprocessed-HLSD, accessed on 4 October 2021.

Conflicts of Interest

The authors declare no conflict of interest.

References

Jackson, R. The computer as a “student” of harmony. In Proceedings of the Tenth Congress of the International Musicological Society, Ljubljana, Yugoslavia, 3–8 September 1967. [Google Scholar]
Ebcioglu, K. An Expert System for Chorale Harmonization. In Proceedings of the Fifth National Conference on Artificial Intelligence, Philadelphia, PA, USA, 11–15 August 1986; pp. 784–788. [Google Scholar]
Evans, B.; Fukayama, S.; Goto, M.; Munekata, N.; Ono, T. Autochoruscreator: Four-part chorus generator with musical feature control, using search spaces constructed from rules of music theory. In Proceedings of the 40th International Computer Music Conference, Athens, Greece, 14–20 September 2014. [Google Scholar]
Sarabia, M.; Lee, K.; Demiris, Y. Towards a synchronised Grammars framework for adaptive musical human-robot collaboration. In Proceedings of the 2015 24th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), Kobe, Japan, 31 August–4 September 2015; pp. 715–721. [Google Scholar]
Jaques, N.; Gu, S.; Bahdanau, D.; Hernández-Lobato, J.M.; Turner, R.E.; Eck, D. Sequence tutor: Conservative fine-tuning of sequence generation models with kl-control. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, NSW, Australia, 6–11 August 2017; pp. 1645–1654. [Google Scholar]
Hadjeres, G.; Pachet, F.; Nielsen, F. Deepbach: A steerable model for bach chorales generation. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1362–1371. [Google Scholar]
Brunner, G.; Wang, Y.; Wattenhofer, R.; Wiesendanger, J. JamBot: Music theory aware chord based generation of polyphonic music with LSTMs. In Proceedings of the 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI), Boston, MA, USA, 6–8 November 2017; pp. 519–526. [Google Scholar]
Shukla, S.; Banka, H. An automatic chord progression generator based on reinforcement learning. In Proceedings of the 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Bangalore, India, 19–22 September 2018; pp. 55–59. [Google Scholar]
Simon, I.; Morris, D.; Basu, S. MySong: Automatic accompaniment generation for vocal melodies. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Florence, Italy, 5–10 April 2008; pp. 725–734. [Google Scholar]
Tsushima, H.; Nakamura, E.; Itoyama, K.; Yoshii, K. Interactive Arrangement of Chords and Melodies Based on a Tree-Structured Generative Model. In Proceedings of the 19th International Society for Music Information Retrieval Conference will take place, Paris, France, 23–27 September 2018. [Google Scholar]
Tsushima, H.; Nakamura, E.; Yoshii, K. Bayesian Melody Harmonization Based on a Tree-Structured Generative Model of Chord Sequences and Melodies. IEEE/ACM Trans. Audio, Speech, Lang. Process. 2020, 28, 1644–1655. [Google Scholar] [CrossRef]
Raczyński, S.A.; Fukayama, S.; Vincent, E. Melody harmonization with interpolated probabilistic models. J. New Music. Res. 2013, 42, 223–235. [Google Scholar] [CrossRef]
Lim, H.; Rhyu, S.; Lee, K. Chord generation from symbolic melody using BLSTM networks. arXiv 2017, arXiv:1712.01011. [Google Scholar]
Yeh, Y.C.; Hsiao, W.Y.; Fukayama, S.; Kitahara, T.; Genchel, B.; Liu, H.M.; Dong, H.W.; Chen, Y.; Leong, T.; Yang, Y. Automatic Melody Harmonization with Triad Chords: A Comparative Study. arXiv 2020, arXiv:abs/2001.02360. [Google Scholar]
Sun, C.E.; Chen, Y.W.; Lee, H.S.; Chen, Y.H.; Wang, H.M. Melody Harmonization Using Orderless NADE, Chord Balancing and Blocked Gibbs Sampling. arXiv 2020, arXiv:2010.13468. [Google Scholar]
Ahlbäck, S. Melody beyond Notes: A Study of Melody Cognition. Ph.D. Thesis, Göteborgs Universitet, Göteborgs, Sweden, 2004. [Google Scholar]
Deliege, I. Grouping Conditions in Listening to Music: An Approach to Lerdahl & Jackendoff’s Grouping Preference Rules. Music. Perception Interdiscip. J. 1987, 4, 325–359. [Google Scholar]
Lerdahl, F.; Jackendoff, R. A Generative Theory of Tonal Music; The MIT Press: Cambridge, MA, USA, 1983. [Google Scholar]
Stein, L. Structure and Style: The Study and Analysis of Musical Forms; Summy-Birchard Company: Evanston, IL, USA, 1962. [Google Scholar]
Temperley, D. End-accented phrases: An analytical exploration. J. Music Theory 2003, 47, 125–154. [Google Scholar] [CrossRef] [Green Version]
Moore, A. The so-called ‘flattened seventh’in rock. Pop. Music. 1995, 14, 185–201. [Google Scholar] [CrossRef]
Guan, Y.; Zhao, J.; Qiu, Y.; Zhang, Z.; Xia, G. Melodic Phrase Segmentation By Deep Neural Networks. 2018. Available online: https://arxiv.org/abs/1811.05688 (accessed on 4 October 2021).
Hirai, T.; Sawada, S. Melody2vec: Distributed representations of melodic phrases based on melody segmentation. J. Inf. Process. 2019, 27, 278–286. [Google Scholar] [CrossRef] [Green Version]
Sawada, S.; Yoshii, K.; Hirata, K. Unsupervised Melody Segmentation Based on a Nested Pitman-Yor Language Model. In Proceedings of the 1st Workshop on NLP for Music and Audio (NLP4MusA), Online, 16–17 October 2020; pp. 59–63. [Google Scholar]
Jiang, N.; Jin, S.; Duan, Z.; Zhang, C. RL-Duet: Online Music Accompaniment Generation Using Deep Reinforcement Learning. Proc. Aaai Conf. Artif. Intell. 2020, 34, 710–718. [Google Scholar] [CrossRef]
Zhang, T.; Huang, M.; Zhao, L. Learning structured representation for text classification via reinforcement learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1928–1937. [Google Scholar]
Jiang, N.; Jin, S.; Duan, Z.; Zhang, C. When Counterpoint Meets Chinese Folk Melodies. In Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H., Eds.; Curran Associates, Inc.: Barcelona, Spain, 2020; Volume 33, pp. 16258–16270. [Google Scholar]
Williams, R.J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learn. 1992, 8, 229–256. [Google Scholar] [CrossRef] [Green Version]
Greensmith, E.; Bartlett, P.L.; Baxter, J. Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning. J. Mach. Learn. Res. 2004, 5, 1471–1530. [Google Scholar]
Cenkerová, Z.; Hartmann, M.; Toiviainen, P. Crossing phrase boundaries in music. In Proceedings of the Sound and Music Computing Conferences, Limassol, Cyprus, 4–7 July 2018. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Harte, C.; Sandler, M.; Gasser, M. Detecting harmonic change in musical audio. In Proceedings of the 1st ACM Workshop on Audio and Music Computing Multimedia, Santa Barbara, CA, USA, 27 October 2006; pp. 21–26. [Google Scholar]
Lerdahl, F. Tonal Pitch Space. Music. Perception: Interdiscip. J. 1988, 5, 315–349. [Google Scholar] [CrossRef]
Sleator, D.; Temperley, D. The Melisma Music Analyzer. Available online: https://www.link.cs.cmu.edu/music-analysis/. (accessed on 4 October 2021).

Figure 1. Phrases and segments in “Auld Lang Syne”. The segment’s start and end align with the change of chords. The phrases are subunits of the segments and are acquired by splitting the segments with the GPR boundary rules (Grouping Preference Rules [17] in the Generative Theory of Tonal Music (GTTM) [18]), which is introduced in Section 2.5.

Figure 2. The overall process of our method.

Figure 3. The note-based representation for symbolic melody.

r_{k}

contains the pitch class information of the k-th note in the dashed rectangle, while

d_{k}

represents the duration value of the k-th note.

Figure 3. The note-based representation for symbolic melody.

r_{k}

contains the pitch class information of the k-th note in the dashed rectangle, while

d_{k}

represents the duration value of the k-th note.

Figure 4. The interactions between REP, SEG and HAR.

Figure 5. An example of phrase identification by applying GPR in GTTM. The onset time is based on our duration vocabulary introduced in Section 2.1.

Figure 6. The frequency of number of notes in a four-bar sample. The horizontal axis represents the number of notes in a four-bar sample and the vertical axis illustrates the frequency of samples with a specific number of notes.

Figure 7. The chord distribution of our preprocessed HLSD. Each index corresponds to a chord symbol from {Cmaj, Cm, Cdim, C7, Cmaj7, Cmin7, Cdim7...Bmaj, Bm ...Bdim7}.

Figure 8. The experiment settings for REP, SEG and HAR and how they interact with each other.

Figure 9. The paired t-test significance analysis of MCH. We compared the group of our MCH values with other four methods. Each group of MCH values was obtained by five runs of the corresponding method.

Figure 10. The paired t-test significance analysis of TPSD. We compared the group of our MCH values with other four methods. Each group of TPSD values was obtained by five runs of the corresponding method.

Figure 11. A sample from “Auld Lang Syne” processed by our model and Melisma in C major key.

Figure 12. A sample from “Eight Days A Week” by the Beatles processed by our model and Melisma in C major key.

Table 1. The results of all the models in our harmonization task. The results are the average over five runs.

Method	Acc.	MCH↓	TPSD↑
original dataset	100.00%	1.5262	1.0000
SVM	25.16%	1.3495	0.5865
CNN	26.64%	1.3465	0.5710
LSTM	28.02%	1.3344	0.5860
BiLSTM+BGS [15]	29.33%	1.3881	0.5870
REP+HAR(ours)	37.42%	1.5141	0.6006

Table 2. The results of all the models in structure segmentation task. The results are the average over five runs.

Method	WA↑	WAR↑	P↑	R↑	F1↑
Melisma	—	26.81%	35.93%	20.12%	25.80%
REP+HAR+REINFORCE	23.96%	30.85%	86.36%	54.18%	66.58%
REP+HAR+A2C	25.76%	34.63%	86.56%	54.13%	66.61%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zeng, T.; Lau, F.C.M. Automatic Melody Harmonization via Reinforcement Learning by Exploring Structured Representations for Melody Sequences. Electronics 2021, 10, 2469. https://doi.org/10.3390/electronics10202469

AMA Style

Zeng T, Lau FCM. Automatic Melody Harmonization via Reinforcement Learning by Exploring Structured Representations for Melody Sequences. Electronics. 2021; 10(20):2469. https://doi.org/10.3390/electronics10202469

Chicago/Turabian Style

Zeng, Te, and Francis C. M. Lau. 2021. "Automatic Melody Harmonization via Reinforcement Learning by Exploring Structured Representations for Melody Sequences" Electronics 10, no. 20: 2469. https://doi.org/10.3390/electronics10202469

APA Style

Zeng, T., & Lau, F. C. M. (2021). Automatic Melody Harmonization via Reinforcement Learning by Exploring Structured Representations for Melody Sequences. Electronics, 10(20), 2469. https://doi.org/10.3390/electronics10202469

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Automatic Melody Harmonization via Reinforcement Learning by Exploring Structured Representations for Melody Sequences

Abstract

1. Introduction

2. Model

2.1. Symbolic Melody Encoding

2.2. Structured Representation Module

2.3. Segmentation Module

2.4. Melody Harmonization Module

2.5. Training Details

3. Experiments and Results

3.1. Dataset

3.2. Experiment Settings

3.3. Harmonization Results

3.3.1. Baselines

3.3.2. Metrics

3.3.3. Results

3.4. Structure Analysis

3.4.1. Baselines

3.4.2. Metrics

3.4.3. Results

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI