Automatic Assessment of Piano Performances Using Timbre and Pitch Features

Phanichraksaphong, Varinya; Tsai, Wei-Ho

doi:10.3390/electronics12081791

Open AccessArticle

Automatic Assessment of Piano Performances Using Timbre and Pitch Features

by

Varinya Phanichraksaphong

¹

and

Wei-Ho Tsai

^2,*

¹

International Graduate Program of Electrical Engineering and Computer Science, National Taipei University of Technology, Taipei 10608, Taiwan

²

Department of Electronic Engineering, National Taipei University of Technology, Taipei 10608, Taiwan

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(8), 1791; https://doi.org/10.3390/electronics12081791

Submission received: 13 March 2023 / Revised: 6 April 2023 / Accepted: 9 April 2023 / Published: 10 April 2023

(This article belongs to the Special Issue Machine Learning in Music/Audio Signal Processing)

Download

Browse Figures

Versions Notes

Abstract

:

To assist piano learners with the improvement of their skills, this study investigates techniques for automatically assessing piano performances based on timbre and pitch features. The assessment is formulated as a classification problem that classifies piano performances as “Good”, “Fair”, or “Poor”. For timbre-based approaches, we propose timbre-based WaveNet, timbre-based MLNet, Timbre-based CNN, and Timbre-based CNN Transformers. For pitch-based approaches, we propose Pitch-based CNN and Pitch-based CNN Transformers. Our experiments indicate that both Pitch-based CNN and Pitch-based CNN Transformers are superior to the timbre-based approaches, which attained classification accuracies of 96.87% and 97.5%, respectively.

Keywords:

piano; performance assessment; songs for kids; timbre-based; pitch-based; convolutional neural network; transformers

1. Introduction

In recent years, computer music technology has blended music with internet information technology, engineering, artificial intelligence, and other multidisciplinary technologies. This inspires composers to produce music in novel ways by using computers to produce music independently [1], facilitated by implementing songwriting program algorithms. Furthermore, computer music technology can also provide a ubiquitous and cheap tutoring service for music learners.

Motivated by this, the aim of this study is to design a system for students which can automatically evaluate a piano performance in terms of various musical aspects, including dynamics, expressiveness, rhythm, articulation, timbre, pitch, and chords. The system may help piano learners to determine the level of their playing skill and thereby make improvements quickly [2]. The assessment of piano playing skills requires the evaluation of a variety of skills in the context of piano instruction, particularly for young children. Therefore, utilizing computer-based techniques offers solutions to problems in the field of education and learning and can assist in developing skills in face-to-face instruction at various levels of skill. In this research, we provide timbre- and pitch-based assessment methods, as piano performance is a combination of multidimensional musical abilities, including dynamics and volume, rhythms and techniques, hand and body postures, and facial expressions.

Playing the piano helps improve hand-eye and body coordination, as it requires precision and strength. While playing the piano, left and right hands must simultaneously cooperate. However, each hand must be capable of playing its own notes and rhythms without relying on the other hand, i.e., right hand plays the melody while left hand plays the accompaniment, which appears as if each hand is performing independently. The freedom of the hands allows more flexibility and expression in piano performance. Moreover, both feet are employed to operate the pedals simultaneously. The coordination of both hands and feet allows for the simultaneous development of the brain and physical equilibrium [3]. Consequently, the physical control required to play music must be consistent. Smart learning can also introduce new teaching approaches to teachers and students; Ref. [4] smart learning has evaluated piano playing with processing intelligence, artificial intelligence, and a variety of music learning software. Such tools can influence the behaviors and conduct of music students. Our goal is to help piano students understand and strengthen their piano-playing skills in order to increase their overall knowledge and ability. In this work, we divided piano sound data into three conditions: “Good”, “Fair”, and “Poor”, and developed several approaches: the timbre-based WaveNet, timbre-based MLNet, Timbre-based CNN, Timbre-based CNN Transformers, Pitch-based CNN, and Pitch-based CNN Transformers.

CNN is a type of deep neural network that has been successfully applied to image recognition, object recognition, audio recognition, and classification problems. It is typically composed of three types of layers: convolution, pooling, and fully connected layers [5]. The first two layers perform feature extraction, whereas the third layer maps the extracted features into an output. The convolution layer is the key component in CNN, which consists of a stack of convolution operations and activation functions, thereby downsizing the input to a concentrated one. The convolutional layer uses a sliding filter on the input data and performs the dot product between the filter and input at each position. This generates a feature map that highlights the presence of specific features [6]. To reduce the number of network parameters and prevent overfitting, the pooling layer decreases the spatial dimensions of the feature maps produced by the convolutional layer.

On the other hand, Transformers are commonly used in natural language processing tasks, which are based on self-attention mechanisms to process input sequences and then generate output sequences. It consists of encoders and decoders, each of which contains a series of layers. Encoders process input sequences, whereas decoders generate output sequences. Each Transformer layer is composed of two sub-layers: multi-head self-attention and a feedforward network [7]. The multi-head self-attention layer enables the model to attend to different parts of the input sequence, [8] whereas the feed-forward network conducts non-linear transformations on the self-attention layer’s output.

The remainder of this investigation is as follows. Section 2 is devoted to a review of the literature. The methodology is presented in Section 3. Next, we discuss our experimental results in Section 4. In Section 5, we present our conclusions and outline the direction of our future work.

2. Related Work

In the field of intelligent music teaching studies, there are still few studies about the assessment of music performance [9], which reforms curriculum-essential intelligent piano teaching and strengthens the teaching content through networking. Such computer-based approaches involve a piano note recognition algorithm to compute the associated description data formed under a spectrogram of piano MIDI dataset for recompilation in multi-voice models to establish sound recognition. One such intelligent music system involves the use of an augmented reality (AR) system called “Oncall Piano Sensei”. This portable AR piano training system is one of the smart music systems that integrates augmented reality (AR) technology effectively [10]. A portable piano system allows users to study anywhere at an affordable price and with no restrictions on available space. The optical music recognition process (OMR) method was initially implemented so that the system could play a melody automatically. The system can also assess the user’s ability to press a piano key to create a sound as well as visual feedback on whether the key was pressed correctly or not. In addition, the system offers interactive simultaneous piano projection onboard with a melody sheet and full-instruction visualization through a smart device while displaying a melody sheet with motion guidance on smartphones. Besides detecting the user’s hand movement in real-time and providing suggestions and feedback, it can help beginners to learn basic piano skills more effectively [11].

More recently, several studies have investigated musical performance classification and evaluation by employing a dynamic time-warping measurement for classifying music performances, which achieves a high-performance accuracy of up to 90% [12]. Stepping into new computational intelligence via deep-computational learning, [13] built an intelligent piano teaching system via neural networks. In our previous work [14], we categorize the performances of staccato patterns and legato patterns by implementing various machine learning approaches. The CNN model was found to be superior for both the performance of staccato and legato at 89% and 93% accuracy, respectively. A similar study currently underway close to us. According to [15], The study also investigated time-series information of Indian-style classical music sounds processed via a multi-approach deep computational network.

3. Methodology

In this study, the automatic assessment of a piano performance is formulated as a problem of classifying a piano sound as “Good”, “Fair”, or “Poor”. The classification is achieved with two features of piano sounds, one is timbre and the other is pitch. As shown in Figure 1, a piano performance is determined to be Good, Fair, or Poor by fusing the results of timbre-based evaluation and pitch-based evaluation.

Figure 2 shows the spectrogram of the examples of “Good”, “Fair”, or “Poor” piano performances for playing the line “Lay thee down now and rest, may thy slumber be blessed” in Brahms’ Lullaby. The x-axis and y-axis of a spectrogram represent time and frequency [16], respectively. The brightness of the spectrogram represents amplitude, where darker pixels indicate smaller values and brighter pixels indicate larger values. Some observations can be made as follows. In a good performance, the resulting amplitudes are relatively stable for every played note. In contrast, in a fair performance, we can see from some discontinuous amplitudes that there are mistakes for one or two notes during playing. As the mistakes in a fair performance are minor, we consider the performer’s skill is at moderate level. In the poor performance, we can see that the resulting amplitudes are unstable, compared to the good performance example.

Figure 3 shows the pitch contour examples of “Good”, “Fair”, and “Poor” performances through pitch contours, which includes multiple sounds with varying levels of loudness, and the frequency functions at any one point may be associated with a later frequency function. Here, we have a set the values for Brahms’ Lullaby between bars 9 and 12. In the three performances, the baseline note is A0, between bars 9 and 12 in the same verse. The first note in the A4 range has the highest frequency, 440 Hz. The frequency of A4 or A440 is considered as the standard musical pitch for this melody [17]. Each performance is a measurement of the pitch range occurring on the scale, frequency, and time interval in seconds.

The aspect form of accuracy played in “Good”, “Fair”, and “Poor” performances are followed by the up-and-down pitch-contour movement of the melody (blue line). Poor performance is characterized by stack frequencies, also called point of cluster frequencies, which are poorly played melodies. A fair performance has gaps in the normal range of frequency, and a good performance has smooth melody movement while playing and consistent frequency compared to a poor or fair performance.

Pitch and timbre are two essential characteristics of a musical performance that enable evaluation of the performance’s quality [18]. Timbre refers to a sound’s quality that distinguishes it from other sounds of the same pitch and loudness. These are determined by the signal’s harmonic content. Pitch is the frequency of the sound’s fundamental component that repeats in the waveform [19]. For instance, a note produced by a guitar has the same pitch but different timbres compared to the same note produced by a piano. This is because of the differences between their harmonic contents.

3.1. Timbre-Based Evaluation

Our system is composed of four modules: Section 3.1.1, Section 3.1.2, Section 3.1.3 and Section 3.1.4. The components for evaluating a piano performance are shown in Figure 4.

3.1.1. Timbre-Based WaveNet Approach

Timbre-based WaveNet is an advanced generative model of deep neural network for unprocessed audio waveforms or raw files [20,21]. A Timbre-based WaveNet can generate a different environment of audio sound datasets and it can also be used to synthesize other audio signals, including music, and many exceptional instances of autonomously generated piano compositions are presented.

In our approach, timbre-based WaveNet or one-dimensional CNN (1D-CNN) features for classifiers contribute to this end-to-end architecture [22].

The Timbre-based WaveNet architecture is shown in Figure 5. The input to the Timbre-based WaveNet model is a monophonic conversion of raw piano waveforms. A piano sound is then converted into a Mel-Frequency Cepstral Coefficient (MFCC). A feature of an audio signal extraction using a series of one-dimensional convolutional layers (1D Conv-layers) is that the 1D Conv-layers are in charge of capturing a global view of the raw data and extracting the local features, whereas the remaining 1D Conv-layers are tasked with obtaining a more in-depth view of the data in order to identify the useful discriminative features for the classification task. Timbre-based WaveNet serves as a baseline model for signalized samples (a.k.a. 1D-CNN). It is composed of two sets of feature encoder blocks, each of which consists of two convolutional layers and one max pooling layer. The first one uses a kernel size of 3 and a filter size of 32 in both of the two convolutional blocks; the second has a kernel size of 5 and a filter size of 64 to expose the significant features inside signal representations, whereas the max pooling layer is used at a kernel size of 2 for all blocks. The output features are then flattened and mapped to a fully connected (FC) layer with a filter size of 256 next to filter size of 3 to output layers carrying vectors to the SoftMax Activation.

3.1.2. Timbre-Based MLNet Approach

Timbre-based MLNet stands for “Multi-level Network” and is a type based on Machine Learning Networks that utilizes multi-level feature learning. For our approach, Timbre-based MLNet model or called the two-dimensional CNN classifiers (2D-CNN). The model features are linearly conducted. The timbre-based MLNet model has been shown superior to the conventional machine learning-based classifier in a wide difference of audio sound datasets [23] reported in some benchmark, such as FSDnoisy18k, Acted Emotional Speech Dynamic Database (AESDD), CHiME-Home, and Device and Produced Speech (DAPS). Sound classification employs image-like features and MLNet is similar to image classification networks such as AlexNet and GoogLeNet, which have been employed for sound classification [24].

As shown in Figure 6, a piano sound is converted into three-dimensional Mel-frequency Cepstral Coefficient (3D MFCC), thereby serving as the inputs for the timbre-based MLNet model.

The 3D MFCC consists of the properties of spec_bw, spec_centroid, and chroma_stft attributes, leading to the resulting CNN Model output shape (63, 1149, 1) as three dimensions (3D). In this task, we use Conv2D as the filter layer because Conv2D takes three-dimensional input.

Python library Librosa is used to perform 3D MFCC, where the frame length is fixed to 25ms and the frame overlapping rate is configured to 50% [25]. Timbre-based MLNet, which is the initial baseline network representation, encodes the signal instances through two convolutional blocks with a kernel size of 4 and a filter size of 32, and one Max pooling before flattening into a fully connected (FC) layer vector, which is then treated to FC1 of 512 filters, FC2 of 64 filters, the last FC of 3 filters corresponding to the number of classes. These yields output vectors, which are transferred to a map with SoftMax activation to assess inference.

3.1.3. Timbre-Based CNN Approach

Recently, CNN has been successfully applied to audio recognition problems, such as music tagging, environmental/urban sound classification [26,27,28,29], and automatic speech recognition [30,31], investigated the use of two well-known image recognition networks, Alexnet and GoogLeNet, for the classification of environmental sounds [30]. The Convolution Neural Network (CNN) is an approach that employs the feed-forward architect style, which is regarded as a unique format for deep artificial neural networks [31]. To facilitate comprehension, we will refer to it as Timbre-based CNN in this study. Because the authors have optimized the parameters for songs for kids, this explains the activation function equation. CNN is typically employed in the SoftMax Activation Function for multiclass classification problems [32,33]. Prior to SoftMax, piano raw data were converted from waveform representation to Mel-frequency Cepstral Coefficient (MFCC). As demonstrated in Figure 7, Timbre-based CNN Original can achieve the process with a single convolution layer. The layer is percolated by 3 × 2 filters employing the Rectified Linear Unit (ReLU) activation function, defined as Equation (1), where x is an input value.

F_{x} = \max (0, x)

(1)

The architecture of the single-layer CNN shows that the feature maps have one convolutional layer.

As previously mentioned, in the first convolutional phase, the current input size is 160 × 1 on a single convolutional layer with a kernel size of 3 × 2 with 64 filter sizes and pooling the features via a Max Pooling layer with a kernel size of 2-by-2; the kernel is used to deal with detailed local features. All features were flattened and treated into a fully connected (FC) layer of 100 filters. The process finally ended with a fully connected layer equivalent to the number of classes. To verify the prediction so as to be probabilistic output, the logit vector was mapped to the SoftMax activation.

3.1.4. Timbre-Based CNN Transformers Approach

Timbre-based CNN Transformers comprises a deep neural network with convolutional and pooling layers (CNN), which could be applied in a variety of fields, including image and sound classification. Furthermore, we would like to explain the background of the vision transformer [34]. Since transformers have gained exceptional success in natural language processing for classification with speech-to-text, voice-to-text [35], or tuned to an input that can be applied for classification with images or audio sounds as an input. The purpose of our system is to create a Timbre-based CNN Transformers hybrid network. Timbre-based CNN Transformers is a network that modifies Timbre-based CNN and combines the transformer encoder network model, and sets new hyperparameters for songs for kids. The input uses piano raw data. Extraction by Chroma_stft, which is one type of the Mel-frequency Cepstral Coefficient (MFCC), spectrogram, and the Librosa python library are applied in this work, which is in line with previous research.

As shown in Figure 8, in this approach, we utilize the part of the transformer encoder network that is composed of the self-attention modules and feed-forward layers. This layer assists in accelerating the receptive field of self-localization features through the 4 multi-head attention layers with 256 head dimension size and feed-forward block to extract top-down encoded features. Each of which processes aggregate input feature maps to reweight and rescale from input feature representation. They can then be extracted into one-dimensional average pooling for our implemented signal samples and a fully connected (FC) layer of 128 filters with a dropout parameter value set at 0.4, next to the last fully connected layer of three filters which correspond to the number of classes. Finally, the final logit vector is turned through the SoftMax function and mapped to the label vector for inference assessment.

3.2. Pitch-Based Evaluation

As piano sounds are considered to be music, it is reasonable to assume that 18 songs for kids have a unique pitch pattern that can be used to differentiate them from other songs. Since the pitch is the reciprocal of the fundamental frequency, a recording of piano sounds can be viewed as a series of fundamental frequencies. Where

e_{m}

, 1

\leq m \leq M

represents the inventory of possible piano notes. Our objective is to figure out which of M possible notes is most likely to be played at any given moment in a piano recording. We adopt the method described in [36] to tackle this issue. Initially, the piano sound is segmented into frames using a P-length sliding Hamming window with a 0.5P-length overlap between frames. Every frame is then subjected to a Fast Fourier Transform (FFT) of size J. Let

x_{t, j}

represent the signal’s energy with respect to the FFT index j in frame t, where

1 \leq j \leq J,

and

x_{t, j}

has been normalized to the interval 0 to 1. The energy of the signal on the mth note in frame t can therefore be approximated by Equation (2) as,

{\hat{x}}_{t, m} = \begin{matrix} m a x \\ \forall_{j} U (j) = e_{m} \end{matrix} x_{t, j},

(2)

and

U (j) = ⌊12 \cdot \log_{2} \frac{F (j)}{440} + 69.5⌋

(3)

where [ ] is a floor operator, F(j) is the frequency corresponding to the FFT index j, and U(

\cdot

) is a function that converts FFT indices to MIDI note numbers. Human perception of musical intervals is approximately logarithmic with respect to fundamental frequency, and the notes of the scale are perceived to repeat once per octave containing 12 semitones, as indicated by Equation (3). MIDI also assigns the number 69 to the fundamental frequency of 440 Hz, which is the “standard pitch” or modern “concert pitch”. When a note is played in frame, the resulting energy,

{\hat{x}}_{t, m}

, should be the highest among

{\hat{x}}_{t, 1}, {\hat{x}}_{t, 2}, \dots, {\hat{x}}_{t, M}

Occasionally, however, the energy of a true note will be less than that of its harmonic note. To avoid the interference of harmonics in the estimation of true notes, we employ Subharmonic Summation (SHS) [37], which calculates a value for the “strength” of each possible note by adding the signal’s energy on a note and the note numbers of its harmonics. In particular, Equation (4) the magnitude of note

n_{m}

in frame t is computed using:

y_{t, m} = \sum_{c = 0}^{C} h^{c} {\hat{x}}_{t, m + 12 c},

(4)

This is an equation in which C is the number of harmonics taken into account and h is a positive number less than 1 that reduces the contribution of higher harmonics. As a consequence of summation by Equation (5), the true note typically receives the most energy from its harmonic notes. Thus, the true note in frame t can be determined by selecting the note number corresponding to the highest strength value. Recognizing that a note typically spans multiple frames, the decision could be made by incorporating information from adjacent frames. Specifically, the note in frame is determined by selecting the note number associated with the highest value of the strength accumulated over the preceding frames, i.e.,

O_{t} = \begin{matrix} \arg m a x \\ 1 \leq m \leq M \end{matrix} \sum_{b = - W}^{W} y_{t + b,}

(5)

In addition, the resulting note sequence is refined by taking frame continuity into account. This is accomplished through median filtering, which replaces each note with the local median of notes from its neighboring W frames in order to eliminate jitters between adjacent frames.

Here, Pitch-based CNN and Pitch-based CNN Transformers are the components for evaluating a piano performance. Figure 9 shows a block diagram of the pitch-based evaluation System.

3.2.1. Pitch-Based CNN Approach

The investigations of encoding and decoding revealed that the pitch of complicated real-world sounds is extracted and transformed. By employing the encoder, we can extract and categorize sounds to evaluate them. A Pitch-Based CNN Approach [38] in the form of CNN is applied to adapt pitch raw data to turn the input, which uses pitch rather than MFCC features.

The Pitch-based CNN model is based on three sequential convolutional blocks, each of which consists of a 1D-convolution layer, a batch normalization layer, and a ReLU activation function layer, respectively. The extracted features, go through the global average pooling 1D to turn 3D original features into 2D features with batch sizes and flattening vectors to yield a final fully-connected (FC) layer, corresponding to the number of classes, that is, three filters with SoftMax activation to verify the correctness with label vectors as depicted in Figure 10.

3.2.2. Pitch-Based CNN Transformers Approach

In the same context of encoding, but transformed into another novelty of feature encoding through the transformer, the pitch-based approach is an essential characteristic that has various applications. In speech and music signals, pitch is an essential factor [39]. Our Pitch-based model is a perceptual attribute associated with the fundamental frequency (or periodicity) of sound. Our model-based encoding without decoding analyses is employed to conduct the song playing evaluation instead. This proposed Pitch-based CNN Transformers model is the same as the use of Timbre-based CNN Transformers.

In Figure 11, Pitch-based CNN Transformers are shown, similar to Timbre-based CNN Transformers. This approach changes the input to pitch-based features (pitch data). We employ a part of the transformer encoder network consisting of the self-attention layers and the forwarded neural block. This layer helps in extracting top-down extracted features by boosting the localization fields of self-attentive features through 4 multi-head attention layers with a 256 head dimension size and the feed-forward layer. Each of these processes’ inputs feature maps for each block transfer forward in order to reweight and rescale the features. They also can be extracted into one-dimensional average pooling for our used instances and the fully connected (FC) layer of 128 filters with a dropout set at 0.4 and the final fully connected (FC) layer of width vectors with three that corresponds to the number of classes using the SoftMax function to form the last logits to the probabilistic vector and mapping it to the label vector for further evaluation.

The full architectures of the aforementioned approaches are illustrated in Appendix A.

4. Experiments

4.1. Experiment Configuration

The piano sounds were recorded using a sensitivity adjustable directional condenser shotgun microphone connected to Aputure V-Mic D2. The baseline setup utilized a sample rate assumed to be 22,050 Hz, the channel was monophonic, and the resolution was 16 bits. The dataset is divided into training and test data in the ratio of 80:20, and the TensorFlow deep learning framework is implemented within the network. Our experiments were run using an RTX3080 8 GB graphics card, an INTEL i7-10870H CPU, 32 GB of memory DDR4-3200 RAM, and a 1TB M.2 Pcie SSD.

4.2. Songs for Kids’ Performance Description

In this study, we set several goals and requirements. The dataset contains eighteen songs for kids used in this experiment: Alphabet Song, Amazing Grace, Au Clair De La Lune, Beautiful Brown Eye, Brahms Lullaby, Can Can, Clementine, Deck the Hall, Hot Cross Buns, Jingle Bells, Lavender Blue, London Bridge, Mary Had a Little Lamb, Ode to Joy, Oh Susanna, Skip to My Lou, The Cuckoo, and This Old Man. These songs were from Easy Piano Songs for kids: 40 Fun & Easy Piano Songs for Beginners by Thomas Johnson. There were four piano teachers who participated in data recording, and the majority of teachers were qualified and had passed the grade 8 piano advanced level Associated Board of the Royal Schools of Music (ABRSM) or Trinity’s Piano Certificate exams. The certification is globally accepted. Based on the data collection, we received nine Taiwanese piano students who had all started piano lessons, as well as piano teachers from the Tianmu and Taipei City Hall to teach at Elephant Music School. From kindergarten to middle school, participants are divided into three age groups for one and a half years: kindergarten (3–5 years old), primary school (6–11 years old), and middle school (12–14 years old). In this regard, we would like to provide additional details regarding the three qualities: “Good”, “Fair”, and “Poor”; each of these qualities is distinct. Additionally, we assured that every participant was able to play beginner, intermediate, and advanced piano by listening to the recording in the context of songs for kids, in comparison to the song, or to the original masterpiece. Through a voting process, the annotations of the four experts were combined. They gave grades for “Good”, “Fair”, and “Poor” performance. In a musical performance, “Good” performance means showing the rhythm, pitch (including intonation), dynamics, and joints as written as well as playing the song correctly; students with the ability to perform well are able to do this accurately with clear recognition and comprehension of the song being performed. The musical performance must be composed of distinct sharp sounds and straight notes. These are produced by pressing the keys with the correct hand weight and never missing a key press. Moreover, when performers give a “Good” performance, they should convey emotion by capturing the correct style, mood, and tone of the song. They should play with precise performance techniques, including symbolic music notation and articulation (Staccato, Legato, Tie, and so on), and expression timing. (All notes are measured in halves, quarters, and eighths.) When performing on stage or in piano competitions, pianists must achieve these standards for “Good” performance, while for “Fair” performance, such as when a pianist plays a scale, piano key presses may contain some errors and they may press one or two keys simultaneously, piano students make errors. In addition, a “Fair” performance playing style and technique may show difficulty in communicating emotions and expressions. This results in actors performing unemotionally with unclear vocalizations and some errors. “Poor” performance in piano playing is characterized by the presence of two or more inaccuracies in which each phrase of the song or sound is played with incorrect notes or rhythm and some note errors are repeated.

4.3. Experiment Setup and System Implementation

In this section, to assist comprehension, we will refer to the nine piano students and four piano teachers as the “participants”, for a total of thirteen participants. Piano recordings from 13 participants and 18 songs for kids were used, and 6318 separate samples of training data including 5058 samples, and 1260 samples of testing data were used to examine the effectiveness of the model in each group, to measure the findings precisely and to test the accuracy of the results. Each of the 18 songs for kids consisted of 351 samples from each participant in the experiment. According to a ratio of 80:20, the data were divided into 281 samples of training data, and 70 samples of testing data.

The six participants can be classified into the following groups with regard to the training data: We used the data from the first teacher and second teacher to represent “Good” performance, the fourth student and fifth student to represent “Fair” performance, and the second student and third student to indicate bad performance. Thus, the sample for training is 1264 in total, with seven performers generating data for evaluation. We chose data from the third and fourth teachers to represent “Good” performance, data from the sixth and eighth students to represent fair performance, and data from the first, seventh, and ninth students to represent “Poor” performance. Therefore, there were 5058 test datasets in total. For the 18 songs for kids, MFCC features were computed for the timbre-based extraction group, and the frequencies of MIDI note numbers instead of fundamental frequencies for pitch-based extraction are presented in Table 1.

As examined in the system implementation, our hyperparameter tuning in the work has set an optimizer via Stochastic Gradient Descent (SGD), batch size in training at 32, the learning rate is set at

2 \times 10^{- 2}

, and weight decaying at

2 \times 10^{- 4}

. These models are trained through 50 epochs. To reduce overfitting, we have adjusted all dropouts to relieve the effect of overfitting in batch training. We have randomly dropped out some neurons between [0.0–0.5] to prevent the model from relying too much on a single or part of features. Our cost function is based on sparse categorical cross-entropy, which is implemented in our method. We set the number of classes in the categorization process at three classes, i.e., “Good”, “Fair”, and “Poor” efficiency in performance assessment, as we have discussed in the previous section.

Furthermore, our sub-network implementation utilizes the activation function of ReLU in post-convolution layers in processing to boost the activation map, and the SoftMax function is used at the final layer for turning output vectors into confidence prediction vectors. Two cluster network groups, as in our proposed techniques, as mentioned in Section 3, consist of multiple networks, each of which uses regarded functions inside in computation.

4.4. Experiment Results

In this section, the experiment will be divided into two major groups containing all 18 songs for kids, which will be evaluated separately for each model. According to each group’s timbre-based model (timbre-based WaveNet, timbre-based MLNet, Timbre-based CNN, and Timbre-based CNN Transformers), the second group consists of pitch-based (Pitch-based CNN and Pitch-based CNN Transformers). There will be vastly different outcomes, and each will use the criteria to investigate the distribution of the rate of accurate prediction across the three categories: “Good”, “Fair”, and “Poor”. The following elements define the piano evaluation system’s accuracy (in percent) as shown in Equation (6).

A c c u r a c y (i n %) = \frac{# C o r r e c t l y - C l a s s i f i e d S o n g s}{# A L L T e s t S o n g s} 100 %

(6)

Performance measures are based on precision, recall, F1 scores, and accuracy rates. Where, TP as a True Positive, TN as a True Negative, FP as a False Positive and, FN as a False Negative. Precision is the ratio of reasonable results correctly predicted (TP) to the total forecast optimistic observations (TP + FP), while recall is the correct ratio. The predicted optimistic results for all actual classes of observations (TP + FN). The scores of F1 are a weighted average of precision and recall. Thus, this ranking score is both false positive and false negative. It has been recorded. The accuracy rate refers to the correct ratio of predicted observations (TP + TN) for complete observations as (TP + FP + FN + TN). The mathematical formulas for these measures are as seen in Equations (7)–(9), respectively.

P r e c i s i o n = \frac{T r u e P o s i t i v e}{T r u e P o s i t i v e + F a l s e P o s i t i v e}

(7)

R e c a l l = \frac{T r u e P o s i t i v e}{T r u e P o s i t i v e + F a l s e N e g a t i v e}

(8)

F 1_score = \frac{2 \cdot (R e c a l l \times P r e c i s i o n)}{R e c a l l + P r e c i s i o n}

(9)

4.4.1. Timbre-Based Results

As shown in Table 2, through the aspect of the model baseline benchmark in the timbre-based group, for instance, the results of the Alphabet song evaluation are 64.78% for Timbre-based WaveNet and 76.08% for Timbre-based MLNet, whereas our method figured out the assessment at 88.73% for Timbre-based CNN and 98.59% for Timbre-based CNN Transformers evaluation. In addition, F1-score, precision, and recall measurement are up by 10.9%, 8.4%, and 10.2%, correspondingly, when compared with a second-place model like Timbre-based CNN. Even though Timbre-based CNN Transformers is unable to achieve perfect performance evaluation, its model can surpass others to yield high-confidence output predictions. This validation shows it to be superior to most from the findings on the Alphabet song. Thus, Timbre-based CNN Transformers has the best predictive performance of the Alphabet song. At average performance, Timbre-based CNN Transformers has conducted the best performance assessment among all models listed in Table 3.

4.4.2. Pitch-Based Results

The results of the internal comparative bench test of the pitch-based group, the results on investigation through the use of songs like the Skip to My Lou show that our approach investigates the measurement at 94.37% for Pitch-based CNN, whereas the best performance is conducted through Pitch-based CNN Transformers, which was evaluated at 97.18%. In comparison, F1-score, Precision, and Recall metrics are up 3.1%, 2.9%, and 2.8%, respectively when compared to Pitch-based CNN. This verification shows it to be superior to the majority of Skip to My Lou song outcomes. Therefore, Pitch-based CNN Transformers has the most accurate prediction of the Skip to My Lou song. Pitch-based CNN Transformers has examined the best performance evaluation of all models, as shown in Table 4 and Table 5.

4.4.3. The Best Comparative Results

The comparative findings of average performance, as shown in Table 6, demonstrate that the baseline models like Timbre-based WaveNet and Timbre-based MLNet and our settled Timbre-based CNN, and Timbre-based CNN Transformers, have yet to achieve lower performance in testing. The pitch-based group accomplished up to 96.87% average accuracy, which is much greater in recognition but it is still slightly lower than the best performance, such as in Pitch-based CNN Transformers, which achieved overall realization of recognizability of up to 97.50%. This showed that Pitch-based CNN Transformers was the best performer.

4.4.4. The Performance Competition Results

As shown in Table 7, we compared the competency of each group by combining two groups with timbre-based & pitch based on CNN and timbre-based & pitch based on CNN Transformer to calculate the average performance. The results from the group of timbre-based & pitch-based combination of CNN Transformers achieved a total average performance of up to 96.05%. Furthermore, the results for the timbre-based & pitch-based combination of CNN achieved only 89.79% for the group, indicating that a timbre-based & pitch-based integration of CNN Transformers is the most effective combination. In Table 8, we compared the other benchmarking. The results show the higher accuracy is attained in our approach that used pitch raw data feature via Pitch-based CNN and Pitch-based CNN Transformer model achieved accuracy of 96.87% and 97.50%, respectively.

4.5. Analysis

4.5.1. Confusion Matrix

Confusion Matrix is apart from the comparison of overall accuracy, we also examined the classification relationship by employing the confusion matrix (a.k.a. Matching matrix). In Figure 12, Figure 13, Figure 14 and Figure 15 of the Brahms Lullaby performance, we illustrate all of our approaches under the timber-based group via timber-based WaveNet vice versa to timber-based CNN Transformers including the pitch-based group through the aspect of Pitch-based CNN compared with Pitch-based CNN Transformers. In addition, the correct classification and misclassification across three sound classes was considered under “Poor”, Fair, and “Good” performances. In regard to timbre-based groups, they are highly misclassified in using timbre-based WaveNet as the result was 60.56% as shown in Figure 12, In contrast, Figure 13 shows that most of the samples were classified correctly and investigated via Timbre-based CNN Transformers at 94.60%. Furthermore, in the s pitch-based group, most of the samples were also miscategorized by employing the Pitch-based CNN model at 96.87% as shown in Figure 14. The Pitch-based CNN Transformers were at 97.50% as shown in Figure 15, and per-formed better than the other models in the same group and also conducted categorization most precisely in all evaluations.

4.5.2. Loss Rate

From the term of cost improvement as computed, we can estimate and illustrate it as a benchmark visualization. In accordance with Figure 16 and Figure 17, through the learning of each model from two groups to observe how they improved remarkably on loss rate analysis, in the perspective of four songs, our findings demonstrate four songs for four and two model baselines in timbre-based and pitch-based groups, respectively. Timbre-based CNN Transformers achieved a major improvement in the model down to 0.625 times when compared with Timbre-based WaveNet, even though there was one song (London Bridge), that Timbre-based CNN could improve slightly better than Timbre-based CNN Transformers. However, it surpassed categorization more precisely in most of the evaluations. As regards pitch-based group loss analysis, our results revealed four songs and two model baselines. Pitch-based CNN Transformers mostly impacted the loss improvement effectively and significantly when compared to Pitch-based CNN. In contrast, there is another song (Clementine), where Pitch-based CNN evaluation was a little greater than Pitch-based CNN’s realization.

5. Conclusions and Future Work

This study aims to explain and compare the feasibility of accurate piano sound identification in order to achieve better results. We concentrated on short songs consisting of 8 to 16 bars, or songs that are between 30 s and less than 1 min in length, which are difficult for students to learn on their own while playing the piano accurately.

The study was primarily concerned with comparing and describing the results for correctly identifying the piano sound accuracy of 18 songs for kids; four piano teachers and nine students participated. A total of thirteen people were included in our experiment for training and evaluation purposes and they were divided into two main groups: the timbre-based group (Timbre-based WaveNet, Timbre-based MLNet, Timbre-based CNN, and Timbre-based CNN Transformers); and the pitch-based group (Pitch-based CNN, and Pitch-based CNN Transformers). We employed model analysis approaches to explain the model prediction in an effort to gain useful insights into the investigation of the Collaborative Playing Piano automatic and performance evaluation process. To develop automated piano performance evaluation, we used 18 songs for kids. We pitted the models in each group against each other to determine the best model. In the end, the pitch-based group was superior to the timbre-based group. Pitch-based CNN Transformers performed the best at 97.50%, followed by the pitch-based group, i.e., Pitch-based CNN at 96.87%.

In addition, we also found that, based on the evaluation of the accuracy of 18 songs for kids, Jingle Bells and Mary Had a Little Lamb were the two songs that kids tended to play correctly, perhaps because they are well-known compositions that have been passed down from generation to generation, or perhaps because they are simple songs with which kids are already familiar. If the child has heard or practiced the song before, they will be able to play it more proficiently than if the song were new to them and they were unfamiliar with the melody.

Nevertheless, there are limitations on songs for kids during the process of data collection and the assessment of the experimental dataset can be expanded to include performances with additional instruments. Evaluation will not be limited to Taiwan solely, but could be applied internationally. This study may encourage students to make effective use of their free time and achieve the study objective. Ultimately, the purpose of the study is to enhance the music industry in the future by encouraging effective music teaching and testing by both teachers and students.

Author Contributions

Conceptualization, V.P. and W.-H.T.; methodology, V.P. and W.-H.T.; software, V.P.; formal analysis, V.P.; investigation, V.P.; resources, V.P.; data curation, V.P.; writing—original draft preparation, V.P.; writing—review and editing, V.P. and W.-H.T.; visualization, V.P.; supervision, W.-H.T.; project administration, V.P. and W.-H.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Ministry of Science and Technology, Taiwan, under grant. MOST-106-2221-E-027-125-MY2.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data in this work was obtained with the participation of students and piano teachers at Elephant Music School from the Tianmu and Taipei City Hall branches, Taiwan.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. The configuration of timbre-based WaveNet.

Layer	Kernel Size	Stride	Number of Filters	Output Shape
Input (Raw data +MFCC)	-	-	-	(1, 510, 1)
1D Conv1	(1, 3)	(1, 1)	32	(1, 510, 32)
1D Conv2	(1, 3)	(1, 1)	32	(1, 508, 32)
MaxPool1	(2, 2)	(1, 1)	32	(1, 254, 32)
1D Conv3	(1, 5)	(1, 1)	64	(1, 250, 64)
1D Conv4	(1, 5)	(1, 1)	64	(1, 246, 64)
MaxPool2	(2, 2)	(1, 1)	64	(1, 123, 64)
Flatten	-	-	-	(1, 7872)
FC1	-	-	256	(1, 256)
FC2	-	-	3	(1, 3)
SoftMax	-	-	-	(1, 3)
Output (#3Classes)	-	-	-	(1, 3)

Table A2. The configuration of timbre-based MLNet.

Layer	Kernel Size	Stride	Number of Filters	Output Shape
Input (Raw data +3D MFCC)	-	-	-	(1, 63, 1149, 1)
2D Conv1	(4, 4)	(2, 2)	32	(1, 63, 1149, 32)
2D Conv2	(4, 4)	(2, 2)	32	(1, 30, 573, 32)
MaxPool1	(2, 2)	(2, 2)	32	(1, 15, 286, 32)
Flatten	-	-	-	(1, 137,280)
FC1	-	-	512	(1, 512)
FC2	-	-	64	(1, 64)
FC3	-	-	3	(1, 3)
SoftMax	-	-	-	(1, 3)
Output (#3Classes)	-	-	-	(1, 3)

Table A3. The configuration of Timbre-based CNN.

Layer	Kernel Size	Stride	Number of Filters	Output Shape
Input (Raw data + MFCC)	-	-	-	(1, 160, 1)
1D Conv1	(3, 3)	(2, 2)	64	(1, 160, 64)
MaxPool1	(2, 2)	(2, 2)	64	(1, 80, 64)
Flatten	-	-	-	(1, 5120)
FC1	-	-	100	(1, 100)
FC2	-	-	3	(1, 3)
SoftMax	-	-	-	(1, 3)
Output (#3Classes)	-	-	-	(1, 3)

Table A4. The configuration of Timbre-based CNN Transformers.

Layer	Kernel Size	Stride	Number of Filters	Output Shape
Input (Raw data + Chroma_stft)	-	-	-	(1, 201, 1)
Layer Normalization1	-	-	-	(1, 201, 1)
Multi Head Attention = 4 size = 256	-	-	-	(1, 201, 1)
Dropout1 (0.25)	-	-	-	(1, 201, 1)
Layer Normalization2	-	-	-	(1, 201, 1)
1D Conv1, ReLU	(1, 1)	(1, 1)	4	(1, 201, 4)
Dropout2 (0.25)	-	-	-	(1, 201, 4)
1D Conv2, ReLU	(1, 1)	(1, 1)	4	(1, 201, 1)
Global Average Pooling	-	-	-	(1, 201)
FC1	-	-	128	(1, 128)
Dropout3 (0.4)	-	-	-	(1, 128)
FC2	-	-	3	(1, 3)
SoftMax	-	-	-	(1, 3)
Output (#3Classes)	-	-	-	(1, 3)

Table A5. The configuration of Pitch-based CNN.

Layer	Kernel Size	Stride	Number of Filters	Output Shape
Pitch Input (Pitch raw data)	-	-	-	(1, 160, 1)
1D Conv1	(1, 3)	(1, 1)	64	(1, 160, 64)
Batch Normalization1	-	-	-	(1, 160, 64)
Rectifier Linear Unit (ReLU)1	-	-	-	(1, 160, 64)
1D Conv2	(1, 3)	(1, 1)	64	(1, 158, 64)
Batch Normalization2	-	-	-	(1, 158, 64)
Rectifier Linear Unit (ReLU)2	-	-	-	(1, 158, 64)
1D Conv3	(1, 3)	(1, 1)	64	(1, 156, 64)
Batch Normalization3	-	-	-	(1, 156, 64)
Rectifier Linear Unit (ReLU)3	-	-	-	(1, 156, 64)
Global Average Pooling	(1, 2)	(2, 2)	-	(1, 78, 64)
FC1	-	-	3	(1, 4992)
SoftMax	-	-	-	(1, 3)
Output (#3Classes)	-	-	-	(1, 3)

Table A6. The configuration of Pitch-based CNN Transformers.

Layer	Kernel Size	Stride	Number of filters	Output Shape
Pitch Input (Pitch raw data)	-	-	-	(1, 201, 1)
Layer Normalization1	-	-	-	(1, 201, 1)
Multi Head Attention = 4 size = 256	-	-	-	(1, 201, 1)
Dropout1 (0.25)	-	-	-	(1, 201, 1)
Layer Normalization2	-	-	-	(1, 201, 1)
1D Conv1, ReLU	(1, 1)	(1, 1)	4	(1, 201, 4)
Dropout2 (0.25)	-	-	-	(1, 201, 4)
1D Conv2, ReLU	(1, 1)	(1, 1)	4	(1, 201, 1)
Global Average Pooling	-	-	-	(1, 201)
FC1	-	-	128	(1, 128)
Dropout3 (0.4)	-	-	-	(1, 128)
FC2	-	-	3	(1, 3)
SoftMax	-	-	-	(1, 3)
Output (#3Classes)	-	-	-	(1, 3)

References

Hosken, D. An Introduction to Music Technology, 2nd ed.; Taylor & Francis: New York, NY, USA, 2014; pp. 4–46. [Google Scholar] [CrossRef]
Campayo-Muñoz, E.; Cabedo-Mas, A.; Hargreaves, D. Intrapersonal skills and music performance in elementary piano students in Spanish conservatories: Three case studies. Int. J. Music Educ. 2020, 38, 93–112. [Google Scholar] [CrossRef]
Chandrasekaran, B.; Kraus, N. Music, noise-exclusion, and learning. Music Percept. 2010, 27, 297–306. [Google Scholar] [CrossRef]
Li, W. Analysis of piano performance characteristics by deep learning and artificial intelligence and its application in piano teaching. Front. Psychol. 2022, 12, 5962. [Google Scholar] [CrossRef] [PubMed]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef] [PubMed]
Wang, D.; Zhang, M.; Li, Z.; Li, J.; Fu, M.; Cui, Y.; Chen, X. Modulation format recognition and OSNR estimation using CNN-based deep learning. IEEE Photon. Technol. Lett. 2017, 29, 1667–1670. [Google Scholar] [CrossRef]
Yang, C.; Zhang, X.; Song, Z. CNN Meets Transformer for Tracking. Sensors 2022, 22, 3210. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In European Conference on Computer Vision; Springer: Cham, Germany, 2020; pp. 213–229. [Google Scholar] [CrossRef]
Shuo, C.; Xiao, C. The construction of internet+ piano intelligent network teaching system model. J. Intell. Fuzzy Syst. 2019, 37, 5819–5827. [Google Scholar] [CrossRef]
Chiang, P.Y.; Sun, C.H. Oncall piano sensei: Portable ar piano training system. In Proceedings of the 3rd ACM Symposium on Spatial User Interaction (SUI), Los Angeles, CA, USA, 8–9 August 2015; p. 134. [Google Scholar] [CrossRef]
Sun, C.H.; Chiang, P.Y. Mr. Piano: A portable piano tutoring system. In Proceedings of the 2018 IEEE XXV International Conference on Electronics, Electrical Engineering, and Computing (INTERCON), Lima, Peru, 8–10 August 2018; pp. 1–4. [Google Scholar] [CrossRef]
Giraldo, S.; Ortega, A.; Perez, A.; Ramirez, R.; Waddell, G.; Williamon, A. Automatic assessment of violin performance using dynamic time warping classification. In Proceedings of the 2018 26th Signal Processing and Communications Applications Conference (SIU), Altinyunus, Turkey, 2–5 May 2018; pp. 1–3. [Google Scholar] [CrossRef] [Green Version]
Liu, M.; Huang, J. Piano playing teaching system based on artificial intelligence–design and research. J. Intell. Fuzzy Syst. 2021, 40, 3525–3533. [Google Scholar] [CrossRef]
Phanichraksaphong, V.; Tsai, W.H. Automatic evaluation of piano performances for STEAM education. Appl. Sci. 2021, 11, 11783. [Google Scholar] [CrossRef]
Sharma, A.K.; Aggarwal, G.; Bhardwaj, S.; Chakrabarti, P.; Chakrabarti, T.; Abawajy, J.H.; Mahdin, H. Classification of Indian classical music with time-series matching deep learning approach. IEEE Access 2021, 9, 102041–102052. [Google Scholar] [CrossRef]
Li, B. On identity authentication technology of distance education system based on voiceprint recognition. In Proceedings of the 30th Chinese Control Conference (CCC 2011), Yantai, China, 22–24 July 2011; pp. 5718–5721. [Google Scholar]
Belman, A.K.; Paul, T.; Wang, L.; Iyengar, S.S.; Śniatała, P.; Jin, Z.; Roning, J. Authentication by mapping keystrokes to music: The melody of typing. In Proceedings of the 2020 International Conference on Artificial Intelligence and Signal Processing (AISP), Andhra Pradesh, India, 10–12 January 2020; pp. 1–6. [Google Scholar] [CrossRef]
McAdams, S. The Psychology of Music, Musical Timbre Perception, 3rd ed.; Elsevier: Amsterdam, The Netherlands, 2013; pp. 35–67. [Google Scholar] [CrossRef]
Jiam, N.T.; Deroche, M.L.; Jiradejvong, P.; Limb, C.J. A randomized controlled crossover study of the impact of online music training on pitch and timbre perception in cochlear implant users. J. Assoc. Res. Otolaryngol. 2019, 20, 247–262. [Google Scholar] [CrossRef] [PubMed]
Verma, P.; Chafe, C. A generative model for raw audio using transformer architectures. In Proceedings of the 2021 24th International Conference on Digital Audio Effects (DAFx), Copenhagen, Denmark, 8–10 September 2021; pp. 230–237. [Google Scholar] [CrossRef]
Oord, A.V.D.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv 2016, arXiv:1609.03499. [Google Scholar]
Tran, V.T.; Tsai, W.H. Acoustic-based emergency vehicle detection using convolutional neural networks. IEEE Access 2021, 8, 75702–75713. [Google Scholar] [CrossRef]
Fonseca, E.; Pons Puig, J.; Favory, X.; Font Corbera, F.; Bogdanov, D.; Ferraro, A.; Serra, X. Freesound datasets: A platform for the creation of open audio datasets. In Proceedings of the 18th Society for Music Information Retrieval (ISMIR), Suzhou, China, 23–27 October 2017; pp. 486–493. [Google Scholar] [CrossRef]
Boddapati, V.; Petef, A.; Rasmusson, J.; Lundberg, L. Classifying environmental sounds using image recognition networks. Proc. Comput. Sci. 2017, 112, 2048–2056. [Google Scholar] [CrossRef]
McFee, B.; Raffel, C.; Liang, D.; Ellis, D.P.; McVicar, M.; Battenberg, E.; Nieto, O. Libros: Audio and music signal analysis in python. In Proceedings of the 14th Python in Science Conference (SciPy 2015), Austin, TX, USA, 6–12 July 2015; pp. 18–25. [Google Scholar] [CrossRef] [Green Version]
Chachada, S.; Kuo, C.C.J. Environmental sound recognition: A survey. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), Kaohsiung, Taiwan, 29 October–1 November 2013. [Google Scholar] [CrossRef]
Salamon, J.; Bello, J.P. Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process. Lett. 2017, 24, 279–283. [Google Scholar] [CrossRef]
Piczak, K.J. Environmental sound classification with convolutional neural networks. In Proceedings of the 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), Boston, MA, USA, 17–20 September 2015; pp. 17–20. [Google Scholar] [CrossRef]
Lee, J.; Kim, T.; Park, J.; Nam, J. Raw waveform-based audio classification using sample-level CNN architectures. arXiv 2017, arXiv:1712.00866. [Google Scholar]
Thomas, S.; Ganapathy, S.; Saon, G.; Soltau, H. Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 2519–2523. [Google Scholar] [CrossRef]
Abdel-Hamid, O.; Mohamed, A.R.; Jiang, H.; Deng, L.; Penn, G.; Yu, D. Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process 2014, 22, 1533–1545. [Google Scholar] [CrossRef] [Green Version]
Siripibal, N.; Supratid, S.; Sudprasert, C. A comparative study of object recognition techniques: Softmax, linear and quadratic discriminant analysis based on convolutional neural network feature extraction. In Proceedings of the 2019 International Conference on Management Science and Industrial Engineering, Phuket, Thailand, 24–26 May 2019; pp. 209–214. [Google Scholar] [CrossRef]
Zhao, Z.Q.; Zheng, P.; Xu, S.T.; Wu, X. Object detection with deep learning: A review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef] [Green Version]
Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCVW), Montreal, QC, Canada, 11–17 October 2021; pp. 6836–6846. [Google Scholar] [CrossRef]
Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. ACM Comput. Surv. 2022, 54, 1–41. [Google Scholar] [CrossRef]
Yu, H.M.; Tsai, W.H.; Wang, H.M. A query-by-singing system for retrieving karaoke music. IEEE Trans. Multimed. 2008, 10, 1626–1637. [Google Scholar] [CrossRef]
Piszczalski, M.; Galler, B.A. Predicting musical pitch from component frequency ratios. J. Acoust. Soc. Am. 1979, 66, 710–720. [Google Scholar] [CrossRef]
Su, H.; Zhang, H.; Zhang, X.; Gao, G. Convolutional neural network for robust pitch determination. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 579–583. [Google Scholar] [CrossRef]
Guo, J.; Han, K.; Wu, H.; Tang, Y.; Chen, X.; Wang, Y.; Xu, C. Cmt: Convolutional neural networks meet vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12175–12185. [Google Scholar] [CrossRef]
Zhang, W.; Lei, W.; Xu, X.; Xing, X. Improved music genre classification with convolutional neural networks. In Proceedings of the 17th Annual Conference of the International Speech Communication Association (INTERSPEECH 2016), San Francisco, CA, USA, 8–12 September 2016; pp. 3304–3308. [Google Scholar] [CrossRef] [Green Version]
Sarkar, R.; Choudhury, S.; Dutta, S.; Roy, A.; Saha, S.K. Recognition of emotion in music based on deep convolutional neural network. Multimed. Tools Appl. 2020, 79, 765–783. [Google Scholar] [CrossRef]
Singh, Y.; Biswas, A. Robustness of musical features on deep learning models for music genre classification. Expert Syst. Appl. 2022, 199, 116879. [Google Scholar] [CrossRef]

Figure 1. Overview of the system.

Figure 2. Spectrogram examples of piano performances.

Figure 3. Pitch contour example of piano performance.

Figure 4. The block diagram of the timbre-based evaluation system.

Figure 5. Architecture of the timbre-based WaveNet model.

Figure 6. Architecture of the timbre-based MLNet model.

Figure 7. Architecture of the Timbre-based CNN model.

Figure 8. Architecture of the Timbre-based CNN Transformers model.

Figure 9. The block diagram of the pitch-based evaluation system.

Figure 10. Architecture of the Pitch-based CNN model.

Figure 11. Architecture of Pitch-based CNN Transformers model.

Figure 12. Confusion matrix obtained with timbre-based WaveNet of Brahms Lullaby performance.

Figure 13. Confusion matrix obtained with Timbre-based CNN of Brahms Lullaby performance.

Figure 14. Confusion matrix obtained with Pitch-based CNN of Brahms Lullaby performance.

Figure 15. Confusion matrix obtained with Pitch-based CNN Transformers of Brahms Lullaby performance.

Figure 16. Loss rate obtained with timbre-based.

Figure 17. Loss rate obtained with pitch-based.

Table 1. Our dataset.

Songs for Kids Dataset	#Training	#Testing
Alphabet Song	281	70
Amazing Grace	281	70
Au Clair De La Lune	281	70
Beautiful Brown Eye	281	70
Brahms Lullaby	281	70
Can Can	281	70
Clementine	281	70
Deck The Hall	281	70
Hot Cross Buns	281	70
Jingle Bells	281	70
Lavender Blue	281	70
London Bridge	281	70
Mary Had a Little Lamb	281	70
Ode to Joy	281	70
Oh Susanna	281	70
Skip to My Lou	281	70
The Cuckoo	281	70
This Old Man	281	70
Total	5058	1260

Table 2. Timbre-based results.

Dataset	Model	Accuracy	Precision	Recall	F1-Score
Alphabet Song	Timbre-based WaveNet	64.78%	44.17%	64.81%	52.42%
	Timbre-based MLNet	76.06%	87.68%	74.53%	75.89%
	Timbre-based CNN Original	88.73%	90.49%	87.72%	88.79%
	Timbre-based CNN Transformer	98.59%	98.80%	98.48%	98.61%
Amazing Grace	Timbre-based WaveNet	64.79%	44.17%	64.81%	52.42%
	Timbre-based MLNet	73.23%	83.53%	68.98%	70.83%
	Timbre-based CNN Original	80.28%	90.47%	72.82%	75.92%
	Timbre-based CNN Transformer	95.77%	96.11%	95.45%	95.70%
Au Clair De La Lune	Timbre-based WaveNet	63.38%	84.24%	56.94%	55.96%
	Timbre-based MLNet	67.60%	73.42%	70.21%	66.38%
	Timbre-based CNN Original	87.32%	95.06%	84.44%	88.42%
	Timbre-based CNN Transformer	97.18%	97.31%	97.25%	97.24%
Beautiful Brown Eye	Timbre-based WaveNet	54.93%	46.41%	47.22%	40.77%
	Timbre-based MLNet	66.20%	72.63%	67.19%	65.43%
	Timbre-based CNN Original	84.51%	85.43%	84.56%	84.10%
	Timbre-based CNN Transformer	92.96%	94.79%	92.42%	93.06%
Brahms Lullaby	Timbre-based WaveNet	60.56%	80.22%	56.48%	55.83%
	Timbre-based MLNet	74.65%	76.62%	71.77%	72.57%
	Timbre-based CNN Original	77.46%	86.67%	74.95%	80.09%
	Timbre-based CNN Transformer	91.55%	92.93%	90.90%	91.39%
Can Can	Timbre-based WaveNet	66.20%	79.91%	72.17%	63.71%
	Timbre-based MLNet	76.06%	76.92%	77.63%	76.48%
	Timbre-based CNN Original	78.87%	88.06%	71.44%	75.41%
	Timbre-based CNN Transformer	91.55%	93.93%	90.90%	91.61%
Clementine	Timbre-based WaveNet	59.15%	50.00%	51.38%	45.64%
	Timbre-based MLNet	73.24%	74.96%	73.24%	71.84%
	Timbre-based CNN Original	77.46%	86.67%	74.95%	80.09%
	Timbre-based CNN Transformer	95.77%	96.00%	96.29%	95.91%
Deck The Hall	Timbre-based WaveNet	61.97%	78.07%	59.13%	56.56%
	Timbre-based MLNet	69.01%	68.47%	69.49%	67.57%
	Timbre-based CNN Original	85.92%	90.98%	72.30%	79.35%
	Timbre-based CNN Transformer	98.59%	98.80%	98.48%	98.61%
Hot Cross Buns	Timbre-based WaveNet	54.93%	49.18%	47.22%	41.08%
	Timbre-based MLNet	77.46%	88.14%	77.77%	76.12%
	Timbre-based CNN Original	78.87%	82.50%	76.91%	79.22%
	Timbre-based CNN Transformer	95.77%	96.03%	95.45%	95.50%
Jingle Bells	Timbre-based WaveNet	66.20%	81.85%	59.27%	54.73%
	Timbre-based MLNet	71.83%	53.06%	63.88%	56.67%
	Timbre-based CNN Original	77.46%	78.16%	76.91%	75.60%
	Timbre-based CNN Transformer	91.55%	92.07%	92.03%	91.65%
Lavender Blue	Timbre-based WaveNet	66.20%	51.57%	58.33%	52.14%
	Timbre-based MLNet	69.01%	71.79%	68.32%	68.46%
	Timbre-based CNN Original	81.69%	86.69%	79.57%	79.78%
	Timbre-based CNN Transformer	94.37%	94.66%	94.78%	94.53%
London Bridge	Timbre-based WaveNet	54.93%	82.51%	50.46%	47.55%
	Timbre-based MLNet	77.46%	87.91%	75.47%	77.77%
	Timbre-based CNN Original	84.51%	89.66%	84.15%	84.82%
	Timbre-based CNN Transformer	95.77%	96.00%	96.29%	95.91%
Mary Had a Little Lamb	Timbre-based WaveNet	55.09%	84.24%	61.11%	59.20%
	Timbre-based MLNet	69.01%	73.10%	71.83%	68.42%
	Timbre-based CNN Original	81.69%	87.22%	75.29%	76.44%
	Timbre-based CNN Transformer	97.18%	97.70%	96.96%	97.22%
Ode To Joy	Timbre-based WaveNet	57.75%	83.05%	55.09%	49.92%
	Timbre-based MLNet	77.46%	84.91%	75.47%	77.77%
	Timbre-based CNN Original	88.73%	89.13%	89.43%	88.52%
	Timbre-based CNN Transformer	90.14%	93.13%	89.39%	90.40%
Oh Susanna	Timbre-based WaveNet	64.79%	44.17%	64.81%	52.42%
	Timbre-based MLNet	74.65%	83.43%	72.70%	75.00%
	Timbre-based CNN Original	84.51%	89.24%	82.19%	85.40%
	Timbre-based CNN Transformer	97.18%	97.70%	96.96%	97.22%
Skip To My Lou	Timbre-based WaveNet	61.97%	46.66%	63.21%	50.56%
	Timbre-based MLNet	74.65%	81.92%	73.86%	74.84%
	Timbre-based CNN Original	88.73%	83.71%	82.00%	86.99%
	Timbre-based CNN Transformer	90.14%	90.14%	89.39%	89.55%
The Cuckoo	Timbre-based WaveNet	57.75%	79.53%	54.98%	51.51%
	Timbre-based MLNet	77.46%	84.01%	74.53%	76.14%
	Timbre-based CNN Original	77.46%	78.16%	76.91%	75.60%
	Timbre-based CNN Transformer	94.37%	94.51%	94.21%	94.34%
This Old Man	Timbre-based WaveNet	63.38%	45.33%	57.47%	45.64%
	Timbre-based MLNet	74.65%	83.43%	72.70%	75.00%
	Timbre-based CNN Original	84.51%	92.12%	67.93%	77.81%
	Timbre-based CNN Transformer	94.37%	95.69%	93.93%	94.48%

Table 3. Timbre-based average performance.

Model	Timbre-Based WaveNet	Timbre-Based MLNet	Timbre-Based CNN	Timbre-Based CNN Transformers
The average of accuracy	61.04%	73.32%	82.71%	94.60%

Table 4. Pitch-based results.

Dataset	Model
	Pitch-Based CNN Original				Pitch-Based CNN Transformer
	Accuracy	Precision	Recall	F1-Score	Accuracy	Precision	Recall	F1-Score
Alphabet Song	92.96%	94.79%	92.42%	92.90%	100.00%	100.00%	100.00%	100.00%
Amazing Grace	97.18%	97.22%	97.53%	97.27%	98.59%	98.55%	98.48%	98.48%
Au Clair De La Lune	92.96%	93.28%	93.55%	93.19%	95.77%	96.00%	96.29%	95.91%
Beautiful Brown Eye	95.77%	96.00%	95.45%	95.43%	97.18%	97.70%	96.96%	97.22%
Brahms Lullaby	92.96%	93.83%	93.83%	93.20%	95.77%	96.66%	95.45%	95.80%
Can Can	100.00%	100.00%	100.00%	100.00%	95.77%	96.00%	96.29%	95.91%
Clementine	98.59%	98.55%	98.48%	98.48%	97.18%	97.22%	96.96%	96.96%
Deck The Hall	98.59%	98.81%	98.48%	98.62%	98.59%	98.55%	98.76%	98.63%
Hot Cross Buns	92.96%	94.79%	92.42%	92.90%	100.00%	100.00%	100.00%	100.00%
Jingle Bells	100.00%	100.00%	100.00%	100.00%	97.18%	97.22%	97.53%	97.26%
Lavender Blue	100.00%	100.00%	100.00%	100.00%	97.18%	97.70%	96.96%	97.22%
London Bridge	97.18%	97.70%	96.97%	97.22%	98.59%	98.55%	98.48%	98.61%
Mary Had a Little Lamb	98.59%	98.55%	98.77%	98.63%	98.59%	98.55%	98.48%	98.48%
Ode To Joy	98.59%	98.55%	98.77%	98.63%	97.18%	97.70%	96.96%	97.22%
Oh Susanna	97.18%	97.70%	96.97%	97.22%	97.18%	97.22%	97.53%	97.26%
Skip To My Lou	94.37%	94.87%	94.22%	94.15%	97.18%	97.70%	96.96%	97.22%
The Cuckoo	97.18%	97.70%	96.97%	97.22%	94.37%	94.87%	95.06%	94.55%
This Old Man	98.59%	98.81%	98.48%	98.62%	98.59%	98.80%	98.48%	98.61%

Table 5. Pitch-based average performance.

Model	Pitch-Based CNN	Pitch-Based CNN Transformers
The average of accuracy	96.87%	97.50%

Table 6. The average of model performance.

Model	The Average of Accuracy
Timbre-based WaveNet	61.04%
Timbre-based MLNet	73.32%
Timbre-based CNN	82.71%
Timbre-based CNN Transformers	94.60%
Pitch-based CNN	96.87%
Pitch-based CNN Transformers	97.50%

Table 7. The combination of model performance.

Model Combination	Timbre-Based CNN & Pitch-Based CNN	Pitch-Based CNN Transformers & Pitch-Based CNN Transformers
The comparison of accuracy	89.79%	96.05%

Table 8. Comparisons with other benchmarking.

Work	Feature	Model/Method	Accuracy
Zhang, W et al. [40], 2016	STFT	CNN	87.40%
Sarkar, R et al. [41], 2020	MFCC	CNN	67.71%
Singh, Y et al. [42], 2022	MFCC	Xception	89.02%
Ours	MFCC	Timbre-based CNN	82.71%
Ours	Pitch raw data	Pitch-based CNN	96.87%
Ours	Chroma_stft	Timbre-based CNN Transformers	94.60%
Ours	Pitch raw data	Pitch-based CNN Transformers	97.50%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Phanichraksaphong, V.; Tsai, W.-H. Automatic Assessment of Piano Performances Using Timbre and Pitch Features. Electronics 2023, 12, 1791. https://doi.org/10.3390/electronics12081791

AMA Style

Phanichraksaphong V, Tsai W-H. Automatic Assessment of Piano Performances Using Timbre and Pitch Features. Electronics. 2023; 12(8):1791. https://doi.org/10.3390/electronics12081791

Chicago/Turabian Style

Phanichraksaphong, Varinya, and Wei-Ho Tsai. 2023. "Automatic Assessment of Piano Performances Using Timbre and Pitch Features" Electronics 12, no. 8: 1791. https://doi.org/10.3390/electronics12081791

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Automatic Assessment of Piano Performances Using Timbre and Pitch Features

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Timbre-Based Evaluation

3.1.1. Timbre-Based WaveNet Approach

3.1.2. Timbre-Based MLNet Approach

3.1.3. Timbre-Based CNN Approach

3.1.4. Timbre-Based CNN Transformers Approach

3.2. Pitch-Based Evaluation

3.2.1. Pitch-Based CNN Approach

3.2.2. Pitch-Based CNN Transformers Approach

4. Experiments

4.1. Experiment Configuration

4.2. Songs for Kids’ Performance Description

4.3. Experiment Setup and System Implementation

4.4. Experiment Results

4.4.1. Timbre-Based Results

4.4.2. Pitch-Based Results

4.4.3. The Best Comparative Results

4.4.4. The Performance Competition Results

4.5. Analysis

4.5.1. Confusion Matrix

4.5.2. Loss Rate

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI