A Novel Approach for Improving Guitarists’ Performance Using Motion Capture and Note Frequency Recognition

Elashmawi, Walaa H.; Emad, John; Serag, Ahmed; Khaled, Karim; Yehia, Ahmed; Mohamed, Karim; Sobeah, Hager; Ali, Ahmed

doi:10.3390/app13106302

Open AccessArticle

A Novel Approach for Improving Guitarists’ Performance Using Motion Capture and Note Frequency Recognition

by

Walaa H. Elashmawi

^1,2,†

,

John Emad

^2,†,

Ahmed Serag

^2,†,

Karim Khaled

^2,†,

Ahmed Yehia

^2,†,

Karim Mohamed

^2,†,

Hager Sobeah

^2,† and

Ahmed Ali

^3,4,*,†

¹

Department of Computer Science, Faculty of Computers and Informatics, Suez Canal University, 4.5 Km the Ring Road, Ismailia 41522, Egypt

²

Department of Computer Science, Faculty of Computer Science, Misr International University, 28 KM Cairo–Ismailia Road, Cairo 44971, Egypt

³

Department of Computer Science, College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Al-Kharj 11942, Saudi Arabia

⁴

Higher Future Institute for Specialized Technological Studies, 29 KM Cairo–Ismailia Road, Cairo 3044, Egypt

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2023, 13(10), 6302; https://doi.org/10.3390/app13106302

Submission received: 8 April 2023 / Revised: 15 May 2023 / Accepted: 18 May 2023 / Published: 22 May 2023

Download

Browse Figures

Versions Notes

Abstract

:

New guitarists face multiple problems when first starting out, and these mainly stem from a flood of information that they are presented with. Students also typically struggle with proper pitch frequency recognition and accurate left-hand motion. A variety of relevant solutions have been suggested in the existing literature; however, the majority have ultimately settled on two approaches. The first is finger motion capture, wherein researchers focus on extracting finger positions through analyzing images and videos. The second is note frequency recognition, wherein researchers focus on analyzing notes and frequencies from audio recordings. This paper proposes a novel hybrid solution that includes both finger motion capture and note frequency recognition in order to conduct a full assessment and give feedback on a guitarist’s performance. To classify hand positions, several classification algorithms are tested. The random forest algorithm obtained superior results, with an accuracy of 99% for overall hand movement and an average of 97.5% for the classification of each finger. Meanwhile, two algorithms were tested for note recognition, where the harmonic product spectrum (HPS) approach obtained the highest accuracy of 95%.

Keywords:

guitar technique; HPS; DFT; note frequency recognition; mediapipe

1. Introduction

The history of the guitar can be traced back to over 4000 years ago, beginning with the lute or the Greek kithara [1]—some of the very first musical instruments to ever be created. The guitar has eventually become one of the most admired and favored instruments across all cultures. There are a few reasons for this level of popularity. The guitar was more complex than most musical instruments at the time of its introduction, allowing it to produce more interesting pieces of music. The guitar is also considered a highly suitable instrument to accompany the human voice and singing [2]. The guitar has become one of the most common musical instruments in the world [3], and can be found in various types, shapes, sizes, and variations including acoustic and electric guitars. Unlike the hollow-bodied acoustic guitar, the electric guitar is solid-bodied and produces vibrations that are converted into signals and amplified, making it suitable for a wide range of music styles [4].

This paper focuses on both electric and acoustic guitars, as they both require the same basics and finger positioning. Moreover, signal detection algorithms can detect the note played by any instrument, regardless of whether it is an electric or acoustic guitar. Playing the guitar requires the guitarist to use both hands. The left hand is the leader: the guitarist puts their thumb behind the fretboard and the other four fingers on the fretboard, selects the note(s) they want to play, and accordingly presses down on the appropriate fret(s). The right hand is more of a follower: using their thumb and three fingers (generally excluding the pinky, as it is too small), the guitarist touches the string(s) according to their left hand and the string(s) pressed on the fretboard, allowing the guitar to produce a suitable note [5].

At present, there are many different playing styles used by guitarists, two of which are considered the most popular styles for playing guitar: lead and rhythm guitar. The lead guitar style is more of a solo-style of playing guitar, making it the more difficult of the two. The other, easier style is rhythm guitar—a style which prioritizes focusing on playing chords [6]. These variations, different styles, and new information pose tremendous questions for a new guitarist, including which style to follow, which is easier, and which is better. All of these are questions asked by aspiring guitarists as the overwhelming amount of information piles on and on until they may simply give up: research has shown that 90% of new guitarists quit playing within the first year [7]. No one has said it better than the CEO of Fender, Andy Mooney, who has stated that the problem is not in attracting new aspiring guitarists; it is to keep the new guitarists interested in continuing the learning process [8].

In addition, with the advent of the COVID-19 pandemic, it has become difficult to attend a class with a group of other students and teachers whilst maintaining social distancing. Naturally, most guitar hopefuls have headed to the Internet in this context, where they can watch videos about guitar playing on platforms such as YouTube. Online learning has its advantages and disadvantages; however, while it is a great time-saver and the better option during a pandemic, it can be less motivational and subtle mistakes may not be corrected, leading to future problems [9]. All of these previously mentioned issues have sparked the interest of researchers, many of which have tried to come up with solutions. Among these solutions, there are two dominant approaches: the first is finger motion capture [10], where the researchers have created systems that analyze the images and videos of guitarists playing and note frequency recognition [11], where the researchers created systems based on the audio recordings of guitarists playing. Such systems analyze these recordings to capture and extract the notes played, and then utilize algorithms to capture the right frequency. Despite the progress made with these approaches, there is still a need for a more comprehensive and accurate system that can overcome the challenges faced in finger motion capture and note frequency recognition. Researchers have attempted to tackle these challenges through the use of various techniques, such as deep learning, neural networks, and signal processing. Section 2 highlights the different approaches and challenges described by researchers in this area, ranging from the use of wearable devices and image processing to the analysis of sound frequencies and the incorporation of machine learning. In order to capture finger motion, researchers have used wrist-wearable devices to capture the spatial coordinates of the guitarist’s hand. Furthermore, the researchers discovered that the left-hand motion was never performed correctly and that pressure on the fretboard was required to generate the correct pitch frequency. Furthermore, the guitarist’s students may struggle with either the proper pitch frequency or the correct left-hand movements. As a result, the work in this paper focuses on both in order to improve the performance of guitarists. Although Section 2 lists some works that have analyzed a guitarist’s performance, some of them only focus on the finger motion capture portion, while others focus on analyzing sounds extracted from recorded songs. Managing figure motion and note frequency in real time is a challenge considered as a contribution to this research.

To classify a guitarist’s motion in space, researchers must deal with the guitarist’s hand motion as either signals or as images. Researchers who have treated finger motion as a set of signals—more specifically, electromyographic signals—have obtained a time-series dataset ready for classification [12], and then used a random forest approach to classify the hand motion. The resulting accuracy indicated an improvement over previously described approaches [13]. An SVM approach was also tested for finger motion classification; however, it required more tuning than the random forest. The use of EMG signals can provide high accuracy, but limits the guitarist’s movements. On the other hand, researchers have captured the guitarist’s movement as images and then used a CNN for classification [14], which did not limit the guitarist’s performance. However, the solution built using the CNN was not scalable, as it did not consider many exercises. In this paper, we propose a novel hybrid solution, which uses both finger motion capture (through MediaPipe) to extract finger landmark positions as time-series data, as well as note frequency recognition, in order to assess and improve a guitarist’s overall performance.

The reason that a hybrid approach was selected is that, when beginner guitarists start learning, they struggle with both playing the correct notes and with correct finger placement and technique. Guitar instructors can really help students with such problems, as they can both see and listen to the student’s playing, allowing for the correction of both notes and technique. The hybrid approach proposed in this paper aims to mimic the action of an instructor through an educational program that corrects both the note played and the finger positioning using a microphone and a camera.

The remainder of this paper is structured as follows: Section 2 reviews the most relevant research in the area of hand movement recognition and note frequency analysis. The proposed novel hybrid approach is described in Section 3. The experimental results and performance analysis are detailed in Section 4. Finally, in Section 5, our conclusions and directions for future work are discussed.

2. Related Work

There exist many approaches that may be used to evaluate a guitarist’s performance, including the analysis of hand movements and note frequencies. For example, researchers have attempted to utilize the Kinect camera for image segmentation and even wrist-worn gloves for spatial position capture. This section summarizes the most relevant approaches detailed in the existing literature.

In the early 2000s, researchers began to classify finger positions using different hardware to address this issue [15]. In [16], image segmentation was conducted to recognize both chords and note frequencies. A few years later, with the introduction of the Microsoft Kinect sensor, researchers also tried using the Kinect in the context of multiple proposed solutions; for example, in [17], the Kinect was used to capture an expert’s motion and compare it to a trainee’s motion. Additionally, researchers have trialed the use of wrist-worn gloves to facilitate finger motion tracking, which [18] utilized in their proposed method to help players improve their strumming technique.

However, more recently, researchers have depended on neural network and deep learning techniques to study and develop more solutions [19]. In [20], a solution which is highly relevant to the system proposed here was presented. The researchers aimed to create a system which can track finger motion as well as recognize Thai sign language. They first filled their datasets with approximately 500 videos of hand gestures, commenting that this process takes a long time but is attainable with no additional hardware (excluding a mobile phone camera). Then, they used the MediaPipe framework to capture the absolute positions of each finger. During the pre-processing, they realized that MediaPipe must capture the landmarks in each frame of every video; therefore, in order to obtain equivalent data, every video in the dataset must have the same length. They trained the extracted positions through various recurrent neural networks (RNNs), using architectures including long short-term memory (LSTM) and gated recurrent units (GRUs). The researchers also decided to tune the hyperparameters with Hyperband, in order to speed up the random search operation. Eventually, they built one desktop app and one mobile app. While the LSTM obtained the best accuracy and was used in the desktop app, they used the GRU-based model for their mobile app, as it was much smaller in size. We believe that this paper is one of the most important and relevant to our research, as the researchers provided information on how to extract hand and finger positions using the MediaPipe framework, which we ended up using for our finger motion capture part. In [21], MediaPipe hands were also used for dynamic gesture recognition, segmenting the movements for 12-hand signs and generating a time-series of each movement, followed by classification using an LSTM. Overall, they achieved an accuracy of 97.2%. Mediapipe is not only capable of detecting hand motion; for example, in [22], MediaPipe’s pose detection module was used to detect a person falling. Most previous research has used wearable systems, which are not practical, while MediaPipe only requires a smartphone camera for use. In [23], a different approach was used to detect face masks in public, using MediaPipe’s facemesh module to detect the placement of face masks due to the COVID-19 pandemic. Furthermore, ref. [14] failed to use the Kinect in their proposed solution. Instead, they used a convolutional neural network (CNN) to detect finger positions on the fretboard. It should be noted, however, that this solution required a larger dataset to recognize all finger positions. Finally, ref. [24] also utilized a CNN along with support vector regression (SVR) and discrete cosine transform (DCT) to evaluate a guitarist’s playing based on expert videos.

Researchers also started incorporating sound analysis into their solutions in the 2010s. For example, in [11], original sounds and notes were compared with notes and frequencies extracted from recorded songs. In [25], this method was also used to classify raga music and how it differs from other musical forms. Moreover, ref. [26] used this approach in their proposed method to carefully analyze the frequency characteristics of Carnatic music.

Overall, the aforementioned studies used various methods to recognize the fingers placed on the guitar, ranging from finger coloring to deep neural networks (e.g., CNNs).

3. Proposed Approach

Our novel approach uses the footage of a guitarist, recorded either using a mobile phone or a webcam. The footage is then pre-processed and sent to our note detector and MediaPipe. MediaPipe [27] is a computer vision framework, which is used to extract the finger positions, save them, and send them to our ML classifier. MediaPipe offers a variety of object and landmark detection modules; in our case, we used the hands module. The proposed note classifier detects any note played on the guitar and checks whether it is the right note. Figure 1 provides an overview of the proposed system. The footage of the guitarist is sent to a server, where the music notes and video frames are then extracted.

3.1. Real-Time Detection

The system seems to assess the real-time performance of guitarists through signal processing (i.e., note labeling) and real-time finger position extraction.

3.1.1. Finger Position Detection

Among the various methods that may be used for finger detection, the use of the Xbox-one Kinect sensor is a popular approach. However, as shown in Figure 2, the Kinect skeleton detection only detects the skeleton of the hands, and not the position of the fingers. Furthermore, the Kinect confused the position of the left hand with the guitar’s neck. This method was also tested by Y. Kashiwagi and Y. Och [14], who observed the same issues. Due to these drawbacks, the Mediapipe framework was used instead [28]. This framework offers the blaze-palm detector and a hand landmark model. Blaze-palm is responsible for detecting the palm and the hand skeleton itself, similarly to the Kinect’s functionality. Meanwhile, the hand landmark model (see Figure 3) segments each finger and retrieves the positions of the finger joints (i.e., 21 positions) in x, y, z format, which are called landmarks. Each landmark has an ID, referring to the positions extracted. For example, the IDs 5, 6, 7, and 8 refer to the positions of the index finger’s joints.

3.1.2. Note Labeling

A note is essentially a labeled frequency. The system recognizes the frequency of the note played first, then assigns a label based on the detected frequency. Figure 4 demonstrates the frequencies of each note for the guitar when playing open strings. In order to detect the frequency of a note, we need to process the note signal first. We used a discrete Fourier transform (DFT) in our approach to translate the output audio signals generated by the guitarist from the time domain to the frequency domain; in particular, the DFT converts the audio signal from amplitude in time to frequencies and intensities using Equation (1). The frequency–intensity graph of a signal is called the magnitude spectrum. The DFT can transform an audio signal into a magnitude spectrum by translating each signal into a set of cosine functions at different frequencies. The DFT function was used whenever a note was detected from the guitarist, which allowed for the labeling of the note being played, working in parallel with MediaPipe and the ML algorithms used to correct the finger motion mentioned in Section 3.3.

X (k) = \sum_{n = 0}^{N - 1} x (n) \cdot e^{- i π k n / N},

(1)

where:

X(k) is the transformed signal, which is a complex number (frequency domain) at index k;
x(n) is the discrete signal at sequence n; N is the total number of samples in the input signal;
$e^{- i π k n / N}$ refers to Euler’s formula, which encodes the amplitude of the frequency for $k / N$ per time unit, where k is the frequency index range, N is the total number of samples, and i denotes the imaginary unit.

The harmonic product spectrum was used, according to Equation (2) (HPS) [29], which is a refined version of DFT that takes a number of power spectra generated by a signal and multiplies them. The product disappears for non-pure signals, resulting in a much more accurate detection of guitar frequencies.

y (f) = \prod_{r = 1}^{R} | X (f r) |,

(2)

where:

R is the number of magnitude spectra;
x(f) is the magnitude spectrum of a signal.

3.2. Data Extraction

MediaPipe helped us construct a pipeline for our videos. Figure 5 demonstrates the flow of the pipeline. The video is first split into multiple frames. MediaPipe detects the hand finger positions in terms of the x, y, z space for both right and left hands, then returns to the normalized x, y, z positions, according to the screen resolution’s width and height. Finally, the fields stored in the .csv file consist of both the left- and right-hand positions for all 21 landmarks (see Figure 3) across the x, y, z space. Additional fields included in our .csv files were the screen width and height, as well as labels for each finger indicating whether they were placed correctly. We chose to extract the data into .csv format as this made it easy to manipulate, query, and visualize our data.

3.2.1. Video Recording

In order to prepare the data, we recorded A minor scale exercises from the fifth position of the guitar’s fretboard, as shown in Figure 6. A total of 66 such videos were captured with an electric guitar, and 40 videos were captured with a classical guitar, yielding a total of 106 videos for the A minor scale from the fifth position. The videos were split into incorrectly played videos and correctly played videos, which helped us label the overall finger positions. Figure 7 shows the hand skeleton for a correct technique, where all fingers are aligned and each finger is placed on only one fret. Furthermore, Figure 8 demonstrates an incorrect technique, where the fingers are tense and not aligned and the fingertips are placed on the metal between each fret, which would result in a buzzing noise when playing and slow down the guitarist. Moreover, the recorded videos were labeled by an expert guitar instructor. We explicitly labeled whether or not each finger was placed correctly, which helped us develop predictive models for each finger.

3.2.2. Note Extraction

Any song is composed of onsets, which denote the start of any note. In order to detect notes, which are essentially individual frequencies, we used the Librosa [30] audio processing library, which helped us extract the onsets for each note played in different time frames. A function which detects and classifies the note played for each .wav file using DFT and HPS was then run.

3.3. Classification Model

In order to classify the technique used for each of the four fingers of the left hand, four models had to be trained, with their training features being the tip, pip, and dip of each finger in the x, y, and z space. The models used to classify the technique included SVM, kNN, naive Bayes, random forest, and XG boosting, as further detailed in Section 4.

4. Experiments and Results

In order to demonstrate the efficiency of the proposed approach, we conducted an experiment consisting of two phases. The first phase involved labeling the overall movement of the hands, while the second one involved labeling each finger’s position (i.e., the index, middle, ring, and pinky fingers). Furthermore, this section provides more information on our recording settings and our dataset and performance metrics for the SVM, kNN, naive Bayes, random forest, and XG boosting algorithms.

4.1. Experimental Setting and Extracted Data

We recorded the videos using the normal mobile back-facing cameras of a Xiaomi note 8 pro and an iPhone 6. The phones were placed on a platform which was directly aligned with the guitarist’s left hand. We used two different cameras in order to test whether different resolutions affected how the data were extracted, and whether this would affect the guitarist’s playing in real-time in the future. However, there was no difference as the finger positions are normalized in proportion to the resolution of each frame. Our dataset was recorded in two settings, using two different guitars, guitarists, and cameras. Table 1 and Table 2 summarize the results of the proposed data collection methods described in Section 3.3 and Section 3.2. Each row in the obtained .csv file included the x, y, z coordinates of the landmarks depicted in Figure 3. The videos did not require any pre-processing, as MediaPipe was able to detect and normalize the finger positions in multiple settings. The output .csv file consisted of a time-series of each finger’s position across numerous frames for each video. The sections below detail the evaluation of each machine learning algorithm when trained on the provided dataset, including the signal detection metrics.

4.2. Note Extraction Using Signal Processing

In order to extract the notes, we had to obtain a captured signal and, so, the waveform signal was captured. Both DFT and HPS were applied to the waveform, generating two different power spectra to capture the note frequencies. Figure 9 and Figure 10 show the waveform and power spectrum of the C4 note, captured by recording a note played on the guitar. The power spectrum shows and breaks down each frequency which composes a note. Table 3 shows the results for each frequency detected using the HPS and DFT approaches. Each has two columns: the detected frequency and the closest frequency (i.e., rounded to classify each note). The standard frequency column resembles the benchmark used for each note label. The measures used to determine the performance of these approaches were the mean squared error (MSE) and the accuracy:

M S E = 1 / N \sum_{i = 0}^{N} {(Y i - \tilde{Y} i)}^{2},

(3)

where N is the total number of notes on the guitar, Yi is the standard benchmark sample, and

\tilde{Y}

i is the detected frequency; and

E f f i c i e n c y = \frac{c o r r e c t S a m p l e s}{N o . S a m p l e s} \times 100,

(4)

where the correct samples are the number of samples from the closest Freq. column matching the standard frequency in Table 3.

Signal Processing Results

To calculate the accuracy of each equation, we took the percentage of all matching rows from the closest frequency and standard frequency columns. Furthermore, we calculated the mean square error between the standard frequency column and the detected frequency column, which allowed us to calculate how much the detected frequency deviated from the standard benchmark. HPS clearly outperformed DFT, in terms of both accuracy and MSE, as HPS obtained an accuracy of 95% and a MSE of 1402. On the other hand, DFT obtained an accuracy of 65% and an MSE of 39198, as shown in Table 3.

4.3. Motion Capture Experiment

This experiment was split into two parts. In the first part, our models classified the hand position on the guitar’s fretboard in general. In the second part, we segmented each finger movement in a given time period. This experiment was conducted as it allows our system to be more specific when giving feedback to the user, pinpointing the mistakes made by each finger.

4.4. Hand Position Classification Results

To evaluate our ML models, we used three metrics:

A c c u r a c y = \frac{T P + T N}{T P + F P + F N + T N},

(5)

P r e c i s i o n = \frac{T P}{T P + F P},

(6)

R e c a l l = \frac{T P}{T P + F N},

(7)

where:

Accuracy indicates the amount of correct classifications out of the total number of instances in the dataset;
Precision measures the accuracy of positive predictions made;
Recall measures the number correctly classified true positives;
TP denotes true positives;
TN denotes true negatives;
FP denotes false positives;
FN denotes false negatives.

Table 4 summarizes the scores for each algorithm in each of these metrics. It can be seen that naive Bayes had the worst performance in terms of accuracy, while random forest was the best-performing algorithm. Naive Bayes presented the poorest performance as it assumed probabilistic independence between each finger’s position, which cannot be assumed in the considered case. Our simplest algorithm was kNN, as we used the Euclidean distance to classify whether each position was correct.

4.5. Finger Classification Results

After classifying the hand positions (in Section 4.4), kNN was found to perform poorly due to noise and inconsistency in labeling our movement labeling. Videos where guitarists performed poorly still contained some correct finger positions, which affected the hand position labeling. This problem was solved by labeling each finger position on a range of frames for each recorded video. Then, our classifier was trained using the same algorithms as in Table 5, which summarizes the accuracy scores for each labeled finger. Figure 11 and Figure 12 show an example of the classification by the trained models for a frame with correct finger positions and another frame with incorrect finger positions and an incorrect technique (i.e., finger alignment in the A minor scale with only the index finger aligned correctly), respectively.

SVM Tuning

We ran a grid search on the SVM parameters, as its evaluation metrics were relatively low compared to the other tested algorithms. Sigmoid Kernel’s highest score was 0.74 with the C parameter equal to 10. Our highest accuracy reached 0.981 using a c parameter of 1000 and a

γ

parameter equal to 1000. In this case, however, we suspected over-fitting as the score on the training set was higher than that on the test set. We settled for the most optimized selection consisting of

c = 10

and

γ = 100

, reaching an accuracy of 0.96. While the random forest algorithm still scored higher, the SVM results after tuning were drastically improved. Figure 13 and Figure 14 depict the index finger’s model grid search process when using sigmoid and RBF kernels as well as just RBF kernels, respectively. Moreover, the highlighted values reveal that the scores of the sigmoid kernel are rapidly decreasing, whereas the rbf outperforms the sigmoid scores.

5. Comparative Analysis

This section summarizes the most relevant research in the area of recognizing and classifying a guitarist’s finger motion and pattern and how it compares to our approach. The authors in [31] used chord classifiers through deep learning, reaching around 90 percent for 3 chords and going down to 70 percent. On the other hand, ref. [14] used CNNs to obtain the accuracy of left hand positions on the fretboard for single note picking, reaching an accuracy of 95, 96, 94, and 96 percent for the index, middle, ring, and pinky fingers consecutively. Other studies, such as [32], focused on using wrist-worn devices to recognize how much a guitarist moves their hands, which helped with analyzing whether the guitarist would make unnecessary movements. However, we believe that using any hardware device or wrist-worn gloves could impede the guitarist’s movement. The difference between our study and the different studies using CNNs is that these studies only classified finger positions, treating their dataset as images. Inspired by the motivation in these studies and the methodology of research using the mediapipe framework [28], we extracted the finger positions as coordinates from our images, treating our dataset as a set of x, y, and z coordinates in space. From this point on, we were not only able to recognize the finger positions of the guitarist, but we were also able to classify whether the guitarist played correctly. The accuracies provided in our experimental results in Section 4 resulted in accuracies of 71, 97, 70, 98, and 90 percent on SVM (before tuning), KNN, naive Bayes, random forest, and XGBoosting, labeling the position as either correct or incorrect. To further enforce the accuracy of our system, we used signal detection to detect the note played whenever a guitarist plays a note. This opens other possibilities for training our ML models so that they are not only limited to one exercise but can also classify a guitarist’s overall technique, which will be further mentioned in Section 8.

6. Conclusions

In this paper, we developed a guitar teaching system that provides real-time feedback for guitarists when playing the A minor scale from the fifth position using both note frequency recognition and hand and finger classification approaches. We conducted experiments by training a single model using four different algorithms in order to check the overall correctness of the hand’s position and then took a more sophisticated approach by training a model for each of the four fingers that a guitarist uses to play the guitar. Our classification experiments resulted in accuracies of more than 90% with the random forest, XG boosting, and kNN algorithms. Our note frequency detection approach using the harmonic product spectrum equation was 95% accurate in detecting the notes played correctly, clearly exceeding the accuracy obtained when using the DFT. As future work, we aimed to add more exercises to our dataset and re-train our models in order to determine how it will affect the accuracy.

7. Limitations

The proposed system captures the note played by the guitarist and their finger motion, allowing for the correction of both. However, the system still only works for the A minor scale in the fifth position. The system also requires the user to use both their camera and microphone regardless of whether they are using their mobile or their laptop.

8. Future Work

The MediaPipe framework was considered to be the most suitable for finger motion capture, and tuning the SVM led to astounding results in note frequency recognition. In the future, we would like to apply this approach to a larger variety of exercises and scale our system to different instruments. We could also experiment with the training process of our ML algorithms by making the training more general: for example, based on assessing the overall technique of the guitarist rather than assessing the overall technique in a specific exercise being played.

Author Contributions

Conceptualization, W.H.E. and J.E.; methodology, A.S.; software, K.K.; validation, J.E. and A.Y.; formal analysis, W.H.E.; investigation, A.A.; resources, K.M.; data curation, H.S.; writing—original draft preparation, H.S.; writing—review and editing, K.M.; visualization, A.Y.; supervision, W.H.E.; project administration, W.H.E.; funding acquisition, A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This study is supported by funding from Prince Sattam Bin Abdulaziz University (project number PSAU/2023/R/1444).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Acknowledgments

We thank Khaled Azab, as a live guitar performer, expert guitarist, and owner of the Spruce Music Academy.

Conflicts of Interest

The authors declare no conflict of interest.

References

Guy, P. Brief History of the Guitar. 2007. Available online: gyuguitars.com (accessed on 7 November 2022).
Somogyi, E. The Guitar as an Icon of Culture, Class Status, and Social Values. Available online: https://esomogyi.com/articles/guitars-virtue-and-nudity-the-guitar-as-an-icon-of-culture-class-status-and-social-values/ (accessed on 7 November 2022).
Central, O. What Is the World’s Best Selling Musical Instrument? 2007. Available online: orchestracentral.com (accessed on 15 February 2023).
Staff, M. Guitar 101: What Is an Electric Guitar? Plus Tips for Perfecting Your Electric Guitar Techniques. Master Class. 2021. Available online: https://www.masterclass.com/articles/guitar-101-what-is-an-electric-guitar-plus-tips-for-perfecting-your-electric-guitar-technique (accessed on 15 February 2023).
Peczeck, E. Left and Right Hands. Classical Guitar Academy. 2015. Available online: https://www.classicalguitaracademy.co.uk/left-right-hands/ (accessed on 15 February 2023).
Matthies, A. Is Lead Guitar Harder Than Rhythm Guitar? April 2020. Available online: https://guitargearfinder.com/faq/lead-vs-rhythm-guitar/ (accessed on 15 February 2023).
Constine, J. Fender Goes Digital So you do not have to quit Guitar. Techcrunch. 2015. Available online: https://techcrunch.com/2015/09/10/software-is-eating-rocknroll/ (accessed on 15 February 2023).
Bienstock, R. 90 Percent of New Guitarists Abandon the Instrument within a Year, According to Fender. Guitar World. 2019. Available online: https://www.guitarworld.com/news/90-percent-of-new-guitarists-abandon-the-instrument-within-a-year-according-to-fender (accessed on 15 February 2023).
Cole, J. In-Person, Online, or DIY: What is the Best Way to Learn Guitar? TakeLessons Blog. 2020. Available online: https://takelessons.com/blog/best-way-to-learn-guitar (accessed on 15 February 2023).
Perez-Carrillo, A. Finger-string interaction analysis in guitar playing with optical motion capture. Front. Comput. Sci. 2019, 1, 8. [Google Scholar] [CrossRef]
Patel, J.K.; Gopi, E.S. Musical Notes Identification Using Digital Signal Processing; National Institute of Technology: Punjab, India, 2015. [Google Scholar]
Xue, Y.; Ju, Z. SEMG based Intention Identification of Complex Hand Motion using Nonlinear Time Series Analysis. In Proceedings of the 2019 9th International Conference on Information Science and Technology (ICIST), Cairo, Egypt, 24–26 March 2019; pp. 357–361. [Google Scholar] [CrossRef]
Yoshikawa, M.; Mikawa, M.; Tanaka, K. Real-Time Hand Motion Estimation Using EMG Signals with Support Vector Machines. In Proceedings of the 2006 SICE-ICASE International Joint Conference, Busan, Republic of Korea, 18–21 October 2006; pp. 593–598. [Google Scholar] [CrossRef]
Kashiwagi, Y.; Ochi, Y. A Study of Left Fingering Detection Using CNN for Guitar Learning. In Proceedings of the 2018 International Conference on Intelligent Autonomous Systems (ICoIAS), Singapore, 1–3 March 2018; pp. 14–17. [Google Scholar] [CrossRef]
Burns, A.M.; Wanderley, M.M. Visual methods for the retrieval of guitarist fingering. In Proceedings of the 2006 conference on New Interfaces for Musical Expression. Citeseer, Paris, France, 4–6 June 2006; pp. 196–199. [Google Scholar]
Kerdvibulvech, C.; Saito, H. Real-time guitar chord recognition system using stereo cameras for supporting guitarists. Proc. ECTI Trans. Electr. Eng. Electron. Commun. 2007, 5, 147–157. [Google Scholar]
Yang, C.K.; Tondowidjojo, R. Kinect v2 Based Real-Time Motion Comparison with Re-targeting and Color Code Feedback. In Proceedings of the 2019 IEEE 8th Global Conference on Consumer Electronics (GCCE), Osaka, Japan, 15–18 October 2019; pp. 1053–1054. [Google Scholar]
Yoshida, K.; Matsushita, S. Visualizing Strumming Action of Electric Guitar with Wrist-worn Inertial Motion Sensors. In Proceedings of the 2020 IEEE 9th Global Conference on Consumer Electronics (GCCE), Kobe, Japan, 13–16 October 2020; pp. 739–742. [Google Scholar]
Chaikaew, A.; Somkuan, K.; Yuyen, T. Thai Sign Language Recognition: An Application of Deep Neural Network. In Proceedings of the 2021 Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunication Engineering, Cha-am, Thailand, 3–6 March 2021; pp. 128–131. [Google Scholar] [CrossRef]
Abeßer, J.; Schuller, G. Instrument-centered music transcription of solo bass guitar recordings. IEEE ACM Trans. Audio Speech Lang. Process. 2017, 25, 1741–1750. [Google Scholar] [CrossRef]
Farhan, Y.; Ait Madi, A. Real-time Dynamic Sign Recognition using MediaPipe. In Proceedings of the 2022 IEEE 3rd International Conference on Electronics, Control, Optimization and Computer Science (ICECOCS), Kenitra, Morocco, 1–2 December 2022; pp. 1–7. [Google Scholar] [CrossRef]
Bugarin, C.A.Q.; Lopez, J.M.M.; Pineda, S.G.M.; Sambrano, M.F.C.; Loresco, P.J.M. Machine Vision-Based Fall Detection System using MediaPipe Pose with IoT Monitoring and Alarm. In Proceedings of the 2022 IEEE 10th Region 10 Humanitarian Technology Conference (R10-HTC), Hyderabad, India, 16–18 September 2022; pp. 269–274. [Google Scholar] [CrossRef]
Thaman, B.; Cao, T.; Caporusso, N. Face Mask Detection using MediaPipe Facemesh. In Proceedings of the 2022 45th Jubilee International Convention on Information, Communication and Electronic Technology (MIPRO), Opatija, Croatia, 23–27 May 2022; pp. 378–382. [Google Scholar] [CrossRef]
Wang, Z.; Ohya, J. A 3D guitar fingering assessing system based on CNN-hand pose estimation and SVR-assessment. Electron. Imaging 2018, 2018, 204-1–204-5. [Google Scholar] [CrossRef]
Kirthika, P.; Chattamvelli, R. A Review of Raga Based Music Classification and Music Information Retrieval. In Proceedings of the 2012 IEEE International Conference on Engineering Education: Innovative Practices and Future Trends (AICERA), Kottayam, India, 19–21 July 2012. [Google Scholar]
Parshanth, T.R.; Venugopalan, R. Note Identification in Carnatic Music From Frequency Spectrum. In Proceedings of the 2011 International Conference on Communications and Signal Processing, Kerala, India, 10–12 February 2011. [Google Scholar]
Lugaresi, C.; Tang, J.; Nash, H.; McClanahan, C.; Uboweja, E.; Hays, M.; Zhang, F.; Chang, C.L.; Yong, M.G.; Lee, J.; et al. Mediapipe: A framework for building perception pipelines. arXiv 2019, arXiv:1906.08172. [Google Scholar]
Zhang, F.; Bazarevsky, V.; Vakunov, A.; Tkachenka, A.; Sung, G.; Chang, C.L.; Grundmann, M. Mediapipe hands: On-device real-time hand tracking. arXiv 2020, arXiv:2006.10214. [Google Scholar]
Sripriya, N.; Nagarajan, T. Pitch estimation using harmonic product spectrum derived from DCT. In Proceedings of the 2013 IEEE International Conference of IEEE Region 10 (TENCON 2013), Xi’an, China, 22–25 October 2013; pp. 1–4. [Google Scholar] [CrossRef]
Raguraman, P.; Mohan, R.; Vijayan, M. LibROSA Based Assessment Tool for Music Information Retrieval Systems. In Proceedings of the 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), San Jose, CA, USA, 28–30 March 2019; pp. 109–114. [Google Scholar] [CrossRef]
Ooaku, T. Guitar Chord Recognition Based on Finger Patterns with Deep Learning. In Proceedings of the ICCIP ’18: Proceedings of the 4th International Conference on Communication and Information Processing, Qingdao, China, 2–4 November 2018; pp. 0785–0789. [Google Scholar] [CrossRef]
Howard, B.; Howard, S. Lightglove: Wrist-worn virtual typing and pointing. In Proceedings of the Fifth International Symposium on Wearable Computers, Zurich, Switzerland, 8–9 October 2001; pp. 172–173. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed hybrid approach.

Figure 2. Kinect false detection.

Figure 3. Mediapipe hand landmarks.

Figure 4. Open-string frequencies.

Figure 5. Pipeline flow.

Figure 6. Recorded data sample.

Figure 7. Correct finger placement.

Figure 8. Incorrect finger placement.

Figure 9. C4 note waveform.

Figure 10. C4 note power spectrum.

Figure 11. Correct finger placement.

Figure 12. Incorrect finger placement.

Figure 13. Sigmoid and RBF kernel tuning.

Figure 14. RBF kernel tuning.

Table 1. Electric guitar recordings.

Size	2.5 GB
Number of rows or frames	25,544 rows
Number of features	68
Duration	From 7 to 30 s
Number of videos	67
Number of classes	2
Number of incorrect classes	33
Number of correct classes	32
Guitar type	Cort cr50 Electric Guitar
Camera for recording	iPhone 6

Table 2. Classical guitar recordings.

Size	1.5 GB
Number of rows or frames	12,607 rows
Number of features	68
Duration	from 2 to 30 s
Number of videos	40
Number of classes	2
Number of incorrect classes	20
Number of correct classes	20
Guitar type	Spanish classic guitar
Camera for recording	Xiaomi note 8 pro

Table 3. Note labeling results using signal processing approaches.

	DFT			HPS
Note	Detected Frequency	Closest Frequency	Detected Frequency	Closest Frequency	Standard Frequency
C3	262	261	131	130	130
C4	262	261	262	261	261
C5	525	523	525	523	523
C♯3	282	277	140	138	138
C♯4	278	277	278	277	277
C♯5	557	554	556	554	554
D3	147	146	147	146	146
D4	294	293	294	293	293
D5	590	587	590	587	587
D♯3	156	155	156	155	155
D♯4	312	311	311	311	311
D♯5	625	622	626	622	622
E2	165	164	82	82	82
E3	330	329	165	164	164
E4	330	329	330	329	329
E5	1323	1318	661	659	659
F2	176	174	88	87	87
F3	175	174	175	174	174
F4	350	349	349	349	349
F5	2106	2093	350	349	698
F♯2	186	185	93	92	92
F♯3	185	185	185	185	185
F♯4	471	369	371	369	369
F♯5	743	739	743	739	739
G2	197	196	98	98	98
G3	394	392	196	196	196
G4	393	392	394	392	392
G5	572	587	788	784	784
G♯2	208	207	104	104	104
G♯3	393	392	207	208	208
G♯4	417	415	415	415	415
G♯5	835	830	834	830	830
A2	111	110	111	110	110
A3	220	220	220	220	220
A4	441	440	443	440	440
A5	1766	1760	885	880	880
A♯2	117	116	117	116	116
A♯3	235	255	234	233	233
A♯4	469	466	470	466	466
A♯5	885	880	886	880	932
B2	125	123	124	123	123
B3	247	246	248	246	246
B4	496	493	497	493	493
B5	995	987	987	987	987

Table 4. Hand position metric scores for various algorithms.

			Algorithm
Metric	SVM	kNN	Naive Bayes	Random Forest	XG Boosting
Accuracy	0.63	0.76	0.68	0.99	0.93
Precision	0.67	0.84	0.66	0.99	0.97
Recall	0.63	0.86	0.66	0.99	0.97

Table 5. Multi-finger metric scores for various algorithms.

			Algorithm
Model	SVM	kNN	Naive Bayes	Random Forest	XG Boosting
			Accuracy
Index	71%	97%	70%	98%	90%
Middle	81%	98%	68%	98%	90%
Ring	82%	98%	66%	97%	92%
Pinky	81%	98%	70%	97%	89%
			Precision
Index	65%	97%	63%	96%	87%
Middle	83%	98%	64%	98%	94%
Ring	72%	98%	70%	98%	91%
Pinky	83%	97%	79%	97%	88%
			Recall
Index	21%	97%	74%	95%	86%
Middle	28%	98%	81%	97%	91%
Ring	19%	98%	79%	97%	92%
Pinky	70%	97%	86%	96%	88%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Elashmawi, W.H.; Emad, J.; Serag, A.; Khaled, K.; Yehia, A.; Mohamed, K.; Sobeah, H.; Ali, A. A Novel Approach for Improving Guitarists’ Performance Using Motion Capture and Note Frequency Recognition. Appl. Sci. 2023, 13, 6302. https://doi.org/10.3390/app13106302

AMA Style

Elashmawi WH, Emad J, Serag A, Khaled K, Yehia A, Mohamed K, Sobeah H, Ali A. A Novel Approach for Improving Guitarists’ Performance Using Motion Capture and Note Frequency Recognition. Applied Sciences. 2023; 13(10):6302. https://doi.org/10.3390/app13106302

Chicago/Turabian Style

Elashmawi, Walaa H., John Emad, Ahmed Serag, Karim Khaled, Ahmed Yehia, Karim Mohamed, Hager Sobeah, and Ahmed Ali. 2023. "A Novel Approach for Improving Guitarists’ Performance Using Motion Capture and Note Frequency Recognition" Applied Sciences 13, no. 10: 6302. https://doi.org/10.3390/app13106302

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Approach for Improving Guitarists’ Performance Using Motion Capture and Note Frequency Recognition

Abstract

1. Introduction

2. Related Work

3. Proposed Approach

3.1. Real-Time Detection

3.1.1. Finger Position Detection

3.1.2. Note Labeling

3.2. Data Extraction

3.2.1. Video Recording

3.2.2. Note Extraction

3.3. Classification Model

4. Experiments and Results

4.1. Experimental Setting and Extracted Data

4.2. Note Extraction Using Signal Processing

Signal Processing Results

4.3. Motion Capture Experiment

4.4. Hand Position Classification Results

4.5. Finger Classification Results

SVM Tuning

5. Comparative Analysis

6. Conclusions

7. Limitations

8. Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI