1. Introduction
One of the fundamental processes of analyzing audio signals is finding the start and endpoint of the notes, which are called the onset and the offset, respectively. Onset and offset are not exact points/times that are universally agreed as the starting and ending of a note but exist within an acceptable range [
1,
2,
3,
4].
Several applications need the results of onset/offset detection, such as tempo and pitch estimation, beat tracking, score following, automatic music transcription, and analysis of recorded music. Real-time music applications demand almost instantaneous results, i.e., real-time onset detection for systems such as the interactive music systems explained in Müller-Rakow [
5] and Malloch [
6], or for music transcriptions as discussed by Kroher and Díaz-Báñez [
7]. Therefore, it is vital to minimize the time delay between the onset or offset and their detection in real-time environments.
Over the years, many research contributions have been made for onset detection, but most work offline. If the onset detection function has been appropriately created, then onsets events will give rise to well-localized recognizable features, e.g., a peak, in the detection function [
8]. Several common approaches for detecting onsets, such as spectral difference, phase deviation, wavelet regularity modulus, negative log-likelihood, and high-frequency content, are well explained in the Bello et al. [
8] study and then compared by Collins [
9]. Moreover, Dixon [
10] has proposed multiple future enhancements for some of these methods.
In addition, Lacoste and Eck [
11] propose an offline music onset detection algorithm using single and combined versions of Artificial Neural Networks (ANN) trained with different hyperparameters, and Eyben et al. [
12] employ a Recurrent Neural Network (RNN) based on Mel spectrograms. Furthermore, after pre-processing with a time-variant filter, a method using Hidden Markov Models (HMMs) was proposed by Degara et al. [
13] for offline onset detection. Schlüter and Böck [
14] refined the model proposed by Eyben et al. [
12] and trained Convolutional Neural Networks (CNNs) with mini-batch gradient descent (this splits the training dataset into small batches) to reduce model error, and the input to their model was two log Mel-spectrograms. Their approach outperformed other traditional methods and required less additional processing. However, the peak-picking approaches used for CNN and RNN-based methods rely on future information (not probabilistic) to detect an event; thus, they cannot work for real-time music onset detection.
Some of the studies are mainly focused on detecting onsets from singing signals. For instance, the singing onset detection method of Toh et al. [
15] is based on audio features such as Mel Frequency Cepstral Coefficients, Linear Predictive Cepstrum Coefficients, pitch stability zero-crossing rate, and signal periodicity. First, the extracted audio features are classified into onset and non-onset frames using Gaussian Mixture Models (GMM). After GMM scoring, the feature evaluation is preceded by a dual detection function (feature level and decision level fusion) for higher accuracy in selecting the most optimal features. This method resulted in an 86.5% precision, 83.9% recall, and an F-measure of 85.2%. The recall shows the proportion of real positive cases that are correctly predicted positive. Precision implies the fraction of predicted positive cases that are correctly real positives. In binary classification, the F-measure calculates a test’s accuracy. It is calculated from the precision and recall of the test. The F-measure is the harmonic mean of the precision and the recall. The value of an F-measure is between 0 and 1. The highest value specifies perfect precision and recall, while the lowest shows whether the precision or the recall is zero [
16]. However, despite the high F-measure score, it was still possible that their result could contain bias because of the dataset they used. The training and test set come from a tiny dataset comprising 18 singing recordings from four singers with 1127 onsets.
In the study conducted by Gong and Serra [
17], a deep learning model was trained for musical onset detection in solo singing, and the authors discussed how their algorithm could lead to improve live onset detection models. They used two datasets, one of which contains more than 25,000 onsets, mostly complex mixtures or solo instrumental excerpts, and only three excerpts are of a solo singing voice, and the other dataset is a subset of a solo Jingju singing voice that contains 100 recordings. They employed seven deep learning-based architectures.
In the Gong and Serra [
17] study, it was preferred to use the score-informed method if the musical score information was available. Score-informed approaches evaluate the data with the assistance of musical scores. Based on the results, score-informed HMM outperformed peak picking for all of the architectures used in this experiment [
17]. The reported F-measure for the combination of the peak picking method and a no-dense neural network architecture was 73.88%, with a
p-value of 0.002. For the score-informed HMM method, a nine-layer CNN architecture worked best, giving an F-measure of 80.90% and a
p-value of 0.001. Learning strategies for inter-dataset knowledge transfer were also studied, but due to different musical patterns, the authors claimed that when the musical patterns from the two datasets used to train their model were different, the onset prediction was not accurate.
Despite these studies, onset detection of a musical note remains a challenge, primarily for the singing voice. Chang and Lee [
18] explain several reasons for this, including inconsistency of articulation, singer-dependent tonal quality, and gradual variation in onset envelopes over time. In other words, the time-varying spectral envelope and the inconsistency of vocal tracks may produce fake maxima (i.e., peaks) in an onset detection function that can lower the precision rate for onset detection. Therefore, detecting onsets from the singing voice is still an active area of study because of waveform unpredictability and the occurrence of many noisy segments. Moreover, most methods are only suitable for recorded singing and are designed to work offline.
According to the previously published results, most existing approaches do not work well for soft onsets, including singing music. A soft onset has a long attack duration or vague envelope shape that becomes a challenge to the peak-picking procedure. The underlying reason for these issues is that the singing voice is classified as a pitched non-percussive (PNP) instrument, and PNP instruments still present a challenge for onset detection [
9]. The nature of the singing voice adds further complexity due to its natural inconsistency with respect to pitch and time dynamics. Unlike some instruments, where their timbre is usually consistent throughout a note, the singing voice inherently can produce more variations of formant structures (for articulation); sometimes, it may even variate within the duration of a single note [
19]. While most onset detection algorithms are based on detecting spectral changes, they can fail to differentiate such variations in a singing voice because of singing features such as vibration and soft onset.
Relevant challenges for onset detection in solo singing voices were identified in a report from the Music Information Retrieval Evaluation eXchange 2012 (MIREX 2012). According to this report, the best-performing detection method gives an F-measure of only 55.9% [
4], which even becomes lower for solo sustained strings with an average F-measure of 52.8%. In addition, training datasets for dynamically changing patterns in a singing voice is still a challenge [
17,
20].
One of the missing parts of most of the onset detection algorithms is considering the actual singing style features. In the Mayor et al. [
21] study, it is shown that one of the crucial features that should be taken into account in onset detection is the transition from a note to another note where there is no intervening silence, i.e., legato singing [
21]. The transition means a singer will take a while to reach the target note. If the time for the transiting is not incorporated, the onset detector cannot find the correct times for onset and offset events. These transitions are categorized as a soft onset.
This paper aims to introduce a new onset detection algorithm incorporating more knowledge about the singing features for a more accurate onset estimation. Although the result of this study is based on an offline F0 estimator algorithm, the proposed algorithm can work in a real-time environment if fundamental frequencies can be estimated correctly.
The following section explains the methodology. After that, in
Section 3, the new algorithm will be discussed in detail. Then, the evaluation results for the proposed algorithm will be presented and discussed in
Section 4. Finally, the last section concludes the paper and its findings.
3. The Proposed Algorithm
This algorithm is based on our observations following investigations that involved many singing pitch contours. From many of the plotted pitch contours, it was noticed that there is a noticeable trajectory change in the fundamental frequency when moving from one note to another. Therefore, the proposed algorithm is focused on evaluating the changes on a pitch contour to identify those meaningful changes that will signify onsets, offsets, and transitions.
The pitch contour is selected because it is a robust indicator of onset compared to other features. For example, Rabiner and Sambur [
34] looked to find significant changes in the sound energy contour to find the start and the end of an isolated utterance. Their approach is based on short-time energy and zero-crossing rate. However, although in the case of a silence existing between notes, as considered by Rabiner and Sambur [
34], a noticeable change in amplitude contour is easy to see, it is difficult to rely on the amplitude contour as a feature when analyzing legato singing, as unpredictable variation occurs in the movement from one note to the next. In contrast, the fundamental frequency track is either erratic before the onset and then quickly becomes stable or moves smoothly from one value to the next in the case of legato singing, even when the consecutive notes are in the same pitch frequency. Thus, the proposed algorithm can be explained as seven main steps to find the onsets, offsets, and transitions, as shown in
Figure 1. The steps are explained in the subsequent paragraphs.
3.1. Estimating F0s
Since the algorithm is based on the fundamental frequencies, the F0s must be estimated correctly. However, as mentioned in [
35,
36,
37], the current real-time pitch detection algorithms are unreliable when applied to singing phrases. Therefore, according to the study by Faghih and Timoney [
35], a more reliable offline algorithm, pYin [
38], was employed to avoid a compounding effect in this analysis if any real-time pitch detector algorithm would be used. Thus, it was possible to evaluate the accuracy of the onset algorithm without any adverse effects caused by the pitch detection algorithms. A Python library, Librosa [
22], was used for pYin.
The main difference between the real-time and offline algorithms is the amount of data they need for the calculation. Therefore, real-time algorithms are only based on the previous data points and/or a few later data points meaning that only a short buffer delay is required. On the other hand, offline algorithms require a long buffer delay to have sufficient data to perform their calculations. Using the pYin algorithm does not mean the proposed algorithm needs a long buffer delay to obtain a large amount of data, but the algorithm can work with a very short buffer delay, as explained below.
3.2. Stretching Pitch Contour
Since humans’ vocal pitch range is wide, generally from 77 to 900 Hz [
39], calculating significant changes occurring on pitch contour has some difficulty. For example, the slope of the line when moving from the note
to the note
is much less than when it moves from the note
to the note
. Therefore, to counteract any adverse effect of this wide pitch frequency range on the slopes, the F0s are
stretched to be on the almost same pitch frequency range.
Figure 2 plots two estimated pitch contours (panels a and b) and the stretched version of them (panels c and d, respectively). As depicted in
Figure 2, although (a) and (b) are in different pitch frequency ranges, after stretching, the slopes between notes in both (c) and (d) are almost similar.
The following formulas, Equations (2) and (3), are used to implement the stretch.
where the variable
holds the maximum
F0 estimated until index
, and the constant value
holds the maximum possible
F0. Since the maximum pitch frequencies of the singers in both datasets mentioned above are less than 1000 Hz, for this study, 1000 Hz is considered as the
In Equation (2), if the current
F0,
, is more than the
variable, Equation (3) should be run for all the
F0 from index
to index
.
3.3. Calculating the Stretched Pitch Contour Slopes
To find the significant changes in F0s, the slopes between points in the pitch contour are needed.
Figure 3 illustrates the process of calculating the slopes: in the left-hand panel, (a), the estimated pitch contour is plotted; the graph in the middle panel, (b), shows a stretched pitch contour of the contour in panel (a) as discussed in
Section 3.2, while that the right-hand panel, (c), depicts the slopes between the F0s of the stretched pitch contour. It is computed by differentiating the contour. The vertical red lines in
Figure 3 show the possible offset points, and the vertical green lines are the possible onset points.
3.4. Calculating the Summation of Slopes in the following Line
In singing, transitions can be observed as the singer moves from one note to another. An example of this is outlined between the two pairs of orange-colored lines in
Figure 4.
In this step, the summation of the following points’ slopes is calculated to find the transitions at each point. In other words, as far as the direction of the line (upward, downward, or straight) in the stretched pitch contour remains the same, the slopes between every two consecutive points would be added to each other. The algorithm is depicted in
Figure 5, where
is the current point in this figure.
The algorithm commences by computing the cumulative sum of the consecutive points in the slope representation. In other words, their amplitudes, the values on the
y-axis in
Figure 3c, are summed. According to the evaluation of several manually annotated onsets, offsets, and transitions, it is observed that there is a sharp upward or downward movement between two consecutive notes in a pitch contour. Therefore, a heuristic function implemented using decision logic is applied to assess how much change happens after each new point. In addition, it is found how many consecutive points have the same sign as the current point’s slope: that is, how many of the successive values are heading in the same direction. The function denotes this in
Figure 5, which is named
.
Therefore, the algorithm, at this point, detects when the slope changes sign.
3.5. Calculating the Mean of the Local Slopes
In this step, the mean of the local slopes needs to be calculated. This mean is always accounted for by considering some of the previous points until the current point, as shown in Equation (4).
where
is the size of the window. The value of
is important to produce a mean that can show the mean of the fluctuations in a note. If
is too big, it may include some old-time fluctuations that make an incorrectly local mean. In contrast, if
is too small, there would not be enough fluctuations to calculate the correct local mean. The
should be selected based on the singing technique, duration, and intervals. In this study, the selected values of
were chosen to be 230 ms for the Erkomaishvili dataset and 46 ms for the SVNote1 dataset. These selections for
were made according to a trial-and-error method of adjusting the
value to have the best result for one of the files of each dataset.
As shown in
Figure 6, although the median duration of the notes in both datasets is almost similar, roughly 0.42 s, the duration of most of the notes in the Erkomaishvili dataset is longer than the median. In contrast, the duration of the notes in the SVNote1 dataset is distributed approximately uniformly below and above the average. Therefore, the variance of notes’ duration in the Erkomaishvili dataset is greater than in the SVNote1 dataset. In addition, the variance of the intervals between notes in the Erkomaishvili dataset is smaller than in the SVNote1 dataset. Thus, two different
values for each dataset were selected.
3.6. Calculating the Standard Deviation of the Local Slopes
To define a trajectory change in the fundamental frequencies, the sample standard deviation of the local slopes is calculated as shown in Equation (5).
The same window size ( value) as for calculating the mean was used for estimating the standard deviation.
3.7. Comparing the Current Slope with the Mean and Standard Deviation
In this step, all the required information is prepared to determine if a significant change has occurred in the fundamental frequency trajectory.
Each of the points in the pitch contour can have only one of the following statuses:
- A.
Onset: this means the point is an onset.
- B.
Offset: this means the point is an offset.
- C.
StartTransition: this means a transition will follow, and this point is the start of the transition.
- D.
EndTransition: this means it is the end of the transition.
- E.
None: this means this point is neither an event’s start nor the end.
These statuses are illustrated in the diagram in
Figure 4. The red and green lines show offset and onset events respectively, while the orange lines denote a transition from a note to the following note, i.e., the points between an offset and its subsequent onset.
Figure 7 illustrates the algorithm for finding each point’s status. This algorithm works based on the values calculated by the algorithm illustrated in
Figure 5. This algorithm is run iteratively on each of the estimated pitch values.
First, a for the local pitch contour’s slope must be calculated. This is completed by adding the mean of the local slopes at to the product of the standard deviation of the local slopes at P and coefficients. The is a user-specified value that indicates which range of frequencies, based on their variation from the mean, should be considered as belonging to the same note. The value t does not define a fixed variation from the mean but is derived based on the singer’s techniques. For instance, when the singer uses vibrato, the variation is higher than singing in an unmodulated tone. This study selected a threshold of 5 for the Erkomaishvili dataset and 2 for the SVNote1 dataset.
Second, if the slope at
is bigger than the
, it means that a trajectory change has happened. This significant change should be an
,
, or
. If it is the first trajectory change after a silence (see Branch B in
Figure 7), it is a movement to reach an
; otherwise (see Branch A in
Figure 7), the current point is an
. Based on each of these situations, Onset, Offset, StartTransition, and EndTransition statuses will be marked. The start and end of transitions are consecutively after and before an Offset and an Onset, respectively. In other words, the start and end of transitions are one point apart from the Offset and Onset points.
When the algorithm finds a trajectory change at , all the events between and will be labeled; thus, the following point that needs to be checked is . Therefore, there is a jump with a size of at the end of the algorithm to set the value for the next iteration.
In the beginning, the variable is set to true, and when a rest is reached (when equals zero), a value will be assigned to this variable.
4. Results and Discussion
This section provides the results and the details of the procedure for evaluating the proposed algorithm. It should be mentioned that the accuracy of the real-time proposed algorithm is compared against a set of real-time and offline algorithms. The delay of the proposed algorithms in calculating each event depends on its parameters, as mentioned in
Section 3. Delays of 230 and 46 ms are used for the Erkomaishvili and SVNote1 datasets, respectively.
Since the other onset detection algorithms mentioned in
Section 2.2 only estimate onsets but not offsets and transitions, only onsets need to be extracted to evaluate and compare the proposed algorithm with them. Therefore, two types of onset times were considered: (1) First, there are only those points in the pitch contour labeled as an onset. The green line illustrates these in
Figure 8, and (2) the middle point between the start time of the transition and onset, illustrated by the pink lines in
Figure 8, is considered as the new onset point. The reason for considering the second type is to align with the approach used for ground truth datasets’ because this one does not consider that transitions occurs between notes. Therefore, they would probably select a point between the red and green lines in
Figure 8 as the onset. Therefore, considering the middle point should result in just a minor deviation from the ground truths.
Generally, as shown in
Figure 8, a range of points between the offset and the start of the following note could be selected as an onset. Therefore, the algorithms were compared with different window sizes of 10, 50, 100, 150, 200, and 250 ms for calculating the F-measure.
Table 1 and
Table 2 display F-measures computed across all the algorithms in the six window sizes. A larger window size for F-measure shows more similarity, since an enormous difference between the ground truth and the estimated onset would be accepted in this case. However, as seen in
Table 1 and
Table 2, after applying the window size of 150 ms, the speed of improvement in F-measure values decreases. In addition, a window size of more than 250 cannot be meaningful, since it accepts more than a 250 ms difference between the ground truth and the estimated onset, which is too long. These tables provide the similarity between the ground truth’s onsets times and the estimated onsets times by each algorithm. As mentioned above, two onset point selections are considered regarding the proposed algorithm. The rows titled “Pro Algorithm 1” in
Table 1 and
Table 2 consider the green line in
Figure 8 as the onset, while the rows titled “Pro Algorithm 2” select the middle point, which is the pink line in
Figure 8.
All the algorithms show better results on the SVNote1 dataset than the Erkomaishvili dataset. One of the possible reasons for the better result could be the better audio quality of the SVNote1 dataset. In addition, there is a speaking introduction at the beginning of each audio file that is not included in their annotations. Nevertheless, since all the algorithms are working on the same audio files, they all have the same faulty sound, which will not affect the comparison.
As the result of the comparison, our proposed algorithm finds more correct onsets compared to the other algorithms when the window size is equal to or greater than 150 ms, as shown in the rows for Pro Algorithm 1 in
Table 1 and
Table 2. The bold numbers in these two tables highlight the performance of the best algorithm.
Selecting the average of the onset and the start of the transition as the onset leads to an increase in the accuracy of the proposed algorithm by 3.4% on average for the Erkomaishvili dataset. However, the opposite is the case for the SVNote1 dataset, in which the accuracy of the onset identification decreased by 3.8%. The reason for these opposing results is that the annotator of the Erkomaishvili dataset considered onsets more closely to the middle, as depicted in
Figure 8. However, the SVNote1 dataset’s annotators mostly considered onsets after the proposed algorithm’s onset point, as shown in
Figure 4. Both approaches can be interpreted as correct, since the onset point is not universally agreed in a pitch contour, as mentioned above, but it is deemed to be valid over a range of points.
To check the meaningfulness of the averages of the F-measure values of each onset detection algorithm, the
p-values for ANOVA were calculated for all the F-measure values calculated for every single file. The ANOVA’s
p-values for both
Table 1 and
Table 2 were less than 0.0001, which means a significant difference between the accuracy of all evaluated algorithms.
As another result, the average and the standard deviation of the duration of the transitions are shown in
Table 3. This table also provides the minimum (average minus standard deviation) and the maximum (average plus standard deviation) typical duration for the transitions. Therefore, the average transitions’ duration in the datasets is almost the same. Overall, based on the results, the minimum and maximum duration of the transitions were approximately 16 and 98 ms. Therefore, since the proposed algorithm is based on the trajectory changes in a pitch contour and the transitions show these significant changes, the minimum delay required to find the onset, offset, and the transition is 16 ms and the maximum of 98 ms. However, most events should be found correctly, with the average transition duration being around 57 ms. This delay would be acceptable for most real-time music information retrieval applications. For example, Henkel and Widner’s real-time score-following system [
40] requires a delay of around 56 ms.
Since the proposed algorithm is based on the changes in a pitch contour, when the intervals between notes are bigger, and there are fewer soft onsets, the algorithm can estimate onsets more accurately.
The accuracy of the proposed algorithm may be improved by considering more spectrogram channels, i.e., including other related frequency components from the spectrogram and not only the fundamental frequencies. In this way, a more comprehensive formula weighted together with the measurements for each channel could improve the overall measure. Therefore, a new series of numbers will be generated to find the onsets, offsets, and transitions from the trajectory changes in the new contour. In this approach, the adverse effect of the incorrect F0 estimation may be reduced, especially in a real-time environment.
Moreover, the accuracy of the proposed algorithm can be improved by incorporating a function tracking significant changes in the magnitudes of each spectral channel that are also associated with the onset.
Another possible approach instead of using the starting pitch explained in
Section 3.2 is to scale down all F0s to one specific octave and then use a log frequency axis. This approach may help in regularizing the slopes and making them comparable.
In addition, the algorithm is based on two parameters, window size (as explained in
Section 3.5) and the proportion of the standard deviation to calculate the thresholds, as discussed in
Section 3.7. By evaluating the algorithm on other larger datasets such as VocalSet [
41], these parameters could be fixed to be a constant value that is generally applicable to all singers or could be determined by a formula and therefore be adaptive to the nature of the style of input singing.
Furthermore, the algorithm’s efficiency and accuracy could be evaluated on notes performed by musical instruments to see if it is also applicable in that domain.
Lastly, making the algorithm more computationally efficient requires a smaller buffer size to work faster in real-time environments.