Next Article in Journal
Design of an Affordable Cross-Platform Monitoring Application Based on a Website Creation Tool and Its Implementation on a CNC Lathe Machine
Next Article in Special Issue
Induced Emotion-Based Music Recommendation through Reinforcement Learning
Previous Article in Journal
Deep Neural Network Model for Evaluating and Achieving the Sustainable Development Goal 16
 
 
Article
Peer-Review Record

Annotated-VocalSet: A Singing Voice Dataset

Appl. Sci. 2022, 12(18), 9257; https://doi.org/10.3390/app12189257
by Behnam Faghih * and Joseph Timoney
Reviewer 1:
Reviewer 2:
Appl. Sci. 2022, 12(18), 9257; https://doi.org/10.3390/app12189257
Submission received: 5 August 2022 / Revised: 2 September 2022 / Accepted: 8 September 2022 / Published: 15 September 2022
(This article belongs to the Special Issue Algorithmic Music and Sound Computing)

Round 1

Reviewer 1 Report

This work contributes a set of a annotations of a pre-existing dataset of recordings of singing. This may be useful for those researching certain aspects of singing, as is the overview in Table 1. The description of prior work and references are almost too detailed for the rest of the paper. The contribution here is to announce the existence of a new resource to those who need datasets of solo singing, and to describe it. There are no new research results to speak of. However, for any future papers doing work on the particular data set in question, the present article would serve as a time-saving background. A few observations are made on the effect of different ways of assessing the per-note  fo and of determining the onset and offset times.

Specific questions and comments

Why are you reporting fdifferences in Hz, and not in cents? The latter would be invariant with the musical pitch of the note. 2 Hz difference is about 34 cents at 100 Hz, but only 3.4 cents at 1000 Hz.

Line 134: what were your criteria for deciding when the fo extraction was incorrect? You explain this later, but you could mention here that it was mostly automated.

Tables 2-4: the numbers of decimals given here are pointlessly large. If you must use p-values, threshold them, e.g. <0.05, <0.01. t-test is written with lower-case italic t. See also Wasserstein, R.L.; Schirm, A.L.; Lazar, N.A. Moving to a World Beyond “p < 0.05.” Am. Stat. 2019, 73, 1–19.
https://doi.org/10.1080/00031305.2019.1583913.

Tables 5-7: the numbers of decimals given here are pointlessly large. It is good practice to round off to the limit of either uncertainty, or if the uncertainty is very small, to the limit of relevance. No one is helped by eight decimals, for this kind of data.

Fig 1: the purpose of the figure is clear, but the font sizes should be increased, and the tick marks should be more sparse. A time scale ticked in units of hundreds of ms would be preferable to the arbitrary division used here.

 

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 2 Report

This paper describes how the authors annotate an existing dataset. I do agree that it is important to produce good annotated datasets and do want to support their publication. It is, however, difficult to give this strong marks on novelty, since they are building on an existing published data set using previously published methods.  I do, however, think that the annotations the authors are seeking to share are a useful contribution. 

Does this rate a journal article, however? That, I'm less sure of. Perhaps if there were significantly more detail about the methodology applied in the annotation, I could be convinced of that. Important detail about how they did the annotation process, however,  is missing. The authors describe key steps extremely briefly, saying in a single sentence that they "removed erroneous pitch contours" or "a software tool corrected errors." This level of detail strikes me as insufficient if the only thing they are reporting in the article is their process for performing the annotation.   Detailed comments follow. 

============

The entirety of the section on estimating pitch contour is quoted below:

 

"Estimating fundamental frequencies

A state-of-the-art pitch detector algorithm, PYin [23], was employed to estimate the 128 fundamental frequencies of each file. The implementation of the PYin in Librosa [31] was used as one of the well-known Python libraries. Although the PYin algorithm is considered a reliable pitch estimator [32,33], it still returns incorrect estimates for some F0s. Therefore, the Smart-Median pitch smoother algorithm [34] was employed to smooth the pitch contours estimated by PYin. However, after plotting all the pitch contours, it was realised that some of the pitch contours were incorrect; therefore, we removed them from the Annotated-VocalSet to have a reliable set of pitch contours."

 

Given that all their annotation work is based on the step of estimating F0, it would help to have more than a single paragraph describing their process. Questions that occur to me after reading this paragraph include:

* PYin is 8 years old and has been superceded by Crepe (https://github.com/marl/crepe). Why didn't the authors use this more modern pitch tracker?

* What is the effect of the pitch smoother?

* Does Figure 1 (as seems likely) display the result of the smoothed pitch tracks?

* Did the authors review every single pitch track in the data set or did they spot check a few?

* What was their inclusion criterion for deciding a pitch track was good enough to be included?

* What proportion of pitch tracks were discarded?

* Did they discard entire files, or just the portions of the file that was pitch tracked incorrectly?

=============

In Section 2.2. Detecting onsets, offsets, and transitions, there are some questions I have:  

They say that they use their prior work as a method of detecting onsets/offsets. It would be nice to have a paragraph to summarize the overall approach of this method and mentions why it was chosen over other methods that (for example) use amplitude (as opposed to pitch contours) to determine onsets.

They then state "Some errors with the estimated events have been corrected with a software tool developed by this paper's authors."  This is the full description. I do appreciate that the authors plan to eventually release their code and I suppose an interested reader could attempt to read their source code (once released...it doesn't look currently available) to determine what the software does. To me, though, the whole point of a journal article is to describe what was done.  So could the authors describe what this tool does? What kinds of errors are detected and corrected by this tool? What is the general approach?

The authors state: "Therefore, the estimated notes and the scores were automatically combined by a piece of code created by this paper’s authors".  What does they mean by "combined"? How was this done? Were they aligned using a dynamic time warping algorithm? Something else? Again, more explanation would be appreciated. 

In Section 2.5.....

"All the pitch contours and the events were plotted, similar to Figure 1, to double- check them manually." How were they double-checked? What would a person do to decide that something was right or wrong? How many people checked each one? A single person?

"In the process of combining the extracted notes with the scores as discussed in Section 0, if the number of extracted notes was not equal to the number of scores in the ground truth, the automatic tool listed the incorrect files to be investigated by the user."

I didn't see a section 0. Also....they refer to Section 0 repeatedly throughout the document. I think something went wrong with an automated section reference tool. 

Also, around line 220 or so I wonder of the authors meant "if the number of extracted notes was not equal to the number of NOTES in the ground truth". They used the word "scores" where it seems like "notes" is the word.

============================

Section 4. Comparing the four methods of selecting the positions of onset, offset, and transition

The authors state "This Section compares the four approaches to select the onset, offset, and transition discussed in Section 0. To compare them, the nominal pitch frequencies of notes are considered to be the ground truth."

Again, what section is Section 0? Also....they don't define what the nominal pitch frequencies are. Since these are the ground truth they use, it seems important to describe how they get a pitch frequency used as ground truth. Once this is done, it would help to get an explanation for how difference between nominal ground truth pitch and average pitch translates into a good measure for determining where onsets should be placed.

 

 

 

 

 

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Back to TopTop