Next Article in Journal
AODs-CLYOLO: An Object Detection Method Integrating Fog Removal and Detection in Haze Environments
Previous Article in Journal
Development of an Adaptive Force Control Strategy for Soft Robotic Gripping
Previous Article in Special Issue
FFA-BiGRU: Attention-Based Spatial-Temporal Feature Extraction Model for Music Emotion Classification
 
 
Article
Peer-Review Record

Optical Medieval Music Recognition—A Complete Pipeline for Historic Chants

Appl. Sci. 2024, 14(16), 7355; https://doi.org/10.3390/app14167355 (registering DOI)
by Alexander Hartelt 1,*, Tim Eipert 2 and Frank Puppe 1
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3: Anonymous
Appl. Sci. 2024, 14(16), 7355; https://doi.org/10.3390/app14167355 (registering DOI)
Submission received: 4 June 2024 / Revised: 30 July 2024 / Accepted: 12 August 2024 / Published: 20 August 2024

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Dear Authors,

 

Your paper is very well written and presents the topic of using MIR (Music Information Retrieval). Your work is important for preserving medieval chant music, especially Gregorian chants.

 

 

However, the article has some minor issues. First, the Corpus Monodicum project is mentioned only a few times. I think the Authors spent a lot of time making it. I suggest adding some sentences about it in the Conclusion section because it seems that the presented pipeline was used in the Corpus Monodicum.

 

 

The second small issue is the lack of modern transcription in Figure 3 (similar to Figure 6).

 

 

The parameter dSL is described only as the sentence (lines from 292 to 299). It would be better if the definition of dSL were written as a formula.

 

 

In the line no. 313, the dSL parameter is written as Ä‘SL.

 

 

The issues in the References are the following:

 

The paper no. 26 is officially published. Reference: Wick, Christoph, Christian Reul, and Frank Puppe. "Calamari− A High-Performance Tensorflow-based Deep Learning Package for Optical Character Recognition." Digital Humanities Quarterly 14.2 (2020).

It is not necessary to refer to arXiv.

 

The paper no. 36 is officially published, too. Reference: Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation." Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer International Publishing, 2015.

It is not necessary to refer to arXiv.

 

 

In references 7, 20, and 29, the links to arXiv can be removed.

 

Author Response

Dear Reviewer,

We appreciate the time and effort that you have dedicated to providing your valuable feedback on our manuscript. We are grateful for their insightful comments on our paper. We have highlighted in red color the changes within the manuscript (file with highlights is uploaded as supplementary file). Here is a point-by-point response to the reviewers’ comments and concerns.

Comment 1: However, the article has some minor issues. First, the Corpus Monodicum project is mentioned only a few times. I think the Authors spent a lot of time making it. I suggest adding some sentences about it in the Conclusion section because it seems that the presented pipeline was used in the Corpus Monodicum.
Response: Thank you for pointing this out. We agree with this comment. We have extended the conclusion by a few sentences in context of the Corpus Monodicum project.
Location in manuscript: 731 – 743

Comment 2: The second small issue is the lack of modern transcription in Figure 3 (similar to Figure 6).

Response: Thank you for pointing this out. We agree with this comment. We extended the caption of Figure 1 to include a reference of a modern transcription mentioned later in the paper.
Location in manuscript: We extended the caption of Figure 1

Comment 3: In the line no. 313, the dSL parameter is written as Ä‘SL.
Response: Thank you for pointing this out. We corrected the parameter.
Location in manuscript: 337

Comment 4: The parameter dSL is described only as the sentence (lines from 292 to 299). It would be better if the definition of dSL were written as a formula.

Response: Thank you for pointing this out. We agree with this comment. We added a formular for d_SL
Location in manuscript: 321


Location in manuscript:

Comment5: The issues in the References are the following:

The paper no. 26 is officially published. Reference: Wick, Christoph, Christian Reul, and Frank Puppe. "Calamari− A High-Performance Tensorflow-based Deep Learning Package for Optical Character Recognition." Digital Humanities Quarterly 14.2 (2020).

It is not necessary to refer to arXiv.

The paper no. 36 is officially published, too. Reference: Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation." Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer International Publishing, 2015.

It is not necessary to refer to arXiv.

In references 7, 20, and 29, the links to arXiv can be removed.

Response: Thank you for pointing this out. We agree with this comment. We removed the arXiv links and replaced references which previously referred to arXiv.
Location in manuscript: References

 

Reviewer 2 Report

Comments and Suggestions for Authors

The submission studies the problem of transcribing chants from score images. The transcription requires multiple types of detection and coordination. The staves of the input image need detection, followed by the notes (neumes). Next, the text (in the medieval Latin hand) needs recognition. Then, the neumes and the text must be aligned vertically. The process has multiple sources of error. The paper presents a workflow for minimizing the amount of errors, including CNNs for recognition. The authors evaluate the framework's performance using the amount of by-hand correction to generate a complete transcription, which is a required step regardless of the system used for transcription. The proposed framework has performed better than the existing ones. A challenging step in the transcription is text recognition because of the variability in the writing style. This can be overcome for the data set studied in the paper, to a degree, because the text style is uniform. Still, the similarity among the characters makes it quite difficult to correctly transcribe the text. The general approach that the authors used was comparing texts in an existing chant database, which has substantially reduced the error. The overall performance of the proposed framework is high, and the paper will appeal to scholars studying chant transcription. Another challenging step is the alignment of the neumes and the texts because the text region is wider than the neume region in some places. How to handle the alignment difficulty perhaps needs a bit more explanation in the final version of the paper.

The paper is well written. However, the tables contain typos, such as misspelled "existence" and "syllable." Therefore, the authors should check the manuscript one more time for spelling errors.

 

Author Response

Dear Reviewer/Editor,

We appreciate the time and effort that you have dedicated to providing your valuable feedback on our manuscript. We are grateful for their insightful comments on our paper. We have highlighted in red color the changes within the manuscript (file with highlights is uploaded as supplementary file). Here is a point-by-point response to the reviewers’ comments and concerns.

Comment 1: How to handle the alignment difficulty perhaps needs a bit more explanation in the final version of the paper.
Response: Thank you for pointing this out. We agree with this comment. We have elaborated the Syllable assignment section.
Location in manuscript: 427442

Comment 2: However, the tables contain typos, such as misspelled "existence" and "syllable." Therefore, the authors should check the manuscript one more time for spelling errors.

Response: Thank you for pointing this out.
Location in manuscript: Since it runs through the entire document, we do not give the exact line number of the changes here

Reviewer 3 Report

Comments and Suggestions for Authors

 

This is a paper with a lot of potential. It deals with very substantial contents and provides an interesting overview of possible solutions to problems related to digital transcription of older music manuscripts. The paper is ambitious and provides a strong methodology, but the readability and understandability of the text could be improved by providing more clear descriptions of the multiple technical terms. I propose some restructuring of the contents and much more explanations in case of technical matters who are not sufficiently clear in the first submitted version. I list below some general remarks and detailed comments.

 

 

General remarks

 

-      The English language use is OK and quite idiomatic

-      The contents deal with an ambitious claim, namely to provide a fully automated optical music recognition system

-      The paper provides a lot of interesting operational definitions and introduces a strong methodology.

-      There is a lot technical stuff that is not sufficiently explained. More intuitive descriptions must be given to guide the reader through the text. There is also a danger of pedantry at some places in the text. Not all readers are familiar with the technical abbreviations, which makes reading very hard at times.

-      Some of the needed explanation for the many abbreviations are provided, but only later in the text and without without announcing this. Some restructuring of the paper could help here in providing the needed explanations and description of the technical terms at first appearance. This could avoid a lot of frustration by the reader.

-      Examples of technical terms that need more intuitive clarification are: end-to-end systems; FCN, GT, Mask-RCNN, etc. Most of these are explained at the end of the paper, but this is too late. Explanation must be also somewhat more than just writing the abbreviation in full.

-      Some important concepts needs additional explanation: e.g., binarization, desjewing, dewarping.

-      Explain the distinction between a computational and classification network more in detail.

-      The use of neural networks is not sufficiently explained.

-      The conclusion is too short. What is the basic take home message of the paper?

-      The texts as a whole needs more homogeneity in the style of writing. Some passages are quite readable and understandable, while others are very demanding.

 

Detailed comments

 

 

-      Page 1, title: why a reduction to historic chants? Some motivation for this selection could be welcome.

-      Page 2, line 52: explain somewhat more in detail what exactly is meant with “square notation”

-      Page 3, line 92: “less” instead of “fewer”?

-      Page 3, line 97: please explain very shortly what is meant with “end-to-end” solutions

-      Page 3, line 107: provide a short description of what is meant here with multi-stage systems or refer to a later place in the paper where this is explained

-      Page 3, lower half of the page: very technical stuff; introducing abbreviations without explaining them hampers the rhythm of reading, and readers may even stop reading. All abbreviations should be given in full at first appearance. These abbreviations are given at the end of the paper, but readers are not informed about this at the beginning of reading.

-      Page 3, line 122: what is meant k-nearest neighbor approach.

-      Page 3, line 130: the concept of convolutional neural network should be explained at least a little in intuitive terms.

-      Page 3, line 138: what is meant with intersection-over-union? Explain a little.

-      Page 3, line 143: what is meant with mAP?

-      Page 4, line 147: what is meant with a sequence to sequence task?

-      Page 4, Lines 157 ff: please give the abbreviations in full and explain shortly what they stand for.

-      Page 4, lines 176 ff: here the abbreviations are given in full. This should be done earlier, at first appearance of the terms.

-      Page 5, first row of table: syllable instead of syllable? Explain the abbreviations “Eval” and “Metr” in the table caption so that readers can interpret the table without need of going through the main text.

-      Page 6: please provide the needed information to inform readers of the meaning of Nevers and Latin. In the remainder of the text, this becomes clear, but is better to provide the information at first appearance of the terms.

-      Page 7: provide some more information about the abbreviations at the very right of the figure (MR, Ch, L, K, etc.) Please be clear.

-      Page 8, line 229: explain very shortly the meaning of diastemic and adiastemic.

-      Page 8, line 230: what is the meaning of Ground Truth? As this is a very important concept, its meaning should be given in intuitive terms.

-      Page 9, lines 247 ff: this content about the basic could perhaps be moved more to the beginning of the paper as it explains some technicalities of the previous text. Please consider some restructuring of the order of the parts of the paper.

-      Page 9, line 267: the term “drop capital” seems to be common knowledge but not all readers know what it exactly means. Please explain very shortly.

-      Page 10, figure caption: the caption is too long. Much of the information in this caption should have its place in the main text of the paper. Captions must always be rather short. Additional explanation has its place in the main text.

-      Page 11: check if all text fragments are sufficiently explained in the main text. Avoid that readers have a feeling of not being sufficiently informed.

-      Page 12, line 288: explain shortly the rationale behind the concept of binarization.

-      Page 12, line 292: here also some abbreviations are explained. This should have been done at first appearance.

-      Page 12, lines 363 ff: explain the what and why of convolutional neural networks.

-      Page 12, line 309: what is the meaning of connected component? Take care of danger of pedantry by introducing terms that are not explained.

-      Page 12, line 319: what is the meaning of “bounding box”?

-      Page 13, line 342: “looped” instead of “lopped”?

-      Page 13, line 345: what is the meaning of “argmax”?

-      Page, 14, line 399: what is the meaning of “greedy strategy” and “bipartite matching”?

-      Page 15, line 424: what is a U-net?

-      Page 15, line 437: what are bridge block and upsample blocks?

-      Page 17, line 500: it is not clear how the results have been computed. Please provide some more information.

-      Page 24, lines 714 ff: refer to this information earlier in the text.

Author Response

Dear Reviewer/Editor,

We appreciate the time and effort that you have dedicated to providing your valuable feedback on our manuscript. We are grateful for their insightful comments on our paper. We have highlighted in red color the changes within the manuscript (file with highlights is uploaded as supplementary file). Here is a point-by-point response to the reviewers’ comments and concerns.

Comment 2: Some of the needed explanation for the many abbreviations are provided, but only later in the text and without without announcing this. Some restructuring of the paper could help here in providing the needed explanations and description of the technical terms at first appearance. This could avoid a lot of frustration by the reader.

Response: Thank you for pointing this out. We agree with this comment. We moved the “Basics” section after the introduction

Comment 7: The conclusion is too short. What is the basic take home message of the paper?
Response: Thank you for pointing this out. We agree with this comment. We have extended the conclusion by a few sentences in context to the Corpus Monodicum project.
Location in manuscript: 731 – 743

Comment 5: Explain the distinction between a computational and classification network more in detail.

Response: We are not aware of using the term “computational network”.

Comment 1, 3, 4 , 6, 8 combined answered:
Response: Thank you for pointing this out. We agree with this comment. We have explained in more detail the technical stuff mentioned in the comments

Comment 1: There is a lot technical stuff that is not sufficiently explained. More intuitive descriptions must be given to guide the reader through the text. There is also a danger of pedantry at some places in the text. Not all readers are familiar with the technical abbreviations, which makes reading very hard at times.

Comment 3: Examples of technical terms that need more intuitive clarification are: end-to-end systems; FCN, GT, Mask-RCNN, etc. Most of these are explained at the end of the paper, but this is too late. Explanation must be also somewhat more than just writing the abbreviation in full.
Comment 4: Some important concepts needs additional explanation: e.g., binarization, desjewing, dewarping.

Comment 6: The use of neural networks is not sufficiently explained.

Comment 8: The texts as a whole needs more homogeneity in the style of writing. Some passages are quite readable and understandable, while others are very demanding.


Detailed Comments:

Comment 9: Page 1, title: why a reduction to historic chants? Some motivation for this selection could be welcome.
Response: Thank you for pointing this out. The motivation for this selection is already pointed out in the introduction
Location in manuscript: 26 - 37

Comment 10: Page 2, line 52: explain somewhat more in detail what exactly is meant with “square notation”

Response: Thank you for pointing this out. We added a reference to this term
Location in manuscript: 51-52

Comment 11: Page 3, line 92: “less” instead of “fewer”?
Response: Thank you for pointing this out. We agree with this comment. We changed fewer → less Location in manuscript: 118

Comment 12: Page 3, line 97: please explain very shortly what is meant with “end-to-end” solutions

Response: Thank you for pointing this out. We agree with this comment. We explained “end-to-end” shortly and have set it in contrast to multi-stage-systems
Location in manuscript: 125-135

Comment 13: Page 3, line 107: provide a short description of what is meant here with multi-stage systems or refer to a later place in the paper where this is explained
Response: Thank you for pointing this out. We agree with this comment. See previous comment.
Location in manuscript: 125-135

Comment 14: Page 3, lower half of the page: very technical stuff; introducing abbreviations without explaining them hampers the rhythm of reading, and readers may even stop reading. All abbreviations should be given in full at first appearance. These abbreviations are given at the end of the paper, but readers are not informed about this at the beginning of reading.

Response: Thank you for pointing this out. We agree with this comment.
Location in manuscript: Since it runs through the entire document, we do not give the exact line number of the changes here, but we tried to give all abbreviations at full at first appearance and explained abbreviations a little more in places that we thought made sense

Comment 15: Page 3, line 122: what is meant k-nearest neighbor approach.
Response: Thank you for pointing this out. We agree with this comment. We explained it a little more.
Location in manuscript: 150ff

Comment 16: Page 3, line 130: the concept of convolutional neural network should be explained at least a little in intuitive terms.

Response: Thank you for pointing this out. We agree with this comment. We explained it a little more.
Location in manuscript: 160-164

Comment 17: Page 3, line 138: what is meant with intersection-over-union? Explain a little.

Response: Thank you for pointing this out. We agree with this comment and added a short paraphrase
Location in manuscript: 218

Comment 18: Page 3, line 143: what is meant with mAP?

Response: Thank you for pointing this out. We added a reference for a better understanding.
Location in manuscript: 178

Comment 19: Page 4, line 147: what is meant with a sequence to sequence task?

Response: Thank you for pointing this out. We agree with this comment. We elaborated this term
Location in manuscript: 191 - 194

Comment 20: Page 4, Lines 157 ff: please give the abbreviations in full and explain shortly what they stand for.

Response: Thank you for pointing this out. We agree with this comment.
Location in manuscript: Since it runs through the entire document, we do not give the exact line number of the changes here, but we tried to give all abbreviations at full at first appearance and explained abbreviations a little more in places that we thought made sense

Comment 21: Page 4, lines 176 ff: here the abbreviations are given in full. This should be done earlier, at first appearance of the terms.

Response: Thank you for pointing this out. We agree with this comment. See comment above

Comment 22: Page 5, first row of table: syllable instead of syllable? Explain the abbreviations “Eval” and “Metr” in the table caption so that readers can interpret the table without need of going through the main text.

Response: Thank you for pointing this out. We agree with this comment.
Location in manuscript: Table 1

Comment 23: Page 6: please provide the needed information to inform readers of the meaning of Nevers and Latin. In the remainder of the text, this becomes clear, but is better to provide the information at first appearance of the terms.

Response: Thank you for pointing this out. The Nevers dataset is already explained (133). We added a footnote for the Latin dataset
Location in manuscript: 243

Comment 24: Page 7: provide some more information about the abbreviations at the very right of the figure (MR, Ch, L, K, etc.) Please be clear.

Response: Thank you for pointing this out. We agree with this comment. We added more information about the abbreviations in the caption of Figure 3.
Location in manuscript: Figure 3

Comment 25: Page 8, line 229: explain very shortly the meaning of diastemic and adiastemic.

Response: Thank you for pointing this out. We agree with this comment. We explained the terms shortly.
Location in manuscript: 270-271

Comment 26: Page 8, line 230: what is the meaning of Ground Truth? As this is a very important concept, its meaning should be given in intuitive terms.

Response: Thank you for pointing this out. We agree with this comment. We explained the meaning of Ground Truth in our context shortly
Location in manuscript: 272-274

Comment 27: Page 9, lines 247 ff: this content about the basic could perhaps be moved more to the beginning of the paper as it explains some technicalities of the previous text. Please consider some restructuring of the order of the parts of the paper.

Response: Thank you for pointing this out. We agree with this comment. We moved this section after the introduction


Comment 28: Page 9, line 267: the term “drop capital” seems to be common knowledge but not all readers know what it exactly means. Please explain very shortly.

Response: Thank you for pointing this out. We agree with this comment. We explained the term shortly in the basics section.
Location in manuscript: 99ff

Comment 29: Page 10, figure caption: the caption is too long. Much of the information in this caption should have its place in the main text of the paper. Captions must always be rather short. Additional explanation has its place in the main text.

Response: Thank you for pointing this out. We agree with this comment. We incoporated the caption text in the main text and shorted the caption
Location in manuscript: 78 - 101

Comment 30: Page 11: check if all text fragments are sufficiently explained in the main text. Avoid that readers have a feeling of not being sufficiently informed.

Response: Thank you for pointing this out. We agree with this comment. We improved the syllable assignment section
Location in manuscript: 427 -442

Comment 31: Page 12, line 288: explain shortly the rationale behind the concept of binarization.

Response: Thank you for pointing this out. We agree with this comment. We explained the concept of binarization, deskewing and grayscale conversion shortly
Location in manuscript: 305 -309

Comment 32: Page 12, line 292: here also some abbreviations are explained. This should have been done at first appearance.

Response: Thank you for pointing this out. We agree with this comment.
Location in manuscript: Since it runs through the entire document, we do not give the exact line number of the changes here

Comment 33: Page 12, lines 363 ff: explain the what and why of convolutional neural networks.

Response: Thank you for pointing this out. See explanation in related work
Location in manuscript: 160ff

Comment 34: Page 12, line 309: what is the meaning of connected component? Take care of danger of pedantry by introducing terms that are not explained.

Response: Thank you for pointing this out. We agree with this comment. We explained the term shortly.
Location in manuscript: 332

Comment 35: Page 12, line 319: what is the meaning of “bounding box”?

Response: Thank you for pointing this out. We agree with this comment. We explained the term shortly
Location in manuscript: 343

Comment 36: Page 13, line 342: “looped” instead of “lopped”?

Response: Thank you for pointing this out. We agree with this comment. We fixed the typo.
Location in manuscript: 367

Comment 37: Page 13, line 345: what is the meaning of “argmax”?

Response: Thank you for pointing this out. We agree with this comment. We added a paraphrase.
Location in manuscript: 370ff

Comment 38: Page, 14, line 399: what is the meaning of “greedy strategy” and “bipartite matching”?

Response: Thank you for pointing this out. We agree with this comment. We revised the syllable assignment section and elaborated the meaning
Location in manuscript: 427-442

Comment 39: Page 15, line 424: what is a U-net?

Response: Thank you for pointing this out. We agree with this comment. We added a reference, mentioned it in the related work and added a sentence that clarifies it
Location in manuscript: 165ff, 456

Comment 40: Page 15, line 437: what are bridge block and upsample blocks?

Response: Thank you for pointing this out. We agree with this comment. We elaborated these terms.
Location in manuscript: 470-473

Comment 41: Page 17, line 500: it is not clear how the results have been computed. Please provide some more information.

Response: Thank you for pointing this out. We agree with this comment. We added two sentences referencing our pipeline and the metric we used.
Location in manuscript: 535-537

Comment 42: Page 24, lines 714 ff: refer to this information earlier in the text.

Response: Thank you for pointing this out. We agree with this comment.
Location in manuscript: Since it runs through the entire document, we do not give the exact line number of the changes here

Back to TopTop