Next Article in Journal
AODs-CLYOLO: An Object Detection Method Integrating Fog Removal and Detection in Haze Environments
Previous Article in Journal
Development of an Adaptive Force Control Strategy for Soft Robotic Gripping
Previous Article in Special Issue
FFA-BiGRU: Attention-Based Spatial-Temporal Feature Extraction Model for Music Emotion Classification
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Optical Medieval Music Recognition—A Complete Pipeline for Historic Chants

1
Department for Artificial Intelligence and Knowledge Systems, University of Wuerzburg, D-97074 Wuerzburg, Germany
2
Department for Music in Pre-Modern Europe, University of Wuerzburg, D-97074 Wuerzburg, Germany
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(16), 7355; https://doi.org/10.3390/app14167355 (registering DOI)
Submission received: 4 June 2024 / Revised: 30 July 2024 / Accepted: 12 August 2024 / Published: 20 August 2024

Abstract

:
Manual transcription of music is a tedious work, which can be greatly facilitated by optical music recognition (OMR) software. However, OMR software is error prone in particular for older handwritten documents. This paper introduces and evaluates a pipeline that automates the entire OMR workflow in the context of the Corpus Monodicum project, enabling the transcription of historical chants. In addition to typical OMR tasks such as staff line detection, layout detection, and symbol recognition, the rarely addressed tasks of text and syllable recognition and assignment of syllables to symbols are tackled. For quantitative and qualitative evaluation, we use documents written in square notation developed in the 11th–12th century, but the methods apply to many other notations as well. Quantitative evaluation measures the number of necessary interventions for correction, which are about 0.4% for layout recognition including the division of text in chants, 2.4% for symbol recognition including pitch and reading order and 2.3% for syllable alignment with correct text and symbols. Qualitative evaluation showed an efficiency gain compared to manual transcription with an elaborate tool by a factor of about 9. In a second use case with printed chants in similar notation from the “Graduale Synopticum”, the evaluation results for symbols are much better except for syllable alignment indicating the difficulty of this task.

1. Introduction

Optical music recognition (OMR) is a technology that enables the automatic conversion of sheet music into a digital format. This technology is particularly useful for preserving and digitizing historical musical documents, which are often fragile and subject to deterioration over time. Since the majority of musical pieces in the Western culture have been composed in written form, their conservation and digitization through manual methods is a laborious time intensive and often error-prone process. OMR is one of the key technologies to accelerate and simplify this task automatically.
To overcome these challenges, OMR systems typically incorporate a combination of image processing, machine learning, and music theory to accurately capture and decode the musical notation. When working with well-preserved documents, modern OMR systems can achieve high levels of accuracy. By contrast, in historical documents, those OMR systems still struggle to achieve such good results. The reasons are several challenges associated with OMR on historical documents. Firstly, the quality of the source material can greatly affect the accuracy of the recognition process. Historical documents may be faded, smudged, or torn, making it difficult to accurately capture and decipher the musical notation. Secondly, many historical musical documents are quite heterogeneous in layout and notation styles, which requires the development of different solutions. Due to these challenges, there are a limited amount of training data available for historical documents, which in turn constrains the range of potential solutions.
Due to the complexity of optical music recognition (OMR), it is typically divided into multiple sub-steps in order to effectively address each component. The general approach to OMR involves the following stages: (1) Preprocessing and deskewing, which prepares the input image for further analysis, (2) staff line detection, which identifies the horizontal lines upon which musical symbols are placed, (3) layout detection, which identifies the spatial organization of music regions and other notations on the page, and (4) symbol detection, which locates and classifies individual musical symbols. For the transcription of historical chants, there are additional steps that are rarely addressed in the literature. These include (5) recognizing the text, (6) dividing it into syllables, and (7) assigning each syllable to its corresponding note or notes. Finally (8), using the previously extracted data, the document is reconstructed and the target encoding is generated.
The main focus of the paper is the integration of a fully automated OMR pipeline covering all stages that can be utilized for mass transcription purposes on historical documents in the context of the Corpus Monodicum project (https://corpus-monodicum.de, accessed on 10 August 2024). The pipeline focuses on historical documents, e.g., manuscripts written in square notation, which was developed in the 11th–12th century [1]. This notation includes several staves on a page, with each staff typically containing four lines and a clef marker at the start. An example image of such documents can be seen in Section 4.
The primary contributions of this paper are as follows:
  • Development of a comprehensive automatic pipeline tailored for digitizing handwritten historical monophonic music depicted in squared notation. The pipeline processes input from a book and produces digitized chants (in formats such as MusicXML [2], MEI [3], Monodi [4]) corresponding to the content of the book. Additionally, position-related data are extracted, including the positions of the staff lines, layout, symbols, and syllables/lyrics on the page to support manual post-correction and training.
  • Extended layout analysis capabilities, facilitating the detection and extraction of chants within the book, such as individual chants, which may extend across multiple pages.
  • Development of an automatic text recognition correction and segmentation mechanism, which involves aligning the automatically extracted lyric of a chant with a pre-existing database of chants to improve the OCR, segmenting the identified lyrics into text lines according to the layout of the page. The transcribed text is finally aligned with the original OCR to obtain positional information (position of the syllables/text on the original document).
  • Development of an algorithm to automatically divide text into syllables utilizing grammar rules and a syllable dictionary including a syllable assignment algorithm that subsequently aligns these syllables with the corresponding musical notes automatically.
  • Evaluation of the output from the complete automated transcription process, including additional evaluations of individual steps involved in the pipeline’s execution.

2. Basics

This chapter introduces the building blocks of a typical page, especially the elements that are of interest and are important for the complete OMR pipeline. These building blocks can be seen in Figure 1. A typical page usually consists of 7–9 music–lyric pairs. These in turn have a music part and a text part. In the music part, there are usually staves consisting of 4–5 connected staves and also auxiliary staves ((invisible) between two staff lines, above the staff and below). In addition, various symbols are shown in the music part:
  • Clef: A clef is mostly at the beginning of a musical region. However, there can also be several clefs in other places in the music regions. Possible subclasses are F-clef and C-clef.
  • Note components: These are the symbols that occur most often in a line. The position of note components within the staves in combination with accidentals and preceding clefs determine the note pitch. Note components may be connected to other note components with the following subclasses: start (start of a note sequence), graphically connected (part of a note sequence with graphic connection to the previous note), gapped (part of a note sequence without graphic connection to the previous note).
  • Accidental: By far the rarest symbols. The subclasses here are sharp, flat, and natural accidentals.
The text part contains the handwritten text to be transcribed divided into syllables and aligned to notes, such that every syllable is assigned to at least one or several notes. Some of the words but not all are divided into syllables. In addition, different chants can be located on one page, or chants can extend over several pages. The top of such chants is usually marked by a large drop capital (letter in the beginning of a logical unit in our case chant which is larger than the rest of the text) and continues until the next drop capital, where the subsequent chant begins.

3. Related Work

In the literature, optical music recognition (OMR) is employed quite broadly, with definitions varying depending on the dataset or task at hand. Calvo Zargoza et al. [5] (2020) defined OMR as “a field of research that investigates how to computationally read music notation in documents”, primarily focusing on the musical aspect. However, this definition is insufficient for our purposes and must be expanded, as our dataset comprises church chants where not only the musical notation but also the lyrics, syllable alignment with symbols, and extraction of different chants within one or multiple pages are relevant and require transcription. Therefore, we divide our pipeline into two main steps: First, the extraction of musical notation as defined in [5], and second, the extraction of text, syllable assignment to notes, and extraction of individual chants.
Unfortunately, comparing such a pipeline is challenging due to the absence of comparable complete transcription processes in the literature that have been depicted and evaluated. Additionally, there is a scarcity of related work on OMR. Calvo Zargoza (2020) [5] noted approximately 500 published works in this field. However, when considering specific OMR steps, notations, modern versus historical data, or monophonic versus polyphonic music, the number of works available for comparison is significantly less.
Especially historical data increases the complexity of comparing OMR algorithms since the data at hand can be quite heterogeneous in layout and difficulty. The majority of the existing OMR systems are designed for modern music notation, which is typically well preserved and homogeneous in layout and notation style. While existing end-to-end solutions for optical music recognition primarily emphasize symbol recognition and are tailored for modern documents featuring fixed clefs, they often overlook historical documents with varying staff lines or clefs that can significantly impact the melody. In our context, end-to-end refers to a system that produces the required outcome directly, with no intermediary results. Conversely, multi-stage systems involve multiple processing phases (steps) in a pipeline, whereby intermediate outcomes of one step are used in other steps.
End-to-end systems for historical documents are rare and use assumptions like requiring image excerpts of individual staff systems [6], necessitating an additional layout processing step. Additionally, end-to-end segmentation-free pipelines have been developed that operate effectively with historical data, without necessitating any preprocessing steps (e.g., localization of individual staffline segments). However, at present, these single-stage systems are outperformed by multi-stage counterparts [7] and typically only extract the melody. Consequently, further steps would be necessary to extract lyrics and syllable note assignments. This can prove challenging, as current end-to-end pipelines are segmentation-free and therefore do not extract information such as position on the page, which may be necessary for lyric-note assignments.
The Music Score Searching and Analysis (SIMSSA) project [8,9] offers a workflow aimed at transcribing historical documents with chant music, employing a multi-stage approach similar to ours. However, the algorithms and approaches used differ significantly from ours. SIMSSA utilizes a segmentation network for the layout approach, which segments staff lines, layout, symbols, and text in a single step. This has the advantage of requiring only one network for the segmentation task. However, the disadvantage is that compared to separate steps, more training data are needed, and pixel-perfect annotation is required, which is very time-consuming. Additionally, these models often need to be retrained for new books as they do not generalize well. In general, specialized algorithms typically yield better results and generalize better (e.g., staff line recognition generalizes well and usually does not need retraining when applied to other books). For symbol classification, SIMSSA uses a k-nearest neighbor approach. This method involves categorizing small patches of symbols into predefined classes based on their similarity to existing samples in the dataset. For text recognition, Calamari [10] is employed and aligned with texts from the Cantus database [11]. SIMSSA is more of a workflow than an automatic pipeline, as it connects various web tools to transcribe such documents. It is unclear to what extent the detection of chants in books or syllable note assignment is supported or if they need to be manually annotated.
In the following, the individual steps of the OMR pipeline are examined in more detail. Table 1 gives an overview of existing publications that cover parts of the OMR pipeline. Calvo-Zaragoza [12] segments staff line pixels by passing a square patch centered around the pixel to a Convolutional Neural Network (CNN). The CNN is trained by exposing both patches containing centered staff line pixels and those devoid of them. As a result, the CNN learns to differentiate between staff line pixels and non-staff line pixels (by learning distinctive features of patches with/without centered staffline pixels). Using this method, the authors receive a F 1 score between 97% and 99%. Similar to this, Wick et al. [13] used an fully convolutional neural network (FCN) with an encoder–decoder architecture like a U-Net [14] to generate a binarized image where only staff line pixels remain. Afterward, a post-processing algorithm is used to extract the staff lines detection ( F 1 D ) 99.7% of all staff lines. The length of the detected lines fitted the GT by 98.2% ( F 1 L f ). Quiros et al. [15] analyzed the performance of an object detection neural network using a Mask-RCNN [16] on complex handwritten documents for the layout detection task. Therefore, several experiments were conducted on different datasets including datasets with musical manuscripts. Instances of three different region types (music, lyrics and drop capital) had to be recognized, resulting in a mean Intersection-over-Union (IoU) score of 89.5%. Similar to that in [17] a segmentation approach was used to classify each pixel in an image to a layout type and afterward extract the regions using a post-processing algorithm. Castellanos [18] has also worked from an object detection perspective and conducted several experiments to test common layout metrics and architectures. It was found that commonly used object detection metrics such as mean average precision (mAP) [19] are not necessarily suitable, as they do not correlate with the final result of an OMR system.
On the symbol level, there are comparatively more publications. On the whole, there are three different approaches to performing symbol recognition: as an object detection (recognition) task, as a segmentation task, or as a sequence-to-sequence task (similar to OCR). Calvo-Zaragoza et al. [20] approached the problem as a segmentation task and employed Convolutional Neural Networks for the automation of historical music image document processing. The dataset was limited in size and comprised square notation. A classification network was implemented to assign one of four classes to each pixel in the image. Small patches extracted from the pages centered at the point of interest were used as input to the network. Similarly, the work of Wick et al. [13] uses a FCN instead of a classification network, which provides a complete segmentation of the image as output. Alternatively, the task can be seen as a sequence-to-sequence problem. In contrast to the other methodologies, sequence-to-sequence techniques directly predict the transcription (melody or encoded symbol sequence) of an individual staff image or an entire page, without calculating the pitch based on the position of the symbols on the staff. The advantage of this is that ground truth is easier to produce since no position-related data must be encoded in the ground truth (GT) (e.g., position of the symbol on the page). Examples of sequence-to-sequence approaches are [6,21,22,23,24]. Antonio Ríos-Vila [6], for example, has trained and evaluated on the Grandstaff dataset, which consists of 53,882 single system piano scores in common Western modern notation. In addition, different symbol encodings were used. Also, different decoder architectures, recurrent neural networks (RNN) and Transformer (CNNT) among others, were tested. Alternatively, Pacha et al. [25,26] approached the issue as an object detection challenge, utilizing established deep convolutional neural networks including Faster R-CNN [27] or YOLO [28] to generate bounding boxes and respective object classes for detected objects. However, it was noted that the network encountered difficulties in accurately detecting smaller objects such as dots or remnants. At the text level, there is little published work that addresses the problem in the context of music documents. Mostly, Handwritten Text Recognition (HTR)/optical character recognition (OCR) systems using an LSTM/CNN network are used to obtain the text transcription. The authors of [29] use Calamari [10], while the authors of [30] use OCRopus [31] for the transcription of the text. In both publications, the results were sobering. The authors of [29] achieved a Character Error Rate (CER) of between 7% and 30% if trained on pages of the same book that were used for evaluation. If the training took place on pages of other books with similar pages, the CER is between 25% and 49%. Similarly, in [30] the automatic transcription was only used as proof of concept for their pipeline. At the syllable level, there is a limited amount of existing research. De Reuse et al. [11] segment the text into syllables and utilize the positional information from the HTR network to compute bounding boxes. A syllable is considered correctly classified with an Intersection-over-Union value of 50%, i.e., if its bounding box overlaps with the ground truth (GT) box by at least 50%. Martinez-Sevilla et al. [32] utilize the GABC encoding with a model trained using the Connection Temporal Classification (CTC) loss. On very clean synthetically generated data, syllable recognition achieves a CER between 16.4% and 62%. When the images are augmented to appear more degraded, the accuracy decreases to between 34% and 62.3%. Table 1 summarizes the results.
Table 1. Comparison of different optical music recognition publications. For each publication, the pipeline step of the OMR which the paper covers is presented. In addition, the notation type of the documents is stated, as well the fundamental method used to obtain the results is given. For each research paper, the specific pipeline steps (X) that were addressed have been indicated, along with the metric (Metr.) used to evaluate the results, as well as the results themselves (Eval).
Table 1. Comparison of different optical music recognition publications. For each publication, the pipeline step of the OMR which the paper covers is presented. In addition, the notation type of the documents is stated, as well the fundamental method used to obtain the results is given. For each research paper, the specific pipeline steps (X) that were addressed have been indicated, along with the metric (Metr.) used to evaluate the results, as well as the results themselves (Eval).
Pipeline StepNotation
Type
Method
PaperLinesLayoutSymbolTextSyllable--
XEvalMetr.XEvalMetr.XEvalMetr.XEvalMetr.XEvalMetr.--
 [6]x--x--6–24SERx--x--modernSeq2Seq
 [20]x--x--94–77 F 1 m x--x--squareCNN
 [21]x--x--17–27SERx--x--modernSeq2Seq
 [24]x--x--80 A C C N x--x--modernSeq2Seq
 [22]x--(✓)n.a.n.a.4–12SERx--x--mensuralSeq2Seq
 [23]x-----7SERx--x--mensuralSeq2Seq
 [9](✓)n.a.n.a.(✓)n.a.n.a.(✓)n.a.n.a.x--x--square-
 [13]99.7 F 1 D (✓)n.a.n.a.89.8hSARx--x--squareFCN
 [29]x--x--x--6.7–50CER90.7–99.7F1squareLSTM
 [33]x--n.a.n.a.x--x--x--squareLSTM
 [34]x--x--94–97hSARx--x--squareFCN
 [12]97–99 F 1 x--x--x--x--modernCNN
 [18]x--71.6 m F 1 x--x--x--mensuralobject det.
 [17]x--92.5 P a c c x--x--x--mensuralFCN
 [15]x--89.5mIoUx--x--x--historicalMASK-RCNN
 [26]x--x--66mAPx--x--mensuralRCNN
 [25]x--x--61–88mAPx--x--modernobject det.
 [32]x--x--x-----16.4–71.3CER(square)LSTM
 [11]x--x--x-----78.6–92.9IOUsquareLSTM
 [8]-----------squareLSTM
Audiverisn.a.-n.a.-n.a.-n.a.-n.a.-modern-
Ours99.94F199.99F196.7hsar4.2CER97.3F1squareFCN, LSTM

4. Dataset

An overview of the datasets used in this work with monophonic music in square notation including lyrics is given in Table 2, and an example page for each dataset used for training is shown in Figure 2. The test datasets are shown in Figure 3. There is unfortunately very little GT data in this subject area, which is why several smaller data sets have been collected instead of one large one. For training and evaluation, a combination of different manuscripts is used, since the datasets have different annotation states. Table 2 specifies the annotation state of each dataset. The characteristics of the three datasets are as follows:
The Nevers (Nevers: Paris, Bibliothèque nationale de France, Nouv. acq. lat. 1235., available under: https://github.com/OMMR4all/datasets, accessed on 10 August 2024) dataset is composed of images from the manuscript Graduel de Nevers, which was published in the 12th Century. This dataset was already presented in [13,34]. The dataset consists of 49 pages and contains no marginalia, no images, and only a single column of lines. A total of six different music symbol classes are stored in the dataset. These include C-clef, F-clef, flat accidentals, and three different types of note components (NCs) being either a neume start, a gapped or a graphically connected (looped) NC. The dataset has been split into three sections of differing levels of complexity and visual presentation, based on human judgement. The initial section has scans with the best music notation quality and is the most visually distinct, while the second section is more challenging, with denser notation and less contrast between staff lines. The third section is similar to the first, but has wavy staff lines and more complex musical components.
The Latin 14819 (Pa 14819: Paris, Bibliothèque nationale de France, Latin 14819. https://gallica.bnf.fr/ark:/12148/btv1b84229841, accessed on 10 August 2024) dataset [34] contains 182 pages from the manuscript Latin 14819 and has a layout and music notation similar to the Nevers Part 1 dataset. The “Latin 14819” dataset is the largest one used for training, and its music symbols and lines are clear and distinct, with good contrast and spacing to avoid confusion. In addition to music and text regions, drop capitals are also annotated in this dataset. Furthermore, it is one of the datasets where lyrics and syllables are annotated.
The Mulhouse (Mul 2) (Mul 2: Mulhouse, Bibliothèque municipale, 0002, https://bvmm.irht.cnrs.fr/consult/consult.php?reproductionId=949, accessed on 10 August 2024) dataset is the one on which the mass transcription has been evaluated in this paper. The dataset is accessible at the French Digital Library. This is one of the more complex datasets, as symbols have been drawn more densely. At the text level, there are strong similarities in various symbols, such as m, n, r, e, c. Additionally all datasets have a visual separation of syllables within individual words, which significantly increases the difficulty of optical character recognition OCR. For our quantitative evaluation, we used the full dataset and for qualitative experiments, we used two subsets of 20 pages each (mul2sub1 and mul2sub2; see Section 7.1). In addition, the dataset is affected by abbreviations that significantly hinder syllable localization (e.g., “dme” is used to represent the word “domine”. While this abbreviation may be convenient for space-saving purposes, it complicates the task of syllable localization. To accurately transcribe the music, the syllables “do-mi-ne” must be assigned to the corresponding notes, despite their absence in the abbreviated form).
In addition to the historical handwritten documents we extended the experiments to include a subset of the Graduale Synopticum (GD) (http://gregorianik.uni-regensburg.de/gr/info, accessed on 10 August 2024) dataset. The subset comprises 309 pages (chant groups), equivalent to 1892 chants.
Each chant group consolidates similar chants with similar text with their respective metadata from various sources into a synoptic view emphasizing variations in the note symbols (see: Figure 3). Typically, such a chant group comprises six to eight chants presented in a well-printed layout (refer to the green rectangle in the lower section of Figure 3). The pages contain both diastematic (pitch indicating, lower part of the image) and adiastematic (not indicating pitch, upper portion) sources. However, since ground truth (GT) is available only for the diastematic sources (only the lower part (diastematic) were manually corrected to generate GT data. This includes correction of the stafflines, lyrics, symbols, layout and syllables), only these were evaluated.
Since the chant texts within a chant group are very similar, only one “representative” text is printed in the context of the melody, while the texts of the other chants are printed separately below. The dataset required minor adjustments to the pipeline, as, e.g., projecting the representative text line to all other lines to enable syllable assignment. Also the automatic transcription of the lyrics is not needed as it can be extracted from the Graduale Synopticum website. Despite that, the fundamental algorithms of the paper remained unchanged.
We divided the dataset into three subsets as depicted in Table 3. The first subset “without text variations” comprises all chant groups, where the representative text line is identical to all other text lines of the chant group. The second subset “minor text variation” includes in addtion all chant groups with small differences (less than or equal to 5 character edits), while the third subset “major text variations” encompass in addition all chant groups where significant discrepancies exist between the representative text line and and the text lines of the other chants within one chant group.

5. Workflow

Figure 4 outlines our workflow of optical music recognition (OMR). The scanned documents serve as inputs to the workflow. The various stages of the pipeline use specific parts of the document as input. For instance, only certain sections of the original document are utilized for symbol recognition. While the algorithms operate at the page level, the assumption is made that an entire work is transcribed, with pages entered sequentially into the pipeline. This is essential because the pipeline also processes information spanning multiple pages, enabling the automatic extraction of individual chants that may extend across several pages. When dealing with single pages, the pipeline still functions but can only extract chants that entirely reside on the page. In this section, a brief overview of each pipeline step is provided.

5.1. Preprocessing

In the preprocessing stage, the input scans are prepared for subsequent processing steps. As different steps in the pipeline require varied input images, this stage generates several processed sub-images. This involves applying combinations of the following operations:
  • Deskewing: Removal of rotation errors in the image
  • Grayscale conversion: Conversion of a multichannel images (e.g., color images) to a grayscale image
  • Binarization: Conversion of an grayscale image to an image in which every pixel is 0 or 1 (white or black)
Various binarization algorithms are employed, including those from OCRopus [36] and ISauvola [37]. The choice between these algorithms depends on the specific characteristics of the book being transcribed. Additionally, the deskewing algorithm from OCRopus is optionally utilized to correct any rotational errors present in the pages.
Furthermore, for each scan, the average distance between adjacent staff lines ( d S L ) are computed. This parameter plays a crucial role in subsequent pipeline steps. The calculation of this parameter is based on the methodology outlined in [38,39], utilizing the run-length encoding (RLE) technique. Specifically, the algorithm is applied to each column of the binarized and deskewed version of the scan, returning the height of a single staff line h S L (the most common black pixel sequences) and the height of the space between two adjacent staff lines s S L (the most common white pixel sequences). The value of d S L is then obtained by summing these two values.
d S L = h S L + s S L

5.2. Staff Line Detection

The staff line algorithm takes the deskewed grayscale image of the scan as input. Additionally, the algorithm receives parameters such as d S L and the number of staff lines per stave (typically four or five, depending on the dataset). A segmentation network inspired by an FCN [40] and a U-Net [14] is employed to detect staff line pixels within the image. To enhance performance, the input image is normalized using the d S L parameter, ensuring uniform spacing between staffs across all scans (the images are scaled so that d S L equals 10 pixels on each page after scaling).
Following the FCN’s output, a post-processing step is executed on the probability map. This involves a connected component analysis algorithm, which extracts connected components (CCs) from the FCN’s output. A connected component is the largest possible subset of foreground pixels, where each pixel is adjacent to at least one other pixel within the subset. Horizontal clustering of the staff lines within each CC is performed using the method outlined in [13], ensuring that each cluster represents a distinct staff line. Finally, the extracted staff lines are grouped together, considering them part of the same stave if the distance between them is less than 1.5 times d S L .

5.3. Layout Analysis

The objective of the layout analysis is to partition the scanned document into music regions, text regions, drop capitals and chants. This analysis takes as input the deskewed color image of the scan and the encoded staves.
A straightforward algorithm is employed, leveraging the encoded staves to determine the music and text regions. Each stave’s bounding box (smallest possible axis-parallel rectangle enclosing the object) is calculated and then padded with d S L / 2 , marking it as a music region. Lyric regions are subsequently identified by marking the areas between two music regions. Notably, the lowest lyric region of each scan page requires separate calculation, as it does not lie between two music regions. To address this, the average height of a lyric region is computed and used to determine the last lyric region.
Given the necessity to identify individual chant sections within a book, initial letters (drop capitals) must also be recognized, as they typically signify the beginning of a new chant. To accomplish this, a Mask-RCNN with a ResNet-50 encoder trained on the “Latin 14819” dataset [34] is employed. The raw color images serve as input for the Mask-RCNN, with a threshold of 0.5 applied to compute drop capital instances. Additionally, a post-processing step is applied to calculate the concave hull of each drop capital instance.
Next, drop capitals are categorized as small or large, with only large drop capitals indicating the start of a new chant. A drop capital is classified as large if it overlaps at least 90 percent of the height of the stave.
Subsequently, large drop capitals are utilized to delineate the lyric regions. If a drop capital is found within a lyric line, it is split into two regions to the left of the drop capital’s bounding box, as they belong to different chants, simplifying OCR post-processing.
Moreover, the reading order of lyrics and drop capitals is calculated from top to bottom and left to right to generate chants. A chant encompasses all text and music lines between two large drop capitals and may span multiple pages.

5.4. Symbol Detection and Recognition

The objective of symbol recognition is to identify the position and type of symbols present on the page. Seven distinct classes are targeted for recognition: F-clef, C-clef, note start, gapped note, looped note, sharp accidental, and flat accidental (refer to Figure 1).
The detector takes as input the cropped music regions of the grayscale image. Similar to staff recognition, a segmentation network inspired by an FCN [40] and a U-Net [14] is utilized, producing a probability map or a labeled map by selecting the class for each pixel with the highest probability (argmax). Connected components of the labeled map are then calculated and assigned to the symbol that appears most frequently within each connected component. The center of these extracted connected components determines the position of the music symbol, while the symbol type is determined by the label most commonly found within the connected components.
Subsequently, the position in line (PiS) of each symbol is calculated based on its distance to the nearest staff line and space (auxiliary line).
A post-processing algorithm, informed by background knowledge, is applied to rectify common errors such as missing clefs at the beginning, graphically connected/gapped errors, and unlikely symbols (e.g., symbols positioned closely together or within drop capitals or other layout structures), as detailed in [34]. Additionally, a small post-processing algorithm is applied that assigns the class ’neume start’ to gapped notes if the distance to a previous note is sufficiently large (greater d S L ) .
Finally, the pitch is determined based on the positions of the symbols on the staff, the preceding clef, and any other alterations.

5.5. Lyric Transcription

Upon obtaining the lyric regions, the optical character recognition (OCR) process can be initiated on the cropped lyric regions of the color image. This includes character recognition utilizing a Handwritten Text Recognition (HTR) engine, followed by a post-processing step. For the HTR step, a CNN/LSTM network architecture is employed. Subsequently, a greedy decoder is applied to the network’s output to extract textual encoding. Previous research, as indicated in [29], has shown poor performance with a fine-tuned model, achieving Character Error Rates (CERs) ranging from 6.7% to 29.7%. Performance deteriorates further when applying the models to similar but unseen datasets, reaching a CER of 27% to 49% [29].
The underlying reasons for this subpar performance are multifaceted. Significant factors include widely spaced syllables within words, leading to a loss of context within the network. Additionally, many letters in the font exhibit a high degree of confusion among themselves, particularly the letters “n”, “m”, “i”, “u”, and “t” which appear very similar.
To address these challenges, we have implemented additional text post-processing aimed at enhancing the results. This post-processing leverages the fact that chant lyrics are often repeated across different sources or at least in varying versions. Many of these texts have already been transcribed and are accessible in databases such as the Cantus database (https://cantus.uwaterloo.ca/, accessed on 10 August 2024), the Corpus Monodicum database (https://corpus-monodicum.de/, accessed on 10 August 2024), or the Antiphonalen/Gradualen synopticum database (http://gregorianik.uni-regensburg.de/, accessed on 10 August 2024).
The input to the post-processing algorithm consists of the transcribed text lines of the encoded chant. Subsequently, the text undergoes normalization, involving the following steps:
  • Summarization into a single text block.
  • Removal of upper cases.
  • Removal of white spaces.
However, the positions of the removed white spaces and newlines are retained. The normalized texts are then compared with a database of similarly normalized texts, and if a match is found, the HTR result is replaced with it. Subsequently, the white spaces are reinserted, and the text block is re-segmented into text lines using the original HTR. Finally, words are segmented into syllables, facilitated by a hyphenation dictionary for Latin words if recognized; otherwise, rules derived from grammar are applied. A schematic representation of this process is depicted in Figure 5.

5.6. Syllable Assignment

The final step in the pipeline involves syllable assignment. Similar to our previous work [29], in this step, we aim to establish the correct semantic relation of a syllable to a neume by utilizing positional information. However, the approach differs as we utilize comparison lyrics from the database instead of directly using the recognized HTR. The input of the algorithm are pairs of music–lyric regions, where symbols and syllables (and thereby the lyric of a line) have already been extracted in previous stages. The goal of syllable assignment is to allocate each syllable to the correct neume/note. To achieve this, a combination of greedy strategy and bipartite matching is employed:
To determine the position of the syllables in a text region, we apply the HTR engine again to extract the location of the characters within the image. The resulting HTR text is then aligned with the given lyric text of the line (through sequence alignment) to subsequently calculate the position of the syllables in the provided transcription. While the HTR quality may be poor, it is typically sufficient for determining the position of the syllables since a syllable usually consists of multiple letters and omitted characters in the HTR have less impact. With the position of the syllables and potential symbols obtained, we proceed to calculate the horizontal distance between each note–syllable pair and assign each syllable to the note that has the smallest horizontal distance (using a greedy approach). Multiple assignments may occur during this process, i.e., several syllables are assigned to the same note. We attempt to resolve such multiple assignments by calculating relevant notes (potential neighboring notes of the currently assigned note that have not yet been assigned a syllable). Subsequently, we typically have a few potential notes (neighboring notes plus the originally selected note) and an equal number of syllables. We then apply bipartite matching (“hungarian” minimum weight matching), minimizing the sum of the horizontal distance between syllable and note.

5.7. Reconstruction and Generation of the Target Encoding

The concluding phase of the pipeline revolves around reconstructing and generating the target encoding. This phase integrates the outcomes of preceding steps, such as the extracted music symbols, assigned syllables, layout details, and extracted chants, as input. The aim is to generate a digital representation encompassing both the music and lyrics, usually formatted as MusicXML, MEI, or Monody. By leveraging these data, the target encoding is generated, culminating in a digital rendering of both the musical composition and accompanying lyrics. An example of the output of the pipeline and its conversion into a widely used format is illustrated in Figure 6.

6. Architectures

This chapter describes the architectures that have been used to enable automatic transcription.

6.1. Staff Line and Symbol Segmentation

As previously mentioned, the pipeline leverages an architecture inspired by the FCN [40] (and U-Net [14]) for staff line and symbol segmentation in images. Our architecture consists of two main components:
  • The encoder (contracting path), which progressively reduces the spatial dimensions of the input image and as such encodes the input image into feature representation at multiple hierarchical levels.
  • The decoder (expansive path), which complements the encoder by projecting the context information features learned during the encoding stage onto the pixel space. This process involves upsampling the encoded features and merging them with corresponding features from the contracting path. The decoder’s objective is to reconstruct the original image with precise segmentation, utilizing the combined spatial and contextual information.
For the encoder part, we settled on an Eff-b3 (https://github.com/lukemelas/EfficientNet-PyTorch, accessed on 10 August 2024) architecture, as the best results have been achieved in previous experiments [34] with this decoder. The decoder part was added to these architectures. The decoder part consists of a bridge block (block of layers connecting encoder and decoder) and upsample blocks (block of layers to increase the spatial resolution of the feature representation, so as to match the original dimension of the input image). The bridge block consists of convolutional layers with 256 kernels with a size of 3 × 3. The number of upsample blocks depends on the number of downsample blocks of the encoder. Each upsample block has two convolutional layers. The first upsample block consists of two layers with 256 kernels with a size of 3 × 3, each. The number of kernels is halved with each additional upsample block. In addition to that, skip-connections between the upsample blocks and the downsample blocks were added. Finally, a convolutional layer is added, which converts the output nodes at each position into 9 (8 classes + background) or 2 (1 staff line pixel + background) target probabilities at each pixel position of the image depending on the task. In training, Adam was used as an optimizer with a learning rate of 1 × 10−4.
The input images are scaled to a height of 80 while retaining the image aspect ratio.

6.2. Text and Syllable Recognition

As the HTR engine, we use a CNN/LSTM architecture. It consists of seven CNN layers, followed by two bidirectional LSTMs. The convolutional layers have 64, 128, 256, 256, 512, 512, and 512 kernels with a kernel size of 3 × 3, each. Between the first and second and between the second and the third layer, a max pool layer with a kernel size of 2 × 2 was added. Between the fourth and the fifth layer, a max pool layer with kernel size 2 × 2, with stride size 2 × 1 and padding size 0 × 1 was added. Also, between the sixth and the seventh layer, a max pool layer with kernel size 4 × 2, with stride size 2 × 1 and padding size 0 × 1 was inserted. The RNN part consists of two bidirectional LSTMs with 128 kernels. Before and between the LSTMs, a dropout layer of 0.3 was added. In training, Adabelief was used as an optimizer with a learning rate of 1 × 10−3. The images are scaled and then padded to a width of 640 and a height of 48 while retaining the image ratio.

7. Experiments

In this section, the setting for the performed experiments is described, as well as their interpretation, and the experimental conclusions that can be drawn. First, an overview of the overall results is given. The metric here solely counts the number of actions needed to correct the automatic output of the entire pipeline.
By evaluating the entire pipeline at once, it should be noted, however, that some of the steps build on each other, i.e., errors that occur in previous steps of the pipeline can influence the results of subsequent steps. For example, if music regions are not correctly recognized in the layout step, this also degrades the results of the symbol recognition step.

7.1. Experimental Setting

In total, there were 228 chants in the Mul 2 dataset. In 186 of the 228 cases, similar texts could be found using HTR. In the evaluation for the Mul2 dataset, we restrict ourselves to these 186 chants. The models utilized to generate the results were trained on Nevers P1, Nevers P2, Nevers P3, and Latin 14819 book. The performance of symbol recognition was evaluated with and without fine-tuning on a dataset of 20 pages from the Mul 2 book that were not included in the original Mul 2 dataset.
Additionally, a similar experiment was conducted on the Graduale Synopticum dataset. Here, the line detection and symbol detection models underwent fine-tuning with an additional 50 pages from the Graduale Synopticum dataset. The syllable assignment model, on the other hand, was trained from scratch using synthetically generated text images extracted from texts available on the Graduale Synopticum website.
The metric used for evaluation is the number of actions required to correct the automatic transcription divided by the total number of reference units (e.g., for the symbol detection step, the number of actions required to correct all the symbols divided by the total number of symbols of the dataset). We orient us at the correction actions (symbol insert, symbol delete, syllable drag and drop, etc.) provided by OMMR4all [35]. For each correction step (layout, symbols, text, and syllable alignment), the respective percentage is indicated. This percentage pertains to various quantities: for the line step, it refers to the number of lines that need correction divided by the total number of lines; for the symbol step, it refers to the number of symbols requiring correction divided by the total number of symbols; for the text step, it refers to the number of characters requiring correction divided by the total number of characters; and for the syllable assignment step, it refers to the number of syllables requiring correction divided by the total number of syllables in the dataset.
For the qualitative evaluation, we randomly selected 20 pages of the Mul2 dataset (Mul2 Sub1, Mul2 Sub2) to measure the duration required to manually rectify melodic inaccuracies (e.g., missing or extra notes, textual errors, and inaccurate syllable assignments) using OMMR4all.
Table 4 gives an overview of the statistics of all test datasets.

Quantitative Evaluation Results

Table 5 provides an overview of the results. The reported results were attained with the pipeline outlined in Section 5 and evaluated by counting the actions needed to rectify the automated transcriptions as outlined in Section 7.1. The layout detection (including staffline recognition) works very well, with no errors detected in the case of the staff lines. Occasionally, staff lines may be recognized as too long. However, this was not counted as an error since it did not pose a problem for subsequent steps; the main concern was that they were detected so that the regions could be cropped. When distinguishing between regions in medieval chants, we categorize them into musical and textual regions. These regions are consistently identified, which makes sense given that all the staff lines have already been recognized, allowing for the computation of music and textual regions based on these staff lines.
Errors in layout detection only occurred in the recognition of chants. However, this is minimal, just 0.42%, and is attributed to the failure to detect some drop capitals. In the case of the Graduale Synopticum, all chants were found. However, this is expected as each line represents a separate chant, making it a trivial task.
The symbol recognition is also very good. Symbol recognition on the Mul2 dataset achieved a low error rate of 1.48% for both missing and additional symbols without fine-tuning. This error rate was further reduced to 0.71% when fine-tuning was applied. In the Graduale Synoptic dataset, this value ranges between 0.22% and 0.39%. The recognition in the Graduale Synoptic dataset is significantly better, as expected, since the symbols are much more uniform. However, it is not perfect, as some symbols may protrude into other staves, be rendered in colors with low contrast to the background, or overlap with text. If the type of symbol (note, C-clef, F-clef, flat/sharp/natural accidental) also needs to be recognized, 0.82% (1.71% without finetuning) of all symbols in the Mul 2 dataset need to be corrected. There is little change in the Graduale Synopticum dataset, mainly because all chants are rendered with a C-clef, eliminating errors in clef distinction. Accidentals are rare, which minimally impacts this result. The pitch recognition has a higher impact on the Graduale Synopticum dataset, which can be explained by the fact that the automatic assignment of notes to staff lines is optimized for handwritten documents. Adjusting the parameters relevant for note assignment to staff lines would likely lead to improvements.
Considering the reading order in the evaluation reveals a larger increase in error rates. This is because determining the reading order can be challenging without background knowledge and context, e.g., in cases where two notes are stacked on top of each other (e.g., PES). Especially in historical documents where the position of the symbols is slightly incorrect (e.g., shifted to left or right) the problem occurs more often. Here, additional neume classification would be helpful in rectifying errors in the reading order.
Considering the graphical connections between notes reveals another significant increase in errors. While depicted graphical connections are easy to recognize and less likely to cause mistakes, distinguishing between gapped and neume start connections is not always straightforward. This is because the only difference is based on the distance to the previous note on the image level, which may not always be clear in historical documents. Additionally, the recognition of graphical connections in historical documents is often challenging due to the faintness and fading of the ink, as well as the close proximity of notes, which can obscure these connections and make them difficult to discern, even for human observers. This poses a significant obstacle to accurate classification. Despite that, the symbol recognition remains at a very high level.
Text recognition is particularly challenging in historical chants due to the similarity of characters, word separation based on syllable assignment, and other typical issues. Therefore, text recognition achieves only a 24.1% character error rate (CER). Even with fine-tuning on 10 pages, the CER remains at 14.9%. However, a helpful approach is chant-specific post-processing, where songs from a database are matched with OCR results to improve transcription. This reduces the CER to 3.8%, which arises from slight mismatches between chants or errors introduced during line segmentation, accounting for approximately 1.48% of errors. Also, it should be noted that text errors tend to be grouped together due to the chant-based post-processing, and individual letter errors are rather rare. For example, a word may be missing, or an incorrect word is inserted in the automatic transcription, rather than individual letters in the words being incorrect. Such errors take less time for manual correction.
When symbols and text are previously corrected, the syllable assignment error rate on both datasets is around 2.3–2.7%. This is mainly because in both the Graduale Synopticum and Mul 2 datasets, syllables may not always be placed directly under the corresponding note but may be shifted to the left or right, leading to errors, especially in dense passages of notes. When symbols are not previously corrected and ground truth text is available, 9.4% of all syllables need to be corrected. Applying the algorithm directly to the inferred symbols and texts without any corrections increases the error rate to 11.2%. This is due to potential missing or additional symbols, as well as symbols having incorrect graphical connections, which affects the algorithm. In addition to incorrectly assigning the syllable to a wrong note, another potential error source is if the transcribed text of the syllable does not match the lyric of the matched ground truth syllable. Errors in syllable assignment are particularly prevalent at the beginning and end of chants due to misalignment between the chant being analyzed and the reference chant. This misalignment can introduce additional syllables or cause words to be truncated, such as the omission of “Alleluia” or “Gloria” at the end of chants.

7.2. Qualitative Evaluation

Furthermore, to assess the quality of the mass transcription, we conducted a qualitative analysis by measuring the duration required to rectify melodic inaccuracies (e.g., missing or extra notes, textual errors, and inaccurate syllable assignments). To accomplish this, 20 pages from the results of the automatic transcription were selected and afterward manually corrected using an updated variant of the overlay editor OMMR4all [35]. OMMR4all is an overlay editor that facilitates graphical-level corrections. It operates by utilizing precise positions of transcription components, including staves, symbols, and syllables, to improve error detection and correction.
To avoid confusion, it is important to clarify that all participants in the experiment possess a musical background and lack expertise in computer science. This distinction is crucial because the Monodi+ tool necessitates musical knowledge, such as identifying note pitches, a process that is automatically handled by the OMMR4all tool. Additionally, even in the OMMR4all tool, musical expertise is advantageous for resolving ambiguous cases. We assessed two approaches for correcting the automatic transcription, one of which divided the correction into two phases. The first phase involved addressing errors at the symbol level, including adding missing symbols or removing incorrect symbols. The second phase addressed textual errors, which included errors caused by the OCR or post-processing algorithm at both the text and syllable levels. Examples of such errors included incorrect word separation and misaligned syllables. Instances of corrections at both the symbol and text levels are depicted in Figure 7. These phases were corrected sequentially in 10-page splits. In the second approach the correction is no longer divided into two phases (initially correcting the symbols and subsequently correcting at the text level) but instead simultaneously addresses errors in both symbols and text (Author 3).
In addition, a second experiment was conducted where the first five pages were manually transcribed with the Monodi+ [4] tool to compare the transcription times. This tool is an optimized editor specially designed for the input of chants written in square notation (or similar notations) via the keyboard. Table 6 shows the results of this evaluation (compare Table 4 for the instance statistics of the datasets). With OMMR4all, transcribing a page takes on average between 2 and 4 min, while transcribing a page with Monodi+ takes between 17 and 18 min. In OMMR4all, the correction at text level (text and syllables) takes longer in relation to the total time, whereas with Monodi+, it is the other way round. This is logical because the correction of the symbols on the graphic level is carried out in OMMR4all (placing the symbols in the right place, the calculation of the note pitch is carried out automatically), whereas in Monodi+, the note height has to be entered directly. Additionally, there is a trend to observe that simultaneous correction of symbols, text, and syllables is beneficial, which makes sense as it involves only one correction run instead of two, potentially reducing the need for implementation time in the melody. However, this approach requires that the tool used for correction supports such a process, and the transition between different correction tools must be seamless. Otherwise, time may still be lost during the transition.
All in all, this process speeds up the transcription time by a factor of 9 in the (simultaneous) one-pass strategy and by a factor of 5 to 6 in the two-pass strategy.

8. Discussion

This chapter focuses on specific selected steps of the pipeline by conducting and discussing advanced experiments.

8.1. Text Recognition

As previously established in the evaluation, text recognition without post-processing performs poorly. In the paper, a post-processing method was introduced and evaluated, demonstrating its usefulness. This section will delve further into text recognition, with an additional experiment created to provide an extended evaluation of the post-processing method and shed light on other relevant aspects.
To address these questions, several experiments were conducted: (1) Is additional fine-tuning on text recognition and post-processing necessary? (2) How many errors are produced by splitting chants into lines based on inaccurate OCR output? For this, 10 extra pages from the Mul 2 book (not included in the Mul 2 dataset) were used for fine-tuning and then compared with the model without fine-tuning. Furthermore, during evaluation, chant-based post-processing was optionally enabled or disabled.
Table 7 presents the results of this evaluation. For each experiment, three values were calculated: Line CER, Chant CER, and WER. These metrics measure the number of actions (Insert, Delete, Replace) required to convert predicted transcriptions into the ground truth sequence. The three metrics differ only in their pre-processing of the sequence. In the case of Line CER, predicted lines are compared with ground truth lines; for Chant CER, the entire predicted lyrics of a chant are compared with the ground truth lyrics of that chant; and for WER, individual words are treated as tokens instead of individual characters.
The results show that, as expected, fine-tuning leads to improvements. However, the difference between fine-tuning and not fine-tuning is not substantial. Errors occur in both cases, and already without fine-tuning, reasonable results can be achieved.
It is also worth noting that while the error rate decreases with post-processing, the errors also shift. Without post-processing, approximately 40–60% of all words were incorrect, but this number falls to 9–12% with post-processing. This is significant for manual correction as it is easier to correct individual words than multiple single letters.
Additionally, approximately 1% of errors are due to suboptimal segmentation. Nevertheless, chant-based post-processing is crucial for the quality of automatic transcription.

8.2. Syllable Detection

In this section, similar to the extended evaluation of text recognition, some aspects of syllable assignment are examined more closely. To do this, various experimental setups were prepared that addressed different questions about syllable assignment: (1) How does poor OCR affect syllable assignment compared to when fine-tuned models for text recognition are used? (2) How well does syllable assignment work if symbols are corrected beforehand (GT)? And (3) how good is syllable recognition if both lyrics (text) and symbols are corrected in advance? Additionally, these experiments were carried out turning the database with the chant texts on and off to provide a more comprehensive analysis. Similar to before, actions needed to correct errors in the syllable assignment normalized by the total number of syllables (M1) were counted as a performance metric. There are primarily two sources of error here: the syllable was assigned to the wrong note or the syllable is indeed assigned to a correct note, but the text is incorrect (due to an erroneous OCR/syllable split). Additionally, a second count was made in which errors found at the text level were ignored to check if the correct notes were selected for syllable assignment (M2).
The results of these experiments can be seen in Table 8. If the algorithm is applied to the results of the previous steps without correcting them beforehand, then between 11.2% and 12.8% of syllables need correction when chant-based lyric post-processing is enabled. Several causes can be found here: either the syllable text is incorrect, the syllable was assigned to the wrong note (because the note is missing, for example), or the word was divided into incorrect syllables. If, on the other hand, one assumes that errors are corrected after each step in the pipeline, i.e., possible errors are corrected in advance (e.g., missing notes), then only 2.3% of all syllables need correction. When errors at the text level are ignored, only 0.1% of all syllables still need to be corrected.

9. Conclusions

In this paper, we have shown that automatic transcription achieves low error rates on historical music documents. For this purpose, we have presented and evaluated our pipeline, which is integrated into OMMR4all [35]. We implemented the pipeline using neural networks and background knowledge. Our pipeline includes the following steps: pre-processing, staff recognition, layout analysis, symbol recognition, text and syllable recognition and finally syllable note assignment. The staff recognition and layout recognition steps generate very few errors that need to be corrected during transcription. More errors occur in symbol recognition. On average, between one and seven errors occur per 100 symbols. However, the most frequent errors occur at the text level. Here, a post-processing with a chant text database was introduced in the pipeline, which improves the raw transcription results of the CNN/LSTM network drastically. It reduced the errors that need to be corrected from 24.4% to 4.7%, or to 3.8% if the model is also fine-tuned. It should also be noted that errors are often grouped together making manual correction much easier, as these are largely limited to extra words/missing words or errors due to incorrect line separation.
Nevertheless, there is still room for improvement. An obvious improvement would be the optimization of the text transcription network with more training material, which could lead to an improvement in the localization of the characters and, as such, improve the line splitting and the comparison. In addition, an expansion of the chant database would make matches more likely. Using background knowledge when assigning syllables to notes would also improve this critical step.
The correction of the automatic transcription takes currently about 2 min per page, while the manual transcription with an optimized tool takes about 18 min. So, the use of this pipeline accelerates the transcription process by a factor of up to 9.
The pipeline was developed as part of the Corpus Monodicum (https://corpus-monodicum.de/about, accessed on 10 August 2024) project, which includes chants of Latin medieval music, with a large proportion of mass chants, tropes, sequences, songs, and liturgical dramas. Integration into the larger system of Cantus Index is underway, which, similar to a metasearch engine for medieval chant, connects over 15 autonomous projects. The goal is to automatically transcribe parts of this collection using the presented pipeline, thereby significantly reducing the time required for transcription. The transcription has currently only been tested on documents that use square notation, but with minor modifications (e.g., finetuning), the pipeline could be applied to other similar notations (e.g., “Hufnagel Notation”). Many of these transcribed chants are already published and can be viewed on https://corpus-monodicum.de/ (accessed on 10 August 2024) including large parts of the Graduale Synoptcium dataset and the Mul 2 dataset. The paper presents a comprehensive pipeline for the complex task of automatic transcription of medieval chants and demonstrates its effectiveness through the results provided in this paper.

Author Contributions

A.H. and F.P. conceived and performed the experiments. A.H. designed the algorithms. A.H. and F.P. analyzed the results. A.H. and T.E. annotated the GT data. A.H. wrote the paper with substantial contributions of F.P. All authors have read and agreed to the published version of the manuscript.

Funding

The work is partly funded by the Corpus Monodicum project supported of the German academy of science, Mainz, Germany; Number: II.G.24.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Nevers: Paris, Bibliothèque nationale de France, Nouv. acq. lat. 1235. Available under: https://github.com/OMMR4all/datasets (accessed on 10 August 2024). Pa 14819: Paris, Bibliothèque nationale de France, lat. 14819. Mul 2: Mulhouse, Bibliothèque municipale, 0002. Available under: https://bvmm.irht.cnrs.fr/consult/consult.php?reproductionId=949 (accessed on 10 August 2024). Graduale Synopticum: http://gregorianik.uni-regensburg.de/gr/ (accessed on 10 August 2024).

Acknowledgments

We would like to thank Felicitas Stickler and Ina Schütte for helping to create and to resolve ambiguities in the data set. Also we thank Andreas Haug for providing musical background knowledge.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
TPTrue Positive
FPFalse Positive
FNFalse Negative
GTGround Truth
CCConnected Component
HTRHandwritten Text Recognition
OMROptical Music Recogniton
CNNConvolutional Neural Network
LSTMLong Short-Term Memory
RNNRecurrent Neural Network
FCNFully Convolutional Network
dSARdiplomatic Symbol Accuracy Rate
dSERdiplomatic Symbol Error Rate
hSARharmonic Symbol Accuracy Rate
hSERharmonic Symbol Accuracy Rate
NCNote Component
SGDStochastic gradient descent
PiSPosition in staff
GCNGraphical connection between notes
CERCharacter Error Rate
ACCAccuracy
AccidAccidental
SylSyllable
RLERun Length Encoding

References

  1. Parrish, C. The Notation of Medieval Music; Pendragon Press: Hillsdale, NY, USA, 1978; Volume 1. [Google Scholar]
  2. Good, M.D. MusicXML: An Internet-Friendly Format for Sheet Music. Proceedings of XML 2001 (Boston, December 9–14, 2001). Available online: http://michaelgood.info/publications/music/musicxml-an-internet-friendly-format-for-sheet-music/ (accessed on 10 August 2024).
  3. Hankinson, A.; Roland, P.; Fujinaga, I. The Music Encoding Initiative as a Document-Encoding Framework. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), Miami, FL, USA, 24–28 October 2011; pp. 293–298. [Google Scholar]
  4. Eipert, T.; Herrman, F.; Wick, C.; Puppe, F.; Haug, A. Editor Support for Digital Editions of Medieval Monophonic Music. In Proceedings of the 2nd International Workshop on Reading Music Systems, Delft, The Netherlands, 2 November 2019; pp. 4–7. [Google Scholar]
  5. Calvo-Zaragoza, J.; Hajič, J., Jr.; Pacha, A. Understanding Optical Music Recognition. ACM Comput. Surv. 2021, 53, 1–35. [Google Scholar] [CrossRef]
  6. Ríos Vila, A.; Rizo, D.; Iñesta, J.; Calvo-Zaragoza, J. End-to-end Optical Music Recognition for Pianoform Sheet Music. Int. J. Doc. Anal. Recognit. 2023, 26, 347–362. [Google Scholar] [CrossRef]
  7. Ríos Vila, A.; Iñesta, J.; Calvo-Zaragoza, J. End-To-End Full-Page Optical Music Recognition for Mensural Notation. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR 2022), Bengaluru, India, 4–8 December 2022. [Google Scholar] [CrossRef]
  8. Fujinaga, I.; Vigliensoni, G. Optical Music Recognition Workflow for Medieval Music Manuscripts. In Proceedings of the 5th International Workshop on Reading Music Systems, Milan, Italy, 4 November 2023; pp. 4–6. [Google Scholar]
  9. Fujinaga, I.; Vigliensoni, G. The Art of Teaching Computers: The SIMSSA Optical Music Recognition Workflow System. In Proceedings of the 2019 27th European Signal Processing Conference (EUSIPCO), A Coruna, Spain, 2–6 September 2019; pp. 1–5. [Google Scholar] [CrossRef]
  10. Wick, C.; Reul, C.; Puppe, F. Calamari—A High-Performance Tensorflow-based Deep Learning Package for Optical Character Recognition. Digit. Humanit. Q. 2020, 14. Available online: https://www.digitalhumanities.org/dhq/vol/14/2/000451/000451.html (accessed on 10 August 2024).
  11. de Reuse, T.; Fujinaga, I. Robust Transcript Alignment on Medieval Chant Manuscripts. In Proceedings of the 2nd International Workshop on Reading Music Systems, Delft, The Netherlands, 2 November 2019; pp. 21–26. [Google Scholar]
  12. Calvo-Zaragoza, J.; Pertusa, A.; Oncina, J. Staff-line Detection and Removal using a Convolutional Neural Network. Mach. Vis. Appl. 2017, 28, 1–10. [Google Scholar] [CrossRef]
  13. Wick, C.; Hartelt, A.; Puppe, F. Staff, Symbol and Melody Detection of Medieval Manuscripts Written in Square Notation Using Deep Fully Convolutional Networks. Appl. Sci. 2019, 9, 2646. [Google Scholar] [CrossRef]
  14. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2018; pp. 234–241. [Google Scholar]
  15. Quirós, L.; Vidal, E. Evaluation of a Region Proposal Architecture for Multi-task Document Layout Analysis. arXiv 2021, arXiv:2106.11797. [Google Scholar]
  16. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
  17. Quirós, L.; Toselli, A.H.; Vidal, E. Multi-task Layout Analysis of Handwritten Musical Scores. In Proceedings of the Pattern Recognition and Image Analysi, Madrid, Spain, 1–4 July 2019; pp. 123–134. [Google Scholar]
  18. Castellanos, F.J.; Garrido-Munoz, C.; Ríos-Vila, A.; Calvo-Zaragoza, J. Region-based Layout Analysis of Music Score Images. Expert Syst. Appl. 2022, 209, 118211. [Google Scholar] [CrossRef]
  19. Everingham, M.; Van Gool, L.; Williams, C.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
  20. Calvo-Zaragoza, J.; Castellanos, F.J.; Vigliensoni, G.; Fujinaga, I. Deep Neural Networks for Document Processing of Music Score Images. Appl. Sci. 2018, 8, 654. [Google Scholar] [CrossRef]
  21. Alfaro-Contreras, M.; Iñesta, J.M.; Calvo-Zaragoza, J. Optical Music Recognition for Homophonic Scores with Neural Networks and Synthetic Music Generation. Int. J. Multimed. Inf. Retr. 2023, 12, 12. [Google Scholar]
  22. Castellanos, F.; Calvo-Zaragoza, J.; Iñesta, J. A Neural Approach for Full-Page Optical Music Recognition of Mensural Documents. In Proceedings of the 21st International Society for Music Information Retrieval Conference, Virtual Conference, 11–16 October 2020. [Google Scholar]
  23. Calvo-Zaragoza, J.; Toselli, A.H.; Vidal, E. Handwritten Music Recognition for Mensural Notation with Convolutional Recurrent Neural Networks. Pattern Recognit. Lett. 2019, 128, 115–121. [Google Scholar] [CrossRef]
  24. van Der Wel, E.; Ullrich, K. Optical Music Recognition with Convolutional Sequence-To-Sequence Models. In Proceedings of the 18th ISMIR Conference, Suzhou, China, 23–27 October 2017. [Google Scholar]
  25. Pacha, A.; Choi, K.Y.; Coüasnon, B.; Ricquebourg, Y.; Zanibbi, R.; Eidenberger, H. Handwritten Music Object Detection: Open Issues and Baseline Results. In Proceedings of the 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), Vienna, Austria, 24–27 April 2018; pp. 163–168. [Google Scholar] [CrossRef]
  26. Pacha, A.; Calvo-Zaragoza, J. Optical Music Recognition in Mensural Notation with Region-Based Convolutional Neural Networks. In Proceedings of the 19th ISMIR Conference, Paris, France, 23–27 September 2018. [Google Scholar] [CrossRef]
  27. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-time Object Detection with Region Proposal. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar]
  28. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  29. Wick, C.; Hartelt, A.; Puppe, F. Lyrics Recognition and Syllable Assignment of Medieval Music Manuscripts. In Proceedings of the 2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), Dortmund, Germany, 8–10 September 2020; pp. 187–192. [Google Scholar] [CrossRef]
  30. Hankinson, A.; Porter, A.; Burgoyne, J.; Thompson, J.; Vigliensoni, G.; Liu, W.; Chiu, R.; Fujinaga, I. Digital Document Image Retrieval using Optical Music Recognition. In Proceedings of the 13th International Society for Music Information Retrieval Conference (ISMIR 2012), Porto, Portugal, 8–12 October 2012; pp. 577–582. [Google Scholar]
  31. Breuel, T. Recent Progress on the OCRopus OCR System. In Proceedings of the International Workshop on Multilingual OCR, MOCR ’09, Barcelona, Spain, 25 July 2009. [Google Scholar] [CrossRef]
  32. Martinez-Sevilla, J.C.; Rios-Vila, A.; Castellanos, F.J.; Calvo-Zaragoza, J. Towards Music Notation and Lyrics Alignment: Gregorian Chants as Case Study. In Proceedings of the 5th International Workshop on Reading Music Systems, Milan, Italy, 4 November 2023; 4 November 2023; pp. 15–19. [Google Scholar]
  33. Burgoyne, J.A.; Ouyang, Y.; Himmelman, T.; Devaney, J.; Pugin, L.; Fujinaga, I. Lyric Extraction and Recognition on Digital Images of Early Music Sources. In Proceedings of the 10th International Society for Music Information Retrieval Conference, Kobe, Japan, 26–30 October 2009; Volume 10, pp. 723–727. [Google Scholar]
  34. Hartelt, A.; Puppe, F. Optical Medieval Music Recognition using Background Knowledge. Algorithms 2022, 15, 221. [Google Scholar] [CrossRef]
  35. Wick, C.; Hartelt, A.; Puppe, F. OMMR4all—Ein Semiautomatischer Online-Editor für Mittelalterliche Musiknotationen. In Proceedings of the DHd 2020 Spielräume: Digital Humanities zwischen Modellierung und Interpretation. 7. Tagung des Verbands “Digital Humanities im deutschsprachigen Raum” (DHd 2020), Paderborn, Germany, 2–6 March 2020. [Google Scholar] [CrossRef]
  36. Breuel, T.M. The OCRopus Open Source OCR System. In Proceedings of the Document Recognition and Retrieval XV, San Jose, CA, USA, 29–31 January 2008; Volume 6815, p. 68150F. [Google Scholar] [CrossRef]
  37. Sauvola, J.; Seppänen, T.; Haapakoski, S.; Pietikäinen, M. Adaptive Document Binarization. In Proceedings of the Fourth International Conference on Document Analysis and Recognition, Ulm, Germany, 18–20 August 1997; Volume 33, pp. 147–152. [Google Scholar] [CrossRef]
  38. Cardoso, J.; Rebelo, A. Robust Staffline Thickness and Distance Estimation in Binary and Gray-Level Music Scores. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 1856–1859. [Google Scholar] [CrossRef]
  39. Fujinaga, I. Staff Detection and Removal. In Visual Perception of Music Notation: On-Line and Off-Line Recognition; IGI Global: Hershey, PA, USA, 2004; pp. 1–39. [Google Scholar] [CrossRef]
  40. Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar] [CrossRef]
Figure 1. Elements of a document that have to be extracted. The following elements need to be extracted and examples are highlighted in the image: Music–lyric region pairs, note components (start, gapped, i.e., connected, looped, i.e., graphically connected), clefs (C-clef, F-clef), accidentals (sharp, flat, natural), drop capitals (orange marked), individual chants (blue marked), lyrics and the division into syllables (red vertical lines), and the assignment of syllables to note sequences (purple arrow). The modern transcription of the highlighted chant is shown in Section 5.7.
Figure 1. Elements of a document that have to be extracted. The following elements need to be extracted and examples are highlighted in the image: Music–lyric region pairs, note components (start, gapped, i.e., connected, looped, i.e., graphically connected), clefs (C-clef, F-clef), accidentals (sharp, flat, natural), drop capitals (orange marked), individual chants (blue marked), lyrics and the division into syllables (red vertical lines), and the assignment of syllables to note sequences (purple arrow). The modern transcription of the highlighted chant is shown in Section 5.7.
Applsci 14 07355 g001
Figure 2. Example images of the train datasets. Top row (from left to right): Nevers Part 1 and 2. Bottom row (left to right): Nevers Part 3, Latin 14819.
Figure 2. Example images of the train datasets. Top row (from left to right): Nevers Part 1 and 2. Bottom row (left to right): Nevers Part 3, Latin 14819.
Applsci 14 07355 g002
Figure 3. Example image of the test datasets. Top row: Mulhouse (Mul 2) dataset. Bottom row: Graduale Synopticum dataset, where only the lower half marked with a green rectangle is transcribed. The abbreviations MR, Ch, L, K, etc. (see http://gregorianik.uni-regensburg.de/gr/info, accessed on 10 August 2024) denote various sources from which the transcription was initially obtained.
Figure 3. Example image of the test datasets. Top row: Mulhouse (Mul 2) dataset. Bottom row: Graduale Synopticum dataset, where only the lower half marked with a green rectangle is transcribed. The abbreviations MR, Ch, L, K, etc. (see http://gregorianik.uni-regensburg.de/gr/info, accessed on 10 August 2024) denote various sources from which the transcription was initially obtained.
Applsci 14 07355 g003
Figure 4. The schematic workflow of the proposed optical music recognition workflow. The images were cropped from the OMMR4all [35] overlay editor. The purple areas mark drop capitals. Green areas are music regions. Red areas are lyric regions. Gray regions are also lyric regions, but they additionally mark the start of a new chant. Yellow and green squares within the music region mark symbols. The different colors of the symbol indicate whether they lie on a staff line or between two staff lines. The reading order of the symbols is represented by a thin line that connects the symbols to each other. Turquoise lines between symbols mark graphic connections. Vertical lines in the bottom box mark note sequences to which a syllable is assigned.
Figure 4. The schematic workflow of the proposed optical music recognition workflow. The images were cropped from the OMMR4all [35] overlay editor. The purple areas mark drop capitals. Green areas are music regions. Red areas are lyric regions. Gray regions are also lyric regions, but they additionally mark the start of a new chant. Yellow and green squares within the music region mark symbols. The different colors of the symbol indicate whether they lie on a staff line or between two staff lines. The reading order of the symbols is represented by a thin line that connects the symbols to each other. Turquoise lines between symbols mark graphic connections. Vertical lines in the bottom box mark note sequences to which a syllable is assigned.
Applsci 14 07355 g004
Figure 5. Schematic procedure of the chant-based post-processing to improve HTR results. The chant to be transcribed is shown on the left side between the two drop capitals highlighted in purple starting with a D for Domine (image extracted from the OMMR4all editor). On the right side, part of the primary OCR transcription and the result after applying the post processing algorithm is displayed. The different colored elements (e.g., red, gray, purple and green regions) of the image are explained in Figure 4.
Figure 5. Schematic procedure of the chant-based post-processing to improve HTR results. The chant to be transcribed is shown on the left side between the two drop capitals highlighted in purple starting with a D for Domine (image extracted from the OMMR4all editor). On the right side, part of the primary OCR transcription and the result after applying the post processing algorithm is displayed. The different colored elements (e.g., red, gray, purple and green regions) of the image are explained in Figure 4.
Applsci 14 07355 g005
Figure 6. On the top left side, the automatically generated output of the pipeline is depicted in the online editor OMMR4all. On the top right side, the corresponding representation generated automatically in the Corpus Monodicum Viewer is displayed. At the bottom part of the image, a snippet of the MEI export is displayed for the first four syllables of the marked (red lines) chant. The other highlighted elements (e.g., yellow or green squares) are explained in Figure 4.
Figure 6. On the top left side, the automatically generated output of the pipeline is depicted in the online editor OMMR4all. On the top right side, the corresponding representation generated automatically in the Corpus Monodicum Viewer is displayed. At the bottom part of the image, a snippet of the MEI export is displayed for the first four syllables of the marked (red lines) chant. The other highlighted elements (e.g., yellow or green squares) are explained in Figure 4.
Applsci 14 07355 g006
Figure 7. Example of actions needed in the OMMR4all editor to correct errors in the transcription marked by a blue arrow. Symbol drag: the marked symbol needs to be adjusted closer to the staff line so that it is recognized by the system as being on the staff (yellow square). Symbol insert: insert a missing symbol. Symbol delete: the custody at the end of line must be removed as it is not part of the melody. Syllable assignment: the syllable was incorrectly assigned to a note and needs to be corrected by moving it to a new note via drag and drop.
Figure 7. Example of actions needed in the OMMR4all editor to correct errors in the transcription marked by a blue arrow. Symbol drag: the marked symbol needs to be adjusted closer to the staff line so that it is recognized by the system as being on the staff (yellow square). Symbol insert: insert a missing symbol. Symbol delete: the custody at the end of line must be removed as it is not part of the melody. Syllable assignment: the syllable was incorrectly assigned to a note and needs to be corrected by moving it to a new note via drag and drop.
Applsci 14 07355 g007
Table 2. Overview of dataset properties (Nevers P1 means Part 1). It also indicates which part has been annotated in the datasets (✓: GT available, x: GT not available).
Table 2. Overview of dataset properties (Nevers P1 means Part 1). It also indicates which part has been annotated in the datasets (✓: GT available, x: GT not available).
DatasetPagesStaffs/PageSymbols/PageClefsAccidsAnnotated
StaffsCapitalsSymbolsText (+Syl)
Nevers P114927915224xx
Nevers P2271238934537xx
Nevers P389209831xx
Latin 1481918291912108291
Mul 21027216846192
Table 3. Division and statistics of the Graduale Synopticum dataset. The parts build upon each other, meaning “incl. minor text variations” contains all pages from “without text variations” along with additional pages, and “incl. major text variations” includes all parts.
Table 3. Division and statistics of the Graduale Synopticum dataset. The parts build upon each other, meaning “incl. minor text variations” contains all pages from “without text variations” along with additional pages, and “incl. major text variations” includes all parts.
#Pages#Chants#Symbols#Syllables
without text variations1246496967227629
incl. minor text variations275162226418671443
incl. major text variations304189230574681223
Table 4. Statistics of the datasets used for evaluation: The number of symbols, staff lines, clefs, accidentals, other note components, chars, words, and syllables is listed (“Graduale no”: Subset of chants from Graduale synopticum with no text variations; “Graduale minor”: Subset with no or minor text variations; “Graduale all” with all chants; “Mul2Sub1” and “Mul2sub2” are two subsets from the “Mul2” dataset.
Table 4. Statistics of the datasets used for evaluation: The number of symbols, staff lines, clefs, accidentals, other note components, chars, words, and syllables is listed (“Graduale no”: Subset of chants from Graduale synopticum with no text variations; “Graduale minor”: Subset with no or minor text variations; “Graduale all” with all chants; “Mul2Sub1” and “Mul2sub2” are two subsets from the “Mul2” dataset.
Dataset#Pages#Chants#Staff#Symb#Clefs#Notes#Accids#Chars#Syllabels
Mul210218662118,64371117,74718516,4165233
Graduale no12464964969,67269168,73224992,31228,702
Graduale minor27516221622264,1922470260,6131109230,73571,443
Graduale all30418921892305,7462962301,5641220262,33381,223
Mul2 Sub12058560454816443632139141271
Mul2 Sub22053139397116437763140811292
Table 5. Necessary manual interventions in percent for correcting the automatic transcription of pages from Mul2 dataset and the Graduale Synopticum dataset. The “/” in the MUL2 column indicates without/with fine-tuning. In the case of the Graduale Synopticum dataset, “no text variations” means that the texts of the chants within one page were identical, while “minor text variations” had up to 5 character differences per chant including “no text variations”.
Table 5. Necessary manual interventions in percent for correcting the automatic transcription of pages from Mul2 dataset and the Graduale Synopticum dataset. The “/” in the MUL2 column indicates without/with fine-tuning. In the case of the Graduale Synopticum dataset, “no text variations” means that the texts of the chants within one page were identical, while “minor text variations” had up to 5 character differences per chant including “no text variations”.
Mul2Graduale Synopticum
NoMinorAll
Layout
   Lines0%0%0%0%
   Lines & Regions0%0%0%0%
   Lines & Regions & Chants0.42%---
Symbols
   Existence1.48/0.71%0.22%0.38%0.39%
   Existence & Type1.71/0.82%0.23%0.39%0.39%
   Existence & Type & Pitch2.22/1.53%0.58%0.90%0.91%
   Existence & Type & Pitch & Reading Order4.40/2.4%1.26%1.69%1.70%
   Existence & Type & Pitch & Reading Order & Graphical Connections13.9/7.4%2.70%3.92%4.07%
Text
   Text only14.9%---
   Text with syllables and line segmentation
          without GT text, but song repository3.8%---
          with GT text1.48%---
Syllable Alignment
   with inferred symbols and inferred text13.06/11.2%---
   with inferred symbols and GT text9.0/7.12%3.63%4.36%6.61%
   with GT symbols and GT text2.3%2.70%2.70%2.70%
Table 6. Evaluation of the correction time in minutes.
Table 6. Evaluation of the correction time in minutes.
Corrector#PagesSymbols
Level
Text
Level
Total
Time
Time/
Page
ToolDataset
Person 11011.62031.63.16OMMR4allMul2 Sub1 (10)
2022.839.2362.033.1OMMR4allMul2 Sub1 (all)
Person 2109.722231.723.72OMMR4allMul2 Sub1 (10)
2019.834362.833.14OMMR4allMul2 Sub1 (all)
Person 310--25.952.60OMMR4allMul2 Sub2 (10)
20--39.431.975OMMR4allMul2 Sub2 (all)
Person 2555.433.588.917.78Monodi+Mul2 Sub1 (5)
Table 7. Experiments of the text detection step. Using the post-processing pipeline, the Line CER drops from 24.1 to 4.7 and from 14.9 to 3.8 on the mixed and fine-tuned model, respectively.
Table 7. Experiments of the text detection step. Using the post-processing pipeline, the Line CER drops from 24.1 to 4.7 and from 14.9 to 3.8 on the mixed and fine-tuned model, respectively.
Post-ProcessingExperimentLine CERWERChant CER
NoMixed24.1%60.3%23.3%
Fine-tuned14.9%43.9%14.2%
YesMixed4.7%12.1%3.4%
Fine-tuned3.8%9.3%2.6%
Table 8. Experiments of the syllable assignment step. Several different experiment setups were evaluated. Post-processing = yes means that the database with chant texts was used to post-correct the raw OCR transcriptions. “Fine-tuned” means that the text models were finetuned, while mixed means that the text model was not fine-tuned. The values after the “/” indicate the error rate of the syllable assignment if the symbol recognition was also fine-tuned. In the optimal case, when text and symbols have been fully corrected in advance, only 2.3% of all syllables need correction. When only using the results of the pipeline without intermediate correction, 11.2%/12.8% of all syllables need correction. Ignoring textual errors not relevant for correct syllable assignments drastically improves the results (M2).
Table 8. Experiments of the syllable assignment step. Several different experiment setups were evaluated. Post-processing = yes means that the database with chant texts was used to post-correct the raw OCR transcriptions. “Fine-tuned” means that the text models were finetuned, while mixed means that the text model was not fine-tuned. The values after the “/” indicate the error rate of the syllable assignment if the symbol recognition was also fine-tuned. In the optimal case, when text and symbols have been fully corrected in advance, only 2.3% of all syllables need correction. When only using the results of the pipeline without intermediate correction, 11.2%/12.8% of all syllables need correction. Ignoring textual errors not relevant for correct syllable assignments drastically improves the results (M2).
Post-ProcessingExperimentM1M2
NoMixed36.7/35.3%16.1/14.0%
Fine-tuned26.1/24.8%12.3/10.1%
GT (Symbols) no Fine-tune31.9%8.2%
GT (Symbols) with Fine-tune21.4%5.2%
GT (Symbols & Text)2.3%0.1%
YesMixed14.1/12.8%10.2/8.4%
Fine-tuned13.1/11.2%9.7/7.5%
GT (Symbols) no Fine-tune8.4%3.6%
GT (Symbols) with Fine-tune7.2%3.0%
GT (Symbols & Text)2.3%0.1%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hartelt, A.; Eipert, T.; Puppe, F. Optical Medieval Music Recognition—A Complete Pipeline for Historic Chants. Appl. Sci. 2024, 14, 7355. https://doi.org/10.3390/app14167355

AMA Style

Hartelt A, Eipert T, Puppe F. Optical Medieval Music Recognition—A Complete Pipeline for Historic Chants. Applied Sciences. 2024; 14(16):7355. https://doi.org/10.3390/app14167355

Chicago/Turabian Style

Hartelt, Alexander, Tim Eipert, and Frank Puppe. 2024. "Optical Medieval Music Recognition—A Complete Pipeline for Historic Chants" Applied Sciences 14, no. 16: 7355. https://doi.org/10.3390/app14167355

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop