Line-Level Layout Recognition of Historical Documents with Background Knowledge
Round 1
Reviewer 1 Report
The paper conveys the hard work that has gone into developing the methods. However, I have two major criticisms:
(1) The related work section is too short / too generic. Try to find publications that tackled similar datasets and layout analysis / segmentation (making use of background knowledge). For the methods you list, discuss a little more their advantages/disadvantages.
(2) There is no comparison against existing baseline methods. At least results of open-source OCR systems (e.g. Tesseract) should be included. This might not be feasible for every aspect of your method, but certainly for parts of it.
Further comments:
- Spelling is generally good (except the first word! "digitalisation" should be digitisation. correct later on in the paper). Grammar could be improved. Especially the abstract needs refinement. Some unusual language here and there ("disastrous" is a bit over the top). "Text lines" not "textlines". "Post-processing" is used in different variations (space, hyphen).
- Many thresholds, parameters etc. are introduced without explanation or reasoning (Were they copied from a reference? Were they tuned? , ...). (e.g. the machine learning hyper parameters or the various percentages used in the rulebook).
- You use page width for several decisions. How is that defined? With or without the margin and scanner background? I think whitespace is mentioned once in relation to this, but you should clearly define it earlier, when it is used the first time.
- Have you checked the impact of binarisation problems / artefacts. Do other binarisation algorithms work better or worse?
- The results and discussion section could be a bit more concise.
Author Response
Thank you for your review and for the helpful recommendations.
We have substantially improved the related work section and discuss the different approaches for layout analysis and their role in a document processing pipeline. Especially, we went into further detail in summarizing the outputs of different layout recognition methods and compare them in a compact table. However, different approaches did evaluate different aspects with different metrics, so that it is difficult to a "best" approach.
Ta address the second major point of comparing our approach to other approaches, we used the popular open source kraken engine and included a comparison of our baseline + textline polygonization system to their baseline detector. Further comparisons, e.g. with line level segmentation approaches for handwritten text, are difficult, because the tasks differ too much (e.g. our approach is not designed for the arbitrarily rotated baselines as they occur in the cBAD dataset)
We also improved the method's section with details on how the parameters were chosen.
In addition, you have noted that different binarization algorithms might produce better results. As this might certainly be the case, the robustness of the binarization is crucial for this application as both our algorithm and the evaluation metrics rely on the binarized output. We have tried several different algorithms (ocropus-nlbin, Sauvola, ISauvola), but ultimately decided to go for ISauvola, as there was a very performant implementation available and the results were consistently good across many different datasets. In the case of artifacts, we found that these pose a significant challenge for both our algorithms and also for the evaluation, as the IoU is calculated on the binarized foreground pixels. Therefore, we do not think that we can evaluate the impact of the binarization algorithm using our method, as we would probably have to adjust the ground-truth data for the different binarization algorithms.
Furthermore, we improved the discussion part to more clearly communicate the advantages and remaining errors for our method.
Best regards,
Norbert Fischer, Alexander Hartelt, Frank Puppe
Reviewer 2 Report
This paper presents a highly detailed, hand-crafted procedure for line-level historical document layout detection. The work is a valuable contribution, but there are several shortcomings which require revision:
Coverage of related work is rather brief and omits numerous relevant works in historical document analysis as well as non-historical document layout prediction. An exhaustive list would be impractical, but inclusion and comparative discussion of the findings of one or more recent reviews of such methods could provide clearer context for the contributions of this work. In particular, there have been more machine-learning-based developments in document layout analysis - and their relative impact also larger - than the relation work section appears to suggest. Many of these may be outside the domain of historical documents per se, but their relevance is, I believe, sufficient to merit discussion and comparison.
Also on the topic of related work, there appears to be relatively little coverage of the types of manual rule-creation approaches (including those leveraging “background knowledge”) that this paper emphasizes, nor explicit discussion of lack of such prior work. It is therefore not clear which of the techniques described in the paper are innovations original to the authors, or else are commonplace, but simply parameterized in this work.
The motivation for the approach taken could use much further discussion. The approach is heavily hand-tuned and rule-based, but nonetheless relies on machine learning (ML) for base methods. Why not then lean more heavily on ML? The method already uses off-the-shelf ML for drop-capital detection, and a custom trained U-Net as a baseline detector. Why not instead train the U-Net model to directly predict the bounding line polygons and simultaneously classify them? It appears the GT data for this already exists, as it was used for the evaluation stage. This seems like the clearest path forward. Certainly the amount of available data is limited, but various data augmentation techniques could be applied to maximize their value, and synthetic data could be generated as well in arbitrary quantities for pre-training and transfer learning. Note also that background information could just as readily be incorporated in an ML-based approach (e.g. training several model versions with various elements enabled or not, or else training a single model to predict all elements, but dynamically ignoring specific elements via post-processing). These are not trivial tasks, but nor is the extensive hand-crafting of rules described in this work. This work’s approach might indeed be an ultimately stronger choice, or else offer a non-ML-centric alternative with competitive and [more importantly] interpretable function, but in light of ever-present advances in ML, I believe the approach needs clearer justification.
Many of the procedures described could sorely use more visual examples, particularly the various steps of section 4.5 (Polygonization of detected Baselines). In the absence of such visual aids, much of the descriptive language is ambiguous or unclear. For example: “If no joint estimated topline is found, the top limit is the estimated topline of the current detected baseline, moved again by its original distance to the top.“ What is the “original distance”? The description of how a line polygon differs from a bounding rectangle and why this is important is also not clear. Fig 5 appears to come closest to providing such an example, but 1) the dividing contour in this example does not do a good job of following the text and so is not a convincing argument for the need of specialized line polygons, and 2) the figure is not referenced anywhere in the text. On the whole, without accompanying and representative graphics (or perhaps pseudocode), much of the detailed procedures appear superfluous as they are excessively detailed to describe the aim of the technique, yet do not properly aid in understanding the method.
The most significant shortcoming is a lack of performance comparison to existing methods. Not only is a novel method presented without comparison, but a wholly novel set of performance metrics is used, leaving no way to assess the quality of these results in the context of prior works. Further, a rationale for these performance metrics is offered (in favor of FgPA and mIoU), but I believe the most relevant metric is ignored: mean average precision (mAP), which is the standard metric for object detection problems - which these evaluations are directly analogous to.
Additional minor issues:
In baseline post-processing step 1 (line 240), is the difference in allowable angles among connected components at different distances meant to allow for line warping in the images (e.g. due curved pages)? If not, why is this step necessary?
Regarding column detection: What about images embedded in the middle of a page or column of text (e.g. wrapped around a figure)? Is this rare enough in historical documents to ignore?
A key strength of the paper that I commend the authors for is the extensive analysis of errors, and especially the accompanying examples and occurrence statistics. On the whole, I believe this is a valuable work, but I must insist on the major revisions listed above.
Author Response
Thank you for your thourough review and helpful recommendations.
In response to your suggestions, we extended the related work section significantly and added a table comparing different recent layout recognition approaches, containing ML-based methods, as well as rule-based methods. Even though we did a in-depth search for other modern rule-based approaches for layout recognition, there has been very little coverage of this topic. As you stated, most approaches focus significantly on machine learning, especially leveraging modern architectures for instance segmentation. This was especially a motivation in the beginning for our research, as we were certain, that for many printings, a rule-based system can also achieve good results for many printings, especially if the pages follow a similar structure, which is common for printings from this time.
In addition, we have reworked parts of the description of our method to hopefully explain the method more clearly and understandably. Unfortunately, we were not able to generate better / more descriptive figures for the individual stages of our algorithm due to time constraints, but hope to improve on this for the final version of the paper.
To address the shortcoming of comparison of our methods to existing approaches, we trained models for the open-source kraken engine on our baseline datasets and compared the results for the line detection with our method. A direct comparison to other layout recognition methods is not easily feasible due to different tasks. At the current time, we do not have other datasets with similar layouts to train e.g. a Mask-RCNN on individually annotated regions for comparison.
To evaluate the feasibility of other detectors for our line-based task, we trained a Mask-RCNN on our testing data (individual text lines) but got worse results as expected which are therefore didn't report.
Furthermore, you suggested to use more commonly used metrics, such as the AP for evaluating the performance of our method. Unfortunately, this is not possible for our approach, as our method don't provide confidence values for individual detected element instances, which is a requirement to compute the AP (e.g. as it's done in the evaluation scheme for the CoCo or PascalVOC datasets). Nevertheless, our evaluation scheme provides an explicit Precision and Recall measure based on the IoU with the threshold set to 0.9. Therefore, a comparison of our proposed method to other methods which evaluate Precision and Recall is feasible. To improve the presentation of our results, we also reworked the tables.
To address your further notes: The angles for the baseline concatenation are set to avoid concatenation of nearby baselines, which are close but not aligned horizontally, as such a case is not recognized by the distance-based DBSCAN algorithm. As you mentioned, to allow for the case of warped or skewed images, the parameter is set to a higher value for close CCs.
Images embedded in the middle of a page should for the most part be fine for the baseline detection. For the layout recognition, this is a case which we have not tested. As you suggested, we found this to be a rare case in historical printings (at least from the time period that we are investigating), as images are usually either spanning the full page or individual columns or are on left-aligned in the text region for single-column layouts. For more modern printings, this is definitely an interesting topic which requires further investigation.
Best regards,
Norbert Fischer, Alexander Hartelt, Frank Puppe
Round 2
Reviewer 1 Report
Thank you for making the changes.
Reviewer 2 Report
The much expanded related work section is very thorough and addresses my concerns about the place of this paper in the context of prior works. The comparative table is much appreciated. I suggest highlighting in some way the included methods that were developed for and/or evaluated on historical documents.
I must admit I am somewhat less satisfied with the updated methods description. To my eye, the section still heavily favors detail at the expense of clarity, but I acknowledge the necessary contents are in place. If the authors indeed find time to add a companion figure, I suggest adding a sample of the U-Net probability map output. Figure 5 could also benefit from enhanced contrast, or different choice of segmentation mask colors, as the resulting segmentation is quite hard to see.
The addition of comparative performance results, and the updated metrics discussion together lend significant credence to the outcome of this work.
PS: 1) the marked up revised submission showed some, but not all tracked changes, and 2) table formatting could use improved presentability/aesthetics