Understanding Unsupervised Deep Learning for Text Line Segmentation
Abstract
:1. Introduction
2. Related Work
3. Data and Evaluation
4. Method
4.1. Handling Binary Document Images
4.2. Handling Colour Document Images
4.3. Training
Pair Generation
4.4. Pseudo-RGB Image Generation
- Since only the last three fully connected layers in the network receive input from both patches, we can conclude that most of the semantic reasoning for each patch is conducted separately by the convolutional network. Both CNN branches of the Siamese network share the same weights, therefore, a single branch is used to extract the features of each patch. Using the CNN branch, each patch is embedded into a feature vector of 512 dimensions, as shown in Figure 3.A sliding window of the size is used to obtain the feature map of a complete document image. The patches produced by the sliding window are fed to a branch of the Siamese network. Thus, an image of size is mapped to a feature map of the size , where each cell in the feature map is a vector that corresponds to a single patch on the input document image.
- By applying PCA on the feature vectors, the feature vectors are projected on their first three principal components.
- To interpret the first three principal components as pseudo-RGB values, they are normalized to values between zero and one, multiplied by 255, and viewed as RGB values, where the first, second, and third components correspond to the red, green, and blue colours, respectively. Subsequently, the features of the document image can be visualized as a pseudo-RGB image in which the central windows of patches with similar patterns are assigned similar colours (Figure 2).
4.5. Thresholding
4.5.1. Component Tree
Algorithm 1: Pseudocode of the component tree traversal. |
1: Output = ϕ. |
2: Enqueue the root into a queue Q |
3: while Q is not empty do |
4: Ci←Q.dequeue() |
5: if F(Ci) represents a blob line then |
6: Output = Output⋃Ci |
7: else |
8: Enqueue all children of Ci with d = 1 |
9: end if |
10: end while |
11: return Output. |
4.6. Pixel Label Extraction for Binary Documents
4.7. Handling Colour Documents
Baseline Seam
4.8. Patch Saliency Visualization
5. Experiments and Results
5.1. Baseline Experiment
5.1.1. Colour Document Images
5.1.2. Binary Document Images
5.2. Training Phase Experiments
Effect of Pair Similarity Assumptions
5.3. Pixel-Label Extraction Phase Experiments
5.3.1. Effect of Splitting Touching Components
5.3.2. Effect of Merging Broken Blob Lines
5.4. Results
5.4.1. Results on the VML-AHTE Dataset
5.4.2. Results on the ICFHR2010 Dataset
5.4.3. Results on the ICDAR2017 Dataset
6. Understanding the Unsupervised Segmentation
6.1. Sense of Network Depth
6.2. What Is Being Clustered?
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Sudholt, S.; Fink, G.A. Phocnet: A deep convolutional neural network for word spotting in handwritten documents. In Proceedings of the 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), Shenzhen, China, 23–26 October 2016; pp. 277–282. [Google Scholar]
- Grüning, T.; Leifert, G.; Strauß, T.; Michael, J.; Labahn, R. A two stage method for text line detection in historical documents. Int. J. Doc. Anal. Recognit. (IJDAR) 2019, 22, 285–302. [Google Scholar] [CrossRef]
- Alberti, M.; Vögtlin, L.; Pondenkandath, V.; Seuret, M.; Ingold, R.; Liwicki, M. Labeling, cutting, grouping: An efficient text line segmentation method for medieval manuscripts. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019; pp. 1200–1206. [Google Scholar]
- Doersch, C.; Gupta, A.; Efros, A.A. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1422–1430. [Google Scholar]
- Wang, X.; Gupta, A. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2794–2802. [Google Scholar]
- Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2536–2544. [Google Scholar]
- Manmatha, R.; Srimal, N. Scale space technique for word segmentation in handwritten documents. In Proceedings of the International Conference on Scale-Space Theories in Computer Vision, Corfu, Greece, 26–27 September 1999; pp. 22–33. [Google Scholar]
- Varga, T.; Bunke, H. Tree structure for word extraction from handwritten text lines. In Proceedings of the Eighth International Conference on Document Analysis and Recognition (ICDAR’05), Seoul, Korea, 31 August–1 September 2005; pp. 352–356. [Google Scholar]
- Graves, A.; Liwicki, M.; Fernández, S.; Bertolami, R.; Bunke, H.; Schmidhuber, J. A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 31, 855–868. [Google Scholar] [CrossRef] [PubMed]
- Liwicki, M.; Graves, A.; Bunke, H. Neural networks for handwriting recognition. In Computational Intelligence Paradigms in Advanced Pattern Classification; Springer: Berlin/Heidelberg, Germany, 2012; pp. 5–24. [Google Scholar]
- Kurar Barakat, B.; Droby, A.; Saabni, R.; El-Sana, J. Unsupervised learning of text line segmentation by differentiating coarse patterns. In Proceedings of the International Conference on Document Analysis and Recognition, Lausanne, Switzerland, 5–10 September 2021; pp. 523–537. [Google Scholar]
- Moysset, B.; Kermorvant, C.; Wolf, C.; Louradour, J. Paragraph text segmentation into lines with recurrent neural networks. In Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia, 23–26 August 2015; pp. 456–460. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Kurar Barakat, B.; Droby, A.; Alaasam, R.; Madi, B.; Rabaev, I.; El-Sana, J. Text line extraction using fully convolutional network and energy minimization. In Proceedings of the 2020 2nd International Workshop on Pattern Recognition for Cultural Heritage (PatReCH), Milan, Italy, 11 January 2020; pp. 3651–3656. [Google Scholar]
- Vo, Q.N.; Kim, S.H.; Yang, H.J.; Lee, G.S. Text line segmentation using a fully convolutional network in handwritten document images. IET Image Process. 2017, 12, 438–446. [Google Scholar] [CrossRef]
- Renton, G.; Soullard, Y.; Chatelain, C.; Adam, S.; Kermorvant, C.; Paquet, T. Fully convolutional network with dilated convolutions for handwritten text line segmentation. Int. J. Doc. Anal. Recognit. (IJDAR) 2018, 21, 177–186. [Google Scholar] [CrossRef]
- Kurar Barakat, B.; Droby, A.; Kassis, M.; El-Sana, J. Text line segmentation for challenging handwritten document images using fully convolutional network. In Proceedings of the 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), Niagara Falls, NY, USA, 5–8 August 2018; pp. 374–379. [Google Scholar]
- Mechi, O.; Mehri, M.; Ingold, R.; Amara, N.E.B. Text line segmentation in historical document images using an adaptive u-net architecture. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019; pp. 369–374. [Google Scholar]
- Diem, M.; Kleber, F.; Fiel, S.; Grüning, T.; Gatos, B. cbad: ICDAR2017 competition on baseline detection. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 1355–1360. [Google Scholar]
- Kurar Barakat, B.; Cohen, R.; El-Sana, J. VML-MOC: Segmenting a multiply oriented and curved handwritten text line dataset. In Proceedings of the 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), Sydney, Australia, 22–25 September 2019; Volume 6, pp. 13–18. [Google Scholar]
- Kurar Barakat, B.; Droby, A.; Alasam, R.; Madi, B.; Rabaev, I.; Shammes, R.; El-Sana, J. Unsupervised deep learning for text line segmentation. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 3651–3656. [Google Scholar]
- Droby, A.; Kurar Barakat, B.; Alaasam, R.; Madi, B.; Rabaev, I.; El-Sana, J. Text Line Extraction in Historical Documents Using Mask R-CNN. Signals 2022, 3, 535–549. [Google Scholar] [CrossRef]
- Simistira, F.; Bouillon, M.; Seuret, M.; Würsch, M.; Alberti, M.; Ingold, R.; Liwicki, M. ICDAR2017 competition on layout analysis for challenging medieval manuscripts. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 1361–1370. [Google Scholar]
- Gatos, B.; Stamatopoulos, N.; Louloudis, G. ICFHR 2010 handwriting segmentation contest. In Proceedings of the 2010 12th International Conference on Frontiers in Handwriting Recognition, Kolkata, India, 16–18 November 2010; pp. 737–742. [Google Scholar]
- Barakat, B.K.; El-Sana, J.; Rabaev, I. The Pinkas Dataset. In Proceedings of the 2019 15th International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019; pp. 732–737. [Google Scholar]
- Naegel, B.; Wendling, L. A document binarization method based on connected operators. Pattern Recognit. Lett. 2010, 31, 1251–1259. [Google Scholar] [CrossRef]
- Boykov, Y.; Veksler, O.; Zabih, R. Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell. 2001, 23, 1222–1239. [Google Scholar] [CrossRef]
- Boykov, Y.Y.; Jolly, M.P. Interactive graph cuts for optimal boundary & region segmentation of objects in ND images. In Proceedings of the Eighth IEEE International Conference on Computer Vision, ICCV 2001, Vancouver, BC, Canada, 7–14 July 2001; Volume 1, pp. 105–112. [Google Scholar]
- Saabni, R.; Asi, A.; El-Sana, J. Text line extraction for historical document images. Pattern Recognit. Lett. 2014, 35, 23–33. [Google Scholar] [CrossRef]
- Saabni, R.; El-Sana, J. Language-Independent Text Lines Extraction Using Seam Carving. In Proceedings of the 2011 International Conference on Document Analysis and Recognition, ICDAR 2011, Beijing, China, 18–21 September 2011; pp. 563–568. [Google Scholar]
- Saabni, R. Robust and Efficient Text: Line Extraction by Local Minimal Sub-Seams. In Proceedings of the 2nd International Symposium on Computer Science and Intelligent Control, Stockholm, Sweden, 21–23 September 2018. [Google Scholar]
Script | Modern/Historical | Image Type | Challenges | |
---|---|---|---|---|
VML-AHTE | Arabic | Historical | Binary | numerous diacritics; cramped text lines |
ICDAR2017 | Latin | Historical | Binary | contain ascenders and descenders; touching text lines |
ICFHR2010 | Latin | Modern | Binary | Heterogeneous document resolutions, text line heights, and skews |
Pinkas | Hebrew | Historical | Colour | Noisy images |
Hyperparameter | Patch Size (p) | Window Size (w) | CNN |
---|---|---|---|
Value | 350 | 20 | AlexNet |
Method | R | P | FM |
---|---|---|---|
Saabni et al. [30] | 98.19 | 97.37 | 97.77 |
Saabni et al. [32] | 98.91 | 97.73 | 97.79 |
Proposed method | 98.11 | 97.92 | 98.02 |
Raw | Split Components | Split Components and Merge Blobs | ||||
---|---|---|---|---|---|---|
Similarity assumption | LIU | PIU | LIU | PIU | LIU | PIU |
Identity | 72.28 | 71.40 | 88.76 | 83.00 | 95.62 | 85.30 |
Identity, neighbouring | 82.45 | 77.10 | 98.18 | 90.60 | 99.28 | 91.40 |
Identity, neighbouring, 180 rotation | 76.97 | 74.00 | 94.92 | 88.40 | 99.28 | 90.30 |
Identity, neighbouring, 180 rotation, horizontal flip | 70.31 | 69.50 | 90.94 | 83.40 | 96.32 | 85.20 |
Method | LIU | PIU |
---|---|---|
Supervised Mask-RCNN [15] | 93.08 | 86.97 |
FCN+EM [15] | 94.52 | 90.01 |
Unsupervised Kurar et al. [11] | 90.94 | 83.40 |
UTLS [22] | 98.55 | 88.95 |
Proposed method | 99.28 | 91.40 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Droby, A.; Kurar Barakat, B.; Saabni, R.; Alaasam, R.; Madi, B.; El-Sana, J. Understanding Unsupervised Deep Learning for Text Line Segmentation. Appl. Sci. 2022, 12, 9528. https://doi.org/10.3390/app12199528
Droby A, Kurar Barakat B, Saabni R, Alaasam R, Madi B, El-Sana J. Understanding Unsupervised Deep Learning for Text Line Segmentation. Applied Sciences. 2022; 12(19):9528. https://doi.org/10.3390/app12199528
Chicago/Turabian StyleDroby, Ahmad, Berat Kurar Barakat, Raid Saabni, Reem Alaasam, Boraq Madi, and Jihad El-Sana. 2022. "Understanding Unsupervised Deep Learning for Text Line Segmentation" Applied Sciences 12, no. 19: 9528. https://doi.org/10.3390/app12199528
APA StyleDroby, A., Kurar Barakat, B., Saabni, R., Alaasam, R., Madi, B., & El-Sana, J. (2022). Understanding Unsupervised Deep Learning for Text Line Segmentation. Applied Sciences, 12(19), 9528. https://doi.org/10.3390/app12199528