Understanding Unsupervised Deep Learning for Text Line Segmentation

Droby, Ahmad; Kurar Barakat, Berat; Saabni, Raid; Alaasam, Reem; Madi, Boraq; El-Sana, Jihad

doi:10.3390/app12199528

Open AccessArticle

Understanding Unsupervised Deep Learning for Text Line Segmentation

by

Ahmad Droby

^1,*

,

Berat Kurar Barakat

¹

,

Raid Saabni

²

,

Reem Alaasam

¹,

Boraq Madi

¹ and

Jihad El-Sana

¹

Department of Computer Science, Ben-Gurion University of the Negev, Be’er Sheva 8410501, Israel

²

Triangle R&D Center, Tel-Aviv Yaffo Academic College, Tel Aviv-Yafo 6818211, Israel

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(19), 9528; https://doi.org/10.3390/app12199528

Submission received: 25 August 2022 / Revised: 10 September 2022 / Accepted: 16 September 2022 / Published: 22 September 2022

(This article belongs to the Special Issue Data Clustering: Algorithms and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

We propose an unsupervised feature learning approach for segmenting text lines of handwritten document images with no labelling effort. Humans can easily group local text line features to global coarse patterns. We leverage this coherent visual perception of text lines as a supervising signal by formulating the feature learning as a global pattern differentiation task. The machine is trained to detect whether a document patch contains a similar global text line pattern with its identity or neighbours, and a different global text line pattern with its 90-degree-rotated identity or neighbours. Clustering the central windows of document image patches using their extracted features, forms blob lines which strike through the text lines. The blob lines guide an energy minimization function for extracting text lines in a binary image and guide a seam carving function for detecting baselines in a colour image. In identifying the aspect of the input patch that supports the actual prediction and clustering, we contribute toward the understanding of input patch functionality. We evaluate the method on several variants of text line segmentation datasets to demonstrate its effectiveness, visualize what it has learned, and enable it to comprehend its clustering strategy from a human perspective.

Keywords:

text line segmentation; text line extraction; text line detection; unsupervised deep learning

1. Introduction

In recent years, deep learning networks have demonstrated state-of-the-art performance in document image analysis tasks [1,2,3]. The success of deep learning networks is mainly driven by a huge amount of manually labelled data. However, this in fact limits the scalability since manual labelling is expensive yet scarce when issues arise that require an expert. On the other hand, a large number of unlabelled handwritten document images are available. It is important to explore deep learning methods for feature learning by leveraging unlabelled images.

A typical unsupervised deep learning standard in computer vision is to train the network for discriminating inherent structures of raw images. Examples include predicting the position of a patch relative to its neighbours [4], supervising via visual tracking in videos [5], and generating the contents of an arbitrary image region conditioned on its context [6]. This paper proposes a surrogate task that provides supervising signals for handwritten text line segmentation using unlabelled data. Text line segmentation presents an early step for higher various document analysis algorithms such as word segmentation [7,8] and word recognition [9,10]. The performance of these algorithms depends on the quality of the text line segmentation. The proposed methods work as follows, given a document image patch, we train a network to learn that it shares a similar text line pattern with its identity or neighbours, and a different text line pattern with its 90-degree-rotated identity or neighbours as shown in Figure 1. The text line pattern discrimination task provides supervised signals for the network to reason the global coarse trend of the text lines. We perform PCA analysis on the feature vectors of central windows in document image patches to extract the first three principal components. The three principal components are interrupted as pseudo-RGB colours, which mark the text line texture, the interline space texture, and the background texture. The pseudo-RGB image is thresholded to yield binary blob lines which strike through the text lines for binary input images. Meanwhile, for colour images, the pseudo-RGB image is transformed into an energy map that assists a seam carving function in extracting the baselines.

The proposed method trains the model to discriminate global text line patterns, but leverages the patches’ features for clustering their central windows. The question is: Which features guide the clustering central windows of the patches with similar patterns together? To provide a better understanding of the learned features, we visualize that the prevalent salient local parts contribute the most to the similarity between a pair. Hence, each document image patch contains a specific sequence of these features according to the global text line pattern they contain. This means the features from the penultimate fully connected layer embeds the document image patches to a compact space where distances correspond to a global text line pattern similarity.

Part of the presented work in this paper has been published [11] for increased visibility. This paper is extended by the inclusion of a specific method for colour document images, addition of understanding regarding the method’s behaviour, extensive experimental evaluation, and comparison with state-of-the-art methods.

The paper is organized as follows. Section 2 provides a summary of the related work. In Section 3, the datasets used in our experiments are shown. The method is proposed in Section 4. In Section 5, a baseline experiment, ablation studies, and results are presented. Experiments to produce an understanding of the method’s functionality are reported in Section 6. Finally, Section 7 provides concluding remarks and describes plans for future work.

2. Related Work

Deep learning of handwritten text line segmentation has a recent history, starting from the Recurrent Neural Network (RNN) work of Moyyset et al. [12]. The RNN is trained on document image columns as a vertical sequence of two labels, text line and interline. The sequential characteristic of RNN limits the input to horizontal text lines. Therefore, researchers have started focusing on the layout agnostic formulation of text line segmentation.

Essentially, two main solutions from computer vision were considered, semantic segmentation using a Fully Convolutional Network (FCN) [13] and instance segmentation using a Mask-RCNN [14]. Instance segmentation labels each foreground pixel with an object and instance. This can easily be adapted for the text line segmentation problem by labelling each foreground pixel with an instance of text line class [15]. However, this is not the case for semantic segmentation, which labels each pixel with an object and all its instances proceed with the same label. Therefore, the potential formulation has to define which pixels belong to a text line and predict whether or not a pixel is a text line pixel. Defining a text line with its raw foreground pixels would yield semantic segmentation to be paradoxically ineffective because the output will not discriminate a text line from the others. This inherent limitation has forced semantic segmentation methods to define a text line as a single component. There are two variants of this component: (1) Blob lines [15,16,17,18,19], a connected component that passes through the middle of the text line’s characters. (2) A baseline [2], a line that passes through the bottom of the text lines’ characters. However, both baselines and the blob lines extract the spatial location of the text lines, and do not label the text lines’ pixels fully. Both variants can be evaluated by text line detection metrics for baselines [17,19,20] or blob lines [18]. The extraction methods of baselines and blob lines can be extended to allow segmentation of the text lines [15,16]. Reference [16] is limited to the extraction of horizontal text lines only, whereas [15] can manage text lines with any orientation, script type, and font size. Such methods are evaluated using classical image segmentation metrics [20].

Deep learning methods are very successful at segmenting handwritten text lines with different script types, sizes, and orientations, but require a large dataset of labelled text lines in order to learn comprehensive features directly from the pixels of the image. For example, rarely occurring curved text lines are poorly detected, as they are not well represented in the training data. This problem was addressed via augmentation [2], learning-free detection [21] or hybrid methods [3]. Alternatively, Kurar et al. [22] propose an unsupervised deep learning algorithm trained to predict whether a document patch contains a text line or interline. The label is based on an adjustable score and tuned by a human. Our method eliminates this score and assumes that a document image patch contains a similar global text line pattern with its identity or neighbours and a different global text line pattern with its 90-degree-rotated identity or neighbours.

3. Data and Evaluation

Three different datasets are used in the binary document imagery experiments, VML-AHTE, ICDAR2017, and ICFHR2010 datasets; the three datasets pose different challenges. The VML-AHTE [23] datasets contain documents from handwritten Arabic manuscripts with numerous diacritics and text lines close together, in most cases touching each other. The ICDAR2017 dataset [24] is a collection of three medieval manuscripts

G B 55

,

C S G 18

and

C S G 863

, with a complex layout structure. On the ICDAR2017 dataset, we run our algorithm on pre-segmented main text regions given by the ground truth. The performance on VML-AHTE and ICDAR2017 datasets are evaluated using text line segmentation evaluation metrics, Line Intersection over Union (LIU) and Pixel Intersection over Union (PIU), as in ICDAR2017 competition on layout analysis [24]. ICFHR2010 [25] is a modern handwriting dataset with heterogeneous document resolutions, text line heights, and skews. The performance on the ICFHR2010 dataset is measured using Detection Rate (DR), Recognition Accuracy (RA), and F-measure (FM) metrics [25]. Colour document image experiments are evaluated on the Pinkas dataset [26]. This dataset contains noisy Hebrew handwritten document images with rounded edges. The performance on colour document images is measured using the baseline detection evaluation metrics, Precision (P), Recall (R), and FM [20]. The characteristics are reported in Table 1.

4. Method

The goal of the proposed method is to exploit the large quantity of unlabelled handwritten document images for segmenting text lines as simple as general human visual perception capabilites. We propose the use of discriminating global text line patterns in a pair of patches as a surrogate task for training a Convolutional Neural Network (CNN). While this task does not directly use human labels, it requires a semantic understanding of visual data as perceived by the human visual system. Salience visualization shows that solving this problem allows the CNN to learn useful features for comprehending the global text line pattern. We propose the employment of this method for text line extraction in binary document images and text line detection in colour document images.

4.1. Handling Binary Document Images

The processing of binary document images has four main phases: training, pseudo-RGB image generation, thresholding, and pixel label extraction (Figure 2). In the first phase, a CNN is trained to determine the similarity between a pair of patches. After training, the CNN is used as an encoder that embeds the patches into feature vectors. The feature vectors can be viewed as semantic features extracted by the CNN describing the text lines, however, they are unable to be read by humans since they are encoded in a learned feature space. In the second phase, the feature space of the patches extracted in the first phase is analysed using PCA, and a pseudo-RGB image is generated from the first three principal components of the features. By thresholding the pseudo-RGB image, we obtain the binary blob lines that pass through the main body of the text lines. The image of the blob lines is inputted to the extraction phase, where the text lines’ pixels are labeled using an energy minimization function, which is assisted by the blob lines obtained from the previous step.

4.2. Handling Colour Document Images

Obtaining the binary image of a given document is not always possible, as the document may contain stain marks, extreme page degradation, or noisy background. In consequence, the need arises to extend the proposed method to support colour document images. The processing of colour document images includes three main phases: training, pseudo-RGB generation, and baseline detection (Figure 2). The training and pseudo-RGB generation phases are the same as those employed in the binary document method, except the input image has three channels. The obtained pseudo-RGB image is further transformed into an energy map based on a grey level. Subsequently, the baseline seams that pass through components along the same text lines are detected by following the local minimums on this energy map.

4.3. Training

The training phase produces a feature vector fc2 (Figure 3) of size 512 from the raw pixels of input images using a deep convolutional network. Input images are patches of size

p \times p

cropped from the document images. We train the network to reason about the similarity of the coarse trend of text lines between a pair of patches. The supervising signal is obtained from the assumption that identical or neighbouring patches (adjacent patches on the top, bottom, left, and right) contain similar coarse text line patterns, and contain different coarse text line patterns when one of them is rotated 90 degrees. The effective execution of this task requires the network to recognize the salient parts of text lines. During blob detection, the feature vectors of document patches are used to discriminate between differences in text pattern contained within the input patches.

For the training phase, a pair of convolutional neural network (CNN) branches with identical weights are used, such that the feature vector is computed for both patches using the same function (Figure 3). However, the reasoning for each patch is performed separately through the CNN branch. Consequently, the feature vectors are concatenated and fed to the decision branch for predicting whether two image patches are similar or different in terms of the coarse text line patterns. All the fully connected layers are followed by a ReLU activation function, apart from fc5, which feeds into a sigmoid binary classifier.

The model is trained on each dataset from scratch using

N_{p}

pairs:

N_{p} = \frac{H_{a} \times W_{a}}{p \times p} \times N_{d}

(1)

where

w_{a}

and

H_{a}

are the average document width and height. p is the size of the patch, and the number of documents in the dataset is denoted as

N_{d}

. We set the training learning rate equal to

1 \times 10^{- 5}

, the batch size equal to eight, and employed Adam as the optimization algorithm. The model was trained until the validation accuracy ceased improving. The best performing model on the validation subset is saved for the next phase.

Pair Generation

The proposed method basically assumes that a document image patch is similar to its identity or neighbours in terms of the text line pattern they contain. On the other hand, a patch is different from its 90-degree-rotated identity or neighbours in terms of the text line pattern they contain (Figure 1). In cases where the input document images contain almost horizontal text lines, the similarity and difference assumptions can be augmented by 180-degree rotation of one of the pairs or by flipping it horizontally. The similarity does not correlate with the proximity when there are fluctuating or skewed text lines; however, this represents a rare occurrence in document images composed almost entirely of horizontal text lines.

4.4. Pseudo-RGB Image Generation

Generating a pseudo-RGB image from an input document image is performed in three steps: (1) generating a feature map from the input document image using one branch of the Siamese network, (2) extracting the three principal components of the feature map’s vectors using PCA, (3) interpreting the three principal components as a pseudo-RGB image.

Since only the last three fully connected layers in the network receive input from both patches, we can conclude that most of the semantic reasoning for each patch is conducted separately by the convolutional network. Both CNN branches of the Siamese network share the same weights, therefore, a single branch is used to extract the features of each patch. Using the CNN branch, each patch is embedded into a feature vector of 512 dimensions, as shown in Figure 3.
A sliding window of the size $p \times p$ is used to obtain the feature map of a complete document image. The patches produced by the sliding window are fed to a branch of the Siamese network. Thus, an image of size $H_{d} \times W_{d}$ is mapped to a feature map of the size $\frac{H_{d}}{w} \times \frac{W_{d}}{w} \times 512$ , where each cell in the feature map is a vector that corresponds to a single patch on the input document image.
By applying PCA on the feature vectors, the $512 D$ feature vectors are projected on their first three principal components.
To interpret the first three principal components as pseudo-RGB values, they are normalized to values between zero and one, multiplied by 255, and viewed as RGB values, where the first, second, and third components correspond to the red, green, and blue colours, respectively. Subsequently, the features of the document image can be visualized as a pseudo-RGB image in which the central windows of patches with similar patterns are assigned similar colours (Figure 2).

4.5. Thresholding

Thresholding the pseudo-RGB image into binary blob lines is crucial for the pixel label extraction of binary documents. When the input document images contain steady-state text lines that are almost consistent with text line heights and interline spaces, the pseudo-RGB image contains three-colour clusters, one for text line regions, one for interline regions, and one for background regions. In such documents, thresholding is trivial by manually selecting three thresholds based on the colour representing the text line. The thresholds are applied to each of the three colours. Whereas, an input document image with heterogeneous interline spaces leads to a pseudo-RGB image that contains several colour clusters for text line regions. We propose the use of the component tree algorithm [27] to automatically select all the colours that form the blob lines.

4.5.1. Component Tree

A component tree [27] organizes the connected components of level sets in to a tree structure. Let

C_{t}

be the set of connected components obtained by thresholding with threshold t. The nodes in a component tree correspond to the components in

C_{t}

for varying values of the threshold t. The root of the tree is the member of

C_{t_{min}}

, where

t_{min}

is chosen such that

| C_{t_{min}} |

=1. Level ℓ in the tree corresponds to

C_{t_{min} + ℓ d}

, where d is a parameter that determines the step size for the tree. There is an edge between

C_{i} \in C_{t}

and

C_{j} \in C_{t + 1}

if and only if

C_{j} \subseteq C_{i}

. The maximal threshold

t_{max}

used in the tree construction is simply the maximal value in the map, which is 255 in our case.

To threshold the pseudo-RGB image using the component tree algorithm, we first convert the pseudo-RGB image to a greyscale image using the NTSC formula. The Component tree is then built by assigning every node to a connected component of the greyscale image. Each connected component is labelled as valid or invalid based on the max of its ligature scores. To measure the ligature scores, we fit least-squares linear splines to the points of each connected component. For each spline on a connected component, the ligature score is the average 1-norm between the linear fit and the connected component points in that spline. A connected component is considered valid if its maximum ligature score is less than the

80 %

percent of the average character height. We traverse the component tree in a breadth-first search manner with

d = 1

(Algorithm 1). At each node, if the connected component is valid, it is taken as a binary blob line, and the search along this branch is complete. Otherwise, the component is refined by recursively processing the children of the node. Figure 4 illustrates component thresholding until they are valid.

Algorithm 1: Pseudocode of the component tree traversal.

1: Output = ϕ.

2: Enqueue the root into a queue Q

3: while Q is not empty do

4: C_i←Q.dequeue()

5: if F(C_i) represents a blob line then

6: Output = Output⋃C_i

7: else

8: Enqueue all children of C_i with d = 1

9: end if

10: end while

11: return Output.

4.6. Pixel Label Extraction for Binary Documents

Text line extraction defines text lines precisely via pixel labels. An energy function [28] is used with connected components to extract the text lines from binary document images. Minimizing the energy function assigns the components the label of the closest blob lines, while trying to assign close together components the same label. This framework is free of an orientation assumption and can be used for text lines with disjoint strokes and close interline proximity.

Let us denote the set of blob lines by

L

and the set of connected components in the document image by

C

. Minimizing the energy function, find the labelling function

f : C \to L

, assigning each component

c \in C

to label

ℓ_{c} \in L

. The energy function

E (f)

is expressed as follows:

E (f) = \sum_{c \in C} D (c, ℓ_{c}) + \sum_{{c, c^{'}} \in N} d (c, c^{'}) \cdot 𝟙 (ℓ_{c} \neq ℓ_{c^{'}})

(2)

where D denotes the data cost, which is the cost of assigning the label

l_{c}

to the component c; and d denotes the smoothness cost. We define

D (c, ℓ_{c})

as the Euclidean distance between the centroid of the component c to the nearest blob line

ℓ_{c}

. The smoothness cost

d (c, c^{'})

is the cost of assigning neighbouring components c and

c^{'}

different labels. Let

N

be the set of the nearest component pairs. The smoothness cost is set as follows: For all neighbouring components

{c, c^{'}} \in N

d i s t (c, c^{'}) = e^{- α \cdot d i s t_{c} (c, c^{'})}

(3)

where

d i s t_{c} (c, c^{'})

denotes the Euclidean distance between the components c and

c^{'}

from their centroids.

α

is defined as follows:

α = \frac{1}{2 \times 〈d i s t_{c} (c, c^{'})〉}

(4)

where

〈\cdot〉

denotes expectation over all pairs of neighbouring components [29] in a document image.

4.7. Handling Colour Documents

Colour document images do not contain connected components for defining all the pixels belonging to a text line. The baseline detection defines the text lines loosely by polylines that pass through the bottom of the text lines’ characters. The baselines are detected following the local minimums on an energy map computed by the grey-level transform of the pseudo-RGB image.

Baseline Seam

The minimal seam tracking approach using dynamic programming was applied by Saabni et al. [30,31] to extract lines from binary and grey-level images. The seam carving approach is suitable for this kind of application, since it computes minimum energy seams in an image caused by spaces between text lines. However, due to using the distance transform to produce energy maps, and lack of constraints, the computed seams are likely to pass through gaps between multiple consequent text lines. To address this problem, the authors in [32] extend their previous approach and use an efficient method to compute adaptive local density histogram, horizontally emphasized using multi-scale anisotropic Laplacian of Gaussian filters to produce a more reliable energy map for minimal or maximal seam computation. The algorithm tracks minimal energy sub-seams accumulated to perform a full local minimal/maximal separation and medial seam defining the text lines. To improve the ability of extracting such seams, the image is preprocessed using double-sided adaptive local-density projection profile followed by a multi-scale anisotropic Laplacian of Gaussian filter bank. Seams that follow the centre of lines are extracted first to constrain the algorithm for evolving the separating seams. In the presented approach, we replace the step of applying anisotropic Gaussian filters, by directly using the greyscale transform of the pseudo-RGB image extracted from the original images. The pseudo-RGB image is converted to greyscale using the NTSC formula.

4.8. Patch Saliency Visualization

The text line pattern similarity behaviour guides the network to understand the spatial location and orientation of the text lines. We observe that one often compares the most frequent and salient features to reason the text line pattern similarity. To enlighten this intuition, we visualize the feature map from the last convolutional layer (fc2) of the CNN branch. From this visualization, we can conclude what the networks examine when making a decision (Figure 5).

The visualized feature map is a matrix with a height and width of m and depth of 512. m is determined by the number of pooling layers l in the CNN branch and the size of the input patch p. As l increases, the receptive field of these features increases, hence they represent more complex features. For visualization purposes, the matrix is considered as

n = m \times m

vectors, each with the dimension

1 \times 512

. We obtain the first three principal components of these 512 dimensional vectors by applying PCA, then visualize them as a pseudo-RGB image, similar to what we conducted in Section 4.4.

5. Experiments and Results

This section contains four subsections. First, we set the configuration of the baseline experiment for binary documents and colour documents. This is followed by training phase experiments that investigate the effect of pair similarity assumptions, and pixel label extraction phase experiments that investigate the effect of splitting touching characters and merging the broken blob lines. Finally, we present the results on the AHTE, ICDAR2017, and ICFHR datasets.

5.1. Baseline Experiment

The proposed method contains an independent set of hyperparameters for each of the training, pseudo-RGB generation, pixel-label extraction, and baseline detection phases. There is no best set of hyperparameters that fits all challenges, and the hyperparameters can always be fine-tuned based off the dataset to boost performance. We wish to propose the simplest possible baseline configuration and investigate the influence of some hyperparameters for understanding the method. All values of the baseline configuration were set in Section 4 except patch size p, CNN architecture and similarity assumption for the training phase, and central window size w for the pseudo-RGB image generation phase. The p value should be large enough for the patch to contain sufficient contextual information, while small enough to fit within memory limitations. As the p value decreases, the detected blobs become over segmented [11]. Accordingly, the baseline experiment sets

p = 350

. The CNN architecture is set as AlexNet, and the exact architecture of the Siamese network is reported in Figure 3. The baseline experiment assumes that a patch is similar to its identical copy and different from its 90-degree-rotated identical copy. The w value should be small enough to represent well separated blobs. As the w value decreases, the computation time increases [11]. Accordingly, the baseline experiment sets

w = 20

. The method’s hyperparameters are tabulated in Table 2.

5.1.1. Colour Document Images

For colour document images, we perform experiments on the Pinkas dataset. Table 3 shows the results using the baseline configuration. The result of the proposed method is

98.02

FM. For comparison, we also executed seam carving methods [30,32], the results are

97.77

FM and

97.79

FM. The equal performance shows that the pseudo-RGB image contains blobs that are equally reliable as the blobs in the output of the anisotropic Gaussian filter. The convolutional network learns the embedding function with great accuracy. In the vast majority of cases, the validation accuracy reached over

99 %

(Figure 6). The network requires a lengthy period of time (46 epochs) to learn the embedding function, relative to the average training time (5 epochs) on the binary document datasets. Using a colour image increases the burden associated with dimensionality and consequently the number of parameters, thus rendering it more difficult to learn the function. On the other hand, a colour image contains more clues, e.g., ink colour, text line intensity, fading text edges, etc. This type of information will not be present in the binary representation, regardless of how perfect the binarization is. Having such information at the network disposal leads to better results assuming the training set is large enough. Since the proposed method is unsupervised, we train on both, the training and the test data. The pseudo-RGB image generation phase takes 2.57 min, and the baseline detection phase takes almost one minute per page on average.

5.1.2. Binary Document Images

For binary document images, we perform experiments on the AHTE, ICDAR2017 and ICFHR2010 datasets. The results using the baseline configuration are presented in the first row and first column of Table 4. The result of the proposed method is

72.28

LIU and

71.40

PIU. In the following sections, we study the effect of various factors on the performance and obtained better results using the best combination of hyperparameters. Similar to the case with colour document images, the network was able to learn the embedding function very accurately. Thus, the validation accuracy almost always reached more than

99 %

(Figure 7). The average training time of the network (5 epochs) on the binary document datasets is relatively shorter than the training time (46 epochs) on the colour document dataset. The number of training epochs varies for each of the datasets because of the early stopping. The execution times for pseudo-RGB image generation and pixel-label extraction phases are reported in Figure 8. We can see that the complexity of the text lines layout influence the pseudo-RGB generation time. Datasets such as AHTE, ICFHR2010, and CB55 with close together and touching text lines take longer compared to the other datasets where there is a clear separation between the text lines.

5.2. Training Phase Experiments

Effect of Pair Similarity Assumptions

The Baseline experiment generates pairs under the assumption that a patch and its identity are a similar pair, whereas a patch and its 90-degree-rotated identity are a different pair, in terms of the text line pattern they contain. The network could potentially use a trivial frame alignment to classify the similarity of a pair as they are extracted from the same spatial location. Experiments with spatial jittering and gapping show that removing the clues at the frame is not necessary for binary document images, likely because all possible low-level features may come across the frame of randomly cropped patches, and eventually the network has to learn all the low level features.

We experiment with different similarity assumptions for training the network. The second column in Table 4 presents the comparison among using similarity assumption identical, neighbouring, horizontal rotation, and horizontal flipping. The results show that assuming identical and neighbouring patches are similar, is advantageous in terms of performance. A potential reason for the poor performance of using horizontal rotation and horizontal flipping might be due to the inessential low-level features that are not used during the pseudo-RGB image generation phase.

To further demonstrate the advantage in terms of training time, we reported the accuracy logs in addition to qualitative results. Figure 9 shows that augmenting similarity assumption with horizontal rotation and horizontal flipping increases the training time. This is because the network additionally has to learn particular features that are present in rotated and flipped inputs but not in the input document. In addition to quantitative results, the qualitative results show that the similarity assumption of identical and neighbouring patches provide sharper blob lines. It is worth noting, that the validation accuracy reaches higher accuracy than the training accuracy. This can be explained by considering the size of the each subset. As the training set is much larger than the validation set and is built using random patches, it will contain a higher level of noise. Thus, the model can reach higher accuracy on the validation set much faster.

5.3. Pixel-Label Extraction Phase Experiments

5.3.1. Effect of Splitting Touching Components

The pixel-label extraction phase uses an energy minimization framework that is formulated at the connected component level. It is possible to formulate the energy function at the pixel level; however, it would be computationally costly. On the contrary, component level formulation is unable to split the components that touch several blob lines. Therefore, the baseline configuration splits a touching component c by labeling each pixel inside the component as the blob line closest to it. We show the effect of splitting touching components in the second column of Table 4. For all similarity assumptions, splitting the touching characters results in better performance. The improvement is larger on the LIU metric than on the PIU metric. Correcting the split of a touching component might rehabilitate a complete text line that was oversegmented, whereas it increases the number of correctly segmented pixels insignificantly relative to the number of all pixels (Figure 10).

5.3.2. Effect of Merging Broken Blob Lines

Occasionally, the thresholded pseudo-RGB image might contain broken blob lines that misguide the energy minimization function. To further improve the performance, we merge the broken blob lines (Figure 10). The most intuitive way is to merge two blob lines if (1) the direction of the vector connecting the right of the first component to the left of the second one falls between the direction of the two blob lines, (2) their vertical distance is less than the maximum character height. The direction of a blob line is defined as the vector connecting the left endpoint to the right endpoint. The third column of Table 4 shows the performance gains of merging blob lines with each of the similarity assumptions. As expected, we can see a significant improvement across all similarity assumptions. The best performance is achieved with the identity and neighbouring similarity assumption reaching

99.28

LIU and

91.40

PIU.

5.4. Results

In the Section 5.3, we conducted an ablation study to quantify the contributions of various components of the proposed method. In this section, two configuration choices are reported on the binary document image datasets. The first configuration [11] is the same as the baseline experiment except the similarity assumption includes identical, neighbouring, horizontal-rotated, and horizontal-flipped patches. The second configuration is the optimal combination of parameters inferred from the second row and third column of Table 4. The optimal combination is the same as the baseline experiment, except the similarity assumption includes identical and neighbouring patches and the pixel-label extraction phase merges the broken blob lines.

5.4.1. Results on the VML-AHTE Dataset

Table 5 compares the results of the proposed method with the results of some supervised and unsupervised learning methods. Mask-RCNN [15] is a fully supervised instance segmentation algorithm labeling each pixel of the text lines. FCN+EM [15] is a semantic segmentation algorithm that is trained to predict the blob lines which strike through text lines with the supervision of human-annotated blob lines. The pixel labels of each text line are extracted using an energy minimization function which is guided by the blob lines. UTLS [22] is another unsupervised learning method based on the statistical differences among patches from text line regions and from interline regions. However, these statistical differences are determined by a human-adjusted score. No matter how crowded the satellite elements are, the method is successful because the global pattern of the document is monotonic. The proposed method achieves the best performance by

99.28

LIU and

91.40

PIU. The error in LIU is due to erroneous blob lines that do not strike through text lines correctly. A part of the PIU error is propagated from the LIU error, and the other part is due to the misassignment of crowded satellite components that are not touched by the blob lines. The energy function urges us to assign components to the label of the closest blob line, while straining to assign closer components to the same label. However, there might be satellite components that are far from the components of their text line and close to the blob line of another text line. Our method fails to assign these components, as it requires recognition of the text.

5.4.2. Results on the ICFHR2010 Dataset

Table 6 reports the results of the proposed method and other unsupervised methods. The CUBS [25] method is a learning-free method and similar to the proposed method, as it first detects the blob lines that strike through the text lines. The detection phase uses a steerable directional filter which is adaptive to changes in the global pattern by the use of multiple directions. Whereas the proposed method fails with regard to sudden changes in the global pattern (Figure 11). Such regions are assigned to neither text line cluster nor interline cluster, but somewhere in between the two. This failure can be attributed to the fact that documents with fluctuating text lines or severely skewed and curved lines are outside the network’s learned domain. As we can see in Figure 12, the salient features do not encode the text line information as well as what we observed in Section 4.8. It is important to note that the proposed algorithm is successful in managing the changes at the local pattern. The network can simply learn all the local patterns, no matter how diversified the local pattern. For instance, a network trained on two different datasets learns all the local patterns despite the differences among the two datasets. This network can be used to segment text lines of the documents from both datasets (Figure 13).

5.4.3. Results on the ICDAR2017 Dataset

Table 7 shows the results on the ICDAR2017 dataset, which contains text lines that are relatively easier to segment. The most significant challenge is consequently touching components among several text lines in the CB55 book. Hence, the UTLS method [22] performs worse on CB55 than the others. The intense touching components among the text lines confuses the statistical supervision of the network. The proposed blob line detection is successful regardless of how the touching components fill the interline space because the global pattern of the document is monotonic. Our pixel-extraction method deals with touching components by labeling each pixel based on the blob line closest to it (Figure 14). The documents contain an almost regular global pattern, which is an advantage of the proposed method. As we discuss in Section 6, the clustering phase depends on the existence of consistent interline spaces and text line heights. The proposed method outperforms [11] because it merges the broken blob lines. System-9+4.1 outperforms the proposed method on both CB55 and CSG863, and performing on par the proposed method on CSG18. While the proposed method is not adapted for a specific script or dataset, System-9+4.1 was developed for the ICDAR2017 dataset [24]. The FCN+EM method [15] surpasses all other methods, but is supervised by human-annotated blob lines.

6. Understanding the Unsupervised Segmentation

This section conducts visual analysis to understand why the learned features for patches lead the central windows to be clustered into blob lines. We leverage patch saliency visualization (Section 4.8) in order to understand the answer to this question. Patch saliency visualization shows which part of the patch contributes the most to the decision of the network. We propose that the result of the proposed method is not random and investigate the effect of different network architectures.

6.1. Sense of Network Depth

This section aims to show that the proposed method is applicable using different CNN architectures. We explore the effect of the depth of networks on the qualitative results. Given that the baseline experiment uses AlexNet, we experiment with deeper networks by adding a single layer at each time, following the network architecture of VGG16. AlexNet contains five convolutional layers, three fully connected layers, and three pooling layers. VGG16 contains 13 convolutional layers, three fully connected layers, and five pooling layers. We experiment with shallower networks by removing a single layer at each time from the AlexNet architecture, until we are left with only three convolutional layers. We could not run a network with only two convolutional layers, as the number of parameters (more than

500,000,000

) exhausted the GPU memory (the experiments were conducted on a machine with an Nvidia RTX 3060 GPU). It should be noted that the model parameter size increases as the number of pooling layers decreases.

Figure 15 shows the impact of network depth on the qualitative results. It shows the psuedo-RGB image of the same document alongside a visualization of the salient features of a patch from the document. As one would expect, deeper networks have wide receptive fields and the shallower networks have narrow receptive fields. However, they can all be used to segment the text lines. Therefore, our method can generalize by using shallower or deeper networks, as long as the network’s receptive field is narrow enough to represent the text line pattern. Figure 16 visualizes the saliency of the ResNet50 network, and we can see that the features from adjacent text lines overlap, indicating that the network sees wider regions with more complex features. Given the input patch size

350 \times 350

, the output tensor size from the last convolutional layer is

5 \times 5

which results in a loss of detail, but leads to complex features that are a combination of simple features. These complex features are not useful for clustering the central windows into text line regions and interline regions during the pseudo-RGB image generation phase, because they might correspond to any spatial location and do not form a pattern that is at the frequency of the text line pattern.

6.2. What Is Being Clustered?

We have discussed the model details for treating the training phase as a black box. However, the text line segmentation in fact occurs during the pseudo-RGB image generation phase. This phase clusters the central windows according to the features of their patches. These features are the frequently occurring salient parts of the text lines in that patch. They form a signal specific to the global text line pattern of the patch. The frequency and the amplitude of this signal should be the same for all the patches, as they are cropped from similar documents. However, the phase of the signal differs according to the global text line pattern in the patch. For example, the patches that contain a text line region at its top frame form a signal at a different phase than the patches that contain an interline region at its top frame (Figure 17). Hence, the central windows are clustered into two clusters, one corresponds to the text line regions and the other corresponds to the interline regions (Figure 17).

We have constructed an adversarial dataset to reveal how the pseudo-RGB image generation phase functions. This dataset contains digitally printed text lines with varying font sizes and interline spaces. Let this dataset be called the heterogeneous printed dataset. We also constructed a homogeneous printed dataset that contains digitally printed text lines with homogeneous interline spaces. The network trained on the heterogeneous printed dataset and tested on both, the heterogeneous and the homogeneous. We observe that the network learns the frequently occurring salient features, therefore it can cluster the central windows of a homogeneous document into text line regions and interline regions. The network can also cluster the central windows of a heterogeneous document, however, into an indeterministic number of clusters (Figure 18). Therefore, we cannot threshold the blob lines conveniently.

We propose the use of component tree binarization that searches all possible thresholds and stops the search at the pixels once a blob line appears on them. The algorithm decides whether a set of pixels are a blob line or not according to the ligature score defined in Section 4.5.1. The resultant blob lines do not always strike through the text lines, as can be seen in Figure 10, where the blob line image contains more lines than the document image. We eliminate the spurious blob lines by adding the label cost to the energy function given in Equation (2), producing:

E (f) = \sum_{c \in C} D (c, ℓ_{c}) + \sum_{{c, c^{'}} \in N} d (c, c^{'}) \cdot δ (ℓ_{c} \neq ℓ_{c^{'}}) + \sum_{ℓ \in L} h_{ℓ}

(5)

For every blob line

ℓ \in L

,

h_{ℓ}

is defined as

exp (- α \cdot r_{ℓ})

where

r_{ℓ}

is the normalized number of foreground pixels overlapping with blob line ℓ.

7. Conclusions

In this work, we formulate the text line segmentation task as an unsupervised deep learning problem and eliminate the human-annotation effort. The formulation provides supervising signals for handwritten text line segmentation using a surrogate task. Given a document image patch, we train a network to learn that it bears a similar text line pattern with its identity or neighbours, and a different text line pattern with its 90-degree-rotated identity or neighbours. The method is successful in documents with a globally monotonic text line pattern even in the existence of crowded satellite elements, touching components and slightly skewed text lines. The method is unsuccessful in documents with fluctuating, severely skewed and curved text lines. It may be possible that by including a similarity assumption that accounts for skewed and curved text lines, the method will become robust to such cases. We also provide an in-depth understanding analysis to provide insights on how the model works.

Author Contributions

Conceptualization, all authors; methodology, A.D. and B.K.B.; software, B.K.B. and A.D.; validation, A.D., B.K.B., R.S., R.A. and B.M.; formal analysis, B.K.B. and A.D.; investigation, A.D. and B.K.B.; data curation, all authors; writing—original draft preparation, A.D., B.K.B. and R.S.; writing—review and editing, all; visualization, A.D. and B.K.B.; supervision, R.S. and J.E.-S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

This research was partially supported by The Frankel Center for Computer Science at the Ben-Gurion University.

Conflicts of Interest

The authors declare no conflict of interest.

References

Sudholt, S.; Fink, G.A. Phocnet: A deep convolutional neural network for word spotting in handwritten documents. In Proceedings of the 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), Shenzhen, China, 23–26 October 2016; pp. 277–282. [Google Scholar]
Grüning, T.; Leifert, G.; Strauß, T.; Michael, J.; Labahn, R. A two stage method for text line detection in historical documents. Int. J. Doc. Anal. Recognit. (IJDAR) 2019, 22, 285–302. [Google Scholar] [CrossRef]
Alberti, M.; Vögtlin, L.; Pondenkandath, V.; Seuret, M.; Ingold, R.; Liwicki, M. Labeling, cutting, grouping: An efficient text line segmentation method for medieval manuscripts. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019; pp. 1200–1206. [Google Scholar]
Doersch, C.; Gupta, A.; Efros, A.A. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1422–1430. [Google Scholar]
Wang, X.; Gupta, A. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2794–2802. [Google Scholar]
Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2536–2544. [Google Scholar]
Manmatha, R.; Srimal, N. Scale space technique for word segmentation in handwritten documents. In Proceedings of the International Conference on Scale-Space Theories in Computer Vision, Corfu, Greece, 26–27 September 1999; pp. 22–33. [Google Scholar]
Varga, T.; Bunke, H. Tree structure for word extraction from handwritten text lines. In Proceedings of the Eighth International Conference on Document Analysis and Recognition (ICDAR’05), Seoul, Korea, 31 August–1 September 2005; pp. 352–356. [Google Scholar]
Graves, A.; Liwicki, M.; Fernández, S.; Bertolami, R.; Bunke, H.; Schmidhuber, J. A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 31, 855–868. [Google Scholar] [CrossRef] [PubMed]
Liwicki, M.; Graves, A.; Bunke, H. Neural networks for handwriting recognition. In Computational Intelligence Paradigms in Advanced Pattern Classification; Springer: Berlin/Heidelberg, Germany, 2012; pp. 5–24. [Google Scholar]
Kurar Barakat, B.; Droby, A.; Saabni, R.; El-Sana, J. Unsupervised learning of text line segmentation by differentiating coarse patterns. In Proceedings of the International Conference on Document Analysis and Recognition, Lausanne, Switzerland, 5–10 September 2021; pp. 523–537. [Google Scholar]
Moysset, B.; Kermorvant, C.; Wolf, C.; Louradour, J. Paragraph text segmentation into lines with recurrent neural networks. In Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia, 23–26 August 2015; pp. 456–460. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Kurar Barakat, B.; Droby, A.; Alaasam, R.; Madi, B.; Rabaev, I.; El-Sana, J. Text line extraction using fully convolutional network and energy minimization. In Proceedings of the 2020 2nd International Workshop on Pattern Recognition for Cultural Heritage (PatReCH), Milan, Italy, 11 January 2020; pp. 3651–3656. [Google Scholar]
Vo, Q.N.; Kim, S.H.; Yang, H.J.; Lee, G.S. Text line segmentation using a fully convolutional network in handwritten document images. IET Image Process. 2017, 12, 438–446. [Google Scholar] [CrossRef]
Renton, G.; Soullard, Y.; Chatelain, C.; Adam, S.; Kermorvant, C.; Paquet, T. Fully convolutional network with dilated convolutions for handwritten text line segmentation. Int. J. Doc. Anal. Recognit. (IJDAR) 2018, 21, 177–186. [Google Scholar] [CrossRef]
Kurar Barakat, B.; Droby, A.; Kassis, M.; El-Sana, J. Text line segmentation for challenging handwritten document images using fully convolutional network. In Proceedings of the 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), Niagara Falls, NY, USA, 5–8 August 2018; pp. 374–379. [Google Scholar]
Mechi, O.; Mehri, M.; Ingold, R.; Amara, N.E.B. Text line segmentation in historical document images using an adaptive u-net architecture. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019; pp. 369–374. [Google Scholar]
Diem, M.; Kleber, F.; Fiel, S.; Grüning, T.; Gatos, B. cbad: ICDAR2017 competition on baseline detection. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 1355–1360. [Google Scholar]
Kurar Barakat, B.; Cohen, R.; El-Sana, J. VML-MOC: Segmenting a multiply oriented and curved handwritten text line dataset. In Proceedings of the 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), Sydney, Australia, 22–25 September 2019; Volume 6, pp. 13–18. [Google Scholar]
Kurar Barakat, B.; Droby, A.; Alasam, R.; Madi, B.; Rabaev, I.; Shammes, R.; El-Sana, J. Unsupervised deep learning for text line segmentation. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 3651–3656. [Google Scholar]
Droby, A.; Kurar Barakat, B.; Alaasam, R.; Madi, B.; Rabaev, I.; El-Sana, J. Text Line Extraction in Historical Documents Using Mask R-CNN. Signals 2022, 3, 535–549. [Google Scholar] [CrossRef]
Simistira, F.; Bouillon, M.; Seuret, M.; Würsch, M.; Alberti, M.; Ingold, R.; Liwicki, M. ICDAR2017 competition on layout analysis for challenging medieval manuscripts. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 1361–1370. [Google Scholar]
Gatos, B.; Stamatopoulos, N.; Louloudis, G. ICFHR 2010 handwriting segmentation contest. In Proceedings of the 2010 12th International Conference on Frontiers in Handwriting Recognition, Kolkata, India, 16–18 November 2010; pp. 737–742. [Google Scholar]
Barakat, B.K.; El-Sana, J.; Rabaev, I. The Pinkas Dataset. In Proceedings of the 2019 15th International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019; pp. 732–737. [Google Scholar]
Naegel, B.; Wendling, L. A document binarization method based on connected operators. Pattern Recognit. Lett. 2010, 31, 1251–1259. [Google Scholar] [CrossRef]
Boykov, Y.; Veksler, O.; Zabih, R. Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell. 2001, 23, 1222–1239. [Google Scholar] [CrossRef]
Boykov, Y.Y.; Jolly, M.P. Interactive graph cuts for optimal boundary & region segmentation of objects in ND images. In Proceedings of the Eighth IEEE International Conference on Computer Vision, ICCV 2001, Vancouver, BC, Canada, 7–14 July 2001; Volume 1, pp. 105–112. [Google Scholar]
Saabni, R.; Asi, A.; El-Sana, J. Text line extraction for historical document images. Pattern Recognit. Lett. 2014, 35, 23–33. [Google Scholar] [CrossRef]
Saabni, R.; El-Sana, J. Language-Independent Text Lines Extraction Using Seam Carving. In Proceedings of the 2011 International Conference on Document Analysis and Recognition, ICDAR 2011, Beijing, China, 18–21 September 2011; pp. 563–568. [Google Scholar]
Saabni, R. Robust and Efficient Text: Line Extraction by Local Minimal Sub-Seams. In Proceedings of the 2nd International Symposium on Computer Science and Intelligent Control, Stockholm, Sweden, 21–23 September 2018. [Google Scholar]

Figure 1. In the proposed method, a machine learning model is trained to determine whether two patches are identical or are the same when one patch is rotated 90 degrees. The network analyses the most frequent and salient features to reason the text line pattern similarity.

Figure 2. An overview of the method for binary and colour document images. Given a handwritten document image, the first phase, a feature vector for each image patch is formed such that, patches with similar global textual patterns are close together in the feature space. The second phase extracts the first three principal components from the feature vectors of image patches using PCA. The three principal components are interpreted as pseudo-RGB colours, producing a pseudo-RGB image. For binary document images, the third phase thresholds the pseudo-RGB image into binary blob lines that strike through the text lines. The binary blob lines assist an energy minimization function for extracting the pixel labels. Meanwhile, the third phase for colour document images transforms the pseudo-RGB image into an energy map based on grey level. The energy map assists a seam carving function for detecting the baselines.

Figure 3. Two-branch CNN architecture that predicts the similarity between two patches. Dotted lines denote shared weights, and fc represents fully connected layer.

Figure 4. Illustration of thresholding on a synthetic component. (a) A Component is composed of three grey-levels with ascending values

t_{1}, t_{2}, t_{3}

. (b–d) Child components from thresholding the parent component with

T \geq t_{1}, t_{2}, t_{3}

, respectively, and their approximate splines. (e) Illustrates the breadth-first traversal of the component tree.

Figure 4. Illustration of thresholding on a synthetic component. (a) A Component is composed of three grey-levels with ascending values

t_{1}, t_{2}, t_{3}

. (b–d) Child components from thresholding the parent component with

T \geq t_{1}, t_{2}, t_{3}

, respectively, and their approximate splines. (e) Illustrates the breadth-first traversal of the component tree.

Figure 5. Visualization of features at the last convolutional layer of a single CNN branch. The network examines the most frequent and salient features to reason the text line pattern similarity. These features can be horizontal lines (AHTE), vertical lines (CB55), whole words (ICFHR, CSG18), or a combination of different features (CSG863, Pinkas).

Figure 6. Baseline experiment training and validation logs obtained on colour document dataset, the Pinkas. The validation accuracy trends higher than the training accuracy due to regularization.

Figure 7. Baseline experiment validation logs obtained on binary document datasets. The network easily learns the difference in global text line pattern. The validation accuracy almost always reaches over

99 %

within an average number of epochs equal to 5.

Figure 7. Baseline experiment validation logs obtained on binary document datasets. The network easily learns the difference in global text line pattern. The validation accuracy almost always reaches over

99 %

within an average number of epochs equal to 5.

Figure 8. Baseline experiment average execution times per page for pseudo-RGB image generation and pixel-label extraction phases. Datasets with touching and close together text lines such as AHTE and CB55 take longer to generate the pseudo-RGB images. Whereas, the pixel-label extraction phase is constant across datsetes.

Figure 9. Effect of various similarity assumptions on the training time and qualitative results. The pseudo-RGB image and accuracy results of validation and training sets are shown for the similarity assumptions. We observe that it takes the network longer to reach high accuracy as additional assumptions are added.

Figure 10. Effect of splitting touching components and merging broken blob lines. Post-processing offers significant advantages over the raw pixel-label extraction. Note that the blob lines image contain spurious lines which do not correspond to any text lines. Spurious lines are eliminated by Equation (5) mentioned in Section 6.2.

Figure 11. We have discussed that the proposed method is successful insofar as the input document contains a globally monotonic pattern. Therefore, as one would expect, the method is not successful in documents with fluctuating text lines or severely skewed and curved text lines. The right document contains severely skewed and curved lines compared with the left one, thus the segmentation failed to segment the lines correctly.

Figure 12. Visualizing the salient features of documents with severely skewed and curved text lines, one can see why the method fails to segment the text lines. The network does not learn features that can aid in text line segmentation, rather we can see that it instead learned the orientation of the text.

Figure 13. Visual results from a network trained on a combination of two datasets, VML-AHTE and ICDAR2017. The top two images show the psuedo-RGB images borrowed from pages of the VML-AHTE and ICDAR2017 (book CB55). The bottom two images visualize the salient features of patches from the two datasets. The network memorizes all the different local patterns, as it clearly identified the text and text lines in both examples.

Figure 14. CB55 contains touching components among several text lines. The left image shows the segmented text lines by pixel labels. The right image shows only the touching components that have been successfully split.

Figure 15. Visualization of the impact of network depth on the receptive field and the results. Each column shows the pseudo-RGB image (top) of the same document and visualizes the salient features of a patch from the document (bottom). The number of layers indicates the number of convolutional layers. The five-layer network is the AlexNet and the 13-layer network is the VGG16 network. Deeper networks have wider receptive fields and shallower networks have narrower receptive fields. All networks from the three layers up to 13 layers can be used to segment text lines. We can see that in all the pseudo-RGB images, the text lines are clearly marked. However, the quality of the text line markings declines as the number of layers increases. Furthermore, in salient feature visualization, we can see that the networks encode more complex features as the number of layers increases.

Figure 16. Visualization of the resulting salient features using ResNet50. Patch saliency shows that the features from text lines and interline regions overlap, thus indicating that the network’s perception field is wide. Hence, the central windows are not clustered into text line regions and interline regions during the pseudo-RGB image generation phase.

Figure 17. The central windows of the patches are clustered into two clusters. The left column shows the patches with central windows (yellow) that correspond to text line regions. The right column shows the patches with central windows (pink) that correspond to interline regions.

Figure 18. CB55 contains touching components among several text lines. The left image shows the segmented text lines by pixel labels. The right image shows only the split touching components.

Table 1. Characteristics of the datasets used in our experiments.

	Script	Modern/Historical	Image Type	Challenges
VML-AHTE	Arabic	Historical	Binary	numerous diacritics; cramped text lines
ICDAR2017	Latin	Historical	Binary	contain ascenders and descenders; touching text lines
ICFHR2010	Latin	Modern	Binary	Heterogeneous document resolutions, text line heights, and skews
Pinkas	Hebrew	Historical	Colour	Noisy images

Table 2. Hyperparameters of the method used in the baseline experiments.

Hyperparameter	Patch Size (p)	Window Size (w)	CNN
Value	350	20	AlexNet

Table 3. Performance values on a historical colour document dataset, the Pinkas dataset.

Method	R	P	FM
Saabni et al. [30]	98.19	97.37	97.77
Saabni et al. [32]	98.91	97.73	97.79
Proposed method	98.11	97.92	98.02

Table 4. LIU and PIU values on the AHTE dataset.

	Raw		Split Components		Split Components and Merge Blobs
Similarity assumption	LIU	PIU	LIU	PIU	LIU	PIU
Identity	72.28	71.40	88.76	83.00	95.62	85.30
Identity, neighbouring	82.45	77.10	98.18	90.60	99.28	91.40
Identity, neighbouring, 180 rotation	76.97	74.00	94.92	88.40	99.28	90.30
Identity, neighbouring, 180 rotation, horizontal flip	70.31	69.50	90.94	83.40	96.32	85.20

Table 5. LIU and PIU values on the VML-AHTE dataset.

Method	LIU	PIU
Supervised Mask-RCNN [15]	93.08	86.97
FCN+EM [15]	94.52	90.01
Unsupervised Kurar et al. [11]	90.94	83.40
UTLS [22]	98.55	88.95
Proposed method	99.28	91.40

Table 6. DR, RA and FM values on the ICFHR2010 dataset.

Method	DR	RA	FM
Unsupervised
UTLS [22]	73.22	72.38	72.36
CUBS [25]	97.54	97.72	97.63
Proposed method	71.70	70.46	71.04

Table 7. LIU and PIU values on the ICDAR2017 dataset.

	CB55		CSG18		CSG863
	LIU	PIU	LIU	PIU	LIU	PIU
Unsupervised
UTLS [22]	80.35	77.30	94.30	95.50	90.58	89.40
[11]	93.45	90.90	97.25	96.90	92.61	91.50
System-9+4.1 [24]	98.04	96.67	96.91	96.93	98.62	97.54
Proposed method	96.01	92.40	97.25	96.90	93.75	91.99
Supervised
FCN+EM [15]	100.0	97.64	97.65	97.79	100.0	97.18

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Droby, A.; Kurar Barakat, B.; Saabni, R.; Alaasam, R.; Madi, B.; El-Sana, J. Understanding Unsupervised Deep Learning for Text Line Segmentation. Appl. Sci. 2022, 12, 9528. https://doi.org/10.3390/app12199528

AMA Style

Droby A, Kurar Barakat B, Saabni R, Alaasam R, Madi B, El-Sana J. Understanding Unsupervised Deep Learning for Text Line Segmentation. Applied Sciences. 2022; 12(19):9528. https://doi.org/10.3390/app12199528

Chicago/Turabian Style

Droby, Ahmad, Berat Kurar Barakat, Raid Saabni, Reem Alaasam, Boraq Madi, and Jihad El-Sana. 2022. "Understanding Unsupervised Deep Learning for Text Line Segmentation" Applied Sciences 12, no. 19: 9528. https://doi.org/10.3390/app12199528

APA Style

Droby, A., Kurar Barakat, B., Saabni, R., Alaasam, R., Madi, B., & El-Sana, J. (2022). Understanding Unsupervised Deep Learning for Text Line Segmentation. Applied Sciences, 12(19), 9528. https://doi.org/10.3390/app12199528

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Understanding Unsupervised Deep Learning for Text Line Segmentation

Abstract

1. Introduction

2. Related Work

3. Data and Evaluation

4. Method

4.1. Handling Binary Document Images

4.2. Handling Colour Document Images

4.3. Training

Pair Generation

4.4. Pseudo-RGB Image Generation

4.5. Thresholding

4.5.1. Component Tree

4.6. Pixel Label Extraction for Binary Documents

4.7. Handling Colour Documents

Baseline Seam

4.8. Patch Saliency Visualization

5. Experiments and Results

5.1. Baseline Experiment

5.1.1. Colour Document Images

5.1.2. Binary Document Images

5.2. Training Phase Experiments

Effect of Pair Similarity Assumptions

5.3. Pixel-Label Extraction Phase Experiments

5.3.1. Effect of Splitting Touching Components

5.3.2. Effect of Merging Broken Blob Lines

5.4. Results

5.4.1. Results on the VML-AHTE Dataset

5.4.2. Results on the ICFHR2010 Dataset

5.4.3. Results on the ICDAR2017 Dataset

6. Understanding the Unsupervised Segmentation

6.1. Sense of Network Depth

6.2. What Is Being Clustered?

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI