1. Introduction
Despite recent advances driven by deep learning, the reliable recognition of handwritten text, captured from either cameras or scanners, is still part of the future. Therefore, handwriting could become a significant means of communication with computers, given factors such as privacy, the widespread use of paper, and its importance in early human learning processes. Many attempts have been made over the years to develop precise HTR systems. Prior to 2013, many solutions were based on Hidden Markov Models (HMMs) as prevailing architectures [
1,
2,
3]. From 2013 onwards, however, deep learning models have been considered as the standard methodology for offline text recognition.
The intra-personal variability between writers has always been a limiting factor in the capabilities of offline handwritten text recognition (HTR) systems, as each person has their own handwriting style. Also, several other factors such as the thickness of the strokes due to the type of pen used may have an impact on the readability of a given text. The improvement of this readability may boost subsequent recognition results. In order to overcome these issues, a previous text normalization stage has been established as common practice for most HTR algorithms. This preprocessing step is in consonance with the classic methodology in many types of computer vision problems: the images have to be normalized before the subsequent recognition models are applied [
4,
5].
In this context, the aim of the normalization step is to reduce the intrinsic variability within the handwritten text images before passing them to an HTR model. This preprocessing task may include several substeps, such as text binarization, image noise removal, slope correction, slant correction, and normalization of ascenders and descenders [
6,
7,
8]. Classical solutions were based on heuristics algorithms [
9,
10]. Some authors omitted this normalization process [
11,
12] with mixed results, but those who have applied normalization have typically obtained as good and usually better results with similar recognition models [
5,
6,
13,
14,
15,
16]. In recent years, many authors have researched preprocessing tasks, but most of them were focused only on one of the substeps mentioned above.
Generative Adversarial Neural Networks (GANs) are popular deep learning models that are able to generate high-quality images [
17]. In particular, Pix2Pix [
18] models are conditional GANs that are able to face what is called paired image-to-image translation. Some works within the literature deal with the use of GAN for image normalization purposes [
19]. In the area of automatic document analysis, GAN models have been used to delete background and printed text elements [
20], denoise [
5], deblur [
21] or binarize [
22] the images, among others. Some GAN models also normalize the text width with erosion and dilation transformations [
23] or try to remove the noise present in document images [
24]. However, note that none of these works are focused on the complete normalization process. In this sense, this work presents a new way of action, in which the normalization process is performed by a network that eventually could be trained in an end-to-end configuration that would combine normalization and recognition within a single scheme.
This paper presents a novel handwritten text normalization algorithm based on a Pix2Pix model trained from scratch. The contributions of this paper can be summarized as follows:
The proposed model reduces the intra-personal variabilities present in the handwritten text of subjects. As far as we know, the complete handwriting text normalization problem with all its substages has not been addressed as a whole by any deep learning model. Moreover, the novelty of our proposal relies on the application of a Pix2Pix network for normalizing handwritten text while preserving the characters and the legibility of the word, rather than for generative purposes.
The adjustment made to the original Pix2Pix architecture to tackle this particular problem. This adaptation involves a decrease in the number of layers of the U-Net generator and therefore the subsequent reduction in the number of parameters that is necessary to train the model.
A comparison between the proposed solution and an ad hoc heuristic procedure.
The rest of this paper is organized as follows:
Section 2 discusses related works,
Section 3 covers the dataset, the preprocessing steps, and the architecture used for normalization.
Section 4 describes the experiments and results, whereas
Section 5 concludes with the main findings and future directions.
2. Related Works
Different authors have proposed and applied several heuristic normalization strategies over the years in order to improve their HTR models.
Most initial works followed a pattern that typically included similar stages to accomplish the handwritten preprocessing task, namely text binarization, image noise removal, slope correction, slant correction, and normalization of ascenders and descenders. For such preprocessing, image-based techniques were focused on finding automatic thresholds to binarize text and remove noise [
25]; on using either Hough transform [
26] or horizontal projections [
27] to correct the slope; on pixel projections [
28], pixel counting [
29], or other techniques to calculate the text-cursive leaning and correct the slant; or on baseline detection to normalize the size of ascenders and descenders [
30].
Some authors develop complete systems to normalize the images. Those systems are based on heuristic methods. Sueiras et al. [
31] propose a combination of heuristic methods to normalize the images. They try to fix most of the common problems regarding HTR, such as noise reduction, binarization, image inversion, slant correction, baseline detection, slope correction, and normalization of ascenders, descenders, and resizing.
More recent works use deep learning techniques in order to achieve automatic preprocessing. Most of them are mainly focused on image binarization and document image enhancement. For example, Akbari et al. [
32] and Calvo-Zaragoza et al. [
22] applied Convolutional Neural Networks (CNNs), long short-term memory (LSTM), and autoencoders to address these issues.
Other works that also tackle preprocessing are mainly focused on GAN models.
Sadri et al. [
33] use a CycleGAN architecture to enhance document images. Another work of this research group uses GAN models for data augmentation [
34].
Gonwirat and Surinta [
24] handle image denoising via GAN models. They developed both a GAN devoted only to image enhancement and a combination of GAN and CNN models for the enhancement and recognition of handwritten characters.
Souibgui and Kessentini [
21] focused their work on the enhancement of document images, this time through the use of conditional GAN models (cGANs). This improvement aimed to remove warping, blurring, watermarks, etc. Their work was extended to train a GAN to enhance the images jointly with a Convolutional Recurrent Neural Network (CRNN) devoted to handwritten text recognition [
20]. In this work, the GAN architecture was composed of a U-Net generator and a Fully Convolutional Network discriminator.
Suh et al. [
35] managed to remove most of the document image background, including text on the reverse page, stamps, lines, etc. The architecture selected in the first stage is a Pix2Pix GAN, which removes background and color information. Specifically, it is composed of four U-Net neural networks as four color-independent generators and a PatchGAN [
18] variation as discriminator. Right after that, in the second stage, local and global binary transformations were applied to improve the first-stage images.
Kang et al. [
23] pretrained and combined several U-Nets to deal with specific preprocessing tasks such as dilation, erosion, histogram-equalization, Canny-edge, and Otsu binarization. Their system is specially designed to handle small datasets.
Zhao et al. [
36] use conditional GANs to binarize images. More specifically, a first conditional GAN is trained to generate images, whereas a subsequent cascade of conditional GANs is used to compose the image by taking into account multi-scale features.
Kumar et al. [
37] describe an unsupervised model which does not need paired images for binarization. These authors create augmented texture images through a GAN model, pseudo-pairing them with the original ones and train a discriminator to differentiate them.
The problem of mimicking the writing style of authors has been investigated by Davis et al. [
38]. These authors proposed a model that combined both GANs and autoencoders to capture global style variations of the writers, such as size or slant.
4. Experiments and Results
The very next step consists of computing the accuracy and effectiveness of our Pix2Pix model. The workflow followed in each experiment of this section (including train and test stages) has been summarized in
Figure 2.
The first metric selected to estimate the accuracy of the normalization process for this workflow is the average Distance Per Pixel (DPP hereafter).
Once trained (train region in
Figure 2), the Pix2Pix model will perform normalization on the unseen test set images (test region in
Figure 2). Then, we will evaluate the pixel-to-pixel L1 distance given by Equation (
2) between the Pix2Pix output images and a certain set of ground truth counterparts. Let
i be an image of the test image set
I with range in
, and let
be the
n-th pixel of
i. We define
and
as two different normalizations that we want to compare after being applied on
i. Then, for an individual image, the DPP is calculated as
where
h and
w correspond to the height and width of the images.
In order to reinforce the reliability of our analysis, we computed a second metric, the
Structural Similarity Index [
46] (SSIM hereafter), to measure and compare the effectiveness of the aforementioned normalization methods. This parameter estimates the similarity between two images by mimicking the human visual perception, and it is given by the following:
where
,
, and
are the mean, variance, and covariance of the image
i normalized through
, respectively. For images in byte integer representation,
and
, as specified by the authors of [
46]. In summary, and given two images, SSIM outputs values in a range between −1 and 1 depending on how close those images are between them in terms of likeness. Hence, totally different images yield a value of −1, whereas the higher the similarity between them, the closer the outcome is to 1. Therefore, SSIM explores global similarities between images that have been normalized through two different processes, whereas DPP is constrained to a pixel-level analysis. Once obtained using Equations (
4) and (
5), DPP and SSIM values are averaged over the whole considered dataset.
4.1. Summary of Experiments
We conducted four distinct experiments using the three different raw datasets. These experiments aimed to determine whether our proposed method can replicate the heuristic transformations performed by Sueiras et al.’s normalization algorithm, which serves as our benchmark. For each experiment, we applied either our normalization algorithm or the heuristic one to different variants of the raw datasets.
The experiments on the various datasets were conducted using the presented Pix2Pix architecture, which was trained with the respective training portion of each dataset. However, due to the smaller size of the Osborne dataset and the limited diversity of the Bentham dataset (a single author), the training for these datasets started from a model pretrained on the IAM dataset.
Also, to measure the impact of the reconstruction term on the final normalization performance, we trained with values of
, and
for Equation (
3). Thus, we tested three different variants of the network for each of the three datasets.
The main aim of each of these experiments may be summarized in the following items:
Identifying a value of for achieving good results on the test partition of each dataset, using the reference heuristic algorithm outputs as ground truth.
Assessing whether our proposal restores images from the test partition of each dataset when subjected to distortions, similar to the results obtained by the reference heuristic algorithm.
Estimating how our proposal and the reference heuristic algorithm approximate manual ad hoc normalization performed on a subset of the IAM test set.
Conducting a preliminary comparison between the text recognition results obtained when using our Pix2Pix normalization approach as a step previous to recognition and those achieved using the reference heuristic normalization.
4.1.1. Experiment 1—Heuristic Replication Measurement
The first experiment consists of testing the extent to which our proposal is able to recreate the heuristic normalization, as this procedure acted as ground truth during our model training time. To test this assertion, we ran our three variants of the proposed Pix2Pix model on the raw test sets of the three datasets. Once the output normalized images have been obtained, we compare them with those normalized using the scheme presented by Sueiras et al. [
14]. Regarding the corresponding results,
Table 1 provides the output metrics of the first experiment. As it can be observed, there are no clear advantages when using different values for
. In particular, the Pix2Pix variant with
(values in bold) gave the best results in terms of both DPP and SSIM for all tested databases. For example, for the IAM database, the model outputs a DPP value of
, which means that, on average, each output pixel differs only by
with respect to its heuristic ground truth counterpart.
4.1.2. Experiment 2—Distorted Datasets
We also provide an alternative version of the databases (IAM, Bentham, and Osborne) to test the accuracy of the normalization procedures when being applied to a different set of images. This alternative sets have been obtained by applying several mild-deforming transformations on the original raw test images. Thus, we will refer to these data as the
distorted datasets henceforth. To obtain this modified versions, we have performed a pipeline of transformations on the raw text images. These transformations are made up of several steps and each of them is applied according to a certain probability. Hence, for each image, the pipeline of transformations involves the following steps: A
chance of undergoing an elastic distortion as the one proposed by Yousef and Bishop [
47]. A
probability per image of being rotated with a random angle within the range of
degrees. An 80% probability of adding a slant to the image, with random angles between
degrees. Finally, a
probability of undergoing through either a
dilation or a
erosion. These probabilities have been determined by previous experiments.
These distorted images will hence be normalized by the three Pix2Pix reconstruction loss-variants described above and also by the reference heuristic method in order to compare the effectiveness of the normalization approaches when facing “unseen” handwritten text images. This time, however, each normalization was compared to itself being applied on the raw, non-distorted images. The results in terms of DPP and SSIM are presented in
Table 2.
Even though the best outcomes were obtained by our model with , the obtained results were very similar for all the metrics. Therefore, we can conclude that both our proposal and the reference heuristic algorithm restore the test dataset images in a similar way, perhaps with some advantage for the proposed method.
4.1.3. Experiment 3—Comparison with Manual Ad Hoc Normalization
In this third experiment, we randomly selected a sample of 100 images from the
raw IAM dataset. These images were manually normalized by visual inspection, accounting especially for slope and slant corrections and preserving the aspect ratio. In addition, to fit the
resolution of our model, we ensured that the main body parts of these images were constrained to the central 24 pixels, whereas the other regions were left to ascenders and descenders. In doing so, we have generated an alternative ground truth set with which we are able to compare all the previously explained normalization algorithms.
Figure 3 displays the raw version of one of these IAM test set images as well as its ad hoc normalized counterpart, whereas
Table 3 encapsulates the values obtained for the computed metrics. In this case, we find that based on the DPP metric, our Pix2Pix proposal slightly surpasses the results of the heuristic algorithm for every considered
parameter. The SSIM results, however, were very much alike for all the normalizations performed.
4.1.4. Experiment 4—Examining Normalization Architecture in Deep Recognition
For our final experiment, we conducted a preliminary comparison between the results obtained from our normalization approach and those achieved using the heuristic normalization method proposed by Sueiras et al. and their own recognition model [
14]. Sueiras et al. reported a Word Error Rate (WER) of
and a Character Error Rate (CER, based on Levenshtein distance) of
when applying their CNN+Seq2Seq LSTM-based attention model to the IAM test dataset with their heuristic normalization procedure, which we have adopted as the baseline for our work. After replacing their heuristic normalization with our Pix2Pix normalization method (see
Figure 4), we observed a decrease in WER to
and in CER to
(see
Table 4). This replacement occurred in two stages: first, we substituted the input layer of the CNN+Seq2Seq LSTM-based attention model with our pretrained Pix2Pix architecture; then, we trained the resulting model using the IAM training dataset, stopping if no further improvements in validation loss were observed for 100 epochs.
4.2. Discussion
Theoretically, the normalization stage quality should improve as the DPP decreases and SSIM increases, respectively. Following this rationale, the experiment with
provides some of the best results (see
Figure 5). Nevertheless, a visual inspection of some of the network outcomes allows us to see whether this is true or not. In doing so, we will have a more qualitative and reliable intuition about the Pix2Pix normalization performance with respect to the “ideal” preprocessing of the ground truth images.
To illustrate the effectiveness of the model in a challenging case, we present the results of applying different normalization methods to a word image example, from the Osborne dataset, that exhibits overlapping text strokes (see
Figure 6). It can be seen that our approach yields better results than the heuristic method and additionally reduces some of the noise present within the image.
We found that our Pix2Pix provides a good normalization with respect to heuristic preprocessing as long as the resizing factor (understood as the difference in the size of text between raw and heuristic images, relative to the entire image) is low. In such cases, most of the other features present in raw input images (slope, slant, low contrast between background and text) are properly corrected. Nevertheless, whenever this resizing of the text with respect to the raw input image becomes noticeable, the normalization performance of our model drops significantly.
Nonetheless, we mostly observe that for our conducted experiments, the trained Pix2Pix architecture effectively replicates the outcomes achieved by the reference heuristic algorithm for handwritten text normalization. These favorable results have facilitated the successful integration of this normalization stage into an end-to-end recognition architecture, enabling a comprehensive optimization of all model parameters.
5. Conclusions and Future Work
We present a Pix2Pix conditional GAN model to normalize handwritten text images. A normalized version of the IAM dataset, applying typical preprocessing routines on the original images, has served as the pretrained ground truth for the presented model. Also, a fine-tuning training has been performed for both the Bentham and the Osborne datasets. Four experiments were conducted on different variants or subsamples of the three aforementioned datasets. These experiments tried to measure both the reliability of the proposed normalization and its capability to generalize the results obtained on the training ground truth. To quantify the latter, we have built our analysis of the results upon two metrics, the Distance Per Pixel and the Structural Similarity Index Measure.
Concerning our four conducted experiments, and even though most of the word images were properly normalized, we observed that those undergoing a significant text resize during the preprocessing stage were significantly more difficult to normalize for the Pix2Pix model. This suggests that resizing is a limiting factor for the capacity of Pix2Pix architectures regarding this normalization task. However, its removal from the global preprocessing procedure may affect the recognition performance of a subsequent recognition model. On the other hand, some adjustments in the proposed architecture and in the steps of the ground truth normalization could be helpful in order to overcome the effect of resizing.
Once the Pix2Pix model proposed in the present work has been trained, it has been jointly integrated with a recognition architecture in an end-to-end configuration. In doing so, the recognition architecture provides Pix2Pix with feedback to improve the normalization phase. So, this new combined HTR model has improved the overall recognition performance of the equivalent non-normalized model. Furthermore, splitting the model into two components (normalization and recognition) enhances its interpretability. Moreover, this architecture could be used for other border normalization tasks, like the recognition of living organisms’ contours. As future works, we will explore some modifications to the Pix2Pix architecture in order to resolve the detected resizing problems, and we will also try to apply this architecture to other similar problems.