An Efficient Transformer–CNN Network for Document Image Binarization

Zhang, Lina; Wang, Kaiyuan; Wan, Yi

doi:10.3390/electronics13122243

Open AccessArticle

An Efficient Transformer–CNN Network for Document Image Binarization

by

Lina Zhang

^*

,

Kaiyuan Wang

and

Yi Wan

School of Information Science and Engineering, Lanzhou University, 222 S. Tianshui Rd., Lanzhou 730000, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(12), 2243; https://doi.org/10.3390/electronics13122243

Submission received: 19 April 2024 / Revised: 27 May 2024 / Accepted: 29 May 2024 / Published: 7 June 2024

(This article belongs to the Special Issue Deep Learning-Based Object Detection/Classification)

Download

Browse Figures

Versions Notes

Abstract

:

Color image binarization plays a pivotal role in image preprocessing work and significantly impacts subsequent tasks, particularly for text recognition. This paper concentrates on document image binarization (DIB), which aims to separate an image into a foreground (text) and background (non-text content). We thoroughly analyze conventional and deep-learning-based approaches and conclude that prevailing DIB methods leverage deep learning technology. Furthermore, we explore the receptive fields of pre- and post-network training to underscore the Transformer model’s advantages. Subsequently, we introduce a lightweight model based on the U-Net structure and enhanced with the MobileViT module to capture global information features in document images better. Given its adeptness at learning both local and global features, our proposed model demonstrates competitive performance on two standard datasets (DIBCO2012 and DIBCO2017) and good robustness on the DIBCO2019 dataset. Notably, our proposed method presents a straightforward end-to-end model devoid of additional image preprocessing or post-processing, eschewing the use of ensemble models. Moreover, its parameter count is less than one-eighth of the model, which achieves the best results on most DIBCO datasets. Finally, two sets of ablation experiments are conducted to verify the effectiveness of the proposed binarization model.

Keywords:

document image binarization; U-Net; transformer; MobileViT

1. Introduction

Document image binarization (DIB) is one of the crucial image preprocessing works and is applied as a basic approach for image processing, such as for text recognition [1,2,3,4,5], feature extraction [6], and so on. A well-prepossessed binarized image has significant effect on the results of optical character recognition (OCR) [7]. The goal of DIB is to separate the image into a foreground (text) and background (non-text content). The foreground pixel values are 0, and the background pixel values are 255, which is what we usually call “black words and white paper”.

Document images like ancient text data have always suffered serious degradation, as shown in Figure 1 and Figure 2. These images are from the datasets of the International Frontiers in Handwriting Recognition Conference, which started in 2009 and is held almost every year [8,9,10,11,12,13,14,15,16,17,18]. Document image binarization holds substantial practical application value and significance in this regard. The digital processing of degraded documents is a crucial method for addressing challenges in historical document preservation and cultural heritage conservation. Manual processing of large volumes of historical text materials entails significant time and labor and is susceptible to errors. Therefore, employing computers for automatic processing of images of ancient text materials is imperative.

As depicted in Figure 1 and Figure 2, various textual issues are evident in the document images. For instance, Figure 1 illustrates text inconsistencies, such as variations in texture, font size, and color, alongside distortions in the alignment of text content lines, yellowing of paper, and ink contamination. Similarly, Figure 2 showcases damaged and incomplete textual elements, significant blurring, and the presence of strong background textures causing interference within the text area. In summary, the analysis and recognition of textual information within handwritten document images of ancient books or early printed materials pose formidable challenges. Therefore, research into document image binarization for text segmentation holds particular significance.

Many approaches have been proposed to realize document image binarization. The typical methods include thresholding methods, such as the Otsu algorithm [20], Niblack’s method [21], Sauvola’s method [22], and Wolf’s method [23], among others [24,25,26,27,28,29,30,31,32,33,34,35]. There are also other traditional algorithms based on edge detection [36,37,38,39,40], and others utilize fuzzy logic [41,42,43,44,45]. Additionally, the method in [46] was the winner of DIBCO2018.

In recent years, the achievements in document image binarization research have primarily been realized through deep learning [19,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72]. Neural networks have demonstrated remarkable ability to segment foreground text and the backgrounds of document images, as demonstrated by a comparison of a deep learning algorithm [63] with the traditional method [46] (winner of DIBCO2018) for the binarization of handwritten document images, as shown in Figure 3.

From Figure 3, it is evident that Rezanezhad et al. [63] segmented the foreground text and the most prominent background textures. Their model attained the top performance across three document image datasets (DIBCO2012, DIBCO2017, and DIBCO2018) at the Proceedings of the 7th International Workshop on Historical Document Imaging and Processing. The model is based on the U-Net structure and a locally applied Transformer, which has significant segmentation capability. Based on further research and analysis of their model, we propose a new model for document image binarization with fewer parameters and better performance.

This paper introduces a novel model that combines the U-Net architecture with a Transformer and integrates MobileViT for document image binarization for the first time. The main contributions of this paper are as follows:

•: We discuss the characteristics of different types of deep learning methods for document image binarization and also illustrate the receptive fields of the CNN and Transformer models.
•: By incorporating the MobileViT block into the U-Net structure, we aim to broaden the receptive fields of the image, capturing both global and local characteristics more effectively. This marks the first application of the MobileViT block in document image binarization. The parameters of the proposed model are only one-fourth of those of a similar model [63] based on U-Net for document image binarization.
•: The proposed method is a straightforward end-to-end model that is trained only once and that does not employ pre- or post-processing steps or ensemble methods.
•: Our model exhibits competitive performance on two established document binarization competition (DIBCO2012 and DIBCO2017) datasets and has good robustness on the DIBCO2019 dataset, which encompasses both handwritten and machine-printed documents.

The paper is structured as follows. In Section 2, we review various document image binarization methods. Section 3 discusses the proposed model. We present the experimental results and comparisons with other approaches on several DIBCO datasets in Section 4. Section 5 showcases two sets of ablation experiments on the core modules of the proposed model. Finally, Section 6 concludes the paper.

2. Related Work

In this section, we provide an overview of the different algorithms or models for document image binarization and analyze the characteristics of each method. It is well known that many approaches have been proposed to address this issue. These techniques can be roughly categorized into traditional approaches [20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40], which can also be based on edge detection [41,42,43,44,45] or use fuzzy logic or other techniques [46], and deep learning methods [19,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72].

2.1. Traditional Binarization Techniques

The Otsu algorithm, pioneered by Otsu [20], stands as one of the most renowned thresholding methods for document image binarization. It endeavors to ascertain an optimal threshold value through comprehensive grayscale image analysis. However, its single-threshold approach proves inadequate for handling low-contrast or unevenly illuminated images. Consequently, alternative methodologies have been proposed to mitigate the limitations inherent in employing a fixed threshold. Noteworthy among these are Niblack [21], Sauvola [22], Wolf [23], and others [24,25,26,27,28,29,30,31,32,33,34,35].

Recognizing the significance of stroke continuity, particularly in degraded handwritten document images, Holambe et al. [39] introduced a methodology integrating contrast mapping with Canny’s edge detection with the aim of identifying text stroke edge pixels. Subsequently, thresholding techniques are applied to delineate background and foreground regions. Similarly, conventional algorithms [36,37,38,39,40] predominantly rely on edge detection for document image binarization. Numerous other traditional approaches [42,43,44,45,46] have been proposed in pursuit of achieving optimal document image binarization.

Among these, the work of Xiong et al. [46] merits attention as the victor of the 2018 Document Image Binarization Competition. This method exemplifies a conventional approach that leverages background estimation and Laplacian energy segmentation techniques exclusively for document image binarization. Despite its accolades, limitations become evident, particularly when the image is afflicted by similar background textures and text, as illustrated in Figure 3c.

2.2. Deep Learning Binarization Approaches

As we can seen from the binarization result of the document image obtained by Rezanezhad et al. [63] in Figure 3d, their technique results in the text information in the original document image being almost well segmented. Even the binarization result of Rezanezhad et al. [63] still has some areas that are not well segmented. For example, the background information far away from the text information is incorrectly classified as foreground. But it can also be clearly seen that the results of document image binarization based on deep learning are far better than those of traditional methods, so the algorithms for document image binarization that have been developed over the past three years have generally been based on deep learning technology. Such as literature [57,58,59,60,61,62,63,66,67,68,69,72,73,74], especially the model of Biswas et al. [72] has almost obtained the best results on most of the DIBCO datasets. Biswas et al.’s model is a document binarization encoder-decoder architecture based on a Tokens-to-token vision transformer, employing a progressive tokenization technique to capture the local information from an image to achieve more effective binarization results.

Models for document image binarization based on deep learning technology are mainly divided into two research directions. One is models obtained based on convolutional neural networks (CNNs) [19,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63], and the other is based on generative adversarial networks (GANs) [66,67,68,69].

Calvo and Gallego [19] used convolutional auto-encoders devoted to learning an end-to-end map from an input image to its selectional output through the activation functions to indicate whether the pixels were either foreground or background. This model is trained once and outperforms the other binarization strategies in the same period.However, it is not as good as the model proposed by Rezanezhad et al. [63], which is a U-Net-based structure with an added attention mechanism (Transformer). A detailed comparison of these two models and their strategies will be provided in the next section. Other CNN-based models [48,49,51,53,55,64,65,66,67,68,69,70,71,73,75] or models using U-Net structures [52,54,56,62,74] have been developed in the past years.

Generative adversarial networks (GANs) [76] include generator networks and discriminator networks. Souibgui et al. [66] apply conditional generation adversarial networks (cGANs) to develop a document-enhancement generative adversarial network (DE-GAN) to restore severely degraded document images: including performing the binarization task. They input a degraded image with its ground truth (GT) to the discriminator, which forces the generator to generate an output that is indistinguishable from the GT. After training, the discriminator becomes unnecessary, and only the generator network is used to enhance the degraded image. Differently, Rajesh et al. [70] employ a dual-discriminator generative adversarial network (DD-GAN) [65] to achieve binarization directly using a JPEG-compressed stream of document images. Other GAN-based models [64,67,70,71] have been constructed to realize binarization.

In summary, the above CNN-based models [19,47,50,57,58,59,60,61] generally have the advantages of a CNN: simple structure design, fewer parameters, and high computational efficiency, which is especially relevant when faced with limited computing resources. Meanwhile, the models from [64,65,66,67,68,69,70,71] operate within the GAN framework and have the characteristics of GANs. An unsupervised generative model based on a well-trained discriminator can produce clearer and more realistic sample results. Nevertheless, GANs also suffer from certain drawbacks, such as mode collapse and mode shock in the training process, which may lead to unstable quality or lack of diversity of the generated document images. GAN training procedures are typically intricate and necessitate meticulous design and parameter tuning, often requiring considerable time investment to achieve the desired outcomes.

3. Methodology

3.1. Baseline Network for Document Image Binarization

Based on the above analysis and summary of the document image binarization task using CNNs and GANs, we found that the essence of document image binarization work is the segmentation task of the target image (that is, the segmentation of the foreground text and the background). The U-Net network was proposed by Olaf et al. [77] in 2015 and is used for medical image segmentation tasks and has achieved excellent results. Due to its excellent performance on segmentation tasks, this structure has attracted the attention of binarization-related research. There are several U-Net-based document image binarization methods [54,63,78,79]. U-Net has two advantages: one is the simple left-to-right symmetric convolution model, and the other is the skip connection. The first point is the essence of realizing the process of encoding and decoding through pure convolution, downsampling, and upsampling; these are the process of compressing the image features, retaining only the key information, and restoring the image to ensure that the size of the output image is consistent with the size of the input image. The second point is the skip connection in the process of encoding and decoding, which makes the features at the pixel level of the image and the semantic features fused, and this feature fusion of different scales is very beneficial for the recovery of the image through upsampling. Therefore, the U-Net network can do semantic segmentation at the pixel level, so it is very suitable for the segmentation task of the texts and backgrounds of document images. In addition, U-Net has the advantage that it can learn well even with a small amount of labeled data through the use of data augmentation and a special network architecture, which is very suitable for datasets with a small amount of labeled document images.

In short, pure convolutional networks can be easily adjusted and optimized according to different task requirements due to their simple structure design, and they can adapt to various sizes and types of image processing tasks, which U-Net has. However, pure cascaded convolutional neural networks cannot achieve the structure designed by U-Net with symmetric structure and skip connections. U-Net can transfer information between different levels of the network, can fuse feature maps of different resolutions, and can effectively combine depth and context information in order to retain more detailed information in order to achieve better document image binarization.

Because of the advantages of convolution and the U-Net network, we will conduct experimental exploration of the document image binarization model [19] and the pure U-Net model. However, we found problems in several cases, as shown in Figure 4. In Figure 4, the binarization results presented in (c) and (d) show that the textural background information that is similar to photocopied content and text information in the original document image cannot be segmented reasonably. However, the original document image of Figure 4a is severely polluted, and even the texture in the background has a strong contrast. Relying only on simple cascaded convolution or U-Net network training cannot fully capture the overall features and achieve the “black and white” binarization effect we want. This problem mainly stems from the limitation of the receptive field range: that is, simple cascade convolution and the U-Net network make it difficult to learn a wider range of receptive field features.

3.2. Improving Network Receptive Field Range

In the previous subsection, we mentioned the concept of the receptive field. Let us now provide a succinct overview of this notion. The receptive field, originally rooted in biological neural networks, denotes the input region perceived by a neuron. In the context of deep learning convolutional neural networks, the computation of a feature map element is influenced by a specific region within the input image, known as the receptive field corresponding to that element. A typical calculation formula can be found in [80], such as

l_{k} = l_{k - 1} + (f_{k} - 1) \prod_{i = 1}^{k - 1} s_{i}

, where

l_{k}

,

f_{k}

, and

s_{k}

represent the receptive field, kernel size, and step size, respectively, of the kth layer. Another approach, proposed by Luo et al. [81], uses backpropagation to compute the “effective receptive field”. In their research, the effective receptive field is defined as the region of input pixels that have some influence on the central output unit. By analyzing the distribution of the effective receptive field with different weights, they deduced the mathematical characteristics of the effective receptive field in the general case involving nonlinear activation functions, which were calculated by the main Formulas: (1) and (2). And the variance was calculated using Formula (3), and expectation was calculated using Formula (4).

\frac{\partial l}{\partial x_{i, j}^{0}} = \sum_{i^{'}, j^{'}} \frac{\partial l}{\partial y_{i^{'}, j^{'}}} \frac{\partial y_{i^{'}, j^{'}}}{\partial x_{i, j}^{0}}

(1)

where l is the loss function, the pixels of each layer are indexed by

(i, j)

, and its center is

(0, 0)

. The function

x_{i, j}^{0}

is the network’s input, and

y_{i, j} = x_{i j}^{n}

is the first n layer of output; the purpose of this formula is to measure the contribution of each x to y.

g (i, j, p - 1) = σ_{i, j}^{p^{'}} \sum_{a = 0}^{k - 1} \sum_{b = 0}^{k - 1} w_{a, b}^{p} g (i + a, i + b, p)

(2)

where

g (i, j, p)

is the gradient at

(i, j)

of the p layer,

σ_{i, j}^{p^{'}}

is the activation function gradient at

(i, j)

of the p layer, and

w_{a, b}^{p}

is the convolution weight value of the convolution kernel at

(a, b)

of the p layer.

V a r [g (i, j, p - 1)] = E [σ_{i, j}^{p^{'} 2}] \sum_{a =} \sum_{b} V a r [[w_{a, b}^{p}] V a r [g (i + a, i + b, p)] = 0

(3)

E [σ_{i, j}^{p^{'} 2}] = V a r [σ_{i, j}^{p^{'} 2}] = 1 / 4

(4)

We use the effective receptive field calculation method proposed by Luo et al. [81] and plot the receptive field of a representative RestNet and Transformer before and after training in Figure 5. From the comparison of Figure 5c and Figure 5d, it can be clearly seen that after Transformer training, the receptive field range of the central element covers almost the entire image, which is significantly better than the effect of RestNet (pure convolution) (the larger the value, the stronger the influence). This shows that the model of the Transformer architecture can effectively deal with long-distance dependencies through the self-attention mechanism so that the model can better understand the association between distant locations in order to better capture the global dependencies. In addition, the flexibility and versatility of the Transformer architecture makes it suitable for various computer vision tasks (Vision Transformers, or ViTs), and it is their powerful modeling capabilities that allow their architectural form to flourish.

In 2021, the MobileViT model proposed by Sachin and Mohammed [82] is a lightweight Vision Transformer architecture. The core components of the model are an MV2 and a MobileViT block, which fuse the advantages of the CNN and Transformer attention mechanisms. The most important module, the MobileViT block, has the advantage of flexibly combining global self-attention and local feature extraction and has the following four advantages for balancing performance and computational efficiency: The first is its lightweight structure, which makes it suitable for mobile devices with limited computing resources. The second is hierarchical feature extraction. The MobileViT block contains multiple sub-layers, which can effectively extract features from different abstract levels of the image. The third is cross-layer connection. Because it introduces a horizontal connection or jump connection mechanism, it is conducive to improving the efficiency and accuracy of feature transfer. Finally, the multi-scale processing feature can support inputs with different scales (i.e., input images with different sizes). These advantages make the MobileViT model perform well in balancing performance and computational efficiency.

3.3. Proposed Model

Incorporating the benefits of the MobileViT block, we integrate this component within the U-Net architecture to formulate a document image binarization model. This study draws insights from the successful work of Rezanezhad et al. [63] for document image processing. Their model, showcased at the 2023 International Symposium on Imaging and Processing of Historical Documents, notably excelled on multiple datasets, including DIBCO2012 [11], DIBCO2017 [15], and DIBCO2018 [16]. A comparative analysis against traditional algorithms [20,22,46] and other deep-learning-based models [49,59,64,69,83] underscored the superior visual outcomes and performance metrics achieved by the model [63]. Notably, a key advantage lies in its minimal GPU resource requirements and rapid training duration; leveraging solely an NVIDIA 2080, a proficient document image binary model can be trained within two days. Moreover, characterized by a modest parameter count and built upon the U-Net architecture, the model integrates the Transformer’s attention mechanism ingeniously at its base.

Nevertheless, the Transformer structure within Rezanezhad et al.’s model [63] is exclusively employed post several convolution subsampling encodations, leading to inadequate comprehension of the original image’s overall structure, as illustrated in Figure 4. In contrast, our proposed model capitalizes on the lightweight attributes of the MobileViT block to proficiently assimilate information features of document images throughout each convolution expansion phase. This enables our model to adeptly amalgamate global and local image attributes. The schematic representation of the document image binarization model proposed in this study is depicted in Figure 6. Leveraging the portability benefits of the MobileViT block, the parameter count in our model is substantially reduced by approximately 76% compared with Rezanezhad et al.’s model [63], rendering our model more suitable for deployment on lightweight hardware devices. The proposed model comprises 8,923,370 parameters, while Rezanezhad et al.’s model encompasses 36,980,426 parameters.

4. Experiments and Analysis

In this section, we elaborate on the specific procedural details of the experiment, encompassing the experimental description, qualitative analysis, and quantitative analysis of the experimental results. Subsequently, we sequentially delve into the intricate specifics of each segment.

4.1. Introduction of Experiments

This part introduces the preparation work before model training. In order to objectively evaluate the performance of our proposed model, we used the same dataset as Rezanezhad et al. [63] for experimental training. To be specific, we used the recent DIBCO datasets [8,9,10,11,12,13,14,15,16], the Bali Palm Leaf dataset [84], and the PHIBC [85] dataset of the Persian Heritage Document Image Binarization Competition.

For comparative experimental analysis, we selected DIBCO2012 [11], DIBCO2017 [15], and DIBCO 2018 [16] as validation datasets. The rest of the datasets served as the training datasets, which is consistent with the models of Rezanezhad et al. [63] and others [49,59,64,69,83].

The training image size, learning rate, and number of epochs for our document image binarization model were all tuned through extensive experiments. The specific training process of the model is divided into two stages. In the first stage, the images of the training dataset are not cut, to train the model from the overall structural features, and the complete image information after data enhancement processing is used for training, and the number of training epochs is set to 30. The second stage focuses on learning local detail information. In this stage, the original document image is diced, which results in the amount of training data increasing to more than four times that of the first stage, so the number of training epochs is set to 20.

Our model is trained using a single NVIDIA GeForce RTX 3090. The training image size is

512 \times 512

, with a batch size of 10. For the DIBCO2012 dataset [11], the initial learning rate in the first stage is set to

1 \times 10^{- 3}

. In the second stage, a learning schedule is applied. The initial learning rate remains at

1 \times 10^{- 3}

, and every five epochs, the learning rate is halved. This learning rate scheduling strategy contributes to the model’s enhanced performance and convergence.

The loss function for training document image binarization models is often the F-measure (FM), as described in [19,63]. The F-measure (FM) of an image is defined by Formula (5).

F M = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(5)

where

P r e c i s i o n = \frac{T P}{F P + T P}

,

R e c a l l = \frac{T P}{F N + T P}

, and the three terms

T P

,

F P

, and

F N

respectively denote correct positive values (foreground pixels accurately classified as text), false positive values (background pixels misclassified as text foreground), and false negative values (text foreground erroneously classified as background). Throughout our training process, we designate the foreground pixel value as 1 (“positive”) and the background pixel value as 0 (“negative”) in a true binary label (GT). By incorporating the calculation of

P r e c i s i o n

and

R e c a l l

into Equation (5), the simplified objective function can be obtained as follows.

F M = \frac{2 T P}{2 T P + F P + F N}

(6)

An inherent challenge in model training lies in the scarcity of adequate training data. This predicament arises due to the limited availability of recent DIBCO datasets [8,9,10,11,12,13,14,15,16,17], comprising a mere 11 datasets, encompassing a total of 146 pairs of original document images and their corresponding true binarized label images (GT). Supplementing this pool are 50 pairs from the Bali Palm Leaf dataset [84] and 16 pairs from the PHIBC dataset [85], culminating in a total of 212 valid image pairs and labels. Focusing on the DIBCO2017 dataset [15] as the training target, a mere 172 pairs of images and their GTs can be utilized for training, excluding 20 pairs from the DIBCO2019 dataset [17]. Owing to the intricate nature of the document images in the DIBCO2019 dataset, characterized by diverse pollution types and textual characteristics, they are deemed unsuitable for inclusion in the training data.

Comparable models [49,59,63,64,69,83] for comparison are trained without leveraging any data from DIBCO2017 or DIBCO2019. However, the limited dataset size is insufficient for effective neural network model training. Hence, akin to Rezanezhad et al. [63], we employ data augmentation techniques [60,86] on the existing dataset. These techniques encompass image rotation, scaling, brightness adjustment, and image chunking.

In summary, our proposed binarization model for document images demonstrates superior performance compared to the method presented by Rezanezhad et al. (2023), as evidenced by our comparative analysis of experimental results on the DIBCO2012 [11] and the DIBCO2017 [11] datasets. Specifically, our model exhibits enhanced performance across all four indicators of document image binarization. However, the performance of our model on the DIBCO2018 [16] dataset is not as satisfactory, prompting a detailed analysis to identify areas for improvement. Furthermore, to assess the robustness of our approach, we conducted verification and comparison experiments on the challenging DIBCO2019 [17] dataset, which is characterized by severe image degradation and distinct features compared to our training dataset. Encouragingly, our model outperforms both the model by Rezanezhad et al. [63] and other conventional methods employing identical parameters. Moving forward, we will delve into a comprehensive analysis and discussion of our experimental findings, examining both qualitative and quantitative aspects to provide deeper insights into the efficacy of our proposed methodology.

4.2. Qualitative Evaluation

The experiments conducted for this paper involve a meticulous comparison and analysis across three prominent datasets: DIBCO2012 [11], DIBCO2017 [15], and DIBCO2018 [16]. We compare the document image binarization model proposed in this paper with deep-learning-based models [49,59,63,64,69,72,83] and the classic traditional method [20,22,46] for three datasets [11,15,16]. By enumerating the visual comparison maps of the binarization results and calculating the relevant evaluation indicators, a comparative analysis is carried out. Visual comparisons of the performance of various methods on distinct datasets are presented in Figure 7, Figure 8, Figure 9, Figure 10, Figure 11 and Figure 12. These visualizations serve as a cornerstone for our comparative analysis and facilitate an intuitive understanding of methodological efficacy.

From Figure 7, we can see that Otsu’s method [20] cannot distinguish the pollution in the second line of the original article image from the text content very well. This is mainly because the method is based on a global threshold. However, the traditional method by Sauvola [22], which is based on a local threshold, can basically separate the polluted background content from the text. Xiong et al.’s [46] method (as the champion algorithm of DIBCO2018 [16]) also cannot perfectly segment the text information in the thick area of the stroke, which, indeed, illustrates the limitations of the traditional binarization method. The model of Calvo and Gallego [19] is a simple cascade convolutional network. It is found that its binarization results cannot distinguish well the features in the background far from the text area, so there is a lot of noise in the result map. This shows that the pure convolution model indeed has a strong ability to learn only local features. The results of Rezanezhad et al.’s method [63] do not perfectly extract only some of the thinner strokes and the sliding lines of the last two letters, and the rest of the text is almost completely separated from the background content.

From Figure 8, it is evident that the traditional method by Sauvola [22], which relies solely on a local threshold, yields unsatisfactory binarization result for document images afflicted by uneven illumination. However, the method proposed by Xiong et al. [46] estimates the overall background of the image and uses a variety of processing techniques, with better binarization results on this kind of document image with uneven illumination. In contrast, the deep-learning-based models, including that of Calvo and Gallego [19] (just without the fully clean separation of background content), Rezanezhad et al. [63], Biswas et al. [72], and our model have achieved satisfactory binarization results for the document image.

Only the proposed method obtains fully satisfactory results for binarization of document images in Figure 9. Examining the binarization outcomes of traditional methods such as those by Otsu [20], Sauvola [22], and Xiong et al. [46] in Figure 9, it becomes apparent that these methods struggle to accurately segment shadows containing similar text information within the background. In contrast, for the binarization results of Rezanezhad et al. [63], the background content near the text can be segmented well, but the photocopy-contaminated area far away from the text area cannot be correctly segmented. This limitation arises due to the constrained feature learning capability of Rezanezhad et al.’s method [63] concerning global information inherent in document images. The distinct advantage of our proposed model lies in its utilization of the lightweight MobileViT block module, which effectively integrates global information into locally extracted feature representations through convolution.

In Figure 10, the binarization results of Rezanezhad et al. [63], Biswas et al. [72], and the proposed model are satisfactory. This is because the model by Rezanezhad et al. [63] uses the Transformer structure in the bottom local region of the U-Net network architecture, and the document image has shallow background photocopy pollution compared to the document image in Figure 9. As a result, Rezanezhad et al.’s method [63] does a good job of separating the background from text content. The model by Biswas et al. [72] also applied the Transformer network, which led to the basically satisfactory binarized results of the document image.

Similar to the binarization results in Figure 10 and Figure 11, the binarization results of Rezanezhad et al.’s method [63] and the proposed model are both relatively satisfactory. Figure 11 shows that the binarization of Otsu’s technique [20] is significantly better than that of Sauvola’s [22] because the photocopied text content in the background is lighter in color than the text content in the foreground. As a result, the Transformer used by Rezanezhad et al.’s method [63] does a good job of separating background content from real text content in small regions of the model. In addition, for the model training process on the DIBCO2018 [16] dataset, 10 more images (6.2%) are used to train the model than that for the DIBCO2017 [15] dataset, which plays an important role in improving the performance of the model.

For the binarization results of all methods in Figure 12, visually, significant disparities are observed between the binarization results of all methods and the GT. The three traditional methods [20,22,46] fail to correctly preserve the shallow text content of the last two lines, and the pure convolutional method [19] also has the same problem. Rezanezhad et al. [63] show strong performance, especially for the last two lines of text, which are displayed closer to the real label. This may be attributed to the large number of parameters and the integration design of multiple models. The reason may be that the model has a large number of parameters (more than four times that of the proposed model) and is based on the integration of multiple models. Specific analytical explanations will be provided in the following section on the model robustness study in quantitative experiments. Therefore, it is undeniable that the model of Rezanezhad et al. [63] does exhibit strong performance.

4.3. Quantitative Evaluation

For the quantitative numerical evaluation of the results of document image binarization processing, four commonly used evaluation indicators are popular at present: F-measure (FM), pseudo F-measure (pFM) [85], peak signal-to-noise ratio (PSNR), and distance reciprocal distortion (DRD) [87]. These four evaluation metrics are used to show the quality of the obtained binarized image through different aspects, and they are significantly better than simple accuracy.

4.3.1. FM

Before defining FM and pFM, we first review the following four concepts: TP, FP, TN, and FN, which represent the correct positive value, the wrong positive value, the correct negative value, and the wrong negative value, respectively. The relationship between these four values can be seen in Table 1.

The usual definition of accuracy is:

A c c u r a c y = \frac{T P + T N}{T P + F P + T N + F N}

(7)

The F-measure (FM) of the image is defined by Formula (5) with the simple style of Formula (6) in the previous section. We can see from Formulas (5), (6), and (7) that FM is more reasonable than accuracy for the evaluation of document image binarization. Because the proportion of text in binarized text is much smaller than the number of pixels in the blank space, using the FM metric is more scientific than simply looking at the pixel accuracy of the whole image.

4.3.2. pFM

The pseudo F-measure (pFM) of an image is defined as:

p F M = \frac{2 \times p R e c a l l \times P r e c i s i o n}{p R e c a l l + P r e c i s i o n}

(8)

where

P r e c i s i o n

is the same as defined in FM, and

p R e c a l l

is the percentage of the character structure in the standard image compared to the binarized image. From the above definitions of FM and pFM ((5) and (8), respectively), we can clearly see that the size of these two indicators is positively correlated with the quality of the document image after binarization.

4.3.3. PSNR

Peak signal-to-noise ratio (PSNR) is an indicator of image quality related to mean square error. It is an objective evaluation index that mainly expresses the difference between the results of image processing and the real image. In this paper, PSNR evaluates the variance between the binarized document image and the corresponding GT, calculated as follows:

P S N R = 10 \times {log}_{10} \frac{{(2^{n} - 1)}^{2}}{M S E}

(9)

where

M S E

is the mean squared error between the binarized image and the real binary image. For the general 8-bit image representation method,

n = 8

, and the peak value is

255 = 2^{8} - 1

. The unit of PSNR is dB: the larger the value is, the more similar the binarization result is to the real binarized document image; this is a widely used objective index to evaluate image quality.

4.3.4. DRD

The distance reciprocal distortion measure DRD [87] objectively expresses the distortion of visual perception of a binarized document image through the distance between pixels. The specific definition is as follows:

D R D = \frac{σ_{k} D R D_{K}}{N U B M}

(10)

where

D R D_{k}

is the distortion of the flipped pixel, and

N U B N

is the number of non-uniform (not all black or white pixels) blocks in the real binarized image. The larger the value of DRD, the greater the distortion of the visual perception of the binarization result. Therefore, the smaller the value of DRD, the better the binarization result of the document image.

Below, we have listed for DIBCO2012 [11], DIBCO2017 [15], and DIBCO2018 [16] the average values of the four indicators for the binarization results on the three datasets obtained by methods [20,22,46,49,59,63,64,69,83] and our method. The results are shown in Table 2, Table 3 and Table 4.

It is shown in Table 2 that Biswas et al.’s method [72] achieves the best results on the DIBCO2012 dataset [11]. The proposed model receives the second-best evaluation. Biswas et al.’s method [72] outperforms our model primarily due to their model’s significantly larger number of parameters (74,415,788), which is over eight times the number of parameters in our model (8,923,370).

From Table 3, it is evident that our straightforward end-to-end model outperforms other traditional methods [20,22,46] and deep learning models [49,59,63,64,69,83] with the same training dataset in terms of the mean performance across the four common document image binarization metrics. Particularly noteworthy is that the performance of our individual model also exceeds that of the ensemble model of Rezanezhad et al. [63], showing the absolute superiority of our model’s performance. However, as depicted in Table 4, our model did not exceed the models by Jemni et al. [69], Rezanezhad et al. [63], or Biswas et al. [72] on the DIBCO2018 dataset, and we also conducted an experimental analysis here.

From Table 4, our model performs worse than the model from [72] and the ensemble model [63] on four metrics for the DIBCO2018 [16] dataset. To find out the direct reason, we have listed the four evaluation values of the binarization results obtained for the proposed model for all ten document images in the DIBCO2018 [16] dataset in Table 5.

We find from the specific values in Table 5 that the binarizations of images 2-2018 and 10-2018 have lower PSNR values and smaller FM values. Then, we list the comparison between the binarization results of these two images obtained by our model, the original document image, and the GT, as shown in Figure 13.

From Figure 13b, we can see that the difference between the binarization results of our model and the real binarization results is mainly due to the black area at the top of the image, specifically for the binarization results of image 2-2018 obtained by our method. The text part is almost the same as that of the GT. It is just that the top black area is not segmented by our model as background content. As you can see, our model is weak in learning large black areas far from the text area, which leads to the poor results of our model on the DIBCO2018 [16] dataset. The main reason for this phenomenon is that the training dataset contains a smaller number of original document images with large dark areas.

Next, we will compare the robustness of the model with other methods [20,22,46,63,72]. The experiment is conducted on the DIBCO2019 [17] dataset with severe image damage. The specific deep learning model used is trained on the DIBCO2017 [15] dataset because the model is obtained using the minimum number of training dataset images (162 pairs of labeled document images and their real binarization results). We also use the means of the twenty results of the four binarization metrics PSNR, FM, pFM, and DRD for the DIBCO2019 [17] dataset for quantitative comparison. The means for the various methods are shown in Table 6.

In Table 6, we present a fair and objective comparison of the robustness between our model and the model provided by Rezanezhad et al. [63]. Notably, we provide two instances of the Rezanezhad et al. model to ensure fairness. This distinction is essential because their model is an ensemble of several trained models, boasting approximately 37.0M parameters, whereas ours comprises around 8.9M parameters, indicating a significant difference in scale. In the literature [88], experiments are carried out in detail, and it is verified that the corresponding changes in the network scale adjustment in depth can maintain the good performance of the original model. Consequently, we adopt the parameter reduction technique proposed in [88] to adjust the parameter configuration of Rezanezhad et al.’s model, resulting in a parameter count of approximately 9.0M, aligning it with the scale of our model. Subsequently, we train the adjusted model under identical conditions to ensure a fair comparison. As to the results of the model from [72], the four evaluations are directly from its paper and are based on the model being trained on the DIBCO2019 [17] dataset. Even though the model’s size is more than eight times that of the proposed model, the evaluation results are not as good as ours. The reason may be that the training dataset is not sufficient for this model.

From Table 6 it is evident that the Otsu [20] method has the worst numerical performance for the binarization metrics PSNR and DRD. Although Xiong et al. [46] provided the champion algorithm of the year on the DIBCO2018 [16] dataset, their FM and pFM numerical results on the DIBCO2019 [17] dataset are the worst, indicating subpar robustness of the method. The model-37.0M [63] is optimal in Table 6 for the three averaged indicators for the DIBCO2019 [17] dataset. This is mainly attributed to its extensive parameter count, rendering it the most robust. Our model ranks second for the mean values of PSNR and FM for all images of the DIBCO2019 [17] dataset, and only the pFM index is inferior to that of the model-9.0M [63]. As for the model of Biswas et al.-74.4M [72], its performance is not as good as that on other DIBCO datasets, mainly due to the influence of images with different styles in the DIBCO2019 dataset. It can be inferred that the model in this paper is superior to the model of Rezanezhad et al. [63] with a comparable parameter count. Regarding the binarization results for DIBCO2019 [17] obtained by our model, it is observed that for one image (8-2019-a), the binarization result hardly captures any real text information, resulting in notably low evaluation index values. Refer to Figure 14.

As shown in Figure 14, our binarization model performs poorly on images with light font colors (8-2019-a), resulting in a pFM value about 33 lower than that of -9.0M [63]. This directly leads to the lower pFM index of the proposed model on the DIBCO2019 [17] dataset.

5. Ablation Experiment

In this subsection, we analyze the effect of the core module’s role in the model and its corresponding channel settings on the network performance. All ablation experiments are validated on the DIBCO2017 [15] dataset.

5.1. Experiments on the Performance of the MobileViT Block

The first set of ablation experiments is an exploration of the number of components of the proposed model’s MobileViT block. The specific experiment is completed by changing the number of this module and replacing the four MobileViT blocks in the network with MV2 in turn under the condition that other variables are the same. We use zero, one, two, three, and four MobileViT blocks to train the proposed model, and the resulting four models are denoted as M0, M1, M2, M3 (the proposed model), and M4. The five models are used to binarize the DIBCO2017 [15] dataset, and their respective average evaluation metrics (FM, pFM, PSNR, and DRD) values are calculated. The results are recorded in Table 7.

In Table 7, we observe a gradual increase in the average PSNR value from M0 to M1, M2, and M3, followed by a decrease from M3 to M4. This trend suggests that with fewer than four MobileViT blocks, the similarity between the binarization results obtained through network training and the true binarization improves as the number of MobileViT blocks increases. As we move from M0 to M1, M2, M3, and finally M4, the values of FM and pFM exhibit a progression from small to large, reaching a peak at M3 and then decreasing (DRD shows a similar trend). Ultimately, these four evaluation indicators achieve their optimal levels for M3 (the proposed model): at least locally. Therefore, our decision to employ three MobileViT blocks is justified.

5.2. Experiments on the Number of Channels in the MobileViT Blocks

The second ablation experiment aims to investigate the impact of varying the number of channels in the three MobileViT blocks on model performance. Specifically, we utilize 32, 64, and 96 channels for the first block, 40, 80, and 96 channels for the second block, and 48, 96, and 144 channels for the third block during model training. The channel numbers employed in this study are sequentially set as 64, 80, and 96. We control the number of channels in the first two modules while experimenting with different channel numbers in the third module. The average FM and PSNR values of the resulting binarization outcomes with varying channel numbers are depicted in line charts within Figure 15.

From Figure 15, it is evident that both the FM and PSNR values reach their peak when the number of channels is set to an intermediate value for each MobileViT block. Thus, the selection of an intermediate number channels for our proposed model is deemed reasonable.

6. Conclusions

This paper primarily implements document image binarization by integrating the MobileViT block module into the U-Net architecture. The proposed model has only one-eighth the number of parameters of the model by Biswas et al. [72], which achieves the best evaluations on most of the DIBCO datasets. And compared to the similar model by Rezanezhad et al. [63], our model boasts 76% fewer parameters. Demonstrating superior performance on the DIBCO2017 dataset [15], our model notably enhances common evaluation metrics such as FM, pFM, PSNR, and DRD. To assess the model’s robustness, we conducted document image binarization on the DIBCO2019 dataset [17], comparing our model with Biswas et al.’s [72] and Rezanezhad et al.’s [63] models with two different parameter quantities and a classic traditional method. By comparing the mean values of the four evaluation metrics across different methods, we confirm the robustness of our proposed model. Lastly, to analyze the core module’s role in our proposed model and the impact of its corresponding channel settings on network performance, we conducted two sets of ablation experiments, validating the rationale behind our proposed model through comparative analysis of experimental results.

Author Contributions

Conceptualization, L.Z. and Y.W.; methodology, L.Z. and K.W.; software, L.Z. and K.W.; validation, L.Z.; formal analysis, L.Z.; investigation, L.Z.; resources, L.Z.; data curation, L.Z.; writing—original draft preparation, L.Z.; writing—review and editing, L.Z. and K.W.; visualization, L.Z.; supervision, Y.W.; project administration, Y.W.; funding acquisition, L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pan, Y.F.; Hou, X.; Liu, C.L. Text Localization in Natural Scene Images Based on Conditional Random Field. In Proceedings of the 2009 10th International Conference on Document Analysis and Recognition, Barcelona, Spain, 26–29 July 2009; pp. 6–10. [Google Scholar] [CrossRef]
Gupta, M.R.; Jacobson, N.P.; Garcia, E.K. OCR binarization and image pre-processing for searching historical documents. Pattern Recognit. 2007, 40, 389–397. [Google Scholar] [CrossRef]
Saabni, R.; Asi, A.; El-Sana, J. Text line extraction for historical document images. Pattern Recognit. Lett. 2014, 35, 23–33. [Google Scholar] [CrossRef]
He, S.; Wiering, M.; Schomaker, L. Junction detection in handwritten documents and its application to writer identification. Pattern Recognit. 2015, 48, 4036–4048. [Google Scholar] [CrossRef]
Giotis, A.P.; Sfikas, G.; Gatos, B.; Nikou, C. A survey of document image word spotting techniques. Pattern Recognit. 2017, 68, 310–332. [Google Scholar] [CrossRef]
Kumar, G.; Bhatia, P.K. A Detailed Review of Feature Extraction in Image Processing Systems. In Proceedings of the 2014 Fourth International Conference on Advanced Computing & Communication Technologies, Rohtak, India, 8–9 February 2014; pp. 5–12. [Google Scholar] [CrossRef]
Smith, R.W. An Overview of the Tesseract OCR Engine. In Proceedings of the Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Curitiba, Paraná, Brazil, 23–26 September 2007; Volume 2, pp. 629–633. [Google Scholar] [CrossRef]
Gatos, B.; Ntirogiannis, K.; Pratikakis, I. ICDAR 2009 Document Image Binarization Contest (DIBCO 2009). In Proceedings of the 2009 10th International Conference on Document Analysis and Recognition, Barcelona, Spain, 26–29 July 2009; pp. 1375–1382. [Google Scholar] [CrossRef]
Pratikakis, I.; Gatos, B.; Ntirogiannis, K. H-DIBCO 2010—Handwritten Document Image Binarization Competition. In Proceedings of the 2010 12th International Conference on Frontiers in Handwriting Recognition, Kolkata, India, 16–18 November 2010; pp. 727–732. [Google Scholar] [CrossRef]
Pratikakis, I.; Gatos, B.; Ntirogiannis, K. ICDAR 2011 Document Image Binarization Contest (DIBCO 2011). In Proceedings of the 2011 International Conference on Document Analysis and Recognition, Beijing, China, 18–21 September 2011; pp. 1506–1510. [Google Scholar] [CrossRef]
Pratikakis, I.; Gatos, B.; Ntirogiannis, K. ICFHR 2012 Competition on Handwritten Document Image Binarization (H-DIBCO 2012). In Proceedings of the 2012 International Conference on Frontiers in Handwriting Recognition, Bari, Italy, 8–20 September 2012; pp. 817–822. [Google Scholar] [CrossRef]
Pratikakis, I.; Gatos, B.; Ntirogiannis, K. ICDAR 2013 Document Image Binarization Contest (DIBCO 2013). In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition, Washington, DC, USA, 25–28 August 2013; pp. 1471–1476. [Google Scholar] [CrossRef]
Ntirogiannis, K.; Gatos, B.; Pratikakis, I. ICFHR2014 Competition on Handwritten Document Image Binarization (H-DIBCO 2014). In Proceedings of the 2014 14th International Conference on Frontiers in Handwriting Recognition, Crete Island, Greece, 1–4 September 2014; pp. 809–813. [Google Scholar] [CrossRef]
Pratikakis, I.; Zagoris, K.; Barlas, G.; Gatos, B. ICFHR2016 Handwritten Document Image Binarization Contest (H-DIBCO 2016). In Proceedings of the 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), Shenzhen, China, 23–26 October 2016; pp. 619–623. [Google Scholar] [CrossRef]
Pratikakis, I.; Zagoris, K.; Barlas, G.; Gatos, B. ICDAR2017 Competition on Document Image Binarization (DIBCO 2017). In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 1395–1403. [Google Scholar] [CrossRef]
Pratikakis, I.; Zagori, K.; Kaddas, P.; Gatos, B. ICFHR 2018 Competition on Handwritten Document Image Binarization (H-DIBCO 2018). In Proceedings of the 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), Niagara Falls, NY, USA, 5–8 August 2018; pp. 489–493. [Google Scholar] [CrossRef]
Pratikakis, I.; Zagoris, K.; Karagiannis, X.; Tsochatzidis, L.; Mondal, T.; Marthot-Santaniello, I. ICDAR 2019 Competition on Document Image Binarization (DIBCO 2019). In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019; pp. 1547–1556. [Google Scholar] [CrossRef]
Seuret, M.; Nicolaou, A.; Stutzmann, D.; Maier, A.; Christlein, V. ICFHR 2020 Competition on Image Retrieval for Historical Handwritten Fragments. In Proceedings of the 2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), Dortmund, Germany, 7–10 September 2020; pp. 216–221. [Google Scholar] [CrossRef]
Calvo-Zaragoza, J.; Gallego, A.J. A selectional auto-encoder approach for document image binarization. Pattern Recognit. 2019, 86, 37–47. [Google Scholar] [CrossRef]
Otsu, N. A Threshold Selection Method from Gray-Level Histograms. IEEE Trans. Syst. Man Cybern. 1979, 9, 62–66. [Google Scholar] [CrossRef]
Niblack, W. An Introduction to Digital Image Processing; Strandberg Publishing Company: Turku, Finland, 1986. [Google Scholar]
Sauvola, J.; Pietikäinen, M. Adaptive document image binarization. Pattern Recognit. 2000, 33, 225–236. [Google Scholar] [CrossRef]
Wolf, C.; Jolion, J.M. Extraction and recognition of artificial text in multimedia documents. Form. Pattern Anal. Appl. 2004, 6, 309–326. [Google Scholar] [CrossRef]
Bernsen, J. Dynamic Thresholding of Grey-Level Images. In Proceedings of the ICPR’86, Eighth International Conference on Pattern Recognition, Paris, France, 27–31 October 1986; pp. 1251–1255. [Google Scholar]
Gatos, B.; Pratikakis, I.; Perantonis, S. Adaptive degraded document image binarization. Pattern Recognit. 2006, 39, 317–327. [Google Scholar] [CrossRef]
Khurshid, K.; Siddiqi, I.; Faure, C.; Vincent, N. Comparison of Niblack Inspired Binarization Methods for Ancient Documents. In Electronic Imaging; SPIE: Bellingham, DC, USA, 2009. [Google Scholar]
Jiang, L.; Chen, K.; Yan, S.; Zhou, Y.; Guan, H. Adaptive Binarization for Degraded Document Images. In Proceedings of the 2009 International Conference on Information Engineering and Computer Science, Wuhan, China, 19–20 December 2009; pp. 1–4. [Google Scholar] [CrossRef]
Bataineh, B.; Abdullah, S.N.H.S.; Omar, K. An adaptive local binarization method for document images based on a novel thresholding method and dynamic windows. Pattern Recognit. Lett. 2011, 32, 1805–1813. [Google Scholar] [CrossRef]
Su, B.; Lu, S.; Tan, C.L. Robust Document Image Binarization Technique for Degraded Document Images. IEEE Trans. Image Process. 2013, 22, 1408–1417. [Google Scholar] [CrossRef]
Hadjadj, Z.; Meziane, A.; Cherfa, Y.; Cheriet, M.; Setitra, I. ISauvola: Improved Sauvola’s Algorithm for Document Image Binarization. Image Anal. Recognit. 2016, 9730, 737–745. [Google Scholar] [CrossRef]
Mustafa, W.A.; Kader, M.M.M.A. Binarization of Document Image Using Optimum Threshold Modification. J. Phys. Conf. Ser. 2018, 1019, 012022. [Google Scholar] [CrossRef]
Zemouri, E.T.; Chibani, Y.; Brik, Y. Enhancement of Historical Document Images by Combining Global and Local Binarization Technique. Int. J. Inf. Eng. Electron. Bus. 2014, 4, 1. [Google Scholar] [CrossRef]
Ntirogiannis, K.; Gatos, B.; Pratikakis, I. A combined approach for the binarization of handwritten document images. Pattern Recognit. Lett. 2014, 35, 3–15. [Google Scholar] [CrossRef]
Chaudhary, P.; Ambedkar, B. An effective and robust technique for the binarization of degraded document images. Int. J. Res. Eng. Technol. 2014, 3, 140–145. [Google Scholar]
Saddami, K.; Arnia, F.; Away, Y.; Munadi, K. Kombinasi Metode Nilai Ambang Lokal dan Global untuk Restorasi Dokumen Jawi Kuno. J. Teknol. Inf. Dan Ilmu Komput. 2020, 7, 163–170. [Google Scholar] [CrossRef]
Lu, S.; Su, B.; Tan, C. Document image binarization using background estimation and stroke edges. Int. J. Doc. Anal. Recognit. 2010, 13, 303–314. [Google Scholar] [CrossRef]
Santhanaprabhu, G.; Karthick, B.; Srinivasan, P.; Vignesh, R.; Sureka, K. Extraction and Document Image Binarization Using Sobel Edge Detection. J. Eng. Res. Appl. 2014, 4, 15–21. [Google Scholar]
Lelore, T.; Bouchara, F. FAIR: A Fast Algorithm for Document Image Restoration. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2039–2048. [Google Scholar] [CrossRef]
Holambe, S.; Shinde, U.; Choudhari, B. Image Binarization for Degraded Document Images. Int. J. Comput. Appl. 2015, 128, 38–43. [Google Scholar]
Jia, F.; Shi, C.; He, K.; Wang, C.; Xiao, B. Document Image Binarization Using Structural Symmetry of Strokes. In Proceedings of the 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), Shenzhen, China, 23–26 October 2016; pp. 411–416. [Google Scholar]
Lai, A.N.; Lee, G. Binarization by Local k-Means Clustering for Korean Text Extraction. In Proceedings of the 2008 IEEE International Symposium on Signal Processing and Information Technology, Sarajevo, Bosnia and Herzegovina, 16–19 December 2008; pp. 117–122. [Google Scholar]
Tong, L.J.; Chen, K.; Zhang, Y.; Fu, X.L.; Duan, J.Y. Document Image Binarization Based on NFCM. In Proceedings of the 2009 2nd International Congress on Image and Signal Processing, Tianjin, China, 17–19 October 2009; pp. 1–5. [Google Scholar]
Biswas, B.; Bhattacharya, U.; Chaudhuri, B.B. A Global-to-Local Approach to Binarization of Degraded Document Images. In Proceedings of the 2014 22nd International Conference on Pattern Recognition, Stockholm, Sweden, 24–28 August 2014; pp. 3008–3013. [Google Scholar]
Soua, M.; Kachouri, R.; Akil, M. GPU parallel implementation of the new hybrid binarization based on Kmeans method (HBK). J. Real-Time Image Process 2018, 14, 363–377. [Google Scholar] [CrossRef]
Annabestani, M.; Saadatmand-Tarzjan, M. A new threshold selection method based on fuzzy expert systems for separating text from the background of document images. Iran. J. Sci. Technol. Trans. Electr. Eng. 2019, 43, 219–231. [Google Scholar] [CrossRef]
Xiong, W.; Zhou, L.; Yue, L.; Li, L.; Wang, S. An enhanced binarization framework for degraded historical document images. Eurasip J. Image Video Process. 2021, 2021, 13. [Google Scholar] [CrossRef]
Pastor-Pellicer, J.; España-Boquera, S.; Zamora-Martínez, F.; Afzal, M.Z.; Castro-Bleda, M.J. Insights on the Use of Convolutional Neural Networks for Document Image Binarization. In Advances in Computational Intelligence; Rojas, I., Joya, G., Catala, A., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 115–126. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Tensmeyer, C.; Martinez, T. Document image binarization with fully convolutional neural networks. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 99–104. [Google Scholar]
Calvo-Zaragoza, J.; Vigliensoni, G.; Fujinaga, I. Pixel-Wise Binarization of Musical Documents with Convolutional Neural Networks. In Proceedings of the 2017 Fifteenth IAPR International Conference on Machine Vision Applications (MVA), Nagoya, Japan, 8–12 May 2017; pp. 362–365. [Google Scholar] [CrossRef]
Vo, Q.N.; Kim, S.H.; Yang, H.J.; Lee, G. Binarization of degraded document images based on hierarchical deep supervised network. Pattern Recognit. 2018, 74, 568–586. [Google Scholar] [CrossRef]
Ma, K.; Shu, Z.; Bai, X.; Wang, J.; Samaras, D. Docunet: Document Image Unwarping via a Stacked U-Net. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4700–4709. [Google Scholar]
He, S.; Schomaker, L. DeepOtsu: Document enhancement and binarization using iterative deep learning. Pattern Recognit. 2019, 91, 379–390. [Google Scholar] [CrossRef]
Bezmaternykh, P.; Ilin, D.; Nikolaev, D. U-Net-bin: Hacking the document image binarization contest. Comput. Opt. 2019, 43, 825–832. [Google Scholar] [CrossRef]
Ayyalasomayajula, K.R.; Malmberg, F.; Brun, A. PDNet: Semantic segmentation integrated with a primal-dual network for document binarization. Pattern Recognit. Lett. 2019, 121, 52–60. [Google Scholar]
Huang, X.; Li, L.; Liu, R.; Xu, C.; Ye, M. Binarization of degraded document images with global-local U-Nets. Optik 2020, 203, 164025. [Google Scholar]
Xiong, W.; Jia, X.; Yang, D.; Ai, M.; Li, L.; Wang, S. DP-LinkNet: A convolutional network for historical document image binarization. Ksii Trans. Internet Inf. Syst. 2021, 15, 1778–1797. [Google Scholar]
Xiong, W.; Yue, L.; Zhou, L.; Wei, L.; Li, M. FD-Net: A Fully Dilated Convolutional Network for Historical Document Image Binarization. In Pattern Recognition and Computer Vision, Proceedings of the 4th Chinese Conference, PRCV 2021, Beijing, China, 29 October–1 November 2021; Proceedings, Part I 4; Springer: Berlin/Heidelberg, Germany, 2021; pp. 518–529. [Google Scholar]
Kang, S.; Iwana, B.K.; Uchida, S. Complex image processing with less data—Document image binarization by integrating multiple pre-trained U-Net modules. Pattern Recognit. 2021, 109, 107577. [Google Scholar] [CrossRef]
Dey, A.; Das, N.; Nasipuri, M. Variational Augmentation for Enhancing Historical Document Image Binarization. arXiv 2022, arXiv:2211.06581. [Google Scholar]
Yang, Z.; Xiong, Y.; Wu, G. GDB: Gated convolutions-based Document Binarization. arXiv 2023, arXiv:2302.02073. [Google Scholar] [CrossRef]
Zhao, P.; Wang, W.; Zhang, G.; Lu, Y. Alleviating pseudo-touching in attention U-Net-based binarization approach for the historical Tibetan document images. Neural Comput. Appl. 2023, 35, 13791–13802. [Google Scholar] [CrossRef]
Vahid, R.; Konstantin, B.; Clemens, N. A Hybrid CNN-Transformer Model for Historical Document Image Binarization. In Proceedings of the HIP ’23: 7th International Workshop on Historical Document Imaging and Processing, San Jose, CA, USA, 25–26 August 2023; pp. 79–84. [Google Scholar]
Zhao, J.; Shi, C.; Jia, F.; Wang, Y.; Xiao, B. Document image binarization with cascaded generators of conditional generative adversarial networks. Pattern Recognit. 2019, 96, 106968. [Google Scholar] [CrossRef]
De, R.; Chakraborty, A.; Sarkar, R. Document Image Binarization Using Dual Discriminator Generative Adversarial Networks. IEEE Signal Process. Lett. 2020, 27, 1090–1094. [Google Scholar] [CrossRef]
Souibgui, M.A.; Kessentini, Y. DE-GAN: A Conditional Generative Adversarial Network for Document Enhancement. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 1180–1191. [Google Scholar] [CrossRef]
Kumar, A.; Ghose, S.; Chowdhury, P.N.; Roy, P.P.; Pal, U. UDBNET: Unsupervised Document Binarization Network via Adversarial Game. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 7817–7824. [Google Scholar]
Suh, S.; Kim, J.; Lukowicz, P.; Lee, Y.O. Two-stage generative adversarial networks for binarization of color document images. Pattern Recognit. 2022, 130, 108810. [Google Scholar] [CrossRef]
Jemni, S.K.; Souibgui, M.A.; Kessentini, Y.; Fornés, A. Enhance to read better: A multi-task adversarial network for handwritten document image enhancement. Pattern Recognit. 2022, 123, 108370. [Google Scholar] [CrossRef]
Racjesh, B.; Agrawal, M.K.; Bhuva, M.; Kishore, K.; Javed, M. Document Image Binarization in JPEG Compressed Domain using Dual Discriminator Generative Adversarial Networks. In Computer Vision and Machine Intelligence, Proceedings of the CVMI 2022, Allahabad, India, August 2022; Springer: Berlin/Heidelberg, Germany, 2023; pp. 761–774. [Google Scholar]
Fathallah, A.; El Yacoubi, M.; Amara, N.B. EHDI: Enhancement of Historical Document Images via Generative Adversarial Network. In Proceedings of the 18th International Conference on Computer Vision Theory and Applications (VISAPP), Lisbon, Portugal, 19–21 February 2023; Volume 4, pp. 238–245. [Google Scholar]
Biswas, R.; Roy, S.; Pal, U. A Layer-Wise Tokens-to-Token Transformer Network for Improved Historical Document Image Enhancement. arXiv 2023, arXiv:2312.03946. [Google Scholar] [CrossRef]
Guo, Y.; Ji, C.; Zheng, X.; Wang, Q.; Luo, X. Multi-scale multi-attention network for moiré document image binarization. Signal Process. Image Commun. 2021, 90, 116046. [Google Scholar] [CrossRef]
Pandey, S.; Bharti, J. Document Enhancement and Binarization Using Deep Learning Approach. In Proceedings of the Third International Conference on Intelligent Computing, Information and Control Systems: ICICCS 2021, Secunderabad, India, 6–8 May 2021; pp. 133–145. [Google Scholar]
Peng, X.; Wang, C.; Cao, H. Document Binarization via Multi-Resolutional Attention Model with DRD Loss. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019; pp. 45–50. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. Commun. Acm 2020, 63, 139–144. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Nikitin, F.; Dokholyan, V.; Zharikov, I.; Strijov, V. U-Net Based Architectures for Document Text Detection and Binarization. In Advances in Visual Computing, Proceedings of the 14th International Symposium on Visual Computing, Lake Tahoe, NV, USA, 7–9 October 2019; Springer International Publishing: Cham, Switzerland, 2019. [Google Scholar] [CrossRef]
Detsikas, N.; Mitianoudis, N.; Papamarkos, N. A Dilated MultiRes Visual Attention U-Net for historical document image binarization. Signal Process. Image Commun. 2024, 122, 117102. [Google Scholar] [CrossRef]
Dumoulin, V.; Visin, F. A guide to convolution arithmetic for deep learning. arXiv 2018, arXiv:1603.07285. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
Mehta, S.; Rastegari, M. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
Souibgui, M.A.; Biswas, S.; Jemni, S.K.; Kessentini, Y.; Fornés, A.; Lladós, J.; Pal, U. Docentr: An End-to-End Document Image Enhancement Transformer. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; pp. 1699–1705. [Google Scholar]
Burie, J.C.; Coustaty, M.; Hadi, S.; Kesiman, M.W.A.; Ogier, J.M.; Paulus, E.; Sok, K.; Sunarya, I.M.G.; Valy, D. ICFHR2016 Competition on the Analysis of Handwritten Text in Images of Balinese Palm Leaf Manuscripts. In Proceedings of the 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), Shenzhen, China, 23–26 October 2016; pp. 596–601. [Google Scholar] [CrossRef]
Ayatollahi, S.M.; Ziaei Nafchi, H. Persian Heritage Image Binarization Competition (PHIBC 2012). In Proceedings of the 2013 First Iranian Conference on Pattern Recognition and Image Analysis (PRIA), Birjand, Iran, 6–8 March 2013; pp. 1–4. [Google Scholar] [CrossRef]
Nicolaou, A.; Christlein, V.; Riba, E.; Shi, J.; Vogeler, G.; Seuret, M. TorMentor: Deterministic Dynamic-Path, Data Augmentations with Fractals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2707–2711. [Google Scholar]
Lu, H.; Kot, A.; Shi, Y. Distance-reciprocal distortion measure for binary document images. IEEE Signal Process. Lett. 2004, 11, 228–231. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, Beach, CA, USA, 9–15 June 2019. [Google Scholar]

Figure 1. Two document images and their resulting binarized images. (a) Original document images 1-2018 and 5-1018 from the DIBCO2018 dataset [16]; (b) binarization results obtained by the method in [19].

Figure 2. A few document images: the left column is 3-2019-a, 15-2017, and 8-2016; the right column is 19-2019-b and 6-2019-a; they are from the DIBCO2019 [17], DIBCO2017 [15], and DIBCO2016 [14] datasets.

Figure 3. Comparison of a document (image 12-2017 of the DIBCO2017 [15] dataset) with its true binarization result and the results obtained by two binarization methods: (a) original color image, (b) true binarization result (GT), (c) binarization result of Xiong et al. [46], and (d) binarization result of Rezanezhad et al. [63].

Figure 4. A comparison of a document (image 19-2017 (DIBCO2017 [15])) and the results obtained by GT and two binarization methods: (a) original color image, (b) true binarization result (GT), (c) binarization result of Calvo and Gallego [19], and (d) binarization result of Rezanezhad et al. [63].

Figure 5. Comparison of receptive field effects of RestNet and Transformer before and after training for the central element. The transition from light yellow to dark green signifies the increased influence of the receptive field region on the central element. (a) Receptive field before training for RestNet. (b) Receptive fields after RestNet training. (c) Receptive field before training for Transformer. (d) Receptive Fields after Transformer training.

Figure 6. The overarching architectural diagram of the proposed document image binarization model. The left segment of the model architecture primarily serves for feature extraction through convolution, MV2, and MobileViT block operations and coupled with subsampling. The self-attention mechanism module is situated at the base of the model. Interlinking the left and right halves are five skip connections. Within the right segment, image size manipulation predominantly occurs through upsampling convolution operations, culminating in the derivation of the final binary grayscale image via the softmax activation function. Annotations are incorporated above or adjacent to the arrows to elucidate the distinct functionalities of the operations.

Figure 7. Comparison of image H01-2012, its GT, and the results obtained by different binarization methods. Some differences between the binarized results and GT are highlighted using red boxes. Binarization results of (a) Otsu [20], (b) Sauvola [22], (c) Xiong et al. [46], (d) Rezanezhad et al. [63], (e) Calvo and Gallego [19], and (f) the proposed method.

Figure 8. Comparison of image H11-2012, its GT, and the results obtained by different binarization methods. Some differences between the binarized results and GT are highlighted using red boxes. Binarization results of (a) Xiong et al. [46], (b) Sauvola [22], (c) Calvo and Gallego [19], (d) Rezanezhad et al. [63], (e) Biswas et al. [72], and (f) the proposed method.

Figure 9. Comparison plots of image 19-2017, its GT, and the results obtained by different binarization methods. Binarization results of (a) Otsu [20], (b) Sauvola [22], (c) Xiong et al. [46], (d) Rezanezhad et al. [63], (e) Calvo and Gallego [19], and (f) the proposed method.

Figure 10. Comparison plots of image 18-2017, its GT, and the results obtained by different binarization methods. Binarization results of (a) Xiong et al. [46], (b) Sauvola [22], (c) Calvo and Gallego [19], (d) Rezanezhad et al. [63], (e) Biswas et al. [72], and (f) the proposed method.

Figure 11. Comparison plots of image 4-2018, its GT, and the results obtained by different binarization methods. Binarization results of (a) Otsu [20], (b) Sauvola [22], (c) Xiong et al. [46], (d) Rezanezhad et al. [63], (e) Calvo and Gallego [19], and (f) the proposed method.

Figure 12. Image 5-2018, its GT, and a comparison of the results obtained by different binarization methods. Binarization results of (a) Otsu [20], (b) Sauvola [22], (c) Xiong et al. [46], (d) Rezanezhad et al. [63], (e) Calvo and Gallego [19], and (f) the proposed method.

Figure 13. Images 2-2018 and 10-2018 and their GTs compared with the binarized results of our method: (a) GT and (b) binarization result of our method for image 2-1018, (c) GT and (d) binarization result of our method for image 10-1018.

Figure 14. Image 8-2019-a comparison plot of the binarized results obtained by GT and the two methods. (a) Binarization results of model [63]-9.0M (pFM = 39.69); (b) binarization results of the proposed model (pFM = 6.76).

Figure 15. FM line plots and PSNR line plots were obtained using different numbers of channels in different MobileViT blocks.

Table 1. Predicted values and ground truth.

Predicted Value Ground Truth	Positive (1)	Negative (0)
Positive (1)	TP	FP
Negative (0)	FN	TN

Table 2. Comparison of the results of different binarization methods and the proposed method on the DIBCO2012 [11] dataset. The best and second-best index values are plotted in red and blue, respectively.

Algorithm	PSNR	FM	pFM	DRD
Otsu [20]	15.03	80.18	82.65	26.45
Sauvola et al. [22]	16.71	82.89	87.95	6.59
Xiong et al. [46]	21.68	94.26	95.16	2.08
Kang et al. [59]	21.37	95.16	96.44	1.13
Tensmeyer et al. [49]	20.60	92.53	96.67	2.48
Zhao et al. [64]	21.91	94.96	96.15	1.55
Jemni et al. [69]	22.00	95.18	94.63	1.62
Souibgui et al. [83]	22.29	95.31	96.29	1.60
Model 1 [63]	23.16	96.02	97.31	1.14
Model 2 [63]	23.24	96.26	97.29	1.12
Model 3 [63]	23.24	96.25	97.51	1.12
Ensemble model [63]	23.27	96.25	97.58	1.11
Biswas et al. [72]	23.95	96.80	98.04	0.20
Proposed method	23.32	96.37	97.73	1.08

Table 3. Comparison of the results of different binarization methods and the proposed method on the DIBCO2017 [15] dataset. The best and second-best index values are plotted in red and blue, respectively.

Algorithm	PSNR	FM	pFM	DRD
Otsu [20]	13.83	77.73	77.89	15.54
Sauvola et al. [22]	14.25	77.11	84.1	8.85
Xiong et al. [46]	17.99	89.37	90.80	5.51
Kang et al. [59]	15.85	91.57	93.55	2.92
Winner Algorithm [15]	18.28	91.04	92.86	3.40
Zhao et al. [64]	17.83	90.73	92.58	3.58
Jemni et al. [69]	17.45	89.80	89.95	4.03
Souibgui et al. [83]	19.11	92.53	95.15	2.37
Model 1 [63]	18.99	92.50	95.05	2.49
Model 2 [63]	19.04	92.60	94.83	2.44
Ensemble Model [63]	19.04	93.01	95.42	2.29
Proposed method	19.29	93.23	95.90	2.22

Table 4. Comparison of the results of different binarization methods and the proposed method on the DIBCO2018 [16] dataset. The best and second-best index values are plotted in red and blue, respectively.

Algorithm	PSNR	FM	pFM	DRD
Otsu [20]	9.74	51.45	53.05	59.07
Sauvola et al. [22]	13.78	67.81	74.08	17.69
Xiong et al. [46] Winner	19.11	88.34	90.37	4.93
Kang et al. [59]	19.39	89.71	91.62	2.51
Zhao et al. [64]	18.37	87.73	90.60	4.58
Jemni et al. [69]	20.18	92.41	94.35	2.60
Souibgui et al. [83]	19.46	90.59	93.97	3.35
Model 1 [63]	19.79	90.65	93.50	3.63
Model 2 [63]	19.94	91.87	95.62	2.77
Model 3 [63]	19.88	91.46	95.00	3.00
Ensemble Model [63]	20.29	92.47	95.99	2.50
Biswas et al. [72]	22.33	95.60	96.97	0.13
Proposed method	19.52	90.59	94.80	3.29

Table 5. Four evaluations of the results of the proposed method on each image of the DIBCO2018 [16] dataset.

Image Name	PSNR	FM	pFM	DRD
1-2018	20.23	89.96	97.52	3.18
2-2018	16.70	78.35	83.88	8.78
3-2018	18.40	95.38	99.05	1.5
4-2018	20.77	85.61	96.81	2.65
5-2018	19.02	90.17	93.96	4.25
6-2018	21.96	96.67	97.91	1.76
7-2018	22.86	93.21	94.87	2.25
8-2018	16.78	90.31	95.09	2.71
9-2018	23.39	97.01	97.44	1.23
10-2018	15.05	89.24	91.44	5.61
Average	19.52	90.59	94.80	3.39

Table 6. Comparison of the results of different binarization methods and the proposed method on the DIBCO2019 [17] dataset. The best and second-best index values are plotted in red and blue, respectively.

Algorithm	PSNR	FM	pFM	DRD
Otsu [20]	9.03	47.67	48.01	109.84
Sauvola et al. [22]	13.12	64.71	66.37	21.24
Xiong et al. [46]	11.84	46.61	47.06	24.13
Model-9.0M [63]	14.99	65.81	68.10	9.98
Model-37.0M [63]	15.64	72.07	73.46	7.72
Biswas et al.-74.4M [72]	14.49	65.70	67.82	0.29
Proposed Model-8.9M	15.25	65.92	66.20	8.97

Table 7. Comparison of results on the DIBCO2017 dataset [15] using zero, one, two, three (proposed), and four MobileViT block modules.

Name	PSNR	FM	pFM	DRD
M0	18.56	91.57	94.91	2.94
M1	18.84	92.45	95.37	2.55
M2	18.88	92.39	95.12	2.75
M3	19.29	93.23	95.99	2.22
M4	18.87	92.45	94.99	2.83

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, L.; Wang, K.; Wan, Y. An Efficient Transformer–CNN Network for Document Image Binarization. Electronics 2024, 13, 2243. https://doi.org/10.3390/electronics13122243

AMA Style

Zhang L, Wang K, Wan Y. An Efficient Transformer–CNN Network for Document Image Binarization. Electronics. 2024; 13(12):2243. https://doi.org/10.3390/electronics13122243

Chicago/Turabian Style

Zhang, Lina, Kaiyuan Wang, and Yi Wan. 2024. "An Efficient Transformer–CNN Network for Document Image Binarization" Electronics 13, no. 12: 2243. https://doi.org/10.3390/electronics13122243

APA Style

Zhang, L., Wang, K., & Wan, Y. (2024). An Efficient Transformer–CNN Network for Document Image Binarization. Electronics, 13(12), 2243. https://doi.org/10.3390/electronics13122243

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Efficient Transformer–CNN Network for Document Image Binarization

Abstract

1. Introduction

2. Related Work

2.1. Traditional Binarization Techniques

2.2. Deep Learning Binarization Approaches

3. Methodology

3.1. Baseline Network for Document Image Binarization

3.2. Improving Network Receptive Field Range

3.3. Proposed Model

4. Experiments and Analysis

4.1. Introduction of Experiments

4.2. Qualitative Evaluation

4.3. Quantitative Evaluation

4.3.1. FM

4.3.2. pFM

4.3.3. PSNR

4.3.4. DRD

5. Ablation Experiment

5.1. Experiments on the Performance of the MobileViT Block

5.2. Experiments on the Number of Channels in the MobileViT Blocks

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI