Increasing Offline Handwritten Chinese Character Recognition Using Separated Pre-Training Models: A Computer Vision Approach

He, Xiaoli; Zhang, Bo; Long, Yuan

doi:10.3390/electronics13152893

Open AccessArticle

Increasing Offline Handwritten Chinese Character Recognition Using Separated Pre-Training Models: A Computer Vision Approach

by

Xiaoli He

^*,

Bo Zhang

and

Yuan Long

School of Computer Science, Sichuan University of Science and Engineering, Zigong 643000, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(15), 2893; https://doi.org/10.3390/electronics13152893

Submission received: 30 May 2024 / Revised: 14 July 2024 / Accepted: 19 July 2024 / Published: 23 July 2024

(This article belongs to the Special Issue Recent Advances in Image Processing and Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

:

Offline handwritten Chinese character recognition involves the application of computer vision techniques to recognize individual handwritten Chinese characters. This technology has significantly advanced the research in online handwriting recognition. Despite its widespread application across various fields, offline recognition faces numerous challenges. These challenges include the diversity of glyphs resulting from different writers’ styles and habits, the vast number of Chinese character labels, and the presence of morphological similarities among characters. To address these challenges, an optimization method based on a separated pre-training model was proposed. The method aims to enhance the accuracy and robustness of recognizing similar character images by exploring potential correlations among them. In experiments, the HWDB and Chinese Calligraphy Styles by Calligraphers datasets were employed, utilizing precision, recall, and the Macro-F1 value as evaluation metrics. We employ a convolutional self-encoder model characterized by high recognition accuracy and robust performance. The experimental results demonstrated that the separated pre-training models improved the performance of the convolutional auto-encoder model, particularly in handling error-prone characters, resulting in an approximate 6% increase in precision.

Keywords:

machine learning; Chinese character recognition; convolutional auto-encoder; split modeling

1. Introduction

Handwritten Chinese character recognition is a crucial aspect of image text recognition, increasingly demanded in fields such as Medicine and Education [1]. As a core component of image text recognition, offline handwritten Chinese character recognition (HCCR) aims to accurately extract character content from handwritten text images, convert it into structured information, and facilitate the efficient data processing and analysis [2]. Integrating HCCR model research with real-time processing systems through transfer learning can enhance real-time handwritten text recognition and advance the research in this field [3]. However, the rapid development of digital technology and the diversity of user writing habits have resulted in variations in the morphology and complex strokes in handwritten Chinese character images, presenting significant challenges to the recognition process [4].

In recent years, the diversification of recognition requirements across various application scenarios has driven the evolution of the structure and standards of Optical Character Recognition (OCR) models [5]. Deep learning techniques have achieved remarkable results in OCR applications, particularly methods based on auto-encoders, attention mechanisms [6], and sequence recognition techniques [7]. However, in large-scale handwritten Chinese character recognition tasks, classification errors frequently occur due to the high glyph similarity of handwritten characters. Feature-similar images and similar character images pose significant challenges in image classification tasks and are central to the ongoing research on model optimization.

To address the challenge of error-prone recognition of similar handwritten Chinese character images, a model specifically designed for recognizing handwritten Chinese characters is essential. The developed system is based on the traditional Convolution Auto-Encode (CAE) model. To improve the model’s generalization ability, the model has been tuned in relation to the parameters and loss function to minimize interference caused by recognition accuracy in subsequent research [8]. On this basis, this paper thoroughly researches the key features of users’ handwritten Chinese characters, and combines this with deep learning techniques to propose Separated Pre-Training (SPT) models to optimize the HCCR model. The main contributions of this study are as follows:

We normalized the image size using a bilinear interpolation to ensure uniformity across the dataset. The key features from similar character images were extracted and simplified using a UMAP for dimensionality reduction combined with deep learning techniques. We further refined the feature space using an improved K-means clustering algorithm that dynamically reduced the number of labels through an adaptive threshold. An in-depth analysis of the clustered labels using a probability density function yielded a dataset comprising distinct, yet similar, characters.
We proposed a design scheme for a separated pre-training model to address the recognition errors in datasets containing similar character images. The method redesigns the structure and loss function of the CAE model by employing larger convolutional kernels instead of smaller ones and constructing multiple convolutional layers into a convolutional block. It then repeatedly trains the model on datasets containing error-prone characters to enhance the recognition accuracy.

This paper is organized as follows: Section 2 reviews the existing research on Chinese character image recognition and describes several widely used datasets. Section 3 details the algorithm design of this study, focusing on the probability analysis for character data after feature extraction and introducing a comprehensive evaluation index for the model. Section 4 outlines the experiments, which include the model comparison experiments and experiments on handwritten Chinese character recognition using separated pre-training models. Section 5 provides the summary and conclusions of this study, discussing and generalizing the research ideas and findings.

2. Related Work

2.1. Handwritten Chinese Character Recognition Method

Wang T et al. [9] proposed a new Radical Aggregation Network (RAN) for offline handwritten Chinese character recognition. This approach can accurately recognize both learned and unseen handwritten characters in a small-sample learning and training environment. Bi N et al. [10] conducted an in-depth study on the application of GoogleNet to the HCL2000 and CASIA-HWDB datasets. They adapted and optimized this end-to-end method, addressing the excessive number of labels in the offline handwritten character recognition dataset. By introducing a batch normalization layer, they achieved an accuracy of over 99% in the optimized final version. To mitigate the “long-tail effect” in handwritten character recognition, Diao X et al. [11] used radical information of Chinese characters to decompose and reconstruct them based on orthographic principles. They developed a feature extractor that simultaneously recognizes candidate radicals and structural relationships from the input character image, employing knowledge graph reasoning to identify target characters. Simulation experiments across multiple datasets confirmed the enhanced accuracy of this feature extractor.

In 2019, Chen X et al. [12] introduced a self-encoder for image recognition, termed the Adaptive Embedding Gate (AEG) module. Integrating the AEG module into the self-encoder eliminates the recognition complexity associated with text distortion and simplifies the overall recognition difficulty. In 2020, Ghanim proposed a multilevel cascade system for offline Arabic handwriting recognition that significantly reduces the complexity of the handwriting recognition process. By integrating this system with effective classifiers such as the Convolutional Neural Network (CNN) or the Support Vector Machine (SVM), the recognition efficiency can be greatly enhanced [13]. Kobayashi Y proposed a system that utilizes Tesseract, known for its proficiency in recognizing Japanese text, and Mathpix, renowned for its ability to recognize mathematical formulas, to integrate two types of OCR into one system [14].

Huang G et al. designed a new Hippocampus-inspired Chinese Character Recognition Network (HCRN), capable of recognizing invisible Chinese characters by predicting unknown characters from some of the partitions in the samples [15]. E. Rainarli et al. wrote a review paper in 2021 summarizing nearly a decade of research on scene-based text recognition. The review discusses in detail various feature extraction techniques used by different researchers employing machine learning methods to address multiple text orientations, multilingual texts, curved texts, and arbitrary texts. It describes the different deep learning systems used for various purposes [16]. In 2022, Ghosh T authored a review paper on the research and development of online handwriting recognition methods over the past decade. The paper discusses the performance of various machine learning and deep learning-based methods in recognizing offline handwritten characters, words, and texts. It provides a detailed analysis of different feature extraction approaches, methods, and structures, which is extremely instructive for other researchers [17].

2.2. Related Datasets

There are many handwritten Chinese character datasets provided by organizations, individuals, and groups. CASIA-HWDB1.0, SCUT-EPT, Chinese Calligraphy Styles by Calligraphers dataset, and CASIA-AHCDB are widely used during current research. Samples from these datasets are displayed in Figure 1. The CASIA-HWDB1.0 dataset is a widely used Chinese character dataset released by the Institute of Automation, Chinese Academy of Sciences [18]. The dataset strictly follows the GB 2312 standard and covers 4037 common Chinese character fonts and symbols [19]. Each Chinese character type contains 111 images, totaling 448,107 single character images. The CASIA-HWDB1.0 dataset plays an important role in Chinese character recognition research due to its richness and diversity. The Chinese Calligraphy Styles by Calligraphers dataset is a collection of Chinese character works focusing on the calligraphic styles of famous calligraphers in China, hereinafter referred to as “Chinese Calligraphy” [20]. The dataset contains the works of 20 calligraphers, and each subset contains an average of 5251 text images with a size of 64 × 64 pixels, totaling 132,080 single character images.

The SCUT-EPT Dataset, constructed by South China University of Technology, is another important handwritten Chinese dataset. This dataset was extracted from the test papers of 2986 volunteers and contains a total of 50,000 text line images. These images cover 4033 common Chinese characters, 104 symbols, and 113 unusual characters, with a total character count of 1,267,161. By introducing text segmentation algorithms, this dataset can be further reorganized into a high-quality handwritten Chinese dataset, providing a rich resource for related research. The Chinese Ancient Handwritten Characters Database (CASIA-AHCDB) was constructed by annotating 11,937 pages of Chinese ancient handwritten documents for character recognition research. It consists of more than 2.2 million annotated handwritten character samples across 10,350 categories.

Figure 2 shows the similarity between the multiple writing styles and shape approximation texts that exist for the same text. Both of these factors can easily lead to difficulties in achieving the expected accuracy rate of the recognition model with similar images.

3. Algorithm Design

3.1. Design and Analysis of Algorithms

In the research in the field of text image recognition, models constructed using deep learning techniques have demonstrated a high recognition accuracy. However, further exploration of methods to improve the accuracy from the perspective of optimizing the model has gradually reached a bottleneck. Therefore, this paper provides an in-depth analysis from the image level. Especially for approximate characters, they exhibit extremely high similarity in terms of word shape and character structure, making the information contained in the images very similar to each other. However, deep Learning can still accurately capture the subtle differences between images of closely related characters [21]. This kind of subtle inconsistent information may not be significant in a large image dataset, but through the fine comparison of a small number of samples and repeated comparison of the extracted key features, we are still able to identify the potential patterns.

Based on image-level research, this paper proposes a separated pre-training model. The model aims to further enhance its ability to learn distinct key features in similar images by extracting and pre-training key dimensional features within the dataset. In this way, we aim to achieve the research goal of improving the recognition accuracy and provide new ideas and methods for research in the field of text image recognition. The flow of the separated pre-training models is shown in Figure 3.

3.2. Design of Separated Pre-Training Models

To address the small format of text image samples and the blurriness of some images, the dataset is enhanced using a bilinear interpolation to avoid the recognition errors caused by image blurriness. The step of enhancement consists of calculating the scaling ratio of the width and height based on the target size and the original image size [22]. The specific method is shown in Figure 4:

The meaning of the formula for the bilinear interpolation method for image enhancement is: find the value of the encircled pixel through the four known pixels. Equation (1) is the x-axis direction interpolation formula, and Equation (2) is the y-axis direction interpolation formula, and the meaning of Equations (1) and (2) can be more easily understood in conjunction with Figure 4a.

Q_{11}, Q_{12}, Q_{21}, Q_{22}

represents four known pixel points.

x_{1}, x_{2}, y_{1}, y_{2}

denotes the coordinates of these pixel points on the x-axis and y-axis, respectively. Function

R

is the result of the interpolation calculation along the x-axis, and function

f

is the final coordinate value of point P, obtained after interpolation along both the x-axis and y-axis.

R (Q_{11}, Q_{21}) = \frac{x_{2} - x}{x_{2} - x_{1}} f (Q_{11}) + \frac{x - x_{1}}{x_{2} - x_{1}} f (Q_{21})

(1)

P (x, y) = \frac{y_{2} - y}{y_{2} - y_{1}} (R (Q_{11}, Q_{21})) + \frac{y - y_{1}}{y_{2} - y_{1}} (R (Q_{12}, Q_{22}))

(2)

The coordinates and corresponding pixel values at

Q_{11} (x_{1}, y_{1})

,

Q_{12} (x_{1}, y_{2})

,

Q_{21} (x_{2}, y_{1})

, and

Q_{22} (x_{2}, y_{2})

are known, and these four pixel positions surround point P. The pixel value of point

P (x, y)

is calculated using Equations (1) and (2). Equation (1) performs the interpolation calculation in the x-direction to determine the values of

R (Q_{11}, Q_{21})

and

R (Q_{12}, Q_{22})

, and then Equation (2) calculates the interpolation in the y-direction to determine the value of P,

f (x, y)

. The image comparison before and after the enhancement by interpolation is displayed in Figure 4b. In the experimental process, the classification model learns better after the bilinear interpolation of the enhanced image.

In the experimental process, preprocessing of image data is crucial. We opted for the bilinear interpolation method for image enhancement, selected after evaluating multiple image enhancement methods and considering the specific features of text images. Different datasets, both before and after the enhancement, were tested using the CAE model and the SPT CAE model. The experimental results indicate that the classification model performs better with the enhanced dataset, achieving the target accuracy with fewer iterations and in less time.

Secondly, the key features of the image dataset are extracted as the focal points for similar image recognition and classification. Uniform Manifold Approximation and Projection (UMAP) is a novel technique for dimensionality-reducing manifold learning. Its core idea is to map high-dimensional data points to a low-dimensional space while attempting to maintain as much of the original distances between them as possible. In the application described in this paper, it is possible to effectively reduce the dimensionality of the image and extract the key feature information without significantly altering the original image information.

Traditional methods typically transform images directly into one-dimensional vectors, leading to significant pixel disparities within the image matrix. To address this, we utilize CNN for feature extraction after a UMAP dimensionality reduction, followed by the k-means method for clustering. This approach enhances the clustering outcomes by basing the grouping on features rather than direct pixel values. We set k = 10, a value determined statistically to best represent the most common features of the images. The final groups are established through the continuous computation of cluster centroids, resulting in ten distinct similar datasets. Subsequently, the low-dimensional image information is visualized and analyzed, and the differences between similar images are studied in depth from the perspective of the clustering probability problem.

f (v) = \frac{1}{2^{n / 2} Γ (n / 2)} v^{(n / 2) - 1} e^{- v / 2}, v \geq 0

(3)

After feature extraction and clustering of the image labels, by counting the labels of each cluster, the distribution of labeled data conforms to the gamma distribution with degrees of freedom n,

α = 0.5 \times n

and

λ = 0.5

. Equation (3) describes the method of calculating its density, where its confidence interval is

[\bar{X} - σ {\bar{X}}^{z (α / 2)}, \bar{X} + σ {\bar{X}}^{z (α / 2)}]

,

\bar{X}

is the population sample, and

μ

is the mean. The variance is

σ^{2}

, and the confidence interval for

μ_{0}

is

100 (1 - α) %

. Following the extraction and clustering analysis of the image features by the specially designed detached pre-training models, we can effectively minimize the impact of redundant labels, thereby allowing the recognition model to be further trained and refined on a newly formed dataset characterized by distinct similar image features.

3.3. Design of Convolutional Self-Encoder Structure

As shown in Figure 5, this paper employs the concept of a convolutional self-encoder to develop a CAE model which is better suited for handwritten Chinese character recognition. The model aims to reduce accuracy losses during the experimental process by minimizing the resource overhead typically associated with conventional convolutional neural networks. This is achieved by constructing convolutional layers for channel optimization, employing a greedy iterative strategy from the original network structure to eliminate less important features, and utilizing network quantization techniques to refine the parameters [23]. The structure of the convolutional self-encoder developed in this paper is composed of two modules: an encoder and a decoder. The encoder learns and trains the key features of the image, while the decoder recognizes and compares the outcomes of the training and learning processes. The encoder includes three convolutional blocks: a convolutional layer with 64 (

1 \times 1

) kernels, another convolutional layer with 64 (

3 \times 3

) kernels, and a third convolutional layer with 256 (

1 \times 1

) kernels, followed by a pooling layer. Additionally, there are four more convolutional blocks, each consisting of a convolutional layer with 128 (

1 \times 1

) kernels, a convolutional layer with 128 (

3 \times 3

) kernels, and 512 (

1 \times 1

) kernels, all processed by a subsequent pooling layer to extract key features.

The structure of the decoder mirrors the encoder, displaying a symmetric trend. Initially, the key feature is up-sampled by a pooling layer, followed by four convolutional blocks. Each block comprises a convolutional layer with 128 (

1 \times 1

) kernels, a convolutional layer with 128 (

3 \times 3

) kernels, and a convolutional layer with 512 (

1 \times 1

) kernels, all followed by a pooling layer to enhance the key feature. The sequence continues with the up-sampling of the pooling layer, leading to the final three convolutional blocks. These blocks include a convolutional layer with 64 (

1 \times 1

) kernels, another with 64 (

3 \times 3

) kernels, and a third with 256 (

1 \times 1

) kernels.

3.4. Loss Function of CAE

The loss function and precision metrics in neural networks are crucial for assessing how correctly a model learns and predicts from the training samples. Common metrics like mean squared error and mean absolute error generally perform well in scenarios involving multiple elements and complex cases. However, there is often a need to develop additional loss functions and precision metrics to address specific challenges.

3.4.1. Customized Loss Function

f (x_{i}, x_{i - 1}) = σ (V_{x} Tanh (W_{p} x_{i - 1} + W_{x} x_{i} + b_{x}))

(4)

In Equation (4),

σ

is an s-type function, where Tanh serves as the activation function. The parameters

W_{x}

,

W_{p}

,

V_{x}

are trainable. The meaning of the formula for the process of the

x_{i - 1}

to

x_{i}

is an activation function to add a function. The role of this function is to introduce nonlinear features into the network. If there is no activation function, each layer of the neural network is equivalent to the matrix multiplication multi-layer neural network, and will only show linear transformations, and the single-layer perceptual machine is no different. Therefore, the activation function is very important for neural networks to learn and understand very complex, nonlinear functions.

3.4.2. Cross Entropy Loss Function

H (p, q) = - \sum_{i = 1}^{n} p (x_{i}) \log (q (x_{i}))

(5)

In Equation (5),

p (x)

denotes the true probability of occurrence at time

x

,

q (x)

denotes the predictive distribution, and the total number of categories is

n

. The formula expresses the meaning that the output of the neural network or model

q (x_{i})

and the cross-entropy is regarded as the difference between the results of the true labeling and the predicted labeling, whereas the purpose of the neural network is to make the predicted output

q (x_{i})

constantly approaching to

p (x_{i})

through the training, which is also in line with the goal of the training and the realization of the conventional neural network.

3.4.3. Split Batch Loss Function

L o s s_{H} = - \frac{1}{b a t c h_s i z e} \sum_{j = 1}^{b a t c h_s i z e} \sum_{i = 1}^{n} p (x_{j i}) \log (q (x_{j i}))

(6)

Formula (6) is the calculation of the cross-entropy function for the label classification task in a batch basis, where

b a t c h_s i z e

is the size of each batch of training data trained in batches, the formula divides a

b a t c h_s i z e

into

j

information entropy seeking, and calculates its average loss function to obtain the final result.

4. Experiment

The experiments in this paper are divided into two parts: the model comparison experiments and handwritten Chinese character recognition experiments using separated pre-training models. Details of the experimental environment, evaluation indexes, the handwritten character image recognition model, and SPT model experiments will be specifically introduced later in the paper.

4.1. Benchmark Setup

This paper utilizes the CASIA-HWDB1.0 and Chinese Calligraphy Styles by Calligraphers datasets, which are segmented and reorganized into multiple new datasets. As shown in Table 1, the Data1 dataset comprises 65% of the total from the two original datasets, amounting to approximately 406,190 images. Its training set is an even mix, consisting of 60% randomly selected images from both the HWDB and the Chinese Calligraphy datasets. Similarly, its test set combines 5% of the images from each dataset. The Data2 dataset represents 80% of the total, about 434,845 images, with its training set including 70% of images from each original dataset selected randomly, and its test set combining 10% from each. The Data3 dataset exclusively uses the HWDB dataset for training and the Chinese Calligraphy dataset for testing, totaling approximately 580,187 images. The purpose of splitting the datasets is to evaluate the model’s learning and recognition robustness across a newly formed dataset.

During the preprocessing stage of the HWDB dataset, the image specifications are normalized to preserve the text features in the text images. All images are processed into a 48 × 48 pixel format. For text symbol images with an aspect ratio exceeding 10 in either dimension (h/w > 10 or w/h > 10), the integrity of the character shape is maintained by adjusting the aspect ratio, stretching the image as necessary, and filling any resultant blank spaces to ensure the size consistency. Using the bilinear interpolation method, the image data is enhanced and begins by calculating the interpolation matrix unit of the key center point of the image data. This is followed by calculating the result of interpolating the target object along both the x-axis and y-axis, respectively. Finally, the method fills any blank values between the interpolated object and the interpolated matrix unit, thereby increasing the readability and clarity of the whole image. The experimental parameters and optimizer include the use of the Adam optimizer, the input image size of 48 × 48 pixels, the learning rate (Learning Rate = 0.0001), Batch size = 64, and Epochs = 50. The experimental environment of this paper is shown in Table 2.

4.2. Evaluation Indicators

In the experiments conducted for this paper, we employed commonly used evaluation metrics for multi-classification tasks, including Accuracy, Recall, and Macro Average-F1.

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(7)

The accuracy rate is calculated as the ratio of the number of correct samples classified by the model to the total number of samples for a given dataset. In this context

T P

indicates the number of positive samples correctly identified,

F P

represents the number of actual negative samples predicted to be positive,

T N

denotes the number of negative samples correctly identified, and

F N

refers to the number of actual positive samples predicted to be negative.

P r e c i s i o n = \frac{T P}{T P + F P}

(8)

The precision formula refers to the ratio of the number of samples correctly classified by the model (true positives) to the number of samples judged positive by the model (sum of true positives and false positives).

R e c a l l = \frac{T P}{T P + F N}

(9)

The Recall is the ratio of the number of correctly categorized positive samples (true positives) to the number of true positive samples (sum of true positives and false negatives).

{\begin{matrix} F 1 - 1 = \frac{2 \times P r e c i s i o n_{1} \times R e c a l l_{1}}{P r e c i s i o n_{1} + R e c a l l_{1}} \\ \dots \\ F 1 - n = \frac{2 \times P r e c i s i o n_{n} \times R e c a l l_{n}}{P r e c i s i o n_{n} + R e c a l l_{n}} \end{matrix}

(10)

M a c r o - F 1 = \frac{F 1 - 1 + \dots + F 1 - n}{n}

(11)

Equation (10) represents the set of F1 scores calculated for each classification. Equation (11) computes the calculation of the accuracy, recall and F1 scores for all

n

classifications, where the F1 score for the

n

th class is specifically noted. These values collectively contribute to the calculation of the Macro-F1 score, which averages the F1 scores across all classes in the sample.

4.3. Results and Analysis of Model Comparison Experiments

The convolutional self-encoder model for Chinese character recognition is based on a convolutional neural network combined with a self-encoder structure. This model employs convolutional layers instead of the traditional fully connected layers typically found in self-encoders. It down-samples the input features to represent them in smaller dimensions or features [24]. The encoder in the model learns and trains the key features of the image, while the decoder recognizes and compares the results of the training and learning process. The overall model structure is simpler than other models and can maintain the recognition accuracy, making it particularly effective for training on the entire HWDB dataset. During the learning process, the model employs a learning rate gradient descent, a dropout layer, and an early stopping method to accurately determine the model’s true performance and reduce the risk of overfitting. Furthermore, the model’s loss function combines a custom-designed loss function with the cross-entropy function, ensuring that the operational outcomes align more closely with the expected results. The comparison of model results is depicted in Figure 6.

The CAE model constructed in this paper is compared with AlexNet, ResNet, CNN, and GoogleNet models using the Data1 data. After 50 iterations, the recognition process for each model is assessed based on the accuracy and the behavior of the loss function. The GoogleNet model exhibits a poorer performance in terms of the recognition accuracy and loss metrics. The remaining four models show comparable performances after a significant number of iterations. Notably, the CAE model and the CNN model achieve faster convergence speeds, whereas the ResNet and AlexNet models demonstrate slower convergences. By approximately 30 iterations, all models begin to converge and effectively learn the key features and patterns of the data. The CAE and CNN models exhibit the highest initial recognition accuracy, low initial loss rates, and demonstrate faster convergence in terms of the accuracy and loss metrics. Although the other models show similar performance levels, the GoogleNet model lags significantly, likely due to its architecture being less suited to processing textual data. Consequently, the latter part of the paper exclusively utilizes CAE models, which are further enhanced by incorporating a Separated Pre-training Convolutional Auto-Encoder (SPT CAE).

4.4. Experimental Results and Analysis of SPT Model

As shown in Figure 7, in the separated pre-training, we use the feature extraction algorithm with the improved k-means algorithm to cluster similar images. In the figure, each color represents a collection of similar character images, and the star shape in the middle represents the center of the cluster. It describes that after all the images are enhanced by bilinear interpolation and dimensionality reduction by UMAP, the content of the images is clustered using the improved k-means algorithm to get the best collection of similar character images. The regression analysis, coupled with the gamma function, is used to determine the parameter

σ, μ

of the clustering number curve and to assess whether the confidence level is satisfied

μ \geq 0.04

. For categories with less than 4% of the total number of labels, which have a low degree of representativeness, we exclude these few labels to reduce the risk of small-sample interference and overfitting in neural network training. This strategy is validated through a probability density analysis to ensure both the purity of the dataset and the stability of the model training.

This model evaluates the proposed method using the accuracy evaluation function defined earlier. The accuracy and evaluation metrics for various training strategies within the reconstructed dataset are presented in Table 3.

In the mixed datasets Data1 and Data2, the SPT CAE model outperforms the traditional CAE model in terms of accuracy, recall, and Macro-F1 values. This improvement is particularly notable in Data2, where the model’s ability to quickly capture accurate features using separated pre-training significantly surpasses that of the traditional CAE model. Additionally, the SPT CAE model demonstrates an optimization in processing time, especially evident in Data3, suggesting a superior capacity to learn from new data. During the experiments, all datasets were randomly mixed, with the same number of datasets split each time, but all with different contents. The reported results represent the average accuracy derived from multiple experimental trials.

In the three reconstructed datasets, models enhanced with SPT show an improved performance compared to the traditional CAE model. To demonstrate the effectiveness of the SPT models on the recognition method, we compare the traditional CAE model and the SPT CAE model by training them to recognize similar characters. The selected recognition results are presented in Table 4.

Based on these experimental results, it is evident that employing SPT models can effectively improve the recognition results of most error-prone characters. However, if the SPT models do not sufficiently learn the classification model of closely related characters, they may lead to recognition errors. Furthermore, the composition of the approximate character data set also affects the learning performance of the SPT models. Some text images, which visually appear dissimilar from others in the dataset, might still be recognized by the models due to the low but existing computational feature correlations. However, these correlations are not strong enough to ensure a reliable recognition. Future research could focus on optimizing the processing of these particular data to enhance the model accuracy. In addition, when dealing with writing irregularities where the image characterization features are less distinct, the system still fails to achieve the correct recognition results.

4.5. Ablation Experiment

The results of ablation studies comparing the proposed image enhancement, model loss function modifications, and SPT model are presented in Table 5. These experiments utilized the Data1 dataset. It is evident that the unmodified CAE model performed the poorest. Subsequently, various optimizations were applied to the experimental setup, each leading to improvements in classification results. Among these, the SPT model exhibited the most significant improvement, with its accuracy surpassing the enhancements achieved through the image enhancement and loss function modifications. The proposed SPT CAE model, tested under these conditions, achieved a recognition accuracy of 96.97% on the Data1 dataset.

5. Conclusions

In this paper, we propose a separated pre-training model to address the challenge of recognizing Chinese text character images. Initially, we reconstructed three datasets from the HWDB and Chinese Calligraphy datasets for the model comparison experiments. Subsequently, we proposed the SPT model, which employs the UMAP feature extraction technique with an improved K-means method. This approach effectively mines the potential associations between similar images and generates a specific dataset by clustering these similar images to enhance the learning ability of the recognition model. For the selection of image character recognition models, we chose a convolutional self-encoder combined with a customized loss function as the benchmark model. Through experiments, we compared the performance of the traditional CAE model and the CAE model enhanced with the SPT model on three reconstructed datasets. The experimental results indicate that the SPT model outperforms the standard single text image recognition model on key evaluation metrics, including accuracy, recall, and Macro-F1. In particular, the SPT CAE model improves the precision rate of the CAE model by approximately 6% in classifying error-prone words during similar image processing. In subsequent research, we plan to optimize the parameters of the SPT model and employ lightweight techniques, such as channel pruning, to develop a more efficient and compact version of the SPT models. Simultaneously, we will explore the transferability of the SPT model across various recognition models to verify its robustness and the universality of its optimization effect. Finding a better classification model to replace the CAE model is important for a more accurate classification of error-prone character images by combining related research.

Author Contributions

Methodology, B.Z. and Y.L.; software and validation, B.Z.; investigation, X.H.; writing—original draft preparation, B.Z.; writing—review and editing, X.H.; supervision, Y.L.; X.H., B.Z. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Talent Introduction Project of Sichuan University of Science and Engineering under Grant No.2020RC22, by the Sichuan University of Science and Engineering Graduate Innovation Fund Project under Grant Y2023121, by the 2022 network ideological and political education research project of Sichuan University of Science and Engineering under Grant SZ2022-21, and by the Sichuan Key Provincial Research Base of Intelligent Tourism under Grant ZHZJ22-01.

Data Availability Statement

We used a public domain dataset [25].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, C.; Yang, C.; Qin, H.-B.; Zhu, X.; Liu, C.-L.; Yin, X.-C. Towards open-set text recognition via label-to-prototype learning. Pattern Recognit. 2023, 134, 109109. [Google Scholar] [CrossRef]
Xiao, Y.; Meng, D.; Lu, C.; Tang, C.-K. Template-instance loss for offline handwritten chinese character recognition. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, NSW, Australia, 20–25 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 315–322. [Google Scholar] [CrossRef]
Elaraby, N.; Barakat, S.; Rezk, A. A generalized ensemble approach based on transfer learning for braille character recognition. Inf. Process. Manag. 2024, 61, 103545. [Google Scholar] [CrossRef]
Cao, Z.; Lu, J.; Cui, S.; Zhang, C. Zero-shot handwritten Chinese character recognition with hierarchical decomposition embedding. Pattern Recognit. 2020, 107, 107488. [Google Scholar] [CrossRef]
Wu, X.; Chen, Q.; Xiao, Y.; Li, W.; Liu, X.; Hu, B. LCSegNet: An efficient semantic segmentation network for large-scale complex Chinese character recognition. IEEE Trans. Multimed. 2020, 23, 3427–3440. [Google Scholar] [CrossRef]
Wang, X.; Dong, L. Application of Attention Mechanism in Offline Chinese Handwritten Text Line Recognition. J. Chin. Comput. Syst. 2019, 40, 1876–1880. [Google Scholar]
Gan, J.; Chen, Y.; Hu, B.; Leng, J.; Wang, W.; Gao, X. Characters as graphs: Interpretable handwritten Chinese character recognition via Pyramid Graph Transformer. Pattern Recognit. 2023, 137, 109317. [Google Scholar] [CrossRef]
Miao, Y.; Liang, L.; Ji, Y.; Li, Z.; Li, G. Research on Chinese ancient characters image recognition method based on adaptive receptive field. Soft Comput. 2022, 26, 8273–8282. [Google Scholar] [CrossRef]
Wang, T.; Xie, Z.; Li, Z.; Jin, L.; Chen, X. Radical aggregation network for few-shot offline handwritten Chinese character recognition. Pattern Recognit. Lett. 2019, 125, 821–827. [Google Scholar] [CrossRef]
Bi, N.; Chen, J.; Tan, J. The handwritten Chinese character recognition uses convolutional neural networks with the GoogleNet. Int. J. Pattern Recognit. Artif. Intell. 2019, 33, 1940016. [Google Scholar] [CrossRef]
Diao, X.; Shi, D.; Tang, H.; Shen, Q.; Li, Y.; Wu, L.; Xu, H. RZCR: Zero-shot character recognition via radical-based reasoning. arXiv 2022, arXiv:2207.05842. [Google Scholar] [CrossRef]
Chen, X.; Wang, T.; Zhu, Y.; Jin, L.; Luo, C. Adaptive embedding gate for attention-based scene text recognition. Neurocomputing 2020, 381, 261–271. [Google Scholar] [CrossRef]
Ghanim, T.M.; Khalil, M.I.; Abbas, H.M. Comparative study on deep convolution neural networks DCNN-based offline Arabic handwriting recognition. IEEE Access 2020, 8, 95465–95482. [Google Scholar] [CrossRef]
Kobayashi, Y.; Mimuro, S.; Suzuki, S.-N.; Iijima, Y.; Okada, A. Basic research on a handwritten note image recognition system that combines two OCRs. Procedia Comput. Sci. 2021, 192, 2596–2605. [Google Scholar] [CrossRef]
Huang, G.; Luo, X.; Wang, S.; Gu, T.; Su, K. Hippocampus-heuristic character recognition network for zero-shot learning in Chinese character recognition. Pattern Recognit. 2022, 130, 108818. [Google Scholar] [CrossRef]
Rainarli, E. A decade: Review of scene text detection methods. Comput. Sci. Rev. 2021, 42, 100434. [Google Scholar] [CrossRef]
Ghosh, T.; Sen, S.; Obaidullah, S.; Santosh, K.; Roy, K.; Pal, U. Advances in online handwritten recognition in the last decades. Comput. Sci. Rev. 2022, 46, 100515. [Google Scholar] [CrossRef]
Xuan, P.; Gao, L.; Sheng, N.; Zhang, T.; Nakaguchi, T. Graph convolutional autoencoder and fully-connected autoencoder with attention mechanism based method for predicting drug-disease associations. IEEE J. Biomed. Health Inform. 2020, 25, 1793–1804. [Google Scholar] [CrossRef]
GB2312-1980; Basic set of Chinese Coded Character Sets for Information Exchange. Standardization Administration of the People’s Republic of China: Beijing, China, 1980.
Yang, L.; Wu, Z.; Xu, T.; Du, J.; Wu, E. Easy recognition of artistic Chinese calligraphic characters. Vis. Comput. 2023, 39, 3755–3766. [Google Scholar] [CrossRef]
Perol, T.; Gharbi, M.; Denolle, M. Convolutional neural network for earthquake detection and location. Sci. Adv. 2018, 4, e1700578. [Google Scholar] [CrossRef]
Deng, X.; Ma, Y.; Dong, M. A new adaptive filtering method for removing salt and pepper noise based on multilayered PCNN. Pattern Recognit. Lett. 2016, 79, 8–17. [Google Scholar] [CrossRef]
Gan, J.; Wang, W.; Lu, K. Compressing the CNN architecture for in-air handwritten Chinese character recognition. Pattern Recognit. Lett. 2020, 129, 190–197. [Google Scholar] [CrossRef]
Altwaijry, N.; Al-Turaiki, I. Arabic handwriting recognition system using convolutional neural network. Neural Comput. Appl. 2021, 33, 2249–2261. [Google Scholar] [CrossRef]
Liu, C.-L.; Yin, F.; Wang, D.-H.; Wang, Q.-F. CASIA online and offline Chinese handwriting databases. In Proceedings of the 11th International Conference on Document Analysis and Recognition (ICDAR), Beijing, China, 18–21 September 2011; pp. 37–41. [Google Scholar] [CrossRef]

Figure 1. Examples of images from selected datasets: (a) CASIA-HWDB; (b) Chinese Calligraphy; (c) SCUT-EPT Database; (d) CASIA-AHCDB.

Figure 2. Examples of character writing: (a) example diagrams of different writing of the same text; (b) example charts of similar characters.

Figure 3. Flowchart of the detached pre-training models.

Figure 4. (a) schematic diagram of the bilinear interpolation method; (b) comparison of the effect of bilinear interpolation method.

Figure 5. Structure of a convolutional self-encoder.

Figure 6. (a) comparison of the accuracy of different recognition models; (b) comparison of the loss ratios of different recognition models.

Figure 7. Cluster analysis plot in split pre-training.

Table 1. Table of reconstructed data sets.

Datasets	Train Datasets		Test Datasets		Total Number of Images
Datasets	HWDB	Chinese Calligraphy	HWDB	Chinese Calligraphy	Total Number of Images
Data1	0.6	0.6	0.05	0.05	406,190
Data2	0.7	0.7	0.1	0.1	434,845
Data3	1	0	0	1	580,187

Table 2. Experimental environment.

Name	Parameter	Name	Parameter
CPU	Inter Core i5-13490	Python version	3.9.7
GPU Graphics Card	Nvidia GeForce RTX 2080Ti	TensorFlow version	2.7.0
Running memory	16G	CUDA version	11.2
Operating system	Windows 10 professional	CuDNN version	8.1

Table 3. Evaluation of SPT model application in three datasets.

Name	Models	Accuracy	Recall	Macro-F1
Data1	CAE	89.59%	0.8770	0.9438
Data1	SPT CAE	95.06%	0.9661	0.9695
Data2	CAE	91.17%	0.7356	0.8086
Data2	SPT CAE	98.25%	0.8957	0.9019
Data3	CAE	89.58%	0.8394	0.8882
Data3	SPT CAE	94.88%	0.9089	0.9299

Table 4. Example table of classification results for text.

Dataset of Similar Text Images	Number of Data Sets	CAE Predictions	SPT CAE Predictions Correct	Accuracy Improvement
己巴巳艺琶岂毛色琵	708	644	686	5.93%
晴睛腊盯睹瞒脂聘暗	1416	1231	1250	1.34%
柏档挡拍柑拼佰伯扫	1005	908	974	6.56%
娜邦挪绑椰柳湃廊滩	1274	1042	1225	7.32%
戊戌成戒戚绒戎茂找	1260	1034	1175	11.19%

Table 5. Comparison table of ablation experiment results.

Name	Image Enhancement	Loss Function	SPT	Accuracy
Data1				0.9204
	√			0.9324
		√		0.9224
			√	0.9601
	√	√		0.9581
		√	√	0.9623
	√		√	0.9637
	√	√	√	0.9697

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, X.; Zhang, B.; Long, Y. Increasing Offline Handwritten Chinese Character Recognition Using Separated Pre-Training Models: A Computer Vision Approach. Electronics 2024, 13, 2893. https://doi.org/10.3390/electronics13152893

AMA Style

He X, Zhang B, Long Y. Increasing Offline Handwritten Chinese Character Recognition Using Separated Pre-Training Models: A Computer Vision Approach. Electronics. 2024; 13(15):2893. https://doi.org/10.3390/electronics13152893

Chicago/Turabian Style

He, Xiaoli, Bo Zhang, and Yuan Long. 2024. "Increasing Offline Handwritten Chinese Character Recognition Using Separated Pre-Training Models: A Computer Vision Approach" Electronics 13, no. 15: 2893. https://doi.org/10.3390/electronics13152893

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Increasing Offline Handwritten Chinese Character Recognition Using Separated Pre-Training Models: A Computer Vision Approach

Abstract

1. Introduction

2. Related Work

2.1. Handwritten Chinese Character Recognition Method

2.2. Related Datasets

3. Algorithm Design

3.1. Design and Analysis of Algorithms

3.2. Design of Separated Pre-Training Models

3.3. Design of Convolutional Self-Encoder Structure

3.4. Loss Function of CAE

3.4.1. Customized Loss Function

3.4.2. Cross Entropy Loss Function

3.4.3. Split Batch Loss Function

4. Experiment

4.1. Benchmark Setup

4.2. Evaluation Indicators

4.3. Results and Analysis of Model Comparison Experiments

4.4. Experimental Results and Analysis of SPT Model

4.5. Ablation Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI