A Historical Handwritten French Manuscripts Text Detection Method in Full Pages

Sang, Rui; Zhao, Shili; Meng, Yan; Zhang, Mingxian; Li, Xuefei; Xia, Huijie; Zhao, Ran

doi:10.3390/info15080483

Open AccessArticle

A Historical Handwritten French Manuscripts Text Detection Method in Full Pages

by

Rui Sang

¹,

Shili Zhao

²,

Yan Meng

²,

Mingxian Zhang

²,

Xuefei Li

²,

Huijie Xia

¹ and

Ran Zhao

^2,*

¹

School of Foreign Languages, North China Electric Power University, Beijing 102206, China

²

College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Information 2024, 15(8), 483; https://doi.org/10.3390/info15080483

Submission received: 19 July 2024 / Revised: 6 August 2024 / Accepted: 12 August 2024 / Published: 14 August 2024

Download

Browse Figures

Versions Notes

Abstract

:

Historical handwritten manuscripts pose challenges to automated recognition techniques due to their unique handwriting styles and cultural backgrounds. In order to solve the problems of complex text word misdetection, omission, and insufficient detection of wide-pitch curved text, this study proposes a high-precision text detection method based on improved YOLOv8s. Firstly, the Swin Transformer is used to replace C2f at the end of the backbone network to solve the shortcomings of fine-grained information loss and insufficient learning features in text word detection. Secondly, the Dysample (Dynamic Upsampling Operator) method is used to retain more detailed features of the target and overcome the shortcomings of information loss in traditional upsampling to realize the text detection task for dense targets. Then, the LSK (Large Selective Kernel) module is added to the detection head to dynamically adjust the feature extraction receptive field, which solves the cases of extreme aspect ratio words, unfocused small text, and complex shape text in text detection. Finally, in order to overcome the CIOU (Complete Intersection Over Union) loss in target box regression with unclear aspect ratio, insensitive to size change, and insufficient correlation between target coordinates, Gaussian Wasserstein Distance (GWD) is introduced to modify the regression loss to measure the similarity between the two bounding boxes in order to obtain high-quality bounding boxes. Compared with the State-of-the-Art methods, the proposed method achieves optimal performance in text detection, with the precision and mAP@0.5 reaching 86.3% and 82.4%, which are 8.1% and 6.7% higher than the original method, respectively. The advancement of each module is verified by ablation experiments. The experimental results show that the method proposed in this study can effectively realize complex text detection and provide a powerful technical means for historical manuscript reproduction.

Keywords:

French historical handwriting; complex text detection; feature enhancement; loss optimization

1. Introduction

With the advancement of digitization technology and the increasing demand for cultural heritage protection, accurate detection of historical handwritten texts has become an important research direction. As one of the important historical and cultural languages in Europe, French, its handwritten antique texts not only record rich historical information but also carry a profound cultural heritage. To better understand the handwritten texts in digital historical manuscripts, the detection of these texts is an important step. However, due to their age and their handwritten characteristics, these historical manuscripts (Figure 1) often have problems such as unclear handwriting, aging paper, and complex backgrounds, which bring great challenges to their detection and recognition. Meanwhile, handwritten fonts, unlike printed fonts, do not have defined shapes or sizes. If a person writes the same sentence more than once, the handwriting may not look exactly the same [1]. Differences in handwritten characters are characterized by 12 features such as characters, words, height, etc. [2]. Due to these differences, handwritten text detection is not easy.

Early text detection methods primarily relied on traditional image processing techniques such as edge detection, projection analysis, and template matching. While these methods are simple and intuitive, they are less effective when dealing with complex backgrounds, multiple fonts, and low-resolution images. In recent years, the development of deep learning techniques, particularly the widespread application of convolutional neural networks (CNNs), has significantly advanced text detection. CNNs can automatically learn and extract features from images, making them suitable for text detection tasks in complex backgrounds. Classical CNN architectures such as Faster R-CNN [3], SSD (Single Shot Multibox Detector) [4], and YOLO (You Only Look Once) [5] can be broadly categorized into three groups: line-based, pixel-based, and segment-based methods. Although these methods achieve excellent performance on some word-level scene text detection benchmarks (e.g., ICDAR-2015 [6]), they struggle with efficiently handling long-oriented text lines or curved text lines, which are common in handwritten text detection scenarios.

Pixel-based methods, such as Scene Text Detection [7], treat text detection as a pixel labeling problem and employ State-of-the-Art image segmentation methods to predict pixel labels [8], which are then used to generate text blocks. These methods typically require complex post-processing steps to segment text blocks into text lines. Segment-based methods, on the other hand, use object detection techniques to first detect text segments, each containing a character or a word/part of a text line [9]. These extracted text segments are then grouped into text lines using traditional text line grouping algorithms. In principle, both pixel-based and segment-based approaches can handle various handwritten text lines. However, the current text line segmentation and text line grouping algorithms used by these methods are very complex and not trainable end-to-end.

Therefore, the aim of this paper is to propose a French handwritten antique text detection method based on improved YOLOv8s. By designing a suitable feature extraction network and image preprocessing strategy, this study overcomes the problems of handwriting blurring, background noise, small text, and complex shaped text in the French handwritten antique text images and realizes the efficient and automated detection of handwritten antique texts of different styles and qualities. By using image processing technology to make the manuscript clearer, its readability and research value can be further improved, and misreading can be reduced. At the same time, it can help improve the accuracy of handwriting text recognition for optical character recognition (OCR) technology in the later stage. Therefore, it can help to better preserve and inherit cultural heritage and enhance the effect of cultural communication, which has important significance and value in the field of digital humanities.

The main contributions of this paper are as follows:

(1): Swin Transformer is used to replace C2f at the end of the backbone network to solve the shortcomings of fine-grained information loss and insufficient learning features in text word detection.
(2): The Dysample method is adopted to retain more detailed features of the target, overcome the lack of information loss in traditional upsampling, and realize the text detection task for dense targets.
(3): The LSK module is added to the detection head to dynamically adjust the feature extraction receptive field, which solves the cases of extreme aspect ratio words, unfocused small text, and complex shape text in text detection.
(4): GWD is introduced to modify the CIOU regression loss to measure the similarity between two bounding boxes in order to obtain high-quality bounding boxes, which overcomes the situation of certain ambiguities in defining the aspect ratio intersection over union, insensitivity to the size change, and insufficient correlation between the target coordinates.

2. Related Work

Convolutional neural networks play a key role in target detection [3], with their excellent feature extraction capability and effective region suggestion mechanism, enabling accurate detection and localization of multiple targets from complex images. Traditional methods rely on manually designed features. These methods commonly explore low-level features to distinguish text candidates from the background (e.g., Stroke Width Transform (SWT) [9], Maximally Stable Extremal Regions (MSER) [10]). A Connectionist Text Proposal Network (CTPN)—An End-to-End trainable and efficient text detector was proposed [11]. CTPN detects text lines in a sequence of fine-scale text suggestions directly in a convolutional map, which overcomes many of the major limitations imposed by previous bottom-up approaches based on character detection.

Most of the existing methods, both traditional and deep neural network-based, consist of several stages and components, which can be sub-optimal and time-consuming. Zhou et al. [12] designed a deep FCN-based pipeline that directly targets the ultimate goal of text detection at the word or text line level. The maximum size of text instances that can be processed by the detector is proportional to the receptive field of the network, limiting the ability of the network to predict longer text regions. To enable the detection of text instances with arbitrary shapes and close proximity, the new Progressive Scale Extension Network (PSENet) was proposed to accurately detect text instances with arbitrary shapes [13]. Wang et al. [14] proposed a robust scene text detection method with adaptive text region representation. For arbitrary shape text detection methods that represent text instances in the image space domain, pixel masks, sequences of contour points in a Cartesian system, or in a polar coordinate system to model the text, mask representations can lead to inherently computationally expensive post-processing and often require large training data. The Fourier Contour Embedding Network (FCENet) can represent an arbitrarily shaped text contour as a compact signature [15]. While dealing with extreme aspect ratios and text instances of different proportions, existing detection methods show poor performance. He et al. [16] proposed a multi-orientation scene text detector (MOST) with localization refinement. A concise, dynamic point text detection Transformer model network (DPText-DETR) that mines a priori knowledge from clip models directly into a scene text detector and eliminates the need for a pre-training process [17].

With the advent of the digital age, where a large number of handwritten documents, notes, and letters need to be efficiently and accurately converted into electronic data, handwritten text detection is pivotal in research. In order to minimize the errors associated with paragraph segmentation, Carbonell et al. [18] proposed an end-to-end full-page detection framework. It addresses the potential improvements suggested in previous HTR works by recognizing the text in a full page in a single feedforward end-to-end model. For handwritten documents, it is difficult to achieve high accuracy due to the lack of common size and shape. Himani Kohli et al. [19] used the OpenCV method and proposed a J&M model to achieve 99% test accuracy on the MINIST dataset. Huseyin et al. [20] introduced a novel deep learning architecture, named DIGITNET, to detect and recognize handwritten digits in historical document images written in the nineteenth century. The specific results of related methods are shown in Table 1.

3. Materials and Methods

3.1. Data Sources

This study examines several historical Sinological documents, specifically manuscripts authored by French missionaries who traveled to China during the 17th and 18th centuries to document the conditions there. These manuscripts hold significant historical documentary value. The images of these documents were sourced from the Gallica platform of the Bibliothèque nationale de France (BnF). Gallica, a digital library platform of the BnF, aggregates a vast collection of documents and manuscripts across various fields, including French history, literature, science, and art, thereby offering researchers an extensive range of reference materials. Researchers can access Gallica online at any time, facilitating the download of necessary documents and manuscripts. This accessibility greatly reduces the difficulty and cost of resource acquisition, particularly for scholars unable to visit France in person. While most of the digital images provided by Gallica are of acceptable quality and clarity for detailed text analysis, some manuscripts remain of low resolution and are challenging to decipher, such as those processed in this study. A total of 50 images of original handwritten French antique texts were collected, displaying diversity in font size, handwriting style, text orientation, and text spacing. The collected text images of French historical manuscripts predominantly adhere to the formats of 901 × 690 and 752 × 651 pixels. Labelme was utilized to annotate the words in the text images, generating label information files that contain the coordinates of the word positions within the images. The data were subsequently divided into training, validation, and test sets in a ratio of 8:1:1.

3.2. Image Preprocessing

The original text images provided are very blurry, making it difficult even for the human eye to quickly and accurately identify the French words and sentences. To address this, we utilized Real-ESRGAN [21], a super-resolution algorithm based on Generative Adversarial Networks (GANs), to preprocess the original images. Real-ESRGAN effectively converts low-resolution images into high-definition images. This preprocessing step aims to enhance the accuracy of the YOLOv8s network in subsequent word detection tasks.

We conducted a comparison of several State-of-the-Art (SOTA) models in the super-resolution field over recent years: Real-ESRGAN-anime, BSRGAN [22], RGT [23], and HAT [24], using only prediction results from pre-trained models. In our experiment, we collected 50 images of French text to evaluate the models. Due to the incorporation of adversarial networks, metrics like the Peak Signal-to-Noise Ratio (PSNR) or Structural Similarity Index Measure (SSIM) may not necessarily reflect better reconstruction quality. Instead, we chose the Natural Image Quality Evaluator (NIQE) metric to assess the quality of the images generated by the network. NIQE measures image quality based on certain statistical features characteristic of natural images; this is particularly useful because it does not require a reference image, making it applicable in scenarios where the original, high-quality image is unavailable. The NIQE scores for Real-ESRGAN-anime, BSRGAN, RGT, and HAT are 5.02, 6.03, 6.98, and 5.10, respectively (lower is better). To further demonstrate the super-resolution capabilities of these networks, we randomly selected one image for testing. As shown in Figure 2, images generated by BSRGAN exhibit relatively clear quality but suffer from noticeable noise and artifacts, with blurred details that are challenging to discern. Images generated by RGT show overall lower clarity and blurry details, particularly performing poorly at letter edges. HAT generates images with better detail, though some noise and artifacts remain, and the letter edges are sharper. Among all methods compared, Real-ESRGAN-anime demonstrates superior performance, generating images with clear details, sharp letter edges, minimal noise, and artifacts. It effectively restores details from the original images. Therefore, this study selects the Real-ESRGAN-anime model for image preprocessing through super-resolution and utilizes the processed images for object detection annotation, aiming to enhance the detection accuracy of French words by the model.

3.3. Text Detection Model

Addressing the issues of complex text word misdetection, omission detection, and insufficient detection of wide-pitch curved text, this paper proposes a high-precision text detection method based on an improved YOLOv8s framework, as illustrated in Figure 3.

The network structure consists of three main components: backbone, neck, and head. Firstly, the preprocessed text images of French historical manuscripts, originally in formats of 5650 × 3824 and 3456 × 2304, are standardized to 1920 × 1080 and input into the Backbone network to extract the original features. This transforms the image into a feature representation rich in semantic information. The Backbone network primarily comprises a stack of C2f modules. To address the loss of fine-grained information and insufficient learning features in text word detection, the Swin Transformer [25] is utilized to replace the C2f module at the end of the Backbone network.

Secondly, the extracted features are fed into the Neck module for feature enhancement, resulting in deeper feature representations. The Neck module employs the path aggregation network (PAN) structure, which enhances the network’s capability to fuse features of objects at different scales. To address the dense targets in text detection, the Dysample [26] dynamic upsampling method is used to retain more detailed target features, overcoming the traditional shortcomings of up-sampling information loss. The enhanced features are then fed into the Head part, where a decoupled head replaces the original coupled head to achieve Anchor-Free-based text detection.

To handle extreme aspect ratio words, unfocused small text, and complex shape text, the LSK [27] module is added to the Head, increasing the receptive field to enhance the word detection ability in complex situations. Finally, to address the CIOU loss in target box regression, which suffers from unclear aspect ratios, sensitivity to size changes, and insufficient correlation between target coordinates, the GWD [28] is introduced to modify the regression loss. This modification measures the similarity between two bounding boxes, resulting in high-quality bounding boxes and improved text word detection. In-depth theoretical analyses and comprehensive implementation details are presented in Section 3.3.1, Section 3.3.2, Section 3.3.3 and Section 3.3.4

3.3.1. Swin Transformer

We adopt the Swin Transformer network to replace the C2f module at the end of the original YOLOv8s backbone network, which solves the effects of the loss of fine-grained information and insufficient learning features in text word detection and achieves high-precision text detection. The Swin Transformer is proposed as a generalized backbone for computer vision tasks and is different from the usual Transformer structure. The Swin Transformer Block replaces Multi-Head Self-Attention (MSA) with Windows Multi-Head Self-Attention (W-MSA) and Shifted Windows Multi-Head Self-Attention (SW-MSA), and the two structures appear consecutively and in pairs. W-MSA can shift the attention computation from the whole feature map to a local window because the computational complexity of self-attention is positively correlated with the square of the data dimensions. Shifting to window computation can greatly reduce the computational effort and increase the local attention of the network. SW-MSA connects window to window in series, enabling information to be effectively interacted with. This module allows the network to learn multi-scale features of the target and improve feature extraction of text words while keeping the computational effort low. The core module structure of the Swin Transformer is shown in Figure 4.

3.3.2. Dysample

To address the challenges of word denseness and crowding in historical text detection, we choose to replace the traditional upsampling operator in YOLOv8s with the Dysample operator. This replacement preserves more detailed features while achieving lower inference latency, memory usage, floating-point operations, and parameter count. Dysample is an ultra-lightweight and efficient dynamic upsampling operator designed to learn upsampling through sampling. Unlike traditional convolution kernel-based dynamic upsampling methods, Dysample is designed from the perspective of point sampling, dividing a point into multiple points to achieve clearer edges. The upsampling process is custom-designed to sample a single point for each upsampling position, dividing one point into s2 upsampling points. Dysample combines the advantages of dynamic upsampling while avoiding complexity and high costs. Dysample conducts the sampling process by finding the correct semantic clustering for each upsampling point without requiring additional high-resolution feature inputs. This allows Dysample to maintain high performance while reducing model complexity and computational costs, achieving efficient upsampling. The sampling principle of Dysample is illustrated in Figure 5.

3.3.3. Detection Head with LSK Module

To overcome challenges such as extreme aspect ratio words, unfocused small texts, and complex-shaped texts in text detection, we incorporate the LSK module into the detection head. This enhancement improves the detection capability of the head for word-level detection in complex scenarios. Each LSK module consists of two residual sub-blocks: the LK Selection sub-block and the Feedforward Network (FFN) sub-block. The LK Selection sub-block dynamically adjusts the network’s receptive field as needed. The FFN sub-block is used for channel mixing and feature refinement, consisting of a sequence that includes a fully connected layer, a deep convolution layer, a GELU activation, and a second fully connected layer. The core module of LSK is embedded within the LK Selection sub-block, comprising a series of large kernel convolutions and a spatial kernel selection mechanism. This module dynamically adjusts the receptive field for more effective word detection in complex texts. The structural schematic diagram of the LSK module is shown in Figure 6.

3.3.4. Loss Optimization

The loss function of YOLOv8 includes confidence loss, classification loss, and bounding box loss. Additionally, the Task Aligned Assigner positive sample distribution strategy is adopted in the loss function, introducing focal distribution loss to optimize the loss calculation process. The bounding box loss is calculated using a combination of DFL and CIOU loss for regression. CIOU loss addresses the issue where IOU fails to reflect the distance between the ground truth box and the predicted box when there is no intersection between them. However, in text detection, CIOU loss exhibits certain ambiguities in defining the aspect ratio intersection over union and is insensitive to changes in object size, neglecting the correlation loss between the coordinates of each object. To address this issue, we introduce GWD in the bounding box loss to modify the regression loss, measuring the similarity between two bounding boxes to achieve high-quality bounding boxes. This modification has shown promising results in text word detection. Specifically, given a bounding box B(x, y, h, w), where (x, y) represent the center coordinates and w and h denote width and height, respectively. We first convert it into a 2-D Gaussian distribution. Following GWD, the converted Gaussian distribution is formulated as follows:

f (x ∣ μ, Σ) = \frac{\exp (- \frac{1}{2} {(x - μ)}^{⊤} Σ^{- 1} (x - μ))}{2 π | Σ |^{\frac{1}{2}}},

(1)

where µ and Σ stand for the mean vector and the covariance matrix of Gaussian distribution, respectively.

According to Optimal Transport Theory, the Wasserstein distance between two distributions µ and ν can be computed as:

W (μ; ν) : = i n f E {(∥ X - Y ∥_{2}^{2})}^{1 / 2},

(2)

where the inferior runs across over all random vectors (X, Y) of

ℝ^{n} \times ℝ^{n}

with X~µ and Y~ν. Therefore, given two 2D Gaussian distributions

N (μ_{1}, Σ_{1})

and

N (μ_{2}, Σ_{2})

, the Wasserstein distance is calculated as:

\begin{array}{l} W^{2} & = {∥ μ_{1} - μ_{2} ∥}_{2}^{2} + {∥ Σ_{1}^{1 / 2} - Σ_{2}^{1 / 2} ∥}_{F}^{2} \\ = {(x_{1} - x_{2})}^{2} + {(y_{1} - y_{2})}^{2} + \frac{{(w_{1} - w_{2})}^{2} + {(h_{1} - h_{2})}^{2}}{4} \\ = l_{2} - n o r m ({[x_{1}, y_{1}, \frac{w_{1}}{2}, \frac{h_{1}}{2}]}^{⊤}, {[x_{2}, y_{2}, \frac{w_{2}}{2}, \frac{h_{2}}{2}]}^{⊤}) \end{array},

(3)

where

∥ \cdot ∥_{F}

indicates the Frobenius norm.

Only using the above loss (Equation (3)) may be sensitive to large errors. Therefore, a non-linear function is performed, and the loss is converted into an affinity measure

\frac{1}{τ + f (W^{2})}

. The final GWD-based loss is expressed as:

L_{g w d} = 1 - \frac{1}{τ + f (W^{2})}, τ \geq 1,

(4)

where f(⋅) is the non-linear function to make the Wasserstein distance smoother and easier to optimize the model. We define f(x) = ln (x + 1). τ as a modulated hyperparameter, which is empirically set as 1.

4. Experimental Results and Analysis

4.1. Experimental Setup

4.1.1. Evaluation Indicators

In order to comprehensively evaluate the performance of the model, this paper uses Precision (P), Recall (R), F1, mAP@0.5, mAP@0.5:0.95, and inference time as the performance evaluation indicators after model optimization, which are defined as shown in Equations (5)–(9):

Precision = \frac{T P}{T P + F P} \times 100 %,

(5)

Recall = \frac{T P}{T P + F N} \times 100 %,

(6)

F_{1} = 2 \cdot \frac{precision \cdot recall}{precision + recall},

(7)

A P = \int_{0}^{1} p (r) d r,

(8)

m A P = \frac{\sum_{i = 1}^{M} A P_{i}}{M},

(9)

Among them, True positive (TP) indicates the number of samples where the model accurately predicts the target object, and False Negatives (FN) indicates the number of samples where the model fails to correctly detect the target object. In contrast, False positive (FP) indicates the number of instances where the model mistakenly identifies non-target objects as target objects. F1 is a balanced score, defined as the harmonic mean of precision and recall, which can more accurately evaluate the performance of the model. Average Precision (AP) is used to measure the average accuracy of a model for a category, and mAP represents the average accuracy of multiple categories. mAP@0.5 indicates the accuracy of model recognition when the IoU threshold is 0.5, while mAP@0.5:0.95 indicates the average of 10 mAPs when the IoU threshold is 0.5 to 0.95 with an interval of 0.05. Larger values of these four metrics indicate better model performance. The inference time is the cost of time per frame detected by the model, which is an important indicator of model inference speed.

4.1.2. Implementation Details

To ensure the fairness of model validation, all experiments were conducted on the same computer. Hardware configuration: Windows 10 professional edition operating system, Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz, NVIDIA GeForce RTX 3090. The programming language used is Python 3.7.12, and the development framework is PyTorch 1.8.0. In addition, the CUDA 11.1 accelerator is used to improve the training speed. The training image size is set to 640 × 640, the SGD dynamics parameter is set to 0.937, the weight decay factor is 5 × 10⁻⁴, and the learning rate is set to 0.01. Table 2 shows the details of all the parameters in the experiment.

4.2. Comparison with State-of-the-Art Methods

In this subsection, we compare our method with the State-of-the-Art text detection methods. To ensure the fairness of the experiment, the parameter settings and training steps are kept consistent. The effect of different methods in text detection is analyzed by conducting comparison experiments on the same dataset. The detection results of each method are shown in Table 3.

As can be seen from Table 3, our improved YOLOv8s performs best in French historical text detection, with precision and recall reaching 0.863 and 0.814, leading the State-of-the-Art algorithms by more than 19%. Moreover, in terms of model inference time, our proposed method consumes 9.2 ms per text image, which is in the leading position and almost reaches real-time performance. Among the State-of-the-Art algorithms, CRAFT shows high precision, recall, and F1 score. In contrast, EAST and TextBPN have low precision and recall, and DBNet has the worst precision of 0.206, thus indicating that these networks are not applicable to historical text detection and are more suitable for text tasks in natural scenes. Although CRAFT has a slight advantage over them in terms of accuracy, it is still inferior to the algorithm we proposed and not applicable in text detection. The experimental results fully show that this paper has achieved the best detection results by adding a feature extraction network, dynamic upsampling operator, dynamic adjustment of receptive field detection head and target box loss, and can well achieve the task of text detection.

The results of different text detection methods on French historical text are shown in Figure 7. As can be seen from Figure 7, each algorithm can detect text words in historical manuscripts, but the differences are large. EAST and DBNet networks have obvious misdetection, false detection, and missed detection in text data detection, among which DBNet has the worst results in correctly detecting text. The CRAFT network has a slightly better detection result than the above two models, but the accuracy of the detection result is still at half the level, with a large error. In comparison, the improved model we proposed has a significantly outstanding effect on historical manuscript text detection and can largely realize word detection in the image.

4.3. Ablation Experiments

In this subsection, we further verify the effectiveness of each design. On the basis of ensuring that each module runs independently, we gradually added the improved module that we proposed. The text detection results under different structures are shown in Table 4.

The experimental results fully demonstrate that this paper can achieve the best historical text detection effect by stacking multiple feature improvement methods, optimizing the detection head and target box loss, and achieving optimal performance relative to the fusion of single module methods. Compared with the basic detection network YOLOv8s, adding the Swin Transformer has achieved 5.1% and 4.2% improvement in precision and recall, indicating that the transformer attention we added can well extract fine-grained information in text detection and overcome the shortcomings of insufficient learning features. Adding the Dysample method improves the accuracy by 0.8% on the basis of the above, indicating that the replacement upsampling method can retain more detailed features of the target and overcome the deficiency of information loss in traditional upsampling. Adding an LSK module in the detection head can dynamically adjust the receptive field of feature extraction, which still achieves a 2.2% improvement in accuracy compared to the above improvements and plays an important role in some complex texts. In terms of inference time, our fusion method is consistent with the basic model, consuming only milliseconds level of time to achieve higher accuracy improvement while ensuring the real-time performance of our model.

After analyzing the special task of historical text detection, we analyze its unique complex text. By improving the loss function of the target box, the situation of certain ambiguities in defining the aspect ratio intersection over union, insensitivity to size change, and insufficient correlation between target coordinates in the target box regression are overcome. The specific experimental results are shown in Table 5. As can be seen from the table, the Gaussian Wasserstein distance (GWD) loss we proposed optimizes the regression prediction of the target box, allowing it to have a larger target box loss in backpropagation, and the detection result improves the accuracy by 1.2%.

4.4. Qualitative Analysis

In order to more intuitively see the detection effect of the proposed method on historical texts, we give the qualitative results of our proposed method for the enhancement dataset and the original dataset. Figure 8 shows the comparison of the detection results of the proposed method by visualization. From left to right, they are the truth-value annotation image (Figure 8a), the original image detection result (Figure 8b), and the enhanced image detection result (Figure 8c). Among them, the green box in Figure 8a is the text annotation box, and the red box in Figure 8b,c is the prediction result of the proposed method. It can be seen from Figure 8b,c that the proposed method can completely detect the words in the historical text. Compared with the enhanced image, the method still achieves satisfactory results in the original image. The results fully demonstrate that the proposed method is robust and effective in the historical manuscript detection task of complex text.

4.5. Results Discussion

It can be seen from the above experimental results that in the detection of handwritten text in historical French books, the network architecture based on YOLOv8s we used performed the best in the detection results. Compared with several other detection methods, the precision is improved significantly. This shows that YOLOv8s has advantages in handwriting detection. However, due to the influence of datasets in natural scenes, our experiments lack detection results in natural scenes. Compared with other models, the precision comparison is not credible enough; therefore, we need to conduct text detection experiments in natural scenes in the follow-up to verify the YOLOv8s network used.

In terms of model innovation based on the YOLOv8s network, through the integration of multiple innovative technologies, the performance of complex text detection has been significantly improved. In the experiment, we conducted a detailed evaluation and analysis of the improvement effects of different modules, and the results showed that the introduction of each innovative component had a positive impact on the overall performance. After extensive ablation experiments, our model verified its ability to extract fine-grained information from text, enhanced the model’s effectiveness in identifying complex text shapes and details, and improved model precision by 5.1%. Furthermore, the detection results of dense and crowded words in text detection are optimized, retaining more target detail features, and the precision of text detection is increased by another 0.8%. Then, a detection head that dynamically adjusts the shape and size of the receptive field of the text was designed, and the precision was further improved by 2.2%. Finally, the loss function in the model was optimized to ensure high-quality bounding box regression, and the precision was finally increased by 8.1%.

Despite significant progress, the generalization performance of our approach has not yet been verified on diverse handwritten text datasets. At the same time, the practicability and robustness of the model in actual application scenarios are worthy of consideration.

5. Conclusions

In this work, we propose a high-precision text detection method based on improved YOLOv8s, which solves the problems of complex text word misdetection, omission, and insufficient detection of wide-space curved text. The Swin Transformer replacement C2f method we adopt solves the shortcomings of fine-grained information loss and insufficient learning features in text word detection. The Dysample method overcomes the lack of information loss in traditional upsampling and realizes the text detection task of dense words. The LSK detector head module with a dynamically adjusted receptive field solves the cases of extreme aspect ratio words, unfocused small text, and complex shape text in text detection. Finally, the GWD is introduced to modify the target box regression loss to measure the similarity between two bounding boxes to obtain high-quality bounding boxes, which overcomes the situation of unclear vertical and horizontal intersection ratio, insensitivity to the size change, and insufficient correlation between the target coordinates in the CIOU loss. Experiments show that the proposed method achieves the best performance in text detection, with precision and mAP@0.5 reaching 86.3% and 82.4%, respectively. Affected by the same category of text datasets, the pictures are from only two manuscripts. The method we proposed cannot be verified in terms of generalization. In the future, different historical text experiments can be used to verify the detection performance of the algorithm in this paper.

Author Contributions

Methodology, resources, data curation, writing—original draft, writing—review and editing, project administration, R.S.; formal analysis, writing—review and editing, validation, visualization, S.Z.; investigation, validation, resources, data curation, Y.M.; investigation, resources, writing—review and editing, M.Z.; resources, writing—review and editing, X.L.; writing—original draft, H.X.; validation, supervision, project administration, R.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Ministry of Education of the People’s Republic of China, University-Industry Collaborative Education Program, 230804602282155.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Brisinello, M.; Grbić, R.; Stefanovič, D.; Pečkai-Kovač, R. Optical Character Recognition on images with colorful background. In Proceedings of the 2018 IEEE 8th International Conference on Consumer Electronics-Berlin (ICCE-Berlin), Berlin, Germany, 2–5 September 2018; pp. 1–6. [Google Scholar]
Adyanthaya, S.K. Text Recognition from Images: A Study. Int. J. Eng. Res. 2020, 8, IJERTCONV8IS13029. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE T Pattern Anal. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Proceedings, Part I 14. Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Karatzas, D.; Gomez-Bigorda, L.; Nicolaou, A.; Ghosh, S.; Bagdanov, A.; Iwamura, M.; Matas, J.; Neumann, L.; Chandrasekhar, V.R.; Lu, S. ICDAR 2015 competition on robust reading. In Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia, 23–26 August 2015; pp. 1156–1160. [Google Scholar]
Yao, C.; Bai, X.; Sang, N.; Zhou, X.; Zhou, S.; Cao, Z. Scene text detection via holistic, multi-channel prediction. arXiv 2016, arXiv:1606.09002. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Epshtein, B.; Ofek, E.; Wexler, Y. Detecting text in natural scenes with stroke width transform. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 2963–2970. [Google Scholar]
Neumann, L.; Matas, J. A method for text localization and recognition in real-world images. In Proceedings of the Computer Vision–ACCV 2010: 10th Asian Conference on Computer Vision, Revised Selected Papers, Part III 10. Queenstown, New Zealand, 8–12 November 2010; pp. 770–783. [Google Scholar]
Tian, Z.; Huang, W.; He, T.; He, P.; Qiao, Y. Detecting text in natural image with connectionist text proposal network. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Proceedings, Part VIII 14. Amsterdam, The Netherlands, 11–14 October 2016; pp. 56–72. [Google Scholar]
Zhou, X.; Yao, C.; Wen, H.; Wang, Y.; Zhou, S.; He, W.; Liang, J. East: An efficient and accurate scene text detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5551–5560. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Hou, W.; Lu, T.; Yu, G.; Shao, S. Shape robust text detection with progressive scale expansion network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 9336–9345. [Google Scholar]
Wang, X.; Jiang, Y.; Luo, Z.; Liu, C.; Choi, H.; Kim, S. Arbitrary shape scene text detection with adaptive text region representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 6449–6458. [Google Scholar]
Zhu, Y.; Chen, J.; Liang, L.; Kuang, Z.; Jin, L.; Zhang, W. Fourier contour embedding for arbitrary-shaped text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 3123–3131. [Google Scholar]
He, M.; Liao, M.; Yang, Z.; Zhong, H.; Tang, J.; Cheng, W.; Yao, C.; Wang, Y.; Bai, X. MOST: A multi-oriented scene text detector with localization refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 8813–8822. [Google Scholar]
Ye, M.; Zhang, J.; Zhao, S.; Liu, J.; Du, B.; Tao, D. Dptext-detr: Towards better scene text detection with dynamic points in transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 3241–3249. [Google Scholar]
Carbonell, M.; Mas, J.; Villegas, M.; Fornés, A.; Lladós, J. End-to-end handwritten text detection and transcription in full pages. In Proceedings of the 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), Sydney, NSW, Australia, 22–25 September 2019; pp. 29–34. [Google Scholar]
Kohli, H.; Agarwal, J.; Kumar, M. An improved method for text detection using Adam optimization algorithm. Glob. Transit. Proc. 2022, 3, 230–234. [Google Scholar] [CrossRef]
Kusetogullari, H.; Yavariabdi, A.; Hall, J.; Lavesson, N. Digitnet: A deep handwritten digit detection and recognition method using a new historical handwritten digit dataset. Big Data Res. 2021, 23, 100182. [Google Scholar] [CrossRef]
Wang, X.; Xie, L.; Dong, C.; Shan, Y. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1905–1914. [Google Scholar]
Zhang, K.; Liang, J.; Van Gool, L.; Timofte, R. Designing a practical degradation model for deep blind image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 4791–4800. [Google Scholar]
Chen, Z.; Zhang, Y.; Gu, J.; Kong, L.; Yang, X. Recursive generalization transformer for image super-resolution. arXiv 2023, arXiv:2303.06373. [Google Scholar]
Chen, X.; Wang, X.; Zhou, J.; Qiao, Y.; Dong, C. Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver BC, Canada, 18–22 June 2023; pp. 22367–22377. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to upsample by learning to sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6027–6037. [Google Scholar]
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.; Yang, J.; Li, X. Large selective kernel network for remote sensing object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 16794–16805. [Google Scholar]
Yang, X.; Yan, J.; Ming, Q.; Wang, W.; Zhang, X.; Tian, Q. Rethinking rotated object detection with gaussian wasserstein distance loss. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 11830–11841. [Google Scholar]
Liao, M.; Wan, Z.; Yao, C.; Chen, K.; Bai, X. Real-time scene text detection with differentiable binarization. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 11474–11481. [Google Scholar]
Baek, Y.; Lee, B.; Han, D.; Yun, S.; Lee, H. Character region awareness for text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 9365–9374. [Google Scholar]

Figure 1. An example of a historical French handwritten manuscript (Source: gallica.bnf.fr/Bibliothèque nationale de France. Département des manuscrits. Français 17238, https://gallica.bnf.fr/ark:/12148/btv1b90616522/f5.item.r=Traductions%20et%20extraits%20de%20livres%20chinois) (accessed on 18 July 2024).

Figure 2. Comparison of super-resolution effects of different models on blurred French text (Source: gallica.bnf.fr/Bibliothèque nationale de France. Département des manuscrits. Français 17239, https://gallica.bnf.fr/ark:/12148/btv1b90615371/f49.item.r=francais%2017239) (accessed on 18 July 2024).

Figure 3. Improved YOLOv8s network architecture (the red box shows the improved module).

Figure 4. Swin Transformer network module structure schematic diagram.

Figure 5. Dysample module structure schematic diagram.

Figure 6. LSK module structure schematic diagram.

Figure 7. Visualized detection results of different text detection models (Source: gallica.bnf.fr/Bibliothèque nationale de France. Département des manuscrits. Français 17239, https://gallica.bnf.fr/ark:/12148/btv1b90615371/f16.item.r=francais%2017239) (accessed on 18 July 2024).

Figure 8. Detection results of original and enhanced image visualization (Source: gallica.bnf.fr/Bibliothèque nationale de France. Département des manuscrits. Français 17239, https://gallica.bnf.fr/ark:/12148/btv1b90615371/f16.item.r=francais%2017239) (accessed on 18 July 2024).

Table 1. The specific results of related methods.

Paper	Method	Datasets	Result	Scene
[9]	SWT	ICDAR	F-measure: 0.66	natural scene
[10]	MSER	ICDAR, SVT	F-measure: 0.687	natural scene
[11]	CTPN	ICDAR	F-measure: 0.61	natural scene
[12]	An Efficient and Accuracy Scene Text detection pipeline (EAST)	ICDAR2015, COCO-Text and MSRA-TD500	F-measure: 0.782	natural scene
[13]	PSENet	CTW1500, Total-Text, ICDAR 2015 and ICDAR 2017 MLT	F-measure: 0.822	natural scene
[14]	A robust arbitrary shape scene text detection method with adaptive text region representation	CTW1500, TotalText, ICDAR2013, ICDAR2015 and MSRATD500	F-measure: 0.917	natural scene
[15]	FCENet	CTW1500, TotalText, ICDAR2015	F-measure: 0.858	natural scene
[16]	MOST	SynthText, ICDAR 2017 MLT (MLT17), MTWI, ICDAR 2015 (IC15), MSRA-TD500	F-measure: 0.864	natural scene
[17]	DPText-DETR	Total-Text, CTW1500, and ICDAR19 ArT	F-measure: 0.890	natural scene
[18]	An end-to-end framework	IAM	Map: 0.9	handwritten document
[19]	J&M	MINIST	Accuracy: 0.99	handwritten document
[20]	DIGITNET	Extended MNIST, Extended USPS, and created the new dataset DIDA.	Correct Detection Rate: 76.84%	handwritten document

Table 2. The details of all the parameters in the experiment.

Configuration	Parameter
Operating system	Windows 10 professional
CPU	Intel(R) Core(TM) i7-9700
GPU	NVIDIA GeForce RTX 3090
Programming language	Python 3.7.12
Development framework	PyTorch 1.8.0
Accelerator	CUDA 11.1
Image size	640 × 640
SGD	0.937
Weight decay factor	5 × 10⁻⁴
Learning rate	0.01
Epoch	500
Batchsize	1
Single_cls	True
IoU	0.55
Mosaic	1

Table 3. Comparison of results of different text detection methods.

Model	Precision	Recall	mAP@0.5	F1	Inference Time
EAST [12]	0.423	0.367	\	0.3935	290.1 ms
DBNet [29]	0.206	0.123	\	0.149318	36.6 ms
CRAFT [30]	0.665	0.512	\	0.5794	566.5 ms
Ours	0.863↑	0.814↑	0.824	0.8377	9.2 ms

Table 4. Comparison of results under different improved modules.

YOLOv8s	Swin Transformer	Dysample	LSK	Precision	Recall	mAP@0.5	mAP@0.5:0.95	Inference Time
√				0.782	0.748	0.757	0.359	7.2 ms
√	√			0.833	0.79	0.829	0.447	8.0 ms
√	√	√		0.841	0.79	0.846	0.455	8.5 ms
√	√	√	√	0.863	0.814	0.824	0.415	9.9 ms

Table 5. Comparison of text detection results with different loss functions.

Model	Precision	Recall	mAP@0.5	mAP@0.5:0.95	Inference Time
YOLOv8s-CIOU	0.782	0.748	0.757	0.359	7.2 ms
YOLOv8s-GWD	0.8	0.749	0.772	0.358	6.5 ms
Ours-CIOU	0.851	0.81	0.849	0.453	9.9 ms
Ours-GWD	0.863	0.814	0.824	0.415	9.2 ms

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sang, R.; Zhao, S.; Meng, Y.; Zhang, M.; Li, X.; Xia, H.; Zhao, R. A Historical Handwritten French Manuscripts Text Detection Method in Full Pages. Information 2024, 15, 483. https://doi.org/10.3390/info15080483

AMA Style

Sang R, Zhao S, Meng Y, Zhang M, Li X, Xia H, Zhao R. A Historical Handwritten French Manuscripts Text Detection Method in Full Pages. Information. 2024; 15(8):483. https://doi.org/10.3390/info15080483

Chicago/Turabian Style

Sang, Rui, Shili Zhao, Yan Meng, Mingxian Zhang, Xuefei Li, Huijie Xia, and Ran Zhao. 2024. "A Historical Handwritten French Manuscripts Text Detection Method in Full Pages" Information 15, no. 8: 483. https://doi.org/10.3390/info15080483

APA Style

Sang, R., Zhao, S., Meng, Y., Zhang, M., Li, X., Xia, H., & Zhao, R. (2024). A Historical Handwritten French Manuscripts Text Detection Method in Full Pages. Information, 15(8), 483. https://doi.org/10.3390/info15080483

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Historical Handwritten French Manuscripts Text Detection Method in Full Pages

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Data Sources

3.2. Image Preprocessing

3.3. Text Detection Model

3.3.1. Swin Transformer

3.3.2. Dysample

3.3.3. Detection Head with LSK Module

3.3.4. Loss Optimization

4. Experimental Results and Analysis

4.1. Experimental Setup

4.1.1. Evaluation Indicators

4.1.2. Implementation Details

4.2. Comparison with State-of-the-Art Methods

4.3. Ablation Experiments

4.4. Qualitative Analysis

4.5. Results Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI