A Historical Handwritten French Manuscripts Text Detection Method in Full Pages
Abstract
:1. Introduction
- (1)
- Swin Transformer is used to replace C2f at the end of the backbone network to solve the shortcomings of fine-grained information loss and insufficient learning features in text word detection.
- (2)
- The Dysample method is adopted to retain more detailed features of the target, overcome the lack of information loss in traditional upsampling, and realize the text detection task for dense targets.
- (3)
- The LSK module is added to the detection head to dynamically adjust the feature extraction receptive field, which solves the cases of extreme aspect ratio words, unfocused small text, and complex shape text in text detection.
- (4)
- GWD is introduced to modify the CIOU regression loss to measure the similarity between two bounding boxes in order to obtain high-quality bounding boxes, which overcomes the situation of certain ambiguities in defining the aspect ratio intersection over union, insensitivity to the size change, and insufficient correlation between the target coordinates.
2. Related Work
3. Materials and Methods
3.1. Data Sources
3.2. Image Preprocessing
3.3. Text Detection Model
3.3.1. Swin Transformer
3.3.2. Dysample
3.3.3. Detection Head with LSK Module
3.3.4. Loss Optimization
4. Experimental Results and Analysis
4.1. Experimental Setup
4.1.1. Evaluation Indicators
4.1.2. Implementation Details
4.2. Comparison with State-of-the-Art Methods
4.3. Ablation Experiments
4.4. Qualitative Analysis
4.5. Results Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Brisinello, M.; Grbić, R.; Stefanovič, D.; Pečkai-Kovač, R. Optical Character Recognition on images with colorful background. In Proceedings of the 2018 IEEE 8th International Conference on Consumer Electronics-Berlin (ICCE-Berlin), Berlin, Germany, 2–5 September 2018; pp. 1–6. [Google Scholar]
- Adyanthaya, S.K. Text Recognition from Images: A Study. Int. J. Eng. Res. 2020, 8, IJERTCONV8IS13029. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE T Pattern Anal. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Proceedings, Part I 14. Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Karatzas, D.; Gomez-Bigorda, L.; Nicolaou, A.; Ghosh, S.; Bagdanov, A.; Iwamura, M.; Matas, J.; Neumann, L.; Chandrasekhar, V.R.; Lu, S. ICDAR 2015 competition on robust reading. In Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia, 23–26 August 2015; pp. 1156–1160. [Google Scholar]
- Yao, C.; Bai, X.; Sang, N.; Zhou, X.; Zhou, S.; Cao, Z. Scene text detection via holistic, multi-channel prediction. arXiv 2016, arXiv:1606.09002. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Epshtein, B.; Ofek, E.; Wexler, Y. Detecting text in natural scenes with stroke width transform. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 2963–2970. [Google Scholar]
- Neumann, L.; Matas, J. A method for text localization and recognition in real-world images. In Proceedings of the Computer Vision–ACCV 2010: 10th Asian Conference on Computer Vision, Revised Selected Papers, Part III 10. Queenstown, New Zealand, 8–12 November 2010; pp. 770–783. [Google Scholar]
- Tian, Z.; Huang, W.; He, T.; He, P.; Qiao, Y. Detecting text in natural image with connectionist text proposal network. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Proceedings, Part VIII 14. Amsterdam, The Netherlands, 11–14 October 2016; pp. 56–72. [Google Scholar]
- Zhou, X.; Yao, C.; Wen, H.; Wang, Y.; Zhou, S.; He, W.; Liang, J. East: An efficient and accurate scene text detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5551–5560. [Google Scholar]
- Wang, W.; Xie, E.; Li, X.; Hou, W.; Lu, T.; Yu, G.; Shao, S. Shape robust text detection with progressive scale expansion network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 9336–9345. [Google Scholar]
- Wang, X.; Jiang, Y.; Luo, Z.; Liu, C.; Choi, H.; Kim, S. Arbitrary shape scene text detection with adaptive text region representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 6449–6458. [Google Scholar]
- Zhu, Y.; Chen, J.; Liang, L.; Kuang, Z.; Jin, L.; Zhang, W. Fourier contour embedding for arbitrary-shaped text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 3123–3131. [Google Scholar]
- He, M.; Liao, M.; Yang, Z.; Zhong, H.; Tang, J.; Cheng, W.; Yao, C.; Wang, Y.; Bai, X. MOST: A multi-oriented scene text detector with localization refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 8813–8822. [Google Scholar]
- Ye, M.; Zhang, J.; Zhao, S.; Liu, J.; Du, B.; Tao, D. Dptext-detr: Towards better scene text detection with dynamic points in transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 3241–3249. [Google Scholar]
- Carbonell, M.; Mas, J.; Villegas, M.; Fornés, A.; Lladós, J. End-to-end handwritten text detection and transcription in full pages. In Proceedings of the 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), Sydney, NSW, Australia, 22–25 September 2019; pp. 29–34. [Google Scholar]
- Kohli, H.; Agarwal, J.; Kumar, M. An improved method for text detection using Adam optimization algorithm. Glob. Transit. Proc. 2022, 3, 230–234. [Google Scholar] [CrossRef]
- Kusetogullari, H.; Yavariabdi, A.; Hall, J.; Lavesson, N. Digitnet: A deep handwritten digit detection and recognition method using a new historical handwritten digit dataset. Big Data Res. 2021, 23, 100182. [Google Scholar] [CrossRef]
- Wang, X.; Xie, L.; Dong, C.; Shan, Y. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1905–1914. [Google Scholar]
- Zhang, K.; Liang, J.; Van Gool, L.; Timofte, R. Designing a practical degradation model for deep blind image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 4791–4800. [Google Scholar]
- Chen, Z.; Zhang, Y.; Gu, J.; Kong, L.; Yang, X. Recursive generalization transformer for image super-resolution. arXiv 2023, arXiv:2303.06373. [Google Scholar]
- Chen, X.; Wang, X.; Zhou, J.; Qiao, Y.; Dong, C. Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver BC, Canada, 18–22 June 2023; pp. 22367–22377. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to upsample by learning to sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6027–6037. [Google Scholar]
- Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.; Yang, J.; Li, X. Large selective kernel network for remote sensing object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 16794–16805. [Google Scholar]
- Yang, X.; Yan, J.; Ming, Q.; Wang, W.; Zhang, X.; Tian, Q. Rethinking rotated object detection with gaussian wasserstein distance loss. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 11830–11841. [Google Scholar]
- Liao, M.; Wan, Z.; Yao, C.; Chen, K.; Bai, X. Real-time scene text detection with differentiable binarization. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 11474–11481. [Google Scholar]
- Baek, Y.; Lee, B.; Han, D.; Yun, S.; Lee, H. Character region awareness for text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 9365–9374. [Google Scholar]
Paper | Method | Datasets | Result | Scene |
---|---|---|---|---|
[9] | SWT | ICDAR | F-measure: 0.66 | natural scene |
[10] | MSER | ICDAR, SVT | F-measure: 0.687 | natural scene |
[11] | CTPN | ICDAR | F-measure: 0.61 | natural scene |
[12] | An Efficient and Accuracy Scene Text detection pipeline (EAST) | ICDAR2015, COCO-Text and MSRA-TD500 | F-measure: 0.782 | natural scene |
[13] | PSENet | CTW1500, Total-Text, ICDAR 2015 and ICDAR 2017 MLT | F-measure: 0.822 | natural scene |
[14] | A robust arbitrary shape scene text detection method with adaptive text region representation | CTW1500, TotalText, ICDAR2013, ICDAR2015 and MSRATD500 | F-measure: 0.917 | natural scene |
[15] | FCENet | CTW1500, TotalText, ICDAR2015 | F-measure: 0.858 | natural scene |
[16] | MOST | SynthText, ICDAR 2017 MLT (MLT17), MTWI, ICDAR 2015 (IC15), MSRA-TD500 | F-measure: 0.864 | natural scene |
[17] | DPText-DETR | Total-Text, CTW1500, and ICDAR19 ArT | F-measure: 0.890 | natural scene |
[18] | An end-to-end framework | IAM | Map: 0.9 | handwritten document |
[19] | J&M | MINIST | Accuracy: 0.99 | handwritten document |
[20] | DIGITNET | Extended MNIST, Extended USPS, and created the new dataset DIDA. | Correct Detection Rate: 76.84% | handwritten document |
Configuration | Parameter |
---|---|
Operating system | Windows 10 professional |
CPU | Intel(R) Core(TM) i7-9700 |
GPU | NVIDIA GeForce RTX 3090 |
Programming language | Python 3.7.12 |
Development framework | PyTorch 1.8.0 |
Accelerator | CUDA 11.1 |
Image size | 640 × 640 |
SGD | 0.937 |
Weight decay factor | 5 × 10−4 |
Learning rate | 0.01 |
Epoch | 500 |
Batchsize | 1 |
Single_cls | True |
IoU | 0.55 |
Mosaic | 1 |
Model | Precision | Recall | [email protected] | F1 | Inference Time |
---|---|---|---|---|---|
EAST [12] | 0.423 | 0.367 | \ | 0.3935 | 290.1 ms |
DBNet [29] | 0.206 | 0.123 | \ | 0.149318 | 36.6 ms |
CRAFT [30] | 0.665 | 0.512 | \ | 0.5794 | 566.5 ms |
Ours | 0.863↑ | 0.814↑ | 0.824 | 0.8377 | 9.2 ms |
YOLOv8s | Swin Transformer | Dysample | LSK | Precision | Recall | [email protected] | [email protected]:0.95 | Inference Time |
---|---|---|---|---|---|---|---|---|
√ | 0.782 | 0.748 | 0.757 | 0.359 | 7.2 ms | |||
√ | √ | 0.833 | 0.79 | 0.829 | 0.447 | 8.0 ms | ||
√ | √ | √ | 0.841 | 0.79 | 0.846 | 0.455 | 8.5 ms | |
√ | √ | √ | √ | 0.863 | 0.814 | 0.824 | 0.415 | 9.9 ms |
Model | Precision | Recall | [email protected] | [email protected]:0.95 | Inference Time |
---|---|---|---|---|---|
YOLOv8s-CIOU | 0.782 | 0.748 | 0.757 | 0.359 | 7.2 ms |
YOLOv8s-GWD | 0.8 | 0.749 | 0.772 | 0.358 | 6.5 ms |
Ours-CIOU | 0.851 | 0.81 | 0.849 | 0.453 | 9.9 ms |
Ours-GWD | 0.863 | 0.814 | 0.824 | 0.415 | 9.2 ms |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Sang, R.; Zhao, S.; Meng, Y.; Zhang, M.; Li, X.; Xia, H.; Zhao, R. A Historical Handwritten French Manuscripts Text Detection Method in Full Pages. Information 2024, 15, 483. https://doi.org/10.3390/info15080483
Sang R, Zhao S, Meng Y, Zhang M, Li X, Xia H, Zhao R. A Historical Handwritten French Manuscripts Text Detection Method in Full Pages. Information. 2024; 15(8):483. https://doi.org/10.3390/info15080483
Chicago/Turabian StyleSang, Rui, Shili Zhao, Yan Meng, Mingxian Zhang, Xuefei Li, Huijie Xia, and Ran Zhao. 2024. "A Historical Handwritten French Manuscripts Text Detection Method in Full Pages" Information 15, no. 8: 483. https://doi.org/10.3390/info15080483
APA StyleSang, R., Zhao, S., Meng, Y., Zhang, M., Li, X., Xia, H., & Zhao, R. (2024). A Historical Handwritten French Manuscripts Text Detection Method in Full Pages. Information, 15(8), 483. https://doi.org/10.3390/info15080483