1. Introduction
Scene text detection is an important research direction in the field of object detection, with the goal of detecting text in real-world photographs. Recently, with the upsurge of deep learning, scene text detection has been widely applied in driverless, unmanned supermarkets, image retrieval, real-time translation, and other fields.
Text is one of the most important ways we perceive our surroundings, and as one of humanity’s most brilliant and influential creations, it has played an important role in modern society [
1,
2,
3]. As a carrier of information exchange between people, text is highly abstract and widely exists in various images. However, the text in the image has a stronger logic and generality, and it can more effectively provide accurate high-level language description and rich semantic information, which is helpful in analyzing and understanding the content of the image.
Although text detection and recognition in images with complex backgrounds has achieved great success in the majority of languages, such as English and Chinese, research on Uyghur text detection and recognition in images with simple backgrounds has remained a difficult challenge in recent years. The challenge mainly comes from the peculiarities of Uyghur text, which is very different from majority language characters is several aspects, as illustrated in
Figure 1:
The Uyghur text is composed of a main part and an additional part at the image level. The characters in the main part always stick to each other, which belongs to a connected domain. The additional part is secondary but should not be ignored, which has one to three dots or other special structures.
Words are the basic units of Uyghur text, and the length of each word is different.
There are no uppercase or lowercase letters, but each letter has a different form, and use different forms in different locations.
There is a “baseline” for every sentence in the middle of the text. Usually, each word on the baseline has the same height and the distance between them is similar.
To detect the Uyghur text in images with simple backgrounds, the Maximally Stable Extremal Regions (MSERs) [
4] algorithm is always chosen for its good performance. But the traditional MSERs methods extract MSER regions on grayscale images. When transforming the color images to grayscale, the contrast between objects and background usually becomes weaker, and this leads to some fuzzy regions being missed. Therefore, the channel-enhanced MSERs algorithm was proposed, which uses the R, G and B channels of the image to be processed, and then the MSER regions on the three channels are marked on the color image. But the channel-enhanced MSERs algorithm is not very good for detecting Uyghur text completely.
In this paper, we proposed an effective and applicable Uyghur text detection method based on the channel-enhanced MSERs [
5] and CNN classification model, which is designed with the characteristics of Uyghur text in images with a simple background. In the component extraction stage, a new component candidate extraction algorithm is put forward, which is based on the channel-enhanced MSERs [
5] according to the characteristics of Uyghur text. In component analysis stage, the CNN classification model replaces the SVM classifier, which is trained with the HOG feature of the components. And a merging candidate regions algorithm is proposed to build the word-level text candidate regions. Recall the omitted text regions according to the color similarity and spatial relationship of the word-level text candidate regions, classify them with CNN, and merge the text regions into text lines at last.
For text recognition, traditional methods usually process individual characters first and then integrate them into words by beam search, dynamic programming, etc. [
6]. The recent methods cast the text recognition task as a sequence recognition problem. Most of the methods are used to recognize majority languages, such as English and Chinese [
7,
8]. In order to recognize Uyghur text in images with a simple background, in this paper the Convolutional Recurrent Neural Network (CRNN) was improved, and then carried on the text sequence recognition with the improvement of the CRNN model to the Uygur texts in images. In this paper, the Uyghur texts in images were labeled in Latin, and then the decoding results of the transcription layer were converted into Latin characters. The Latin characters were converted into Uyghur text sequences finally.
The main contributions of this paper are as follows:
In view of the Uygur text characteristic, a new effective Uyghur text detection method based on channel enhanced MSERs and CNN is proposed.
We improved the traditional CRNN network, and then carried on the text sequence recognition with the improved CRNN model to the Uygur texts in images with a simple background.
In order to satisfy the image sample data needs for the text detection duty, the text detection data set was established. In order to meet the training and testing requirements of the improved CRNN network, a random text in the natural scene image production tool [
9] was developed independently, and the text recognition dataset was established.
The remaining sections are organized as follows. In
Section 1, The format of the Uyghur language is briefly described, and the main points of contribution of this paper are generally presented. In
Section 2, some related works about Uyghur text detection and recognition in images are briefly reviewed. In
Section 3, the proposed methodology is described in detail. In
Section 4, the results of the conducted experiment are illustrated. And in
Section 5, the conclusion and discussion of the paper are demonstrated.
Author Contributions
Conceptualization, M.I.; methodology, M.I.; software, M.I., A.M. and A.H.; validation, M.I., A.M. and A.H.; formal analysis, M.I.; investigation, M.I., A.M. and A.H.; resources, M.I.; data curation, M.I.; writing—original draft preparation, M.I.; writing—review and editing, M.I. and A.H.; visuali-zation, M.I.; supervision, A.H.; project administration, A.H.; funding acquisition, M.I. and A.H. All authors have read and agreed to the published version of the manuscript.
Funding
This work is supported by Natural Science Foundation of Xinjiang (No. 2020D01C045), National Science Foundation of China (NSFC) under Grant No. 62166043 and Youth Fund for scientific research program of Autonomous Region (XJEDU2019Y007).
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Data used to support the work of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Long, S.; He, X.; Yao, C. Scene Text Detection and Recognition: The Deep Learning Era. arXiv 2018, arXiv:1811.04256. [Google Scholar] [CrossRef]
- Ye, Q.; Doermann, D. Text Detection and Recognition in Imagery: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1480–1500. [Google Scholar] [CrossRef] [PubMed]
- Chen, X.; Jin, L.; Zhu, Y.; Luo, C.; Wang, T. Text Recognition in the Wild: A Survey. ACM Comput. Surv. CSUR 2020, 54, 1–35. [Google Scholar] [CrossRef]
- Matas, J.; Chum, O.; Urban, M.; Pajdla, T. Robust wide-baseline stereo from maximally stable extremal regions. Image Vis. Comput. 2004, 22, 761–767. [Google Scholar] [CrossRef]
- Song, Y.; Chen, J.; Xie, H.; Chen, Z.; Gao, X.; Chen, X. Robust and parallel Uyghur text localization in complex background images. Mach. Vis. Appl. 2017, 28, 755–769. [Google Scholar] [CrossRef]
- Bissacco, A.; Cummins, M.; Netzer, Y.; Neven, H. Photoocr: Reading text in uncontrolled conditions. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Sydney, Australia, 1–8 December 2013. [Google Scholar]
- Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst. 2014, 27, 3104–3112. [Google Scholar]
- Bai, J.; Chen, Z.; Feng, B.; Xu, B. Chinese Image Text Recognition on grayscale pixels. In Proceedings of the 2014 IEEE Inter-national Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 1380–1384. [Google Scholar]
- Jaderberg, M.; Vedaldi, A.; Zisserman, A. Deep features for text spotting. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 512–528. [Google Scholar]
- Shi, B.; Yang, M.; Wang, X.; Lyu, P.; Yao, C.; Bai, X. ASTER: An Attentional Scene Text Recognizer with Flexible Rectification. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 2035–2048. [Google Scholar] [CrossRef] [PubMed]
- Mou, Y.; Tan, L.; Yang, H.; Chen, J.; Liu, L.; Yan, R.; Huang, Y. PlugNet: Degradation Aware Scene Text Recognition Super-vised by a Pluggable Super-Resolution Unit. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020. [Google Scholar]
- Wang, W.; Xie, E.; Li, X.; Liu, X.; Liang, D.; Zhibo, Y.; Lu, T.; Shen, C. PAN++: Towards Efficient and Accurate End-to-End Spotting of Arbitrarily-Shaped Text. IEEE Trans. Pattern Anal. Mach. Intell. 2021. [Google Scholar] [CrossRef] [PubMed]
- Xue, C.; Lu, S.; Bai, S.; Zhang, W.; Wang, C. I2C2W: Image-to-Character-to-Word Transformers for Accurate Scene Text Recognition. arXiv 2021, arXiv:2105.08383. [Google Scholar]
- Huang, M.; Liu, Y.; Peng, Z.; Liu, C.; Lin, D.; Zhu, S.; Yuan, N.; Ding, K.; Jin, L. SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition. arXiv 2022, arXiv:2203.10209. [Google Scholar]
- Fang, S.; Xie, H.; Chen, Z.; Zhu, S.; Gu, X.; Gao, X. Detecting Uyghur text in complex background images with convolutional neural network. Multimed. Tools Appl. 2017, 76, 15083–15103. [Google Scholar] [CrossRef]
- Alsharif, O.; Pineau, J. End-to-End Text Recognition with Hybrid HMM Maxout Models. arXiv 2013, arXiv:1310.1811. [Google Scholar]
- Jaderberg, M.; Simonyan, K.; Vedaldi, A.; Zisserman, A. Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition. arXiv 2014, arXiv:1406.2227. [Google Scholar]
- Shi, B.; Bai, X.; Yao, C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2298–2304. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of the International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Curran Associates Inc.: Red Hook, NY, USA, 2012; pp. 1097–1105. [Google Scholar]
- Bai, J.; Chen, Z.; Feng, B.; Xu, B. Image character recognition using deep convolutional neural network learned from different languages. In Proceedings of the 2014 IEEE International Conference on Image Processing (ICIP), Paris, France, 27–30 October 2014; pp. 2560–2564. [Google Scholar]
| Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).