Unconstrained Bilingual Scene Text Reading Using Octave as a Feature Extractor
Abstract
:Featured Application
Abstract
1. Introduction
- Following [22], we prepare large syntactically generated bilingual (English and Amharic) scene text datasets. Additionally, we collect real datasets that have different shapes and written using the two scripts.
- Our proposed model extracts feature by factorizing based on their frequencies (low and high), which helps to reduce both storage and computation costs. This also helps each layer gain a larger receptive field to capture more contextual information.
- The proposed system can detect and read texts from an image that has arbitrary shapes, containing oriented, horizontal, and curved text.
- The performance of the time-restricted attention encoder-decoder module is examined to predict words based on the extracted and segmented features.
- Using the prepared dataset and well-known datasets, we perform several experiments and our model shows promising results.
2. Related Work
2.1. Scene Text Detection
2.2. Scene Text Recognition
2.3. Scene Text Spotting
3. Methodology
3.1. Overall View of the Architecture
3.2. Feature Extraction Layer
3.3. Text Region Detection Layer
3.4. Segmentation and Recognition Layer
4. Ethiopic Script and Dataset Collection
4.1. Ethiopic Script
4.2. Dataset Collection
4.2.1. Synthetic Scene Text Dataset
4.2.2. Real Scene Text Dataset
5. Experiments and Discussions
5.1. Implementation Details
5.2. Experiment Results
6. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Tian, Z.; Huang, W.; He, T.; Qiao, Y. Detecting Text in Natural Image with Connectionist Text Proposal Network. Available online: http://textdet.com/ (accessed on 14 December 2019).
- Yao, C.; Bai, X.; Sang, N.; Zhou, X.; Zhou, S.; Cao, Z. Scene text detection via holistic, multi-channel prediction. arXiv 2016, arXiv:1606.09002. Available online: http://arxiv.org/abs/1606.09002 (accessed on 31 March 2019).
- Buta, M.; Neumann, L.; Matas, J. FASText: Efficient unconstrained scene text detector. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7 December 2015; Volume 2015 Inter, pp. 1206–1214. [Google Scholar]
- Deng, D.; Liu, H.; Li, X.; Cai, D. PixelLink: Detecting scene text via instance segmentation. arXiv 2018, arXiv:1801.01315. Available online: http://arxiv.org/abs/1801.01315 (accessed on 10 February 2019).
- Zhou, X.; Yao, C.; Wen, H.; Wang, Y.; Zhou, S.; He, W.; Liang, J. EAST: An efficient and accurate scene text detector. arXiv 2017, arXiv:1704.03155. [Google Scholar]
- Jaderberg, M.; Simonyan, K.; Vedaldi, A.; Zisserman, A. Reading text in the wild with convolutional neural networks. Int. J. Comput. Vis. 2016, 116, 1–20. [Google Scholar] [CrossRef] [Green Version]
- He, P.; Huang, W.; Qiao, Y.; Loy, C.C.; Tang, X. Reading scene text in deep convolutional sequences. arXiv 2015, arXiv:1506.04395. Available online: https://arxiv.org/abs/1506.04395 (accessed on 2 April 2019).
- Liao, M.; Shi, B.; Bai, X. TextBoxes++: A single-shot oriented scene text detector. arXiv 2018, arXiv:1801.02765. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Liao, M.; Lyu, P.; He, M.; Yao, C.; Wu, W.; Bai, X. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of the 2019 Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; LNCS; Volume 11218, pp. 71–88. [Google Scholar] [CrossRef] [Green Version]
- Li, H.; Wang, P.; Shen, C. Towards end-to-end text spotting with convolutional recurrent neural networks. arXiv 2017, arXiv:1707.03985. Available online: http://arxiv.org/abs/1707.03985 (accessed on 2 April 2019).
- Liao, M.; Shi, B.; Bai, X.; Wang, X.; Liu, W. TextBoxes: A Fast text detector with a single deep neural network. arXiv 2016, arXiv:1611.06779. Available online: http://arxiv.org/abs/1611.06779 (accessed on 2 April 2019).
- Jaderberg, M.; Simonyan, K.; Vedaldi, A.; Zisserman, A. Synthetic data and artificial neural networks for natural scene text recognition. arXiv 2014, arXiv:1406.2227. Available online: http://arxiv.org/abs/1406.2227 (accessed on 2 March 2019).
- Tian, S.; Bhattacharya, U.; Lu, S.; Su, B.; Wang, Q.; Wei, X.; Lu, Y.; Tan, C.L. Multilingual scene character recognition with co-occurrence of histogram of oriented gradients. Pattern Recognit. 2016, 51, 125–134. [Google Scholar] [CrossRef]
- Shi, B.; Bai, X.; Belongie, S. Detecting oriented text in natural images by linking segments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 2550–2558. Available online: http://openaccess.thecvf.com/content_cvpr_2017/html/Shi_Detecting_Oriented_Text_CVPR_2017_paper.html (accessed on 11 April 2019).
- Shi, B.; Wang, X.; Lyu, P.; Yao, C.; Bai, X. Robust scene text recognition with automatic rectification. arXiv 2016, arXiv:1603.03915. Available online: http://arxiv.org/abs/1603.03915 (accessed on 21 March 2019).
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef] [Green Version]
- Addis, D.; Liu, C.-M.; Ta, V.-D. Ethiopic Natural Scene Text Recognition Using Deep Learning Approaches; Springer: Cham, The Netherlands, 2020; pp. 502–511. [Google Scholar]
- Simons, G.F.; Fennig, C.D. Ethnologue: Languages of the World, 20th ed.; SIL International: Dallas, TX, USA, 2017. [Google Scholar]
- Busta, M.; Neumann, L.; Matas, J. Deep Textspotter: An End-to-End Trainable Scene Text Localization and Recognition Framework. In Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2204–2212. [Google Scholar]
- Lin, T.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Chen, Y.; Fan, H.; Xu, B.; Yan, Z.; Kalantidis, Y.; Rohrbach, M.; Yan, S.; Feng, J. Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution. arXiv 2019, arXiv:1904.05049. Available online: http://arxiv.org/abs/1904.05049 (accessed on 20 September 2019).
- Gupta, A.; Vedaldi, A.; Zisserman, A. Synthetic data for text localisation in natural images. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York, NY, USA, 17–22 June 2016; Volume 2016-Decem, pp. 2315–2324. [Google Scholar]
- Neumann, L.; Matas, J. Scene Text Localization and Recognition with oriented Stroke Detection. In Proceedings of the IEEE International Conference on Computer Vision 2013, Sydney, Australia, 1–8 December 2013; pp. 97–104. [Google Scholar]
- Wang, K.; Babenko, B.; Belongie, S. End-to-end scene text recognition. In Proceedings of the IEEE International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 1457–1464. [Google Scholar]
- Lee, J.J.; Lee, P.H.; Lee, S.W.; Yuille, A.; Koch, C. AdaBoost for text detection in natural scene. In Proceedings of the IEEE International Conference on Document Analysis and Recognition, Beijing, China, 18–21 September 2011; pp. 429–434. [Google Scholar]
- Epshtein, B.; Ofek, E.; Wexler, Y. Detecting text in natural scenes with stroke width transform. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 2963–2970. [Google Scholar]
- Matas, J.; Chum, O.; Urban, M.; Pajdla, T. Robust wide-baseline stereo from maximally stable extremal regions. Image Vis. Comput. 2004, 22, 761–767. [Google Scholar] [CrossRef]
- Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.; Xue, X. Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimed. 2018, 20, 3111–3122. [Google Scholar] [CrossRef] [Green Version]
- He, D.; Yang, X.; Liang, C.; Zhou, Z.; Ororbia, A.G.; Kifer, D.; Giles, C.L. Multi-scale FCN with cascaded instance aware segmentation for arbitrary oriented word spotting in the wild. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 474–483. [Google Scholar]
- Su, B.; Lu, S. Accurate Scene Text Recognition Based on Recurrent Neural Network. In Computer Vision—ACCV 2014, Lecture Notes in Computer Science; Cremers, D., Reid, I., Saito, H., Yang, M.H., Eds.; Springer: Cham, The Netherlands, 2014; Volume 9003, pp. 35–48. [Google Scholar]
- Lee, C.-Y.; Osindero, S. Recursive recurrent nets with attention modeling for OCR in the wild. Proceeding of the IEEE conference on Computer Vision and Patter Recognition (CVPR 2016), Las Vegas, NV, USA, 26–30 June 2016; pp. 2231–2239. [Google Scholar]
- Liu, W.; Chen, C.; Wong, K.-Y.; Su, Z.; Han, J. STAR-Net: A Spatial attention residue network for scene text recognition. BMVC 2016, 2, 7. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision 2016, Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–37. [Google Scholar]
- Liu, X.; Liang, D.; Yan, S.; Chen, D.; Qiao, Y.; Yan, J. Fots: Fast Oriented Text Spotting with a Unified Network. Proceeding of the IEEE conference on Computer Vision and Pattern Recogntiion 2018, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5676–5685. [Google Scholar]
- Lundgren, A.; Castro, D.; Lima, E.; Bezerra, B. OctShuffleMLT: A compact octave based neural network for end-to-end multilingual text detection and recognition. In Proceedings of the 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), Sydney, Aystralia, 22–25 September 2019; pp. 37–42. [Google Scholar]
- Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, inception-ResNet and the impact of residual connections on learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 4278–4284. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; Volume 07-12-June, pp. 1–9. [Google Scholar]
- Povey, D.; Hadian, H.; Ghahremani, P.; Li, K.; Khudanpur, S. A time-restricted self-attention layer for ASR. Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; Volume 2018-April, pp. 5874–5878. [Google Scholar]
- Zagoruyko, S.; Komodakis, N. Wide residual networks. arXiv 2016, arXiv:1605.07146. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing System 2015, Montreal, PQ, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
- He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Compter Vision 2017, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS—Improving Object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision 2017, Venice, Italy, 22–29 October 2017; pp. 5561–5569. [Google Scholar]
- Wu, L.; Li, T.; Wang, L.; Yan, Y. Improving hybrid CTC/attention architecture with time-restricted self-attention CTC for end-to-end speech recognition. Appl. Sci. 2019, 9, 4639. [Google Scholar] [CrossRef] [Green Version]
- Karatzas, D.; Shafait, F.; Uchida, S.; Iwamura, M.; Bigorda, L.G.i.; Mestre, S.R.; Mas, J.; Mota, D.F.; Almazan, J.A.; de Las Heras, L.P. ICDAR 2013 robust reading competition. In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition, Washington, DC, USA, 25–28 August 2013; pp. 1484–1493. [Google Scholar]
- Ch’Ng, C.K.; Chan, C.S. Total-Text: A Comprehensive Dataset for Scene Text Detection and Recognition. In Proceedings of the International Conference on Document Analysis and Recognition, Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 935–942. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Gómez, L.; Karatzas, D. TextProposals: A text-specific selective search algorithm for word spotting in the wild. Pattern Recognit. 2017, 70, 60–74. [Google Scholar] [CrossRef] [Green Version]
- Bušta, M.; Patel, Y.; Matas, J. E2E-MLT—An unconstrained end-to-end method for multi-language scene text. arXiv 2018, arXiv:1801.09919. Available online: http://arxiv.org/abs/1801.09919 (accessed on 11 April 2019).
- Wang, W.; Xie, E.; Li, X.; Hou, W.; Lu, T.; Yu, G.; Shao, S. Shape robust text detection with progressive scale expansion network. arXiv 2019, arXiv:1903.12473. Available online: http://arxiv.org/abs/1903.12473 (accessed on 6 January 2019).
Method | Model | Detection | Recognition | Year |
---|---|---|---|---|
Liao et al. [11] | TextBoxes | SSD-based framework | CRNN | 2017 |
Bŭsta et al. [19] | Deep TextSpotter | Yolo v2 | CTC | 2017 |
Liu et al. [34] | FOTS | EAST with RoI Rotate | CTC | 2018 |
Liao et al. [8] | TextBoxes++ | SSD-based framework | CRNN | 2018 |
Liao et al. [9] | Mask TextSpotter | Mask R-CNN | Character segmentation + Spatial attention module | 2019 |
Dataset | Language | Total Images | Training | Testing | Type | |
---|---|---|---|---|---|---|
Ours | Real | Bilingual | 1200 | 600 | 600 | Irregular |
Synthetic | Bilingual | 500,000 | 500,000 | - | Regular | |
ICDAR2013 [44] | English | 462 | 229 | 233 | Regular | |
ICDAR2015 [40] | English | 1500 | 1000 | 500 | Regular | |
Synthetic [22] | English | 600,000 | - | - | Regular | |
Total-Text [45] | English | 1555 | 1255 | 300 | Irregular |
Method | ICDAR2013 | ICDAR2015 | Total-Text |
---|---|---|---|
TextProposals+DicNet * [47] | 68.54% | 47.18% | - |
DeepTextSpotter * [19] | 77.0% | 47.0% | - |
FOTS * [34] | 84.77% | 65.33% | - |
TextBoxes * [8] | 84.65% | 51.9% | - |
E2E-MLT ** [48] | - | 71.4% | - |
Mask Text Spotter ** [9] | 86.5% | 62.4% | 65.3% |
Ours | 86.8% | 62.15% | 67.6% |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tadesse, D.A.; Liu, C.-M.; Ta, V.-D. Unconstrained Bilingual Scene Text Reading Using Octave as a Feature Extractor. Appl. Sci. 2020, 10, 4474. https://doi.org/10.3390/app10134474
Tadesse DA, Liu C-M, Ta V-D. Unconstrained Bilingual Scene Text Reading Using Octave as a Feature Extractor. Applied Sciences. 2020; 10(13):4474. https://doi.org/10.3390/app10134474
Chicago/Turabian StyleTadesse, Direselign Addis, Chuan-Ming Liu, and Van-Dai Ta. 2020. "Unconstrained Bilingual Scene Text Reading Using Octave as a Feature Extractor" Applied Sciences 10, no. 13: 4474. https://doi.org/10.3390/app10134474
APA StyleTadesse, D. A., Liu, C.-M., & Ta, V.-D. (2020). Unconstrained Bilingual Scene Text Reading Using Octave as a Feature Extractor. Applied Sciences, 10(13), 4474. https://doi.org/10.3390/app10134474