A Multi-Scale Natural Scene Text Detection Method Based on Attention Feature Extraction and Cascade Feature Fusion
Abstract
:1. Introduction
2. Related Work
3. Method
3.1. Attention Feature Fusion
3.2. Depthwise Separable Convolution
3.3. Pyramid Pooling Module (PPM)
3.4. Cascaded Feature Fusion
3.5. Feature Fusion Module (FFM)
3.6. SAM
3.7. Differentiable Binary
3.8. Loss Function
4. Experiment and Analysis
4.1. Datasets
4.2. Experimental Configuration
4.3. Evaluation Index
4.4. Ablation Experiment
4.5. Experimental Results
5. Discussion and Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Zhu, Y.; Chen, J.; Liang, L.; Kuang, Z.; Jin, L.; Zhang, W. Fourier contour embedding for arbitrary-shaped text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3123–3131. [Google Scholar]
- Dai, P.; Zhang, S.; Zhang, H.; Cao, X. Progressive contour regression for arbitrary-shape scene text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7393–7402. [Google Scholar]
- Taye, M.M. Theoretical understanding of convolutional neural network: Concepts, architectures, applications, future directions. Computation 2023, 11, 52. [Google Scholar] [CrossRef]
- Krichen, M. Convolutional neural networks: A survey. Computers 2023, 12, 151. [Google Scholar] [CrossRef]
- Su, Y.; Shao, Z.; Zhou, Y.; Meng, F.; Zhu, H.; Liu, B.; Yao, R. Textdct: Arbitrary-shaped text detection via discrete cosine transform mask. IEEE Trans. Multimed. 2022, 25, 5030–5042. [Google Scholar] [CrossRef]
- Liao, M.; Wan, Z.; Yao, C.; Chen, K.; Bai, X. Real-time scene text detection with differentiable binarization. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 11474–11481. [Google Scholar]
- Karatzas, D.; Gomez-Bigorda, L.; Nicolaou, A.; Ghosh, S.; Bagdanov, A.; Iwamura, M.; Matas, J.; Neumann, L.; Chandrasekhar, V.R.; Lu, S. ICDAR 2015 competition on robust reading. In Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia, 23–26 August 2015; pp. 1156–1160. [Google Scholar]
- Yao, C.; Bai, X.; Liu, W.; Ma, Y.; Tu, Z. Detecting texts of arbitrary orientations in natural images. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 1083–1090. [Google Scholar]
- Ch’ng, C.K.; Chan, C.S. Total-text: A comprehensive dataset for scene text detection and recognition. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; pp. 935–942. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. pp. 21–37. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Tian, Z.; Huang, W.; He, T.; He, P.; Qiao, Y. Detecting text in natural image with connectionist text proposal network. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VIII 14. pp. 56–72. [Google Scholar]
- Shi, B.; Bai, X.; Belongie, S. Detecting oriented text in natural images by linking segments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2550–2558. [Google Scholar]
- Liao, M.; Shi, B.; Bai, X.; Wang, X.; Liu, W. Textboxes: A fast text detector with a single deep neural network. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
- Liao, M.; Shi, B.; Bai, X. Textboxes++: A single-shot oriented scene text detector. IEEE Trans. Image Process. 2018, 27, 3676–3690. [Google Scholar] [CrossRef]
- Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.; Xue, X. Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimed. 2018, 20, 3111–3122. [Google Scholar] [CrossRef]
- Jiang, Y.; Zhu, X.; Wang, X.; Yang, S.; Li, W.; Wang, H.; Fu, P.; Luo, Z. R2 cnn: Rotational region cnn for arbitrarily-oriented scene text detection. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 3610–3615. [Google Scholar]
- Zhou, X.; Yao, C.; Wen, H.; Wang, Y.; Zhou, S.; He, W.; Liang, J. East: An efficient and accurate scene text detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5551–5560. [Google Scholar]
- Li, X.; Wang, W.; Hou, W.; Liu, R.-Z.; Lu, T.; Yang, J. Shape robust text detection with progressive scale expansion network. arXiv 2018, arXiv:1806.02559. [Google Scholar]
- Long, S.; Ruan, J.; Zhang, W.; He, X.; Wu, W.; Yao, C. Textsnake: A flexible representation for detecting text of arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 20–36. [Google Scholar]
- Deng, D.; Liu, H.; Li, X.; Cai, D. Pixellink: Detecting scene text via instance segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Wang, W.; Xie, E.; Song, X.; Zang, Y.; Wang, W.; Lu, T.; Yu, G.; Shen, C. Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8440–8449. [Google Scholar]
- Xu, Y.; Wang, Y.; Zhou, W.; Wang, Y.; Yang, Z.; Bai, X. Textfield: Learning a deep direction field for irregular scene text detection. IEEE Trans. Image Process. 2019, 28, 5566–5579. [Google Scholar] [CrossRef]
- Huang, Z.; Zhong, Z.; Sun, L.; Huo, Q. Mask R-CNN with pyramid attention network for scene text detection. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 7–11 January 2019; pp. 764–772. [Google Scholar]
- Baek, Y.; Lee, B.; Han, D.; Yun, S.; Lee, H. Character region awareness for text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9365–9374. [Google Scholar]
- Xie, E.; Zang, Y.; Shao, S.; Yu, G.; Yao, C.; Li, G. Scene text detection with supervised pyramid context network. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 9038–9045. [Google Scholar]
- Long, S.; Qin, S.; Panteleev, D.; Bissacco, A.; Fujii, Y.; Raptis, M. Towards end-to-end unified scene text detection and layout analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1049–1059. [Google Scholar]
- Saini, R.; Jha, N.K.; Das, B.; Mittal, S.; Mohan, C.K. Ulsam: Ultra-lightweight subspace attention module for compact convolutional neural networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 1627–1636. [Google Scholar]
- Lu, S.; Liu, M.; Yin, L.; Yin, Z.; Liu, X.; Zheng, W. The multi-modal fusion in visual question answering: A review of attention mechanisms. PeerJ Comput. Sci. 2023, 9, e1400. [Google Scholar] [CrossRef]
- Guo, M.-H.; Lu, C.-Z.; Liu, Z.-N.; Cheng, M.-M.; Hu, S.-M. Visual attention network. Comput. Vis. Med. 2023, 9, 733–752. [Google Scholar] [CrossRef]
- Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional feature fusion. In Proceedings of the IEEE/CVF winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 3560–3569. [Google Scholar]
- Hassan, E.; L., L.V. Scene text detection using attention with depthwise separable convolutions. Appl. Sci. 2022, 12, 6425. [Google Scholar] [CrossRef]
- Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3156–3164. [Google Scholar]
- Ibrayim, M.; Li, Y.; Hamdulla, A. Scene text detection based on two-branch feature extraction. Sensors 2022, 22, 6262. [Google Scholar] [CrossRef]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Wang, Q.; Ma, Y.; Zhao, K.; Tian, Y. A comprehensive survey of loss functions in machine learning. Ann. Data Sci. 2020, 9, 187–212. [Google Scholar] [CrossRef]
- Ho, Y.; Wookey, S. The real-world-weight cross-entropy loss function: Modeling the costs of mislabeling. IEEE Access 2019, 8, 4806–4813. [Google Scholar] [CrossRef]
- Shrivastava, A.; Gupta, A.; Girshick, R. Training region-based object detectors with online hard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 761–769. [Google Scholar]
- Gupta, A.; Vedaldi, A.; Zisserman, A. Synthetic data for text localisation in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2315–2324. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Backbone | DSAF | PFFM | SAM | P(%) | R(%) | F(%) |
---|---|---|---|---|---|---|
Resnet-18 | 86.12 | 77.81 | 81.75 | |||
Resnet-18 | √ | 86.52 | 78.33 | 82.22 | ||
Resnet-18 | √ | 87.25 | 78.32 | 82.54 | ||
Resnet-18 | √ | √ | 87.53 | 78.86 | 82.97 | |
Resnet-18 | √ | √ | √ | 87.65 | 79.45 | 83.34 |
Backbone | DSAF | PFFM | SAM | P(%) | R(%) | F(%) |
---|---|---|---|---|---|---|
Resnet-18 | 86.58 | 75.32 | 80.55 | |||
Resnet-18 | √ | 88.02 | 78.16 | 82.80 | ||
Resnet-18 | √ | 87.20 | 78.53 | 82.63 | ||
Resnet-18 | √ | √ | 87.35 | 79.28 | 83.11 | |
Resnet-18 | √ | √ | √ | 87.41 | 79.32 | 83.16 |
Backbone | DSAF | PFFM | SAM | P(%) | R(%) | F(%) |
---|---|---|---|---|---|---|
Resnet-18 | 85.75 | 80.26 | 82. 91 | |||
Resnet-18 | √ | 86.52 | 81.83 | 84.11 | ||
Resnet-18 | √ | 87.08 | 81.41 | 84.14 | ||
Resnet-18 | √ | √ | 87.26 | 81.58 | 84.32 | |
Resnet-18 | √ | √ | √ | 87.53 | 82.52 | 84.95 |
Method | P(%) | R(%) | F(%) |
---|---|---|---|
CTPN (Tian et al., 2016 [13]) | 74.2 | 51.6 | 60.9 |
EAST (Zhou et al., 2017 [19]) | 83.6 | 73.9 | 78.2 |
SSTD (He et al., 2017) | 80.2 | 73.9 | 76.9 |
Corner (Lyu et al., 2018) | 94.1 | 70.7 | 80.7 |
RRD (Liao et al., 2018) | 85.6 | 79.0 | 82.2 |
PSE-1s (Wang et al., 2019) | 86.9 | 84.5 | 85.7 |
PAN (Wang et al., 2019 [23]) | 84.0 | 81.9 | 82.9 |
LOMO (Zhang et al., 2019) | 91.3 | 83.5 | 87.2 |
CRAFT (Baek et al., 2019 [26]) | 89.8 | 84.3 | 86.9 |
SAE (Tian et al., 2019) | 88.3 | 85.0 | 86.6 |
SPCNET (Xie et al., 2019 [27]) | 88.7 | 85.8 | 87.2 |
SRPN (He et al., 2020) | 92.0 | 79.7 | 85.4 |
FDTA (Cao et al., 2020) | 81.2 | 89.0 | 84.9 |
FCENet (Zhu et al., 2021) | 90.1 | 82.6 | 86.2 |
STKM (Wan et al., 2021) | 88.7 | 84.8 | 86.7 |
DB-ResNet-18(736) | 86.1 | 77.8 | 81.7 |
Ours-ResNet-18(736) | 87.7 | 79.5 | 83.4 |
DB-ResNet-50(736) | 87.4 | 81.4 | 84.3 |
Ours-ResNet-50(736) | 87.9 | 83.4 | 85.6 |
DB-ResNet-50(1152) | 88.5 | 83.8 | 86.1 |
Ours-ResNet-50(1152) | 89.6 | 84.2 | 86.8 |
Method | P(%) | R(%) | F(%) |
---|---|---|---|
TextSnake (Long et al., 2018 [21]) | 82.7 | 74.5 | 78.4 |
MTS (Lyu et al., 2018) | 82.5 | 75.6 | 78.6 |
ATRR (Wang et al., 2019) | 80.9 | 76.2 | 78.5 |
TextField (Xu et al., 2019) | 81.2 | 79.9 | 80.6 |
PAN (Wang et al., 2019 [23]) | 89.3 | 81.0 | 85.0 |
CRAFT (Baek et al., 2019 [26]) | 87.6 | 79.9 | 83.6 |
CSE (Liu et al., 2019) | 81.4 | 79.1 | 80.2 |
PSE-1s (Wang et al., 2019) | 84.0 | 78.0 | 80.9 |
STKM (Wan et al., 2021) | 86.3 | 78.3 | 82.2 |
DB-ResNet-18(800) | 86.7 | 77.5 | 81.5 |
Ours-ResNet-18(800) | 87.4 | 79.3 | 83.2 |
DB-ResNet-50(800) | 86.2 | 80.2 | 83.1 |
Ours-ResNet-50(800) | 88.5 | 82.4 | 85.3 |
Method | P(%) | R(%) | F(%) |
---|---|---|---|
SegLink (Shi et al., 2017 [14]) | 86 | 70.0 | 77.0 |
DeepReg (He et al., 2017) | 77.0 | 70.0 | 74.0 |
EAST (Zhou et al., 2017 [19]) | 87.28 | 67.43 | 76.08 |
RRPN (Ma et al., 2018) | 82.0 | 68.0 | 74.0 |
RRD (Liao et al., 2018) | 87.0 | 73.0 | 79.0 |
MCN (Liu et al., 2018) | 88.0 | 79.0 | 83.0 |
PixelLink (Deng et al., 2018) | 83.0 | 73.2 | 77.8 |
Corner (Lyu et al., 2018) | 87.6 | 76.2 | 81.5 |
TextSnake (Long et al., 2018 [21]) | 83.2 | 73.9 | 78.3 |
PAN (Wang et al.,2019 [23]) | 84.4 | 83.8 | 84.1 |
CRAFT (Baek et al., 2019 [26]) | 88.2 | 78.2 | 82.9 |
SAE (Tian et al., 2019) | 84.2 | 81.7 | 82.9 |
SRPN (He et al., 2020) | 84.9 | 77.0 | 80.7 |
DRRG (Zhang et al., 2020) | 88.1 | 82.3 | 85.1 |
FDTA (Cao et al., 2020) | 71.2 | 84.2 | 77.2 |
MOST (He et al., 2021) | 90.4 | 82.7 | 86.4 |
STKM (Wan et al., 2021) | 81.6 | 77.1 | 79.3 |
DB-ResNet-18(736) | 85.7 | 80.2 | 82.9 |
Ours-ResNet-18(736) | 87.5 | 82.5 | 84.9 |
DB-ResNet-50(736) | 88.8 | 82.4 | 85.5 |
Ours-ResNet-50(736) | 89.3 | 85.2 | 87.2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, N.; Wang, Z.; Huang, Y.; Tian, J.; Li, X.; Xiao, Z. A Multi-Scale Natural Scene Text Detection Method Based on Attention Feature Extraction and Cascade Feature Fusion. Sensors 2024, 24, 3758. https://doi.org/10.3390/s24123758
Li N, Wang Z, Huang Y, Tian J, Li X, Xiao Z. A Multi-Scale Natural Scene Text Detection Method Based on Attention Feature Extraction and Cascade Feature Fusion. Sensors. 2024; 24(12):3758. https://doi.org/10.3390/s24123758
Chicago/Turabian StyleLi, Nianfeng, Zhenyan Wang, Yongyuan Huang, Jia Tian, Xinyuan Li, and Zhiguo Xiao. 2024. "A Multi-Scale Natural Scene Text Detection Method Based on Attention Feature Extraction and Cascade Feature Fusion" Sensors 24, no. 12: 3758. https://doi.org/10.3390/s24123758
APA StyleLi, N., Wang, Z., Huang, Y., Tian, J., Li, X., & Xiao, Z. (2024). A Multi-Scale Natural Scene Text Detection Method Based on Attention Feature Extraction and Cascade Feature Fusion. Sensors, 24(12), 3758. https://doi.org/10.3390/s24123758