Style-Enhanced Transformer for Image Captioning in Construction Scenes
Abstract
:1. Introduction
2. Related Work
2.1. Transformer-Based Models
2.2. Model Application in Construction Scenes
3. Model
3.1. Feature Extraction
3.2. Encoder
3.2.1. Style Encoder
3.2.2. Semantic Encoder
3.3. Decoder
4. Loss Function
5. Dataset
5.1. Image Captioning Dataset
5.2. Data Enhancement
6. Experiment and Result
6.1. Parameter Settings
6.2. Pre-Training
6.3. Evaluation Metrics
6.4. Experimental Result
6.4.1. Experimental Result on MOCS
6.4.2. Experimental Result on MSCOCO
6.4.3. Ablation Study
6.4.4. Visualization Analysis
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Kulkarni, G.; Premraj, V.; Ordonez, V.; Dhar, S.; Li, S.; Choi, Y.; Berg, A.C.; Berg, T.L. Babytalk: Understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2891–2903. [Google Scholar] [CrossRef] [PubMed]
- Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. [Google Scholar]
- Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 2048–2057. [Google Scholar]
- Liu, M.; Li, L.; Hu, H.; Guan, W.; Tian, J. Image caption generation with dual attention mechanism. Inf. Process. Manag. 2020, 57, 102178. [Google Scholar] [CrossRef]
- Zhang, X.; Sun, X.; Luo, Y.; Ji, J.; Zhou, Y.; Wu, Y.; Huang, F.; Ji, R. RSTNet: Captioning with adaptive attention on visual and non-visual words. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15465–15474. [Google Scholar]
- Feng, Y.; Ma, L.; Liu, W.; Luo, J. Unsupervised image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4125–4134. [Google Scholar]
- Laina, I.; Rupprecht, C.; Navab, N. Towards unsupervised image captioning with shared multimodal embeddings. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7414–7424. [Google Scholar]
- Guo, L.; Liu, J.; Yao, P.; Li, J.; Lu, H. Mscap: Multi-style image captioning with unpaired stylized text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4204–4213. [Google Scholar]
- Deng, C.; Ding, N.; Tan, M.; Wu, Q. Length-controllable image captioning. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XIII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 712–729. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
- Kim, K.; Kim, H.; Kim, H. Image-based construction hazard avoidance system using augmented reality in wearable device. Autom. Constr. 2017, 83, 390–403. [Google Scholar] [CrossRef]
- Fang, W.; Ding, L.; Zhong, B.; Love, P.E.; Luo, H. Automated detection of workers and heavy equipment on construction sites: A convolutional neural network approach. Adv. Eng. Inform. 2018, 37, 139–149. [Google Scholar] [CrossRef]
- Kolar, Z.; Chen, H.; Luo, X. Transfer learning and deep convolutional neural networks for safety guardrail detection in 2D images. Autom. Constr. 2018, 89, 58–70. [Google Scholar] [CrossRef]
- Mneymneh, B.E.; Abbas, M.; Khoury, H. Vision-based framework for intelligent monitoring of hardhat wearing on construction sites. J. Comput. Civ. Eng. 2019, 33, 04018066. [Google Scholar] [CrossRef]
- Bang, S.; Kim, H. Context-based information generation for managing UAV-acquired data using image captioning. Autom. Constr. 2020, 112, 103116. [Google Scholar] [CrossRef]
- Liu, H.; Wang, G.; Huang, T.; He, P.; Skitmore, M.; Luo, X. Manifesting construction activity scenes via image captioning. Autom. Constr. 2020, 119, 103334. [Google Scholar] [CrossRef]
- Han, S.H.; Choi, H.J. Domain-Specific Image Caption Generator with Semantic Ontology. In Proceedings of the 2020 IEEE International Conference on Big Data and Smart Computing (BigComp), Busan, Republic of Korea, 19–22 February 2020; pp. 526–530. [Google Scholar] [CrossRef]
- Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6077–6086. [Google Scholar]
- Pan, Y.; Yao, T.; Li, Y.; Mei, T. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10971–10980. [Google Scholar]
- Kim, D.J.; Oh, T.H.; Choi, J.; Kweon, I.S. Semi-Supervised Image Captioning by Adversarially Propagating Labeled Data. arXiv 2023, arXiv:2301.11174. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30.
- Huang, L.; Wang, W.; Chen, J.; Wei, X.Y. Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4634–4643. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- An, X.; Zhou, L.; Liu, Z.; Wang, C.; Li, P.; Li, Z. Dataset and benchmark for detecting moving objects in construction sites. Autom. Constr. 2021, 122, 103482. [Google Scholar]
- Cornia, M.; Stefanini, M.; Baraldi, L.; Cucchiara, R. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10578–10587. [Google Scholar]
- Cao, S.; An, G.; Zheng, Z.; Wang, Z. Vision-Enhanced and Consensus-Aware Transformer for Image Captioning. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 7005–7018. [Google Scholar] [CrossRef]
- Ma, Y.; Ji, J.; Sun, X.; Zhou, Y.; Ji, R. Towards local visual modeling for image captioning. Pattern Recognit. 2023, 138, 109420. [Google Scholar] [CrossRef]
- Wang, N.; Xie, J.; Wu, J.; Jia, M.; Li, L. Controllable image captioning via prompting. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 2617–2625. [Google Scholar]
- Vo, D.M.; Luong, Q.A.; Sugimoto, A.; Nakayama, H. A-CAP: Anticipation Captioning with Commonsense Knowledge. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 10824–10833. [Google Scholar]
- Yang, Z.; Liu, Q.; Liu, G. Better Understanding: Stylized Image Captioning with Style Attention and Adversarial Training. Symmetry 2020, 12, 1978. [Google Scholar] [CrossRef]
- Nong, Y.; Wang, J.; Chen, H.; Sun, W.; Geng, H.; Li, S. A image caption method of construction scene based on attention mechanism and encoding-decoding architecture. J. ZheJiang Univ. Eng. Sci. 2022, 56, 236. [Google Scholar] [CrossRef]
- Jiang, H.; Misra, I.; Rohrbach, M.; Learned-Miller, E.; Chen, X. In defense of grid features for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10267–10276. [Google Scholar]
- Ji, J.; Luo, Y.; Sun, X.; Chen, F.; Luo, G.; Wu, Y.; Gao, Y.; Ji, R. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 2–9 February 2021; Volume 35, pp. 1655–1663. [Google Scholar]
- Rennie, S.J.; Marcheret, E.; Mroueh, Y.; Ross, J.; Goel, V. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7008–7024. [Google Scholar]
- Karpathy, A.; Li, F.F. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3128–3137. [Google Scholar]
- Min, J.; McCoy, R.T.; Das, D.; Pitler, E.; Linzen, T. Syntactic data augmentation increases robustness to inference heuristics. arXiv 2020, arXiv:2004.11999. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
- Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
- Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out; Association for Computational Linguistics: Barcelona, Spain, 2004; pp. 74–81. [Google Scholar]
- Vedantam, R.; Lawrence Zitnick, C.; Parikh, D. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar]
- Anderson, P.; Fernando, B.; Johnson, M.; Gould, S. Spice: Semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 382–398. [Google Scholar]
- Fang, Z.; Wang, J.; Hu, X.; Liang, L.; Gan, Z.; Wang, L.; Yang, Y.; Liu, Z. Injecting semantic concepts into end-to-end image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18009–18019. [Google Scholar]
Models | Cross-Entropy Loss | CIDEr-D Optimization | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
B@1 | B@4 | M | R | C | S | B@1 | B@4 | M | R | C | S | |
Up-down [18] | 34.6 | 16.8 | 13.6 | 25.5 | 53.8 | 9.4 | 35.6 | 16.8 | 13.9 | 25.7 | 56.9 | 9.9 |
AoANet [22] | 34.9 | 17.2 | 14.3 | 26.8 | 56.3 | 9.8 | 36.0 | 17.9 | 14.7 | 27.4 | 61.0 | 10.3 |
X-LAN [19] | 35.6 | 17.7 | 14.5 | 27.2 | 57.4 | 10.1 | 36.7 | 18.3 | 14.9 | 27.8 | 62.1 | 10.8 |
ViTCAP [43] | 36.3 | 17.9 | 14.5 | 27.4 | 58.1 | 11.1 | 37.5 | 18.8 | 14.8 | 28.3 | 63.4 | 11.6 |
Models | Cross-Entropy Loss | CIDEr-D Optimization | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
B@1 | B@4 | M | R | C | S | B@1 | B@4 | M | R | C | S | |
Up-down [18] | 77.2 | 36.2 | 27.0 | 56.4 | 113.5 | 20.3 | 79.8 | 36.3 | 27.7 | 56.9 | 120.1 | 21.4 |
AoANet [22] | 77.4 | 37.2 | 28.4 | 57.5 | 119.8 | 21.3 | 80.2 | 38.9 | 29.2 | 58.8 | 129.8 | 22.4 |
X-LAN [19] | 78.0 | 38.2 | 28.8 | 58.0 | 122.0 | 21.9 | 80.8 | 39.5 | 29.5 | 59.2 | 132.0 | 23.4 |
ViTCAP [43] | - | 35.7 | 28.8 | 57.6 | 121.8 | 22.1 | - | 40.1 | 29.4 | 59.4 | 133.1 | 23.0 |
SF | SSL | B@1 | B@4 | M | R | C | S |
---|---|---|---|---|---|---|---|
✗ | ✗ | 35.3 | 17.8 | 14.2 | 26.5 | 61.8 | 11.3 |
✗ | ✓ | 35.7 | 17.9 | 14.3 | 26.8 | 62.5 | 11.4 |
✓ | ✗ | 36.9 | 18.5 | 14.8 | 27.6 | 64.5 | 11.8 |
✓ | ✓ | 38.4 | 19.3 | 15.4 | 28.8 | 67.2 | 12.3 |
Layer | B@1 | B@4 | M | R | C | S |
---|---|---|---|---|---|---|
1 | 36.4 | 18.1 | 14.7 | 27.3 | 63.6 | 11.6 |
2 | 37.1 | 18.8 | 15.0 | 28.0 | 65.1 | 12.0 |
3 | 38.3 | 19.3 | 15.4 | 28.8 | 67.2 | 12.3 |
4 | 38.4 | 19.3 | 15.2 | 28.8 | 67.2 | 12.4 |
Size | B@1 | B@4 | M | R | C | S |
---|---|---|---|---|---|---|
WS = 12, SS = 0 | 37.2 | 18.7 | 15.1 | 27.8 | 65.3 | 11.9 |
WS = 6, SS = 3 | 38.3 | 19.3 | 15.4 | 28.8 | 67.2 | 12.3 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Song, K.; Chen, L.; Wang, H. Style-Enhanced Transformer for Image Captioning in Construction Scenes. Entropy 2024, 26, 224. https://doi.org/10.3390/e26030224
Song K, Chen L, Wang H. Style-Enhanced Transformer for Image Captioning in Construction Scenes. Entropy. 2024; 26(3):224. https://doi.org/10.3390/e26030224
Chicago/Turabian StyleSong, Kani, Linlin Chen, and Hengyou Wang. 2024. "Style-Enhanced Transformer for Image Captioning in Construction Scenes" Entropy 26, no. 3: 224. https://doi.org/10.3390/e26030224
APA StyleSong, K., Chen, L., & Wang, H. (2024). Style-Enhanced Transformer for Image Captioning in Construction Scenes. Entropy, 26(3), 224. https://doi.org/10.3390/e26030224