Enabling High-Level Worker-Centric Semantic Understanding of Onsite Images Using Visual Language Models with Attention Mechanism and Beam Search Strategy
Abstract
:1. Introduction
2. Related Work
2.1. Visual-Semantic Information Extracting from Construction Site Images
2.2. Image Captioning
2.3. Overview of Existing Image Captioning Datasets
3. Methodology
3.1. Visual Language Model
3.2. Development of Visual-Language Dataset
- Image collection
- 2.
- Image annotation
- 3.
- Preprocessing for datasets
4. Experiments and Results
4.1. Model Training
4.2. Evaluation Metrics
4.3. Analysis of Results
5. Discussion
5.1. Contributions to the Body of Knowledge
5.2. Practical Implications and Application Challenges
- (1)
- Practical implications: The proposed approach is able to depict an activity scene of a construction site using full sentences. The model utilizes an attention mechanism to highlight significant elements within the scene, which can assist managers in quickly scanning the generated language information to obtain reliable information regarding the construction activity status. For instance, by analyzing the frequency of scene words like “helmet” and spatial positional relationships such as “under” and “next to”, safety managers can evaluate the safety status of workers.
- (2)
- Application challenges: The proposed approach is confronted with a challenge in its application, as its level of accuracy is dependent on the size and quality of the training dataset, as well as the effectiveness of the model architecture [67]. Given that visual language models are still evolving and there are limitations in generating large datasets in the construction industry, achieving higher levels of accuracy may prove to be difficult.
5.3. Limitations and Further Work
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Ham, Y.; Kamari, M. Automated content-based filtering for enhanced vision-based documentation in construction toward exploiting big visual data from drones. Autom. Constr. 2019, 105, 102831. [Google Scholar] [CrossRef]
- Xiong, R.; Song, Y.; Li, H.; Wang, Y. Onsite video mining for construction hazards identification with visual relationships. Adv. Eng. Inform. 2019, 42, 100966. [Google Scholar] [CrossRef]
- Slaton, T.; Hernandez, C.; Akhavian, R. Construction activity recognition with convolutional recurrent networks. Autom. Constr. 2020, 113, 103138. [Google Scholar] [CrossRef]
- Harichandran, A.; Raphael, B.; Mukherjee, A. Equipment activity recognition and early fault detection in automated construction through a hybrid machine learning framework. Comput.-Aided Civ. Infrastruct. Eng. 2023, 38, 253–268. [Google Scholar] [CrossRef]
- Zhang, C.; Zhao, Y.; Li, T.; Zhang, X.; Adnouni, M. Generic visual data mining-based framework for revealing abnormal operation patterns in building energy systems. Autom. Constr. 2021, 125, 103624. [Google Scholar] [CrossRef]
- Zhong, B.; Shen, L.; Pan, X.; Lei, L. Visual attention framework for identifying semantic information from construction monitoring video. Saf. Sci. 2023, 163, 106122. [Google Scholar] [CrossRef]
- Hu, N.; Fan, C.; Ming, Y.; Feng, F. MAENet: A novel multi-head association attention enhancement network for completing intra-modal interaction in image captioning. Neurocomputing 2023, 519, 69–81. [Google Scholar] [CrossRef]
- Wu, J.; Cai, N.; Chen, W.; Wang, H.; Wang, G. Automatic detection of hardhats worn by construction personnel: A deep learning approach and benchmark dataset. Autom. Constr. 2019, 106, 102894. [Google Scholar] [CrossRef]
- Nath, N.D.; Behzadan, A.H.; Paal, S.G. Deep learning for site safety: Real-time detection of personal protective equipment. Autom. Constr. 2020, 112, 103085. [Google Scholar] [CrossRef]
- Abdullahi, I.; Chukwuma, N.; Mostafa, N.; Amanda, K.; Ulises, T. Investigating the impact of physical fatigue on construction workers’ situational awareness. Saf. Sci. 2023, 163, 106103. [Google Scholar]
- Yu, Y.; Li, H.; Yang, X.; Kong, L.; Luo, X.; Wong, A.Y.L. An automatic and non-invasive physical fatigue assessment method for construction workers. Autom. Constr. 2019, 103, 1–12. [Google Scholar] [CrossRef]
- Chen, C.; Zhu, Z.; Hammad, A. Automated excavators activity recognition and productivity analysis from construction site surveillance videos. Autom. Constr. 2020, 110, 103045. [Google Scholar] [CrossRef]
- Zhu, C.; Zhu, J.; Bu, T.; Gao, X. Monitoring and Identification of Road Construction Safety Factors via UAV. Sensors 2022, 22, 8797. [Google Scholar] [CrossRef]
- Bang, S.; Kim, H. Context-based information generation for managing UAV-acquired data using image captioning. Autom. Constr. 2020, 112, 103116. [Google Scholar] [CrossRef]
- Olanrewaju, A.; AbdulAziz, A.; Preece, C.N.; Shobowale, K. Evaluation of measures to prevent the spread of COVID-19 on the construction sites. Clean. Eng. Technol. 2021, 5, 100277. [Google Scholar] [CrossRef] [PubMed]
- Essam, N.; Khodeir, L.; Fathy, F. Approaches for BIM-based multi-objective optimization in construction scheduling. Ain Shams Eng. J. 2023, 14, 102114. [Google Scholar] [CrossRef]
- Golovina, O.; Teizer, J.; Johansen, K.W.; König, M. Towards autonomous cloud-based close call data management for construction equipment safety. Autom. Constr. 2021, 132, 103962. [Google Scholar] [CrossRef]
- Liu, H.; Wang, G.; Huang, T.; He, P.; Skitmore, M.; Luo, X. Manifesting construction activity scenes via image captioning. Autom. Constr. 2020, 119, 103334. [Google Scholar] [CrossRef]
- Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6–11 July 2015; Volume 37, pp. 2048–2057. [Google Scholar]
- Dash, S.K.; Acharya, S.; Pakray, P.; Das, R.; Gelbukh, A. Topic-Based Image Caption Generation. Arab. J. Sci. Eng. 2020, 45, 3025–3034. [Google Scholar] [CrossRef]
- Suresh, K.R.; Jarapala, A.; Sudeep, P.V. Image Captioning Encoder–Decoder Models Using CNN-RNN Architectures: A Comparative Study. Circuits Syst. Signal Process. 2022, 41, 5719–5742. [Google Scholar] [CrossRef]
- Alsakka, F.; El-Chami, I.; Yu, H.; Al-Hussein, M. Computer vision-based process time data acquisition for offsite construction. Autom. Constr. 2023, 149, 104803. [Google Scholar] [CrossRef]
- Lin, T.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014, Proceedings, Part V; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
- Wang, C.; Gu, X. Learning Double-Level Relationship Networks for image captioning. Inf. Process. Manag. 2023, 60, 103288. [Google Scholar] [CrossRef]
- Young, P.; Lai, A.; Hodosh, M.; Hockenmaier, J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2014, 2, 67–78. [Google Scholar] [CrossRef]
- Li, B.; Cheng, K.; Yu, Z. Histogram of Oriented Gradient Based Gist Feature for Building Recognition. Comput. Intell. Neurosci. 2016, 2016, 6749325. [Google Scholar] [CrossRef] [PubMed]
- Eo, Y.D.; Pyeon, M.W.; Kim, S.W.; Kim, J.R.; Han, D.Y. Coregistration of terrestrial lidar points by adaptive scale-invariant feature transformation with constrained geometry. Autom. Constr. 2012, 25, 49–58. [Google Scholar] [CrossRef]
- Li, J.; Wang, Y.; Wang, Y. Visual tracking and learning using speeded up robust features. Pattern Recognit. Lett. 2012, 33, 2094–2101. [Google Scholar] [CrossRef]
- Zhou, Y.; Guo, H.; Ma, L.; Zhang, Z.; Skitmore, M. Image-based onsite object recognition for automatic crane lifting tasks. Autom. Constr. 2021, 123, 103527. [Google Scholar] [CrossRef]
- Li, Y.; Lu, Y.; Chen, J. A deep learning approach for real-time rebar counting on the construction site based on YOLOv3 detector. Autom. Constr. 2021, 124, 103602. [Google Scholar] [CrossRef]
- Kim, D.; Liu, M.; Lee, S.; Kamat, V.R. Remote proximity monitoring between mobile construction resources using camera-mounted UAVs. Autom. Constr. 2019, 99, 168–182. [Google Scholar] [CrossRef]
- Kardovskyi, Y.; Moon, S. Artificial intelligence quality inspection of steel bars installation by integrating mask R-CNN and stereo vision. Autom. Constr. 2021, 130, 103850. [Google Scholar] [CrossRef]
- Chen, S.; Demachi, K. Towards on-site hazards identification of improper use of personal protective equipment using deep learning-based geometric relationships and hierarchical scene graph. Autom. Constr. 2021, 125, 103619. [Google Scholar] [CrossRef]
- Hwang, J.; Lee, K.; Ei Zan, M.M.; Jang, M.; Shin, D.H. Improved Discriminative Object Localization Algorithm for Safety Management of Indoor Construction. Sensors 2023, 23, 3870. [Google Scholar] [CrossRef]
- Wang, X.; Zhu, Z. Vision-based hand signal recognition in construction: A feasibility study. Autom. Constr. 2021, 125, 103625. [Google Scholar] [CrossRef]
- Kim, K.; Cho, Y.K. Effective inertial sensor quantity and locations on a body for deep learning-based worker’s motion recognition. Autom. Constr. 2020, 113, 103126. [Google Scholar] [CrossRef]
- Cheng, M.; Khitam, A.F.K.; Tanto, H.H. Construction worker productivity evaluation using action recognition for foreign labor training and education: A case study of Taiwan. Autom. Constr. 2023, 150, 104809. [Google Scholar] [CrossRef]
- Antwi-Afari, M.F.; Qarout, Y.; Herzallah, R.; Anwer, S.; Umer, W.; Zhang, Y.; Manu, P. Deep learning-based networks for automated recognition and classification of awkward working postures in construction using wearable insole sensor data. Autom. Constr. 2022, 136, 104181. [Google Scholar] [CrossRef]
- Luo, X.; Li, H.; Yang, X.; Yu, Y.; Cao, D. Capturing and Understanding Workers’ Activities in Far-Field Surveillance Videos with Deep Action Recognition and Bayesian Nonparametric Learning. Comput.-Aided Civ. Infrastruct. Eng. 2019, 34, 333–351. [Google Scholar] [CrossRef]
- Luo, X.; Li, H.; Cao, D.; Yu, Y.; Yang, X.; Huang, T. Towards efficient and objective work sampling: Recognizing workers’ activities in site surveillance videos with two-stream convolutional networks. Autom. Constr. 2018, 94, 360–370. [Google Scholar] [CrossRef]
- Yang, J.; Shi, Z.; Wu, Z. Vision-based action recognition of construction workers using dense trajectories. Adv. Eng. Inform. 2016, 30, 327–336. [Google Scholar] [CrossRef]
- Fang, H.; Gupta, S.; Iandola, F.; Srivastava, R.K.; Deng, L.; Dollár, P.; Gao, J.; He, X.; Mitchell, M.; Platt, J.C.; et al. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2015, Boston, MA, USA, 7–12 June 2015; pp. 1473–1482. [Google Scholar]
- Yatskar, M.; Zettlemoyer, L.; Farhadi, A. Situation recognition: Visual semantic role labeling for image understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 5534–5542. [Google Scholar]
- Ushiku, Y.; Yamaguchi, M.; Mukuta, Y.; Harada, T. Common subspace for model and similarity: Phrase learning for caption generation from images. In Proceedings of the IEEE International Conference on Computer Vision 2015, Santiago, Chile, 7–13 December 2015; pp. 2668–2676. [Google Scholar]
- Xu, C.; Yang, M.; Ao, X.; Shen, Y.; Xu, R.; Tian, J. Retrieval-enhanced adversarial training with dynamic memory-augmented attention for image paragraph captioning. Knowl.-Based Syst. 2021, 214, 106730. [Google Scholar] [CrossRef]
- Du, Y.; Liu, Z.; Li, J.; Zhao, W.X. A survey of vision-language pre-trained models. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22), Vienna, Austria, 23–29 July 2022. [Google Scholar] [CrossRef]
- Zhang, L.; Wang, J.; Wang, Y.; Sun, H.; Zhao, X. Automatic construction site hazard identification integrating construction scene graphs with BERT based domain knowledge. Autom. Constr. 2022, 142, 104535. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
- De Curtò, J.; De Zarzà, I.; Calafate, C.T. Semantic Scene Understanding with Large Language Models on Unmanned Aerial Vehicles. Drones 2023, 7, 114. [Google Scholar] [CrossRef]
- Tsai, W.-L.; Le, P.-L.; Ho, W.-F.; Chi, N.-W.; Lin, J.J.; Tang, S.; Hsieh, S.-H. Construction Safety Inspection with Contrastive Language-Image Pre-Training (CLIP) Image Captioning and Attention. Autom. Constr. 2025, 169, 105863. [Google Scholar] [CrossRef]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
- Dinh, N.N.H.; Shin, H.; Ahn, Y.; Oo, B.L.; Lim, B.T.H. Attention-Based Image Captioning for Structural Health Assessment of Apartment Buildings. Autom. Constr. 2024, 167, 105677. [Google Scholar] [CrossRef]
- Tu, Y.; Zhou, C.; Guo, J.; Li, H.; Gao, S.; Yu, Z. Relation-aware attention for video captioning via graph learning. Pattern Recognit. 2023, 136, 109204. [Google Scholar] [CrossRef]
- Li, P.; Gai, S. Single image deraining using multi-scales context information and attention network. J. Vis. Commun. Image Represent. 2023, 90, 103695. [Google Scholar] [CrossRef]
- Dubey, S.; Olimov, F.; Rafique, M.A.; Kim, J.; Jeon, M. Label-attention transformer with geometrically coherent objects for image captioning. Inf. Sci. 2023, 623, 812–831. [Google Scholar] [CrossRef]
- Zhai, P.C.; Wang, J.J.; Zhang, L.T. Extracting Worker Unsafe Behaviors from Construction Images Using Image Captioning with Deep Learning-Based Attention Mechanism. J. Constr. Eng. Manag. 2023, 149, 04022164. [Google Scholar] [CrossRef]
- Huang, L.; Zhao, K.; Ma, M. When to Finish? Optimal Beam Search for Neural Text Generation (modulo Beam Size). In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 9–11 September 2017; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 2134–2139. [Google Scholar]
- Freitag, M.; Al-Onaizan, Y. Beam Search Strategies for Neural Machine Translation. In Proceedings of the First Workshop on Neural Machine Translation, Vancouver, BC, Canada, 3–4 July 2017; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 56–60. [Google Scholar]
- Wang, Y.; Xiao, B.; Bouferguene, A.; Al-Hussein, M.; Li, H. Vision-based method for semantic information extraction in construction by integrating deep learning object detection and image captioning. Adv. Eng. Inform. 2022, 53, 101699. [Google Scholar] [CrossRef]
- Duan, R.; Deng, H.; Tian, M.; Deng, Y.; Lin, J. SODA: A large-scale open site object detection dataset for deep learning in construction. Autom. Constr. 2022, 142, 104499. [Google Scholar] [CrossRef]
- Niu, Z.; Zhong, G.; Yu, H. A review on the attention mechanism of deep learning. Neurocomputing 2021, 452, 48–62. [Google Scholar] [CrossRef]
- Li, J.; Monroe, W.; Jurafsky, D. A simple, fast diverse decoding algorithm for neural generation. arXiv 2016, arXiv:1611.08562. [Google Scholar]
- Bhatti, S.S.; Gao, X.; Chen, G. General framework, opportunities and challenges for crowdsourcing techniques: A comprehensive survey. J. Syst. Softw. 2020, 167, 110611. [Google Scholar] [CrossRef]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics 2002, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
- Vedantam, R.; Lawrence Zitnick, C.; Parikh, D. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2015, Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar]
- Lin, C. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
- Sun, Y.; Gu, Z. Using computer vision to recognize construction material: A Trustworthy Dataset Perspective. Resour. Conserv. Recycl. 2022, 183, 106362. [Google Scholar] [CrossRef]
Scenes | Image Count | Caption Count |
---|---|---|
Reinforcement work | 380 | 1900 |
Scaffold work | 785 | 3925 |
Concrete casting work | 359 | 1795 |
Formwork | 399 | 1995 |
Handcart work | 129 | 645 |
Machine work | 184 | 920 |
Bricklaying and plastering work | 399 | 1995 |
Ladder work | 77 | 385 |
Excavation and earthmoving work | 279 | 1395 |
Leveling ground work | 459 | 2295 |
Transport work | 265 | 1325 |
Mechanical hoisting work | 338 | 1690 |
Personnel hoisting work | 2193 | 10,965 |
Commanding work | 167 | 835 |
Surveying work | 109 | 545 |
Stand, rest, walk | 478 | 2390 |
Dataset | Image Count | Caption Count | Scenes | Date |
---|---|---|---|---|
SODA-ktsh | 14,000 | 70,000 | 16 | 2023 |
Huan [13] | 7382 | 36,910 | 5 | 2020 |
Bang [14] | 1431 | 8601 | 1 | 2020 |
ACID-C [15] | 6000 | 18,000 | 1 | 2022 |
Epoch | Batch Time | LOSS | Top-5 Accuracy | BLEU-4 |
---|---|---|---|---|
10 | 0.308 | 1.523 | 97.277% | 0.639 |
20 | 0.308 | 1.490 | 97.769% | 0.656 |
30 | 0.310 | 1.487 | 97.753% | 0.657 |
40 | 0.308 | 1.482 | 97.775% | 0.660 |
50 | 0.277 | 1.468 | 97.957% | 0.670 |
60 | 0.309 | 1.455 | 98.010% | 0.682 |
70 | 0.306 | 1.452 | 98.190% | 0.695 |
80 | 0.301 | 1.438 | 98.305% | 0.702 |
90 | 0.298 | 1.439 | 98.233% | 0.702 |
100 | 0.302 | 1.430 | 98.260% | 0.709 |
110 | 0.309 | 1.438 | 98.365% | 0.698 |
120 | 0.311 | 1.418 | 98.531% | 0.720 |
Process | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | CIDEr | ROUGE_L |
---|---|---|---|---|---|---|
beam size = 1 | 0.8087 | 0.7788 | 0.7569 | 0.7365 | 4.9398 | 0.8075 |
beam size = 2 | 0.8117 | 0.7817 | 0.7593 | 0.7383 | 4.9398 | 0.8075 |
beam size = 3 | 0.8155 | 0.7865 | 0.765 | 0.7448 | 4.9719 | 0.8093 |
beam size = 4 | 0.8156 | 0.7869 | 0.7654 | 0.7452 | 4.9908 | 0.8101 |
beam size = 5 | 0.8161 | 0.7872 | 0.7661 | 0.7464 | 5.0255 | 0.8106 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Deng, H.; Fu, K.; Yu, B.; Li, H.; Duan, R.; Deng, Y.; Lin, J.-r. Enabling High-Level Worker-Centric Semantic Understanding of Onsite Images Using Visual Language Models with Attention Mechanism and Beam Search Strategy. Buildings 2025, 15, 959. https://doi.org/10.3390/buildings15060959
Deng H, Fu K, Yu B, Li H, Duan R, Deng Y, Lin J-r. Enabling High-Level Worker-Centric Semantic Understanding of Onsite Images Using Visual Language Models with Attention Mechanism and Beam Search Strategy. Buildings. 2025; 15(6):959. https://doi.org/10.3390/buildings15060959
Chicago/Turabian StyleDeng, Hui, Kejie Fu, Binglin Yu, Huimin Li, Rui Duan, Yichuan Deng, and Jia-rui Lin. 2025. "Enabling High-Level Worker-Centric Semantic Understanding of Onsite Images Using Visual Language Models with Attention Mechanism and Beam Search Strategy" Buildings 15, no. 6: 959. https://doi.org/10.3390/buildings15060959
APA StyleDeng, H., Fu, K., Yu, B., Li, H., Duan, R., Deng, Y., & Lin, J.-r. (2025). Enabling High-Level Worker-Centric Semantic Understanding of Onsite Images Using Visual Language Models with Attention Mechanism and Beam Search Strategy. Buildings, 15(6), 959. https://doi.org/10.3390/buildings15060959