Deep Learning and Text Mining: Classifying and Extracting Key Information from Construction Accident Narratives
Abstract
:1. Introduction
2. Materials and Methods
2.1. Data Set
2.2. Data Preprocessing
2.3. Automatic Classification of CNN
2.3.1. Word Embedding
2.3.2. CNN-Based Text Classification Model
2.4. Key Information Extraction and Word Cloud Visualization
3. Results
3.1. Configuration
3.2. Automatic Category Classification
3.2.1. Parameter Setting
3.2.2. Evaluation Indicators
3.2.3. Model Testing and Evaluation
- (1)
- Word2vec achieves an average accuracy of 84%, showing only a slight difference compared to the One-hot encoding model. In construction accident texts, the lack of complete contextual information often leads to sparse text features. However, with Word2vec training, it captures various types of relationships between words and provides more contextual information for similar words. Therefore, we expect the Word2vec model to better handle such texts.
- (2)
- From the perspective of the F1 score, the One-hot encoding model performs similarly to Word2vec in the “falling” category. However, in other categories, Word2vec outperforms One-hot encoding.
- (3)
- Compared to the CNN model using One-hot encoding as word embedding, the CNN model using Word2vec as word embedding performs better in classifying construction accident narrative texts.
3.3. Key Information Mining and Word Cloud Visualization
3.3.1. Key Information Mining
3.3.2. Word Cloud Visualization
4. Discussion
5. Conclusions
- (1)
- We proposed a text classification model based on CNN that can automatically classify texts related to five types of accidents: electrocution, falling, object strikes, collapse of objects, and mechanical injury. The overall accuracy of the model reaches 84%. Compared to other shallow machine learning methods, the CNN model demonstrates higher accuracy in classifying construction accident narrative texts, outperforming the other three shallow machine learning models. This indicates that our model exhibits accuracy in handling construction accident narratives.
- (2)
- Using the categorized fall accident text as an illustration, we employed the TF-IDF algorithm for text mining, extracting the foremost 20 weighted accident features that stand as representatives. Delving into these features illuminates eight pivotal accident zones and highlights three operations particularly susceptible to accidents. The presentation of this crucial information through the visualized format of a word cloud serves as a lucid guide for on-site safety management by managers, facilitating the prevention of analogous accidents.
- (3)
- Our innovative approach, combining deep learning and text mining, swiftly identified diverse accident types and extracted essential insights from accident-related Chinese texts. This research can help managers analyze and understand accident narratives, offering robust direction for post-accident emergency response and prevention measures. Moreover, it introduces fresh perspectives and methodologies into construction safety management.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A
Keyword | TF-IDF | Keyword | TF-IDF |
---|---|---|---|
Dismantling | 1.06 × 10−2 | Elevator | 0.72 × 10−2 |
Falling | 1.02 × 10−2 | Formwork | 0.60 × 10−2 |
Head | 1.01 × 10−2 | Dropping | 0.58 × 10−2 |
Death | 0.99 × 10−2 | Pipes | 0.57 × 10−2 |
Ground | 0.98 × 10−2 | Hopper | 0.53 × 10−2 |
Concrete | 0.91 × 10−2 | Exterior walls | 0.46 × 10−2 |
Scaffolding | 0.90 × 10−2 | Dislodged | 0.45 × 10−2 |
Pump truck | 0.88 × 10−2 | Pouring | 0.44 × 10−2 |
Tower crane | 0.78 × 10−2 | Fasteners | 0.43 × 10−2 |
Wire rope | 0.74 × 10−2 | Cutting | 0.43 × 10−2 |
Keyword | TF-IDF | Keyword | TF-IDF |
---|---|---|---|
Scaffolding | 1.45 × 10−2 | Punching | 0.46 × 10−2 |
Electrocution | 1.04 × 10−2 | Switch | 0.45 × 10−2 |
Crane | 0.75 × 10−2 | Contact | 0.44 × 10−2 |
Wires | 0.71 × 10−2 | Fire protection | 0.44 × 10−2 |
Concrete | 0.68 × 10−2 | Demolition | 0.43 × 10−2 |
High-voltage lines | 0.64 × 10−2 | Restrooms | 0.43 × 10−2 |
Ceiling | 0.57 × 10−2 | Electric shock | 0.41 × 10−2 |
Wiring | 0.55 × 10−2 | Lighting | 0.40 × 10−2 |
Power supply | 0.54 × 10−2 | Submersible pumps | 0.39 × 10−2 |
Ground | 0.51 × 10−2 | Cleaning | 0.38 × 10−2 |
Keyword | TF-IDF | Keyword | TF-IDF |
---|---|---|---|
Scaffolding | 1.62 × 10−2 | Earthwork | 0.67 × 10−2 |
Collapse | 1.56 × 10−2 | Burial | 0.66 × 10−2 |
Walls | 1.51 × 10−2 | Clearance | 0.65 × 10−2 |
Trench | 1.35 × 10−2 | Excavation | 0.65 × 10−2 |
Pouring | 1.32 × 10−2 | Ground | 0.60 × 10−2 |
Concrete | 1.18 × 10−2 | Falling | 0.55 × 10−2 |
Demolition | 1.17 × 10−2 | Formwork | 0.54 × 10−2 |
Pits | 1.15 × 10−2 | Excavator | 0.53 × 10−2 |
Pipes | 0.90 × 10−2 | Fence | 0.53 × 10−2 |
Collapse | 0.75 × 10−2 | Slope | 0.50 × 10−2 |
Keyword | TF-IDF | Keyword | TF-IDF |
---|---|---|---|
Mixing | 1.08 × 10−2 | Head rope | 0.60 × 10−2 |
Reinforcing | 1.06 × 10−2 | Conveyor belt | 0.60 × 10−2 |
Mixer | 0.91 × 10−2 | Belt | 0.59 × 10−2 |
Concrete | 0.87 × 10−2 | Pumper | 0.57 × 10−2 |
Drill | 0.83 × 10−2 | Switch | 0.56 × 10−2 |
Equipment | 0.83 × 10−2 | Power | 0.55 × 10−2 |
Death | 0.76 × 10−2 | Head | 0.54 × 10−2 |
Winches | 0.70 × 10−2 | Body | 0.53 × 10−2 |
Operation | 0.69 × 10−2 | Shutdown | 0.53 × 10−2 |
Drill pipe | 0.66 × 10−2 | Scene | 0.53 × 10−2 |
Appendix B
References
- National Bureau of Statistics of China. High-Quality Development of the Construction Industry to Strengthen the Foundation to Benefit People’s Livelihood and Create a New Road—The Fourth in a Series of Reports on the Achievements of Economic and Social Development Since the 18th National Congress of the CPC. 2022. Available online: http://www.stats.gov.cn/xxgk/jd/sjjd2020/202209/t20220920_1888501.html (accessed on 20 July 2023).
- Han, Y.; Li, R. Research on the causes and control measures of the “five major injuries” in construction based on accident causation theory. J. Chifeng Univ. (Nat. Sci. Ed.) 2017, 33, 123–126. (In Chinese) [Google Scholar] [CrossRef]
- Behm, M.; Schneller, A. Application of the Loughborough Construction Accident Causation model: A framework for organizational learning. Constr. Manag. Econ. 2013, 31, 580–595. [Google Scholar] [CrossRef]
- Ferrari, A.; Gori, G.; Rosadini, B.; Trotta, I.; Bacherini, S.; Fantechi, A.; Gnesi, S. Detecting requirements defects with NLP patterns: An industrial experience in the railway domain. Empir. Softw. Eng. 2018, 23, 3684–3733. [Google Scholar] [CrossRef]
- Zhang, F.; Fleyeh, H.; Wang, X.R.; Lu, M.H. Construction site accident analysis using text mining and natural language processing techniques. Autom. Constr. 2019, 99, 238–248. [Google Scholar] [CrossRef]
- Le, T.Y.; Jeong, H.D. NLP-Based Approach to Semantic Classification of Heterogeneous Transportation Asset Data Terminology. J. Comput. Civil. Eng. 2017, 31, 13. [Google Scholar] [CrossRef]
- Tixier, J.P.; Hallowell, M.R.; Rajagopalan, B.; Bowman, D. Automated content analysis for construction safety: A natural language processing system to extract precursors and outcomes from unstructured injury reports. Autom. Constr. 2016, 62, 45–56. [Google Scholar] [CrossRef]
- Diamantopoulos, T.; Symeonidis, A. Enhancing requirements reusability through semantic modeling and data mining techniques. Enterp. Inf. Syst. 2018, 12, 960–981. [Google Scholar] [CrossRef]
- Ye, Z.H.; Zuo, T.; Chen, W.E.; Li, Y.X.; Lu, Z.Y. Textual emotion recognition method based on ALBERT-BiLSTM model and SVM-NB classification. Soft Comput. 2023, 27, 5063–5075. [Google Scholar] [CrossRef]
- Huang, A.Z.; Xu, R.; Chen, Y.; Guo, M.W. Research on multi-label user classification of social media based on ML-KNN algorithm. Technol. Forecast. Soc. Change 2023, 188, 10. [Google Scholar] [CrossRef]
- Jalal, N.; Mehmood, A.; Choi, G.S.; Ashraf, I. A novel improved random forest for text classification using feature ranking and optimal number of trees. J. King Saud Univ. Comput. Inf. Sci. 2022, 34, 2733–2742. [Google Scholar] [CrossRef]
- Shinde, P.P.; Shah, S. A review of machine learning and deep learning applications. In Proceedings of the 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), Pune, India, 16–18 August 2018; pp. 1–6. [Google Scholar] [CrossRef]
- Alsaleh, D.; Larabi-Marie-Sainte, S. Arabic Text Classification Using Convolutional Neural Network and Genetic Algorithms. IEEE Access 2021, 9, 91670–91685. [Google Scholar] [CrossRef]
- Gu, Y.H.; Gu, M.; Long, Y.; Xu, G.D.; Yang, Z.L.; Zhou, J.S.; Qu, W.G. An enhanced short text categorization model with deep abundant representation. World Wide Web 2018, 21, 1705–1719. [Google Scholar] [CrossRef]
- Shuang, Q.; Zhang, Z.R. Determining Critical Cause Combination of Fatality Accidents on Construction Sites with Machine Learning Techniques. Buildings 2023, 13, 22. [Google Scholar] [CrossRef]
- Choi, J.; Gu, B.; Chin, S.; Lee, J.S. Machine learning predictive model based on national data for fatal accidents of construction workers. Autom. Constr. 2020, 110, 14. [Google Scholar] [CrossRef]
- Zermane, A.; Tohir, M.Z.M.; Zermane, H.; Baharudin, M.R.; Yusoff, H.M. Predicting fatal fall from heights accidents using random forest classification machine learning model. Saf. Sci. 2023, 159, 10. [Google Scholar] [CrossRef]
- Qiu, Q.J.; Xie, Z.; Wu, L.; Tao, L.F. Automatic spatiotemporal and semantic information extraction from unstructured geoscience reports using text mining techniques. Earth Sci. Inform. 2020, 13, 1393–1410. [Google Scholar] [CrossRef]
- Chen, Z.L.; Huang, K.; Wu, L.; Zhong, Z.Y.; Jiao, Z.Y. Relational Graph Convolutional Network for Text-Mining-Based Accident Causal Classification. Appl. Sci. 2022, 12, 13. [Google Scholar] [CrossRef]
- Pan, X.; Zhong, B.T.; Wang, Y.H.; Shen, L.X. Identification of accident-injury type and bodypart factors from construction accident reports: A graph-based deep learning framework. Adv. Eng. Inform. 2022, 54, 12. [Google Scholar] [CrossRef]
- Xu, H.; Liu, Y.; Shu, C.M.; Bai, M.Q.; Motalifu, M.; He, Z.X.; Wu, S.C.; Zhou, P.G.; Li, B. Cause analysis of hot work accidents based on text mining and deep learning. J. Loss Prev. Process Ind. 2022, 76, 11. [Google Scholar] [CrossRef]
- Goh, Y.M.; Ubeynarayana, C.U. Construction accident narrative classification: An evaluation of text mining techniques. Accid. Anal. Prev. 2017, 108, 122–130. [Google Scholar] [CrossRef]
- Tian, D.; Li, M.C.; Shi, J.; Shen, Y.; Han, S. On-site text classification and knowledge mining for large-scale projects construction by integrated intelligent approach. Adv. Eng. Inform. 2021, 49, 12. [Google Scholar] [CrossRef]
- Zhang, Y.; Chen, M.; Liu, L. A review on text mining. In Proceedings of the 2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS), Beijing, China, 23–25 September 2015; pp. 681–685. [Google Scholar] [CrossRef]
- Qiu, Z.X.; Liu, Q.L.; Li, X.C.; Zhang, J.J.; Zhang, Y.Q. Construction and analysis of a coal mine accident causation network based on text mining. Process Saf. Environ. Protect. 2021, 153, 320–328. [Google Scholar] [CrossRef]
- Jing, S.F.; Liu, X.W.; Gong, X.Y.; Tang, Y.; Xiong, G.; Liu, S.; Xiang, S.G.; Bi, R.S. Correlation analysis and text classification of chemical accident cases based on word embedding. Process Saf. Environ. Protect. 2022, 158, 698–710. [Google Scholar] [CrossRef]
- Hu, J.Q.; Huang, R.; Xu, F.T. Data Mining in Coal-Mine Gas Explosion Accidents Based on Evidence-Based Safety: A Case Study in China. Sustainability 2022, 14, 16. [Google Scholar] [CrossRef]
- Onan, A. Two-Stage Topic Extraction Model for Bibliometric Data Analysis Based on Word Embeddings and Clustering. IEEE Access 2019, 7, 145614–145633. [Google Scholar] [CrossRef]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar] [CrossRef]
- Zhang, D.W.; Xu, H.; Su, Z.C.; Xu, Y.F. Chinese comments sentiment classification based on word2vec and SVMperf. Expert Syst. Appl. 2015, 42, 1857–1863. [Google Scholar] [CrossRef]
- Khatua, A.; Khatua, A.; Cambria, E. A tale of two epidemics: Contextual Word2Vec for classifying twitter streams during outbreaks. Inf. Process. Manag. 2019, 56, 247–257. [Google Scholar] [CrossRef]
- Fu, H.P.; Niu, Z.D.; Zhang, C.X.; Ma, J.; Chen, J. Visual Cortex Inspired CNN Model for Feature Construction in Text Analysis. Front. Comput. Neurosci. 2016, 10, 64. [Google Scholar] [CrossRef]
- Guo, Q.; Wang, F.L.; Lei, J.; Tu, D.; Li, G.H. Convolutional feature learning and Hybrid CNN-HMM for scene number recognition. Neurocomputing 2016, 184, 78–90. [Google Scholar] [CrossRef]
- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
- Curiskis, S.A.; Drake, B.; Osborn, T.R.; Kennedy, P.J. An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit. Inf. Process. Manag. 2020, 57, 21. [Google Scholar] [CrossRef]
Type of Accidents | Quantity |
---|---|
Object strikes | 221 |
Electrocution | 201 |
Collapse of objects | 374 |
Falling | 771 |
Mechanical injury | 160 |
Regions | Quantity | Regions | Quantity |
---|---|---|---|
Beijing | 145 | Fujian | 47 |
Tianjin | 34 | Jiangxi | 29 |
Shanghai | 126 | Shandong | 73 |
Chongqing | 72 | Henan | 62 |
Shanxi | 25 | Hubei | 92 |
Jiangsu | 307 | Hunan | 67 |
Zhejiang | 107 | Guangdong | 137 |
Anhui | 153 | Sichuan | 125 |
Guizhou | 14 | Yunnan | 26 |
Guangxi | 73 | Shaanxi | 13 |
Pseudocode for CNN-Based Text Classification Models |
---|
1: Load CSV file df = pd.read_csv (‘accidents.csv’) 2: Split dataset into train and test sets train_data, test_data = splitData (df) 3: Preprocess data trainLabel, testLabel = preprocessLabels (train_data, test_data) trainCut, testCut = preprocessText (train_data, test_data) saveWordData (trainCut + testCut) 4: Train Word2Vec model w2v_model = trainWord2Vec (loadWordData ()) 5: Tokenize and pad sequences tokenizer = Tokenizer () vocab = tokenizeAndPreprocess(loadWordData (), tokenizer) trainSeq, testSeq = convertAndPad (tokenizer, trainCut, testCut) 6: Encode labels trainCate, testCate = encodeOneHot (trainLabel, testLabel) 7: Build and train CNN model model = buildAndTrainCNNModel (vocab, w2v_model, trainSeq, trainCate) 8: Evaluate and visualize model evaluateAndVisualize (model, testSeq, testCate) 9: Generate reports generateReports (model, testSeq, testCate) |
Pseudocode for TF-IDF Algorithm to Extract Key Information |
---|
1: Load CSV file df = pd.read_csv (‘fall.txt’) 2: Load stopwords stopwords = load_stopwords (‘stopwords.txt’) 3: Preprocess descriptions for row in df [‘description’]: row [‘description’] = preprocess(row [‘description’], stopwords) 4: Apply TF-IDF algorithm vectorizer = create_vectorizer (max_features = 1000) tfidf = vectorizer.fit_transform(df [‘description’]) feature_names = vectorizer.get_feature_names_out () tfidf_scores = calculate_tfidf_scores (tfidf) 5: Normalize and print top keywords normalized_tfidf = normalize_scores (tfidf_scores) print_normalized_keywords (normalized_tfidf) 6: Create and display word cloud wordcloud = create_wordcloud (normalized_tfidf) display_wordcloud (wordcloud) |
Label | CNN-Word2vec | CNN-One-Hot | ||||
---|---|---|---|---|---|---|
Precision | Recall | F1 Score | Precision | Recall | F1 Score | |
Object strikes | 0.67 | 0.66 | 0.66 | 0.70 | 0.49 | 0.57 |
Electrocution | 0.77 | 0.83 | 0.80 | 0.68 | 0.60 | 0.64 |
Collapse of objects | 0.93 | 0.92 | 0.93 | 0.84 | 0.94 | 0.89 |
Falling | 0.93 | 0.90 | 0.91 | 0.87 | 0.95 | 0.91 |
Mechanical injury | 0.90 | 1.00 | 0.95 | 0.88 | 0.83 | 0.86 |
Average | 0.84 | 0.86 | 0.85 | 0.80 | 0.76 | 0.77 |
Keyword | TF-IDF | Keyword | TF-IDF |
---|---|---|---|
Ground | 1.81 × 10−2 | Head | 0.59 × 10−2 |
Fall | 1.79 × 10−2 | Painting | 0.51 × 10−2 |
Scaffolding | 1.67 × 10−2 | Hoist | 0.49 × 10−2 |
Death | 1.39 × 10−2 | Opening | 0.43 × 10−2 |
Unintentionally | 1.16 × 10−2 | Floor | 0.43 × 10−2 |
Elevator | 1.13 × 10−2 | Cleaning | 0.42 × 10−2 |
Demolition | 0.92 × 10−2 | Material | 0.42 × 10−2 |
Roof | 0.85 × 10−2 | Canopy | 0.42 × 10−2 |
Template | 0.78 × 10−2 | Cement | 0.42 × 10−2 |
Exterior wall | 0.64 × 10−2 | Wearing | 0.40 × 10−2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, J.; Wu, C. Deep Learning and Text Mining: Classifying and Extracting Key Information from Construction Accident Narratives. Appl. Sci. 2023, 13, 10599. https://doi.org/10.3390/app131910599
Li J, Wu C. Deep Learning and Text Mining: Classifying and Extracting Key Information from Construction Accident Narratives. Applied Sciences. 2023; 13(19):10599. https://doi.org/10.3390/app131910599
Chicago/Turabian StyleLi, Jue, and Chang Wu. 2023. "Deep Learning and Text Mining: Classifying and Extracting Key Information from Construction Accident Narratives" Applied Sciences 13, no. 19: 10599. https://doi.org/10.3390/app131910599
APA StyleLi, J., & Wu, C. (2023). Deep Learning and Text Mining: Classifying and Extracting Key Information from Construction Accident Narratives. Applied Sciences, 13(19), 10599. https://doi.org/10.3390/app131910599