Development of a Knowledge Base for Construction Risk Assessments Using BERT and Graph Models

Lee, Wonjong; Lee, Seulki

doi:10.3390/buildings14113359

Open AccessArticle

Development of a Knowledge Base for Construction Risk Assessments Using BERT and Graph Models

by

Wonjong Lee

and

Seulki Lee

^*

Department of Architecture Engineering, Kwangwoon University, Seoul 01897, Republic of Korea

^*

Author to whom correspondence should be addressed.

Buildings 2024, 14(11), 3359; https://doi.org/10.3390/buildings14113359

Submission received: 20 September 2024 / Revised: 19 October 2024 / Accepted: 22 October 2024 / Published: 23 October 2024

(This article belongs to the Section Construction Management, and Computers & Digitization)

Download

Browse Figures

Versions Notes

Abstract

As a significant percentage of disasters and fatal accidents still occur in the construction sector, it is legally obligatory to conduct workplace risk assessments to avoid accidents and enhance safety. Identifying harmful and hazardous elements is crucial to discern the distinctive characteristics of potential accidents. However, conventional risk-assessment approaches, which rely on the skills and experience of safety managers, may overlook important factors, leading to inconsistencies in the procedures employed across different sites. Such unstructured safety knowledge reduces accessibility and utility, increases reliance on individual skills, and renders information management inefficient. Recently, the focus has shifted from efficient data storage to obtaining valuable knowledge tailored to specific use-cases. Knowledge-graph-based systems integrate and manage the relationships between knowledge entities, thereby enhancing the development of knowledge bases. Research on automatically extracting and managing predefined knowledge from various forms of data through natural language processing (NLP) is ongoing. This study proposes a novel method that uses NLP and graph models to automatically extract predefined knowledge from unstructured construction data and build an entity-relationship-based risk-assessment knowledge base. We developed an entity-name recognition and keyword-extraction engine that defines the core knowledge related to construction safety and risk assessments. This engine can automatically extract predefined knowledge from unstructured data by learning from NLP data. The extracted risk-assessment knowledsge was used to create a knowledge base, and its efficiency and effectiveness were validated through comparisons with existing methods. The results of this study are significant because they lay the foundation for an automatic knowledge-management system for construction safety and risk assessment, offering both practical and academic contributions to the field of construction safety.

Keywords:

BERT; graph model; risk assessment; knowledge base

1. Introduction

Job risk assessments are mandatory for preventing accidents and disasters that are prevalent in the construction industry. Among the various stages of risk assessment, identifying hazard factors is the most crucial because it involves preemptively identifying factors that can potentially cause accidents and implementing appropriate safety measures [1,2]. Currently, risk assessments are primarily conducted by business owners, safety and health officers, and supervisors. However, this approach has several issues. First, it is difficult to determine all risk factors based on the complex site conditions and prior accident cases, thereby increasing the likelihood of overlooking hazard factors. Additionally, each company or authority employs different methods to conduct risk assessments and identify hazard factors, leading to inconsistent evaluations [3,4,5,6,7].

Moreover, the construction-safety data used for risk identification and assessment are in unstandardized and unstructured file formats. Therefore, users are compelled to manually search for and extract the required information or knowledge from vast amounts of data, making the process highly dependent on the experience, judgment, and ability of the supervisory personnel. This dependence exacerbates the difficulty of considering all hazard factors based on complex site conditions and risk assessments. Additionally, unstructured construction-safety data suffer from reduced accessibility and usability, making their continuous accumulation and management highly inefficient [8,9].

Currently, the focus has shifted from merely storing these data to efficiently accumulating and providing the necessary knowledge and identifying all relationships. Various graph-data- and knowledge-based technologies and systems have been developed and are widely used in many fields. These systems can efficiently store data and the relationships between them, infer third-party data through relational reasoning, and continuously expand and integrate their management. Furthermore, continuous research is being conducted on the application of natural language processing (NLP), which enables computers to understand and process natural language and automatically extract and manage predefined knowledge from the data format, thereby enhancing the efficiency of the construction and expansion of knowledge bases [10,11,12,13,14,15,16].

Therefore, this study proposes an NLP-based technology that automatically extracts predefined risk-assessment knowledge from unstructured data of the construction industry and builds an entity-relationship-based knowledge base for their efficient management. The efficiency and effectiveness of the proposed method for creating risk-assessment knowledge bases were verified through comparisons with traditional knowledge extraction and management methods. The results of this study are expected to allow users to extract consistent risk-assessment knowledge without manually searching for information, irrespective of their experience and abilities, thus ensuring the scalability and sustainability of knowledge maintenance and enhancing safety at construction sites.

The remainder of this paper is organized as follows. Section 2 examines the concept and implementation of mandatory risk assessments aimed at preventing accidents in the construction industry. It reviews the methods used for providing and managing knowledge to ensure efficient risk assessments and identifies issues related to efficient accumulation and management of risk-assessment knowledge. To address these issues and efficiently accumulate and manage risk-assessment knowledge, it explores the concepts, research trends, and current utilization of knowledge bases and NLP techniques to assess their applicability.

2. Literature Review

2.1. Limitations of Existing Knowledge-Management Methods for Risk Assessment

Owing to the high percentage of disasters and fatal accidents in the construction industry, ‘job risk assessment’ is legally mandated under Article 36 of the Occupational Safety and Health Act in Korea to prevent accidents. The risk-assessment method proposed by the Ministry of Employment and Labor (MOEL) [2] involves identifying hazard factors, estimating and determining the likelihood and severity of injuries and illnesses, and establishing measures to reduce risks and prevent accidents. To conduct risk assessments efficiently, it is essential to comprehensively examine the severity (intensity) and frequency of potential hazards, occurrences of industrial accidents or near misses, and workers’ opinions.

Risk assessment encompasses all foreseeable injuries or illnesses related to the hazardous factors associated with workers’ duties, such as past accidents or tasks, and must be repeated continuously until the risks are reduced to acceptable levels, as it is an ongoing process that must adapt to the progress of tasks or construction methods. Risk assessments generally involve five stages—preparation, identification of hazardous factors, risk determination, and risk assessment sharing—with the second being the most critical and challenging [2]. It involves preemptively identifying potential accident-causing factors, including hazard factors, which are the direct targets of accidents and risk assessments. Hazard factors refer to the inherent characteristics or attributes related to potential risks, encompassing all components such as machinery, equipment, materials, transportation processes, byproducts, methods, practices, and attitudes. Even if one factor is omitted, appropriate measures cannot be taken [17,18]. However, those that are expected to cause only minor injuries or illnesses may be excluded.

Various methods, including workplace inspections, collating worker suggestions, conducting surveys and interviews, and implementing safety checklists tailored to specific workplaces, have been used to identify hazardous factors [17,18,19]. Although risk assessments are primarily conducted by business owners or safety and health managers, they cannot be performed independently and require active support of workers, supervisors, and safety managers. Employers may also establish criteria for the methods, procedures, and content of risk assessments and provide specialized training to ensure that the entities can efficiently perform the assessments themselves.

However, relying on business owners, safety managers, and supervisors for identifying hazard factors has certain limitations. It is difficult to consider all factors related to complex site conditions and accident cases, and inexperienced supervisors are more likely to omit some factors. Additionally, risk assessments and hazard-factor identification may be inconsistent owing to the different entities and formats employed across sites or companies. Therefore, it is necessary to accumulate construction-safety-related information and implement appropriate risk assessments to ensure effective hazard-factor identification without omission [3,4,5,6,7].

The Ministry of Employment and Labor supports efficient risk assessments and hazard-factor identification by providing various construction safety knowledge resources such as Korea Occupational Safe & Health Agency’s (KOSHA) Korea Risk Assessment System (KRAS) [20], construction-accident casebooks by the Ministry of Land, Infrastructure and Transport (MOLIT) and Korea Authority of Land & Infrastructure Safety (KALIS) [21], Korea Land & Housing Corporation (LH)’s risk-assessment standard models, and Construction Safety Management Integrated Information (CSI)’s construction-accident casebooks [22]. However, these data are not uniformly categorized by tasks or factors and presented in various unstructured formats (.xls, .pdf, etc.).

Most construction-industry data, such as contracts, design reports, construction plans, and guidelines, are in unstructured text-based formats. Simoff and Maher [23] noted that the management of construction information and knowledge involves various types of data. Compared with structured data, unstructured data limit access to and use of knowledge within documents and require users to manually search for and extract the desired information, thereby limiting the accessibility and usability of the knowledge. Consequently, dependence on the experience and subjective judgment of business owners and safety managers is high, leading to incomplete or inconsistent evaluations. Additionally, continuous accumulation and management of data, such as hazard factors and safety measures, for risk assessments are highly inefficient.

Furthermore, as both the volume and diversity of data are crucial for data management, managing and analyzing unstructured data from the construction industry are important tasks that can yield valuable information that may be overlooked by humans. Therefore, there is a growing need to establish a knowledge-management system that can automatically extract the relevant risk-assessment data from unstructured construction-industry data, continuously accumulate them, and efficiently integrate and manage them.

2.2. Concept of Network-Based Knowledge Base

A knowledge base is a database that stores expert knowledge accumulated through activities and experiences in a specific field or a large-scale database that stores real-world knowledge in the form of a knowledge graph, which is a type of graph-data model. A knowledge graph that constitutes a knowledge base has various definitions, making it difficult to assign a universal definition. However, it can be described as a representation of entities and the relationships between them in a node-and-edge-based graph format [24] or a network of real entities or concepts that shows the relationships between them. A common feature of knowledge bases is that they include the relational meanings and associations between entities. This allows for the effective definition of complex data structures and efficient storage and integrated management of vast knowledge by specifying the relationships between entities, enabling the acquisition of the desired information more efficiently. Particularly, through graph-based relational reasoning, knowledge bases enable discovery of third-party information of existing entities and relationships, and they can be continuously and intelligently expanded and managed by integrating reasoning engines.

Although previous studies have focused on developing efficient methods for storing data, the requirement has recently shifted to obtaining the most valuable knowledge from within the data based on the user’s purpose. To allow computers to understand and analyze sentences like humans and obtain valuable knowledge from data, knowledge-graph technologies that focus on the interconnections between entities are essential. Google first employed knowledge graphs in its search engine in 2012, marking the beginning of using knowledge graphs across various industries and research fields, including searching for, extracting, and providing new information or knowledge. Additionally, knowledge graphs have recently attracted significant interest from academia and various industries owing to their cognitive intelligence capabilities, which combine information technology to define and implement realistic patterns of human understanding and knowledge utilization and discover third-party knowledge that they have not identified through reasoning [25,26].

Various types of open-source knowledge bases, such as ConceptNet, YAGO, and WordNet, have been developed and widely utilized. ConceptNet is an open-source knowledge graph composed of sentence data extracted from various document types in 78 languages, including English, French, and Korean. By converting a vast amount of knowledge into an entity-relationship-based knowledge graph, valuable knowledge can be efficiently extracted and utilized. Furthermore, the continuous expansion and integration of knowledge allow its management across various fields, demonstrating the development and utilization of graph-database-based knowledge (Table 1).

The methods for constructing knowledge bases can be divided into hierarchical knowledge bases based on relational databases, which effectively define and process the relationships between entities, and network knowledge bases based on graph databases. Hierarchical knowledge bases are primarily optimized for storing and retrieving real entities such as people, places, and objects, making them advantageous for comprehending the status of knowledge and efficiency of adding entities. By contrast, network knowledge bases focus on storing and retrieving the relationships between entities. They are advantageous for cases wherein human and digital representations are similar and allow expressing human intuition easily through machines or computers, provide valuable insights into the data, and collect or reason about the relationships required for decision making. Each construction method has its strengths and weaknesses depending on the format of the knowledge-graph data. Therefore, the appropriate graph should be used based on the specific problem or purpose. Knowledge is generally applied as associative knowledge through relationships combined with multiple pieces of knowledge depending on the purpose and situation. Recently, the focus has shifted from merely accumulating knowledge efficiently to extracting value from the accumulated knowledge, thereby changing the way of thinking and processes from organizing entities to relationships. As stated in previous studies [33,34,35], traditional knowledge graphs are mostly hierarchical and do not effectively reflect the relational characteristics of knowledge. Therefore, it is desirable to construct knowledge graphs as networks. However, defining associative knowledge as relationship-based hierarchies is complex, and constantly updating the database schema for new applications and situations is a significant burden [36].

Network knowledge bases are constructed using the graph schema language, which is composed of ‘nodes’, ‘edges’, and ‘properties’ that represent the concepts or entities of the data, the relationships between them, and the names, dates, functions, etc. of the nodes and edges, respectively.

2.3. Research Trends of Using BERT for Knowledge Management in Construction

NLP technology uses computers to process natural language, i.e., the language used by humans for communication, and can be described as the process through which machines understand human language. Recently, the emergence of pretrained language models such as embeddings from language model (ELMO), bidirectional encoder representations from the transformer (BERT), OpenAI’s Chat-Generative Pretrained Transformer (ChatGPT), and extreme language understanding network (XLNet) has led to rapid improvements in NLP-related technologies. NLP is primarily employed in the fields of information retrieval (IR), which involves finding necessary information from large information repositories, and information extraction (IE), which involves extracting only predefined information from data. It is applied in various tasks such as NER, relation extraction (RE), question and answer, and document classification (DC).

A review of previous research on NLP-based automatic text extraction in the construction field shows that NLP technology can be used for document classification, IR, and automatic text extraction for various purposes. Furthermore, research is being conducted to automatically extract knowledge from unstructured data using NLP technology to build knowledge bases that are efficiently tailored to specific situations and purposes.

Lee [37] constructed a knowledge graph of Korean health checkup medical opinion reports to systematically structure the health status of individuals and offer them personalized solutions. Specifically, to automate the construction and expansion of the knowledge graph and improve extraction accuracy, they defined seven key entities and ten relationships from the medical opinion reports to build the training dataset for the NLP model. NER and RE were applied using the BERT model to automatically extract predefined entities and relationships suitable for constructing knowledge graphs.

Yoo et al. [38] proposed a chatbot system that utilizes a knowledge graph for real-time data collection, analysis, and automatic expansion. They used the BERT model for NLP to extract relationships between words within real-time data, such as news and social media, and proposed an automatic knowledge-graph expansion method that adds new words and relationships to the existing knowledge graph. Park [39] collected unstructured corporate information texts that were collected from the electronic disclosure site of listed companies in South Korea and extracted and visualized the elements and relationships required for constructing a BCG matrix chart using a knowledge graph. They used the BERT model, which has shown the best performance among the NLP models, to create a vocabulary dictionary for business consulting categories and reports and ensured the quality of the embedded data through sentence- and word-similarity assessments. Subsequently, to handle the interactions within rapidly moving economic and social networks, Kim et al. [40] used the BERT model to extract academic keywords and relationships between keywords in the field of economics via NLP and represented the extracted knowledge on a knowledge graph. The extraction accuracy was increased by training the existing NLP model with the entity and relationship information of keywords from economics papers, thereby allowing the model to predict and extract new relationship information.

Thus, research is being actively conducted in various fields to automatically extract predefined information from unstructured data in the construction field and build knowledge systems. However, research on the use of NLP in the construction field is relatively lacking compared with other fields, such as the medical, legal, and information, indicating that there is still a considerable amount of unstructured data that can be utilized. This also suggests its boundless application and utilization potential for data analysis, wherein the diversity of data types is emphasized.

3. Materials and Methods

The research process to achieve the objectives of this study is as follows.

In Section 3, we discuss the limitations of existing knowledge-management methods for construction risk assessments, focusing on managing unstructured data. We then explore the concept of network-based knowledge bases in other industries, highlighting how they address inefficiencies in knowledge management and their potential application in the construction sector. Finally, we introduce BERT (developed by Google AI, Mountain View, CA, USA) as a leading NLP technique and review its application for automatic text extraction in construction-related research.

Section 4 discusses the development of a named entity recognition (NER) and extraction model using BERT (developed by Google AI, Mountain View, CA, USA), which is designed to automatically extract predefined risk-assessment knowledge from unstructured construction data. The extracted knowledge is then used to create a risk-assessment knowledge base with Neo4j (Neo4j, Inc., San Mateo, CA, USA).

Subsequently, Section 5 presents a case study to validate the proposed method. To prevent overfitting of the proposed model, its automatic classification performance for predefined risk-assessment knowledge is verified using unstructured and unseen construction-safety data. This knowledge is then used to create a risk-assessment knowledge base, along with suggested utilization methods. Additionally, the efficiency, usability, convenience and satisfaction of the proposed model were verified through comparisons with traditional knowledge-extraction and -management methods, as well as the results from the case study.

Finally, Section 6 summarizes the main results of this study and discusses the expected effects, limitations, and directions for future research.

The step-by-step research methodology and the data utilized are shown in Figure 1.

4. Risk-Assessment Knowledge Base Using BERT and Graph Models

4.1. Conceptural Framework

According to the risk assessment guidelines and interviews with safety managers, the identification of hazard factors is conducted not only by following safety guidelines for specific tasks but also by utilizing past accident cases to identify possible accident types and causes. However, the accident casebooks and analysis reports currently available are mostly accumulated as unstructured data in document form, which presents limitations in terms of data search and utilization. To address this issue, this study proposes a method to efficiently accumulate and manage key knowledge related to construction accidents and risk assessments by building a risk-assessment knowledge base using BERT to automatically extract knowledge from unstructured data. A conceptual diagram of the proposed method is shown in Figure 2, which involves designing the risk-assessment knowledge base, developing a BERT-based keyword-extraction engine, and constructing the risk-assessment knowledge base.

Design the risk-assessment knowledge base

This is the first step of the proposed risk-assessment knowledge base for construction projects. First, by referring to unstructured construction data such as accident casebooks and risk-assessment support systems, key factors related to construction accidents and risk assessments are defined as nodes and the relationships and properties between them are defined. The proposed risk-assessment knowledge base is designed and constructed as a connected graph through graph-data modeling.

Develop the BERT-based keyword-extraction engine

We employed the NER technique that is typically used in NLP methods to identify and extract predefined words related to entities such as people, places, and times from documents. The BERT model, released by Google in 2018, is a representative pretrained language model composed of transformers that allow bidirectional learning. This study aimed to automatically extract predefined knowledge from unstructured construction data for risk assessment. A bidirectional learning model that can understand the flow and structure of sentences faster and more efficiently than a unidirectional learning-based NLP model was more suitable to achieve the goals of this study. A keyword-extraction engine based on NER was developed to build a risk-assessment knowledge base using the KLUE-RoBERTa model, which is the best-performing Korean classification BERT model [41].

In this study, a new BERT model for knowledge extraction was developed by fine-tuning the existing BERT model on a risk-assessment set with the aim of automatically extracting predefined risk-assessment knowledge from unstructured construction data. Fine-tuning involves adding only the output layer of the labeled data suitable for the desired purpose to the parameters of the pretrained BERT, thereby utilizing the weights of the previously learned data. This process allows for the creation of a BERT model that achieves state-of-the-art performance based on the intended application [42]. For fine-tuning, unstructured data of construction accidents and disaster cases were collected from reputable sites comprising various forms of unstructured construction-safety data. Training and test sets were constructed to match the predefined risk-assessment knowledge. To enhance the model performance, the training data were augmented via back-translation using Google Translate to translate Korean sentences into English and then translate them back into Korean to create different sentences with the same meaning and text data. ChatGPT-4 was used to transform sentences while retaining their meaning; these augmented data were used to fine-tune the KLUE-RoBERTa model. Consequently, a BERT-based keyword-extraction engine capable of automatically extracting predefined knowledge for risk assessments, such as accident causes, hazard factors, and safety measures, from unstructured construction safety data was developed.

Create the risk-assessment knowledge base

To verify that the developed risk-assessment knowledge base, composed of risk-assessment knowledge extracted from unstructured data, allows searching for consistent and clear knowledge and relationships related to construction-safety accidents, a network-type risk-assessment knowledge base based on the graph database was built using Neo4j Desktop (v1.5.8), a knowledge-base construction program [43].

4.2. Designing the Risk-Assessment Knowledge Base

To design the risk-assessment knowledge base, we first defined the information constituting the knowledge base as nodes, properties that were specific examples describing the nodes, and edges representing the relationships between nodes. To define these elements, we compared and analyzed risk-assessment and construction-accident data provided by accredited institutions, such as the construction-accident casebooks provided by KOSHA and KALIS, risk-assessment model of the KRAS provided by KOSHA, risk-assessment standard model provided by LH, construction hazard-factor checklist provided by MOLIT, and risk-assessment guidelines provided by MOEL. The results are summarized in Table 2. The basic concept is that risk assessments are conducted based on tasks, and the risks that may arise during these tasks are unsafe conditions or unsafe behaviors. The objects that trigger risks may include materials, tools, and equipment used in the tasks.

The final nodes and properties defined in the risk-assessment knowledge base are listed in Table 3. They include ten key factors related to construction-safety accidents and risk assessment: ‘Work’ for Construction, ‘Machine’, ‘Tool’, and ‘Material’ for Hazardous Objects, ‘Unsafe Condition’ and ‘Unsafe Act’ for Hazard Factors, and ‘Accident’, ‘Engineering’, ‘Enforcement’, and ‘Education’ for Safety Actions.

Edges represent the relationships between previously defined nodes (Work, Hazard Object, Hazard Factor, Accident, and Safety Action). In this study, we defined six edges, as listed in Table 4: ‘Precede’, ‘Use’, ‘Be Related’, ‘Cause–Effect’, ‘Effect–Cause’, and ‘Recommend’.

Finally, the nodes were organized in the form of connected graphs using edges. The proposed graph-based data model is illustrated in Figure 3.

4.3. Development of BERT-Based Keyword-Extraction Engine

Owing to the descriptive descriptions of accidents in the casebooks, users were required to manually search for the desired knowledge, which was inconvenient. To manage and accumulate data in a unified format, the BERT-based keyword-extraction engine was developed to extract keywords corresponding to the defined nodes using various examples. This experiment was conducted in Google Colab and Anaconda, a Python distribution. The KLUE-RoBERTa model, which exhibited better performance for the trained data and structure than other Korea-focused BERT models, was used to develop the risk-assessment knowledge, i.e., entity recognition, and extraction engine.

Training Data

In this study, 7256 pieces of training data were constructed by tagging 3149 sentences containing keywords related to hazard factors and safety measures from a total of 1154 construction accident and disaster cases provided by the Korea Occupational Safety and Health Agency (KOSHA) from 2010 to 2018, with predefined risk assessment nodes. In domain-specific named entities (NEs) such as those in the construction industry, performance degradation due to out-of-vocabulary (OOV) issues and overfitting of the data may occur when there is insufficient pretraining or fine-tuning data, making it difficult to achieve satisfactory learning outcomes. This is a common problem in named entity recognition (NER) information-extraction models, and to minimize errors and maximize performance, a large amount of data are required. Therefore, to improve the classification performance of the risk-assessment NER BERT model being developed in this study, a data-augmentation technique was applied. A Google Translate-based back-translation technique was used to convert Korean sentences into English and then back into Korean to generate different sentences with the same meaning. Additionally, a ChatGPT-4-based data augmentation technique was applied, where ChatGPT-4 was asked to create different sentences with the same meaning, maximizing diversity. As a result, augmented data were generated from the original manually constructed data, producing 81,225 augmented data points, and a total of 88,481 pieces of training data were constructed (see Table 5).

If the pretrained or fine-tuned data are insufficient for domain-specific named entities, such as those in the construction industry, performance degradation and overfitting issues owing to out-of-vocabulary words are likely to occur, making it challenging to sufficiently train the model. This is a common limitation of NER-based IE models, and extensive data are necessary to minimize errors and maximize their performance.

To improve the classification performance of the proposed BERT model, the aforementioned data-augmentation techniques were applied, and the augmented data were added to the original training data, resulting in 88,481 training-data entries from 5941 sentences. All data augmentations were conducted in Anaconda. To fine-tune the KLUE-RoBERTa model using the constructed training set, the data had to be converted into the beginning–inside–outside (BIO) format, a widely used format, to allow the computer to recognize the named entities.

Fine-tuning the BERT model

The original data sentences were split into 2519 and 630 for training and testing (8:2 ratio), respectively. All augmented data were used for training, resulting in 8460 sentences for training. Fine-tuning the KLUE-RoBERTa model involved adding layers of data labeled with predefined risk-assessment knowledge, obtained using the BERT-based keyword-extraction engine, on top of the existing layers to achieve optimal performance for the intended purpose without significant structural modifications. After fine-tuning, the classification performance of the model was measured using the F1-score, which is the harmonic mean of precision and recall and ranges from 0–1, with a score closer to 1 indicating better NER classification performance.

As listed in Table 6, the proposed BERT model obtained an average F1-score of 0.57 for the classification of risk-assessment knowledge defined in this study. The model exhibited good classification performance for entities such as ‘Accident’ (ACC) and ‘Work’ (WORK), for which considerable amounts of training data and clearly definable terms were available, with scores of 0.8 and 0.74, respectively. However, the performance was lower for entities such as ‘Engineering’ (ENG) and ‘Education’ (EDU), for which insufficient training data were available, and ‘Unsafe Action’ (UNACT) and ‘Enforcement’ (ENF), which were not easily definable by specific terms. Considering that existing BERT models are sufficiently pretrained on basic named entities such as ‘Person’, ‘Location’, ‘Time’, and ‘Organization’, but not on ‘Domain-specific Named Entities’ belonging to the construction field, the proposed BERT model exhibited satisfactory classification performance.

4.4. Creating the Risk-Assessment Knowledge Base

By creating a knowledge base using 7256 original data entries from the previously built risk-assessment training set, we created a risk-assessment knowledge base with 3129 nodes and 7878 edges for each node, excluding duplicate nodes. Among them, ‘Enforcement’ (ENF) appeared 1568 times and ‘Unsafe Condition’ (UNCON) appeared 565 times. As for the edges, the ‘Recommend’ relationship between a specific ‘Accident’ and ‘Safety Action’ was the most frequent with 2821 occurrences (see Table 7).

5. Case Study

5.1. Application of the Risk-Assessment Knowledge Base Development Process

The 2022 Construction Accident Information Report provided by the Comprehensive Information Network for Construction Safety Management indicates that falls, collapses, and strikes are the most common fatal accidents at construction sites. Therefore, we randomly selected ten accident cases, comprising four falls, three collapses, and three strikes, from the casebooks. To improve the quality and accuracy of the risk-assessment knowledge extracted using the developed BERT model, a preprocessing step was implemented to combine the TAG-B (beginning) and TAG-I (inside) formats into specific words, excluding knowledge with the same meaning or commonly co-extracted particles.

A series of processes for extracting and storing risk-assessment knowledge using BERT were implemented in a web-based application, as shown in Figure 4. The extracted results were saved in the CSV format.

The risk-assessment knowledge extracted by BERT from the ten selected accident cases and saved in the CSV format was collectively used to build an entity-relationship-based knowledge graph in Neo4j. The constructed knowledge base comprised fifty-nine nodes (seven WORK, four MAC, three TOOL, two MAT, six UNCON, seven UNACT, three ACC, two EDU, and twenty-five ENF) and 81 edges (10 USE, 17 BE RELATED, 13 CAUSE–EFFECT and EFFECT–CAUSE, and 28 RECOMMEND).

5.2. Validation

The proposed BERT-based knowledge-extraction method was primarily validated in terms of efficiency and effectiveness. First, its efficiency was validated through a comparison with the traditional method of simple knowledge extraction by users or supervisors. This involved assessing the extraction speed and whether the necessary risk-assessment information could be extracted without omissions. Additionally, we verified that the extracted risk-assessment knowledge can be used to efficiently construct a knowledge base, which was the primary objective of the study. Second, usability verification in this study was conducted by using Neo4j’s Cypher queries and Neo4j Bloom functionality to validate the knowledge search and extraction functions of the risk-assessment knowledge base. This allowed us to verify whether each query result accurately derived the necessary information and relationships for risk assessment. Third, convenience and satisfaction were evaluated through a survey conducted with three safety supervisors (with 5, 14, and 20 years of experience, respectively). The survey assessed whether the knowledge search function of the knowledge base was more efficient and user-friendly compared to the traditional method of referencing cases in PDF, Excel, or document format. The survey required participants to perform searches using the knowledge graph we constructed and then compare the results with the traditional methods. The survey was conducted using a 5-point Likert scale, and the supervisors were selected to represent a range of experience levels.

Efficiency: The proposed BERT model achieved an average F1-score of 0.57, indicating that it can automatically extract the appropriate predefined risk-assessment knowledge without omissions from unstructured construction data in any format. Additionally, the model could extract the desired knowledge within 10 s even from long, unstructured sentences.

Usability: The knowledge-search function is a representative use-case for a knowledge base. The knowledge graph allows searching for specific nodes first and then for the edges connected to them, or vice versa. For example, after searching for a specific node such as ‘Accident(ACC)’, edges such as ‘EFFECT-CAUSE’ and ‘RECOMMEND’ could be searched, or conversely, after searching for a specific edge such as ‘BE RELATED’, nodes such as hazard object and hazard factor could be searched, allowing users to easily extract the specific risk-assessment knowledge. An example of knowledge search is shown in Figure 3, wherein a search around the ‘Pouring (WORK)’ operation outputs that ‘Pump Car (MAC)’, ‘Boom and Hose (TOOL)’, and ‘Concrete (MAT)’ can be ‘USE’. Among these, ‘Boom’ can lead to ‘Struck (ACC)’ owing to ‘Boom Breakage (UNCON)’, and ‘Concrete’ can cause ‘Collapsed (ACC)’ owing to ‘Formwork Collapse from Concrete Pressure (UNCON)’, illustrating the relationship and knowledge related to pouring operations. Additionally, recommendations related to specific accident outcomes and causes can be easily obtained through knowledge search. As shown in Figure 5, safety measures recommended for the ‘Struck (ACC)’ accident include pre-education (EDU) and equipment inspection before entry and safety review of lifting methods for heavy objects (ENF). Furthermore, by expanding specific nodes or edges, additional knowledge regarding accident causes and hazardous objects can be searched and extracted.

Convenience and satisfaction: As previously discussed, compared to the traditional method of manually searching and extracting risk-assessment knowledge from unstructured data, the search function for the knowledge base enables convenient and consistent extraction of desired knowledge and relationships through a simple search. The proposed BERT model can be used to automatically extract predefined knowledge from large amounts of unstructured data and use it in bulk to build a risk-assessment knowledge base. Additionally, even undefined entities or relationships can be easily discovered and updated through inference of graph data. This method is efficient for continuously expanding and managing a knowledge base using unstructured data generated in the construction industry. Current safety resources provided by institutions such as the Korea Land and Housing Corporation or KOSHA are available in the Excel format or as web pages or PDF files, which require users to manually search and extract the desired information, which is a cumbersome process. Risk-assessment knowledge bases, particularly those focused on the relationships between the various elements of complex network structures, enable efficient definition and visualization of the knowledge, making it easy and convenient for users to understand.

To evaluate the expected effects of the proposed knowledge base, a survey was conducted with three safety supervisors who performed risk assessments. The survey utilized a 5-point Likert scale and assessed their agreement with the following statements: the knowledge base enables consistent knowledge search by any technician, allows inexperienced technicians to search for necessary knowledge without omissions, and offers the convenience of visualizing relationships between knowledge elements. In addition to using the knowledge base, the respondents were asked to compare their experience with the traditional method of reviewing PDF, Excel, or webpage-based accident case documents. The results indicated that all items scored above four points, demonstrating that the proposed knowledge base significantly improved the knowledge search process for supporting risk assessments. Although the small sample size limits the generalizability of the findings, the diversity in the respondents’ backgrounds provided valuable qualitative insights. Overall, the search function utilizing the knowledge base offered superior performance compared to traditional knowledge management systems used for risk assessments and identification of hazardous factors. However, further research with a larger and more diverse sample size is needed to validate these results comprehensively.

6. Conclusions

This paper presented a novel method for constructing a risk-assessment knowledge base for construction projects using KLUE-RoBERTa and graph models to efficiently extract and effectively utilize accident-risk knowledge from unstructured data. The main findings of this study can be summarized as follows. First, an NER-based BERT model that can automatically extract predefined risk-assessment knowledge from unstructured construction data using NLP technology was developed. Training data were constructed from 1154 construction-disaster incidents that occurred between 2010 and 2018 and were included in the casebooks provided by KOSHA. Artificially augmented data were generated using Google Translate-based back-translation and ChatGPT-4-based text-augmentation techniques. The BERT model was fine-tuned by training on the 88,481 augmented data and exhibited a classification performance of 0.57. Second, an entity-relationship-based risk-assessment knowledge base was designed using the extracted data. It comprised 10 nodes, which were connected using six types of edges. Third, the proposed BERT model was used to automatically extract risk-assessment knowledge from data not used for pretraining. They comprised ten accident cases that occurred between 2021 and 2022 and were obtained from the accident casebooks provided by KALIS. A risk-assessment knowledge base comprising 59 nodes, and 81 edges was constructed. Through this process, methods for utilizing the search and management functions of the knowledge base were proposed, and the efficiency and effectiveness of the proposed method were verified through comparisons with traditional knowledge-extraction and -management methods. Additionally, the effectiveness of the search function was verified through a survey, wherein it obtained an average expected effect score of 4.44 on a 5-point Likert scale.

The use of NLP and knowledge bases, which have not been widely employed for construction research, to realize knowledge management for risk assessments can expand the scope of future research in related fields. The proposed methodology signifies a departure from traditional construction-safety knowledge-management methods. Thus, the scope of analyzing unstructured data from the construction field can not only be used to obtain risk-assessment knowledge but can also be expanded to knowledge related to contracts, design, and construction over the entire construction cycle. This will ensure that each piece of knowledge is organically connected, enabling more accurate predictions and proactive safety management based on the construction cycle. Although manuals or casebooks are available for risk assessments, inconsistencies in names, classifications, and formats can lead to repetitive searches for the same cases. Additionally, detailed accident descriptions in casebooks require users to manually search for the desired knowledge, which is inconvenient and cumbersome.

The proposed method aims to support inexperienced technicians in identifying hazard factors more effectively, without requiring them to perform risk assessments or determine safety measures on their own. While the use of this knowledge base can assist users with varying levels of expertise, the method is designed to complement, not replace, the role of experienced supervisors and the required consultation with workers and site-specific conditions.

In the proposed risk-assessment knowledge base, the elements typically included in existing construction-accident or risk-assessment data are defined by segmenting and unifying them into risk-assessment knowledge. This ensures the discernment, suitability, and relevance of the knowledge that can be extracted from unstructured data in the construction industry. Furthermore, traditional methods have limitations such as omission of hazard factors or deviations based on the capabilities, experience, or subjective judgment of the business owner or supervisors, as well as the characteristics and environment of each site. By utilizing the proposed knowledge base, relevant knowledge centered on keywords can be easily obtained without omissions, regardless of user knowledge or experience. Additionally, by automatically extracting knowledge for construction-project risk assessments and constructing an entity-relationship-based risk-assessment knowledge base, the proposed method can be used for efficient management of knowledge while ensuring scalability and sustainability.

The limitations and future research directions for this study are as follows.

First, the knowledge graph constructed in this study is centered around common hazard factors such as materials, tools, and equipment used in the work process. Additionally, project-specific variables, such as the location of the building, the characteristics of the project, and the limitations of available manpower and equipment, can also influence the risks. Therefore, future research needs to incorporate these factors.

Second, the proposed BERT model, trained on predefined key factors related to construction accidents and risk assessments, heavily relied on data from construction safety and accident casebooks. This reliance may have introduced biases or gaps, limiting the broader applicability of the model to new or unseen types of construction-safety data. Consequently, its generalizability and accuracy may decrease when applied to different datasets outside its original training scope. To mitigate this, future research should focus on continuously expanding the vocabulary lexicon and developing algorithms capable of performing named entity normalization across various data types. Additionally, incorporating more diverse construction safety data could reduce biases and improve generalizability.

Finally, owing to the characteristics of construction accidents, a limited number of training samples are available for building big data, which constrains the classification performance of the model. This limitation likely affects key performance metrics, such as precision, recall, and F1-score, which could be improved with more comprehensive data. Developing a high-performance model requires collecting as much data as possible, developing related rules, and performing processing tasks such as tagging or transformation. However, this requires a significant amount of time and effort and is highly inefficient. To address this, efficient data-augmentation techniques such as oversampling, synthetic data generation, and advanced preprocessing methods should be explored.

Author Contributions

Conceptualization, S.L.; Methodology, S.L.; Software, W.L.; Validation, W.L. and S.L.; Formal analysis, W.L.; Data curation, W.L.; Writing—original draft preparation, W.L.; Writing—review and editing, S.L.; Project administration, S.L.; Funding acquisition, S.L. All authors have read and agreed to the published version of the manuscript.

Funding

The present research has been conducted with the Research Grant of Kwangwoon University in 2024.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hinze, J. Construction Safety; Prentice-Hall: Upper Saddle River, NJ, USA, 2006. [Google Scholar]
Ministry of Employment and Labor (MOEL). 2023 New Risk Assessment Guide. Available online: https://www.moel.go.kr/policy/policydata/view.do?bbs_seq=20230501085 (accessed on 17 September 2024).
Coble, R.J.; Hinze, J.; Haupt, T. Construction Safety and Health Management; Prentice Hall: Upper Saddle River, NJ, USA, 2000. [Google Scholar]
Lingard, H.; Rowlinson, S. Occupational Health and Safety in Construction Project Management; Spon Press: London, UK, 2005. [Google Scholar]
Gambatese, J.A.; Behm, M.; Rajendran, S. Design’s role in construction accident causality and prevention: Perspectives from an expert panel. Saf. Sci. 2008, 46, 675–691. [Google Scholar] [CrossRef]
Hallowell, M.R.; Gambatese, J.A. Activity-based safety risk quantification for concrete formwork construction. J. Constr. Eng. Manag. 2009, 135, 990–998. [Google Scholar] [CrossRef]
Teizer, J.; Cheng, T.; Fang, Y. Location tracking and data visualization technology to advance construction ironworkers’ education and training in safety and productivity. Autom. Constr. 2013, 35, 53–68. [Google Scholar] [CrossRef]
Zhou, Z.; Goh, Y.M.; Li, Q. Overview and analysis of safety management studies in the construction industry. Saf. Sci. 2015, 72, 337–350. [Google Scholar] [CrossRef]
Zhao, D.; Lucas, J. Virtual reality simulation for construction safety promotion. Int. J. Inj. Control Saf. Promot. 2015, 22, 57–67. [Google Scholar] [CrossRef]
Liu, K.; El-Gohary, N. Ontology-based Semi-supervised Conditional Random Fields for Automated Information Extraction from Bridge Inspection Reports. Autom. Constr. 2017, 81, 313–327. [Google Scholar] [CrossRef]
Lee, J. Development of a Risk Extraction Model for Overseas Construction Contracts Through Natural Language Processing (NLP). Master’s Thesis, Ewha Womans University, Seoul, Republic of Korea, 2018. [Google Scholar]
Mohemad, R.; Hamdan, A.R.; Othman, Z.A.; Noor, N.M.M. Ontological-based Information Extraction of Construction Tender Documents. In Proceedings of the 7th Atlantic Web Intelligence Conference, Fribourg, Switzerland, 26–28 January 2011; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
Moon, S.; Lee, G.; Chi, S. Automated System for Construction Specification Review Using Natural Language Processing. Adv. Eng. Inform. 2022, 51, 101495. [Google Scholar] [CrossRef]
Tixier, A.J.-P.; Hallowell, M.R.; Rajagopalan, B.; Bowman, D. Automated Content Analysis for Construction Safety: A Natural Language Processing System to Extract Precursors and Outcomes from Unstructured Injury Reports. Autom. Constr. 2016, 62, 45–56. [Google Scholar] [CrossRef]
Yeung, C.-L.; Cheung, C.-F.; Wang, W.-M.; Tsui, E. A Knowledge Extraction and Representation System for Narrative Analysis in the Construction Industry. Expert Syst. Appl. 2014, 41, 5710–5722. [Google Scholar] [CrossRef]
Zhang, J.; El-Gohary, N.M. Information Transformation and Automated Reasoning for Automated Compliance Checking in Construction. In Proceedings of the Computing in Civil Engineering, Los Angeles, CA, USA, 23–25 June 2013. [Google Scholar]
Manuele, F.A. Advanced Safety Management Focusing on Z10 and Serious Injury Prevention; Wiley: Hoboken, NJ, USA, 2008. [Google Scholar]
Harms-Ringdahl, L. Safety Analysis: Principles and Practice in Occupational Safety; Routledge: London, UK, 2013. [Google Scholar]
Kleiner, B.M.; DeJoy, D.M. Safety management systems and methods in high-hazard industries: A review of the literature. Saf. Sci. 2015, 77, 20–27. [Google Scholar]
KRAS. Korea Risk Assessment System. Available online: https://kras.kosha.or.kr/ (accessed on 17 September 2024).
MOLIT; KARIS. Construction Accident Casebook; KARIS: Jinju, Republic of Korea, 2022.
CSI. Available online: https://www.csi.go.kr/intro.do (accessed on 17 September 2024).
Simoff, S.; Maher, M.L. Ontology-based Multimedia Data Mining for Design Information Retrieval. In Proceedings of the ACSE Computing Congress, Boston, MA, USA, 18 October 1998. [Google Scholar]
Wang, Q.; Mao, Z.; Wang, B.; Guo, L. Knowledge Graph Embedding: A Survey of Approaches and Applications. IEEE Trans. Knowl. Data Eng. 2017, 29, 2724–2743. [Google Scholar] [CrossRef]
Hogan, A.; Blomqvist, E.; Cochez, M.; D’Amato, C.; de Melo, G.; Gutierrez, C.; Kirrane, S.; Labra Gayo, J.E.; Navigli, R.; Neumaier, S.; et al. Knowledge Graphs. ACM Comput. Surv. 2021, 54, 1–37. [Google Scholar] [CrossRef]
Ji, S.; Pan, S.; Cambria, E.; Marttinen, P.; Yu, P.S. A Survey on Knowledge Graphs: Representation, Acquisition, and Applications. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 494–514. [Google Scholar] [CrossRef]
Kim, J.; Roh, K. Knowledge Base Construction Based on Graph Database for Career Guidance Using Public Data. J. Korean Inst. Electron. Commun. Sci. 2019, 56, 41–48. [Google Scholar]
Park, M. Development of a Plant Safety Operation Procedure Generation System Based on Process Knowledge Extraction and Knowledge Base Construction. In Proceedings of the Korean Gas Association Conference, Jeju, Republic of Korea, 26–27 May 2022. [Google Scholar]
Kim, W.; Lee, J.; Ahn, S. Development of Integrated Knowledge Graph for Construction Safety Guidelines—Knowledge Graph Development and Knowledge Search Using Python and Neo4j. In Proceedings of the Korea Institute of Construction Engineering and Management, Seoul, Republic of Korea, 18 November 2022. [Google Scholar]
Ryu, M.; Cha, S. Knowledge Graph Construction Based on Ontology Learning. J. Korean Assoc. Comput. Educ. 2022, 25, 51–57. [Google Scholar]
Payal, C.; Huang, K.; Zitnik, M. Building a Knowledge Graph to Enable Precision Medicine. Sci. Data 2023, 10, 1–16. [Google Scholar]
Peng, F.-L.; Qiao, Y.-K.; Yang, C. Building a Knowledge Graph for Operational Hazard Management of Utility Tunnels. Expert Syst. Appl. 2023, 223, 119901. [Google Scholar] [CrossRef]
Lin, F.-R.; Yu, J.-H. Visualized Cognitive Knowledge Map Integration for P2P Networks. Decis. Support Syst. 2009, 46, 774–785. [Google Scholar] [CrossRef]
Rao, L.; Mansingh, G.; Osei-Bryson, K.-M. Building Ontology Based Knowledge Maps to Assist Business Process Re-engineering. Decis. Support Syst. 2012, 52, 577–589. [Google Scholar] [CrossRef]
Yoo, K. Keyword-based Networked Knowledge Map Expressing Content Relevance Between Knowledge. J. Intell. Inf. Syst. 2018, 24, 119–134. [Google Scholar]
Suh, J.; Lee, H. Technology Trends and Application Cases of Graph DB. Tech. Rep. (ITFind), Institute of Information and Communications Technology Planning and Evaluation, 2020.02.05. Available online: https://www.itfind.or.kr/streamdocs/view/sd;streamdocsId=Ls1g7yoLO95IbXRWAJSFPd5Pv-qI52IQq0Yv62zc4pw (accessed on 19 September 2024).
Lee, S. Construction of a Knowledge Graph and Automated Pipeline for Medical Opinion on Health Checkups Based on Deep Learning. Master’s Thesis, Yonsei University, Seoul, Republic of Korea, 2022. [Google Scholar]
Yoo, S.Y.; Jeong, O.R. An Intelligent Chatbot Utilizing BERT Model and Knowledge Graph. J. Soc. e-Bus. Stud. 2019, 24, 87–98. [Google Scholar]
Park, B. Visualization of BCG Matrix Through BERT-Based Knowledge Graph Generation from Unstructured Text. Master’s Thesis, Hansung University Graduate School of Knowledge Service & Consulting, Seoul, Republic of Korea, 2021. [Google Scholar]
Kim, S.; Kang, Y.; Seok, J. BERT-based Relation Extraction Model and Knowledge Graph for Economic Knowledge Analysis. J. Next Gener. Comput. 2022, 18, 7–20. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Baton, J.; Van Bruggen, R. Learning Neo4j 3.x: Effective Data Modeling, Performance Tuning and Data Visualization Techniques in Neo4j; Packt Publishing Ltd.: Birmingham, UK, 2017. [Google Scholar]
MOLIT. Research Report on the Development of Construction Project Hazard Factor Profiles; MOLIT: Sejong, Republic of Korea, 2014.

Figure 1. Materials and Methods of Research.

Figure 2. Conceptual framework of the process for building the risk-assessment knowledge base.

Figure 3. Proposed graph-based data model.

Figure 4. Example of knowledge search from the proposed risk-assessment knowledge base.

Figure 5. Another example of knowledge search from the proposed risk-assessment knowledge base.

Table 1. Literature Review Results on Knowledge-Base Development.

Authors	Target	Included Information	Extraction Target	Extraction Method
Kim and Roh [27]	University students	Job knowledge and skills, performance, certifications, training elements, etc.	Public data	API
Park [28]	Plant operators	Operational risks and safety-related plant operation manuals	Technical documents of the process plant	Tesseract package algorithm of the OCR system
Kim et.al. [29]	Safety supervisors	Construction safety guidelines	Public data	TF-IDF and SBERT
Ryu and Cha [30]	Unspecified	Plain text knowledge	Data from search engines such as Naver and Google	Simple extraction
Payal et al. [31]	Primary care physicians	Biomedical knowledge and health information	Existing knowledge bases (DisGeNet, OrphaNet, Drug Central SQL DB, etc.) and disease ontology	Beautiful Soup, GOATOOLS, and other search packages
Peng et al. [32]	Tunnel engineers	Tunnel-operation risk-management information, including operational risks, management items, defect types, risk-control measures	Existing knowledge bases and unstructured data	NER using field mapping and machine learning

Table 2. Key information for construction-accident risk assessment.

Category		Node	KRAS [20]	LH	MOLIT [44]	MOEL [2]	KALIS [22]
Work		WORK	◎	◎	◎	◎	◎
Hazard Object	Machine	MAC	◎	◎	◎	◎	◎
	Tool	TOOL	◎	◎	◎
	Material	MAT	◎	-	◎
Hazard Factor	Unsafe Condition	UNCON	◎	◎	◎
Hazard Factor	Unsafe Act	UNACT	◎	◎	◎
Accident		ACC	-	-	◎	◎	◎
Safety Action	Engineering	ENG	◎	◎	◎	◎	◎
	Education	EDU			◎
	Enforcement	ENF			◎

Table 3. Definitions of the nodes and their properties employed in the proposed risk-assessment knowledge base.

Node	Definition	Property
WORK	Tasks or work such as civil engineering and construction work	Installation, dismantling, assembly, transportation, welding, etc.
MAC	A machine is a dynamic device composed of multiple parts that performs certain relative motions to achieve a specific function or work (machine used in work)	Pile driver, tower crane, excavator, forklift, etc.
TOOL	A tool used for creating objects and performing tasks in various types of work (tools used in work)	Formwork, cutter, grinder, workbench, hammer, etc.
MAT	Materials without fixed geometric shapes are used in various types of work (materials used in work)	Concrete, rebar, gypsum board, insulation, etc.
UNCON	Unsafe conditions of machinery and equipment as direct causes of accidents	Leaving the workbench, damage to support parts, unsecured covers, etc.
UNACT	Unsafe actions of workers as direct causes of accidents	Not wearing safety protective equipment, tripping, moving to an external scaffold, etc.
ACC	Accident details where significant personal injury or property damage occurs during construction	Struck by object, collision, falling, being crushed, etc.
ENG	Improvement of equipment environment and working methods	Anti-collision devices, fire-detection devices, etc.
EDU	Implementation of safety education and training	Safety education, regular training, etc.
ENF	Institutional enforcement by strict rules	Safety measures during work, preventive measures, wearing personal protective equipment, etc.

Table 4. Definitions of the edges employed in the proposed risk-assessment knowledge base.

Edge	Definition
PRECEDE	n1 (WORK) precedes another n1 (WORK). e.g., ‘Transportation’ precedes ‘Installation’.
USE	n1 (WORK) uses n2 (MAC, TOOL, MAT). e.g., ‘Installation’ uses a ‘Tower crane’.
BE RELATED	n1 (WORK, MAC, TOOL, MAT) is related to n2 (UNCON, UNACT). e.g., ‘Installation work’ is related to ‘Tripping during movement’. ‘Pump car’ is related to ‘Boom breakage’.
CAUSE–EFFECT	n1 (UNCON, UNACT) causes n2 (ACC). e.g., ‘Not wearing personal protective equipment’ causes a ‘Fall’.
EFFECT–CAUSE	n1 (ACC) is caused by n2 (UNCON, UNACT). e.g., a ‘Fall’ is caused by ‘Not wearing personal protective equipment’. ‘Struck by object’ is caused by ‘Defects in safety and protective devices’.
RECOMMEND	n1 (ACC) is recommended to implement n2 (ENG, EDU, ENF). e.g., to prevent fall accidents, it is recommended to conduct ‘Safety measures during work’ and ‘Safety education’.

The text within parentheses represents the node names that can occupy that position.

Table 5. Risk-assessment training data used for fine-tuning.

Node	Original Data	Augmented Data	Total
WORK	1027	4762	5789
MAC	340	2783	3123
TOOL	467	3352	3819
MAT	578	3979	4557
UNCON	594	13,309	13,903
UNACT	335	5444	5779
ACC	1189	4797	5986
ENG	36	351	387
EDU	40	581	621
ENF	2650	41,867	44,517
Total	7256	81,225	88,481

Table 6. Performance of the BERT model for each node.

Node	Precision	Recall	F1-Score	Support
WORK	0.75	0.74	0.74	704
MAC	0.63	0.72	0.67	307
TOOL	0.55	0.49	0.52	464
MAT	0.51	0.50	0.50	521
UNCON	0.45	0.49	0.47	452
UNACT	0.39	0.49	0.44	215
ACC	0.84	0.77	0.80	987
ENG	0.40	0.13	0.19	32
EDU	0.27	0.30	0.28	27
ENF	0.42	0.48	0.45	1917
Weighted avg.	0.57	0.58	0.57	5623

Table 7. Characteristics of the risk-assessment knowledge base constructed using original data.

Node	Count	Edge	Count
WORK	90	USE	1246
MAC	96	BE RELATED	1901
TOOL	137
MAT	316
UNCON	585	CAUSE–EFFECT	955
UNACT	215	CAUSE–EFFECT	955
ACC	56	EFFECT–CAUSE	955
ENG	30	RECOMMEND	2821
EDU	36
ENF	1568
Total	3129	Total	7878

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, W.; Lee, S. Development of a Knowledge Base for Construction Risk Assessments Using BERT and Graph Models. Buildings 2024, 14, 3359. https://doi.org/10.3390/buildings14113359

AMA Style

Lee W, Lee S. Development of a Knowledge Base for Construction Risk Assessments Using BERT and Graph Models. Buildings. 2024; 14(11):3359. https://doi.org/10.3390/buildings14113359

Chicago/Turabian Style

Lee, Wonjong, and Seulki Lee. 2024. "Development of a Knowledge Base for Construction Risk Assessments Using BERT and Graph Models" Buildings 14, no. 11: 3359. https://doi.org/10.3390/buildings14113359

APA Style

Lee, W., & Lee, S. (2024). Development of a Knowledge Base for Construction Risk Assessments Using BERT and Graph Models. Buildings, 14(11), 3359. https://doi.org/10.3390/buildings14113359

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Development of a Knowledge Base for Construction Risk Assessments Using BERT and Graph Models

Abstract

1. Introduction

2. Literature Review

2.1. Limitations of Existing Knowledge-Management Methods for Risk Assessment

2.2. Concept of Network-Based Knowledge Base

2.3. Research Trends of Using BERT for Knowledge Management in Construction

3. Materials and Methods

4. Risk-Assessment Knowledge Base Using BERT and Graph Models

4.1. Conceptural Framework

4.2. Designing the Risk-Assessment Knowledge Base

4.3. Development of BERT-Based Keyword-Extraction Engine

4.4. Creating the Risk-Assessment Knowledge Base

5. Case Study

5.1. Application of the Risk-Assessment Knowledge Base Development Process

5.2. Validation

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI