1. Introduction
Mine ventilation systems are essential for maintaining a safe underground environment, boosting productivity, and safeguarding worker health, making them a critical component of mine operations [
1]. Inefficient ventilation systems can contribute to hazardous situations, such as gas accumulation, oxygen deficiency, and poor air quality, all of which are major contributors to mining accidents [
2]. For instance, the 2002 Shanxi coal mine explosion in China, which claimed the lives of over 200 miners, was caused by inadequate ventilation, leading to the accumulation of explosive gases. In contrast, the 2010 Copiapó mine disaster in Chile underscored the critical role of the ventilation system. When 33 miners were trapped for 69 days, the mine’s ventilation system ensured a continuous supply of breathable air, preventing oxygen deficiency and sustaining the miners’ survival underground, ultimately contributing to their successful rescue. Consequently, the efficient management and performance enhancement of mine ventilation systems have consistently been a focal point of interest for both industry and academia. Studies include reviews of ventilation monitoring and control technologies [
3], the development of quantitative assessment methods for airflow stability [
4], and dynamic ventilation modeling that incorporates thermal and gas pollutant simulations [
5]. Additionally, the following were proposed: ventilation management procedures to ensure regulatory compliance [
6], the use of ventilation-on-demand technologies [
7], and the optimization of oxygen supply through computational fluid dynamics [
8]. Furthermore, research by [
9] explored the integration of real-time data monitoring with predictive analytics to improve system reliability, all aimed at enhancing the safety and economic efficiency of ventilation systems in mines. However, most of these efforts have primarily concentrated on the optimization of ventilation parameters and system configurations, with relatively little attention paid to the comprehensive management and analysis of accident-related data within ventilation systems.
Traditionally, the safety management of mine ventilation systems relies on manual inspection records, spreadsheet-based documentation, and relational database systems. These conventional methods typically focus on monitoring key parameters, such as airflow volume, gas concentration, and equipment status [
10,
11]. However, they face significant limitations when handling heterogeneous, complex, and dynamically evolving safety data. Specifically, rigid data structures, limited scalability, and reliance on manual analysis make it challenging to integrate multi-source information and uncover implicit relationships among personnel, equipment, and environmental factors. These limitations can result in inefficiencies in accident analysis, knowledge discovery, and decision-making processes.
As a novel database architecture, knowledge graphs organize complex information in a structured manner and comprehensively uncover intricate relationships within data texts [
12,
13,
14]. This demonstrates widespread applicability across various industries, including, but not limited to, smart factories, smart mines, and the digital transformation of healthcare and finance [
15]. Liu et al. [
16] explored the application of knowledge graphs, specifically within risk pre-control management systems, demonstrating how knowledge graphs can improve risk identification and decision-making in high-risk industries like underground coal mining. Liu et al. [
17] also investigated the use of knowledge graphs for enhancing situation awareness and decision-making in industrial control systems, highlighting the benefits of this technology in improving system security and operational efficiency. Zhang et al. [
18] used knowledge graph technology to establish a knowledge graph system for coal mine equipment maintenance, addressing the need for a question-and-answer system in this domain. Jiang et al. [
19] constructed a medical question-answering system based on knowledge graphs, which improves the accuracy and efficiency of responses to medical questions through entity linking, intent classification, and question–answer matching. Experimental validation demonstrated its effectiveness and practicality. Hu et al. [
20] studied the storage and analysis of building fire cases using knowledge graph technology and concluded that this approach could significantly improve the display and utilization of fire information compared with traditional text-based storage, thus enhancing the effectiveness of fire accident analysis. Liu et al. [
21] proposed a Chinese mineral question-and-answer system based on a mineral knowledge graph to retrieve mineral entities and relationships. The system achieved an accuracy rate of 91.2% for 2000 test questions. Pei et al. [
22] constructed a deposit knowledge graph based on raw text data from different gold deposits in a gold belt, and they proposed methods for the visual construction of the deposit knowledge graph and for calculating the similarity between deposits based on their metallogenic geological characteristics.
The Neo4j graph database, a sophisticated NoSQL graph database system, enables the construction of highly structured graph databases by defining nodes, relationships, and properties [
23]. It is widely used for building and storing knowledge graphs. Saad et al. [
24] developed a semantic graph database for Life Cycle Inventory using Neo4j (version 5.2.0), which provides a versatile, queryable, scalable, intuitive, and interchangeable data format. Tuck [
25] summarized research by integrating data on non-small cell lung cancer and constructing a graph database to study genomic variations and their relationships with clinical outcomes. Compared with traditional safety management methods based on manual data analysis, spreadsheets, or relational databases, knowledge graph technology offers several key advantages [
26,
27]. First, it allows for the semantic representation of unstructured and multi-source data, overcoming the rigidity and schema limitations of relational databases [
28,
29]. Second, the graph-based structure of Neo4j enables the explicit modeling of complex relationships among personnel, equipment, environmental factors, and procedures, which are often implicit or overlooked in conventional systems [
30]. Third, knowledge graphs support real-time updates and dynamic expansion, making them ideal for managing evolving safety information [
31]. These characteristics make Neo4j-based knowledge graphs a robust solution for addressing the limitations of traditional approaches in mine ventilation systems. Furthermore, Neo4j was selected over other available graph database solutions (such as JanusGraph, OrientDB, or GraphDB) due to its balanced combination of querying efficiency, ease of use, stable performance, and robust visualization capabilities [
18,
32]. Its intuitive Cypher query language facilitates flexible and readable complex queries, supporting both deep correlation analysis and traceability analysis. These features align well with the specific scale, complexity, and operational requirements of mine ventilation system safety data, ensuring practical applicability and system scalability.
To enhance the integration of textual data related to mine ventilation systems, we applied knowledge graph technology and the Neo4j graph database to the safety management of these systems. By collecting and organizing relevant accident records, we constructed a knowledge graph for mine ventilation system safety, transforming unstructured data into structured formats. This provided essential structured data for the safe management of mine ventilation systems.
2. Construction of the Knowledge Graph for Mine Ventilation System Safety
2.1. Construction of the Knowledge Graph Framework
A knowledge graph, functioning as a structured semantic network, efficiently represents, organizes, and utilizes entities, concepts, and their interrelationships in the real world. The fundamental unit of a knowledge graph is the triple, which is typically structured as <Entity1, Relationship, Entity2>, and is used to represent the relationships between entities [
33]. Entities are fundamental units within a knowledge graph, representing “objects” or “concepts” in the real world. Each entity typically has a unique identifier and a set of attributes. Relationships connect two or more entities, define the type of association or interaction between them, and are generally characterized by a specific directionality. Attributes are data that describe the characteristics of entities or relationships, providing additional information.
The basic process of constructing a knowledge graph is as follows: First, an appropriate method is selected to build the knowledge graph framework based on the specific application scenario. Subsequently, relevant data are collected and processed, with knowledge elements such as entities and relationships extracted using knowledge extraction techniques, and then are represented as triples. Finally, the knowledge graph is constructed using triple-formatted data. A variety of applications, such as semantic searches, question–answering systems, and recommendation systems, can be developed based on knowledge graphs. Additionally, the knowledge graph can also be continuously optimized and updated for practical applications [
34]. The specific process is shown in
Figure 1.
The construction of a knowledge graph primarily involves two main approaches: top-down and bottom-up [
35]. The top–down approach begins with the highest-level concepts and incrementally refines them to establish a well-structured taxonomic hierarchy. Entities are then added to the knowledge base according to these predefined patterns. This method is often suitable for building knowledge graphs in specific or vertical domains, as it effectively represents the hierarchical relationships between concepts. However, designing and maintaining a comprehensive ontology upfront requires extensive domain expertise, which can be time-consuming and inflexible, especially when dealing with dynamic or heterogeneous data sources.
In contrast, the bottom–up approach starts with the data themself, gradually refining and expanding the graph to ensure accuracy and flexibility. Given that the raw data collected in this study consist of unstructured accident reports, characterized by diverse formats, incomplete information, and the lack of predefined data models, the top–down approach presents significant challenges in modeling. The bottom–up approach, however, offers the following several advantages in this context: (1) it accommodates the heterogeneity and variability of the accident text data without requiring a rigid predefined schema; (2) it enables the automatic discovery and validation of relationships and patterns inherent in the data, which are crucial for accurately capturing the complex interactions among personnel, equipment, environment, and procedures in mine ventilation systems; and (3) it reduces the reliance on extensive manual input from domain experts, thus lowering modeling and maintenance costs. Therefore, to ensure flexibility, scalability, and the ability to handle the inherent complexity of the data, the bottom–up approach is adopted for constructing the mine ventilation system safety knowledge graph.
2.2. Data Sources
The construction of the mine ventilation system safety knowledge graph is primarily grounded in historical accident data related to mine ventilation systems, specifically those occurring during mining operations as a result of unsafe conditions of the mine ventilation system equipment or unsafe behaviors of employees. Based on the causes of accidents, incidents related to mine ventilation systems can be categorized into three main types:
Accidents involving gas, coal dust, or asphyxiation that occurred due to the failure to activate or the improper configuration of ventilation facilities, such as air ducts and auxiliary fans, which prevented the ventilation system from functioning properly.
Accidents occurring during maintenance work on ventilation equipment, where unsafe behavior by workers led to incidents such as being struck by objects or suffering mechanical injuries.
Accidents resulting from mechanical failure or aging of ventilation equipment, leading to mechanical injuries.
The accident data were primarily obtained from the Safety Management Network (
https://www.safehoo.com/), a platform dedicated to collecting safety-related information, including accident cases from various industries, commonly used safety education materials, and the latest safety policies. The raw accident data were crawled using Octoparse 8 software. A total of 404 accident reports were retrieved as raw data by performing a full-text search using the keyword ‘ventilation’. After manually reviewing the reports, only those explicitly involving mine ventilation system failures or contributing factors were retained, based on the defined classification criteria, resulting in 202 complete accident reports. It is important to note that the dataset may exhibit potential biases due to its reliance on publicly available online sources. Specifically, limitations in temporal coverage, geographic distribution, and data completeness may exist, as the dataset reflects records published on the website during the collection period. Despite these constraints, the dataset provides a representative basis for analyzing ventilation system safety issues, and future work will focus on expanding data sources to enhance coverage and reduce potential biases.
Most accident reports are comprehensive investigation reports characterized by a clear structure, including six main sections: an overview of the situation, basic information about the accident unit, a description of the accident’s occurrence and rescue process, an analysis of the accident’s causes and nature, recommendations for handling the personnel and units responsible for the accident, and suggestions for accident prevention and corrective measures.
2.3. Definition of Entity Labels and Relationship Labels
Based on the basic structure of the raw accident text data and the characteristics of accident analysis, the entity labels were divided into the following three parts, with a total of twenty entity label categories defined (
Figure 2):
Basic Information: This section provides an overview of the situation and basic information about the accident unit from the raw accident text, clearly reflecting the general circumstances of the accident. The defined entity labels include the accident name, type, time, location, mine involved, casualties, and economic losses.
Specific Causes: This section focuses on the causes and nature of the accident, as described in the original text, emphasizing unsafe behaviors of individuals and unsafe conditions of objects. The defined entity labels include direct cause, indirect cause, spatial location, personnel, work environment, equipment, ventilation equipment, process phenomena, and work tasks.
Recommendations and Preventive Measures: This section pertains to the recommendations for handling responsible personnel and units, as well as accident prevention and rectification suggestions in the original text. The defined entity labels include penalized personnel, penalties, preventive measures, and regulations and standards.
Following the definition of entity labels, the relationship labels were also defined and classified into the following two types, resulting in a total of sixteen relationship label types (
Figure 2):
Relationships Centered on the Accident Name (Part1): This type focuses on the accident name as the central entity, establishing relationships with other entity labels related to the basic information of the accident. For example, in the triple <Accident Name, Relatype, Accident Type>, “Accident Name” and “Accident Type” represent entity labels, while “Relatype” denotes the relationship type connecting these two entities. This structure follows the standard (Entity1, Relation, Entity2) format used in knowledge graph representation.
Relationships Based on Specific Causes (Part2): This category involves relationships typically associated with multiple factors, such as personnel, equipment, environment, and specific locations. The following four types of relationship labels are defined: “cause”, “relate to”, “send out”, and “use”.
Cause: This label denotes a direct or indirect causal relationship between two entities. For example, the entities Direct Reason, Indirect Reason, Environment, and Exact Place may act as causal factors leading to a particular phenomenon. It captures how specific environmental conditions, work procedures, or locations contribute to the occurrence of incidents.
Relate to: This label captures associative relationships without implying direct causality. In the diagram, entities such as Person1, Rule, and Phenomenon are connected through the “relate to” relationship, indicating that, while they influence each other, they may not be the direct cause of an incident. It helps illustrate the broader operational context.
Send out: This label describes cases where an entity actively emits, triggers, or generates a phenomenon. For instance, Person1 and Equipment may “send out” a specific phenomenon, such as a hazardous condition or observable event, emphasizing their role in initiating certain outcomes.
Use: This label represents functional usage relationships. Specifically, Person1 may “use” entities such as Ventilation or Equipment, highlighting how human interaction with tools or facilities impacts safety and operations.
2.4. Data Preprocessing and Labeling
First, the 202 retained accident text records were pre-processed. Using Python, unnecessary characters, spaces, and line breaks within the texts were removed. Additionally, irrelevant content in the accident reports—such as section or paragraph headings, as well as descriptions of emergency response procedures—was deleted. The focus was placed on preserving three key sections: the detailed accident description, the accident occurrence process, and consequence management.
Subsequently, the pre-processed text data were annotated for entities and relationships using the “Label Studio Assistant” software(version 2.0.4), following the predefined 20 entity label categories and 16 relationship label categories. Upon completion of the annotation process, the labeled data were exported directly in ann format. The annotated data were then converted into the BIO labeling format using Python.
2.5. Knowledge Extraction
Knowledge extraction is the process of automatically extracting structured information from unstructured or semi-structured data sources. It involves identifying entities and relationships from raw data, often using techniques such as natural language processing, machine learning, and data mining [
36]. In entity extraction tasks, commonly used models include BERT, CRF, and BiLSTM.
2.5.1. Entity Extraction
To efficiently perform entity extraction, this study compared BERT, BERT + CRF, and BERT + BiLSTM + CRF models. The performance of these models was primarily evaluated using Precision, Recall, and F1-score metrics [
37].
The proposed model was implemented using the TensorFlow framework, within an experimental environment configured with an Intel Core i5-8250U CPU (1.60 GHz, 4 cores), Python 3.7, and TensorFlow 1.14.0. The dataset was randomly split into training, validation, and test sets at ratios of 7:1.5:1.5. The parameter settings for the BERT + BiLSTM + CRF model developed in this study are presented in
Table 1.
The results of the entity extraction task are presented in
Table 2, demonstrating that the BERT + BiLSTM + CRF model achieved the best performance across all evaluation metrics. Despite some limitations, such as high computational cost and long training times, the BERT + BiLSTM + CRF model remained the best-performing model, combining the contextual representation power of BERT, the bidirectional sequence modeling capabilities of BiLSTM, and the sequence label optimization ability of CRF, which collectively enhance performance in the named entity recognition tasks. In addition, models such as RoBERTa and ALBERT offer improvements in performance and efficiency, while Transformer-XL and GPT-2 may be better suited for tasks involving long-range dependencies or sequence generation.
2.5.2. Relation Extraction
Compared to entity extraction, relation extraction tasks tend to yield better results. Relation extraction was performed using the BERT model. The parameters of the BERT model for relation extraction are shown in
Table 3.
The overall relation extraction task metrics were calculated based on the labels and predictions on the validation set, as shown in
Table 4.
2.6. Knowledge Storage
The extracted entities and relationships were saved in the form of triples and stored in a CSV file.
Data were imported into the Neo4j database using Cypher query statements, and a safety knowledge graph of the mine ventilation system was constructed [
38,
39]. In Neo4j, data are represented as a graph, which not only enhances the flexibility and efficiency of data representation but also effectively handles dynamic changes, multi-layered relationships, and highly interconnected data scenarios [
40]. Furthermore, Neo4j utilizes Cypher, a query language specifically designed for graph data. Cypher is robust and highly capable of efficiently executing complex graph traversals, path queries, and aggregation operations, thereby uncovering potential connections and revealing hidden patterns within the data [
41]. Its concise syntax reduces the complexity of querying graph data, allowing users to express queries related to graph structures intuitively and conveniently.