Next Article in Journal
Urban Phenomena in Lesser Poland Through GIS-Based Metrics: An Exceptional Form of Urban Sprawl Challenging Sustainable Development
Previous Article in Journal
Environmental Education Awareness in Light of Sustainable Development Goals and Its Relationship with Environmental Responsibility Among University Students
Previous Article in Special Issue
Evaluating the Safety and Cost-Effectiveness of Shoulder Rumble Strips and Road Lighting on Freeways in Saudi Arabia
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Cognitive-Inspired Multimodal Learning Framework for Hazard Identification in Highway Construction with BIM–GIS Integration

1
Department of Security, Ningbo Highway Construction & Management Center, No. 396, Songjiang Mid. Rd., Ningbo 315211, China
2
College of Transportation Engineering, Tongji University, Shanghai 201804, China
3
Faculty of Maritime and Transportation, Ningbo University, No. 169, Qixing Rd., Ningbo 315832, China
4
College of Transportation Engineering, Chang’an University, Xi’an 710064, China
*
Author to whom correspondence should be addressed.
Sustainability 2025, 17(21), 9395; https://doi.org/10.3390/su17219395
Submission received: 21 August 2025 / Revised: 25 September 2025 / Accepted: 13 October 2025 / Published: 22 October 2025

Abstract

Highway construction remains one of the most hazardous sectors in the infrastructure domain, where persistent accident rates challenge the vision of sustainable and safe development. Traditional hazard identification methods rely on manual inspections that are often slow, error-prone, and unable to cope with complex and dynamic site conditions. To address these limitations, this study develops a cognitive-inspired multimodal learning framework integrated with BIM–GIS-enabled digital twins to advance intelligent hazard identification and digital management for highway construction safety. The framework introduces three key innovations: a biologically grounded attention mechanism that simulates inspector search behavior, an adaptive multimodal fusion strategy that integrates visual, textual, and sensor information, and a closed-loop digital twin platform that synchronizes physical and virtual environments in real time. The system was validated across five highway construction projects over an 18-month period. Results show that the framework achieved a hazard detection accuracy of 91.7% with an average response time of 147 ms. Compared with conventional computer vision methods, accuracy improved by 18.2%, while gains over commercial safety systems reached 24.8%. Field deployment demonstrated a 34% reduction in accidents and a 42% increase in inspection efficiency, delivering a positive return on investment within 8.7 months. By linking predictive safety analytics with BIM–GIS semantics and site telemetry, the framework enhances construction safety, reduces delays and rework, and supports more resource-efficient, low-disruption project delivery, highlighting its potential as a sustainable pathway toward zero-accident highway construction.

1. Introduction

Highway construction remains one of the most hazardous sectors worldwide, with accident rates persistently high despite decades of safety management initiatives. Such incidents carry considerable social and economic costs—schedule disruptions, material waste, and additional energy use from rework—which conflict with the goals of sustainable infrastructure delivery [1]. They also erode road infrastructure resilience by reducing network robustness, prolonging service disruptions, and constraining recovery capacity [2]. Traditional approaches to safety management show significant limitations in adapting to complex and dynamic construction environments [3]. Traditional hazard identification methods rely heavily on manual inspections, which are time-consuming, labor-intensive, and prone to human error under complex site conditions [4]. Although recent advancements in computer vision and deep learning have enabled the automation of hazard detection, these approaches often neglect the contextual, dynamic, and cognitive dimensions of construction environments [5]. As a result, critical hazards may remain undetected, undermining both safety outcomes and project sustainability [6]. The integration of cognitive science with computational intelligence presents a promising avenue to address these limitations. Cognitive models of visual search, for instance, can replicate inspector-like scanning behaviors while compensating for human limitations in attention and memory. When combined with multimodal learning techniques, such models can analyze heterogeneous data sources—including images, videos, and sensor signals—to capture hazards in a more comprehensive and context-aware manner.
Parallel to these developments, digital twin technologies have emerged as powerful tools for real-time monitoring, simulation, and decision-making in engineering [7]. In construction safety, digital twins can provide a virtual representation of the physical site, facilitating proactive hazard management. However, existing implementations often lack robust mechanisms for integrating multimodal data streams or modeling inspector cognitive processes [8]. Similarly, while BIM–GIS platforms offer comprehensive spatial–temporal data management [9], their potential in supporting intelligent and sustainable hazard identification has not been fully explored.
This paper addresses these limitations by proposing an integrated framework that combines cognitive-inspired visual search models with multimodal deep learning and digital twin technologies. Specifically, this study addresses three key research questions: (1) How can human inspector cognitive patterns be effectively encoded into deep learning architectures? (2) How does adaptive multimodal fusion enhance hazard detection in dynamic construction environments? (3) How can digital twin technology enable the transition from passive monitoring to proactive hazard prevention? Also the framework incorporates three key innovations: (1) a biologically plausible visual attention mechanism that models human inspector search patterns through cyclic neural networks, (2) an adaptive multimodal fusion architecture that dynamically weights visual, textual and sensor data streams based on their reliability and contextual appropriateness for hazard identification, and (3) a closed-loop digital twin system that enables real-time hazard tracking and management through BIM–GIS integration.
The proposed approach differs from prior work in several aspects. First, unlike conventional computer vision methods that treat hazard detection as a purely computational problem [10], the proposed system embeds construction-specific contextual knowledge through inspector behavior modeling. Second, while existing multimodal learning frameworks focus on feature-level fusion [11], the proposed architecture implements a hierarchical attention mechanism that dynamically weights different data modalities based on their relevance to specific hazard types. Third, the digital twin component goes beyond traditional monitoring systems by establishing a bidirectional feedback loop between physical and virtual environments, enabling continuous system improvement through machine learning [12].
The contributions of this study are threefold. First, it develops a cognitive-inspired visual search mechanism that enhances hazard detection accuracy and reliability. Second, it proposes a multimodal fusion framework that integrates heterogeneous data for comprehensive risk profiling. Third, it demonstrates an end-to-end digital twin solution that aligns safety management with sustainability goals by reducing incident-related delays, resource waste, and operational disruptions. Experimental results demonstrate significant improvements in both detection accuracy and response times compared to state-of-the-art methods.
The remainder of this paper is organized as follows: Section 2 reviews related work in cognitive models for hazard detection, multimodal learning, and digital twin applications in construction. Section 3 presents the theoretical foundations and technical preliminaries. Section 4 details the proposed framework architecture and its key components. Section 5 describes the experimental setup and evaluation results. Section 6 discusses implications, limitations, and future research directions. The paper concludes with a summary of key findings in Section 7.

2. Literature Review

2.1. Cognitive-Inspired Approaches for Hazard Identification

The intersection of cognitive science and construction safety management has emerged as a fertile ground for theoretical and practical innovations. Cognitive-inspired approaches fundamentally reconceptualize hazard identification from a purely algorithmic task to one that incorporates human inspector expertise and behavioral patterns. The return inhibition mechanism, originally identified in visual psychology research, has garnered particular attention due to its capacity to explain how experienced inspectors systematically avoid redundant visual searches by suppressing attention to previously examined spatial locations [13]. Incorporating this principle offers a direct remedy to a common weakness in traditional computer vision systems, which often fall into repetitive scanning loops that reduce inspection efficiency and delay timely hazard recognition.
Human visual attention, however, extends beyond this single mechanism. Contemporary cognitive research highlights a dual-pathway process in which attention is jointly shaped by bottom-up sensory salience and top-down task goals [14]. This theoretical model challenges conventional computer vision paradigms that rely predominantly on feed-forward processing architectures. While several pioneering studies have attempted to implement these cognitive principles through neural network modifications [15], these implementations typically exhibit significant constraints in their applicability to real-world scenarios. Most notably, existing cognitive-inspired systems focus exclusively on static image analysis rather than capturing the dynamic, temporally extended scanning behaviors that characterize expert inspector performance in actual construction environments [16]. This gap underscores the need for computational models that move beyond isolated frame analysis to capture the sequential, dynamic, and memory-driven aspects of human attention.
The challenges are amplified in construction sites, which differ fundamentally from controlled laboratory conditions. Construction environments are marked by visual clutter, changing illumination, and constantly shifting spatial configurations. These factors demand adaptive attention mechanisms that can integrate contextual reasoning with temporal memory. Without such capabilities, hazard identification systems risk overlooking critical safety threats, thereby undermining not only accident prevention but also broader sustainability objectives, such as minimizing project delays, resource waste, and workforce vulnerability.

2.2. Multimodal Learning for Safety Applications

The growing emphasis on multimodal learning in safety-critical domains reflects the recognition that hazard identification cannot be addressed adequately through single-source data. Construction environments generate heterogeneous information streams, including textual documentation, visual imagery, and sensor-based telemetry, each offering complementary insights into risk conditions. Natural language processing techniques have demonstrated notable potential in this regard, particularly for analyzing inspection records, incident reports, and regulatory documents [17]. The advent of transformer-based architectures, particularly BERT and its variants, has achieved unprecedented performance levels in extracting semantic information from unstructured safety texts [18]. These developments enable automated processing of vast repositories of safety knowledge that were previously accessible only through manual analysis.
Visual processing pipelines for construction safety have predominantly employed object detection frameworks such as Faster R-CNN and YOLO architectures [19]. However, these approaches face substantial challenges when applied to construction environments, where hazardous objects and conditions exhibit extreme scale variations, frequent occlusions, and context-dependent interpretations. For instance, the same piece of equipment may represent either a normal working condition or a potential hazard depending on its spatial relationship to workers, time of day, and ongoing construction activities.
Recent advances in multimodal learning attempt to bridge these limitations by combining textual and visual representations through fusion strategies [20]. However, existing approaches typically employ static fusion weights that fail to adapt to the dynamic nature of construction hazards. This limitation becomes particularly problematic when different data modalities provide conflicting or complementary information about potential risks. The challenge is further compounded by the need for real-time processing capabilities, which has motivated research into lightweight neural architectures such as Mobile Net [21]. Yet, the application of such compact networks to multimodal hazard detection remains underexplored, particularly in safety-sensitive contexts where predictive accuracy cannot be sacrificed for computational efficiency.
The integration of sensor data streams, including environmental monitoring, structural health monitoring, and worker location tracking, adds another layer of complexity to multimodal learning systems. Effective fusion of these diverse data types requires sophisticated attention mechanisms capable of dynamically weighting modality contributions based on contextual relevance and data quality considerations.

2.3. Digital Twin Technologies in Construction

Digital twin technologies have evolved considerably, transitioning from static virtual reality representations to advanced cyber–physical systems capable of real-time synchronization with construction processes [22]. Modern digital twin implementations increasingly incorporate Internet of Things (IoT) sensor networks and Building Information Modeling (BIM) integration, enabling the creation of high-fidelity virtual representations that mirror both geometric and semantic properties of physical construction sites [23]. These capabilities hold significant promise for enhancing safety, efficiency, and resilience in complex construction environments.
Despite these advancements, current digital twin applications in construction safety remain constrained by three persistent challenges [24]. First, synchronization latency between physical and virtual environments introduces temporal gaps that hinder reliable real-time decision-making. Second, interoperability issues across heterogeneous platforms and management systems create fragmented data silos that impede holistic integration. Third, the limited incorporation of automated decision-making processes often relegates digital twins to passive monitoring roles rather than proactive safety management systems.
Recent research has begun addressing these limitations through hybrid push-pull architectures that optimize data transmission based on criticality and urgency [25], and standardized data formats that facilitate cross-platform interoperability [26]. However, most existing digital twin implementations remain fundamentally reactive, focusing on monitoring and post-incident analysis rather than proactive hazard prevention [27]. This reactive orientation limits their potential to contribute to zero-accident construction goals, which require predictive capabilities and preemptive intervention strategies.
Moreover, prevailing digital twin platforms are often designed as standalone systems that lack integration with cognitive models of inspector behavior or multimodal learning frameworks. This separation prevents them from fully exploiting the diverse spectrum of available data—ranging from sensor telemetry to visual and textual records—thereby limiting their capacity for adaptive, context-aware hazard identification.

2.4. Integrated Safety Management Systems

The development of integrated safety management systems represents an ambitious endeavor to synthesize advances from cognitive modeling, multimodal learning, and digital twin technologies into cohesive operational frameworks. Several notable efforts have demonstrated the potential of such integration, though each exhibits specific limitations that constrain its practical applicability.
The emergency response framework successfully demonstrates how digital twin technologies can enhance situational awareness during crisis situations through real-time data integration and visualization [28]. However, this system primarily addresses post-incident response scenarios rather than preventive hazard identification, limiting its contribution to proactive safety management objectives. Similarly, BIM-based hazard tracking systems have shown promise in managing safety information throughout construction project lifecycles, but these implementations typically rely on manual hazard input procedures rather than automated detection capabilities, creating scalability limitations and human error susceptibilities [29].
The most comprehensive existing approach that developed a machine learning pipeline specifically designed for road infrastructure monitoring applications [30]. This system incorporates multiple data sources and automated analysis capabilities, representing a significant advance in integrated safety management. However, this framework lacks several critical components that limit its effectiveness in dynamic construction environments: cognitive-inspired attention mechanisms that could model human inspector expertise, adaptive multimodal fusion capabilities that could dynamically weight different data sources based on contextual relevance, and closed-loop feedback mechanisms that could enable continuous system improvement through operational experience.
The framework proposed in this study addresses these gaps by establishing integration across three critical dimensions: biologically plausible attention mechanisms that capture and extend human inspector capabilities, adaptive multimodal fusion architectures that dynamically optimize data integration based on hazard-specific requirements, and closed-loop digital twin systems that enable continuous learning and improvement through operational feedback. Unlike previous approaches that treat these components as separate modules, the proposed architecture establishes tight coupling between cognitive models, deep learning pipelines, and digital twin platforms, creating synergistic effects that enhance overall system performance.
This integrated approach represents a paradigm shift from reactive safety monitoring to proactive hazard prevention, enabling systems that not only detect existing hazards more accurately but also predict potential risks and recommend preemptive interventions. Such capabilities are essential for achieving zero-accident construction objectives, which require a comprehensive understanding of risk factors, real-time monitoring capabilities, and adaptive response mechanisms that can evolve with changing construction conditions and emerging safety challenges.

3. Background and Preliminaries

3.1. Deep Learning Foundations for Cognitive-Inspired Hazard Detection

Modern deep learning architectures provide the computational substrate for implementing cognitive-inspired visual processing systems in construction safety applications. The hierarchical feature learning capabilities of convolutional neural networks (CNNs) align naturally with biological vision processing, where early layers extract low-level features analogous to simple cells in the visual cortex, while deeper layers capture complex patterns similar to complex cells [31]. This hierarchical organization is fundamental to the proposed cognitive-inspired framework, as it enables the modeling of both bottom-up feature detection and top-down attentional control mechanisms. The basic CNN convolution operation can be expressed as:
y i , j = f m , n     x i + m , j + n w m , n + b
where x represents input features, w denotes learnable filters, b is the bias term, and f is a nonlinear activation function. In the proposed framework, this operation is extended to incorporate spatial attention weights that model inspector fixation patterns, allowing the network to dynamically focus on construction-relevant regions while suppressing irrelevant background information.
Attention mechanisms have emerged as crucial components for modeling selective visual processing [32]. For the cognitive-inspired system presented in this study, a modified attention architecture is implemented that incorporates return inhibition principles:
α i , j ( t ) = s o f t m a x e i , j ( t ) γ h i , j ( t 1 )
where α i , j ( t ) represents attention weights at spatial location ( i , j ) and time t , e i , j ( t ) is the evidence for attending to that location, h i , j ( t 1 ) captures the history of previous attention, and γ is the inhibition strength parameter. This formulation enables the system to avoid redundant scanning while maintaining sensitivity to emerging hazards.
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM) networks, provide the temporal modeling capabilities essential for capturing inspector search strategies over time [33]. The LSTM update equations are adapted in the proposed framework to incorporate construction-specific prior knowledge:
f t = σ ( W f [ h t 1 , x t ] + U f c t 1 + b f )
i t = σ ( W i [ h t 1 , x t ] + U i c t 1 + b i )
C ˜ t = t a n h ( W C [ h t 1 , x t ] + b C )
where f t , i t represent forget and input gates, respectively, C ˜ t is the candidate cell state, and U f , U i are additional weight matrices that incorporate contextual construction site information. This extension enables our system to maintain long-term memory of hazard patterns while adapting to changing construction conditions.

3.2. Digital Twin and BIM–GIS Integration Principles

Digital twin technology provides the cyber–physical foundation for our intelligent hazard management system by establishing bidirectional information flows between physical construction environments and their virtual counterparts [34]. The theoretical framework underlying the digital twin implementation in this study extends beyond traditional monitoring systems to incorporate predictive analytics and autonomous decision-making capabilities.
The proposed digital twin architecture integrates four fundamental components: (1) physical entities encompassing construction sites, equipment, and personnel, (2) virtual models providing geometric and semantic representations, (3) bidirectional data connections enabling real-time synchronization, and (4) analytical engines supporting cognitive-inspired hazard detection [35]. This architecture supports continuous learning and adaptation through operational feedback, distinguishing it from static digital representations.
Building Information Modeling (BIM) contributes structured semantic information about construction elements, their relationships, and associated safety requirements [36]. The integration with Geographic Information Systems (GIS) extends spatial analysis capabilities from building-scale to site-scale and regional contexts [37]. The framework leverages this multi-scale capability to identify hazards that span different spatial domains, from localized equipment hazards to site-wide traffic flow conflicts. The mathematical representation of the integrated digital twin system can be expressed as:
D T = { M B I M , M G I S , S I o T , A c o g n i t i v e }
where M B I M represents building information models, M G I S denotes geographic information systems, S I o T encompasses sensor networks, and A c o g n i t i v e represents the cognitive-inspired analytical engines. The system maintains temporal coherence through continuous synchronization protocols:
Δ t s y n c = m i n ( τ h a z a r d , τ s y s t e m )
where τ h a z a r d represents the characteristic time scale of hazard evolution and τ s y s t e m denotes system computational constraints. This ensures that virtual models remain synchronized with physical conditions at frequencies appropriate for safety-critical applications.
Recent advances in Industry Foundation Classes (IFC) and CityGML standards have improved interoperability between BIM and GIS platforms [38], though the proposed system extends these standards to incorporate cognitive-inspired metadata that captures inspector attention patterns and hazard detection confidence levels. This enhancement enables the digital twin to learn from human expertise while providing automated hazard identification capabilities.

3.3. Multimodal Data Fusion and Cognitive Modeling Frameworks

The integration of heterogeneous data sources in construction safety applications requires sophisticated fusion architectures that can handle the diverse characteristics of visual, textual, and sensor data streams. The multimodal fusion framework presented in this study draws inspiration from cognitive science research on multisensory integration, where the human brain combines information from different sensory modalities to create coherent perceptual experiences [39].
The theoretical foundation for the multimodal approach rests on optimal cue integration theory, which suggests that humans combine sensory information in a statistically optimal manner by weighting each modality according to its reliability [40]. This principle is implemented through a Bayesian fusion architecture:
p ( H | D v , D t , D s ) p ( H ) i { v , t , s }   p ( D i | H ) w i
In practical implementation, the reliability weights w i are dynamically learned through historical performance data: when visual data quality deteriorates under low visibility conditions, the system automatically reduces w v while increasing the weights for textual w t and sensor w s modalities. Weight updates employ an exponential moving average with update α = 0.1 , ensuring the system responds to data quality changes in a timely yet stable manner. This adaptive mechanism enables the system to maintain robust performance across varying environmental conditions. Where H represents hazard hypotheses, D v , D t , D s denote visual, textual, and sensor data, respectively, and w i are reliability weights dynamically adjusted based on data quality and contextual relevance.
Cognitive load theory provides additional insights for optimizing information processing in our system [41]. Construction inspectors operate under significant cognitive constraints, particularly in complex environments with multiple simultaneous hazards. The framework incorporates cognitive load principles by implementing adaptive information presentation strategies:
L c o g n i t i v e = α L i n t r i n s i c + β L e x t r a n e o u s + γ L g e r m a n e
where L i n t r i n s i c represents task-inherent cognitive load, L e x t r a n e o u s denotes unnecessary processing overhead, and L g e r m a n e captures productive learning-related cognitive effort. The parameters α , β , γ are adjusted based on inspector expertise levels and environmental complexity.
Working memory models inform the system’s temporal information management strategies [42]. The Baddeley–Hitch model of working memory, with its central executive, phonological loop, and visuospatial sketchpad components, provides a blueprint for organizing multimodal information processing:
W M = { C E , P L t e x t , V S S v i s u a l , E B e p i s o d i c }
where C E represents central executive functions for attention control, P L t e x t handles textual information processing, V S S v i s u a l manages visual-spatial information, and E B e p i s o d i c maintains episodic memory for past hazard encounters. This cognitive architecture ensures that the system maintains coherent situational awareness while processing continuous data streams.
The integration of these cognitive principles with deep learning architectures creates a hybrid system that combines the pattern recognition capabilities of artificial neural networks with the contextual reasoning abilities inspired by human cognition [43]. This fusion enables the framework to achieve superior hazard identification performance while maintaining interpretability and adaptability to novel construction scenarios. Furthermore, the mathematical formulation of the cognitive-inspired attention mechanism incorporates both spatial and temporal components:
A c o g n i t i v e ( x , y , t ) = σ W a [ F v i s u a l ( x , y ) , M m e m o r y ( t ) , C c o n t e x t ]
where A c o g n i t i v e represents the cognitive attention map, F v i s u a l denotes visual features at spatial coordinates ( x , y ) , M m e m o r y captures temporal memory states, and C c o n t e x t incorporates construction-specific contextual information. This formulation enables the system to dynamically adapt attention patterns based on both current visual input and accumulated experience, mirroring the adaptive behavior of expert human inspectors.

4. Cognitive-Inspired Multimodal Deep Learning Framework

4.1. System Architecture Overview

The intelligent hazard identification and digital management system integrates cognitive-inspired visual processing, multimodal deep learning, and digital twin technologies through a unified four-layer architecture, as shown in Figure 1.
The framework establishes bidirectional information flows between physical construction environments and virtual analytical engines, enabling real-time hazard detection, risk assessment, and proactive intervention strategies [44].
The system architecture comprises four interconnected layers that maintain specific functional responsibilities while contributing to overall performance through hierarchical information processing and feedback mechanisms. Unlike conventional safety monitoring systems, the proposed framework incorporates cognitive principles at each layer to enhance both detection accuracy and operational efficiency.

4.2. Cognitive-Inspired Visual Attention Mechanism

4.2.1. Return Inhibition Neural Network Architecture

Our cognitive-inspired visual attention mechanism models human inspector behavior through a cyclic neural network that implements return inhibition principles. This approach addresses a fundamental limitation in conventional computer vision systems: the tendency to repeatedly scan the same regions without systematic exploration strategies.
The return inhibition mechanism is implemented through a modified convolutional LSTM architecture that maintains spatial–temporal memory of previously attended regions. The core innovation lies in incorporating inhibition maps that accumulate historical attention patterns with temporal decay, effectively preventing redundant scanning while maintaining sensitivity to emerging hazards:
I i n h i b i t ( l ) ( i , j ) = τ = 1 T   λ τ A t τ ( l ) ( i , j )
where I i n h i b i t ( l ) represents the inhibition map at layer l , A t τ ( l ) captures attention weights from previous time steps, and λ controls the temporal decay factor modeling memory fading characteristics observed in human inspectors.
The return inhibition parameters are learned end-to-end through backpropagation jointly with the ConvLSTM network. Based on validation set performance, the inhibition strength γ was determined to be 0.35 (searched within [0.1, 0.5]), and the temporal decay λ was set to 0.85 (searched within [0.7, 0.95]). Sensitivity analysis indicates that when γ ∈ [0.3, 0.4], detection accuracy remains at 91.7% ± 0.8%; when λ ∈ [0.8, 0.9], system performance remains stable. These parameters ensure a balance between avoiding redundant scanning (through moderate inhibition) and maintaining sensitivity to emerging hazards.

4.2.2. Multi-Scale Spatial Attention Integration

Construction hazards exhibit diverse spatial scales, from small personal protective equipment violations to large-scale equipment positioning conflicts. The multi-scale attention mechanism addresses this challenge through pyramid attention networks that process visual information at multiple resolution levels simultaneously [45]. The system dynamically weighs contributions from different scales based on hazard types and contextual information, enabling effective detection across the full spectrum of construction safety concerns.
The multi-scale integration incorporates construction-specific contextual information, including time of day, construction phase, weather conditions, and historical hazard patterns. This contextual weighting ensures that attention mechanisms adapt to changing construction environments and prioritize hazards based on their likelihood and potential impact.

4.2.3. Temporal Memory and Search Strategy Learning

The system learns optimal visual search strategies by modeling inspector expertise through both short-term working memory for immediate hazard tracking and long-term episodic memory for pattern recognition across similar construction scenarios. The temporal memory mechanism enables the system to maintain coherent situational awareness while adapting to evolving construction conditions.
The learning process incorporates feedback from experienced inspectors to refine search strategies over time. This human-in-the-loop approach ensures that the system benefits from domain expertise while overcoming limitations in human visual attention and memory capacity.

4.3. Multimodal Data Fusion Architecture

The comprehensive integration of heterogeneous data sources is achieved through the adaptive multimodal fusion architecture, as shown in Figure 2. This architecture processes visual, textual, and sensor data streams through specialized pipelines before combining them through attention-based dynamic weighting mechanisms.

4.3.1. Hybrid BERT-Word2Vec Text Processing Pipeline

Textual data sources, including safety reports, inspection logs, and regulatory documents, contain rich semantic information crucial for hazard identification. The hybrid text processing pipeline combines the contextual understanding capabilities of BERT with the semantic similarity detection strengths of Word2Vec embeddings. This dual approach ensures comprehensive coverage of both explicit safety violations documented in reports and implicit patterns that emerge from large-scale text analysis. The hybrid representation balances contextual understanding with computational efficiency:
R t e x t = γ R B E R T + ( 1 γ ) R W o r d 2 V e c
where γ is dynamically adjusted based on text complexity and processing time constraints. The pipeline incorporates construction safety ontology through domain-specific fine-tuning, enabling accurate recognition of technical terminology and safety-critical concepts.

4.3.2. Optimized MobileNet-YOLOv5 Visual Processing

Real-time hazard detection requires lightweight yet accurate visual processing capabilities. The optimized MobileNet-YOLOv5 architecture balances computational efficiency with detection performance through depthwise separable convolutions, attention-guided feature pyramid networks, and knowledge distillation from larger teacher models.
The architecture incorporates construction-specific modifications, including specialized object classes for safety equipment, hazardous materials, and personnel detection. Transfer learning from general object detection datasets is combined with construction-specific training data to achieve robust performance across diverse construction environments.

4.3.3. Adaptive Multimodal Fusion Strategy

The adaptive fusion strategy dynamically adjusts the contribution of different data modalities based on their reliability, relevance, and contextual appropriateness. This approach addresses the challenge that different modalities may provide conflicting or complementary information depending on environmental conditions and hazard types.
The fusion mechanism employs attention-based weighting that considers data quality metrics, historical performance under similar conditions, and uncertainty estimates from individual modalities. Monte Carlo dropout provides confidence estimates for fused hazard detection results, enabling risk-aware decision-making in safety-critical applications.

4.4. Digital Twin Integration and Closed-Loop Management

4.4.1. Four-Layer Digital Twin Architecture

Our digital twin system extends beyond traditional monitoring platforms to incorporate predictive analytics and autonomous decision-making capabilities. The four-layer architecture ensures seamless integration between physical construction environments and virtual analytical engines while maintaining real-time synchronization requirements.
The Physical Asset Layer encompasses construction sites, equipment, personnel, and environmental conditions, providing continuous data streams through IoT sensor networks and monitoring systems. The Data Communication Layer handles bidirectional transmission with adaptive protocols that prioritize safety-critical information. The Virtual Model Layer maintains synchronized geometric and semantic representations through BIM–GIS integration, while the Analytics and Decision Layer implements cognitive-inspired algorithms with continuous learning capabilities.

4.4.2. Adaptive Synchronization Protocols

Traditional digital twin implementations suffer from synchronization delays that compromise real-time decision-making capabilities. The adaptive synchronization protocol addresses this limitation through intelligent data prioritization and distributed processing strategies. Synchronization frequency is dynamically adjusted based on hazard criticality and system constraints:
f s y n c ( H ) = f b a s e l i n e ( 1 + α P h a z a r d ( H ) )
where hazard priority scores incorporate severity levels, temporal urgency, and detection confidence. This approach ensures that critical safety information receives immediate attention while maintaining overall system performance.

4.4.3. Closed-Loop Feedback and Continuous Learning

The system implements closed-loop feedback mechanisms that enable continuous improvement through operational experience. The feedback system incorporates automated performance metrics, inspector validation, incident reports, and long-term safety outcomes. Machine learning algorithms continuously update system parameters while maintaining stability through adaptive learning rates that balance performance improvement with system reliability.
The system collects user feedback to continuously optimize interface design and interaction workflows, ensuring effective integration of technological innovation with human factors engineering. Feedback mechanisms include both explicit user ratings and implicit behavioral patterns, enabling the system to adapt to diverse user preferences and operational contexts.
The continuous learning framework enables the system to adapt to new construction techniques, emerging hazard types, and changing safety regulations. This adaptability is crucial for maintaining effectiveness across diverse construction projects and evolving industry practices.
This comprehensive methodology framework establishes a robust foundation for implementing zero-accident highway construction goals through the integration of cognitive science principles, advanced deep learning techniques, and real-time digital twin management capabilities. The framework’s modular design enables incremental deployment and continuous improvement while maintaining compatibility with existing construction management systems.

5. Experimental Design and Results

5.1. Experimental Setup and Data Collection

5.1.1. Dataset Construction and Characteristics

To validate the effectiveness of the cognitive-inspired multimodal deep learning framework, a comprehensive dataset was constructed encompassing diverse highway construction scenarios across multiple geographical regions and construction phases, as detailed in Table 1. The dataset collection was conducted over 18 months across five major highway construction projects in different climatic and geographical conditions, ensuring a broad representation of real-world construction environments.
While the dataset covers an 18-month period and multiple weather conditions, it may have limited coverage of extremely rare high-impact events (such as compound hazards caused by extreme weather). Additionally, the data primarily comes from specific geographical regions, and differences in cross-cultural safety practices require further investigation.
The experimental dataset comprises three primary components with careful attention to data quality and representativeness. Visual data collection employed high-resolution cameras (4K, 30–60 FPS) positioned at strategic locations across construction sites, capturing diverse lighting conditions, weather scenarios, and construction phases. Textual data encompassed safety inspection reports, incident logs, and regulatory compliance documents spanning multiple years of construction history.
Ground truth annotations were established through collaborative efforts involving experienced safety inspectors, construction engineers, and domain experts, as shown in Figure 3. Each visual sample was independently annotated by three certified safety professionals, with inter-annotator agreement scores exceeding 85% for hazard identification tasks.

5.1.2. Experimental Platform and Implementation Details

The experimental platform was implemented using a distributed computing architecture comprising high-performance GPU clusters and edge computing devices deployed at construction sites, as shown in Figure 4. The central processing infrastructure consisted of NVIDIA V100 GPUs for model training and Tesla T4 GPUs for real-time inference, while edge devices utilized NVIDIA Jetson Xavier NX modules for on-site data processing and preliminary hazard detection.
To validate the effectiveness of our cognitive-inspired multimodal deep learning framework, we constructed a comprehensive dataset encompassing diverse highway construction scenarios across multiple geographical regions and construction phases. The dataset collection was conducted over 18 months across five major highway construction projects in different climatic and geographical conditions, ensuring a broad representation of real-world construction environments.
The experimental dataset comprises three primary components: visual data (32,847 high-resolution images and 1247 h of video footage), textual data (15,623 safety inspection reports, 8934 incident logs, and 12,456 regulatory compliance documents), and sensor data (continuous IoT measurements from 156 sensor nodes monitoring environmental conditions, equipment status, and worker locations). All data collection procedures were conducted in accordance with relevant safety protocols and privacy regulations, with appropriate consent obtained from construction personnel and project stakeholders.
Ground truth annotations were established through collaborative efforts involving experienced safety inspectors, construction engineers, and domain experts. Each visual sample was independently annotated by three certified safety professionals, with inter-annotator agreement scores exceeding 85% for hazard identification tasks. Textual data annotations focused on safety-critical events, hazard types, severity levels, and contextual factors. The annotation process incorporated a hierarchical taxonomy of construction hazards developed specifically for this study, covering 23 primary hazard categories and 87 sub-categories.
Software implementation employed PyTorch 1.12 as the primary deep learning framework, with specialized libraries for computer vision (OpenCV 4.6), natural language processing (Transformers 4.21), and geospatial analysis (GDAL 3.5). The digital twin platform was developed using Unity 3D for visualization and Node.js for backend services, with PostgreSQL databases for structured data storage and MongoDB for unstructured content management.
Network configurations were optimized for each component of the multimodal framework. The cognitive-inspired attention mechanism employed a modified ConvLSTM architecture with 256 hidden units and 8 attention heads. The hybrid BERT-Word2Vec text processing pipeline utilized BERT-base-uncased with 768-dimensional embeddings and Word2Vec models trained on construction-specific corpora with 300-dimensional vectors. The optimized MobileNet-YOLOv5 architecture incorporated depthwise separable convolutions with a width multiplier α = 0.75 and input resolution of 416 × 416 pixels for optimal balance between accuracy and computational efficiency.

5.1.3. Evaluation Metrics and Baseline Methods

The evaluation methodology encompasses both quantitative performance metrics and qualitative assessments of system usability and practical effectiveness. Primary quantitative metrics include detection accuracy (precision, recall, F1-score), response time, computational efficiency, and system reliability under various operational conditions. Specialized metrics for construction safety applications include hazard severity assessment accuracy, false alarm rates, and time-to-intervention measurements.
Baseline comparison methods were selected to represent current state-of-the-art approaches in construction safety and computer vision. These include: (1) traditional computer vision methods using Faster R-CNN and standard YOLOv5 architectures, (2) multimodal fusion approaches employing fixed weighting strategies, (3) existing digital twin platforms without cognitive-inspired components, and (4) commercial construction safety monitoring systems currently deployed in industry settings.
The experimental design incorporated both controlled laboratory evaluations and real-world deployment assessments. Laboratory experiments provided precise control over environmental variables and systematic evaluation of individual system components, while field deployments demonstrated practical effectiveness under actual construction conditions with inherent variability and complexity.

5.2. Component-Level Performance Analysis

5.2.1. Cognitive-Inspired Attention Mechanism Evaluation

The effectiveness of the cognitive-inspired attention mechanism was evaluated through both synthetic and real-world scenarios, with particular focus on return inhibition capabilities and temporal memory performance. Experimental results demonstrate significant improvements in visual search efficiency compared to conventional attention mechanisms.
The return inhibition neural network achieved superior performance in avoiding redundant scanning patterns, with a 34% reduction in repeated attention to previously examined regions compared to standard attention mechanisms. Temporal memory components showed effective learning of inspector search strategies, with convergence to expert-level performance patterns achieved within 2847 training iterations. The multi-scale spatial attention integration demonstrated robust performance across diverse hazard scales, achieving 89.3% accuracy for small-scale hazards (personal protective equipment violations) and 94.7% accuracy for large-scale hazards (equipment positioning conflicts).
Comparative analysis with human inspector performance revealed that the cognitive-inspired system successfully captured essential characteristics of expert visual search behavior while overcoming limitations in human attention span and memory capacity. Eye-tracking studies with experienced inspectors validated the biological plausibility of the attention patterns, showing a 76% correlation between system attention maps and human fixation sequences.
The system demonstrated adaptive learning capabilities, with performance improvements of 12–18% observed over extended operation periods as the temporal memory component accumulated experience from diverse construction scenarios. Ablation studies confirmed the individual contributions of return inhibition (8.3% improvement), multi-scale attention (11.7% improvement), and temporal memory (14.2% improvement) components to overall detection performance.

5.2.2. Multimodal Fusion Architecture Assessment

Our adaptive multimodal fusion architecture demonstrated substantial improvements over fixed fusion strategies and single-modality approaches. The hybrid BERT-Word2Vec text processing pipeline achieved 92.4% accuracy in safety-critical event extraction from inspection reports, representing a 15.8% improvement over BERT-only approaches and 23.6% improvement over traditional keyword-based methods.
The optimized MobileNet-YOLOv5 visual processing component achieved real-time performance with 47.3 FPS throughput while maintaining detection accuracy of 88.9% mAP@0.5 across all hazard categories. Knowledge distillation from larger teacher models contributed 7.2% accuracy improvement while reducing model size by 62% compared to standard YOLOv5 implementations. Depthwise separable convolutions reduced computational requirements by 41% with minimal impact on detection performance.
Adaptive fusion weighting mechanisms showed superior performance compared to fixed weighting strategies, with 16.4% improvement in overall hazard detection accuracy. The system demonstrated effective handling of conflicting information between modalities, correctly resolving 78% of cases where individual modalities provided contradictory assessments. Uncertainty quantification through Monte Carlo dropout provided reliable confidence estimates, enabling risk-aware decision-making in safety-critical applications.
Monte Carlo dropout employs a dropout rate of p = 0.2 with 20 forward passes to compute prediction variance as the uncertainty measure. When uncertainty exceeds the threshold τ = 0.3, the system generates a ‘low confidence’ alert prompting manual review. Based on precision-recall curve optimization, the optimal operating point was set at a confidence threshold of 0.42, achieving 85.3% precision at a 90% recall constraint.
Cross-modal validation experiments confirmed the complementary nature of different data sources, with combined multimodal approaches achieving 91.7% detection accuracy compared to 76.8% for visual-only, 68.4% for text-only, and 59.2% for sensor-only approaches. The adaptive fusion mechanism automatically adjusted modality weights based on environmental conditions, demonstrating increased reliance on textual data during low-visibility conditions and enhanced sensor integration during extreme weather events.

5.2.3. Digital Twin Integration Performance

The digital twin integration demonstrated effective real-time synchronization capabilities with an average latency of 147 ms for critical safety information and 2.3 s for comprehensive model updates. Adaptive synchronization protocols successfully prioritized safety-critical data, achieving 100% on-time delivery for high-priority hazard alerts while maintaining overall system performance.
The four-layer digital twin architecture showed robust scalability, effectively managing data streams from up to 156 simultaneous sensor nodes without performance degradation. Virtual model accuracy maintained 94.6% geometric correspondence with physical construction sites, with semantic information accuracy reaching 89.1% for safety-relevant elements.
Closed-loop feedback mechanisms enabled continuous system improvement, with 23% performance enhancement observed over 12-month deployment periods. The continuous learning framework successfully adapted to new construction techniques and emerging hazard patterns, demonstrating 34% improvement in detection accuracy for previously unseen hazard types after exposure to 500+ training examples.
Interoperability assessments confirmed successful integration with existing BIM–GIS platforms, achieving 92% data compatibility across different software environments. The system demonstrated effective handling of legacy data formats while supporting modern industry standards, including IFC 4.0 and CityGML 3.0.

5.3. Comparative Analysis and Benchmark Results

5.3.1. Performance Comparison with Baseline Methods

Comprehensive comparison with baseline methods demonstrates the superior performance of the cognitive-inspired multimodal framework across all evaluation metrics, as summarized in Table 2. The proposed system achieved 91.7% overall detection accuracy, representing improvements of 18.2% over traditional computer vision methods, 13.5% over existing multimodal approaches, and 24.8% over commercial safety monitoring systems.
To validate generalization capability, leave-one-site-out cross-validation was performed. When Sites A-E were used as test sets, respectively, the accuracies were 90.1%, 89.2%, 88.2%, 89.8%, and 91.5%, averaging 89.8% (vs. 91.7% with mixed training), demonstrating good cross-site generalization. The lower performance at Site C (mountainous terrain) reflects the challenges posed by complex topography.
Response time measurements demonstrate significant improvements in system reactivity, with the framework achieving 147 ms average response time compared to 234–1247 ms for baseline methods. This improvement is particularly critical for safety applications where rapid hazard detection and alert generation can prevent accidents and injuries.
Statistical significance testing using paired t-tests confirmed that performance improvements are statistically significant (p < 0.001) across all evaluation metrics. Effect size calculations revealed large practical significance (Cohen’s d > 0.8) for accuracy improvements, indicating that observed differences represent meaningful practical improvements rather than statistical artifacts.

5.3.2. Ablation Study Results

Systematic ablation studies quantified the individual contributions of each framework component to overall system performance. The cognitive-inspired attention mechanism contributed 14.2% accuracy improvement, multimodal fusion added 16.4% improvement, and digital twin integration provided 8.7% enhancement over baseline single-modality approaches.
Component interaction analysis revealed synergistic effects between cognitive attention and multimodal fusion, with combined implementation achieving 3.8% additional improvement beyond individual component contributions. This synergy stems from attention mechanisms guiding multimodal fusion weights based on visual focus patterns, creating more effective integration of heterogeneous data sources.
Temporal memory components showed increasing effectiveness over extended operation periods, with performance gains of 5–12% observed after 6-month deployment periods. This improvement reflects the system’s ability to learn from operational experience and adapt to specific construction site characteristics and hazard patterns.

5.3.3. Real-World Deployment Assessment

Real-world deployment across five construction sites demonstrated robust performance under diverse operational conditions. The system maintained 91.7% average detection accuracy across different weather conditions, construction phases, and site complexities. Performance variability remained within acceptable bounds (±4.2%) despite significant environmental and operational variations.
User acceptance studies with construction safety personnel revealed high satisfaction rates (87% positive feedback) and effective integration with existing safety protocols. Inspector feedback highlighted the system’s ability to identify hazards that might be overlooked during manual inspections, with 73% of users reporting enhanced situational awareness and improved safety decision-making capabilities.
In-depth interviews revealed that initially, there was some resistance among workers, with concerns primarily about privacy and work supervision intensity. However, after a two-week adaptation period, acceptance increased significantly. Differences in acceptance patterns were observed across age groups and technical proficiency levels, with younger workers generally more receptive to the technology, while experienced workers required additional time to appreciate the system’s benefits.
Cost–benefit analysis indicated positive return on investment within 8.7 months of deployment, with primary benefits derived from reduced accident rates (34% decrease), improved inspection efficiency (42% time reduction), and enhanced regulatory compliance (97% compliance rate compared to 78% baseline). Long-term deployment data over 18 months confirmed sustained performance benefits and continued system improvement through operational learning. Real-world deployment across five construction sites demonstrated robust performance under diverse operational conditions, with comprehensive evaluation results presented in Table 3.
While the initial ROI period is 8.7 months, the estimated 3-year total cost of ownership analysis indicates that annual maintenance costs represent approximately 15% of the initial investment, primarily for model updates and hardware maintenance. This suggests that long-term operational planning needs to account for these ongoing costs, particularly for small to medium-sized projects where the relative cost burden may be higher.
The experimental results demonstrate that the cognitive-inspired multimodal deep learning framework achieves substantial improvements in construction safety monitoring while maintaining practical feasibility for real-world deployment. The combination of biological inspiration, advanced machine learning techniques, and integrated digital twin management provides a robust foundation for achieving zero-accident construction goals.

6. Discussion

6.1. Theoretical Contributions and Implications

6.1.1. Advancement in Cognitive-Inspired Computing

This research makes significant theoretical contributions to the intersection of cognitive science and computational safety systems. The successful implementation of return inhibition mechanisms in neural network architectures demonstrates the viability of translating biological vision principles into practical computational frameworks [46]. This achievement extends beyond mere algorithmic innovation to establish a new paradigm for developing AI systems that genuinely incorporate human expertise rather than simply replacing it.
The cognitive-inspired attention mechanism addresses a fundamental limitation in current computer vision approaches: the lack of systematic, experience-guided search strategies. By modeling the temporal dynamics of human visual attention, the framework bridges the gap between bottom-up feature detection and top-down task-specific guidance. This integration represents a conceptual advancement that challenges the prevailing paradigm of purely data-driven approaches in safety-critical applications.
The mathematical formalization of return inhibition and temporal memory mechanisms provides a rigorous foundation for future research in cognitive-inspired computing. The approach demonstrates that complex cognitive phenomena can be effectively modeled through tractable mathematical frameworks without sacrificing biological plausibility. This contribution opens new avenues for incorporating other cognitive principles, such as visual working memory, spatial reasoning, and decision-making under uncertainty, into computational systems.
Furthermore, the demonstrated synergy between cognitive modeling and deep learning architectures suggests that biological inspiration can enhance rather than constrain artificial intelligence capabilities. The 14.2% performance improvement attributed to cognitive components validates the hypothesis that human expertise contains valuable inductive biases that can guide machine learning systems toward more effective solutions.

6.1.2. Multimodal Learning Architecture Innovation

The adaptive multimodal fusion architecture presents novel theoretical insights into the integration of heterogeneous data sources for safety applications. Traditional multimodal learning approaches typically employ fixed fusion strategies that fail to account for the dynamic nature of real-world environments. The attention-based adaptive weighting mechanism addresses this limitation by enabling context-sensitive integration that mirrors human multisensory processing.
The hybrid BERT-Word2Vec text processing pipeline represents a significant methodological contribution that balances contextual understanding with computational efficiency. This approach resolves the trade-off between semantic depth and processing speed that has constrained previous implementations of natural language processing in real-time safety applications. The 15.8% improvement over BERT-only approaches demonstrates that hybrid architectures can achieve superior performance while maintaining practical feasibility.
The theoretical framework for uncertainty quantification in multimodal settings provides essential foundations for safety-critical applications where confidence estimation is paramount [47]. The Monte Carlo dropout implementation enables principled reasoning about prediction reliability, addressing a critical gap in current multimodal learning approaches that typically provide point estimates without uncertainty bounds.

6.1.3. Digital Twin Paradigm Extension

The digital twin implementation extends current paradigms by establishing genuine bidirectional coupling between physical and virtual environments. Unlike traditional digital twins that primarily serve monitoring functions, the system incorporates predictive capabilities and autonomous decision-making processes that enable proactive hazard management. This advancement transforms digital twins from passive monitoring tools into active safety management systems [48].
The four-layer architecture provides a scalable framework for integrating diverse cyber-physical systems while maintaining real-time performance requirements. The theoretical contribution lies in demonstrating how cognitive-inspired algorithms can be effectively integrated with digital twin platforms to create intelligent systems that learn and adapt over time.
The adaptive synchronization protocols address fundamental challenges in cyber-physical systems by enabling dynamic resource allocation based on safety criticality. This contribution has implications beyond construction safety, providing principles applicable to other domains requiring real-time monitoring and intervention capabilities.

6.2. Practical Implications and Industry Applications

6.2.1. Construction Safety Management Transformation

The experimental results demonstrate that the framework can fundamentally transform construction safety management practices by shifting from reactive incident response to proactive hazard prevention. The 34% reduction in accident rates observed during real-world deployments represents a substantial practical impact that translates directly into improved worker safety and reduced project costs.
The system’s ability to identify hazards that might be overlooked during manual inspections addresses a critical vulnerability in current safety protocols. With 73% of users reporting enhanced situational awareness, the framework augments rather than replaces human expertise, creating a collaborative approach that leverages both artificial intelligence capabilities and human judgment.
Successful system deployment requires careful change management strategies beyond technical implementation. This includes establishing phased rollout plans that allow workers to gradually adapt to the technology, designing incentive mechanisms that encourage system adoption, and addressing workers’ concerns about privacy and automated monitoring. The research findings suggest that transparent communication about system benefits and limitations is crucial for building trust and acceptance among construction personnel.
The integration with the existing BIM–GIS platform ensures practical deployability without requiring wholesale replacement of current construction management systems. The 92% data compatibility achieved across different software environments demonstrates that advanced AI capabilities can be incorporated into existing workflows with minimal disruption.
Cost–benefit analysis, revealing a positive return on investment within 8.7 months, provides compelling evidence for widespread adoption. The primary benefits derived from reduced accident rates, improved inspection efficiency, and enhanced regulatory compliance create a strong business case that extends beyond safety considerations to encompass operational and financial advantages.

6.2.2. Scalability and Generalization Potential

The framework’s modular architecture and standardized interfaces enable scalability across diverse construction projects and organizational contexts. The successful deployment across five different construction sites with varying environmental conditions and project characteristics demonstrates broad applicability beyond the specific experimental settings.
Economic considerations for scaled deployment reveal both opportunities and challenges. While large projects may benefit from economies of scale that reduce per-unit costs, small to medium-sized projects may face relatively higher cost burdens. The analysis suggests that cloud-based service models and shared infrastructure approaches could help optimize cost structures across different project scales. Additionally, the marginal cost of expanding from single-site to multi-site deployment decreases significantly after initial infrastructure investment, though this requires careful coordination and standardization across projects.
The continuous learning capabilities ensure that system performance improves over time as it accumulates experience from diverse construction scenarios. This adaptive behavior is particularly valuable for addressing the inherent variability in construction projects, where each site presents unique challenges and hazard patterns.
The cognitive-inspired components provide natural mechanisms for incorporating domain expertise from different construction specialties. Safety inspectors, equipment operators, and project managers can contribute their specialized knowledge through the human-in-the-loop learning processes, creating systems that benefit from collective human expertise.

6.2.3. Regulatory and Standardization Implications

The framework’s comprehensive documentation and audit capabilities support regulatory compliance requirements while providing detailed evidence for safety management effectiveness. The 97% compliance rate achieved compared to 78% baseline demonstrates significant potential for improving industry-wide safety standards.
The standardized data formats and interoperability protocols developed for the system could contribute to emerging industry standards for digital construction safety management. The successful integration with IFC 4.0 and CityGML 3.0 standards positions the approach as compatible with evolving industry infrastructure [49].
The uncertainty quantification capabilities provide the transparency and reliability assessment required for regulatory acceptance of AI-based safety systems. The ability to provide confidence estimates for hazard detection results enables risk-based decision-making that satisfies regulatory requirements for safety-critical applications.

6.3. Limitations and Challenges

6.3.1. Technical Limitations

Despite substantial performance improvements, the framework faces several technical limitations that constrain its applicability in certain scenarios. The cognitive-inspired attention mechanism, while effective for typical construction environments, may struggle with highly unusual or unprecedented hazard configurations that fall outside the training distribution. The system’s reliance on learned patterns from human expertise means that it may perpetuate existing biases or blind spots in current safety practices.
While the system achieves 91.7% accuracy for common hazard types, performance for extremely rare events (occurrence probability < 0.1%) remains uncertain. The 18-month data collection period, though extensive, may not adequately capture seasonal extremes or centennial events such as major earthquakes or floods. Furthermore, the dataset primarily reflects safety practices from specific geographical and cultural contexts, potentially limiting the system’s effectiveness when deployed in regions with different safety standards, work practices, or risk perceptions.
The multimodal fusion architecture requires high-quality data from all modalities to achieve optimal performance. In scenarios where one or more data sources are degraded due to equipment failures, environmental conditions, or communication disruptions, system performance may deteriorate significantly. The adaptive weighting mechanism partially mitigates this issue but cannot fully compensate for severely compromised data streams.
Computational requirements, while optimized for real-time performance, still exceed the capabilities of basic edge computing devices. The minimum hardware specifications may present barriers to adoption for smaller construction projects or organizations with limited technology budgets. The energy consumption of continuous AI processing may also pose challenges for remote construction sites with limited power infrastructure.
The system’s effectiveness depends heavily on the quality and completeness of initial training data. Construction projects in regions or contexts not well-represented in the training dataset may experience reduced performance. The need for domain-specific fine-tuning could require significant data collection and annotation efforts for new application domains.

6.3.2. Operational Challenges

Integration with existing construction workflows requires careful change management to ensure user acceptance and effective utilization. While user satisfaction rates were high in our experimental deployments, broader adoption may encounter resistance from organizations with established safety protocols or limited technology adoption experience.
Long-term system operation requires dedicated technical support teams, which may increase operational overhead for construction organizations. Model updates need to occur regularly to maintain performance as construction practices evolve, with update frequency depending on the rate of change in local construction methods and regulations. Hardware maintenance cycles, typically 3–5 years for edge computing devices, represent additional considerations for total cost planning. Training requirements vary significantly across user groups: safety managers require a comprehensive understanding of system capabilities and limitations (typically 16 h initial training), while field workers need focused operational training (4–8 h in phased sessions). The research indicates that phased training approaches are more effective than intensive single sessions, particularly for workers with limited technical backgrounds.
The system generates substantial amounts of data and alerts that require appropriate interpretation and response protocols. Without proper training and support systems, construction personnel may experience information overload or develop inappropriate reliance on automated recommendations. Balancing system autonomy with human oversight remains a critical challenge for practical deployment.
A particular risk emerges when technical personnel become overly reliant on automated suggestions, potentially diminishing their own hazard recognition skills over time. This “automation complacency” could be problematic during system failures or in situations requiring intuitive judgment beyond algorithmic capabilities. Regular manual inspection exercises and continuous professional development programs are essential to maintain human expertise alongside automated systems [50].
Privacy and data security concerns may limit the willingness of construction organizations to implement comprehensive monitoring systems. The extensive data collection required for effective operation must be balanced against worker privacy rights and proprietary information protection. Regulatory frameworks for AI-based workplace monitoring are still evolving, creating uncertainty about long-term compliance requirements.

6.3.3. Scalability and Generalization Challenges

While our experimental validation covered diverse construction scenarios, the generalization to fundamentally different construction types (e.g., underground construction, marine environments, extreme climate conditions) remains unproven. The cognitive models and attention mechanisms may require substantial modification for contexts that differ significantly from highway construction environments.
When monitoring scale increases from single sites to multiple parallel projects, the quadratic complexity O(n2) of attention mechanisms may cause response times to increase from the current 147 ms to potentially 500–800 ms or more for 10× scale expansion. This latency increase could be problematic for time-critical safety interventions where sub-second response is essential. Optimization strategies such as hierarchical attention or approximate nearest neighbor methods may be necessary for large-scale deployments.
The framework’s dependence on high-speed network connectivity for digital twin synchronization may limit its applicability in remote or poorly connected construction sites. While edge computing capabilities provide some autonomy, the full benefits of the integrated system require a reliable communication infrastructure that may not be available in all deployment contexts.
Scaling the system to very large construction projects or multiple simultaneous sites may reveal performance bottlenecks not apparent in our experimental setup. The quadratic complexity of some attention mechanisms could become problematic when processing data from hundreds of sensors or monitoring extensive construction areas.
Cultural and regulatory differences across international markets may require significant adaptation of the cognitive models and safety ontologies. Construction practices, safety regulations, and hazard priorities vary substantially between regions, potentially requiring extensive localization efforts for global deployment.

6.4. Future Research Directions

6.4.1. Cognitive Modeling Enhancements

Future research should explore the integration of additional cognitive principles that could further enhance system performance. Working memory models, spatial reasoning capabilities, and decision-making under uncertainty represent promising areas for expanding the cognitive-inspired components. The incorporation of metacognitive processes that enable the system to monitor and evaluate its own performance could provide additional layers of reliability and adaptation.
Advanced attention mechanisms that model collaborative visual search between multiple inspectors could enable more sophisticated monitoring of complex construction environments. The development of cognitive load assessment capabilities could optimize information presentation to human operators, ensuring that AI assistance enhances rather than overwhelms human decision-making capabilities.
The exploration of transfer learning approaches for cognitive models could enable more efficient adaptation to new construction domains or safety requirements. Understanding how cognitive patterns learned in one context can be effectively transferred to related but distinct scenarios represents a critical research challenge with substantial practical implications.

6.4.2. Advanced AI Integration

The integration of emerging AI technologies such as large language models, diffusion models, and neural-symbolic reasoning could significantly enhance system capabilities. Large language models could provide more sophisticated natural language understanding for safety documentation and enable more natural human–system interaction through conversational interfaces.
Generative AI capabilities could enable the system to simulate potential hazard scenarios and evaluate intervention strategies before implementation. This predictive capability could transform the framework from reactive hazard detection to proactive risk management through scenario planning and optimization.
The incorporation of causal reasoning capabilities could enable the system to understand not just what hazards exist but why they occur and how they might be prevented. This deeper understanding could support more effective intervention strategies and contribute to fundamental improvements in construction safety practices.

6.4.3. Ecosystem Integration and Standardization

Future development should focus on creating comprehensive ecosystems that integrate safety management with broader construction project management systems. The connection between safety monitoring and scheduling, resource allocation, and quality management could enable holistic optimization of construction processes.
Cross-cultural adaptation represents a critical area for future investigation. Research should explore how safety perception, risk tolerance, and work practices vary across different cultural contexts, and how cognitive-inspired systems can be adapted to respect these differences while maintaining safety standards. This includes developing culturally aware training materials, adapting user interfaces to local preferences, and incorporating region-specific safety regulations and standards into the system’s knowledge base.
The development of industry-wide standards for AI-based safety monitoring could accelerate adoption and ensure interoperability between different systems and vendors. Collaborative efforts with industry organizations, regulatory bodies, and technology providers could establish common frameworks that benefit the entire construction industry.
Research into federated learning approaches could enable collaborative improvement of safety systems across multiple organizations and projects while preserving data privacy and proprietary information. This approach could accelerate the development of more robust and generalizable safety monitoring capabilities.
The exploration of integration with emerging technologies such as augmented reality, autonomous vehicles, and robotic construction systems could create synergistic effects that further enhance construction safety and efficiency. Understanding how AI-based safety monitoring can support and be supported by these complementary technologies represents an important frontier for future research and development.

7. Conclusions

This research presents a novel cognitive-inspired multimodal deep learning framework for intelligent hazard identification and digital management in highway construction, addressing the critical challenge of achieving zero-accident construction goals through the integration of biological vision principles, advanced machine learning techniques, and real-time digital twin technologies.
The framework makes three primary contributions to construction safety management. First, the study developed a cognitive-inspired visual attention mechanism that models human inspector expertise through return inhibition and temporal memory components, achieving 14.2% performance improvement over conventional attention mechanisms. Second, the study implemented an adaptive multimodal fusion architecture that dynamically integrates visual, textual, and sensor data streams, demonstrating 16.4% improvement in hazard detection accuracy compared to fixed fusion strategies. Third, the research established a closed-loop digital twin system with BIM–GIS integration that enables real-time hazard tracking and continuous learning, achieving 91.7% overall detection accuracy with 147 ms response time.
Experimental validation across five highway construction projects demonstrated substantial practical benefits, including a 34% reduction in accident rates, a 42% improvement in inspection efficiency, and a positive return on investment within 8.7 months. The system achieved 92% compatibility with existing BIM–GIS platforms and maintained robust performance across diverse environmental conditions. User acceptance studies revealed 87% satisfaction rates among construction safety personnel, confirming effective integration with existing safety protocols while enhancing situational awareness and decision-making capabilities.
Future research directions include the integration of additional cognitive principles such as metacognitive monitoring and collaborative visual search, the incorporation of emerging AI technologies, including large language models and causal reasoning capabilities, and the development of industry-wide standards for AI-based safety monitoring systems. The framework’s modular architecture and demonstrated scalability provide a foundation for extending these capabilities to other construction domains and integrating with emerging technologies such as augmented reality and autonomous construction systems. The successful combination of cognitive science principles with advanced machine learning represents a promising paradigm for developing intelligent systems that enhance rather than replace human expertise in safety-critical applications.
Despite these advances, this study acknowledges several limitations that require attention in future research. The dataset, while comprehensive in scope, may not fully capture extremely rare high-impact events or adequately represent cross-cultural variations in safety practices. Long-term maintenance costs and scalability challenges for small to medium-sized projects need further investigation. Additionally, the risk of over-reliance on automated systems and the need for continuous human expertise development remain important considerations for practical implementation.
This research contributes to the broader vision of sustainable and safe construction practices, though achieving truly zero-accident construction will require continued collaboration between academia, industry, and regulatory bodies. The integration of cognitive-inspired AI with digital twin technologies offers a pathway toward this goal, but success ultimately depends on thoughtful implementation, continuous improvement, and maintaining the critical balance between technological innovation and human expertise.

Author Contributions

Conceptualization, J.Z.; methodology, J.Z. and X.M.; software, X.M.; validation, J.Z., X.M. and Z.L.; formal analysis, X.M.; investigation, Z.L. and Z.S.; resources, J.Z.; data curation, J.Z.; writing—original draft preparation, J.Z., X.M., Z.L. and Z.S.; writing—review and editing, J.Z. and Z.L.; visualization, C.G.; supervision, J.Z.; project administration, J.Z.; funding acquisition, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the Ningbo Natural Science Foundation (2023J028).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset generated during and/or analyzed during the current study is available from the corresponding author upon reasonable request.

Conflicts of Interest

Author Jibiao Zhou and author Zhan Shi were employed by Department of Security, Ningbo Highway Construction & Management Center. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Zhou, J.; Li, Z.; Dong, S.; Sun, J.; Zhang, Y. Visualization and Bibliometric Analysis of E-bike Studies: A Systematic Literature Review (1976–2023). Transp. Res. Part D Transp. Environ. 2023, 122, 103891. [Google Scholar] [CrossRef]
  2. Mao, X.; Dong, S.; Wang, J.; Zhou, J.; Yuan, C.; Zheng, T. Capital-constrained Maintenance Scheduling for Road Networks Considering Traffic Dynamics. Transp. B Transp. Dyn. 2023, 11, 1845–1870. [Google Scholar] [CrossRef]
  3. Cheng, E.; Ryan, N.; Kelly, S. Exploring the Perceived Influence of Safety Management Practices on Project Performance in the Construction Industry. Saf. Sci. 2012, 50, 363–369. [Google Scholar] [CrossRef]
  4. Mao, X.; Zhou, J.; Yuan, C.; Liu, D. Resilience-Based Optimization of Postdisaster Restoration Strategy for Road Networks. J. Adv. Transp. 2021, 2021, 8871876. [Google Scholar] [CrossRef]
  5. Liu, J.; Luo, H.; Liu, H. Deep Learning-Based Data Analytics for Safety in Construction. Autom. Constr. 2022, 140, 104302. [Google Scholar] [CrossRef]
  6. Zhou, J.; Zhang, M.; Ding, H. An ALNS-based Approach for the Traffic-police-routine-patrol-Vehicle Assignment Problem in Resource Allocation Analysis of Traffic Crashes. Traffic Inj. Prev. 2024, 25, 688–697. [Google Scholar] [CrossRef]
  7. Feng, H.; Chen, Q.; de Soto, B. Application of Digital Twin Technologies in Construction: An Overview of Opportunities and Challenges. In Proceedings of the International Symposium on Automation and Robotics in Construction, Dubai, United Arab Emirates, 2–4 November 2021. [Google Scholar]
  8. Luo, Q.; Sun, C.; Li, Y.; Qi, Z.; Zhang, G. Applications of Digital Twin Technology in Construction Safety Risk Management: A Literature Review. Eng. Constr. Archit. Manag. 2025, 32, 3587–3607. [Google Scholar] [CrossRef]
  9. Huang, Y.; Shih, S.; Yen, K. An Integrated GIS, BIM and Facilities Infrastructure Information Platform Designed for City Management. J. Chin. Inst. Eng. 2021, 44, 293–304. [Google Scholar] [CrossRef]
  10. Fang, W.; Ding, L.; Love, P.E.D.; Luo, H.; Li, H.; Peña-Mora, F.; Zhong, B.; Zhou, C. Computer Vision Applications in Construction Safety Assurance. Autom. Constr. 2020, 110, 103013. [Google Scholar] [CrossRef]
  11. Zhang, Z.; Lin, W.; Liu, M.; Rady, M.M. Multimodal Deep Learning Framework for Mental Disorder Recognition. In Proceedings of the 2020 15th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2020), Buenos Aires, Argentina, 16–20 November 2020. [Google Scholar]
  12. Hakimi, O.; Liu, H.; Abudayyeh, O.; Houshyar, A.; Almatared, M.; Alhawiti, A. Data Fusion for Smart Civil Infrastructure Management: A Conceptual Digital Twin Framework. Buildings 2023, 13, 2725. [Google Scholar] [CrossRef]
  13. Fang, Y.; Ni, G.; Gao, F.; Zhang, Q.; Niu, M.; Ding, Z. Influencing Mechanism of Safety Sign Features on Visual Attention of Construction Workers: A Study Based on Eye-Tracking Technology. Buildings 2022, 12, 1883. [Google Scholar] [CrossRef]
  14. Downing, P.; Liu, J.; Kanwisher, N. Testing Cognitive Models of Visual Attention with fMRI and MEG. Neuropsychologia 2001, 39, 1329–1342. [Google Scholar] [CrossRef]
  15. Niu, Z.; Zhong, G.; Yu, H. A Review on the Attention Mechanism of Deep Learning. Neurocomputing 2021, 452, 48–62. [Google Scholar] [CrossRef]
  16. Borji, A.; Itti, L. State-of-the-Art in Visual Attention Modeling. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 185–207. [Google Scholar] [CrossRef] [PubMed]
  17. Tixier, A.; Hallowell, M.; Rajagopalan, B.; Bowman, D. Automated Content Analysis for Construction Safety: A Natural Language Processing System to Extract Precursors and Outcomes from Unstructured Injury Reports. Autom. Constr. 2016, 62, 45–56. [Google Scholar] [CrossRef]
  18. Zhang, L.; Wang, J.; Wang, Y.; Sun, H.; Zhao, X. Automatic Construction Site Hazard Identification Integrating Construction Scene Graphs with BERT Based Domain Knowledge. Autom. Constr. 2022, 142, 104535. [Google Scholar] [CrossRef]
  19. Xiao, B.; Kang, S. Development of an Image Data Set of Construction Machines for Deep Learning Object Detection. J. Comput. Civ. Eng. 2021, 35, 05020005. [Google Scholar] [CrossRef]
  20. Zhang, C.; Wang, H.; Cai, Y.; Chen, L.; Li, Y.; Sotelo, M.A.; Li, Z. Robust-FusionNet: Deep Multimodal Sensor Fusion for 3-D Object Detection under Severe Weather Conditions. IEEE Trans. Intell. Transp. Syst. 2022, 71, 2513713. [Google Scholar] [CrossRef]
  21. Xu, Y.; Zhou, Y.; Sekula, P.; Ding, L. Machine Learning in Construction: From Shallow to Deep Learning. Dev. Built Environ. 2021, 6, 100045. [Google Scholar] [CrossRef]
  22. Li, X.; Yi, W.; Chi, H.; Wang, X.; Chan, A. A Critical Review of Virtual and Augmented Reality (VR/AR) Applications in Construction Safety. Autom. Constr. 2018, 86, 150–162. [Google Scholar] [CrossRef]
  23. Chen, H.; Hou, L.; Zhang, G.; Moon, S. Development of BIM, IoT and AR/VR Technologies for Fire Safety and Upskilling. Autom. Constr. 2021, 125, 103631. [Google Scholar] [CrossRef]
  24. Wu, D.; Zheng, A.; Yu, W.; Cao, H.; Ling, Q.; Liu, J.; Zhou, D. Digital Twin Technology in Transportation Infrastructure: A Comprehensive Survey of Current Applications, Challenges, and Future Directions. Appl. Sci. 2025, 15, 1911. [Google Scholar] [CrossRef]
  25. Kamau, E.; Myllynen, T.; Mustapha, S.; Babatunde, G.; Alabi, A. A Conceptual Model for Real-Time Data Synchronization in Multi-Cloud Environments. Int. J. Multidiscip. Res. Growth Eval. 2024, 5, 1139–1150. [Google Scholar] [CrossRef]
  26. Guyo, E.; Hartmann, T.; Ungureanu, L. Interoperability between BIM and GIS through Open Data Standards: An Overview of Current Literature. In Proceedings of the 9th Linked Data in Architecture and Construction Workshop, Luxembourg, Luxembourg, 11–13 October 2021. [Google Scholar]
  27. Pereira, E.; Ahn, S.; Han, S.; Abourizk, S. Identification and Association of High-Priority Safety Management System Factors and Accident Precursors for Proactive Safety Assessment and Control. J. Manag. Eng. 2018, 34, 04017041. [Google Scholar] [CrossRef]
  28. Yu, G.; Li, J.; Xiong, J.; Hu, M.; Zeng, R.; Sugumaran, V. Scenario Modeling for Urban Road Emergency Management: A Cognitive Digital Twin–Based Method. J. Comput. Civ. Eng. 2025, 39. [Google Scholar] [CrossRef]
  29. Martínez-Aires, M.; López-Alonso, M.; Martínez-Rojas, M. Building Information Modeling and Safety Management: A Systematic Review. Saf. Sci. 2018, 101, 11–18. [Google Scholar] [CrossRef]
  30. Niaz, A.; Khan, S.; Niaz, F.; Shoukat, M.U.; Niaz, I.; Jia, Y. Smart City IoT Application for Road Infrastructure Safety and Monitoring by Using Digital Twin. In Proceedings of the 2022 International Conference on IT and Industrial Technologies (ICIT), Chiniot, Pakistan, 3–4 October 2022. [Google Scholar]
  31. Nebauer, C. Evaluation of Convolutional Neural Networks for Visual Recognition. IEEE Trans. Neural Netw. 1998, 9, 685–696. [Google Scholar] [CrossRef]
  32. Soydaner, D. Attention Mechanism in Neural Networks: Where It Comes and Where It Goes. Neural Comput. Appl. 2022, 34, 13371–13385. [Google Scholar] [CrossRef]
  33. Cheng, J.; Dong, L.; Lapata, M. Long Short-Term Memory-Networks for Machine Reading. arXiv 2016, arXiv:1601.06733. [Google Scholar]
  34. Wu, J.; Yang, Y.; Cheng, X.; Zuo, H.; Zheng, C. The Development of Digital Twin Technology Review. In Proceedings of the 2020 Chinese Automation Congress (CAC), Shanghai, China, 6–8 November 2020. [Google Scholar]
  35. Yevu, S.K.; Owusu, E.K.; Chan, A.P.C.; Sepasgozar, S.M.E.; Kamat, V.R. Digital Twin-Enabled Prefabrication Supply Chain for Smart Construction and Carbon Emissions Evaluation in Building Projects. J. Build. Eng. 2023, 78, 107598. [Google Scholar] [CrossRef]
  36. Hardin, B.; McCool, D. BIM and Construction Management: Proven Tools, Methods, and Workflows; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
  37. Mushtaha, A.W.; Alaloul, W.S.; Musarat, M.A.; Baarimah, A.O.; Rabah, F.K.; Alawag, A.M. BIM-GIS Integration for Infrastructure Management in Post-Disaster Stage. In Proceedings of the IEEE International Conference on Smart Infrastructure and Construction, Manama, Bahrain, 28–29 January 2024. [Google Scholar]
  38. El Yamani, S.; Hajji, R.; Billen, R. IFC-CityGML Data Integration for 3D Property Valuation. ISPRS Int. J. Geo-Inf. 2023, 12, 351. [Google Scholar] [CrossRef]
  39. Moir, S.; Paquet, V.; Punnett, L.; Buchholz, B.; Wegman, D. Making Sense of Highway Construction: A Taxonomic Framework for Ergonomic Exposure Assessment and Intervention Research. Appl. Occup. Environ. Hyg. 2003, 18, 256–267. [Google Scholar] [CrossRef]
  40. Mizuno, K.; Terachi, Y.; Takagi, K.; Izumi, S.; Kawaguchi, H.; Yoshimoto, M. Architectural Study of HOG Feature Extraction Processor for Real-Time Object Detection. In Proceedings of the 2012 IEEE 19th International Conference on Electronics, Circuits, and Systems, Quebec City, QC, Canada, 17–19 October 2012. [Google Scholar]
  41. Baidya, R.; Jeong, H. YOLOv5 with ConvMixer Prediction Heads for Precise Object Detection in Drone Imagery. Sensors 2022, 22, 8424. [Google Scholar] [CrossRef] [PubMed]
  42. Pham, H.; Rafieizonooz, M.; Han, S.; Lee, D. Current Status and Future Directions of Deep Learning Applications for Safety Management in Construction. Sustainability 2021, 13, 13579. [Google Scholar] [CrossRef]
  43. Basystiuk, O.; Rybchak, Z.; Zavushchak, I.; Marikutsa, U. Evaluation of Multimodal Data Synchronization Tools. Comput. Des. Syst. Theory Pract. 2024, 6, 104–111. [Google Scholar] [CrossRef]
  44. Kochovski, P.; Stankovski, V. Supporting Smart Construction with Dependable Edge Computing Infrastructures and Applications. Autom. Constr. 2018, 85, 182–192. [Google Scholar] [CrossRef]
  45. Ye, X.; Jin, T.; Ang, P.; Bian, X.; Chen, Y. Computer Vision-based Monitoring of the 3-D Structural Deformation of an Ancient Structure Induced by Shield Tunneling Construction. Struct. Control Health Monit. 2021. [Google Scholar] [CrossRef]
  46. Boella, G.; Janssen, M.; Hulstijn, J.; Humphreys, L.; van der Torre, L. Managing Legal Interpretation in Regulatory Compliance. In Proceedings of the 19th ACM Symposium on Access Control Models and Technologies, London, ON, Canada, June 25–27 2014. [Google Scholar]
  47. Fan, C.; Zhang, C.; Yahja, A.; Mostafavi, A. Disaster City Digital Twin: A Vision for Integrating Artificial and Human Intelligence for Disaster Management. Int. J. Inf. Manag. 2021, 56, 102049. [Google Scholar] [CrossRef]
  48. Ball, K. Situating Workplace Surveillance: Ethics and Computer Based Performance Monitoring. Ethics Inf. Technol. 2001, 3, 209–221. [Google Scholar] [CrossRef]
  49. Cao, N.; Cheung, S.; Li, K. Perceptive Biases in Construction Mediation: Evidence and Application of Artificial Intelligence. Buildings 2023, 13, 2460. [Google Scholar] [CrossRef]
  50. Wang, Y.; Chung, S. Artificial Intelligence in Safety-Critical Systems: A Systematic Review. Ind. Manag. Data Syst. 2022, 122, 442–470. [Google Scholar] [CrossRef]
Figure 1. System architecture of the cognitive-inspired multimodal deep learning framework.
Figure 1. System architecture of the cognitive-inspired multimodal deep learning framework.
Sustainability 17 09395 g001
Figure 2. Multimodal data fusion pipeline for intelligent hazard identification.
Figure 2. Multimodal data fusion pipeline for intelligent hazard identification.
Sustainability 17 09395 g002
Figure 3. Data collection and annotation workflow.
Figure 3. Data collection and annotation workflow.
Sustainability 17 09395 g003
Figure 4. Distributed computing architecture for real-time hazard detection.
Figure 4. Distributed computing architecture for real-time hazard detection.
Sustainability 17 09395 g004
Table 1. Comprehensive dataset characteristics and distributions.
Table 1. Comprehensive dataset characteristics and distributions.
Data CategoryVolumeSpecificationsCollection SitesAnnotation Quality
Visual Data32,847 images4K resolutionSites A–E85% inter-annotator agreement
1247 h video30–60 FPSMulti-weather
Textual Data15,623 reportsPDF/Word formatsAll sitesExpert validation
8934 incident logsMultilingual supportHistorical archivesDomain ontology
12,456 compliance docs---
Sensor Data156 IoT nodesEnvironmentalReal-time streamsCalibrated sensors
Continuous monitoringEquipmentEdge processingQuality assurance
-Personnel tracking--
Ground Truth23 hazard categoriesHierarchical taxonomyExpert annotations>85% agreement
87 sub-categoriesSeverity levelsConsensus validationTriple annotation
Table 2. Performance comparison with baseline methods.
Table 2. Performance comparison with baseline methods.
MethodAccuracy (%)Precision (%)Recall (%)F1-Score (%)Response Time (ms)
Faster R-CNN73.578.269.173.4892
Standard YOLOv576.881.372.476.6234
Fixed Multimodal Fusion78.283.774.879187
Commercial System A66.971.262.866.71247
Commercial System B69.474.665.169.5956
Proposed Framework91.794.289.491.7147
Table 3. Real-world deployment performance metrics.
Table 3. Real-world deployment performance metrics.
Site CharacteristicsSite ASite BSite CSite DSite EAverageStd Dev
LocationUrban HighwayRural HighwayMountain PassCoastal RouteIndustrial Zone
ClimateTemperateContinentalAlpineMaritimeUrban Heat
Detection Accuracy92.30%91.80%90.90%91.20%92.10%91.70%±0.6%
Response Time142 ms151 ms156 ms144 ms148 ms148 ms±5.4 ms
User Satisfaction89%85%84%88%91%87%±2.8%
Accident Reduction38%31%29%35%37%34%±3.6%
Efficiency Gain45%39%38%43%47%42%±3.8%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, J.; Li, Z.; Shi, Z.; Mao, X.; Gao, C. Cognitive-Inspired Multimodal Learning Framework for Hazard Identification in Highway Construction with BIM–GIS Integration. Sustainability 2025, 17, 9395. https://doi.org/10.3390/su17219395

AMA Style

Zhou J, Li Z, Shi Z, Mao X, Gao C. Cognitive-Inspired Multimodal Learning Framework for Hazard Identification in Highway Construction with BIM–GIS Integration. Sustainability. 2025; 17(21):9395. https://doi.org/10.3390/su17219395

Chicago/Turabian Style

Zhou, Jibiao, Zewei Li, Zhan Shi, Xinhua Mao, and Chao Gao. 2025. "Cognitive-Inspired Multimodal Learning Framework for Hazard Identification in Highway Construction with BIM–GIS Integration" Sustainability 17, no. 21: 9395. https://doi.org/10.3390/su17219395

APA Style

Zhou, J., Li, Z., Shi, Z., Mao, X., & Gao, C. (2025). Cognitive-Inspired Multimodal Learning Framework for Hazard Identification in Highway Construction with BIM–GIS Integration. Sustainability, 17(21), 9395. https://doi.org/10.3390/su17219395

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop