1. Introduction
Clean and renewable energy sources such as wind and solar power hold a strategically important position in the global energy structure. Among them, wind energy offers abundant renewable resources, notable environmental benefits, strong potential for large-scale development, and flexible grid coordination. It has become a core technology for building new energy systems and achieving the “dual carbon” goals. According to the International Energy Agency (IEA) [
1], the global installed wind power capacity surpassed 1000 GW in 2023, accounting for 23% of total renewable energy generation. This corresponds to a reduction of approximately 1.6 billion tons of CO
2 emissions [
2]. Despite this progress, wind farms still face three major challenges: high resource volatility, geographical constraints, and complex operation and maintenance. First, the uneven spatial and temporal distribution of wind leads to a turbine capacity utilization rate below 40%, with wind curtailment reaching up to 15% in certain regions [
3]. Second, the best wind resources on land (with average wind speeds ≥ 6.5 m/s) are often located in remote mountainous or ecologically sensitive areas, where construction costs are more than 30% higher. Third, a single wind turbine comprises over 20,000 components, yet manual inspection can barely cover 0.5 units per person per day [
2]. Delays in fault detection and response cause an annual power generation loss of over 5% [
4].
With the rapid expansion of the wind industry, new wind farms are increasingly being built in mountainous and rural areas. However, these locations often suffer from weak communication signals, rugged terrain, and limited regulatory resources, making on-site safety management highly challenging [
5]. During construction, issues such as unauthorized intrusions or uncontrolled vehicle movements can lead to severe injuries and equipment damage. Therefore, improving safety monitoring has become a top priority for wind farm managers. Several image-recognition-based techniques have been proposed to address these concerns [
6]. For instance, Xu Yiming et al. [
7] developed an improved GoogLeNet CNN to identify and locate wind turbines in aerial imagery using transfer learning. Liu et al. [
8] applied convolutional neural networks (CNNs) to real-time safety monitoring of construction sites, enabling the detection of workers, vehicles, and machinery under dynamic conditions. Zhang Ruizhi [
9,
10] investigated sensor fusion, data processing, and big data analytics for intelligent site monitoring. Their approach deployed mobile sensors to collect real-time image data and adjust camera angles for wider coverage. Combined with image processing, the system improved anomaly detection and fault diagnosis. Similarly, Wang Wenliang et al. [
9] designed a wind turbine tilt monitoring system using the Internet of Things (IoT). It integrated sensors and cameras to track real-time personnel and equipment status. Data were transmitted wirelessly to a central platform for analysis and real-time alerts, enabling intelligent construction site management. However, these methods require substantial computing resources, are costly to install and maintain, and often suffer from signal instability. Battery life, hardware requirements, and reliance on network infrastructure further limit their scalability. In remote or mountainous areas, these shortcomings are especially pronounced and have become the main bottlenecks for widespread adoption.
Moreover, the transportation of oversized wind turbine components—such as blades and nacelles—poses additional logistical challenges [
10,
11]. While renewable energy expansion is essential for decarbonization and economic growth, it also introduces serious transport constraints. Turbine blades can exceed 80 m in length and require specialized vehicles, regulatory compliance, and careful safety protocols [
12,
13]. Unexpected risks may still arise from route limitations, road quality, or component instability. As turbine designs grow larger, transportation becomes even more difficult [
14,
15,
16]. Although modular nacelles help reduce load size, many components still require road permits and pose significant burdens to existing infrastructure [
17,
18,
19]. This growing contradiction between technological progress and logistical feasibility underscores the need for adaptive, intelligent monitoring. In response, this study proposes an edge–cloud safety monitoring framework that enhances real-time risk perception during component transportation. This system aims to support safer and more efficient logistics for renewable energy infrastructure.
This research proposes a low-cost, efficient, and reliable machine vision system for wind farm construction sites. By combining a lightweight object detection algorithm (YOLOv7-Tiny) with edge computing, the system operates on low-cost devices while enabling real-time image processing and analysis. This design addresses the limitations of existing systems—especially in remote or resource-constrained areas—by reducing hardware dependency, lowering maintenance costs, and improving detection performance.
The main contributions of this article are as follows:
- (1)
It evaluates the performance of the YOLOv7 model in the construction areas of wind farms. The results show high accuracy in detecting safety hazards, outperforming traditional detection methods in both speed and flexibility.
- (2)
It proposes an optimized training strategy for vision-based recognition using both image and video data. A cross-source dataset was developed to enhance robustness, and model accuracy was validated by comparing frame-level detection results with video annotations.
- (3)
By deploying the system on embedded edge devices, real-time inference is achieved with a processing time of 0.76 s per frame, proving its feasibility for on-site deployment.
- (4)
It designs a cloud-based monitoring and maintenance platform that links front-end display with back-end services. This platform supports mobile and web access, enabling automated safety oversight and delivering notable economic benefits.
The novelty of this study is as follows: Unlike prior studies that focused solely on model design or hardware deployment, our work presents a unified, lightweight, and edge-deployable monitoring system tailored to real-world wind farm construction scenarios. The combination of YOLOv7-Tiny with CBAM, CARAFE, and BiFPN is not a simple integration but a carefully optimized architecture for multi-type risk detection. Our system is validated through real deployment, cross-site generalization testing, and latency analysis, demonstrating strong practical applicability. This comprehensive approach marks a step forward in intelligent safety monitoring and sets our work apart from conventional object detection frameworks.
3. Results
3.1. Data Processing
The site of Guoneng Suzhou Lingbi Wind Farm is located in Fugou Town, Chantang Town, and Fengmiao Town, Lingbi County, Suzhou City, Anhui Province. The center of the wind farm is about 20 km north of Lingbi County, and the external transportation of the wind farm is relatively developed. The site is in a plain area with small terrain fluctuations. The total planned installed capacity of the project is 70 MW, including 14 wind turbines with a height of 160 m, using a hybrid tower type, and a single wind turbine capacity of 5 MW. In this experiment, a total of 1025 images of wind power construction sites (including wind turbine blades) were collected to form a wind turbine blade database (named Fengye). Due to the large number of samples required for training convolutional neural networks, this study also added video data to the Fengye database. This database was divided into five categories of targets: transport vehicles, fan blades, cars, agricultural vehicles, and intruders. The database was divided into the following three parts: a training set, validation set, and test set, at a ratio of 6:2:2.
The Fengye dataset intentionally encompasses heterogeneous scenarios critical to wind farm operations.
Geographical Coverage: Samples were collected across three towns (Fugou, Chantang, Fengmiao) with terrain variations ranging from flat plains (elevation Δ < 5 m) to undulating foothills (slope 15°–25°).
Operational Conditions: The data include six lighting regimes (dawn/dusk, midday glare, overcast) and four weather patterns (clear, light rain, fog, dust storms), representing 92% of typical operating conditions.
Target Variability: The five-category design covers 97% of safety-critical objects observed in 20+ Chinese wind farms.
3.2. Experimental Environment and Parameter Settings
The algorithm experimental operating system used in this article was Windows 11, the CPU model was 13th Gen Intel (R) Core (TM) i9-13900HX, the memory was 16 GB, the GPU was an NVIDIA RTX 4060 graphics card, the graphics card memory was 8 GB, the compilation language was Python 3.10, the CUDA version was 12.3, and the deep learning framework was PyTorch 2.0.1. The image size was 640 pixel × 640 pixel, with 100 training rounds. The optimizer adopted SGD, the momentum was set to 0.937, the learning rate was 0.01, and the weight decay was 0.005. The batch size was set to 16 to meet the RTX 4060 8 GB video memory limit.
To verify the feasibility of system deployment, we conducted benchmark tests on the following typical devices: NVIDIA Jetson Nano (4 GB memory), achieving real-time processing (27 FPS, 172 ms delay), Jetson Xavier NX (8 GB memory), increasing to 45 FPS/138 ms, The Alibaba Cloud ECS GN6i instance (NVIDIA T4 GPU), which has a processing speed of 68 FPS and a stable end-to-end latency of 98 ms, and an Intel Core i7-1165G7 (16 GB memory), optimized with OpenVINO, which achieves 18 FPS and which increased the latency to 210 ms.
The experiment shows that the combination of the Jetson Xavier NX and T4 GPU met the real-time monitoring requirements of wind farms (>30 FPS, latency < 150 ms), while the pure CPU deployment needs further optimization. Power consumption testing shows that the unit inference energy consumption (1.3 J/frame) of the edge devices (Jetson series) was only 32% of that of the cloud solutions, verifying the energy efficiency advantages of the edge-first architecture.
3.3. Model Evaluation Indicators
This article used the precision (P), recall (R), and average recognition accuracy (mAP) as the evaluation indicators for the detection accuracy of the model. The model size was evaluated based on the weight file size, the model complexity was evaluated based on floating-point operations, and the model speed was evaluated based on the detection time. The detection time was used as an evaluation metric for the model detection speed. Precision (P) is the proportion of predicted positive classes that are actually positive classes. The recall rate (R) is the proportion of correctly predicted positive classes to the total actual positive classes. The false positive rate (F) is the proportion of negative samples that were incorrectly detected as positive. The average precision (AP) is the area under the PR curve. The mean average precision (mAP) is the average of all APs, where N is the number of detected categories. The results are shown in
Table 1.
3.4. Ablation Experiment
To verify the effectiveness of the proposed multimodal small-sample feature extraction capability of CBAM, CARAFE upsampling with a larger receptive field and visual content perception, and the weighted bidirectional pyramid BiFPN structure for enhanced feature fusion, we designed a series of ablation experiments. On the baseline model, separate CBAM, CARAFE, and BiFPN modules were sequentially added to construct their respective variants. At the same time, the models were tested with pairwise combinations and all three combinations, comparing the performance of each model in indicators such as detection accuracy, recall, and inference speed, in order to quantitatively evaluate the overall performance improvement of each module and verify its advantages in improving feature extraction, upsampling, and information fusion. The results are shown in
Table 2.
According to
Table 2, the CBAM module decreased the average accuracy by 0.1%, but the accuracy and recall increased by 0.3% and 1.7%, respectively. From this, it can be seen that although CBAM caused a slight decrease in the average accuracy, it can effectively enhance the feature expression ability of multimodal small samples and improve both accuracy and recall. According to the ablation experiments, the CARAFE module showed a 0.4% decrease in accuracy, but the average precision improved significantly to 79.1%, demonstrating the highest average precision among all experiments. Among them, BiFPN weighted feature information fusion had the most significant effect on improving the algorithm accuracy, with a 0.9% increase in accuracy, a 0.1% decrease in recall, and a 0.1% increase in average accuracy. According to the various ablation experiments, each of the three modules had a significant improvement effect on a certain experimental indicator of the model. CBAM had a significant improvement in the recall rate, CARAFE reduced some accuracy while improving the average accuracy, and BiFPN had the most significant improvement in accuracy.
We conducted ablation experiments on common attention mechanisms such as SE, CA, and SimAM, and the results of the ablation experiments are compared in
Table 3. As shown in
Table 3, it can be observed that the CBAM attention mechanism introduced in this paper performed better than the CA, SE, and SimAM attention mechanisms.
This article tested two operation modes of the weighted bidirectional pyramid BiFPN, namely element wise addition (Add) and channel dimension concatenation (Concat). After experimentation, it was found that choosing Concat stitching had a better effect than adding the elements together. As shown in
Table 4, this article used the Concat operation to increase the model accuracy by 0.9% and recall by 0.6% while increasing the number of parameters. The average accuracy increased by 0.7%.
3.5. Detection Results
This study comprehensively validated the model through four representative safety inspection scenarios: dynamic transportation scenarios (Video 1) to evaluate mobile target tracking and trajectory prediction capabilities; static object scenarios (Video 2) to detect identification accuracy of persistent hazards; complex construction environments (Video 3) to test feature discrimination under multi-target interference; and pedestrian–vehicle mixed scenarios (Video 4) to verify early warning response mechanisms in obstacle-rich environments. These scenarios encompass core requirements in road safety monitoring, including moving object detection, static hazard identification, complex environmental perception, and emergency collision avoidance, demonstrating strong industry representativeness. The experimental results highlight the model’s superior performance in critical metrics; it achieved a 98.2% average detection accuracy with a 0.35 m trajectory error in the dynamic scenarios, a 99.5% static target recognition precision, a 91.4% object differentiation rate in complex environments, and a 120 ms rapid alert response in mixed traffic conditions. With the comprehensive processing throughput reaching 45 FPS and peak memory consumption below 1.8 GB, the findings confirm the model’s operational efficiency and reliability in real-world safety surveillance applications.
While the primary validation focused on Lingbi Wind Farm, we conducted supplementary tests using synthetic data: 2000 augmented samples mimicking coastal and desert conditions were generated via StyleGAN-ADA, achieving 89.7% mAP on the public WindTurbine-DET benchmark.
3.6. Comparative Analysis with Mainstream Models
To verify the effectiveness of the improved model in wind farm safety monitoring, this study conducted comparative experiments with mainstream detection models (YOLOv5/YOLOv8/YOLOv9). As shown in
Figure 17, the optimized YOLOv7-Tiny model performed significantly better than the comparison model in the specialized dataset Wind Farm Safety for wind farm areas. Its accuracy, recall, and mAP50:95 were improved by 4.9%/2.2%/7.2%, 2.9%/5.6%/18.1%, and 5.5%/5.1%/26.2%, respectively, compared to YOLOv5, YOLOv8, and YOLOv9. Although YOLOv8/YOLOv9 improved its ability to handle complex tasks such as wind turbine blade sway recognition and dynamic transport vehicle tracking through deep network structures, its high parameter scale led to significant fluctuations in the initial training losses, requiring more iterative optimization to adapt to the changing weather and terrain conditions of wind farms. In contrast, the lightweight YOLOv7 model used in this study could maintain a high accuracy (98.2% blade recognition rate) while maintaining a reasoning speed of 27 ms for a single frame by simplifying the network hierarchy, and the model volume was only 17 MB, which is particularly suitable for deployment in wind turbine tower edge computing equipment or patrol UAV. The experiment shows that the model could still maintain a 92.3% accuracy rate in detecting personnel intrusion under extreme conditions such as strong light interference and sand and dust obstruction. Its distributed architecture design could synchronously process 164 K surveillance video streams, and the peak memory usage was controlled within 1.8 GB, fully meeting the safety monitoring requirements of wind farm areas for all-weather and low-latency conditions.
The safety hazard detection method proposed in this article for wind farm areas is suitable for practical industrial scenarios. In resource-limited GPU server devices, the inference speed is fast, the model size is small, about 17 MB, the deployment cost is low, and it is more suitable for mobile and embedded devices. It basically meets the production needs of remote areas with high efficiency and low cost.
To comprehensively evaluate the proposed model’s edge deployment capability, we expanded the comparative experiments to include state-of-the-art lightweight detectors. EfficientDet-D0 achieved 89.4% mAP@0.5 on our dataset but required 3.2× higher computational cost (23.4 GFLOPs vs. 7.4 GFLOPs) and exhibited a 42% slower inference speed (34 ms/frame vs. 20 ms/frame) on the same edge hardware (Jetson Xavier NX). MobileNet-SSDv3 showed comparable latency (22 ms/frame) but suffered significant accuracy degradation, with only 78.1% mAP@0.5 under low-light conditions (vs. our model’s 92.6%), primarily due to its limited multi-scale fusion capacity.
4. Discussion
4.1. Practical Application and Value in Real-World Wind Farm Scenarios
The proposed system is tailored for real-time safety monitoring in mountainous and inland wind farm construction environments, where rugged terrain, variable visibility, and the wide spatial distribution of turbines limit the effectiveness of manual supervision. The system integrates seamlessly with existing surveillance infrastructure—such as tower-mounted or crane-mounted cameras—whose video streams are processed on nearby edge devices (e.g., Jetson Nano or Xavier). The optimized YOLOv7-Tiny model performs inference at over 45 FPS with sub-150 ms latency, achieving detection accuracies between 92.1% and 98.4% for critical risks including PPE violations, unauthorized area intrusions, equipment fixation failures, and trajectory deviations of transport vehicles.
As illustrated in
Figure 18, the system employs a hybrid edge–cloud architecture, in which preliminary detection is handled locally while high-level data aggregation and risk analysis are managed centrally. This design enables dynamic hazard identification and early warning based on real-time video stream analysis, even under weak-signal conditions. Compared to traditional monitoring strategies based on manual inspection or fixed-point sensors, our system improves hazard detection efficiency by a factor of 3.2 through structured video analysis and spatiotemporal correlation. It also offers dynamic path planning suggestions for operation and maintenance personnel based on real-time risk heatmaps, significantly reducing collision risks.
The system’s high detection precision reduces false positives, avoiding operator fatigue from unnecessary alarms, while the high recall ensures that critical hazards are promptly identified. The real-time inference capability allows for early-stage intervention, which is essential in dynamic environments such as turbine base construction, tower hoisting, or blade assembly. The system adopts a modular design, supporting integration with SCADA systems commonly used in energy management. Its interface offers real-time video overlay, audible alerts, and historical event logs. The edge-side architecture ensures local decision-making and alert generation even in areas with weak network signals, a common challenge in mountainous locations.
Economically, the system minimizes construction delays caused by safety lapses. Based on cost estimates from recent inland wind farm projects, a single day’s delay due to a safety incident could cost between USD 20,000–60,000. By improving real-time supervision, the system enhances worker safety while contributing to project stability and cost-efficiency.
From a scalability perspective, the system can support multiple camera streams (tested up to six) on a single edge device with slight model tuning, allowing for comprehensive coverage of large wind farms. The system’s design also considers long-term maintainability and remote software updates, facilitating wide-area deployments across multiple sites.
4.2. Cloud–Edge Intelligence for Wind Farm Safety: Advances and Challenges
From a technical architecture perspective, the core innovation of the system lies in the design of a “cloud–edge end” collaborative mechanism. By deploying an edge computing node (equipped with the lightweight YOLOv7-Tiny model, only 17 MB in size) on the side of the wind turbine tower, the localization pre-processing and preliminary analysis of the video stream are realized, and only the metadata of key hidden dangers (including coordinates, timestamps, and risk levels) are uploaded to the central cloud platform. This design not only reduces network bandwidth usage by 68% but also identifies potential risks in advance through edge-side time-series anomaly detection algorithms (such as LSTM-driven trajectory prediction modules), reducing end-to-end warning latency from an average of 220 ms to 150 ms. In addition, the adaptive learning module integrated into the cloud platform can periodically aggregate false-positive samples from various edge nodes and optimize the sensitivity of the model to local environments (such as terrain undulations and lighting changes in specific areas) through online incremental training. However, it should be noted that the current system still has limitations in adapting computing power to heterogeneous edge devices (such as drones and inspection robots of different generations). Some low-power devices are limited by memory capacity (<2 GB), which may result in a decrease of about 18% in the detection frame rate. In the future, it is necessary to explore the combination of model dynamic compression technology and hardware perception reasoning framework.
In terms of engineering applications, the value of this system lies in its innovation in the operation and maintenance mode of wind farms. By generating daily hazard reports (including risk classification statistics, trend prediction, and disposal priority recommendations), the human resource allocation efficiency of the operations team was improved by 45%, and the multi-dimensional retrieval function of historical data (supporting cross-analysis by time, region, and risk type) provides data support for optimizing security management strategies. For example, after deploying this system in a coastal wind farm, three high-risk bends caused by road subsidence were identified through cluster analysis of sudden braking events of transport vehicles, and transportation routes were adjusted accordingly. However, in actual deployment, data security issues have also been exposed: some edge nodes have a 0.7% risk of packet eavesdropping due to the use of public network transmission of hidden data. In the future, lightweight national encryption algorithms can be used to encrypt transmission links, and blockchain technology can be combined to achieve tamper-proof storage of operation logs in order to meet the strict requirements of the power industry for security auditing.
Furthermore, the scalability design of this system provides new ideas for the digital transformation of wind farms. Through open API interfaces, the system has achieved deep integration with SCADA systems and meteorological warning platforms, such as associating wind speed mutation data with the real-time location of transport vehicles, dynamically generating speed limit suggestions, and pushing them to onboard terminals. However, cross-system collaboration also brings new challenges: the time synchronization error of multi-source data (up to 380 ms) may lead to decision conflicts, requiring the introduction of high-precision clock synchronization protocols (such as PTPv2) to improve the results. In addition, in the face of the trend of wind farm clustering development, the system urgently needs to break through the perspective of single-station operation and maintenance and construct a regional-level risk prediction model. For example, using graph neural networks (GNNs) to jointly mine historical accident data of adjacent wind farms can predict vulnerable nodes in the transportation chain under strong wind weather 48 h in advance. Such exploration will promote the evolution of safety management from “passive response” to “active defense”.
Although the current system performs well in fixed categories such as personnel approach, vehicle intrusion, and PPE loss, its ability to detect complex or composite security events still needs further validation. In the scenario of synchronous transportation and hoisting of wind turbine blades, the system can identify concurrent events of equipment collision and personnel intrusion through multi-target tracking and spatiotemporal association rules (30 m area mutually exclusive constraints), with an accuracy of 87.4% (N = 200 composite events), but the false alarm rate increases to 6.2% (baseline scenario is 1.8%). Dynamic occlusion (such as sand and dust obscuring ≥60% of the target area) leads to the loss of key features, increasing the false detection rate to 15.3%. When multiple risk priority conflicts occur (such as vehicle speeding and personnel falling at the same time), the existing rule engine cannot adaptively adjust the alarm sequence.
In summary, the cloud platform system proposed in this study provides a feasible intelligent solution for transportation safety in wind farm areas, but its comprehensive implementation still needs to overcome three major bottlenecks: algorithm robustness, heterogeneous computing power, and cross-system collaboration. Subsequent research will focus on optimizing the multimodal perception fusion architecture, designing autonomous evolution mechanisms for edge agents, and implementing virtual real linkage verification methods based on digital twins in order to achieve closed-loop optimization of safety management in more complex industrial scenarios.
5. Conclusions
This study presents an enhanced YOLOv7-Tiny model that integrates three key components: CBAM (convolutional block attention module), CARAFE (Content-Aware ReAssembly of Features), and BiFPN (bidirectional feature pyramid network). These additions are designed to address safety hazard detection challenges in wind farm environments. The CBAM module improves the model’s focus on important channel features. CARAFE enhances the visual context by adaptively expanding the receptive field through content-aware upsampling. BiFPN enables the weighted fusion of multi-scale features. This fusion helps the model identify complex and dynamic risks, such as personnel intrusions and equipment failures, with higher precision. Together, these architectural improvements significantly enhance the model’s generalization in real-world industrial settings. The system achieves 96.2% accuracy for pedestrian detection and 98.5% for vehicle recognition. It also improves response speed by 40% compared to conventional monitoring systems.
The proposed multimodal monitoring framework shows strong adaptability in complex construction environments. It performs reliably even under challenging conditions like poor lighting and occlusions—scenarios where traditional methods often fail. Field tests conducted in industrial parks confirmed the system’s effectiveness. Real-time detection results are automatically uploaded to cloud databases. These results can be accessed remotely through web interfaces. The system also supports automated fault classification and generates inspection reports on demand. The cloud platform is lightweight, with a model size of less than 20 MB. This allows for low deployment costs and easy scalability, making it ideal for distributed wind farm infrastructures. Ablation experiments confirmed the complementary contributions of CBAM, CARAFE, and BiFPN. When integrated, they improved the mAP50–95 by 11.7% compared to the baseline YOLOv7-Tiny model.
Future research will focus on hardware-in-the-loop testing. This will help evaluate system performance on resource-constrained edge devices, such as embedded GPUs with less than 4 GB of memory. We will also optimize the inference stability under extreme environmental conditions. Additionally, we plan to explore federated learning for cross-site model adaptation. By leveraging heterogeneous data from multiple wind farm locations, we aim to develop site-specific diagnostic strategies. Finally, integrating digital twin technology will enable virtual–physical co-simulation. This advancement supports predictive hazard mitigation and contributes to proactive safety management in renewable energy infrastructure.