1. Introduction
Earlier unmanned aerial vehicles (UAVs) were mainly used for military purposes, such as military training, regional reconnaissance, and combat. Currently, environmental resource monitoring [
1,
2,
3], agricultural exploration [
4,
5,
6], traffic management [
7,
8,
9], and construction mapping [
10,
11,
12], for example, have all developed rapidly due to the introduction of UAVs. Introducing UAVs to construction sites and designing a complete construction supervision and inspection system can rapidly advance project management development and significantly improve project management’s efficiency.
Regarding construction management, workers die in various accidents on construction sites yearly. The most common type of accident with the highest casualty rate is caused by falls or slips on scaffolds, roofs, and steel structures [
13,
14,
15]. Collisions with tower crane loads, trucks, and heavy equipment cause the second-highest rate of casualties [
16,
17,
18]. Even though construction managers have implemented safety training for construction workers to prevent such accidents [
19,
20,
21], due to the nature of the construction industry, experienced workers tend to ignore safety or follow old practices, resulting in high fatalities in the construction industry. Therefore, many researchers have proposed studies such as camera-based, sensor-based, computer vision, and deep learning as monitoring methods for construction site management.
Many studies have examined the use of sensors in construction management, including [
22], which proposes the use of camera housing with damped gimbaled mount and angle adjustment mechanism (CRANIUM) to transmit real-time images from a tower crane boom camera to the tower crane control room, allowing the driver to identify the load and improve tower crane safety directly. Later, Lee et al. [
23] installed a small solar-powered camera on the tower crane’s trolley so the driver could understand the ground conditions and the load. It also proposed a method of attaching sensor-based radio-frequency identification (RFID) cards to construction materials and transmitting information such as the location or type of materials on the site to a personal digital assistant (PDA) owned by the operator. Regarding the management of construction materials, Lee et al. [
24] also presented a study in which RFID cards were attached to construction materials such as H-beams, glass windows, plasterboard, and tiles to provide workers with information about the location and properties of the materials. In addition to construction materials, Kelm et al. [
25] proposed a method for personal protective equipment (PPE) detection using a mobile RFID portal. A tag identifier was installed at the construction site entrance, and an RFID tag was inserted into each item of PPE so that workers at the construction site could record when they passed the identifier and identify whether the PPE was correctly worn or not. Dong et al. [
26] proposed a method to determine whether a worker at a construction site is wearing a helmet by installing pressure sensors on the helmet. These advanced efforts make it relatively easy to identify construction materials or determine whether PPE is being worn. Still, they substantially increase the project’s cost when installing or inserting the sensors.
Computer vision techniques began to be used to manage the safety of construction sites in 2010. Azar et al. [
27] proposed to detect excavators at construction sites by learning a support vector machine (SVM) classifier with a differently trained histogram of gradients (HoG) features. Kim et al. [
28] also used the HoG detector as a concrete mixer detection method. In contrast, Park et al. [
29] used a combination of background subtraction, HoG shaping function, and color histogram to detect workers and achieved good accuracy but a high delay rate experimentally. In a follow-up study, Park et al. [
30] proposed a study using background subtraction and HoG features to determine whether workers at construction sites wear helmets. Memarzadeh et al. [
31] proposed adding HoG and hue–saturation color to the existing HoG features to detect workers, excavators, and trucks with 98.83%, 82.10%, and 84.88% accuracy, respectively. In [
32], to detect trucks, considering the slow speed of the conventional HoG detector, a model combining a haar-like feature (Haar) and HoG was proposed to improve the detection rate and reduce the false alarm rate. In addition to the HoG-based approach, Mneymneh et al. [
33] also proposed a method to detect helmets in the human head region through a background subtraction method to detect moving workers by a color-based classification algorithm. The authors of [
34,
35] proposed a method using a Bayesian and multi-network classifier to automatically classify and localize construction site workers and heavy equipment such as excavators. In addition, their follow-up study proposed a model to distinguish heavy equipment on construction sites and to identify hazardous areas by identifying their locations. Kim et al. [
36] used a Gaussian mixture model (GMM) to remove the background and utilized the Kalman filter as a tracking technique to assess the congestion on construction sites.
Nevertheless, all of these methods rely on techniques such as color segmentation, HoG feature detection, and SVM, which are constrained by illumination changes, occlusions, color changes, and complex backgrounds, as shown in
Table 1. The accuracy rate is low when applied in practice. In addition, the need to design separate feature algorithms for a single detection target leads to insufficient detection capability and a high cost of required computing power.
With the advent of convolutional neural networks (CNNs) and the increasing popularity of deep learning methods, there has been a significant shift in the way machine learning algorithms are designed and implemented. The limitations of traditional image-processing methods also have been overcome [
37], as in
Table 2. Many construction management and detection algorithms have been updated. For example, Kolar et al. [
38] used VGG-16 and multi-layer perceptron (MLP) to detect guardrails, and the experimental results showed that the detection accuracy of single guardrails and multiple guardrails were 97% and 86%, respectively. Fang et al. [
39] used a modified Faster R-CNN to automatically detect workers and heavy equipment at construction sites in real time. The accuracy of worker detection is 91%, and heavy equipment detection is 95%. Fang et al. [
40] used the Faster R-CNN model to detect non-helmet use (NHU) at construction sites under different environmental conditions, such as weather, lighting, shading, and pose. The experimental results in multiple environments showed an accuracy of 95.7% and an average speed of 0.205 s. Fang et al. [
41] presented a safety harness detection study by combining the Faster R-CNN model with RPN to detect workers and safety harnesses on site. To detect heavy equipment, Xiao et al. [
42] proposed a semi-supervised learning approach based on teacher–student networks, combining Faster R-CNN and ResNet-50 models in the same object detection method, obtaining 92.7% mAP performance results with only half of the labeled dataset. Gugssa et al. [
43] used the you only look once (YOLO) model and CNN structure to detect PPE on construction sites. Wang et al. [
44] detected helmets of multiple colors (black, orange, blue, and white) on construction sites. The YOLO series models were compared and analyzed to detect workers and vests, and the results showed that the YOLO v5 had the highest mAP of 86.55%.
Although these CNN-based detection methods perform well, three main limitations remain:
Firstly, the majority of construction site inspections rely on monitoring cameras installed in fixed positions. Their fixed perspective and limited setting density result in numerous blind spots, and the installation of cameras is also limited by on-site environmental factors such as wired power and data transmission. All of these factors impede the implementation of automated building monitoring, whereas drones can overcome the constraints of wired power and are not restricted to fixed perspectives.
Secondly, although most of the work can detect its established construction targets, including workers, helmets, construction machinery, materials, etc., there is only one detectable category, and a solution that allows for the comprehensive detection of multiple targets on the construction site is not available.
In addition, the size variation triggered by the distance of the targets in the acquired images remains a significant challenge even for the most advanced jobs [
9,
15,
26,
28,
30,
33], making applying most detection algorithms to practical construction supervision difficult.
Given the aforementioned constraints and challenges, this study introduces an automated multi-category inspection system for construction sites using UAVs. The proposed system, illustrated in
Figure 1, is designed to efficiently and effectively inspect multiple targets at construction sites and offers the following key contributions:
This paper presents an innovative solution for construction site inspections, utilizing the mobility and flexibility of UAVs for efficient and comprehensive perimeter inspections. By incorporating a deep learning model, the proposed scheme has been successfully verified on actual construction sites.
Furthermore, this paper presents a novel target detection network for UAV remote sensing images that is entirely automated and operates on a single-stage end-to-end basis. The proposed network leverages the Swin Transformer (ST) module, a cutting-edge deep learning technique, as its backbone to enable highly efficient feature extraction. We utilize a multi-scale feature fusion attention network to further enhance the network’s detection performance for multiple classes of targets.
The results of the experiments demonstrate that the proposed method outperforms other classical models, with a detection accuracy of 82.48% on the open-source dataset and the ability to detect and localize up to 15 targets at construction sites. This work makes a significant contribution to the field of construction site inspections through the integration of cutting-edge technology such as UAVs and deep learning to improve the efficiency and accuracy of the inspection process.
Figure 1.
Flow chart of the proposed target detection system.
Figure 1.
Flow chart of the proposed target detection system.
4. Discussion
The current state of the art in target detection in construction sites relies on fixed-position cameras, which often result in blind spots due to their limited viewpoints and setup densities. Additionally, the constrictions of the construction site environment make automated supervision a challenging task. To address these limitations, this study aims to develop a multi-category detection system utilizing UAV low-altitude remote sensing.
Experimental results of this study indicate that the proposed method, using a ST self-attention module as the backbone network, demonstrates faster convergence and higher accuracy compared to other methods. In particular, it can be observed from
Figure 12a that the proposed method exhibits a faster decrease in the loss function and achieves convergence earlier compared to other backbone networks. Furthermore,
Figure 12b shows that while most network’s mAP increases steadily with training iterations, the proposed method reaches its maximum point in only 50 epochs and exhibits a superior rising trend and accuracy. This exceptional performance is attributed to the utilization of the ST self-attention module in the construction of the backbone network.
In addition to the backbone network, the study also explores the impact of different attention mechanisms on network performance through attention module ablation experiments. The results show that the addition of LCMA-Net to the network leads to an improvement of 3.52% in mAP compared to the baseline. The heat map visualization results further indicate that the high brightness of detected targets is significantly improved across different size feature layers, indicating that the attention mechanism effectively boosts interest in targets of all sizes.
Furthermore, the proposed method is compared cross-sectionally with four other representative attention mechanisms. The proposed LCMA-Net outperforms existing methods such as SENet and ECA-Net, with an accuracy improvement of 1.98% and 1.95%, respectively. This improvement can be attributed to the incorporation of both channel and spatial attention mechanisms in LCMA-Net. In comparison with CBAM, the proposed method, which incorporates a multi-channel branching approach for local interactions across channels, overcomes the deficiency of single pooling in suppressing feature diversity and results in an accuracy improvement of 3.4%. Additionally, the proposed method also has a lower computational cost.
Finally, the proposed approach is compared with other widely used target detection models to verify its superiority. The precision × recall curves for each category, as shown in
Figure 14, indicate that the proposed method’s curves are smoother and larger in the area among all detected categories, which means that it achieves the highest accuracy for all types of targets with different sizes. This is a result of the semantic fusion of the multi-scale feature fusion attention network for different resolution feature layers and the weighting of the attention mechanism LCMA-Net. In addition, the study also discusses the complexity of the model in relation to performance; as shown in
Table 8, the proposed approach improves mAP by 9.76% and 7.45% compared to YOLOv5-L and YOLOX-L, respectively. The increase in computational cost is deemed a worthwhile investment due to the substantial performance improvement it yields. Overall, the proposed approach model is less complex and has the highest accuracy in comparison to other models, as seen in the region closer to the top-left corner in
Figure 16.
We also discuss the visual inspection results at the actual construction site, and the wrap-around inspection program has a comprehensive detection capability. Additionally, as the height of the UAV increases, it still has a high degree of robustness. However, it is easy to find that the phenomenon of missed inspection will appear with the increase in UAV height. The main reason is that the UAV height increases, the background interference increases, and the target size is reduced, which is an enormous challenge for detection. In addition, the target size of the training dataset images is not uniform with the shooting angle, which will lead to a false detection phenomenon, such as detecting the distant building as an e-box.
Overall, the detection system still has certain limitations, primarily stemming from the need for more samples of overhead views in the training dataset. Thus, improving the richness of the dataset and developing a multi-objective detection dataset for construction sites that is applicable to UAVs is identified as a potential direction for future research.