Construction Site Multi-Category Target Detection System Based on UAV Low-Altitude Remote Sensing

Liang, Han; Cho, Jongyoung; Seo, Suyoung

doi:10.3390/rs15061560

Open AccessArticle

Construction Site Multi-Category Target Detection System Based on UAV Low-Altitude Remote Sensing

by

Han Liang

,

Jongyoung Cho

and

Suyoung Seo

^*

Department of Civil Engineering, Kyungpook National University, Daegu 41566, Republic of Korea

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(6), 1560; https://doi.org/10.3390/rs15061560

Submission received: 10 February 2023 / Revised: 6 March 2023 / Accepted: 12 March 2023 / Published: 13 March 2023

(This article belongs to the Special Issue Advanced Application of Artificial Intelligence and Machine Vision in Remote Sensing II)

Download

Browse Figures

Versions Notes

Abstract

:

On-site management of construction sites has always been a significant problem faced by the construction industry. With the development of UAVs, their use to monitor construction safety and progress will make construction more intelligent. This paper proposes a multi-category target detection system based on UAV low-altitude remote sensing, aiming to solve the problems of relying on fixed-position cameras and a single category of established detection targets when mainstream target detection algorithms are applied to construction supervision. The experimental results show that the proposed method can accurately and efficiently detect 15 types of construction site targets. In terms of performance, the proposed method achieves the highest accuracy in each category compared to other networks, with a mean average precision (mAP) of 82.48%. Additionally, by applying it to the actual construction site, the proposed system is confirmed to have comprehensive detection capability and robustness.

Keywords:

object detection; attention mechanism; remote sensing; UAV inspection system

1. Introduction

Earlier unmanned aerial vehicles (UAVs) were mainly used for military purposes, such as military training, regional reconnaissance, and combat. Currently, environmental resource monitoring [1,2,3], agricultural exploration [4,5,6], traffic management [7,8,9], and construction mapping [10,11,12], for example, have all developed rapidly due to the introduction of UAVs. Introducing UAVs to construction sites and designing a complete construction supervision and inspection system can rapidly advance project management development and significantly improve project management’s efficiency.

Regarding construction management, workers die in various accidents on construction sites yearly. The most common type of accident with the highest casualty rate is caused by falls or slips on scaffolds, roofs, and steel structures [13,14,15]. Collisions with tower crane loads, trucks, and heavy equipment cause the second-highest rate of casualties [16,17,18]. Even though construction managers have implemented safety training for construction workers to prevent such accidents [19,20,21], due to the nature of the construction industry, experienced workers tend to ignore safety or follow old practices, resulting in high fatalities in the construction industry. Therefore, many researchers have proposed studies such as camera-based, sensor-based, computer vision, and deep learning as monitoring methods for construction site management.

Many studies have examined the use of sensors in construction management, including [22], which proposes the use of camera housing with damped gimbaled mount and angle adjustment mechanism (CRANIUM) to transmit real-time images from a tower crane boom camera to the tower crane control room, allowing the driver to identify the load and improve tower crane safety directly. Later, Lee et al. [23] installed a small solar-powered camera on the tower crane’s trolley so the driver could understand the ground conditions and the load. It also proposed a method of attaching sensor-based radio-frequency identification (RFID) cards to construction materials and transmitting information such as the location or type of materials on the site to a personal digital assistant (PDA) owned by the operator. Regarding the management of construction materials, Lee et al. [24] also presented a study in which RFID cards were attached to construction materials such as H-beams, glass windows, plasterboard, and tiles to provide workers with information about the location and properties of the materials. In addition to construction materials, Kelm et al. [25] proposed a method for personal protective equipment (PPE) detection using a mobile RFID portal. A tag identifier was installed at the construction site entrance, and an RFID tag was inserted into each item of PPE so that workers at the construction site could record when they passed the identifier and identify whether the PPE was correctly worn or not. Dong et al. [26] proposed a method to determine whether a worker at a construction site is wearing a helmet by installing pressure sensors on the helmet. These advanced efforts make it relatively easy to identify construction materials or determine whether PPE is being worn. Still, they substantially increase the project’s cost when installing or inserting the sensors.

Computer vision techniques began to be used to manage the safety of construction sites in 2010. Azar et al. [27] proposed to detect excavators at construction sites by learning a support vector machine (SVM) classifier with a differently trained histogram of gradients (HoG) features. Kim et al. [28] also used the HoG detector as a concrete mixer detection method. In contrast, Park et al. [29] used a combination of background subtraction, HoG shaping function, and color histogram to detect workers and achieved good accuracy but a high delay rate experimentally. In a follow-up study, Park et al. [30] proposed a study using background subtraction and HoG features to determine whether workers at construction sites wear helmets. Memarzadeh et al. [31] proposed adding HoG and hue–saturation color to the existing HoG features to detect workers, excavators, and trucks with 98.83%, 82.10%, and 84.88% accuracy, respectively. In [32], to detect trucks, considering the slow speed of the conventional HoG detector, a model combining a haar-like feature (Haar) and HoG was proposed to improve the detection rate and reduce the false alarm rate. In addition to the HoG-based approach, Mneymneh et al. [33] also proposed a method to detect helmets in the human head region through a background subtraction method to detect moving workers by a color-based classification algorithm. The authors of [34,35] proposed a method using a Bayesian and multi-network classifier to automatically classify and localize construction site workers and heavy equipment such as excavators. In addition, their follow-up study proposed a model to distinguish heavy equipment on construction sites and to identify hazardous areas by identifying their locations. Kim et al. [36] used a Gaussian mixture model (GMM) to remove the background and utilized the Kalman filter as a tracking technique to assess the congestion on construction sites.

Nevertheless, all of these methods rely on techniques such as color segmentation, HoG feature detection, and SVM, which are constrained by illumination changes, occlusions, color changes, and complex backgrounds, as shown in Table 1. The accuracy rate is low when applied in practice. In addition, the need to design separate feature algorithms for a single detection target leads to insufficient detection capability and a high cost of required computing power.

With the advent of convolutional neural networks (CNNs) and the increasing popularity of deep learning methods, there has been a significant shift in the way machine learning algorithms are designed and implemented. The limitations of traditional image-processing methods also have been overcome [37], as in Table 2. Many construction management and detection algorithms have been updated. For example, Kolar et al. [38] used VGG-16 and multi-layer perceptron (MLP) to detect guardrails, and the experimental results showed that the detection accuracy of single guardrails and multiple guardrails were 97% and 86%, respectively. Fang et al. [39] used a modified Faster R-CNN to automatically detect workers and heavy equipment at construction sites in real time. The accuracy of worker detection is 91%, and heavy equipment detection is 95%. Fang et al. [40] used the Faster R-CNN model to detect non-helmet use (NHU) at construction sites under different environmental conditions, such as weather, lighting, shading, and pose. The experimental results in multiple environments showed an accuracy of 95.7% and an average speed of 0.205 s. Fang et al. [41] presented a safety harness detection study by combining the Faster R-CNN model with RPN to detect workers and safety harnesses on site. To detect heavy equipment, Xiao et al. [42] proposed a semi-supervised learning approach based on teacher–student networks, combining Faster R-CNN and ResNet-50 models in the same object detection method, obtaining 92.7% mAP performance results with only half of the labeled dataset. Gugssa et al. [43] used the you only look once (YOLO) model and CNN structure to detect PPE on construction sites. Wang et al. [44] detected helmets of multiple colors (black, orange, blue, and white) on construction sites. The YOLO series models were compared and analyzed to detect workers and vests, and the results showed that the YOLO v5 had the highest mAP of 86.55%.

Although these CNN-based detection methods perform well, three main limitations remain:

Firstly, the majority of construction site inspections rely on monitoring cameras installed in fixed positions. Their fixed perspective and limited setting density result in numerous blind spots, and the installation of cameras is also limited by on-site environmental factors such as wired power and data transmission. All of these factors impede the implementation of automated building monitoring, whereas drones can overcome the constraints of wired power and are not restricted to fixed perspectives.
Secondly, although most of the work can detect its established construction targets, including workers, helmets, construction machinery, materials, etc., there is only one detectable category, and a solution that allows for the comprehensive detection of multiple targets on the construction site is not available.
In addition, the size variation triggered by the distance of the targets in the acquired images remains a significant challenge even for the most advanced jobs [9,15,26,28,30,33], making applying most detection algorithms to practical construction supervision difficult.

Given the aforementioned constraints and challenges, this study introduces an automated multi-category inspection system for construction sites using UAVs. The proposed system, illustrated in Figure 1, is designed to efficiently and effectively inspect multiple targets at construction sites and offers the following key contributions:

This paper presents an innovative solution for construction site inspections, utilizing the mobility and flexibility of UAVs for efficient and comprehensive perimeter inspections. By incorporating a deep learning model, the proposed scheme has been successfully verified on actual construction sites.
Furthermore, this paper presents a novel target detection network for UAV remote sensing images that is entirely automated and operates on a single-stage end-to-end basis. The proposed network leverages the Swin Transformer (ST) module, a cutting-edge deep learning technique, as its backbone to enable highly efficient feature extraction. We utilize a multi-scale feature fusion attention network to further enhance the network’s detection performance for multiple classes of targets.
The results of the experiments demonstrate that the proposed method outperforms other classical models, with a detection accuracy of 82.48% on the open-source dataset and the ability to detect and localize up to 15 targets at construction sites. This work makes a significant contribution to the field of construction site inspections through the integration of cutting-edge technology such as UAVs and deep learning to improve the efficiency and accuracy of the inspection process.

Figure 1. Flow chart of the proposed target detection system.

2. Materials and Methods

2.1. UAV Application for Construction Site Wrap-Around Inspection

As indicated in Figure 2a, the trial location is in Daegu, Korea, at the building site of the Kyungpook University Dormitory Project, whose datum of coordinates is WGS84. The DJI Avata was used in the experiment to perform a wrap-around inspection in the field, as shown in Figure 2b. The UAV was selected because it is small, easy to operate, equipped with wing protection, and less likely to pose a risk factor to construction. The inspection path was spread around the periphery of the construction site, as shown in Figure 2c, and the example of the acquired images is shown in Figure 2d.

The flight attitude of the UAV during the surround inspection is shown in Figure 3, and the detailed aerial photography parameters are shown in Table 3. The UAV was set at different flight altitudes relative to the construction surface for inspection to discuss the reliability of the detection system. It is worth noting that the wrap-around inspection along the construction perimeter is intended to reduce interference with construction operations but undoubtedly also presents a significant challenge to the robustness of the inspection.

2.2. Multi-Size Target Detection Network for UAV Inspection System

Figure 4 shows three significant steps in the proposed construction site target detection network: a high-precision backbone feature extraction network based on the ST model [45], a multi-scale feature fusion attention network, and anchor-based decoupled headers.

The first step utilizes the ST as the backbone for efficient feature extraction. In the second step, an attention mechanism is introduced to enhance feature diversity and improve target detection accuracy by weighing the networks at each scale and fusing them. Finally, in step three, K-means clustering is employed to determine the optimal size of prior boxes based on different output feature layers, thus facilitating faster network convergence and reducing adjustment time.

2.2.1. Design of Backbone (Step 1)

The primary objective of construction site target detection using UAV photos is to achieve high accuracy. The choice of a high-precision backbone network is, therefore, crucial to achieving this goal, as the feature richness depends on the backbone network’s performance. In this regard, ST has been found to outperform other networks on most datasets. Hence, the hierarchical feature extraction structure of ST is employed to build a backbone network that can handle multi-scale segmentation while ensuring high efficiency.

Multi-head self-attention (W-MSA) [46] and shifted windows multi-head self-attention (SW-MSA) [45] are two core components of the backbone network utilized in construction site target detection using UAV photos. W-MSA divides feature maps into windows and reduces the number of parameters and model complexity compared to traditional multi-head self-attention. However, W-MSA only performs self-attentive computation within each window, leading to limitations in the network’s ability to handle multi-scale segmentation.

In contrast, SW-MSA enables self-attentive interaction between windows, addressing the limitations of W-MSA as shown in Figure 5. This tandem module allows for sharing information between adjacent windows, reducing computational costs while ensuring high efficiency and performance. The hierarchical feature extraction structure of SW-MSA has been found to outperform other networks on most datasets, making it a crucial component of the backbone network for the construction site target detection network applied to UAVs.

The input image is first processed by the patch partition module, as illustrated in Figure 6a, where each 4 × 4 group of adjacent pixels is transformed into a patch, which is then flattened into 16 pixels, each containing three R, G, and B values. This results in an image with a shape of (160, 160, 48) after the initial flattening of the input image (640, 640, 3). After passing through the linear embedding layer, the pixel channels are modified to 128. Subsequently, the self-attentive module is utilized to generate stage 1, which has a shape of (160, 160, 128).

As shown in Figure 6b, the patch merging process is a crucial component that plays a significant role in the overall performance of the model. The objective of this process is to combine multiple patches extracted from the input image into a single representation that is fed to the next layer of the model. The patch merging process is performed by a multi-head self-attention mechanism, where each head attends to different regions of the input patches and aggregates information from these regions to generate a merged representation. The self-attention mechanism is designed to be both parallelizable and computationally efficient, making it ideal for large-scale image classification tasks.

The merging process starts by extracting multiple overlapping patches from the input image. These patches are then linearly transformed to obtain their feature representations, which are used as inputs to the self-attention mechanism. The self-attention mechanism computes a set of attention weights that capture the relative importance of each patch representation in the merged representation.

To obtain the merged representation, the feature representations are multiplied by the attention weights and then summed. The merged representation is then linearly transformed to produce the final representation that is fed to the next layer of the model.

This downsampling module produces feature layers with three sizes, namely stage 1, stage 2, stage 3, and stage 4, with shapes of (160, 160, 128), (80, 80, 256), (40, 40, 512), and (20, 20, 1024). This multi-scale network design is intended to accommodate variations in object size resulting from changes in distance caused by UAV motion.

2.2.2. Design of Multi-Scale Feature Fusion Attention Network (Step 2)

The shallower layers in the backbone network are generally suitable for handling smaller targets due to their small feature perceptual fields. On the contrary, deeper feature layers are suitable for detecting larger targets using them because of their high number of downsampling and insufficient resolution information of spatial dimension. However, with the deepening of the network, multiple downsampling operations can obtain more channel information but also lose the information of spatial dimension. Using UAVs to detect targets at construction sites inevitably faces the challenge of target detection of varying sizes. In this paper, we propose a multi-scale feature fusion attention network, as shown in step 2 in Figure 4, where the main idea is to combine the feature layers of different stages of the backbone network and weigh them using the proposed attention mechanism. The attention mechanism is leveraged to optimize information filtering and emphasize relevant features. This approach enables the network to process and prioritize critical features efficiently for effective multi-scale target detection.

Our prior work involved enhancing the spatial attention module of the lightweight residual convolutional attention network (LRCA-Net) [47] to create LRCA-Netv2 [48]. The primary objective of this modification was to enable a more effective synthesis of global features while avoiding limitations imposed by local information. The operational process is depicted in Figure 7. The input features, represented by F, are initially subjected to two types of pooling, followed by convolution with 1D of size k. This approach results in a significant reduction in computational cost compared to linear mapping.

However, the pure pooling operation causes the channels with different information features to have the same average value, suppressing the diversity among channels. To obtain multi-scale feature maps of channel dimensions to a greater extent, we propose a lightweight channel multi-branch attention network (LCMA-Net) by improving the channel attention module of LRCA-Netv2, as shown in Figure 8.

First, the channels of the input feature map are partitioned into N branches, and the input channel dimension of each branch is C′. For each branch, the weights are obtained by first being pooled and then convolved by 1D of size k and using the softmax function, as in Equation (1).

W^{n} = σ (1 D (C o n c a t [(\frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} x_{c^{n, m}} (i, j))_{m = \{0, 1, \dots, M - 1\}}])), n \in \{0, 1, \dots, N - 1\},

(1)

where

x_{c^{n, m}} (i, j)

represents pixel value at

(i, j)

in the feature map

x_{c^{n, m}}

, σ represents the softmax function, and N represents the number of branches of the segmentation.

In this way, multiple weights corresponding to multi-branch channels can be obtained. With the advantage of the highly lightweight 1D convolution, the computational cost does not increase even as the number of branches increases. As shown in Equation (2), each branch can be independently attention-weighted by using the obtained weights to multiply the elements with their corresponding branch channels.

F^{'} = C o n v [C^{0} \cdot W^{0}, C^{1} \cdot W^{1}, \dots, C^{N - 1} \cdot W^{N - 1}],

(2)

where Conv represents the convolution kernel as 1 × 1 convolution, and F′ represents the output feature map.

The feature information F′ obtained after concatenation can be locally interacted with cross-channel, overcoming the lack of single pooling to suppress feature diversity. The overall structure of LCMA-Net combined with the spatial attention module is shown in Figure 9. Since it does not change the dimensionality and size of the feature map, insertion in each scale feature layer of the multi-scale feature fusion attention network can produce feature information with more channels and spatial dimensions.

As shown in step 2 in Figure 4, the main idea is to combine the feature layers of different stages of the backbone network and to use the proposed attention mechanism for weighing. The proposed multi-scale feature fusion attention network finally integrates the shallow and deep feature layers into dimensions of 80 × 80, 40 × 40, and 20 × 20, where Conv7 and Conv8 are generated by fusing two feature layers adjacent to the left and right. Such a network structure design can further enrich the diversity of features so that target features with different scales are all preserved.

2.2.3. Design of Decoupled Headers for Classification and Positioning (Step 3)

Step 3 uses an anchor-based approach, which is a widely used method for object detection tasks. The method utilizes three different size output features to determine the final classification and boundary position of the target. For the three output feature layers, we set up nine prior boxes in total, with each layer containing three distinct prior boxes of varying sizes that can be adjusted during the prediction phase. These anchors serve as reference points, representing different aspect ratios and scales, to predict the locations and sizes of the target objects in an image.

Figure 10 demonstrates how the anchor boxes’ dimensions are calculated using the results of the K-means clustering analysis, while Table 4 presents the anchor boxes’ masks. These proposals are then evaluated based on the predicted class probabilities and their corresponding bounding box regression results. Finally, the non-maximum suppression algorithm is applied to select the best candidate boxes and remove the overlapping or low-confidence boxes.

2.3. Experiments

2.3.1. Description of the Dataset and Experimental setup

The dataset from [49], named site object detection dataset (SODA), contains approximately 19,847 images, including 286,201 instances. The SODA dataset presented in this study is highly applicable to UAV-based construction target detection. The dataset covers various construction stages, collected through different methods, including UAV, handheld camera, and construction site monitoring video. The images in SODA were taken from different viewpoints and ambient lighting conditions, covering 15 object classes for the worker, material, machine, and layout classification. The dataset is the first to achieve full coverage of these categories and contains the largest number of objects and categories compared to current open-source object detection datasets in the construction industry. The training and validation sets were randomly divided into 17,861 and 1986 images in a ratio of 9:1, and 15 object classes were selected, as shown in Figure 11.

Table 5 outlines the specifics of our experimental configuration.

2.3.2. Metrics and Parameters for Performance Evaluation

Equations (3) and (4) define two widely adopted evaluation metrics in object detection tasks, namely precision and recall. Precision measures the fraction of positive detections that are correct, while recall measures the fraction of actual positive instances that are correctly detected. A high precision indicates that the algorithm is able to correctly identify the objects, while a high recall means that it is able to detect a majority of the objects in the scene. The F1 score quantified in Equation (5), which is the harmonic mean of precision and recall, balances the trade-off between precision and recall, and provides a single metric to evaluate the overall performance of an algorithm. The average precision (AP) as stated in Equation (6) is another widely used evaluation metric in object detection. AP measures the average precision across all levels of recall. The precision–recall curve, which plots the precision values against recall, can be used to visualize the performance of an algorithm. The mAP as given in Equation (7) is an extension of the AP metric, which computes the average precision over multiple classes in the dataset. mAP provides a comprehensive evaluation of an algorithm’s performance across all classes and can be used to compare the performance of different algorithms.

Precision = \frac{TP}{TP + FP},

(3)

Recall = \frac{TP}{TP + FN},

(4)

F 1 = 2 \times \frac{Precision \times Recall}{Precision + Recall},

(5)

where True/False (T/F) indicates the accuracy of the prediction, with true denoting a correct prediction and false denoting an incorrect one. Positive/Negative (P/N) represents the outcome of the prediction, with positive indicating a positive prediction result and negative indicating a negative one.

AP = \frac{1}{n} \sum_{(r \in \frac{1}{n}, \frac{2}{n} \dots \frac{n - 1}{n}, 1)} P_{interop} (r),

(6)

mAP = \frac{1}{n c l a s s} \sum AP,

(7)

where n denotes the number of detection points, n class denotes the number of object classes, and P_interop (r) represents the value of the accuracy at a recall of r.

We utilized a set of hyperparameters in the training process, as detailed in Table 6. The loss function used was CIoU [50], and we applied the augmentation techniques of mosaic and mix-up.

3. Results

3.1. Comparison of Model Training with Different Backbone Networks

The initial experiment aimed to validate the rationality and efficacy of the proposed network’s backbone architecture. To accomplish this, an ablation study was performed. The training data sets, hyperparameters, training strategies, and experimental environments were held constant across all comparison experiments, with the exception of the module parameters. Six representative deep neural networks were selected for comparison, with the backbone networks being replaced in each case.

The VGG16 [52] network is widely used for CNN classical feature extraction. Faster processing lightweight networks include MobileNetv1 [53], MobileNetv2 [54], and MobileNetv3 [55]. ResNet50 [56] has a deeper network for higher accuracy. On the other hand, DenseNet121 [57] uses connections on channels to improve efficiency and reuse features. Figure 12a shows the convergence status of the loss function and the rising trend in mAP for each backbone network when trained for 300 epochs, where the average training time per epoch is 22 min. The increasing mAP curves in Figure 12b are obtained from the validation set, and the parameters set to speed up the evaluation are conservatively set to visually reflect the changes in mAP during the training process.

3.2. Comparison of the Attention Modules

Using ablation experiments, we further discuss the performance comparison of different attention mechanisms for network configurations. As shown in Table 7, when not adding any attention module is set as the baseline, the mAP is only 78.96%. To analyze the effectiveness of the proposed LCMA-Net attention mechanism, we evaluated its performance against other commonly used attention mechanisms, which were integrated into the second step of the network. The visual heat map of each output feature layer before and after the configuration of LCMA-Net in the network can be seen in Figure 13.

3.3. Comparison of Model Performance

Finally, our proposed method demonstrated superior performance compared to the widely used object detection networks such as Faster-RCNN [61], Yolov5, YoloX [62], and EfficientDet [63]. The ablation experiments confirmed the rationality and effectiveness of the network’s backbone network. The results of the quantitative experiments are presented in Table 8, with precision × recall curves and AP results for each class detection shown in Figure 14 and Figure 15, respectively. A performance comparison between each model is also depicted in Figure 16.

3.4. Visualization of UAV Inspection Program Applied to Construction Sites

The proposed method was tested on construction sites using UAVs according to the scheme to determine its practical detection performance. Images were acquired at a resolution of 3840 × 2160. To remove redundant bounding boxes and enhance the performance of the object detection, the maximum suppression method was employed with an Intersection over Union (IOU) threshold of 0.5. The video processing speed achieved through this approach was approximately 13 frames per second. Overall, the visualization results in Figure 17 show that the wrap-around route inspection scheme can provide a more comprehensive detection capability for multiple types of targets on construction sites.

In addition, three UAV flight heights of 10 m, 15 m, and 20 m were set for the experiment, and the visualization comparison results are shown in Figure 18.

4. Discussion

The current state of the art in target detection in construction sites relies on fixed-position cameras, which often result in blind spots due to their limited viewpoints and setup densities. Additionally, the constrictions of the construction site environment make automated supervision a challenging task. To address these limitations, this study aims to develop a multi-category detection system utilizing UAV low-altitude remote sensing.

Experimental results of this study indicate that the proposed method, using a ST self-attention module as the backbone network, demonstrates faster convergence and higher accuracy compared to other methods. In particular, it can be observed from Figure 12a that the proposed method exhibits a faster decrease in the loss function and achieves convergence earlier compared to other backbone networks. Furthermore, Figure 12b shows that while most network’s mAP increases steadily with training iterations, the proposed method reaches its maximum point in only 50 epochs and exhibits a superior rising trend and accuracy. This exceptional performance is attributed to the utilization of the ST self-attention module in the construction of the backbone network.

In addition to the backbone network, the study also explores the impact of different attention mechanisms on network performance through attention module ablation experiments. The results show that the addition of LCMA-Net to the network leads to an improvement of 3.52% in mAP compared to the baseline. The heat map visualization results further indicate that the high brightness of detected targets is significantly improved across different size feature layers, indicating that the attention mechanism effectively boosts interest in targets of all sizes.

Furthermore, the proposed method is compared cross-sectionally with four other representative attention mechanisms. The proposed LCMA-Net outperforms existing methods such as SENet and ECA-Net, with an accuracy improvement of 1.98% and 1.95%, respectively. This improvement can be attributed to the incorporation of both channel and spatial attention mechanisms in LCMA-Net. In comparison with CBAM, the proposed method, which incorporates a multi-channel branching approach for local interactions across channels, overcomes the deficiency of single pooling in suppressing feature diversity and results in an accuracy improvement of 3.4%. Additionally, the proposed method also has a lower computational cost.

Finally, the proposed approach is compared with other widely used target detection models to verify its superiority. The precision × recall curves for each category, as shown in Figure 14, indicate that the proposed method’s curves are smoother and larger in the area among all detected categories, which means that it achieves the highest accuracy for all types of targets with different sizes. This is a result of the semantic fusion of the multi-scale feature fusion attention network for different resolution feature layers and the weighting of the attention mechanism LCMA-Net. In addition, the study also discusses the complexity of the model in relation to performance; as shown in Table 8, the proposed approach improves mAP by 9.76% and 7.45% compared to YOLOv5-L and YOLOX-L, respectively. The increase in computational cost is deemed a worthwhile investment due to the substantial performance improvement it yields. Overall, the proposed approach model is less complex and has the highest accuracy in comparison to other models, as seen in the region closer to the top-left corner in Figure 16.

We also discuss the visual inspection results at the actual construction site, and the wrap-around inspection program has a comprehensive detection capability. Additionally, as the height of the UAV increases, it still has a high degree of robustness. However, it is easy to find that the phenomenon of missed inspection will appear with the increase in UAV height. The main reason is that the UAV height increases, the background interference increases, and the target size is reduced, which is an enormous challenge for detection. In addition, the target size of the training dataset images is not uniform with the shooting angle, which will lead to a false detection phenomenon, such as detecting the distant building as an e-box.

Overall, the detection system still has certain limitations, primarily stemming from the need for more samples of overhead views in the training dataset. Thus, improving the richness of the dataset and developing a multi-objective detection dataset for construction sites that is applicable to UAVs is identified as a potential direction for future research.

5. Conclusions

In this study, we present a multi-category target detection system for UAV low-altitude remote sensing. The aim is to address the challenges posed by the reliance on fixed cameras and single category identification in traditional target detection algorithms for construction supervision. The proposed system was evaluated through comprehensive experiments to assess its rationality and efficiency. Results indicate that the network is capable of detecting and precisely locating 15 different types of targets at construction sites with higher accuracy than other commonly used methods. The system’s performance was further confirmed through its successful application in real-world construction sites, making it a robust and practical solution for automating construction supervision. The proposed system represents a significant step forward in the application of UAVs in the construction industry. To further advance this field, future work should focus on developing UAV perspective construction datasets and exploring integration with tracking and warning systems.

Author Contributions

Conceptualization, H.L. and S.S.; methodology, H.L.; software, H.L.; writing—original draft preparation, H.L. and J.C.; writing—review and editing, H.L. and S.S.; visualization, H.L.; supervision, S.S.; project administration, S.S.; funding acquisition, S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2016R1D1A1B02011625).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Mishra, P.K.; Rai, A. Role of unmanned aerial systems for natural resource management. J. Indian Soc. Remote Sens. 2022, 49, 671–679. [Google Scholar] [CrossRef]
Cardenas, S.M.; Cohen, M.C.; Ruiz, D.P.; Souza, A.V.; Gomez-Neita, J.S.; Pessenda, L.C.; Culligan, N. Death and Regeneration of an Amazonian Mangrove Forest by Anthropic and Natural Forces. Remote Sens. 2022, 14, 6197. [Google Scholar] [CrossRef]
Zhu, Q.; Lei, Y.; Sun, X.; Guan, Q.; Zhong, Y.; Zhang, L.; Li, D. Knowledge-guided land pattern depiction for urban land use mapping: A case study of Chinese cities. Remote Sens. Environ. 2022, 272, 112916. [Google Scholar] [CrossRef]
Singh, R.; Singh, R.; Gehlot, A.; Akram, S.V.; Priyadarshi, N.; Twala, B. Horticulture 4.0: Adoption of Industry 4.0 Technologies in Horticulture for Meeting Sustainable Farming. Appl. Sci. 2022, 12, 12557. [Google Scholar] [CrossRef]
Keshet, D.; Brook, A.; Malkinson, D.; Izhaki, I.; Charter, M. The Use of Drones to Determine Rodent Location and Damage in Agricultural Crops. Drones 2022, 6, 396. [Google Scholar] [CrossRef]
Yu, F.; Bai, J.; Jin, Z.; Zhang, H.; Guo, Z.; Chen, C. Research on Precise Fertilization Method of Rice Tillering Stage Based on UAV Hyperspectral Remote Sensing Prescription Map. Agronomy 2022, 12, 2893. [Google Scholar] [CrossRef]
Saponi, M.; Borboni, A.; Adamini, R.; Faglia, R.; Amici, C. Embedded Payload Solutions in UAVs for Medium and Small Package Delivery. Machines 2022, 10, 737. [Google Scholar] [CrossRef]
Yakushiji, K.; Fujita, H.; Murata, M.; Hiroi, N.; Hamabe, Y.; Yakushiji, F. Short-range transportation using unmanned aerial vehicles (UAVs) during disasters in Japan. Drones 2020, 4, 68. [Google Scholar] [CrossRef]
Singh, C.H.; Mishra, V.; Jain, K.; Shukla, A.K. FRCNN-Based Reinforcement Learning for Real-Time Vehicle Detection, Tracking and Geolocation from UAS. Drones 2022, 6, 406. [Google Scholar] [CrossRef]
Guan, S.; Zhu, Z.; Wang, G. A Review on UAV-Based Remote Sensing Technologies for Construction and Civil Applications. Drones 2022, 6, 117. [Google Scholar] [CrossRef]
Hu, Q.; Wang, P.; Li, S.; Liu, W.; Li, Y.; Lu, W.; Yu, A. Research on Intelligent Crack Detection in a Deep-Cut Canal Slope in the Chinese South–North Water Transfer Project. Remote Sens. 2022, 14, 5384. [Google Scholar] [CrossRef]
Lee, K.; Lee, W.H. Earthwork Volume Calculation, 3D Model Generation, and Comparative Evaluation Using Vertical and High-Oblique Images Acquired by Unmanned Aerial Vehicles. Aerospace 2022, 9, 606. [Google Scholar] [CrossRef]
Beavers, J.E.; Moore, J.R.; Rinehart, R.; Schriver, W.R. Crane-related fatalities in the construction industry. J. Constr. Eng. Manag. 2016, 132, 901–910. [Google Scholar] [CrossRef]
Teizer, J.; Allread, B.S.; Mantripragada, U. Automating the blind spot measurement of construction equipment. Autom. Constr. 2010, 19, 491–501. [Google Scholar] [CrossRef]
Zhu, Z.; Park, M.W.; Koch, C.; Soltani, M.; Hammad, A.; Davari, K. Predicting movements of onsite workers and mobile equipment for enhancing construction site safety. Autom. Constr. 2016, 68, 95–101. [Google Scholar] [CrossRef] [Green Version]
Nadhim, E.A.; Hon, C.; Xia, B.; Stewart, I.; Fang, D. Falls from height in the construction industry: A critical review of the scientific literature. Int. J. Environ. Res. Public Health 2016, 13, 638. [Google Scholar] [CrossRef] [Green Version]
Ale, B.J.; Bellamy, L.J.; Baksteen, H.; Damen, M.; Goossens, L.H.; Hale, A.R.; Whiston, J.Y. Accidents in the construction industry in the Netherlands: An analysis of accident reports using Storybuilder. Reliab. Eng. Syst. Saf. 2008, 93, 1523–1533. [Google Scholar] [CrossRef]
Azevedo, R.; Martins, C.; Teixeira, J.C.; Barroso, M. Obstacle clearance while performing manual material handling tasks in construction sites. Saf. Sci. 2014, 62, 205–213. [Google Scholar] [CrossRef]
Shringi, A.; Arashpour, M.; Golafshani, E.M.; Rajabifard, A.; Dwyer, T.; Li, H. Efficiency of VR-Based Safety Training for Construction Equipment: Hazard Recognition in Heavy Machinery Operations. Buildings 2022, 12, 2084. [Google Scholar] [CrossRef]
Teizer, J.; Cheng, T.; Fang, Y. Location tracking and data visualization technology to advance construction ironworkers’ education and training in safety and productivity. Autom. Constr. 2013, 35, 53–68. [Google Scholar] [CrossRef]
Choe, S.; Seo, W.; Kang, Y. Inter-and intra-organizational safety management practice differences in the construction industry. Saf. Sci. 2020, 128, 104778. [Google Scholar] [CrossRef]
Everett, J.G.; Slocum, A.H. CRANIUM: Device for improving crane productivity and safety. J. Constr. Eng. Manag. 1993, 119, 23–39. [Google Scholar] [CrossRef]
Lee, U.K.; Kang, K.I.; Kim, G.H.; Cho, H.H. Improving tower crane productivity using wireless technology. Comput. Aided Civ. Infrastruct. Eng. 2006, 21, 594–604. [Google Scholar] [CrossRef]
Lee, J.H.; Song, J.H.; Oh, K.S.; Gu, N. Information lifecycle management with RFID for material control on construction sites. Adv. Eng. Inform. 2013, 27, 108–119. [Google Scholar] [CrossRef]
Kelm, A.; Laußat, L.; Meins-Becker, A.; Platz, D.; Khazaee, M.J.; Costin, A.M.; Teizer, J. Mobile passive Radio Frequency Identification (RFID) portal for automated and rapid control of Personal Protective Equipment (PPE) on construction sites. Autom. Constr. 2013, 36, 38–52. [Google Scholar] [CrossRef]
Dong, S.; He, Q.; Li, H.; Yin, Q. Automated PPE misuse identification and assessment for safety performance enhancement. In Proceedings of the ICCREM 2015, Lulea, Sweden, 11–12 August 2015; pp. 204–214. [Google Scholar]
Azar, E.R.; McCabe, B. Part based model and spatial–temporal reasoning to recognize hydraulic excavators in construction images and videos. Autom. Constr. 2012, 24, 194–202. [Google Scholar] [CrossRef]
Kim, H.; Kim, H.; Hong, Y.W.; Byun, H. Detecting construction equipment using a region-based fully convolutional network and transfer learning. J. Comput. Civ. Eng. 2018, 32, 04017082. [Google Scholar] [CrossRef]
Park, M.W.; Brilakis, I. Construction worker detection in video frames for initializing vision trackers. Autom. Constr. 2012, 28, 15–25. [Google Scholar] [CrossRef]
Park, M.W.; Elsafty, N.; Zhu, Z. Hardhat-wearing detection for enhancing on-site safety of construction workers. J. Constr. Eng. Manag. 2015, 141, 04015024. [Google Scholar] [CrossRef]
Memarzadeh, M.; Golparvar-Fard, M.; Niebles, J.C. Automated 2D detection of construction equipment and workers from site video streams using histograms of oriented gradients and colors. Autom. Constr. 2013, 32, 24–37. [Google Scholar] [CrossRef]
Rezazadeh Azar, E.; McCabe, B. Automated visual recognition of dump trucks in construction videos. J. Comput. Civ. Eng. 2012, 26, 769–781. [Google Scholar] [CrossRef]
Mneymneh, B.E.; Abbas, M.; Khoury, H. Vision-based framework for intelligent monitoring of hardhat wearing on construction sites. J. Comput. Civ. Eng. 2019, 33, 04018066. [Google Scholar] [CrossRef]
Chi, S.; Caldas, C.H. Automated object identification using optical video cameras on construction sites. Comput. Aided Civ. Infrastruct. Eng. 2011, 26, 368–380. [Google Scholar] [CrossRef]
Chi, S.; Caldas, C.H. Image-based safety assessment: Automated spatial safety risk identification of earthmoving and surface mining activities. J. Constr. Eng. Manag. 2012, 138, 341–351. [Google Scholar] [CrossRef] [Green Version]
Kim, H.; Kim, K.; Kim, H. Vision-based object-centric safety assessment using fuzzy inference: Monitoring struck-by accidents with moving objects. J. Comput. Civ. Eng. 2016, 30, 04015075. [Google Scholar] [CrossRef]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. arXiv 2019, arXiv:1905.05055. [Google Scholar] [CrossRef]
Kolar, Z.; Chen, H.; Luo, X. Transfer learning and deep convolutional neural networks for safety guardrail detection in 2D images. Autom. Constr. 2018, 89, 58–70. [Google Scholar] [CrossRef]
Fang, W.; Ding, L.; Zhong, B.; Love, P.E.; Luo, H. Automated detection of workers and heavy equipment on construction sites: A convolutional neural network approach. Adv. Eng. Inform. 2018, 37, 139–149. [Google Scholar] [CrossRef]
Fang, Q.; Li, H.; Luo, X.; Ding, L.; Luo, H.; Rose, T.M.; An, W. Detecting non-hardhat-use by a deep learning method from far-field surveillance videos. Autom. Constr. 2018, 85, 1–9. [Google Scholar] [CrossRef]
Fang, W.; Ding, L.; Luo, H.; Love, P.E. Falls from heights: A computer vision-based approach for safety harness detection. Autom. Constr. 2018, 91, 53–61. [Google Scholar] [CrossRef]
Xiao, B.; Zhang, Y.; Chen, Y.; Yin, X. A semi-supervised learning detection method for vision-based monitoring of construction sites by integrating teacher-student networks and data augmentation. Adv. Eng. Inform. 2021, 50, 101372. [Google Scholar] [CrossRef]
Gugssa, M.; Gurbuz, A.; Wang, J.; Ma, J.; Bourgouin, J. PPE-Glove Detection for Construction Safety Enhancement Based on Transfer Learning. In Computing in Civil Engineering 2021; ASCE Library: Reston, VA, USA, 2021; pp. 58–65. [Google Scholar]
Wang, Z.; Wu, Y.; Yang, L.; Thirunavukarasu, A.; Evison, C.; Zhao, Y. Fast personal protective equipment detection for real construction sites using deep learning approaches. Sensors 2021, 21, 3478. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Liang, H.; Seo, S. Lightweight deep learning for road environment recognition. Appl. Sci. 2022, 12, 3168. [Google Scholar] [CrossRef]
Liang, H.; Seo, S. Automatic detection of construction workers’ helmet wear based on lightweight deep learning. Appl. Sci. 2022, 12, 10369. [Google Scholar] [CrossRef]
Duan, R.; Deng, H.; Tian, M.; Deng, Y.; Lin, J. SODA: Site Object Detection dAtaset for Deep Learning in Construction. arXiv 2022, arXiv:abs/2202.09554. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Adam, H. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.F.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef] [Green Version]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YoloX: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]

Figure 2. Experimental sites and data collection. (a) Location of the test site, (b) UAV, wearable glasses, and operating stick, (c) UAV wrap-around inspection path. (d) Example of UAV data.

Figure 3. Example of flight attitude of a UAV during the wrap-around inspection.

Figure 4. The proposed construction site target detection network applied to UAVs in this paper.

Figure 5. Tandem modules in the backbone network.

Figure 6. The two types of channel number adjustment operations in Step 1: (a) patch partition; (b) patch merging.

Figure 7. The pure pooling operation suppresses the diversity of channel features.

Figure 8. The channel attention module is obtained by improving the channel multi-branch structure.

Figure 9. The overall structure of LCMA-Net can be plug-and-play in the network since it does not change the dimensionality and size of the input feature maps.

Figure 10. The bounding boxes’ normalized height and width clusters are obtained through K-means clustering.

Figure 11. Number and percentage of ground truth for each category of the SODA dataset.

Figure 12. (a) Loss function convergence status and (b) iterative upward trend in mAP for different backbone network models for training 300 epochs.

Figure 13. Visual heat map of Conv7, Conv8, and Conv9, where the more highlighted areas indicate higher weights given by the network. (a) No attention module configured and (b) with LCMA-Net configured.

Figure 14. Comparison of precision × recall curves of detection: (a) person, (b) vest, (c) helmet, (d) board, (e) wood, (f) rebar, (g) brick, (h) scaffold, (i) handcart, (j) cutter, (k) e-box, (l) hopper, (m) hook, (n) fence, (o) slogan.

Figure 15. The results of each model for each detection category are listed in order of AP value. (a) Faster-RCNN, (b) Yolov5, (c) YolovX, (d) EfficientDet, and (e) our approach.

Figure 16. Comparison of comprehensive performance of models: (a) parameters (M) vs. mAP, (b) G-FLOPs (G) vs. mF1, (c) parameters (M) vs. inference speed (s), (d) G-FLOPs (G) vs. inference speed (s).

Figure 17. Example visualization of the UAV’s wrap-around route inspection on the actual construction site.

Figure 18. Comparison of visualization results for different flight heights: (a) 10 m, (b) 15 m, (c) 20 m.

Table 1. Summary of object detection papers in the traditional method.

Objective	Key Algorithm(s)	Performance	Reference
Moving loads in tower crane	CRANIUM system	The safety of cranes has improved	John and Alexander [22]
Moving loads in tower crane	RFID	The safety of cranes has improved	Lee et al. [23]
Construction material	RFID and PDA	Identify material properties up to 4.2 m	Lee et al. [24]
PPE	RFID	The detection accuracy was high, but frequent errors were seen at fast passing speeds	Kelm et al. [25]
Helmets	Real-time location system (RTLS), Pressure sensors	Upon approaching dangerous areas without wearing helmets, warnings sounded within 3 s	Dong et al. [26]
Excavators	SVM, HoG	95.2% accuracy	Azar et al. [27]
Concrete mixer truck	HoG	96.33% mAP	Kim et al. [28]
Workers	Background subtraction, HoG shaper, color histogram	99.0% precision and time-lapse rate of 0.67 s	Park and Brilakis [29]
Helmets	Background subtraction, HoG	94.3% precision and 89.4% recall	Park et al. [30]
Workers, excavators, trucks	HoG, hue–saturation color	98.83%, 82.10%, 84.88%, respectively	Memarzadeh et al. [31]
Trucks	Haar–HoG, Blob–HoG	91% detection rate and 0.24 false alarms per frame	Azar et al. [32]
Workers, helmets	Standard deviation matrix (SDM), color-based classification algorithm	98.5% precision and 90% recall	Mneymneh et al. [33]
Workers and excavators	Bayes classifiers, multi-network classifiers	96% accuracy	Chi and Caldas [34]
Heavy equipment	Bayes classifiers, neural network	84.82% accuracy	Chi and Caldas [35]
Workers and heavy equipment	GMM, Kalman filter, fuzzy inference	95.58% accuracy	Kim et al. [36]

Table 2. Summary of object detection papers in the deep learning method.

Objective	Key Algorithm(s)	Performance	Reference
Guardrails	VGG-16, MLP	The accuracy rate is 97% for a single and 86% for a multiple	Kolar et al. [38]
Workers and heavy equipment	Faster R-CNN	91% mAP for workers and 95% mAP for heavy equipment	Fang et al. [39]
Helmets	Faster R-CNN	95.7% precision, 94.9% recall, and speed of 0.205 s	Fang et al. [40]
Safety harness	Faster R-CNN with RPN	99% precision and 95% recall	Fang et al. [41]
Safety harness	Faster R-CNN and ResNet-50	mAP 92.7%	Xiao et al. [42]
Gloves	YOLO v3, VGG-19, ResNet-50	mAP 78.48%	Gugssa et al. [43]
Helmets	YOLO v5x	mAP 86.55%	Wang et al. [44]

Table 3. Camera and aerial photography parameters of the UAV.

UAV Parameters		Experimental Parameters
Takeoff weight	0.4 kg	Inspection height (H)	10, 15, 20 m
Dimensions (L × W × H)	180 × 180 × 80 mm	Shooting angles (R)	30°~70°
Max resolution	4 K/60 fps	The total duration of a single inspection	15 min
Max video transmission bitrate	50 Mbps	Inspection interval (T)	2 h
Field of view (FOV)	155°	Flight speed	1 m/s
Propeller guard	Built-in	Wind speed	0~3 m/s

Table 4. The parameters for setting the anchor mask are determined.

Layer	Mask Size (Width, Height)
Anchor 1	(73, 64); (56, 136); (171,183)
Anchor 2	(15, 45); (31, 29); (27, 74)
Anchor 3	(3, 5); (6, 13); (11, 23)

Table 5. Hardware and software specification.

Items		Description
H/W	CPU	i5-11400F
	RAM	16 GB
	SSD	Samsung SSD
	Graphics card	NVIDIA 3050
S/W	Operating system	Windows 11 Pro, 64bit
	Programming language	Python 3.7
	Learning framework	PyTorch 1.9.0

Table 6. The hyperparameters are set during the proposed method’s training.

Input Settings			Loss Calculation				Data Enhancement
Input shape	Batch size	Total epoch	Loss function	Max_lr	Min_lr	Decay type	Mosaic	Mixup
640 × 640	8	300	CIoU	0.01	0.0001	Cosine annealing [51]	True	True

Table 7. The results of the ablation experiments regarding the replacement of each attention module, where the highest mAP values are shown in bold.

Baseline	√	√	√	√	√	√
SENet [58]		√
ECA-Net [59]			√
CBAM [60]				√
LRCA-Netv2 [48]					√
LCMA-Net						√
Parameters (Millions)	87.14	87.27	87.14	87.61	87.15	87.25
mAP	78.96%	80.50%	80.53%	81.08%	81.27%	82.48%

Table 8. The results of the proposed method are compared with other models on the same environment and data set.

Method		Input Size	mAP(%)	mF1(%)	Parameters (Millions)	G-FLOPs(G)	Inference Speed(s)
Faster-RCNN	ResNet	600 × 600	71.40	63.57	370.01	136.78	0.28015
Faster-RCNN	VGG	600 × 600	68.85	59.13	939.36	28.316	0.32341
YOLOv5	L	640 × 640	72.72	65.51	114.31	46.65	0.06970
YOLOv5	X	640 × 640	75.88	69.81	217.41	87.27	0.06999
YOLOX	L	640 × 640	75.03	69.34	106.12	37.62	0.06990
YOLOX	X	640 × 640	76.94	72.31	188.51	70.84	0.07321
EfficientDet	D4	1024 × 1024	62.56	51.53	110.26	20.70	0.07021
	D5	1280 × 1280	63.77	52.74	270.65	33.63	0.09142
	D6	1280 × 1280	66.91	55.01	546.31	51.84	0.14425
	D7	1536 × 1536	69.88	57.41	650.17	57.57	0.19436
Our approach		640 × 640	82.48	79.93	87.25	249.41	0.07811

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liang, H.; Cho, J.; Seo, S. Construction Site Multi-Category Target Detection System Based on UAV Low-Altitude Remote Sensing. Remote Sens. 2023, 15, 1560. https://doi.org/10.3390/rs15061560

AMA Style

Liang H, Cho J, Seo S. Construction Site Multi-Category Target Detection System Based on UAV Low-Altitude Remote Sensing. Remote Sensing. 2023; 15(6):1560. https://doi.org/10.3390/rs15061560

Chicago/Turabian Style

Liang, Han, Jongyoung Cho, and Suyoung Seo. 2023. "Construction Site Multi-Category Target Detection System Based on UAV Low-Altitude Remote Sensing" Remote Sensing 15, no. 6: 1560. https://doi.org/10.3390/rs15061560

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Construction Site Multi-Category Target Detection System Based on UAV Low-Altitude Remote Sensing

Abstract

1. Introduction

2. Materials and Methods

2.1. UAV Application for Construction Site Wrap-Around Inspection

2.2. Multi-Size Target Detection Network for UAV Inspection System

2.2.1. Design of Backbone (Step 1)

2.2.2. Design of Multi-Scale Feature Fusion Attention Network (Step 2)

2.2.3. Design of Decoupled Headers for Classification and Positioning (Step 3)

2.3. Experiments

2.3.1. Description of the Dataset and Experimental setup

2.3.2. Metrics and Parameters for Performance Evaluation

3. Results

3.1. Comparison of Model Training with Different Backbone Networks

3.2. Comparison of the Attention Modules

3.3. Comparison of Model Performance

3.4. Visualization of UAV Inspection Program Applied to Construction Sites

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI