Construction Activity Recognition Method Based on Object Detection, Attention Orientation Estimation, and Person Re-Identification

Li, Jiaqi; Zhao, Xuefeng; Kong, Lingjie; Zhang, Lixiao; Zou, Zheng

doi:10.3390/buildings14061644

Open AccessArticle

Construction Activity Recognition Method Based on Object Detection, Attention Orientation Estimation, and Person Re-Identification

by

Jiaqi Li

^1,*

,

Xuefeng Zhao

²

,

Lingjie Kong

³,

Lixiao Zhang

⁴ and

Zheng Zou

^4,*

¹

School of Civil Engineering, University of Science and Technology Liaoning, Anshan 114051, China

²

School of Civil Engineering, Dalian University of Technology, Dalian 116024, China

³

Northeast Branch China Construction Eighth Engineering Division Corp., Ltd., Dalian 116019, China

⁴

College of Transportation Engineering, Dalian Maritime University, Dalian 116026, China

^*

Authors to whom correspondence should be addressed.

Buildings 2024, 14(6), 1644; https://doi.org/10.3390/buildings14061644

Submission received: 23 April 2024 / Revised: 29 May 2024 / Accepted: 31 May 2024 / Published: 3 June 2024

(This article belongs to the Special Issue Engineering Safety Monitoring and Management)

Download

Browse Figures

Versions Notes

Abstract

:

Recognition and classification for construction activities help to monitor and manage construction workers. Deep learning and computer vision technologies have addressed many limitations of traditional manual methods in complex construction environments. However, distinguishing different workers and establishing a clear recognition logic remain challenging. To address these issues, we propose a novel construction activity recognition method that integrates multiple deep learning algorithms. To complete this research, we created three datasets: 727 images for construction entities, 2546 for posture and orientation estimation, and 5455 for worker re-identification. First, a YOLO v5-based model is trained for worker posture and orientation detection. A person re-identification algorithm is then introduced to distinguish workers by tracking their coordinates, body and head orientations, and postures over time, then estimating their attention direction. Additionally, a YOLO v5-based object detection model is developed to identify ten common construction entity objects. The worker’s activity is determined by combining their attentional orientation, positional information, and interaction with detected construction entities. Ten video clips are selected for testing, and a total of 745 instances of workers are detected, achieving an accuracy rate of 88.5%. With further refinement, this method shows promise for a broader application in construction activity recognition, enhancing site management efficiency.

Keywords:

deep learning; object detection; convolutional neural network; computer vision; construction management; construction activity recognition

1. Introduction

Inefficient and irresponsible working has been shown to result not only in wasted time and resources and economic losses to the project as a whole but also in a reduction in construction quality, which in turn increases the risk to the structural safety of buildings [1,2]. One of the most effective ways to address this issue is to accurately monitor and record the labor consumption of each worker and then compare the results with the project baseline [3]. Timely and effective analysis and tracking of each worker’s activities are therefore essential for assessing and monitoring site productivity.

Most of the traditional monitoring methods are based on manual methods, i.e., there are professionals at the construction site to monitor the construction activities, which has the problems of insufficient coverage, unreasonable personnel scheduling, long time consumption, and strong subjectivity of the monitoring results [4]. With the rapid development of intelligent construction technology [5], a number of advanced hardware devices have been applied to construction activity detection [6]. For example, acceleration sensors, which are widely used in structural health monitoring, have become more portable due to their lightweight characteristics, and related technologies have been applied to worker activity recognition [7,8,9,10]. However, there are some application limitations of such contact sensors; to more accurately detect different types of motion, only one sensor is often not enough, but tying multiple sensors to arms and thighs may cause some inconvenience to activities. Therefore, a more appropriate solution is to use contactless monitoring methods with high accuracy, mainly based on video and image signals, using computer vision techniques that have long been applied in medical diagnostics [11,12], food detection [13,14], vehicle identification [15,16], and structural health monitoring [17,18,19,20,21,22]. The rapid development of deep learning and artificial intelligence technologies offers the possibility of applying computer vision techniques to construction worker activity detection [6].

The purpose of this paper is to present a method for recognizing the construction activities of each different worker, and the technical scheme is shown in Figure 1. Multiple computer vision algorithms were utilized in the research. Initially, YOLO v5 was used to detect the construction workers’ postures, directions of heads, and bodies. Then, the person re-identification model TransReID was applied to distinguish and match different construction workers, obtaining the position and attention orientation information of each worker at different moments. Next, YOLOv5 was used again to detect the relevant construction entity objects, and the construction activities were recognized based on the positional relationship between the workers combined with the attention orientation. The paper is structured as follows: Section 2 provides a literature review. Section 3 explains the method for recognizing construction activities. Section 4 establishes the vision model. Section 5 presents the results of the activity recognition experiments on the construction video clips. Finally, Section 6 and Section 7 present the discussion and conclusions, respectively.

2. Literature Review

2.1. Computer Vision Technology and Engineering Applications

Computer vision technology is a method that employs computers and mathematical algorithms to simulate and automate human vision processing. Its primary objective is to extract significant information from image and video data and analyze, process, and comprehend it.

Deep learning technology has seen explosive development in recent years. The applied fields, mainly in computer vision, are convolutional neural networks and object detection techniques. This field was first pioneered by an R-CNN algorithm [23], followed by Fast R-CNN [24] and Faster R-CNN [25], which respectively linked those shortcomings and raised the speed of detection. With the appearance of the FCN [26] and Mask R-CNN [27], the technology of image segmentation is increasing in maturity, and the object detection algorithms, together with lightweight algorithms [28,29,30,31,32,33,34], can naturally run on mobile, and cloud servers are no longer needed. The theoretical foundation of this paper is based on the development of those algorithms.

Computer vision techniques have been increasingly used in the engineering construction field to detect various types of objects at construction sites. For example, Fang et al. [35] proposed using Faster R-CNN to detect workers and various types of equipment at construction sites. Lee et al. [36] pointed out that in indoor construction sites, accidents caused by small tools often occur, so they built a database containing 12 types of small tools for indoor construction, and the YOLO v5 algorithm has good detection results on this database. Kim et al. [37] introduced optimal domain adaptive object detection with a self-training and adversarial-based approach. The experimental results showed that the proposed method for detecting workers and heavy equipment at construction sites has high accuracy and robustness. Based on the detection of construction objects, computer vision and object detection can also be used to identify the risk of human intrusion at construction sites. Mei et al. [38] realized the detection of human intrusion in static hazardous areas at construction sites using YOLO v5.

For individuals during construction, computer vision techniques can be used to estimate workers’ postures and orientations. Liu et al. [39] extracted 3D keypoint coordinates of construction workers to assess fall risk. Fan et al. [40] built a construction worker 3D posture estimation dataset containing 421,000 3D postures and annotations used for construction activities. Liu et al. [41] and Halder et al. [42] further used posture recognition to assess work-related musculoskeletal disorder (WMSD) risk. Orientation estimation, on the other hand, is a less researched area. Cai et al. [43] used Faster R-CNN to detect the head and body orientation of construction workers and then applied a multi-task learning network to assess the direction of visual attention of construction workers. Person re-identification technology is a popular direction of computer vision in recent years [44], but the technology is not much used in the fields of engineering and construction. Cheng et al. [45] combined the application of object detection methods and person re-identification algorithms for the detection of personal protective equipment based on worker tracking. The above research well promotes the application of computer vision technology in the fields of engineering and construction and also provides technical support for the research solutions in this paper.

2.2. Computer Vision Application in Workers’ Construction Activities Recognition

Efficiently recognizing construction workers’ activities can help project managers control construction progress and allocate on-site labor resources more effectively. This is an important method for analyzing labor productivity [6].

In early construction activity recognition research [46,47,48], traditional image processing methods are more commonly used. However, with the development and improvement of CNN, the training cost of the object detection algorithm has been further reduced, resulting in improved detection speed and accuracy. As a result, it can now be applied to construction activity recognition. Luo et al. [49] utilized the Faster R-CNN object detection algorithm to identify construction workers and entity objects in construction site images. They then employed a relevance network, taking into account the spatial relationships between the objects, to recognize multiple construction activities. Fang et al. [50] conducted similar studies to detect non-certified work through SORT and face recognition, while Li et al. [51] used CenterNet to detect construction workers and construction objects and evaluate construction productivity while recognizing the activity of assembling reinforcement bars. Luo et al. [52] attempted to integrate object detection algorithms with other algorithms. They combined YOLO v3 with SORT and 3D CNN to achieve spatial localization of construction workers and recognition of construction activities. In another study, Luo et al. [53] applied YOLO v3, SORT, KCF, C3D, CRF, and other algorithms in multiple steps to recognize the construction activities of workers in groups. Li et al. [54] used Faster R-CNN to recognize construction activities in images and detect construction workers. They determined the construction activities of different workers through the spatial relationship between the individuals and the activities. Bhokare et al. [55] also proposed a YOLO v3-based method to recognize construction activities directly.

In recent years, studies have been conducted to recognize construction activities based on video clips. Luo et al. [56,57] proposed recognizing construction activities using multistream convolutional neural networks. Roberts et al. [58] integrated a pose estimation algorithm into video-based construction activity recognition. Cai et al. [59] proposed an attentional orientation estimation method to recognize groups of construction workers. This method was then used for activity classification through LSTM. Li et al. [60] identified three types of construction worker behaviors: throwing, operating, and crossing, using YOLO and ST-GCN. Torabi et al. [61] proposed a YOWO53 model that recognizes construction activities. This model can be used for construction productivity analysis, although the inference is not fast. Li and Li [62] utilized Openpose and GAN to obtain the skeleton joints of construction workers. They then employed ResNet to recognize the construction activities of workers under occlusion.

2.3. Summary

In the studies on construction activity recognition mentioned above, it was found that most of them faced challenges in effectively distinguishing between different construction workers. When recognizing construction activities, if only the workers are categorized as “worker”, it can lead to different individuals appearing in the same construction scene with the same categorization name, even if the time interval is not long. Corresponding spatial and temporal contexts between images and videos are difficult, making it impossible to determine which workers in two images or videos belong to the same person. Additionally, accurately identifying the performers of recognized construction activities and responding to the construction activities of different workers over a long period of time are also impossible. To address this issue, the authors suggested creating distinct classifications for various individuals [54]. However, this approach is not feasible for recognizing construction activities on a large scale due to the extensive labeling required. Therefore, this paper proposes using the person re-identification method to solve this problem.

Additionally, some studies have directly analyzed video clips to recognize construction activities. Given the practical constraints of complex working conditions and limited computing power, this paper focuses on image-based research. A previous study proposed a method for recognizing construction workers and activities in construction images [54]. However, this method requires annotating a large number of images. For instance, in the case of the ‘transfer’ act, the dataset should consider the various postures of the worker performing this activity. Additionally, the dataset should include several different workers performing this activity. In the paper, 2729 boxes were set up in the dataset for the activity of ‘transfer’ alone. If this approach is applied to the activity recognition of construction scenarios with multiple processes, the workload of dataset creation will be significantly increased. Therefore, we believe that detecting individual and construction entity objects and then recognizing construction activities through positional relationships is a more reasonable solution. This approach is especially suitable for recognizing large-scale construction activities. Similar studies have been conducted by Luo et al. [63], Zhang et al. [63,64], and Fang et al. [50]. However, these studies do not distinguish between different individuals. This problem can be solved by using the person re-identification method mentioned in the previous paragraph.

Another problem is that most previous studies have not considered the posture and orientation of the worker. It is possible for an operator to be incorrectly recognized as performing a relevant construction activity in previous studies, even when their back is turned to a construction object or their attention is elsewhere. To improve the accuracy of construction activity recognition, it is necessary to introduce algorithms for estimating personnel posture and attention orientation in addition to the existing methods.

In summary, the objective of this paper is to propose a novel construction activity recognition method to solve some of the problems existing in previous studies. First of all, combining the capabilities of current computing and storage devices, we still selected the image-based recognition method as the research method. In order to reflect the contextual relationships in construction activities, we take multiple images within a video as the research objects. On the one hand, we develop two object detection models for detecting construction entities and for estimating the orientation of an individual worker’s attention, so that we can better utilize spatial location cues to establish accurate “person-object” relationships, which have not been mentioned in the past. On the other hand, we introduce a person re-identification model for construction activity recognition for the first time to address the challenge of distinguishing between different individual workers in previous studies. Construction activity recognition is achieved by exploiting the positional relationship between the worker and the construction object and by combining the estimation of the worker’s attentional orientation with the matching result of person re-identification.

3. Methodology

For the object detection algorithm, we selected the YOLO v5 algorithm with high detection accuracy and speed. Based on the YOLO v5 algorithm, the authors trained the workers’ orientation estimation model and the construction entity object detection model and detected the related objects in parallel. The former can estimate the three body postures (standing, bending, and squatting) and four orientations of the head and body (east, south, west, and north) of construction workers, and the latter can identify various types of construction entity objects at the construction site.

In order to correspond construction activities to individual personnel, it is necessary to accurately identify the performers of the activities at each moment in time. Unlike our previous work, here we selected TransReID, a person re-identification algorithm based on Transformer technology, which has achieved satisfactory results on several person re-identification benchmark datasets. By applying the TransReID algorithm, the detection boxes of the extracted workers can be retrieved and matched, and in combination with the identification results of posture and orientation, the detection boxes and the estimation results of the attention orientation of each individual can be combined according to the time series.

Finally, for each individual worker, the coordinates of his detection box as well as the estimation results of posture and attention orientation are combined with the detected construction entity objects, and through the spatial location relationship between them, multiple related activities are recognized.

3.1. Collection of Research Materials

The material and dataset images collected for this research paper are mainly from the construction site of a super high-rise project in the Donggang Business District, Dalian City, Liaoning Province [51]. The structural type of the project is a frame core structure, and the slipform construction method is applied, as shown in Figure 2a. With the increase in the number of floors constructed in the super high-rise structure, the fixed surveillance cameras installed at the site will not be able to record the working plane. Therefore, we installed several cameras at the perimeter fence of the project according to the scheme shown in Figure 2b, which can continuously record the construction video without interference from the number of floors. The selected cameras are all Dahua brand, with a focal length of 4 mm and a resolution of 1080p. All the cameras are connected to the switch via network cables using the RTSP protocol, and then the switch transmits the video signals to the PC. A Python program is run on the PC to extract images from the video stream of multiple signals. To ensure the parallel processing capability of the PC, we have equipped it with 32 G RAM and an Nvidia 2080 ti GPU. The entire video recording system is equipped with a special power supply module to ensure continuous video recording.

The images in the material of this paper are mainly derived from cameras D02, D03, D05, and D06.

3.1.1. Dataset of Construction Entities

The construction entity object detection dataset contains a total of 727 images, including ten types of construction entity objects commonly found on site, which are: wooden stick (WSK) used for fixing formwork, tank (TNK) for welding, formwork (FMW), stirrups (STU), longitudinal rebar (LRB), scaffolding steel pipe (SSP), formwork processing table (FPT), brace for strengthening the structure (BRACE), column (CLM), and beam (BEAM). The names of each category in the dataset are abbreviated in English. Table 1 and Figure 3 show the details of the dataset. The labeling of consumables such as stirrups, longitudinal bars, wooden sticks, scaffolding steel pipes, etc. is not done for scattered individual pieces of material but for piles of material.

3.1.2. Dataset for Estimation of Posture and Orientation

The orientation of a worker’s body and head is closely linked to their attention. Estimating their orientation can help determine if they are connected to a construction object nearby. Posture estimation is a valuable indicator of their attention, as workers are more likely to engage in construction activities when bending or squatting rather than standing. Additionally, a worker can be assumed to be resting if he is in a squatting posture and not obstructing any construction objects nearby.

In recent years, scholars have studied personnel orientation estimation, including Baxter et al. [65], Liu and Ma [66], and Raza et al. [67]. These studies categorized an individual’s orientation into eight categories: East, Northeast, North, Northwest, West, Southwest, South, and Southeast. However, according to Cai et al. [43], accurately assessing the orientation of a human body is challenging due to the low resolution of most construction surveillance video images. To address this issue, Cai et al. categorized both the body orientation and the head orientation into four categories: East, South, West, and North. This categorization avoids serious ambiguities between neighboring classes (e.g., Northeast and East, Southwest and West). The study presented in this paper follows the aforementioned idea. Additionally, the dataset includes three body postures commonly found in construction scenarios: standing, bending, and squatting.

Table 2 and Figure 4 present the contents of the dataset, which includes images from a construction site described in Section 3.1.1 as well as web material from video clips of a hospital construction site in the United States. The dataset includes 16 categories, with the first 12 indicating the posture and body orientation of the personnel. The categories are labeled as st, b, and s for standing, bending, and squatting, respectively. The directions are labeled as e, s, w, and n for east, south, west, and north, respectively. For instance, wkr_st_e indicates that the worker is standing with an eastward body orientation. The final four categories of the dataset indicate head orientation. For instance, head_n indicates that the head orientation is north.

3.1.3. Dataset for Worker Re-Identification

The purpose of utilizing the person re-identification method is to match the identified construction activities to the correct performers. The authors suggest creating a distinct classification for each worker based on their unique appearance and characteristics. This approach is better suited for small-scale construction scenarios with relatively stable personnel, low turnover rates, and extended work cycles [54]. Applying this approach in construction scenarios with frequent personnel movements and irregular work contents and locations can be more challenging. Additionally, creating a separate category for each worker would increase the difficulty and workload of labeling and the redundancy of the dataset. Therefore, this paper applies the Person Re-Identification (Person ReID) algorithm, which matches gallery images to different individuals based on appearance features and eliminates the need for separate categories for each worker.

The DukeMTMC dataset from Duke University is a widely used dataset in the ReID field. The dataset contains images of over 2700 individuals, captured by eight cameras [68]. To bring the trained ReID model closer to actual engineering construction, the authors modeled the format of the DukeMTMC and created a worker ReID dataset by capturing images of 100 workers from the cameras at the construction site mentioned earlier. The dataset contains a total of 5455 images. Figure 5 displays some of the images from the worker ReID dataset proposed in this paper, where 001 and 002 indicate the worker ID.

3.2. Computer Vision Algorithms

3.2.1. Object Detection

In the field of object detection, algorithms are typically categorized into two groups: one-stage and two-stage. This paper utilizes YOLO v5, a leading one-stage algorithm, to detect construction entity objects, person postures, and orientations due to its exceptional performance and speed. Based on the methodology proposed in this paper, other classical object detection networks, such as Faster R-CNN, SSD, and CenterNet, etc., will be equally applicable to the problem described in this paper. After several iterations since the first version of the YOLO series, YOLO v5 achieves efficient detection of small and overlapping objects while maintaining fast processing speeds.

YOLO v5 introduces several improvements, such as Mosaic data augmentation, CSPDarknet-53 as the backbone, FPN (Feature Pyramid Network), PAN (Path Aggregation Network) structure, and the application of the CIOU loss function. These improvements significantly enhance the model’s performance. Adaptive anchor frame calculation, Focus and CSP structures, and the Silu activation function are also introduced to further optimize the model’s performance.

As shown in Figure 6, in practical applications, YOLO v5 scales the input image to a size of 640 × 640. It then extracts features through the CSPDarknet and SPP layers. The FPN and PAN structures facilitate the fusion of high- and low-level features to enhance the model’s recognition accuracy. The model achieves accurate recognition of objects of different sizes through the predicted outputs of the three scales. Subsequently, the YOLO v8 algorithm employs a new convolutional module while maintaining the same network architecture as YOLO v5. Considering the real-time requirements of construction scenarios, this study applied the smaller YOLO v5 model. This paper utilizes the weights of YOLO v5 m (with m = 0.75 and n = 0.67) in the training process. It is important to note that YOLO v5’s Mosaic data enhancement considers horizontal flips, which may result in an individual facing east being detected as facing west. When training on the pose and orientation dataset, the authors removed the horizontal flip from the Mosaic data enhancement by modifying the source code of YOLO v5.

3.2.2. Person Re-Identification

Person re-identification (ReID) is a crucial problem in the field of image retrieval. It aims to retrieve a specific target person from a cross-device pedestrian image set. ReID technology can effectively solve the problem of cross-device person retrieval and tracking in the field of intelligent surveillance and security. It overcomes the limitations of face recognition when clarity is insufficient [69].

The performance of ReID has significantly improved with the development of deep learning, particularly with the application of CNNs. Deep learning methods determine the target person by feature similarity, which involves metric learning, feature learning, and ranking optimization. The Transformer technique has recently been introduced to computer vision due to its success in natural language processing and has been developed into Vision Transformer (ViT) [70].

According to [71], the ViT technique exhibits greater similarity between deep and low representations and better preservation of spatial information compared to traditional CNNs. In the field of ReID, He et al. [72] proposed the TransReID algorithm, which applied ViT to the ReID problem for the first time and achieved excellent performance on several benchmark datasets. TransReID was designed with a jigsaw puzzle repair module (JPM) and a side information embedding module (SIE) to effectively cope with image and camera view angle changes. It solves the limitations of CNNs in long-distance spatial structure feature learning and detail feature retention. This paper adopts TransReID as the ReID algorithm due to its advantages based on ViT technology and excellent performance in ReID tasks, providing effective technical support for solving the cross-device people retrieval problem.

3.3. Construction Activity Recognition Method

Construction video clips will be detected by the trained object detection model every 3 s. The worker’s orientation and body posture will be determined by intercepting the image with the posture and orientation estimation model. The posture, body orientation, and coordinate data of the detection box will be saved in the file name and generated data file. The ReID model is then used to match the operators. Afterward, the positional relationship between each worker’s detection box and all the head detection boxes detected at the current moment is calculated. If Equation (1) is satisfied, the orientation indicated by the head detection box is considered the worker’s head orientation at the current moment. The process operates the image frames at each moment to obtain the coordinates of the detection frames and the body orientation, head orientation, and body posture of each operator. Figure 7 shows a flowchart of the personnel posture and orientation estimation.

P_{w k r_o r i} = \frac{A r e a (w o r k e r) \cap A r e a (head)}{A r e a (head)} > γ

(1)

The worker’s head orientation is determined by the ratio of the intersection area of the body detection box and the head detection box to the head detection box, represented by P_{wkr_ori}. If this ratio is greater than 0.95, it is considered that the worker’s head orientation matches the orientation indicated by the head detection box.

The workers’ construction activities are recognized by determining the orientation of their attention based on the obtained body and head directions of the personnel and combining them with the detection box of the construction entity object. Figure 8 shows the method used to estimate the attention orientation of the operating individual.

The next step is to determine whether the worker is moving based on the obtained coordinates of the detection box. If Equation (2) is satisfied, the construction worker can be considered to be in a moving state.

| X_{c_t} - X_{c_t - 1} | \geq m i n (W_{t}, W_{t - 1}) O R | Y_{c_t} - Y_{c_t - 1} | \geq m i n (W_{t}, W_{t - 1})

(2)

The X coordinate of the center point of the worker’s detection box at the current moment is represented by X_{c_t}, while X_{c_t−1} represents the X coordinate at the previous moment. Similarly, Y_{c_t} and Y_{c_t−1} represent the Y coordinate of the center point of the worker’s detection box at the current and previous moments, respectively. The width of the worker’s detection box at the current moment and the previous moment is represented by W_t and W_t−1, respectively. According to the formula, if the worker moves more than the width of their detection box in two consecutive moments before and after, they can be considered to be in a state of motion. The criterion for choosing the minimum width of the detection box in the two moments is based on the fact that the width of the detection box may differ significantly due to the different postures of the workers before and after. Choosing the smaller width value removes the restriction on determining whether the worker is in a moving state.

If the worker is in a moving state, then calculate whether the worker intersects the detection box of the detected construction entity object. If the overlap between the worker’s detection box and the detection box of a specific type of construction entity object is greater than zero at two consecutive moments, the worker is considered to have lifted the construction entity object at both moments. If none of the construction entity objects meet the aforementioned conditions, the worker is considered to be in a state of walking movement.

The third step is to determine the construction activities performed by the worker while in a state of in-situ activity. The scenarios described depend on the worker’s attention orientation, body posture, and the location of the construction entity object.

The worker’s attention orientation is east.
If there is a construction object that overlaps with the worker’s detection box and is located to the right of it, it can be inferred that the worker is currently performing an activity related to that construction object, as illustrated in Figure 9. In the video screen, there may be instances where the worker appears to be far away from the construction object, but the detection box intersects due to the parallax of the viewing angle of the recording, as shown in Figure 10. Therefore, in Figure 9, considering the actual situation, the restriction condition is listed, which means that when the worker’s attention orientation is to the right and the construction object is located on its right side, the whole construction object is usually not higher than the position of the dotted line in the figure.

Figure 9. Detected boxes when the individual’s attention orientation is east.

Figure 10. No activities can be found, although the detected boxes were inserted.

The worker’s attention orientation is west.

Similar to the situation when attention is directed to the east, Case (1) can be horizontally flipped to correspond to this scenario.

The worker’s attention orientation is north.

If the detection box of a construction object and a worker detection box satisfy the relationship shown in Figure 11, the worker is considered to have performed the relevant construction activity. Figure 11a shows that the construction entity object is in the same plane as the worker and located on the worker’s north side. In this situation, it is crucial to ensure that the construction object is in close proximity to the worker. The distance between the two should be less than the worker’s step distance, which is approximately the width of the body’s detection box. If the distance between the worker and the construction object is not limited, the phenomenon shown in Figure 12 may occur. In this case, the operator does not perform the construction activity related to the formwork, even if the construction object is located on the north side of the worker but the distance between the two is far. As depicted in Figure 11b, the worker is positioned above the construction object, which is typically the case for frame columns and braces. This scenario is not applicable to other construction entity objects.

The worker’s attention orientation is south.

When the worker’s attention is oriented towards the south, two situations usually occur during a construction activity, as shown in Figure 13. Figure 13a shows that the construction entity object is located on the south side of the operator and that its detection box width is greater than that of the personnel detection box. To ensure that personnel are in close proximity to the construction object, the distance between the two boxes must be less than the width of the operator’s detection box. Figure 13b illustrates the scenario where the operator’s width is greater than that of the construction object, resulting in an intersecting relationship between the two.

The worker’s posture is either bending or squatting.

If the worker’s posture is bending and the positional relationship between the construction object and the worker satisfies the above four cases, it is considered that the worker is engaged in activities related to picking up. If the worker is in a squatting posture, construction activities are determined in the same way as in the first four cases. If there is no construction object that satisfies the first four cases, it is considered that the worker is resting at that moment.

The worker is considered to be performing a construction activity related to the object when his detection box is completely covered by the detection box of the construction object.
This paper considers the construction activities related to each construction object listed in Table 3. Construction activities, aside from lifting, will be classified as XX-related activities (where XX refers to the detected construction entity object).

4. Establishment of Deep Learning Models

4.1. YOLO v5-Based Construction Entity Object Detection Model

The dataset is split into 80% for training and fitting the model and 20% for evaluating its performance. The number of epochs is set to 500, the batch size is 8, and the initial learning rate is 0.01.

Table 4 presents the relevant performance metrics. The proposed YOLOv5 model achieves an mAP value of 0.961. Figure 14 shows the detection results of several validation set images. It is evident that the model accurately detects and localizes various types of construction entities. These results provide a strong foundation for future research.

4.2. YOLO v5-Based Model for Estimating Workers’ Pose and Orientation

The training parameters remain unchanged from the previous section. The authors introduced the principle of YOLO v5 and mentioned that it applies mosaic data enhancement to improve accuracy and stability while avoiding overfitting. However, the random horizontal flip in the mosaic data enhancement can significantly impact this model. When the horizontal flip is applied, there is a chance that the worker with an eastward orientation will be mirrored horizontally. This may result in training with the category name ‘east’ but with the actual orientation of ‘west’, resulting in extremely low accuracy in both orientations. To address this issue, the authors modified the source code of YOLO v5 to remove horizontal flipping in the mosaic data enhancement when training the pose and orientation dataset. Table 5 presents the relevant performance metrics, and the mAP achieved a competitive accuracy of 0.898 using the YOLO v5 model, which is comparable to similar studies [43]. Figure 15 demonstrates accurate recognition of the pose and orientation of individuals in most cases.

4.3. Workers’ Re-Identification Results

To practically test the ReID model’s performance, we selected a construction scene of the concrete pouring process from this project for analysis. We recorded twelve videos on-site from four viewpoints, which we fed into the YOLO v5 model to obtain 2420 images of seven construction workers’ detection boxes. We placed these images in the gallery and selected ten images for each worker, resulting in a total of seventy images in the query library.

The trained TransReID model takes the images from the query library and gallery as input. For each image in the query, the model retrieves images from the gallery based on the level of confidence. Person reidentification commonly uses mean average precision (mAP) and Rank-1 as evaluation metrics.

The mAP and Rank-1 of the TransReID model, trained using the worker ReID dataset proposed in this study, were both 1.00. The results demonstrate that the ReID model can effectively differentiate between workers in the operational area. This lays a solid foundation for identifying the performers of construction activities.

5. Experiment on Construction Activity Recognition

5.1. Recognition Results

This section presents ten two-minute video clips selected for testing from the construction site of the core structure described earlier in this paper. The videos are used for construction activity recognition and are first processed by two trained YOLO v5 object detection models. The models detect objects at three-second intervals and output the detection results. Construction activity recognition is conducted using the obtained detection results, following the method described in the previous section. For clarity, we will refer to these ten videos as V01 to V10.

To illustrate the recognition results, three clips were selected: the V02 video clip contains two workers, and Figure 16 shows the detection results of V02 by applying two YOLO v5 models. It can be observed that one of the workers depicted in Figure 16a–c is in a relatively static state, with his attention oriented northward and situated on the south side of the SSP. This individual can be identified as engaging in scaffolding-related construction activities at the 15th and 18th seconds. The other worker was in a state of motion and intersected the SSP detection box at both the 15th and 18th seconds. This indicates that he was engaged in the construction activity of lifting the scaffolding at these moments.

Figure 17 illustrates the detection results of V06. The worker in the images is observed to be in a relatively static state at the 15th and 18th seconds, respectively. His attention is determined to be oriented northward by the method in Figure 8. However, this worker is determined to be physically inactive at this moment because his distance from the formwork exceeds the limit.

Figure 18 illustrates the detection results of V08 at two moments: the 87th and 90th seconds. At these moments, all workers were observed to be relatively stationary. One worker was observed to be oriented towards the east, with the column situated to his right. The detection boxes of the other two workers were included within the detection box of the column, indicating that all three workers were engaged in column-related construction activities at these moments.

5.2. Recognition Results Analysis

Table 6 presents the results of ten construction videos subjected to analysis using the methodology proposed in this paper. The methodology exhibits relatively high accuracy in the recognition of static construction activities. However, in scenarios involving a greater number of personnel and a more dispersed array of construction materials, errors may persist, and the precision of the object detection model may also influence the final recognition results.

Incorrect construction activity recognition is primarily attributable to three factors: First, the logical relationship between the spatial location of the worker and the construction entity object used to determine the construction activity is not sufficiently precise. As illustrated in Figure 19a, the worker on the left is squatting, facing east, and his detection box intersects with a piece of formwork scattered on the ground, which is located on his east side. He is recognized as performing construction activities related to the formwork using the method proposed in this paper. However, in reality, the worker is in a resting state.

Second, the activity recognition error resulting from the misestimation of the attention orientation is illustrated in Figure 19b: The attention orientation of the worker on the left should be west, but the object detection model determines his body orientation and head orientation to be north. Consequently, his final attention orientation is determined to be north, and he is recognized to have performed a formwork-related construction activity using the method proposed in this paper.

Third, the erroneous determination of the lifting activity is illustrated in Figure 19c. By applying the methodology proposed in this paper, it can be determined that the worker situated on the right side of the screen is in a state of motion at both moments. The formwork is distributed in a scattered manner in this scenario, resulting in the detection box of the worker intersecting with that of the formwork for two consecutive moments. Consequently, the worker is incorrectly identified as performing the activity of lifting the formwork.

Future work will involve optimizing the logical relationship between the spatial location of the worker’s attention orientation and the construction entity object, improving the accuracy of the attention orientation estimation, and perfecting the judgment logic of the lifting activity. This can be achieved by gradually advancing the aforementioned tasks through the expansion of the dataset content, further improvement of image resolution, the adoption of three-dimensional coordinates, and the implementation of improvements to the vision algorithms.

The overall accuracy of the method is 0.885, and it can recognize construction activities in large-scale scenes without the need to classify and label each worker. Furthermore, it is more accurate in determining whether a worker is “not engaged in physical activity”. Therefore, the results of this method can be considered satisfactory and comparable.

6. Discussion

The paper proposes a method for recognizing construction workers’ activities. The novelty of the proposed method lies in the combined application of object detection, attention orientation estimation, and person re-identification. Different visual models have distinct roles, finally achieving the recognition of construction activities.

The contributions of this paper are as follows:

(1) By applying the method proposed in this paper, it is possible to record the construction activities and work status of each worker in the construction video clips. The determination of the work status is more accurate compared to the method previously proposed by the authors. In this paper, the status of an individual is determined based on his position, posture, and orientation of attention when the relevant associated construction entity object is not detected. (2) In this paper, a construction worker re-identification dataset is established to introduce a re-identification method for construction activity recognition, which will be more accurate when judging the executors of construction activities. (3) In this paper, posture and orientation estimation are introduced into the construction activity recognition of individual personnel. This effectively utilizes the detection box information of workers and construction entity objects output by the object detection algorithm to complete the construction activity recognition.

The authors believe that the proposed method in this study can be applied in project practice, which could improve traditional project management patterns. In the future, we first plan to enhance the generalization ability of the vision model in different scenarios, supplementing training data with the monitored projects if necessary. Subsequently, we will set up surveillance cameras around the construction area for long-term monitoring. Data storage will be achieved through switches and hard disk recorders, while video data will be transmitted to the GPU server via the RTSP protocol using the internet. Finally, the deployed vision algorithm framework will be used to recognize construction activities.

The proposed method also has limitations that need to be addressed in future research. Firstly, the logic for recognizing construction activities requires continuous optimization, as there are currently some unclear aspects to recognizing certain activities, such as identifying the activity of lifting construction entities. Secondly, the accuracy of orientation estimation needs further improvement, especially when the construction worker’s orientation is at a critical point between two directions, where the likelihood of estimation errors increases significantly. Therefore, future research should not only focus on optimizing algorithm performance but also on making the dataset more comprehensive. Compared to other groups, construction workers may have very similar clothing and appearance characteristics, posing a challenge for person re-identification methods. In the future, we will consider individuals with similar attire more extensively in the worker re-identification dataset. Additionally, enhancing the generalization ability across different scenarios and integrating the algorithm framework are key areas for improvement in future work.

7. Conclusions

This paper describes a computer vision-based method for the recognition of multiple construction activities. The method employs two computer vision algorithms: object detection and person re-identification. To complete this study, a dataset containing a total of 727 images in ten categories was produced based on ten types of construction entity objects. The study scenario was the main structural operating surface of a super-high rise building. A dataset comprising sixteen categories for estimating the body posture and orientation of workers was produced based on 2546 images collected from three construction sites. The sixteen categories encompass four orientations of the body and the head, as well as three postures of the body. These two datasets were employed to train two YOLOv5 object detection models. The ratio of the training set to the validation set was set at 8:2 for model training. The mAP for the YOLOv5 model based on construction entity objects reached 0.961, while the mAP for the YOLOv5 model based on the estimation of workers’ postures and orientations reached 0.898. In addition, we have established a worker re-identification dataset containing 5455 images. Based on this dataset, we trained a ReID model, achieving both a mAP and Rank-1 score of 1.0.

For the purpose of construction activity recognition, video clips were input into two trained object detection models at a fixed time interval. Among these, the worker posture and orientation estimation model outputs the body detection box and head detection box of the worker and utilizes the trained ReID model to complete the matching of the worker. This is then combined with the body and head orientations to complete the estimation of the attention orientation. Subsequent to this, the results of posture and attention orientation estimation are combined with the detected construction entity objects at the construction site to complete construction activity recognition based on the positional relationship between the detection boxes. Ten video clips are selected for testing, and a total of 745 instances of workers are detected, achieving an accuracy rate of 88.5%.

This study introduces new ideas and solutions for image-based construction activity recognition. The authors suggest that future research could incorporate more advanced technologies for construction activity recognition, such as 3D object detection, point cloud recognition, and large-scale model techniques. With further optimization and refinement of these methods, computer vision techniques have the potential to be applied to larger-scale construction activity recognition scenarios, thereby improving on-site management efficiency.

Author Contributions

Conceptualization, J.L. and X.Z.; methodology, J.L.; software, L.K.; validation, J.L., L.Z. and Z.Z.; formal analysis, J.L.; investigation, J.L. and L.K.; resources, L.K. and X.Z.; data curation, L.Z. and Z.Z.; writing—original draft preparation, J.L.; writing—review and editing, J.L. and X.Z.; visualization, J.L.; supervision, Z.Z.; project administration, L.Z.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Outstanding Young Scientist Program of the University of Science and Technology Liaoning (Grant Number 2023YQ03), the Basic Research Program for Universities of the Educational Department of Liaoning Province (Grant Number JYTQN2023241), and the University of Science and Technology Liaoning Talent Project Grants.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

Author Lingjie Kong was employed by the company Northeast Branch China Construction Eighth Engineering Division Corp., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Zhou, H.; Zhao, Y.; Shen, Q.; Yang, L.; Cai, H. Risk Assessment and Management via Multi-Source Information Fusion for Undersea Tunnel Construction. Autom. Constr. 2020, 111, 103050. [Google Scholar] [CrossRef]
Tong, R.; Zhao, H.; Zhang, N.; Li, H.; Wang, X.; Yang, H. Modified Accident Causation Model for Highway Construction Accidents (ACM-HC). Eng. Constr. Archit. Manag. 2020, 28, 2592–2609. [Google Scholar] [CrossRef]
Zhong, B.; Wu, H.; Ding, L.; Love, P.E.D.; Li, H.; Luo, H.; Jiao, L. Mapping Computer Vision Research in Construction: Developments, Knowledge Gaps and Implications for Research. Autom. Constr. 2019, 107, 102919. [Google Scholar] [CrossRef]
Dawood, T.; Zhu, Z.; Zayed, T. Computer Vision–Based Model for Moisture Marks Detection and Recognition in Subway Networks. J. Comput. Civ. Eng. 2018, 32, 04017079. [Google Scholar] [CrossRef]
Xu, L.; Xu, E.; Li, L. Industry 4.0: State of the Art and Future Trends. Int. J. Prod. Res. 2018, 56, 2941–2962. [Google Scholar] [CrossRef]
Li, J.; Miao, Q.; Zou, Z.; Gao, H.; Zhang, L.; Li, Z.; Wang, N. A Review of Computer Vision-Based Monitoring Approaches for Construction Workers’ Work-Related Behaviors. IEEE Access 2024, 12, 7134–7155. [Google Scholar] [CrossRef]
Ryu, J.; Seo, J.; Jebelli, H.; Lee, S. Automated Action Recognition Using an Accelerometer-Embedded Wristband-Type Activity Tracker. J. Constr. Eng. Manag. 2018, 145, 04018114. [Google Scholar] [CrossRef]
Akhavian, R.; Behzadan, A. Wearable Sensor-Based Activity Recognition for Data-Driven Simulation of Construction Workers’ Activities. In Proceedings of the 2015 Winter Simulation Conference (WSC), Huntington Beach, CA, USA, 6–9 December 2015; pp. 3333–3344. [Google Scholar]
Tao, W.; Lai, Z.-H.; Leu, M.C.; Yin, Z. Worker Activity Recognition in Smart Manufacturing Using IMU and sEMG Signals with Convolutional Neural Networks. Procedia Manuf. 2018, 26, 1159–1166. [Google Scholar] [CrossRef]
Zhang, M.; Chen, S.; Zhao, X.; Yang, Z. Research on Construction Workers’ Activity Recognition Based on Smartphone. Sensors 2018, 18, 2667. [Google Scholar] [CrossRef]
Esteva, A.; Kuprel, B.; Novoa, R.; Ko, J.; Swetter, S.; Blau, H.; Thrun, S. Dermatologist-Level Classification of Skin Cancer with Deep Neural Networks. Nature 2017, 542, 115–118. [Google Scholar] [CrossRef]
Setio, A.; Ciompi, F.; Litjens, G.; Gerke, P.; Jacobs, C.; Riel, S.; Wille, M.; Naqibullah, M.; Sanchez, C.; Ginneken, B. Pulmonary Nodule Detection in CT Images: False Positive Reduction Using Multi-View Convolutional Networks. IEEE Trans. Med. Imaging 2016, 35, 1160–1169. [Google Scholar] [CrossRef] [PubMed]
Liu, C.; Cao, Y.; Luo, Y.; Chen, G.; Vokkarane, V.; Ma, Y. DeepFood: Deep Learning-Based Food Image Recognition for Computer-Aided Dietary Assessment; Springer: Berlin/Heidelberg, Germany, 2016; p. 48. ISBN 978-3-319-39600-2. [Google Scholar]
Liu, C.; Cao, Y.; Luo, Y.; Chen, G.; Vokkarane, V.; Ma, Y.; Chen, S.; Hou, P. A New Deep Learning-Based Food Recognition System for Dietary Assessment on An Edge Computing Service Infrastructure. IEEE Trans. Serv. Comput. 2018, 11, 249–261. [Google Scholar] [CrossRef]
Ammour, N.; Alhichri, H.; Bazi, Y.; Benjdira, B.; Alajlan, N.; Zuair, M. Deep Learning Approach for Car Detection in UAV Imagery. Remote Sens. 2017, 9, 312. [Google Scholar] [CrossRef]
Al-Qizwini, M.; Barjasteh, I.; Al-Qassab, H.; Radha, H. Deep learning algorithm for autonomous driving using GoogLeNet. In Proceedings of the 2017 IEEE Intelligent Vehicles Symposium (IV), Los Angeles, CA, USA, 11–14 June 2017; IEEE: New York, NY, USA, 2017; pp. 89–96. [Google Scholar] [CrossRef]
Zhang, Y.; Yuen, K.-V. Crack Detection Using Fusion Features-Based Broad Learning System and Image Processing. Comput.-Aided Civ. Infrastruct. Eng. 2021, 36, 1568–1584. [Google Scholar] [CrossRef]
Zhang, Y.; Dang, D.-Z.; Wang, Y.-W.; Ni, Y.-Q. Damage Identification for Railway Tracks Using Ultrasound Guided Wave and Hybrid Probabilistic Deep Learning. Constr. Build. Mater. 2024, 418, 135466. [Google Scholar] [CrossRef]
Zhang, Y.; Ni, Y.-Q.; Jia, X.; Wang, Y.-W. Identification of Concrete Surface Damage Based on Probabilistic Deep Learning of Images. Autom. Constr. 2023, 156, 105141. [Google Scholar] [CrossRef]
Zhao, X.; Zhang, Y.; Wang, N. Bolt Loosening Angle Detection Technology Using Deep Learning. Struct. Control Health Monit. 2018, 26, e2292. [Google Scholar] [CrossRef]
Zheng, Z.; Zhao, X.; Zhao, P.; Qi, F.; Wang, N. CNN-Based Statistics and Location Estimation of Missing Components in Routine Inspection of Historic Buildings. J. Cult. Herit. 2019, 38, 221–230. [Google Scholar] [CrossRef]
Zhang, Y.; Yuen, K.-V. Bolt Damage Identification Based on Orientation-Aware Center Point Estimation Network. Struct. Health Monit. 2021, 21, 147592172110042. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; IEEE: New York, NY, USA, 2014; pp. 580–587. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; IEEE: New York, NY, USA, 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef] [PubMed]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA, 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:abs/1704.04861. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; IEEE: New York, NY, USA, 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 779–788. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 6517–6525. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:abs/1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:abs/2004.10934. [Google Scholar]
GitHub—Ultralytics/Yolov5: YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite. Available online: https://github.com/ultralytics/yolov5 (accessed on 21 April 2023).
Fang, W.; Ding, L.; Zhong, B.; Love, P.E.D.; Luo, H. Automated Detection of Workers and Heavy Equipment on Construction Sites: A Convolutional Neural Network Approach. Adv. Eng. Inform. 2018, 37, 139–149. [Google Scholar] [CrossRef]
Lee, K.; Jeon, C.; Shin, D.H. Small Tool Image Database and Object Detection Approach for Indoor Construction Site Safety. KSCE J. Civ. Eng. 2023, 27, 930–939. [Google Scholar] [CrossRef]
Kim, H.-S.; Seong, J.; Jung, H.-J. Optimal Domain Adaptive Object Detection with Self-Training and Adversarial-Based Approach for Construction Site Monitoring. Autom. Constr. 2024, 158, 105244. [Google Scholar] [CrossRef]
Mei, X.; Zhou, X.; Xu, F.; Zhang, Z. Human Intrusion Detection in Static Hazardous Areas at Construction Sites: Deep Learning–Based Method. J. Constr. Eng. Manag. 2023, 149, 04022142. [Google Scholar] [CrossRef]
Liu, X.; Xu, F.; Zhang, Z.; Sun, K. Fall-Portent Detection for Construction Sites Based on Computer Vision and Machine Learning. Eng. Constr. Archit. Manag. 2023. ahead-of-print. [Google Scholar] [CrossRef]
Fan, C.; Mei, Q.; Li, X. 3D Pose Estimation Dataset and Deep Learning-Based Ergonomic Risk Assessment in Construction. Autom. Constr. 2024, 164, 105452. [Google Scholar] [CrossRef]
Liu, Y.; Ojha, A.; Jebelli, H. Vision-Based Ergonomic Risk Assessment of Back-Support Exoskeleton for Construction Workers in Material Handling Tasks. Comput. Civ. Eng. 2024, 331–339. [Google Scholar] [CrossRef]
Halder, S.; Alimoradi, S.; Afsari, K.; Dickerson, D.E. A Computer Vision Approach to Assessing Work-Related Musculoskeletal Disorder (WMSD) Risk in Construction Workers. In Proceedings of the Construction Research Congress 2024, St. Louis, MO, USA, 8–10 March 2024; ASCE: Reston, VA, USA, 2024; pp. 678–687. [Google Scholar] [CrossRef]
Cai, J.; Yang, L.; Zhang, Y.; Li, S.; Cai, H. Multitask Learning Method for Detecting the Visual Focus of Attention of Construction Workers. J. Constr. Eng. Manag. 2021, 147, 04021063. [Google Scholar] [CrossRef]
Yadav, A.; Vishwakarma, D. Deep Learning Algorithms for Person Re-Identification: Sate-of-the-Art and Research Challenges. Multimed. Tools Appl. 2023, 83, 22005–22054. [Google Scholar] [CrossRef]
Cheng, J.P.; Wong, P.K.-Y.; Luo, H.; Wang, M.; Leung, P. Vision-Based Monitoring of Site Safety Compliance Based on Worker Re-Identification and Personal Protective Equipment Classification. Autom. Constr. 2022, 139, 104312. [Google Scholar] [CrossRef]
Yang, J.; Shi, Z.; Wu, Z. Vision-Based Action Recognition of Construction Workers Using Dense Trajectories. Adv. Eng. Inform. 2016, 30, 327–336. [Google Scholar] [CrossRef]
Liu, M.; Hong, D.; Han, S.; Lee, S. Silhouette-Based On-Site Human Action Recognition in Single-View Video. In Construction Research Congress 2016; ASCE: New York, NY, USA, 2016; p. 959. [Google Scholar]
Yang, J. Enhancing Action Recognition of Construction Workers Using Data-Driven Scene Parsing. J. Civ. Eng. Manag. 2018, 24, 568–580. [Google Scholar] [CrossRef]
Luo, X.; Li, H.; Cao, D.; Dai, F.; Seo, J.; Lee, S. Recognizing Diverse Construction Activities in Site Images via Relevance Networks of Construction Related Objects Detected by Convolutional Neural Networks. J. Comput. Civ. Eng. 2017, 32, 04018012. [Google Scholar] [CrossRef]
Fang, Q.; Li, H.; Luo, X.; Ding, L.; Rose, T.; An, W.; Yu, Y. A Deep Learning-Based Method for Detecting Non-Certified Work on Construction Sites. Adv. Eng. Inform. 2018, 35, 56–68. [Google Scholar] [CrossRef]
Li, J.; Zhao, X.; Zhou, G.; Zhang, M.; Li, D.; Zhou, Y. Evaluating the Work Productivity of Assembling Reinforcement through the Objects Detected by Deep Learning. Sensors 2021, 21, 5598. [Google Scholar] [CrossRef]
Luo, X.; Li, H.; Wang, H.; Wu, Z.; Dai, F.; Cao, D. Vision-Based Detection and Visualization of Dynamic Workspaces. Autom. Constr. 2019, 104, 1–13. [Google Scholar] [CrossRef]
Luo, X.; Li, H.; Yu, Y.; Zhou, C.; Cao, D. Combining Deep Features and Activity Context to Improve Recognition of Activities of Workers in Groups. Comput.-Aided Civ. Infrastruct. Eng. 2020, 35, 965–978. [Google Scholar] [CrossRef]
Li, J.; Zhou, G.; Li, D.; Zhang, M.; Zhao, X. Recognizing Workers’ Construction Activities on a Reinforcement Processing Area through the Position Relationship of Objects Detected by Faster R-CNN. Eng. Constr. Archit. Manag. 2022. ahead-of-print. [Google Scholar] [CrossRef]
Bhokare, S.; Goyal, L.; Ren, R.; Zhang, J. Smart Construction Scheduling Monitoring Using YOLOv3-Based Activity Detection and Classification. J. Inf. Technol. Constr. 2022, 27, 240–252. [Google Scholar] [CrossRef]
Luo, X.; Li, H.; Cao, D.; Yu, Y.; Yang, X.; Huang, T. Towards Efficient and Objective Work Sampling: Recognizing Workers’ Activities in Site Surveillance Videos with Two-Stream Convolutional Networks. Autom. Constr. 2018, 94, 360–370. [Google Scholar] [CrossRef]
Luo, H.; Xiong, C.; Fang, W.; Love, P.E.D.; Zhang, B.; Ouyang, X. Convolutional Neural Networks: Computer Vision-Based Workforce Activity Assessment in Construction. Autom. Constr. 2018, 94, 282–289. [Google Scholar] [CrossRef]
Roberts, D.; Torres-Calderon, W.; Tang, S.; Golparvar-Fard, M. Vision-Based Construction Worker Activity Analysis Informed by Body Posture. J. Comput. Civ. Eng. 2020, 34, 04020017. [Google Scholar] [CrossRef]
Cai, J.; Zhang, Y.; Cai, H. Two-Step Long Short-Term Memory Method for Identifying Construction Activities through Positional and Attentional Cues. Autom. Constr. 2019, 106, 102886. [Google Scholar] [CrossRef]
Li, P.; Wu, F.; Xue, S.; Guo, L. Study on the Interaction Behaviors Identification of Construction Workers Based on ST-GCN and YOLO. Sensors 2023, 23, 6318. [Google Scholar] [CrossRef]
Torabi, G.; Hammad, A.; Bouguila, N. Two-Dimensional and Three-Dimensional CNN-Based Simultaneous Detection and Activity Classification of Construction Workers. J. Comput. Civ. Eng. 2022, 36, 04022009. [Google Scholar] [CrossRef]
Li, Z.; Li, D. Action Recognition of Construction Workers under Occlusion. J. Build. Eng. 2021, 45, 103352. [Google Scholar] [CrossRef]
Zhang, M.; Cao, Z.; Yang, Z.; Zhao, X. Utilizing Computer Vision and Fuzzy Inference to Evaluate Level of Collision Safety for Workers and Equipment in a Dynamic Environment. J. Constr. Eng. Manag. 2020, 146, 04020051. [Google Scholar] [CrossRef]
Zhang, M.; Zhu, M.; Zhao, X. Recognition of High-Risk Scenarios in Building Construction Based on Image Semantics. J. Comput. Civ. Eng. 2020, 34, 04020019. [Google Scholar] [CrossRef]
Baxter, R.H.; Leach, M.J.V.; Mukherjee, S.S.; Robertson, N.M. An Adaptive Motion Model for Person Tracking with Instantaneous Head-Pose Features. IEEE Signal Process. Lett. 2015, 22, 578–582. [Google Scholar] [CrossRef]
Liu, H.; Ma, L. Online person orientation estimation based on classifier update. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; IEEE: New York, NY, USA, 2015; pp. 1568–1572. [Google Scholar] [CrossRef]
Raza, M.; Chen, Z.; Rehman, S.-U.; Wang, P.; Bao, P. Appearance Based Pedestrians’ Head Pose and Body Orientation Estimation Using Deep Learning. Neurocomputing 2018, 272, 647–659. [Google Scholar] [CrossRef]
Zheng, Z.; Zheng, L.; Yang, Y. Unlabeled Samples Generated by GAN Improve the Person Re-Identification Baseline in Vitro. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 3774–3782. [Google Scholar]
Leng, Q.; Ye, M.; Tian, Q. A Survey of Open-World Person Re-Identification. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 1092–1108. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6000–6010. [Google Scholar]
Raghu, M.; Unterthiner, T.; Kornblith, S.; Zhang, C.; Dosovitskiy, A. Do Vision Transformers See Like Convolutional Neural Networks? In Proceedings of the Neural Information Processing Systems (NeurIPS), Virtual, 6–14 December 2021. [Google Scholar]
He, S.; Luo, H.; Wang, P.; Wang, F.; Li, H.; Jiang, W. TransReID: Transformer-based Object Re-Identification. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; IEEE: New York, NY, USA, 2021; pp. 14993–15002. [Google Scholar] [CrossRef]

Figure 1. Technical framework of this paper.

Figure 2. Detail of the construction project: (a) front elevation diagram of slipform construction; (b) camera layout details.

Figure 3. Details of the dataset for construction entities.

Figure 4. Representation of an individual’s posture and orientation.

Figure 5. Samples of the Worker ReID dataset.

Figure 6. Network structure of YOLO v5.

Figure 7. Flowchart of posture and orientation estimation.

Figure 8. Attention orientation estimation method.

Figure 11. Detected boxes when the individual’s attention direction is north: (a) Construction entity objects are located at the same altitude plane as the workers; (b) Construction entity objects are located at a different altitude plane as the workers.

Figure 12. No activities can be recognized, although the two detected boxes intersect.

Figure 13. Detected boxes when the individual’s attention orientation is south: (a) The width of the worker’s detection box is less than that of the construction entity; (b) The width of the worker’s detection box is more than that of the construction entity.

Figure 14. Detection results of construction objects based on YOLO v5.

Figure 15. Detection results of workers’ orientation and pose based on YOLO v5.

Figure 16. Experimental test results of V02 through YOLO v5 models: (a) moment at the 12nd second; (b) moment at the 15th second; (c) moment at the 18th second.

Figure 17. Experimental test results of V06 through YOLO v5 models: (a) moment at the 15th second; (b) moment at the 18th second.

Figure 18. Experimental test results of V08 through YOLO v5 models: (a) moment at the 87th second; (b) moment at the 90th second.

Figure 19. Error cases when recognizing workers’ construction activities: (a) error case1; (b) error case2; (c) error case3.

Table 1. Number of images and labels in the construction entity dataset.

	Number of Images Containing This Category	Number of Rectangle Boxes in This Dataset for This Category
WSK	346	463
TNK	442	1804
FMW	376	1189
STU	456	586
LRB	215	362
SSP	145	260
FPT	331	332
BRC	56	56
CLM	337	590
BEAM	160	160

Table 2. Number of images and labels in the individuals’ posture and direction dataset.

	Number of Images Containing This Category	Number of Rectangle Boxes in This Dataset for This Category
wkr_st_e	858	1132
wkr_st_n	1277	1675
wkr_st_w	782	970
wkr_st_s	858	1228
wkr_b_e	170	176
wkr_b_n	96	97
wkr_b_w	143	147
wkr_b_s	63	66
wkr_s_e	173	195
wkr_s_n	153	178
wkr_s_w	177	192
wkr_s_s	145	154
head_e	1032	1404
head_n	1452	2036
head_w	984	1275
head_s	967	1364
Total	2546	12,289

Table 3. Construction activities related to construction entity objects.

Construction Entity	Construction Activities
WSK	Lifting, picking, measuring, processing, organizing, etc.
TNK	Lifting, connecting, maintaining, etc.
FMW	Lifting, picking, measuring, processing, organizing, etc.
STU	Lifting, picking, measuring, processing, organizing, etc.
LRB	Lifting, picking, measuring, processing, organizing, etc.
SSP	Lifting, picking, measuring, processing, organizing, etc.
FPT	Lifting, processing formwork, etc.
BRC	Welding, assembling, etc.
CLM	Installing rebars, installing steel columns, etc.
BEAM	Installing rebars, etc.

Table 4. AP and mAP of the construction entity detection model.

Category/Average Precision	WSK	TNK	FMW	STU	LRB
	0.949	0.973	0.978	0.984	0.961
	WSK	TNK	FMW	STU	LRB
	0.88	0.902	0.995	0.996	0.995
mAP	0.961

Table 5. AP and mAP of the YOLO v5 model for estimating workers’ pose and orientation.

Category/ Average Precision	wkr_st_e	wkr_st_n	wkr_st_w	wkr_st_s	head_e	head_n
	0.901	0.948	0.922	0.945	0.856	0.916
	wkr_b_e	wkr_b_n	wkr_b_w	wkr_b_s	head_w	head_s
	0.913	0.948	0.854	0.76	0.881	0.866
	wkr_s_e	wkr_s_n	wkr_s_w	wkr_s_s
	0.895	0.974	0.941	0.845
mAP	0.898

Table 6. Test results (MRC—moments when the recognition result is correct; MRW—moments when the recognition result is wrong).

	Worker ID	Recognized Construction Activities	MRC	MRW
V01	1	Wooden stick-related activity, lifting wooden sticks, no construction activity	31	7
V01	2	Walking, formwork-related activity, wooden stick-related activity, lifting formwork, no construction activity	29	4
V02	1	Scaffold-related activity, lifting scaffold steel pipes, walking, no construction activity	20	1
V02	2	Scaffold-related activity, lifting scaffold steel pipes, walking, no construction activity	16	2
V02	3	Wooden stick-related activity, scaffold-related activity, lifting scaffold steel pipes	11	4
V03	1	Formwork-related activity, rest	30	10
	2	Rest	40	0
	3	Rest	40	0
V04	1	Tank-related activity, walking, no construction activity	26	3
V05	1	No construction activity, walking, longitudinal rebar-related activity, formwork-related activity	30	1
	2	Walking	7	0
	3	Walking, lifting rebars	6	1
V06	1	Brace-related activity	40	0
V06	2	Brace-related activity, no construction activity	39	1
V07	1	No construction activity, lifting formworks	18	6
V07	2	Lifting wooden sticks, processing formworks, formwork-related activity, no construction activity	16	3
V08	1	Column-related activity, no construction activity	36	4
	2	Column-related activity	40	0
	3	Column-related activity	30	10
V09	1	No construction activity, beams-related activity, longitudinal rebar-related activity	28	12
	2	Beam-related activity, no construction activity	36	1
	3	Beam-related activity, no construction activity	18	9
V10	1	Column-related activity, no construction activity	36	4
V10	2	Column-related activity, no construction activity	36	3
Total			659	86
Accuracy	0.885

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, J.; Zhao, X.; Kong, L.; Zhang, L.; Zou, Z. Construction Activity Recognition Method Based on Object Detection, Attention Orientation Estimation, and Person Re-Identification. Buildings 2024, 14, 1644. https://doi.org/10.3390/buildings14061644

AMA Style

Li J, Zhao X, Kong L, Zhang L, Zou Z. Construction Activity Recognition Method Based on Object Detection, Attention Orientation Estimation, and Person Re-Identification. Buildings. 2024; 14(6):1644. https://doi.org/10.3390/buildings14061644

Chicago/Turabian Style

Li, Jiaqi, Xuefeng Zhao, Lingjie Kong, Lixiao Zhang, and Zheng Zou. 2024. "Construction Activity Recognition Method Based on Object Detection, Attention Orientation Estimation, and Person Re-Identification" Buildings 14, no. 6: 1644. https://doi.org/10.3390/buildings14061644

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Construction Activity Recognition Method Based on Object Detection, Attention Orientation Estimation, and Person Re-Identification

Abstract

1. Introduction

2. Literature Review

2.1. Computer Vision Technology and Engineering Applications

2.2. Computer Vision Application in Workers’ Construction Activities Recognition

2.3. Summary

3. Methodology

3.1. Collection of Research Materials

3.1.1. Dataset of Construction Entities

3.1.2. Dataset for Estimation of Posture and Orientation

3.1.3. Dataset for Worker Re-Identification

3.2. Computer Vision Algorithms

3.2.1. Object Detection

3.2.2. Person Re-Identification

3.3. Construction Activity Recognition Method

4. Establishment of Deep Learning Models

4.1. YOLO v5-Based Construction Entity Object Detection Model

4.2. YOLO v5-Based Model for Estimating Workers’ Pose and Orientation

4.3. Workers’ Re-Identification Results

5. Experiment on Construction Activity Recognition

5.1. Recognition Results

5.2. Recognition Results Analysis

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI