Helmet-Wearing Tracking Detection Based on StrongSORT

Li, Fufang; Chen, Yan; Hu, Ming; Luo, Manlin; Wang, Guobin

doi:10.3390/s23031682

Open AccessArticle

Helmet-Wearing Tracking Detection Based on StrongSORT

by

Fufang Li

,

Yan Chen

^*

,

Ming Hu

,

Manlin Luo

and

Guobin Wang

School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou 510006, China

^*

Author to whom correspondence should be addressed.

Sensors 2023, 23(3), 1682; https://doi.org/10.3390/s23031682

Submission received: 24 December 2022 / Revised: 23 January 2023 / Accepted: 27 January 2023 / Published: 3 February 2023

(This article belongs to the Section Industrial Sensors)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Object detection based on deep learning is one of the most important and fundamental tasks of computer vision. High-performance detection algorithms have been widely used in many practical fields. For the management of workers wearing helmets in construction scenarios, this paper proposes a framework model based on the YOLOv5 detection algorithm, combined with multi-object tracking algorithms, to monitor and track whether workers wear safety helmets in real-time video. The improved StrongSORT tracking algorithm of DeepSORT is selected to reduce the loss of the tracked object caused by the occlusion, trajectory blur, and motion scale of the object. The safety helmet dataset is trained with YOLOv5s, and the best result of training is used as the weight model in the StrongSORT tracking algorithm. The experimental results show that the [email protected] of all classes in the YOLOv5s model can reach 95.1% in the validation dataset, [email protected]:0.95 is 62.1%, and the precision of wearing helmet is 95.7%. After the box regression loss function was changed from CIOU to Focal-EIOU, the [email protected] increased to 95.4%, [email protected]:0.95 increased to 62.9%, and the precision of wearing helmet increased to 96.5%, which were increased by 0.3%, 0.8% and 0.8%, respectively. StrongSORT can update object trajectories in video frames at a speed of 0.05 s per frame. Based on the improved YOLOv5s combined with the StrongSORT tracking algorithm, the helmet-wearing tracking detection can achieve better performance.

Keywords:

YOLOv5; focal-EIOU; object detection; StrongSORT; helmet wear tracking

1. Introduction and Related Work

The safety of the construction site has always been a hot issue. Wearing a helmet can greatly reduce a heavy blow to the head during the construction process, and is also a basic guarantee for the personal safety of workers. The accidents caused by the lack of safety helmets during construction are still on the rise. Therefore, real-time monitoring of construction sites and management to ensure workers wear safety helmets are effective ways to reduce the occurrence of safety accidents.

The task of real-time monitoring whether workers are wearing a helmet consists of object detection and target tracking.

Object detection has always been a hot research direction of computer vision. In short, its main task is to classify and locate algorithms in specific scenes, and many excellent detection algorithms have also been combined in applications in various fields. Traditional object detection [1,2] is a feature extractor using manual identification. However, these traditional model algorithms are relatively slow in detection speed and poor in detection precision compared with the current algorithms based on deep learning. The generalization effect of the model on the test dataset is not very good, and the performance applied to the actual project cannot meet the specified standards. As research on CNNs (convolutional neural networks) [3] continues to advance, object detection can be divided into two methods: two-stage and one-stage.

Two-stage is an algorithm based on region proposals; first, extracting region proposals, and then performing a classification regression task on the region proposals. Girshick, R. et al. [4] proposed R-CNN (Region CNN) for the first time, and used selective search [5] to extract region proposals. The backbone used the AlexNet [6] network model as the detector. The proposed R-CNN greatly promoted the research in the field of object detection. Later, Girshick, R. and Ren, S. et al. [7,8] proposed Fast R-CNN and Faster R-CNN, respectively, which greatly improved the detection precision. In [8], it is proposed to use the RPN (Region Proposal Network) to generate region proposals instead of selective search. Meanwhile, the idea of generating region proposals based on anchor box is proposed.

The one-stage detection algorithm uses end-to-end training to directly classify and locate objects after feature extraction of samples, such as the YOLO series [9,10,11,12,13] and FCOS [14] and RetinaNet [15]. The YOLOv1 [10] model is proposed to simplify the object detection process to an end-to-end regression problem and directly predict the bounding box based on anchor box and complete the classification and location of the object. In 2016, an SSD [16] detection algorithm was proposed, which uses VGG16 [17] as the network structure and multi-scale feature map to detect objects, promoting the detection of small objects. In 2017, YOLOv2 [11] proposed to replace the original backbone network GoogLeNet [18] in YOLOv1 with DarkNet-19. In 2018, [12] proposed YOLOv3, which is widely used in the industrial field, to replace the network with DarkNet-53 [19], and added FPN (Feature Pyramid Networks) [20] to achieve multiscale training. In 2020, YOLOv4 [9] was proposed on the basis of YOLO series, CSPDarknet-53 was introduced as the feature extraction network, and SPP (spatial pyramid pooling) [21] was used in the feature extraction network to improve the effect of feature extraction, Mish [22] activation function was also used in the network, CIOU [23] was used to calculate bounding box loss, Mosaic was used for data augmentation, and PANet [24] network was used to integrate features of different scales, so as to further fusion the feature information of different layers. In 2020, Ultralytics [13] opened the code of YOLOv5.

Target tracking is to achieve the goal of tracking by analyzing the trajectory characteristics of the target [25]. Bewley et al. proposed the SORT algorithm [26] in 2016, combining the two-stage detection algorithm with Faster R-CNN as the detector. In 2017, Bewley et al. improved the SORT algorithm and proposed the DeepSORT [27] tracking algorithm. Based on the original SORT [26] algorithm, they proposed to combine the motion and appearance information of the target to track. In 2022, the StrongSORT [28] tracking algorithm was proposed, and two lightweight plug-and-play algorithms, AFLink model and Gaussian Smoothed Interpolation, were proposed. The feature library and cascade matching in DeepSORT algorithm were replaced by [29] feature extraction and vanilla global linear matching strategy.

In the helmet-wearing tracking experiment based on YOLOv5 and DeepSORT proposed by [30], it takes not wearing helmets as the tracking task. However, when the scene is complex, it is difficult to detect the target without a helmet and also easy to lose the target trajectory, so here we consider the person wearing a helmet as the tracking task. In [31], a tracking framework using YOLOv3 combined with DeepSORT was proposed to improve the YOLOv3 network architecture, design a deeper network, and introduce a new tracker to reduce ID switching. However, the network is complex in computing and is not suitable for helmet detection scenarios. The safety helmet detection scene is complex, the construction scene is frequently switched, the movement track is complex and fuzzy, and the movement scale changes frequently. In [32], YOLOv3-Tiny and YOLOv4 combined with DeepSORT, respectively, are proposed to complete the vehicle pedestrian tracking task. However, the performance of the YOLOv3-Tiny detector is not as good as YOLOv5, and the YOLOv4 model is too large. In addition, the ID of the DeepSORT tracker is more easily lost when the scene is complex.

The difficulties in monitoring the wearing of helmets lie in the complex trajectory of the target, the presence of target occlusion and the change of target attitude. These factors will lead to incorrect monitoring and missing detection of the target, or ID loss and ID switching when tracking whether a helmet is being worn, which will affect the real-time monitoring results. Combined with the above problems, this paper takes the construction personnel wearing helmets as the monitoring object, and processes more than 8000 images as the training dataset. YOLOv5 is selected as the detector for training, and the StrongSORT [28] tracking algorithm with good performance is used to achieve target tracking, so as to achieve the monitoring requirement of wearing a safety helmet under the construction background.

This paper uses detection-based tracking, combined with better detector and tracker algorithms, to achieve the task of tracking and monitoring whether a helmet is being worn in real-time scenarios.

2. YOLOv5

In the detection-based tracking task, the most important step is to select an appropriate detector, and the result trained by the detector model directly affects the effect of target trajectory tracking. The detection speed and detection precision of the object detector also directly affect the real-time tracking of the target trajectory. In YOLOv5, there are four network models, named YOLOv5s, YOLOv5m, YOLOv5l and YOLOv5x, respectively. In this paper, the YOLOv5s model is selected as the detector model in the tracking task.

2.1. YOLOv5s

The YOLOv5s network structure mainly includes the Input and Backbone network and Neck network and Prediction header, as shown in Figure 1. In the input part, the input size is set to 640 × 640 × 3, and the input image will use the Mosaic method for data enhancement. The main idea is to randomly crop any four images in the dataset, and then splice them as the training image, so as to achieve the effect of enriching the dataset. An adaptive anchor box calculation is introduced in the input section, the optimal initial anchor box size is selected for different datasets, and the image is processed with adaptive image scaling, reducing calculation time and improving the performance of detection. In the backbone section, the BottleneckCSP network structure is used to separate features by separating channels, and stitching when predicting output, which reduces the repeated calculation of feature information in the calculation process and enhances the ability of CNN to learn more feature information. In the Neck section, the structure of FPN and PANet is used. In the head section, convolution of three feature layers of different scales of the Neck layer makes the final prediction output.

2.2. Bounding Box Regression Loss

The traditional IOU (intersection over union) [33] calculates the overlap rate of the predicted box and ground truth box, that is, the ratio of their intersection and union. In this kind of bounding box regression algorithm, when the prediction box and the ground truth box do not have an intersection, the IOU loss remains at 0, degenerates to a constant, the regression loss of the predicted bounding box cannot be measured, and the convergence rate of the IOU loss is very slow. Therefore, Rezatofighi, H. et al. [34] proposed generalized intersection over union, GIOU, to solve the problem of traditional IOU loss, but when the prediction box and the ground truth box coincide, GIOU degenerated into traditional IOU loss, and the above problems will also occur. Therefore, on the basis of GIOU, combining the proportion of the non-cross area and the proportion of the distance between the center point of the two boxes, Zheng, Z. et al. [23] proposed DIOU (distance-IOU loss), which solves the situation when the two boxes of the prediction box and the ground truth box are contained horizontally and vertically with each other. When the prediction box coincides with the center point of the ground truth box, DIOU degenerates into the traditional IOU form. Therefore, complete-IOU (CIOU) loss [23] is proposed, combined with the aspect ratio penalty to deal with the problem of center point coincidence. As a result, CIOU takes into account both the loss of the cross area and the loss of the center point offset, combined with the loss of the proportion of width and height. Therefore, in the YOLOv5s detection algorithm, CIOU is used in the region proposals regression, and the calculation loss function is shown in Equation (1).

L_{C I O U} = 1 - I O U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α v

(1)

Focal-EIOU

CIOU considers various problems in the regression process of the prediction box and accelerates the convergence of the prediction box; however, there will be some problems that the width and height of the prediction box cannot be converged in a certain proportion at the same time. The EIOU [35] loss considered these problems and its calculation form is shown in Equation (2).

L_{E I O U} = L_{I O U} + L_{d i s} + L_{a s p} = 1 - I O U + \frac{ρ^{2} (b, b^{g t})}{{(w^{c})}^{2} + {(h^{c})}^{2}} + \frac{ρ^{2} (w, w^{g t})}{{(w^{c})}^{2}} + \frac{ρ^{2} (h, h^{g t})}{{(h^{c})}^{2}}

(2)

In the prediction of bounding box regression, the training samples play an important role in the convergence process. Therefore, under the premise of EIOU loss, this paper uses Focal-EIOU [35] to replace the prediction box regression loss CIOU in the original YOLOv5, and adds the inhibitory factor

γ

to promote the training samples with more information to play more roles in the regression process; its calculation form is shown in Equation (3).

L_{F o c a l - E I O U} = I O U^{γ} L_{E I O U}

(3)

After replacing the CIOU loss of the bounding box regression with the Focal-EIOU loss, training the YOLOv5s model, the [email protected] of all classes on the validation dataset is improved, which increased from 95.1% to 95.4%, [email protected]:0.95 increased from 62.1% to 62.9%, and the detection precision of wearing helmet increased to 96.5%, which were increased by 0.3%, 0.8% and 0.8%, respectively.

3. StrongSORT

The SORT [26] algorithm for multi-object tracking combines past video frames and current video frames, and solves the problem of prediction of object motion trajectory and video frame data correlation through Kalman Filter [36] and Hungarian algorithm, respectively, so as to achieve correlation detection of video frames. However, when the object is occluded, the next video frame predicted by Kalman Filter and the result detected by the detector will fail to match, and the trajectory tracking of that object will end, leading to ID switching of a large number of targets. DeepSORT [27] adds a pre-trained CNN network to the SORT to save the appearance features of each trajectory of the last 100 frames, solving the problem of ID switching caused by the object due to occlusion. At the same time, DeepSORT introduces cascade matching and new trajectory confirmation to improve the optimal matching of the predicted trajectory with the object in the current frame. Du, Y. et al. [28] proposed StrongSORT; two plug-and-play lightweight algorithms are introduced: AFLink and GSI. Among them, the AFLink model associates the short trajectory as a complete trajectory, which is a fully connected model without appearance information, GSI improves the absence in detection by simulating nonlinear motion, achieves more accurate positioning based on Gaussian regression, and does not ignore the motion information of the detected object during the regression process.

3.1. Kalman Filter

In [27], the motion information

(u, v, γ, h, \dot{x}, \dot{y}, \dot{γ}, \dot{h}

) of the object is described by an eight-dimensional space, which are the center coordinates

(u, v)

of the bounding box, respectively. Taking the aspect ratio

γ

, height

h

, and the relative speed of each variable in the object image coordinates, Kalman Filter is used to predict and update the object trajectory in the next frame, and the state

X_{i}

at time

t - 1

is used to predict and transfer, and according to Equation (4), we obtain the status information

X

at time

t

, where

F

is the state transition matrix. The error of these two states is represented by the covariance matrix

P_{k}

, then the state error at the next moment

t

is described as

P_{t}

, as shown in Equation (5), to complete the prediction of the track state information of the next frame.

X = F X_{t - 1}

(4)

P_{t} = F P_{t - 1} F^{T} + Q

(5)

After completing the trajectory prediction of the next frame, the Hungarian algorithm is used to match the predicted trajectory with the object detected in the current frame, and a Kalman Filter is used to update the trajectory of the successful matching.

3.2. Cascade Matching

In [26], a Kalman Filter updates the predicted next frame trajectory result and matches the detected object. In [27], the trajectory predicted by Kalman is divided into a confirmed state and an unconfirmed state, and the newly predicted trajectory is initialized to an unconfirmed state; the Hungarian algorithm matches the detected object to a certain number of times before it is converted into a confirmation state, and the trajectory of the confirmation state will be matched with the detected box detected by the detector. In cascade matching, first, the set of deterministic trajectories predicted by the Kalman Filter is denoted as

T

, and the set of detected boxes is denoted as

D

, and the cost matrix

C

of the two is calculated by Equation (6); the Mahalanobis Distance is used to describe the trajectory predicted by Kalman Filter and the motion information of the current detection box,

c_{(i, j)} = λ d^{(1)} (i, j) + (1 - λ) d^{(2)} (i, j)

(6)

which is

d (i, j)

. As shown in Equation (7),

d_{j}

represents the

j - t h

detection box.

d^{(1)} (i, j) = {(d_{j} - y_{i})}^{T} S_{i}^{- 1} (d_{j} - y_{i})

(7)

(y_{i}, S_{i})

represents the projection of the

i - t h

trajectory to the detection space. Second, the matches that do not conform to the Mahalanobis Distance are removed according to the set threshold (as in Equation (8)).

b_{i, j} = \prod_{m = 1}^{2} b_{i, j}^{(m)}

(8)

Finally, according to the update status of the prediction box, the newer prediction box will be matched with the Hungarian algorithm first.

3.3. AFLink

Over-reliance on the feature information of the object is easily affected by noise. The pursuit of high performance and detection speed by correlating global trajectory information will result in complex computations and a large number of hyperparameters. AFLink directly predicts the association of two trajectories through time information. In the AFLink model, trajectories

T_{i}

and

T_{j}

are used, as shown in Figure 2 for the AFLink model structure, where

T_{*} = {f_{k}, x_{k}, y_{k}}_{k = 1}^{N}

consists of the position information of the last 30 frames and the frame

f_{k}

.

T_{i}

and

T_{j}

will be input into the time module and the fusion module [28]. The time module is used to extract frame feature information, and then the fusion module is used to perform feature fusion on the extracted frames of different dimensions, and then the classifier predicts the correlation between the two frames. In this process, the two trajectories

T_{i}

and

T_{j}

do not interfere with each other in the processing of the time extraction module and the fusion module.

3.4. Appearance Information

For appearance feature information, in DeepSORT, a CNN network pretrained on the pedestrian re-identification dataset is used, the CNN is used to extract pedestrian features, and the pedestrian features are saved, and the CNN feature extractor is used for pedestrian tracking tasks. When using [27] to track objects, it will save the features of the 100 most recent frames of each track in a feature library. When a video frame has an undetected object, it calculates the feature library

R_{i}

of the

i - t h

track and the

j - t h

detection. The minimum cosine distance of the feature

f_{j}

of the object is shown in Equation (9).

d (i, j) = \min {1 - f_{j}^{T} f_{k}^{(i)} | f_{k}^{(i)} \in R_{i}}

(9)

In StrongSORT, the CNN that extracts feature information is replaced by a feature extractor BoT network, which can extract more feature information about the detection object in the video frame. At the same time, the feature library extracted and saved by CNN is changed to a feature update strategy, that is, the appearance state

e_{i}^{t}

of the

i - t h

track in the

t - t h

frame is updated in the form of an exponential moving average, as Equation (10) shows.

e_{i}^{t} = α e_{i}^{t - 1} + (1 - α) f_{i}^{t}

(10)

f_{i}^{t}

is the appearance embedding information of the detection matched by the current trajectory.

4. Experiments and Analysis

4.1. Construction of Dataset

The dataset used for experimental detection and tracking was selected from the Safety Helmet Wearing Dataset for the experiment; the dataset includes more than 8000 images, includes workers wearing and not wearing a helmet in various construction scenarios, and images of complex task-overlapping scenes, dark scenes, character occlusion and other scenarios. At the same time, some negative samples without helmet were added to the dataset to increase the difficulty of detection. The dataset sample is consistent with the real-time monitoring of the actual construction scene, so, in this paper, the Safety Helmet Dataset is selected for this detection and tracking task, and the dataset is processed into three categories, namely the target helmet to be detected and samples that not wearing a helmet and the head as a negative sample interference. Figure 3a–d contain positive and negative samples, night samples, character occlusion, and negative samples, respectively.

We recorded a short video at our school construction site as a real-time detect video to test our model.

4.2. Experimental Environment

The specific training environment of our experiment is as shown in Table 1.

In this experiment, the mean of Average Precision, mAP and Frame Per Second, FPS are used as the evaluation criteria for the YOLOV5s model. The speed at which video frames are processed per second and the frequency at which the target ID is switched are used as metrics for the StrongSORT tracking model.

4.3. Experimental Results and Analysis

4.3.1. The Results of YOLOv5s

In the experiment, the epoch is set to 300 and the batch size is 8. The results of training on the helmet dataset are shown in Figure 4a,b. As shown in Figure 4a, the training precision of wearing a helmet can reach 95.7%, the precision of not wearing a helmet can reach 94.4%, and the [email protected] of all classes on the validation dataset can reach 95.1%. Figure 4b shows that there are no false detections and missed detections.

The results of the loss in the training process can be seen, as shown in Figure 5; after the class loss on the validation dataset is trained to 200 epochs, the loss begins to converge and remains at 0.0015 until the 300 epochs of iterations are completed.

As can be seen in Figure 6, the classification loss during the training process can be reduced to an effect of 0.00018.

At the same time, we use YOLOv5s’s best weight result to detect real-time recorded construction site videos. Figure 7 shows the results of missed detection and full detection in different video frames. In Figure 7a, a small number of occluded objects cannot be detected. In Figure 7b, it can be seen that all objects are detected, and the detection of video frames can reach an average inference speed of 16.2 ms. The average detection processing speed of each frame is 0.015 s.

4.3.2. Tracking Results of StrongSORT

After using the YOLOv5s model to train the dataset, the target person wearing a helmet can be basically detected. The trained detector is combined with the StrongSORT to achieve the effect of real-time target tracking. As shown in Figure 8, the upper left corner is the unique identification (ID) number of the target person, Figure 8a–c show the tracking situation of the same target person in different video frames by the StrongSORT, and there is no target ID switching, and there is no false detection in real-time monitoring and tracking. Taking the target person whose ID is 17 as an example, in different video frames, although the target is occluded for a long time, the ID of the target person has never been switched, which achieves a good tracking result. In Figure 8c, the target without a helmet is not marked as a positive sample, and no false detection occurs.

Using StrongSORT to achieve tracking, it achieves a processing speed of 26.5 ms per frame. At the same time, the average detection speed of YOLOv5s is 0.017 s, and the average speed of StrongSORT to update the video frame track is 0.05 s.

4.3.3. Detector Comparison Experiment

In this paper, the two-stage detection models Faster R-CNN + FPN, Cascade Masked R-CNN + FPN, and the one-stage detection model YOLOv3 + SPP are compared with the improved YOLOv5 detector model, and each model uses pretrained weights during training. Under the same conditions, the above-mentioned Safety Helmet Dataset is used for training, and the weight size, [email protected], FPS, and the saved weights model size is used as the metrics for evaluating the detector. The experimental results are shown in Table 2, and it can be seen that the detection model proposed in this paper has reached 95.4% of the [email protected] of all categories on the validation dataset, and the precision of wearing helmets can reach 96.5%, as illustrated in Figure 9. The inference speed FPS has reached 100 images per second, and the model weight is only 14.4 MB, which is better than other detection models. The original YOLOv5s network weight size was 14.5 MB, and the weight model size was not much different after changing the prediction box regression loss.

4.3.4. Tracker Comparison Experiment

In addition, the use of DeepSORT [27] and StrongSORT [28] algorithms are selected in this paper. The above algorithms are combined with the YOLOv5s + Focal-EIOU detector for comparative experiments. We use the character occlusion ID switching and FPS as evaluation indicators. As shown in Table 3, the StrongSORT processes tracking and detection is faster than DeepSORT, reaching 37 frames per second.

In the experimental results of YOLOv5s combined with DeepSORT, the phenomenon of frequent switching of the target ID after the target occlusion occurs, as shown in Figure 10, where Figure 10a,c are the same target ID at different frame moments, the ID is switched from 858 to 939 due to the target person being obscured, and in Figure 10b the ID is lost because the target is obscured. In YOLOv5s combined with StrongSORT, for the same target person, there is no ID switching or ID loss when the targets have been obscured, which has always been 223, as shown in Figure 10d,e.

5. Conclusions

In this paper, the object detection model YOLOv5s is combined with the tracking algorithm StrongSORT [28] to realize the tracking of helmet wearing. According to the comparative experiments, the YOLOv5s model is the most suitable choice in terms of detection speed and detection precision. In addition, in the tracking comparison experiment, StrongSORT [28] has a faster processing speed than DeepSORT [27], and the target ID will not be lost or switched due to problems such as long-term occlusion and large changes in motion scale, amongst the other reasons. At the same time, the speed of processing detection and tracking has also achieved good results. In the next work, we will explore how to apply this work to embedded terminal applications.

Author Contributions

Conceptualization, F.L. and Y.C.; Methodology, F.L. and Y.C.; Software, F.L., Y.C., M.H., M.L. and G.W.; Validation, F.L., Y.C., M.H., M.L. and G.W.; Formal analysis, Y.C.; Investigation, Y.C.; Resources, F.L.; Data curation, F.L.; Writing—original draft preparation, Y.C.; Writing—review and editing, F.L.; Visualization, M.H., M.L. and G.W.; Supervision, F.L.; Project administration, F.L.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Please contact the corresponding author for available data support.

Conflicts of Interest

The authors declare no conflict of interest.

References

Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, Kauai, HI, USA, 8–14 December 2001; Volume 1, p. 1. [Google Scholar]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
O’Shea, K.; Nash, R. An introduction to convolutional neural networks. arXiv 2015, arXiv:1511.08458. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Uijlings, J.R.; Van De Sande, K.E.; Gevers, T.; Smeulders, A.W. Selective search for object recognition. Int. J. Comput. Vis. 2013, 104, 154–171. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Bochkovskiy, A.; Wang, C.Y.; Liao HY, M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Jocher, G. YOLOv5 by Ultralytics (Version 7.0) [Computer software]. Available online: https://doi.org/10.5281/zenodo.3908559 (accessed on 20 December 2020).
Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Redmon, J. Darknet: Open Source Neural Networks in c. Available online: http://pjreddie.com/darknet/ (accessed on 20 December 2013).
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Misra, D. Mish: A self regularized non-monotonic activation function. arXiv 2019, arXiv:1908.08681. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. Proc. AAAI Conf. Artif. Intell. 2020, 34, 12993–13000. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Kalake, L.; Wan, W.; Hou, L. Analysis based on recent deep learning approaches applied in real-time multi-object tracking: A review. IEEE Access 2021, 9, 32650–32671. [Google Scholar] [CrossRef]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
Du, Y.; Song, Y.; Yang, B.; Zhao, Y. Strongsort: Make deepsort great again. arXiv 2022, arXiv:2202.13514. [Google Scholar] [CrossRef]
Luo, H.; Jiang, W.; Gu, Y.; Liu, F.; Liao, X.; Lai, S.; Gu, J. A strong baseline and batch normalization neck for deep person re-identification. IEEE Trans. Multimed. 2019, 22, 2597–2609. [Google Scholar] [CrossRef]
Song, H.; Zhang, X.; Song, J.; Zhao, J. Detection and tracking of safety helmet based on DeepSort and YOLOv5. Multimedia Tools Appl. 2022, 1–14. [Google Scholar] [CrossRef]
Dang, T.L.; Nguyen, G.T.; Cao, T. Object tracking using improved deep SORT YOLOv3 architecture. ICIC Express Lett. 2020, 14, 961–969. [Google Scholar]
Meimetis, D.; Daramouskas, I.; Perikos, I.; Hatzilygeroudis, I. Real-time multiple object tracking using deep learning methods. Neural Comput. Appl. 2021, 35, 89–118. [Google Scholar] [CrossRef]
Wu, S.; Li, X.; Wang, X. IoU-aware single-stage object detector for accurate localization. Image Vis. Comput. 2020, 97, 103911. [Google Scholar] [CrossRef]
Rezatofighi, H.; Tsoi, N.; Gwak, J.Y.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
Kalman, R.E. A new approach to linear filtering and prediction problems. J. Basic Eng. 1960, 82, 35–45. [Google Scholar] [CrossRef] [Green Version]

Figure 1. The network model of YOLOV5s.

Figure 2. The network model of AFLink.

Figure 3. Four different test scenarios with helmets. (a) including positive and negative samples; (b) dark samples; (c) object occlusion; (d) negative samples.

Figure 4. Training results of Safety Helmet Dataset on YOLOv5s. (a) Precision and Recall; (b) Training results of some samples.

Figure 5. Validation dataset classification loss.

Figure 6. Training set classification loss.

Figure 7. YOLOv5s video detection results. (a) missing detection; (b) full detection.

Figure 8. Tracking results of the same object person. (a) 10th second; (b) 15th second; (c) 25th second.

Figure 9. Precision and recall with focal-EIOU.

Figure 10. ID Status of same person in different frames in the video. (a) Frame 612; (b) Frame 689; (c) Frame 710; (d) Frame 670; (e) Frame 710.

Table 1. Training Environment.

Items	Version
Graphics	NVIDIA GeForce RTX 3070
Frame	Pytorch
System CUDA	Ubuntu 18.04 11.04

Table 2. Comparison experiment of detector.

Detector	[email protected]/%	FPS	Weight (MB)
Faster R-CNN + FPN	85.1	24	330.4
Cascade Masked R-CNN + FPN	85.5	19	552.8
YOLOv3 + SPP	87.9	55	338.8
YOLOv5s	95.1	77	14.5
YOLOv5s + Focal-EIOU	95.4	100	14.4

Table 3. Comparison of FPS.

Model	FPS
YOLOv5 + DeepSORT	34
YOLOv5 + StrongSORT	37

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, F.; Chen, Y.; Hu, M.; Luo, M.; Wang, G. Helmet-Wearing Tracking Detection Based on StrongSORT. Sensors 2023, 23, 1682. https://doi.org/10.3390/s23031682

AMA Style

Li F, Chen Y, Hu M, Luo M, Wang G. Helmet-Wearing Tracking Detection Based on StrongSORT. Sensors. 2023; 23(3):1682. https://doi.org/10.3390/s23031682

Chicago/Turabian Style

Li, Fufang, Yan Chen, Ming Hu, Manlin Luo, and Guobin Wang. 2023. "Helmet-Wearing Tracking Detection Based on StrongSORT" Sensors 23, no. 3: 1682. https://doi.org/10.3390/s23031682

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Helmet-Wearing Tracking Detection Based on StrongSORT

Abstract

1. Introduction and Related Work

2. YOLOv5

2.1. YOLOv5s

2.2. Bounding Box Regression Loss

Focal-EIOU

3. StrongSORT

3.1. Kalman Filter

3.2. Cascade Matching

3.3. AFLink

3.4. Appearance Information

4. Experiments and Analysis

4.1. Construction of Dataset

4.2. Experimental Environment

4.3. Experimental Results and Analysis

4.3.1. The Results of YOLOv5s

4.3.2. Tracking Results of StrongSORT

4.3.3. Detector Comparison Experiment

4.3.4. Tracker Comparison Experiment

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI