Improved DeepSORT Algorithm Based on Multi-Feature Fusion

Liu, Haiying; Pei, Yuncheng; Bei, Qiancheng; Deng, Lixia

doi:10.3390/asi5030055

Open AccessArticle

Improved DeepSORT Algorithm Based on Multi-Feature Fusion

School of Information and Automation Engineering, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Syst. Innov. 2022, 5(3), 55; https://doi.org/10.3390/asi5030055

Submission received: 22 April 2022 / Revised: 23 May 2022 / Accepted: 24 May 2022 / Published: 13 June 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

At present, the detection-based pedestrian multi-target tracking algorithm is widely used in artificial intelligence, unmanned driving cars, virtual reality and other fields, and has achieved good tracking results. The traditional DeepSORT algorithm mainly tracks multiple pedestrian targets continuously, and can keep the ID unchanged. The applicability and tracking accuracy of the algorithm need to be further improved during tracking. In order to improve the tracking accuracy of the DeepSORT method, we propose a novel algorithm by revising the IOU distance metric in the matching process and integrating Feature Pyramid Network (FPN) and multi-layer pedestrian appearance features. The improved algorithm is verified on the public MOT-16 dataset, and the tracking accuracy of the algorithm is improved by 4.1%.

Keywords:

multi-target tracking; DeepSORT; feature extraction; target detection

1. Introduction

With the great breakthroughs in deep learning in computer vision in recent years, deep learning has been used for pedestrian multi-target tracking, and improving the accuracy of target tracking is the mainstream of multi-target tracking research [1]. The key steps of multi-target tracking algorithms are target apparent feature extraction, calculation of appearance similarity measure or distance measure between the newly detected target and the target in the trajectory, and prediction of motion trajectory [2]. At present, the most mature research direction of deep learning technology in the field of pedestrian multi-target tracking is the extraction of the apparent features of tracked pedestrians [3].

AlexNet Network, proposed by Alex Krizhvsky [4], was first convolutional network applied to image feature extraction and won the ImageNet competition that year, prompting people to enter the era of deep feature extraction. Alex Bewley et al. [5] proposed a SORT algorithm, which brought deep learning into multi-target tracking for the first time. This method mainly used Faster Region Convolutional Neural Networks (R-CNN) [6] algorithm based on deep learning to detect pedestrians and get the pedestrian’s position in the current video sequence. By using a deep learning-based detector, a Kalman filter combined with the detection algorithm gave the current state information for the target and the prediction information of the previous state in order to estimate the current state of the target. Then, the weight was assigned through the intersection union ratio (IOU) distance, the data were correlated through Hungarian calculation, and the detected results were assigned to the estimated trajectory. The performance of this method possessed high-speed accuracy. At the same time, the algorithm achieved a tracking speed of 60 Hz. Nicolai Wojke et al. proposed DeepSORT [7], which adds a pedestrian appearance feature similarity measure based on the SORT algorithm and combines a cascade matching module to reduce the amount of ID switching when pedestrians are occluded, improving the robustness of the model. Wang et al. embedded the feature extraction module of the multi-target tracker into the network of YOLOv3, which jointly learns the detector and embedding model (JDE) [8]. During the training process, the weights are directly shared with the detection network so that the tracker can use the target border output by the detector and the corresponding appearance features. This method improved the real-time tracking performance. Zhang et al. proposed the FairMOT detection algorithm [9] based on the idea of JDE because JDE directly used the feature extraction network module of the YOLOv3 [10] algorithm, which led to the existence of anchor frames. Meanwhile, the detection network and tracking algorithm have different features for pedestrian appearance, which makes them less accurate in dealing with dense pedestrians. FairMOT network uses the anchorless-frame design of CenterNet [11]. At the same time, downsampling was used to extract pedestrian appearances, which greatly improves tracking accuracy. In summary, it can be seen that improving the accuracy of pedestrian appearance feature extraction is of great significance to improve the accuracy of pedestrian multi-target tracking.

Shao et al. [12] combined FPN and ResNet to extract and fuse image features, which enhanced the semantic information and contour of occluded pedestrians and improved the performance of pedestrian detection. Pereira et al. [13] compared the impact of different data-association measures on the tracking effect of the DeepSORT algorithm; it can be seen that better data-association measures can improve the tracking effect of tracking algorithms.

Based on the above results, this paper improved the feature extraction network model in the DeepSORT algorithm and introduced a new data-association measure to improve the tracking accuracy of the algorithm.

2. Analysis of Related Algorithms

2.1. Pedestrian Multi-Objective Tracking Algorithm

The DeepSORT tracking method is a single-hypothesis target-tracking algorithm based on the Hungarian algorithm and Kalman filter algorithm. The state information of the target is represented by

(x, y, γ, h, v_{x}, v_{y}, v_{γ}, v_{h})

, where

(x, y)

are the target’s center coordinates,

γ

is the target’s width-to-height ratio,

v_{x}, v_{y}, v_{γ}, v_{h}

is the moving speed of

x, y, γ, h

; a Kalman filter algorithm was introduced to predict the location of the track in the detection space, and a Hungarian matching algorithm was adopted to match the detection with the tracking.

When matching the target trajectory and detection, the cascade-matching approach is used. In the matching process, appearance information and motion information are combined to form a new measurement matrix to judge the matching degree between detection and trajectory. Appearance information is extracted by a simple convolution neural network, and then features are obtained using a simple convolution neural network to extract the feature matrix

r_{j}

in each pedestrian-detection frame. All the appearance information for the last 100 frames of the track is stored in

R_{i}

, with the feature matrix in each frame of the track being

r_{i}

. The minimum distance between the ith pedestrian trajectory feature and the j-th pedestrian detection feature is calculated by the following Equation (1):

d_{(i, j)}^{(1)} = min \{1 - r_{j}^{T} r_{i}^{(k)} |r_{i}^{(k)} \in R_{i}, k \in (1, 100)\}

(1)

A control threshold

t^{(1)}

is introduced to the appearance distance to determine whether the two can be related, and is calculated using Equation (2):

b_{(i, j)}^{(1)} = \{\begin{matrix} 1, d_{(i, j)}^{1} \leq t^{(1)} \\ 0, d_{(i, j)}^{1} > t^{(1)} \end{matrix}

(2)

When the i track matches the j detection, the result is 1; otherwise it is 0.

Motion information is expressed as the square value of the Mahalanobis distance [14] between the predicted and detected locations of the Kalman trajectory and is calculated by Equation (3):

d_{(i, j)}^{(1)} = {(d_{j} - y_{i})}^{T} S_{i}^{- 1} (d_{j} - y_{i})

(3)

where

y_{i}

represents the projection of the predicted value of the ith track in the detection space,

d_{j}

represents the jth detection target in the current detection space, and

S_{i}

represents the covariance matrix of the ith track in the detection space.

The detection control threshold

t^{(2)}

is also set to determine whether the detection needs to be excluded and is calculated by Equation (4):

b_{(i, j)}^{(2)} = \{\begin{matrix} 1, d_{(i, j)}^{(2)} \leq t^{(2)} \\ 0, d_{(i, j)}^{(2)} > t^{(2)} \end{matrix}

(4)

The weighted sum of the Mahalanobis distance based on motion information and cosine distance based on pedestrian appearance characteristics are used to form a new measurement, calculated by Equation (5):

c_{i, j} = λ d_{(i, j)}^{(1)} + (1 - λ) d_{(i, j)}^{(2)}

(5)

In association, Equation (6) is a gated matrix that determines whether two measures are in the selected region.

b_{i, j} = Π_{m = 1}^{2} b_{(i, j)}^{(m)}

(6)

The matching task can only be completed when both the appearance distance measure and the location distance measure pass through the selected area. The DeepSORT algorithm uses cascade matching and crossover matching to match the new detection method to the trajectory predicted by the Kalman filter algorithm, and iterates repeatedly to complete the target matching process.

The DeepSORT algorithm has a significant performance improvement over the SORT algorithm for pedestrian target tracking. During tracking, the feature extraction network has a great impact on the accuracy and richness of pedestrian target feature appearance information extraction and has a huge impact on the tracking effect. Although the DeepSORT algorithm adds the pedestrian appearance feature extraction network model, the whole network model still has room for improvement:

(1) The algorithm feature extraction network is only a shallow convolution feature extraction. When similarity outside the pedestrian target is high, it is prone to false matching and mismatching. When the target scale changes, some errors still occur in matching. This problem can be solved by increasing the number of layers of the feature extraction network.

(2) The traditional convolution network only uses a simple downsampling process. The characteristic map of the backbone network output has a very large field of perception, which can easily lose target information. This results in inaccurate pedestrian feature information for smaller targets. Tracking accuracy can be improved by combining different feature information. At the same time, it can also deal with different detection frame sizes, which have a high impact on the integrity of feature information extraction.

2.2. Feature Pyramid Network

In the process of pedestrian multi-target tracking, pedestrian feature extraction in a complex environment faces many external factors. Pedestrians in the same frame tend to have different sizes. Using the same feature extractor for pedestrians of different sizes will often cause some smaller pedestrian targets to lose a lot of detailed information, which is not conducive for good appearance matching in the tracking process.

A Feature Pyramid Network (FPN) [15] combines features from multiple levels to output the final feature map. It is named the feature pyramid network because of its pyramid-like structure and appearance. FPN is widely used in target detection, semantic segmentation, behavior recognition and other fields. Because of the feature fusion at different levels, the output feature map has multi-layer semantic information, which can extract a variety of semantic information when extracting multi-target pedestrian features. The network structure is shown in Figure 1.

From Figure 1 of FPN network structure, it can be seen that when feature extraction is performed in the backbone network, output features

C_{0}, C_{1}, C_{2}

of different levels of feature information is finally fused by top–down fusion of feature maps. When extracting pedestrian target features, feature size has a great influence on the expression of pedestrian feature information. Using different depths of network structure to extract pedestrian appearance information with different sizes has different accuracy. Fusing pedestrian appearance feature information extracted from different network depths can reduce the differences in feature information caused by scale changes in the matching process [16]. At the same time, a shallow feature extraction network is very sensitive to location information, while a deep feature extraction network is sensitive to appearance information [17]. Combining these two different advantages can perfectly improve the probability of successful pedestrian detection and track-matching in the tracking process.

2.3. Residual Network Structure

In the feature extraction network, as network depth and structure complexity increase, the new layer will learn the same model parameters as the previous layer. Continuing to increase network layers will not have an obvious effect on improving network accuracy—sometimes it will reduce accuracy while increasing consumption of computing resources. In order to improve the training accuracy of the network and the diversity of feature information extracted, the residual network structure (ResNet) can be used to solve the problem of identical mapping in the network model [18]. Its network module is shown in Figure 2.

From Figure 2, the main idea of the residual module is to map input x into

y = F (x) + x

, where

F (x) = y - x

(also known as the residual term). Compared with a traditional convolution network, by introducing the residual module, input x is linked to the output through a cross-layer so that the network can learn the residual information. This network structure better solves the problem of information loss in the convolution calculation process of a traditional convolution neural network, protects the integrity of information, simplifies network learning, and improves testing and training accuracy.

3. Generalized Intersection over Union (GIOU)

In general, the IOU distance can well reflect the distance relationship between the projection box (hereinafter referred to as the prediction box) and the detection box in the detection space. The IOU value is always greater than 0, and its value will not change with the scaling of the prediction frame and detection frame. However, when there is no overlap between the detection box and the prediction box, the IOU is constant at 0, so the IOU distance cannot be used to judge the match between the detection box and the prediction box, as shown in Figure 3.

From Figure 3, when the red detection frame does not coincide with the green track prediction frame, the IOU values of Figure 3a,b are 0, so it is impossible to judge the difference in matching degree between them, which very easily causes false matching.

It can also appear that the IOU value between the trajectory prediction box and the detection box is the same, but there are different overlap positions. As shown in Figure 4, it is also impossible to judge the matching degree between the detection frame and the prediction frame from the IOU distance.

Figure 4a,b have the same IOU value, but the overlapping positions of detection frame and prediction frame in the two groups of figures are different. To overcome this problem, position information is added to the distance measurement to judge how the detection frame and prediction frame intersect. Through the GIOU distance, the minimum bounding box of the detection frame and prediction frame is introduced to solve the spatial position relationship between the two frames. The calculation process is shown in Equation (7).

G I O U = I O U - \frac{C - A \cup B}{C}

(7)

where C is the minimum frame area used to surround the detection frame and prediction frame, A is the area of the target trajectory prediction frame, and B is the area of the pedestrian detection frame.

The spatial representation of GIOU is shown in Figure 5.

From Figure 5, we know that

C - A \cup B

is the red area in the graph. When the location relationship between the detection box and the prediction is different, the area of the red area changes accordingly. By using the ratio of the red area to the minimum bounding box, the location information between the detection box and the prediction box is indirectly added to the distance matrix.

Simple rearrangement of Equation (7) obtains:

G I O U = I O U - 1 + \frac{A \cup B}{C}

(8)

It can be seen from Equation (8) that when IOU is 0, that is, there is no intersection between the detection frame and the prediction frame, the value is fixed. The larger C, that is, the farther the distance between the detection frame and the prediction frame, the smaller the GIOU and the larger the corresponding GIOU distance, indicating that the matching degree between the two is not high, and vice versa.

When the IOU is 0, the IOU distance measurement cannot easily judge the matching relationship between the detection frame and the prediction frame. In this case, when using the GIOU distance for measurement, matching between the detection frame and the prediction frame is shown in Figure 6.

As can be seen from Figure 6, in figures a and b, when using the GIOU measurement, the purple minimum bounding box is introduced to surround the detection and prediction. It solved the problem of the IOU distance measurement being unable to judge the matching degree between the detection frame and the prediction frame when there is no coincidence between them.

When the IOU values are the same and the overlap between the detection frame and the prediction frame is different, the GIOU measurement effect is shown in Figure 7.

It can be seen from Figure 7 that under different overlapping effects, the IOU value between the detection frame and the prediction frame in Figure 7a,b is 0.14, the GIOU value in Figure 7a is 0.14, and the GIOU value in Figure 7b is −0.0793. The IOU distance measurement cannot judge the matching degree between the detection frame and the prediction frame, but the GIOU distance measurement can solve such problems well.

4. The Improved Feature Extraction Network

In order to improve the performance of the feature extraction network, ResNet50 is selected as the backbone network. With the deepening of the network layer, the overall content of pedestrian appearance is richer, and the semantic information is more accurate. At the same time, regarding the integration of the features of different levels, improved richness of the pedestrian feature output from the feature extraction network improves tracking accuracy and deals with the complex external environment. At the same time, in order to reduce the number of parameters, the number of channels for network residual blocks is reduced. The backbone network structure of feature fusion is shown in Figure 8.

From Figure 8, the residual module of the backbone network is divided into four stages according to the number of output channels. The first stage is composed of three identical residual structures. In the second to fourth stages, the convolution with size

3 \times 3

in the residual module of the first layer in each stage has a stride of two; all other convolution strides are one. The dimension of the final output feature of the whole pedestrian feature extraction network is 512, which can express more-abundant pedestrian feature information than the 128 dimension of the original network. At the same time, at the end of the main network and at the end of the third and second stages, the output feature maps

C_{0}, C_{1}, C_{2}

of each layer are extracted and passed to the FPN network. The output feature information of the

C_{0}

layer has rich, high-level semantic information, which improves large-target pedestrian feature extraction but has a poor effect on small-target pedestrian feature extraction. Although the feature map information of

C_{1}

and

C_{2}

outputs are not as rich as that of

C_{0}

, the output feature map information of

C_{0}

is not as rich. However, it has better extraction for smaller pedestrian targets. It may also retain some spatial information.

A Rectified Linear Unit (ReLU) is used as the activation function in the whole network structure, and a cross-entropy loss function is used for training. At the same time, each residual module is batch-normalized, which makes the model converge faster and improves the training efficiency and stability of the model. Because the data distribution is more uniform, it is less prone to over-fitting.

The extracted

C_{0}, C_{1}

and

C_{2}

are passed into the FPN module, and the network implementation process is shown in Figure 9.

As shown in Figure 9, the feature maps

C_{0}, C_{1}

and

C_{2}

of different levels output by the backbone network are fused through the FPN network.

C_{0}

operates through a convolution with a size

1 \times 1

.

C_{0}

is converted from 512 to 256 dimensions, and a concat function combines it with

C_{1}

after upsampling. After the concat function, the same convolution operation is used to turn it into 128 dimensions, and then after upsampling, the downsampled feature information is combined with

C_{2}

. using a concat. After the concat, the output dimension is converted to 512 by a convolution operation with convolution dimensions of

3 \times 3

and

1 \times 1

. Then, through a mean pooling layer with a size of

5 \times 5

and a stride of two, the final pedestrian appearance eigenvector is obtained after full joining and batch normalization operations to calculate the appearance similarity measure between the track and the detection.

ResNet50 was used as the backbone network to extract the appearance information of pedestrian targets. In order to solve the unfriendly feature extraction of deep features for small target objects and the insensitive feature extraction of location information, this paper combines the FPN network to extract multi-layer features to solve the problem of deep networks in feature extraction. After the whole feature extraction network has extracted pedestrian appearance features, the cost matrix is computed by calculating the feature similarity between pedestrian trajectory and detection combined with the motion feature information for complete cascade matching. The GIOU distance measure was also introduced as a cost matrix between detection and trajectory prediction in the pedestrian multi-target tracking algorithm. GIOU calculation can improve the problems that cannot be solved by IOU measurement, reduce the mismatch between detection and track prediction, and reduce pedestrian switching during the whole tracking process. The proposed DeepSORT algorithm can significantly improve tracking accuracy.

5. Experimental Results and Analysis

Firstly, the improved feature extraction model was trained using the official and public Market-1501 pedestrian recognition dataset, which is composed of six cameras to collect 1501 pedestrian data points in a school. A total of 32,668 pedestrian pictures are collected, of which 12,936 make up the training set and 19,732 the test set. Each pedestrian image has a different angle and camera acquisition. Some of the data are shown in Figure 10 below. We use all the training sets of Market-1501 to train our feature extraction network, and use the first picture of each pedestrian in the Market-1501 test set as our test set. A total of 80 rounds of training with 64 samples in each batch are performed, as shown in Figure 11.

Figure 11a depicts the loss curve during feature extraction network training and prediction, and Figure 11b represents the Top-1 error rate change during feature extraction network training and prediction, where the blue segment represents the training phase and the red segment represents the prediction phase. As you can see from Figure 11a, in the first 20th epoch of training, the loss decreases very quickly; it slows down gradually thereafter, accelerates suddenly in the 40th epoch, flattens out until training loss in the 70th epoch stabilizes, and then ends at the 80th epoch. In the prediction stage, the downward trend of loss is similar to that in the training phase, but the first 40th epoch of training experiences severe fluctuations due to inadequate model accuracy at the beginning of training. In Figure 11b, the curve change law of the Top-1 error rate is similar to that of the loss curve in Figure 11a, and the training is also completed in the 80th epoch.

After training, the training results are imported into the DeepSORT tracking module to verify the performance of the improved pedestrian multi-target tracking algorithm on the MOT-16 dataset. Our algorithm is the matching algorithm without replacing IOU, and Our-1 is the algorithm using GIOU to replace IOU. The evaluation index False Positive (FP) is the number of false detections, False Negative (FN) is the number of missed detections, Identity Switch (IDS) indicates the number of identity exchanges of all tracking targets. Multi Objective Tracking Accuracy (MOTA) is expressed by the following equation:

M O T A = 1 - \frac{\sum (F P + F N + I D S)}{\sum G T}

(9)

In Equation (9) above,

G T

represents the number of tracking targets.

The Multiple Object Tracking Precision (MOTP) measures the positioning accuracy of the detector, which is expressed by the following equation:

M O T P = \frac{\sum_{t, i} d_{t, i}}{\sum_{t} c_{t}}

(10)

In Equation (10),

c_{t}

represents the total number of matches at frame t, and

d_{t, i}

represents the distance between the hypothetical bounding box and the real bounding box.

The verification results are shown in Table 1.

From Table 1, it can be seen that MOTA and MOTP in the SORT algorithm without the feature extraction network are much lower than the DeepSORT pedestrian tracking algorithm with the feature extraction network, and the number of IDS is much higher than the DeepSORT pedestrian tracking algorithm. It can be seen that the feature extraction network has a great impact on improving tracking. Compared with the DeepSORT algorithm, the number of FP in Our algorithm in the table is reduced by 1106, an 8.6% decrease. The number of FN is reduced by 8286, a decrease of 22.5%, which is the best evaluation parameter to improve. The number of IDS has been reduced by 132 (16.9%). The final tracking accuracy (MOTA) of Our tracking algorithm is increased by 2.3 compared with the DeepSORT algorithm. The relative improvement is 3.7%, but the tracking accuracy (MOTP) is decreased by 0.25%, reduced by 0.2.

Compared with Our algorithm, the number of FP in Our-1 is reduced by 403 (3.1%), and the number of FN is reduced by 587 (0.02%), which is the best evaluation parameter. The number of IDS has been reduced by 32 (0.49%). The final tracking accuracy (MOTA) of Our-1 tracking algorithm is 0.2% higher than that of Our algorithm. It is improved by 0.31%, but the tracking accuracy (MOTP) is reduced by 0.61%, decreased by 0.5.

From the comparative analysis of the above experimental data, it can obviously be seen that using ResNet50 combined with the feature pyramid network to extract pedestrian appearance features can improve the MOTA of the DeepSORT algorithm by 3.7%. Using GIOU instead of IOU as the matching measure for detection frames and trajectory prediction frames, MOTA is improved by 4.1% compared with DeepSORT.

6. Conclusions

As a feature extraction backbone network, ResNet50 was used to obtain deeper pedestrian appearance information in the proposed algorithm. At the same time, in order to adapt to detection frames with different sizes, the feature map information of different levels was output through a feature extraction backbone network, an FPN network was used to fuse them, improving the semantic information of pedestrian appearance features extracted by the tracking process. The GIOU distance value was used as the matching metric between the detection box and the trajectory prediction box to solve the inaccurate matching of the IOU distance, enhancing the robustness of the algorithm and the tracking accuracy. Experiments show that the tracking accuracy of the improved algorithm is improved by 4.1%. Future research could consider using a more effective feature extraction network structure and distance matching metric to improve tracking accuracy.

Author Contributions

Investigation, H.L.; Resources, Y.P. and Q.B.; Supervision, L.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work is sponsored by the project No: QLUTGJHZ2018019.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Gong, X.; Le, Z.; Wu, Y.; Wang, H. Real-Time Multiobject Tracking Based on Multiway Concurrency. Sensors 2021, 21, 685. [Google Scholar] [CrossRef] [PubMed]
Tathe, S.V.; Narote, A.S.; Narote, S.P. Face Recognition and Tracking in Videos. Adv. Sci. Technol. Eng. Syst. J. 2017, 2, 1238–1244. [Google Scholar] [CrossRef] [Green Version]
Luo, W.; Xing, J.; Milan, A.; Zhang, X.; Liu, W.; Kim, T.K. Multiple object tracking: A literature review. Artif. Intell. Int. J. 2021, 1293, 103448. [Google Scholar] [CrossRef]
Alex, K.; Ilya, S.; Geoffrey, E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar]
Bewley, A.; Zongyuan, G.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the International Conference on Image Processing, Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Wojke, N.; Bewley, A.; Paulus, D. Simple Online and Realtime Tracking with a Deep Association Metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
Yu, F.; Li, W.; Li, Q.; Liu, Y.; Shi, X.; Yan, J. POI: Multiple object tracking with high performance detection and appearance feature. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 36–42. [Google Scholar]
Chen, J.; Jin, M.; Wang, W.; Lu, Y. Traffic flow detection based on yolov3 and deepsort. J. Metrol. 2021, 42, 718–723. [Google Scholar]
Jun, L.; Yaoru, W.; Guokang, F.; Zeng, Z. Real-time detection tracking and recognition algorithm based on multi-target faces. Multimedia Tools Appl. 2021, 80, 17223–17238. [Google Scholar]
Gai, Y.; He, W.; Zhou, Z. Pedestrian Target Tracking Based on DeepSORT with YOLOv5. In Proceedings of the 2021 2nd International Conference on Computer Engineering and Intelligent Control, ICCEIC 2021, Chongqing, China, 12–14 November 2021; pp. 1–5. [Google Scholar]
Shao, X.; Wang, Q.; Yang, W.; Chen, Y.; Xie, Y.; Shen, Y.; Wang, Z. Multi-Scale Feature Pyramid Network: A Heavily Occluded Pedestrian Detection Network Based on ResNet. Sensors 2021, 21, 1820. [Google Scholar] [CrossRef] [PubMed]
Pereira, R.; Carvalho, G.; Garrote, L.; Nunes, U.J. SORT and Deep-SORT Based Multi-Object Tracking for Mobile Robotics: Evaluation with New Data Association Metrics. Appl. Sci. 2022, 12, 1319. [Google Scholar] [CrossRef]
Maesschalck, R.D.; Jouan-Rimbaud, D.; Massart, D.L. Tutorial. The Mahalanobis distance. Chemometr. Intell. Lab. Syst. 2000, 50, 1–18. [Google Scholar] [CrossRef]
Ben, W.; Shuhan, C.; Jian, W.; Hu, X. Residual feature pyramid networks for salient object detection. Vis. Comput. 2020, 36, 1897–1908. [Google Scholar]
Kuang, Q. Face Image Feature Extraction based on Deep Learning Algorithm. J. Phys. 2021, 1852, 032040. [Google Scholar] [CrossRef]
Yu, C.; Qiong, C.; Huang, Q.; Chen, G.; Fu, X. An Image Defog Network Based on Multi-scale Feature Extraction and Weighting. In Proceedings of the 2021 4th International Conference on Artificial Intelligence and Big Data (ICAIBD), Chengdu, China, 8–31 May 2021; pp. 423–427. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]

Figure 1. Feature pyramid network structure.

Figure 2. Structural diagram of residual module.

Figure 3. No coincident position relationship between detection frame and prediction frame. (a) The detection frame is close to the prediction frame. (b) The detection frame is far from the prediction frame.

Figure 4. Coincident position relationship between detection frame and prediction frame. (a) Horizontal overlap. (b) Cross overlap.

Figure 5. Spatial representation of GIOU.

Figure 6. GIOU distance measurement effect display. (a) The detection frame is close to the prediction frame. (b) The detection frame is far from the prediction frame.

Figure 7. Effect drawing of GIOU measurement in different overlapping modes. (a) Horizontal overlap. (b) Cross overlap.

Figure 8. Feature fusion backbone network structures.

Figure 9. Multi-layer feature fusion structure diagram.

Figure 10. Pedestrian recognition partial dataset.

Figure 11. Feature extraction network training results. (a) Training Loss. (b) Top1 Error.

Table 1. Algorithmic performance comparison.

Algorithm	FP↓	FN↓	IDS↓	% MOTA↑	% MOTP↑
SORT	7318	32,615	1423	33.4	72.1
DeepSORT	12,852	36,747	781	61.4	81.7
Our	11,746	28,461	659	63.7	81.5
Our-1	11,343	27,874	627	63.9	81.0

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, H.; Pei, Y.; Bei, Q.; Deng, L. Improved DeepSORT Algorithm Based on Multi-Feature Fusion. Appl. Syst. Innov. 2022, 5, 55. https://doi.org/10.3390/asi5030055

AMA Style

Liu H, Pei Y, Bei Q, Deng L. Improved DeepSORT Algorithm Based on Multi-Feature Fusion. Applied System Innovation. 2022; 5(3):55. https://doi.org/10.3390/asi5030055

Chicago/Turabian Style

Liu, Haiying, Yuncheng Pei, Qiancheng Bei, and Lixia Deng. 2022. "Improved DeepSORT Algorithm Based on Multi-Feature Fusion" Applied System Innovation 5, no. 3: 55. https://doi.org/10.3390/asi5030055

Article Menu

Improved DeepSORT Algorithm Based on Multi-Feature Fusion

Abstract

1. Introduction

2. Analysis of Related Algorithms

2.1. Pedestrian Multi-Objective Tracking Algorithm

2.2. Feature Pyramid Network

2.3. Residual Network Structure

3. Generalized Intersection over Union (GIOU)

4. The Improved Feature Extraction Network

5. Experimental Results and Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI