Detection and Tracking of Low-Frame-Rate Water Surface Dynamic Multi-Target Based on the YOLOv7-DeepSORT Fusion Algorithm

Han, Xingcheng; Fu, Shiwen; Han, Junxuan

doi:10.3390/jmse12091528

Open AccessArticle

Detection and Tracking of Low-Frame-Rate Water Surface Dynamic Multi-Target Based on the YOLOv7-DeepSORT Fusion Algorithm

by

Xingcheng Han

^1,*,

Shiwen Fu

¹

and

Junxuan Han

²

¹

School of Information and Communication Engineering, North University of China, Taiyuan 030051, China

²

North Automatic Control Technology Institute, Taiyuan 030051, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2024, 12(9), 1528; https://doi.org/10.3390/jmse12091528

Submission received: 3 August 2024 / Revised: 24 August 2024 / Accepted: 1 September 2024 / Published: 3 September 2024

Download

Browse Figures

Versions Notes

Abstract

:

This study aims to address the problem in tracking technology in which targeted cruising ships or submarines sailing near the water surface are tracked at low frame rates or with some frames missing in the video image, so that the tracked targets have a large gap between frames, leading to a decrease in tracking accuracy and inefficiency. Thus, in this study, we proposed a water surface dynamic multi-target tracking algorithm based on the fusion of YOLOv7 and DeepSORT. The algorithm first introduces the super-resolution reconstruction network. The network can eliminate the interference of clouds and waves in images to improve the quality of tracking target images and clarify the target characteristics in the image. Then, the shuffle attention module is introduced into YOLOv7 to enhance the feature extraction ability of the target features in the recognition network. Finally, Euclidean distance matching is introduced into the cascade matching of the DeepSORT algorithm to replace the distance matching of IOU to improve the target tracking accuracy. Simulation results showed that the algorithm proposed in this study has a good tracking effect, with an improvement of 9.4% in the improved YOLOv7 model relative to the mAP50-95 value and an improvement of 13.1% in the tracking accuracy in the DeepSORT tracking network compared with the SORT tracking accuracy.

Keywords:

super-resolution reconstruction; multi-target tracking; YOLOv7; DeepSORT; Euclidean distance matching

1. Introduction

Water surface target navigation and tracking technology has gained increasing attention for its application in the military, and the technology is of great strategic significance for specific sea area monitoring [1], port flow statistics, marine security management, unknown ship identification, and unknown ship behavior analysis. However, this technology is affected by wave interference, cloud cover [2], and speed and direction change of the water surface cruising target movement. Simultaneously, low-quality video images also create some difficulties in target tracking, while low-resolution videos can cause blurry image features, leading to failed target identification, and low-frame-rate or missing-frame videos can cause the loss of information when tracking a target. Therefore, improving the accuracy of identification and tracking is essential [3].

Existing algorithms for ship tracking include correlation filtering (CF) [4]; however, since CF considers only the local feature information, it is easy to receive the interference of the background clutter, and it is not conducive to the target’s scale change and partially occluded situation. A previous work proposed the background difference method [5]. Although the accuracy of this method has improved, it is too slow to update, resulting in the background model being outdated and unable to adapt to environmental changes. In addition, its adaptability to sudden changes of scene and dynamic background is poor. Recently, other researchers have proposed using the residual neural network method [6] to identify and locate the tracking of the ship target. This method has good anti-interference ability for high-resolution ship targets, but its recognition performance will decrease for low-resolution video images of the target ship during tracking.

To address these problems, we proposed a YOLOv7-DeepSORT water surface dynamic multi-target tracking algorithm because YOLOv7 has better accuracy in ship recognition compared to YOLOv8 [7] and YOLOv9 [8], and at the same time, YOLOv7 takes up fewer resources compared to YOLOv10. Considering the aspect of tracking algorithms, a tracking algorithm based on KCF [9] has been proposed; however, although this algorithm has significant advantages in speed, it has lower accuracy in complex environments. The target tracking algorithm based on the Siamese network [10] has shown good performance in tracking accuracy and speed, but cannot effectively solve the problems of target disappearance and occlusion. The proposed CamShift tracking algorithm [11] has shown good performance in real time, and accuracy, but it can cause target loss when the target and background color tones are similar. Therefore, we improved the Deep SORT algorithm by introducing Euclidean distance matching [12] in the tracking network to ensure that the model has better tracking performance for low-frame-rate videos and when there is a significant difference in the target between adjacent video frames due to excessive motion speed.

To sum up, our proposed YOLOv7-DeepSORT joint algorithm can ensure excellent accuracy and real-time performance, and also solve the problem of target loss and occlusion.

2. Algorithm Principle

2.1. Algorithm Flow

The algorithm is composed of three modules: an image preprocessing module (image enhancement algorithm) [13], a ship recognition module, and a cruise ship tracking module [14].

Firstly, we used the multi-scale residual reconstruction network composed of multi-scale residual blocks (MSRBs) to reconstruct cruise ship videos; secondly, we trained the ship recognition network composed of the improved YOLOv7 to obtain the required ship training set weights; finally, the improving DeepSORT tracking algorithm replaced IOU matching [15] with Euclidean distance matching [16], enabling the network to accurately recognize ship videos at low frame rates and identify ship videos; and the recognized ship targets for trajectory tracking and marking ID. The algorithm flowchart is shown in Figure 1.

The algorithm aims to process the dataset as well as the detection video after the multi-scale resolution reconstruction algorithm, pass the processed dataset into the YOLOv7 recognition network for the training of the weights file, import the trained weights file into the DeepSORT tracking network, and assign the obtained IDs.

2.2. Image Preprocessing

Shooting long-distance video of cruising ships will lead to a decrease in the recognition accuracy of tracking ship targets because of the poor quality of the image resolution, so image enhancement techniques were introduced to preprocess the image and enhance the image resolution while improving the recognition accuracy. In this study, we introduce a multi-scale super-resolution reconstruction network (MSRN) to enhance the target image resolution. In this network, the low-rate (LR) image is a direct input, followed by frame extraction processing of the video, and then the initial feature extraction through the first layer of convolution

M_{0}

, then in the back of each

M_{n - 1}

and

M_{n}

is the same structure with the multi-scale residual block (MSRB), where the MSRB also includes the multi-scale feature fusion and the local residual learning. The multi-scale feature fusion can be expressed as follows:

S_{1} = σ (ω_{3 \times 3}^{1} * M_{n - 1} + b^{1})

(1)

P_{1} = σ (ω_{5 \times 5}^{1} * M_{n - 1} + b^{1})

(2)

S_{2} = σ (ω_{3 \times 3}^{2} * [S_{1}, P_{1}] + b^{2})

(3)

P_{2} = σ (ω_{5 \times 5}^{2} * [S_{2}, P_{2}] + b^{2})

(4)

S^{'} = σ (ω_{1 \times 1}^{3} * [S_{2}, P_{2}] + b^{3})

(5)

where

ω

and

b

represent the weights and biases, respectively; the upper labeling of

ω

represents the number of layers it is located, the lower labeling of

ω

represents the size of the convolutional kernel used in the layer, the parameter in the middle bracket represents the connectivity parameter, and

σ

represents the ReLU function [17]. The purpose of local residual learning is to improve network efficiency, and each MRSB uses residual network learning, with the multi-scale residual block expressed as follows:

M_{n} = S^{'} + M_{n - 1}

(6)

where

M_{n}

and

M_{n - 1}

denote the inputs and outputs of the MSRBs, respectively; and

S^{'} + M_{n - 1}

denotes the multi-residual network addition performed through a fast connection and elements. The fused feature channels are compressed to a certain number of channels by setting outputs on each

M_{n}

and importing each output into a 1 × 1 convolutional layer, constituting a hierarchical feature fusion structure (HFFS). Finally, the image is enlarged using a pixel blending magnification layer to obtain the final reconstructed image after convolution. The structure of the MSRN network is as follows in Figure 2.

As shown in Figure 2, the input target video and dataset are input to the M0 convolutional layer and then passed eight times through the MSRB structure. The results of these nine iterations are then passed through a

1 \times 1

convolutional layer to compress the fused feature channels to a certain number of channels. Then, the image is enlarged using a pixel blending magnification layer. After another convolution, the final reconstructed image is obtained.

2.3. Improvement of YOLOv7 Detector

Adding an attention mechanism to the YOLOv7s network can significantly optimize the performance of the whole network to ensure the high accuracy and good detection speed of the YOLOv7s recognition network. Common attention mechanisms include the coordinate attention mechanism (CoordAtt), squeeze-and-excitation attention mechanism (SE), and convolutional block attention mechanism (CBAM), all of which are relatively excellent attention mechanisms [18]. In this study, we introduce a new mechanism, shuffle attention (SA), which effectively combines the spatial attention mechanism with the channel attention mechanism using shuffle units. The SA mechanism aims to group the input features into g units, where each group of units is divided into two branches by the partial structure within the SA, which are the spatial attention mechanism and the channel attention mechanism, in which the channel attention uses the global average pooling (GAP) to embed global information, then generate channel statistics information. We can calculate this information by shrinking

X_{k 1}

with the spatial dimension of the

H \times W

:

s = F_{g p} (X_{k 1}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X_{k 1} (i, j)

(7)

where

X_{k 1}

is the value of the feature map,

F_{g p} (X_{k 1})

represents the GAP processing of the feature map, and

H

and

W

represent the spatial height and width, respectively. After that, the sigmoid activation function is used to make an accurate and adaptive selection of features; and the final output will be given by

X_{k 1}^{'} = σ (F_{c} (s)) \cdot X_{k 1} = σ (W_{1} s + b_{1}) \cdot X_{k 1}

(8)

where

σ

represents the activation function operation; use the parameters

W_{1}

and

b_{1}

to scale and translate s, respectively. Spatial attention is a complement to channel attention. It uses group norm (GN) to obtain spatial statistics. Then,

F_{c} (\cdot)

is used to enhance the feature representation of

\hat{X_{k 2}}

. The final spatial attention output is

X_{k 2}^{'} = σ (W_{2} G N (X_{k 2}) + b_{2}) \cdot X_{k 2}

(9)

where

σ

represents the activation function operation; use the parameters

W_{2}

and

b_{2}

to scale and translate s, respectively. The two branches

X_{k 1}^{'}

and

X_{k 2}^{'}

are then connected so that the number of channels is the same as the number of inputs, i.e.,

X_{k}^{'} = [X_{k 1}^{'}, X_{k 2}^{'}] \in R^{C / G \times H \times W}

(10)

where C represents the number of channels, and G represents the division of the original feature map X into G groups. Finally, after aggregating all sub-features, use the channel shuffling operation to rearrange the groups to allow information flow between different groups. The block diagram of the shuffle attention mechanism structure is as follows in Figure 3.

Afterwards, introduce the SA mechanism into the three branches of the neck part of the YOLOv7 network, and the structure of the introduced YOLOv7 network is as follows in Figure 4.

2.4. Improvement of DeepSORT Algorithm

The DeepSORT algorithm is an improvement of the SORT algorithm. Obtain the bounding box for the target through Yolov7 detection and recognition, and then use the Kalman filtering algorithm to filter this bounding box to predict the target’s trajectory. This trajectory uses the Hungarian algorithm to match the bounding box of the next frame and the prediction box under the influence of intersection-over-union (IOU) [19]. Then, calculate the overlap rate of the prediction box and the actual bounding box. When the target ship’s video frame rate is too low, the prediction box and the bounding box do not overlap, resulting in an overlap ratio of 0; thus, the loss of the ship target tracking box of the next frame occurs. Therefore, we introduced Euclidean distance matching instead of IOU matching to cope with ship target videos with low-frame-rate video tracking. The matching algorithm sets the image matrix as n pixel points, with n elemental values, to form the feature group of the image. The feature group constitutes an n-dimensional space, where each pixel point in the feature group is composed of a

1 \times n

array; then, calculate the distance between two points through the mathematical Euclidean distance formula, and the one with the smallest distance is the best matching image. The Euclidean formula [20] is as follows:

A = (x_{1}, x_{2}, \dots, x_{n})

(11)

B = (y_{1}, y_{2}, \dots, y_{n})

(12)

A B = \sqrt{{(x_{1} - y_{1})}^{2} + {(x_{2} - y_{2})}^{2} \dots {(x_{n} - y_{n})}^{2}}

(13)

The process of the improved DeepSORT algorithm is as follows in Figure 5. The sequence number marked by the arrow in the figure represents the improvement of the fusion algorithm.

3. Experimental Results

3.1. Comparison of Image Enhancement Results

This study selected five ship cruise images, several of which contained fuzzy ship images, waves, fog, and other factors, simulating unstable factors at sea and interference. We compared the image enhancement effects of the histogram equalization network, the FFA defogging network, the sharpening+contrast joint network, and the super-resolution network. We also compared the accuracy of bounding boxes identified by the YOLOv7 recognition network. The comparison chart is as follows in Figure 6.

Due to the proximity of the identified ships, the category labels are hidden in the bounding box comparison graph, making it easy to observe the bounding box visually. From the figure, we can see that the histogram averaging method will adjust the color of the blue ocean background and redistribute the pixels of the image to enhance contrast. In a foggy environment, the FFA defogging network can identify the ship behind the fog clearly, but increasing the brightness of the ship image will cause small ships to merge with the waves, resulting in the loss of targets; at the same time, it has no optimization effect on distant fuzzy targets. The image sharpening+contrast joint network can adjust the image sharpening, brightness, and contrast, and enhance the feature of the close target, but it also reduces the clarity of the images. Finally, the MSRN super-resolution reconstruction network can enhance the resolution of ship target images and improve the clarity of distant targets.

The characteristic heatmap of the five algorithms is as follows in Figure 7.

The image feature heatmap in Figure 7 shows that when extracting and comparing the targets in the original image, the histogram equalization algorithm and FFA algorithm lost the selected targets. The super-resolution reconstruction algorithm MRSN proposed in this paper is better than the above; it improved the resolution of ship images and enhanced the clarity of remote targets. At the same time, it reduced the recognition difficulty of the ship target caused by the water splashes without adjusting the brightness and contrast. Therefore, MRSN is more suitable for ship target recognition and tracking than other image enhancement methods. A comparison of YOLOv7 without image enhancement and YOLOv7 with MSRN image enhancement follows in Table 1.

As shown in Table 1, after introducing the MRSN image enhancement method, the accuracy R of YOLOv7 increased by 2.3%, the accuracy P increased by 2.2%, and the mean average precision value for thresholds greater than 0.5 (mPA50) increased by 0.1%, which was an improvement compared to the original network.

3.2. Comparison of Attention Mechanism Effect

To evaluate the performance of the network after adding the SA attention mechanism, we utilized four parameters: P, B, mAP50, and mAP50-95 to compare the SA attention mechanism with Coordatt, SE, sim attention mechanism (SimAM), CBAM, and YOLOv7 network without attention mechanism. P stands for a single class of accuracy; that is, the accuracy of the prediction is positive; R stands for true positive accuracy, that is, the number of recalls in a positive sample; MAP50 stands for the mean average precision value for thresholds greater than 0.5; and MAP50-95 stands for the mean average precision value for thresholds between 0.50 and 0.95, where the step size is 0.05. The results after 200 iterations are as follows in Table 2 below.

As the data in Table 2 show, the SA mechanism was better than other attention mechanisms in terms of comprehensive performance. After 200 iterations, we can see from the values of mAP50-95 that the SA network improved by 9.4% compared to the network with no attention, 16.8% compared to the network with CoordAttention, 5.7% compared to the SE attention network, 11.1% compared to the SimAM network, and 8.3% relative to CBAM network, which indicates that optimizing the YOLOv7 network with the SA attention mechanism can enhance the recognition network’s ability to extract features, thereby improving the accuracy of recognition.

3.3. Comparison of Euclidean Distance Matching Effect

According to the modification of the DeepSORT algorithm matching mechanism, the original IOU matching mechanism was replaced with Euclidean distance matching to cope with trajectory tracking of low-frame-rate or missing-frame ship cruising videos. We compared the original IOU matching with the Euclidean distance matching, and observed whether the ID label of the ship target in the tracking video was updated due to trajectory interruption. Then, we selected low-frame-rate ship cruising videos to evaluate the Euclidean distance matching mentioned above. The comparison is as follows in Figure 8.

In Figure (b), the low-frame-rate ship tracking video has been optimized. From observations, we can see that two of the three ships in Figure (a) experienced ID reassignment due to the interruption of the trajectory, with their IDs updated to 5 and 6, respectively. After optimization of the ship’s target, as shown in Figure (b), only one of the three target ships had an ID update due to the single target being repeated once by the bounding box. We can see that the improved DeepSORT network is significant for low-frame-rate ship target tracking. In addition, we compared our algorithm with the SORT, Centertrack, and DeepSORT algorithms to verify the performance of the improved tracking algorithm, as follows in Table 3.

As shown in Table 3, compared to the DeepSORT algorithm before improvement, the MOTA was improved by 5.2%, the HOTA was improved by 3.3%, and the IDF1 was improved by 6.6%, so we can conclude that in the ship videos with low frame rates or missing frames, the improved DeepSORT tracking network will further improve tracking effectiveness.

3.4. Ablation Experiments

In this paper, the image of the target recognition stage is preprocessing, and the backbone network Yolov7 is improved. To evaluate the optimization degree of the algorithm performance, we performed ablation experiments: comparing YOLOv7+MRSN+SA with YOLOv7, YOLOv7+MRSN, YOLOv7+CBAM, YOLOv7+SA, YOLOv7+CBAM+SA, YOLOv7+MRSN+CBAM. After 200 iterations, performance comparisons for seven different networks with four parameters, P, R, box loss, and MAP50, were as follows in Figure 9, Figure 10, Figure 11 and Figure 12.

The curve shown in Figure 9 is a comparison curve for box_loss, which is the error (CIoU) between the prediction box and the calibration box used to monitor the boundary box regression. The smaller box loss value means that the boundary box of the model is more accurate in prediction. As shown in the figure above, although the early performance of the model this paper proposed was not optimal, the loss value of the model in this paper gradually decreased with increasing iterations, which made the prediction more accurate.

The curve shown in Figure 10 is the comparison curve of the precision rate parameter, which is the ratio of the number of true positive category targets correctly predicted by the mode to the total number of the predicted positive category targets. The high precision rate indicates fewer false positives in the prediction model. We can see from the figure that the proposed model significantly improved compared with other network models.

The curve shown in Figure 11 is the comparison curve of the recall rate parameter, which is the ratio of the number of true positive category targets correctly predicted by the mode to the total number of the predicted positive category targets. The high recall rate indicates that the model had a higher coverage of the true positive category target and fewer false negatives. We can see from the figure that the proposed model significantly improved the recall rate compared with other network models.

Figure 12 shows the comparison diagram of parameter curves with an average accuracy threshold of 0.5. The mAP_50 refers to the average accuracy (mAP) evaluation index in object detection tasks, where the IoU threshold is set to 0.5, and the mAP_0.5 measures model accuracy in target detection. The figure above shows that the model in this paper was not ideal in the early stage, but with the increase in the number of iterations, the model in this paper outperformed other models after about 125 rounds, proving the improvement of the model in this paper.

The summary table of the ablation experiments of YOLOv7 is as follows in Table 4.

As can be seen from the table above, the precision and mAP50 of the baseline model YOLOv7 were 95.1% and 97.9%, respectively. By adding two optimization modules, the MSRN module and the SA module, precision and mAP50 were improved compared to the baseline model. Specifically, the precision was improved by 2.2% and 2.8%, and the mAP50 was improved by 0.1% and 0.5%, respectively. The final combination of the two optimization modules yielded the best results, and both precision and mAP50 reached their highest values, 4.7% and 1.6% higher than the baseline model, respectively. Due to the combination of the two modules, the model in this paper achieved the best performance.

3.5. Confidence–MOTA Contrast Analysis

The comparison results of confidence–MOTA are as follows in Figure 13.

Figure 13 is the comparison result curve of the confidence and multi-object tracking accuracy. The x-axis represents confidence level, with an interval of 40–70%; the Y-axis is MOTA, with an interval of 0–100%. We can see from the observations that:

(1) As the confidence level increased, MOTA also increased. When the confidence level reached 55% to 60%, the MOTA of the four algorithms reached the highest value, but as the confidence level increased, all algorithms showed a downward trend.

(2) Compared with other algorithms, the MOTA based on the YOLOv7+DeepSORT algorithm was always the highest; when the confidence level was around 59%, the MOTA was the highest, reaching up to 70%.

3.6. Execution Time and Accuracy Analysis

The comparison of execution time results follows in Figure 14.

The comparison curve of execution time is shown in Figure 14, where (a) is the result of the comparison of 500 frames of pictures, and (b) is the result of the comparison of the first 100 frames of pictures. As we can see from the observations, the execution time of each frame varied randomly in a fixed range. YOLOv7 is a blue curve with an average execution speed of about 120 ms; YOLOv8 is a red curve with an average execution speed of about 140 ms; YOLOv9 is a purple curve with an average execution speed of about 120 ms; YOLOv10 shows a green curve with an average execution speed of about 110 ms. From this, we can see that during execution, Yolov10 was the fastest, Yolov7 and Yolov9 were in the middle of each frame, and Yolov8 was the slowest.

The comparison results of recognition accuracy are as follows in Figure 15.

As we can see from the observations in Figure 15, as the number of training sessions increased, the accuracy of target recognition of the four kinds of networks increased gradually. After 200 iterations of training, the accuracy of YOLOv8 was the highest, and that of YOLOv7 was the second highest. The recognition accuracy of YOLOv9 and YOLOv10 was slightly lower, but the recognition accuracy of these four networks was more than 90%.

3.7. Discussion

(1) Diversity of datasets: in future studies, we could consider using more datasets under different environmental conditions to validate the algorithm’s performance in a broader context, which will help the generalization performance of the algorithm.

(2) Robustness: to deal with more complex ocean environments and target motion patterns, the algorithm’s robustness needs to be improved in future work. In particular, the algorithm’s robustness is still a challenge under severe weather conditions.

(3) Practical deployment: future research can consider the application of the algorithm in the actual navigation and military to verify its performance and feasibility in the actual scene.

(4) Unsupervised or semi-supervised: although the number of labeled datasets is small, the number of unlabeled datasets is large. The research of unsupervised or semi-supervised detection algorithms to replace YOLOv7 would help avoid the problem of insufficient datasets.

4. Conclusions

This paper aimed to solve the problems in video tracking of the target ID being reassigned due to the loss of the tracking target caused by low frame rates or missing frames of ship images, as well as the interference of clouds and waves in the recognition network. Based on image preprocessing and enhancing the feature extraction ability of the recognition network, and by optimizing the matching mechanism of DeepSORT network, the tracking of ship targets in low-frame-rate images was significantly improved. The disadvantage of this paper is that the amount of ship data used was relatively small, and the ship video dataset was also quite limited, making it impossible to perform various ship recognition and tracking tests. This network can be improved and widely used by expanding the ship datasets and improving the ship video datasets.

Author Contributions

Methodology, X.H.; Writing—original draft, S.F.; Writing—review & editing, J.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by (1) National Natural Science Foundation of China, grant number [62203405]; (2) Fundamental Research Program of Shanxi Province, China, grant number [20210302124545]; (3) Key Research and Development Program of Shanxi Province, China, grant number [202202110401015].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://universe.roboflow.com/.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ye, C.; Lu, T.; Xiao, B.; Lu, H.; Yang, Q. Current status and outlook of target detection of ships in maritime surveillance video. Chin. J. Image Graph. 2022, 27, 2078–2093. [Google Scholar]
Durve, M.; Orsini, S.; Tiribocchi, A.; Montessori, A.; Tucny, J.M.; Lauricella, M.; Camposeo, A.; Pisignano, D.; Succi, S. Benchmarking YOLOv5 and YOLOv7 models with DeepSORT for droplet tracking applications. Eur. Phys. J. E 2023, 46, 32. [Google Scholar] [CrossRef] [PubMed]
Shan, Y.; Liu, S.; Zhang, Y.; Jing, M.; Xu, H. LMD-TShip: Vision Based Large-Scale Maritime Ship Tracking Benchmark for Autonomous Navigation Applications. IEEE Access 2021, 9, 74370–74384. [Google Scholar] [CrossRef]
Zhang, P. Design and Implementation of Marine Ship Tracking System Based on Multi-Target Tracking Algorithm. J. Coast. Res. 2020, 110, 47–49. [Google Scholar] [CrossRef]
Hu, X.; Zhang, Q. Nighttime trajectory extraction framework for traffic investigations at intersections based on improved SSD and DeepSort. Signal Image Video Process. 2023, 17, 2907–2914. [Google Scholar] [CrossRef]
Chen, Y.; Wu, B. Multi-target tracking algorithm based on YOLO+DeepSORT. J. Phys. Conf. Ser. 2022, 2414, 012018. [Google Scholar] [CrossRef]
Wang, J.; Zhao, H. Improved YOLOv8 Algorithm for Water Surface Object Detection. Sensors 2024, 24, 5059. [Google Scholar] [CrossRef] [PubMed]
Xavier, J.A.; Valarmathy, S.; Gowrishankar, J.; Devi, B.N. Multi-Class Flower Counting Model with Zha-KNN Labelled Images Using Ma-Yolov9. Int. J. Adv. Comput. Sci. Appl. 2024, 15, 416–426. [Google Scholar] [CrossRef]
Jia, H.; Li, B.; Fei, Q.; Wang, Q. Fast-Moving Target Tracking Based on KCF with Motion Prediction. In Proceedings of the 2023 42nd Chinese Control Conference (CCC), Tianjin, China, 24–26 July 2023; p. 6. [Google Scholar] [CrossRef]
Yang, X.; Huang, X.; Huang, Z.; Zhu, L.; Li, X. A dynamic target tracking model for uavs based on the fusion of twin networks and deep learning. J. Phys. Conf. Ser. 2024, 2807, 012030. [Google Scholar] [CrossRef]
Vatsavai, K.S.L.; Mantena, V.S.K. Camshift Algorithm with GOA-Neural Network for Drone Object Tracking. Ing. Syst. d’Inf. 2023, 28, 491. [Google Scholar] [CrossRef]
Yu, G.; San, J.; Li, J. Real-time target tracking and recognition technology for ships based on improved convolutional neural network. Ship Sci. Technol. 2022, 44, 152–155. [Google Scholar]
Song, H.; Zhang, X.; Song, J.; Zhao, J. Detection and tracking of safety helmet based on DeepSort and YOLOv5. Multimed. Tools Appl. 2022, 82, 10781–10794. [Google Scholar] [CrossRef]
Hong, Y.; Ye, R.; Feng, G. A fish counting method in freshwater environment based on improved DeepSORT. Mar. Fish. 2023, 1–12. [Google Scholar] [CrossRef]
Ren, C. Research on Hydroacoustic Target Recognition Technology Based on Joint Neural Network. Master’s Thesis, North University of China, Taiyuan, China, 2022. [Google Scholar]
Lin, X.; Yao, L.; Sun, W.; Liu, Y.; Chen, J.; Jian, T. Target tracking of moving ships by GF-4 satellite under broken cloud environment. Space Return Remote Sens. 2021, 42, 127–139. [Google Scholar]
Mohammad, B.S.; Ali, N.; Mahmoud, S. Evaluating the spatial effects of environmental influencing factors on the frequency of urban crashes using the spatial Bayes method based on Euclidean distance and contiguity. Transp. Eng. 2023, 12, 100181. [Google Scholar]
Xie, T.; Yao, X. Smart Logistics Warehouse Moving-Object Tracking Based on YOLOv5 and DeepSORT. Appl. Sci. 2023, 13, 9895. [Google Scholar] [CrossRef]
Wu, B.; Liu, C.; Jiang, F.; Li, J.; Yang, Z. Dynamic identification and automatic counting of the number of passing fish species based on the improved DeepSORT algorithm. Front. Environ. Sci. 2023, 11, 1059217. [Google Scholar] [CrossRef]
Wu, H.; Wu, F.; Shang, S.; Liu, Z.; Yang, Y.; Li, D.; Li, P. Shadow tracking for airborne terahertz video-SAR based on SORT algorithm. In Proceedings of the SPIE-CLP Conference on Advanced Photonics, San Diego, CA, USA, 22–23 August 2023. [Google Scholar]

Figure 1. Algorithm flowchart.

Figure 2. MSRN network structure.

Figure 3. Shuffle attention network structure.

Figure 4. Improved YOLOv7 network structure.

Figure 5. Improved DeepSORT network structure.

Figure 6. Comparison of the effects of different image enhancement methods. Note: (a) original; (b) histogram-equalizer; (c) FFA; (d) sharpening+contrast enhancement; (e) MSRN super-resolution.

Figure 7. Characteristic heatmap performance of the above five algorithms. Note: (a) original; (b) histogram-equalizer; (c) FFA; (d) sharpening + contrast enhancement; (e) MSRN super-resolution.

Figure 8. Comparison of tracking effect before and after optimization. Note: (a) tracking effect before optimization; (b) tracking effect after optimization.

Figure 9. box_loss parameter comparison curve.

Figure 10. Precision rate parameter comparison curve.

Figure 11. Recall rate parameter comparison curve.

Figure 12. Parametric curve with a mean precision threshold of 0.5.

Figure 13. Comparison curve of confidence–MOTA.

Figure 14. Comparison curve of execution time. (a) The comparison curve of 500 frames; (b) the comparison curve of 100 frames.

Figure 15. Comparison curve of recognition accuracy.

Table 1. Contrast before and after image enhancement.

	R	P	mAP50
YOLOv7	0.768	0.951	0.979
MSRN+YOLOv7	0.791	0.973	0.98

Table 2. Comparison of attention mechanisms.

	No Attention	CoordAtt	SE	SimAM	CBAM	SA
P	0.973	0.981	0.974	0.915	0.986	0.998
R	0.791	0.812	0.815	0.743	0.925	0.972
mAP50	0.98	0.994	0.994	0.992	0.99	0.995
mAP50-95	0.831	0.757	0.868	0.814	0.842	0.925

Table 3. The algorithm in this article compared with other algorithms.

	MOTA/%	HOTA/%	IDF1/%	IDSW/%
YOLOv7+SORT	58.3	46.9	54.3	1126
YOLOv7+CenterTrack	61.7	48.5	56.8	834
YOLOv7+DeepSORT	66.2	55.8	69.5	673
The algorithms in this paper	71.4	59.1	76.1	549

Table 4. Comparison of YOLOv7 ablation experiments.

	R	P	mAP50
YOLOv7	0.768	0.951	0.979
YOLOv7+MSRN	0.791	0.973	0.98
YOLOv7+CBAM	0.838	0.975	0.988
YOLOv7+SA	0.841	0.979	0.984
YOLOv7+CBAM+SA	0.844	0.972	0.985
YOLOv7+MSRN+CBAM	0.925	0.986	0.99
The target recognition algorithm in this paper	0.972	0.998	0.995

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Han, X.; Fu, S.; Han, J. Detection and Tracking of Low-Frame-Rate Water Surface Dynamic Multi-Target Based on the YOLOv7-DeepSORT Fusion Algorithm. J. Mar. Sci. Eng. 2024, 12, 1528. https://doi.org/10.3390/jmse12091528

AMA Style

Han X, Fu S, Han J. Detection and Tracking of Low-Frame-Rate Water Surface Dynamic Multi-Target Based on the YOLOv7-DeepSORT Fusion Algorithm. Journal of Marine Science and Engineering. 2024; 12(9):1528. https://doi.org/10.3390/jmse12091528

Chicago/Turabian Style

Han, Xingcheng, Shiwen Fu, and Junxuan Han. 2024. "Detection and Tracking of Low-Frame-Rate Water Surface Dynamic Multi-Target Based on the YOLOv7-DeepSORT Fusion Algorithm" Journal of Marine Science and Engineering 12, no. 9: 1528. https://doi.org/10.3390/jmse12091528

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Detection and Tracking of Low-Frame-Rate Water Surface Dynamic Multi-Target Based on the YOLOv7-DeepSORT Fusion Algorithm

Abstract

1. Introduction

2. Algorithm Principle

2.1. Algorithm Flow

2.2. Image Preprocessing

2.3. Improvement of YOLOv7 Detector

2.4. Improvement of DeepSORT Algorithm

3. Experimental Results

3.1. Comparison of Image Enhancement Results

3.2. Comparison of Attention Mechanism Effect

3.3. Comparison of Euclidean Distance Matching Effect

3.4. Ablation Experiments

3.5. Confidence–MOTA Contrast Analysis

3.6. Execution Time and Accuracy Analysis

3.7. Discussion

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI