*3.4. Improving DeepSORT Algorithm*

## 3.4.1. GIOU Loss Function

IOU (Intersection over Union), also known as intersection and merge ratio, is a measure of the accuracy of detecting the corresponding object in a given dataset. DeepSORT (Deep Simple Online and Realtime Tracking) uses the IOU of the detection frame and tracking frame as the loss matrix in the correlation algorithm. The input IOU ranges between [0, 1] with scale invariance, and the equation is shown in Equation (8):

$$\text{IOU} = \frac{\mathcal{S}\_A \cap \mathcal{S}\_B}{\mathcal{S}\_A \cup \mathcal{S}\_B} \tag{8}$$

where *S<sup>A</sup>* is the area of the predicted box, and *S<sup>B</sup>* is the area of the real box. If IOU is used as a measure of the overlap between boxes, the following problems will occur:

(1) IOU is always 0 when there is no overlap between the prediction box and the real box, as shown in Figure 9, state 1, where the red prediction box and the blue real box have no intersection, and the value of IOU is 0.

**Figure 9.** Schematic diagram of different overlapping shapes of IOU. (**a**) status1: IOU = 0. (**b**) status2: IOU = 0.38. (**c**) status3: IOU = 0.38.

(2) When the IOUs intersect and have the same value, it is impossible to distinguish the various cases of IOUs. There can be many kinds of overlapping shapes for the same IOU value, and they are different in effect. As shown in Figure 9 state 2 and state 3, the IOUs of both the prediction box and the real box in state 2 and state 3 are equal to 0.38, but state 2 is an up-and-down intersection, and state 3 is a horizontal intersection.

In order to solve the above problem, this paper uses GIOU (Generalized Intersection over Union) to replace IOU in the DeepSORT algorithm. GIOU loss focuses not only on overlapping regions, but also on non-overlapping regions, which distinguishes the cases with the same IOU but different forms of overlap, and solves the problem that there can be no gap between non overlapping frames. The value range of GIOU is [−1, 1] with the following formula.

$$\text{GIOU} = \text{IOU} - \frac{|\mathbb{C} - (A \cup B)|}{|\mathbb{C}|} \tag{9}$$

where *C* is the smallest outer rectangle of the prediction frame and the target frame, as shown in the left of Figure 10. In Equation (8) is the difference set, as shown in the blue part in Figure 10.

As shown in Figure 11, suppose A is the ship target at frame n and B and C are the ship targets at frame n + 1, where the IOUs of A and B are 4/28 ≈ 0.14, and the IOUs of A and C are also 4/28 ≈ 0.14. Since their IOU values are equal, the difference cannot be measured if IOUs are used. However, the GIOU of A and B is 4/28 − (36 − 28)/36 ≈ −0.08, while the GIOU of A and C is 4/28 − (28 − 28)/28 ≈ 0.14, which shows that the correlation between A and C will be greater than that between A and B. Therefore, in the ship tracking task, we prefer to consider ship C as the target position of ship A in the next frame, which

is also consistent with the fact that inland ships travel slowly and have low deformation in the video sequence.

**Figure 10.** Schematic diagram of GIOU. (**a**) C is Minimum Enclosing Rectangle. (**b**) The areas of C-A∪B.

**Figure 11.** Schematic diagram of ship intersection between simulated frames. A is detection of n frame. B and C are detections of n + 1 frame.

#### 3.4.2. KM Association Algorithm

In multi-target tracking tasks, the main purpose of data association is to perform matching of multiple targets between frames, including emerging targets, the disappearance of old targets, and the ID matching problem between the previous frame and the current frame. The DeepSORT default data association algorithm uses the Hungarian algorithm, and the core idea is to find the maximum matching algorithm of the augmented path for the bipartite graph. As shown in Table 1.

**Table 1.** Inter-frame target matching.


M1~M4 are the four tracked targets in the nth frame, N1~N4 are the four newly detected targets in the nth + 1 frame, and the association degree index between the targets is measured by the GIOU loss function in the previous section. Since M4 is not associated with any detected targets in the new frames, M4 is the old target tracking loss case; N4 is the detected targets in the new frames, which belong to the new target emergence case and will be assigned new IDs to track. The association algorithm discussed in this section then

solves the matching problem between M1~M3 and N1~N3. If the Hungarian algorithm with no weight value is used, the matching results are generally: M1 matches with N1, M2 matches with N2, and M3 matches with N3 when the threshold value is taken as 0.5. The Hungarian algorithm considers both to be correlated as long as it is greater than the specified threshold, that is, it considers that M2 and N2 and N3 are matched while ignoring the fact that M2 is more correlated with N3. It is this matching method, which is regarded as leveling, that leads to low tracking accuracy.

The KM algorithm is an improvement of the Hungarian algorithm, in which the weights of the edge values are increased to achieve optimal weight matching based on the Hungarian algorithm. The steps to solve the target tracking problem involve using the KM algorithm [11]. The results detected in the *n*th and *n*th + 1 frames are used as vertices to form the point set M and the point set N, respectively, the GIOU of the detection frame and the prediction frame is used as the edge value connected between the two points, with the ID of each vertex in M set to M<sup>i</sup> , the initial weight set to the maximum edge value W of the edge connected to that point, and each point in the point set N set to N<sup>i</sup> , with the initial weight set 0. If the point set M is satisfied M<sup>i</sup> + N<sup>i</sup> = Wij, then the M<sup>i</sup> and N<sup>i</sup> will be matched; if not satisfied, then the point set M in the conflict will be minus *d*, and the point set N in the conflict will be plus *d*, here set to 0.1. The specific process is shown in Figure 12.

**Figure 12.** Schematic diagram of KM algorithm. (**a**) Initialize W, d. (**b**) Resolve conflict.

In Figure 11a, the KM algorithm assigns the initial value of W in the target M1~M3 from the maximum weight edge, and the initial value of d in the target N1~N3 is 0. After initialization, it is found that M1 and M3 are matched with N1, and try to change the edge weights of M1 and M3 to other values, but they do not satisfy M<sup>i</sup> + N<sup>i</sup> = Wij. As such, a conflict arises. In order to resolve the conflict, the KM algorithm subtracts 0.1 from the W value of M1 and M3, and adds 0.1 to the d value of N1. At this point, M3 and N2 satisfy 0.8 + 0 = 0.8, and M1 and N1 also satisfy 0.7 + 0.1 = 0.8. The matching results obtained using the KM algorithm are: M1 matches N1, M2 matches N3, and M3 matches N2. The KM algorithm (total weight 0.8 + 0.9 + 0.8 = 2.5) is better than the Hungarian algorithm (total weight 0.8 + 0.6 + 0.5 = 1.9). 0.5 = 1.9) at matching yields with greater correlations.

#### **4. Experimental Results and Analysis**

#### *4.1. Network Training Experiments*

Due to the complex conditions of inland waters, the changeable weather, and the diversity of inland vessel types, datasets also require a large number of data sources. There are four main ways through which data sources were collected in this section: (1) ship images were collected and screened through search engines such as Baidu, Google, and Bing [24]; (2) high-definition surveillance cameras were built at fixed locations next to both banks of the Wuhan basin of the Yangtze River, which captured images cropped from the videos of ship navigation between 11 June 2019 and 17 November 2019; (3) the

image of the ship was captured with a digital camera in the Changjiang River Basin of Wuhan City, such as: erqi River Bank, Tianxingzhou Ferry Port, Hankou River Bank at a frequency of one per second, from 21 July 2020 to 26 November 2020. In this paper, data were collected from many locations and over a large time span, so the collected ship dataset meets the requirements of large data volume and sample types. The number of data sources is summarized as shown in Table 2. The ship types are divided into six categories as shown in Figure 13, and the statistical information of ship image data is shown in Table 3.

**Table 2.** Statistics of the number of data sources.


**Figure 13.** Vessel classification. (**a**) ore carrier. (**b**) bulk cargo carrier. (**c**) general cargo ship. (**d**) container ship. (**e**) fishing boat. (**f**) passenger ship.

**Table 3.** Number and proportion of ship types.


The ratio of the training set, validation set, and test set was 16:3:1. In order to ensure the objectivity of the experimental results, the hyper-parameter settings were consistent for different models, and some of the hyper-parameter settings related to the experiments are shown in Table 4.

**Table 4.** Hyper-parameter settings.


#### *4.2. Chimney Inspection Experiments*

The graded detection results are shown in Figure 14 below, where each column represents a set of data. The first and second rows show the raw data from the visible and infrared cameras, respectively; the third row shows the first-level detection, i.e., the result of detection by the deep learning detector; the fourth row passes the first-level detection result and uses the two-step Ostu binarization algorithm to obtain the region with higher temperature, and further filters the non-chimney highlighted region by leveling the upper and lower regions; the fifth row filters the noise points through the image erosion operation, and expands the chimney candidate region through the expansion operation to expand the chimney candidate area; and the sixth row calculates the maximum value of the area of the chimney candidate area, and draws the contour of the maximum value as the final chimney detection area. After the first-level detection to narrow the range, the background interference can be reduced, and the accuracy of ship chimney detection can be improved.

**Figure 14.** Chimney inspection. (**a**) Visible light input diagram. (**b**) Infrared camera input diagram. (**c**) First-level test results. (**d**) Optimization of detection range. (**e**) Secondary test results. (**f**) The results are displayed in the original image.

#### *4.3. Model Evaluation*

4.3.1. Evaluation Index of Ship Detection Model

We conducted numerical experiments on YOLOV3 [10] and our ship detection method. To evaluate the performance of the two models, we used the evaluation metrics, mainly Precision (accuracy), Recall (recall), mAP (mean average precision), and F1 (F1-Measure), and the calculation formula is as in Equation (9), where AP is the area value under the curve calculated by integration after the P-R curve is smoothed, and mAP is the mean value of AP for all categories.

$$\begin{cases} \text{precision} = \frac{TP}{TP + FP} \\ \text{recall} = \frac{TP}{TP + FN} \\ \text{ F1} = \frac{2 \times \text{precision} \times \text{recall}}{\text{precision} + \text{recall}} \end{cases} \tag{10}$$

where *TP* means the sample is marked as positive, *FP* means the sample is marked as positive by error, *TN* means the sample is marked as negative by correct, and *FN* means the sample is marked as negative by error.

After 100 rounds of iterative training using the migration study methodology, the training results are shown in Figure 15. The abscissa represents the number of training iterations, and the ordinate indicates accuracy, average recall, average accuracy and F1, respectively. As can be seen from the result graph, after the number of iterations reaches 40 rounds, the four basic parameters are stable at about 92%.

**Figure 15.** Results after 100 iterations. (**a**) Accuracy results. (**b**) Recall results. (**c**) mAP results. (**d**) F1 results.

In order to illustrate the effectiveness of our model, we carried out experiments in the same software and hardware environment, and the specific parameters are shown in Table 5. The calculation times of YOLOV3 model and our model were counted. To ensure the fairness of time cost, our calculation time was divided into two parts: training time, and verification time. The time consumed by each epoch is calculated in Table 4.


From the data in Table 4, we can see that the average calculation time of each epoch of YOLOV3 is 9.46, while the time of our model is lower, at 6.83. Compared with the calculation time of the validation model, our model is also faster. Therefore, our model has a good effect on real-time tasks, such as the detection and tracking of ship chimneys in inland rivers.

Under the same hyper-parameters, dataset, and the same experimental environment, the improved network and the original YOLOV3 network were compared experimentally, and the experimental results for each category are shown in Figure 16. The horizontal coordinates are the confidence values, and the vertical coordinates are the values of each metric at the current confidence level, and the overall values are shown in Table 6.

**Figure 16.** *Cont*.

**Figure 16.** Graph of the results of various ship indicators on the test set.



From the experimental results, our model can reach an accuracy of 1 when the confidence level is taken as 0.784, while the original model needs to be taken as 0.861 to reach 1. We can also see from the confusion matrix that the improved model has significantly less false detections than the original model, and the detector in this paper outperforms the YOLOV3 model in terms of accuracy, recall, mAP, and F1 indexes, especially in terms of detection speed, which is significantly higher than in the original model. The visualization of detection results is shown in Figure 17.

#### 4.3.2. Ship Tracking Model Evaluation Index

In this paper, three metrics were chosen to evaluate the effectiveness of multiple object tracking: (1) ID switch indicates the number of times the target label is changed in a tracking track, and the smaller the value, the better; (2) multiple object tracking accuracy (MOTA) mainly considers the matching errors of all objects in the tracking process, mainly the FP, FN, and ID switch. MOTA gives a very intuitive measure of the performance of the tracking algorithm in detecting objects and maintaining the trajectory, independent of the progress of target detection. A larger MOTA value indicates a better performance of the model. MOTA is calculated as:

$$M\_{OTA} = 1 - \frac{\sum \left( A\_{FP} + A\_{FN} + A\_{ID} \right)}{\sum A\_{GT}} \tag{11}$$

where *AFP* is the number of false positive cases, *AFN* is the number of false negative cases, *MOTA* is the multi-target tracking accuracy, *AID* is the number of ID switches, and *AGT* is the number of labeled targets; (3) FPS, the number of image frames per second processed by the model—the larger the value, the better the processing effect. To verify the performance of the method in this paper in chimney tracking, the test was conducted on the video surveillance data of the Yangtze River Bridge, and the test results are shown in Figure 18. The results are also shown in Table 5.

**Figure 17.** Visualization of test results.

**Figure 18.** Vessel tracking visualization.

Among them, the blue box is the box detected by the deep learning detector, and the yellow box is the final box after the Kalman filter update. From the experimental results, it can be seen that the ID jump frequency decreases when the target is occluded, and the

accuracy rate increases by 0.04. The specific experimental results evaluation index is shown in the Table 7.

**Table 7.** Results of tracking metrics in the test set.


#### **5. Conclusions**

In this paper, we propose a deep learning-based multi-sensor hierarchical detection and tracking method for inland river ship chimneys, which makes full use of the image characteristics of different sensors, and combines the hierarchical idea to solve the problems encountered in practical engineering problems. The method uses visible images with rich feature information, combining deep neural networks to detect inland river ships, filtering irrelevant background information, and using the infrared camera's sensitivity to temperature to locate ship chimneys to ensure high accuracy of detection results under inland river waters with complex backgrounds. The reliability and practicality of the method are proved by field experiments. It makes a certain contribution to assisting the monitoring of automatic air pollution.

**Author Contributions:** Conceptualization, F.W. and Q.C.; methodology, F.W.; software, F.W.; validation, F.W.; formal analysis, Q.C.; investigation, F.W.; resources, Y.W and C.X.; data curation, F.W. and F.Z.; writing—original draft preparation, F.W. F.Z. and Q.C.; writing—review and editing, Q.C.; visualization, Y.W.; supervision, Y.W.; project administration, Y.W; funding acquisition, Y.W. and C.X. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the Natural Science Foundation of Shandong Province under Grant ZR2020KE029; by the National Natural Science Foundation of China under Grant 52001241; by the 111 Project (B21008); by the Zhejiang Key Research Program under Grant 2021C01010.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data presented in this study are openly available in [SeaShips] at [doi:10.1109/TMM.2018.2865686], reference number [24].

**Conflicts of Interest:** The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

#### **References**

