*Article* **Multi-Sensor-Based Hierarchical Detection and Tracking Method for Inland Waterway Ship Chimneys**

**Fumin Wu <sup>1</sup> , Qianqian Chen 1,\*, Yuanqiao Wen 2,3, Changshi Xiao 1,3,4 and Feier Zeng <sup>1</sup>**


**Abstract:** In the field of automatic detection of ship exhaust behavior, a deep learning-based multisensor hierarchical detection method for tracking inland river ship chimneys is proposed to locate the ship exhaust behavior detection area quickly and accurately. Firstly, the primary detection uses a target detector based on a convolutional neural network to extract the shipping area in the visible image, and the secondary detection applies the Ostu binarization algorithm and image morphology operation, based on the infrared image and the primary detection results to obtain the chimney target by combining the location and area features; further, the improved DeepSORT algorithm is applied to achieve the ship chimney tracking. The results show that the multi-sensor-based hierarchical detection and tracking method can achieve real-time detection and tracking of ship chimneys, and can provide technical reference for the automatic detection of ship exhaust behavior.

**Keywords:** ship exhaust behavior; detection and tracking; multi-sensor; deep learning; morphological operation

## **1. Introduction**

The construction of the Yangtze River Economic Belt is one of the key strategies of the national cross-regional coordinated development, and both the "Yangtze River Protection" and the "Yangtze River Green Ecological Corridor" are the top priorities of the construction of the Yangtze River Economic Belt. The International Maritime Organization (IMO) has mandated a gradual reduction of nitrogen oxide and other types of gas emissions [1], and a regulation on sulfur emissions from ships sailing in global waters has been in effect since 1 January 2020 [2]. In addition, the design of ships' intake ports and the exhaust ports of the exhaust gas is being modified in accordance with the requirements of the International Maritime Organization (IMO) [3]. However, the detection of ship exhaust depends on high-sensitivity gas sensors, and it is difficult to obtain evidence. The Pankratova NV study showed that ship exhaust emission data are correlated with ship chimneys [4]; therefore, the method of tracking ship chimney detection based on computer technology is one of the most important tools for scientific and efficient regulation.

Ship chimney detection is the core research content of this paper, and ship detection is the prerequisite and a key technical point of ship chimney detection. Since the ship chimney has small target and inconspicuous features, and the known chimney dataset is very small, it is very difficult to detect the ship chimney directly; on the contrary, the ship has relatively large target and obvious features compared with the chimney, and the dataset is relatively large.

However, currently there are still difficulties and challenges in the field of computer vision for small target detection. In terms of visible images, both traditional manually

**Citation:** Wu, F.; Chen, Q.; Wen, Y.; Xiao, C.; Zeng, F. Multi-Sensor-Based Hierarchical Detection and Tracking Method for Inland Waterway Ship Chimneys. *J. Mar. Sci. Eng.* **2022**, *10*, 809. https://doi.org/10.3390/ jmse10060809

Academic Editor: Rafael Morales

Received: 18 March 2022 Accepted: 12 May 2022 Published: 13 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

designed feature operator-based target detection and deep learning-based target detection methods have yet to improve the detection accuracy of small targets. In addition to the characteristics of small targets, the detection of inland river ship chimneys is also affected by the small feature information of ship chimneys. In terms of infrared images, the infrared camera has a small field of view, and the acquired image information is not rich. Although the infrared camera is more sensitive to the target in high temperature regions and the ship chimney is also a high temperature object, the simple use of an infrared camera to detect the chimney is less robust due to the high temperature of the ship itself or the ship's cargo exposed to the sun, as well as the influence of the background buildings and water reflections in the inland river.

Based on the above problems, we found that, on the visible band image, although the ship chimney target is small, the ship target is relatively large and rich in information, and the deep learning technique can be used to detect the ship on the visible band image first, with the aim of narrowing down the detection range of the ship chimney. Then, since the visible band is more sensitive to high temperature regions, the difficulty of detecting the chimney in a small area will be greatly reduced. Therefore, the detection of ship chimneys can eventually be achieved by combining the characteristics of different sensor images, thereby bringing convenience to the subsequent tracking.

The remainder of this paper is organized as follows. Some related works are introduced in Section 2. In Section 3, we will discuss the whole methodology of our algorithm. The experiment and model prediction performance is reported in Section 4. Finally, the work is concluded in Section 5.

#### **2. Related Work**

A large number of scholars have also conducted research on ship detection based on computer vision techniques. According to the type of technology used, this research can be divided into traditional-based methods and deep learning-based methods. Most of the traditional methods are designed to detect or recognize a specific scene. Arshad [5] et al. first processed the ship background image using morphological operations, and then used the Sobel operator to perform edge detection of the ship to discriminate it from its background, but it is not effective in the case of complex textures, which have more noise. Zhang X designed a rotated Gaussian mask to model the ship, and, at the same time, contextual information was used to enhance the perception of the ship [6]. Wang Y. [7] et al. proposed a ship detection algorithm based on a background difference method, but the algorithm was aimed at ship detection under a static background, and did not identify, classify, and track targets. Tang Y. [8] et al. adopted the fusion technology of multi-vision to analyze and detect ship targets by monitoring through local entropy and a connected domain, requiring two scans of images, which was inefficient, and the threshold had a great influence on the final effect. Shi W. et al. [9] proposed morphology with multiple structural elements to extract the edge features of ships by using different structural elements, which can fully retain various details of ships while filtering out background noises such as waves, but it is difficult to detect small targets.

In addition to the traditional vision technology-based methods mentioned above, deep learning technology-based methods are the mainstream ship detection methods at present. Excellent target detection methods based on deep learning are the R-CNN series, YOLO series, and SSD series. Cui ZY used a pyramidal structure to connect the convolutional block attention module (CBAM) closely with each feature map connected from top to bottom of the pyramidal network in order to extract rich features containing resolution and semantic information for multi-scale ship detection [10]. Subsequently, Cui ZY proposed a center net-based large SAR image ship detection method for locating the centroid of the target by key point estimation, which can effectively avoid the missed detection of small target ships [11]. Differently, Chen XQ used a convolutional neural network in the YOLO model to extract multi-scale ship features from the input ship images. Then, multiple bounding boxes (i.e., potential ship positions) were generated based on the target confidence, and, finally,

the background surround box interference was suppressed to obtain the ship positions in each ship image. Finally, Chen XQ analyzed the spatio–temporal behavior of ships in continuous ocean images based on the ship's kinematic information [12]. Shao ZF used the CNN framework based on depth features, saliency maps, and coastline prior. This work integrated ship discriminative features to detect ship class and location [13]. Yang X proposed a dense feature pyramid network to detect ships in different scenarios, including in the ocean and at ports, in order to solve the problem caused by narrow ship width [14].

In recent years, deep learning methods have been successfully applied to ship detection in synthetic aperture radar (SAR) images. Wei SJ proposed a high-resolution ship detection network based on high-resolution and low-resolution convolutional feature mapping for ship detection in high-resolution SAR images [15]. Similarly, Lin Z, et al. proposed a new fast R-CNN-based network structure based on high-resolution SAR images to further improve ship detection performance by using a squeeze excitation mechanism [16,17]. Jin L., et al. used the SSD model and added a feature fusion module to the shallow feature layer to optimize the feature extraction capability for small objects, and then added the squeeze and excitation network (SE) module to each feature layer to introduce an attention mechanism for the network to achieve small-scale ship detection in remote sensing images [18,19]. Wang Y combined single-shot multibox detector (SSD) with migration learning to solve the ship detection problem in complex environments, such as oceans and islands [20]. Sun J, based on the SSD model, integrated expansion convolution with a multiscale feature fusion to improve small target detection accuracy [21]. Not coincidentally, Chen P, to improve the small target detection accuracy, embedded the elemental pyramid model into the traditional RPN, and then mapped it to a new elemental space for object recognition [22]. The detection of multiscale SAR ships remains a great challenge due to the strong interference and wide variation of scales in the offshore background.

This paper proposes a multi-sensor hierarchical detection tracking algorithm based on deep learning to detect and track ship chimneys. Firstly, the first level detection uses a visible light image input deep-learning target detector to detect the ship target, so as to greatly reduce the target detection range and solve the problem of background interference. Then, in view of the problem that the chimney target is too small to be identified, infrared imaging is adopted for the second-level detection, with the first-level detection result used as the input of the second-level detection. The image is extracted through a two-step Ostu binarization algorithm, image corrosion, and expansion operation. Finally, according to the prior knowledge of the chimney orientation, the candidate area is bisecting to further reduce the detection range and extract the final chimney target, combined with area characteristics. The improved DeepSORT tracking algorithm is used to track the ship chimney, which provides some help for the ship exhaust monitoring.

#### **3. Algorithm Design**

#### *3.1. Algorithmic Framework*

The framework of the multi-sensor hierarchical detection and tracking algorithm is shown in Figure 1, which is divided into four parts, namely data input, detection, tracking, and data output. Among them, the input data are an infrared camera and a visible camera, and the detection stage is divided into primary detection and secondary detection. The primary detection uses the improved YOLOV3 which was proposed by Joseph Redmon in 2018 as the ship detector, which is improved from two aspects: the design of the a priori frame, and the output of the feature pyramid. The second level detection splits the ship area in the infrared camera according to the first level detection result, and then filters the background by Gaussian filtering and adaptive threshold selection algorithm to obtain the candidate area of the chimney, according to the a priori knowledge. It is known that the chimney detected in this paper is located above the ship area, so the area equalization method is used to narrow the detection range again. Finally, the maximum value of the contour area is calculated as the final detection result.

**Figure 1.** General framework of ship chimney detection and tracking algorithm.

The tracking is performed using the improved DeepSORT [23] algorithm, which mainly consists of a target detection module and a data association module. DeepSORT is used in the real-time target tracking process to first extract the depth features of the target, and then uses Kalman filtering to make predictions, correlate the sequence data, and perform the target matching. Mainly from the calculation of the cost matrix and association, the algorithm is improved. The main steps of the improved DeepSORT tracking are as follows:

(1) Create Tracks according to the results detected in the first frame, and initialize the Kalman filter. Tracks are initially in Unconfirmed state and can be converted to Confirmed state only if they are tracked successfully three times in a row.

(2) Calculate the cost matrix between the tracked target in Tracks and the detected target in the current frame using the improved GIOU.

(3) The cost matrix in Step (2) is input to the improved data association algorithm KM, and three kinds of matching results are obtained: the first category Matched Tracks is the traces matched to the detection results, indicating that the current frame tracks the target in the previous frame, and the values in Tracks are subsequently updated according to Kalman filtering. The second type, Unmatched Detections, is the detection result of unmatched tracks, which means that the target detected in the current frame is a newly appeared target, which is not related to the previous detection result, so a new tracking track needs to be added. The third type, Unmatched Tracks, is the trajectory with unmatched detections, which means that the trajectory existing in the previous frame is lost in the current frame, and if it is an Unconfirmed stable state, the trajectory is deleted directly. If it is a Confirmed stable state, the number of followed traces max\_age is increased by 1. When the number of followed traces reaches 30 times, the Confirmed state is converted to an Unconfirmed non-stable state.

(4) For the Confirmed state, Tracks and Detections will use cascade matching to calculate the cost matrix. Cascade matching uses the appearance feature vector to calculate the cosine similarity, and uses the Marxist distance to exclude the targets between frames that are far away from each other, where the appearance feature vector saves the feature vector of this target in the first 100 frames by default.

(5) There are also three types of cascade matching results: for the Unmatched Tracks and Unmatched Detections states, the algorithm re-calculates these two states together with the Unconfirmed state in Tracks using the GIOU association algorithm. For Matched Tracks states, the variable information in Tracks is updated by Kalman.

(6) The cost matrix in (5) is input into the KM algorithm, and the processing result is similar to step (3).

(7) Loop (4) to (6) steps until the end of the video frame.

#### *3.2. Improved YOLOv3-Based Ship Detection Network*

#### 3.2.1. Anchor Improvements

A large number of experiments have shown that the selection and design of Anchor has had a large impact on the results of detection. Through the analysis of our own ship dataset, we know that the ship targets are larger, and the ship lengths and widths are more similar with horizontal orientation. By comparing the characteristics of the COCO dataset, we can see that the default Anchor of YOLOV3 does not meet our actual needs. Based on the above characteristics of the actual ship dataset, we made a specific design for the Anchor of the ship target, aiming to improve the speed and accuracy of ship detection.

In YOLO detection algorithm, the input image is divided into S × S grids, and each grid is called a Grid Cell. Each Grid Cell is responsible for detecting a target on which the center of the object falls. Each Grid Cell has a prediction box, which we call Anchor, and the number of Anchor for each Grid Cell is different in different versions. In YOLOV1, the image is divided into 7 × 7 size, and each grid is fixed with only two Anchors with different aspect ratios. Each Grid Cell can predict only one category, so the detection accuracy is low in scenes with dense targets. In YOLOV2, the authors used clustering to cluster the real target aspect ratios of the dataset into five classes by default, thus introducing five Anchors for each Grid Cell, and improving the detection capability for dense objects. In YOLOV3, the authors reduce the number of Anchors for each Grid Cell to three different scales, and introduce the concept of multi-scale feature map fusion to detect targets at different scales with three different scales, so the number of Anchor for each Grid Cell increases to nine.

The targets detected in this paper are ships, which generally have an aspect ratio greater than 1, i.e., the detection frame rectangle is longer than wide, as shown in Figure 2, where the upper left corner shows the distribution of the number of ship types, the upper right corner shows the distribution of the rectangular frame of the ship training set, and the lower left corner shows the distribution of the target center x and y, where the horizontal and vertical coordinates are the ratio of x and y to the actual width and height of the image. The same is true for the lower right corner, where the original width of the image is 1920 pixels and the height is 1080 pixels. From the statistical results, we can see that most of the ship widths are distributed around 0.1~0.3, i.e., 192~576 pixels wide, and the heights are distributed around 0.02~0.1, i.e., 22~108 pixels high. In order to make our designed Anchor aspect ratio closer to the actual ship detection application, the number of each Grid Cell was reduced from three to two, and we kept the default Feature Map with two different scales due to the "small and large" characteristics in inland waters. As a result, the number of Anchors was reduced from the default nine to four. In order to make our designed Anchor aspect ratio closer to the actual ship detection application, we first clustered the aspect of the Bounding Box of the dataset, where there are multiple clustering methods. We borrowed the idea from the YOLOV2 authors, and used k-means algorithm to cluster the data into two classes, and obtained the original dimensions of Anchor for each Grid Cell as (384, 54) and (1152, 216).

**Figure 2.** Distribution of Anchor information of a ship dataset. (**a**) Number of ships by class. (**b**) Statistics of Anchor shape. (**c**) Statistics of anchor X y center coordinates (**d**) Statistics of Anchor width height.

#### 3.2.2. Improvement of Feature Pyramids

In YOLOV3, in order to make the detection of objects of different sizes, after the feature extraction network, the features of different feature extraction layers were fused to form new feature maps through Concat and upsampling operations. These different feature maps have the same depth, but different sizes. Fresh feature maps of different sizes, as well as the network structure in YOLOV3, are shown in Figure 3.

The light yellow part of the figure is for the three different scales of 13 × 13, 26 × 26, and 52 × 52. In these different scales, the size of each Cell is inversely proportional to the size of the scale, and for the large scale of 52 × 52, the corresponding size of each Cell is small, while for the small scale of 13 × 13, the size of each Cell is large. The small-scale Cell contains less information, and is therefore more suitable for detecting small objects, while the large-scale Cell incorporates more information, and is therefore more suitable for detecting larger objects, as shown in Figure 4. For the large ship in the bottom corner, a 13 × 13 feature map is generally available, while for the small target ship in the middle, a 26 × 26 feature map is generally available.

**Figure 4.** Output diagram of different sizes. (**a**) 13 × 13 grid cell. (**b**) 26 × 26 grid cell. (**c**) 52 × 52 grid cell.

Through analyzing the self-collected ship dataset in this paper, we can see that, in terms of species, the species of ships is much smaller than the open source generic dataset; in terms of scenarios, the river channel in inland waters is limited, and ships can only travel in the area. With the shore camera as the reference point, the width of the river channel greatly limits the size variation of ships, and most of the ships in inland waters are larger in size and belong to large targets, so we can delete the 52 × 52 feature maps used to detect small targets. Just keep the 13 × 13 and 26 × 26 feature maps. This optimization can reduce the parameters for network training, as well as speed up the training of the network. In addition, since the number of feature maps is reduced from three to two, the number of Anchor corresponding to each feature map is also reduced from three to two, so the original 3 × 3 = 9 frames to be detected is reduced to 2 × 2 = 4 frames to be detected when calculating the detection frames. This will greatly reduce the amount of calculation, as well as improve the detection speed of the ship. To sum up, the complete network structure after the improvement designed in this paper is shown in Figure 5.

**Figure 5.** Improved YOLOv3 network.

As shown in Figure 5, the input size of the image is 416 × 416 pixels, and after five down-sampling calculations, a feature layer of size 13 × 13 is obtained, which detects large targets. Ship targets are relatively large targets, so a 52 × 52 feature layer will increase the number of parameters of the model and reduce the detection speed. Therefore, only two output layers of 13 × 13 and 26 × 26 are considered for retention. The goal of reducing the number of parameters and operations is achieved by reducing the number of feature layers to improve the network detection speed.

#### *3.3. Chimney Detection with Fused Infrared Images*

#### 3.3.1. Threshold Processing

The video saved by the infrared heat-sensing camera used in this paper was later processed and saved locally as an RGB three-channel image as well. In order to facilitate the subsequent thresholding, the RGB image needed to be converted to a grayscale image. The conversion of RGB to gray scale image is represented by Equation (1)

$$\text{Gray} = \text{R} \times 0.299 + \text{G} \times 0.578 + \text{B} \times 0.114 \tag{1}$$

After the grayscale processing, a bimodal image can be obtained by counting the individual grayscale values, and due to the processing of bimodal images, this subsection uses Otsu's algorithm, which attempts to find a threshold that minimizes the weighted intra-class variance given by the relation:

$$
\sigma^2 = \omega\_1 \cdot \left(\mu\_1 - \mu\_0\right)^2 + \omega\_2 \cdot \left(\mu\_2 - \mu\_0\right)^2 \tag{2}
$$

where *σ* 2 is the interclass variance of foreground and background, *ω*<sup>1</sup> and *ω*<sup>2</sup> represent the proportion of background and foreground pixels in the total image, *µ*<sup>1</sup> and *µ*<sup>2</sup> represent the average grayscale of background and foreground, respectively, and *µ*<sup>0</sup> represents the average grayscale of the whole image. Expanding Equation (2) yields:

$$\begin{array}{ll} \sigma^2 &= \omega\_1 \cdot \mu\_1^2 + \omega\_2 \cdot \mu\_2^2 \\ &-2(\omega\_1 \cdot \mu\_1 + \omega\_2 \cdot \mu\_2) \cdot \mu\_0 + \mu\_0^2 \end{array} \tag{3}$$

According to the mathematical definition formula of expectation *E*(*X*) = ∞ ∑ *k*=1 *xk* · *p<sup>k</sup>* , we can deduce that:

$$
\mu\_0 = \omega\_1 \cdot \mu\_1 + \omega\_2 \cdot \mu\_2 \tag{4}
$$

Bringing (4) into (3), *σ* <sup>2</sup> <sup>=</sup> *<sup>ω</sup>*<sup>1</sup> · *<sup>µ</sup>* 2 <sup>1</sup> + *ω*<sup>2</sup> · *µ* 2 <sup>2</sup> − *µ* 2 0 is again replaced using the relationship between Equation (4) and *ω*<sup>2</sup> = 1 − *ω*1:

$$\begin{split} \sigma^2 &= \omega\_1 \cdot \mu\_1^2 + \frac{\omega\_2^2 \cdot \mu\_2^2}{1 - \omega\_1} - \mu\_0^2 \\ &= \omega\_1 \cdot \mu\_1^2 + \frac{\left(\mu\_0 - \omega\_1 \cdot \mu\_1\right)^2}{1 - \omega\_1} - \mu\_0^2 \\ &= \frac{\omega\_1}{\left(1 - \omega\_1\right)} \cdot \left(\mu\_1 - \mu\_0\right)^2 \end{split} \tag{5}$$

Using Equation (5), we only need to count the pixels before the current iteration of grayscale, which greatly improves the efficiency of the program.

As can be seen in Figure 6, some background noise points can be effectively removed after Gaussian filtering. Compared with a fixed threshold, the Ostu algorithm is more likely to try to find a threshold to reasonably separate the foreground and background.

**Figure 6.** Comparison of infrared image binarization. (**a**) input images. (**b**) histogram. (**c**) Processing results of thresholds.

#### 3.3.2. Coordinate Fusion

In order to collect experimental data, we have independently developed a set of experimental systems, which consists of a visible camera, an infrared camera and a gimbal that can be rotated coaxially, which can locate and track the target in real time. The visible camera has a resolution of 1920 × 1080, and the thermal imaging camera is a custom thermal imaging camera from Golder Infrared with a resolution of 640 × 512 resolution and a rotation angle of −120–120◦ for the gimbal. The multi-sensor coaxial rotation system is shown in Figure 7.

**Figure 7.** Multi-sensor coaxial rotation system.

Therefore, the coordinates of the same object in different cameras in the same frame are represented differently, as shown in Figure 8.

$$\mathbf{x}\_2 = \frac{\mathbf{x}\_1}{1920} \times \begin{array}{c} \text{(64)} \end{array} \tag{6}$$

$$y\_2 = \frac{y\_1}{1080} \times 512\tag{7}$$

Therefore, this paper needs to convert the coordinates to ensure the accuracy of the search area of the ship's chimney. For different size images, the size is different, but the position of each coordinate point relative to the upper left corner (zero point) is the same after conversion to a right-angle coordinate system, so the coordinates can be converted using the scale relationship. For the image, the coordinate system is two-dimensional, so it needs to be converted separately in the *x* and *y* directions. Supposing that the coordinates of the same object P (*x*1, *y*1) on the 1920 × 1080 resolution image and (*x*2, *y*2), the specific value of *x*, *y* of (*x*2, *y*2) should be as shown in the operation of Equation (6), according to the image scale. By using the above-mentioned coordinate conversion equation after corresponding the visible image to the infrared band image, it aims to ensure that the location of the ship's chimney is found accurately, rather than deviations due to coordinate conversion.
