*Article* **Tracking of Multiple Static and Dynamic Targets for 4D Automotive Millimeter-Wave Radar Point Cloud in Urban Environments**

**Bin Tan 1, Zhixiong Ma 1,\*, Xichan Zhu 1, Sen Li 1, Lianqing Zheng 1, Libo Huang <sup>2</sup> and Jie Bai <sup>2</sup>**

	- **\*** Correspondence: mzx1978@tongji.edu.cn

**Abstract:** This paper presents a target tracking algorithm based on 4D millimeter-wave radar point cloud information for autonomous driving applications, which addresses the limitations of traditional 2 + 1D radar systems by using higher resolution target point cloud information that enables more accurate motion state estimation and target contour information. The proposed algorithm includes several steps, starting with the estimation of the ego vehicle's velocity information using the radial velocity information of the millimeter-wave radar point cloud. Different clustering suggestions are then obtained using a density-based clustering method, and correlation regions of the targets are obtained based on these clustering suggestions. The binary Bayesian filtering method is then used to determine whether the targets are dynamic or static targets based on their distribution characteristics. For dynamic targets, Kalman filtering is used to estimate and update the state of the target using trajectory and velocity information, while for static targets, the rolling ball method is used to estimate and update the shape contour boundary of the target. Unassociated measurements are estimated for the contour and initialized for the trajectory, and unassociated trajectory targets are selectively retained and deleted. The effectiveness of the proposed method is verified using real data. Overall, the proposed target tracking algorithm based on 4D millimeter-wave radar point cloud information has the potential to improve the accuracy and reliability of target tracking in autonomous driving applications, providing more comprehensive motion state and target contour information for better decision making.

**Keywords:** target tracking; 4D millimeter-wave radar; motion state estimation; autonomous driving

#### **1. Introduction**

For autonomous driving systems, accurately sensing the surrounding environment is crucial. Among the various vehicle sensing sensors, millimeter-wave radar is capable of obtaining position and speed information of targets, and can operate in complex weather conditions such as rain, fog, and bright sunlight exposure [1].

Conventional 2 + 1D (x, y, v) millimeter-wave radar is effective in measuring the radial distance, radial velocity, and horizontal angular information of a target. However, when compared to cameras and LIDAR, which are the other major sensors used in autonomous driving, traditional millimeter-wave radar has a lower angular resolution and cannot provide height angle information of the target. In autonomous driving scenarios, where vehicle or pedestrian targets are common, the small number of points and low angular resolution of individual targets in the scene can result in large errors in size and location estimation. To address this issue, high-resolution 4D (x, y, z, v) millimeter-wave radar has been developed, which can provide height angle information of targets with higher angular resolution. This enables more accurate edge information of targets and more precise estimation of a target's size and position.

**Citation:** Tan, B.; Ma, Z.; Zhu, X.; Li, S.; Zheng, L.; Huang, L.; Bai, J. Tracking of Multiple Static and Dynamic Targets for 4D Automotive Millimeter-Wave Radar Point Cloud in Urban Environments. *Remote Sens.* **2023**, *15*, 2923. https://doi.org/ 10.3390/rs15112923

Academic Editors: Francesco Nex and Mohammad Awrangjeb

Received: 23 April 2023 Revised: 17 May 2023 Accepted: 31 May 2023 Published: 3 June 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Radar target tracking plays a critical role in millimeter-wave radar sensing. By providing a continuous position and velocity profile of a target, radar target tracking offers higher accuracy and reliability compared to a single measurement from the radar. Furthermore, it can effectively eliminate false detections.

Most conventional millimeter-wave radar tracking methods focus on point targets, which provide target ID, position, and velocity information. However, 4D millimeter-wave radar can measure multiple scattering centers per target, making direct application of point cloud tracking methods unsuitable. In addition, contour information, such as target size and orientation, is critical in autonomous driving environments. Therefore, accurate estimation of target ID, position, size, direction, and velocity information is necessary for 4D millimeter-wave radar target tracking. Despite considerable research on point target tracking using millimeter-wave radar, there is limited research on 4D millimeter-wave radar-based target tracking methods. Dynamic target tracking using 4D millimeter-wave radar presents several challenges, including variation in target size and multiple individual target measurement points. Furthermore, 4D millimeter-wave radar can measure static targets in the scene, while conventional millimeter-wave radar usually filters out static targets due to the absence of altitude angle information, which can result in false positives. Therefore, 4D millimeter-wave radar target tracking can also estimate the contour shape information of static targets. This paper focuses on developing tracking methods for multiple dynamic and static targets throughout a scene using 4D millimeter-wave radar.

The most commonly used multi-target tracking methods for millimeter-wave radar based on point targets include nearest neighbor data association (NN) [2,3], global nearest neighbor association (GNN) [4,5], multiple hypothesis tracking (MHT) [6,7], joint probabilistic data association (JPDA) [8,9], and the random finite set method (RFS) [10–12]. The nearest neighbor association algorithm selects the observation point that falls within the association gate and is closest to the tracking target as the association point. The global nearest neighbor algorithm minimizes the total distance or association cost. The joint probabilistic data association algorithm combines data association probabilities. The multi-hypothesis tracking algorithm calculates the probability and likelihood for each track. The RFS approach models objects and measurements as random finite sets.

In high-resolution millimeter-wave radar or 4D millimeter-wave automotive radar, a road target often spans multiple sensor resolution units, which poses challenges for tracking. In the extended target tracking problem for millimeter-wave radar, the position of the target measurement point on the object is represented as a probability distribution that changes with the sensor measurement angle, and the measurement point may appear or disappear. Therefore, tracking extended targets using millimeter-wave radar presents a significant challenge.

One approach to address the extended target tracking problem is to include a clustering process that reduces multiple measurements to a single measurement, which can then be tracked using a point target tracking method. In extended target tracking, clustering can be used to partition the point cloud. In automotive millimeter-wave radar target tracking, the size and shape of the clustering clusters vary due to the different size and reflection properties of the targets. Therefore, density-based spatial clustering of applications with noise (DBSCAN) [13] is commonly used to cluster radar points. However, density-based clustering methods rely on fixed parameter values and may perform poorly with targets of different densities. As a result, several methods that allow for different clustering parameters have been proposed, such as ordering points to identify the clustering structure (OPTICS) [14], hierarchical DBSCAN (HSBSCAN) [15], and tracking-assisted multi-hypothesis clustering [16].

Other approaches to extended target tracking involve designing object measurement models. Some examples include the elliptic random matrix model [17], the random hypersurface model [18], and the Gaussian process model [19]. In millimeter-wave radar vehicle target tracking, various vehicle target models have been proposed, such as a direct

scattering model [20], a variational radar model [21], a B-spline chained ellipses model [22], and the data-region association model [23].

Although several methods exist for extended target tracking using millimeter-wave radar, many of them rely on simulation data for extended target tracking theory. In practical scenarios, challenges such as varied point cloud probability distributions of different extended targets, and diverse position relationships when different targets are associated require further investigation on certain tracking algorithms. Moreover, some algorithms focus on tracking vehicle targets, and thus it is essential to explore ways to adapt tracking algorithms to different types of targets with varying sizes. Additionally, there have been limited studies on 4D millimeter-wave radar target tracking, and, therefore, the effectiveness of such methods on 4D millimeter-wave radar needs to be explored. This paper presents an effective 4D millimeter-wave radar target tracking method with the following contributions.


The structure of this paper is organized as follows. Section 2 describes the tracking problem. Section 3 presents the proposed solution to the tracking problem, which includes compensating for target velocity, clustering point clouds, determining target associations, identifying dynamic and static targets, updating contour shape states, and creating, retaining, and deleting trajectories. Section 4 presents the experimental setup and results. Finally, Section 5 summarizes the research.

#### **2. Materials and Methods**

The objective of this paper is to derive state estimates for both dynamic and static targets within the field of view of 4D millimeter-wave radar, using the point cloud measurement volume of the radar. This includes obtaining 3D edge information of dynamic targets and contour shape information of static targets.

#### *2.1. Measurement Modeling*

4D millimeter-wave radar point cloud measurement includes information on the position along the *x*, *y*, and *z*-axes as well as the radial velocity *v<sup>r</sup>* and intensity *I*. The radial velocity information is obtained through direct measurement as the target point's relative radial velocity. Each measurement point can be expressed as:

$$z\_j = \begin{bmatrix} \ x\_j & y\_j & z\_j & v\_j^r & I\_j \end{bmatrix} \tag{1}$$

where *zj* represents the measurement, *j* represents the *j*-th point, and *v<sup>r</sup> <sup>j</sup>* represents the relative radial velocity of the *j*-th point.

As shown in Figure 1, 4D millimeter-wave radar point clouds are utilized to measure targets at three distinct time steps, revealing that the detected target points are dynamic and can vary over time, possibly appearing or disappearing at different locations. This poses a significant challenge in accurately estimating the target's location and shape. To account for sensor noise and the inherent uncertainty in the measurement model, a probabilistic model is often employed to describe the measurement process.

**Figure 1.** Measurements of the same target at adjacent moments. (**a**) 3D view of the target point cloud at moment *t* − 2. (**b**) 3D view of the target point cloud at moment *t* − 1. (**c**) 3D view of the target point cloud at moment *t*. (**d**) Top view of the target point cloud at moment *t* − 2. (**e**) Top view of the target point cloud at moment *t* − 1. (**f**) Top view of the target point cloud at moment *t*.

For multiple measurements of the expansion target, this can be expressed as:

$$Z = \left\{ z^{j} \right\}\_{j=1}^{n} \tag{2}$$

where Z is the set of measurement quantities, *z<sup>j</sup>* is a single measurement quantity, *j* is the number of measurements, and *n* is the total number of measurements.

The probability distribution of the measurements obtained from the target state can be expressed as:

$$p(Z\_k|X\_k) \tag{3}$$

where *Zk* is the measurement at moment *k* for a target with target state *Xk*.

#### *2.2. Target State Modeling*

The aim of this paper is to estimate the states of both dynamic and static targets in the 4D millimeter-wave radar field of view using point cloud measurements. For the dynamic targets, their states can be described as follows:


Therefore, the state estimation of a 3D dynamic target in a road environment at time *k* can be represented as *Xk*, which consists of the position state (*xyz*), the motion state (*vx vy*), and the profile shape state ( *lwh θ*).

$$X\_k^d = [x\_k \ y\_k \ z\_k \ v\_{\ge k} \ v\_{yk} \ l\_k \ w\_k \ h\_k \ \theta\_k] \tag{4}$$

The states of the static targets in this paper can be described as follows:


The state estimation of a 3D static target in a road environment can be expressed as:

$$X\_d^s = \begin{bmatrix} \begin{Bmatrix} x\_j & y\_j \end{Bmatrix}\_{j=1}^n \upsilon\_{xk} \ \upsilon\_{yk} \ \begin{Bmatrix} h\_k \ z\_k \end{Bmatrix}\_{k=1}^n \tag{5}$$

#### *2.3. Method*

The proposed solution in this paper is illustrated in Figures 2 and 3:

**Figure 2.** 4D millimeter-wave radar point cloud tracking framework.

**Figure 3.** Module of calculation of target measurement.

In Figures 2 and 3, time is represented by *t*, the detection value is represented by *D*, the trajectory is represented by *T*, the dynamic target is represented by *d*, and the static target is represented by *s*.

The 4D radar data is input to generate point cloud data of the scene. The point cloud is preprocessed to compensate for the velocity information and convert relative radial velocity to absolute radial velocity. The static scene from the previous frame is matched with the current frame to aid in associating static and dynamic targets. A clustering module is used to classify the points into different target proposals. Data association is performed using an optimal matching algorithm. For the clustered targets that are successfully associated, their dynamic and static attributes are updated using a binary Bayesian filtering algorithm. For dynamic targets, the target state is updated using a Kalman filtering method to obtain the 3D bounding box of the target. For static targets, the bounding box state is updated using the rolling ball method. For unassociated clustered targets, trajectory initialization is performed, historical trajectories that are not associated are retained or deleted, and trajectories in overlapping regions are merged.

#### 2.3.1. Point Cloud Preprocessing

Before feeding the millimeter-wave radar point cloud into the tracking framework, several preprocessing steps are performed. Firstly, the relative radial velocity information of the point cloud is compensated for absolute radial velocity, allowing for the extraction of dynamic and static targets in the scene and the updating of their states based on radial velocity information. Additionally, due to the motion of the radar, the world coordinate systems of the front and back point clouds are different, and multi-frame point clouds are matched to facilitate the association of dynamic and static targets. Further details on these steps can be found in previous work [25].

After obtaining the ego vehicle's speed *vs*, the compensation amount, *v*ˆ*<sup>r</sup> <sup>c</sup>*, for the radial velocity of the target can be calculated. Then, the absolute velocity of each target point, *v<sup>r</sup> a*, can be calculated as follows:

$$
\upsilon\_a^r = \upsilon\_d^r - \mathfrak{d}\_c^r \tag{6}
$$

The radar point cloud conversion relationship can be expressed as:

$$H = \begin{bmatrix} R, t \end{bmatrix} \tag{7}$$

$$Y\_{n-1}^{n} = H\_{n-1} P\_{n-1} \tag{8}$$

*Yn <sup>n</sup>*−<sup>1</sup> is the point set after the point cloud of the (*<sup>n</sup>* <sup>−</sup> <sup>1</sup>)-th frame is registered to the point cloud of the *n*-th frame. *Pn*−<sup>1</sup> is the information of the *n*-th point.

2.3.2. Clustering and Data Association

• Radar Point Cloud Clustering

After preprocessing the point cloud data, the large number of points are grouped into different targets using clustering techniques based on their position and velocity characteristics. One commonly used clustering algorithm for radar point clouds is densitybased spatial clustering of applications with noise (DBSCAN) [13], which can automatically detect clustering structures of arbitrary shapes without requiring any prior knowledge. DBSCAN determines clusters by calculating the density around sample points, grouping points with higher density together to form a cluster, and determining the boundary between different clusters by the change in density. The DBSCAN algorithm takes spatial coordinates (*x*, *y*, *z*) and radial distance (*v<sup>r</sup> <sup>a</sup>*) of the data points as input. Specifically, the DBSCAN algorithm can be executed in the following steps:

• Calculation of the number of data points *N*(*p*) in the neighborhood of a data point *p*:

$$N(p) = \{ q \in Z : \text{dist}(p, q) \le \varepsilon \}\tag{9}$$

Here, *Z* is the dataset, *dist*(*p*, *q*) is the Euclidean distance between the data points *p* and *q*, and *ε* is the radius of the neighborhood.


By executing the above steps, the DBSCAN algorithm can complete the clustering process and assign the data points to different clusters and noise points.

After clustering the *k* targets, the features of the *j*-th target are represented as:

$$f\_j = \begin{Bmatrix} \overline{x}\_j & \overline{y}\_j & \overline{z}\_j & \overline{v}\_j^r & \overline{I}\_j \end{Bmatrix} \tag{10}$$

where ( *xj yj zj <sup>v</sup><sup>r</sup> <sup>j</sup> Ij* ) are calculated as the averages of the point cloud features within each target. The features of all clustering targets can be expressed as:

$$F = \left\{ f\_{\bar{j}} \right\}\_{j=1}^{n} \tag{11}$$

• Data Association

For the *j*-th trajectory, its features are denoted as:

$$\mathbf{g}\_{\hat{j}} = \begin{Bmatrix} \tilde{\mathbf{x}}\_{\hat{j}} & \tilde{\mathbf{y}}\_{\hat{j}} & \tilde{\mathbf{z}}\_{\hat{j}} & \tilde{\mathbf{v}}\_{\hat{j}} & \tilde{I}\_{\hat{j}} \end{Bmatrix} \tag{12}$$

The features of all trajectories can be expressed as:

$$G = \left\{ \mathcal{g}\_{\dot{\jmath}} \right\}\_{\dot{\jmath} = 1}^{n} \tag{13}$$

The purpose of data correlation is to select which measurements are used to update the state estimate of the real target and to determine which measurements come from the target and which come from clutter. In this paper, it is necessary to correlate all clustered targets *F* and all trajectories *G*. One of the most widely used algorithms for target association is the Hungarian algorithm, which is a classical graph theoretic algorithm that can be used to maximize the matching of bipartite graphs. It can be used in a variety of target association algorithms for radar or images, and in target tracking it can be used to match point clouds in target clusters at different time steps to achieve target association. Assuming that there are radar historical trajectories and clustered targets, where the clustered targets contain m targets and the radar trajectories contain n targets, a cost matrix can be defined where Cos*t*(*i*, *j*) denotes the cost between the *i*-th point in the trajectory and the *j*-th point in the clustered targets. Depending on the needs of the target tracking, the cost function can be calculated from factors such as target clustering centroids, average velocity, and intensity characteristics. The Hungarian algorithm finds the optimal matching solution with the minimum cost by converting the bipartite graph into a directed complete graph with weights and by finding the augmented paths in the graph.

The substitution matrix is calculated using the cost function, which is a combination of the position cost and the velocity/intensity cost. The cost function is defined as:

$$\text{Cost}(i, j) = a\_1 \times \text{PositionCost}(i, j) + a\_2 \times \text{VelocityIntensityCost}(i, j) \tag{14}$$

where *α*<sup>1</sup> is the weight of the position cost and *α*<sup>2</sup> is the weight of the velocity/intensity cost. The position cost can be calculated based on the distance between the target centroid and the trajectory prediction at the current time step, while the velocity/intensity cost can be calculated based on the difference in velocity and intensity between the target and the trajectory prediction.

Once the cost function has been calculated, the Hungarian algorithm can be used to find the optimal matching solution with the minimum cost. The resulting substitution matrix *C* is a binary matrix, where *C*(*i*, *j*) = 1 if target *i* is matched to the trajectory *j*, and *C*(*i*, *j*) = 0 otherwise.

2.3.3. Target Status Update

• Target Dynamic Static Property Update

By integrating the absolute velocity information of a target with a binary Bayesian filter, its static and dynamic attributes can be updated. To estimate the target's dynamic probability at a given moment, the ratio of points with a speed greater than a given value to the total number of points in the target's point cloud is calculated. Bayes' theorem is used in the binary Bayesian filter to update the state of the target, which can be either static or dynamic, represented by a binary value of 0 or 1, respectively, at time t.

Applying Bayes' theorem:

$$p(\mathbf{x}|z\_{1:t}) = \frac{p(z\_t|\mathbf{x}, z\_{1:t-1})p(\mathbf{x}|z\_{1:t-1})}{p(z\_t|z\_{1:t-1})} = \frac{p(z\_t|\mathbf{x})p(\mathbf{x}|z\_{1:t-1})}{p(z\_t|z\_{1:t-1})}\tag{15}$$

The Bayes' rule is applied to the measurement mode *p*(*zt*|*x*):

$$p(z\_t|\mathbf{x}) = \frac{p(\mathbf{x}|z\_t)p(z\_t)}{p(\mathbf{x})} \tag{16}$$

Then,

$$p(\mathbf{x}|z\_{1:t}) = \frac{p(\mathbf{x}|z\_t)p(z\_t)p(\mathbf{x}|z\_{1:t-1})}{p(\mathbf{x})p(z\_t|z\_{1:t-1})} \tag{17}$$

For the opposite event ¬*x*,

$$p(\neg x | z\_{1:t}) = \frac{p(\neg x | z\_t) p(z\_t) p(\neg x | z\_{1:t-1})}{p(\neg x) p(z\_t | z\_{1:t-1})} \tag{18}$$

Then,

$$\frac{p(\mathbf{x}|z\_{1:t})}{p(\neg \mathbf{x}|z\_{1:t})} = \frac{p(\mathbf{x}|z\_t)p(\mathbf{x}|z\_{1:t-1})p(\neg \mathbf{x})}{p(\neg \mathbf{x}|z\_t)p(\neg \mathbf{x}|z\_{1:t-1})p(\mathbf{x})} = \frac{p(\mathbf{x}|z\_t)}{1 - p(\mathbf{x}|z\_t)} \frac{p(\mathbf{x}|z\_{1:t-1})}{1 - p(\mathbf{x}|z\_{1:t-1})} \frac{1 - p(\mathbf{x})}{p(\mathbf{x})} \tag{19}$$

The log odds belief at time *t* is:

$$l\_t(\mathbf{x}) = \log \frac{p(\mathbf{x}|z\_t)}{1 - p(\mathbf{x}|z\_t)} - \log \frac{p(\mathbf{x})}{1 - p(\mathbf{x})} + l\_{t-1}(\mathbf{x}) \tag{20}$$

And,

$$d\_0(\mathbf{x}) = \log \frac{p(\mathbf{x})}{1 - p(\mathbf{x})} \tag{21}$$

Then,

$$l\_t(\mathbf{x}) = l\_{t-1}(\mathbf{x}) + \log \frac{p(\mathbf{x}|z\_t)}{1 - p(\mathbf{x}|z\_t)} - l\_0 \tag{22}$$

In dynamic and static attribute updates, *p*(*x*|*zt*) is calculated as the ratio of the number of points with a velocity greater than a given value *vd* to the total number of points in the target point cloud.

• Dynamic Target State Update

The state estimation of a 3D dynamic target in a road environment at time *k* can be represented as X*<sup>d</sup> <sup>k</sup>* by Equation (4), which consists of the position state (*xyz*), the motion state (*vx vy*), and the profile shape state (*lwh θ*).

To update the state of a target, it is necessary to perform additional calculations on the existing clustered targets to obtain measurements of its current state. These calculations may involve analyzing the shape and center position of the target, as well as estimating its velocity. Once these calculations are completed, the status of the target can be updated based on the latest information available, allowing for more accurate tracking and prediction of the target's movement.

When computing measurements of clustered targets for dynamic targets, it is necessary to obtain the rectangular box enclosing the target. The height of the rectangular box can be calculated from the maximum and minimum height of the point cloud, while the other parameters of the rectangular box can be obtained from the enclosing rectangular box in the x and y planes.

However, calculating the rotation angle of the rectangular box is the most challenging part of target shape estimation, especially in imaging millimeter-wave radar, where the number of point clouds is limited and the contours of the point clouds are not well-defined. To address this issue, this paper proposes a method for calculating the rotation angle based on the combination of point cloud position and velocity information and trajectory angle. This approach provides a more accurate and robust estimate of the rotation angle, leading to improved target tracking and prediction.

The rectangular box of the point cloud is fitted using the L shape fitting method [26]. When working with points on a 2D plane, the least squares method is a common approach to finding the best-fitting rectangle for these points.

$$\begin{aligned} \underset{P, \theta, c\_1, c\_2}{\text{minimize }} & \sum\_{i \in P} \left( \mathbf{x}\_i \cos \theta + y\_i \sin \theta - c\_1 \right)^2 + \sum\_{i \in Q} \left( -\mathbf{x}\_i \sin \theta + y\_i \cos \theta - c\_2 \right)^2 \\ \text{subject to } & P \cup Q = \left\{ 1, 2, \dots, m \right\} c\_1, c\_2 \in R \text{ } \mathbf{0}^\circ \le \theta \le 90^\circ \end{aligned} \tag{23}$$

The above optimization problem can be approximated by using a search-based algorithm to find the best-fitting rectangle. The basic idea is to iterate through all possible directions of the rectangle. At each iteration, a rectangle is found that points in that direction and contains all scanned points. The distances from all points to the four edges of the rectangle are then obtained, based on which the points can be divided into two sets, p and q, and the corresponding squared errors are calculated as the objective function in the above equation. After iterating through all directions and obtaining all corresponding squared errors, the squared errors can be plotted as a function of the angle variation trend. Algorithm 1 is as follows.


• **Input**: data points *X* = (*x*, *y*) • **Output**: criterion *Qp* 1. **For** *θ* = 0 **to** *π*/2 − *δ* step *δ* do 2. *e*ˆ1 = (cos *θ*, sin *θ*) 3. *e*ˆ2 = (− sin *θ*, cos *θ*) 4. *C*<sup>1</sup> = *X* · *e*ˆ *T* 1 5. *C*<sup>2</sup> = *X* · *e*ˆ *T* 2 6. *q* = CalculatecriterionX(*C*1, *C*2) 7. *Qp*(*θ*) = *q* 8. end for

The algorithm for defining the calculate criterion, CalculatecriterionX(*C*1, *C*2), using the minimum rectangular area method as described in this paper, is as follows:

$$c\_1^{\max} = \max\{\mathbb{C}\_1\}, \ c\_1^{\min} = \min\{\mathbb{C}\_1\} \tag{24}$$

$$c\_2^{\max} = \max\{\mathbb{C}\_1\}, \ c\_1^{\min} = \min\{\mathbb{C}\_2\} \tag{25}$$

$$\alpha = -(c\_1^{\max} - c\_1^{\min})(c\_2^{\max} - c\_2^{\min}) \tag{26}$$

After calculating to obtain *Qp*(*θ*), the probability *Pp*(*θ*) is calculated as:

$$P\_p(\theta) = \frac{\max(Q(\theta)) - Q\_p(\theta) + \min(Q\_p(\theta))}{\sum\_{\theta} Q\_p(\theta)}\tag{27}$$

For a target on a two-dimensional plane, if the velocities of the point clouds on the target are assumed to be approximately equal, the orientation of the velocities can be estimated. Since millimeter-wave radar has different radial velocities at different points, this estimated velocity orientation can be used as an approximation for the rotation angle of the estimated rectangle for the calculation of the rotation angle, as follows.

The radial velocity measured by millimeter-wave radar can be expressed as

$$
v\_d r\_d = v\_{d,x} \frac{x}{R} + v\_{d,y} \frac{y}{R} \tag{28}$$

$$w\_d^r = v\_{d,x}(\frac{\mathbf{x}}{R} + \tan \theta \frac{\mathbf{y}}{R}) \tag{29}$$

Similar can be achieved by using a search-based algorithm to find the right angle, where the criterion is calculated as the variance. Algorithm 2 is as follows.


After calculating to obtain *Qv*(*θ*), the probability *Pv*(*θ*) is calculated as:

$$P\_v(\theta) = \frac{\max(Q\_v(\theta)) - Q\_v(\theta) + \min(Q\_v(\theta))}{\sum\_{\theta} Q\_v(\theta)}\tag{30}$$

Calculating the historical trajectory angle as *θ<sup>l</sup>* and the probability as a Gaussian distribution with center at *θ<sup>l</sup>* and variance at *δl*:

$$P\_t(\theta) = N(\theta, \sigma^2) + P\_t \tag{31}$$

$$P\_h(\theta) = \frac{P\_t(\theta)}{\sum\_{\theta} P\_t(\theta)}\tag{32}$$

Angular probabilities estimated from the point cloud position and velocity information and trajectory angles are fused using a weighted average.

$$P(\theta) = a\_1 P\_p(\theta) + a\_2 P\_v(\theta) + a\_1 P\_h(\theta) \tag{33}$$

The theta value that maximizes *P*(*θ*) is chosen as the measured value, and the rectangular boundary {*aix* + *biy* = *ci*|*i* = 1, 2, 3, 4} is calculated as:

$$\mathbf{C}\_1^\* = X \cdot \left(\cos \theta^\*, \sin \theta^\*\right)^T,\\ \mathbf{C}\_2^\* = X \cdot \left(-\sin \theta^\*, \cos \theta^\*\right)^T \tag{34}$$

$$a\_1 = \cos \theta^\*, b\_1 = \sin \theta^\*, c\_1 = \min \{ C\_1^\* \} \tag{35}$$

$$a\_2 = -\sin\theta^\*, b\_2 = \cos\theta^\*, c\_2 = \min\{C\_2^\*\}\tag{36}$$

$$a\_3 = \cos \theta^\*, b\_3 = \sin \theta^\*, c\_3 = \max\{C\_1^\*\}\tag{37}$$

$$a\_4 = -\sin\theta^\*, b\_4 = \cos\theta^\*, c\_4 = \max\{\mathbb{C}\_2^\*\}\tag{38}$$

From the process described above, the following parameters of the clustered target can be calculated: the centroid coordinates in three-dimensional space (*x*, *y*, *z*), the length, width, and height of the rectangular box enclosing the target, and the rotation angle (*θ*) of the rectangular box.

The velocity information of the target can be calculated by Equation (32). Then, the measurement can be expressed as:

$$Z\_{t,k} = \begin{bmatrix} x\_k & y\_k & z\_k & v\_{xk} & v\_{yk} & l\_k & w\_k & h\_k & \theta\_k \end{bmatrix} \tag{39}$$

The state transfer model of the target motion can be modeled as:

$$X\_t = FX\_{t-1} + \xi\_t \tag{40}$$

where *ξ<sup>t</sup>* is the system white Gaussian noise with covariance *η*(*ξ*; 0, *R*).

The sensor's observation model is described as:

$$z\_t = H\mathbf{x}\_{t-1} + \mathbb{Z}\_t \tag{41}$$

where *ζ<sup>t</sup>* is the measurement white Gaussian noise with covariance *η*(*ζ*; 0, *Q*).

Based on Equations (40) and (41), since the state and measurement equations of the target can be expressed in linear forms, the state can be updated by the Kalman filter.

• Static Target State Update

The state estimation of a 3D static target in a road environment can be expressed as Equation (5)

When calculating measurements for clustered target detection in static scenarios, obtaining the enclosing box of the target is necessary. The height of the enclosing box can be determined by computing the maximum and minimum heights of the point cloud, while the other parameters of the enclosing box can be obtained from the enclosing concave hull in the x and y planes.

The specific steps of the algorithm are as follows:


Through Formula (7) of the radar point cloud velocity compensation part, *vxk* and *vyk* of the static target can be calculated, and the vehicle speed can be updated through the Kalman filter.

#### 2.3.4. Track Management

In multi-object tracking, the number of targets is typically unknown and can vary as targets and clutter appear and disappear from the scene. Therefore, effective management of target trajectories is essential. For associated detections and trajectories, their states are preserved and updated over time. In cases where detections cannot be associated with any existing trajectory, new trajectories are generated and released as visible trajectories if their lifespan exceeds a predefined threshold *Tr*. For unassociated trajectories, their states are also preserved and updated. However, if their unassociated time exceeds a second threshold *Tu*, the trajectories are deleted to avoid unnecessary computational load.

#### **3. Results**

#### *3.1. Experiment Setup*

To verify the proposed algorithm, data from a 4D radar in road conditions were acquired using a data acquisition platform. The platform includes a 4D radar, LIDAR, and camera sensors, as shown in Figure 4. The 4D radar is installed in the middle of the front ventilation grille, and the LIDAR collects 360◦ of environmental information. The camera and 4D radar capture information within the field of view. The true value frame of the tracking target was labeled using the LIDAR and camera sensors. The performance parameters of the 4D radar sensor are shown in Table 1. The TJ4DRadSet [27] dataset was collected and is used for the algorithm analysis. As shown in Figure 4, the collection platforms of the dataset are displayed.

**Figure 4.** Data acquisition platform, including 4D radar, lidar, and camera sensor.

**Table 1.** Performance parameters of millimeter-wave radar in experimental data acquisition.


#### *3.2. Results and Evaluation*

In order to investigate the impact of velocity errors on the angle estimation under different distances and angles, the graphs shown in Figures 5–10 were plotted.

**Figure 5.** The relationship between velocity estimation and angle under a distance of 5 m and an object rotation angle of 10 degrees, considering different radial velocity measurement errors.

**Figure 6.** The relationship between velocity estimation and angle under a distance of 20 m and an object rotation angle of 10 degrees, considering different radial velocity measurement errors.

**Figure 7.** The relationship between velocity estimation and angle under a distance of 5 m and an object rotation angle of 40 degrees, considering different radial velocity measurement errors.

**Figure 8.** The relationship between velocity estimation and angle under a distance of 20 m and an object rotation angle of 40 degrees, considering different radial velocity measurement errors.

**Figure 9.** The relationship between velocity estimation and angle under a distance of 20 m and an object rotation angle of 70 degrees, considering different radial velocity measurement errors.

**Figure 10.** The relationship between velocity estimation and angle under a distance of 40 m and an object rotation angle of 70 degrees, considering different radial velocity measurement errors.

From Figures 5–10, it can be observed that when the radial velocity error is small, the estimation of the rotation angle can be made using velocity measurements from multiple points, and a shorter distance is more favorable for estimating the rotation angle based on the velocity.

Due to the limited number of millimeter-wave radar points, the rotation angle estimation of the dynamic target is fused by different methods. As shown in Figures 11 and 12, the rotation angle of the dynamic target can be better estimated.

**Figure 11.** Method for estimating dynamic targets at different angles, and the relationship between probability and angle changes.

**Figure 12.** Rectangle formed by the estimated rotation angle and the true rotation angle.

Figures 13–15 show the state estimation of dynamic targets and static targets in a 4D millimeter-wave radar scenario. Different estimated dynamic targets, static targets, and true bounding boxes of dynamic targets have been labeled.

**Figure 13.** Results of 4D millimeter-wave radar point cloud and target tracking for a single vehicle, where the green box represents a dynamic target, the red box represents a static target, and the blue box represents the true box of a dynamic target.

**Figure 14.** Results of 4D millimeter-wave radar point cloud and target tracking for a single vehicle, including an incorrect dynamic detection, where the green box represents a dynamic target, the red box represents a static target, and the blue box represents the true box of a dynamic target.

**Figure 15.** Results of 4D millimeter-wave radar point cloud and target tracking for multiple objects, where the green box represents a dynamic target, the red box represents a static target, and the blue box represents the true box of a dynamic target.

Figure 16 shows the effects of different performance parameters in the target tracking scene.

**Figure 16.** Performance curves of different indicators for dynamic targets in a tracking scenario.

#### **4. Discussion**

The proposed 4D radar object tracking method based on radar point clouds can effectively estimate the position and state information of radar targets. This provides more accurate information for perception and planning in autonomous driving. By utilizing radar point clouds, the method improves the tracking and prediction of surrounding objects, enabling autonomous vehicles to make informed decisions in real time. Precise localization and tracking of radar targets enhance situational awareness, allowing autonomous vehicles to navigate complex environments with greater reliability and safety. Overall, this method significantly enhances the perception and planning capabilities of autonomous driving systems, contributing to the development of safer and more efficient autonomous vehicles.

#### **5. Conclusions**

In summary, this paper presents a 4D radar-based target tracking algorithm framework that utilizes 4D millimeter-wave radar point cloud information for autonomous driving awareness applications. The algorithm overcomes the limitations of conventional 2 + 1D radar systems and utilizes higher resolution target point cloud information to achieve more accurate motion state estimation and target profile information. The proposed algorithm includes several steps, such as ego vehicle speed estimation, density-based clustering, and binary Bayesian filtering to identify dynamic and static targets, as well as state updates of dynamic and static targets. Experiments are conducted using measurements from 4D millimeter-wave radar in a real-world in-vehicle environment, and the algorithm's performance is validated by actual measurement data. The algorithm can improve the accuracy and reliability of target tracking in autonomous driving applications. This method focuses on the tracking framework for 4D radar. However, further research is needed to investigate the details of certain aspects such as motion models, filters, and ego-vehicle pose estimation.

**Author Contributions:** Conceptualization, B.T., Z.M. and X.Z.; methodology, B.T., Z.M. and X.Z.; software, B.T.; validation, B.T., S.L. and L.Z.; formal analysis, L.Z.; investigation, S.L.; resources, L.H.; data curation, B.T. and L.Z.; writing—original draft preparation, Z.M.; writing—review and editing, B.T. and Z.M.; visualization, L.Z.; supervision, X.Z. and L.H.; project administration, X.Z. and J.B.; funding acquisition, X.Z. and J.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Key R&D Program of China (2022YFB2503404).

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

### *Article* **HTC+ for SAR Ship Instance Segmentation**

**Tianwen Zhang and Xiaoling Zhang \***

School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China; twzhang@std.uestc.edu.cn

**\*** Correspondence: xlzhang@uestc.edu.cn

**Abstract:** Existing instance segmentation models mostly pay less attention to the targeted characteristics of ships in synthetic aperture radar (SAR) images, which hinders further accuracy improvements, leading to poor segmentation performance in more complex SAR image scenes. To solve this problem, we propose a hybrid task cascade plus (HTC+) for better SAR ship instance segmentation. Aiming at the specific SAR ship task, seven techniques are proposed to ensure the excellent performance of HTC+ in more complex SAR image scenes, i.e., a multi-resolution feature extraction network (MRFEN), an enhanced feature pyramid net-work (EFPN), a semantic-guided anchor adaptive learning network (SGAALN), a context ROI extractor (CROIE), an enhanced mask interaction network (EMIN), a post-processing technique (PPT), and a hard sample mining training strategy (HSMTS). Results show that each of them offers an observable accuracy gain, and the instance segmentation performance in more complex SAR image scenes becomes better. On two public datasets SSDD and HRSID, HTC+ surpasses the other nine competitive models. It achieves 6.7% higher box AP and 5.0% higher mask AP than HTC on SSDD. These are 4.9% and 3.9% on HRSID.

**Keywords:** synthetic aperture radar; ship instance segmentation; HTC+; deep learning; convolutional neural network

#### **1. Introduction**

Ship surveillance has received widespread attention [1–4]. Synthetic aperture radar (SAR) is an active microwave sensor [5–7]. It works regardless of weather and light conditions, and is more suitable for ship monitoring than optical sensors [8]. Traditional methods [9,10] rely overly on hand-picked features, reducing model flexibility and migration. Now, more efforts are devoted to deep learning-based methods [11].

Most scholars use boxes to detect ships in the SAR community [12], but the instance segmentation at both box- and pixel-level has received less attention [13]. Moreover, Xu et al. [14] studied the dynamic detection of offshore wind turbines by spatial machine learning in SAR images, but offshore facilities and ships have different radar scattering characteristics. Some works [15–22] have studied SAR ship instance segmentation, but they mostly used models for generic objects directly, without considering the targeted characteristics of SAR ship objects, hindering further accuracy improvements and leading to poor segmentation performance in more complex SAR image scenes [23].

Thus, we propose HTC+ to explore better SAR ship instance segmentation. HTC is selected because it may be the best model [24]. For SAR ship mission, we enhance HTC using seven techniques for incremental performance to form HTC+. It is similar to the update from YOLOv3 [25] to YOLOv4 [26]. (1) A multi-resolution feature extraction network (MRFEN) is used to boost multi-scale feature description. (2) An enhanced feature pyramid network (EFPN) is designed to enhance better small ship search ability. (3) A semantic-guided anchor adaptive learning network (SGAALN) is proposed to optimize anchors. (4) A context ROI extractor (CROIE) is designed to boost background discrimination. (5) An enhanced mask interaction network (EMIN) is designed to boost multi-stage mask feature fusion. (6) A post-processing technique (PPT) using NMS [27] and Soft-NMS [28]

**Citation:** Zhang, T.; Zhang, X. HTC+ for SAR Ship Instance Segmentation. *Remote Sens.* **2022**, *14*, 2395. https:// doi.org/10.3390/rs14102395

Academic Editors: Zhihuo Xu, Jianping Wang and Yongwei Zhang

Received: 10 April 2022 Accepted: 13 May 2022 Published: 17 May 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

is used to reduce missed detections of densely moored ships. (7) A hard sample mining training strategy (HSMTS) [29] is used to deal with complex scenes and cases. Experiments are performed on two public datasets SSDD [13] and HRSID [15]. Results show that seven novelties contribute to improving accuracy; HTC+ offers the best accuracy over the other nine competitive models. The instance segmentation performance in more complex SAR image scenes becomes better. HTC+ offers 6.7% box AP and 5.0% mask AP increments on the vanilla HTC on SSDD; they are 4.9% and 3.9% on HRSID.

The contributions of our work are summarized as follows:


The rest of this paper is arranged as follows. Section 2 reviews some related works. Section 3 introduces the methodology. Experiments are described in Section 4. Results are shown in Section 5. Ablation studies are made in Section 6. More discussions are introduced in Section 7. Finally, a summary of this paper is made in Section 8.

#### **2. Related Works**

In this section, we will review some commonly-used generic instance segmentation models in the computer vision community in Section 2.1. Afterwards, some existing SAR ship instance segmentation models will be introduced in Section 2.2.

#### *2.1. Instance Segmentation*

Mask R-CNN [30] is the most classic instance segmentation model, which designed a mask prediction branch on the basis of Faster R-CNN [31]. To measure mask quality, Mask Scoring R-CNN [32] added a scoring network to provide confidences of mask prediction. Cascade Mask R-CNN [33] designed a multi-stage detection and mask head with increasing intersection over union (IOU) thresholds to improve hypotheses quality and ease overfitting. PANet [34] added a bottom-top path aggregation to boost FPN's representation. To make full use of multi-scale features, Rossi et al. [35] proposed a novel region of interest (ROI) extraction layer (GROIE) for instance segmentation. YOLACT [36] is a real-time one-stage instance segmentation model, but its accuracy is poorer than two-stage ones. Furthermore, HTC [24] combined Cascade R-CNN and Mask R-CNN to leverage relationships between detection and segmentation, offering the state-of-the-art performance [37,38]. Therefore, we select it as our experimental baseline.

#### *2.2. SAR Ship Instance Segmentation*

Recently, many scholars in the SAR community have started to study SAR ship instance segmentation. Since the first public dataset called HRSID was released by Wei et al. [15] in 2020, various methods have emerged. Su et al. [16] proposed a high-quality network named HQ-ISNet for remote sensing object instance segmentation. They measured the model performance on SAR images; the results indicated the effectiveness of the proposed model. Yet, their model was just a mechanical borrowing from the computer vision community, without thought of appropriateness, hampering further performance improvements. Zhao et al. [17] proposed SA R-CNN which added attention mechanisms to boost accuracy, but the performance among complex scenes was still limited. Gao et al. [18] proposed an anchor-free model called CenterMask and a centroid-distance based loss to enhance benefits of ship feature learning, but such anchor-free models still cannot handle complex scenes

and cases [29]. HTC was applied to SAR ship instance segmentation by Zhang et al. [19], but such direct use led to limited accuracy for SAR ships.

In 2022, Fan et al. [20] designed an efficient instance segmentation paradigm (EISP) for interpreting SAR and optical images, which adopted transformers to extract features. Yet, this paradigm did not consider the targeted characteristics of SAR ships, with limited performance. Zhang et al. [21] designed a full-level context squeeze-and-excitation ROI extractor for SAR ship instance segmentation, but their method only considered extracting the optimized feature subset, and ignored improvements in other parts of the network, leading to limited ship segmentation performance in more complex SAR image scenes. Ke et al. [22] proposed a global context boundary-aware network to improve the positioning performance of the bounding box so as to achieve better segmentation effects, but they did not consider differences between segmentation tasks and detection tasks. Zhang et al. [23] improved Mask R-CNN further by using context information, and squeeze-and-excitation mechanism, but their network did not have adequate mask information interaction, leading to poor segmentation performance in more complex scenes.

In short, the above existing methods mostly used models for generic objects in the computer vision community directly. In other words, they did not consider the targeted characteristics of SAR ship objects, which hinders further accuracy improvements. Thus, we will research useful techniques in this paper to boost instance segmentation especially for SAR ships.

#### **3. Methodology**

Aiming at the specific SAR ship task, we explore ways to enhance each component's performance on the basis of the vanilla HTC [24] to achieve the progressive improvements to the overall performance, resulting in the evolution from HTC to HTC+. Our research thinking is similar to the evolution from YOLOv3 [25] to YOLOv4 [26] where YOLOv4 proposed five key techniques and adopted some useful tricks to enhance YOLOv3 further.

Figure 1 depicts HTC+ architecture. MRFEN is a backbone network to extract multiresolution ship features. EFPN is to improve multi-scale feature representation. SGAALN is to learn anchor location and shape used in the region proposal network (RPN) [31] that is responsible for producing proposals. Classifier (CLS) is used to identify foreground and background. Regressor (REG) predicts proposal positions. CROIE is to map proposals from RPN into MRFEN's feature maps to extract feature subsets [21] for the box-mask prediction head. EMIN predicts box and mask. PPT post-processes outputs. HSMTS works only in training selected hard samples to handle complex scenes and cases.

**Figure 1.** HTC+ architecture. (1) MRFEN denotes the multi-resolution feature extraction network; (2) EFPN denotes the enhanced feature pyramid network; (3) GAALN denotes the semantic-guided anchor adaptive learning network; (4) CROIE denotes the context regions of interest extractor; (5) EMIN denotes the enhanced mask interaction network; (6) PPT denotes the post-processing technique; (7) HSMTS denotes the hard sample mining training strategy. Seven novelties are marked by red numbers 1–7. The first five belong to the network architecture's improvements. The remaining two constitute extra tricks to boost performance further. Moreover, here, RPN denotes the region proposal network, CLS denotes classification, and REG denotes regression.

Moreover, for ease of reading, we summarize the following materials in Table 1. Next, we will introduce the seven components for incremental accuracy in detail in the following sub-sections Sections 3.1–3.7.


**Table 1.** Materials arrangement of the methodology.

#### *3.1. Multi-Resolution Feature Extraction Network (MRFEN)*

**Existing approach.** The raw HTC adopted the high-to-low resolution paradigm as the network deepens to extract features, as shown in Figure 2a, e.g., ResNet [39] and ResNeXt [40], i.e., the network depth is inversely proportional to the resolution. Still, this paradigm is not well-suited to SAR ship tasks, considering the two aspects below.

On the one hand, four stages in Figure 2a extract multi-scale features equally, i.e., the same number of conv blocks [e.g., 4 in Figure 2a]. Yet, the ship size distribution of existing datasets is uneven as in Figure 3a,b, i.e., small ships are far more than large ones. The main reason for this phenomenon is that SAR is a "bird-eye" remote sensing earth observation tool that is different from "person-eye" natural scene cameras. Thus, one should treat them differently. Otherwise, a huge performance imbalance between small ships and large ships will occur. We think that one should arrange heavy networks for small ship detection because they are more difficult to detect for fewer feature pixels; for contrast, one should use light networks for large ship detection because they are easier to detect due to their clearer features.

**Figure 2.** Different backbone networks for feature extraction. (**a**) Existing approach: the backbone network of HTC. (**b**) Proposed approach: the backbone network of HTC+ (MRFEN).

**Figure 3.** Multi-scale SAR ships. (**a**) Ship size uneven distribution in SSDD. (**b**) Ship size uneven distribution in HRSID. (**c**) Small ships. (**d**) Large ships.

On the other hand, the network backend in Figure 2a is lacking in rich high-resolution representations, i.e., the spatial position information is lost to some degree. This is not conducive to handling position-sensitive vision problems [37], e.g., SAR ship instance segmentation. Thus, one should maintain high-resolution position representations totally across the whole conv process. Moreover, we think that the strong coupling between the depth and resolution directions potentially limits feature description capacity. Signs [41,42] have indicated that decoupling them would help to improve the performance of pixelsensitive tasks.

**Proposed approach.** Given the above, we designed MRFEN to extract more precise position representations and richer semantic representations. Moreover, although the multiresolution approach offers limited accuracy gains for small ships, it can improve the instance segmentation performance of very large ships when the high-resolution mode is used. Figure 2b shows its architecture. MRFEN has three design concepts, i.e., (1) multi-resolution feature extraction (MRFF) in Section 3.1.1, (2) multi-scale attention-based feature fusion (MSAFF) in Section 3.1.2, and (3) atrous spatial pyramid pooling (ASPP) in Section 3.1.3.

#### 3.1.1. Multi-Resolution Feature Extraction (MRFF)

We retain high-resolution representations across the entire system, including, the uppermost resolution-1 branch. This can boost instance segmentation of small ships due to more position information and heavier network parameters. The resolution-2 branch starts from the stage-2; the resolution-3 branch starts from the stage-3; the resolution-3 branch starts from the stage-4. This leverages lighter networks for larger ships so as to adapt to their easier detection. Consequently, the network depth and resolution are decoupled smoothly, which can enable networks to optimize their parameters among a larger search space so as to further enhance fitting or learning capacity. Briefly, the above can be described by

$$
\begin{array}{c}
\mathcal{N}\_{11} \to \mathcal{N}\_{21} \to \mathcal{N}\_{31} \to \mathcal{N}\_{41} \\
\searrow \mathcal{N}\_{22} \to \mathcal{N}\_{32} \to \mathcal{N}\_{42} \\
\searrow \mathcal{N}\_{33} \to \mathcal{N}\_{43} \\
\searrow \mathcal{N}\_{44}
\end{array}
\tag{1}
$$

where N*sr* denotes the sub-network of the *s*-th stage and the *r*-th resolution, → denotes the conv process, and denotes the down-sampling process. Different from image pyramid in [43], the low-resolution in MRFEN comes from the previous high-resolution downsampling, rather than the down-sampling on the input image. This is because feature maps from the front-end high-resolution sub-network are more representative.

#### 3.1.2. Multi-Scale Attention-Based Feature Fusion (MSAFF)

There are no direct interactions between different resolution branches after the network depth and resolution are decoupled. This hampers network information flow, possibly increasing the risk of overfitting of their separate local optimization. Moreover, training within their own closed cyberspace may also slow down the training convergence speed, declining performance. Therefore, it is essential to perform multi-scale feature fusion. (In this paper, the resolution and the scale share the same meaning.) Integrating features with different scales, the down-sampling and up-sampling were widely adopted [37,44,45]. Still, different from these authors, we suggest to first use an attention module for a feature refinement, and then execute the down-sampling and up-sampling. This can enable more valuable features to be transmitted to another branch so as to avoid possible negative interferences. Taking N<sup>42</sup> in Equation (1) as an example, we get its feature maps F<sup>42</sup> by

$$\mathcal{F}\_{42} = \mathcal{F}\_{32} + \text{DownSample}^{2\times}(f\_{attention}(\mathcal{F}\_{31})) + \text{lfp} \text{Samp}^{2\times}(f\_{attention}(\mathcal{F}\_{33})) \tag{2}$$

where F*sr* denotes the feature maps of N*sr*, *DownSampn*<sup>×</sup> denotes the *<sup>n</sup>* times downsampling, *UpSampn*<sup>×</sup> denotes the *n* times up-sampling, and *fattention* denotes the refinement operation using an attention module. We implement *fattention* by a convolutional block attention module (CBAM) [46] with channel attention and space channel attention. One can also use other advanced attention modules [47] for better performance. In this work, we select CBAM, because it is the most famous and has been used by many scholars in the SAR community [17].

The network architecture of CBAM is shown in the orange dashed box in Figure 2b. Let its input be *<sup>F</sup>* <sup>∈</sup> <sup>R</sup>*H*×*W*×*<sup>C</sup>* where *<sup>H</sup>* and *<sup>W</sup>* are the height and width of feature maps and *C* is the channel number. Then the channel attention is responsible for generating a channel-dimension weight matrix *WCA* <sup>∈</sup> <sup>R</sup>1×1×*<sup>C</sup>* to measure the important levels of *<sup>C</sup>*

channels; the space attention is responsible for generating a space-dimension weight matrix *WSA* <sup>∈</sup> <sup>R</sup>*H*×*W*×<sup>1</sup> to measure the important levels of space-elements across the entire *<sup>H</sup>* <sup>×</sup> *<sup>W</sup>* space. They range from 0 to 1 by a sigmoid activation. The result of the channel attention is denoted by *FCA* = *F*·*WCA*. The result of the space attention is denoted by *FSA* = *FCA*·*WSA*. See [46] for CBAM's details.

#### 3.1.3. Atrous Spatial Pyramid Pooling (ASPP)

Although the multi-scale attention-based feature fusion offers some other resolution responses from other branches, these kinds of responses are still limited among the total responses. Thence, we adopt the atrous spatial pyramid pooling (ASPP) [48] to deal with this problem. Its network architecture is depicted in the purple dashed box in Figure 2b. ASPP utilizes atrous convs [49,50] with different dilated rates to achieve multi-resolution feature responses in the single-resolution branch. It is described by

$$F\_{\rm ASPP} = f\_{1 \times 1} \left( \left[ f\_{3 \times 3}^2(F), f\_{3 \times 3}^3(F), f\_{3 \times 3}^4(F), f\_{3 \times 3}^5(F) \right] \right) \tag{3}$$

where *F* denotes the input, *F*ASPP denotes the output, *f <sup>r</sup>* <sup>3</sup>×<sup>3</sup> denotes a 3 <sup>×</sup> 3 conv with a dilated rate of *r*, and *f*1×<sup>1</sup> denotes a 1 × 1 conv for channel reduction, i.e., from four atrous convs concatenation 4*C* to the raw *C* of *F*. In this way, different dilated rates will enable different resolution responses, as well yielding different scope contexts. We set four dilated rates for the accuracy-speed trade-off. More might offer better performance but must sacrifice speed. Different from [48], four dilated rates are set to 2, 3, 4 and 5 because of the small size of the low-solution branch (*L*/32 × *L*/32). Especially, ASPP can also allow our MRFEN to enlarge receptive fields so as to receive more ship surrounding context information. This is conducive to alleviating background interferences, e.g., blur edges, sidelobes, ship wakes, speckle noise (SAR imaging mechanisms), tower crane [8], and inshore facilities, as in Figure 4.

**Figure 4.** SAR ships and ships' surrounding contexts. Boxes with different colors and sizes denotes the atrous convs with different dilated rates.

Finally, the outputs of ASPP with different resolution levels will constitute the inputs of FPN (*C*2, *C*3, *C*<sup>4</sup> and *C*5). It is also different from the previous network in Figure 2a that makes the last layers of all stages constitute the inputs of FPN. Figure 2b performs better than Figure 2a because the former offers richer high-resolution position representation and richer low-resolution semantic representation at the same time among each stage.

#### *3.2. Enhanced Feature Pyramid Network (EFPN)*

**Existing approach.** The vanilla HTC followed the standard FPN paradigm [51] to ensure multi-scale performance as shown in Figure 5a. In Figure 5, *C*2, *C*3, *C*<sup>4</sup> and *C*<sup>5</sup> are the inputs of FPN which are from the backbone network as shown in Figure 2. However, this standard FPN offers limited performance for SAR ship instance segmentation from the following three aspects.

**Figure 5.** Feature pyramid networks. (**a**) Existing approach: FPN of HTC. (**b**) Proposed approach: EFPN of HTC+. (**c**) CARAFE implementation. (**d**) GCB implementation.

Firstly, on account of the huge resolution difference, e.g., 1 m resolution for TerraSAR-X and 20 m resolution for Sentinel-1, ships in SAR images present a huge scale difference as shown in Figure 6a,b. This situation is called the cross-scale effect [52], e.g., the extremely small ships in Figure 6c vs. the extremely large ships in Figure 6d. The standard FPN was designed for the natural object detection, e.g., COCO [53] and PASCAL [54] datasets. These datasets are with a relatively small scale difference. Thus, the raw FPN has some challenges to handle such cross-scale problems due to its limited FPN levels. From the clustering results of *K*-means, the raw four levels *P*2, *P*3, *P*<sup>4</sup> and *P*<sup>5</sup> are inferior to six levels in terms of the network multi-scale feature learning ability. The mean IOU of the former is 0.5913, lower than that of the latter 0.6490. Thus, we use more levels to deal with this special SAR ship cross-scale instance segmentation.

Secondly, small ships always constitute the majority among existing datasets for the characteristics of the "bird-eye" view of SAR. The raw bottom-level *P*<sup>2</sup> is with limited searching ability for small ships in Figure 6c, because small ships are diluted after multiple down-sampling operations (from *L* to *L*/4) due to their faint spatial features. As a result, a large number of small ships will be missed, reducing HTC's overall performance. Thus, we suggest to generate a lower level *P*<sup>1</sup> to handle this problem, because lower levels offer richer spatial position information, which is beneficial for small ship instance segmentation.

Thirdly, although the original top-level *P*<sup>5</sup> may detect the extremely large ships in Figure 6d successfully by using a rectangle bounding box, it is still with limited pixelsegmentation performance. Different parts of the ship hull have different materials, resulting in differential radar electromagnetic scatterings. This makes the pixel brightness distribution of the ship in a SAR image extremely uneven. This will bring huge difficulties to classifiers for their effective pixel-level discrimination. Therefore, we suggest to generate a higher level *P*<sup>6</sup> to handle this problem, because high levels can offer more semantic information by shrinking large ships to reach the purpose of removing the ship hull's internal "black" pixels; then in the mask recovery process, the nearest neighbor interpolation can fill

those internal "black" pixels using correct predicted ship "white" pixels, so as to achieve better segmentation performance, as shown in Figure 6d.

**Figure 6.** (**a**) Cluster results of four levels of HTC. (**b**) Cluster results of six levels of HTC+. Here, K-means is used for more intuitive presentation. (**c**) An SAR image with extremely small ships. (**d**) An SAR image with extremely large ships. Here, (**c**,**d**) have a huge scale difference due to different resolutions, i.e., cross-scale.

**Proposed approach.** Given the above, we propose an enhanced feature pyramid network (EFPN) to enhance multi-scale instance segmentation. Figure 5b shows its architecture. EFPN has four design concepts, i.e., (1) content-aware reassembly of features (CARAFE) in Section 3.2.1, (2) feature balance (FB) in Section 3.2.2, (3) feature refinement (FR) in Section 3.2.3, and (4) feature enhancement (FE) in Section 3.2.4.

#### 3.2.1. Content-Aware ReAssembly of Features (CARAFE)

We draw lessons from the advanced Content-Aware ReAssembly of Features (CARAFE) [55] to generate an extra higher top-level *P*<sup>6</sup> and an extra lower bottom-level *P*1. Note that the raw CARAFE did not offer a down-sampling operation; here, we expand it by adding an extra hyper-parameter *k*. *k* = 2 denotes the up-sampling. *k* = 1/2 denotes the downsampling. (i) *Generate P*1. Wang et al. [55] have confirmed that CARAFE was superior to the nearest neighbor and the bilinear interpolation which both focus on the subpixel neighborhoods, failing to capture the richer semantic information required by dense prediction tasks, and it was also superior to the adaptive *deconv* [56] which uses a fixed kernel for all samples, resulting in limited receptive fields. CARAFE can enable instance-specific

content-aware handling, offering a large field of view, which can generate adaptive kernels on-the-fly. It can also aggregate global contextual information, enabling preferable performance for object detection [55]. Thence, it is adopted to generate *P*1. (ii) *Generate P*6. Shelhamer et al. [57] pointed out that the simple max-pooling would increase the risk of feature loss. Thus, to leverage CARAFE's advantage, we extend it for more effective feature down-sampling. The above is described by

$$\begin{aligned} P\_1 &= \text{CARAFE}^{\times 2}(P\_2) \\ P\_6 &= \text{CARAFE}^{\times \frac{1}{2}}(P\_5) \end{aligned} \tag{4}$$

Here, CARAFE×*<sup>k</sup>* denotes the *k* times down/up-sampling using CARAFE. It needs to be noted that one can follow the above operation to generate more levels, e.g., *P*0, *P*7, but the trade-off between speed and accuracy should be carefully considered.

Figure 5c shows the implementation of CARAFE. It contains a kernel prediction process and a feature reassembly one. (i) The former is to predict an adaptive *k* times down/up-sampling kernel *Kl* corresponding to the *l* location of feature maps from the original *l* location. The kernel size is *n* × *n* which means *n* × *n* neighbors of the location *l*. Here, *n* is set to 5 empirically, the same as the raw report [55]. In other words, CARAFE considers the surrounding 5 pixels for down/up-sampling interpolation (5 × 5 = 25 pixels). Moreover, the contribution weights of these 25 pixels are obtained by the adaptive learning. During the kernel prediction process, a 1 × 1 conv is to compress channel to refine salient features, where the compression ratio *d* is set to 4, i.e., the raw 256 channels are compressed to 64 ones. This can not only reduce the calculation burden, but also ensure the benefits of predicted kernels [58]. One 3 × 3 conv is to encode contents whose channel number is *<sup>n</sup>*<sup>2</sup> × *<sup>k</sup>*<sup>2</sup> where *<sup>k</sup>* denotes the down/up-sampling ratio (i.e., from *<sup>H</sup>* × *<sup>W</sup>* to *kH* × *kW*). The dimension transformation is completed using a pixel shuffle operation. Each reassembly kernel is normalized by a softmax function spatially so as to reflect the weight of each subcontent. Finally, the learned kernel *Kl* will serve as the kernel for the subsequent feature reassembly process. (ii) The latter is a simple *n* × *n* conv, but its conv kernel parameters are determined by *Kl* . In the above way, the resulting down/up-sampling feature maps have the ability of content perception, yielding the better feature representation. More details can be found in [55].

#### 3.2.2. Feature Balance (FB)

Cross-scale ships predicted in different FPN levels have a huge feature imbalance [59] especially with the increase of levels. This imbalance also potentially leads to unstable training coming from the huge number gap between small ships and large ships. Thence, we follow the practice from [59] to balance ship features with huge differences, i.e.,

$$\begin{aligned} P\_{\text{FB}} &= \frac{1}{6} \{ \text{Up} \text{Samp}^{8 \times} (f\_{\text{attionion}}(P\_6)) + \text{Up} \text{Samp}^{4 \times} (f\_{\text{attionion}}(P\_5)) + \text{Up} \text{Samp}^{2 \times} (f\_{\text{attionion}}(P\_4)) \\ &+ P\_3 + \text{Down} \text{Samp}^{2 \times} (f\_{\text{attionion}}(P\_2)) + \text{Down} \text{Samp}^{4 \times} (f\_{\text{attionion}}(P\_1)) \} \end{aligned} \tag{5}$$

Here, to fully leverage the advantage of the attention in Equation (2), before up/downsampling, each level is also processed by CBAM to further increase representation power. Moreover, we rescale all levels into the *P*<sup>3</sup> level empirically, because it is located at the middle of the pyramid, provided with both richer position information and semantic information. It can consider both lower levels *P*1, *P*<sup>2</sup> and higher levels *P*4, *P*5, *P*6. *P*<sup>4</sup> is also a middle level of the pyramid; still it is not used as the rescaled level, because we hold the view that the network should better contain more space position features for better small ship instance segmentation. Once all levels are rescaled to the same level, an average operation is performed for their balanced feature fusion. In this way, the resulting condensed multi-scale features contain balanced semantic features and position features from various resolutions. Finally, large ship features and small ones can complement

each other to facilitate the information flow, alleviating feature imbalance and promoting network smooth training.

#### 3.2.3. Feature Refinement (FR)

To recover a more robust FPN, we also adopt a global context block (GCB) [60] to refine the condensed multi-scale features *F*FB further. Such practice is in fact motivated by Libra R-CNN [59]. However, their used non-local block [61] only can capture spatial dependence, but the channel dependence is neglected. Therefore, we adopt the more advanced GCB to reach this aim, achieving global feature self-attention and meeting channel squeeze-andexcitation (SE) [62] simultaneously. Figure 5d shows its architecture. GCB contains a context modeling module and a transform module. (i) The former first uses 1 × 1 conv *Wk* and a *softmax* activation to generate the attention weights *A*; then, conducts a global attention pooling to achieve the global context features, i.e., from *C* × *H* × *W* to *C* × 1 × 1. It is equivalent to the global average pooling in SE [62], but the average operation is replaced with the adaptive operation here. (ii) The latter is similar to SE, but before the rectified linear unit (ReLU) activation, the output of the 1 × 1 squeeze conv *Wv*<sup>1</sup> is first normalized to ensure better generalization, equivalent to the regularization of batch normalization (BN) [63]. Here, to refine more salient features of six FPN levels, the squeeze ratio *r* is set to 6. The last 1 × 1 conv *Wv*<sup>2</sup> is used to transform bottlenecks to capture channel-wise dependencies. Finally, an element-wise matrix addition is used for feature fusion. More details can be found in [60].

#### 3.2.4. Feature Enhancement (FE)

To reduce the risk of feature loss from the boundary levels *P*<sup>1</sup> and *P*<sup>6</sup> due to their relatively large up/down-sampling ratios, we also add extra two skip connections for their feature enhancement while recovering them, i.e.,

$$\begin{aligned} B\_1 &= P\_1 + \mathcal{U}pSamp^{4 \times} (f\_{\text{attection}}(F\_{\text{FB}})) \\ B\_6 &= P\_6 + DownSample^{8 \times} (f\_{\text{attection}}(F\_{\text{FB}})) \end{aligned} \tag{6}$$

where *Bj* denotes the *i*-th level of the recovered FPN. Here, we obtain the other remaining levels using the reverse operation of Equation (5). Finally, the recovered FPN *Bi* will be able to predict cross-scale SAR ships in a more elegant and stable paradigm.

#### *3.3. Semantic-Guided Anchor Adaptive Learning Network (SGAALN)*

**Existing approach.** The raw HTC used the classic anchor generation scheme [31] as in Figure 7a. This scheme uniformly arranges dense anchors with fixed shapes to every location among an image. However, it is not suitable for SAR ship instance segmentation.

**Figure 7.** Different anchor generation schemes. (**a**) Existing approach: dense and shape-fixed anchors of HTC. (**b**) Proposed approach: Sparse and shape-adaptive anchors of HTC+. Blue boxes denote anchor boxes.

On the one hand, ships in SAR images exhibit a sparse distribution due to the characteristic of SAR "bird-eye" remote sensing view, e.g., there are only 2.12 ships on average among one image in the SSDD dataset. The uniform and dense anchor allocation to everywhere in a SAR image will potentially increase false alarms; additionally, numerous anchors are redundant, causing a heavier computational burden. Thus, one should better first adaptively predict possible positions where ships are more likely to occur; then arrange anchors on these possible positions to handle the above problem. On the other hand, the raw anchors with fixed shapes (width, height and aspect ratio) are not conducive to multi-scale prediction, with slower training speed. The hand-crafted preset anchors with fixed shapes are not in line with real ships with changeable shapes, reducing multi-scale performance. Although one can draw support from the *K*-means clustering on the specific dataset for better initial anchors [25], it is troublesome when this practice is applied to more datasets. Moreover, the initial anchors are still with fixed shapes, not resolving substantive issues. Thus, we should adaptively predict anchor shapes to handle the above problem.

**Proposed approach.** Given the above, we establish a semantic-guided anchor adaptive learning network (SGAALN) to achieve the adaptive anchor location prediction and the adaptive anchor shape prediction. The execution visual effect of SGAALN is shown in Figure 7b. Here, we leverage high-level semantic features to reach this aim, because they enable higher anchor quality than low-level ones [29]. SGAAL has three design concepts, i.e., (1) anchor location prediction (ALP) in Section 3.3.1, (2) anchor shape prediction (ASP) in Section 3.3.2, and (3) feature adaption (FA) in Section 3.3.3.

#### 3.3.1. Anchor Location Prediction (ALP)

We use a 1 × 1 conv *WL* whose channel number is set to 1 for the adaptive anchor location prediction, as shown in Figure 8a. This conv works on the inputted semantic features denoted by *Q*. The resulting feature maps are with a *H* × *W* × 1 dimension. *H* × *W* is the whole 2D space. This feature map is then activated by a *sigmoid* function to achieve the probability of ship occurrence *Pship* ∈ [0, 1] at the (*x*, *y*) position cross the whole *H* × *W* 2D space. When *Pship* is bigger than the preset threshold *t*, this (*x*, *y*) position will generate anchors; otherwise, this position will be removed where no anchors are generated. Here, *<sup>t</sup>* is set to 0.01 empirically according to the experimental results in Section 6.3.

**Figure 8.** SGAALN architecture. (**a**) Anchor location prediction. (**b**) Anchor shape prediction. (**c**) Feature adaption.

#### 3.3.2. Anchor Shape Prediction (ASP)

We arrange another one 1 × 1 conv *WS* whose channel number is set to 2 for the adaptive anchor shape prediction, as shown in Figure 8b. This conv works on the input semantic features denoted by *Q* as well. The resulting feature maps are with a *H* × *W* × 2 dimension. This is because we need to estimate two parameters, i.e., the anchor width *w* and height *h*. Note that the anchor shape prediction works on all positions of the whole *H* × *W* 2D space in consideration of an easy network implementation. Finally, according to the previously obtained anchor prediction locations, the redundant shape predictions are filtered. As a result, the adaptive anchor parameter (*x*, *y*, *w*, *h*) is achieved which will be inputted to RPN for classification and regression.

#### 3.3.3. Feature Adaption (FA)

Due to the cross-scale effect of SAR ships, the adaptively predicted anchors will also exhibit a huge shape difference. Yet, this will bring about a huge encoded content difference on the inputted semantic features when they are pooled for the subsequent box and mask prediction. Thus, large anchors should encode content in a large region, while small anchors should have a small region, accordingly. However, the raw semantic features *Q* do not meet this point because it is designed for fixed-shape anchors. Coincidentally, the existing deformable convs [64] can offer an effective solution for this issue, where the previously learned anchor shape (*w*, *h*) exactly corresponds to deformable conv kernel's bias (i.e., the offset field). Thence, we use a deformable convs *WD* to process the inputted semantic features *Q* for such feature adaption, i.e.,

$$q\_i' = \mathcal{W}\_D(q\_{i\prime}, w\_{i\prime}h\_i) \tag{7}$$

where *wi* and *hi* are the *<sup>i</sup>*-th position anchors' width and height *<sup>i</sup>* <sup>∈</sup> <sup>R</sup>*H*×*W*, *qi* <sup>∈</sup> *<sup>Q</sup>*, and *q <sup>i</sup>* ∈ *Q* . *Q* denotes the output of feature adaption. The optimized anchors enable better proposals to extract ROI feature subsets using ROIAlign.

#### *3.4. Context Regions of Interest Extractor (CROIE)*

**Existing approach.** The raw HTC followed the standard two-stage ROI extractor (ROIE) of Mask R-CNN [30] to extract feature subsets of ROIs for the subsequent mask prediction, as shown in Figure 9a. That is, the bounding box prediction is first conducted, and then, the mask prediction is performed in the resulting *h* × *w* box. However, this practice is still with limited SAR ship mask prediction performance. On the one hand, the mask prediction relies heavily on the box prediction. If the offered boxes are not accurate, the mask prediction must become poor. From Figure 9a, the features for mask prediction exist in the corresponding box with limited receptive fields, reducing the global field of vision. This will decline segmentation performance improvements. One the other hand, as in Figure 4, due to the special SAR imaging mechanism, SAR ships have many complex surroundings outside the compact box, e.g., blur edges, sidelobes, ship wakes, speckle noise, tower crane, and inshore facilities. They will bring some non-negligible effects for mask prediction. Thus, the compact box makes it impossible for mask prediction to observe more ship backgrounds, e.g., ship-like pixel noise and ship wakes. Although the box can eliminate the cross-sidelobe deviating from the ship center too far, a few sidelobe and noise pixels in the box can make it difficult to ensure segmentation learning benefits. Thus, it is necessary to expand the receptive field to explicitly find out the clear boundary between ship and its context surrounding.

**Figure 9.** Different ROI extractors (ROIEs). (**a**) Existing approach: ROIE of HTC. (**b**) Proposed approach: CROIE of HTC+.

**Proposed approach.** Given the above, we design a context ROI extractor (CROIE) to add multi-scope contexts to the box for better mask prediction, as shown in Figure 9b. We arrange two different scope contexts denoted by ROI-C1 with a size of *c*1*w* × *c*1*h* and ROI-C2 with a size of *c*2*w* × *c*2*h*. Here, *c*<sup>1</sup> and *c*<sup>2</sup> (*c*<sup>2</sup> > *c*<sup>1</sup> > 1) are two amplification factors which are set to 1.5 and 2.0, respectively, according to experiments in Section 6.4. This idea is motivated by Kang et al. [65]; however, differently, we consider multi-scope contexts. Moreover, we do not use more context ROIs, e.g., ROI-C3, considering the trade-off of speed and accuracy. Figure 10 shows CROIE's implementation. CROIE has three design concepts, i.e., (1) concatenation in Section 3.4.1, (2) channel shuffle in Section 3.4.2, and (3) dimension reduction squeeze-and excitation (DRSE) in Section 3.4.3.

**Figure 10.** Implementation of CROIE.

#### 3.4.1. Concatenation

We concatenate three feature subsets after ROIAlign. We find that the feature concatenation performs better than the feature adding, because the former achieves feature reuse for better deep supervision. Then, we use a 3 × 3 group conv to refine them for subsequent operations where the group division factor *g* is set to 3. The above is described by

$$F\_{\rm CAT} = f\_{\rm 3 \times 3,3}([\rm ROIAlign(ROI), ROIAlign(ROI - C1), ROIAlign(ROI - C2)])\tag{8}$$

where *f*3×3,3 denotes the 3 × 3 group conv whose group division factor is 3, and *F*CAT denotes the output of concatenation.

#### 3.4.2. Channel Shuffle

To reduce the effect of feature collaboration consistency in the single ROI, we also shuffle the resulting features *F*CAT along the channel dimension to enable more powerful representation. The result is denoted by *F*CS.

#### 3.4.3. Dimension Reduction Squeeze-and-Excitation (DRSE)

To balance allocation contributions of different ROIs with different context scopes, we also design a dimension reduction squeeze-and-excitation (DRSE) block, an extended version of SE [62] (the raw SE did not achieve dimension reduction), to model channel correlation. It can suppress useless channels and highlight valuable ones meanwhile reducing channel dimension, which reduces the risk of the training oscillation due to excessive irrelevant backgrounds. Consequently, moderate contexts can be offered for mask prediction. Figure 11 shows DRSE's implementation. In the collateral branch, the global average pooling is used to achieve the global spatial information; a 1 × 1 conv with a sigmoid activation is used to squeeze channels to focus on important ones. The squeeze ratio *p* is set to 3 (256 × 3 → 256). In the main branch, the input channel's number is reduced directly using a 1 × 1 conv with a ReLU activation. The broadcast elementwise multiplication is used for compressed channel weighting. DRSE will model channel correlation of the input feature maps in a reduced dimension space. Then, it leverages the learned weights from the reduced dimension space to pay attention to the important features of the main branch. In this way, the potential information loss from the crude dimension reduction is avoided. The above is described by

$$F\_{\rm DRSE} = \text{ReLU}(conv\_{1 \times 1}(F\_{\rm CS})) \odot \sigma(conv\_{1 \times 1}(GAP(F\_{\rm CS}))) \tag{9}$$

where *F*CS denotes the input, i.e., the output of channel shuffle, *F*DRSE denotes the output, *σ* denotes the sigmoid function, and denotes the broadcast element-wise multiplication.

**Figure 11.** Implementation of DRSE.

#### *3.5. Enhanced Mask Interaction Network (EMIN)*

**Existing approach.** The raw HTC designed the mask interaction network (MIN) as shown in Figure 12a to establish a connection between different stages. Mask features of previous stage *Mi*−<sup>1</sup> are refined by a 1 × 1 conv for next stage *Mi*. A simple addition is used for feature fusion, i.e., *conv*1×1(*Mi*−1(*F*)) + *F* where *F* is feature maps of backbone networks. We observe that a 1 × 1 conv offers limited feature refinement effects; a direct feature addition also offers limited fusion benefits.

**Figure 12.** Mask interaction network (MIN) (**a**) Existing approach: MIN of HTC. (**b**) Proposed approach: EMIN of HTC+.

**Proposed approach.** Thus, we design an enhanced mask interaction network (EMIN) whose architecture is shown in Figure 12b. EMIN has two design concepts, i.e., (1) global feature self-attention (GFSA) in Section 3.5.1, and (2) adaptive mask feature fusion (AMFF) in Section 3.5.2.

#### 3.5.1. Global Feature Self-Attention (GFSA)

We design a global feature self-attention block (GFSA) to replace the raw 1 × 1 conv, inspired by the advanced non-local neural networks [61]. GFSA can capture long-range dependencies of each mask pixel in the whole space, to enable a global receptive field, which is conducive to the efficient flow of information and context modeling. Figure 13a shows its implementation. In Figure 13a, features at the *i*-position are denoted by *φ* using a 1 × 1 conv *Wφ*. Features at the *j*-position are denoted by *θ* using a 1 × 1 conv *Wθ*. *f* is obtained from adaptive learning between *φ* and *θ* where the normalization process is equivalent to a *softmax* calculation function. The representation of the input at the *j*-position *g* is learned using another one 1 × 1 conv *Wg*. The response at the *i*-position y*<sup>i</sup>* is obtained by a matrix multiplication. Note that we embed all features into *C*/4 channel space to reduce computational burdens. To apply response to the input readily, we use another one <sup>1</sup> × 1 conv *Wz* to transform dimension for the adding operation (:). Finally, we achieve the global feature self-attention output *F*GFSA that will be transmitted to the next stage.

**Figure 13.** (**a**) Implementation of the global feature self-attention (GFSA). (**b**) Implementation of the adaptive mask feature fusion.

3.5.2. Adaptive Mask Feature Fusion (AMFF)

We arrange an adaptive mask feature fusion (AMFF) scheme for reasonably allocating contributions, instead of the direct feature adding. AMFF is depicted in Figure 13b. Firstly, the previous mask feature *F*GFSA and the backbone network feature *F* are flattened into 1D column vectors respectively, i.e., from *H* × *W* × *C* to *HWC* × 1.

Then, they are concatenated directly to be inputted into a fully-connected (FC-1) layer. Finally, the terminal 2-element FC-2 layer with a *softmax* activation is used to achieve two adaptive weight parameters *α* and *β*. Here, due to the use of the *softmax* activation, *α* plus *β* equals 1. The above is described by

$$\mathcal{W} = \operatorname{softmax}\{\operatorname{FC}\_2(\operatorname{FC}\_1(\left[\left[\operatorname{flatten}(F\_{\text{CFSA}}), flatten(F)\right]\right])\}\tag{10}$$

where *W* = [*α*, *β*] *<sup>T</sup>* denotes the weight vector. Finally, the mask interaction is implemented by

$$M\_i(F) = \alpha \cdot F\_{\text{GFSA}} + \beta \cdot F \tag{11}$$

where *F*GFSA = GFSA(*Mi*−1(*F*)). In summary, Equation (11) will be used to replace the original expression *Mi*(*F*)=*conv*1×1(*Mi*−1(*F*)) + *F* in Figure 12a to form the final Figure 12b.

#### *3.6. Post-Processing Technique (PPT)*

**Existing approach.** The raw HTC offers NMS [27] and Soft-NMS [28] to remove duplicate detections, but the two both did not consider the prior knowledge of ships, that is, in most cases, only those ships with similar aspect ratios just dock together side by side, as shown in Figure 14. Not considering this prior knowledge will cause that the boxes which should be retained are removed when NMS is used; the boxes which should be deleted are retained when Soft-NMS is used. We think that when the ship aspect ratios are similar, one should better use Soft-NMS to avoid missed detections; however, when the ship aspect ratios have a huge difference, one should better use NMS to delete redundant boxes.

**Figure 14.** Densely moored ships. There are some hull overlaps between ships. In most cases, only ships with similar aspect ratios dock together. Green boxes denote the ground truths.

**Proposed approach.** Thus, we propose a post-process technique (PPT) guided by the prior of ship aspect ratios to adaptively select NMS and Soft-NMS. Algorithm 1 shows the implementation of PPT. We set a similarity threshold of ship aspect ratios *<sup>r</sup>* to judge two cases, i.e., (i) a huge aspect ratio difference |*ri* − *rm*| > *<sup>r</sup>* where *rm* denotes the aspect ratio with the highest score *sm* and *ri* denotes the remaining boxes needed to traverse; (ii) a small aspect ratio difference |*ri* − *rm*| ≤ *r*. Here, *<sup>r</sup>* is set to 0.20 empirically, (see Section 6.5). For the former, NMS is executed to remove boxes safely (i.e., B←B− *bi* , S←S− *si* ). For the latter, Soft-NMS is executed to retain boxes (i.e., B←B, S←S ) but the current box is given with a penalty score *si* <sup>←</sup> *si* exp- *IOU*(M,*bi*) 2 *σ* for the densely moored ship detections in Figure 14.

**Algorithm 1:** PPT guided by the prior of ship aspect ratios.

**Input:** B = {*b*1, *b*2, ··· , *bN*}, R = {*r*1, *r*2, ··· ,*rN*}, S = {*s*1, *s*2, ··· ,*sN*}, *Nt*, *<sup>r</sup>* B denotes the list of initial detection boxes. R denotes the list of initial detection aspect ratios. S denotes the list of initial detection scores. *Nt* denotes the IOU threshold. *r* denotes the similarity threshold of aspect ratios.

**Begin** 1: D ← {} 2: **while** <sup>B</sup> <sup>=</sup> <sup>∅</sup> **do** 3: *m* ← argmax S 4: M ← *bm* 5: P ← *rm* 6: D←D∪M; B←B−M; R←R−P 7: **for** (*bi*, *ri*) in *zip* (B, <sup>R</sup>) **do** 8: **if** *IOU*(M, *bi*) ≥ *Nt* **then** 9: **Case 1**: |*ri* − *rm*| > *<sup>r</sup>* 10: B←B− *bi* ; R←R− *ri* ; S←S− *si* # NMS 11: **Case 2**: |*ri* − *rm*| ≤ *<sup>r</sup>* 12: B←B ; R←R; S←S 13: *si* ← *sie IOU*(M,*bi*)<sup>2</sup> *<sup>σ</sup>* , ∀*bi* ∈ D/ # Soft-NMS Existing approach: consider one of Case 1 and Case 2. Proposed approach: consider both Case 1 and Case 2. 14: **end** 15: **end** 16: **return** D, S **end Output:** D, S

#### *3.7. Hard Sample Mining Training Strategy (HSMTS)*

**Existing approach.** The raw HTC did not offer useful training strategies to boost learning benefits as shown in Figure 15a, so we propose a hard sample mining training strategy (HSMTS) to supplement this blank for better accuracy. HSMTS is inspired by the extreme imbalance between positive and negative samples in SAR images, i.e., ships exhibit a sparse distribution in the whole SAR image, but backgrounds occupy more pixels. This causes the number of negative samples being much larger than the number of positive samples, so one should select more typical ones among a large number of negative samples to train the network so as to ensure the background discrimination ability of models. Otherwise, the network will fall into the over fitting of a large number of simple negative samples with low values. More typical negative samples should be those which are difficult to distinguish (close to positive). Only by emphasizing learning on them can the network improve its discrimination ability; repeated and mechanical learning on simple samples is worthless.

**Proposed approach.** HSMTS is in fact motivated by [66], but we add the extra supervision of the mask prediction loss. Figure 15b shows its implementation. From Figure 15b, we monitor the terminal training loss of EMIN where *Loss* = *LossCLS* + *LossREG* + *LossMASK*. Here, *LossCLS* denotes the box classification loss. *LossREG* denotes the box regression loss. *LossMASK* denotes the mask prediction loss. The total training loss is first ranked. In training, the *K* negative samples with top *K* losses are collected to a hard negative sample pool. When the number of samples in the pool reaches a batch size, these hard negative samples are mapped into the feature maps of the backbone network via CROIE to extract feature subsets again for the next box and mask prediction. The above process repeated until the end of the training does not destroy the end-to-end training. The total number of the required samples is 512. The positive negative ratio is 1:3, in line with the raw report, so the number of positive samples is 128; as a result, *K* is equal to 384.

**Figure 15.** Different training strategies. (**a**) Existing approach: There are no hard negative sample mining mechanism of HTC. (**b**) Proposed approach: Hard sample mining training strategy (HSMTS) of HTC+.

#### **4. Experiments**

#### *4.1. Dataset*

SSDD [13] offers 1160 image samples with 512 × 512 size from RadarSat-2, TerraSAR-X and Sentinel-1. The training-test ratio is 4:1 [13]. There are 2551 ships in SSDD. The smallest ship is 28 pixel<sup>2</sup> and the largest one is 62,878 pixel2 (pixel2 denotes the product of width pixel and height one). SAR ships in SSDD are provided with various resolutions from 1m to 10m, and HH, HV, VV, and VH polarizations. In SSDD, images with suffix 1 and 9 (232 samples) are selected as the test set, and the others as the training set (928 samples).

HRSID [15] offers 5604 image samples with 800 × 800 average size from TerraSAR-X and Sentinel-1. The training set has 3643 samples and the rest serves as the test set, same as [15]. There are 16,965 ships in HRSID. SAR ships in HRSID are provided with resolutions from 0.1 m to 3 m, and HH, HV, and VV polarizations. The smallest one is 3 pixel2, and the largest one is 522,400 pixel2. HRSID is divided into a training and test set by the ratio of 13:7, same as [15].

#### *4.2. Training Details*

The backbone sub-network MRFEN of HTC+ and other backbone networks for performance comparison are pretrained on ImageNet [67]. Following Faster R-CNN [31], we first train the backbone network, the EFPN sub-network, and the region generation sub-network SGAAL jointly. Then, we fix them to train the CROIE sub-network and the EMIN sub-network. Moreover, the same as for HTC, we end-to-end to train and test HTC+. We train all networks (HTC+ and the other 9 models) by 12 epochs on the SAR ship datasets using SGD [68]. The learning rate is set to 0.008 that will be reduced 10 times at 8- and 11-epoch. The batch size is set to 4. Other training details are consistent with HTC. Experiments were run on a personal computer with RTX 3090 GPU, i9-9900 CPU, and 32G memory based on mmdet [69] and Pytorch. In the test, the trained weights are loaded to evaluate performance.

#### *4.3. Evaluation Criteria*

Similar to COCO [53], AP is the mean of different IOU thresholds with 0.50:0.05:0.95. AP50 is the accuracy with a 0.50 IOU threshold. AP75 is that with a 0.75 IOU threshold. APS is that of small ships (<32<sup>2</sup> pixels). APM is that of medium ships (>32<sup>2</sup> pixels and <96<sup>2</sup> pixels). APL is that of large ships (>96<sup>2</sup> pixels). In this paper, AP serves as the core index for accuracy comparison.

#### **5. Results**

To reveal the state-of-the-art performance of our HTC+, nine competitive models are reproduced for comparison of accuracy. The comparative experiments are conducted on SSDD and HRSID datasets.

#### *5.1. Quantitative Results*

Tables 2 and 3 show the quantitative comparison results on SSDD and HRSID. HTC is the experimental baseline reproduced basically in line with the raw report. Its box and mask AP are comparable to reports [17,18].

From Tables 2 and 3, the following conclusions can be drawn:





**Table 3.**

Quantitative

 results on HRSID. FPS: frames per second.

Table 4 shows the computational complexity calculations of different methods. Here, we adopt the floating point of operations (FLOPs) to measure calculations whose unit is the giga multiply add calculations (GMACs) [70]. From Table 4, the calculation amount of GCBANet is more than other models, so the future model computational complexity optimization is needed.

**Table 4.** Computational complexity calculations of different methods. Here, we adopt the floating point of operations (FLOPs) to measure calculations whose unit is the giga multiply add calculations (GMACs) [70].


#### *5.2. Qualitative Results*

Figures 16 and 17 show the qualitative results on SSDD and HRSID. Due to limited pages, we only show the comparison results with HTC. From Figures 16 and 17, HTC+ can detect more real ships than HTC (e.g., the #5 sample in Figure 16). It can suppress false alarms (e.g., the #2 sample in Figure 16). It also offers better positioning performance (e.g., the #1 sample in Figure 16) and higher confidence scores (e.g., the #6 sample in Figure 16). These reveal the state-of-the-art performance of HTC+.

**Figure 16.** Qualitative results on SSDD. (**a**) The vanilla HTC. (**b**) Our HTC+. Green boxes denote the ground truths. Orange boxes denote the false alarms (i.e., false positives, FP). Red circles denote the missed detections (i.e., false negatives, FN).

**Figure 17.** Qualitative results on HRSID. (**a**) The vanilla HTC. (**b**) Our HTC+. Green boxes denote the ground truths. Orange boxes denote the false alarms (i.e., false positives, FP). Red circles denote the missed detections (i.e., false negatives, FN).

#### **6. Ablation Study**

In this section, we will conduct ablation studies for sensitivity analysis of each technique. All the following experiments are conducted on SSDD. These ablation studies can also confirm the effectiveness of each technique. For rigorous comparison, we install and remove the particular technique while freezing the other six.

#### *6.1. Ablation Study on MRFEN*

#### 6.1.1. Effectiveness of MRFEN

Table 5 shows the quantitative results with and without MRFEN. MRFEN offers a better accuracy than the common plain structure, because it expands the range of network spatial search to achieve multi-resolution analysis, enabling better multi-scale ship feature representation.

**Table 5.** Quantitative results on with and without MRFEN.


<sup>1</sup> ResNet-101 in Figure 2a. <sup>2</sup> Our MRFEN in Figure 2b.

#### 6.1.2. Component Analysis in MRFEN

Table 6 shows the results of component analysis in MRFEN. All components are conducive to improving accuracy, showing their effectiveness. Three components enhance the multi-scale feature description of the network, enabling better performance. MRFE reduces the speed greatly because it makes the network heavier. Although MRFE offers an only 0.2% box AP gain, its mask AP gain is more significant than others, so it is still effective. One can also find that after the feature fusion is established, the box AP is further improved by 0.4%. Therefore, it is essential to exchange information in multi-resolution parallel branches. Moreover, CBAM can improve accuracy further because it can focus on more important features.


**Table 6.** Quantitative results of component analysis in MRFEN.

<sup>1</sup> MRFE denotes the multi-resolution feature extraction. Feature fusion and CBAM are deleted. <sup>2</sup> FF denotes the feature fusion in the multi-scale attention-based feature fusion (MSAFF). <sup>3</sup> CBAM denotes the convolutional block attention module in the MSAFF. <sup>4</sup> ASPP denotes the atrous spatial pyramid pooling.

#### 6.1.3. Compared with Other Backbones

We compare performance of other backbones in Table 7. MRFEN offers the optimum accuracy. HRNetV2-W40 [16] offers a comparable 72.2% box AP, but its mask AP is lower than MRFEN (63.5% < 64.3%). Furthermore, in our MRFEN, MRFE+MSAFF can be regarded as an improved version of HRNet [41], because CBAM is added during feature fusion. We also study the advantage of HRNet compared with other backbones under the condition that ASPP is used. The results are shown in Table 8. From Table 8, HRNet achieves the best box AP and mask AP, thus the multi-resolution parallel structure is better than the plain structure.

**Table 7.** Quantitative results of different backbones. ASPP is not used in other backbones.


**Table 8.** Quantitative results on with and without MRFEN. ASPP is used in other backbones.


<sup>1</sup> The backbone (MRFE+MSAFF) inside MRFEN is replaced. <sup>2</sup> In MRFEN, MRFE+MSAFF can be regarded as an improved version of HRNet because CBAM is added during feature fusion.

#### *6.2. Ablation Study on EFPN*

#### 6.2.1. Effectiveness of EFPN

Table 9 shows the results with or without EFPN. EFPN offers a ~3% box AP gain; a ~2% mask AP gain. The speed is sacrificed, but it offers better multi-scale performance, considering large and small ships simultaneously.


**Table 9.** Quantitative results on with and without EFPN.

<sup>1</sup> FPN in Figure 5a. <sup>2</sup> Our EFPN in Figure 5b.

#### 6.2.2. Component Analysis in EFPN

Table 10 shows the results of component analysis in EFPN. Each component is conducive to improving accuracy more or less, showing their effectiveness. CARAFE reduces the speed more obviously, because it adds FPN levels with increased calculation costs.



<sup>1</sup> CARAFE denotes the content-aware reassembly of features. <sup>2</sup> FB denotes the feature balance. <sup>3</sup> FR denotes the feature refinement. <sup>4</sup> FE denotes the feature enhancement.

#### 6.2.3. Compared with Other FPNs

We compare performance of other FPNs in Table 11. MFPN offers the best accuracy except for the mask AP of Quad-FPN [58]. Still, MFPN is better than Quad-FPN, because its box AP is higher (72.3% > 71.8%).



#### *6.3. Ablation Study on SGAALN*

#### 6.3.1. Effectiveness of SGAALN

Table 12 shows the results with and without SGAALN. SGAALN boosts the box AP and the mask AP by ~1%. It can generate more optimized location- and shape-adaptive anchors to better match SAR ships. This can ease background interferences for better performance.


<sup>1</sup> Fixed anchors and aspect ratios used in the vanilla HTC. <sup>2</sup> Our SGAALN in Figure 8.

#### 6.3.2. Component Analysis in SGAALN

Table 13 shows the results of component analysis in SGAALN. ALP boosts accuracy in any case, but ASP must be equipped with FA to give full play to its advantages, because FA aligns the raw feature maps to width-height of anchors to eliminate feature differences.


**Table 13.** Quantitative results of component analysis in SGAALN.

<sup>1</sup> ALP denotes the anchor location prediction. <sup>2</sup> ASP denotes the anchor shape prediction. <sup>3</sup> FA denotes the feature adaption.

#### 6.3.3. Different Probability Thresholds

We adjust probability thresholds to determine their optimal value in Table 14. One finds that when *<sup>t</sup>* = 0.10, the accuracy reaches the peak, so it is selected, as also suggested by [29], because it can remove many false positives meanwhile maintaining an unaffected recall rate.


**Table 14.** Quantitative results of component analysis in SGAALN.

#### *6.4. Ablation Study on CROIE*

6.4.1. Effectiveness of CROIE

Table 15 shows the results with/without CROIE. CROIE improves the accuracy by ~0.5%, because it offers more context information to the network, conducive to enhancing the background discrimination ability.

**Table 15.** Quantitative results on with and without CROIE.


<sup>1</sup> ROI is used as shown in Figure 9a. <sup>2</sup> ROI, ROI-C1 and ROI-C2 are used as shown in Figure 9b.

#### 6.4.2. Different Range Contexts

We survey the influences of different range contexts on performance as shown in Table 16. We observe that moderate contexts are beneficial, but excessive ones will lead to negative effects. When using CROIE, a special parameter adjustment is required to be in line with actual applications. For the best accuracy, we set the two amplification factors *c*<sup>1</sup> and *c*<sup>2</sup> to 1.5 and 2.0 respectively.

**Table 16.** Quantitative results of different range contexts.


#### *6.5. Ablation Study on EMIN*

#### 6.5.1. Effectiveness of EMIN

Table 17 shows the results with and without EMIN. EMIN offers better accuracy than the raw MIN. It transmits more important mask features to the next stage. It further balances the contributions between the backbone network's features and the previous stage's features. Consequently, the efficiency of information flow is improved, bringing better learning benefits.

**Table 17.** Quantitative results with and without EMIN.


<sup>1</sup> The raw MIN in Figure 12a. <sup>2</sup> Our EMIN in Figure 12b.

#### 6.5.2. Component Analysis in EMIN

Table 18 shows the results of component analysis in EMIN. Each component offers an observable accuracy gain, showing their effectiveness. They do not impose great impacts on speed, so they are cost-effective.

**Table 18.** Quantitative results of component analysis in EMIN.


<sup>1</sup> GFSA denotes the global feature self-attention. <sup>2</sup> AMFF denotes the adaptive mask feature fusion.

#### *6.6. Ablation Study on PPT*

#### 6.6.1. Effectiveness of PPT

Table 19 shows the results with or without PPT. PPT has a slightly better accuracy than NMS and Soft-NMS with little sacrifice of speed. It considers the ship aspect ratio prior to determine whether to suppress boxes, with the advantages of NMS and Soft-NMS, enabling better performance.

**Table 19.** Quantitative results on with and without PPT.


6.6.2. Different Similarity Thresholds of Aspect Ratios

We survey the influences of different similarity thresholds of aspect ratios on the performance as in Table 20. Due to SAR imaging error and annotation deviation, it is impossible to be sure that ships moored in parallel have absolutely-equal aspect ratios. Therefore, setting this threshold reasonably is needed. In our work, we set *<sup>r</sup>* to 0.20 because it offers the best accuracy.


**Table 20.** Quantitative results of different similarity thresholds of aspect ratios.

#### *6.7. Ablation Study on HSMTS*

#### 6.7.1. Effectiveness of HSMTS

Table 21 shows the results with and without HEMTS. HEMTS further improves the accuracy; the network boosts learning benefits by focusing on difficult samples to boost foreground-background recognition ability. HEMTS is only used in training, so the speed is not affected.

**Table 21.** Quantitative results on with and without HSMTS.


<sup>1</sup> The raw random sampling. <sup>2</sup> OMEM. Here, the mask prediction loss is not monitored. <sup>3</sup> Our HSMTS in Figure 15. Here, the mask prediction loss is monitored.

#### 6.7.2. Compared with OHEM

In Table 21, HEMTS (the second row) performs better than OHEM (the third row) because HEMTS adds the extra supervision of the mask prediction loss. This is conducive to mining more representative difficult negative samples.

#### **7. Discussions**

#### *7.1. Multi-Scale Training and Test*

We also discuss the multi-scale training and test on SSDD in Table 22. The single-scale input is [512 × 512]; the multi-scale input is [416 × 416, 512 × 512, 608 × 608] inspired by YOLOv3 [25]. Multi-scale training and test can further improve the accuracy but the speed becomes lower for all models. Our single-scale HTC+ surpasses all other multi-scale models. This advantage comes from the multi-resolution feature extraction. Our multiscale HTC+ enables the better performance from 72.3% to 72.9% box AP and from 64.3% to 65.1% mask AP. It is always far superior to all other competition models, which shows its better performance.

#### *7.2. Extension to Mask R-CNN*

To confirm the universal effectiveness of the proposed techniques on other instance segmentation models, we extend them to the mainstream Mask R-CNN [30]. Here, EMIN is not applicable, because Mask R-CNN does not have mask information interaction branches, whose mask head is not cascaded. The results are shown in Table 23. From Table 23, six novelties all offer continuous accuracy growth, from the initial 62.0% to the final 70.8% box AP, i.e., a huge 8.8% incremental improvement, and from the initial 57.8% to the final 62.5% mask AP, i.e., a huge 4.7% incremental improvement. The above reveals the universal validity of our proposed techniques.


**Table 22.** Quantitative results of multi-scale training and test on SSDD. The suboptimal method is marked by underline "—".

<sup>1</sup> Single denotes the input size [512 × 512]. <sup>2</sup> Multi denotes the input size [416 × 416, 512 × 512, 608 × 608] inspired by YOLOv3 [25].



\* Mask R-CNN does not have mask information interaction branches, because its mask head is not cascaded. Thus, EMIN is not applicable to Mask R-CNN.

#### *7.3. Extension to Faster R-CNN*

We also extend the proposed techniques (except EMIN only used in segmentation models) to the pure detection model. We take the mainstream two-stage model Faster R-CNN [31] as an example. The results are shown in Table 24. From Table 24, six novelties all offer continuous accuracy growth, from the initial 62.1% to the final 69.1% box AP, i.e., a huge 7% incremental improvement, which shows their universal validity. Certainly, these benefits are achieved at a certain sacrifice of speed, which will be considered in the future.



\* EMIN is only used in segmentation models, but Faster R-CNN is a detection model. Therefore, EMIN cannot be applied to Faster R-CNN.

#### **8. Conclusions**

We propose HTC+ to boost SAR ship instance segmentation. Seven techniques (MR-FEN, EFPN, SGAALN, CROIE, EMIN, PPT, and HSMTS) are used ensure the state-of-the-art accuracy of HTC+. HTC+ is elaborately designed for SAR ship tasks in consideration of the targeted SAR characteristics. HTC+ is superior to the vanilla HTC by 6.7% box AP and 5.0% mask AP and by 4.9% and 3.9% on HRSID. It outperforms the other nine advanced models. Moreover, we also extend the proposed techniques to Faster R-CNN to confirm their effectiveness for pure detection tasks; results reveal that they can offer continuous accuracy growth.

In the future, the speed optimization [71,72] will be considered; other tricks [73] will also be considered for better accuracy.

**Author Contributions:** Conceptualization, T.Z.; methodology, T.Z.; software, T.Z.; validation, T.Z.; formal analysis, T.Z.; investigation, T.Z.; resources, T.Z.; data curation, T.Z.; writing—original draft preparation, T.Z.; writing—review and editing, X.Z.; visualization, T.Z.; supervision, T.Z.; project administration, X.Z.; funding acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the National Natural Science Foundation of China (61571099).

**Data Availability Statement:** Not applicable. No new data were created or analyzed in this study. Data sharing is not applicable to this article.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

