*3.1. DNet*

Region Proposal is one of the key points for a target detection network, such as Faster R-CNN [8] using RPN to generate better regions, and YOLOv1 [10] splitting an image into grid cells as region proposals to improve the detection efficiency. As shown in Figure 2, DNet divides the input image into 6 × 6 grid cells as region proposals like YOLOv1 [10]. Each region proposal consists of eight predictions: *x*, *y*, *w*, *h*, *c*, *C* × 3. The *(x, y)* coordinates represent the center of the predicted box. *w* and *h* represent the width and height of the predicted box. *c* represents the IOU (intersection-over-union) between the predicted box and ground truth box. *C* × 3 represents the probability classes of bow, cabin and stern.

**Figure 2.** Region proposal of DNet.

Inspired by YOLOv1, DNet models the detection as a regression problem. Since the object and background are relatively simple, object features are relatively obvious, and the network is not as deep as VGGNet [10] and ResNet [22]. We pay more attention to the decreasing network model parameters. DNet resizes the image to 192 × 192 as the input and we design five layers to extract features from an image; the last two layers predict the object probabilities and its coordinates.

DNet predicts bounding boxes based on grid cells. A grid cell produces one bounding box predictor. We need one bounding box predictor to be "responsible" for each object, and choose the one based on which prediction has the highest current IOU (intersection-over-union) with the ground truth. To choose a proper predictor for each object, at training time, we design the loss *Lg* function as follows:

$$L\_{grid} = \sum\_{i=0}^{S^2} (\mathbf{C}\_i - \mathbf{C}\_i^\*)^2,\tag{1}$$

where *S*<sup>2</sup> is the number of the grid cells, *Ci* is the confidence value that the predicting box contain an object, and *C*∗ *<sup>i</sup>* is the IOU between predicted bounding box with ground truth. If there is no object in predictions, then *C*∗ *<sup>i</sup>* = 0.

The final layer predicts both class probabilities and bounding box coordinates; we calculate the coordinates loss and classification loss only when the predictor is a proper one, the loss function is:

$$L\_{\text{box}} = \sum\_{\varepsilon \in \text{provers}} \sqrt{(\mathbf{x}\_{\varepsilon} - \mathbf{x}\_{\varepsilon}^\*)^2 + (y\_{\varepsilon} - y\_{\varepsilon}^\*)^2 + (w\_{\varepsilon} - w\_{\varepsilon}^\*)^2 + (h\_{\varepsilon} - h\_{\varepsilon}^\*)^2} + (p\_{\varepsilon} - p\_{\varepsilon}^\*)^2,\tag{2}$$

where *pc* is the predicting class and *p*∗ *<sup>c</sup>* is the truth class. The loss of *Lbox* is under the assumption that the predictor is a proper one. Therefore, it may not be ideal to weight the *Lgrid* equally to *Lbox*. We use *λ* to weight the loss, and the final loss function is designed as follows:

$$L = \lambda L\_{\text{grid}} + (1 - \lambda)L\_{\text{box}}.\tag{3}$$

*3.2. CNet*

Ship recognition is challenging, and we can make use of the fact that there is only a limited number of ships in a port. CNet model considers the recognition problem as a classification problem, which is connected to the end of the DNet. We set the output boxes and classes as the input and share the first three layers' feature maps of the DNet. The boxes is resized to 14 × 14 by a ROIPool layer as shown in Figure 3, which was proposed in [8]. Two extra convolutional layers followed by the ROIPool layer are added, and, finally, two fully connected layers and a softmax layer are used to predict the output probabilities.

**Figure 3.** ROIPool Layer.

Finally, CNet outputs three ship classification scores of bow, cabin and stern. We design the voting strategy as

$$\lambda\_{i:i \in (probabilities)} = \lambda\_b Score\_{buv}^i + \lambda\_c Score\_{cabin}^i + \lambda\_s Score\_{stern}^i$$

*Score<sup>i</sup>* denotes the output score of probabilities *i*. It weights the score of cabins equally with scores of bow and stern which may not be ideal. To resolve this, we use *λ* to weight the scores.

#### *3.3. Training and Running*

Before training, we have to label the ship data set. We quadrangle the bow, cabin and stern with (*c,i,x,y,w,h*). *c* represents the key point of ships, and *i* represents the identification of the key point, (*x,y*) represents the upper-left coordinates of the box, *w* represents the with of the box and *h* represent the hight of the box. To learn the shared features, we train the DCNet with two steps as shown in Figure 4. In the first step, we train the DCNet use the ship data set, we set the initial learning rates as 0.01 and decrease by one tenth per 10,000 iterations; after 50,000 iterations, the losses tend to stabilize. In the second step, we fix the shared convolutional layers and only fine-tune the unique layers of CNet. During CNet training, we feed the ship data sets to the shared convolutional layers and rectangle the box feature maps, unify the box feature maps size by the ROIPool layer, and, lastly, classify the feature maps with the unique layers of CNet. We set the initial learning rates as 0.1 and decrease by one-tenth per 5000 iterations; after 40,000 iterations, the loss tends to be stabilized.

**Figure 4.** DCNet Training.

When running the model, firstly, the DNet predicts the coordinates and classes of bow, cabin and stern, and then it rectangles key ship parts from the sharing feature maps and feeds them to the CNet to get the probability scores, as shown in Figure 5.

**Figure 5.** Labels of bow, cabin and stern.
