**4. Architecture Design of ESDR-DL**

In order to reduce network traffic caused by video streaming from surveillance cameras, and resolve the limitation of low transmission bandwidth, we design an embedded architecture for deep learning, which connects surveillance cameras and performs image processing at the front end, as shown in Figure 6. In this ESDR-DL, the video stream is connected to a nearby TX2 through a LAN. To ensure real-time performance of video surveillance, each TX2 receives only one or two video streams. When the system is running, a Video Stream Receiver in TX2 is responsible for receiving the video stream accessed by the current device, decoding the video stream through a Video Stream Decoder, and inputting the decoded images to an Image Processor for detection. In the Image Processor, the DCNet model is used to detect and identify key parts of a ship (the bow, the cabin, and the stern), and classify the ship's identity based on these key parts, and output three prediction results. These prediction results are then used in a Voter for the decision of the ship's identity.

We use NVIDIA Jetson TX2 as it is an industry-leading embedded computing device. Table 2 lists the main properties of TX2 related to our work in this paper.

**Figure 6.** System architecture of ESDR-DL.

**Table 2.** Jetson TX2.


## **5. Experiment Results**

We use the recall-R and precision-P as the evaluation standard, defined as

$$\mathbb{R} = \text{TP}/\text{(TP} + \text{FN)} \text{ , P = TP/(TP} + \text{FP)} \text{.}$$

TP refers to true positive, FN indicates false negative, and FP means false positive.

#### *5.1. Algorithm Performance*

To evaluate the performance of DNet, we use a ship data set that has 6000 images collected from Donging port, Shandong, China. We have tested both Yolo Tiny and DNet, running on TX2 and GTX TITAN X. In addition, 4700 images are used for training and 1300 are used for testing. Table 3 shows the test results. We can see that DNet achieves much higher energy efficiency with a little lower accuracy.


**Table 3.** Test results of Tiny YOLO and DNet.

YOLOv1 splits an image into 7 × 7 grid cells; considering the big target of ship and the limitation of computing capacity of a TX2, we decrease the grid cells to reduce model parameters. As is shown in Table 4 where efficiency is measured by FPS (Frames Per Seconds), a test is made to check this, and DNet splits an image into 6 × 6, considering the performance–accuracy trade off.


**Table 4.** Grid cells number test for DNet.

The *λ* for loss L in Equation (3) can be changed based on different scenarios and targets. We adjust *λ* experimentally and the results is shown in Table 5. Concluding from the tests, we set *λ* = 0.7.


**Table 5.** *λ* test for Loss L.

We adjust *λ* experimentally for voting strategy and test its impact on accuracy as in Table 6. We can conclude the weights from it that *λ<sup>c</sup>* > *λ<sup>b</sup>* > *λs*. We set *λ<sup>b</sup>* = 0.3, *λ<sup>c</sup>* = 0.5 and *λ<sup>s</sup>* = 0.2.



#### *5.2. System Performance*

ESDR-DL is deployed to Dongying port, China. The video cameras used are Hikvision DS-2CD3T25D-I5. The pixel used is 1920 × 1080 and the frame rate is 30 fps. We use seven TX2s for 10 cameras as shown in Table 7. Four cameras are installed on both sides of the entrance with a height of 8 m. Others are installed inside the port.

**Table 7.** Deployment of cameras and TX2s.


During one month's running, we collect 13,000 recognized records and checks the accuracy manually. There are a total of 14,536 ships in videos.

Table 8 illustrates the recall and precision rates of ship detection and recognition. S denotes the ship number occurring in each camera, D-P stands for the ship detection precision, D-R is the detection recall, R-P is the recognition precision, R-R refers to the recognition recall, and T denotes the processing efficiency of each camera.


**Table 8.** Performance of ESDR-DL.

Comparing Tables 6 and 8, we can find that the accuracy of actual running is lower than the accuracy testing in the home-made data set because there are new ships arriving the port and the ESDR-DL can not recognize these new ships. In addition, ESDR-DL performs better for the inside-port monitoring cameras because there are some far away scenes of ships from the entrance cameras and only close scenes exist from the inside-port cameras, while DCNet focus on big target detection and recognition. In addition, as shown in Figure 7, the system can run in bad weather conditions (such as rain and smog) in practice. In order to test the performance of the system in bad weather, we run the system in rain and smog weather, and run it at dusk (5:00 p.m.–6:00 p.m.). The detection recognition results are shown in Table 9.

The recognition results are shown in Table 10. We can see that the accuracy of the system is dropping sharply in rain and smoggy weather, while performing well at dusk. This is not a problem in practice as there are very few ships in such weather conditions.


**Table 9.** Performance of ship detection in bad weather.

**Table 10.** Performance of ship recognition in bad weather.


**Figure 7.** Ships in bad weather: The top is the ships in bad weather, and the bottom is the process results.

#### **6. Conclusions**

Considering the challenges of ship detection and recognition, this paper proposes an embedded deep learning system for ship detection and recognition named ESDR-DL. It first locates the bow, cabin and stern of the ship using DNet, and then recognizes them by a classification network named CNet. Finally, voting is used to recognize the ship identification. We implement the ESDR-DL with an embedded architecture which supports real-time video processing. We have deployed ESDR-DL at

Dongying port, China. It has been running stably in the past year, which shows the effectiveness of our solution. In the future, we will adopt a multi-model data fusion approach [23,24] to improve the recognition accuracy.

**Author Contributions:** Conceptualization, H.Z. and W.Z.; methodology, H.Z.; validation, H.Z., H.S. and B.X.; formal analysis, W.S.; investigation, B.X.; resources, W.Z.; writing–original draft preparation, H.Z.; writing–review and editing, H.Z. and H.S.; visualization, W.Z. and B.X.; project administration, W.Z.

**Funding:** This research was funded by the Key Research Program of Shandong Province under Grant No. 2017GGX10140 and the National Natural Science Foundation of China under Grant No. 61309024.

**Conflicts of Interest:** The authors declare no conflicts of interest.
