*Article* **Sensitivity of Safe Trajectory in a Game Environment on Inaccuracy of Radar Data in Autonomous Navigation**

#### **Józef Lisowski**

Faculty of Marine Electrical Engineering, Gdynia Maritime University, 81-225 Gdynia, Poland; j.lisowski@we.umg.edu.pl; Tel.: +48-694-458-333

Received: 27 March 2019; Accepted: 15 April 2019; Published: 16 April 2019

**Abstract:** This article provides an analysis of the autonomous navigation of marine objects, such as ships, offshore vessels and unmanned vehicles, and an analysis of the accuracy of safe control in game conditions for the cooperation of objects during maneuvering decisions. A method for determining safe object strategies based on a cooperative multi-person positional modeling game is presented. The method was used to formulate a measure of the sensitivity of safe control in the form of a relative change in the payment of the final game; to determine the final deviation of the safe trajectory from the set trajectory of the autonomous vehicle movement; and to calculate the accuracy of information in terms of evaluating the state of the control process. The sensitivity of safe control was considered in terms of both the degree of the inaccuracy of radar information and changes in the kinematics and dynamics of the object itself. As a result of the simulation studies of the positional game algorithm, which used an example of a real situation at sea of passing one's own object with nine other encountered objects, the sensitivity characteristics of safe trajectories under conditions of both good and restricted visibility at sea are presented.

**Keywords:** autonomous navigation; automatic radar plotting aid; safe objects control; game theory; computer simulation

#### **1. Introduction**

The subject of this article directly concerns sensors, which are the basic part of vessel detection and navigation in the process of ensuring safe control of marine objects. Sensors, such as radars, logs and gyro-compasses, form a source of information for the Automatic Radar Plotting Aid ARPA system, which are mandatory pieces of equipment for each ship to prevent collisions. However, its functional scope is limited to the determination of the safe maneuver of the object to the most dangerous object encountered, followed by its simulation in an accelerated time scale [1–3]. Modern navigation systems aim to use computer decision support systems that take many factors into account [4–9].

First, we take into account the subjectivity of the navigator in the assessment of the situation as well as the kinematics and dynamics of the objects encountered [10]. According to Lloyd Register statistics, the human errors caused by subjectivity account for about 60% of the causes of maritime accidents [11,12].

Secondly, we take into account the game nature of the anti-collision process, which results from the imperfection of maritime law and the complexity of the actual navigational situation at sea [13].

The influence of the information accuracy from sensors on the current state of the transport process and on the quality of safe control, which can be determined through the analysis of the control sensitivity to information inaccuracy, becomes important [14–17]. Most of the scientific literature concerns the sensitivity analysis of deterministic systems [18–21]. Therefore, the purpose of this study is to conduct the sensitivity analysis of the game control system for safe moving objects.

The process of managing the autonomous marine vehicles as a complex dynamic control object depends both on the accuracy of the measurements determining the current navigational situation, which was measured using the devices of the automatic radar plotting aids (ARPA) anti-collision system, and on the mathematical model of the process used to synthesize the object control algorithm.

The ARPA system allows us to track the automatically encountered object *j*th by determining its motion parameters, including velocity *Vj* and course ψ*j*, and approach elements to its own ship. Furthermore, we can also determine *Djmin* (distance of the nearest approach point, DCPAj) and *Tjmin* (time to the nearest approach point, TCPAj) (Figure 1).

**Figure 1.** The navigational situation of the passage of the own object 0 moving at speed *V* and course ψ with the *j*th object encountered when moving at speed *Vj* and course ψ*j*. In this figure, *Dj* is distance, *Nj* is bearing and *Ds* is safe distance.

In theory and practice, there are many methods for determining a safe maneuver or the safe trajectory of the own object while passing other objects. The simplest method is to determine the course change maneuver or the speed of the own object in relation to the most dangerous object encountered.

In one article [22], the "time to safe distance" upon the detection of dangerous objects was proposed as a potentially important parameter, which should be accompanied by a display of possible evasive maneuvers. The acceptable solutions for altering the course range should comply with the international regulations for preventing collisions at sea (COLREGs), as presented in [23].

The most important purpose of the control process is to determine a certain sequence of maneuvers in the form of a safe trajectory of the own object. The safe trajectory of the own object can be distinguished in deterministic terms without considering the maneuvering of other objects encountered. In the game approach, this is based on the use of a cooperative or non-cooperative game model of the control process [24,25].

The safety distance Ds of the passing objects, which was subjectively determined by the navigator in the current navigational situation, is important for safe navigation. This value depends on the current state of visibility at sea, which is classified by the international rules for preventing collisions of ships at sea (COLREGs) as either good or restricted visibility at sea.

Therefore, the aim of this study is to assess the sensitivity of the quality of security checks and games under the conditions of good and limited visibility at sea.

#### **2. The Safe and Game Object Control in Autonomous Ship Navigation**

The complexity of the situation when many dynamic objects are passed at sea provides the possibility of using a game model for the control process, given the fairly limited rules of international law of the sea route (COLREGs) in terms of good and limited visibility at sea and a large influence of the navigator's subjectivity in making the final maneuvering decision. It is possible to describe the process in the form of static positional and matrix games or in the form of dynamic differential games. This article proposes the use of the positional game, which is the most appropriate for this type of control process.

The basis of the positional game involves assigning the maneuver strategy of the own object to the current positions of *p*(*tk*) objects in the current step *k*. As such, the process model considers all possible changes in the course and speed of the encountered objects during the control [26–42] (Figure 2).

**Figure 2.** Block diagram of the positional game model in the situation with passing objects.

The process state is determined by the coordinates of the object's own position and the positions of the objects encountered as follows:

$$\begin{array}{l} \mathbf{x}\_{0} = \begin{pmatrix} \mathbf{X}\_{0\prime} \ \mathbf{Y}\_{0} \end{pmatrix}, \ \mathbf{x}\_{j} = \begin{pmatrix} \mathbf{X}\_{j\prime} \ \mathbf{Y}\_{j} \end{pmatrix} \\\ \mathbf{j} = \mathbf{1}, \ \mathbf{2}, \ \ldots, \ \mathbf{J} \end{array} \tag{1}$$

The control algorithm generates the own object's movement strategy at the present time tk, based on information from the ARPA anti-collision system on the relative position of the objects that it meets:

$$\mathbf{p}(t\_k) = \begin{bmatrix} \mathbf{x}\_0(t\_k) \\ \mathbf{x}\_j(t\_k) \end{bmatrix} \quad j = 1, \ 2, \dots, \ l \qquad k = 1, \ 2, \dots, K. \tag{2}$$

Thus, in the multi-stage positional game model, at each discrete time *tk*, the own object knows the positions of the objects encountered.

The following navigation restrictions are imposed on the components of the process state, which consist of the acceptable coordinates of the own position and the objects encountered:

$$\left\{\mathbf{x}\_{0}(t),\,\mathbf{x}\_{j}(t)\right\}\,\,\in\,\mathbf{P}.\tag{3}$$

The limits of the control values of the own and met objects are determined as:

$$u\_0 \in \mathcal{U}\_0 \; , \; u\_j \in \mathcal{U}\_j \; \quad \quad j = 1 \; \; , \; 2, \dots \; \; \; I. \tag{4}$$

which considers the ship movement kinematics, recommendations of the COLREGs rules and the conditions required to maintain a safe passing distance:

$$D\_{j\text{min}} = \text{min}D\_j(t) \ge D\_{\text{\textdegree}} \tag{5}$$

The sets of acceptable maneuvering strategies of the own object *U*0*j*(*p*) and the met objects *Uj*0(*p*) are dependent, which means that the choice of control *uj* by the *j*th object changes the sets of acceptable strategies of other objects.

Figure 3 shows the geometrical structure of the sets of acceptable safe strategies for the own object and for one met *j*th object. First, the value of the safe ships passing at a distance *Ds* is assumed. After this, they are positioned to be at a tangent to circles with a radius *Ds*, which cut off the areas of safe possible changes in the courses and speeds of the own and the encountered objects in the districts with the *V* and *Vj* rays.

**Figure 3.** The method of determining the acceptable strategy sets of the own object *U*0*<sup>j</sup>* = *U*0*<sup>j</sup> PS* <sup>∪</sup> *<sup>U</sup>*0*<sup>j</sup> SS* and the *j*th encountered object *Uj*<sup>0</sup> = *Uj*<sup>0</sup> *PS* <sup>∪</sup> *Uj*<sup>0</sup> *SS* for the port side (PS) and starboard side (SS). In this figure, *P*0*<sup>r</sup>* is the turning point of the rhumb line of the own object, *Pjr* is the turning point of the rhumb line of the encountered object, 1 is a safe maneuver for changing the course of the own object and 2 is a safe maneuver for changing the course of the encountered object.

The method for determining the total sets of acceptable safe strategies of the own object, while passing with many objects encountered simultaneously, is shown in Figure 4, which utilizes the example of passing the own object with six objects encountered at a safe distance *Ds*.

**Figure 4.** Areas of acceptable safe strategies of the own object in relation to the six objects encountered: 3 is the optimal maneuver for changing the course of the own object *u*0\* = Δψ\**SS* when safely passing the six met objects.

The algorithm of the safe cooperative control of the own object *u*0\*(*tk*) at each stage *k* is implemented using the following three tasks:


$$D^\*(\mathbf{x}\_0) = \min\_{u\_0} 3 \min\_{u\_{\hat{\beta}0}} 2 \min\_{u\_{0\hat{\jmath}}} 1 \ D[\mathbf{x}\_0(t\_k)], \ \mathbf{j} = 1, \ \mathbf{2}, \ \dots, \ \mathbf{J}. \tag{6}$$

where *D* is the distance of the own object to the nearest point of return *P*0*<sup>r</sup>* on the reference route.

The criterion for choosing the best trajectory of the own object is to calculate such course values and the speed, which would provide the smallest loss of the path for the safe passage of the encountered objects at a distance that is no less than the value of *Ds* previously accepted by the navigator.

Through the three-fold use of the *linprog* function, which is linear programming from the MATLAB Optimization Toolbox software, a cooperative multi-stage Positional Game (PG) algorithm was developed to determine the safe trajectory of the own object.

#### **3. Control Sensitivity Analysis**

The sensitivity analysis refers to the identification of the static and dynamic properties of control objects and to the synthesis of automatic control systems, particularly optimal, adaptive and game systems. A distinction is made between the sensitivity of the object model itself or the control process of changes in its operating parameters and the sensitivity of the optimal, adaptive or game control. This is both in terms of changes in parameters and the influence of disturbances, and impacts of other objects. Therefore, the *sx* sensitivity functions of the optimal control u of the game process described by the state variables *x* can be represented as the following partial derivatives of quality control index *Q*:

$$s\_{\mathfrak{x}} = \frac{\partial Q[\mathfrak{x}(\mathfrak{u})]}{\partial \mathfrak{x}}.\tag{7}$$

The game control quality index *Q* acts as the form of payment for the game, which consists of integral payments and the final payment:

$$Q = \int\_{t\_0}^{t\_K} [x(t)]^2 + r\_j(t\_K) + d(t\_K). \tag{8}$$

The integral game payment represents the loss of the path through its own object when passing the objects encountered and the final payment determines the final collision risk *rj*(*tk*) with respect to the *j*th object encountered and the final deviation of the trajectory of the object *d*(*tk*) from the reference trajectory.

Testing the sensitivity of game control will complete the sensitivity analysis of the final game payment *d*(*tk*):

$$s\_i = \frac{\partial d(t\_K)}{\partial x\_i}.\tag{9}$$

Considering the practical application of the game control algorithm for the own object in a collision situation, it is recommended that the sensitivity analysis of a secure control should be conducted in terms of the information accuracy obtained from the ARPA anti-collision radar system in the current situation and in relation to changes in the kinematic parameters and dynamic control.

The permissible average errors that may be caused by an anti-collision system sensors may have the following values for:


The algebraic sum of all errors affecting the image of the navigational situation cannot exceed ±5% for absolute values and ±3◦ for angular quantities.

#### *3.1. Sensitivity of Safe Ship Control to Inaccuracy of Information from Sensors of ARPA System*

*SP* represents such a set of information about the control of the State Process in a navigational situation:

$$SP = \left\{ V, \Psi, V\_{\rangle}, \Psi\_{\rangle}, D\_{\langle \prime \rangle} N\_{\langle \prime} \right\}. \tag{10}$$

*SPe* represents a set of information from the sensors of ARPA system, which contains errors in measurement and processing parameters:

$$SP\_{\mathcal{E}} = \left\{ V \pm \delta V, \psi \pm \delta \psi, V\_{\rangle} \pm \delta V\_{\rangle}, \psi\_{\rangle} \pm \delta \psi\_{\rangle}, D\_{\rangle} \pm \delta D\_{\rangle}, N\_{\rangle} \pm \delta N\_{\rangle} \right\}. \tag{11}$$

The relative sensitivity of the final payment in the *sx* game as the final deviation of the safe trajectory of the ship *dk* from the reference trajectory is expressed as follows:

$$s\_{\chi} = \left| \frac{d\_K(SP\_\ell) - d\_K(SP)}{d\_K(SP)} \right| 100\%. \tag{12}$$

$$\mathbf{s}\_{\mathcal{X}} = \left\{ \mathbf{s}\_{V, \mathbf{s}} \mathbf{s}\_{\mathfrak{V}, \mathbf{s}} s\_{V/\mathfrak{V}} s\_{\mathfrak{V}/\mathfrak{V}} s\_{\mathfrak{V}/\mathfrak{V}} s\_{\mathfrak{V}/\mathfrak{V}} \right\}. \tag{13}$$

#### *3.2. Sensitivity of Safe Own Object Control to Autonomous Navigation Process Parameter Alterations*

*PP* is a set of state Parameter Processes of control, which is expressed as follows:

$$PP = \{t\_{m\_{\ell}}, D\_{\mathbb{S}\_{\ell}} \Delta t\_{k\prime} \Delta V\}. \tag{14}$$

*PPe* represents a set of parameters containing errors in measurement and processing parameters:

$$PP\_{\varepsilon} = \{t\_m \pm \delta t\_m, D\_s \pm \delta D\_s, t\_k \pm \delta t\_k, \Delta V \pm \delta \Delta V\}. \tag{15}$$

The relative sensitivity of the final payment in the game *sp*, which represents the final deviation of the safe trajectory of the ship *dK* from the assumed trajectory, will be:

$$s\_p = \left| \frac{d\_K(PP\_\ell) - d\_K(PP)}{d\_K(PP)} \right| 100\%. \tag{16}$$

$$s\_{\mathcal{V}} = \{s\_{\ell m\_{\prime}} s\_{\Omega \ast} s\_{\mathcal{V} \ast} s\_{\Delta \ell \mathbf{k}\_{\prime}} s\_{\Delta V}\}. \tag{17}$$

where *tm* is the advance time of the maneuver with respect to the dynamic properties of the own ship, *tk* is the duration of one stage of the ship's trajectory, *Ds* is the safe distance and *Ts* is the safe time of the approach.

#### **4. Sensitivity Characteristics**

The computer simulation of the PG algorithm, which represents the computer software supporting the navigator's maneuvering decision, was conducted using an example of a real navigational situation of the *J* = 9 objects encountered.

#### *4.1. Sensitivity Characteristics of Game Own Object Control in Good Visibility at Sea*

The safe trajectory of the own object and sensitivity characteristics, which were determined by the PG algorithm in the MATLAB/Simulink software, are presented in Figures 5 and 6.

**Figure 5.** The safe trajectory of the own object for the positional game PG\_gv algorithm in good visibility at sea where *Ds* = 0.5 nm in the situation of passing *J* = 9 encountered objects, *r*(*tK*) = 0 and *d*(*tK*) = 0.71 nm.

**Figure 6.** *Cont*.

**Figure 6.** Sensitivity characteristics of the positional game control of the own object in good visibility at sea according to PG\_gv algorithm as a function of: (**a**) absolute values of the information from sensors, (**b**) angular values of the information from sensors and (**c**) values of the control process parameters.

#### *4.2. Sensitivity Characteristics of Game Own Object Control in Restricted Visibility at Sea*

The safe trajectory of the own object and sensitivity characteristics, which were determined by the PG algorithm in the MATLAB/Simulink software, are presented in Figures 7 and 8.

**Figure 7.** The safe trajectory of the own object for positional game PG\_rv algorithm in restricted visibility at sea where *Ds* = 1.5 nm in the situation of passing *J* = 9 encountered objects, *r*(*tK*) = 0 and *d*(*tK*) = 3.47 nm.

**Figure 8.** Sensitivity characteristics of the positional game control of the own object in restricted visibility at sea according to the PG\_rv algorithm as a function of: (**a**) absolute values of the information from sensors, (**b**) angular values of the information from sensors and (**c**) values of the control process parameters.

#### **5. Conclusions**

The use of simplified models of a dynamic process game for the synthesis of optimal control allowed us to determine the safe trajectories of an own object in situations that involve passing a large number of encountered objects in a certain course sequence and speed maneuvers.

The developed algorithms also consider the COLREGs rules and maneuver advance time in addition to estimating the object's dynamic properties and assessing the final deviation of the actual trajectory from the reference value.

The following conclusions follow from the course of the sensitivity characteristics presented in Figures 5–8:


The considered control algorithms are the formal models of the navigator's thinking process that controls the objects' movement and maneuvering decisions. Therefore, they can be used in the construction of a new model of the ARPA system containing a computer that supports the decision-making of the navigator.

**Funding:** The project was financed under the program of the Minister of Science and Higher Education under the name "Regional Initiative of Excellence" from 2019 to 2022, project number 006/RID/2018/19 and the amount of financing was 11 870 000 PLN.

**Conflicts of Interest:** The author declares no conflict of interest regarding the publication of this paper. The funders had no role in the design of the study; in the collection, analyses or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

#### **References**


© 2019 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Vessel Detection and Tracking Method Based on Video Surveillance**

**Natalia Wawrzyniak 1,\*, Tomasz Hyla <sup>2</sup> and Adrian Popik <sup>3</sup>**


Received: 29 October 2019; Accepted: 26 November 2019; Published: 28 November 2019

**Abstract:** Ship detection and tracking is a basic task in any vessel traffic monitored area, whether marine or inland. It has a major impact on navigational safety and thus different systems and technologies are used to determine the best possible methods of detecting and identifying sailing units. Video monitoring is present in almost all of them, but it is usually operated manually and is used as a backup system. This is because of the difficulties in implementing an efficient and universal automatic detection method that would work in quickly alternating environmental conditions for all kind of sailing units—from kayaks to seagoing merchant vessels. This paper presents a method that allows the detection and tracking of ships using the video streams of existing monitoring systems for ports and rivers. The method and the results of experiments on three sets of data using cameras with different characteristics, settings, and scene locations are presented. The experiments were carried out in variable light and weather conditions, and a wide range of unit types were used as detection objectives. The results confirm the usability of the proposed solution; however, some minor issues were encountered in the presence of ships wakes or highly unfavourable weather conditions.

**Keywords:** vessel detection; video monitoring; inland waterway; real-time detection

#### **1. Introduction**

Video surveillance systems are typically used to monitor vessels' movement on coastal and inland waterways—especially those with heavy traffic, complicated organisation or which are in direct proximity to ports. However, the most typical way to track ships is using radar or Automatic Identifications System (AIS) infrastructure [1,2]. In most cases, waterside video surveillance works as a support for one of existing vessel traffic monitoring systems that uses wide collection of sensors and subsystems to acquire and distribute information on current traffic to systems users. (1) Vessel Traffic Services (VTS) is a marine system for ports and harbors that collects data on current traffic and assists ships in decision-making on area covered by the system. (2) River Information Services (RIS) are implemented in European waterways and serve as an information center for all systems' recipients (ships, port authorities, and ship-owners) [3]. The information on detected and identified vessels in RIS can be pushed to other interconnected systems of administration (customs, police, etc.). (3) Integrated waterside security systems are developed in sensitive areas (port, naval bases, power plants, etc.) [4]. These systems detect and identify intruders using and fusing information from both underwater and above water with the use of sonars, echosounders [5,6] and autonomous vehicles [7]. In all of these systems, video surveillance helps to visually confirm the identification of a vessel or to monitor non-conventional (according to the International Convention for the Safety of Life at Sea (SOLAS)) units that are not obliged to be equipped with AIS transponders. Also, video monitoring is often used as a backup system and is seen as more reliable as a passive sensor. Currently, in most cases,

a systems operator must visually monitor vessels that are passing in front of the cameras, particularly when it is necessary to detect and identify a wide range of vessels, which range from large cargo vessels to motorboats. Implementing detection and tracking algorithms to analyse live video streams allows for later automatic classification and identification of ships that enter or leave ports or other areas that need traffic support.

A video monitoring system can be used to track the status of all vessels that are present in a monitored zone. Especially, when the zone borders are established on rivers or canals. Additionally, each camera, that has a view from one bank to another, can be used to update vessels' status or to count vessels passing a certain point on the waterway. The cameras can be placed in several different positions (Figure 1). The best view for the camera is usually from a bridge as passage under a bridge is often narrower than a waterway and zoom is not required to obtain a high-resolution image of a vessel. When a waterway is wide (hundreds of meters), a set of cameras is required to detect and identify passing vessels.

**Figure 1.** Vessel monitoring on a river based on video surveillance.

The automatic detection of vessels based on video stream analysis is difficult, mainly because the scene condition is constantly changing; e.g., the lighting conditions can be very dynamic due to sun reflections and the waves generated by wind and passing vessels. This causes difficulties in separating moving objects from the background. Generally, the object in a video stream can be detected using two basic solutions. The first solution is based on a pixel-based detection method that allows the detection of any moving object on a constant or slightly changing background. The second solution is object-based detection using a classifier; this second solution is usually used when it is possible to find a distinctive property of a class of objects—e.g., a specific type of prow.

Tracking vessels passing in front of the camera requires matching vessels that were detected in different video frames or using one of the standard tracking algorithms. The second approach is used when it is easy to specify the area of camera view in which objects enter and leave the scene. The tracking algorithms are generally faster than detection algorithms. Therefore, the choice of a tracking approach depends largely on performance requirements.

#### *1.1. Related Works*

Several researchers have proposed methods for object detection and tracking in video streams that did not assume a constant background, but were not specially dedicated for waterside systems. Their work was brought up and thoroughly compared in recent literature [8–10]. However, these solutions are usually too slow for use in real-time systems. Recent trend in object tracking and detection is to exploit the machine learning and pattern recognition methods; but again, the computational complexity of these solutions is high [11,12]. There are much fewer methods developed precisely for marine or costal environment with different approaches to detect ships.

Ferreira et al. [13] proposed a solution for the vessel plate number identification of fishing vessels that enter and leave the harbor with the use of two cameras. The filming camera, with a low resolution, detects movements, and the photographic camera, with a high resolution, takes a photo when a movement is detected. In this solution, the type of the vessel is determined using object-based detection that looks for the prow of the vessel. Object-based detection is mainly based on a Histogram of Oriented Gradients (HOG) classifier [14].

In contrary, a pixel-based detection was used by Hu et al. [15] in a video surveillance system designed to detect and track intruder vessels approaching a cage aquaculture. They used the median scheme to create a background image from previous *N* frames and used a two-stage procedure to remove wave ripples. In the first stage, they used brightness and chromatic distortion to select some wave candidates, and in the second stage, they used brightness variation to finally select waves.

In 2010, Szpak and Tapamo [16] also proposed a solution to the problem of vessel detection in the presence of waves. They used a background subtraction method and a real-time approximation of level-set-based curve evolution to distinguish moving vessels' outlines. Another approach to improve detection (tracking) quality, based on a fusion of Bayesian and Kalman filters and the adaptive tracking algorithm, was proposed by Kim et al. [17]. Also, Kaido et al. [18] in 2016 proposed a two-stage method for detecting and tracking vessels based on edge selection and the support vector machine in the detection stage, and on a particle filter based on a colour histogram in the tracking stage.

In 2014, Moreira et al. [19] reviewed state-of-the-art algorithms related to the detection and tracking of maritime vessels. They concluded, among other results, that detection and tracking algorithms do not produce efficient results when applied to a maritime environment without proper adjustments; especially algorithms that have problems in real situations such as with small vessels that are hard to distinguish from the background due to low contrast.

There is also dynamic ongoing research into the methods used for the satellite optical imagery that is used to detect vessels. In 2010, Corbane et al. [20] proposed an operational vessel detection algorithm using high spatial resolution optical imagery. The algorithm is based on statistical methods and signal processing techniques (wavelet analysis, Radon transform). Another method based on shape and texture features was presented by Zhu et al. [21]. Later in 2013, Yang et al. [22] proposed another detection method based on sea surface analysis. More detection methods based on satellite imagery were published in [23–25].

In recent years, significant progress has been made in background/foreground segmentation algorithms that allow the detection of a moving object in the presence of a changing background. The background subtraction algorithms were evaluated in [26] and compared in [27]. Several background subtraction algorithms are implemented in the OpenCV library [28]. To begin with, the Gaussian Mixture-based Background–Foreground Segmentation (MOG) algorithm uses a mixture of three to five Gaussian distributions to model each background picture. The probable values of background pixels are the ones that are more static and present in most of the previous frames [29]. Next, Gaussian Mixture-based Background–Foreground Segmentation Algorithm version 2 (MOG2) [30] is available, which is an improved version of MOG. Other popular algorithms are the Godbehere–Matsukawa–Goldberg (GMG) method [31], which uses per-pixel Bayesian segmentation; CouNT (CNT), designed by Zeevi [32], which is designed for variable outdoor lighting conditions; *k* Nearest Neighbours (KNN), which implements K-nearest neighbours background subtraction, as shown in [33]; the algorithm created during the Google Summer of Code (GSOC) [28]; and Background Subtraction using the Local SVD Binary Pattern (LSBP) [34].

#### *1.2. Motivation and Contribution*

This paper is a part of an ongoing research in the Vessel Recognition (SHREC) [35] project, which concerns the automatic recognition and identification of non-conventional vessels in areas covered either by River Information Services or Vessel Traffic Service systems. The detection method is a first step in the automatic vessel identification and classification process.

The main contribution of this research is a new vessel detection and tracking method. The method (1) detects all kind of moving vessels; (2) works in variable lightning conditions; (3) tracks vessels, i.e., it assigns a unique identifier to each vessel passing in front of the camera; and (4) is designed to be efficient, i.e., it can process Full High-Definition (FHD) and 4K video streams using economically acceptable amount of server resources. Based on several observations related to vessels movement and size characteristic, several rules were created that make it possible to distinguish between vessels and other moving objects. These rules were incorporated into several steps of the proposed method. Therefore, the proposed vessel detection method uses less image processing operations than existing ones. Additionally, a simple water area detection algorithm is used that detects water based on number and length of edges in a bounding box containing a moving object. The method was implemented, and the results from the experiments are described, which confirm that the proposed method is good to use in practice.

The preliminary version of the method was published in [36]. In contrast to that version, the proposed method contains a tracking function, a status update algorithm that includes additional filtering, a simple water detection algorithm, and several other minor improvements. The main difference compared to other vessel detection methods is the use of more logical processing (object filtering, tracking) than image processing techniques. Such an approach allows us to improve processing speed.

#### *1.3. Paper Organisation*

The rest of this paper is organised as follows. Section 2 contains the description of a novel Vessel Detection Method. Section 3 presents our test environment, test application, and experimental data sets. The final section discusses the experiment results and provides conclusions concerning the practical implementation of the proposed method.

#### **2. Materials and Methods**

#### *2.1. Vessel Detection Method*

The proposed vessel detection method is designed using the following approach. The method assumes that for each camera view, there is a determined detection zone that eliminates areas of the scene where either ships cannot appear (e.g., on land) or they are too far for the detection process to make sense (Figure 2). The background subtraction algorithm is used for each frame from a video stream to obtain foreground objects, find their contours, and to obtain bounding boxes for these contours. These boxes are then used in analysis throughout the whole method. First, the bounding boxes outside of the detection zone (Figure 2i), boxes too small to be a vessel (Figure 2e,f), or with improper width to height ratio (Figure 2d) are removed. Next, the bounding boxes that contain water areas (e.g., waves, sun reflexes) are removed (Figure 2h). Then, the method matches the boxes from the current frame to boxes from previous frames based on several properties such as location or overlapping ratio and stores them in a buffer. Finally, once per every five frames, the boxes stored in the buffer that have movement features similar to passing vessels are returned as vessels (Figure 2a,b), while others are discarded (Figure 2c).

The vessel detection method consists of the Moving Vessel Detection Algorithm (MVDA), Status Update Algorithm (SUA), and Temporary Buffer (TB). The MVDA is responsible for returning bounding boxes with probable vessel locations for a given input video stream. The results are returned every 1 s and are stored in TB (Figure 3). At the end of every round (every 5 s), the SUA algorithm is run. The SUA inspects the content of TB, filters out probable artefacts, and returns 0 or more series of pictures containing the detected moving vessel. The example MVDA output for one passing vessel is presented in Figure 4.

**Figure 2.** Schema of an exemplary video scene from a bridge camera with distinguished detection zone and different cases of bounding boxes being analysed in the method.

**Figure 3.** Vessel detection method. MVDA: Moving Vessel Detection Algorithm, SUA: Status Update Algorithm.

**Figure 4.** Example of MVDA output (a bridge camera—size corresponds to the size of the vessel in the video frames).

#### *2.2. Moving Vessel Detection Algorithm*

The MVDA takes as an input a frame from a video stream and returns one or more bounding boxes with possible vessel locations. The algorithm works in three phases. In the first phase, the background model is initiated. In the second phase, a frame from a video stream is analysed, which results in a set of bounding boxes with probable locations of vessels in the frame. In the third and final phase, the algorithm assigns an identification number to each bounding box.

The detailed Algorithm 1 is as follows:



#### *2.3. Status Update Algorithm*

The SUA's goal is to eliminate any reaming artefacts. The algorithm is run every 5 s. It uses as its input data from the *Temporary Bu*ff*er.* It outputs 0 or more moving vessels; i.e., their pictures and movement direction.

The detailed Algorithm 2 is as follows:

#### **Algorithm 2** Status Update Algorithm


#### *2.4. Water Detection Algorithm*

The Water Detection Algorithm is based on an observation that all moving vessels have some kind of edges in contrast to water areas. The only situation when this algorithm is not able to detect water are waves behind moving vessels that have a lot of edges. In such a case, the algorithm is not able to differentiate them from vessels characteristics.

The Algorithm 3 is as follows:


#### **3. Results**

#### *3.1. Test Environment*

The method was implemented using C# (Microsoft Corporation, Redmond, Washington, DC, USA) and Emgu CV version 4.0.1 (EMGU Corporation, Markham, Ontario, Canada) (C# wrapper for OpenCV). The application (Figure 5) allows the user to select different parameters and see how they affect the detection quality. It is possible to observe the detection performed by two methods simultaneously to allow for better comparison. Some of the intermediate results are visualised to help us to better understand the impact of each element of the method on the final result. Additionally, it is possible to run a batch test for a given set of video files and store the detection results into the files.

**Figure 5.** A screenshot from the test application.

Three data sets were used to test the quality of the proposed detection method:


Figure 6 presents sample (not all) camera views from sets A–C. View (a) is the view from the center of the bridge, view (b) is the view from the river bank, view (c) is the view from below a high suspended bridge, and view (d) is a canal bank.

**Figure 6.** A screenshot from different video samples from sets A, B, and C: (**a**) bridge view; (**b**) bank view–close perspective; (**c**) under bridge view; (**d**) bank view–distant perspective.

Two sets of settings were used in the tests:


The method was tested using a test computer (Core i7-8700K (Intel Corporation, Santa Clara, CA, USA)., 32GB RAM., SSD 1TB NVIDIA Quadro P4000 (Nvidia, Santa Clara, CA, USA)). Each test outputted a set of image files that were categorised to each catgory of detection events by two experts.

#### *3.2. Experiments*

The detection events were divided into the following categories:


The semi-correct results can be later corrected by the final vessel identification algorithm, which is not a part of the detection method. For example, when a series containing 20 vessel images contains a water artefact in the end, this is not a problem, as the identification algorithm does not use boundary images.

The results for datasets A, B, and C and standard settings are presented in Figure 7. The method returned around 75% of correct detection events for sets A and B and 60% for set C. The semi-correct detection events accounted for 16% of the total events for set A, 24% for set B, and 27% for set C. The results from set A contain more Type II semi-correct events, and those from set B and C exhibit more Type I semi-correct events. For all sets, there were no inncorrect detection Type I and IIa events; i.e., all vessels were detected, and series containing only artefacts were not returned. The only incorrect results are Type IIb events, which are series of pictures with water-only artefacts.

**Figure 7.** Detection events for datasets A, B, and C and standard settings.

The results for high-threshold settings are presented in Figure 8. In contrast to standard settings, the incorrect detection events of type I (the vessel was not detected) are presents for sets A and B. Also, water artefacts are not present in sets A and C. The method with high settings returned more

correct results for set C (85%) and fewer for sets A (70%) and B (64%), mainly because these sets contain more small vessels that were filtered out.

**Figure 8.** Detection events for datasets A, B, and C and high-threshold settings.

#### **4. Discussion**

The incorrect detection events mainly arose from a few video samples with unfavourable lightning conditions. The video samples that come from a camera facing the sun have less colour saturation, and sun reflexes are present. This causes the background subtraction algorithm to provide worse results. Another difficult case is of a camera placed below a high suspended bridge (Figure 6c) that has a large shadow in the middle of the camera view. In practical deployment, such situations can be avoided in most cases by carefully placing surveillance cameras; for example, the cameras can be placed on the north side of the bridge (on northern hemisphere) or before it to avoid such cases.

The method for the input resizes the input frames to 1280 × 720 resolution. This resolution is a trade-off between processing speed and accuracy. Set B, despite the fact that it has higher resolution, is more compressed and has visible compression artefacts. This means that, after downsizing the frame, the picture has fewer sharp edges. The method uses a blurred frame for background–foreground separation, but a non-blurred frame is used for water artefact filtering. Because of that, water artefacts were better filtered in set B, as water detection is based mostly on counting edges in the image.

It is worthwhile to note that the professional camera used to obtain datasets B and C produced lower quality images than our GoPro camera. This is mainly because video samples were recorded locally on an memory card. The video samples from the other two cameras were recorded from the video streams coming from these cameras with the highest possible quality settings.

One of the main problems in vessel detection are waves. However, the waves in inland waters caused by wind are significantly smaller than waves at sea. In the evaluated datasets, the proposed method removes all these waves. The few water artefacts that are returned (Type IIb incorrect detection events) are caused by the wakes created by a moving vessel. These water artefacts have more small sharp edges than small boats and therefore are not eliminated. This can be changed by setting the sensitivity to a higher threshold in the water detection algorithm. The output from the proposed method is further used in the vessel classification method based on deep neural networks (DNN). One of the defined classes is water, so that it can be later eliminated. The only reason that we do not use DNN in the MVDA is efficiency, as our tests showed that using DNN in the detection method takes too much time.

The results are not uniformly distributed among samples from different camera views. Samples from camera views with sunlight from behind and which are placed on the centre of the bridge or placed perpendicular to a narrow waterway provide practically no incorrect detection results.

The main limitation of the study is that the method is designed to work in daytime without large atmospheric precipitation. One of the other known limitations is the problem of distinguishing vessels in a situation where one big vessel is in front of the camera and small ones enter the camera view when the other vessel is in the background. In such cases, the method returns a frame with two vessels. In further steps in the final identification algorithm, it will be possible to detect hull inscriptions from both vessels, and in this way, the vessel can be distinguished.

Future works include improving the method by adjusting it to work with video streams obtained from infrared cameras at night and detecting vessels passing in proximity. Future improvements also include adding the ability to work in low-light conditions; for example, when the camera is placed on a bridge in a city during the night, when there is residual light from street lamps.

**Author Contributions:** Conceptualization, N.W. and T.H.; methodology, T.H.; software, T.H.; validation, N.W. and T.H.; investigation, N.W., T.H. and A.P.; resources, A.P.; data curation, A.P.; writing—Original draft preparation, N.W. and T.H.; writing—Review and editing, N.W. and T.H.

**Funding:** This scientific research work was supported by the National Centre for Research and Development (NCBR) of Poland under grant No. LIDER/17/0098/L-8/16/NCBR/2017.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Classification of Non-Conventional Ships Using a Neural Bag-Of-Words Mechanism**

#### **Dawid Polap <sup>1</sup> and Marta Wlodarczyk-Sielicka 2,\***


Received: 14 February 2020; Accepted: 12 March 2020; Published: 13 March 2020

**Abstract:** The existing methods for monitoring vessels are mainly based on radar and automatic identification systems. Additional sensors that are used include video cameras. Such systems feature cameras that capture images and software that analyzes the selected video frames. Methods for the classification of non-conventional vessels are not widely known. These methods, based on image samples, can be considered difficult. This paper is intended to show an alternative way to approach image classification problems; not by classifying the entire input data, but smaller parts. The described solution is based on splitting the image of a ship into smaller parts and classifying them into vectors that can be identified as features using a convolutional neural network (CNN). This idea is a representation of a bag-of-words mechanism, where created feature vectors might be called words, and by using them a solution can assign images a specific class. As part of the experiment, the authors performed two tests. In the first, two classes were analyzed and the results obtained show great potential for application. In the second, the authors used much larger sets of images belonging to five vessel types. The proposed method indeed improved the results of classic approaches by 5%. The paper shows an alternative approach for the classification of non-conventional vessels to increase accuracy.

**Keywords:** bag-of-words mechanism; machine learning; image analysis; ship classification; marine system; river monitoring system; feature extraction

#### **1. Introduction**

Ship classification is an important process in practical applications in different places. In coastal cities, ships enter from the mouth of a river or moor at ports. This type of activity is quite often reported and recorded. However, for measurement, statistical, or even analytical purposes, it is often necessary to record vessels that arrive but do not report anywhere. To this end, the simplest solution is to create a monitoring system and analyze acquired images. This type of system architecture is based primarily on three main components: video recording, image processing, and classifying possible water vehicles.

While the solution itself seems simple, each component has its disadvantages, which also affect the others. First, the video recorder may be a simple camera, but often one needs to take good-quality photos for easier analysis. The second component is image processing. Image processing should consider the location of a possible ship on an image, or even perform some extraction of features. It is particularly important to remove unnecessary areas such as the background, houses, and even water. The third element is classifying these images, i.e., based on the obtained images, the algorithm should determine with some probability the type of ship.

In this paper, we considered the third aspect of such a system to model a solution enabling the most accurate classification of a given type of ship based on a photo entered into the system. In the analyzed system [1] an important element was the recording of information about passing vessels in water bodies. Unfortunately, this task is not easy due to the similarities between ships and the many factors that can be mistaken for a ship.

#### **2. Related Works**

In the last decade, the number of methods for classifying images has increased, and the main contribution is the description of convolutional neural networks (CNNs). These mathematical structures can be modeled for specific classification problems, as can be seen in [2]. The classification problem might also be improved by extracting some important objects. For this purpose, segmentation can be used [3,4]. In this paper, the authors proposed a convolutional network architecture based on three dimensions of the incoming image. Moreover, CNN was used for classifying objects from different points of view, which is very practical using drones [5], or even for sleep stage scoring based on electroencephalogram (EEG) signals [6]. Moreover, CNN has been used for recognition of vehicle driving behavior [7]. In addition to classic architectures, there are others, such as U-net, which are used as segmenting elements and in inverse problems [8]. Many applications of CNN can be found, primarily in monitoring systems and medicine.

In general, a database for training such structures has the biggest impact on the classifier. A very common problem is the lack of enough samples, which results in low efficiency or even overfitting. Data augmentation, i.e., generating new samples based on existing samples using image processing techniques, is a popular solution [9]. In [10], the authors discuss the effect of augmented data on CNN based on images from chest radiographs. Similarly, in [11], the authors use augmentation to increase the dataset by adding some distortions, such as changing the brightness, rotating, or adding some mirroring effects. Moreover, in recent years learning transfer has been enabled, i.e., the use of trained network architectures to minimize training time. The main idea was to create architectures and train them on huge databases. Trained classifiers are those whose coefficients are specialized in searching for features and classifying, so the learning transfer consists of using the finished model, modifying only selected layers, and overtraining only selected values to meet the needs of new bases. Not modifying any of the layers is called freezing. One of the first architectures for that was AlexNet [12]. Another was VGG16, modeled by a group from Oxford who primarily reduced the size of the filter in a convolution layer [13]. Another popular model is Inception [14], which drastically reduced the number of architecture parameters.

The ship classification problem depends on using images. Commonly used are synthetic aperture radar (SAR) images, by which ships can be classified based on their shape [15]. Similar research was described in [16,17], where superstructure scattering features were analyzed in the process of classification. Similarly, in [18], the idea of ship classification was solved by analyzing sound signals and removing the background sound of the sea. Other input data are aerial images that present a top view of the scenery and the ship. In all of these solutions, CNN was used for faster feature extraction and classification. An interesting approach was presented in [19], where the authors described the impact of simulated data on the training process of neural classifiers in the problem of ship classification. Moreover, in [20], a neural approach for ship type analysis with sea traffic was presented as an automatic identification system. All these studies used neural classifiers for image processing and classification.

In this paper, we propose a solution for the classification of different non-conventional ships using images made from the side and not from the top, like SAR images. This problem is hard, because images can be created using different light, from a different distance, or even from different sides of the object. The described solution was based on splitting the image of the ship into smaller parts (using keypoint algorithms with clustering) and classifying them by CNN into vectors that can be identified as features. This idea is a representation of a bag-of-words mechanism, where created feature vectors might be called words, and using these words, a solution can assign them a specific class. The main contribution of this article is the use of a bag-of-words mechanism to classify non-conventional ships, which in the future could be used in an innovative system for automatic recognition and identification

in video surveillance areas. The solution proposed in the paper has not been applied anywhere and is a new approach to the subject.

#### **3. Bag-Of-Words**

Bag-of-words is an abstract model used in the processing of text or graphics. It is the representation of data described in words, i.e., linguistic values. In the case of two-dimensional images, with a word we can describe a feature or fragment of an object. The idea of using a bag can help classify processes, because the input image will be decomposed into smaller fragments and classified according to certain linguistic values. These values can help in the classification of larger objects. This is especially desirable when analyzing the same objects that differ in their small features.

The proposed idea consists of extracting small fragments of the image with certain features. All points are divided against a certain metric into smaller images containing a fragment of the object. Such images can represent everything, because the object can be on any background; for instance, a ship can be captured in the port or against a background of trees; in that case, smaller images can even show some trees. Thus, the use of a classical approach, that is, the creation of a bag-of-words using an algorithm such as k-nearest neighbors, is not very effective. The reason is the lack of connection between the features (in smaller parts), because it should be considered that the objects can be on different scales or turned at a certain angle or even have some noise, such as bad weather or additional objects. That is why we propose a bag-of-words model based on more complex structures, such as neural networks.

#### *3.1. Feature Extraction*

The main idea of this study was to extract features using one of the classic algorithms for obtaining keypoints, such as scale-invariant feature transform (SIFT) [21], speeded up robust features (SURF) [22], features from accelerated segment test (FAST) [23], or binary robust invariant scalable keypoints (BRISK) [24], and then create samples with found features. It should be noted that if these algorithms processed the original image, the found points would probably cover the entire image; in the case of a simple image where a ship is at sea, all points could be placed on this object or water or waves, but there may be an image with some additional background with many possible points. To remedy this, in the first step, the image must be processed, which means using graphic filters to minimize elements such as edges or points. We used only two filters, such as gamma correction and blur.

#### *3.2. Feature Extraction Based on Keypoints*

Using the described algorithms, we obtained a set of keypoints, which we can describe as *A* = *{(x*0,*y*0*)*, *(x*1,*y*1*)*, ... , *(xn*−*1*,*yn*−*1)}.* To minimize the number of points (because unnecessary elements of the image can be indicated), all points were checked against their neighbors. If the point had a neighbor within a certain distance α, it remained in the set. Otherwise, the point was removed, and the cardinality was reduced by one. The distance between two points *pi* = *(xi*, *yi)* and *pj* = *(xj*, *yj)* was checked using one of the two classic metrics, Euclidean or river. The best known is the Euclidean, modeled as

$$d\_E = \left( (\mathbf{x}\_{i\prime} \ y\_i), \begin{pmatrix} \mathbf{x}\_{j\prime} \ y\_j \end{pmatrix} \right) = \sqrt{\left( y\_i - \mathbf{x}\_i \right)^2 + \left( y\_j - \mathbf{x}\_j \right)^2} \tag{1}$$

A river metric is the distance between points but counted relative to a certain straight line between the points. For both points, a perpendicular projection is made, as a result of which an additional two points are obtained, *(xo*, *yo)* and *(xp*, *yp)*. The distance in this metric will be calculated as the sum of the distance of a given point to the straight line, the distance between these two points on the straight line, and the transition from the straight line to the second point. Formally, it can be stated as

$$d\_{\mathbb{R}}\left( (\mathbf{x}\_{i\prime}, y\_{i\prime}), (\mathbf{x}\_{j\prime}, y\_{j\prime}) \right) = d\_{\mathbb{E}}((\mathbf{x}\_{i\prime}, y\_{i\prime}), (\mathbf{x}\_{o\prime}, y\_{o\prime})) + d\_{\mathbb{E}}((\mathbf{x}\_{o\prime}, y\_{o\prime}), (\mathbf{x}\_{p\prime}, y\_{p})) + d\_{\mathbb{E}}((\mathbf{x}\_{p\prime}, y\_{p}), (\mathbf{x}\_{j\prime}, y\_{j})) \tag{2}$$

Depending on the given metric, all points are checked to see if the distance is smaller, and if so, the point is removed. The next step is to divide the points into subsets *Bq*, where *q* is the number of objects. It is not possible to adjust the value of *q* without empirically checking and testing the data in the database. With this value, it is worth using existing algorithms to divide these points (for example, using the k-nearest neighbors algorithm). However, this value is unknown, so another approach to the topic should be taken. For this purpose, one of the previously described metrics can be used.

For all points in a given set *A*, the average distance value is calculated as

$$\xi(A) = \frac{1}{5 \cdot n^2} \sum\_{i=0}^{n-1} \sum\_{j=i}^{n-1} d\_{\text{metric}}((x\_i, y\_i), (x\_{j'}, y\_{j'})) \tag{3}$$

With the average distance, the points are divided concerning this value. The first subset is created by adding the first point to it, i.e., (*x*0, *y*0) ∈ *B*0. Then, for each point (*xr*, *yr*) ∈ *A*, we check to see if the distance between this point and any other in a given subset (*x*0, *y*0) ∈ *B*<sup>0</sup> is less than the average distance of the set, i.e.,

$$d\_{metric}((x\_{r\prime}, y\_r), (x\_{0\prime}, y\_0)) < \, \, \xi(A) \tag{4}$$

If the above equality is met for a point (*xr*, *yr*), it is added to subset *B*<sup>0</sup> and removed from *A*. In the case where none of the points is added to a given subset, another subset, *B*1, is created. Then, the first point from *A* is added to subset *B*<sup>1</sup> and removed from *A*. In this way, the action is repeated to meet the stop condition, which is the emptiness of the set, *A* = ∅.

As a result, subsets *B* are generated, with each representing one feature. For each set, an image is created whose dimensions will depend on the subset. To find the dimensions, we look for the maximum and minimum values of both coordinates in a subset that we can mark as *xmax*, *ymax*, *xmin*, and *ymin*. Hence the image size will be *(xmax–xmin)* × *(ymax–ymin).* Then, the images are saved and each one represents a part of the image. The left part of Figure 1 shows this process of extracting smaller parts of the image. Figure 1 shows a graphic visualization of the proposed model.

**Figure 1.** Graphic visualization of the proposed model.

#### **4. Classification with Bag-Of-Words**

Unfortunately, there was no unambiguous method to assign attributes to specific groups automatically. Therefore, we suggested creating groups at the initial stage of modeling the solution with the help of empirical division. In this way, the basic database of features were created, which will include a later bag-of-words.

#### *4.1. Convolutional Neural Network*

One of the most important branches of artificial intelligence methods is neural networks, which have been modeled for the needs of graphic image classification. Convolutional neural networks are models inspired by image processing by the cerebral cortex of cats. It is a mathematical structure built of three types of layers, where the layers between them are connected by synapses burdened with a certain weight. The weight is given randomly while creating the structure. Then, in the training process, the weights are modified to best match the training database.

One of the key layers of the network is the convolutional layer, which takes the image of the input with dimensions *w* × *h* × *d*, where *w* and *h* are the width and height of the image and *d* is understood as depth and depends on the model. For color images saved in the red–green–blue (RGB) model, the depth will be 3 due to the number of components. Formally, each image is saved as a set of matrices, each of which describes the image values for a given component. The convolutional layer works on a principle of image filter *<sup>f</sup>* of size *<sup>k</sup>* <sup>×</sup> *<sup>k</sup>*. This filter is a matrix with *<sup>k</sup>*<sup>2</sup> coefficient defined randomly and modified during the training process. This filter is moved over the image and changes the value in pixel *p* on image *I* at position *(i*, *j)*, which can be defined as

$$I[i,j] = \frac{1}{K} \sum\_{t=-\lfloor \frac{k}{2} \rfloor}^{\lfloor \frac{k}{2} \rfloor} \sum\_{r=-\lfloor \frac{k}{2} \rfloor}^{\lfloor \frac{k}{2} \rfloor} I[i+t, \ j+r] \cdot f[t,r] \tag{5}$$

where matrix *f* is located over an image and the central point of the matrix is over a pixel at position *(i*, *j)*, and *K* is the sum of all weights of filter *f*. The main purpose of this layer is feature extraction and reduction of data redundancy on the image. Applying some filter on the image will change it; depending on the coefficient of filters, some objects might be deleted or highlighted.

The second type of layer is called pooling, which has only one purpose: to reduce the size of matrices. Reducing depends on some function *g(*·*)*, which selects one pixel from each square *m* × *m*. The most commonly used function is *max(*·*)* or *min(*·*).*

These two layers can be used alternately many times. In the end, there is the last layer, the fully connected layer, which is understood as a classical neural network. Each pixel from the last layer (pooling or convolutional) is input as a numerical value. This layer is composed of columns of neurons connected by synapses, which are burdened with some weight. Each neuron gets a numerical value that is processed and sent to the next column. This operation can be described as

$$\mathbf{x}\_m^t = f(\sum\_{i=1}^n \mathbf{x}\_i^{t-1} \cdot \boldsymbol{w}\_i) \tag{6}$$

where *x<sup>t</sup>* is the output from neuron *m* in layer *t*, and ω*<sup>i</sup>* is a weight on the connection between *xm* in layer *t* and *xi* in layer *t* – 1. The number of columns and neurons depends on the modeled architecture. In the last column, there should be *k* neurons (when a classification process is described as a *k*-classes problem). The final calculation of an image in such a structure gives a probability distribution that can be normalized by some function like softmax. These values are understood as the probability of belonging to this class.

Unfortunately, all weights in this model are generated randomly at the beginning. To change these values, the training algorithm must be used. The main idea is to minimize loss function during two iterations. One such algorithm commonly used in convolutional networks is adaptive moment estimation [25]. The modification of weights is based on a basic statistical coefficient like the correlation of mean *<sup>m</sup>*% or variation %*v*:

$$
\widehat{m\_t} = \frac{m\_t}{1 - \beta\_1^t} \tag{7}
$$

$$
\widehat{\upsilon v}\_t = \frac{\upsilon\_t}{1 - \beta\_2^t} \tag{8}
$$

where *mt* and *vt* are the mean and variation values in the *t*th iteration. The formulas for calculating this can be presented as

$$m\_t = \beta\_1 m\_{t-1} + (1 - \beta\_1) \lg t \tag{9}$$

$$
\upsilon\_t = \beta\_2 \upsilon\_{t-1} + (1 - \beta\_2) g\_t^2 \tag{10}
$$

where β1, β<sup>2</sup> are distribution values.

Those two statistical coefficients are used in the modification of weight as

$$
\theta\_{t+1} = \theta\_t - \frac{\eta}{\sqrt{\widehat{v}\_t + \epsilon}} \,\,\widehat{m}\_t \tag{11}
$$

where η is the learning coefficient and  -0, which prevents division by 0.

*4.2. Bag-Of-Words*

A trained classifier can be used as an element dividing incoming images into selected elements in a bag. For each image, smaller images representing features are created. Each of these features is classified using the pretrained convolutional neural network. As a result, the network will return the probability of belonging for each word in the set (each single output from the network is interpreted as a word). Based on a certain probability and features, it is possible to assign these attributes to an object. The selection of features for an object works on the principle of determining conditional affiliation to another word in the bag. To make it impossible to save the whole object to its characteristics, it is worth introducing division of the bag into two sets (or even two bags). The first bag will contain only features and the second full objects. For a better understanding of this idea, let us assume that the image presents a motorboat. The biggest bag will contain a class of ships, like motorboat, yacht, etc. The smallest bag (in the biggest one) will describe one ship. For motorboat, these words would be, for example "a man", "waves", and "no sails".

Each of these objects is defined as a numerical vector consisting of zeros and ones (ones as belonging to this class). Each item in the vector is assigned to one feature from the bag-of-words, so its creation consists of using the result returned by the classifier. It should be noted that for many smaller segments from basic images, there will be many classification results. These results are averaged by all returned decision from classifiers.

The evaluation of the feature vector to an object occurs by comparing these vectors. The simplest method is to approximate the values returned by the network to integers and compare them with the words in a bag. However, there may be a situation where the vector will be different in one position compared to the patterns. To prevent this, we suggest using the k-nearest neighbors algorithm, which will allow assigning to a given object. The full display of this process is shown in Figure 1.

The *k*-nearest neighbors algorithm consists of analyzing and assigning the sample to neighboring samples [26,27]. Suppose that the value *xi* has an assigned class μ*i.* In the case of the analyzed problem, *xi* will correspond to 1 and values of μ*<sup>i</sup>* are the appropriate values representing the objects. The algorithm finds the nearest neighbors (values) *x <sup>n</sup>*  {*x*0, *x*1, ... , *xn*−1} for the given value *x* according to the following equation:

$$\min(d\_{\text{metric}}\left(\mathbf{x}\_{i\prime}, \mathbf{x}\right)) = d\_{\text{metric}}\left(\mathbf{x}\_{n\prime}^{\prime}, \mathbf{x}\right), \quad i = 0, \ 1, \ 2, \ \dots, n-1. \tag{12}$$

#### **5. Experiments**

In our experiments, we tested two databases. The first one had two classes, sailing ship and others, and was used to create the first set of features and find the best combination of algorithms. The second database contained more classes and the biggest number of samples to show the potential application of such an approach.

#### *5.1. Classification for Two Classes of Ships*

In these experiments, we tested the proposed solution to find the best combination for our proposition. For this purpose, the database we used was very small. It contained two classes, sailing ship and others. A sailing ship should have sails, although they do not always have to be spread. Such an observation allows the creation of two features describing this object, i.e., masts and sails. In this way, a vector describing these two classes will be created:

$$\begin{cases} \quad \begin{cases} \ \text{l}, \ \text{l} \end{cases} \qquad \begin{array}{l} \text{sailing slip} \\ \text{0}, \text{0} \end{array} \\ \quad \begin{array}{l} \text{0}, \text{0} \end{array} \end{cases} \tag{13}$$

where individual values are understood as appropriate features, masts and sails. In these tests, a CNN architecture as described in Table 1 was used.

**Table 1.** Convolutional neural network (CNN) architecture. ReLU, rectified linear unit.


In the experiments, we used a database contained 800 images (600 with sailing ships, and 200 with other ships). In the training process, 75% of the samples selected randomly from each class were used, and the remaining 25% was used for the validation process, which were 150 and 50 images.

For each sample, one of the keypoint algorithms was used, which allowed us to create a few smaller segments. We tested the algorithm for each segment, and the results of two selected metrics, Euclidean and river, are presented in Table 2. In the table, for each algorithm, there are two columns labeled "Object features" and "Background", which means that the extracted segment describes an important feature of a ship or not. Quite a common problem was to find the background, i.e., an insignificant fragment of the ship, and a large amount of sky or sea. The results shown are averaged over the entire base. It is easy to see that using the Euclidean metric generates many more features compared to the river metric. In both cases the ratio of images depicting features of the background exceeded 50%; however, that is not that big for the classic Euclidean metric.

**Table 2.** Average number of created objects using a key-search algorithm with the connection with Euclidean or river metrics. SIFT, scale-invariant feature transform; SURF, speeded up robust features; FAST, features from accelerated segment test; BRISK, binary robust invariant scalable keypoints.


In our tests, we used the SIFT, SURF, BRISK, and FAST algorithms to find keypoints. After that, all found segments were resized to one size and calculated using CNN. The results obtained for each

image were averaged and classified using the k-nearest neighbors algorithm (in this experiment, k = 2) and are presented in Tables 3 and 4 (the results in the second column represent classification of the whole image). Some examples of keypoint clustering are presented in Figure 2.


**Table 3.** Statistical coefficients for classification measurements using the selected keypoint search algorithm with Euclidean metric and CNN.

**Table 4.** Statistical coefficients for classification measurements using the selected keypoint search algorithm with river metric and CNN.


**Figure 2.** Examples of clustered keypoints for motorboat images.

The highest efficiency was obtained with the Euclidean metric using the SURF algorithm. For this combination, the results of classification compared to those without using the bag-of-words mechanism was nearly 6% higher than that with the convolutional network alone. However, it is worth noting that the significant difference between the results obtained indicates the negative predictive value, whose value was almost twice as high when using the bag mechanism. This factor determines the probability of assigning a false sample to the correct class; in this case, not a sailing ship. The situation is similar to other hybrids, where this value is always higher than 50%. A similar situation occurred with the F1 score, which is the harmonic average of the precision and recall coefficients. This factor allows us to evaluate the classification if its components have different values. In each case, the statistical coefficients indicated a more accurate process taking into account the proposed mechanism.

For a more detailed analysis, time measurements were also made for the image processing and training of a given architecture, as shown in Figure 3.

**Figure 3.** Average time for image processing and the training process.

The presented results are averaged data from 10 tests. In general, using the Euclidean metric saves approximately 10% more time than using the river metric. The tests showed that the longest processing time occurred using the FAST algorithm and the shortest with BRISK. As for the SIFT and SURF algorithms, the time measurement was at a similar level and was classified as in the middle.

#### *5.2. Classification for Five Classes of Ships*

Based on the previous results, the best accuracy was achieved with a combination of the SURF algorithm and CNN. We used this combination for classification of five classes: cargo (2120 images), military (1167 images), tanker (1217 images), yacht (688 images), and motorboat (512 images). For the first three classes, images were downloaded from a publicly available dataset from Deep Learning Hackathon organized by Analytics Vidhya. Each class was divided randomly into two sets in a 75%:25% (training/validation) ratio, and for the training process the data were split in the same proportion. Using the training set, the SURF algorithm was used to create smaller parts, and based on the created sets, these samples were put into features which can be described as the following vector:

$$[\text{most}, \text{ sail}, \text{ people}, \text{color}, \text{simple}], \tag{14}$$

where *people* means that on deck some people can be found, *color* means that a boat can have different colors (for a military ship, it is mainly gray), and *simplyShape* means that the ship can be recognized as a simple geometric figure, such as a rectangle. These features were chosen according to the database used and their possible location.

Using these features, words describing ship type were defined as follows:

$$\begin{cases} \begin{array}{ll} \left[0; \ 0; \ 0; \ 1; \ 1\right] & \text{cargo} \\ \left[1; \ 0; \ 0; \ 0; \ 0\right] & \text{military} \\ \left[0; \ 0; \ 0; \ 1; \ 0\right] & \text{tanker} \\ \left[1; \ 1; \ 1; \ 1; \ 0\right] & \text{yacht} \\ \left[0; \ 0; \ 1; \ 1; \ 1\right] & \text{motorboat} \end{array} \end{cases} \tag{15}$$

The training database contained 4278 images, which resulted in almost 26,000 smaller segments. Data were split into features based on color clustering using the *k*-means algorithm [28] and corrected empirically (especially for shape). We trained the classifiers with the architecture described in Table 1, but in the end there were five because of the five classes of features. The classifier was trained for five different numbers of iterations, *t* ∈ {20, 40, ... , 100}, and the accuracy is presented in Table 5. The best accuracy was reached using 80 iterations; accuracy did not improve with more iterations.

**Table 5.** Average classification accuracy and number of iterations in the training process.


The obtained accuracy is not very promising in such a classification problem. The main cause of this is the selection of features and creating sets for them. In the experiments, the dataset was so big that whether the sample belonged to the set was determined by the algorithm. Moreover, a feature such as shape is not the best choice for ships.

Despite these drawbacks, we conducted an additional experiment to check the classification result for this database in terms of hybrids. We classically trained a CNN to classify full images. Next, we checked the effectiveness of the validation base. Then we combined the obtained results from this classification with the proposed solution. Our approach classified into a given class out of the bag, so we understand the assignment to this class as adding a constant value equal to 0.2 to the probability of assignment according to the classic classifier. This approach will allow one probability distribution to be changed by 20%. The results of such action are shown in Table 6. The table shows the exact numbers of correctly classified images from the validation set and the accuracy.

**Table 6.** Comparison of classic CNN usage and extended usage with the proposed approach.


These data show that our proposition can be used as an additional component and increase the classification accuracy by nearly 5.5%. This result is better, but there is a problem with more time to train nets and classify samples because of much more operation. It is worth noting that values increased mainly for the military class, yachts, and motorboats. This is due to the good definitions of features such as the ones we have for military, or people and colors for the other two classes. The main conclusion is that the most difficult task is to initially declare a bag-of-words describing these features. This solution can be used in practice, but there are some additional tasks during the modeling of this solution, such as overseeing the creation of small images representing features and assigning them to individual groups. Also, the declaration of characteristics involves allocating image segments to these classes and analyzing them before training the classifier.

We used other CNN architectures, including VGG16, Inception, and AlexNet, and compared the results with and without our approach, as shown in Figure 4.

**Figure 4.** Comparison of the classic approach to image classification and hybrids.

The obtained results show that all tested architectures increased classification accuracy. The average value for all architectures was around 6%, which was a good result based on small datasets (neural networks are data-hungry algorithms). However, it was noted that apart from VGG16, where the increase was close to 3%, the other architectures achieved an increase of 7%. This was a good result, which could be significant for more extensive classes in the analyzed database.

#### **6. Conclusions**

Image classification is a problem for which solutions are being developed all the time. In recent years, revolutionary neural networks have been developed that have enabled a huge leap forward. Unfortunately, this solution also has its problems, such as requiring a large number of samples in the database, or architecture modeling. In this paper, we focused on analyzing images of selected ships. As part of the research, we proposed a classification mechanism based on sample segments that was determined based on algorithms searching for keypoints and subsequent classification.

As part of our experiments, we performed two tests. In the first, we analyzed two classes and the results obtained showed great potential for practical applications. In the second, we used much larger sets of images of five types of ships. The proposed solution in itself showed many disadvantages, especially at the stage of determining features and assigning samples to them to train the classifier. However, we used this solution as an additional element of classification after using the classic approach, including learning transfer. As a result, we noticed that the average efficiency increased by approximately 5% in almost all cases compared to the currently used convolutional network architectures.

An analysis of the database using a feature vector, which can be treated as a bag of words, shows potential practical application, especially if the features of the objects are well described. In future research, we plan to focus on how to automatically analyze images to extract features from them, as well as automatically assign classes as an unsupervised technique.

**Author Contributions:** Conceptualization, D.P. and M.W.-S.; methodology, D.P.; software, D.P.; validation, D.P.; data curation, D.P. and M.W.-S.; writing—original draft preparation, D.P.; writing—review and editing, D.P. and M.W.-S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the National Centre for Research and Development (NCBR) of Poland under grant no. LIDER/17/0098/L-8/16/NCBR/2017 and grant no. 2/S/KG/20, financed by a subsidy from the Ministry of Science and Higher Education for statutory activities.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
