*Article* **Navigation Situation Clustering Model of Human-Operated Ships for Maritime Autonomous Surface Ship Collision Avoidance Tests**

**Taewoong Hwang <sup>1</sup> and Ik-Hyun Youn 2,\***


**Abstract:** The collision avoidance system is one of the core systems of MASS (Maritime Autonomous Surface Ships). The collision avoidance system was validated using scenario-based experiments. However, the scenarios for the validation were designed based on COLREG (International Regulations for Preventing Collisions at Sea) or are arbitrary. Therefore, the purpose of this study is to identify and systematize objective navigation situation scenarios for the validation of autonomous ship collision avoidance algorithms. A data-driven approach was applied to collect 12-month Automatic Identification System data in the west sea of Korea, to extract the ship's trajectory, and to hierarchically cluster the data according to navigation situations. Consequently, we obtained the hierarchy of navigation situations and the frequency of each navigation situation for ships that sailed the west coast of Korea during one year. The results are expected to be applied to develop a collision avoidance test environment for MASS.

**Keywords:** navigation situation; human-operated ship; MASS; clustering; testbed scenario

## **1. Introduction**

Ship collisions are frequent accidents that account for more than 50% of all maritime accidents [1], causing large scale of casualties and property and environmental damage [2]. Collision accidents are a concern even in the development of MASS (Maritime Autonomous Surface Ship) [3]. Therefore, the collision avoidance and path finding system of the MASS is considered one of the core systems of MASS [4]. Numerous studies were conducted on this collision avoidance system and path finding/control systems [5–7]. This system was tested in navigation scenarios to validate safety integrity.

Perera et al. presented an experimental evaluation of autonomous ship collision avoidance. The experiment was conducted with scaled model ships in a lake. Five COLREGbased encounter situations were set up for the experimental evaluation of autonomous navigation and collision avoidance [8].

Son et al. verified the collision avoidance performance of autonomous ships by applying a COLREG-based scenario to performing a real-ship experiment. For the scenario, three types of navigation situations (head-on, port-crossing, and starboard-crossing) were applied to a 1:1 situation [9].

Shen et al. presented an autonomous vessel collision avoidance algorithm in restricted water based on deep Q-learning. The scenario was designed based on the navigation situation defined in COLREG by applying the head-on, crossing, and overtaking scenarios in restricted and open areas [10].

Woo et al. proposed a collision avoidance algorithm for autonomous ships using deep reinforcement learning and tested the algorithm's performance by applying head-on and overtaking scenarios in a simulation environment [11].

**Citation:** Hwang, T.; Youn, I.-H. Navigation Situation Clustering Model of Human-Operated Ships for Maritime Autonomous Surface Ship Collision Avoidance Tests. *J. Mar. Sci. Eng.* **2021**, *9*, 1458. https://doi.org/ 10.3390/jmse9121458

Academic Editors: Haitong Xu, Lúcia Moreira and Carlos Guedes Soares

Received: 1 December 2021 Accepted: 17 December 2021 Published: 20 December 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Pedersen et al. verified the developed autonomous navigation system through simulation-based tests. The automated scenario generator tool was used to test it in the set scenario. Two head-on collisions and one multiple target ship encounter scenarios were used [12].

Porres et al. presented an AI-based scenario search and production algorithm to validate the autonomous navigation system, but the developed scenario was limited to the 1:1 situation [13].

Huang et al. presented a new collision risk measure for the collision avoidance algorithm. The authors compare the new model's performance with the traditional risk measures in the simulated environment. The scenarios used were encounter situations with 1, 2, and 3 target ships. However, the basis for setting up the scenario was not in the article [14].

Chun et al. proposed a collision avoidance algorithm based on deep reinforcement learning. The developed algorithm was validated in two scenarios, but the design base for one of them was not specified [15].

Lazarowska used a navigation scenario with a change in the target ship's course to validate an autonomous ship's safe trajectory planning algorithm. Head-on, crossing, 1:2 encounter situation, and sudden change of opponent ship were the four scenarios used for validation, and no objective scenario design basis was used [16].

Gil proposed the concept of a critical safety area for the obstacle-avoidance algorithm. The verification experiments were conducted in a simulation environment. The author designed the scenario considering various obstacles which were different in size and shape. However, the basis for designing the scenario was arbitrary [17].

Szlapczynski et al. suggested a ship domain-based model of ship collision risk that utilizes a number of parameters. A simulation was conducted to validate the presented method. The scenarios used for the simulation were overtaking, head-on, and crossing, and two phases were included in each scenario. However the rationale of designing scenario was missing [18].

Table 1 shows a summary of related works. The related works show that various methods are being used to improve the collision avoidance algorithm for autonomous ships. A scenario-based test is generally used for the validation of collision avoidance system development. However, the scenarios used for validation are mainly designed based on the navigation situations in the COLREG or the arbitrary design of the researcher. Thus, there may be a dearth of research into systematic scenario design testing of the MASS collision avoidance system.


**Table 1.** Summary of related works.


**Table 1.** *Cont.*

For this reason, systematic scenarios must be developed to verify the MASS collision avoidance algorithm. In addition, because human-operated ships dominate the environment in which the MASS will navigate in the future, objectively recognizing the navigation situations where autonomous vessels are likely to encounter human-operated ships is necessary.

Therefore, the purpose of this study is to differentiate the navigation situation of human-operated ships for mass collision avoidance algorithm validation.

#### **2. Methodology**

This section aims to describe the methodology that objectively analyzes the navigation situations by clustering the latent ship's trajectory from the Automatic Identification System (AIS) data collected over a long period in a wide sea area.

The methodology workflow is shown in Figure 1. The collected data was AIS data, and the data's characteristics were verified through data analysis. Then, in the preprocessing stage, criteria were applied to the AIS data to extract the trajectory of own ship and target ship, and necessary data cleaning and calculation were performed. Then, feature extraction was used to turn the extracted trajectory data into a variable that represented the navigation situation. Finally, hierarchical clustering was applied to analyze the composition and ratio of the navigation situation. *J. Mar. Sci. Eng.* **2021**, *9*, x FOR PEER REVIEW 4 of 17

**Figure 1.** *Cont*.

AIS data were collected from all over the Korean sea, as shown in Figure 2. In this study, the Yellow Sea (west sea) of South Korea, where the no obstacles, such as islands, exist, is presented as the selected target sea area shown in red. Thus, navigation situations between ships that were not affected by the geographic environment could be collected.

The distance from east to west and north to south was 90 nautical miles.

**Figure 1.** Workflow of methodology.

2.1.2. Spatial Information

**Figure 1.** Workflow of methodology. **Figure 1.** Workflow of methodology.

#### 2.1.2. Spatial Information *2.1. Data Collection*

#### AIS data were collected from all over the Korean sea, as shown in Figure 2. In this 2.1.1. AIS Data

study, the Yellow Sea (west sea) of South Korea, where the no obstacles, such as islands, exist, is presented as the selected target sea area shown in red. Thus, navigation situations The collected data were AIS data. The Republic of Korea's Ministry of Oceans and Fisheries provided this data, which included both static and dynamic data [19].

#### between ships that were not affected by the geographic environment could be collected. The distance from east to west and north to south was 90 nautical miles. 2.1.2. Spatial Information

AIS data were collected from all over the Korean sea, as shown in Figure 2. In this study, the Yellow Sea (west sea) of South Korea, where the no obstacles, such as islands, exist, is presented as the selected target sea area shown in red. Thus, navigation situations between ships that were not affected by the geographic environment could be collected. The distance from east to west and north to south was 90 nautical miles.

#### 2.1.3. Temporal Information

AIS data was data collected for 12 months From 1 September 2019 to 31, 2020. Therefore, it was possible to analyze the data and to consider monthly or seasonal changes.

#### *2.2. Data Examination*

Data examination was performed to understand AIS data characteristics and identify problems such as the missing values, outliers, and errors inherent in the data.

**Figure 2.** Selected target sea area. **Figure 2.** Selected target sea area.

#### 2.2.1. Time Stamp

2.1.3. Temporal Information AIS data was data collected for 12 months From September 1, 2019 to August 31, 2020. Therefore, it was possible to analyze the data and to consider monthly or seasonal changes. The transmit rate of AIS data varies depending on the ship's status and type of AIS transponder [20]. Therefore, the time interval of the collected data, as shown in Figure 3a, shows an inconsistent characteristic. Moreover, the time interval beyond the AIS transmission cycle was confirmed. *J. Mar. Sci. Eng.* **2021**, *9*, x FOR PEER REVIEW 6 of 17

2.2.2. COG (Course over the Ground) In Figure 3b, COG distribution was primarily concentrated at 010 and 190 degrees. **Figure 3.** Observations of the data examination: (**a**) inconsistent time-series, (**b**) COG distribution, (**c**) heading distribution. **Figure 3.** Observations of the data examination: (**a**) inconsistent time-series, (**b**) COG distribution, (**c**) heading distribution.

cessitated time-series cleaning and distance calculation. Figure 4 describes the trajectory

#### This explains that the ship traffic in this sea area was mainly formed in the north–south 2.2.2. COG (Course over the Ground)

direction. 2.2.3. Heading *2.3. Data Preprocessing*  The own ship and target ship trajectories were extracted from AIS data by applying criteria in the data preprocessing stage. In addition, the trajectory extraction process ne-In Figure 3b, COG distribution was primarily concentrated at 010 and 190 degrees. This explains that the ship traffic in this sea area was mainly formed in the north– south direction.

As shown in Figure 3c, a large number of 511-degree readings were observed in the

#### heading. Since the heading value 511 indicates that the value is not available, heading extraction process. 2.2.3. Heading

**Figure 4.** Trajectory extraction process.

2.3.1. Trajectory Extraction • Own ship extraction

• Target ship extraction

extracted from dynamic data.

errors of 511 degrees appeared to be in need of substitution with appropriate values in a preprocessing stage to confirm the relative bearing between ships [21]. As shown in Figure 3c, a large number of 511-degree readings were observed in the heading. Since the heading value 511 indicates that the value is not available, heading errors of 511 degrees appeared to be in need of substitution with appropriate values in a preprocessing stage to confirm the relative bearing between ships [21].

The MMSI number of the vessel and the corresponding gross tonnage were extracted from the static data because the subject ship of the research project was a vessel of gross tonnage between 100,000 and 150,000. Then, the data rows in the dynamic data with the same MMSI number were sorted using the extracted MMSI number. The own ship data extracted through the application of gross tonnage included ships that were not sailing.

Data of target ships with the same time range as the own ships were extracted. Using the time-series range for each own vessel, data corresponding to the same time range was

Therefore, only ships with a speed of 5 knots or greater were extracted.

(**c**) heading distribution.

#### *2.3. Data Preprocessing 2.3. Data Preprocessing*

*J. Mar. Sci. Eng.* **2021**, *9*, x FOR PEER REVIEW 6 of 17

The own ship and target ship trajectories were extracted from AIS data by applying criteria in the data preprocessing stage. In addition, the trajectory extraction process necessitated time-series cleaning and distance calculation. Figure 4 describes the trajectory extraction process. The own ship and target ship trajectories were extracted from AIS data by applying criteria in the data preprocessing stage. In addition, the trajectory extraction process necessitated time-series cleaning and distance calculation. Figure 4 describes the trajectory extraction process.

(**a**) (**b**) (**c**)

**Figure 4.** Trajectory extraction process. **Figure 4.** Trajectory extraction process.

2.3.1. Trajectory Extraction

2.3.1. Trajectory Extraction • Own ship extraction

• Own ship extraction The MMSI number of the vessel and the corresponding gross tonnage were extracted from the static data because the subject ship of the research project was a vessel of gross tonnage between 100,000 and 150,000. Then, the data rows in the dynamic data with the same MMSI number were sorted using the extracted MMSI number. The own ship data extracted through the application of gross tonnage included ships that were not sailing. The MMSI number of the vessel and the corresponding gross tonnage were extracted from the static data because the subject ship of the research project was a vessel of gross tonnage between 100,000 and 150,000. Then, the data rows in the dynamic data with the same MMSI number were sorted using the extracted MMSI number. The own ship data extracted through the application of gross tonnage included ships that were not sailing. Therefore, only ships with a speed of 5 knots or greater were extracted.

Therefore, only ships with a speed of 5 knots or greater were extracted. • Target ship extraction

• Target ship extraction Data of target ships with the same time range as the own ships were extracted. Using the time-series range for each own vessel, data corresponding to the same time range was Data of target ships with the same time range as the own ships were extracted. Using the time-series range for each own vessel, data corresponding to the same time range was extracted from dynamic data.

extracted from dynamic data. • Distance

> When the distance between the ships was less than 3 miles, a row of dynamic data was extracted by calculating the distance between the own ship and the target ship. The calculated distance was the Euclidean distance using longitude and latitude. Table 2 summarizes the applied criteria.



#### 2.3.2. Data Cleaning 2.3.2. Data Cleaning

marizes the applied criteria.

**Table 2.** Criteria list.

• Distance

During the trajectory extraction stage, time-series synchronization and interpolation for variables were performed. Correlation with COG was confirmed for replacing heading values of 511 (not available) identified during data examination. Since the criteria of own ship is a ship moving at 5 knots or more, the correlation was checked for ships with SOG 5 knots or more. Since COG and heading are continuous values, a Pearson correlation coefficient was used. As a result, a high correlation of 0.9864 was confirmed with a zero *p*-value, and the not available heading value was replaced with COG. Figure 5 describes the correlation between COG and heading. During the trajectory extraction stage, time-series synchronization and interpolation for variables were performed. Correlation with COG was confirmed for replacing heading values of 511 (not available) identified during data examination. Since the criteria of own ship is a ship moving at 5 knots or more, the correlation was checked for ships with SOG 5 knots or more. Since COG and heading are continuous values, a Pearson correlation coefficient was used. As a result, a high correlation of 0.9864 was confirmed with a zero *p*-value, and the not available heading value was replaced with COG. Figure 5 describes the correlation between COG and heading.

Time range Target ship Same time range with own ship Distance Target ship Less than 3 miles with own ship

When the distance between the ships was less than 3 miles, a row of dynamic data was extracted by calculating the distance between the own ship and the target ship. The calculated distance was the Euclidean distance using longitude and latitude. Table 2 sum-

**Criteria Object Ship Description**  Gross tonnage Own ship 100,000–150,000 GT SOG of own ship Own ship Over 5kts

*J. Mar. Sci. Eng.* **2021**, *9*, x FOR PEER REVIEW 7 of 17

**Figure 5.** Correlation coefficient plot between COG and heading. **Figure 5.** Correlation coefficient plot between COG and heading.

#### *2.4. Feature Extraction*

Because machine learning uses a numerical learning and decision-making basis, converting the AIS data into feature data representing the navigation situation was necessary to apply the navigation situation to the clustering algorithm [22]. In addition, because the clustering algorithm's performance varies depending on input features, the following approach was used to extract features that could effectively express the navigation situation.

#### 2.4.1. Periodic Feature

Three experts with more than five years of onboard experience and knowledge of data science determined the domain of the features to be the input data to the clustering algorithm. Consequently, the domain of the features was set to a periodic feature that could comprehensively explain the entire situation from the beginning to the end of the navigation situation.

#### 2.4.2. Relative Bearing Change

The relative bearing change was extracted as a feature to explain the general navigation situation [23]. As shown in Figure 6, changes to the relative bearing in a vessel were converted into a change in quadrant in a Cartesian coordinate system. By passing an abeam through the starboard of the own ship, the ship (a) was changed from starboard

bow to starboard quarter, and when this is expressed as a change in the quadrant, it is expressed as Q1-Q1-Q4. in Table 3.

The feature was extracted by calculating the change of the quadrants of a target ship in the entire trajectory, and the extracted feature list was 24 quadrant changes, as shown

Because machine learning uses a numerical learning and decision-making basis, converting the AIS data into feature data representing the navigation situation was necessary to apply the navigation situation to the clustering algorithm [22]. In addition, because the clustering algorithm's performance varies depending on input features, the following approach was used to extract features that could effectively express the navigation situation.

Three experts with more than five years of onboard experience and knowledge of data science determined the domain of the features to be the input data to the clustering algorithm. Consequently, the domain of the features was set to a periodic feature that could comprehensively explain the entire situation from the beginning to the end of the

The relative bearing change was extracted as a feature to explain the general navigation situation [23]. As shown in Figure 6, changes to the relative bearing in a vessel were converted into a change in quadrant in a Cartesian coordinate system. By passing an abeam through the starboard of the own ship, the ship (a) was changed from starboard bow to starboard quarter, and when this is expressed as a change in the quadrant, it is

*J. Mar. Sci. Eng.* **2021**, *9*, x FOR PEER REVIEW 8 of 17

*2.4. Feature Extraction* 

2.4.1. Periodic Feature

navigation situation.

expressed as Q1-Q1-Q4.

2.4.2. Relative Bearing Change

**Figure 6.** Concept of feature extraction: (**a**) relative bearing change (Q1-Q1-Q4), (**b**) relative bearing change (Q1-Q2-Q3), (**c**) relative bearing change (Q4-Q3-Q2). **Figure 6.** Concept of feature extraction: (**a**) relative bearing change (Q1-Q1-Q4), (**b**) relative bearing change (Q1-Q2-Q3), (**c**) relative bearing change (Q4-Q3-Q2).

The feature was extracted by calculating the change of the quadrants of a target ship in the entire trajectory, and the extracted feature list was 24 quadrant changes, as shown in Table 3.


**Table 3.** Feature list.


**Table 3.** *Cont.*

### *2.5. Hierarchical Clustering*

#### Agglomerative Clustering Algorithm

An agglomerative hierarchical clustering algorithm was used in this study. This algorithm's parameter tuning must consider several factors. In this study, the clustering algorithm considered the similarity (distance) measuring method, the proximity (linkage) measuring method, and the optimal number of clusters [24].

## **3. Result**

#### *3.1. Input Data*

A data set that counts situations corresponding to features in each ship's trajectory was used as the clustering input data. Each row of input data was a row vector describing one trajectory as a combination of the number of ships corresponding to the features.

The illustrations in Figure 7 conceptualize the process from the navigation situation to input data. Trajectory (a) is composed of one starboard passing vessel, one crossing vessel, one overtaking vessel, and one moving away from the forward, and the corresponding features are 1, 3, 11, 15. This combination of the navigation situation is described as a row vector (a). A zero in row vector (a) means that there was no vessel corresponding to the feature, and a one means that there was one vessel in the navigation situation corresponding to that feature. Trajectory (b) has two vessels crossing from the starboard side, one vessel crossing from the port side, and one overtaking vessel, and the corresponding features are 3, 7, 11. This combination of the navigation situation is described as a row vector (b). *J. Mar. Sci. Eng.* **2021**, *9*, x FOR PEER REVIEW 10 of 17

**Figure 7.** Example of input data: (**a**) a combined navigation situation consisting of navigation situations corresponding to features 1, 3, 11, and 15, and an example of conversion to a row vector, (**b**) a combined navigation situation consisting of navigation situations corresponding to features 3, 7, and 11, and an example of conversion to a row vector. **Figure 7.** Example of input data: (**a**) a combined navigation situation consisting of navigation situations corresponding to features 1, 3, 11, and 15, and an example of conversion to a row vector, (**b**) a combined navigation situation consisting of navigation situations corresponding to features 3, 7, and 11, and an example of conversion to a row vector.

*3.2. Clustering Algorithm Parameter Tuning* 

3.2.1. Distance Measurement

tance is 4.

3.2.2. Linkage Method

[26].

**Figure 8.** Hamming distance for calculation of similarity.

distinguishing the difference among a combination of vectors, was selected because the input data feature was a set of row vectors that count the situations corresponding to the features in the trajectory for each column [25]. Figure 8 is the row vector of the sailing situation mentioned as an example above. These 2-row vectors use the hamming distance, and the elements corresponding to the four features are different, so the hamming dis-

For the linkage method, the average linkage with the highest value was selected by comparing the cophenetic correlation coefficient for each method, as shown in Table 4

The input data was an array comprising 1998 row vectors because the feature extraction was applied to 1998 trajectories. navigation situations corresponding to features 3, 7, and 11, and an example of conversion to a row vector.

#### *3.2. Clustering Algorithm Parameter Tuning 3.2. Clustering Algorithm Parameter Tuning*  3.2.1. Distance Measurement

#### 3.2.1. Distance Measurement

The hamming distance, average linkage distance measurement, and linkage method were used in the clustering algorithm. The hamming distance, which is appropriate for distinguishing the difference among a combination of vectors, was selected because the input data feature was a set of row vectors that count the situations corresponding to the features in the trajectory for each column [25]. Figure 8 is the row vector of the sailing situation mentioned as an example above. These 2-row vectors use the hamming distance, and the elements corresponding to the four features are different, so the hamming distance is 4. The hamming distance, average linkage distance measurement, and linkage method were used in the clustering algorithm. The hamming distance, which is appropriate for distinguishing the difference among a combination of vectors, was selected because the input data feature was a set of row vectors that count the situations corresponding to the features in the trajectory for each column [25]. Figure 8 is the row vector of the sailing situation mentioned as an example above. These 2-row vectors use the hamming distance, and the elements corresponding to the four features are different, so the hamming distance is 4.

$$\begin{array}{rcl} \mathbf{A} & : \begin{bmatrix} \mathbf{1} \\ \mathbf{0} \\ \mathbf{0} \\ \mathbf{0} \end{bmatrix} \bullet \begin{bmatrix} \mathbf{1} \\ \mathbf{0} \\ \mathbf{1} \end{bmatrix} \bullet \begin{bmatrix} \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{1} & \mathbf{0} & \mathbf{0} & \mathbf{0} \\\\ \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{1} & \mathbf{0} & \mathbf{0} & \mathbf{0} \\\\ \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{1} & \mathbf{0} & \mathbf{0} & \mathbf{0} \\\\ \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} \end{bmatrix} \begin{bmatrix} \mathbf{1} \\ \mathbf{0} \\ \mathbf{0} \\ \mathbf{0} \\ \mathbf{0} \\ \mathbf{0} \end{bmatrix} & \begin{bmatrix} \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} \\\\ \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} \\\\ \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} \\\\ \end{array} \end{array}$$

(**a**) (**b**) **Figure 7.** Example of input data: (**a**) a combined navigation situation consisting of navigation situations corresponding to features 1, 3, 11, and 15, and an example of conversion to a row vector, (**b**) a combined navigation situation consisting of

*J. Mar. Sci. Eng.* **2021**, *9*, x FOR PEER REVIEW 10 of 17

**Figure 8.** Hamming distance for calculation of similarity. **Figure 8.** Hamming distance for calculation of similarity.

#### 3.2.2. Linkage Method

3.2.2. Linkage Method For the linkage method, the average linkage with the highest value was selected by For the linkage method, the average linkage with the highest value was selected by comparing the cophenetic correlation coefficient for each method, as shown in Table 4 [26].

comparing the cophenetic correlation coefficient for each method, as shown in Table 4

[26]. **Table 4.** Cophenetic correlation coefficient.


#### 3.2.3. Number of Clusters

Figure 9 shows the silhouette values to find the appropriate number of clusters. The higher the silhouette value, the better the distinction between clusters [27]. When the number of clusters was 347, the silhouette value was 1.0, so the appropriate number of clusters was 347.

**Linkage Cophenetic Correlation Coefficient** 

Figure 9 shows the silhouette values to find the appropriate number of clusters. The higher the silhouette value, the better the distinction between clusters [27]. When the number of clusters was 347, the silhouette value was 1.0, so the appropriate number of clusters

Average 0.8949 Single 0.8060 Complete 0.7273 Weighted 0.6703

**Figure 9.** Silhouette value corresponding to the number of clusters. **Figure 9.** Silhouette value corresponding to the number of clusters.

#### *3.3. Clustering Result 3.3. Clustering Result*

#### 3.3.1. Similarity of Clusters 3.3.1. Similarity of Clusters

**Table 4.** Cophenetic correlation coefficient.

3.2.3. Number of Clusters

was 347.

The dendrogram shown in Figure 10 can be used to determine the similarity of each leaf node that constituted the navigation situation. The lowest level of the dendrogram is called a leaf node or cluster; the cluster's vertical line is called a node. Each node is connected to one horizontal line, called a clade, and the height of this clade explains the difference between connected clusters [28]. The difference in the cluster can be explained by the distance measurement used in the clustering algorithm, and a dendrogram in Figure 10 can be explained by the hamming distance of the input row data used. In this dendrogram, the height of the leaf node increases as it goes to the right, and the height decreases as it goes to the left. Therefore, navigation situations with high similarity were clustered on the left side, and navigation situations with low similarity were clustered on the right. The dendrogram shown in Figure 10 can be used to determine the similarity of each leaf node that constituted the navigation situation. The lowest level of the dendrogram is called a leaf node or cluster; the cluster's vertical line is called a node. Each node is connected to one horizontal line, called a clade, and the height of this clade explains the difference between connected clusters [28]. The difference in the cluster can be explained by the distance measurement used in the clustering algorithm, and a dendrogram in Figure 10 can be explained by the hamming distance of the input row data used. In this dendrogram, the height of the leaf node increases as it goes to the right, and the height decreases as it goes to the left. Therefore, navigation situations with high similarity were clustered on the left side, and navigation situations with low similarity were clustered on the right. *J. Mar. Sci. Eng.* **2021**, *9*, x FOR PEER REVIEW 12 of 17

**Figure 10.** Dendrogram of hierarchical clustering. **Figure 10.** Dendrogram of hierarchical clustering.

of cluster 7, (**b**) trajectory example of cluster 9.

#### 3.3.2. Frequency of Clusters 3.3.2. Frequency of Clusters

49.1% of the input data points.

vessels is different.

Figure 10 shows how many navigation situations are clustered at each leaf node. The height represents the frequency and the x-axis corresponds to the order of a dendrogram. Figure 10 shows how many navigation situations are clustered at each leaf node. The height represents the frequency and the x-axis corresponds to the order of a dendrogram.

The two most frequent clusters show frequencies of 522 and 459, and they accounted for

cluster with high frequency have a low frequency. For example, in the Figure 11, the frequency of the 9th cluster(b) connected to the same clade as the 7th cluster(a), which occurred 459 times, is 3. This is because one property of the hamming distance to is to calculate the distance as 1 if the number of vessels is different, even though the combination of a vessel relationship is the same. Although the number of vessels is large in Figure 12 and they were classified as being in a similar situation because all vessels were passing by the starboard in the same the 7th cluster, the frequency is different because the number of

(**a**) (**b**) **Figure 11.** Example of different frequencies in similar navigation situations : (**a**) trajectory example

The two most frequent clusters show frequencies of 522 and 459, and they accounted for 49.1% of the input data points. 49.1% of the input data points. However, it can be confirmed that clusters were connected to the same clade as a

Figure 10 shows how many navigation situations are clustered at each leaf node. The height represents the frequency and the x-axis corresponds to the order of a dendrogram. The two most frequent clusters show frequencies of 522 and 459, and they accounted for

*J. Mar. Sci. Eng.* **2021**, *9*, x FOR PEER REVIEW 12 of 17

**Figure 10.** Dendrogram of hierarchical clustering.

3.3.2. Frequency of Clusters

However, it can be confirmed that clusters were connected to the same clade as a cluster with high frequency have a low frequency. For example, in the Figure 11, the frequency of the 9th cluster(b) connected to the same clade as the 7th cluster(a), which occurred 459 times, is 3. This is because one property of the hamming distance to is to calculate the distance as 1 if the number of vessels is different, even though the combination of a vessel relationship is the same. Although the number of vessels is large in Figure 12 and they were classified as being in a similar situation because all vessels were passing by the starboard in the same the 7th cluster, the frequency is different because the number of vessels is different. cluster with high frequency have a low frequency. For example, in the Figure 11, the frequency of the 9th cluster(b) connected to the same clade as the 7th cluster(a), which occurred 459 times, is 3. This is because one property of the hamming distance to is to calculate the distance as 1 if the number of vessels is different, even though the combination of a vessel relationship is the same. Although the number of vessels is large in Figure 12 and they were classified as being in a similar situation because all vessels were passing by the starboard in the same the 7th cluster, the frequency is different because the number of vessels is different.

**Figure 11.** Example of different frequencies in similar navigation situations : (**a**) trajectory example of cluster 7, (**b**) trajectory example of cluster 9. **Figure 11.** Example of different frequencies in similar navigation situations: (**a**) trajectory example of cluster 7, (**b**) trajectory example of cluster 9.

The red lines in Figure 13 are the leaf node locations where 1:1 situations were clustered among leaf nodes of the dendrogram. In general, different clusters can be recognized as connected to the same clade with high similarity, based on the 1:1 navigation situation,

**Figure 13.** Clustering of 1:1 vessels and similar navigation situations: (**a**) a cluster group with a hamming distance of less than 2.3 from a 1:1 situation, (**b**) a cluster group with a hamming dis-

Figure 14 shows the conversion of the hamming distance to the number of different input elements. This figure shows that the clusters can be distinguished into two large

*J. Mar. Sci. Eng.* **2021**, *9*, x FOR PEER REVIEW 13 of 17

**Figure 12.** Frequency of clusters. **Figure 12.** Frequency of clusters.

3.3.3. Ordinary and Extraordinary Situation

tance of larger than 2.3 from 1:1 situation.

**Cluster Num-**

#### 3.3.3. Ordinary and Extraordinary Situation 3.3.3. Ordinary and Extraordinary Situation

**Figure 12.** Frequency of clusters.

*J. Mar. Sci. Eng.* **2021**, *9*, x FOR PEER REVIEW 13 of 17

The red lines in Figure 13 are the leaf node locations where 1:1 situations were clustered among leaf nodes of the dendrogram. In general, different clusters can be recognized as connected to the same clade with high similarity, based on the 1:1 navigation situation, and these clusters indicate navigation situations where the hamming distance was less than 2.3 from the corresponding 1:1 navigation situation, as shown in Figure 13. The red lines in Figure 13 are the leaf node locations where 1:1 situations were clustered among leaf nodes of the dendrogram. In general, different clusters can be recognized as connected to the same clade with high similarity, based on the 1:1 navigation situation, and these clusters indicate navigation situations where the hamming distance was less than 2.3 from the corresponding 1:1 navigation situation, as shown in Figure 13.

**Figure 13.** Clustering of 1:1 vessels and similar navigation situations: (**a**) a cluster group with a hamming distance of less than 2.3 from a 1:1 situation, (**b**) a cluster group with a hamming distance of larger than 2.3 from 1:1 situation. **Figure 13.** Clustering of 1:1 vessels and similar navigation situations: (**a**) a cluster group with a hamming distance of less than 2.3 from a 1:1 situation, (**b**) a cluster group with a hamming distance of larger than 2.3 from 1:1 situation. two groups based on the dissimilarity value of 2.3. Therefore, these clusters can be distinguished as group (a) and group (b).

Figure 14 shows the conversion of the hamming distance to the number of different input elements. This figure shows that the clusters can be distinguished into two large Figure 14 shows the conversion of the hamming distance to the number of different input elements. This figure shows that the clusters can be distinguished into two large two groups based on the dissimilarity value of 2.3. Therefore, these clusters can be distinguished as group (a) and group (b). Group (a) comprises ordinary navigation situations in which the dissimilarity related to the 1:1 navigation situation was low, and its proportion in the total was 95.2%. Group (b) comprises extraordinary navigation situations that differs from other navigation situations, and its proportion in the total is 4.8%.

**Figure 14.** Hamming distance (nubmer of different in input elements). **Figure 14.** Hamming distance (nubmer of different in input elements).

3.3.4. Top 20 Frequent Clusters Table 5 summarizes the 20 most frequent navigation situations. The 27th cluster has Group (a) comprises ordinary navigation situations in which the dissimilarity related to the 1:1 navigation situation was low, and its proportion in the total was 95.2%. Group (b)

522 distinguished trajectories, accounting for 26.1% of the total, and the second most fre-

**Accumulated Rate (%)** 

**Table 5.** Top 20 frequent clusters.

27th Pass port 1:1 522 (26.1) 26.1 7th Pass starboard 1:1 459 (23.0) 49.1 1st Pass port and starboard 1:2 (1 and 1) 88 (4.4) 53.5 8th Pass port 1:2 59 (3.0) 56.5 35th Pass starboard 1:2 57 (2.9) 59.3 94th Away from starboard bow 1:1 38 (1.9) 61.2 120th Away from starboard quarter 1:1 37 (1.9) 63.1 197th Away from port bow 1:1 37 (1.9) 64.9 138th Away from port quarter 1:1 36 (1.8) 66.7 5th Pass port and starboard 1:3 (2 and 1) 31 (1.6) 68.3 41th Pass port and starboard 1:3 (1 and 2) 25 (1.3) 69.5 155th Crossing bow (starboard bow to port quarter) 1:1 21 (1.1) 70.6 15th Pass port 1:3 16 (0.8) 71.4 54th Pass starboard 1:3 15 (0.8) 72.1 185th Crossing bow (port bow to starboard quarter) 1:1 15 (0.8) 72.9 50th Pass port and starboard 1:4 (3 and 1) 12 (0.6) 73.5

**ber Navigation Situation Description Frequency (%)** 

comprises extraordinary navigation situations that differs from other navigation situations, and its proportion in the total is 4.8%.

#### 3.3.4. Top 20 Frequent Clusters

Table 5 summarizes the 20 most frequent navigation situations. The 27th cluster has 522 distinguished trajectories, accounting for 26.1% of the total, and the second most frequent cluster was the 7th cluster, accounting for 23%. The top 20 clusters in Table 5 account for 75.7% of the total navigation situation.


**Table 5.** Top 20 frequent clusters.

#### **4. Discussion**

A total of 347 leaf nodes were clustered as a result of hierarchical clustering. Thus, the navigation situation could be distinguished among ordinary navigation situations and extraordinary navigation situations based on the hamming distance (dissimilarity) of the cluster being 2.3.

An ordinary situation is a case in which the hamming distance is less than 2.3 from the 1:1 navigation situation between the own ship and target ship, implying that there are three or fewer ship relationships in the navigation situation. In all navigation situations, ordinary navigation situations occured 95.2% of the time, whereas extraordinary navigation situations occured 4.8% of the time.

Furthermore, the most common navigational situations were confirmed. A navigation situation that passed from the bow directly to the stern direction without crossing the bow or stern direction of the own ship was the most frequently occurring navigation situation, with 11 clusters out of the top 20 navigation situations being related to it. The proportion of these situations in the total was 64.5%. Ships that did not get closer to their ship in the direction of the bow and stern, which occurred 7.5% of the time, were the next most common navigation situation. The third-highest rate of navigation situations was a navigation situation that approached from the port and starboard and passed the bow or stern direction of the own ship, which exhibited a ratio of 3.5%.

However, there are still uncertainties regarding the application of these methods. The encounters with other ships that can occur during a voyage were objectively clustered through this approach, but the passing distance with the encountered ships, elapsed time to termination of the situation, etc. were not considered in the suggested method.

#### **5. Conclusions**

The MASS collision avoidance system is one of the core systems of MASS. However, compared to this importance, because the scenario design being used in the test was subjective and did not reflect the characteristics of human-operated ships encountered by the actual MASS, it is necessary to develop an objective scenario. Therefore, the navigation situation of human-operated ships was distinguished for the validation of systematic MASS in this study. The navigation situations of human-operated ships were objectively analyzed using AIS data, and the navigation situations were classified through a datadriven approach of data preprocessing, feature extraction, and hierarchical clustering.

Therefore, by comparing cluster similarity and frequency, the navigation situations that a ship can encounter during navigation were divided into ordinary and extraordinary navigation situations. Ordinary navigation situations account for 95.2% of total navigation situations, and extraordinary navigation situations occur at a rate of 4.8%. The top 20 situations, which account for 75% of the total navigation situations, could be classified along with the occurrence rate.

This research proposed a data-driven clustering model for human-operated ship navigation situations. As a result, the actual navigation conditions of human-operated ships were objectively distinguished. The ratio of ordinary and extraordinary situations, which are two major categories of the proposed navigation situation, and the ratio of the detailed navigation situations constituting it, seems to be a more objective basis for collision avoidance algorithm test scenario design.

However, in general, because the sea area where the data analysis was applied is a region with the traffic is concentrated from north to south and south to north, the navigation situations comprised an overwhelming number of passing situations, the extracted features could only explain ship navigation relationship using relative bearings, excluding other navigational aspects. The movement of the own ship was not considered. Such limitations of this study will be supplemented through future work.

**Author Contributions:** Conceptualization, T.H. and I.-H.Y.; methodology, T.H.; software, T.H.; formal analysis, T.H.; data curation, I.-H.Y.; writing—original draft preparation, T.H.; writing—review and editing, I.-H.Y.; visualization, T.H.; supervision, I.-H.Y.; project administration, I.-H.Y.; funding acquisition, I.-H.Y. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was supported by the 'Development of Autonomous Ship Technology (20200615)' funded by the Ministry of Oceans and Fisheries (MOF, Korea).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data used to support the findings of this study are available from the corresponding author upon request.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

