*2.1. Random Forests*

RFs are supervised ML algorithms belonging to the decision trees family. They can be employed in both classification and regression problems, where they provide a piecewise approximation of the response function. All the decision trees are based on binary decisions made according to the value of one predictor *xi* at each node. Therefore, the process is shaped as a tree, starting from a root and reaching one leaf, i.e., the response, moving decision by decision (Figure 2). A single decision tree is trained with a database capable of describing the relationship between the predictors' values and a response [13]. RFs have been introduced to improve the accuracy of the prediction provided by standard decision trees [14]. As for bootstrap aggregation, the problem is decomposed into a set of "weak" trees trained with a partition of the original database instead of a single tree trained with the complete database. The response of the overall model is selected according to the vote of the multiple trees for classification problems and as the average of the responses for the regression one. In the present work, 30 weak learners were employed. In addition to the application of multiple learners, RFs also utilize a random selection of a predictors' subset at each split in a single tree to decorrelate the trees in the ensemble [15]. These features make RFs more resilient to noise/missing data and more capable of dealing with higher-dimensional data compared to standard decision trees and other ensemble methods. Hence, the choice of such an ML technique was initially considered for the studied problems, involving a large number of predictors for higher *t* ∗ and progressive flooding being affected by several uncertainties [16,17].

**Figure 2.** Structure of a decision tree.

#### *2.2. Accuracy Estimation*

The accuracy of trained learners can be estimated using a validation database independent of the training one. Considering a classification problem, the accuracy rate is usually defined as the capability of assigning a specific scenario from the validation database to the correct response class. Namely, given a time instant *t* ∗, the accuracy of the related classifiers is defined as:

$$Acc(\%) = 100 \frac{N\_{\rm c}}{N} \tag{1}$$

where *Nc* is the number of correctly classified damage scenarios and *N* the total number of scenarios induced in the validation database.

Aiming to predict the outcomes of the damage scenario, the so-called "ongoing" damage scenarios, i.e., the scenarios having *tf* > *t* ∗, are more interesting. Thus, the ongoing accuracy can also be defined as:

$$Acc^\*(\%) = 100 \frac{N\_c^\*}{N^\*} \tag{2}$$

where *N*∗ *<sup>c</sup>* is the number of correctly classified ongoing damage scenarios and *N*<sup>∗</sup> the total number of ongoing scenarios induced in the validation database.

Regarding the regression problems, the accuracy can be checked by means of a proper statistical indicator. Here, the coefficient of determination *R*<sup>2</sup> was used:

$$R^2 = 1 - \frac{SSE}{SS\_{tot}}\tag{3a}$$

$$SSE = \sum\_{i=1}^{N} \left( y\_i - y\_i^\* \right)^2 \tag{3b}$$

$$SS\_{tot} = \sum\_{i=1}^{N} \left( y\_i - \overline{y} \right)^2 \tag{3c}$$

where *yi* are the known responses, *y* is their mean value, and *y*<sup>∗</sup> *<sup>i</sup>* are the responses predicted by the model. Once again, an *R*2<sup>∗</sup> ongoing coefficient of determination was also defined based only on the *N*∗ ongoing damage scenarios.

#### **3. Database-Generation Methods**

As mentioned, to assess the damage consequences by applying ML, a training dataset is needed, composed of progressive flooding simulations. The simulations are driven by a damage case generation algorithm. In the present section, several options are proposed to generate the training database according to a different characterization of the damages. All the progressive flooding simulations were carried out using a quasi-static technique based on the solution of a linearized differential-algebraic equation system [18,19]. The method represents a good compromise between accuracy and computational effort [20]; hence, it was considered adequate for the generation of large databases of progressive flooding simulations.

Usually, the damage is modeled as a parallelepiped box intersecting the hull [21–23]. With such an assumption, the surface of the hull shell enclosed in the damage box is removed to define the damage. Considering a collision case, the box-shaped damage is always crossing the waterline and can be completely defined by five parameters (Figure 3):


In the present work, the damage penetration was neglected since all the internal structures were considered intact.

**Figure 3.** Bow-shaped damage parameters.

Here, two families of methods for the database generation of side damage cases in calm water were tested, one based on MC generation and the other based on a parametric generation aimed to cover all the possible damage scenarios involving multiple neighboring rooms. Applying MC sampling, the damage cases can be generated following the probability distribution of their parameters [24]. Here, three different options were explored. The first was based on the probability distributions embedded in SOLAS and the other two on two types of uniform distributions.

### *3.1. Monte Carlo with SOLAS Probability Distributions*

The SOLAS probabilistic rule framework for ship damage stability is based on the statistical analysis of a database of side collision accidents [25]. In the SOLAS, the probability distributions are used to define a so-called zonal approach, so they are not explicitly defined. However, recent studies explored the so-called nonzonal approach, which directly applies the probability distributions on the damage parameters [26]. With this approach, the following damage parameter probability distributions can be taken from SOLAS: *Ld*, *Xd*, *Bd*, which lead to the definition of the p-factor, and *Zmax*, which is considered in the v-factor. The *Zmin* is not defined, since SOLAS adopts a worst-case approach in s-factor determination to consider horizontal subdivision below the waterline. However, the *Zmin* probability distribution can be taken from the statistical analysis of collision damage

data available in the literature [22]. The adopted probability distributions for the SOLAS database generation are defined as follows:

Damage length was modeled with a bilinear probability density function, leading to the following cumulative distribution:

$$cdf(L\_d) = \begin{cases} 0 & \text{if } \quad f \le 0\\ \frac{b\_{11}}{2}l^2 + b\_{12}l & \text{if } \quad 0 \le f \le f\_k\\ \frac{b\_{11} - b\_{21}}{2}l\_k^2 + (b\_{12} - b\_{22})l\_k + \frac{b\_{21}}{2}l^2 + b\_{22}l & \text{if } \quad l\_k < f \le l\_m\\ 1 & \text{if } \quad f > f\_m \end{cases} \tag{4}$$

where *J* = *Ld*/*LS*, and all the other parameters are defined as in SOLAS Ch.II-1 Part B-1 Regulation 7-1 [27].

The longitudinal position of the damage center is uniformly distributed along the ship subdivision length *LS*:

$$cdf(X\_d) = \begin{cases} 0 & \text{if} \quad X\_d \le 0\\ \frac{X\_d}{L\_S} & \text{if} \quad 0 < X\_d < L\_S\\ 1 & \text{if} \quad X\_d \ge L\_S \end{cases} \tag{5}$$

The vertical height *Zmax* was modeled with a bilinear cumulative density function:

$$cdf(Z\_{\max}) = \begin{cases} 0 & \text{if } \quad Z\_{\max} - T \le 0 \text{ m} \\ 0.8 \frac{Z\_{\max} - T}{7.8} & \text{if } \quad 0 \text{ m} \le Z\_{\max} - T \le 7.8 \text{ m} \\ 0.8 + 0.2 \frac{Z\_{\max} - T - 7.8}{4.7} & \text{if } \quad 7.8 \text{ m} < Z\_{\max} - T \le 12.5 \text{ m} \\ 1 & \text{if } \quad Z\_{\max} - T > 12.5 \text{ m} \end{cases} \tag{6}$$

The vertical height *Zmin* was modeled with a linear probability density function, leading to the following cumulative distribution:

$$cdf(Z\_{\min}) = \begin{cases} 0 & \text{if} \quad Z\_{\min} \le 0\\ 1.4\frac{Z\_{\min}}{T} - 0.4\left(\frac{Z\_{\min}}{T}\right)^2 & \text{if} \quad 0 < Z\_{\min} < T\\ 1 & \text{if} \quad Z\_{\min} \ge T \end{cases} \tag{7}$$

#### *3.2. Monte Carlo with a Uniform Distribution of the Damage Dimensions*

In this database-generation algorithm, the maximum damage dimensions were still taken from SOLAS. However, a uniform distribution was assumed for the damage length and height. The applied cumulative density functions are, then, defined as follows:

Damage length was assumed as uniformly distributed between zero and the maximum admissible nondimensional length according to SOLAS:

$$cdf(L\_d) = \begin{cases} 0 & \text{if} \quad X\_d \le 0\\ \frac{L\_d}{J\_m L\_S} & \text{if} \quad 0 < X\_d < J\_m L\_S\\ 1 & \text{if} \quad X\_d \ge J\_m L\_S \end{cases} \tag{8}$$

The longitudinal position of the damage center is already uniformly distributed in SOLAS. Hence, Equation (5) can still be applied.

The damage height *Hd* = *Zmax* − *Zmin* is uniformly distributed between zero and *T* + 12.5 m, i.e., the maximum value according to SOLAS:

$$cdf(Z\_{\max}) = \begin{cases} 0 & \text{if} \quad H\_d \le 0\\ \frac{H\_d}{T + 12.5 \text{ m}} & \text{if} \quad 0 < H\_d < T + 12.5 \text{ m} \\\ 1 & \text{if} \quad H\_d \ge T + 12.5 \text{ m} \end{cases} \tag{9}$$

As the damage height is defined, the vertical position of the damage center *Zd* is defined ensuring that the damage is crossing the waterline in compliance with SOLAS. Hence, *Zd* was assumed as uniformly distributed in the interval:

$$\left[ \max\left(\frac{H\_d}{2}, T - \frac{H\_d}{2}\right), \ T + \min\left(12.5 \text{ (}m\text{)} - \frac{H\_d}{2}, T + \frac{H\_d}{2}\right) \right] \tag{10}$$

#### *3.3. Monte Carlo with a Uniform Distribution of the Damage Area Inverse*

In this damage-case-generation algorithm, a uniform distribution of the inverse of the damage area *Ad* = *Ld*(*Zmax* − *Zmin*) was applied. The main objective of such a method is to generate a more uniform distribution of the *time-to-flood*, compared to the SOLAS one. SOLAS's probability distributions lead to many large-area damages having, consequently, a short *time-to-flood*. Such damages are not very interesting for decision support purposes since the events evolve too fast to gain any advantage from a DSS response. Moreover, a too-small number of long damage scenarios might affect the forecast accuracy of the learners due to the lack of training data.

The following cumulative probability function was applied to draw the damage areas:

$$cdf\left(\frac{1}{A\_d}\right) = \begin{cases} 0 & \text{if} \quad \frac{1}{A} \le \frac{1}{A\_{\max}}\\ \frac{1}{A} - \frac{1}{A\_{\max}}\\ \frac{1}{A\_{\min}} - \frac{1}{A\_{\max}}\\ 1 & \text{if} \quad \frac{1}{A} \ge \frac{1}{A\_{\min}} \end{cases} \quad \text{if} \quad \frac{1}{A\_{\max}} < \frac{1}{A} < \frac{1}{A\_{\min}} \tag{11}$$

where *Amin* and *Amax* are the minimum and maximum damage areas that can be defined for each different ship. In a real application, floodwater inflow due to very small damages can be controlled by the bilge system. Hence, it can be assessed considering the bilge pumps' capacity. The maximum area can be defined as the maximum damage area according to SOLAS: *Amax* = (*T* + 12.5) *Jm LS*.

Given the damage area, the other parameters were defined. The longitudinal position of the damage center was here assessed as for SOLAS according to Equation (5). Two alternative procedures were then applied to half of the generated damage cases:


Damages having a height or length outside the ship boundaries were discarded and randomly generated again.
