*2.3. Random Forest Classifiers*

The RF algorithm was first described in detail by Breiman [40]. RF is an ensemble learning algorithm and is based on the aggregation of a large number of independent decision trees. When used for classification, the class votes of each tree determine the classification by majority vote [5], resulting in enhanced classification accuracy and reduced overfitting. Each tree within this RF is grown using random feature selection; each new training set being drawn with replacement from the original training set. This method is known as bootstrap aggregation or *bagging* [40,41].

In RFs, bagging is combined with a randomised selection of the *p* input features to be considered for splitting an internal node. At each node, a random subset of *k* features is selected, from which only the best split is determined [42]. For classification, the default value for *k* is typically set as the square root of *p*. At each split, the total reduction in the split criterion, usually measured by the Gini index [43], can be used as an importance measure for the corresponding splitting feature. The feature importance is obtained by accumulating this importance measure over all trees separately for each feature [5]. The size of an individual tree is typically controlled by predefined parameters, such as the terminal node size and tree depth. As a consequence, for every tree in the RF ensemble, a set of observations exists that are not used for growing the tree. These so-called out-of-bag observations (OOB) can be used to estimate the prediction accuracy of the individual decision trees [43].

Generally speaking, the larger the number of estimators, the better the prediction accuracy becomes. However, beyond a critical number of trees, there is no significant performance gain in adding more trees, at the cost of increasing computing demand. Numbers available in the literature include 128 [44], 200 [5], or 250 [45] trees.

In order to assess the prediction quality of the trained RF algorithm, a series of classification metrics is used [46,47].

The most straightforward metric is the *accuracy* (*qa*), which is defined as the ratio between the number of correct predictions (*NT*) and the total number of samples (*N*), i.e.,

$$q\_a = \frac{N\_T}{N}.\tag{2}$$

If a sample that has been labelled as positive is also predicted as positive, the classification is counted as a *True Positive* (*NTP*). If it is predicted as negative, the classification is a *False Negative* (*NFN*). *True Negatives* (*NTN*) and *False Positives* (*NFP*) are defined analogously. These four numbers can be displayed as a 2 × 2 *confusion matrix C*. In the present study, we follow scikit-learn's implementation [48]; other sources may use the transposed version, e.g., [46].

$$\mathbf{C} = \begin{pmatrix} N\_{TP} & N\_{FN} \\ N\_{FP} & N\_{TN} \end{pmatrix} . \tag{3}$$

Using above four definitions, the number of correct predictions is given by

$$N\_T = N\_{TP} + N\_{TN}.\tag{4}$$

The *precision* (*qp*) or *confidence* is defined as the fraction of all positively predicted samples (*NPP*), which are actually labelled as positive (*NTP*), i.e.,

$$\eta\_p = \frac{N\_{TP}}{N\_{PP}} = \frac{N\_{TP}}{N\_{TP} + N\_{FP}}.\tag{5}$$

Conversely, the *recall* (*qr*) or *sensitivity* gives the fraction of all positively labelled samples (*NPL*), which are correctly identified as positive, i.e.,

$$q\_r = \frac{N\_{TP}}{N\_{PL}} = \frac{N\_{TP}}{N\_{TP} + N\_{FN}}.\tag{6}$$

In the case of multi-label classification, precision and recall values are calculated separately for each class, with 'positives' meaning samples belonging to the respective class. Each row in the confusion matrix represents a 'true' class, with the 'predicted' class labels as columns. In this case, the confusion matrix contains the number of correct predictions of each class in the diagonal, and false predictions are contained in the respective off-diagonal elements. Given a classification with *N* labels, the precision and recall can be calculated separately for each class (denoted by index *i*, *i* = 1 ... *N*) from the coefficients of the *N* × *N* confusion matrix as follows:

$$q\_p^{(i)} = \frac{\mathbf{C}\_{ii}}{\sum\_{j=1}^{N} \mathbf{C}\_{ji}} \tag{7}$$

and

$$\mathbf{G}^{(i)} = \frac{\mathbf{C}\_{ii}}{\sum\_{j=1}^{N} \mathbf{C}\_{ij}}.\tag{8}$$
