*3.4. Classification Algorithms*

We selected 5 different ML algorithms: k-Nearest Neighbor (KNN), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF) and Gradient Boosting (GB). These are described in Sections 3.4.1–3.4.5, and implemented in Python using *Scikit-Learn*. We used the default parameters value of the different classifiers from the version 0.23.2 of Scikit-Learn. It is a tool with a simple interface, built on scientific libraries such as *NumPy*, *SciPy*, and *matplotlib*. The library code is open source and is under the BSD license. Moreover, its documentation is very complete and includes many sample codes. We used the default parameters of each classifier of the version

Pedregosa et al. [42] introduced *Scikit-Learn* and presented its features, comparing the efficiency of its algorithms to other similar libraries. The results show that it is often faster and has the advantage of supporting many available algorithms. This led to its wide adoption in the ML community. In particular, *Scikit-Learn* provides several classification and regression algorithms for supervised learning. Moreover, it implements model selection and evaluation functions that allow to perform cross-validations, searches and comparisons with various metrics.

### 3.4.1. k-Nearest Neighbor (KNN)

KNN is a well-known algorithm with a very simple operating principle. Data are classified, by a majority vote, with the class most represented among its k-closest neighbors. This algorithm belongs to the lazy learning class because it defers the work as long as possible. During the training, it simply organizes data. However, during a prediction, it browses the recorded data to count the classes of its k-nearest neighbors. Therefore, all calculation costs are during a prediction [43].

This algorithm has two main parameters. The first one is the number of neighbors to consider. A big value allows to have a probabilistic information but the estimation locality may be destroyed. Therefore, compromises have to be made and the value of 5 is used typically in the literature. The optimal number of neighbors depends strongly on the type of data. The second parameter is the method to calculate the distance between two data and defines their closeness. The choice of this metric is complicated and the notion of distance depends on the data characteristics [43]. There are several distance formulae but the most commonly used ones are Euclidean, Manhattan and Minkowski.

Throughout our experiments, we confirmed the following characteristics. The advantages of this algorithm are simplicity, efficacy and ease of tuning to find the best hyper-parameters. In addition, the greater the number of training data, the better the performance, which is however still sensitive to noise. Data normalization can then solve this problem. As a disadvantage, this algorithm is sensitive to the curse of dimensionality. The increase in the number of features tends to improve the results but only up to a certain threshold. When this one is reached, the addition of new features degrades the

results. This is because irrelevant features influence negatively in the calculation of the distance. Finally, KNN is expensive in memory and in computation in comparison to other algorithms, as corroborated by the literature [27,30].

### 3.4.2. Support Vector Machines (SVM)

SVM is also a well-known algorithm. It can be employed in supervised and unsupervised learning. It tries to find the best hyperplane which maximizes the margins between each class. When a linear classification is not feasible, SVM can use a technique named *kernel trick* that maps inputs into a higher dimension [43]. It is an eager learning algorithm because it creates a classification model based on the data during the training. When a prediction is asked, it uses the model to determine the class.

SVM has several hyper-parameters affecting the classification results. The most relevant ones are:


SVM has the advantage of being able to find a unique and global solution which comes from the fact that the optimization problem is convex [43]. Thanks to the *kernel trick*, it can produce good results even with a high features space. However, SVM requires greater processing power during the training to find the best hyperplane and also during the predictions to calculate the support vector for each new data, as corroborated by the literature [18,24,27,30].

### 3.4.3. Decision Tree (DT)

Trees are well known data structures which are used in many different problems. They are applicable in ML and their objective is to create a DT based on the features of each data. Every node of the tree is divided to satisfy the most data until there are only leaves at the end. Therefore, it is an eager learning algorithm because it tries to build the best DT during the training phase [43].

Most of the hyper-parameters are useful to decide when a node must be divided and when the DT must stop. The most relevant ones are:


Advantages of DTs are their ease of understanding and interpretation for humans, as it can be visualized [43]. It also requires few data preparations and has a low cost during a prediction because its complexity is logarithmic. However, a tree can become very complex and not generalize enough the data which then produces overfitting. In the same way, an unbalanced dataset will create biased trees. Despite this shortcoming, DTs (J48 in particular) are commonly used in the literature [24].

### 3.4.4. Random Forest (RF)

RF is an improvement to DTs because it includes many of them as its name *forest* suggests. Its principle is to create multiple trees and train them on random subsets of data. During a prediction, every tree processes the data and the obtained results are then merged to determine the most likely class by a vote [43]. This method is called *bagging*. This algorithm allows to remove the overfitting problem created by DTs. It is part of ensemble learning algorithms whose concept is to combine several ML algorithms to achieve better performance.

The available hyper-parameters are the same as the ones in DTs in addition to one which allows to define the number of trees to use in the forest. A value of 1 is equivalent to the DT algorithm. A high value will usually give better results. However, this creates a high cost in computational power and memory because each tree has to be stored.

One of the strongest advantages of RF is that it can automatically create a list with the most discriminative features. It has also the ability to create confidence intervals which indicate the certainty rate of a predicted class for each data. Its disadvantage is that the ease of interpretation of DTs is lost. This algorithm has also been used in the fall detection literature [24].

### 3.4.5. Gradient Boosting (GB)

GB is very similar to RF because it also employs multiple trees but in a different manner. The trees do not work in parallel as in RF but sequentially. The output of each tree is used as input of the following one. The idea is that each tree learns iteratively on the errors made by its predecessor. This is called *boosting* [43]. Because GB is composed of DTs, most of the parameters are the same. However, it has additional ones which are:


An advantage of this algorithm is that it can produce better results than RF but it potentially has overfitting issues. It also allows to reduce the variance and the bias. However, the model is more complex to create and as a result the training phase is much longer than in other algorithms. Despite this shortcoming, GB is commonly used in the literature [33,34].
