*3.3. Classification Algorithms*

Among the supervised methods, classification algorithms are some of the most wellknown. Classification gives prediction in the form of a class label (e.g., which bacteria species is present), thus, the output is inherently categorical. Briefly, some of the most common classification algorithms are presented in the following.

*k-nearest neighbors* (*k-NN*): One of the simplest classification algorithms, *k*-NN is a distance-based classifier. Class is predicted as the most common class of the *k*-nearest neighbors in the feature space [77]. In the example shown in Figure 1, the feature space is two dimensional (with variables *x*1 and *x*2) and the value of *k* is 4. In *k*-NN, the number of neighbors used for assignment, *k*, is a hyperparameter (i.e., a model parameter that is not optimized during the training process itself). As with most ML models, hyperparameter selection may strongly influence performance [78].

*Support vector machine* (*SVM*) is a non-probabilistic, binary, linear classifier [79]. SVM relies on the construction of hyper-plane boundaries in the feature space to separate data of different classes. Although SVM itself only accounts for linear separation of classes (i.e., hyper-plane boundaries must be "flat"), the data may be mapped to a higherdimensional feature-space using the "kernel trick" [80]. Some of the most common kernels are radial basis function and Gaussian. When the hyperplane boundaries are projected back into the original feature space, they allow for non-linear boundaries, as shown in Figure 1. Additionally, there are methods allowing SVM to be used for multi-class prediction [81]. The placement of hyperplanes is determined by minimizing the distance between the

hyperplane and several of the points closest to the boundary between classes. SVM's robustness against outliers is improved by a soft margin. This allows for a certain quantity of misclassifications, which are presumably outliers, to improve the separation of the other observations [82]. While SVM shows resilience against outliers and performs well in high-dimension feature spaces, it is prone to over-fitting, especially when using non-linear kernels [83]. Overfitting is when the model performs well on training data but performs poorly when generalized to unseen data.

**Figure 1.** Comparison of classification technique using *k*-NN and SVM. In *k*-NN, four nearest neighbors are shown contributing to the gray point's assignment. Classification of the gray point is the blue star class. In hypothetical SVM with nonlinear kernel, new data are classified in which region the point lies. In both examples, the feature space consists of two dimensions. Classification could be, for example, bacterial species like *E. coli*, *Salmonella* spp., *Pseudomonas* spp., *Staphylococcus* spp., *Enterococcus* spp., etc. In practical applications, the feature space has many more dimensions, where decision boundaries for SVM are hyperplanes in the (*<sup>n</sup>*−1) dimension for an *n*-dimensional feature vector.

*Linear discriminant analysis* (*LDA*): In addition to dimension reduction, LDA can be used for classification. Other related algorithms allow for non-linear classification such as quadratic discriminant analysis (QDA) [84]. One of the limitations of LDA and its relatives is that they assume the data are normally distributed.

*Decision tree* (*DT*) and *random forest* (*RF*): In tree-based models such as decision tree (DT), the feature vector starts at the tree's "trunk," and at each branching point a decision is made based on the learned decision rules. The end classification would then be at the terminal or "leaf" node that the instance results. DTs can be used for classification and regression [85]. When the target variable is categorical, it is referred to as a classification tree; when the target variable is numerical and continuous, it is referred to as a regression tree [86]. Random forest (RF) is so called because it can be considered a forest of decision trees (Figure 2) [87]. There are many RF architectures, but in all instances, the classification from each decision tree contributes to the overall classification for an observation.

*Artificial neural network* (*ANN*) draws inspiration from biological neural networks (i.e., neurons in the brain) and is composed of a collection of connected nodes called artificial neurons (see Figure 3). ANNs can be used for classification and regression. As mentioned earlier, ANN can be used for dimension reduction prior to supervised machine learning. There are a large variety of ANN structures such as (1) recurrent neural network (RNN) [88], (2) extreme learning machine (ELM) [89], and (3) deep learning algorithms such as the convolutional neural network (CNN) [90], deep belief network [91], and back-propagation neural network (BPNN) [92]. "Deep" indicates several hidden layers. ANN architectures have many hyperparameters such as the number of hidden layers, connectedness, and activation functions [93].

**Figure 2.** Decision tree (DT) showing nodes at which binary decisions are made on features. Terminal node dictates model prediction. Actual DTs have many more nodes than shown here. Random forest (RF) shown as a series of distinct decision trees.

**Figure 3.** Artificial neural network (ANN) showing nodes of the input, hidden, and output layers.

One of the aspects that makes ANN so powerful is that features do not need to be well-defined real numbers. This allows them to excel at working with data such as images for which extracting numerical features would be difficult and inefficient. One limitation of ANNs is that they require a large amount of data for effective training. In some settings, training data sparsity can be mitigated through a generative adversarial network (GAN) using back propagation [94].

Common classification model performance metrics are accuracy, precision, sensitivity (also known as recall), specificity, and *F*1. For binary classification with labels "positive" and "negative", they are defined as follows:

$$accuracy = \frac{TP + TN}{TP + TN + FP + FN} \tag{1}$$

$$precision = \frac{TP}{TP + FP} \tag{2}$$

$$sensitivity = \frac{TP}{TP + FN} \tag{3}$$

$$Specificity = \frac{TN}{TN + FP} \tag{4}$$

$$F1 = \frac{2 \times precision \times sensitivity}{precision + sensitivity} \tag{5}$$

where *TP* is true positive, *TN* is true negative, *FP* is false positive, and *FN* is false negative.
