2.2.2. Machine Learning Classification Algorithm
The six common machine learning algorithms are: decision tree [
21], linear discriminant analysis (LDA) [
22], naive bayes [
23], linear SVM [
24], K-nearest neighbor (KNN) [
25], and neural networks [
26].
(1) Decision Tree
A decision tree is an algorithm used for classifying data; data characteristics are judged one by one to obtain the category to which the data belong. It can be regarded as a tree prediction model, which is a hierarchical structure comprising nodes and directed edges. The tree contains three nodes: root, internal, and leaf nodes. The decision tree has only one root node, which is a collection of all the training data. Each internal node in the tree is a split problem: a test for a certain attribute of the instance is specified, the samples arriving at the node are divided according to a specific attribute, and each subsequent branch of the node corresponds to a possible value for the attribute. Each leaf node is a data collection point with a classification label, indicating the category to which an instance belongs.
There are many decision tree algorithms such as ID3, C4.5, and CART. All these algorithms use a top-down greedy approach. Each internal node selects the attribute with the best classification effect to split and then continues this process until the decision tree can accurately classify all the training data or all attributes are used. The simplified version of the algorithm uses the assumptions of all samples to construct the decision tree. The specific steps are as follows:
Step 1: Suppose T is the training sample set.
Step 2: Select an attribute from the attribute set that best describes the sample in T.
Step 3: Create a tree node whose value is the selected attribute. The child nodes of this node are created. Each child chain represents a unique value (interval) for the selected attribute. The value of the child chain is used to further subdivide the sample into subcategories.
(2) Linear Discriminant
Linear discriminant analysis is a classic linear learning method that was first proposed by Fisher in 1936 on the two-classification problem, also known as the Fisher linear discriminant. The idea of linear discrimination is very simple: given a set of training samples, we project the examples onto a straight line so that the projection points of similar examples are as close as possible, and the projection points of different examples are as far away as possible. When classifying, they are projected onto the same straight line, and the class of the new sample is determined according to the position of the projection point.
Previously, we focused on analyzing the application of the LDA algorithm in dimensionality reduction. The LDA algorithm can also be used for classification. LDA assumes that various types of sample datasets conform to a normal distribution. After LDA reduces the dimensions of various types of sample data, we can calculate the mean and variance of each type of projection data through maximum likelihood estimation as follows:
Thereafter, the probability density function of samples from each class can be obtained as follows:
where
is the sample after dimensionality reduction.
Therefore, the steps of LDA classification for an unlabeled input sample are:
- (1)
LDA is used to reduce the dimensionality of the input sample.
- (2)
According to the probability density function, the probability that the reduced dimensionality sample belongs to each class is calculated.
- (3)
The category corresponding to the largest probability is identified as the predicted category.
(3) Naive Bayes
The naive Bayes classifier is a series of simple probability classifiers based on Bayes’ theorem under the assumption of strong (naive) independence between features. The classifier model assigns class labels represented by feature values to the problem instances, and the class labels are obtained from a limited set. It is not a single algorithm for training this classifier, but a series of algorithms based on the same principle: all naive Bayes classifiers assume that each sample feature is not related to other features.
Step 1: represents a data object with D-dimensional attributes. Training set S contains K categories, expressed as .
Step 2: The data object
X to be classified predicts the category
X, and the calculation method is as follows:
The obtained is in category X. The above formula indicates that when the data object X to be classified is known, the probabilities of X belonging to . are calculated and the maximum value of the probability is selected. The corresponding is category X.
Step 3: According to Bayes’ theorem, . is calculated as follows:
In the calculation process, is equivalent to the constant . Therefore, if the maximum value of . is to be obtained, only the maximum value of needs to be calculated. If the prior probability of the category is unknown, that is, s is unknown, it is usually assumed that these categories are equal in probability, that is, .
Step 4: Assuming that the attributes of data object
X are independent of each other,
is calculated as follows:
Step 5:
is calculated as follows if the attribute
is discrete or categorical. There are
n data objects belonging to category
in the training set with different attribute values under attribute
; in the training set, there are m data objects belonging to category
and the attribute value
under attribute
. Therefore,
is calculated as follows:
(4) Linear SVM
An SVM is a type of generalized linear classifier that classifies binary data in a supervised learning manner. Its decision boundary is the maximum margin hyperplane solved for the learning samples.
The SVM model makes the distance between all points and the hyperplane greater than a certain distance, such that all classification points are on both sides of the support vector of their respective categories. The mathematical formula is expressed as:
(5) KNN
The proximity algorithm, or the KNN classification algorithm, is one of the simplest methods in data-mining classification technology. Each sample can be represented by its KNNs. The nearest-neighbor algorithm classifies each record in the dataset.
In general, the KNN classification algorithm includes the following four steps:
Step 1: Data are prepared and preprocessed.
Step 2: The distance from the test sample point (i.e., the point to be classified) to every other sample point is calculated.
Step 3: Each distance is sorted, and the K points with the smallest distance are selected.
Step 4: The categories to which the K points belong are compared. According to the principle that the minority obeys the majority, the test sample points are classified into the category with the highest proportion among the K points.
(6) Neural Networks
The basic processing elements of artificial neural networks are called artificial neurons, simply neurons or nodes. In a simplified mathematical model of the neuron, the effects of the synapses are represented by connection weights that modulate the effect of the associated input signals, and the nonlinear characteristic exhibited by neurons is represented by a transfer function. The neuron impulse is then computed as the weighted sum of the input signals, transformed by the transfer function. The learning capability of an artificial neuron is achieved by adjusting the weights in accordance with the chosen learning algorithm.
The Artificial Neural Networks classification algorithm includes the following four steps:
Step 1: Network structure is chosen.
Step 2: Weights are randomly initialized.
Step 3: Forward propagation FP algorithm is executed.
Step 4: The cost function J is calculated through the code.
Step 5: The backpropagation algorithm is executed.
Step 6: Gradient check is performed.
Step 7: Function J is minimized using the optimization and backpropagation algorithms.