The overall system for Wi-Fi-based positioning and localization consists of fixed WAP’s, mobile devices, and positioning services. The stages of this general method are shown in
Figure 1. To start with, new features are generated from the dataset. The generated dataset is split into two categories, namely training and test. This process is carried out randomly and the ratio of training data and test data is 70% and 30%, respectively. In the next stage, training data are presented to Gaussian naive Bayes (NB), decision tree (DT), K-nearest neighbors (KNN), support vector machine (SVM), random forest (RF), and multi-layer perceptron (MLP) algorithms. These learning algorithms were obtained from Scikit-learn library [
25,
26,
27,
28,
29]. They are then compared according to the classification and regression performance metrics, which are explained in the ML section, while the most suitable algorithms are selected for positioning and localization. In addition, the hierarchical fusing ML (HF-ML) algorithm is shown in
Figure 2. In this algorithm, the dataset is primarily used for the training of building classification. Results of the building prediction are then combined with the input dataset. Consequently, prediction results and the input dataset are used for the training of floor prediction. After the training, floor predictions are combined with the previous dataset. This dataset contains the input dataset, building predictions, and floor predictions. The resulting dataset of these combined hierarchical prediction procedures is used in the training of longitude and latitude regression. The selected learning models are fused hierarchically using deductive reasoning. According to HF-ML, there are two more types of input data, namely building information and floor information. Since two new variables were introduced, HF-ML was trained with the training data and the performance was evaluated with the test data. This obtained model can produce location and position data after applying the feature generation process to the new data. See the
supplementary materials for the codes and the data.
2.3. Machine Learning
Gaussian naive Bayes (GNB) classifier is a probabilistic classifier algorithm based on Bayes’ theorem [
31]. Due to its low computational load and high correct classification performance, it has been widely chosen in the studies aimed for classification. The aim of the naive Bayes classifier is to compute the
i-th observation that is obtained by computing the posterior probability of class
c for a given feature
xi. Calculation of the posterior probability is given in Equation (1). In this equation P(
c|xi), is the posterior probability while P(
c) is class prior probability and P(
xi|
c) is the likelihood, which means probability of
i-th feature, and finally, P(
xi) is the prior probability of feature
i. There are three main steps in the naive Bayes algorithm. In the first step, the algorithm calculates the frequency values of each output category according to the dataset. In the second step, the likelihood table is calculated. In the last step, probability of output class according to Gaussian Distribution is calculated [
31,
32].
The K nearest neighbors (KNN) algorithm is a learning algorithm that operates according to the values of the nearest k-neighbor. The KNN algorithm is a non-parametric method for classification and regression [
33]. It was first applied to the classification of news articles [
34]. When performing learning with the KNN algorithm, firstly, the distance of the data to others is calculated in the dataset examined. This length calculation is done with Euclidian, Manhattan, or Hamming distance function. Then, the mean value of the nearest K neighbors is calculated for the data. The K value is the only hyper-parameter of the KNN algorithm. If the K value is too low, then the borders are going to be flickering and overfitting will occur, whereas if the K value is too high, the separation borders are going to be smoother and underfitting will occur. The disadvantage of the KNN algorithm is that in the distance calculation process, it increases the processing load as the number of data increases.
Decision tree (DT) algorithm is an algorithm that is frequently used in statistical learning and data mining. It works with simple if-then-else decision rules and is used for both classification and regression in supervised learning [
35]. It has three basic steps. In the first step, the most meaningful feature is placed as the first (root) node. In the second step, datasets are divided into subsets according to this node. Subsets should be created in such a way that each subset contains data with the same value of a feature. In the third step, step one and two are repeated until the last (leaf) nodes in all of the branches are created. The decision tree algorithm builds classification or regression models in the form of a tree structure. It splits a dataset into smaller subsets while at the same time an associated decision tree is incrementally developed. The result of this algorithm is a tree with decision nodes. Decision trees can handle both categorical and numerical data [
36,
37].
The RF method is designed in the form of a forest consisting of many DT’s [
38]. Each decision tree in the forest is formed by selecting the sample from the original dataset with the bootstrap technique and selecting the random number of all variables in each decision node. The RF algorithm consists of four fundamental steps. First, n features are randomly selected from a total of m features. Second, the node d using the best split point is calculated among the n features. Third, it checks whether the number of final (leaf) nodes reaches the target number. If it does, it moves on with the next step; if not, the algorithm goes back to step one. Finally, by repeating steps one to three for n (number of trees in the forest) times, a forest is built [
38,
39,
40].
The adaptive boosting (AB) algorithm is a ML method, which is also known as an ensemble method. The aim of this method is to create a stronger learning structure by using weak learners. AB can be used to improve the performance of any ML algorithm. However, it often uses a single-level decision tree algorithm as a weak learner algorithm since the processing load is much lower than other basic learning algorithms. The AB algorithm consists of four basic steps. In the first step, the N weak algorithms are run and the dataset is learned. Each of these N weak learning algorithms is assigned a weight value of 1/N. In the second step, error values are calculated in each of the learner algorithms. In the third step, the weight value of the learning algorithm with a high amount of error is increased. In the final step, the learning algorithms are summed with weight values and if the desired metric limit is reached, the total algorithm is output; otherwise, it returns to the second step [
41,
42,
43]. As the number of learners in this algorithm increases, the processing load and the learning performance increases.
First introduced by Vapnik, SVM is a supervised learning algorithm based on the statistical learning theory [
44,
45]. SVM was originally developed for binary classification problems. While SVM was only used for classification at first, it eventually began to be used in linear and non-linear regression procedures as well [
46]. SVM aims to find the hyper-plane, which separates classes from each other, and which is the most distant from both classes. In cases where a linear separation cannot occur, the data are moved to another space of a higher dimension, and the classification is performed in that space. While the classification processes aim to separate data from each other by means of the generated vectors; the regression, in a sense, performs the opposite of this process and aims to create a hyper-plane by identifying support vectors in a way that they will include as many data as possible [
47]. The SVM is mainly divided into two categories according to the linear separability of the dataset. Non-linear SVM decision function is given in Equation (2). In cases where the data cannot be separated linearly, the data are moved to a space of higher dimension, and kernel functions are used to resolve them. The transformations can be made by using Kernel functions expressed as K(x
i,x
j) = Φ(x) Φ(x
i) instead of Φ(x) Φ(x
i) scalar product given in Equation (2). Also, non-linear transformations can be made, and the data can be separated in the high dimension thanks to the Kernel functions. Therefore, Kernel functions have a critical role in the performance of SVM. The most widely used of these Kernel functions are linear, polynomial, Radial Basis Function (RBF), and Sigmoid, all of which are given in
Table 3 [
47,
48,
49]. In this table, x
i and x
j corresponds to window coefficients. x
i is the width parameter and the x
j represents the place of the window. In the polynomial function, the parameter d corresponds to the degree of the polynomial. In radial basis function, on the other hand, γ is equal to Gaussian function. In this study, only RBF was used as the SVM Kernel function.
K-fold cross validation is a method used in the performance evaluation of learning algorithms [
50,
51]. In the evaluation processes, it is essential to create a dataset for training and testing and to test the performance of the model that actualizes the learning with training data on the test dataset. However, performance evaluation may not be reliable because of not having the same distribution when selecting the training and test data in the dataset, uneven distribution of outliers, and so on. Therefore, K-fold cross validation method was developed. In this method, training and test data are integrated and turned into a single dataset. All data are divided into K equal parts. The K value here is determined by the user. Then, learning and testing are performed for each of the K sub-sets; here, one of the subsets will be used for the test, and the others are going to be used for the training. As a result, performance metrics are obtained for each of the sub-sets. The average of the performance metrics is considered as the performance metric of the K-fold cross-validation. In this study, the K value was chosen as 10.
The metrics of classification performance are obtained through the confusion matrix in
Figure 4 as outlined in
Table 4. The true positive (TP) value in a two-class confusion matrix is the number of predictions where the predicted value is 1 (true) when the actual value is also 1 (true). The true negative (TN) value is the number of predictions where the predicted value is 0 (false) when the actual value is also 0 (false). The false positive (FP) value is the number of predictions where the predicted value is 1 (true) when the actual value is 0 (false). The false negative (FN) value is the number of predictions where the predicted value is 0 (false) when the actual value is 1 (true). The mean square error (MSE), mean average error (MAE), and coefficient of determination (R
2) metrics used in the regression performance evaluation are outlined in
Table 5. It represents the sample standard deviation of the differences between predicted values (
) and observed values (
).