*Article* **Learning to Classify DWDM Optical Channels from Tiny and Imbalanced Data**

**Paweł Cichosz 1, Stanisław Kozdrowski 1,\* and Sławomir Sujecki 2,3**


**Abstract:** Applying machine learning algorithms for assessing the transmission quality in optical networks is associated with substantial challenges. Datasets that could provide training instances tend to be small and heavily imbalanced. This requires applying imbalanced compensation techniques when using binary classification algorithms, but it also makes one-class classification, learning only from instances of the majority class, a noteworthy alternative. This work examines the utility of both these approaches using a real dataset from a Dense Wavelength Division Multiplexing network operator, gathered through the network control plane. The dataset is indeed of a very small size and contains very few examples of 'bad" paths that do not deliver the required level of transmission quality. Two binary classification algorithms, random forest and extreme gradient boosting, are used in combination with two imbalance handling methods, instance weighting and synthetic minority class instance generation. Their predictive performance is compared with that of four one-class classification algorithms: One-class SVM, one-class naive Bayes classifier, isolation forest, and maximum entropy modeling. The one-class approach turns out to be clearly superior, particularly with respect to the level of classification precision, making it possible to obtain more practically useful models.

**Keywords:** machine learning; optical networks; imbalanced data; one-class classification

#### **1. Introduction**

Constantly growing traffic in backbone networks makes dynamic and programmable optical networks increasingly important. This particularly applies to Dense Wavelength Division Multiplexing (DWDM) networks whereby efficient use of network resources is of paramount importance. Introducing automation, frequent network reconfiguration, re-optimization and network reliability monitoring allows DWDM network operators to minimize the capital expenditures (Capex) and operating expenditures (Opex) [1–6]. Currently, software-defined networking (SDN) is used to achieve all these objectives. SDN uses a logically centralized control plane in a DWDM network that is realized using purpose-built flexible hardware such as reconfigurable optical add/drop multiplexers (ROADMs), flexible line interfaces, etc. [7,8]. In modern DWDM optical networks, following the software defined network paradigm, DWDM network reconfiguration is becoming more frequent, making the evolving network more resilient and adapting faster to real changes in bandwidth demand so that network reconfigurations can closely match changes in bandwidth demand. However, bandwidth demand can change very quickly (fluctuations can occur within minutes), while network reconfigurations typically take much longer. This is mainly due to operational processes that are too slow to allow real-time network re-optimization. It is therefore important that DWDM network reconfigurations are automated and as fast as possible, without significantly increasing operational costs.

**Citation:** Cichosz, P.; Kozdrowski, S.; Sujecki, S. Learning to Classify DWDM Optical Channels from Tiny and Imbalanced Data. *Entropy* **2021**, *23*, 1504. https://doi.org/10.3390/ e23111504

Academic Editors: Jan Kozak and Przemysław Juszczuk

Received: 8 October 2021 Accepted: 10 November 2021 Published: 13 November 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Frequent network reconfiguration and re-optimization necessary to make the best use of available resources has been facilitated by the introduction of software-defined networking (SDN) and knowledge-based networking (KDN) paradigms [7,9–11]. Central to SDN and KDN is automatic provisioning of optical channels (lightpaths), which is based on accurate quality estimation for them. Machine learning (ML) is a promising solution to this problem. Therefore, a number of algorithms have been proposed that first create a database using numerical modelling tools and then implement ML to estimate the quality of optical links ([6,7,12]). However, in the approach presented here, we apply ML to a database that has been extracted directly via the control plane from the DWDM network under analysis. This approach leads to an ML problem that is clearly different from the one addressed in [6,7,11,12], as there are significant challenges in using real optical network datasets, related to data representation, data size and class imbalance, which is intrinsic to data gathered via control plane from an operating DWDM network. The class imbalance follows from the fact that in an operating DWDM network there may be dozens or hundreds of operating connections but there is not much information available (if any) on connections that could not be realized due to excessive bit error rate. Therefore, a specially tailored ML approach is required to tackle this problem.

The main advantage of gathering data via the control plane is that it can be easily implemented by a DWDM network operator. As it will be further explained later, this approach imposes some constraints on the choice of appropriate ML methods due to above-mentioned class imbalance, which is an intrinsic feature of data collected via control plane. As already mentioned this makes the ML problem considered in this contribution clearly different from those considered so far in most of the available literature [6,7,12,13]. Expanding upon our previous work [14,15], we compare the predictive performance of the most successful binary classification algorithms combined with different techniques for class imbalance compensation and that of one-class classification algorithms that learn from majority class instances only.

#### *1.1. Machine Learning Challenges*

Successful applications of machine learning to support optical network design require real training data. While experiments on synthetic data may provide encouraging demonstrations, they are likely not to adequately represent the challenges that are associated with this application area and therefore provide overoptimistic predictive performance estimates or fail to identify potential obstacles and culprits. These challenges are mainly related to data size and quality.

DWDM network operators, particularly operating small or medium networks, may be unable to provide a dataset with more than several dozen or at best several hundred paths. More importantly, the vast majority if not all of those path configurations would usually correspond to correct, working channel designs. This is because unsuccessful path configurations are often discarded rather than archived, at least before the provider becomes aware of their utility as training data for machine learning. Before this awareness increases, available real datasets remain tiny and extremely imbalanced.

The data that has been made available for this study comes from a DWDM network operator providing services in Poland and is an excellent example of these issues. It contains just about a hundred of paths, including only three "bad" ones (i.e., such that could not be allocated due to a low quality of transmission). While it is still possible to use such data to train predictive models using classification learning algorithms, special care is needed to increase their sensitivity to the minority class and to reliably evaluate their quality. The extreme dominance of the "good" class makes it easy to come up with apparently accurate models with little or no actual predictive utility. To avoid this, we compensate the class imbalance using instance weighting and synthetic minority-class instance generation. However, the tiny size and extreme imbalance of the data may be still on the edge of the capabilities of standard binary classification, even with such compensation techniques. Therefore one-class classification, in which only "good" paths are used as training data, may be a viable and promising alternative. To compare the predictive performance of the binary and one-class classification approaches, their predictions are evaluated using ROC and precision-recall curves in combination with stratified cross-validation.

#### *1.2. Related Work*

To the authors' best knowledge this work is the first to apply one-class classification in the optical network domain and to compare its predictive performance to binary classification using different techniques of handling class imbalance. There is, however, some related prior work on applying other machine learning methods to optical networks as well as on using one-class classification as an alternative to binary classification for imbalanced data.

Considering work related to optical networks first, in [16] authors show that a routing and spectrum allocation (RSA) that monitors QoT in multiple slices significantly improves network performance. Rottondi et al. [12] extensively discuss and use ML techniques to optimise complex systems where analytical models fail. However, the network data in [12] was generated artificially, whereas in this contribution the data is collected by control plane from an operating network.

Similar problems related to lightpath QoT estimation are addressed by Mata et al. [17] but they mainly focus on the SVM classifier only. Barletta et al. [18] on the other hand, use mainly Random Forest algorithm that predicts whether the BER (Bit Error Rate) of unestablished lightpaths meets the required threshold based on traffic volume, desired route and modulation format. As in [12] the system is trained and tested on artificial data, which is different to the approach adopted in this contribution.

Japkowicz [19] compared different ways of handling class imbalance including oneclass classification. Japkowicz [19] found binary classification with imbalanced compensation superior to one class classification but experiments performed in [19] used artificial data and neural network classifier (with one-class classification performed using an autoassociative network type).

Lee and Cho [20] advocated the use of one-class classification for imbalanced data and demonstrated that it can outperform binary classification if the imbalanced ratio is high. They experimented with the standard and one-class versions of the SVM algorithm.

Bellinger et al. [21] discuss the potential utility of one-class classification in binary classification tasks with extreme class imbalance, as in our case. Their results suggest that binary classification with class imbalance compensation methods may be more useful than one-class classification when dealing with data from complex multi-modal distributions. However their results are based on datasets where the number of minority class instances is bigger than in our case.

#### *1.3. Article organization*

The rest of the paper is organized as follows. In Section 2 the analyzed optical network data, the applied machine learning algorithms, and model evaluation methods are described. The results of the experimental study are presented in Section 3 and discussed in Section 4. Contributions of this work and future research directions are summarized in Section 5.

#### **2. Materials and Methods**

The data comes from a real DWDM optical network of a large telecom operator. The network uses 96 DWDM channels allocated in C-band and is physically located in Poland, with network nodes corresponding to Polish cities.

#### *2.1. Data*

The network is equipped exclusively with coherent transponders. This is a typical representative of a new network created by an operator. The coherent transponders belong to Ciena's 6500 family, with transmission rates of 100 G, 200 G and 400 G and four types of modulation: QPSK, 16QAM, 32QAM and 64QAM.

Data preparation process is depicted in Figure 1. In order to better understand the meaning of the various database attributes presented later in the subsection, in the context of DWDM technology, an example DWDM network topology is shown in Figure 2. Figure 3 illustrates the concepts of network node, hop, hop length, path, and transponder. The dataset contains 107 optical paths, 3 of which correspond to unsuccessful designs ("bad") and rest of them (104) are operational ("good").

**Figure 1.** Data preparation process.

**Figure 2.** An example DWDM network topology.

**EOXHSDWKFRQQHFWLQJWUDQVSRQGHUVDQG SUHVHQWLQKRS JUH\RWKHUSDWKVSUHVHQWLQKRS**

**Figure 3.** Network subsection illustrating the meaning of the specific channel attributes occurring in the studied database.

#### 2.1.1. Path Description

Network paths are described by several properties that may be related to transmission quality and expected to be predictively useful. The hop\_lengths property gives the length of each hop that forms a path from the initial transponder to the destination transponder. This property is important because the signal to noise ratio depends on the length of the fibre connecting both transponders. In each hop there are usually more wavelengths occupied. This is because these wavelengths are used by paths other than the one occupied by the considered path. All paths can interact through nonlinear phenomena like four wave mixing and thus affect the quality of transmission. Therefore, the num\_of\_paths\_in\_hops property, which gives the number of adjacent DWDM wavelengths in a given hop, is included. The hop\_losses property gives the value of the optical loss for a given hop. Again, hop losses affect the signal to noise ratio and hence the corresponding property was included. Another property, number\_of\_hops, provides information on how many hops are present in a path from the initial to destination transponder. Since each hop corresponds to a signal passing through a DWDM node, the number of hops affects the signal to noise ratio due to optical regeneration taking place in a DWDM node. The last two properties are intrinsically related with a specific type of transponder used. The transponder\_modulation property stores information on the transponder modulation format, e.g., QPSK or 16QAM. This property is important because modulation format is related to receiver sensitivity. Finally, the transponder\_bitrate property is in essence self explanatory and gives the bit rate of a given transponder. Transponder bit rate also affects receiver sensitivity and hence it is included.

#### 2.1.2. Vector Representation

Path descriptions were transformed to a vector representation, as expected by classification algorithms for tabular data, by a simple aggregation-based feature engineering technique. Each of the available edge properties (hop\_lengths, num\_of\_paths\_in\_hops, hop\_losses) was aggregated by calculating the mean and standard deviation over all edges in the path. This gives 6 attributes derived from edge properties (2 attributes for each of the 3 edge properties), in addition to the 3 path attributes unrelated to individual edges (number\_of\_hops, transponder\_modulation, and transponder\_bitrate).

Applying additional aggregation functions to edge properties, such as the minimum, the maximum, the median, the first quartile, the third quartile, or the linear correlation

coefficient with the ordinal number of the edge in the path, as in our prior work [14], may create some additional predictively useful attributes. However, this would make the dimensionality of this representation relatively high in comparison to the size of the available dataset, considerably increasing the risk of overfitting.

#### *2.2. Binary Classification*

Any standard classification algorithm can be used to predict channel "good"/"bad" class labels or probabilities. In this work we limit our attention to the two algorithms that performed the best in our previous study [14]: Random forest and extreme gradient boosting. They belong to the most successful learning algorithms for tabular data and it is very unlikely that their performance could be beaten by other algorithms using the same vector path representation.

#### 2.2.1. Random Forest

The random forest algorithm creates a model ensemble consisting of multiple decision trees [22]. They are grown on bootstrap samples from the training set by using a mostly standard decision tree growing algorithm [23,24]. However, since the expected improvement of the resulting model ensemble over a single model is contingent upon sufficient diversity of the individual models in the ensemble [25,26], the following modifications are applied to stimulate the diversity of decision trees that are supposed to constitute a random forest:


To use a random forest model for prediction, simple unweighted voting of individual trees from the model is performed, and vote distribution is used to obtain class probability predictions. With dozens or (more typically) hundreds trees this voting mechanism usually makes random forests highly accurate and resistant to overfitting. An additional important advantage of the algorithm is its ease of use, resulting from limited sensitivity to parameter settings, which makes it possible to obtain high quality models without excessive tuning.

#### 2.2.2. Extreme Gradient Boosting

Extreme gradient boosting or *xgboost* is is another highly successful ensemble modeling algorithm. As other boosting algorithms, it creates ensemble components sequentially in such a way that each subsequent model best combines with the previously created ones [27–30].

The *xgboost* algorithm internally uses regression trees for model representation and optimizes an ensemble quality measure that includes a loss term and a regularization term [31]. Each subsequent tree is grown to minimize the sum of loss and regularization terms of all trees so far. Split selection criteria, stop criteria, and leaf values are derived from this minimization by the Taylor expansion of the loss function, using its gradient and hessian decomposed to terms for particular training instances and then assigned to the corresponding nodes and leaves of the tree being grown.

Extreme gradient boosting applied to binary classification is typically used with logarithmic loss (the negated log-likelihood) and the summed up numeric predictions of individual regression trees are transformed by a logistic link function to obtain class probability predictions.

The extreme gradient boosting algorithm is capable of providing excellent prediction quality, sometimes outperforming random forest models. It can overfit, however, if the number of trees grown is too large.

#### 2.2.3. Handling Class Imbalance

Techniques for compensating class imbalance can be divided in the following three main categories:


Techniques of the first category are generally supposed to increase sensitivity to the minority class without modifying the training data. They are possible with many classification algorithms and often consist in specifying class weights or prior probabilities. The binary classification algorithms used by this work are both ensemble modeling algorithms, which tend to be quite robust with respect to class imbalance, but their model quality can be still improved by such compensation mechanisms.

In the case of the random forest algorithm there are actually two possible related techniques. One is drawing bootstrap samples in a stratified manner, with different selection probabilities for particular classes. In the extreme case, a bootstrap sample may contain all instances from the minority class and the sample of the same size from the dominating class. The other is to specify instance weights affecting split selection and stop criteria for tree growing. Since in our case classes are extremely imbalanced and there are very few instances of the minority class, the weighting technique is preferred to the stratified sampling technique, since the latter would have to severely undersample the dominating class, with a possibly negative effect on model performance. The same weighting technique is also used with the the *xgboost* algorithm. In this case instance weights are used when calculating the logarithmic loss, so that the contribution of minority class instances to the loss function minimized by the algorithm is increased.

Data resampling may be performed by minority class oversampling (replicating randomly selected minority class instances), majority class undersampling (selecting a sample of majority class instances), or a combination of both, so that the resampled training set has either fully balanced classes or at least considerably more balanced than originally. Unfortunately these techniques have very limited utility for datasets that are both small and extremely imbalanced, as in our case. Undersampling would remove most of the available training data, and oversampling would replicate the very few "bad" paths increasing the risk of overfitting to these specific instances. They can be therefore hardly expected to offer any advantages over internal imbalance compensation by weighting and are not used in this work.

Potentially more useful techniques of generating synthetic minority class instances can be considered more refined forms of oversampling in which minority class instances available in the training data are not directly replicated, but used to generate new synthetic instances. This is supposed to make the increased representation of the minority class in the modified training set more diverse and thus reduce the risk of overfitting. Two well known specific techniques based of this idea are SMOTE [32] and ROSE [33] and they are both used in our experimental study. SMOTE finds nearest neighbors of each minority class instance and then generates new synthetic instances by interpolating between the instance and its neighbors. ROSE adopts a smoothed bootstrap sampling approach, with new instances generated in the neighborhood of original instances by drawing from a conditional kernel density estimate of attribute values given the class. Both minority and majority class instances are generated, and the class distribution in the generated dataset can be controlled to achieve a desired level of balance.

#### *2.3. One-Class Classification*

One-class classification follows the following learning scenario [34,35]:


In our case the single class represented in the training set corresponds to "good" paths. When the obtained model is applied to prediction, it identifies paths which are likely to also be "good" (i.e., be of the same class as that represented in the training set) and those which are likely to be "bad" (i.e., not to be of the same class as that represented in the training set). It can be assumed that model predictions are provided in the form of decision function values (numeric scores) such that higher values are assigned to instances that are considered less likely to be if the same class as that represented in the training set, i.e., in our case, more likely to be "bad" paths.

One-class classification is most often applied to unsupervised or semi-supervised anomaly detection [36], where an unlabeled training set, assumed to contain only normal instances, sometimes with a small fraction of anomalous instances, is used to learn a model that can detect anomalous instances in new data. It can be also useful, however, for binary classification tasks with extreme class imbalance [21], particularly when the number of minority class instances is too small for standard binary classification algorithms, even combined with imbalance compensation techniques. This may be often the case with data for optical channel classification.

The best known and widely used one-class classification algorithm is one-class SVM. In our experimental study it is compared with three other algorithms: The one-class naive Bayes classifier, the isolation forest algorithm, and the maximum entropy modeling algorithm. The first of those is a straightforward modification of the standard naive Bayes classifier and probably the simplest potentially useful one-class learning algorithm. The second one, while designed specifically for anomaly detection applications, can also serve as a general-purpose one-class classification algorithm. The third one, while originally intended for creating models of species distribution in ecosystems, has been also found to be useful for one-class classification.

#### 2.3.1. One-Class SVM

The one-class SVM algorithm uses a linear decision boundary, like standard SVM, but adopts a different principle to determine its parameters. Rather than maximizing the classification margin, which is not applicable to one-class classification, it maximizes the distance from the origin while separating the majority of training instances therefrom [37]. The side of the decision boundary opposite from the origin corresponds to the class represented in the training set. Only a small set of outlying training instances are permitted to be left behind, and the upper bound on the share of such outlying instances in the training set is specified via the *ν* parameter.

The principle of separating most of the training set from the origin of the space is typically combined with a translation-invariant kernel function (such as the radial kernel), sothat instances in the transformed representation lie on a sphere centered in the origin. The separating hyperplane then cuts off a segment of the sphere where most training instances are located.

One-class SVM predictions are signed distances from the decision boundary, positive on the "one-class" (normal) side and negative on the outlying side. The negated value of such signed distance can therefore serve as a numeric score for ranking new instances with respect to their likelihood of not belonging to the class represented in the training set.

#### 2.3.2. One-Class Naive Bayes Classifier

The one-class modification of the naive Bayes classifier is particularly straightforward. Since only one class is represented in the training set, its prior probability is assumed to be 1, conditional attribute-value probabilities within this class are estimated on the full training set, and the probability of an instance belonging to this class is proportional to the product of such attribute-value probabilities [38]. For numeric attributes, Gaussian density function values, with the mean and standard deviation estimated on the training data, are used instead of discrete attribute-value probabilities.

Discrete class predictions can be made by comparing the product of attribute-value probabilities for a given instance to a threshold, set to or around the minimum value of this product over the training set. Numeric scores (decision function values) can be defined as the difference between such a threshold and the probability being compared.

#### 2.3.3. Isolation Forest

The isolation forest algorithm was proposed as an anomaly detection method [39], but it can also serve as a one-class classification algorithm regardless of whether instances not belonging to the class represented in the training set are interpreted as anomalous. Its model representation consists of multiple isolation trees grown with random split selection. These are not standard decision or regression trees, since no labels or values are assigned to leaves, and they just partition the input space. Splitting is stopped whenever a single training instance is left or a specified maximum depth is reached.

In the prediction phase each isolation tree is used to determine the path length between the root node and the leaf at which the instance arrives after traversing down the tree along splits. Instances that do not belong to the class represented in the training set can be expected to be easier to isolate (have shorter paths) than those which do belong to the class. The average path length over all trees in the forest can then serve as a decision value function for determining whether an instance is likely to belong to this class or not. The original algorithm transforms this average path length into a standardized anomaly score for generating alerts in anomaly detection applications, using the expected depth of unsuccessful BST searches. This is not necessary for one-class classification, since the negated average path length is sufficient to rank new instances with respect to their likelihood of not belonging to the class represented by the training set.

An extended version of the isolation forest used for this work employs multivariate rather than univariate splits [40]. This eliminates a bias that resulted in the original algorithm from using axis-parallel hyperplanes for data splitting.

#### 2.3.4. Maximum Entropy Modeling

The maximum entropy modeling or maxent algorithm was originally developed for ecological species geographical distribution prediction based on available presence data, i.e., locations where a given species has been found and their attributes, used to derive environmental features [41]. These features, besides raw continuous attribute values, include attributes derived by several transformations, as well as binary features obtained by comparing continuous attributes with threshold values and by one-hot encoding of discrete attributes, with an internal forward feature selection process employed based on nested model comparison [42].

The algorithm, following the maximum entropy principle [43], identifies a species occurrence probability distribution that has a maximum entropy (i.e., is most spread out) while preserving constraints on environmental features. These constraints require that the expected values of environmental features under the estimated species occurrence probability distribution should be close to their averages from the presence points. The obtained model can provide, for an arbitrary point, the prediction of the species occurrence probability.

Despite its original intended purpose, maxent has been found to be useful as a general-purpose one-class classification algorithm [44,45]. Training instances take the role of "presence points", and input attributes are used to derive "environmental features", whereas background points can be generated by uniformly sampling the attribute ranges. Model prediction for an arbitrary instance can be interpreted as the probability that it belongs to the class represented in the training set.

#### *2.4. Model Evaluation*

Both binary and one-class classification algorithms used by this work produce scoring predictive models – their predictions are numeric values ranking instances with respect to the likelihood of being a "bad" path (or not belonging to the class represented in the training data). When applying standard binary classification quality measures to evaluate these predictions using, we refer to to the "good" class (represented in the training data for one-class classification), as *negative*, and the "bad" class (not represented in the training data for one-class classification) as *positive*.

ROC and precision-recall (PR) curves are used to visualize the predictive performance of the obtained models. ROC curves make it possible to observe possible tradeoff points between the *true positive rate* (the share of positive instances correctly predicted to be positive) and the *false positive rate* (the share of negative instances incorrectly predicted to be positive) [46,47]. PR curves similarly present possible levels of tradeoff between the *precision* (the share of positive class predictions that are correct) and the *recall* (the same as the true positive rate). The overall predictive power is summarized using the area under the ROC curve (AUC) and the area under the precision-recall curves (PR AUC).

In our case the true positive rate and the recall is the share of "bad" paths that are correctly predicted to be "bad", the false positive rate is the share of "good" paths that are incorrectly predicted to be "bad", and the precision is the share of "bad" class predictions that are correct. The area under the ROC curve can be interpreted as the probability that a randomly selected "bad" path is scored higher by the model than a randomly selected "good" path and the area under the PR curve can be interpreted as the average precision across all recall values.

When using data with heavily imbalanced classes, where positive instances are extremely scarce, even numerous false positives do not substantially decrease the false positive rate, since the number of false positives may be still small relative to the dominating negative class count. This is not the case for the precision, though, which is much more sensitive to false positives. Therefore precision-recall curves may be expected to be more informative and better highlight differences in the predictive performance obtained using different algorithms. For a more complete picture, however, both ROC and PR curves are presented.

To make an effective use of the small available dataset for both model creation and evaluation as well as to keep the evaluation bias and variance at a minimum, the *n* × *k*-fold cross-validation procedure (*n* × *k*-CV) is applied [48]. The dataset is split into *k* equally sized subsets, each of which serves as a test set for evaluating the model created on the combined remaining subsets, and this process is repeated *n* times to further reduce the variance. To evaluated binary classification models, the random partitioning into *k* subsets is performed by stratified sampling, preserving roughly the same number of minority class instances in each subset. For the evaluation of one-class classification models a one-class version of the *n* × *k*-CV procedure is used in which the few instances of the "bad" class are never used for training but always used for testing. The true class labels and predictions for all *n* × *k* iterations are then combined to determine ROC curves, PR curves, and the corresponding AUC values.

While it appears a common practice to use the leave-out-out procedure rather than *k*-fold cross-validation when working with small datasets, the only potential advantage of the former would be avoiding the pessimistic bias resulting from the fact that in each iteration of the latter <sup>1</sup> *<sup>k</sup>* of the data are not used for model creation. However, the leave-oneout procedure has high variance (that cannot be reduced by multiple repetitions since the procedure is fully deterministic) and excluding only a single instance from the training data may cause optimistic bias due to underrepresenting the differences between the training data and the test data. We find it therefore more justified to use the *n* × *k*-fold crossvalidation procedure where the variance is substantially reduced and accept the fact that it may be pessimistically biased. This means that our reported results may underestimate the actually possible predictive performance levels, which should be preferred to any risk of optimistic bias.

#### **3. Results**

In the experimental study presented in this section binary and one-class classification algorithms described in Sections 2.2 and 2.3 are applied to the small and imbalanced dataset described in Section 2.1. The objective of the study is to verify the level of optical channel classification quality that can be obtained using these two types of algorithms. For binary classification the effects of class imbalance compensation using instance weights and synthetic minority class instance generation are also examined.

#### *3.1. Algorithm Implementations and Setup*

The following algorithm implementations are used in the experiments:


Since the *xgboost* algorithm does not directly support discrete attributes and one attribute in the dataset is discrete, it was preprocessed by converting discrete values to binary indicator columns.

The tiny size of the dataset and, particularly, the number of "bad" path configurations makes it hardly possible to perform algorithm hyper-parameter tuning. While the performance evaluation obtained by *n* × *k*-fold cross-validation could be used to adjust algorithm settings and improve the results, as demonstrated in our previous work [14], without the possibility to evaluate the expected predictive performance of the tuned configurations on new data it could lead to overoptimistic results. This is why the algorithms are used in the following mostly default configurations, with only a few parameters set manually where defaults are unavailable or clearly inadequate:


setup for a small dataset), and the maximum three depth is the ceiling of the base-2 logarithm thereof,

• **maximum entropy modeling:** All available attribute transformations [42] are applied to derive environmental features (linear, monotone, deviation, forward hinge, reverse hinge, threshold, and binary one-hot encoding), a significance threshold used for internal feature selection is set to 0.001, and the generated background data size is 1000.

For imbalance compensation with instance weighting the majority class weight is fixed as 1 and the minority class weight is set to values from the following sequence: 1, 2, 5, 10, 20, 50, 100 (where 1 corresponds to no weighting). When using synthetic instance generation, the number of generated minority class instances is set to *d* − 1 times the number of real minority class instances, where *d* is in the same sequence as above. This can be achieved exactly for SMOTE and only approximately for ROSE due to its probabilistic nature.

The *n* × *k*-fold cross-validation procedure is used with *k* = 3 (since there are only 3 minority class instances) and *n* = 50 (to keep the evaluation variance at a minimum).

#### *3.2. Classification Performance*

For each of the binary and one-class classification algorithm configurations described above cross-validated ROC and PR curves, with the corresponding area under the curve values, are reported and briefly discussed below. A bootstrap test (with 2000 replicates drawn from the data) is used for verifying the statistical significance of the observed AUC differences.

#### 3.2.1. Binary Classification

Figure 4 presents the ROC and PR curves obtained for binary classification with instance weighting. The numbers in the parentheses after algorithm acronyms in the plot legends specify the minority instance weight value. For readability, only the results without weighting and with the best weight value are included. All the observed differences are statistically significant according to the bootstrap test except for those between RF(1) and XGB(5), and between RF(20) and XGB(5). One can observe that:


**Figure 4.** The ROC and PR curves for binary classification with instance weighting.

Figure 5 presents the ROC and PR curves obtained for binary classification with synthetic instance generation. The numbers in the parentheses after algorithm acronyms in the plot legends specify the minority class size multiplication coefficient. For readability, only the best results obtained when using SMOTE and ROSE are included and the results with no synthetic instance generation as a comparison baseline. All the observed differences are statistically significant according to the bootstrap test. One can observe that:


**Figure 5.** The ROC and PR curves for binary classification with synthetic instance generation.

#### 3.2.2. One-Class Classification

The ROC and precision-recall curves for one-class classification are presented in Figure 6. All the observed differences are statistically significant according to the bootstrap test except for the one between IF and OCNB. One can observe that:


**Figure 6.** The ROC and PR curves for one-class classification.

#### **4. Discussion**

As discussed in Section 2.4, ROC curves do not provide a sufficient picture of model performance under severe class imbalance, because even with many false positives the false positive rate remains small due to the dominating overall negative class count. This is why they suggest that all the investigated algorithms achieve excellent prediction quality and their models exhibit only minor performance differences. Precision-recall curves indeed show a more useful view of the predictive quality of models obtained by particular algorithms and better highlight the differences between them.

For binary classification algorithms the simple instance weighting technique appears more useful than the more refined and computationally expensive synthetic instance generation techniques. This may be surprising at first, but actually neither SMOTE or ROSE are well suited to working with datasets not only heavily imbalanced but also very small. With just three minority class instances (two remaining for model creation within a single cross-validation fold) there is probably not enough real data to provide a reliable basis for synthetic data generation.

One-class classification algorithms, although using less input information (training data of the majority class only), all produce clearly better models than the best of those obtained using binary classification. The isolation forest algorithm turns out to deliver a superior overall predictive power and considerably more preferable operating points, with near-perfect detection of true positives ("bad" paths) without excessively many false positives. While all the algorithms deliver high quality models, the one-class naive Bayes and isolation forest algorithms clearly outperform the one-class SVM and maxent algorithms. It is particularly noteworthy that they can provide high precision in a wide range of recall values.

This study suggests that standard methods of handling class imbalance may be insufficient when the dataset is of a very small size. Indeed, it is not only the small share, but also the small absolute number of "bad" paths that prevents binary classification algorithms from creating more successful models. While the skewed class distribution can be compensated for by weighting, just a few training instances provide very poor basis for detecting generalizable patterns and for generating synthetic instances. Using only "good" paths for model creation leads to better results. The best obtained one-class models providing a precision level of about 0.7 are much more practically useful than the best binary classification models with precision just above 0.3.

#### **5. Conclusions**

The work has provided additional evidence that applying machine learning to optical channel classification is a promising work direction, but is associated with important challenges. To achieve models applicable in real-world conditions one has to use realworld datasets, but these suffer from severe imperfections, the most important of which are

a small size and a heavy class imbalance. We have demonstrated that state-of-the-art binary classification algorithms may not achieve a very high level of prediction quality even when coupled with appropriate imbalance compensation techniques. The utility of the latter may be limited by the fact that it is not only the relative share of the minority class instances in the data that is small, but also their absolute count. The reported results confirm that one-class classification is a viable alternative, and models learned using majority class data only achieve better classification precision that those obtained using binary classification learning from all data.

Our findings provide an encouragement to continue this research direction by extending input representation with additional attributes, applying more one-class classification algorithms, and tuning their parameters to further improve the predictive performance. Gathering additional data not only would make the results of these enhanced future studies more reliable, but also make it possible to examine further ideas, such as model transfer between different networks or combining models trained on data from different networks. Expert knowledge on the physics of optical networks may permit defining alternative or additional path attributes, creating a more adequate input space representation for machine learning. Such knowledge could also be used to design a domain-specific data augmentation method that might be expected to perform better than general-purpose techniques of synthetic minority-class instance generation.

**Author Contributions:** Conceptualization, P.C., S.K. and S.S.; methodology, P.C., S.K. and S.S.; software, P.C.; validation, S.K., S.S. and P.C.; formal analysis, P.C., S.K. and S.S.; writing—original draft preparation, P.C., S.K. and S.S.; writing—review and editing, S.K., S.S. and P.C. All authors have read and agreed to the published version of the manuscript.

**Funding:** Not applicable.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** The authors would like to thank the anonymous reviewers for their valuable comments.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **Real-World Data Difficulty Estimation with the Use of Entropy**

**Przemysław Juszczuk 1,\*, Jan Kozak 2, Grzegorz Dziczkowski 2, Szymon Głowania 2, Tomasz Jach <sup>2</sup> and Barbara Probierz <sup>2</sup>**


**Abstract:** In the era of the Internet of Things and big data, we are faced with the management of a flood of information. The complexity and amount of data presented to the decision-maker are enormous, and existing methods often fail to derive nonredundant information quickly. Thus, the selection of the most satisfactory set of solutions is often a struggle. This article investigates the possibilities of using the entropy measure as an indicator of data difficulty. To do so, we focus on realworld data covering various fields related to markets (the real estate market and financial markets), sports data, fake news data, and more. The problem is twofold: First, since we deal with unprocessed, inconsistent data, it is necessary to perform additional preprocessing. Therefore, the second step of our research is using the entropy-based measure to capture the nonredundant, noncorrelated core information from the data. Research is conducted using well-known algorithms from the classification domain to investigate the quality of solutions derived based on initial preprocessing and the information indicated by the entropy measure. Eventually, the best 25% (in the sense of entropy measure) attributes are selected to perform the whole classification procedure once again, and the results are compared.

**Keywords:** entropy measure; real-world data; preprocessing; decision table; classification

#### **1. Introduction**

In present times, we are facing the problem of a large amount of data flowing from different sources. In the era of the Internet of Things (IoT) and big data, the challenge is to effectively use and present the acquired data without generating redundant information. Due to the size of data available for decision-makers, it is nearly impossible to manually make any complex decisions. This difficulty is experienced even in machine learning algorithms, which must manage too many attributes, variables, and additional constraints, resulting in the whole process being lengthy and complicated [1]. As such, it is essential to simplify data in the cases where the decisions should be made very quickly, and a need exists to use a decision support system to maintain the decision-maker's sovereignty.

The main drawback of the existing datasets is their uniform structure. For the data related to a single domain, the distribution of attribute values, the size of data, or the overall difficulty of the given dataset classification is expected to be on a similar level. However, in the case of more general approaches, we often face inconsistency in data, including the need to use additional knowledge from the domain experts. In general, data available in repositories are mostly preprocessed and directed on a particular problem (like the classification or the regression). At the same time, the initially collected data may still be very complex.

The above problem had led to the construction of many complex algorithms and methods intending to decrease the complexity of the data used in the decision process.

**Citation:** Juszczuk, P.; Kozak, J.; Dziczkowski, G.; Głowania, S.; Jach, T.; Probierz, B. Real-World Data Difficulty Estimation with the Use of Entropy. *Entropy* **2021**, *23*, 1621. https://doi.org/10.3390/e23121621

Academic Editors: Geert Verdoolaege and Éloi Bossé

Received: 11 October 2021 Accepted: 26 November 2021 Published: 1 December 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Among these methods, we can emphasize approaches for reducing the number of variables included in the algorithm [2,3]. The idea of initially preprocessing the data related to the feature selection, removing the redundant data, or including more general attributes replacing the existing ones is not a new concept and it was deeply studied in many articles, where initial data limitation was needed. Examples of such feature selection methods can be found, for example, in extensions of the Principal Component Analysis method. One of the newest review articles in this subject can be found in [4]. A more general approach for future selection involving the swarm methods is presented in [5,6]. In comparison, one of the newest review articles related to the swarm methods is [7]. The second large set of algorithms used for the feature selection is related to the tree-based methods. In these methods, the attributes can be selected based on the importance of the attribute in the process of building the tree (classifier). An example of comparison for such algorithms can be found in [8].

For many cases, data dependencies are not linear. Thus, a complex method of variables elimination should be applied. For example, in the case of periodically important variables or in situations where the linear dependencies between elements are not obvious, different methods must be used to emphasize the crucial variables in the system. To avoid redundancy in the data, the selected variables should exhibit little or no mutual correlation. This requirement was described by [9], in which the phenomenon of the illusion of validity occurs: people have confidence in the results, which are based on redundant data. Thus, in decision support systems and during attribute selection, the role of decision-makers can also be marginalized.

A method that effectively identifies the crucial variables present in the complex data can be essential for the whole system's efficiency. However, in the case where the data structure and its complexity makes the data difficult or even impossible to process, the decision-maker faces a two-step problem: First, there is a need to adapt the data to fit the algorithm's input format. This can be achieved by some additional preprocessing methods, leading to a data format acceptable as the algorithm's input. However, the whole process may be lengthy and complex. It often covers concepts such as filling the missing data, discretization, and scalarization. Dealing with missing data cannot be solved with simple methods, and the literature covers various approaches to this problem [10–12].

Thus, today we observe many algorithms dedicated to a particular domain, which, opposite to the general approaches, can deal with the problems more efficiently. However, one should know that such available methods can still be beneficial, even as a starting point for emerging domains related to complex or big data. Our idea was to collect raw data from different fields and prepare it in a uniform, easy-to-analyze format based on decision tables. At the same time, we tried to use as general tools as possible, which unfortunately can lead to a decrease in classification quality. However, it maintains the generalized approach for all datasets.

Furthermore, we selected entropy as a concept, which allows us to describe the disorder of the data. By the disorder, we understand here the measure of complexity, where the more complex data (fewer dependencies between objects and attributes is visible) is defined by the higher entropy values. Therefore, we assumed that the increase in entropy could be equated with data difficulty. Furthermore, this assumption is verified by performing the actual classification on various datasets. Eventually, the results from the classification on the full set of attributes and subset generated on the basis of entropy can be compared. It is expected that high entropy should lead to less effective classification.

The entropy measure is considered from the point of view of all attributes. Thus, it is possible to identify the attributes with small disorder values (smaller entropy values). A subset of attributes with small entropy could be used to perform the classification while the data is limited.

In our data, a clear distinction exists between conditional attributes and decision class. Data from various fields cover several objects as well as different numbers of attributes. However, the common goal is to perform a classification task on the presented data. The second step of our research completely focused estimating the impact of the entropy-based measure on the classification task. First, we tried to determine if entropy can be effectively used to indicate data difficulty. Eventually, we investigated the results of the classification of the data. We expected that, initially, all conditional attributes analyzed in the dataset could be treated uniformly (i.e., have similar entropy values). Thus, the main questions were: is there a correlation between the entropy values and the quality of classification, and can the entropy-based measure be used to select the best-fitted attributes for the classification problem? To summarize, our research steps were as follows:


To generalize our observations as much as possible, we tried to select data from various fields and describe the whole preprocessing framework with the use of domain knowledge presented by experts from different fields. Moreover, this preprocessing schema allowed us to use a general data format, which can be effectively used in entropy calculation and, finally, in classification problems.

The paper is organized as follows: In the next section, we present the related studies. In Section 3, we discuss the theoretical background related to the subject, including a description of entropy, decision tables, and efficiency measures used in classification tasks. Section 4 contains a description of the real-world data covering different domains. Section 5 presents the results of our experiments based on entropy calculation as well as the classification problem. Eventually, we conclude the study in Section 6.

#### **2. Related Works**

By classical entropy, we understand the measure of uncertainty related with some data. The idea was introduced by Shannon in 1948 [13] and further extended, for example, by Renyi and Tsallis [14,15], where Renyi entropy is the generalization of the Shannon entropy for specific parameters.

The classical entropy measure is used as a crucial element in many different algorithms and methods. Amongst the most prominent examples are the well-known classification algorithm C4.5 developed by Quinlan [16] as an extension of algorithm ID3 [17]. In both examples, entropy was used as a measure to generate a classifier (a decision tree). In C4.5, entropy was used for all algorithm steps to calculate the information gain based on the entropy for every attribute available in the dataset. A similar idea is used in greedy heuristic ID3, where, once again, the attribute used as a split criterion for the data is based on the highest information gain. Such an approach has been successfully used in machine learning [18] and signal processing [19].

Entropy is often used as an element of broader methods rather than a standalone measure. It has a role in novel metaheuristics such as an extension of classical particle swarm optimization [20]. In [21], it was used as an alternative approach to the concept of fuzzy sets to measure the uncertainty of the task in a task assignment problem. Entropy was used as an extension of the binary classification problem solved by particle swarm optimization [22]. In many articles, entropy has often been used as a replacement for classical measures such as variance [23].

Entropy mixed with the concept of fuzzy sets was included in an outlier detection approach [24]. In [25], entropy was included as a part of the feature selection mechanism based on fuzzy sets. Finally, a more complex approach, including the fuzzy multicriteria approach based on the TOPSIS method, was presented in [26].

Entropy was used in many different approaches to measure randomness in a clinical trial [27]. In [28], entropy was introduced to measure the uncertainty of ordered sets. In general, it can be used as an idea of measure for different fields such finance [29,30],

chemistry [31], physics [32], and more. However, no works used entropy as a general measure for different domains simultaneously. A separate direction of research is devoted to various extensions of classical entropy. In [33], the idea of measuring an entropy on different scales (multiscale entropy) was presented. In the case of time-series data, the concept of approximate entropy is often used [34]. In [35], approximate entropy was extended, called sample entropy. This idea was further extended in [36]. Both methods were used in different applications to address various dynamic aspects of systems.

Another prevalent extension of the classical measure is permutation entropy, effectively used as a nonlinear measure in different fields such as cyber-security [37] and fault diagnosis in systems [38]. Some preliminary comparisons between the classical entropy measure and Pearson correlation were introduced [39]. In this example, the authors focused on the data derived from the system from the Internet of Things, focusing on spatio-temporal data.

The idea of using entropy as a complexity measure is well-known, and it has been recently studied by many researchers. Among interesting examples, we mention [40], where information entropy was used to measure the genetic diversity in colonies. Another example covers the general idea of measuring the complexity of time series [41].

Entropy as a measure of diversity was presented in [42], where the authors used Shannon entropy to measure the urban growth dynamics for a case study related to real-world data from the city of Sheffield in the U.K. More complex examples related to health and perception can be found in [43,44]. In the first case, the authors used entropy-based concepts for knowledge discovery in heart rate variability, whereas in the second example, approximate entropy was used for EEG data. Finally, among the newest works from the medical domain, Coates et al. [45] used entropy in the Parkinson's disease recognition process.

#### **3. Methodology**

For a set of objects *X*, every element can be described by a vector of *n* conditional attributes *xatr* = {*xatr*<sup>1</sup> , *xatr*<sup>2</sup> , ..., *xatrn* } where *n* is a number of conditional attributes. A decision class is denoted as *xclass*. Thus, every object is described by a pair (*aatr*, *xclass*). For every conditional attribute, we have the attribute and value pair, and every attribute can have a numeric or symbolic value. In the case of attributes with continuous values, the discretization procedure, leading to limiting the number of values for a single attribute, is often performed.

In classification problems, the decision class *xclass*, including information about the decision class for a single object, has one of the values belonging to the decision class set of values.

In this article, we perform the preprocessing of real-world data, which allows transforming the initial raw data into a decision table defined as follows:

$$DS = (X, \mathfrak{x}\_{attr}^{\star}, \mathfrak{x}\_{class}). \tag{1}$$

All analyzed data differ in terms of the size of set *X* and the number of attributes in the vector of conditional attributes *xatr*. We did not assume simplifications related to the cardinality of the decision class. Thus, for some sets, this attribute is continuous, and an additional discretization procedure is needed. Eventually, for all datasets, the number of values in decision class *xclass* is discrete.

#### *3.1. Entropy as a Measure of Classification Uncertainty*

According to our aim, we wanted to explore the possibility of using entropy as an indicator of data difficulty. Therefore, we treated entropy as a measure of classification uncertainty. In addition, we explored how data can be simplified using only attributes selected in terms of entropy value. Therefore, we also examined the information attribute to assess the usefulness of entropy for data simplification.

Assuming that several different symbols describe information, entropy, in its basic form, can be calculated as follows:

$$E(DS) = -\sum\_{i=1}^{|\mathcal{C}|} p\_i \cdot \log(p\_i),\tag{2}$$

where |*C*| is the number of different decision classess, and *pi* is the probability of occurence of the *i*-th decision class. With such a definition, entropy can be understood as a measure of data complexity. With an increasing number of decision classess available in the data, the overall complexity increases. In the most trivial case, for a single decision class, the *pi* value is equal to 1, whereas *log*(*pi*) is zero (as well as the entropy). Thus, any increase in this value leads to higher entropy.

The value of the information attribute (Equation (3)) is determined for each conditional attribute to determine how it can change the entropy of the decision table *DS*. The resulting value determines the entropy that can be obtained by considering that attribute.

The information attribute is thus based on the calculation of entropy due to decision classes (Equation (2)), but this is performed due to the cases grouped by the values of the attribute being analyzed.

Formally, the information attribute is written as Equation (3), but note that these determinations are required for each attribute, where *k* is the number of attributes being analyzed, *m* is the number of possible values of the *k*-th attribute, and |*DSi*| is the number of instances having the *i*-th attribute value (analogously, *DSi* is the subset of the decision table *DS* that has only the *i*-th attribute value on attribute *k*).

$$
\dot{m}fo\\_att(k, DS) = \sum\_{i=1}^{m} \frac{|DS\_i|}{|DS|} \cdot E(DS\_i) \tag{3}
$$

In our considerations, *inf o*\_*att* is crucial for simplifying the dataset. For each decision table *DS* with the number of conditional attributes *n*, values are determined based on Equation (4). This observation is used for further analysis.

$$\text{all\\_inf}fo\\_att(DS) = \sum\_{k=1}^{n} \text{inf}fo\\_att(k, DS) \tag{4}$$

#### *3.2. Classification Measures*

In our research, we wanted to examine the classification quality using state-of-the-art machine learning algorithms. We chose decision trees (CART algorithm) and ensemble methods: Random Forest, Bagging, and AdaBoost. To assess the quality of classification, in addition to the classical measures of classification quality (accuracy), we also used precision (called positive predictive value (PPV)) and recall (called true positive rate (TPR)). Notably, these are binary classification measures, i.e., for a dataset with only two decision classes. In real datasets, there are often more decision classes. Several methods can be used to generalize precision and recall. We wanted to provide as much information as possible in our solutions, so we computed precision and recall for each decision class.

Therefore, for PPV, the analyzed decision class is treated as positive and all others as negative, and analogously for TPR. So, in the definition of the measures of the quality of classification (accuracy in Equation (5), precision in Equation (6), and recall in Equation (7)), we denote:

**TP:** to identify all correctly classified cases of the analyzed class;

**TN:** to identify all cases outside the analyzed class that were not assigned to this class;

**FP:** to identify all cases outside the analyzed class that were assigned to this class;

**FN:** to identify all misclassified cases of the analyzed class.

$$accuracy = \frac{TP + TN}{TP + TN + FP + FN} \tag{5}$$

$$PPV = \frac{TP}{TP + FP'} \tag{6}$$

$$TPR = \frac{TP}{TP + FN}.\tag{7}$$

#### **4. Data Preparation and Preprocessing**

In this section, we provide details of the real-world data used in further experiments. The data were collected from external sources and cover various fields. We adapted the raw data into a decision table format, described in detail in the previous section, to perform the tests based on the classification problem. All necessary steps for data processing are described in this section.

However, despite the processing of all datasets, some general preprocessing steps were used. Below we indicate these steps in points with a short description.


Please note that the last step was used for both conditional attributes as well as the decision attribute (if needed). Moreover, these were general steps adapted for all data. However, additional steps were explicitly performed for the selected data (for example, related to the natural language processing), described in detail in subsections related to different data.

#### *4.1. Fake News Data*

Universal access to the Internet created the possibility of the rapid creation and gaining of knowledge by users, which became a threat through the easy spread of false information in the form of fake news. Fake news aims to present users with a view that is not in line with reality or leads them to make wrong decisions or actions based on false information.

The problem of disinformation is best visible on social networking services and news sites, where fake news is spreading widely in the form of sharing, passing on to friends, or creating documents based on unreliable sources [46]. Therefore, it is essential to quickly classify the documents posted and adequately mark the articles as true or fake news. The subject matter of the documents from the fake news dataset is related to many different fields; in particular, it concerns political, media, and financial content, as well as current events [47,48].

Kannan et al. [49] claimed that preprocessing real text data for analysis using machine learning algorithms is always the longest stage and often amounts to around 80% of the total processing time. Therefore, to transform the fake news dataset into a decision table, we propose applying the statistical approach of natural language processing (NLP).

In the first step of NLP, the tokenization process is carried out, dividing a given text into the smallest unit (e.g., a sequence of words, bytes, syllables, or characters) called a token. The result is the creation of an n-gram model that is used to identify and analyze attributes used in natural language modeling and processing [50]. In our research, we used n-gram to define individual words from document titles, from which we additionally rejected words appearing on the stop word list. An example of a stop words list is presented in Figure 1.

*a*, *an*, *about*, *are*, *be*, *is*, *was*, *will*, *as*, *how*, *by*, *f or*, *o f* , *f rom*, *in*, *on*,*at*, *or*, *and*, *the*, *that*, *these*, *this*, *too*, *what*, *when*, *where*, *who*,

**Figure 1.** A sample list of rejected words, the so-called Stop Words.

The next step in NLP is to perform the normalization process using two methods: stemming and lemmatization. The stemming method is used to extract the subject and the endings of the words. Eventually, similar words are replaced by the same base word [51]. The method of lemmatization consists of reducing the word to its basic form [52]. The purpose of the normalization process is to reduce the variability in the set of terms.

The final step in the NLP covered in this research is creating a word vector model as a document representation. Our vector model is presented as a matrix (Figure 2), where documents ( *dok*\_1–*dok*\_*n*) are presented in the form of feature vectors representing particular attributes (*at*\_1–*at*\_*n*). In the model, we use a binary representation, where each value from the {0,1} set determines whether the word appears in a given document. In addition, the number of attributes is limited to the most common words in the title of the document. On this basis, the fake news dataset was transformed into a decision table consisting of the attributes of the most common words and a decision attribute (*decision*) containing two classes (true or fake).


*at*\_1 *at*\_2 *at*\_3 *at*\_4 *at*\_5 *at*\_6 *at*\_7 *at*\_8 ... *at*\_*n decision*

**Figure 2.** The sample matrix of words occurrence (selected as conditional attributes) in documents.

The decision table structure consists of columns with conditional attributes and one decision, whereas rows include all documents from the set. Conditional attributes are words most often appearing in the text. The presence of specific words (in the decision table) is strictly dependent on the analyzed dataset. For this reason, the number of attributes is limited. Table 1 shows an example of the frequency of words (selected as conditional attributes) in the titles of true and fake news.


**Table 1.** The example frequency of words (selected as conditional attributes).

Real text datasets are challenging to analyze due to the large number of attributes [53] that constitute single words for the fake news dataset. The distribution of attribute values due to decision classes (fake and true) is presented in Figure 3.

For each attribute, there is one histogram (Figure 3) consisting of two columns, which corresponds to the number of values for each attribute. The first column shows the number of objects (article content) in which the selected word does not appear (as an attribute value), while the second column shows the number of objects in which the selected word appears at least once. These numbers are shown in the chart. Additionally, each column shows the assignment of a word to the appropriate class: blue is the true class, and red is the fake class.

By such a distribution of attributes due to decision classes (fake and true), it can be seen that some words (such as *word*\_3, *word*\_7, *word*\_17, *word*\_19) do not appear at all in the fake class—the right column is entirely blue. However, in the case of the first column, the division into both classes is equal for almost all attributes.

**Figure 3.** The distribution of attribute values due to decision classes for fake news data.

#### *4.2. User Websites Navigation Data*

Electronic commerce (e-commerce) has become popular as the Internet has grown, with many websites offering online sales, and e-commerce activity is undergoing a significant revolution. The major challenges in research are the collection, identification, and adoption of data supplied by Internet services to provide actionable marketing intelligence.

The main difficulty in web usage mining is the procurement of the desired database, as the only information we can collect from users visiting a website is through tracing the pages they have accessed.

Data collected from log files must be processed before data mining techniques (based on machine learning algorithms) can be used. Then, the personalization process is performed in the six main steps generally used in the field:


The main idea of analyzing the users' behavior during user navigation was to limit the users' sessions to 10 actions. Each action corresponds to a one-page view by the users. We chose the 10 actions limitation in the session because it was impossible to perform a pertinent clustering using less than 10 actions for the user session; the cluster was not significant enough, and differences between clusters were negligible.

Before the phase of navigation conditional attributes selection, the hierarchy of the website was derived. An example division of the site is as follows: First, we separated thematic websites to create universes. Websites from each universe were about the same topic. Then, we divided the entire site into seven different universes:


The universe store was divided into three levels of hierarchy: section, subsection, and subsubsection. Generally, the final product page corresponds to the subsubsection.

From this hierarchy, we selected conditional attributes that describe the user navigation of our commercial partner's website. The attributes are presented in Table 2.


**Table 2.** Session attributes.

The presented attributes are described as follows:


for example, the user switches universe and then returns to the previous one, the value of this attribute is equal to 2;


For the decision attribute, we chose the binary attribute *purchase*. Decisions classes were "yes" and "no". All the attributes were normalized.

The distribution of attribute values due to decision classes—*purchase* is presented in the Figure 4. Two colors correspond to the decision classes: blue indicates sessions not completed with the purchase, and red indicates the sections in which the purchase was made.

**Figure 4.** The distribution of attribute values due to decision classes for user websites navigation data.

As we can see in the Figure 4, some attributes do not discriminate the decision class. For example, the decision class distribution is identical for attributes such as *Day*/*Month*/*Year*, *Hour*\_*o f* \_*end*, and *Source*\_*o f* \_*navigation*. On the other hand, attributes such as *Discount*\_*code*, *Total*\_*time*, or *New*\_*user*, clearly indicate the *purchase* class. According to the presented data distribution, we can determine that the user's session ending with purchase has the following attribute values: *avg*.\_*no*\_*o f* \_*pages*\_*viewed* and *average*\_*amount*\_*o f* \_*time*\_*spent*\_*on*\_*navigation*, the customer is not the first time on the website, and he has a discount code, the customer does not spend a lot of time in the store section, but frequently changes subpages in this category.

#### *4.3. Real Estate Market Data*

The real estate market has grown rapidly during the recent years [54]. As such, both the volume of data and the number of processed details have increased. Investors are looking for attractive properties from which profit can easily be earned. As customer habits change, so do the features connected to a particular property that is essential for buyers.

The change in investor and end-consumer behavior has led to the inclusions of additional details in advertisements of properties. Each advertisement is currently filled with much additional information, some of it structured and some of it only provided in descriptive text. The real estate market data used in this paper originated from actual advertisements presented on multiple Polish market web pages. The details of the adverts are often hidden inside the text describing a particular property. However, many details are often presented in a structured form, allowing less sophisticated automatic scrapers to gather the data. For some of the conditional attributes, it is still necessary to perform more advanced processing. For instance, *the*\_ *floor*\_*number* is usually provided as a number in the vast majority of cases. However, there are some occasions where it is stated verbally

as "ground floor" or "higher than the 10th floor". Most of the advertising portals do not provide a good enough validation of this data, which is why, during the data acquisition, we had to construct more detailed methods to handle the special types of values and data. A similar process had to be performed for geo-encoding the spatial data. In almost every advertisement, the exact address of the property was not given; only the street name and the city were described. Sometimes the street names had spelling errors, were not correctly placed on a map, or used an old street name before the mandatory change of street names in Poland that recently occurred [55].

Notably, the process of acquiring data from web pages is complicated. The dataset used in the current study consists of the following conditional attributes:


The last attribute, being the decision one, denotes the price per square meter. As this value can fluctuate widely, we transformed it using a simple discretization:

$$bucket = \lceil \frac{price\\_per\\_sq\\_unt}{1000} \rceil \tag{8}$$

Because of the nature of scrapped data and the frequent necessity for repairing or transforming the data (e.g., converting units of measurement between imperial and metrical), this data is rather difficult to analyze. Furthermore, many attributes, all interesting for the end-user, make this processing even more complicated.

The distribution of attribute values in accordance with decision classes was created, as shown in the Figure 5. Please note that due to many values in the decision class, there was no visible distinction related to color for each class.

**Figure 5.** The distribution of attribute values due to decision classes for real estate market data.

Even though the data has been preprocessed extensively, some of the original values with mistakes were left intact. This is the case for *area* attribute, where one of the flat's areas is set to 349,000 square meters. This is clearly seen in the distribution plot, where the plot is heavily skewed. The same thing is happening with the *build*\_*date* (a building has a date set to 892,007; there are also some spelling errors with a date like 19,000 or 20,014 where an individual probably inserted an additional 0). Because the number of records with such mistakes is relatively small (less than 0.02%), the authors included these outliers in the dataset to determine their influence on the overall entropy and classification results.

It is clearly seen that most of the properties are situated below fourth floor, which is expected, as it is far more easy to build such buildings in Poland compared to skyscrappers due to legal reasons. The owners tend to over-estimate the quality of interior, therefore the vast majority of apartments have the "ready to be moved" *condition*\_*state*. Most of the analyzed apartments also have modern PVC windows.

#### *4.4. Sport Data*

Sport is a valuable part of many people's lives, understood both as physical activity and in terms following individual teams or athletes. Football is the most popular sport known, with the European leagues being some of the most famous in the world. Therefore, the top leagues from Germany, Italy, and Spain were selected for our analysis.

Numerous studies based on both expert analysis and machine learning techniques for predicting sports results can be found in the literature [56–59]. The most popular and accessible are predictions of match results in the form of win/loss/draw; however, both analyses and predictions may concern other elements such as the number of goals scored, the exact score, or the number of yellow cards [56,60].

The dataset was created from the tabular data available on a website [61]. For complete information, the data were extracted using the scraping method from two tables. The first one contains data about the league table. The second one consists of information about individual matches. The tables were then combined to obtain a full decision table that was divided into sets for each country. The conditional attributes included in the decision tables are presented below:


The same attributes are available for the second team as for the first team. The conditional attributes for the second team were marked by "T2". The last of the attributes is the decision class (*match*\_*result*), which can have three values: 1 indicates a win for team 1, 2 indicates a win for team 2, and X is a draw. Team 1 is the team playing the game on its home field; team 2 is the team playing away.

The Figure 6 shows examples of distributions for the data of the German Bundesliga. A significant part of the data is characterized by right-hand asymmetry, which is naturally related to the domain specificity of the data. Representative examples of this fact are, among others, *Winnings*\_*T*1, *Draws*\_*T*1, *Goals*\_*scored*\_*T*1, *Goals*\_*conceded*\_*T*1, *Points*\_*T*1. A team starts with a value of 0 for the number of games won/lost, goals scored/conceded or the number of points. During the game, teams increase the values of these attributes, or they remain unchanged. This behavior contributes to the right asymmetry in the data. The distribution for *Goal*\_*di f f erence*\_*T*1 is much closer to the normal distribution. In the decision class distribution, it can be seen that the most common values are related to the home team win (color = red), then the visiting team wins (color = cyan) and draw (color = blue). The last two classes have numbers much more similar to each other. The following rules are also observed for the "Team2" data and for other countries' leagues.

**Figure 6.** The distribution of attribute values due to decision classes for sport data.

#### *4.5. Financial Data*

From financial data, we can highlight two main groups of data. The first one is related to the well-known Markowitz model (and its extensions) and the portfolio selection problem, which is beyond the scope of this study. The second group is related to the price and indicator data from various markets. In this group, the most popular data are obtained from the financial markets (also known as forex market or foreign exchange market) and concerns the currency pairs.

A single market indicator (or group of indicators used jointly) is used in trading systems to generate buy signals. All indicator data were calculated according to market indicator formulas, which can be divided into two separate groups. The first covers trendfollowing indicators, which include the moving average (*MA*) market indicator. The *MA* for time *t* and *s* periods, denoted *MAs*(*t*), is calculated as:

$$MA\_s(t) = \frac{\sum\_{i=t-s}^{t-1} price\_i}{s},\tag{9}$$

where *pricei* is the value of the corresponding instrument at time *i*. In the above context, the period is the number of values considered when calculating the indicator. The second group of indicators covers the oscillators, whose primary purpose is to indicate rising or falling potential for the given currency pair. The indicator value is calculated using the currency value and can include the closing, opening, minimum, or maximum currency pair value from previous sessions (or any combination of the above). As an example, the oscillator Relative Strength Index (*RSI* ) is calculated based on the last *n* periods in time *t* as follows:

$$RSI\_{\mathbb{S}}(t) = 100 - \frac{100}{1 - \frac{a v \chi\_{gain}}{a v \chi\_{loss}}},\tag{10}$$

where *avggain* is the sum of gains over the past *s* periods and *avgloss* is the sum of losses over the past *p* periods.

All mentioned, indicators are calculated based on the currency pair value, which was included in the data. The decision (*BUY* or *SELL*) is based on the indicator value in time *t* and its relation to the indicator value at time *t* − 1. Therefore, the general rule for opening the trade for indicators can be defined as follows:

$$\text{cond}\_{\text{Buy}} = \text{true} \text{ if } (\text{ind}\_s(t-1) < c) \land (\text{ind}\_s(t) > c), \tag{11}$$

where *inds*(*t*) is the value of indicator *ind* in the present reading *t* considering the last *s* readings, *t* − 1 is the value in the previous reading, and *c* is the indicator level (different for each indicator), which should be crossed, to observe the signal.

As shown, the crucial aspect related to generating the signal by the indicator is the value difference between two successive readings. Thus, we decided to include this information in our data in some limited way (in the case of the *MA* indicator). For the remaining indicators, a discretization procedure was performed because, in the classification process performed in the experimental section, only a limited number of indicator values was accepted. The summary for each indicator is presented in Table 3.

**Table 3.** Discretization procedure for the market indicators. \* in the rare cases, where indicator value exceeds the border value (cases with the word "above" or "below", the indicator value is set to the border value).


Each of our readings in data also included the decision taken as one of the following values: *STRONG BUY*, *BUY*, *WAIT*, *SELL*, or *STRONG SELL*. Each set's decision was based on calculating the difference between the present instrument value and the value observed after *p* readings. This schema is presented in Figure 7. In this study, we examined *p* equal to 5.

**Figure 7.** Decision calculation method for the financial data.

The distribution of attribute values in accordance to decision classes was created, as shown in Figure 8. We selected an example data for the AUDUSD instrument; however, a similar distribution of attribute values was noted for the remaining datasets. The blue color on the chart denotes the number of objects for which the STRONG BUY class was observed. Cyan color is related to the STRONG SELL class. Both classes cover the majority of all objects in the data. The red color shows the objects belonging to the SELL class. The two remaining classes are BUY and WAIT, respectively.

In general, we can divide the whole attribute set into three different categories. The first one is related to the instrument price (which is *Close* on the chart) and two indicators (*the*\_*moving*\_*average*) based on the price. For this category, we observe attributes, for which there are several values with a reasonably high number of objects assigned. The second category is related to the same indicators, where the difference between two successive readings was calculated. It gives us a distribution close to the normal distribution, where the minor differences (close to the 0) have a high number of objects assigned. Finally, the last category is related to the oscillator indicators like *Bulls* or *OSMA*, for which once again the approximation of the normal distribution is observed. Also, for these attributes, relative change between successive readings was included. The main problem in this data is that the slight differences (the middle part of attributes number 4 to 11) are frequently observed in the data. At the same time, most information comes from the relatively significant differences (tails of the distribution). Thus the most promising attribute values are the least observed in the data.

**Figure 8.** The distribution of attribute values due to decision classes for financial data.

#### **5. Numerical Experiments**

In this section, we describe the experiments we performed on different real-world datasets. For every set, the experiments consisted of four steps:


• sensitivity analysis on the parameter related to the percent of attributes included in the limited set of attributes.

We selected a group of well-known state-of-the-art algorithms for the classification: decision tree, Random Forest, Bagging, and AdaBoost. Two measures were used to estimate the quality of classification: the positive predictive value (PPV) and the true positive rate (TPR). Additionally, the accuracy of the classification was measured.

#### *5.1. Fake News Data*

The fake news detection research was conducted on the ISOT Fake News Dataset provided by the University of Victoria, Canada [63]. This collection includes 44,898 documents, of which 21,417 are real news cases and 23,481 are fake news. Each document in the set is described with the following attributes:


Additionally, to determine the decision class, the main file was divided into two separate files:


In our fake news detection experiments, the dataset was limited to the *title*, and the *decision (true or fake news)* attributes only. This restriction allowed us to quickly mark the document based on the title without analyzing its content. In our previous research [64], we showed that the fake news detection model analyzing the titles produces accurate results and reduces the runtime of classification algorithms compared to the analysis of the entire content of the document.

In the first step of the experiments, we calculated the entropy of the decision class (see Equation (2)) and the information for each conditional attribute, which were the most common words in the documents. Notably, the values of the decision table (frequency of the occurrence of certain words) are strictly dependent on the documents that comprise the set on which the algorithm was trained. For this reason, the number of attributes was limited to 20. The results of this experiment are presented in Table 4. As can be seen, almost all information values for individual attributes are close to the maximum entropy value (1.0) and are in the range of 0.958–0.998. However, the last row in Table 4 shows the entropy value for the entire dataset.

In general, it is difficult to determine the set of attributes that most impact the classification results. Only attribute *word\_17* has an advantage over other attributes because, for attribute *word\_17*, the value of the information is visibly lower and amounts to 0.83. This means that after a single attribute—in this case, one word per document title—whether the document's full title is true or false cannot be determined. Moreover, the conditional attributes are different for a different set of documents, which entails the possibility of entirely different entropy values.

In the next step of the experiments, the values of the classification evaluation measures were calculated using selected machine learning algorithms, which were derived for each of two decision classes (true or fake news). Table 5 shows the results for the classification of fake news data by decision class for all twenty attributes.

In the case of decision class *FAKE*, PPV values were in the range of 91.38–98.88%, where the best result was obtained using a decision tree, where TPR values were in the range 46.05–58.67%, and the best result was obtained with Bagging. However, in the case of decision class *TRUE*, PPV values were in the range of 62.70–67.46% (Bagging was superior), and TPR values were in the range of 94.65–99.43% and the best results were obtained by the decision tree.

We also checked the influence of a limited number of attributes on the classification results. For this purpose, 25% of the attributes with the lowest value information attribute were selected (in this case, the top five attributes were selected). The obtained results are presented in Table 6, where the values are similar to those in Table 5. This proves that with a significantly limited number of attributes—in this case, up to 5 single words per document—the classification results for the algorithms used are the same as for a full set of conditional attributes.

The classification accuracy values for the entire set were calculated in terms of the number of attributes (five or 20 attributes), and the results are presented in Table 7. As can be seen, for the three algorithms, the accuracy was in the range of 74.17–75.49%, while for the decision tree, the accuracy was slightly lower at 71.51%.

When detecting fake news by title only, the classification accuracy measure determined how many documents were correctly classified. However, when using the PPV and TPR measures, it was possible to assess how many documents in a given class were correctly recalled and with what confidence (precision).


**Table 4.** Information attribute values for fake news data.


**Table 5.** Classification results for fake news data by decision class for full set of attributes [in %] (all bold numbers correspond the best values obtained).

**Table 6.** Classification results for fake news data by decision class for limited set of attributes (5 attributes selected) [in %] (all bold numbers correspond the best values obtained).


**Table 7.** Accuracy results for the classification over fake news data [in %].


#### *5.2. User Websites Navigation Data*

The vital part of preprocessing the data is converting the raw data into a set of navigation attributes. During our research, we obtained the data of our commercial partner for one entire year. This data was more than 85 GB in size. For our learning base, we used a sample of data of one month. We chose the month of April due to avoid any marketing actions. The database for one month represents more than one million sessions with more than 10 actions performed. On account of the scale of the database, the treatment is time-consuming. After performing the limitation, we obtained 211,639 user sessions.

For entropy and classification analyses, we eliminated significantly correlated attributes such as *total*\_*amount*. In the end, we obtained 31 attributes and one binary decision attribute, *purchase*.

The dataset for user behavior analysis consists of 211,639 unique rows. Each entry represents a unique user navigation session. First, the entropy value represents the entropy of a decision class of individual conditional attributes. Second, the results are shown in Table 8 along with the cardinality of the value set for each conditional attribute.

The entropy values for most attributes were near 0.5. For several attributes, the entropy value was lower than 0.5. For few attributes, the entropy was less than 0.2. An explanation may be the distribution of values for these attributes, which was strongly unbalanced. In most cases, the value of an attribute was equal to zero, only occasionally taking different values. Examples of these attributes are *discount*\_*code* and *new*\_*user*. When analyzing other attributes, the values of entropy were similar, indicating that most attributes carried an equivalent level of information. Intuitively, it seems that some attributes should be more discriminatory, but the analysis of the results did not confirm this. There were no highly biased attributes in the analyzed dataset.

Table 9 provides the classification results for the same dataset divided by each decision class value. The efficiency measures indicated relatively accurate results: PPV, TPR, and accuracy values were in the range of 0.89–1. However, both PPV and TPR were better for the decision class equal to "no". The results for the decision tree, Random Forest, and AdaBoost were similar. The results obtained using the Bagging algorithm were visibly

worse than for the other algorithms. The PPV value for the class "yes" was around 0.5. Again, the reason seems to be the uneven distribution of the values of the target class.


**Table 8.** Information attribute values for user websites navigation data.

Finally, we performed the limitation of the attributes used in classification. The limitation was based on the analysis of the value of entropy for each attribute. We selected the 25% most significant conditional attributes and performed the classification with a

limited number of attributes. The classification results for user websites navigation data by decision class values for 25% of the attributes with the lowest information attribute are presented in Table 10.

The accuracy of the results for user websites navigation data is compared in Table 11. The number of all attributes participating in the classification process was 31. After limiting the set of attributes to seven, the results of the classifier efficiency increased, which may be counterintuitive. Depending on the classifier used, the improvement in efficiency ranges from 0% (DT) to 10% (Bagging). The presented analysis shows the importance of limiting the attributes at the data preprocessing stage and of classification parameterization.

**Table 9.** Classification results for user websites navigation data by decision class values for full set of attributes [in %].


**Table 10.** Classification results for user websites navigation data by decision class values for limited set of attributes (7 attributes selected) [in %].


**Table 11.** Accuracy results for user websites navigation data [in %].


#### *5.3. Real Estate Market Data*

The goal of the real estate market data experiment presented in this paper was to find which attributes are crucial and essential for AI model creation based on the presented decision table. To achieve this, the values of the information attributes were computed.

The dataset consisted of 14,344 unique rows. There were 13 conditional attributes (described earlier) and one decision (price bucket). In the first experiment, we computed the entropy of a decision class and the information of individual conditional attributes. The results are shown in Table 12, along with the cardinality of the value set for each attribute.

Because the data were obtained from actual advertisements, the cardinality of a decision class fell more or less in a normal distribution (Figure 9). The most frequent price fell into the PLN 6000–7000 per square meter bucket. The far-right side of the histogram plot shows the luxury properties that are part of the dataset. Remember that the property's region heavily influences the real estate market. A property located in the capital is far more expensive than the same property in a less rich part of the country. The overall decision entropy is relatively high, as the classification problem is rather difficult. Most of the attributes maintain a similar entropy value, with a single exception being the property area. Because of the cardinality of this attribute and the fact that the price of a property is usually heavily correlated with the location, this is to be expected. However, the surprising finding is that the value of entropy is also relatively high, which means that the price fluctuation between a property with a similar area is also significant. We found no noticeable changes

in information for attributes such as *market*\_*type* or *ownership*\_*type*, indicating that such features have secondary importance for the selling price.

All attributes except the *area* obtained an information value close to the maximal entropy for the whole dataset. That means that no single conditional attribute was enough to predict the price bucket of a given property. Even the *area* conditional attribute, with a visibly lower information value equal to 2.07, was insufficient to correctly predict the price range. The price range agrees with intuition: a large but poorly located and unfurnished ruin might be cheaper than a downtown loft.

Table 13 provides the classification results for the same dataset divided by each decision class value. The Bagging algorithm produced the best results by far in nearly every decision class, both in terms of PPV and TPR. When using the limited set of attributes the following results were obtained (Table 14). Overall accuracy results were also superior using the Bagging algorithm (Table 15). Further research is required to determine whether a precise fine-tuning of hyper-parameters would increase the quality of results produced by the other algorithms.


**Table 12.** Information attribute values for real estate market data.

**Figure 9.** Histogram of cardinality of the decision set.


**Table 13.** Classification results for real estate market data by decision class for full set of attributes [in %] (all bold numbers correspond the best values obtained).

**Table 14.** Classification results for real estate market data by decision class for a limited set of attributes (3 attributes selected) [in %]. (all bold numbers correspond the best values obtained).



**Table 15.** Accuracy results for the classification over real estate data [in %].

#### *5.4. Sport Data*

Three datasets with 3362 unique rows for Spain, 2674 for Germany, and 3359 for Italy were analyzed. There was a total of 26 conditional attributes with *match*\_*result* as a decision class. In the first stage, we calculated the entropy of a decision class and the information attribute values. The results are shown in Table 16, along with the cardinality of the value set for each attribute.

In all three analyzed datasets, the information attribute value was relatively small. It was the lowest for *Goal difference T1* and *Goal difference T2*, oscillating between 1.38 and 1.40. The highest information attribute value was recorded for *Season*. The next conditional attributes with high values were *Round* and *Matches T1 (T2)*. For the remaining measures, the values of attributes were similar. Table 16 presents the entropy of the datasets, all of which are similar (1.52–1.55).

Of the selected methods, random forest had the highest accuracy, followed by the AdaBoost algorithm. The decision tree performed the worst in the classification. None of the algorithms provided a significant advantage in terms of efficiency measures. A summary of the results is presented in Table 17.

Tests were also conducted using fewer attributes (from 24 to 6; 25% of the set based on the information attributes values). The results obtained are presented in Tables 18 and 19. As can be observed, similar results were obtained with a limited list of attributes. For some cases, the results obtained with a limited set of attributes were better. The best algorithms, in this case, were AdaBoost and random forest, whereas Bagging worked poorly.


**Table 16.** Information attribute values for sport data.

**Table 16.** *Cont*.



**Table 17.** Classification results for sport data by decision class for full set of attributes [in %] (all bold numbers correspond the best values obtained).

**Table 18.** Accuracy results for the classification over sport data [in %].


The results (Table 17) show a problem with the prediction of class X (draw), which is best exemplified by the complete lack of prediction results by the Random Forest algorithm for data from Germany and Spain; for the remaining cases, this class had poor results. The unbalanced values in the decision class may be the reason for this finding. Note that a draw between teams seldom occurs.

The classification accuracy for the three sets and all selected algorithms oscillated between 51.55% and 55.85%, being higher than the random approach (for the three decision classes = 33.33%). The Random Forest algorithm achieved the highest classification accuracy on the Italy dataset and the lowest was achieved by Bagging on the Spain dataset. The exact results are presented in Table 18.


**Table 19.** Classification results for sport data by decision class values for for limited set of attributes (6 attributes selected) [in %] (all bold numbers correspond the best values obtained).

#### *5.5. Financial Data Results*

We used daily forex data in this study, which means that every new value was obtained at the beginning of the daily market session. We selected four different currency pairs as separate datasets: AUDUSD, EURUSD, GBPUSD, and NZDUSD, each containing 2865 readings. In addition, we used six different oscillator indicators: the Bulls indicator (*Bulls*), Commodity Channel Index (*CCI*), DeMarker indicator (*DM*), Oscillator of Moving Average (*OSMA*), Relative Strength Index (*RSI*), and the *stochastic*\_*oscillator*. Additionally, the moving average (MA) indicator, calculated for 14 (*MA*14) and 50 (*MA*50) past readings, were included. For the results, we used the MA indicator and *MA* to denote the absolute difference between two successive readings for the indicator. It provided us with an overall number for 10 attributes.

In Table 20, we present the entropy of the decision class along with the information attributes values for the four different datasets. Firstly, there are no visible differences between the entropy values for the different datasets. However, a significant difference exists in the case of trend-following indicators (the first four attributes related to the MA indicator). This is obvious for *MA*14 and *MA*50. However, these attributes were not preprocessed and were used as was. Small entropy values suggest the strong predictive power of these indicators; however, their practical usability is lower due to a large number of different attribute values (in comparison to other oscillator indicators such as *RSI*).

In the case of oscillators, information attribute values were held on the same level instead, and it would not be easy to identify the best (in the sense of information) indicators. However, it is easy to find many examples of articles confirming that indicators' predictive capabilities are similar.

Table 21 presents the results of classification based on the PPV and TPR measures for the complete set of attributes available in the dataset. The decision class values were highly unbalanced, and for some cases, values such as BUY or SELL did not occur even once. For other cases (such as in the case of the GBPUSD dataset), the results were poor quality because we observed the STRONG BUY or STRONG SELL decision for most cases. However, in general, the AdaBoost algorithm for these rare cases with buying or selling values was slightly better than the Bagging algorithm. For the remaining cases, all four algorithms achieved similar results oscillating between 30% and 40%. Lower results for

some cases (such as the STRONG BUY for the EURUSD dataset) could be related to the market situation and overall advantage of the bearish trend.


**Table 20.** Information attribute values for the financial data (all bold numbers correspond the best values obtained).

**Table 21.** Classification results for the financial data by decision class for full set of attributes [in %].



**Table 21.** *Cont*.

Next, we performed the classification once again on the limited set of attributes. The results are presented in Table 22. For both measures (PPV and TPR), the quality of classification slightly worsened. However, the results improved for some rare cases (for example, EURUSD and GBPUSD and the TPR measure). This was achieved despite considerably reducing the number of conditional attributes included in the classification process.

**Table 22.** Classification results for the financial data by the decision class values for 25% of attributes with the lowest information attribute (in %).



**Table 22.** *Cont*.

Eventually, we analyzed the classical accuracy measure for two cases: with the full set of conditional attributes along with the limited set. These results are presented in Table 23. Surprisingly, the results do not indicate that the full set of attributes allows obtaining the highest accuracy values. These results are ambiguous; for some cases, (AUDUSD or EURUSD with the Bagging algorithm), accuracy was higher using the limited number of attributes.

These observations were also confirmed for the remaining sets. Thus, it can be assumed that some core sets of attributes can allow obtaining a relatively accurate classification. However, dependencies between these attributes are more sophisticated than simple linear correlations.


**Table 23.** Accuracy results for the classification over the financial data [in %].

#### *5.6. Attributes Selection and the Sensitivity Analysis*

To test and evaluate our results based on the attributes selection (based on the entropy values), we used the well-known correlation-based feature selection (CFS) method implemented in the WEKA system [65]. As a result, a subset of attributes, including the essential elements, were selected—comparison of a number of attributes obtained by our method and the WEKA system can be found in Table 24. As it can be noted, for most cases, the number of attributes in our approach is smaller than the number of attributes selected by the CFS method. For example, only the User Websites Navigation Data attribute selection is shown five instead of seven (out of 31 possible) attributes. In the case of the financial data, the number of attributes was the same for both methods. In contrast, for the remaining datasets, our proposed method allowed us to use a smaller number of attributes—extreme cases related to Real Estate Market Data indicated nine instead of three (out of 31) attributes.


**Table 24.** Number of attributes after selection.

A smaller number of attributes resulting from the use of our method does not affect the overall quality of classification. The results of classification after the selection are presented in Table 25 (names of datasets were written as an acronym). The table shows the difference in classification based on the attribute set calculated using the CFS method and our proposed approach. As can be observed, despite the smaller number of attributes indicated by the proposed method, the classification quality is similar—mostly does not exceed 0.3%. Only for the Random Forest method used for the Real Estate Market Data, an overall improvement close to 1% is observed—it is the case, where the number of attributes selected by the CFS method was equal to nine (instead of three in our proposed method). Similarly for the Sport Data, where there is improvement around 1%. While for the Financial Data, the highest differences (favoring our proposed method) were observed. In the case of the Random Forest and Bagging algorithms, the attributes selection worsens the results for over 2%. For the Financial Data for both cases, the classification was performed based on two attributes.


**Table 25.** Accuracy results for the classification over the data after selection [in %].

In the case of the proposed method, we used the threshold of 25% of attributes included in the classification. It was shown to evaluate if the small subset of attributes allows maintaining the relatively high classification quality. Attributes were selected as the most important from the point of view of the entropy measure. This threshold was set experimentally, and it was based on several different indicators. Going below the 25% could limit the subset of attributes to two or even a single value in the case of analyzed data. At the same time, in the case of many attributes, it was possible to observe the visible decrease of classification quality. An example chart for the Sport Data (Germany) is presented in Figure 10, where the quality of classification (the Y-axis) is presented depending on the number of attributes (the X-axis). The vertical line points out the 25% of attributes used in the article.

**Figure 10.** Classification accuracy depending on the number of attributes.

#### **6. Conclusions and Future Works**

In this study, we investigated the possibilities of using the entropy measure to select the best set of conditional attributes to be used in a classification problem. The general idea of the entropy, related works, and the problem background was introduced in the first part of the article. We also selected real-world data covering different fields. These data were retrieved and described with the use of domain knowledge experts. Finally, preprocessing was applied to all datasets, which were transformed into decision tables.

The datasets differed in their complexity, number of objects, number of conditional attributes, and the number of decision classes. Our goal was to calculate the entropy of decision classes and the information attribute values. Furthermore, we performed the classification with a set of well-known state-of-the-art algorithms. To estimate the quality of classification, we used the recall, precision, and accuracy measures. After the initial results, we selected the 25% best attributes (attributes with the best information attribute values) and performed the classification on the limited number of attributes.

For most of the cases, the algorithms obtained similar results. However, there were some examples, such as the real estate dataset, in which the Random Forest produced better results using only the limited attribute set. The Bagging algorithm showed slightly lower classification accuracy. The nature of the Random Forest algorithm, as the name implies, conducts each run providing similar but different results. The hyperparameters of Random Forest are the most prone to fine-tuning, but optimizing the parameter of each used algorithm for each used dataset was beyond the scope of this study. Notably, the value of real estate cannot be classified only using the significance of attributes but also must consider emotions and non-technical factors. For instance, we were unable to quantize the "cool" factor of a given property.

For the remaining datasets, the results were not uniform. It was difficult to identify the attributes with the best information attributes value. Differences in these values amongst the attributes in the single dataset were often negligible. However, eventually, we were able to select a subset of attributes with which the classification procedure was performed once again. Surprisingly, the limited set of attributes often allowed obtaining similar classification results. Unfortunately, it was impossible to capture the complex, nonlinear relations amongst the conditional attributes within the single dataset.

In the case of classification, we used the classical algorithms considered as a state-of-art approach. However, the multicriteria efficiency measure based on different entropy types could give much more useful information. This can be the case, especially for complex datasets without uniform structure (like Big Data). At the same time, we only investigated entropy in its basic form. An interesting approach could be related to introducing different entropy measures or even deriving estimates based on other entropy types.

In this article, we obtained some advantages over classical methods; however, the obtained results are not uniform. Therefore, our future goal could be related to extending the number of analyzed sets and emphasizing the quantitative results rather than focusing on the description of every single piece of data used in the experiments.

**Author Contributions:** Conceptualization, P.J. and J.K.; methodology, G.D., S.G., T.J., P.J., J.K. and B.P.; software, J.K.; validation, G.D., S.G., T.J. and B.P.; formal analysis, P.J. and J.K.; investigation, P.J.; resources, G.D., S.G., T.J., P.J., J.K. and B.P.; writing—original draft preparation, G.D., S.G., T.J., P.J., J.K. and B.P.; writing—review and editing, G.D., S.G., T.J., P.J., J.K. and B.P.; visualization, G.D., S.G., T.J., P.J., J.K. and B.P.; supervision, P.J. and J.K.; project administration, P.J. and J.K.; All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **Minimum Query Set for Decision Tree Construction**

**Wojciech Wieczorek 1, Jan Kozak 2,\*, Łukasz Str ˛ak <sup>3</sup> and Arkadiusz Nowakowski <sup>3</sup>**


**\*** Correspondence: jan.kozak@ue.katowice.pl

**Abstract:** A new two-stage method for the construction of a decision tree is developed. The first stage is based on the definition of a minimum query set, which is the smallest set of attribute-value pairs for which any two objects can be distinguished. To obtain this set, an appropriate linear programming model is proposed. The queries from this set are building blocks of the second stage in which we try to find an optimal decision tree using a genetic algorithm. In a series of experiments, we show that for some databases, our approach should be considered as an alternative method to classical ones (CART, C4.5) and other heuristic approaches in terms of classification quality.

**Keywords:** query set; decision tree; classification

#### **1. Introduction**

One of the main problems in machine learning is finding associations in empirical data in order to optimize certain quality measures. These associations may take different forms, such as Bayesian classifiers, artificial neural networks, rule sets, nearest-neighbor or decision tree classifiers [1]. Classical decision tree learning is performed using statistical methods. However, due to the large space of possible solutions and the graph representation of decision trees, stochastic methods can also be used.

Decision trees have been the subject of scientific research for many years [2]. The most recognized algorithms in that class are ID3 [3], C4.5 [4], and CART [5]. There are also works on the evolutionary approach to generating trees. The most popular ideas connected with this research direction are described in the article of Barros et al. [6]. Other approaches, for instance, the ant colony system, also have been studied [7]. To evaluate the performance of our approach, the following methods are selected for comparison: C4.5, CART (classification and regression trees), EVO-Tree (evolutionary algorithm for decision tree induction) [8], and ACDT (ant colony decision trees) [9]. We test the predictive performance of our method using publicly available UCI data sets.

The present proposal is about the building of decision trees which maximize the quality of classification measures, such as accuracy, precision, recall and F1-score, on a given data set. To this end, we introduce the notion of minimum query sets and provide a tree construction algorithm based on that concept. The purpose of the present proposal is fourfold:


**Citation:** Wieczorek, W.; Kozak, J.; Str ˛ak, Ł.; Nowakowski, A. Minimum Query Set for Decision Tree Construction. *Entropy* **2021**, *23*, 1682. https://doi.org/10.3390/e23121682

Academic Editor: Geert Verdoolaege

Received: 14 November 2021 Accepted: 10 December 2021 Published: 14 December 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

of MIP solvers makes it possible to tackle the tree induction problem for large-size instances and to compare our approach with existing ones.

4. Sharing our program because of the possibility of future comparisons with other methods. The Crystal language implementation of our method is publicly available via GitHub. (https://github.com/w-wieczorek/mining, accessed on 8 December 2021).

This paper is organized into six sections. In Section 2, we present the necessary definitions and facts originated from the data structures and classification. Section 3 briefly introduces the related algorithms, while Section 4 describes our tree-construction algorithm based on solving an LP (linear programming) model and the genetic algorithm. Section 5 shows the experimental results of our approach with suitable statistical tests. Concluding comments and future plans are made in Section 6.

#### **2. Preliminaries**

In this section, we describe some definitions and facts about binary trees, decision trees, and the classification problem that are required for good understanding of our proposal. For further details about the topic, the reader is referred to the book by Japkowicz and Shah [10].

#### *2.1. Observations and the Classification Problem*

In supervised classification, we are given a training set called samples. This set consists of *n observations* (also called *objects*):

$$X = \{\mathbf{x}\_1, \mathbf{x}\_2, \dots, \mathbf{x}\_n\}. \tag{1}$$

For each 1 ≤ *i* ≤ *n*, an observation *xi* is described by *m attributes* (also called *features*):

$$d(\mathbf{x}\_i) \in A\_1 \times A\_2 \times \cdots \times A\_{m\_\nu} \tag{2}$$

where *Aj* (1 ≤ *j* ≤ *m*) denotes the domain of the *j*-th attribute and *d* : *X* → *A*<sup>1</sup> ×···× *Am* is a function. The values of the attributes can be quantitative (e.g., a salary) or categorical (e.g., sex—"female" or "male"). Furthermore, each observation belongs to one of *k* ≥ 2 different *decision classes* defined by a function *c* : *X* → *C*:

$$c(\mathfrak{x}\_{i}) \in \mathbb{C} = \{c\_{1}, c\_{2}, \dots, c\_{k}\}.\tag{3}$$

We assume that there are no two objects with the same description and different decision classes, that is, for any 1 ≤ *q*,*r* ≤ *n*, *q* = *r*,

$$d(\mathbf{x}\_q) = d(\mathbf{x}\_r) \Rightarrow \mathcal{c}(\mathbf{x}\_q) = \mathcal{c}(\mathbf{x}\_r). \tag{4}$$

Based on the definitions given above, the *classification problem* can be defined as follows: assign an unseen object *x* to a class, knowing that there are *k* different decision classes *C* = {*c*1, *c*2, ... , *ck*}, each object belongs to one of them, and that *d*(*x*)=(*a*1, *a*2, ... , *am*). When *k* = 2, we are faced with the problem called *binary classification*. A learning algorithm L is first trained on a set of pre-classified samples *S*. In practice, a set *S* consists of independently obtained samples, according to a fixed—but unknown—probability distribution. The goal of an algorithm L is to produce a "classifier" which can be used to predict the value of the class variable for a new instance and to evaluate the classification performed on some test set *V*. Thus, we can say that in the learning process, a hypothesis *h* is proposed and its classification quality can be measured by means of accuracy, precision, recall, etc.

#### *2.2. Decision Trees*

We define a *binary tree* recursively as a tuple (*S*, *L*, *R*), where *L* and *R* are binary trees or the empty set, and *S* is a singleton set containing the value of the *root*. If *L* and *R* are empty sets, *S* is called a *leaf node* (or *leaf*); otherwise, *S* is called a *non-leaf node*. If (*U*, *L*1, *R*1) is a binary tree and *L*<sup>1</sup> = (*VL*, *L*2, *R*2) or *R*<sup>1</sup> = (*VR*, *L*2, *R*2), then we say that there is an *edge* from *U* to *VL* (or from *U* to *VR*). Furthermore, *VL* and *VR* are called, respectively, left and right sons of *U*.

Let *Q* = {*Q*1, *Q*2, ... , *Qt*} be a collection of binary test (called *queries*) *Qi* : *X* → {0, 1}, where *X* is a set of objects for which we define functions *d* and *c* as described in (2)–(4). A *decision tree*, *TX*, is a binary tree in which each non-leaf node is labeled by a test from *Q* and has non-empty left and non-empty right subtrees; each leaf is labeled by a decision class; the edge from a non-leaf node to its left son is labeled 0 and the one to its right son is labeled 1. If *Qi*<sup>1</sup> ,*Oi*<sup>1</sup> , *Qi*<sup>2</sup> ,*Oi*<sup>2</sup> , ... , *Qih* ,*Oih* is the sequence of node and edge labels on the path from the root to a leaf labeled by *c*<sup>∗</sup> ∈ *C*, then *c*(*x*) = *c*<sup>∗</sup> for all objects *x* ∈ *X* for which *Qij* (*x*) = *Oij* for all *j* (1 ≤ *j* ≤ *h*). We also require that in this manner all leaves in a decision tree cover the whole set *X*, i.e., for all *x* ∈ *X*, there is at least one path from the root to a leaf corresponding to *x*.

The tree in Figure 1 is said to have a depth of 3. The *depth* (or *height*) of a tree is defined as the number of queries that have to be resolved down the longest path through the tree.

**Figure 1.** An exemplary decision tree.

Naturally, every decision tree *T* can play the role of a classifier as long as the queries can be resolved for other objects, i.e., those outside the training set. Having given a new object, let us say *y*, one may apply queries from the tree starting from the root and ending in a leaf that points out the predicted class *p* to which the object should belong. Every query in the tree directs us to a left or right son, toward a leaf . We denote such a prediction as *T*(*y*) = *p*.

#### *2.3. Quality of Classification*

To assess the quality of classification, we use the classical measures of classification quality: accuracy (5), precision (6), recall (7), and F1-score (8). Notably, these are binary classification measures, i.e., for a data set with only two decision classes. However, there are often more decision classes in data sets, so we use the so-called macro method to determine the values of these measures. Thus, in the definitions, we denote the following: *TPi* to identify all correctly classified cases of the *ci* class; *TNi* to identify all cases outside the *ci* class that are not assigned to this class; *FPi* to identify all cases outside the *ci* class that are assigned to this class; *FNi* to identify all misclassified cases of the *ci* class; and *k* as the number of decision classes.

$$acc = \frac{1}{k} \sum\_{i=1}^{k} \frac{TP\_i + TN\_i}{TP\_i + TN\_i + FP\_i + FN\_i} \tag{5}$$

$$prec = \frac{1}{k} \sum\_{i=1}^{k} \frac{TP\_i}{TP\_i + FP\_i} \tag{6}$$

$$recc = \frac{1}{k} \sum\_{i=1}^{k} \frac{TP\_i}{TP\_i + FN\_i} \tag{7}$$

$$f1 = \frac{1}{k} \sum\_{i=1}^{k} \frac{2 \cdot TP\_i}{2 \cdot TP\_i + FP\_i + FN\_i} \tag{8}$$

#### **3. Related Works**

This section describes the tree construction methods taken for our comparison. These are well-known, deterministic C4.5 and CART, and stochastic, population-based algorithms: EVO-Tree and ACDT.

#### *3.1. C4.5*

Developed initially by Ross Quinlan in 1993 [4], the C4.5 algorithm became one of the most popular decision tree-based algorithms [11] implemented as the standard in data mining tools, i.e., Weka (https://www.cs.waikato.ac.nz/~ml/weka/, accessed on 8 December 2021). Conceptually, the heuristic is a more advanced version of the ID3 algorithm proposed by the same author in 1986 [3]. The tree-building process recursively chooses the attribute with the highest information gain ratio. The higher the information gain the attribute has, the higher position in the tree from the root it has. Each selected feature splits a node's set of samples into subsets enriched in one class or the other [12]. To avoid over-fitting, the pruning technique is used to remove parts of the tree that minimally affect the estimated classification error. In contrast to ID3, some improvements can be made to handle missing values and continuous data [12].

#### *3.2. CART*

The classification and regression trees algorithm was co-authored by Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone in 1984 [5], and is one of the most widely used decision tree making algorithms [11]. The CART is a binary (each node has two branches), recursive and non-parametric algorithm. It can be used for regression and classification problems. The decision tree making process uses the Gini impurity measure to determine attribute order in the tree [12]. The measure can be interpreted as the probability of incorrect classifying a randomly chosen observation from sample data if the attribute for the calculation is selected as the new decision tree node. The pruning mechanism is complex and produces a sequence of nested pruned trees, all candidate optimal trees. The best one is identified by evaluating the predictive performance of every tree in the pruning sequence by cross-validation.

#### *3.3. EVO-Tree*

The EVO-Tree algorithm [8] is an evolutionary algorithm that generates binary decision trees for classification. It uses the minimization of a multi-objective fitness function that utilizes the balance between the number of correctly classified instances and the size of the generated decision tree. The algorithm starts with the randomly initialized population of trees and uses two standard genetic operators: crossover and mutation. The crossover creates offspring by replacing a randomly selected sub-tree in the first parent with a subtree from the second parent. The parents' selection is made in a series of tournaments. In each tournament, a certain number of individuals from the population is randomly picked. Then, the best individual in terms of the fitness function value is chosen as a tournament winner to be put into the the pool of parents. The mutation randomly changes both attribute and split value of a decision tree. Finally, the algorithm stops if the maximum

number of generations is reached or the fitness of the best individual does not improve after a fixed number of iterations.

#### *3.4. ACDT*

The ant colony decision tree (ACDT) algorithm [7] is an application of ant colony optimization algorithms [13] in the process of constructing decision trees. The good results typically achieved by the ant colony optimization algorithms when dealing with combinatorial optimization problems suggest the possibility of using that approach for the efficient construction of decision trees [14,15]. In the ACDT algorithm, each agent ant chooses an appropriate attribute for splitting the objects in each node of the constructed decision tree according to the heuristic function and pheromone values. The heuristic function is based on the twoing criterion (known from the CART algorithm) [5,16], which helps agent ants divide the objects into two groups. In this way, the attribute which best separates the objects is treated as the best condition for the analyzed node. Pheromone values represent the best way (connection) from the superior to the subordinate nodes—all possible combinations in the analyzed subtrees. For each node, the following values are calculated according to the objects classified, using the twoing criterion of the superior node.

#### **4. Proposed Method**

Our learning algorithm L receives as its input samples *S*, which are split into two subsets, the training set *X* and the test set *Y* (in experiments, we chose the proportions 4/7 to *<sup>X</sup>* and 3/7 to *<sup>Y</sup>*). Hypothesis space *<sup>H</sup>*<sup>L</sup> <sup>=</sup> {*T<sup>i</sup> <sup>X</sup>*}*i*∈*<sup>I</sup>* is searched in order to find a decision tree that approximates best the unknown true function. To this end, each tree is validated against *Y*: as a result, we output a tree *T*∗ *<sup>X</sup>* that minimizes err = |{*y* ∈ *Y*: *T*<sup>∗</sup> *<sup>X</sup>*(*y*) = *c*(*y*)}|. Unfortunately, in practice, we are not able to cover the whole hypothesis space. The selected hypothesis *T*∗ *<sup>X</sup>* can then be used to predict the class of unseen examples in the validation set, taken for the evaluation of L. More exactly, L has two stages. In the first stage, by means of zero-one linear programming, a minimum query set *Q* is determined. In the second stage, by means of the genetic algorithm, the best ordering of *Q*—in the view of a decision tree construction—is settled. Let *x* ∈ *X*, *d*(*x*)=(*a*1, *a*2, ... , *am*), and *v* ∈ *Aj* (1 ≤ *j* ≤ *m*). In our approach, a *query* can be a function defined by *Qi*(*x*) = 1 if *aj* = *v* and *Qi*(*x*) = 0 if *aj* = *v*. Thus, non-leaf nodes contain "questions" such as *Aj* = *v*?.

We require *Q* to be a minimum size query set satisfying the following condition: for each pair of distinct elements *u*, *w* ∈ *X* with *c*(*u*) = *c*(*w*), there is some query *q* ∈ *Q* that *q*(*u*) = *q*(*w*). We verified experimentally that this minimality is crucial in achieving good quality decision trees.

#### *4.1. Linear Program for the Minimum Query Set Problem*

Let us show how a collection of queries, *Q*, is determined via an integer program for the training set *X* = {*x*1, *x*2, ... , *xn*}. The integer variables are *zjv* ∈ {0, 1}, 1 ≤ *j* ≤ *m*, *v* ∈ *Aj*, assuming that there are *m* attributes, *A*1, *A*2, ... , *Am*. The value of *zjv* is 1 if some query in *Q* is defined with *Aj* and *v* ∈ *Aj*; in other words, *Aj* = *v*? is taken as a non-leaf node label representing the query and *zjv* = 0 otherwise, i.e., there is no query based on *Aj* and *v*. Let us now see how to describe the constraints of the relationship between a set *Q* and a set *X*, with features and classes defined by functions *d* (as in (2)) and *c* (as in (3)), in terms of linear inequalities. For every pair of distinct elements *u*, *w* ∈ *X* with *c*(*u*) = *c*(*w*), we should have at least one query that distinguishes between the two. The following equation is the standard way of showing in a linear program that some elements (i.e., queries modeled as 0–1 variables) have to be included in the solution:

$$\sum\_{\substack{1 \le j \le m\\a\_j \ne b\_j}} z\_{ja\_j} + z\_{jb\_j} \ge 1,\tag{9}$$

where (*a*1, *a*2, ... , *am*) = *d*(*u*) and (*b*1, *b*2, ... , *bm*) = *d*(*w*). Obviously, we are to find the minimum value of the linear expression

$$\sum\_{\substack{1 \le j,\upsilon \le j \le m,\,\upsilon \in A\_j \atop j}} z\_{j\upsilon}.\tag{10}$$

Please note that the above-mentioned problem is computationally complex (that is why we use an LP solver, specifically Gurobi optimizer) since Garey and Johnson's [17] NPcomplete problem SP6 can be easily transformed to the decision version of the minimum query test problem.

#### *4.2. The Construction of a Decision Tree with the Help of the Genetic Algorithm*

After obtaining a minimum query set *Q* = {*Q*1, *Q*2, ... , *Qt*}, we are ready to create a decision tree *TX* by Algorithm 1.


**Theorem 1.** *Let X be a set of n* ≥ 1 *observations and let Q* = {*Q*1, ... , *Qt*} *be a set of such queries that for every pair of distinct elements u*, *w* ∈ *X with c*(*u*) = *c*(*w*) *there is some i (*1 ≤ *i* ≤ *t) for which Qi*(*u*) = *Qi*(*w*)*. Then* BUILDTREE(*X*, *Q*) *constructs a decision tree for X.*

**Proof.** Let *TX* be a tree returned by BUILDTREE(*X*, *Q*). The conclusion of the theorem can be written as follows: *TX*(*x*) = *c*(*x*) for an arbitrary *x* ∈ *X*. We prove it by induction on *n*. Basis: We use *n* = 1 as the basis. The tree consisting of one leaf is returned, with the decision *c*(*x*), so *TX*(*x*) = *c*(*x*), where *x* is the only element of *X*.

Induction: Suppose that the statement of the theorem holds for all *k* < *n*, where *k* = |*X*|. We want to show that for an arbitrary *x* ∈ *X*, where |*X*| = *n*, *TX*(*x*) = *c*(*x*) holds. Let us consider two cases: (i) all *x* ∈ *X* have the same decision *c*(*x*), and (ii) there is such *y* ∈ *X* that *c*(*x*) = *c*(*y*). In the former case, we can easily verify that *TX*(*x*) = *c*(*x*). In the latter case, there is some *i* (1 ≤ *i* ≤ *t*) for which *Qi* splits *X* into two non-empty sets, *XL* and *XR*. An element *x* is put into one of them. If it is *XL* (i.e., *x* ∈ *XL*), by the inductive hypothesis, we can claim that *TXL* (*x*) = *c*(*x*), where *TXL* is the left subtree of a non-leaf node containing *Qi*. Thus, *TX*(*x*) = *c*(*x*). For *x* ∈ *XR*, we can repeat our reasoning.

Therefore, by strong induction, BUILDTREE(*X*, *Q*) constructs a decision tree for any set *X* of *n* ≥ 1 observations.

Please notice that the shape of a tree *TX* depends on the ordering of queries in an array *Q*. As a consequence, the order decides the quality of classification done by a tree returned by function BUILDTREE. That is why we apply the genetic algorithm (Algorithm 2) as a heuristic method to search such a large solution space [18]. Each individual is the permutation of the set {1, 2, . . . , *t*}, which determines the order of *Q* = {*Q*1, *Q*2,..., *Qt*}.

The population size depends on the complexity of the problem, but usually contains several hundreds or thousands of possible solutions. We follow the advice of Chen et al. [19] and take POP\_SIZE = 2*t* ln *t* (they suggested |*P*| = *O*(ln *n*), where *n* is the problem size, while our *n* is *t*!). The initial population is generated randomly, allowing the entire range of possible permutations.

During each successive iteration, a portion of the existing population (T\_SIZE = 3 is chosen during preliminary experiments) is selected to breed a new individual. Solutions are selected through a fitness-based process, where fitter solutions (as measured by a fitness function) are chosen to be parents.

The fitness function is defined over the genetic representation and measures the quality of the represented solution. We use Algorithm 1 to decode a permutation. The number of misclassified objects for a test set *Y* is the fitness value.

For each new solution to be produced, a pair of "parent" solutions is selected for breeding from the pool selected previously. By producing a "child" solution using the crossover and mutation operations, a new solution is created which typically shares many of the characteristics of its "parents". We use partially mapped crossover (PMX for short) because it is the most recommended method for sequential ordering problems [18,20]. In the mutation operation, two randomly selected elements of a permutation are swapped with a probability PROB\_MUTATION = 0.01. This process is repeated until one of the two termination condition is reached: (i) a solution is found that satisfies minimum criteria, or (ii) fixed number (MAX\_ITER = 500*t*) of iterations reached. As a result, the best permutation encountered during all iterations is returned.

The final Algorithm 3 is depicted below. Note that heuristic search procedures that aspire to find globally optimal solutions to hard optimization problems usually require some diversification to overcome the local optimality. One way to achieve diversification is to restart the procedure many times [21]. We follow this advice and call the genetic algorithm 30 times, returning the best solution found over all starts.


Because our algorithm relies on solving the minimum query set problem (finding the minimum set of attribute-value pairs that distinguishes every two objects) that is NP-hard, its overall complexity is exponential with respect to the size of input data. To tackle the problem, we use an integer linear programming solver. As modern ILP solvers are very

ingenious, for practical data sets the computing time is not a big problem. Algorithms for solving ILP-problems and their NP-completeness were described in the book of [22].

#### **5. Experiments**

The section describes the comparison between selected referenced methods introduced in Section 3 and our proposed Algorithm 3 devised in the previous section.

#### *5.1. Benchmark Data Sets*

To verify our approach, we select 11 publicly available data sets with different numbers of objects, attributes, and decision classes. Used data sets are downloaded from the UCI data sets repository (https://archive.ics.uci.edu/, accessed on 8 December 2021) and are not subject to any modifications, except for possible ID removal. They are presented in Table 1, where the abbreviation used further in the paper is given in brackets, followed by the number of objects in the data set, the number of attributes, and the number of decision classes.


**Table 1.** Characteristics of data sets.

#### *5.2. Performance Comparison*

In this section, we describe some experiments comparing the performance of our approach implemented (https://github.com/w-wieczorek/mining, accessed on 8 December 2021) in Crystal language with ACDT implemented (https://github.com/jankozak/acdt\_ cpp, accessed on 8 December 2021) in C++, Weka's C4.5 implemented in Java, Scikit-learn's CART and EVO-Tree implemented (https://github.com/lazarow/dtree-experiments, accessed on 8 December 2021) in Python.

For the purpose of the experimental study, all data sets described in Section 5.1 are divided into three sets: training set (40%), test set (30%), and validation set (30%). For the classical algorithms (CART, C4.5) and EVO-Tree, the training and test sets are combined and used to learn the algorithm, while for the other algorithms, the training and test sets are used separately (according to the rule of the algorithm). In each case, the results are verified through the validation set. In this section, all given values are the results of classification performed on the validation set. So a train-and-test approach is used, but it is ensured that the data breakdowns are exactly the same in each case.

Additionally, for the algorithms that do not work deterministically (the proposed MQS and the compared EVO and ACDT) each experiment is repeated 30 times and the values presented in Tables 2 and 3 are the averages. The stability of the results obtained by these algorithms is also tested, which is presented in the form of box plots in Figures 2–4.


**Table 2.** The quality of classification depending on the approach (bold text is the best value).


**Table 3.** Decision tree characteristics depending on the approach.

#### *5.3. Results of Experiments*

The proposed algorithm is compared with two classical approaches and two heuristic algorithms (another genetic algorithm and the ant colony optimization algorithm). Our goal was to experimentally verify whether the MQS algorithm allows finding different (often better) solutions than the compared algorithms. The achieved results show that our assumption is confirmed.

The MQS algorithm, in terms of the analyzed metrics (see Section 2.3), allows for a significant improvement in the results for 3 out of 11 data sets. Thus, in the case of the monks-1 data set, the improvements in classification quality of almost 5% (with respect to CART), almost 7% (with respect to ACDT), about 16% (with respect to C4.5), and as much as about 20% with respect to another genetic algorithm (EVO-Tree) are obtained. There is an even greater improvement for the 2015 Somerville Happiness Survey data set and slightly less for tic-tac-toe.

**Figure 2.** Box plot—accuracy of classification for the MQS algorithm.

**Figure 3.** Box plot—accuracy of classification for the EVO-Tree algorithm.

**Figure 4.** Box plot—accuracy of classification for the ACDT algorithm.

For the remaining data sets, the MQS algorithm obtains similar or slightly worse results, but only in one case the difference in classification quality is large—this is for the soybean-large data set. However, in two more cases, it is noticeable: dermatology and zoo. In each of these cases, the second GA algorithm has also poorer classification quality. As can be seen, the problem concerns sets with a large number of attributes (34 for dermatology, 16 for soybean-large, and 16 for zoo), so as the solution space increases (for classification, it depends on the number of attributes and the values of these attributes), the MQS algorithm has a harder time finding a suitable solution.

Our aim is to propose a new algorithm that will allow finding new optima in the solution space (in terms of classification quality). Thus, in some cases, it will allow to improve the quality of classification compared to other algorithms. Therefore, we do not try to improve either the size of the tree, the height of the tree, or the algorithm's running time, which is hard to compare between genetic and deterministic algorithms. However, we make a comparison of these decision tree-related parameters, and the results are shown in Table 3.

As can be seen, the MQS algorithm is similar in the decision tree learning time to another algorithm related to genetic algorithms (EVO-Tree). However, in terms of decision tree size and height, the proposed algorithm mostly constructs the largest trees. This is probably related to searching the solution space and covering the solution with the local optima. The size of the decision tree does not correlate with its classification quality (in relation to other algorithms) and so a significantly larger tree, e.g., in the case of the balance-scale data set, does not improve the results, while in the case of tic-tac-toe, the results are improved while increasing the decision tree.

The stability of the results obtained is also subject to our analysis, because the stability allows us to assume that the classifier will always be of similar quality. While in the case of classical algorithms, the results are deterministic, in the case of MQS, EVO-Tree and ACDT, a different classifier may emerge each time. Box plots are prepared with classification accuracy for each data set in case of MQS (Figure 2), EVO-Tree (Figure 3), and ACDT (Figure 4) algorithms. To prepare the graphs, the corresponding quantiles (minimum value is lowest on the OY axis, 1st quantile, 2nd quantile (median), 3rd quantile and maximum value that is highest on the OY axis) from all 30 repetitions of learning the decision tree are determined.

The MQS algorithm is the most stable; in Figure 2, we can see that only for the Somerville Happiness Survey 2015 and soybean-large data set, small (compared to the other algorithms) differences appear. For the other data sets, the results are very repeatable. For the other algorithms, the repeatability of the results is much lower, and so for EVO-Tree, we can see in Figure 3 that in seven cases, the differences are quite divergent; for the dermatology, soybean-large and tic-tac-toe databases, the classification accuracy in successive repetitions changes even by several dozen percentage points. In the case of the ACDT algorithm, the results are more reproducible (Figure 4)—significant differences appear in two to three cases, while for the monks-1 set, the difference can be as much as several dozen percentage points.

#### *5.4. Statistical Analysis*

The experimental results of the MQS approach are compared using a non-parametric statistical hypothesis test, i.e., the Friedman test [23,24] for *α* = 0.05. Parameters of the Friedman test are shown in Table 4. The same table presents the average rank values for the compared algorithms for learning decision trees (in terms of classification quality). Results in terms of each of the classification quality measures analyzed are used for statistical testing.

The MQS algorithm obtains a rank of 3.1591, so it is significantly better than the EVO-Tree algorithm (the 5% critical difference is 0.6192); MQS is worse than the other algorithms, but this is by no means a critical difference. Therefore, we confirm that it is possible to use the MQS algorithm in the decision tree learning process, so it should always be considered and tested because it can output a significantly better classifier than the other algorithms. This is especially valid when we are given a data set with a small number of attributes. At the same time, we confirm that the proposed algorithm is significantly better than another genetic algorithm used for decision tree learning.

**Table 4.** The Friedman test results and mean ranks.


As the EVO-Tree algorithm is found to be critically inferior to all other approaches analyzed, we perform a second round of statistical analysis. The results of the Friedman test and the mean ranks after rejecting the critically inferior method are recorded in Table 5. As can be seen, in this case, none of the methods is critically better or worse than all the others. The big difference remains only when contrasting MQS with CART.

Due to the lack of significant differences and the advantage of obtaining significantly higher results (when the MQS algorithm gets a rank of 1, it is better by several/dozen percentage points, where in other methods, the advantage is often negligible—see Table 2), the proposed method can be considered for use in selected classification problems.

**Table 5.** Friedman test results and mean ranks after rejection of the critically worse method.


#### *5.5. Discussion*

To evaluate the proposed algorithm, we made comparisons with classical approaches and other non-deterministic algorithms. This is a new algorithm proposal, so we wanted to make a fair comparison. We used up to four different measures of classification quality. We also compared the size and height of the decision tree and the learning time of the classifier. Finally, we performed statistical tests.

As decision trees learned with non-deterministic methods often search a much larger solution space, this must affect their running time. It can also result in larger, more extensive decision trees. When proposing the MQS algorithm, we knew that the classifier learning time would require more time. Therefore, its application, like other stochastic methods, should be considered for classifiers that are built once in a while—not online classifiers. Our study confirmed that the MQS algorithm takes longer to learn than statistical methods. However, it is comparable to non-deterministic methods (especially another genetic algorithm).

In this case, the classification time is more important, and it depends primarily on the height of the decision tree. Our analyses indicated, for example, that the MQS algorithm is better than the CART algorithm in 10 out of 11 cases, remaining worse than the other algorithms in 7–9 cases. In terms of the size of the decision trees (this affects the memory occupation needed to store the finished classifier), the situation is similar. The MQS and CART algorithm learn larger decision trees than the others. However, it should be emphasized that no pruning of decision trees is performed for the proposed MQS algorithm. At this stage, we wanted to keep the complete decision trees.

However, our aim was to find new alternative classifiers with which a better classification could be achieved. Therefore, the most important analysis concerned the evaluation of classification quality. In this case, we were able to see that for some data sets, the MQS algorithm allows to build a classifier better than all other algorithms.

This is particularly important because often the differences (in classification quality assessment) between different algorithms are a few percentage points. However, for the monks-1, Somerville Happiness Survey 2015 and tic-tac-toe data sets, the MQS algorithm allows a very large improvement in each of the classification quality assessment measures.

We analyzed the exact structure of these data sets. Our observations show that the application of the proposed algorithm can be particularly beneficial for data sets with two decision classes and attributes with a small number of possible values (3–5 values of each attribute). However, the decision classes can be of different numbers. This does not mean, however, that the MQS algorithm obtains bad results with other sets—the suggestion described above indicates a situation where a classifier learned by MQL obtains results with much better classification quality.

Finally, we analyzed the stability of the results obtained. We did this to determine whether the classifiers learned by the MQS algorithm are always of similar quality. For this purpose, we performed 30 independent runs of the algorithm and obtained 30 independent classifiers. We performed the same tests with other stochastic algorithms (EVO-Tree and ACDT). The obtained results clearly indicate that the proposed algorithm is the most stable one, so it can be assumed that the classifier will always obtain similar results.

To confirm our observations, a statistical test was performed twice: the first time, for all approaches (and all classification quality values) and the second time, after rejecting the EVO-Tree algorithm (it obtained results with a critical difference with respect to other algorithms). This time, the critical difference of one algorithm against all others was not shown.

#### **6. Conclusions**

This paper deals with the construction of decision trees based on the finite set of observations (objects). In order to address the problem, we introduced the notion of minimum query set and made use of the genetic algorithm for suitable ordering of the found queries. As the result of the implemented algorithm, we achieved decision trees that perfectly match the training data set and have good classification quality on the test set. The conducted experiments and statistical inference showed that the new proposed, two-stage algorithm should be considered as an alternative method to classical ones (CART, C4.5) and other heuristic approaches in terms of accuracy, precision, recall, and F1-score for all 11 UCI data sets.

Our method has also a few disadvantages. The most significant ones are that (i) the first stage of our approach relies on solving a computationally intractable problem, and (ii) for some cases, the obtained decision trees have too many nodes. In the near future, we are planning to adapt our approach to handle continuous attributes. In order to make it possible to reproduce our results or apply our method on new data, we share the source code of all algorithms via the Github platform.

**Author Contributions:** Conceptualization, W.W.; methodology, J.K., Ł.S. and A.N.; validation, J.K., Ł.S. and A.N.; formal analysis, W.W. and Ł.S.; investigation, J.K., Ł.S. and A.N.; resources, W.W., J.K., Ł.S. and A.N.; writing—original draft preparation, W.W., J.K., Ł.S. and A.N.; writing—review and editing, W.W., J.K., Ł.S. and A.N.; visualization, Ł.S.; supervision, W.W.; project administration, J.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

