Next Article in Journal
Chaos Synchronization of Integrated Five-Section Semiconductor Lasers
Previous Article in Journal
A Novel Classification Method: Neighborhood-Based Positive Unlabeled Learning Using Decision Tree (NPULUD)
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Importance of Characteristic Features and Their Form for Data Exploration

1
Department of Computer Graphics, Vision and Digital Systems, Silesian University of Technology, Akademicka 2A, 44-100 Gliwice, Poland
2
Institute of Computer Science, University of Silesia in Katowice, Bȩdzińska 39, 41-200 Sosnowiec, Poland
*
Author to whom correspondence should be addressed.
Entropy 2024, 26(5), 404; https://doi.org/10.3390/e26050404
Submission received: 28 March 2024 / Revised: 27 April 2024 / Accepted: 3 May 2024 / Published: 6 May 2024

Abstract

:
The nature of the input features is one of the key factors indicating what kind of tools, methods, or approaches can be used in a knowledge discovery process. Depending on the characteristics of the available attributes, some techniques could lead to unsatisfactory performance or even may not proceed at all without additional preprocessing steps. The types of variables and their domains affect performance. Any changes to their form can influence it as well, or even enable some learners. On the other hand, the relevance of features for a task constitutes another element with a noticeable impact on data exploration. The importance of attributes can be estimated through the application of mechanisms belonging to the feature selection and reduction area, such as rankings. In the described research framework, the data form was conditioned on relevance by the proposed procedure of gradual discretisation controlled by a ranking of attributes. Supervised and unsupervised discretisation methods were employed to the datasets from the stylometric domain and the task of binary authorship attribution. For the selected classifiers, extensive tests were performed and they indicated many cases of enhanced prediction for partially discretised datasets.
MSC:
68T10; 68T30; 68T37; 68U35

1. Introduction

The knowledge discovery in databases (KDD) process refers to discovering useful knowledge or patterns from databases [1]. It encompasses a range of techniques and methodologies from various fields, such as machine learning, data mining, statistical analysis, and use of database systems [2]. The KDD process involves several stages, including data selection, preprocessing, transformation, mining, evaluation, and interpretation of results. It often begins with identifying relevant data sources, followed by cleaning and preprocessing the data to handle noise, missing values, and inconsistencies. Subsequently, various data mining techniques are applied to discover patterns, associations, or clusters within the input data [3]. The discovered patterns are then evaluated to assess their significance and reliability. Finally, the results are presented in a meaningful way to facilitate decision-making and knowledge utilisation and interpretation.
Feature selection [4] plays one of the key roles in the KDD process, more specifically in the data preparation stage. The objective is to identify significant attributes from the entire set of available variables, while at the same time preserving the descriptive and representative qualities of the original set of the input features [5]. One of the means to assess the quality of selected features is the construction of a ranking of attributes [6]. It enables ordering of variables from the most to the least important, based on some adopted criterion. Feature ranking is also referred to as feature weighting. It encompasses the evaluation of individual attributes through the allocation of weights determined by their relevance [7].
Another important step in the KDD process is data preprocessing [8], as the outcome of this stage is subject to exploration, and, therefore, it forms the basis for deriving the results of the data mining process. Proper data preparation and selection of relevant attributes can influence the algorithms used, their range, and operation, which, in turn, translates into the classification results and the discovery of patterns that exist in the input data.
Discretisation represents an aspect of data preprocessing. It involves the conversion of numerical attributes into discrete or categorical ones with a finite number of intervals [9]. It is classified as a data reduction method because it condenses a continuous spectrum of attribute values into a smaller set of discrete values. This simplification process helps to make the data more comprehensible, interpretable, and usable, while potentially eliminating noise. However, it can also result in some loss of information due to the disregard of some properties or relationships present in the continuous nature of attributes [10]. There are many discretisation methods and approaches, and the choice of a particular algorithm has an impact on the obtained discrete form of the features [11].
Supervised discretisation methods take into account class information to find proper intervals among the ranges of attribute values, as opposed to unsupervised algorithms, where the discretiser considers only the range of values being translated, and the number of intervals is given as an input parameter [12]. In the standard procedures used, transformations are applied to all continuous variables at once, before knowledge discovery, and the same mechanism is employed to translate all attribute domains [13].
Taking into account the above characteristics of the knowledge discovery and discretisation processes, in the framework of the research presented in this paper, the authors propose a methodology related to the data preprocessing stage, more specifically, data discretisation. The modification involves conditioning transformations of attribute domains on their relevance. The procedure of data discretisation is performed gradually, separately for each feature, and the selection of features is driven by constructed rankings [14]. As a consequence, not only the original continuous input space is explored, but all the partially discretised variants as well. The processing stops when all variables are translated into their discrete domains.
The proposed methodology was verified through extensive experiments, including two well-known methods for ranking construction, examined in both ascending and descending orders, for selected representatives of both supervised and unsupervised discretisation approaches, and three state-of-the-art classification algorithms. The procedure of gradual discretisation controlled by use of ranking was applied to the datasets from the stylometry domain with authorship attribution as a supervised machine learning task [15,16].
The research presented in the paper makes the following main contributions:
  • Illustration of a research framework dedicated to a gradual discretisation procedure directed by selected rankings of features;
  • Exploitation of multiple discretisation algorithms, with supervised and unsupervised interval construction;
  • Comparison between domain transformations following rankings in ascending and descending directions;
  • Analysis of trends in performance of state-of-the-art classifiers with varied operational backgrounds from the point of view of data representation and interpretation;
  • Observation of the impact of considering information on the relevance of attributes during their discretisation on the performance of the selected classifiers;
  • Application of the proposed methodology in the stylometric domain for authorship attribution tasks.
The paper is organised into six sections. Section 2 constitutes a presentation of the research background, with comments on important issues, methods, and tools employed. Section 3 provides a description of the framework of the research procedure in which attributes are discretised gradually, while following a ranking of attributes, with all relevant considerations. The range of experiments performed and their limitations and parameters are included in Section 4, while Section 5 is dedicated to analysis of the obtained results. Concluding remarks and possible future research directions are provided in Section 6.

2. Background

In this section, some aspects related to the nature of data are considered. In relation to the data preparation stage during the KDD process, approaches to data discretisation and methods of attributes ranking construction are described. Finally, popular classifiers used in the experiments are outlined.

2.1. Nature of Input Space

The nature of data refers to various aspects of the input space. It involves identifying characteristic features and their importance, including the type of data and their complexity, the structures in which they are stored, and the selection of tools, methods, and algorithms aimed at extracting knowledge from the data and discovering useful new patterns. Understanding the nature of data is crucial when designing systems for their processing, selecting analytical tools, and implementing appropriate data management and protection strategies. One of the important concerns in this context is the possible existence of data irregularities, which can be considered from various points of view [17]. This may involve issues related to data discretisation, the uneven distribution of decision classes within a dataset [18], or possible stratification visible in the data [19].
Knowledge mining algorithms often require data in a discrete form, necessitating the process of transformation for these data [20]. In cases where the data are continuous, their normalisation or standardisation could be needed to ensure that different attributes have a comparable scale. If discretisation is performed, it can lead to a loss of information regarding the relationships and dependencies present in the data. Transforming the data into a discrete form reduces memory usage and computational power requirements and ensures that the data are easier to understand and interpret [21]. However, selecting an appropriate data discretisation method is not a trivial task [22]. Standard approaches apply the transformations to all available features at once, and supervised methods are most often considered superior to unsupervised ones. In the investigations described, representatives of both approaches were involved, but discretisation was executed gradually on attributes, selected based on constructed rankings, which reflected how their importance was estimated.
The problem of imbalanced decision classes [23], where some classes are represented in the dataset to a significantly higher degree than others, can cause difficulties in classification, as algorithms may favour the dominant class. This, in turn, translates into a low accuracy of identification for objects belonging to minority classes and can lead to erroneous conclusions about the effectiveness of the model. Despite high overall accuracy, the model may struggle with accurately recognising important but less frequently occurring cases. In the conducted experiments, to ensure unbiased observations, the considered decision classes were balanced and had equal representation in all datasets.
Stratification, as a technique used in statistics and research, involves dividing a population or a dataset into smaller groups based on one or more attributes. The goal is to ensure the representativeness of the sample for the entire population, which allows for more accurate statistical estimations and translates into classification results. By reducing variance within groups, stratification can lead to a better understanding of population diversity and enables an analysis specific to individual groups. The application of this technique requires resources for data collection and comparative analysis with methods that overlook stratification. When stratification is a known characteristic of the input space, it needs to be taken into account in the performance evaluation step for algorithms involved in knowledge mining. In the performed experiments, stratification was applied and used during the division of datasets into the train and test sets for classification purposes.
All these characteristics of the input data significantly influence any KDD process. They affect every stage, from preprocessing to final analysis, and interpretation as well. Therefore, their understanding is critical for effective pattern recognition. They must be considered in the context of any observations or conclusions drawn from the executed tests.

2.2. Data Transformations

Within the realm of supervised machine learning, numerous algorithms rely on data in discrete forms. In this context, discretisation assumes a critical role during the stages of input data preparation. Its primary function involves converting numerical characteristic features into discrete or nominal ones with a finite number of intervals representing the attribute domains [24]. Discretisation serves as a means of reducing features, aiming to diminish the multitude of values associated with a continuous variable, by segmenting its range into bins. It can also provide some insight into how important the attributes are [25], which can lead to the reduction of entire domains by transforming them into a single categorical representation.
Typically, discretisation follows a series of steps [26]: (i) arranging the continuous values of an attribute to be discretised either in ascending or descending order; (ii) determining and assessing cut-points to divide a range of continuous values or combine neighbouring intervals; (iii) dividing or merging intervals of the attribute’s values based on the chosen discretisation method and criteria; (iv) verifying the stopping criterion using a measure to regulate the entire discretisation process.
Discretisation techniques can be categorised based on various criteria. Among the most commonly recognised classifications are supervised versus unsupervised, local versus global, static versus dynamic, and top-down versus bottom-up approaches [9]. The properties associated with any specific processing are reflected in how intervals are constructed and the cut-points between them selected, and how many categorical values are defined for the variables.
In contrast to supervised methods, unsupervised algorithms disregard instance labels when transforming attribute values [11]. Local approaches focus on a subset of the discretised object space, while global methods consider the entire instance space. Dynamic discretisation involves examining interdependencies among variables, while static methods treat each attribute independently. In top-down processing, a large range is divided into smaller intervals, whereas in the bottom-up approach, small original intervals are merged into larger ones.
Discretisation algorithms from the group of supervised methods are widely considered as resulting in obtaining the most advantageous representation of the data in a discrete domain. Popular standard approaches from this category are Fayyad and Irani [27] and Kononenko [28]. They rely on the class entropy [29] within the intervals under consideration to evaluate cut-points and utilise the Minimum Description Length (MDL) principle [30,31] as a stopping criterion. The process of determining cut-points works in a top-down direction. It begins with a single interval that encompasses all values of the attribute to be discretised. Partitioning continues recursively until a stopping criterion is satisfied.
For the Fayyad and Irani method, firstly, the class entropy E n t ( S ) is calculated:
E n t ( S ) = i = 1 k P ( C i , S ) log ( P ( C i , S ) ) ,
where S is a set of N instances with k decision classes C 1 , , C k , and P ( C i , S ) is the proportion of class C i instances in S.
For the case of binary discretisation of a continuous attribute A, the optimal selection of cut-point T o p t is performed by testing and evaluating all possible candidate cut-points T. The entropy for a cut-point T, which splits the set S into two subsets, S 1 and S 2 , where the S 1 S contains instances with attribute values T and S 2 = S S 1 , is calculated as follows:
E n t ( A , T ; S ) = | S 1 | | S | E n t ( S 1 ) + | S 2 | | S | E n t ( S 2 ) .
For the optimal cut-point T o p t , the class information entropy E n t ( A , T o p t ; S ) is minimal.
The stopping criterion referring to the MDL principle is connected with the information gain:
G a i n ( A , T ; S ) = E n t ( S ) E n t ( A , T ; S ) .
The discretisation process is applied recursively until the inequality (4) is satisfied,
G a i n ( A , T ; S ) > log 2 ( N 1 ) N + Δ ( A , T ; S ) N ,
where
Δ ( A , T ; S ) = log 2 ( 3 k 2 ) k · E n t ( S ) k 1 · E n t ( S 1 ) k 2 · E n t ( S 2 ) .
In the case of the Kononenko method, the discretisation process is applied recursively until the inequality (6) is satisfied,
log N N C 1 N C k + log N + k 1 k 1 > j log N A j N C 1 A j N C k A j + j N A j + k 1 k 1 + log N T ,
where
  • N—the number of training instances,
  • N C i —the number of training instances from the class C i ,
  • N A x —the number of instances with the x-th value of the given attribute,
  • N C i A y —the number of instances from class C i with the y-th value of the given attribute,
  • N T —the number of possible cut-points. 
The two most commonly used representatives of the unsupervised approaches are equal-width binning and equal-frequency binning. For both, the number of intervals k to be constructed is defined by a user [9]. For the two algorithms, the values of a continuous attribute are sorted and the minimum and maximum values of the discretised feature are identified. In the case of the equal-width method, the range of attribute values is divided into k equal-width discrete intervals. In the case of equal-frequency binning, the range is divided into k intervals such that each bin contains the same number of sorted values.
Both techniques are relatively straightforward but can be influenced by the number of bins specified by the user. A drawback is that when values of a continuous attribute are unevenly distributed, the discretisation process may result in the loss of some information [22]. In the case of the equal-frequency method, numerous instances of a continuous value might lead to that value being allocated into different bins. Therefore, during the determination of cut-points, it is crucial to ensure that duplicate values are assigned to one bin only. In the case of the equal-width algorithm, intervals can be defined for regions of space where no datapoints exist.
In standard discretisation approaches, all attributes receive the same treatment and are processed at the same time, typically in the data preprocessing step, before data exploration. In the paper, a different way of proceeding is illustrated through the proposed methodology, where discretisation is performed gradually, with transformation of one attribute at a time. In the investigations carried out, selected representatives of both supervised and unsupervised discretisation methods were used, and the procedure involved taking into account the importance of the features to be discretised.

2.3. Importance of Attributes

Feature selection can be executed through two distinct methods: by choosing a subset of attributes or by establishing a ranking of variables based on their significance [32,33]. In both cases, the main goal is to reduce the dimensionality of the data by eliminating irrelevant or redundant attributes. It helps in building more efficient and faster models and facilitates the interpretation of the data. Decreasing the number of features can also reduce the computational and memory requirements.
Creating a ranking of attributes involves evaluating and ordering the features available in a dataset according to their significance or impact on a specific analytical objective, such as model prediction. Various methods can be utilised for ranking construction [34]: statistical tests, entropy-based methods, principal component analysis, or machine learning algorithms, e.g., random forests offer built-in feature importance assessment mechanisms [35]. Some approaches apply a scoring function whose values can be treated as assigned weights, while others, for example, sequential search [36], just return an ordering of attributes. The evaluation of features allows for the identification of the most significant ones and their ordering from the most to the least important, or in reverse order.
Relief and OneR are popular ranking mechanisms, which were studied in the research framework presented. Their implementation is available in the WEKA software [37]. They belong to the category of algorithms that treat all available attributes as relevant and always assign a non-zero score. Both algorithms can handle categorical as well as numerical types of features.
The Relief algorithm falls under methods that rely on the instances present in the training data [38]. When it is applied, each variable accumulates a score that indicates its effectiveness in distinguishing between different classes. At the beginning of the algorithm, all the features are assigned weights with a value of zero. In the iterative process, the nearest instance (neighbour) of the same class (nearest hit H) and the nearest instance of a different class (nearest miss M) are identified. Based on the calculated differences between the feature values of the current instance and its nearest hit and nearest miss instances, the weights of the attributes are updated. Higher scores are assigned to those attributes that demonstrate larger differences for nearest hits and smaller differences for nearest misses. The pseudo-code is listed as Algorithm 1.
Algorithm 1 Pseudo-code for Relief
Input:    set of learning instances X,
            set A of all N attributes,
            set of classes Cl,
            probabilities of classes P(Cl),
            number of iterations m,
            number k of considered nearest instances from each class;
Output: vector of weights w for all attributes;
begin
for i = 1 to N do
            w(i) = 0
end for
for i = 1 to m do
            choose randomly an instance xX
            find k nearest hits Hj
            for each class Clclass(x) do
              find k nearest misses Mj(Cl)
            end for
            for l = 1 to N do
               w ( l ) = w ( l ) j = 1 k diff ( l , x , H j ) m × k + Cl class ( x ) P ( Cl ) 1 P ( class ( x ) ) j = 1 k diff ( l , x , M j ( Cl ) ) m × k
            end for
end for
end {algorithm}
The difference function for categorical attributes returns one if the values are distinct, and zero if they are the same. For numerical attributes, it provides the normalised difference. After iterating through the dataset, the weights assigned to the features represent their importance. Attributes with higher weights are considered more relevant for classification as they contribute more effectively to discrimination between classes [39].
OneR (One Rule) is a simple and effective algorithm used to evaluate the importance of features in a dataset during classification tasks [40]. It examines attributes individually and ranks them based on their ability to discriminate between different classes in the dataset. For each unique value of the chosen feature, the algorithm generates one rule, collectively forming the basis of the classification model. The feature selected by OneR is the one that results in the lowest error rate when predicting the class labels. This algorithm is a straightforward approach to feature ranking; it can provide valuable insight regarding which features are the most informative for classification [41]. However, it may not always capture complex relationships between features, and its effectiveness can vary depending on the dataset and the nature of the problem. Algorithm 2 presents the pseudo-code of the OneR algorithm.
Algorithm 2 Pseudo-code for OneR classifier
Input:    set A of all attributes,
            set of learning instances X;
Output: 1-rule 1-rB;
begin
CandidateRules←Ø
for each attribute aA do
            for each value va of attribute a do
              count how often each class appears
              find the most frequent class ClF
              construct a rule IF a = va THEN ClF
            end for
            calculate classification accuracy for all rules
            choose the best rule rB
            CandidateRulesrB
end for
choose as 1-rB the best one from CandidateRules
end {algorithm}

2.4. Exploration of Input Space

Classification stands as a fundamental activity within the realms of knowledge discovery and pattern recognition. It can be viewed as a function that assigns a class label to the instances characterised by a set of attributes. In this work, three state-of-the-art classifiers were employed, including Naive Bayes, J48 and k-Nearest Neighbours.
The Naive Bayes (NB) classifier is a probabilistic machine learning algorithm used for classification tasks. It is based on Bayes’ theorem, which describes the probability of an event, based on prior knowledge of conditions that might be related to the event [42]. The “naive” part of its name comes from the assumption of independence among features. NB assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. This simplifies the computation and makes it more efficient, although it is often an oversimplification of real-world scenarios. Despite their simplicity and the “naive” assumption, Naive Bayes classifiers often perform well, especially with high-dimensional data, and they are computationally efficient, making them a popular choice for many classification problems [43].
J48 is a decision tree algorithm used in the field of machine learning and data mining. It is an implementation of the C4.5 algorithm, created by R. Quinlan [44]. J48 is often used in data classification and prediction. The name “J48” refers to the file format that creates the decision tree. The algorithm constructs a data model in the form of a decision tree. In the structure of this tree, every internal node represents an attribute, and each terminal node, referred to as a leaf, corresponds to a class label. By traversing the tree from the root to the leaves, a decision can be made for a given object under consideration. This approach allows understanding the rationale behind a specific decision and offers a straightforward and intuitive way to represent complex decision-making processes. Thus, decision trees are not only recognised as effective classifiers but also serve as a widely adopted form of knowledge representation [45].
The k-Nearest Neighbours (k-NN) classifier is a simple and intuitive machine learning algorithm that operates on continuous as well as discrete data [46]. The parameter k represents the number of neighbours to consider. They are identified on the basis of a distance metric that depends on the nature of the data—very often the Euclidean distance is used. In the framework of a classification task, the k-NN algorithm classifies the given new object by finding the majority class among its k nearest neighbours in the feature space. In regression tasks, instead of class labels, the algorithm predicts a continuous value by averaging the values of the k nearest neighbours. k-NN is a non-parametric algorithm which is categorised as a lazy learning method—it does not make any assumptions about the underlying data distribution and it does not learn a model during the training phase. Instead, it stores all the training data and makes predictions only when required during the testing phase. The latter property means that in the case of a large data size, the computational cost of determining the distance between objects increases, which significantly affects the performance of the algorithm.

3. Framework for Discretisation Controlled by Attribute Importance

The current investigation sought to provide as unbiased observations on the experiments as possible. Therefore, the algorithm for selective discretisation of the characteristic features, controlled and directed by their rankings, required certain limitations, assumptions, and decisions on the processing paths to take. This section provides comments on all the relevant considerations and explanations for all the steps.

3.1. Input Data and Attributes

All attributes are expected to be of the same fundamental nature, with the continuous domains and comparable ranges of their values. Only then are their representations before and after discretisation transformations similar. Variables should be selected based only on domain knowledge, without applying algorithms dedicated to feature selection, otherwise such double processing would make distinguishing their individual impacts impossible. To avoid the influence of imbalance or the existence of multiple classes on recognition, the classification task to be solved is binary, with balanced classes, and where both classes are considered to be of the same importance.

3.2. Rankings

Some ranking mechanisms evaluate and then select a subset of attributes, while to the remaining ones a zero rank is assigned. In effect, this means rejecting such non-ranking variables as entirely irrelevant. This could result from applying the notion of entropy in the evaluation of features, which could lead to finding that they do not support class recognition. The methods to be used in the proposed framework must belong to the other category of weighting approaches, which treat all the variables as relevant to some non-zero degree, by always assigning a rank different from zero. It is not necessary for any specific score to be given to the features, as the focus is only on the ordering obtained. The evaluation of importance must take place in the continuous domain. A ranking is assumed to involve a standard ordering of attributes, with the most important features at the top, and the least important variables placed at the bottom.

3.3. Discretisation Approaches

A static discretisation process is required, where transformations are executed independently on a learner used. All attributes should be separately translated into a discrete domain, and the algorithm should be univariate, that is, not taking into account any interdependencies among the variables. For transformations, supervised as well as unsupervised methods can be used. However, typically, supervised discretisation reflects to some extent how variables support recognition, which can be perceived as their importance, making it similar to ranking procedures. Furthermore, in a top-down algorithm, which starts with constructing a single interval to represent the entire range of continuous values translated to a discrete domain, if all the candidate cut-points are evaluated and rejected, then this single bin remains a sole discrete value for an attribute. In such a case, the attribute is practically removed from considerations in a discrete domain, as nothing can be learnt from its constant value in all samples.

3.4. Inducers

To observe the impact of the discretisation of attributes on the performance of a classifier, the inducer is required to be capable of efficient operation on both continuous and nominal values of variables, without any inherent transformations of features corresponding to discretisation. Since learners are sensitive to forms of attributes in varying degrees, more varied mathematical backgrounds and diverse modes of operation of employed classification systems provide a wider scope of observations.

3.5. Starting and Stopping Point

The processing starts with exploration of the datasets in the continuous domain. The performance is evaluated by labelling previously unknown samples in the test sets with reference to knowledge discovered in the train sets, expressed through patterns detected in the real-valued variables. The performance observed in this step constitutes one of the reference points for comparisons in further processing.
The stopping point for the procedure is reached once the set of variables is exhausted, when all are processed, and the entire datasets become discrete. The performance of inducers for the discretised data is the second reference point. Transformations can also be stopped sooner, when some noticeable worsening or increase in performance is observed. However, it can result in missing global maxima and too narrow a focus on some local trends in monotonicity.

3.6. Intermediate Steps and Directions of Processing

The pseudo-code for the ranking-driven discretisation procedure is shown in Algorithm 3. At each processing step, a single attribute is discretised. The variables chosen for the transformation are indicated by their position in a ranking. The ranking can be followed either in descending order, starting with the top positions taken by the most important features and then of gradually decreasing relevance, or in ascending order, beginning with the least important variables, placed at the bottom of the ranking, and then climbing up the ranks. Therefore, putting aside the starting point (all attributes continuous) and the stopping point (all variables discretised), for the rest of the processing, the datasets are partially continuous and partially discrete, and the number of middle steps to take equals the number of available features minus one.
Algorithm 3 Pseudo-code for ranking driven discretisation
Input:    ranking of attributes RankingA,
            dataset in the continuous domain Data-R,
            direction Direction to pursue ranking RankingA,
            number of attributes N;
begin
TMP-DataData-R
mine knowledge from TMP-Data
evaluate performance for TMP-Data
if Direction = Descending then k = 1 else k = N
while (k > 0) AND (k < N + 1) do
            select attribute from the ranking attr = RankingA[k]
            discretise attr in TMP-Data
            mine knowledge from TMP-Data
            evaluate performance for TMP-Data
            if Direction = Descending then k = k + 1 else k = k − 1
end while
end {algorithm}

4. Experiments

The experiments that were carried out began with preparation of the input datasets, with all attributes in the continuous domain. For the available features, the rankings were calculated next. Then, the procedure of selective discretisation controlled by a ranking was applied to the data. All data variants, continuous, partially discrete, and completely discrete, were explored by the selected classifiers. Performance was studied in the context of data form, ranking, and inducer.
The research included L = 2 rankings, examined in both ascending and descending order. They were applied to N = 12 attributes, and exploited in the gradual discretisation procedure with M = 20 discretisation approaches tested (two supervised discretisation methods, and two unsupervised discretisation methods with nine variants each). Therefore, per dataset, 1 + M 2 L ( N 1 ) + 1 versions of the data (making a total of 901) were explored by three selected classifiers, and their performance was evaluated with the test sets, discretised accordingly. The parameters of these extensive experiments are commented on in this section, while the results obtained are shown and discussed in the next one.

4.1. Data Preparation

To minimise the number of factors that could influence and bias the results of the experiments, a binary authorship attribution was selected as a classification task under study. The problem of attribution of authorship belongs to the stylometric domain [47]. It can be treated as a classification task by training an inducer on samples of known authorship to detect the linguistic characteristics of writing styles. It leads to the construction of stylistic profiles for authors [48]. Then, such profiles are matched to text samples of questionable origin to either confirm authorship, or to deny it [49].
For the stylometric analysis, two pairs of well-known writers were taken: the literary works of Edith Wharton and Mary Johnston formed the basis for the female writer dataset (F-writers), and the novels by Henry James and Thomas Hardy were used for the construction of the male writer dataset (M-writers). To increase the numbers of available text samples, these long texts were partitioned into much smaller parts, keeping comparable lengths [50]. For all these text chunks, the values were calculated for the arbitrarily selected group of lexical descriptors [51] in the form of frequency of occurrence [52] for twelve common two-letter function words as follows: as, at, by, if, in, no, of, on, or, so, to, up. Since these attributes are regular words, when they are referred to in descriptions of the experiments, formatting in italics is employed (e.g., the frequency of occurrence of of).
Due to the division of the novels into smaller parts, the samples obtained were grouped in the input space by these longer works, which led to a stratified space. In such a situation, to arrive at reliable observations, the original works need to be separated into sets to be used for training and testing. Samples based on one and the same novel are more similar, and using them for both stages would result in leaking information and overoptimistic predictions. The conditions for such unreliable evaluation would be given by the popular standard cross-validation technique employed in estimations of performance [19].
Cross-validation relies on multiple execution of train and test procedures, over which an average is calculated, and samples for all steps are selected randomly. In the stratified space, where groupings of points form a specific and known pattern, a completely random selection would increase the probability of biased recognition. To avoid this problem, non-standard cross-validation can be used, not exchanging single samples but rather their groups between the training and test sets. However, such processing results in very high additional computing costs. As a compromise, a different approach can be employed, with a single train and multiple test sets, all constructed based on separate long texts. Then, with respect to performance, the predictions averaged over all test sets are reported. The latter approach was implemented in the research. Apart from the train set, each dataset included two test sets. All the original sets were continuous with balanced data.

4.2. Rankings Employed

In the investigations, two ranking mechanisms were applied to the available features with the continuous domains. Relief and OneR are popular algorithms, implemented in the WEKA workbench [37]. Both rankings assign a non-zero rank to all the available attributes, treating all of them as relevant to some extent. For the two rankings, the resulting order of attributes for the male and female writer datasets is shown in Table 1.
Although the studied datasets share stylometric features, their role for each dataset is considered locally; therefore, they were mostly placed differently in the rankings. For the F-writers the orderings obtained by Relief and OneR were relatively close, in particular in the upper half, for the more important attributes. For the M-writers, there were more differences noted. Despite the similarities, the two rankings were not identical, and both were exploited in the next stage of the experiments as indicators of attributes to discretise one-by-one in a sequential process.
When a ranking was processed upward (in an ascending direction), it meant starting at the bottom of the ranking, with the least important variables and transforming them before moving on to the more relevant variables. Going down the ranking (in a descending direction) was understood as translating in the first discretisation steps the most important features, with the highest ranks, placed at the top of the ranking, and only then proceeding to less relevant variables.

4.3. Discretisation Algorithms

The discretisation methods used were from both supervised and unsupervised categories [24]. Unsupervised equal-width binning (duw) was employed, varying the number of bins to be constructed from two to ten, which returned nine variants of the datasets (duw02 ÷ duw10). In the same way, nine discrete variants of the data were obtained by applying unsupervised equal-frequency binning (duf), with the number of intervals ranging from two to ten (duf02 ÷ duf10).
The supervised discretisation methods (ds), that is, the Fayyad and Irani (dsF) and Kononenko (dsK) approaches, returned single variants of the data. These two algorithms rely on the MDL principle and calculation of entropy when the construction of intervals is performed. It led to some variables for which one interval was found to represent the entire range of values in a discrete domain. These attributes were effectively excluded from the data mining that followed discretisation. For F-writers, for both the supervised discretisation algorithms, there were six such features (if, in, no, or, so, up). For M-writers and dsK, six variables also had only single bins (as, no, on, so, to, up). For dsF, this set was expanded to seven elements (by adding of).
When a dataset consists of some constituent sets with the same input features, several different approaches to their discretisation can be attempted [53], all with some advantages and disadvantages, in particular, when irregularities in the data are observed [17]. In the experiments, all the sets were discretised independently, based on the characteristics of each individual set.

4.4. Performance Evaluation for Classifiers

In the investigations, three types of classification systems were used, Naive Bayes, J48 and k-NN. The differences in their operation mode and mathematical foundations enabled widening the scope for observations. As the goal of the research was to discover the relations between the attributes’ importance and their form, either continuous or discrete, and how they reflect on the performance of inducers, some measure for the estimation of quality was needed. Classification accuracy was chosen as the suitable indicator, showing how good the inducers were overall at attributing the considered authors to the text samples.
The samples to be labelled were entirely unknown to the classifiers and came from long texts that were not used for training. This way of proceeding prevented possible bias in recognition. The samples were grouped into two test sets; predictions obtained for them were averaged, and then, finally, reported as the percentage of texts for which correct authors were found in relation to the total number of samples.
For all inducers, their powers were studied under three conditions. The starting point was in a continuous domain, with all variables real valued. The end point was in a discrete domain, with all features discretised, for all variants of discrete datasets. And, in addition, a space was studied where some features were still continuous while others were discretised.

5. Results and Their Discussion

The results of the experiments can be studied from several perspectives. The performance of classifiers in the original input space needs to be contrasted with the results obtained for all variants of discretised space, with partial or complete transformations of the attributes. The first part of this section includes comments regarding reference points; the second part shows the performance trends observed inside the procedure of gradual discretisation. The third part is dedicated to an examination of the ranges of classification accuracy obtained, using selected calculated statistics.

5.1. Reference Points

To find out if the overall change in representation for attributes from continuous to discrete was advantageous to recognition by a classifier, the performance at the starting point, with all variables real-valued, needs to be compared with the performance at the finish line, with all attributes discretised, which then both constitute reference points. For all three inducers used in the described research, the classification accuracy, evaluated by labelling samples from the test sets, is shown in Table 2. Each column of the table shows the performance of a specific inducer for a dataset, with each row listing the result obtained for a certain data variant, either continuous or discrete. The highest accuracy discovered is marked in bold.
From these results, it can be observed that for all three inducers and both datasets, translation from the continuous into a discrete domain was not always advantageous. However, only for the Naive Bayes classifier, for M-writers, the maximum was obtained in the original input space before transformations. For all variants of the discretisation procedures, when translation was executed for all features, predictions in this case were brought down. On the other hand, for the NB and the female writer dataset, maximum precision was found for unsupervised discretisation with the equal-frequency binning approach, with seven bins constructed for all variables.
For the other two classifiers, J48 and k-NN, the maximum classification accuracy was detected in discrete spaces. For the J48 and the F-writers, it was for supervised discretisation by the Kononenko algorithm, and for M-writers, again for duf binning with 10 intervals defined. On the other hand, for k-NN, the maximum was found for M-writers for the Fayyad and Irani discretisation processing, and for F-writers, once again for unsupervised discretisation by equal-frequency binning, for six bins.
Since supervised discretisation is popularly considered superior to unsupervised transformations, it is worth observing that this opinion was not confirmed in these experiments. There were cases where supervised processing led to better results, but also conditions occurred when unsupervised algorithms returned variants of the data that caused improved predictions. However, a maximum was never found for processing with unsupervised equal-width binning, regardless of the number of intervals defined.

5.2. Performance Trends

The discretisation process referring to the importance of attributes shown by a ranking was studied for the three inducers in the context of a particular discretisation method, ranking, and direction of transformations. The results obtained are displayed in Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11 and Figure 12. Each figure shows the performance reported by a classifier for one ranking, for both ascending (left half of a figure) and descending (right part of a figure) directions, for all four types of discretisation algorithms, for either the female or male writer dataset.
In the charts showing the performance of classifiers, for the unsupervised discretisation methods, the categories on the x-axis show the number of bins constructed for the transformed variables, and for supervised discretisation, the method is given. The data series specify the number of attributes after discretisation, where 0 means that all variables were continuous (the original input space before any transformations), and 12 denotes the situation where the entire set of available features was translated (the final form, with all discrete attributes). The charts on the left display processing in an ascending direction of a ranking; that is, for discretisation, the least important variable was chosen first, then another more relevant, and so on, until the top of a ranking was reached. On the right are included the charts for transformations going in descending order, starting with the most important features, and then those less and less relevant.
For the Naive Bayes classifier and the Relief ranking (see the charts in Figure 1), for the female writer dataset for both ascending and descending order, discretisation by unsupervised equal-frequency binning was noticeably advantageous; yet, in the vast majority of cases, when only some subset of the attributes was transformed instead of all of them. Only when proceeding upward and constructing seven or nine bins, or downward with two or seven intervals, did the maximum performance occur for all 12 discrete variables. For equal-width binning, the ascending order still resulted in benefits from partial discretisation, but less so, as for eight or ten bins, for discrete data, the predictions were only lower than the reference point in the continuous domain. For descending order, this reference point was higher than any classification accuracy obtained for partially or completely discretised sets for five out of nine data variants. Only when 5, 8, 9, or 10 intervals were constructed was there some small improvement, when only either two or three of the least important variables were transformed. When both the supervised discretisation methods were applied, for processing up the Relief ranking, enhanced predictions were detected for just three translated variables. Then, a very steep decrease followed. When the order of features was descending, only degradation of the classifier power was observed.
Recognition of the male writers with the NB classifier (see Figure 2) for variables transformed along the Relief ranking led to the conclusion that for ascending order, for all four discretisation methods, there was some improvement when a subset of variables was processed. For the unsupervised methods, only for duf03 and duw08 was the reference point in the continuous domain better than the results observed in partial discretisation. The descending order resulted in rather disappointing predictions for both supervised approaches, and for most duw versions of the data. Only for duw06 and duw07 could some benefits of discretisation be found. Equal-frequency binning returned slightly more advantageous cases, as for six out of a total of nine variants, the classifier working in the continuous domain was outperformed by the NB operating on partially discretised data.
The operation of the Naive Bayes classifier on the F-writer dataset discretised while following the OneR ranking is shown in the charts included in Figure 3. Discretisation using supervised methods for both ascending and descending orders of transformations resulted only in some gradually decreased predictions, dropping to very low levels when all attributes were translated. For unsupervised equal-frequency binning, for the ascending direction, when either 6, 7 or 9 bins were formed, the highest performance was reached only in the last processing step—when all variables were discrete. For the descending order, both reference points were the best for some variant of the data—for two bins constructed, discretisation brought only a decrease in predictions, and for seven intervals defined, when all features were transformed, the accuracy was the highest. For all the remaining cases, the maximum was found for partial discretisation. Unsupervised equal-width binning returned such data variants for which the results were poorer, but still some improvement was also noted. For upward direction and eight or ten bins, and for downward direction and three or six bins, the classifier working in the continuous domain was not outperformed in any other case, only degradation in power was observed. However, for other numbers of intervals constructed for variables, in both directions of processing along the OneR ranking, increased predictions were reported.
For the male writer dataset shown in Figure 4, in transformations following the OneR ranking, there were more instances where the NB inducer worked best in the continuous domain, in particular, for processing downward. From all the discretisation methods and their variants, only for equal-width binning with four or seven bins, or equal-frequency binning with 4, 5, 7, 9 or 10 intervals, was some enhanced accuracy observed, when only the first few most important attributes were discretised. On the other hand, for transformations starting with the least relevant variables, in all cases except one (for duw06), partial discretisation was advantageous to the classifier performance, for both supervised and unsupervised methods.
Figure 5 displays charts illustrating the performance for the J48 classifier when discretisation was executed while following the Relief ranking for the F-writer dataset. For some conditions, the transformation resulted only in either the same or worsened predictions. This happened for unsupervised equal-width binning with six bins for ascending order, and for descending, when for duf either 3, 5 or 9 intervals were constructed, and for duw with two or seven bins. The complete path of discretisation was beneficial in the case of duw08 while processing upward and, for dsK, for both directions. For other transformation conditions, translation of a subset of attributes resulted in the greatest advantage.
For the male writer dataset (see Figure 6), only in one instance were the predictions of the J48 inducer for the continuous variables the best, when unsupervised equal-width binning with two bins was applied starting with the highest ranking features. On the other end, the maximum accuracy for all discretised variables was detected for both supervised discretisation approaches and both directions of transformations, and also for duf03, duf07, duf09, and duf10 for going up the Relief ranking, and for going down for duf05, duf09, and for duw03. This leaves twelve other circumstances for ascending and fourteen for descending the ranking with some form of discretisation where partial transformation led to the greatest improvement.
For all discretisation variants driven by the OneR ranking, the performance trends of the J48 classifier are shown in Figure 7 for the cases where the female writers were recognised. Here, only for processing downward and unsupervised equal frequency and equal width with three intervals constructed, did discretisation cause worsened accuracy. For this dataset, the maximum performance was rarely observed for the translation of all attributes into a discrete domain, for the supervised Kononenko approach for both directions, and for duw08 for going upward. With the other transformation parameters, enhanced predictions were reported, yet typically for higher numbers of processed features when starting with the least ranked variables, or smaller numbers of attributes when a downward direction was considered.
When the samples were attributed to the male authors, which is shown in Figure 8, only unsupervised equal-width binning with two bins for descending OneR resulted in a decreased power of the J48 inducer. Discretisation of all variables was the most advantageous case when the equal-frequency binning approach with 2, 9 or 10 intervals was applied to the attributes processed up the ranking, and for five or ten bins when proceeding downward. For the Fayyad and Irani and the Kononenko supervised algorithms, in ascending and descending order, the maximum performance was reported in the last few steps or just the last step of the procedure. The remaining conditions, from all 20 discretisation paths, 13 for going up the ranking and 15 for going down, led to improved predictions when some groups of variables, but not all of them, were transformed.
How the k-NN classifier fared in the gradual discretisation procedure based on the Relief ranking for the female writers can be seen in Figure 9. Changing the domain from continuous to discrete for a subset or all variables was always beneficial, and the performance was improved at some point. An entirely discrete domain worked best with processing by the equal-frequency approach with 3, 4 or 6 bins while proceeding down the ranking, and for the equal-width algorithm with two intervals constructed for going down. For other numbers of intervals defined, only for subsets of variables in unsupervised algorithms for both directions of ordering were increased numbers of samples correctly attributed to authors. For supervised discretisation, the highest accuracy was detected after transformation of the three least relevant attributes, and for translating only the most important variable.
On the other hand, for the male writers, the trends in performance visible in Figure 10 were noticeably different. Discretisation brought many cases of worsened predictions. When starting the transformations with the least relevant attributes, this occurred for the duf algorithm applied with 2, 3, 6 or 8 bins, and for the duw approach for almost all numbers of intervals formed, with the exception of two and seven bins. For discretisation of the higher ranking variables first, the reference point in the continuous domain was also better than the results obtained for duf02, duf04, duf05, duf08, and duw02. The second reference point, with all attributes transformed, was the maximum for duf07 while processing up the Relief ranking, and for both supervised discretisation methods for proceeding down. This implies that for ascending direction in 8, and for descending order in 12, out of a total of 20 discretisation paths, the maximum was detected for partial transformations.
Figure 11 includes charts that allow for analysis of performance for the kNN classifier, when discretisation of the female writer dataset was based on the OneR ranking. For this ranking, the benefits of partial or complete discretisation can also be observed, as the obtained results were always better than in the continuous domain. Complete discretisation was the most advantageous in just a few instances; for the duf procedure with 3, 4, 6, and 8 bins for transformations starting with the least relevant features, and for the duw algorithm with two intervals when the highest ranking variables were translated first. Both supervised methods led to the maximum detected after only either the least or the most important attribute was discretised. For other discretisation conditions, the number of transformed features that led to the maximum predictions ranged from 1 to 7 for following the ascending ordering of variables, and from 1 to 10 for the reverse.
In contrast, for the male writers, discretisation conditioned on the OneR ranking brought (see Figure 12) similar observations to the Relief ranking referred to before, that is, rare cases of improvement over the performance in the continuous domain. For going up the ranking, partial discretisation gave the best results only for both supervised methods, and for duf02, duf04, and duf09. For proceeding down, the same happened for duf07, duf09, and duf10, as well as for almost all the duw variants, when four or more intervals were constructed. For the descending order, the supervised discretisation of all variables was most beneficial to author recognition.
Overall, these substantial experiments show a relatively high number of cases when transformation of domains from continuous to discrete, not for all but some selected variables, resulted in improved accuracy of the employed classifiers. This happened for all three inducers used in the research, for all discretisation approaches, both processing directions, and for both datasets. Finding which conditions were most advantageous requires further study, but the obtained results validate the research framework that was proposed and are sufficiently promising to provide motivation for deeper investigation.

5.3. Summary of Results

To evaluate the usefulness of the selective discretisation procedure, some standard statistics were calculated, as shown in Table 3, Table 4 and Table 5. They included the average classification accuracy and standard deviation (per sample), as well as the minimum and maximum performance observed. These elements were established based on the procedure with the starting point of a single variable out of N available attributes being discretised, and the last stage taken into consideration was when N 1 features became discrete. From the calculations, the performance in the original continuous domain, and the performance in the final discretisation step where all variables were in a discrete domain were excluded. For each discretisation method and each ranking, both directions, ascending (starting with less important variables) and descending (starting at the top of a ranking with the most relevant features), were considered for each classifier. For the unsupervised methods, the results include detailed values obtained for each variant of a method (depending on the number of bins constructed for the variables), the overall averages calculated for the approach, and the overall extrema as well.
From these statistics, the highest values were preferred for all but one. For the standard deviation, the smallest values show how stable the process was, that the obtained predictions were close to each other. For classification accuracy, whether for averages calculated or extrema found, a higher percentage of correctly attributed samples was considered advantageous, as always in classification tasks. These preferred minimal and maximal values were marked in the tables in the context of each ranking and direction of processing for a dataset and a classifier.
In all three tables, it was observed that the standard deviation had values over a relatively wide range. There were always some cases of double digits in the integer parts. They were calculated for the supervised discretisation algorithms applied to the attributes, either one or both, and this happened for both ranking directions and both rankings. For the unsupervised discretisation approaches, the results were smaller, in single digits, and quite often just fractional. The minimum value was always found for one of the unsupervised transformations of the variables.
For the Naive Bayes classifier, the statistics presented in Table 3 show that, for the most part, for both rankings studied, for the female writer dataset, discretisation brought improved performance. For the male writer dataset, there were many cases of degradation when just the average was studied. For unsupervised equal-frequency binning and F-writers, in the vast majority of cases, the average performance was better than the one reported in the input continuous domain. However, for M-writers, the predictions decreased, although the change was mostly small. The best results were rarely observed for unsupervised equal-width binning, and, surprisingly, also for supervised discretisation methods. The maximum level of predictions for the female writer dataset was the same for both rankings, and for ascending order, the same as the highest performance detected for all variables transformed at once. For the male writer dataset, the maximum classification accuracy was always higher than the performance in the continuous domain and than for all discretised variables.
Comparison of the two orderings of variables in this case leads to the conclusion that the Relief ranking processed in the ascending direction for the female writers brought better results than the OneR, while for the male writers the opposite was true. Comparison of the directions indicates that going upward was more advantageous than following a ranking downward.
The J48 inducer (see Table 4) was generally not as good at prediction as Naive Bayes. It benefited more from discretisation. For the ascending direction of the Relief and OneR rankings, the average performance was almost always better than in the continuous domain for both the female and male writer datasets. For the descending direction of both rankings, these values showed some decrease with respect to the reference point. The maximum predictions found for the male writers were higher than the best result obtained for all discretised variables. For the female writers, the same situation occurred for the Relief ranking processed in both directions, and for ascending order, only Relief led to higher precision.
Of the three classifiers studied, k-NN (see Table 5) returned the worst predictions for the female writers, lower than the other two inducers. When only some parts of attributes were discretised, for the ascending direction of processing a ranking, the average performance was lower than when all variables were transformed. For the descending direction, there were some cases of improvement. However, the maximum classification accuracy detected was always higher, both for the rankings and in both directions. For the male writers, the average performance was close, yet below the reference point in the continuous domain, and noticeably below the predictions for all discrete attributes. The maximum found was improved only when processing a ranking up.
When the results from all three tables were compared against each other, it turned out that for the female writer dataset, ascending orders for rankings were beneficial for classification by the Naive Bayes and J48, while for the k-NN, the opposite was true. For the male writer dataset, the trends were not so constant and depended on a classifier and a discretisation approach applied to the data. However, they were the same for both rankings employed in the research. For F-writers, it was also observed that the Relief ranking more often led to higher values of the obtained statistics than the OneR ranking, while for M-writers, more variations were detected and greater dependence on the parameters of the discretisation process observed.
For the process of gradual discretisation controlled by the importance of attributes indicated by a ranking, the detailed analysis of classifier performance indicates that standard transformation approaches cannot necessarily guarantee a form of features that is the most beneficial to predictions. When only some subsets of variables are discretised, instead of all of them at once, it can lead to improved accuracy. Many such cases were observed in the investigations, for all types of inducers employed and all variants of discrete data. These findings show the merits of the proposed methodology, although they also indicate that finding the most advantageous scenario, that is, a particular discretisation method and a subset of variables to transform, is not a trivial task. To limit computational costs, the processing does not have to include all steps, from discretisation of a single attribute to translation of all of them. It can be stopped sooner, once some increase in performance is detected. However, such an approach brings the risk of detecting only a local (and not global) maximum.

6. Conclusions

Discretisation as a stage of data preparation plays an important role in knowledge discovery processes, with a noticeable impact on their efficiency. In standard proceedings, some arbitrarily selected discretisation approach, which enables finding categorical representations for the continuous domains of the input features, is applied to all attributes present in the input data, regardless of their characteristics, and all transformations are performed at once. In the paper, the research methodology was presented which was dedicated to a gradual data discretisation procedure, driven by a ranking of attributes. The aim was to examine how the form of the features, dependent on their importance, affects their characteristics and impacts the performance of the chosen classifiers.
Rankings belong to feature selection mechanisms. They provide information on the importance of individual features, which allows reducing the dimensionality of the data. In the research, the constructed orderings of variables were exploited to direct the sequential transformations of attributes. The features were selected one-by-one, taking into account two possible directions of processing, i.e., descending and ascending. The former started with the most relevant attributes and then gradually less and less important variables were discretised, and the latter began at the bottom of a ranking with the least relevant features before moving on to the more important ones. For transformations, representatives of the two most popular approaches were used: supervised and unsupervised.
The research methodology was extensively verified on two collections of the datasets from the stylometry domain, i.e., on 901 variants per one collection in total. These investigations included four discretisation algorithms with various parameters, two ranking methods with both directions of ordering of features, and three state-of-the-art classifiers. The analysis of the results obtained allowed the identification of many cases where the proposed data transformation procedure resulted in an improved predictive accuracy for the classifiers, when only some subset of attributes was discretised. This proves the merits of the methodology and highlights the value of investigating it more deeply.
In future research, other discretisation algorithms and classifiers will be examined, with the goal of determining guidelines for selecting the most advantageous combinations of attribute transformation methods. Also, other mechanisms for ranking construction will be investigated and compared.

Author Contributions

Conceptualization, U.S.; methodology, U.S., B.Z., G.B.; software, G.B.; validation, U.S., B.Z., G.B.; formal analysis, U.S., B.Z.; investigation, U.S.; data curation, U.S, G.B.; writing—original draft preparation, U.S., B.Z.; writing—review and editing, U.S., B.Z.; visualization, U.S., B.Z.; supervision, U.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data available upon request.

Acknowledgments

Texts exploited in experiments are available for on-line reading and download thanks to Project Gutenberg (www.gutenberg.org). For data processing, WEKA workbench [37] was used. The research described was performed within the statutory project of the Department of Computer Graphics, Vision and Digital Systems (Rau-6, 2024) at the Silesian University of Technology (SUT), Gliwice, Poland, and at the Institute of Computer Science, the University of Silesia in Katowice, Sosnowiec, Poland.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Han, J.; Kamber, M.; Pei, J. Data Mining: Concepts and Techniques; Morgan Kaufmann: San Francisco, CA, USA, 2011. [Google Scholar]
  2. Cios, K.J.; Pedrycz, W.; Świniarski, R.W.; Kurgan, L. Data Mining. A Knowledge Discovery Approach; Springer: New York, NY, USA, 2007. [Google Scholar]
  3. Witten, I.; Frank, E.; Hall, M. Data Mining. Practical Machine Learning Tools and Techniques, 3rd ed.; Morgan Kaufmann: San Francisco, CA, USA, 2011. [Google Scholar]
  4. Guyon, I.; Elisseeff, A. An Introduction to Variable and Feature Selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
  5. Guyon, I.; Gunn, S.; Nikravesh, M.; Zadeh, L. (Eds.) Feature Extraction: Foundations and Applications; Studies in Fuzziness and Soft Computing; Physica-Verlag, Springer: Heidelberg, Germany, 2006; Volume 207. [Google Scholar]
  6. Liu, H.; Motoda, H. Computational Methods of Feature Selection; CRC Press: Boca Raton, FL, USA, 2007. [Google Scholar]
  7. Stańczyk, U. Pruning Decision Rules by Reduct-Based Weighting and Ranking of Features. Entropy 2022, 24, 1602. [Google Scholar] [CrossRef] [PubMed]
  8. Fernández, A.; García, S.; Galar, M.; Prati, R.C.; Krawczyk, B.; Herrera, F. Data Level Preprocessing Methods. In Learning from Imbalanced Data Sets; Springer International Publishing: Cham, Switzerland, 2018; pp. 79–121. [Google Scholar]
  9. Kotsiantis, S.; Kanellopoulos, D. Discretization Techniques: A recent survey. Int. Trans. Comput. Sci. Eng. 2006, 1, 47–58. [Google Scholar]
  10. Kliegr, T.; Izquierdo, E. QCBA: Improving rule classifiers learned from quantitative data by recovering information lost by discretisation. Appl. Intell. 2023, 53, 20797–20827. [Google Scholar] [CrossRef]
  11. Yang, Y.; Webb, G.I.; Wu, X. Discretization Methods. In Data Mining and Knowledge Discovery Handbook; Maimon, O., Rokach, L., Eds.; Springer: Boston, MA, USA, 2005; pp. 113–130. [Google Scholar]
  12. Dougherty, J.; Kohavi, R.; Sahami, M. Supervised and Unsupervised Discretization of Continuous Features. In Proceedings of the Machine Learning: Proceedings of the 12th International Conference; Morgan Kaufmann: San Francisco, CA, USA, 1995; pp. 194–202. [Google Scholar]
  13. Dash, R.; Paramguru, R.L.; Dash, R. Comparative analysis of supervised and unsupervised discretization techniques. Int. J. Adv. Sci. Technol. 2011, 2, 29–37. [Google Scholar]
  14. Blum, A.L.; Langley, P. Selection of relevant features and examples in machine learning. Artif. Intell. 1997, 97, 245–271. [Google Scholar] [CrossRef]
  15. Koppel, M.; Schler, J.; Argamon, S. Computational Methods in Authorship Attribution. J. Am. Soc. Inf. Sci. Technol. 2009, 60, 9–26. [Google Scholar] [CrossRef]
  16. Zhao, Y.; Zobel, J. Searching with Style: Authorship Attribution in Classic Literature. In Proceedings of the Thirtieth Australasian Conference on Computer Science—Volume 62, ACSC ’07, Darlinghurst, Australia, 30 January 2007; pp. 59–68. [Google Scholar]
  17. Stańczyk, U.; Zielosko, B. Data irregularities in discretisation of test sets used for evaluation of classification systems: A case study on authorship attribution. Bull. Pol. Acad. Sci. Tech. Sci. 2021, 69, 1–12. [Google Scholar] [CrossRef]
  18. Das, S.; Datta, S.; Chaudhuri, B.B. Handling data irregularities in classification: Foundations, trends, and future challenges. Pattern Recognit. 2018, 81, 674–693. [Google Scholar] [CrossRef]
  19. Baron, G.; Stańczyk, U. Standard vs. non-standard cross-validation: Evaluation of performance in a space with structured distribution of datapoints. In Knowledge-Based and Intelligent Information & Engineering Systems, Proceedings of the 25th International Conference KES-2021, Szczecin, Poland, 8–10 September 2021; Procedia Computer Science; Wątróbski, J., Salabun, W., Toro, C., Zanni-Merk, C., Howlett, R.J., Jain, L.C., Eds.; Elsevier: Amsterdam, The Netherlands, 2021; Volume 192, pp. 1245–1254. [Google Scholar]
  20. Toulabinejad, E.; Mirsafaei, M.; Basiri, A. Supervised discretization of continuous-valued attributes for classification using RACER algorithm. Expert Syst. Appl. 2024, 244, 121203. [Google Scholar] [CrossRef]
  21. Huan, L.; Farhad, H.; Lim, T.; Manoranjan, D. Discretization: An Enabling Technique. Data Min. Knowl. Discov. 2002, 6, 393–423. [Google Scholar]
  22. Peng, L.; Qing, W.; Gu, Y. Study on Comparison of Discretization Methods. In Proceedings of the 2009 International Conference on Artificial Intelligence and Computational Intelligence, Shanghai, China, 7–8 November 2009; Volume 4, pp. 380–384. [Google Scholar]
  23. Islam, M.A.; Uddin, M.A.; Aryal, S.; Stea, G. An ensemble learning approach for anomaly detection in credit card data with imbalanced and overlapped classes. J. Inf. Secur. Appl. 2023, 78, 103618. [Google Scholar] [CrossRef]
  24. García, S.; Luengo, J.; Sáez, J.A.; López, V.; Herrera, F. A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning. IEEE Trans. Knowl. Data Eng. 2013, 25, 734–750. [Google Scholar] [CrossRef]
  25. de Sá, C.R.; Soares, C.; Knobbe, A. Entropy-based discretization methods for ranking data. Inf. Sci. 2016, 329, 921–936. [Google Scholar] [CrossRef]
  26. Stańczyk, U.; Zielosko, B.; Baron, G. Discretisation of conditions in decision rules induced for continuous data. PLoS ONE 2020, 15, e0231788. [Google Scholar] [CrossRef] [PubMed]
  27. Fayyad, U.M.; Irani, K.B. Multi-interval discretization of continuousvalued attributes for classification learning. In Proceedings of the 13th International Joint Conference on Articial Intelligence, Chambéry, France, 28 August–3 September 1993; Morgan Kaufmann Publishers: San Francisco, CA, USA, 1993; Volume 2, pp. 1022–1027. [Google Scholar]
  28. Kononenko, I.; Kukar, M. Data Preprocessing. In Machine Learning and Data Mining; Kononenko, I., Kukar, M., Eds.; Woodhead Publishing: Cambridge, UK, 2007; Chapter 7; pp. 181–211. [Google Scholar]
  29. Grzymala-Busse, J.W. Discretization Based on Entropy and Multiple Scanning. Entropy 2013, 15, 1486–1502. [Google Scholar] [CrossRef]
  30. Rissanen, J. Modeling by shortest data description. Automatica 1978, 14, 465–471. [Google Scholar] [CrossRef]
  31. Ross Quinlan, J.; Rivest, R.L. Inferring decision trees using the minimum description length principle. Inf. Comput. 1989, 80, 227–248. [Google Scholar] [CrossRef]
  32. Hall, M.A. Correlation-Based Feature Subset Selection for Machine Learning. Ph.D. Thesis, Department of Computer Science, University of Waikato, Hamilton, New Zealand, 1998. [Google Scholar]
  33. Kohavi, R.; John, G.H. Wrappers for feature subset selection. Artif. Intell. 1997, 97, 273–324. [Google Scholar] [CrossRef]
  34. Mansoori, E. Using statistical measures for feature ranking. Int. J. Pattern Recognit. Artif. Intell. 2013, 27, 1350003. [Google Scholar] [CrossRef]
  35. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  36. Saha, P.; Patikar, S.; Neogy, S. A Correlation–Sequential Forward Selection Based Feature Selection Method for Healthcare Data Analysis. In Proceedings of the 2020 IEEE International Conference on Computing, Power and Communication Technologies (GUCON), Greater Noida, India, 2–4 October 2020; pp. 69–72. [Google Scholar]
  37. Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I. The WEKA Data Mining Software: An Update. SIGKDD Explor. 2009, 11, 10–18. [Google Scholar] [CrossRef]
  38. Kononenko, I. Estimating attributes:Analysis and extensions of RELIEF. In Proceedings of the Machine Learning: ECML-94; LNCS; Bergadano, F., De Raedt, L., Eds.; Springer Verlag: Berlin, Germany, 1994; Volume 784, pp. 171–182. [Google Scholar]
  39. Sun, Y.; Wu, D. A RELIEF Based Feature Extraction Algorithm. In Proceedings of the SIAM International Conference on Data Mining, Atlanta, GA, USA, 24–26 April 2008; pp. 188–195. [Google Scholar]
  40. Holte, R. Very simple classification rules perform well on most commonly used datasets. Mach. Learn. 1993, 11, 63–91. [Google Scholar] [CrossRef]
  41. Ali, S.; Smith, K.A. On learning algorithm selection for classification. Appl. Soft Comput. 2006, 6, 119–138. [Google Scholar] [CrossRef]
  42. Friedman, N.; Geiger, D.; Goldszmidt, M. Bayesian Network Classifiers. Mach. Learn. 1997, 29, 131–163. [Google Scholar] [CrossRef]
  43. Domingos, P.; Pazzani, M. On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. Mach. Learn. 1997, 29, 103–130. [Google Scholar] [CrossRef]
  44. Quinlan, J.R. C4.5: Programs for Machine Learning; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1993. [Google Scholar]
  45. Moshkov, M.; Zielosko, B.; Tetteh, E.T. Selected Data Mining Tools for Data Analysis in Distributed Environment. Entropy 2022, 24, 1401. [Google Scholar] [CrossRef] [PubMed]
  46. Zhang, X.; Xiao, H.; Gao, R.; Zhang, H.; Wang, Y. K-nearest neighbors rule combining prototype selection and local feature weighting for classification. Knowl.-Based Syst. 2022, 243, 108451. [Google Scholar] [CrossRef]
  47. Zhao, Y.; Zobel, J. Effective and Scalable Authorship Attribution Using Function Words. In Proceedings of the Information Retrieval Technology; Lee, G.G., Yamada, A., Meng, H., Myaeng, S.H., Eds.; Springer: Berlin/Heidelberg, Germany, 2005; pp. 174–189. [Google Scholar]
  48. Rybicki, J.; Eder, M.; Hoover, D. Computational stylistics and text analysis. In Doing Digital Humanities: Practice, Training, Research, 1st ed.; Crompton, C., Lane, R., Siemens, R., Eds.; Routledge: London, UK, 2016; pp. 123–144. [Google Scholar]
  49. Stamatatos, E. A Survey of Modern Authorship Attribution Methods. J. Am. Soc. Inf. Sci. Technol. 2009, 60, 538–556. [Google Scholar] [CrossRef]
  50. Škorić, M.; Stanković, R.; Ikonić Nešić, M.; Byszuk, J.; Eder, M. Parallel Stylometric Document Embeddings with Deep Learning Based Language Models in Literary Authorship Attribution. Mathematics 2022, 10, 838. [Google Scholar] [CrossRef]
  51. Eder, M.; Górski, R.L. Stylistic Fingerprints, POS-tags, and Inflected Languages: A Case Study in Polish. J. Quant. Linguist. 2022, 30, 86–103. [Google Scholar] [CrossRef]
  52. Rybicki, J. Vive la différence: Tracing the (authorial) gender signal by multivariate analysis of word frequencies. Digit. Scholarsh. Humanit. 2016, 31, 746–761. [Google Scholar] [CrossRef]
  53. Baron, G.; Harężlak, K. On Approaches to Discretization of Datasets Used for Evaluation of Decision Systems. In Intelligent Decision Technologies 2016: Proceedings of the 8th KES International Conference on Intelligent Decision Technologies (KES-IDT 2016)—Part II; Czarnowski, I., Caballero, M.A., Howlett, J.R., Jain, C.L., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 149–159. [Google Scholar]
Figure 1. Performance [%] for the Naive Bayes classifier observed in the discretisation of the female writer dataset while following the Relief ranking. For unsupervised equal-frequency (duf) and equal-width (duw) binning, the categories reflect the number of constructed bins, and for supervised approaches, the method is given. The series specify the number of discretised attributes.
Figure 1. Performance [%] for the Naive Bayes classifier observed in the discretisation of the female writer dataset while following the Relief ranking. For unsupervised equal-frequency (duf) and equal-width (duw) binning, the categories reflect the number of constructed bins, and for supervised approaches, the method is given. The series specify the number of discretised attributes.
Entropy 26 00404 g001
Figure 2. Performance [%] for the Naive Bayes classifier observed in the discretisation of the male writer dataset while following the Relief ranking. For unsupervised equal-frequency (duf) and equal-width (duw) binning, the categories reflect the number of constructed bins, and for supervised approaches, the method is given. The series specify the number of discretised attributes.
Figure 2. Performance [%] for the Naive Bayes classifier observed in the discretisation of the male writer dataset while following the Relief ranking. For unsupervised equal-frequency (duf) and equal-width (duw) binning, the categories reflect the number of constructed bins, and for supervised approaches, the method is given. The series specify the number of discretised attributes.
Entropy 26 00404 g002
Figure 3. Performance [%] for the Naive Bayes classifier observed in the discretisation of the female writer dataset while following the OneR ranking. For unsupervised equal-frequency (duf) and equal-width (duw) binning, the categories reflect the number of constructed bins, and for supervised approaches, the method is given. The series specify the number of discretised attributes.
Figure 3. Performance [%] for the Naive Bayes classifier observed in the discretisation of the female writer dataset while following the OneR ranking. For unsupervised equal-frequency (duf) and equal-width (duw) binning, the categories reflect the number of constructed bins, and for supervised approaches, the method is given. The series specify the number of discretised attributes.
Entropy 26 00404 g003
Figure 4. Performance [%] for the Naive Bayes classifier observed in the discretisation of the male writer dataset while following the OneR ranking. For unsupervised equal-frequency (duf) and equal-width (duw) binning, the categories reflect the number of constructed bins, and for supervised approaches, the method is given. The series specify the number of discretised attributes.
Figure 4. Performance [%] for the Naive Bayes classifier observed in the discretisation of the male writer dataset while following the OneR ranking. For unsupervised equal-frequency (duf) and equal-width (duw) binning, the categories reflect the number of constructed bins, and for supervised approaches, the method is given. The series specify the number of discretised attributes.
Entropy 26 00404 g004
Figure 5. Performance [%] for the J48 classifier observed in the discretisation of the female writer dataset while following the Relief ranking. For unsupervised equal-frequency (duf) and equal-width (duw) binning, the categories reflect the number of constructed bins, and for supervised approaches, the method is given. The series specify the number of discretised attributes.
Figure 5. Performance [%] for the J48 classifier observed in the discretisation of the female writer dataset while following the Relief ranking. For unsupervised equal-frequency (duf) and equal-width (duw) binning, the categories reflect the number of constructed bins, and for supervised approaches, the method is given. The series specify the number of discretised attributes.
Entropy 26 00404 g005
Figure 6. Performance [%] for the J48 classifier observed in the discretisation of the male writer dataset while following the Relief ranking. For unsupervised equal-frequency (duf) and equal-width (duw) binning, the categories reflect the number of constructed bins, and for supervised approaches, the method is given. The series specify the number of discretised attributes.
Figure 6. Performance [%] for the J48 classifier observed in the discretisation of the male writer dataset while following the Relief ranking. For unsupervised equal-frequency (duf) and equal-width (duw) binning, the categories reflect the number of constructed bins, and for supervised approaches, the method is given. The series specify the number of discretised attributes.
Entropy 26 00404 g006
Figure 7. Performance [%] for the J48 classifier observed in the discretisation of the female writer dataset while following the OneR ranking. For unsupervised equal-frequency (duf) and equal-width (duw) binning, the categories reflect the number of constructed bins, and for supervised approaches, the method is given. The series specify the number of discretised attributes.
Figure 7. Performance [%] for the J48 classifier observed in the discretisation of the female writer dataset while following the OneR ranking. For unsupervised equal-frequency (duf) and equal-width (duw) binning, the categories reflect the number of constructed bins, and for supervised approaches, the method is given. The series specify the number of discretised attributes.
Entropy 26 00404 g007
Figure 8. Performance [%] for the J48 classifier observed in the discretisation of the male writer dataset while following the OneR ranking. For unsupervised equal-frequency (duf) and equal-width (duw) binning, the categories reflect the number of constructed bins, and for supervised approaches, the method is given. The series specify the number of discretised attributes.
Figure 8. Performance [%] for the J48 classifier observed in the discretisation of the male writer dataset while following the OneR ranking. For unsupervised equal-frequency (duf) and equal-width (duw) binning, the categories reflect the number of constructed bins, and for supervised approaches, the method is given. The series specify the number of discretised attributes.
Entropy 26 00404 g008
Figure 9. Performance [%] for the k-NN classifier observed in the discretisation of the female writer dataset while following the Relief ranking. For unsupervised equal-frequency (duf) and equal-width (duw) binning, the categories reflect the number of constructed bins, and for supervised approaches, the method is given. The series specify the number of discretised attributes.
Figure 9. Performance [%] for the k-NN classifier observed in the discretisation of the female writer dataset while following the Relief ranking. For unsupervised equal-frequency (duf) and equal-width (duw) binning, the categories reflect the number of constructed bins, and for supervised approaches, the method is given. The series specify the number of discretised attributes.
Entropy 26 00404 g009
Figure 10. Performance [%] for the k-NN classifier observed in the discretisation of the male writer dataset while following the Relief ranking. For unsupervised equal-frequency (duf) and equal-width (duw) binning, the categories reflect the number of constructed bins, and for supervised approaches, the method is given. The series specify the number of discretised attributes.
Figure 10. Performance [%] for the k-NN classifier observed in the discretisation of the male writer dataset while following the Relief ranking. For unsupervised equal-frequency (duf) and equal-width (duw) binning, the categories reflect the number of constructed bins, and for supervised approaches, the method is given. The series specify the number of discretised attributes.
Entropy 26 00404 g010
Figure 11. Performance [%] for the k-NN classifier observed in the discretisation of the female writer dataset while following the OneR ranking. For unsupervised equal-frequency (duf) and equal-width (duw) binning, the categories reflect the number of constructed bins, and for supervised approaches, the method is given. The series specify the number of discretised attributes.
Figure 11. Performance [%] for the k-NN classifier observed in the discretisation of the female writer dataset while following the OneR ranking. For unsupervised equal-frequency (duf) and equal-width (duw) binning, the categories reflect the number of constructed bins, and for supervised approaches, the method is given. The series specify the number of discretised attributes.
Entropy 26 00404 g011
Figure 12. Performance [%] for the k-NN classifier observed in the discretisation of the male writer dataset while following the OneR ranking. For unsupervised equal-frequency (duf) and equal-width (duw) binning, the categories reflect the number of constructed bins, and for supervised approaches, the method is given. The series specify the number of discretised attributes.
Figure 12. Performance [%] for the k-NN classifier observed in the discretisation of the male writer dataset while following the OneR ranking. For unsupervised equal-frequency (duf) and equal-width (duw) binning, the categories reflect the number of constructed bins, and for supervised approaches, the method is given. The series specify the number of discretised attributes.
Entropy 26 00404 g012
Table 1. Rankings of attributes.
Table 1. Rankings of attributes.
RankF-Writers M-Writers
Relief OneR Relief OneR
1onto byby
2toon ifor
3ofof soin
4asas orif
5byby inat
6ifif asso
7orin atas
8upup onon
9atso nono
10inor ofto
11soat upof
12nono toup
Table 2. Performance [%] of classifiers for all attributes in the continuous domain and all discrete domains.
Table 2. Performance [%] of classifiers for all attributes in the continuous domain and all discrete domains.
DomainF-Writers M-WritersDomainF-Writers M-Writers
NB J48 k-NN NB J48 k-NN NB J48 k-NN NB J48 k-NN
Cont.93.3389.7985.56 84.0375.6377.29
dsF50.0087.7862.22 69.4480.7683.54dsK62.2293.4062.22 68.8980.7682.43
duf0291.1891.1883.89 75.3573.8268.47duw0287.0185.3588.06 73.8272.7170.14
duf0392.8585.6386.94 75.6979.7271.25duw0389.3883.6885.83 78.6877.2972.36
duf0493.4086.4688.61 81.6069.3175.21duw0491.6786.9482.78 80.6379.2471.32
duf0592.8586.1885.07 81.1178.9673.54duw0590.5680.4284.10 79.3875.7669.51
duf0694.1088.1989.17 80.3574.7973.47duw0692.2285.6985.21 82.2275.2177.22
duf0795.7681.7485.63 80.4278.7577.85duw0791.1189.6586.46 81.0471.9472.29
duf0892.2988.1988.47 80.4977.9275.28duw0891.6790.9785.76 77.6475.4273.61
duf0994.5888.2686.81 80.4279.7977.57duw0990.5690.3588.68 81.5376.4674.58
duf1093.4787.1586.88 79.8680.9075.35duw1091.6787.0186.32 80.0076.2573.61
Table 3. Statistics of performance [%]: average classification accuracy, standard deviation, and minimum and maximum classification accuracy of the NB classifiers, for the procedure of the gradual discretisation controlled by rankings, starting with 1 out of N discretised attributes, and ending with N 1 discretised variables.
Table 3. Statistics of performance [%]: average classification accuracy, standard deviation, and minimum and maximum classification accuracy of the NB classifiers, for the procedure of the gradual discretisation controlled by rankings, starting with 1 out of N discretised attributes, and ending with N 1 discretised variables.
F-WritersM-Writers
AscendingDescendingAscendingDescending
Avg. ± St.dev.MinMaxAvg. ± St.dev.MinMaxAvg. ± St.dev.MinMaxAvg. ± St.dev.MinMax
DomainRelief ranking
dsF80.38 ± 12.9456.1895.1460.18 ± 14.5250.0088.6183.74 ± 1.7981.5387.0876.60 ± 4.1372.2982.85
dsK83.67 ± 07.5074.1795.1478.57 ± 11.7061.1190.5683.15 ± 1.2981.3284.7975.96 ± 4.7170.8382.85
duf94.07 ± 00.1992.7195.7693.00 ± 00.4989.3895.2182.73 ± 1.8077.4385.2180.95 ± 1.8473.1385.76
duf0294.58 ± 00.5893.8995.7690.16 ± 00.6089.3891.1181.98 ± 2.3377.4384.5177.14 ± 3.6473.1383.54
duf0394.35 ± 00.5093.8995.2193.20 ± 00.9891.6794.5882.64 ± 1.4979.9384.0379.07 ± 3.2874.0383.40
duf0494.34 ± 00.6393.8995.7693.16 ± 00.8192.2994.5183.13 ± 2.1279.9385.2181.76 ± 1.4779.7984.65
duf0593.67 ± 00.4092.7193.8992.34 ± 01.0090.4993.9682.68 ± 1.9379.2484.6582.96 ± 1.6980.5685.21
duf0693.95 ± 00.3293.3394.5193.54 ± 00.4792.8594.5182.80 ± 2.0079.3885.2180.90 ± 1.2279.7983.40
duf0794.12 ± 00.4993.3395.1494.53 ± 00.4493.3395.2182.69 ± 1.8579.9385.1481.52 ± 1.6179.2484.65
duf0893.73 ± 00.3892.7193.8993.11 ± 00.9491.7495.0782.60 ± 1.9379.3185.2181.87 ± 1.6980.3584.58
duf0993.90 ± 00.2793.3394.5194.00 ± 00.5793.3395.1482.73 ± 1.9279.8685.2182.14 ± 1.4380.4284.58
duf1094.00 ± 00.3893.8995.1493.01 ± 00.8292.2994.5183.33 ± 1.4480.4985.2181.18 ± 1.8879.2485.76
duw93.15 ± 00.6190.6995.7691.12 ± 00.9785.2193.8983.30 ± 1.1179.8686.4681.01 ± 0.9276.8884.17
duw0293.88 ± 00.9292.2295.7688.17 ± 02.2685.2192.2283.67 ± 0.7982.7185.1478.87 ± 1.1977.2980.76
duw0392.99 ± 01.0591.0494.5189.64 ± 01.2288.1392.0883.42 ± 2.0379.9386.4679.65 ± 1.8976.8882.85
duw0493.24 ± 00.6691.6794.0391.46 ± 00.8490.5693.3382.66 ± 1.2680.0084.5181.75 ± 1.0980.0083.96
duw0592.84 ± 00.9490.6993.8991.18 ± 01.6089.3893.8983.24 ± 2.0579.8685.2881.11 ± 1.0079.8683.47
duw0693.25 ± 00.5491.7493.9692.17 ± 00.8791.0493.3383.48 ± 1.0981.0484.5881.22 ± 1.3279.4484.10
duw0793.05 ± 00.8691.1194.0392.01 ± 00.5291.5392.7883.40 ± 1.0380.4984.5182.59 ± 0.9081.1184.17
duw0893.08 ± 00.4692.2293.3392.08 ± 00.8791.0493.8982.38 ± 1.2679.9383.9680.64 ± 1.4778.7582.85
duw0992.89 ± 00.5991.6793.8991.28 ± 01.2889.9393.2683.43 ± 1.1081.5384.5881.72 ± 0.8380.4983.47
duw1093.09 ± 00.6791.1193.3392.07 ± 01.0691.1193.8983.99 ± 1.2181.6785.7681.55 ± 1.0079.3882.85
OneR ranking
dsF78.54 ± 12.2952.2293.3360.61 ± 15.3450.0092.7884.32 ± 1.5881.4687.0876.06 ± 3.8771.6783.33
dsK81.07 ± 08.0461.8893.3382.92 ± 08.3261.1192.7883.59 ± 0.9481.9484.7275.56 ± 4.4968.3383.33
duf93.88 ± 00.1992.2295.7693.14 ± 00.5189.3895.2183.09 ± 1.6079.2485.2880.81 ± 1.7373.0685.76
duf0294.42 ± 00.7393.3395.7690.33 ± 01.1289.3893.3382.47 ± 1.6479.6584.5177.28 ± 3.8473.0683.54
duf0394.02 ± 00.4493.4795.1493.54 ± 00.7392.2294.5882.78 ± 1.6779.9384.6578.00 ± 3.2474.0383.40
duf0494.02 ± 00.4593.4095.1493.43 ± 00.7992.2994.5183.54 ± 1.6880.4285.2881.76 ± 1.3779.7984.65
duf0593.52 ± 00.5892.2293.8992.23 ± 01.2690.4993.9683.25 ± 1.9279.2484.7282.80 ± 1.4681.0484.58
duf0693.90 ± 00.0493.8994.0393.50 ± 00.7392.2994.5183.43 ± 1.4480.4285.2181.22 ± 1.3679.7983.47
duf0793.91 ± 00.2793.3394.5194.80 ± 00.4493.8995.2182.68 ± 1.8079.9385.1481.35 ± 1.6479.2484.65
duf0893.63 ± 00.5992.2293.8993.15 ± 01.1991.6795.0783.14 ± 1.3580.4985.2181.70 ± 1.1780.3583.96
duf0993.74 ± 00.3492.8593.8994.01 ± 00.5593.4795.1483.06 ± 1.9579.8685.2181.76 ± 1.2280.4284.58
duf1093.79 ± 00.3192.8593.8993.29 ± 00.7092.2994.5183.42 ± 1.5780.4985.2181.44 ± 1.8379.2485.76
duw93.11 ± 00.8189.4495.7691.42 ± 01.0486.3994.5183.05 ± 1.1979.8686.4681.11 ± 1.0374.3884.17
duw0293.51 ± 01.5289.4495.7688.83 ± 02.2186.3993.8983.32 ± 1.3480.9085.1478.48 ± 1.9074.3880.63
duw0392.94 ± 01.1190.6393.8990.25 ± 01.3088.1392.7883.91 ± 1.7179.9386.4680.16 ± 1.7577.5082.85
duw0493.39 ± 00.9091.0494.0391.20 ± 01.1489.9393.8982.18 ± 1.5780.3584.5182.17 ± 1.1881.1884.10
duw0592.82 ± 00.9990.4993.8991.69 ± 01.5489.3893.8983.15 ± 1.8179.8685.2881.25 ± 1.6078.7583.47
duw0693.24 ± 00.5891.6093.9692.27 ± 00.8391.0493.3383.37 ± 0.8881.0484.0381.37 ± 0.8480.0082.36
duw0793.10 ± 00.7491.6794.0392.49 ± 00.6391.6793.8983.30 ± 1.2680.4984.6582.27 ± 1.0481.0484.17
duw0893.08 ± 00.4692.2293.3392.17 ± 00.9291.0493.8981.79 ± 1.8679.8684.5880.56 ± 1.7577.6483.47
duw0992.97 ± 00.8591.0493.8991.54 ± 01.4889.9394.5183.11 ± 1.2681.5384.5881.81 ± 0.8680.5683.47
duw1092.93 ± 00.9091.1193.3392.32 ± 00.9691.1193.8983.30 ± 1.1881.6084.5881.94 ± 1.0480.0083.47
Table 4. Statistics of performance [%]: average classification accuracy, standard deviation, and minimum and maximum classification accuracy of the J48 classifiers, for the procedure of the gradual discretisation controlled by the rankings, starting with 1 out of N discretised attributes, and ending with N 1 discretised variables.
Table 4. Statistics of performance [%]: average classification accuracy, standard deviation, and minimum and maximum classification accuracy of the J48 classifiers, for the procedure of the gradual discretisation controlled by the rankings, starting with 1 out of N discretised attributes, and ending with N 1 discretised variables.
F-WritersM-Writers
AscendingDescendingAscendingDescending
Avg. ± St.dev.MinMaxAvg. ± St.dev.MinMaxAvg. ± St.dev.MinMaxAvg. ± St.dev.MinMax
DomainRelief ranking
dsF91.45 ± 1.5789.7994.6574.36 ± 19.0850.0090.2877.53 ± 2.3672.0179.5178.59 ± 1.6476.3280.76
dsK91.16 ± 1.1789.7992.7892.41 ± 01.9387.5093.4076.76 ± 2.6971.8880.1478.59 ± 1.6476.3280.76
duf89.95 ± 0.9282.6493.3387.54 ± 00.7382.8592.2276.37 ± 0.7170.8380.9074.77 ± 0.8166.6782.08
duf0290.55 ± 1.7785.6992.7888.43 ± 02.4284.7991.8173.64 ± 1.8472.0177.8571.87 ± 3.9266.7477.92
duf0390.24 ± 1.4886.8193.3385.73 ± 00.5885.2887.4376.52 ± 1.3174.4478.8977.54 ± 2.4872.1581.46
duf0490.38 ± 0.8289.2491.5389.36 ± 02.5686.4692.2277.07 ± 1.2975.6379.5872.97 ± 4.5666.6779.51
duf0589.35 ± 2.3683.4792.7884.68 ± 01.0183.7587.0178.43 ± 1.5475.6380.1471.27 ± 0.8269.8672.64
duf0688.77 ± 2.7582.6490.4290.75 ± 01.1588.1991.8875.13 ± 1.9970.8377.9275.00 ± 1.5772.2277.43
duf0790.34 ± 0.7689.7992.1584.36 ± 02.4282.8590.4975.64 ± 0.7474.5876.8878.05 ± 2.1473.2680.42
duf0890.32 ± 0.6089.7991.5388.42 ± 00.6787.6490.0077.54 ± 2.5773.8980.9076.84 ± 1.9072.6478.47
duf0990.13 ± 0.4189.7990.9788.09 ± 00.5786.5388.3377.51 ± 1.0175.6378.7574.43 ± 2.7969.6579.79
duf1089.48 ± 0.9186.8890.3588.04 ± 01.2787.1591.8175.81 ± 1.1372.9277.3674.94 ± 4.1469.3882.08
duw89.83 ± 1.2283.6892.8587.63 ± 00.8380.3594.5176.73 ± 1.1368.1381.3274.42 ± 0.9863.1979.86
duw0290.73 ± 1.1988.6192.2284.22 ± 03.0980.3589.3876.67 ± 1.5773.6878.9669.15 ± 3.7063.1974.51
duw0389.79 ± 2.3083.6892.7887.51 ± 01.8983.6889.9376.87 ± 2.0272.4380.1473.81 ± 2.8168.8277.29
duw0490.61 ± 1.2088.5492.2286.81 ± 01.8183.2690.9777.90 ± 2.0974.1081.3276.50 ± 2.4571.1179.86
duw0589.83 ± 2.2785.4992.2285.07 ± 02.4580.4289.9376.84 ± 2.0074.7980.6974.96 ± 1.3772.7877.43
duw0689.03 ± 1.1986.2589.7986.12 ± 01.3485.0089.9374.68 ± 1.7572.5076.8874.05 ± 1.7171.6076.39
duw0789.98 ± 1.5585.9092.1589.46 ± 00.4988.0689.6576.87 ± 1.6574.5879.1074.44 ± 1.6770.0076.04
duw0889.44 ± 1.8684.5190.4290.81 ± 01.7386.9492.2276.69 ± 3.6668.1379.5175.43 ± 1.6472.5079.44
duw0989.20 ± 1.5685.0090.4291.16 ± 01.4590.3594.5177.15 ± 1.5675.3579.5874.48 ± 2.1871.1877.43
duw1089.82 ± 1.7985.1492.8587.56 ± 01.4487.0191.7476.86 ± 1.6374.9378.9676.98 ± 1.2674.0378.61
OneR ranking
dsF89.46 ± 6.8369.1792.7863.62 ± 18.7450.0090.2877.77 ± 1.7174.3879.5178.52 ± 2.0474.1080.76
dsK90.65 ± 3.0182.2992.7891.98 ± 03.2282.7893.4077.04 ± 1.5274.2479.5178.52 ± 2.0474.1080.76
duf89.91 ± 1.2782.6493.3387.66 ± 01.0182.8592.2276.57 ± 0.9171.8882.0874.32 ± 0.6066.6781.46
duf0290.99 ± 0.8589.7992.7889.00 ± 02.6184.7991.8174.72 ± 2.1471.8878.4071.71 ± 4.3466.7479.10
duf0390.13 ± 1.7785.6393.3385.81 ± 00.5885.2887.4377.47 ± 2.2574.7982.0877.23 ± 1.7374.5181.46
duf0490.16 ± 1.6885.6391.5389.36 ± 02.3386.4691.8178.33 ± 1.9076.6081.9471.95 ± 3.3466.6777.71
duf0589.89 ± 1.5586.9492.7885.01 ± 02.1883.7590.9778.02 ± 1.0976.6780.1472.22 ± 2.0270.0076.18
duf0689.07 ± 2.6182.6490.4289.73 ± 01.8586.2591.8875.62 ± 1.5473.5477.9274.20 ± 1.4572.2276.53
duf0789.88 ± 2.4082.9292.1584.63 ± 02.8282.8592.2275.60 ± 0.8874.5876.8877.30 ± 2.6773.2680.42
duf0889.97 ± 1.1287.0891.5388.83 ± 01.2688.1992.1576.46 ± 1.5573.8978.5476.58 ± 2.1172.6479.10
duf0989.85 ± 1.0686.8890.9788.34 ± 00.8786.5390.4277.66 ± 0.9476.1178.4773.54 ± 2.5769.6577.50
duf1089.20 ± 1.3486.2590.3588.22 ± 01.5087.1591.8175.28 ± 1.2172.9277.3674.14 ± 3.7569.3879.31
duw90.09 ± 0.8882.2292.8587.63 ± 01.3180.3594.5177.18 ± 1.3568.1381.3274.37 ± 0.6563.8979.79
duw0290.59 ± 1.7287.0192.2284.68 ± 03.5080.3590.3576.63 ± 3.2570.6979.7268.74 ± 2.4863.8972.22
duw0390.73 ± 1.1389.2492.7886.73 ± 02.3583.0689.1778.39 ± 1.6376.2580.8374.57 ± 2.5268.8277.36
duw0490.20 ± 2.2684.5892.2287.45 ± 01.6885.1490.9777.58 ± 2.4173.2681.3276.20 ± 2.6671.1179.79
duw0590.64 ± 2.0485.4992.2284.57 ± 03.8080.4292.2278.11 ± 1.9674.2480.6974.72 ± 1.5672.7877.99
duw0689.24 ± 0.7887.9290.3585.84 ± 01.6684.5190.3575.20 ± 1.7972.0177.8574.59 ± 0.8273.4076.25
duw0790.39 ± 0.8988.5492.1589.67 ± 00.3089.1090.4276.13 ± 2.1973.6179.1073.56 ± 2.3370.2876.04
duw0888.80 ± 2.7582.2290.4291.07 ± 01.2488.0692.2277.67 ± 3.2768.1379.5175.66 ± 1.9372.8579.44
duw0989.75 ± 0.8288.0690.4291.05 ± 01.5589.7994.5177.36 ± 1.8874.7979.5875.39 ± 1.9773.0678.26
duw1090.47 ± 1.0889.7992.8587.59 ± 01.4687.0191.7477.56 ± 1.3974.9379.5175.94 ± 1.6473.0678.61
Table 5. Statistics of performance [%]: average classification accuracy, standard deviation, and minimum and maximum classification accuracy of the k-NN classifiers, for the procedure of the gradual discretisation controlled by the rankings, starting with 1 out of N discretised attributes, and ending with N 1 discretised variables.
Table 5. Statistics of performance [%]: average classification accuracy, standard deviation, and minimum and maximum classification accuracy of the k-NN classifiers, for the procedure of the gradual discretisation controlled by the rankings, starting with 1 out of N discretised attributes, and ending with N 1 discretised variables.
F-WritersM-Writers
AscendingDescendingAscendingDescending
Avg. ± St.dev.MinMaxAvg. ± St.dev.MinMaxAvg. ± St.dev.MinMaxAvg. ± St.dev.MinMax
DomainRelief ranking
dsF72.85 ± 11.3957.7891.0464.47 ± 09.6155.5685.8377.32 ± 4.1070.6985.7675.26 ± 2.8670.1478.61
dsK74.06 ± 10.1166.6791.0471.43 ± 09.2357.2285.8377.83 ± 4.0469.5885.2175.41 ± 3.0470.1479.17
duf83.80 ± 02.5373.8988.7588.79 ± 01.1181.9493.4772.24 ± 1.7461.8178.6874.12 ± 0.9464.7280.00
duf0281.48 ± 05.8573.8988.6886.36 ± 02.2081.9489.7267.80 ± 5.4861.9476.0471.70 ± 1.6969.3174.58
duf0381.24 ± 02.8076.3286.8889.49 ± 01.7686.8893.4767.69 ± 4.1761.8173.6173.24 ± 3.1468.9680.00
duf0483.74 ± 02.7379.1787.3690.08 ± 01.2088.6892.7873.28 ± 2.6469.7978.6874.65 ± 1.8070.9076.46
duf0584.68 ± 02.9679.7288.6187.84 ± 02.3584.6592.0172.30 ± 2.8569.2477.7171.45 ± 3.8464.7277.08
duf0685.14 ± 01.2582.5786.8890.43 ± 00.8488.7591.6074.13 ± 2.4670.4977.2274.76 ± 2.3170.3577.64
duf0782.95 ± 03.1678.6886.8887.90 ± 01.5785.8390.9773.48 ± 2.3770.1477.7876.75 ± 1.6374.5879.51
duf0885.28 ± 02.9280.3588.7588.76 ± 01.3887.3691.3973.28 ± 1.9270.1475.4973.52 ± 1.7370.9776.32
duf0984.96 ± 02.2880.9087.4389.04 ± 01.4786.8190.9074.73 ± 1.3573.6878.4075.74 ± 1.4873.6878.19
duf1084.75 ± 02.3880.9087.5089.20 ± 01.5686.3991.5373.45 ± 1.6571.0475.8375.29 ± 2.0971.7479.31
duw86.09 ± 01.7573.8290.9786.47 ± 01.2080.3590.8372.80 ± 1.0665.9080.0775.25 ± 1.8163.0682.50
duw0285.94 ± 03.1079.7989.9383.06 ± 01.4580.3585.0773.40 ± 4.3866.6780.0770.16 ± 4.4963.0677.08
duw0382.17 ± 03.7973.8286.9486.86 ± 02.9382.6490.8371.00 ± 1.7268.1974.7274.07 ± 2.5770.4978.47
duw0486.45 ± 02.1982.7889.1085.46 ± 02.4982.7189.7973.62 ± 2.6968.4776.5376.00 ± 2.1971.6778.33
duw0586.40 ± 02.3682.3689.1785.93 ± 02.3883.4089.7968.87 ± 1.7565.9071.4674.21 ± 2.8670.2878.47
duw0686.07 ± 01.7082.3688.0686.75 ± 02.1883.9689.5874.03 ± 1.6770.6376.0477.47 ± 1.2775.3580.69
duw0787.18 ± 01.9985.0790.9787.51 ± 00.7586.3988.6174.74 ± 2.4370.5677.7174.29 ± 2.0071.3277.50
duw0886.22 ± 01.8782.2288.5487.02 ± 01.1884.5188.5473.35 ± 2.0569.5876.7478.02 ± 2.3375.0082.50
duw0987.76 ± 01.3085.1489.7989.06 ± 01.0586.8190.4273.51 ± 1.5071.5376.4674.63 ± 2.3870.8379.58
duw1086.62 ± 02.1183.4789.1786.55 ± 00.9884.5888.4772.67 ± 1.3070.7674.3878.36 ± 1.9574.7981.53
OneR ranking
dsF68.31 ± 07.4257.7888.1366.05 ± 10.0358.3390.2876.89 ± 3.1774.0385.7675.76 ± 3.7467.0181.25
dsK69.48 ± 06.2866.1188.1373.74 ± 07.9560.5690.2877.92 ± 3.6873.4785.2175.86 ± 3.8067.0181.25
duf83.98 ± 02.6473.8989.2489.22 ± 00.8383.2693.4772.57 ± 1.7061.0479.3873.65 ± 0.9064.7278.82
duf0280.65 ± 05.2173.8987.3686.48 ± 02.0883.2690.2868.53 ± 6.0061.0478.9672.75 ± 3.4666.1877.50
duf0382.22 ± 03.2976.3286.8889.07 ± 02.1586.3993.4768.63 ± 2.7263.4772.1570.54 ± 2.5265.5675.07
duf0484.60 ± 03.2079.1788.1990.10 ± 01.2687.9992.0173.51 ± 2.2170.2877.6475.73 ± 1.2472.7877.01
duf0584.63 ± 03.3879.7289.2488.55 ± 01.7285.1492.0171.45 ± 1.4869.2474.4471.43 ± 3.5064.7275.90
duf0685.93 ± 01.6182.5788.6190.43 ± 00.8788.5491.6074.37 ± 2.2371.3977.2273.91 ± 2.0070.9777.15
duf0782.75 ± 02.8778.6886.8888.63 ± 01.4586.1890.9774.27 ± 2.5771.3279.3875.96 ± 2.2271.1878.82
duf0885.25 ± 02.8580.3587.9989.70 ± 01.1787.3691.3973.74 ± 2.0470.1475.9773.14 ± 1.6570.6975.76
duf0985.32 ± 02.4380.9087.9989.79 ± 01.3787.2291.4675.13 ± 1.7073.1377.7874.76 ± 2.3070.5677.71
duf1084.50 ± 02.2980.9088.5490.24 ± 00.9888.6191.6073.50 ± 1.9171.0477.1574.65 ± 2.0771.7478.19
duw85.73 ± 01.6873.8290.9786.98 ± 01.0080.3590.2172.53 ± 1.5664.6577.2974.61 ± 1.9266.7480.35
duw0285.12 ± 03.5377.4388.6884.17 ± 01.9180.3587.9272.60 ± 4.1364.6577.0172.76 ± 3.7667.7177.22
duw0381.81 ± 03.6473.8286.2587.08 ± 02.3182.6490.2172.52 ± 1.6070.2175.1472.96 ± 2.5167.7876.88
duw0486.07 ± 02.3581.1188.6186.12 ± 02.1982.8589.7973.04 ± 1.6570.3575.5673.98 ± 2.5269.1777.78
duw0585.42 ± 02.0482.3689.1786.71 ± 02.2384.1089.7969.29 ± 2.0965.9074.1772.58 ± 3.1266.7477.92
duw0686.19 ± 01.5782.3687.4387.65 ± 01.5085.2889.5874.10 ± 1.9970.6377.2975.80 ± 2.5669.5177.64
duw0787.34 ± 02.1285.0090.9788.18 ± 00.7987.0189.9373.36 ± 1.8170.5675.8374.19 ± 2.1771.0478.33
duw0886.22 ± 02.0482.2288.6887.60 ± 00.8286.2588.5472.92 ± 2.8769.5877.2277.05 ± 1.7474.3879.72
duw0987.62 ± 01.6185.1489.7988.58 ± 00.9186.8889.7272.78 ± 1.7371.1175.9074.53 ± 3.1468.6179.58
duw1085.74 ± 01.6483.4788.0686.70 ± 01.1584.5888.4772.16 ± 1.4070.0074.3877.62 ± 1.9873.5480.35
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Stańczyk, U.; Zielosko, B.; Baron, G. Importance of Characteristic Features and Their Form for Data Exploration. Entropy 2024, 26, 404. https://doi.org/10.3390/e26050404

AMA Style

Stańczyk U, Zielosko B, Baron G. Importance of Characteristic Features and Their Form for Data Exploration. Entropy. 2024; 26(5):404. https://doi.org/10.3390/e26050404

Chicago/Turabian Style

Stańczyk, Urszula, Beata Zielosko, and Grzegorz Baron. 2024. "Importance of Characteristic Features and Their Form for Data Exploration" Entropy 26, no. 5: 404. https://doi.org/10.3390/e26050404

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop