1. Introduction
Discrete data compression is an interesting problem especially when compressed data is required to maintain the characteristics of the original data [
1]. Most of the state-of-the-art classification methods require a large amount of memory and time making them unfeasible options for some practical applications in the real world [
2].
In many datasets, the number of attributes (also called the dimension) is large, and many algorithms do not work well with datasets that have a high dimension because they require all information to be stored in memory prior to processing. Nowadays, it is a challenge to process datasets with a high dimensionality such as censuses carried out in different countries [
3]. A census is a particularly relevant process and a vital source of information for a country [
4]. The predominant characteristic of this type of information is that most of the data is of categorical type.
The problem of assigning a class to a dataset is a basic action in data analysis and pattern recognition, the task consists labeling an observation from a set of known variables [
5].
Supervised learning is a part of machine learning (ML) which tries to model the behavior of some system. The supervised models are created from observations which consist of a set of input and output data. A supervised model describes the function which associates inputs with output [
6].
In many cases, k-nearest neighbors (
kNN) is a simple and effective classification method [
7]. However, it presents two major problems when it comes to implementation: (1) it is a lazy learning method and (2) it depends on the selection of the value of
k [
8]. Other limitations present in this method corresponds to the high memory consumption which limits its application [
9].
In this work a new method to classify information, using the
kNN algorithm, on a compressed dataset is proposed. The method proposes to compress observations into packets of a certain number of bits, in each packet a certain number of attributes are stored (compressed) through operations at the bit level. This avoids having to reduce the size of the dataset [
9,
10] to avoid the memory problem.
An interesting feature of the proposed method is that the information can be decompressed, observation by observation in real time (on-the-fly), without the need to decompress all the dataset and carry out it into the memory.
As an application of the compression mechanism, this work proposed the implementation of the kNN algorithm that works with compressed data, we call this method “Compressed kNN algorithm”.
The rest of this document is organized as follows:
Section 2 shows a brief introduction to data classification techniques focusing on the
kNN algorithm, in
Section 3 the datasets used in this work are described, in
Section 4 a variation of the algorithm for working with compressed data is presented, in
Section 5 some results obtained with the execution of the proposed algorithm are presented, and finally,
Section 6 shows some conclusions and future work.
2. Background
This section describes, in general, the process of data classification focusing on the kNN method (the algorithm is also presented). In addition, some compression/encoding techniques are described, as well as the metrics used for categorical, numerical or mixed information.
2.1. KNN
The
kNN algorithm belongs to the family of methods known as instance based methods. These methods are based on the principle that observations (instances) within a dataset are usually placed close to other observations that have similar attributes [
11].
Given an observation from which you want to predict the class to which it belongs, this method selects the closest observations from the dataset in such a way to minimize the distance [
12]. There are two types of
kNN algorithms [
10]:
Structure less NN
Structure based NN
Algorithm 1 defines the basic scheme of the kNN classification method (structure less NN) on a dataset with m observations.
There are variations of this algorithm in order to reduce the input dataset size [
13]. For example we can cite stochastic neighbor compression (SNN) [
14] that tries to compress the input dataset in order to obtain a sample of the data, or ProtoNN—compressed and accurate
kNN for resource-scarce devices [
15] that generates a subset of small number of prototypes to represent the input dataset.
Algorithm 1: The k-nearest neighbors (KNN) algorithm. |
|
From the algorithm definition, it is necessary to define the concept of distance between observations.
2.1.1. Categorical Data
In data mining there are several types of data, including numerical, categorical, text, images, audio, etc. Current methods that work with this type of information generally convert these types of data into numerical or categorical data [
16].
Categorical or nominal data (also known as qualitative variables) have been the subject of studies in several contexts [
17]. Data mining from categorical data is an important research topic in which several methods have been developed to work with this type of information such as decision trees, rough sets and others [
16].
2.1.2. Metrics
The similarity or distance between two objects (observations in our case) plays an important role in many tasks of classification or grouping of data [
18].
In this study, the concept of similarity between two observations composed of categorical variables is considered.
The recommended distance function is the heterogeneous euclidean-overlap metric (HEOM) [
19] which is defined as [
20]:
where
is the distance between two observations in the
i-th attribute.
In case that the
i-th attribute has categorical values, it is recommended to use the Hamming distance [
21], which is defined as:
In case that the
i-th attribute has numerical values, it is recommended to use the range normalized difference distance, which is defined as:
The above allows working with hybrid datasets, i.e., datasets composed of categorical and numerical variables.
In case the dataset only contains categorical variables, the Hamming distance is applied (see Equation (
2)). If the dataset only contains numerical variables, it is possible to apply traditional distances such as Euclidean, Manhattan or Minkowski [
21].
2.2. Data Compression
In database theory, a record
R is a set of attributes
. An instance
r, is a set of values associated with each attribute
, this is
(values for each attribute). Each attribute
has an associated
set of possible values called the domain.
Table 1 describes the above.
The main objective of this work is the classification of categorical data. Categorical data are stored in one-dimensional vectors (by rows or by columns) or in matrices depending on the information type.
From now on we will name
to the set of values of the attribute
, that is:
We will name
to the set of values of the
j-th observation, this is:
Among the methods of compression/encoding of categorical data we can find [
22]:
Run-length encoding
Offset-list encoding
GNU ZIP (GZIP)
Bit level compression
Each of these methods has advantages and disadvantages. In this work, the last option bit level compression is used, applied to the compression of observations .
2.3. Bit Level Compression
In this section we review the mechanism for compressing categorical data, the compression method corresponds to a variation of the bit level compression used by the REDATAM software (
http://www.redatam.org). This method doesn’t use all available bits because the block size may not be a multiple of the number of bits needed to represent the categories. Each data block stores one or more values depending of the maximum size in bits required to store the values [
23].
Categorical data is represented as signed integer values of 32, 16 or 8 bits (4, 2, 1 bytes). This implies that to store a numerical value it is necessary to use 32 bits (or its equivalent in 2 or 1 byte). As an example, we will consider the case in which the information is represented as a set of four-bytes integer values (some frameworks use this representation [
24]).
Figure 1 represents the bit distribution of a integer value composed of 4 bytes.
Figure 2 shows a set of four values
, the gray area corresponds to space that is not used. Out of a total of
bits, only 16 are used, which represents
of the total storage used. We can make a similar representation using different block sizes.
The bit level compression can be implemented using 64, 32, 16 or 8 bits per block, these sizes correspond to the computer internal representation of integer values.
Figure 2 shows the storage using 32 bits per block, the gray areas (unused bits of
Figure 2) are used to store additional values. In one 32-bit block, we can store the four values
reducing the memory consumption to 12.4% of the original size.
Figure 3 shows one 32-bit block that store all the four values.
Table 2 shows the number of bits used to represent a group of categorical values. First column shows the number of categories to represent (two, between three and four, between five and eight, etc.), the second column shows the number of bits needed to represent the categories mentioned, the third and fifth columns show the total number of elements that can be stored within 32 or 64 bits per block and, finally, the fourth and sixth columns show the number of bits that we lose using 32 or 64 bits per block.
The compression implemented use a fixed number of bits for all the data in order to reduce the number of operations, this may cause that a group ob bits may be lost (in the case that the number of bits is not a divisor of 64, 32, 16 or eight).
For example, if we need to compress eight categories using a 64-bit block, we lose bit in each block. So, in order to get a better compression ratio and avoid as much as possible the lost bits, we should choose the biggest block size.
3. Datasets
This section describes the datasets used in this work as well as some of their characteristics.
Datasets are generally stored as flat files. Rows represent objects (also called records, observations, points) and columns represent attributes (features) [
25]. Two test datasets were used in the experiments:
Census income data set (CIDS), which represents a mixed dataset containing both categorical and numeric variables.
Wisconsin breast cancer (original) (WBC-original) which represents a dataset containing only categorical variables.
Table 3 shows a brief description of the two datasets.
Dataset selection was made to show how the algorithm works with a dataset containing only categorical variables and with a mixed dataset (categorical and numeric variables).
Each of the datasets used is described below.
3.1. Census Income Data Set
Table 4 describes the type and range of each variable.
The dataset contains 32,561 observations of which 2399 observations contain missing values (NA). In order to simplify the process, all observations containing at least one NA value were deleted, so that the resulting dataset contains 30,162 observations.
As can be seen in the table above, there are variables that are originally considered continuous. However, they can be treated as categorical because they take integer values within a small range. These variables are considered categorical in the tests performed.
3.2. Wisconsin Breast Cancer (Original)
Table 5 describes the type and range of each variable.
The dataset contains 699 observations of which 16 observations contain missing values (NA). In order to simplify the process, all observations containing at least one NA value were deleted, so that the resulting dataset contains 683 observations.
As seen in
Table 5, the sample code number variable corresponds to the identifier of the observation so it can be discarded. The rest of the attributes are all categorical in the range of one to 10, with the exception of the class variable that has two possible values; 2 or 4.
4. Compressed kNN
This section proposes a variation of the
kNN algorithm to work with compressed categorical data. The proposed compression method uses a variation of the method described in [
22] for which it is proposed to compress each observation
with a bit schema. Similar to the scheme used by the REDATAM 7 V3 software [
23], each block stores the values corresponding to
.
To apply the proposed scheme, it is necessary to determine the maximum number of bits to represent all possible bits of each observation. This imposes that, prior to the process, the maximum number of bits needed to represent each attribute () is determined to finally define the maximum of these values.
If we define
as the number of bits necessary to represent the attribute (variable)
, the number of bits necessary to represent an observation would be given by:
Also note that we use the same value, defined as , for all variables in order to avoid additional operations in the decompression/decoding phase.
The proposed method uses bit-wise operations (AND, OR, etc.) which are an important part of modern programming languages. These operations are widely used because they allow arithmetic operations to be replaced by more efficient operations [
27].
Figure 4 describes the phases of the proposed method that can be summarized as:
Preprocess: In this phase, starting from the original dataset, the possible categorical and numerical variables are determined, as well as the ranges of each categorical variable.
Feature compression: In this phase, the compression algorithm is applied to the original dataset.
kNN Classification: In this phase the classification is applied on the compressed dataset.
4.1. Preprocess
This phase determines the number of bits needed to represent each categorical variable and then determine the value of
N given by Equation (
4). Let’s take the WBC dataset as an example, in a first step the sample code number (v1, see
Table 5) variable was eliminated.
Records containing missing values were then deleted, resulting in a dataset of 683 observations. Two datasets were generated, one for training (80% of the data) and one for testing (20% of the data). These two datasets constitute the input data for compression and then for classification (see
Table 6).
The minimum amount of bits used to represent a range of categories (
) and the elements per block (
) can be obtained using:
where
represents the size of the block and can be eight, 16, 32 or 64,
represents the smallest integer greater than or equal to its argument and is defined by
and
represents the biggest integer smaller than or equal to its argument and is defined by
[
22].
In the rest of this document, as an example, we use a block size equal to 32, that is .
All variables have a range from one to 10 (10 categories), so the number of bits needed for compression corresponds to (4 bits) and the number of elements per block corresponds to (with a 32-bit block), which in this case indicates that eight values can be stored in a block, which corresponds to the values of variables two through nine, the remaining variable (variable number 10) must be stored in an additional block.
A similar process was made with the census income data set. All categorical variables have a range from one to 128, so the number of bits needed for compression correspond to (7 bits). After the elimination of the NA values, the final dataset has 30,162 observations. The final dataset was split in two; 80% for training and 20% for testing.
4.2. Feature Compression
In this phase we proceed to compress each observation using a total number of bits equal to
M (see Equation (
4)), with this, in 32 bits block it is possible to store the values of
variables (attributes). The process start with the original dataset to generate a compressed/encoded dataset (see
Figure 5).
If
n is the total number of attributes (see
Table 1), the number of blocks needed to represent all the attributes corresponds to [
22]:
If we use a
equals to 32, Equation (
7) can be written as:
In the case of the WBC dataset, . This means that the original dataset can be represented by two one-dimensional vectors, the first containing the information of eight variables and the second containing the information of the remaining variable.
The blocks w1 and w2 are defined in such a way that w1 contains the compressed/encoded information of the variables
and w2 contains the compressed/encoded information of the variables
. The new generated dataset is called
WBC encoded dataset (see
Table 7).
In the case of the CID dataset, . This means that the original dataset can be represented by six one-dimensional vectors, the first three contains the information of the variables and the rest of the vectors (the last three) contains the information of the remaining uncompressed variables .
The blocks w1, w2 and w3 are defined in such a way that w1 contains the compressed/encoded information of the variables
, w2 contains the compressed/encoded information of the variables
and w3 contains the compressed/encoded information of the variables
. The new generated dataset is called CID encoded dataset (see
Table 8).
In [
22] the compression algorithm was applied to the columns of the dataset, in this work we apply the algorithm to the rows (observations) of the dataset.
Figure 6 describes the compression of w1 in the WBC dataset.
4.3. kNN Classification
The classification process uses the compressed dataset obtained by applying the compression of the data over the original dataset as the input dataset.
Figure 7 shows the implemented classification process.
The process uses custom metrics to work with the compressed data, and in this case the HEOM and Hamming metric are used. At the classification stage, it is necessary to decompress the information and calculate the distances between observations.
The distance calculation uses the algorithm described in [
28] for the calculation of the
norm modified to decompress two vectors at a time (the observations from which the distance is calculated).
Because the execution of the kNN algorithm compares an observation with all other observations, a local cache is used to avoid repeated decompression of the first vector (). Algorithms 2 and 3 describes the calculation of Euclidean and Hamming distances with on-the-fly decompression.
Algorithm 2: Euclidean compressed_distance: distance between two compressed vectors. |
|
Algorithm 3: Hamming compressed_distance: distance between two compressed vectors. |
|
Algorithm 4 describes the calculation of Hamming distances without decompression using bit-wise operations.
Algorithm 4: Hamming compressed_distance without decompression: distance between two compressed vectors. |
|
Similarly, it is possible to define algorithms to calculate other metrics.
After defining the metric we proceed to define the kNN algorithm on compressed data. Algorithm 5 describes the Compressed-kNN algorithm.
Algorithm 5: Compressed kNN. |
|
The first step is to compress the original dataset, then iterate over the compressed dataset and calculate the distances between the observation to be classified and the dataset elements. Finally, the set of distances is ordered in ascending order and the most repeated class of the K-first elements is selected.
The algorithm can be generalized to work with other kNN implementations, the paper presents results obtained from two variations of the algorithm:
As we can see, the principal advantage of the proposed algorithm is the reduction of the dataset size. The result of the compression phase is a dataset that can be fully loaded into memory thus reducing the memory limitations described in
Section 1 without sampling or dimension reduction. If we combine compression with sampling or dimension reduction, we can obtain a better algorithm than the described in Algorithm 5 (this is proposed as a future work in
Section 6).
Limitations
There are some limitations related to distance calculation. Prior to calculating the distance, the entire observation must be decompressed. If we use a dataset only with categorical variables, we can use Algorithm 4 in order to calculate the Hamming distance without decompression.
If we try to classify a group of data (test dataset), for each test observation, we need to decompress the train observations, so if we have a t observations in test dataset, we need to decompress t—times each train observation. If the test dataset is also compressed, to avoid the limitation showed above, Algorithms 2 and 3 use a local cache to store the uncompressed test observation. We can extend the cache idea to the training dataset and use a small local-cache of uncompressed training data.
Another limitation is related with the range that categorical variables may have. If the range is big, we can only compress a few values inside of a block, so the calculations needed to decompress may lift up the processing time.
5. Experimental Results
This section presents some results obtained by applying the kNN algorithm on the original and on the compressed dataset. Similarly, the tests were performed using two metrics: Hamming and HEOM (for uncompressed and compressed values).
All tests were performed using the statistical machine intelligence and learning engine (SMILE) framework (
https://haifengl.github.io/smile/) modified to work with compressed data according to the algorithm described in
Section 2.3. See the source code in the
Supplementary. The algorithm used corresponds to
kNN with a value of
k between five to 20 (
), which by default uses an Euclidean metric
EuclideanDistance that operates with real (double) numbers.
The WBC dataset was compressed using a number of bits equal to four (see Equation (
4)). The CID dataset was compressed with a mixed schema using a number of bits equal to seven (see Equation (
4)).
Table 9 shows the execution times of the classification and cross validation (
k-
) for different values of
k with linear search. In the case of WBC dataset, the Hamming metric is used.
As can be seen, the accuracy of the classification is the same in both cases (uncompressed dataset and compressed dataset) as expected. The classification time is slightly shorter when working with the compressed dataset (see
Table 9).
Table 10 shows the execution times of the classification and cross validation (
k-
) for different values of
k with linear search. In the case of CID dataset, the HEOM metric is used.
As can be seen, the accuracy of the classification is the same in both cases (uncompressed dataset and compressed dataset). The classification time is slightly shorter when working with the compressed dataset (see
Table 10).
5.1. Test Platform
The platform on which the tests were carried out corresponds to:
Processor: Intel(R) Core(TM) i5-5200 CPU
Processor speed: 2.20 GHz
RAM Memory: 16.0 GB
Operating System: Windows 10 Pro 64 bits
Compiler: JDK 1.8.0.144 64-Bit Server VM
Machine Learning Framework: SMILE 1.5.2
5.2. Memory Consumption
The memory consumption represents the amount of memory required to represent the complete dataset.
Table 11 shows a summary of the datasets used in the experiments carried on.
Table 12 shows the amount of memory used to represent both datasets (WBC and CID) using a traditional vector representation using
int, short and byte as element type. Column % Memory is calculated using the int representation as 100%. Due to final weight variable has values between
and
(see
Table 4), CID datasets only are represented as a vector of integer values.
Table 13 shows the amount of memory used to represent WBC dataset using different block sizes. Column % Memory is calculated using the int vector size as 100% (see
Table 12).
Table 14 shows the amount of memory used to represent CID dataset using different block sizes. Column % Memory is calculated using the int vector size as 100% (see
Table 12). Due to
final weight variable has values between
and
(see
Table 4), CID datasets only are compressed with 32 and 64 bits.
The WBC dataset contains nine valid variables and 683 observations. Since four bits are used for compression, this means that a 32-bit block compresses four attributes, so two 32-bit blocks are needed to represent the 9 categorical attributes.
With this data, the memory size to represent the complete dataset corresponds to the product of the total number of elements of the dataset by the number of attributes by the size of the representation of an integer value (sizeof(int)):
While the value of the representation in the compressed format corresponds to the product of the total number of elements of the compressed dataset by the number of attributes (blocks, see Equation (
6)) resulting from the compression by the size of the representation of an integer value (sizeof(int)):
Taking as 100% the value of the uncompressed representation (Equation (
8)),
Figure 8 and
Figure 9 show the memory consumption for the datasets using different vector representations and block sizes (for compression).
As can be seen, the memory consumption for WBC dataset is slower using any compressed representation (8, 16, 32 or 64 bits). The best compression ratio can be obtained using a 8-bit block size, this represent the 13.9% of the integer vector representation and a 55.6% of the byte vector representation.
In the case of CID dataset, the compressed representation is slower than the traditional representation. Using 32-bit block, the compressed format represents the 42.9% of the vector representation.
The compressed representation uses 6 bits to represent the data, so we can compress/encoded 5 values in each 32-bit block, this lead to have a new dataset (compressed) with 5 columns instead of 21 (the original).
As can be seen, the total memory used to represent the original dataset is 1.2166 GBytes (using a vector that contains integer values) and the compressed representation is equivalent to 32.81% of the original dataset. The compression using 32-bit block is slower than any vector representation.
5.3. Accuracy
Figure 11 and
Figure 12 show the accuracy comparison for the classification using the following metrics:
In all cases the version of the compressed datasets was used with a . In the case of the WBC dataset, the HEOM metric corresponds to the Hamming metric since all variables are categorical.
As can be seen in the figure, the HEOM metric provides a better result. The result for execution with compression is the same for execution without compression since the same metric is used, what varies is the processing speed.
5.4. Processing Speed
Figure 13 and
Figure 14 show the comparison in the cross-validation speed. The
kNN algorithm was executed with both uncompressed and compressed datasets (using a 32-bit block) a value of
k-
and
.
As can be seen in the figures, the use of compression show an increase in the processing time.
5.5. kNN Variations
This section presents some results obtained by applying the kNN algorithm on the compressed dataset with a Cover Tree search algorithm.
Table 17 shows the execution times of the classification for different values of
k using a linear search and a cover tree search. In the case of WBC dataset, the Hamming metric was used.
Table 18 shows the execution times of the classification for different values of
k using a linear search and a cover tree search. In the case of CID dataset, the HEOM metric was used.
6. Conclusions
In this paper we reviewed one of the most widely used classification algorithms, the kNN. Along with some traditional methods of compression/encoding of categorical data a model called “compressed kNN” was proposed to work directly on compressed categorical data. After performing some tests on known datasets (WBC and CID) we can conclude that the proposed method considerably reduces the amount of memory used by the kNN algorithm and in turn maintains the same percentage of error classification made with the original method.
The inclusion of the compression stage prior to the execution of the algorithm guarantees a decrease in the percentage of memory needed to represent the whole dataset (see
Figure 8 and
Figure 9). The inclusion of the distance calculation for compressed categorical data (see Algorithm 2) slightly increases the classification time because it is necessary to decompress the observations prior to the distance calculation to determine the nearest neighbors. This paper also provides an algorithm for Hamming Distance without decompression.
Future works include (1) the implementation of variations of the kNN algorithm of the type structure less NN and structure based NN, (2) extending the concepts of compression to cluster algorithms, (3) the implementation of hybrid algorithms that uses sampling with compression, and (4) reordering attributes before compression.