The k-means is one of the most popular and widely used clustering algorithm; however, it is limited to numerical data only. The k-prototypes algorithm is an algorithm famous for dealing with both numerical and categorical data. However, there have been no studies to accelerate it. In this paper, we propose a new, fast k-prototypes algorithm that provides the same answers as those of the original k-prototypes algorithm. The proposed algorithm avoids distance computations using partial distance computation. Our k-prototypes algorithm finds minimum distance without distance computations of all attributes between an object and a cluster center, which allows it to reduce time complexity. A partial distance computation uses a fact that a value of the maximum difference between two categorical attributes is 1 during distance computations. If data objects have m categorical attributes, the maximum difference of categorical attributes between an object and a cluster center is m. Our algorithm first computes distance with numerical attributes only. If a difference of the minimum distance and the second smallest with numerical attributes is higher than m, we can find the minimum distance between an object and a cluster center without distance computations of categorical attributes. The experimental results show that the computational performance of the proposed k-prototypes algorithm is superior to the original k-prototypes algorithm in our dataset.

K-means algorithm is one of the simplest clustering algorithm as unsupervised learning, so that is very widely used [

The purpose of using k-means is to find clusters that minimize the sum of square distances between each cluster center and all objects in each cluster. Even though the number of clusters is small, the problem of finding an optimal k-means algorithm solution is NP-hard [

K-means algorithm spends a lot of processing time for computing the distances between each of the k cluster centers and the n objects. So far, many researchers have worked on accelerating k-means algorithms by avoiding unnecessary distance computations between an object and cluster centers. Because objects usually remain in the same clusters after a certain number of iterations, much of the repetitive distance computation is unnecessary. So far, a number of studies on accelerating k-means algorithms to avoid unnecessary distance calculations have been carried out [

The K-means algorithm is efficient for clustering large datasets, but it only works on numerical data. However, the real-world data is a mixture of both numerical and categorical features, so a k-means algorithm has a limitation of applying cluster analysis. To overcome this problem, several algorithms have been developed to cluster large datasets that contain both numerical and categorical values, and one well-known algorithm is the k-prototypes algorithm by Huang [

Recently, big data has become a big issue for academia and various industries. Therefore, there is a growing interest in the technology to process big data quickly from the viewpoint of computer science. The research on the fast processing of big data in clustering has been limited to numerical data. However, big data deals with numerical data as well as categorical data. Because numerical data and categorical data are processed differently, it is difficult to improve the performance of clustering algorithms that deal with categorical data in the existing way of improving performance.

In this paper, we propose a fast k-prototypes algorithm for mixed data (FKPT). The FKPT reduces distance calculation using partial distance computation. The contributions of this study are summarized as follows.

Reduction: computational cost is reduced without an additional data structure and memory spaces.

Simplicity: it is simple to implement because it does not require a complex data structure.

Convergence: it can be applied to other fast k-means algorithms to compute the distance between each cluster center and an object for numerical attributes.

Speed: it is faster than the conventional k-prototypes.

This study presents a new method of accelerating k-prototypes algorithm using partial distance computation by avoiding unnecessary distance computations between an object and cluster centers. As a result, we believe the algorithm proposed in this paper will become the algorithm of choice for fast k-prototypes clustering.

The organization of the rest of this paper is as follows. In

In this section, we briefly describe various methods of accelerating k-means and traditional k-prototypes algorithm.

The k-means is one of the most popular clustering algorithm due to its simplicity and scalability for large datasets. The k-means algorithm is to partition n data objects into k clusters while minimizing the Euclidean distance between each data object and the cluster center it belongs to [

It chooses k cluster centers in some manner. The final result of the algorithm is sensitive to the initial selection of k initial centers, and many efficient initialization methods have been proposed to calculate better final k centers.

The k-means repeats the process of assigning individual objects to their nearest centers and updating each k center as the average of a value of object’s vector assigned to the centers until no further changes occur on the k centers.

K-means algorithm spend most of the time computing distance between an object and current cluster centers. However, much of these distance computations are unnecessary, because objects usually remain in the same clusters after a few iterations [

The reason why the k-means are inefficient is because, in each iteration, all objects must identify the closest center. In one iteration, all

Pelleg and Moore (1999) and Kanungo et al. (2002) adapted a k–d tree to store datasets for accelerating k-means [

One way to accelerate algorithms is to search using partial distance [^{2}, the distance between a point x and a center c can be calculated by summing squared distances in each dimension. In a distance calculation between point x and another center c’, if the sum exceeds ^{2}, the distance ^{2} cannot be the minimum distance, so the distance calculation stops before all attribute calculations. The cost of a partial distance search is usually effective in high dimension.

Our proposed idea of this study was inspired by this partial distance search [

K-prototypes algorithm integrates the k-means and k-modes algorithms to deal with the mixed data types [

In Equation (2),

The existing k-prototypes algorithm allocates objects to the cluster with the smallest distance by calculating the distance between each cluster center and a new object to be allocated to the cluster. Distance is calculated by comparing all attributes of an object with all attributes of each cluster center using the brute force method.

The purpose of distance computation is to find a cluster center closest to an object. However, there is an unnecessary distance calculation in the traditional k-prototypes algorithm. According to Equation (4), the maximum value that can be obtained is 1 when comparing a single categorical attribute. In

This paper studies a method to find the closest center to an object without comparison for all attributes in a distance computation. We prove that the closest center to an object can be found without comparison for all attributes. For a proof, we define a computable max difference value as follows.

According to Equation (4), the distance for a single categorical attribute between cluster centers and an object is either 0 or 1. Therefore, a computable max difference value for a categorical attribute, according to Definition 1, becomes 1 without taking the value of the attribute into account. If an object in the dataset consists of m categorical attributes, then a computable max difference value between a cluster and object is m. A computable max difference value of a numerical attribute is a difference of a maximum value and a minimum value of the attribute. Thus, to know a computable max difference value of a numerical attribute, we have to scan full datasets so that maximum and minimum values are obtained.

The proposed k-prototypes algorithm finds a minimum distance without distance computations of all attributes between an object and a cluster center using the computable max difference value of the object. The k-prototypes algorithm updates a cluster center after an object is assigned to the cluster of the closest center by the distance measure. By Equation (1), the distance

By Equation (2),

According to Definition 1, the categorical distance between an object and a cluster center with m categorical attributes can be

We introduce a way to determine the minimum distance between an object and each cluster center with only computation of numerical attribute by an example.

Example 1. We assume that

In this section, we describe our proposed algorithm.

The proposed k-prototypes algorithm in this paper is similar to traditional k-prototypes. The difference between the proposed k-prototypes and traditional k-prototypes is that the distance between an object and cluster centers on the numerical attributes,

In Line 4, firstly, you calculate the distance for a numerical attribute. You obtain the closest distance and the second closest distance value while calculating the distance. Using these two values and the number of the categorical attributes, m, the discriminant is performed. If the result of the discriminant is true, the distance to the categorical property is calculated, and then the result of the final distance is derived by adding the distance of the numerical property. If the result of the discriminant is false, the final distance is measured by the numerical attribute result only. By including

Definition 1 is a function that determines whether to compare the categorical attribute with the algorithm that implements it. Returns the true value if the difference between the second smallest distance and the first smallest distance is less than m in the distance measured only by a numerical property comparison between an object and cluster center. If a true value is returned, Algorithm 1 calls a function that compares the distance of the categorical attribute to calculate the final distance. If a false value is returned, the distance measured by only the numerical property comparison is set as the final results value without comparing the categorical property. The larger the difference between the two distances, the greater the number of categorical attributes that need not be compared.

In Algorithm 2, DIST-COMPUTE-NUM() calculates a distance between an object and cluster centers for numerical attributes and returns all distances for each cluster. In this algorithm, first_min and second_min is calculated to determine whether calculation of categorical data in a distance computation.

In Algorithm 3, a distance between an object and cluster centers is calculated for categorical attributes of each cluster.

In Algorithm 4, the center vector of a cluster is assigned to new center vector. The center vectors consisted of two parts: numerical and categorical attributes. The numerical part of the center vector is calculated by an average value of each numerical attributes and the categorical part of the center vector is calculated by the value of the highest frequency in each categorical attribute.

The time complexity of traditional k-prototypes is

All experiments are conducted on an Intel(R) Pentium(R) 3558U 1.70 GHz, 4GB RAM. All programs are written in Java. We generate several independent, uniform distribution mix typed datasets. A distribution of numerical attributes is from 0 to 100, and one of the categorical attributes is from A to Z.

We set |X| (number of objects) = {500000, 800000, 1000000}, numerical attributes = 2, categorical attributes = 16 and

To analyze the difference CPU time for each dataset, 114,844, 85,755, and 5,753,212 computation decreased in 500k, 800k and 1000k datasets, respectively. These computation reductions means that the number of calculation categorical attribute in distance calculation between a point and a center decreases. Decreasing the computation in distance calculation between a point and a center seems to reduce CPU time. The final clustering results of our proposed algorithm is the same as the clustering results of original k-prototypes algorithm. These results lead us to conclude that the proposed algorithm has better performance than the original k-prototypes algorithm.

In this paper, we have proposed a fast k-prototypes algorithm for clustering mixed datasets. Experimental results show that our algorithm is fast than the original algorithm. Previous fast k-means algorithm focused on reducing candidate objects for computing distance to cluster centers. Our k-prototypes algorithm reduces unnecessary distance computation using partial distance computation without distance computations of all attributes between an object and a cluster center, which allows it to reduce time complexity. The experimental shows proposed k-prototypes algorithm improves computational performance than the original k-prototypes algorithm in our dataset.

However, our k-prototypes algorithm does not guarantee that the computational performance will be improved in all cases. If the difference of the first and the second minimum distance between an object and cluster centers for all objects in a given dataset on numerical attributes is less than m, then the performance of our k-prototypes is the same as the original k-prototypes. Our k-prototypes algorithm is influenced by the variance of the numerical data values. The larger variance of the numerical data values, the higher probability that the difference of the first and the second minimum distance between an object and cluster centers is large.

The k-prototypes algorithm proposed in this paper simply reduces the computational cost without using additional data structures and memories. Our algorithm is faster than the original k-prototypes algorithm. The goal of the existing k-means acceleration algorithm is to reduce the number of dimensions to be compared when calculating the distance between center and object, in order to reduce the number of objects compared with the center of the cluster. K-means, which deals only with numerical data, is the most widely used algorithm among clustering algorithms. Various acceleration algorithms have been developed to improve the speed of processing large data. However, real-world data is mostly a mixture of numerical data and categorical data. In this paper, we propose a method to speed up the k-prototypes algorithm for clustering mixed data. The method proposed in this paper is a method of reducing the number of objects compared with the center of existing clusters, and is not exclusive, and the existing methods and the methods proposed in this paper can be integrated with each other.

In this study, we considered only an effect of cardinality. However, the performance of the algorithm may be affected by dimensionality, the size of k, and data distributions. In future works, firstly we have a plan to measure performance of our proposed algorithm by dimensionality, the size of k, and data distributions. Secondly, we can prune more objects by using partitioning data points into grid. We will extend our fast k-prototypes algorithm to improve for better performance.

Byoungwook Kim: Research for the related works, doing the experiments, writing the paper, acquisition of data, analysis of data, interpretation of the related works, and design of the complete model.

The author declares no conflict of interest.

A process of assigning an object _{i}

Finding the closest cluster center without computing categorical attributes.

Effect of cardinality. FKPT (fast k-prototypes) is the result of our propose k-prototypes algorithm and TKPT (traditional k-prototypes) is the result of original k-prototypes algorithm.