1. Introduction
In 1909, Jan Czekanowski proposed the first clustering method [
1]. This kind of method has become fundamental to many branches of statistics and social sciences. With clustering, we seek to classify a set of objects into relatively homogeneous groups, which are usually referred to as clusters. That is, for a given dataset, the goal of a cluster analysis is to define a set of clusters and to assign to each of them the observations that some distances or similarity measures are close to each other, while observations between clusters are away from each other. There are increasing discussions surrounding the best clustering method, as one can gather from the large number of review articles (see for example [
2,
3,
4,
5,
6]). Many authors have proposed different clustering algorithms, and most techniques and algorithms deal with quantitative data. However, categorical data are common, particularly in the social sciences [
7,
8,
9,
10,
11,
12]. As such, applying clustering methods to categorical data is important, and methods have been proposed to deal with these types of data. An extension of the
k-means approach to clustering, the
k-modes clustering [
13], is prominent among these. In this paper, we present a novel algorithm to group qualitative data: an extension of Condorcet clustering [
14]. We demonstrate that, with a fixed number of clusters, a unique partition of the data could be achieved by maximizing a Condorcet’s criterion [
14]. We developed a heuristic algorithm that proved to be very useful. Moreover, an adjustment rate index was used to evaluate the quality of the partition of
k-modes and
-Condorcet on the basis of real datasets. The rest of the paper is organized as follows: in
Section 2, we present some related work. In
Section 3, we introduce some relevant concepts and definitions. In
Section 4, we present some theoretical results. The clustering algorithm is presented in
Section 5. Using real data, in
Section 6, we compare
-Condorcet clustering to
k-modes clustering. Finally, our concluding remarks are given in
Section 7.
2. Related Work
Clusters may be regarded as crisp or fuzzy. In fuzzy clustering, an observation may belong to more than one cluster with given probabilities, whereas in crisp clustering, an observation belongs to one and only one cluster. Most clustering algorithms, but not all, may be classified into two categories: partitioning and hierarchical algorithms.
k-means is prominent among the partitional methods, and is one of the most popular techniques for clustering quantitative data [
15,
16,
17,
18]. Given a set of
n multivariate observations,
where
is a
d dimensional vector, the
-means algorithm partitions the data into
clusters,
, such that the sum of squares within each cluster is minimized. That is,
k-means seeks to minimize:
where
is the mean of point in
. This algorithm is fast and easy to implement [
16,
18]. Once the number of clusters is defined, this method chooses, at random, k points in the attribute space as initial values. After that, observations are assigned to the closest cluster and the centroids are updated. Because the algorithm does not guarantee convergence to the global optimum and since it is usually a fast algorithm, it is common to run it multiple times with different starting conditions. This method may, however, be badly affected by outliers.
Several methods have been proposed to deal with qualitative data or mixed data. The
k-modes and
k-prototype methods are prominent among these, as proposed by Huang [
13], which are extensions to the
k-means (see
Table 1).
k-modes, in particular, is the
k-means method, but with the Euclidean distance metric substituted by a simple matching dissimilarity measure, where the centers of the clusters are represented by their modes instead of the means. To introduce
k-modes, let
X and
Z be two objects described by
n categorical attributes. Then, a simple dissimilarity measure between these objects is the total number of mismatches of the corresponding values of the attributes of the two objects. That is
where
Let
be a set of
m objects described by
n categorical attributes denoted by
,
. Then a mode of
is a vector
that minimizes
Q not necessarily an object of
. Finally, the
k-modes algorithm partitions the set of
m objects described by
n categorical attributes into
k clusters,
,
, by minimizing the following expression:
where
is the mode of cluster
. For a survey of
k-modes see [
19]. For a different approach to clustering categorical data, see [
20].
Although the
k-modes method has the advantage of being scalable to very large datasets, the final solution may be influenced by the initialization criterion of using random initial modes as centers. A number of suggestions have been made to overcome the shortcomings of
k-modes. For example, Lakshmi [
21] propose a different algorithm to overcome the initialization problem of
k-modes. Moreover, Dorman [
22] adapt the Hartigan algorithm for
k-means and develop several approaches to selects the initial centroids to improve the efficiency of
k-modes. Two other approaches to initialize the
k-modes algorithm are given in [
23,
24]. A fuzzy version of the
k-modes algorithm is proposed by Huang [
25] to improve the performance of
k-modes. Other fuzzy versions of the
k-modes method are given in [
26,
27,
28]. Ng [
29] modify a simple matching dissimilarity measure to obtain clusters with intra-similarity and describe extensions of
k-modes to cluster efficiently large categorical datasets. A different dissimilarity measure is provided by Cao [
30].
For different approaches to clustering categorical data, see [
20,
31,
32,
33].
Besides the
k-modes algorithm, Huang [
25] also proposes the
k-prototype, an algorithm that integrates the
k-means and
k-modes algorithms to cluster mixed types of objects. The dissimilarity between two mixed-type objects,
X and
Z, which are described by
, may be measured by
Of course, the first term corresponds to the squared Euclidean distance, which is applied to the quantitative attributes and the second term is the simple matching dissimilarity measure, which is applied to the qualitative attributes.
is a weight used to avoid favoring either type of attribute. Thus
k-prototype seeks to minimize the following cost function:
where
W is an
partition matrix with elements
is a set of objects in the same object domain.
Next, Marcotorchino [
14], Michaud [
34,
35,
36] were the first to propose a clustering method for categorical data using a dissimilarity measure. These authors developed the relational analysis theory, and introduced the relation aggregation problem in order to solve the Condorcet’ s paradox in the voting system, and relate it to the similarity problem.
This approach consists of using pairwise comparisons and applying the simple majority decision rule. Indeed, aggregating equivalence relations using the simple majority decision rule guarantees optimal solutions under some constraints and without fixing a priori the number of groups. In our work, we used the approach introduced by Michaud and Marcotorchino, setting a priori the number of groups.
3. Materials and Methods
Let
be a set of
n variables and
a set of
m objects. Let
C be a Condorcet matrix, with elements
, corresponding to the number of variables for which
is similar to
, denoted by
, and
is a matrix, such that
For two given objects, and , with we mean that and have the same value for the variable with , while means that and are similar.
In the relational analysis methodology, Marcotorchino [
14] suggest the maximization of Condorcet’s criterion, under some restrictions, given by
with
,
and
is a matrix that maximizes the function
given in Equation (
6). This matrix takes values 0 and 1.
Then, the model associated with the absolute global majority is defined by:
where
Y is the matrix of similarities. The first constraint represents the binarity, the second restriction represents the symmetry and the third restriction is the transitivity.
The following example explains how to obtain the matrix Y, which maximizes the Condorcet’s criterion under the restrictions given below.
Let E be a dataset that is composed of three items
, with three qualitative variables
being measured. The dataset is presented in
Table 2.
Using
Table 2, we identify the matrix of Condorcet
C, which is given by
. Then, the possible solutions
Y that satisfy the constraints are
Next, we compute the function for each matrix , . Indeed, we have , , , and . We deduce that maximizes the Condorcet’s criterion. Finally, we obtain the number of clusters, equal 2, and the clusters are and .
Although, the proposed method of clustering does not require fixing the number of classes beforehand, there are instances where this is not convenient, such is the case in a psychometric analysis. In this paper, we take this point of view and assume that the number of clusters is given.
Therefore, in this paper, we fix the number of groups denoted by
, and focus on finding the solution of Equation (
7), giving its algorithm and comparing its results to
k-modes for some fixed value of
.
Next, let us denote by a partition with respect to the set of objects S. In this partition, the number of clusters is .
Recall that the matrix
C represents the similarities between pairs of objects that we want to cluster. Similarly, we introduce a matrix of dissimilarities between pairs of the same objects and we denote it by
. Next, we define
n categorical variables denoted by
, and let
be a modality of
assigned to object
. Then, we write that each variable
is associated with a matrix
. As a consequence, we obtain
where the elements of matrix
are given by
By abuse of notation, we write and .
Using Equation (
8), the general terms of the collective relational matrix
C are given by
. Furthermore, we define the general terms of the collective relational matrix
as
. Note that
represents the number of variables for which
and
are not similar.
5. Algorithm
Several algorithms have been given in the past three decades to solve problem (
7) by linear programming techniques, when the population under study is relatively small. Unfortunately, the classical linear programming techniques require many restrictions. In this case, the heuristic method has been adopted in order to process large amounts of data. Although these heuristic algorithms are fast, they do not always ensure an optimal solution.
Next, for all points
, we maximize the expression
Note that the term is omitted because it is constant.
The above formula represents the series of links between objects and , denoted by , and we write . Moreover, we denote the general link by , where is a matrix with elements equal to 1.
The -Condorcet clustering algorithm is illustrated by some steps given in Algorithm 1. Given a database D of m points in and partition of S, such that , .
Similar to the algorithm given by [
40], we compute the following steps.
Algorithm 1. Heuristic Algorithm -Condorcet.
|
Input | : Number of partitions |
| : Number of observations |
| : observations |
| : Number of variables |
| : Feature Matrix |
Output | : is a partition of clusters |
1 | ): the generation of the Condorcet matrix C |
2 | : where is a matrix of ones |
3 | |
4 | |
5 | |
6 | for todo |
7 | for to do |
8 | if then |
9 | |
10 | endif |
11 | endfor |
12 | where is the largest value of the |
| column of matrix L |
13 | endfor |
14 | is the position of the first occurrence |
| of the largest value of vector K |
15 | |
16 | EliminateGroup(): Elimination of the cluster |
17 | |
18 | |
19 | |
20 | |
21 | |
22 | if () then |
23 | : GenCombi gives the combination |
| which maximizes the link F |
24 | : is the cardinality function |
25 | if then |
26 | |
27 | |
28 | , goto 5 |
29 | else |
30 | goto 5 |
31 | endif |
32 | endif |
33 | if then |
34 | goto 5 |
35 | endif |
36 | if then |
37 | return |
38 | endif |
- 1.
First, we find the largest value in each column of Condorcet’s matrix C, which corresponds to the number of characteristics that a pair of observations share. We then take the position of the largest value, denoted by a. In this case, we put those observations in the same group , with k representing the kth column.
- 2.
Next, we remove the value a in the matrix C and define b as the largest value of the kth column.
- 3.
We distinguish some conditions:
- 3.1
If , we repeat the first point.
- 3.2
If , then the Condorcet’s criterion is applied. We group the elements that maximize the Condorcet criterion.
- 4.
We repeat the process.
- 5.
This process stops when the groups are identified
5.1. Illustrative Example Using the Heuristic Algorithm
We now consider a dataset D, which is composed of six items
with three qualitative variables
being measured. The dataset is presented in
Table 3.
Then, the Condorcet’s matrix
C is given in the
Table 4.
In general, the diagonal of Condorcet’s matrix represents the number of variables measured in this group of observations, it also represents the maximum possible similarity that can occur between two observations and with . For our heuristic algorithm, we replace the diagonal numbers by zero.
The goal of this example is to create our partition P such that , , fixing the number of groups . Before creating this partition, we suppose that each element represents a group and we write with . Let K be a vector whose elements are the maximum in each vector of Condorcet’s matrix. Then, we have . So, we identify the first maximum number of the vector K, called a, with , and its position , which represents the fourth position of the first column.
The following step is to put together the elements
and
in the same group
. In the first column, we eliminate the fourth value, and we write:
| | | | | | |
| - | 2 | 1 | 3 | 1 | 1 |
| 2 | - | 0 | 2 | 0 | 0 |
| 1 | 0 | - | 1 | 2 | 3 |
| - | 2 | 1 | - | 1 | 1 |
| 1 | 0 | 2 | 1 | - | 2 |
| 1 | 0 | 3 | 1 | 2 | - |
Computing the maximum of the first column, we obtain
. Comparing the value of both parameters
a and
b, we find that
. Thus, we have
. Then, the vector
K is recalculated without considering columns 1 and 4, obtaining
. So, we identify the first maximum number of the vector
K with
, and its position
that represents the last position of the third column. In the third column, we eliminate the sixth value, and we write:
| | | | | | |
| - | 2 | 1 | 3 | 1 | 1 |
| 2 | - | 0 | 2 | 0 | 0 |
| 1 | 0 | - | 1 | 2 | 3 |
| - | 2 | 1 | - | 1 | 1 |
| 1 | 0 | 2 | 1 | - | 2 |
| 1 | 0 | - | 1 | 2 | - |
Computing the maximum of the third column, we obtain
. In this case,
. Again, the vector
K is recalculated without considering columns
and 6, and we have
. The first maximum of vector
K is in the second position, and we have
, with position in the Condorcet matrix given by
. The last position lead to add the element
to the group
. We eliminate the first value of the second column, and write:
| | | | | | |
| - | - | 1 | 3 | 1 | 1 |
| 2 | - | 0 | 2 | 0 | 0 |
| 1 | 0 | - | 1 | 2 | 3 |
| - | 2 | 1 | - | 1 | 1 |
| 1 | 0 | 2 | 1 | - | 2 |
| 1 | 0 | - | 1 | 2 | - |
Next, the maximum of the second column is equal to 2, and we have with a position . Both parameters a and b are equal. In this case, it is not necessary to carry out combinatorics between and because the element of position belongs to . Now, we define a new vector K without the second column given by . The first maximum of vector K can be found in position and meaning that the element can be in the first group or the second group. In this case, we must check which of the two partitions maximizes the Condorcet’s criterion function. After simple calculations, we deduce that belongs to . Finally, we obtain two groups and .
5.2. Advantage of Heuristic Algorithm
The main goals of this section are threefold. Firstly, we compare the partition quality, given in Equation (
21), for the feline dataset using both exact and heuristic algorithms. Secondly, we use the inertia index, given in Equation (
13), to compare the exact and heuristic algorithms. Finally, the execution time of the two methods is compared. For each step, we choose the first
felines of the feline dataset.
Table 5 shows that the use of the exact algorithm is ineffective due to some problems that occur when the sample size increases. Furthermore, the inertia and quality indexes of both heuristic and exact algorithms are almost equal.
Finally, observing the last column of
Table 5, when the data size is equal to
and
, the execution of the exact algorithm takes
s, while the execution time of the heuristic algorithm is
s for
and
. Fixing again
and
, we observe that the execution time of the exact algorithm increases considerably, 586.95 s, compared to the execution time for
. Furthermore, for
, the quality and inertia indexes cannot be computed for exact algorithm; however, we know that the exact algorithm provide an optimal solution. Then, we can deduce that its quality index is at least as large as the quality of heuristic algorithm. Consequently, we confirm that the exact algorithm is computationally very expensive compared to the heuristic algorithm. Note that for the exact algorithm, the use of large datasets generates two important problems, the first is related to the execution time, while the second is concerned with the temporal storage space of data required by the programs being used at a particular moment (e.g., R-project).
6. Comparison between -Condorcet and -Modes
Firstly, in this section, we describe the experiments and their results. We ran our algorithm on feline datasets obtained from [
14] and presented in
Table A1 and
Table A2 from the
Appendix A. We tested the performance of
-Condorcet clustering against the
k-modes algorithm. Our algorithms were implemented in R language. The
-Condorcet algorithm was implemented according to the description given above, and for
k-modes, we used the algorithm as already implemented in R language. The quality of the partition was compared using the fit rate [
14] given by Equation (
21).
In previous studies [
14,
37], the similarity aggregation method gave an optimal solution of four groups. This solution was closer to the classification recognized by zoologists by species and genus (
Figure 1).
On the other hand, in the partition into 4 = four groups, applying the
k-modes algorithm, it is observed that certain species belong to more than one group and that it does not agree with the classification recognized by zoologists (
Figure 2).
The accuracy of measurement, given in Equation (
22), of both solutions given in
Figure 1 and
Figure 2, is 1 and 0.83, respectively.
Next, a comparison was made between the
-Condorcet method and the
k-modes method for different values of
in order to find the best method that fit the feline data through the within-class inertia index and adjustment rate given in Equations (
13) and (
21) respectively.
Figure 3, contrasts the quality of groupings by means of the adjustment rate. Therefore, it is observed that the
-Condorcet method presents a better quality of partition than the
k-modes method for different values of
.
Figure 4, contrasts the quality of clustering fit through inertia, in the same dataset. In this figure, it is concluded that the intra–class inertia is almost the same for both methods with different values of
.
We now use the 1990 US Census dataset to compare the heuristic with
k-modes algorithm. This dataset contains a 1% sample of the public use microdata sample person records drawn from the full 1990 census sample. For further references, see
https://archive.ics.uci.edu/ml/datasets/US+Census+Data+%281990%29 (accessed on 20 December 2021). The comparisons between both methods were made with 50, 100, 150, and 200 observations.
Table 6 shows that the inertia index is almost the same for both algorithms. However, we observe that the heuristic algorithm is better than
k-modes algorithm from the point of view of the quality index.
7. Conclusions
In clustering categorical data, many researchers have succeeded in developing unsupervised classification methods without fixing the number of classes a priori. Fixing the number of clusters beforehand, may be major drawback.
However, sometimes it is convenient to identify beforehand the number of groups, as, for instance, in psychometrics. Several methods have been proposed with a known number of clusters. We believe, however, that these methods do not always provide optimal solutions. For this reason, we proposed a new method with a fixed number of groups. This new approach is an extension of the Condorcet method. Although, the exact algorithm of this new approach gives an optimal solution, it consumes too much time. Hence, the heuristic algorithm was introduced.
Table 5 shows that the proposed algorithm produces almost the same values of quality and inertia indexes as the exact algorithm.
Next, comparing our approach with
k-modes for the feline data, we found that the accuracy index gave better result for the heuristic algorithm. In this case, the comparison was made with the precision index because we know
a priori that the number clusters,
, is 4. This comparison was also made with the US Census 1990 data, using both partition quality and intra-class inertia indexes. The results in
Table 6 show that both methods have almost the same inertia. However, the heuristic algorithm shows an improvement over
k-modes in terms of partition quality. Consequently, the following may be concluded:
We proposed a heuristic algorithm as an alternative algorithm to the exact one. This gives the same or an approximate solution as the exact one.
From the simulations presented in
Table 5, we can conclude that the heuristic algorithm is faster than the exact algorithm.
The heuristic algorithm produces similar (or even better) results to k-modes.
We conclude that -Condorcet is a valid technical competitor with respect to the k-modes clustering technique.