1. Introduction
Clustering is a fundamental technique in unsupervised machine learning and data mining that aims to divide a dataset into several clusters without any information. A lot of clustering algorithms have been proposed over the past few decades [
1,
2,
3,
4,
5] and applied in many real-world applications, such as face detection [
6], recommendation systems [
7], information retrieval [
8], and so on.
Cluster analysis is a widely used technique in the electronics domain. Li et al. [
9] proposed a simplified electrochemical impedance spectroscopy model based on k-means to describe the battery characteristics of electric vehicles. Jinlei et al. [
10] developed an active equalization strategy for series-connected battery packs based on clustering analysis with a genetic algorithm. Lei et al. [
11] employed hierarchical clustering to cluster four kinds of VOCs detected by a fluorescent cross-responsive sensor array and obtained accurate results. Hwang and Kim [
12] implemented the Gaussian mixture model to a variational autoencoder framework to cluster wafer maps in semiconductor manufacturing. Recently, researchers have paid more attention to fairness in clustering [
13].
Traditional clustering algorithms often take the internal structure of the data into account, but ignore the bias in the data, which is associated with protected attributes (e.g., race and gender). In the process of clustering, the bias can be transmitted, or even amplified, and affect the clustering result. In bioelectronics, clustering methods can be used to analyze single-cell RNA-Seq data. Cell sequencing techniques can be considered the protected attribute since cells sequenced by different techniques would result in different expression levels [
14]. Traditional clustering methods may perform well on cell sequences obtained by a specific sequence technique while struggling with the data obtained by others. Take feature extraction for face emotion recognition in electronic multimedia as another showcase; clustering methods can be used to group features that give the shape and location of the important biometric parts of the face, such as the eyes, nose, and mouth, based on similarity [
15]. In this context, race can be considered the protected attribute, and there may be inherent physical differences in facial features between races. Traditional clustering-based feature extraction may perform well for one race but less well for other races. Therefore, the issue of fairness of the clustering should be considered in most domains where clustering is employed [
16]. Such bias can be mitigated by adding fairness constraints that hide protected attributes from the clustering results. Generally, fairness in clustering requires that objects with different values of protected attributes have a uniform distribution in each cluster [
17,
18,
19].
In order to achieve fair clustering, much effort has been made to incorporate fairness constraints into clustering [
20,
21,
22,
23,
24,
25]. They can be divided into: (1) pre-processing, (2) in-processing, and (3) post-processing according to when the fairness is handled. Chierichetti et al. [
20], Backurs et al. [
21], and Ahmadian et al. [
22] partitioned the original dataset into small clusters where fairness is guaranteed and then applied clustering algorithms (e.g., k-center, k-median, and hierarchical clustering) on them to achieve the fair results. Hence, these works are all pre-processing-based methods. Both Kleindessner et al. [
23] and Liu and Vicente [
24] are in-processing-based methods. The former formulated fairness constraints and incorporated it into the spectral clustering objective function; the latter added the steps of adjusting the fairness of clusters to mini-batch k-means. Bera et al. [
25] first employed a vanilla clustering algorithm and then changed the assignment of objectives to the clusters to improve fairness, so it is a post-processing-based method.
In general, enforcing fairness on clustering will lead to a loss of clustering quality. This is because the pursuit of fairness may result in the allocation of closely related objects to different clusters to mitigate bias. Such assignments can decrease the performance of the clustering. For example, post-processing-based methods often utilize the initial unfair clustering results obtained by vanilla clustering algorithms, which are expected to have high clustering quality. The fair clustering result generated by the fairness operation is usually of lower clustering quality than the original unfair one.
In this paper, we take a different path that modifies an initial fair clustering result rather than an unfair one. The method we propose consists of three stages: an initialization stage that obtains a fair clustering output, a relaxation stage that allows a certain degree of unfairness to achieve better clustering quality, and an improvement stage that improves clustering quality. Additionally, different from the post-processing-based approach that improves fairness while retaining clustering quality, our idea of an improvement stage is to improve clustering quality while maintaining fairness. The main contributions in this paper can be summarized as follows:
An initialization method that can handle the case of multiple groups is presented to generate an initial fair clustering result, and, compared to random initialization, the result is of relatively good clustering quality.
A local search method is employed to improve clustering quality. Specifically, the local search method can find a local optimum of the initial clustering result.
Experimental results on both synthetic and real-world datasets demonstrate that our proposed method outperforms some of the other fair clustering methods in terms of fairness and clustering metrics.
3. Proposed Method
This section presents the fair k-means algorithm and describes the initialization method and the improvement for fairness.
3.1. Initialization
Fair clustering aims to ensure that the proportions of protected groups in each cluster are close to the proportions in the whole dataset [
23,
26,
27,
37]. Consequently, data points from different groups can be assigned the same clustering label. However, vanilla k-means clustering without consideration for protected attributes usually divides the dataset in such a way that similar points are grouped into the same cluster and conducts an unfair solution. In our study, we make the assumption that if data points belonging to the same group are similar, they should be placed in the same cluster in a fair clustering solution. Therefore, we propose an initialization strategy based on the k-means algorithm to ensure clustering quality and guarantee almost fairness.
Suppose that dataset
X can be divided into
m disjoint groups
and
. We first run k-means on every group to obtain
m clustering results,
, where
.
is selected as the primary clustering result because
has the most points compared with other groups. Then, we iteratively choose the clustering result corresponding to the largest group among the remaining groups as the secondary clustering result, refine the clusters obtained by combining the primary and secondary clustering results, and consider it the new primary clustering result.
Figure 1 shows a complete initialization scheme on a dataset with 3 groups. The combination step and refinement step are described below.
3.1.1. Combination Step
Suppose we combine a primary clustering result
with a secondary clustering result
, and
contains more data points than
. In the context of clustering, cluster centers can be considered representatives of the clusters, as they are calculated as the mean of the data points within each cluster. To establish a one-to-one matching between the clusters in
and
, we initially match the clusters based on their cluster centers. Specifically, let
denote the cluster that has a match with cluster
. We determine this matching by minimizing the following equation:
where
represents the distance between two cluster centers. After the matching operation,
k cluster pairs are obtained. Subsequently, each cluster pair is merged to generate a new primary clustering result
. The algorithm for combining the primary in Algorithm 1 and the secondary clustering results is summarized as follows:
Algorithm 1 Combine Algorithm |
- Input:
Primary clustering result , secondary clustering result . - Output:
New primary clustering result. - 1:
Compute cluster centers of and ; - 2:
Compute all permutations of clusters in ; - 3:
For each permutation, calculate the value of Equation ( 6) when performing a one-to-one matching between the clusters in the permutation and the clusters in ; - 4:
Select the one-to-one matching with the minimum value of Equation ( 6); - 5:
Merge the matched cluster pairs of the one-to-one matching obtained in Step 4 to generate a new primary clustering result.
|
3.1.2. Refinement Step
Suppose that
is produced by running k-means on
, and the data points in
are considered to belong to the same group
. In the case of two groups, we measure the fairness of a cluster by calculating the ratio of the number of data points belonging to
to the number of data points belonging to
within the cluster. After the combination step, the fairness of the merged cluster
is formally defined as:
The closer the value of Equation (
7) is to
, the fairer the cluster is. However, it can be observed that almost all merged clusters are unfair. The reason is that we do not impose extra constraints, such as cluster size, during the process of k-means and the combination step to ensure that merged clusters maintain the distribution proportions of protected groups in the dataset. To cope with this challenge, a refinement step is designed.
The refinement step iteratively handles the cluster with the poorest fairness to make it nearly completely fair. Let
be the cluster with the lowest value for Equation (
7). An intuitive approach to improve the fairness of
is to move some data points belonging to
to it. Specifically, we move these data points from a cluster
since the data points in
all belong to
. We determine
as the nearest unfair neighbor of
and move the
data points that are closest to the center of
. These two considerations will lead to a smaller increase in clustering cost, on the one hand, and make
fair on the other hand. The algorithm of the refinement step in Algorithm 2 is summarized as follows:
Algorithm 2 Refine Algorithm |
- Input:
The primary clustering result obtained from the combination step. - Output:
New primary clustering result. - 1:
Mark each cluster as unfair; - 2:
while the number of unfair clusters is greater than 1 do - 3:
Select as the cluster with the lowest value of Equation ( 7); - 4:
Select as the nearest unfair neighbor of ; - 5:
Move the data points closest to the center of from to ; - 6:
Mark as fair; - 7:
end while
|
Consequently, the initialization stage is described in Algorithm 3.
Algorithm 3 Initialization |
- Input:
Dataset , cluster number k - Output:
Clustering result. - 1:
Apply k-means to m groups to achieve m clustering results, , where ; - 2:
Primary clustering result ; - 3:
for do - 4:
Secondary clustering result ; - 5:
Apply Algorithm 1 to combine and and obtain a new primary clustering result ; - 6:
Apply Algorithm 2 to refine ; - 7:
Take the refined as the primary clustering result ; - 8:
end for - 9:
The final clustering result ;
|
3.2. Relaxation
The clustering result is almost exactly fair after the initialization stage. However, in certain cases, there is a desire to obtain a relatively relaxed fair clustering outcome. Specifically, when using a fairness measure to assess the fairness of clustering results, it is expected that the value of the fairness measure exceeds a predetermined threshold, rather than strictly achieving exact fairness.
Our proposed method relaxes the fairness constraint under a fairness measure called balance. Balance was first introduced by Chierichetii et al. [
20] and generalized by Bera et al. [
25]. It is measured by the ratio of different groups in each cluster. Suppose that
denotes the representation of group
in cluster
, and
denotes the representation of group
in the dataset. The balance of cluster
is defined as:
The balance of a clustering result
C is defined as:
The value of the balance lies in the range from 0 to 1, with a higher value indicating a fairer clustering result. If the balance value of a clustering result is 1, it indicates that the clustering result is exactly fair.
To relax the fairness constraint, we introduce a parameter
. The balance value of a clustering result is limited by
, that is, the balance value of each cluster should not less than
. We employ the greedy searching strategy based on moving an element from one cluster to another, similar to Hartigan’s algorithm. Given a point
, which is moved to
, the movement will change the balance values of
and
. If the balance values of
and
are still greater than or equal to
, this movement can be considered valid. We evaluate all valid movements and accept the movement with the maximum decrease the objective function value. We iteratively move points until there is no valid movement or until Equation (
4) is less than zero for all valid movements.
By employing the greedy searching strategy, we can gradually optimize the objective function while maintaining a level of fairness. The algorithm of relaxation in Algorithm 4 is summarized as follows:
Algorithm 4 Relaxation |
- Input:
Dataset , clustering result , cluster number k. - Output:
Clustering result, - 1:
repeat - 2:
for each data point do - 3:
if then - 4:
for each clustert do - 5:
if then - 6:
Try to move to and calculate Equation ( 4); - 7:
end if - 8:
end for - 9:
end if - 10:
Move to the cluster with maximum decrease by Step 6; - 11:
Update the centers of the clusters, where moves in and out by Equation ( 5); - 12:
end for - 13:
until there is no valid movement or Equation ( 4) is less than zero for all valid movements
|
3.3. Improvement
In order to minimize the objective function value while maintaining fairness, a local search method based on interchanging two elements belonging to same group and different clusters is employed. Formally, give a set of
k clusters belong to
X
the interchange of points
and
produces a new solution
. The change in the objective function value can be calculated by the following equation.
where
,
and
and
are the centers of
and
, respectively.
Our local search strategy evaluates the interchange of every pair of data points that belong to same group and different clusters. It accepts the first interchange that results in an improvement of the objective function. The algorithm of improvement in Algorithm 5 is described as follows:
Algorithm 5 Improvement |
- Input:
Dataset , clustering result , cluster number k. - Output:
Clustering result. - 1:
repeat - 2:
; - 3:
for do - 4:
for do - 5:
if and belong to different cluster and and belong to same groups, then - 6:
- 7:
if the value of Equation ( 11) is less than 0, then - 8:
Interchange points and ; - 9:
; - 10:
end if - 11:
end if - 12:
end for - 13:
if then - 14:
break - 15:
end if - 16:
end for - 17:
until.
|
Finally, the overall algorithm of the proposed method in Algorithm 6 is summarized as follows:
Algorithm 6 FFC |
- Input:
Dataset , cluster number k. - Output:
Clustering result. - 1:
Apply Algorithm 3 to generate an initial fair clustering result; - 2:
Apply Algorithm 4 to obtain a relatively relaxed fair clustering result; - 3:
Apply Algorithm 5 to improve the clustering quality while maintaining fairness, and the final clustering result is achieved.
|
3.4. Computational Complexity
The overall computational complexity of FFC can be evaluated as the sum of the computational complexity of each stage. In the initialization stage, the computational complexity of k-means is , where is the number of iterations in the k-means. The computational complexities of the combination step and the refinement step are and , respectively. Then, the computational complexity of the initialization stage is . The computational complexity of the relaxation stage is , where is the number of iterations in the relaxation stage. In the improvement stage, the computational complexity of the local search method is , where is the number of iterations in the improvement stage. To sum up, the computational complexity of our method is .
The computational complexity of vanilla k-means is
, where
I is the number of iterations. It is lower than that of FFC since vanilla k-means do not account for fairness constraints. The computational complexity of fair spectral clustering [
23] is
. The computational complexity of fair k-means proposed by Xu et al. [
38] is
, where
h denotes the cost in the minimum cost flow they employed. Additionally, fair hierarchical agglomerative clustering, introduced by Chhabra et al. [
39], has a computational complexity of
.
5. Conclusions
In this paper, we incorporate fairness constraints into k-means clustering and propose FFC. The scheme of our method is to handle fairness before improving clustering quality, which is different from the existing post-processing-based work. We design an initialization stage that can handle the dataset with a multi-value protected attribute to produce an initial fair solution. Moreover, a relaxation stage and an improvement stage are presented to improve clustering quality while relaxing and maintaining fairness, respectively. Experimental results on synthetic and real-world datasets show that our proposed method outperforms some other fair clustering methods in terms of fairness metrics while the loss of clustering quality is acceptable or even better.
We achieved fairness in k-means clustering but sacrificed the quality of clustering. Even though the loss of clustering quality is acceptable, finding a way to minimize the cost of fairness is our future work. Since a local search method is employed, FFC does not provide a guarantee that any clustering output is the global optimum. In the future, we will find a strategy that can search for the near global optimum or even the global optimum.