Fairness First Clustering: A Multi-Stage Approach for Mitigating Bias

Pan, Renbo; Zhong, Caiming

doi:10.3390/electronics12132969

Open AccessArticle

Fairness First Clustering: A Multi-Stage Approach for Mitigating Bias

by

Renbo Pan

¹ and

Caiming Zhong

^2,*

¹

Faculty of Electrical Engineering and Computer Science, Ningbo University, Ningbo 315211, China

²

College of Science and Technology, Ningbo University, Ningbo 315211, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(13), 2969; https://doi.org/10.3390/electronics12132969

Submission received: 28 May 2023 / Revised: 27 June 2023 / Accepted: 3 July 2023 / Published: 5 July 2023

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Fair clustering aims to partition a dataset while mitigating bias in the original dataset. Developing fair clustering algorithms has gained increasing attention from the machine learning community. In this paper, we propose a fair k-means algorithm, fair first clustering (FFC), which consists of an initialization stage, a relaxation stage, and an improvement stage. In the initialization stage, k-means is employed to cluster each group. Then a combination step and a refinement step are applied to ensure clustering quality and guarantee almost fairness. In the relaxation stage, a commonly used fairness metric, balance, is utilized to assess fairness, and a threshold is set to allow for fairness relaxation while improving the clustering quality. In the improvement stage, a local search method is used to improve the clustering quality without changing the fairness. Comparisons of fairness and clustering quality are carried out between our method and other state-of-the-art fair clustering methods on 10 datasets, which include both synthetic and real-world datasets. The results show that compared to the method with the second highest balance value, FFC shares the same SSE value on one dataset and achieves lower SSE values on six datasets.

Keywords:

fairness; fair clustering; k-means clustering; local search

1. Introduction

Clustering is a fundamental technique in unsupervised machine learning and data mining that aims to divide a dataset into several clusters without any information. A lot of clustering algorithms have been proposed over the past few decades [1,2,3,4,5] and applied in many real-world applications, such as face detection [6], recommendation systems [7], information retrieval [8], and so on.

Cluster analysis is a widely used technique in the electronics domain. Li et al. [9] proposed a simplified electrochemical impedance spectroscopy model based on k-means to describe the battery characteristics of electric vehicles. Jinlei et al. [10] developed an active equalization strategy for series-connected battery packs based on clustering analysis with a genetic algorithm. Lei et al. [11] employed hierarchical clustering to cluster four kinds of VOCs detected by a fluorescent cross-responsive sensor array and obtained accurate results. Hwang and Kim [12] implemented the Gaussian mixture model to a variational autoencoder framework to cluster wafer maps in semiconductor manufacturing. Recently, researchers have paid more attention to fairness in clustering [13].

Traditional clustering algorithms often take the internal structure of the data into account, but ignore the bias in the data, which is associated with protected attributes (e.g., race and gender). In the process of clustering, the bias can be transmitted, or even amplified, and affect the clustering result. In bioelectronics, clustering methods can be used to analyze single-cell RNA-Seq data. Cell sequencing techniques can be considered the protected attribute since cells sequenced by different techniques would result in different expression levels [14]. Traditional clustering methods may perform well on cell sequences obtained by a specific sequence technique while struggling with the data obtained by others. Take feature extraction for face emotion recognition in electronic multimedia as another showcase; clustering methods can be used to group features that give the shape and location of the important biometric parts of the face, such as the eyes, nose, and mouth, based on similarity [15]. In this context, race can be considered the protected attribute, and there may be inherent physical differences in facial features between races. Traditional clustering-based feature extraction may perform well for one race but less well for other races. Therefore, the issue of fairness of the clustering should be considered in most domains where clustering is employed [16]. Such bias can be mitigated by adding fairness constraints that hide protected attributes from the clustering results. Generally, fairness in clustering requires that objects with different values of protected attributes have a uniform distribution in each cluster [17,18,19].

In order to achieve fair clustering, much effort has been made to incorporate fairness constraints into clustering [20,21,22,23,24,25]. They can be divided into: (1) pre-processing, (2) in-processing, and (3) post-processing according to when the fairness is handled. Chierichetti et al. [20], Backurs et al. [21], and Ahmadian et al. [22] partitioned the original dataset into small clusters where fairness is guaranteed and then applied clustering algorithms (e.g., k-center, k-median, and hierarchical clustering) on them to achieve the fair results. Hence, these works are all pre-processing-based methods. Both Kleindessner et al. [23] and Liu and Vicente [24] are in-processing-based methods. The former formulated fairness constraints and incorporated it into the spectral clustering objective function; the latter added the steps of adjusting the fairness of clusters to mini-batch k-means. Bera et al. [25] first employed a vanilla clustering algorithm and then changed the assignment of objectives to the clusters to improve fairness, so it is a post-processing-based method.

In general, enforcing fairness on clustering will lead to a loss of clustering quality. This is because the pursuit of fairness may result in the allocation of closely related objects to different clusters to mitigate bias. Such assignments can decrease the performance of the clustering. For example, post-processing-based methods often utilize the initial unfair clustering results obtained by vanilla clustering algorithms, which are expected to have high clustering quality. The fair clustering result generated by the fairness operation is usually of lower clustering quality than the original unfair one.

In this paper, we take a different path that modifies an initial fair clustering result rather than an unfair one. The method we propose consists of three stages: an initialization stage that obtains a fair clustering output, a relaxation stage that allows a certain degree of unfairness to achieve better clustering quality, and an improvement stage that improves clustering quality. Additionally, different from the post-processing-based approach that improves fairness while retaining clustering quality, our idea of an improvement stage is to improve clustering quality while maintaining fairness. The main contributions in this paper can be summarized as follows:

An initialization method that can handle the case of multiple groups is presented to generate an initial fair clustering result, and, compared to random initialization, the result is of relatively good clustering quality.
A local search method is employed to improve clustering quality. Specifically, the local search method can find a local optimum of the initial clustering result.
Experimental results on both synthetic and real-world datasets demonstrate that our proposed method outperforms some of the other fair clustering methods in terms of fairness and clustering metrics.

2. Related Work

2.1. Fair Clustering

Chierichetti et al. proposed the first work on fair clustering [20]. They added fairness constraints for k-center and k-median clustering in the case of two protected groups. They presented the fairness notion of balance, which requires that two protected groups must have approximately equal representation in each cluster. To achieve fair clustering, they introduced the fairlet decomposition, which constructed small and balance subsets (i.e., fairlets). Classical clustering algorithms were applied to fairlets to obtain a fair clustering result. To scale to large datasets, Backurs et al. proposed a tree-metric-based nearly linear time algorithm for fairlet decomposition [21]. Rösner and Schmidt allowed multiple protected groups and obtained a constant approximation algorithm for the k-center clustering [26]. Bera et al. introduced the fairness notion of bounded representation, which adds upper and lower bounds to protected groups in clusters [25]. They first solved the vanilla clustering problem and then improved the fairness by fairly assigning objects to cluster centers. Ziko et al. proposed a general variational framework of fair clustering by integrating a fairness penalty term based on Kullback–Leibler (KL) divergence into the classic clustering methods [27]. Esmaeili et al. treated an upper bound on the clustering objective as a constraint and minimized a measure of unfairness [28].

Beyond center-based clustering, fairness has been achieved in other clustering algorithms. Kleindessner et al. incorporated fairness constraints into spectral clustering [23]. Ahmadian et al. extended fairness to hierarchical clustering under different hierarchical clustering objectives [22]. Li et al. explored fairness in deep clustering for large-scale and high-dimensional visual learning and developed a deep fair clustering model to learn the feature representations that are favorable for clustering and hiding protected attributes [29]. Chhabra et al. proposed a novel black-box attack against fair clustering algorithms where the attacker can perturb a small percentage of protected group memberships and a novel fair clustering approach that utilizes consensus clustering along with fairness constraints to output robust and fair clusters [30].

The works by [31,32] take the results of a vanilla clustering algorithm as the input and find a clustering close to the input while satisfying the fairness constraints. Both formulated the problem as an integer linear program and use ILP solver and LP solver to solve it, respectively.

2.2. K-Means Clustering

K-means clustering is a commonly used clustering method due to its simplicity. The goal of k-means clustering is to partition a dataset into k clusters, where each object is assigned to the cluster with the nearest cluster center. Formally, let

X = {x_{1}, x_{2}, \dots, x_{n}}

be a dataset with point

x_{i} = {(x_{i 1}, \dots, x_{l d})}^{T} \in R^{d}

, k-means clustering aims to partition X into k clusters

C = {C_{1}, C_{2}, \dots, C_{k}}

to minimize the sum of the squared error (SSE) [33]. The SSE can be computed as the sum of the squared Euclidean distances of each point to its closest center:

S S E = \sum_{j = 1}^{k} \sum_{x_{i} \in C_{j}} {∥ x_{i} - c_{j} ∥}^{2},

(1)

where

c_{j} \in R^{d}

is the center of cluster

C_{j}

.

The most popular implementation of k-means clustering is Lloyd’s algorithm [34,35]. It starts by randomly choosing k points as centers. After that, it alternates between the two steps:

Assignment Step: Assign each point to the cluster with the nearest center:

C_{j}^{(t)} = {x_{i} : ∥ x_{i} - c_{j}^{(t)} ∥ < ∥ x_{i} - c_{p}^{(t)} ∥ \forall p, 1 \leq p \leq k} .

(2)

Update Step: Recalculate the center of each cluster:

c_{j}^{(t + 1)} = \frac{1}{| C_{j}^{(t)} |} \sum_{x_{i} \in C_{j}^{(t)}} x_{i} .

(3)

Lloyd’s algorithm converges when no point changes its assignment or the maximum number of iterations is reached. It always converges to a local optimum but not necessarily to the global optimum.

Hartigan’s algorithm is another heuristic algorithm for k-mean clustering, but it is somewhat different from Lloyd’s algorithm [36]. It pays attention to a important fact: if you add a point to a cluster, the center will change. It starts from a random partition of X into k clusters and calculates the center of each cluster. For a point

x_{i}

, Hartigan’s algorithm calculates the change in Equation (1) if

x_{i} \in C_{i}

is moved from

C_{i}

to

C_{j}

:

\begin{matrix} Δ {S S E}_{x_{i} \to C_{j}} \\ = & \sum_{x \in C_{i}} {∥ x - c_{i} ∥}^{2} + \sum_{x \in C_{j}} {∥ x - c_{j} ∥}^{2} \\ - \sum_{x \in C_{i}^{^{'}}} {∥ x - c_{i}^{^{'}} ∥}^{2} - \sum_{x \in C_{j}^{^{'}}} {∥ x - c_{j}^{^{'}} ∥}^{2}, \end{matrix}

(4)

c_{i}^{^{'}} = c_{i} + \frac{c_{i} - x_{i}}{| C_{i} | - 1} and c_{j}^{^{'}} = c_{j} + \frac{x_{i} - c_{j}}{| C_{j} | + 1},

(5)

where

C_{i}^{^{'}} = C_{i} ∖ {x_{i}}

,

C_{j}^{^{'}} = C_{j} \cup {x_{i}}

and

c_{i}^{^{'}}

and

c_{k}^{^{'}}

are the centers of

C_{i}^{^{'}}

and

C_{j}^{^{'}}

, respectively. Then,

x_{i}

is assigned to the cluster with the maximum decrease in the objective function value. Hartigan’s algorithm repeatedly picks a point and assigns it until Equation (4) is less than zero for all points.

3. Proposed Method

This section presents the fair k-means algorithm and describes the initialization method and the improvement for fairness.

3.1. Initialization

Fair clustering aims to ensure that the proportions of protected groups in each cluster are close to the proportions in the whole dataset [23,26,27,37]. Consequently, data points from different groups can be assigned the same clustering label. However, vanilla k-means clustering without consideration for protected attributes usually divides the dataset in such a way that similar points are grouped into the same cluster and conducts an unfair solution. In our study, we make the assumption that if data points belonging to the same group are similar, they should be placed in the same cluster in a fair clustering solution. Therefore, we propose an initialization strategy based on the k-means algorithm to ensure clustering quality and guarantee almost fairness.

Suppose that dataset X can be divided into m disjoint groups

X = {G_{1}, G_{2}, \dots, G_{m}}

and

| G_{1} | > | G_{2} | > | \dots | > | G_{m} |

. We first run k-means on every group to obtain m clustering results,

C^{1}, \dots, C^{i}, \dots, C^{m}

, where

C^{i} = {C_{1}^{i}, C_{2}^{i}, \dots, C_{k}^{i}}

.

C^{1}

is selected as the primary clustering result because

G_{1}

has the most points compared with other groups. Then, we iteratively choose the clustering result corresponding to the largest group among the remaining groups as the secondary clustering result, refine the clusters obtained by combining the primary and secondary clustering results, and consider it the new primary clustering result. Figure 1 shows a complete initialization scheme on a dataset with 3 groups. The combination step and refinement step are described below.

3.1.1. Combination Step

Suppose we combine a primary clustering result

C^{p} = {C_{1}^{p}, C_{2}^{p}, \dots, C_{k}^{p}}

with a secondary clustering result

C^{s} = {C_{1}^{s}, C_{2}^{s}, \dots, C_{k}^{s}}

, and

C^{p}

contains more data points than

C^{s}

. In the context of clustering, cluster centers can be considered representatives of the clusters, as they are calculated as the mean of the data points within each cluster. To establish a one-to-one matching between the clusters in

C^{p}

and

C^{s}

, we initially match the clusters based on their cluster centers. Specifically, let

M (C_{i}^{p}) \in C^{s}

denote the cluster that has a match with cluster

C_{i}^{p} \in C^{p}

. We determine this matching by minimizing the following equation:

Φ (M) = \sum_{C_{i}^{p} \in C^{p}} d (C_{i}^{p}, M (C_{i}^{p})),

(6)

where

d (\cdot, \cdot)

represents the distance between two cluster centers. After the matching operation, k cluster pairs are obtained. Subsequently, each cluster pair is merged to generate a new primary clustering result

C^{p^{'}} = {C_{1}^{p} \cup M (C_{1}^{p}), C_{2}^{p} \cup M (C_{2}^{p}), \dots, C_{k}^{p} \cup M (C_{k}^{p})}

. The algorithm for combining the primary in Algorithm 1 and the secondary clustering results is summarized as follows:

Algorithm 1 Combine Algorithm

Input:: Primary clustering result $C^{p} = {C_{1}^{p}, C_{2}^{p}, \dots, C_{k}^{p}}$ , secondary clustering result $C^{s} = {C_{1}^{s}, C_{2}^{s}, \dots, C_{k}^{s}}$ .
Output:: New primary clustering result.
1:: Compute cluster centers of $C^{p}$ and $C^{s}$ ;
2:: Compute all permutations of clusters in $C^{p}$ ;
3:: For each permutation, calculate the value of Equation (6) when performing a one-to-one matching between the clusters in the permutation and the clusters in $C^{p}$ ;
4:: Select the one-to-one matching with the minimum value of Equation (6);
5:: Merge the matched cluster pairs of the one-to-one matching obtained in Step 4 to generate a new primary clustering result.

3.1.2. Refinement Step

Suppose that

C^{s}

is produced by running k-means on

G_{s}

, and the data points in

C^{p}

are considered to belong to the same group

G_{p}

. In the case of two groups, we measure the fairness of a cluster by calculating the ratio of the number of data points belonging to

G_{s}

to the number of data points belonging to

G_{p}

within the cluster. After the combination step, the fairness of the merged cluster

C_{i}^{p} \cup M (C_{i}^{p})

is formally defined as:

b (C_{i}^{p} \cup M (C_{i}^{p})) = \frac{{C_{i}^{p} \cup M (C_{i}^{p})} \cap G_{s}}{{C_{i}^{p} \cup M (C_{i}^{p})} \cap G_{p}} = \frac{| M (C_{i}^{p}) |}{| C_{i}^{p} |} .

(7)

The closer the value of Equation (7) is to

\frac{| G_{s} |}{| G_{p} |}

, the fairer the cluster is. However, it can be observed that almost all merged clusters are unfair. The reason is that we do not impose extra constraints, such as cluster size, during the process of k-means and the combination step to ensure that merged clusters maintain the distribution proportions of protected groups in the dataset. To cope with this challenge, a refinement step is designed.

The refinement step iteratively handles the cluster with the poorest fairness to make it nearly completely fair. Let

C_{l}^{p^{'}} = {C_{l}^{p} \cup M (C_{l}^{p})}

be the cluster with the lowest value for Equation (7). An intuitive approach to improve the fairness of

C_{l}^{p^{'}}

is to move some data points belonging to

G_{s}

to it. Specifically, we move these data points from a cluster

C_{t}^{s} \in C^{s}

since the data points in

C_{t}^{s}

all belong to

G_{s}

. We determine

C_{t}^{s}

as the nearest unfair neighbor of

M (C_{l}^{p})

and move the

\frac{| G_{s} | \times | C_{l}^{p} |}{| G_{p} |} - | M (C_{l}^{p}) |

data points that are closest to the center of

M (C_{l}^{p})

. These two considerations will lead to a smaller increase in clustering cost, on the one hand, and make

C_{l}^{p^{'}}

fair on the other hand. The algorithm of the refinement step in Algorithm 2 is summarized as follows:

Algorithm 2 Refine Algorithm

Input:: The primary clustering result $C^{p^{'}}$ obtained from the combination step.
Output:: New primary clustering result.
1:: Mark each cluster as unfair;
2:: while the number of unfair clusters is greater than 1 do
3:: Select $C_{l}^{p^{'}} = {C_{l}^{p} \cup M (C_{l}^{p})}$ as the cluster with the lowest value of Equation (7);
4:: Select $C_{t}^{s}$ as the nearest unfair neighbor of $M (C_{l}^{p})$ ;
5:: Move the $\frac{| G_{s} | \times | C_{l}^{p} |}{| G_{p} |} - | M (C_{l}^{p}) |$ data points closest to the center of $M (C_{l}^{p})$ from $C_{t}^{s}$ to $C_{l}^{p^{'}}$ ;
6:: Mark $C_{l}^{p^{'}}$ as fair;
7:: end while

Consequently, the initialization stage is described in Algorithm 3.

Algorithm 3 Initialization

Input:: Dataset $X = {G_{1}, G_{2}, \dots, G_{m}}$ , cluster number k
Output:: Clustering result.
1:: Apply k-means to m groups to achieve m clustering results, $C^{1}, \dots, C^{i}, \dots, C^{m}$ , where $C^{i} = {C_{1}^{i}, C_{2}^{i}, \dots, C_{k}^{i}}$ ;
2:: Primary clustering result $C^{p} = C^{1}$ ;
3:: for $i = 2, 3, \dots, m$ do
4:: Secondary clustering result $C^{s} = C^{i}$ ;
5:: Apply Algorithm 1 to combine $C^{p}$ and $C^{s}$ and obtain a new primary clustering result $C^{p^{'}}$ ;
6:: Apply Algorithm 2 to refine $C^{p^{'}}$ ;
7:: Take the refined $C^{p^{'}}$ as the primary clustering result $C^{p}$ ;
8:: end for
9:: The final clustering result $C^{p}$ ;

3.2. Relaxation

The clustering result is almost exactly fair after the initialization stage. However, in certain cases, there is a desire to obtain a relatively relaxed fair clustering outcome. Specifically, when using a fairness measure to assess the fairness of clustering results, it is expected that the value of the fairness measure exceeds a predetermined threshold, rather than strictly achieving exact fairness.

Our proposed method relaxes the fairness constraint under a fairness measure called balance. Balance was first introduced by Chierichetii et al. [20] and generalized by Bera et al. [25]. It is measured by the ratio of different groups in each cluster. Suppose that

r (j, t) = \frac{| G_{t} \cap C_{j} |}{| C_{j} |}

denotes the representation of group

G_{t}

in cluster

C_{j}

, and

r (t) = \frac{| G_{t} |}{n}

denotes the representation of group

G_{t}

in the dataset. The balance of cluster

C_{j}

is defined as:

B a l a n c e (C_{j}) = min_{t \in m} (\frac{r (j, t)}{r (t)}, \frac{r (t)}{r (j, t)}),

(8)

The balance of a clustering result C is defined as:

B a l a n c e = min_{j \in k} B a l a n c e (C_{j}) .

(9)

The value of the balance lies in the range from 0 to 1, with a higher value indicating a fairer clustering result. If the balance value of a clustering result is 1, it indicates that the clustering result is exactly fair.

To relax the fairness constraint, we introduce a parameter

μ \in [0, 1]

. The balance value of a clustering result is limited by

μ

, that is, the balance value of each cluster should not less than

μ

. We employ the greedy searching strategy based on moving an element from one cluster to another, similar to Hartigan’s algorithm. Given a point

x_{i} \in C_{i}

, which is moved to

C_{j}

, the movement will change the balance values of

C_{i}

and

C_{j}

. If the balance values of

C_{i}

and

C_{j}

are still greater than or equal to

μ

, this movement can be considered valid. We evaluate all valid movements and accept the movement with the maximum decrease the objective function value. We iteratively move points until there is no valid movement or until Equation (4) is less than zero for all valid movements.

By employing the greedy searching strategy, we can gradually optimize the objective function while maintaining a level of fairness. The algorithm of relaxation in Algorithm 4 is summarized as follows:

Algorithm 4 Relaxation

Input:: Dataset $X = {x_{1}, x_{2}, \dots, x_{n}}$ , clustering result $C = {C_{1}, C_{2}, \dots, C_{k}}$ , cluster number k.
Output:: Clustering result,
1:: repeat
2:: for each data point $x_{i} \in C_{i}$ do
3:: if $B a l a n c e (C_{i} \cap x_{i}) \geq μ$ then
4:: for each clustert $C_{j}$ do
5:: if $B a l a n c e (C_{j} \cup x_{i}) \geq μ$ then
6:: Try to move $x_{i}$ to $C_{j}$ and calculate Equation (4);
7:: end if
8:: end for
9:: end if
10:: Move $x_{i}$ to the cluster with maximum decrease by Step 6;
11:: Update the centers of the clusters, where $x_{i}$ moves in and out by Equation (5);
12:: end for
13:: until there is no valid movement or Equation (4) is less than zero for all valid movements

3.3. Improvement

In order to minimize the objective function value while maintaining fairness, a local search method based on interchanging two elements belonging to same group and different clusters is employed. Formally, give a set of k clusters belong to X

C = {C_{1}, \dots, C_{a}, \dots, C_{b}, \dots, C_{k}},

(10)

the interchange of points

x_{t} \in C_{a} \cap G_{e}

and

x_{l} \in C_{b} \cap G_{e}

produces a new solution

C^{^{'}} = {C_{1}, \dots, C_{a} ∖ {x_{t}} \cup {x_{l}}, \dots, C_{b} ∖ {x_{l}} \cup {x_{t}}, \dots, C_{k}}

. The change in the objective function value can be calculated by the following equation.

\begin{matrix} Δ_{C \to C^{^{'}}} \\ = & Δ_{C_{a} \to C_{a}^{^{'}}} + Δ_{C_{b} \to C_{b}^{^{'}}} \\ = & \sum_{x \in C_{a}} {∥ x - c_{a} ∥}^{2} - \sum_{x \in C_{a}^{^{'}}} {∥ x - c_{a}^{^{'}} ∥}^{2} \\ + \sum_{x \in C_{b}} {∥ x - c_{b} ∥}^{2} - \sum_{x \in C_{b}^{^{'}}} {∥ x - c_{b}^{^{'}} ∥}^{2}, \end{matrix}

(11)

c_{a}^{^{'}} = c_{a} + \frac{x_{l} - x_{t}}{| C_{a} |} and c_{b}^{^{'}} = c_{b} + \frac{x_{t} - x_{l}}{| C_{b} |},

(12)

where

C_{a}^{^{'}} = C_{a} ∖ {x_{t}} \cup {x_{l}}

,

C_{b}^{^{'}} = C_{b} ∖ {x_{l}} \cup {x_{t}}

and

c_{a}^{^{'}}

and

c_{b}^{^{'}}

are the centers of

C_{a}^{^{'}}

and

C_{b}^{^{'}}

, respectively.

Our local search strategy evaluates the interchange of every pair of data points that belong to same group and different clusters. It accepts the first interchange that results in an improvement of the objective function. The algorithm of improvement in Algorithm 5 is described as follows:

Algorithm 5 Improvement

Input:: Dataset $X = {x_{1}, x_{2}, \dots, x_{n}}$ , clustering result $C = {C_{1}, C_{2}, \dots, C_{k}}$ , cluster number k.
Output:: Clustering result.
1:: repeat
2:: $F o u n d = F a l s e$ ;
3:: for $t = 1, 2, \dots, n - 1$ do
4:: for $l = t + 1, \dots, n$ do
5:: if $x_{t}$ and $x_{l}$ belong to different cluster and $x_{t}$ and $x_{l}$ belong to same groups, then
6:: Calculate Equation (11);
7:: if the value of Equation (11) is less than 0, then
8:: Interchange points $x_{t}$ and $x_{l}$ ;
9:: $F o u n d = T r u e$ ;
10:: end if
11:: end if
12:: end for
13:: if $F o u n d = T r u e,$ then
14:: break
15:: end if
16:: end for
17:: until $F o u n d = F a l s e$ .

Finally, the overall algorithm of the proposed method in Algorithm 6 is summarized as follows:

Algorithm 6 FFC

Input:: Dataset $X = {x_{1}, x_{2}, \dots, x_{n}}$ , cluster number k.
Output:: Clustering result.
1:: Apply Algorithm 3 to generate an initial fair clustering result;
2:: Apply Algorithm 4 to obtain a relatively relaxed fair clustering result;
3:: Apply Algorithm 5 to improve the clustering quality while maintaining fairness, and the final clustering result is achieved.

3.4. Computational Complexity

The overall computational complexity of FFC can be evaluated as the sum of the computational complexity of each stage. In the initialization stage, the computational complexity of k-means is

O (n I_{1} k d)

, where

I_{1}

is the number of iterations in the k-means. The computational complexities of the combination step and the refinement step are

O ((k + 1)! d - k d)

and

O (k d + n d)

, respectively. Then, the computational complexity of the initialization stage is

O (k d (n I_{1} + (k + 1)!))

. The computational complexity of the relaxation stage is

O (I_{2} n k d))

, where

I_{2}

is the number of iterations in the relaxation stage. In the improvement stage, the computational complexity of the local search method is

O (I_{3} n (d + n))

, where

I_{3}

is the number of iterations in the improvement stage. To sum up, the computational complexity of our method is

O (k d (n I_{1} + n I_{2} + (k + 1)!) + I_{3} n (d + n))

.

The computational complexity of vanilla k-means is

O (n I k d)

, where I is the number of iterations. It is lower than that of FFC since vanilla k-means do not account for fairness constraints. The computational complexity of fair spectral clustering [23] is

O ({(n - m + 1)}^{3})

. The computational complexity of fair k-means proposed by Xu et al. [38] is

O (k (n^{3} + n^{2}) l o g (n h) + k^{2} m^{2} (m - 1) (d + k))

, where h denotes the cost in the minimum cost flow they employed. Additionally, fair hierarchical agglomerative clustering, introduced by Chhabra et al. [39], has a computational complexity of

O (m n^{3})

.

4. Experiments

In this section, we empirically evaluate the proposed algorithm on both synthetic and real-world datasets in terms of two clustering quality measurements and two fairness metrics.

4.1. Datasets and Experimental Setups

Datasets. The performance of our proposed method is evaluated on five synthetic and five real-world datasets. Table 1 provides a descriptive summary of the synthetic and real-world datasets. The synthetic datasets, including Elliptical, DS-577, 2d-4c-no0, 2d-4c-no1, and 2d-4c-no4, are commonly used in clustering studies. The Elliptical dataset is obtained from [40]. The DS-577 dataset is obtained from [41]. The datasets 2d-4c-no0, 2d-4c-no1, and 2d-4c-no4 are taken from Dr. Julia Handl’s work (https://personalpages.manchester.ac.uk/staff/Julia.Handl/data.tar.gz, accessed on 27 June 2023). Due to the absence of any explicitly protected attributes, we take the ground-truth label as the sensitive protected attribute for datasets DS-577, 2d-4c-no0, 2d-4c-no1, and 2d-4c-no4. For the Elliptical dataset, we divide it into two classes and take the ground-truth label of the two classes as the sensitive protected attribute. The synthetic datasets are shown in Figure 2. The shape of each object identifies the group it belongs to [14,29]. Regarding the real-world datasets, they are publicly available datasets from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/, accessed on 27 June 2023) and have been used in previous fairness studies [25]. These datasets, such as Adult, Bank, Census1990, CreditCard, and Diabetic, cover diverse domains and exhibit various characteristics, including different feature dimensions and group structures.

Experimental Setups. Our proposed method is implemented in Python 3.9, and the source code is available on GitHub for public use (https://github.com/rebpan/FFC, accessed on 27 June 2023). Experiments are performed on a laptop with Windows 10 64-bit edition, which has a 2.90 GHz AMD Ryzen 7 4800H CPU and 32GB RAM. Each dataset is standardized by rescaling each attribute to have a mean of 0 and a standard deviation of unit variance. Then, L2 normalization is applied to each standardized dataset.

The baseline methods include vanilla k-means clustering (Lloyd), fair unnormalized spectral clustering (FSCUN) [23], fair normalized spectral clustering (FSCN) [23], fair algorithms for clustering (FAC) [25], and variational fair cluster (VFC) [27]. Publicly available implementations (FSCUNandFSCN:https://github.com/matthklein/fair_spectral_clustering; FAC: https://github.com/nicolasjulioflores/fair_algorithms_for_clustering; VFC: https://github.com/imtiazziko/Variational-Fair-Clustering, accessed on 27 June 2023) of these fair algorithms have been used.

FSCUN and FSCN consider fairness constraints to be linear constraints and incorporated them into the RatioCut objective. The fair version of spectral clustering is solved by finding the k smallest eigenvalues of a new Laplace matrix embedded in the fair subspace. Fair SC requires a weighted adjacency matrix W as an input, in our experiments, we construct W as:

w_{i j} = e^{- \frac{∥ x_{i} - x_{j} ∥_{2}^{2}}{σ}}, σ = max ∥ x_{i} - x_{j} ∥_{2} .

(13)

FAC employs a rounding approach based on linear programming to solve a fair-assignment problem, which improves the fairness after vanilla k-means clustering. FAC requires two parameters: the lower bound

β

and the upper bound

α

. The values of the two parameters are calculated as follows:

β_{l} = \frac{| G_{l} | (1 - δ)}{n}, α_{l} = \frac{| G_{l} |}{n (1 - δ)},

(14)

where the value of

δ

is set as 0.2, as reported in [25].

VFC describes the fairness objective as the KL divergence between the demographic proportions in the dataset and the distributions of each cluster. The VFC objective is a trade-off between a clustering objective (e.g., k-medians, k-means, and NCut) and the fairness objective. The authors derived a convex–concave formulation for the overall objective and optimized it. We choose the best trade-off between the k-means clustering objective and fairness error.

4.2. Evaluation Metrics

In our experiments, both clustering quality and fairness are concerned. The metrics of clustering quality include:

SSE: The SSE measures how much objects deviate from the centers of the clusters they are assigned to. The smaller the clustering cost, the better the clustering result.
Dunn index (DI): The DI quantifies the ratio of the smallest distance between different clusters and the largest distance within a cluster [42]. It is defined as follows:

$D I = \frac{{min}_{1 \leq i \leq k} {min}_{1 \leq j \leq k, j \neq i} {min}_{a \in C_{i}, b \in C_{j}} ∥ a - b ∥}{{max}_{1 \leq l \leq k} m a x_{a \in C_{l}, b \in C_{l}} ∥ a - b ∥} .$

(15)

The DI has a value between zero and infinity and should be maximized.

The metrics of clustering fairness include:

Balance: The balance measures the disparity in the distribution of a protected attribute across clusters and a dataset. The higher the value, the fairer the clustering result.
Average Wasserstein Distance between Probability Vectors (AWD): The AWD measures the unfairness of the clustering result [16]. Let $P_{i} \in R^{m \times 1}$ denote the distribution of groups in $C_{i}$ , and $P_{x} \in R^{m \times 1}$ denote the distribution of groups in X. The AWD of clustering is defined as:

$A W D = \frac{\sum_{i}^{k} | C_{i} | \times W D (P_{i}, P_{x})}{n},$

(16)

where $W D (., .)$ is the Wasserstein distance. The smaller the value of AWD, the fairer the clustering result.

4.3. Fair Clustering Results Comparison

We relax the fairness constraint so that the balance value of our proposed method is slightly higher than or equal to the second highest. The clustering results of different algorithms in terms of clustering quality and fairness evaluation are shown in Table 2 and Table 3, where the results of our proposed method and the method with the second highest balance value are highlighted as the underlined items. From Table 2, one can find that the proposed FFC outperforms all other methods in terms of the balance metric. FFC achieves the lowest value for the AWD on four datasets. It is noticeable that FSCUN has a lower value for the AWD than FFC on datasets 2d-4c-no0 and 2d-4c-no1. On these two datasets, the balance value of FSCUN is 0 because it groups almost all data points into one cluster, as shown in Figure 3, which we did not expect.

Table 3 reports the comparative results for the clustering quality. The inclusion of fairness consideration in clustering algorithms can negatively impact their clustering quality. Generally, if one algorithm is more focused on fairness (i.e., has a higher value for the balance or a lower value for the AWD) compared to another, it is likely to result in lower clustering quality (i.e., has a lower value for the DI or a higher value for the SSE). Our proposed method performs better than other fair clustering algorithms in terms of fairness metrics, so it can be expected to perform poorly on clustering quality metrics.

It is worth noting the comparison between our proposed method and the methods with the second highest balance value in terms of the SSE. Among the 10 datasets, compared to the methods with the second highest balance value, FFC shares the same SSE values on one dataset and achieves lower SSE values on six datasets. On Elliptical, FFC and FSCUN share the highest balance, and they also have same values for the SSE. On DS-577, 2d-4c-no4, and Adult, FFC has higher SSE values than the methods with the second highest balance value. On 2d-4c-no0, 2d-4c-no1, Bank, Census1990, CreditCard, and Diabetic, FFC obtains lower SSE values than the methods with the second highest balance value. Therefore, FFC has a better ability to search for solutions that have a higher balance value and a low SSE value. Figure 3 depicts clustering results on the synthetic datasets. FFC assigns similar points belonging to the same group to the same cluster to make the clustering results meaningful.

4.4. Clustering Cost with Respect to k

We evaluate the clustering objectives and balance of Lloyd, FAC, and FFC with different numbers of clusters. Here, we relax the fairness constraint by ensuring that the balance value of FFC is slightly higher than or equal to the balance value of FAC for different numbers of clusters. Figure 4 shows the performance trends of these methods on Census with respect to the number of clusters. As the number of clusters increases, we have several observations. (1) The SSE values of these methods decrease. (2) The gap between the SSE values of Lloyd and FFC increases. (3) The balance value of Lloyd is reduced to zero. (4) Although the balance values for FFC and FAC are nearly equal, the SSE value for FFC is always lower than that of FAC, and the gap between the two increases. These observations are consistent with those in previous works [25,27,38].

5. Conclusions

In this paper, we incorporate fairness constraints into k-means clustering and propose FFC. The scheme of our method is to handle fairness before improving clustering quality, which is different from the existing post-processing-based work. We design an initialization stage that can handle the dataset with a multi-value protected attribute to produce an initial fair solution. Moreover, a relaxation stage and an improvement stage are presented to improve clustering quality while relaxing and maintaining fairness, respectively. Experimental results on synthetic and real-world datasets show that our proposed method outperforms some other fair clustering methods in terms of fairness metrics while the loss of clustering quality is acceptable or even better.

We achieved fairness in k-means clustering but sacrificed the quality of clustering. Even though the loss of clustering quality is acceptable, finding a way to minimize the cost of fairness is our future work. Since a local search method is employed, FFC does not provide a guarantee that any clustering output is the global optimum. In the future, we will find a strategy that can search for the near global optimum or even the global optimum.

Author Contributions

Conceptualization, R.P. and C.Z.; methodology, R.P. and C.Z.; software, R.P.; writing—original draft preparation, R.P.; writing—review and editing, C.Z.; funding acquisition, C.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under grant number 62172242 and the Science and Technology Innovation 2025 Major Project of Ningbo under grant number 20211ZDYF020159.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this work are publicly available. The code is available online in our Github repository at https://github.com/rebpan/FFC (accessed on 27 June 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

Xu, R.; II, D.C.W. Survey of clustering algorithms. IEEE Trans. Neural Netw. 2005, 16, 645–678. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Shi, J.; Malik, J. Normalized Cuts and Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 888–905. [Google Scholar] [CrossRef] [Green Version]
Ng, A.Y.; Jordan, M.I.; Weiss, Y. On Spectral Clustering: Analysis and an algorithm. In Proceedings of the Advances in Neural Information Processing Systems 14, Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, Vancouver, BC, Canada, 3–8 December 2001; Dietterich, T.G., Becker, S., Ghahramani, Z., Eds.; MIT Press: Cambridge, MA, USA, 2001; pp. 849–856. [Google Scholar]
Schubert, E.; Sander, J.; Ester, M.; Kriegel, H.; Xu, X. DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN. ACM Trans. Database Syst. 2017, 42, 19:1–19:21. [Google Scholar] [CrossRef]
Zhong, C.; Yue, X.; Zhang, Z.; Lei, J. A clustering ensemble: Two-level-refined co-association matrix with path-based transformation. Pattern Recognit. 2015, 48, 2699–2709. [Google Scholar] [CrossRef]
Li, Q.; Xie, Z.; Wang, L. Robust Subspace Clustering with Block Diagonal Representation for Noisy Image Datasets. Electronics 2023, 12, 1249. [Google Scholar] [CrossRef]
Wang, E.; Lee, H.; Do, K.; Lee, M.; Chung, S. Recommendation of Music Based on DASS-21 (Depression, Anxiety, Stress Scales) Using Fuzzy Clustering. Electronics 2023, 12, 168. [Google Scholar] [CrossRef]
Yin, L.; Li, M.; Chen, H.; Deng, W. An Improved Hierarchical Clustering Algorithm Based on the Idea of Population Reproduction and Fusion. Electronics 2022, 11, 2735. [Google Scholar] [CrossRef]
Li, X.; Song, K.; Wei, G.; Lu, R.; Zhu, C. A Novel Grouping Method for Lithium Iron Phosphate Batteries Based on a Fractional Joint Kalman Filter and a New Modified K-Means Clustering Algorithm. Energies 2015, 8, 7703–7728. [Google Scholar] [CrossRef] [Green Version]
Jinlei, S.; Wei, L.; Chuanyu, T.; Tianru, W.; Tao, J.; Yong, T. A Novel Active Equalization Method for Series-Connected Battery Packs Based on Clustering Analysis With Genetic Algorithm. IEEE Trans. Power Electron. 2021, 36, 7853–7865. [Google Scholar] [CrossRef]
Lei, J.C.; Hou, C.J.; Huo, D.Q.; Luo, X.G.; Bao, M.Z.; Li, X.; Yang, M.; Fa, H.B. A novel device based on a fluorescent cross-responsive sensor array for detecting lung cancer related volatile organic compounds. Rev. Sci. Instruments 2015, 86, 025106. [Google Scholar] [CrossRef]
Hwang, J.; Kim, H. Variational Deep Clustering of Wafer Map Patterns. IEEE Trans. Semicond. Manuf. 2020, 33, 466–475. [Google Scholar] [CrossRef]
Chhabra, A.; Masalkovaite, K.; Mohapatra, P. An Overview of Fairness in Clustering. IEEE Access 2021, 9, 130698–130720. [Google Scholar] [CrossRef]
Zeng, P.; Li, Y.; Hu, P.; Peng, D.; Lv, J.; Peng, X. Deep Fair Clustering via Maximizing and Minimizing Mutual Information: Theory, Algorithm and Metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 23986–23995. [Google Scholar]
Naga, P.; Marri, S.D.; Borreo, R. Facial emotion recognition methods, datasets and technologies: A literature survey. Mater. Today Proc. 2023, 80, 2824–2828. [Google Scholar] [CrossRef]
Wang, B.; Davidson, I. Towards Fair Deep Clustering With Multi-State Protected Variables. arXiv 2019, arXiv:1901.10053. [Google Scholar]
Dai, Z.; Makarychev, Y.; Vakilian, A. Fair Representation Clustering with Several Protected Classes. In Proceedings of the FAccT’22: 2022 ACM Conference on Fairness, Accountability, and Transparency, Seoul, Republic of Korea, 21–24 June 2022; pp. 814–823. [Google Scholar] [CrossRef]
Abraham, S.S.; P, D.; Sundaram, S.S. Fairness in Clustering with Multiple Sensitive Attributes. In Proceedings of the 23rd International Conference on Extending Database Technology, EDBT 2020, Copenhagen, Denmark, 30 March–2 April 2020; OpenProceedings.org. 2020; pp. 287–298. [Google Scholar] [CrossRef]
Böhm, M.; Fazzone, A.; Leonardi, S.; Schwiegelshohn, C. Fair Clustering with Multiple Colors. arXiv 2020, arXiv:2002.07892. [Google Scholar]
Chierichetti, F.; Kumar, R.; Lattanzi, S.; Vassilvitskii, S. Fair Clustering Through Fairlets. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 5029–5037. [Google Scholar]
Backurs, A.; Indyk, P.; Onak, K.; Schieber, B.; Vakilian, A.; Wagner, T. Scalable Fair Clustering. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, Long Beach, CA, USA, 9–15 June 2019; Proc. Mach. Learn. Res. 2019; Volume 97, pp. 405–413. [Google Scholar]
Ahmadian, S.; Epasto, A.; Knittel, M.; Kumar, R.; Mahdian, M.; Moseley, B.; Pham, P.; Vassilvitskii, S.; Wang, Y. Fair Hierarchical Clustering. In Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Virtual, 6–12 December 2020. [Google Scholar]
Kleindessner, M.; Samadi, S.; Awasthi, P.; Morgenstern, J. Guarantees for Spectral Clustering with Fairness Constraints. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, Long Beach, CA, USA, 9–15 June 2019; Proc. Mach. Learn. Res. 2019; Volume 97, pp. 3458–3467. [Google Scholar]
Liu, S.; Vicente, L.N. A Stochastic Alternating Balance k-Means Algorithm for Fair Clustering. In Proceedings of the Learning and Intelligent Optimization—16th International Conference, LION 2022, Milos Island, Greece, 5–10 June 2022; Revised Selected Papers. Simos, D.E., Rasskazova, V., Archetti, F., Kotsireas, I.S., Pardalos, P.M., Eds.; Springer: Berlin/Heidelberg, Germany, 2022; Volume 13621, pp. 77–92. [Google Scholar] [CrossRef]
Bera, S.K.; Chakrabarty, D.; Flores, N.; Negahbani, M. Fair Algorithms for Clustering. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019; pp. 4955–4966. [Google Scholar]
Rösner, C.; Schmidt, M. Privacy Preserving Clustering with Constraints. In Proceedings of the 45th International Colloquium on Automata, Languages, and Programming, ICALP 2018, Prague, Czech Republic, 9–13 July 2018; Chatzigiannakis, I., Kaklamanis, C., Marx, D., Sannella, D., Eds.; Schloss Dagstuhl—Leibniz-Zentrum für Informatik: Wadern, Germany, 2018; Volume 107, pp. 96:1–96:14. [Google Scholar] [CrossRef]
Ziko, I.M.; Yuan, J.; Granger, E.; Ayed, I.B. Variational Fair Clustering. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual, 2–9 February 2021; pp. 11202–11209. [Google Scholar]
Esmaeili, S.A.; Brubach, B.; Srinivasan, A.; Dickerson, J. Fair Clustering Under a Bounded Cost. In Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, Virtual, 6–14 December 2021; pp. 14345–14357. [Google Scholar]
Li, P.; Zhao, H.; Liu, H. Deep Fair Clustering for Visual Learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; Computer Vision Foundation/IEEE. pp. 9067–9076. [Google Scholar] [CrossRef]
Chhabra, A.; Li, P.; Mohapatra, P.; Liu, H. Robust Fair Clustering: A Novel Fairness Attack and Defense Framework. arXiv 2022, arXiv:2210.01953. [Google Scholar]
Nghiem, N.; Vrain, C.; Dao, T.; Davidson, I. Constrained Clustering via Post-processing. In Proceedings of the Discovery Science—23rd International Conference, DS 2020, Thessaloniki, Greece, 19–21 October 2020; Appice, A., Tsoumakas, G., Manolopoulos, Y., Matwin, S., Eds.; Springer: Berlin/Heidelberg, Germany, 2020; Volume 12323, pp. 53–67. [Google Scholar] [CrossRef]
Davidson, I.; Ravi, S.S. Making Existing Clusterings Fairer: Algorithms, Complexity Results and Insights. In Proceedings of the The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, 7–12 February 2020; pp. 3733–3740. [Google Scholar]
Aloise, D.; Deshpande, A.; Hansen, P.; Popat, P. NP-hardness of Euclidean sum-of-squares clustering. Mach. Learn. 2009, 75, 245–248. [Google Scholar] [CrossRef] [Green Version]
Lloyd, S.P. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–136. [Google Scholar] [CrossRef] [Green Version]
Ostrovsky, R.; Rabani, Y.; Schulman, L.J.; Swamy, C. The Effectiveness of Lloyd-Type Methods for the k-Means Problem. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2006), Berkeley, CA, USA, 21–24 October 2006; Proceedings. IEEE Computer Society. pp. 165–176. [Google Scholar] [CrossRef] [Green Version]
Hartigan, J.A.; Wong, M.A. Algorithm AS 136: A k-means clustering algorithm. J. R. Stat. Soc. Ser. C 1979, 28, 100–108. [Google Scholar] [CrossRef]
Bercea, I.O.; Groß, M.; Khuller, S.; Kumar, A.; Rösner, C.; Schmidt, D.R.; Schmidt, M. On the Cost of Essentially Fair Clusterings. In Proceedings of the Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2019, Massachusetts Institute of Technology, Cambridge, MA, USA, 20–22 September 2019; Achlioptas, D., Végh, L.A., Eds.; Schloss Dagstuhl—Leibniz-Zentrum für Informatik: Wadern, Germany, 2019; Volume 145, pp. 18:1–18:22. [Google Scholar] [CrossRef]
Xu, W.; Hu, J.; Du, S.; Yang, Y. K-Means Clustering with Fairness Constraints. In Proceedings of the 16th International Conference on Intelligent Systems and Knowledge Engineering, ISKE 2021, Chengdu, China, 26–28 November 2021; Chen, S., Hu, J., Li, T., Martínez, L., Liu, J., Eds.; IEEE: Piscataway, NJ, USA, 2021; pp. 215–222. [Google Scholar] [CrossRef]
Chhabra, A.; Mohapatra, P. Fair Algorithms for Hierarchical Agglomerative Clustering. In Proceedings of the 21st IEEE International Conference on Machine Learning and Applications, ICMLA 2022, Nassau, Bahamas, 12–14 December 2022; Wani, M.A., Kantardzic, M.M., Palade, V., Neagu, D., Yang, L., Chan, K.Y., Eds.; IEEE: Piscataway, NJ, USA, 2022; pp. 206–211. [Google Scholar] [CrossRef]
Bandyopadhyay, S.; Maulik, U. Nonparametric genetic clustering: Comparison of validity indices. IEEE Trans. Syst. Man Cybern. Syst. 2001, 31, 120–125. [Google Scholar] [CrossRef] [Green Version]
Su, M.C.; Chou, C.H.; Hsieh, C.C. Fuzzy C-means algorithm with a point symmetry distance. Int. J. Fuzzy Syst. 2005, 7, 175–181. [Google Scholar]
Maulik, U.; Bandyopadhyay, S. Performance Evaluation of Some Clustering Algorithms and Validity Indices. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 1650–1654. [Google Scholar] [CrossRef] [Green Version]

Figure 1. The scheme of the initialization. In each subfigure, shapes represent groups and colors represent clusters. (a) A given dataset. Points with the same color belong to the same group. (b) Each group is partitioned by k-means. (c) Combine

C^{1}

and

C^{2}

. (d) Refine the fairness of clusters in

C^{1^{'}}

of (c). (e) Combine

C^{1^{'}}

and

C^{3}

. (f) Refine the fairness of clusters of (e).

Figure 1. The scheme of the initialization. In each subfigure, shapes represent groups and colors represent clusters. (a) A given dataset. Points with the same color belong to the same group. (b) Each group is partitioned by k-means. (c) Combine

C^{1}

and

C^{2}

. (d) Refine the fairness of clusters in

C^{1^{'}}

of (c). (e) Combine

C^{1^{'}}

and

C^{3}

. (f) Refine the fairness of clusters of (e).

Figure 2. Five synthetic datasets. (a) Elliptical. (b) DS-577. (c) 2d-4c-no0. (d) 2d-4c-no1. (e) 2d-4c-no4.

Figure 3. Clustering results on the synthetic datasets. The clustering result is represented by different colors and shapes.

Figure 4. The effect of k on SSE and balance for Lloyd, FAC, and FFC. The results are shown for the Census dataset.

Table 1. Dataset summaries.

Dataset	Instances	Features	Groups	Proportions
Elliptical	500	2	2	[0.5000, 0.5000]
DS-577	577	2	3	[0.3397, 0.3484, 0.3120]
2d-4c-no0	1572	2	4	[0.3359, 0.2214, 0.1730, 0.2697]
2d-4c-no1	1623	2	4	[0.3315, 0.1183, 0.3321, 0.2181]
2d-4c-no4	863	2	4	[0.4878, 0.0997, 0.3407, 0.0718]
Adult	3000	5	2	[0.6693, 0.3307]
Bank	4000	6	3	[0.1123, 0.2812, 0.6065]
Census1990	3000	25	2	[0.4847, 0.5153]
CreditCard	3000	14	2	[0.3963, 0.6037]
Diabetic	10,000	2	2	[0.4624, 0.5376]

Table 2. Clustering fairness.

Dataset	Balance ↑						AWD ↓
Dataset	Lloyd	FSCUN	FSCN	FAC	VFC	FFC	Lloyd	FSCUN	FSCN	FAC	VFC	FFC
Elliptical (k = 2)	0	0.8878	0.8835	0.8000	0.8835	0.8878	0.4940	0.0460	0.0480	0.1000	0.0480	0.0460
DS-577 (k = 3)	0	0	0	0.7988	0.5427	0.8005	0.4183	0.0995	0.1496	0.0462	0.1618	0.0381
2d-4c-no0 (k = 4)	0	0	0	0.7719	0.0201	0.7722	0.3127	0.0049	0.0649	0.0274	0.2226	0.0291
2d-4c-no1 (k = 4)	0	0	0	0.7780	0	0.7785	0.2705	0.0012	0.0701	0.0427	0.2341	0.0409
2d-4c-no4 (k = 4)	0	0	0	0.7369	0.3045	0.7369	0.1723	0.0759	0.0799	0.0459	0.1117	0.0296
Adult (k = 5)	0.6345	0.6755	0.6618	0.7978	0.9405	0.9405	0.0723	0.0402	0.0405	0.0429	0.0106	0.0143
Bank (k = 6)	0.2929	0.3368	0.5306	0.7958	0.8076	0.8079	0.0691	0.0357	0.0358	0.0334	0.0251	0.0360
Census1990 (k = 5)	0.0039	0.6964	0.7418	0.7984	0.9119	0.9127	0.2429	0.0511	0.0470	0.0554	0.0174	0.0218
CreditCard (k = 5)	0.8910	0	0.9259	0.9011	0.9037	0.9266	0.0193	0.0074	0.0128	0.0177	0.0173	0.0138
Diabetic (k = 10)	0.8376	0	0.8239	0.8239	0.8728	0.8731	0.0357	0.0235	0.0285	0.0289	0.0243	0.0330

Table 3. Clustering quality.

Dataset	DI ↑						SSE ↓
Dataset	Lloyd	FSCUN	FSCN	FAC	VFC	FFC	Lloyd	FSCUN	FSCN	FAC	VFC	FFC
Elliptical (k = 2)	0.0644	0.0118	0.0716	0.0010	0.0716	0.0118	206.2982	344.4216	343.9640	446.1201	343.9640	344.4216
DS-577 (k = 3)	0.0079	0.0005	0.0068	0.0005	0.0005	0.0005	71.0134	449.0292	361.4299	501.0075	377.3101	507.5545
2d-4c-no0 (k = 4)	0.0076	0.0032	0.0002	0.0002	0.0002	0	114.4834	1.5283 × 10³	1.3558 × 10³	1.4101 × 10³	419.5988	1.3851 × 10³
2d-4c-no1 (k = 4)	0.0017	0.0112	0.0001	0	0	0	82.3122	1.6150 × 10³	1.2718 × 10³	1.4734 × 10³	287.2313	1.4632 × 10³
2d-4c-no4 (k = 4)	0.0057	0.0002	0.0001	0.0001	0.0001	0.0001	104.0023	705.9991	666.3117	676.0752	420.1287	680.7263
Adult (k = 5)	0.0085	0.0023	0.0150	0.0148	0.0106	0.0091	1.3426 × 10³	1.3529 × 10³	1.3470 × 10³	1.3443 × 10³	1.3611 × 10³	1.3672 × 10³
Bank (k = 6)	0.0190	0	0.0018	0	0.0013	0	1.23134 × 10³	1.7858 × 10³	1.2543 × 10³	1.2533 × 10³	1.2774 × 10³	1.2714 × 10³
Census1990 (k = 5)	0.0663	0.0580	0.0258	0.0859	0.0527	0.0458	1.7604 × 10³	1.8207 × 10³	1.8219 × 10³	1.8405 × 10³	1.8807 × 10³	1.8216 × 10³
CreditCard (k = 5)	0.0288	0.0472	0.0184	0.0288	0.0288	0.0264	1.0805 × 10³	1.8202 × 10³	1.1119 × 10³	1.0805 × 10³	1.0805 × 10³	1.0806 × 10³
Diabetic (k = 10)	0.0406	0	0	0.0374	0	0	243.2634	3.2616 × 10³	235.2583	231.9906	254.8696	241.3612

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pan, R.; Zhong, C. Fairness First Clustering: A Multi-Stage Approach for Mitigating Bias. Electronics 2023, 12, 2969. https://doi.org/10.3390/electronics12132969

AMA Style

Pan R, Zhong C. Fairness First Clustering: A Multi-Stage Approach for Mitigating Bias. Electronics. 2023; 12(13):2969. https://doi.org/10.3390/electronics12132969

Chicago/Turabian Style

Pan, Renbo, and Caiming Zhong. 2023. "Fairness First Clustering: A Multi-Stage Approach for Mitigating Bias" Electronics 12, no. 13: 2969. https://doi.org/10.3390/electronics12132969

APA Style

Pan, R., & Zhong, C. (2023). Fairness First Clustering: A Multi-Stage Approach for Mitigating Bias. Electronics, 12(13), 2969. https://doi.org/10.3390/electronics12132969

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fairness First Clustering: A Multi-Stage Approach for Mitigating Bias

Abstract

1. Introduction

2. Related Work

2.1. Fair Clustering

2.2. K-Means Clustering

3. Proposed Method

3.1. Initialization

3.1.1. Combination Step

3.1.2. Refinement Step

3.2. Relaxation

3.3. Improvement

3.4. Computational Complexity

4. Experiments

4.1. Datasets and Experimental Setups

4.2. Evaluation Metrics

4.3. Fair Clustering Results Comparison

4.4. Clustering Cost with Respect to k

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI