SoFL: Clustered Federated Learning Based on Dual Clustering for Heterogeneous Data

Zhang, Jianfei; Qiao, Zhiming

doi:10.3390/electronics13183682

Open AccessArticle

SoFL: Clustered Federated Learning Based on Dual Clustering for Heterogeneous Data

by

Jianfei Zhang

^*

and

Zhiming Qiao

School of Computer Science and Technology, Changchun University of Science and Technology, Changchun 130022, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(18), 3682; https://doi.org/10.3390/electronics13183682

Submission received: 25 July 2024 / Revised: 12 September 2024 / Accepted: 14 September 2024 / Published: 16 September 2024

(This article belongs to the Special Issue Advances in Cloud Computing and IoT Systems)

Download

Browse Figures

Versions Notes

Abstract

:

Federated Learning (FL) is an emerging privacy-preserving technology that enables training a global model beneficial to all participants without sharing their data. However, differences in data distributions among participants may undermine the stability and accuracy of the global model. To address this challenge, recent research proposes client clustering based on data distribution similarity, generating independent models for each cluster in order to enhance FL performance. Nevertheless, due to the uncertainty of participant identities, FL struggles to rapidly and accurately determine the clusters. Most of the existing algorithms distinguish clients by iterative clustering, which not only increases the computing cost of the server but also affects the convergence speed of the federation model. To address these shortcomings, in this paper, we propose a novel clustering-based FL method, SoFL. SoFL introduces SOM networks, improves the quality of cluster data, and eliminates redundant categories through secondary clustering, encouraging more similar clients to train together. Through this mechanism, SoFL completes the clustering task in one round of communication and speeds up the convergence of federated model training. Simulation results demonstrate that SoFL accurately and swiftly adapts to determine the clusters. In different non-IID settings, SoFL’s model accuracy improvements ranged from 9 to 18% compared to FedAvg and FedProx.

Keywords:

federated learning; clustering; SOM (Self-Organizing Map); non-IID data

1. Introduction

With the continuous development and proliferation of Internet of Things (IoT) technology, a large number of network edge devices such as smartphones and mobile terminals are widely used in various scenarios. The data generated by IoT is loosely stored in these edge devices, posing new challenges for the application of Artificial Intelligence (AI) on IoT data. AI applications, represented by machine learning, require large quantities of high-quality, centrally stored sample datasets. However, due to evolving regulations and user concerns about privacy, acquiring highly sensitive data from IoT devices has become increasingly difficult. Meanwhile, organizations such as enterprises, governments, and research institutions face the dilemma between data protection and improving the quality of modeling datasets. These issues have significantly impacted traditional centralized machine-learning training and AI applications. To address the challenge of data silos in model training while protecting user privacy and data security, a novel machine-learning training paradigm, Federated Learning (FL) [1], has been proposed. The core idea of FL is that participants do not share data; instead, they conduct local model training on their respective datasets. The global federated model

M_{G l o b a l}

is then generated by sharing model parameters or intermediate results. During the entire process, the server does not have any access to the client’s local data. In this way, federated learning effectively protects client data privacy.

In federated learning, participants independently collect and locally store their data, which can vary in size and distribution. When there are significant differences between the datasets of different participants, the convergence speed and performance of the federated model will rapidly decline due to the influence of each client model. For instance, the federated model might overlook valuable model parameters present in smaller datasets, leading to a situation where the minority is underrepresented by the majority. This problem of non-Independently Identically Distributed (non-IID) datasets due to clients not sharing data is called data heterogeneity. Data heterogeneity represents one of the primary challenges in federated learning [2]. Several non-IID scenarios that are widespread in real-world applications are summarized below.

Feature distribution skew: The marginal distribution, $P_{i} (x)$ , may vary from client to client, even if $P_{i} (y | x)$ is the same.
Label distribution skew: The marginal distributions, $P_{i} (y)$ , may vary across clients, even if $P_{i} (x | y)$ is the same.
Same label, different features: Even if $P (y)$ is shared, the conditional distribution, $P_{i} (x | y)$ , may vary from client to client.
Same features, different label: The conditional distribution, $P_{i} (y | x)$ , may vary from client to client, even if $P (x)$ is the same.
Quantity skew or unbalancedness: Different clients may save different amounts of data.

Some researchers have focused on training a single global model that performs well across all local distributions [3,4,5]. However, due to the impact of non-IID, a converged global model may not necessarily exhibit equally good performance on clients.

To address the impact of non-IID data on federated-learning methods, researchers have proposed the concept of clustering-based federated learning. Specifically, this method assumes that there are some distribution similarities between the data on the client devices, and according to this similarity, the clients are grouped into the same client group for training, so as to avoid the interference of other clients with large differences, so that the federated model is more focused on the cluster learning task. This strategy typically leverages the following client data distribution characteristics: (a) significant differences in data distribution between clusters, leading to degraded performance of the global model in joint training; (b) minimal differences in data distribution within clusters, potentially approximating IID data. Therefore, it is assumed that clients with similar data distributions converge towards similar optimization directions, which is a crucial premise for successfully segregating clients in clustering-based federated learning. However, existing clustering methods often rely on prior knowledge of the number of clusters [6,7] or involve iterative clustering separations [8], which are challenging to implement in federated learning.

In this paper, we propose a novel federated-learning algorithm, Self-organizing Federated Learning (SoFL). SoFL groups clients by double clustering and performs federated learning based on multiple clusters. SoFL does not require a predetermined number of clusters; instead, it dynamically determines them based on the structure of a Self-Organizing Map (SOM) network. Furthermore, SoFL improves the clarity of cluster identification and enhances the efficiency of federated learning by reclassifying clients using SOM node parameters. The main contributions of this paper are as follows:

(1): We designed a dual clustering method based on SOM networks that does not require a priori knowledge, such as the number of clusters, and can determine the number of clusters more accurately and adaptively during training, thereby improving the quality of clusters.
(2): We decompose the clustered federated learning into two stages and give the corresponding objective function, so as to combine with other FL algorithms and increase the flexibility of the clustered federated-learning framework.
(3): We propose a new multi-center aggregated federated-learning algorithm, SoFL, to address the challenges posed by non-IID data in federated learning. SoFL identifies clients with the same or similar data distribution through client model parameters and trains them in clusters to avoid interference from different clients. By introducing SOM networks, SoFL implements the double clustering mechanism from client cluster to SOM node and SOM node cluster, so that the clustering process is completed in one round of communication. In this way, SoFL not only improves the accuracy of federated models but also has a faster convergence rate than CFL and other algorithms. And because the clustering process is done in one round of communication, the SoFL algorithm does not increase the client computing burden or increase the communication overhead.
(4): We conduct experiments to evaluate the performance of SoFL. We took public datasets, MNIST and CIFAR-10, and extended them to simulate FL application scenarios. The experimental results show that the accuracy of SoFL is better than several single model algorithms, and the convergence speed is higher than the typical clustering federated-learning algorithm.

2. Related Work

With increasing concerns over privacy protection and data silos, federated learning—a machine-learning framework designed to safeguard privacy—has garnered significant interest among researchers. Numerous studies have explored aspects such as communication efficiency, privacy preservation, fairness, system challenges, and resistance to attacks. Among these, data heterogeneity has consistently been identified as a primary challenge hindering the development of federated learning. McMahan et al. proposed the FedAvg algorithm [1], which has gained widespread application and recognition as the first federated-learning algorithm. The FedAvg algorithm can be divided into three steps. Firstly, in the t-th round of iteration, the server randomly selects K clients as participants and sends the model parameters

W_{t}

. Secondly, client

C_{k} (k \in K)

performs local model training using its own local data and sends the trained parameters back to the server. Finally, the server aggregates the received model parameters and sends the aggregated model

W_{t + 1}

back to the participants as the global model for the (t + 1)-th round. Xiang Li et al. [9] focused on the convergence issues of FedAvg when handling non-IID data, theoretically proving its effectiveness. T. Li et al. [3] proposed the FedProx algorithm based on FedAvg. FedProx adds a proximal term to the client’s loss function to reduce the deviation between local and global model updates, thereby alleviating data heterogeneity issues. Karimireddy et al. [4] introduced server control variables

c

and client control variables

c_{i}

, correcting client drift by adding

c - c_{i}

as a correction term in local gradient updates. Jianyu Wang et al. [5] normalized and scaled local updates from each participant before updating the global model to reduce the target inconsistency caused by data heterogeneity while maintaining fast error convergence. L Gao et al. [10] reduced the impact of local drift on global objectives by introducing drift variables to adjust the weight differences between local and global models. Yue Zhao et al. [11] improved the convergence rate and accuracy of federated learning by sharing a subset of data among clients.

However, due to the complexity of data distribution, the limitations of traditional federated learning focused on training a single model have become increasingly evident. In recent years, research has gradually shifted towards personalized federated learning. A. Fallah et al. [12], based on Model-Agnostic Meta-Learning (MAML), proposed the Per-FedAvg algorithm. Per-FedAvg aims to train an initial global model shared by all clients. Current or new clients can perform several steps of gradient descent on this model with their local dataset, quickly obtaining a model adapted to their local data. V. Smith et al. [13] combined multitask learning with federated learning, treating each client in federated learning as a task in multitask learning, and proposed the MOCHA algorithm. M.G. Arivazhagan et al. [14] proposed a method of dividing the training model into base and personalized layers, using FedAvg or its variants to train the base layers, and performing personalized layer training locally to enhance model performance.

A natural idea is to group clients, and then each group jointly trains a global model applicable to that group. Sattler et al. [8] proposed the Clustered Federated-Learning (CFL) algorithm. The CFL algorithm employs optimal biclassification to achieve the clustering task by separating client clusters multiple times. However, when a large data distribution exists, bipartition-based CFL requires multiple clusters to completely separate all clients, which can lead to inefficiency in the algorithm. R. Mishra et al. [15] clustered clients by collecting information about their processing speed, data transmission rate, and available memory, which alleviated training delays caused by communication issues. However, they did not consider how the distribution characteristics of the local data on clients might affect the clustering process. Z. Chen et al. [16] proposed lightweight Privacy-Preserving Cross-Cluster Federated Learning (PrivCrFL) with heterogeneous data to protect the privacy security of the model upload. However, they preset the number of client groups during training, which may be difficult to obtain in federated learning. A. Ghosh et al. [6] constructed K global models at the server and broadcast these models to all clients. Clients compute loss values using local data, select the model with the lowest loss, and return it to the server, which assigns the clients to the K clusters based on the received model information. However, this approach increases the communication burden between the server and the clients by a factor of K due to the sending of K models.

3. Preliminaries

3.1. Notations Description

The notations covered in this article are shown in Table 1.

3.2. Clustered Federated Learning

The loss function of a typical federated-learning model is shown in Equation (1):

\min_{w \in R^{d}} f (w) ≝ \sum_{i = 1}^{N} p_{i} f_{i} (w)

(1)

where N denotes the number of clients and

p_{i}

is the corresponding weight of the client and satisfies

\sum_{i = 1}^{n} p_{i} = 1, \forall p_{i} \geq 0

. The function

f_{i} (w) = L (D_{i}; y_{i}; w)

is the loss function of the i-th client, where

(D_{i}; y_{i})

is the training data and the corresponding labels stored in that client, and

w

is the federal global model. From Equation (1), it is evident that traditional federated-learning methods like FedAvg aim to train a global model that minimizes the aggregate loss across all clients. However, this singular global model may not always be optimal. For instance, when two clients have divergent or even opposing local objectives, the performance of the joint model often suffers due to conflicting predictions, resulting in poorer performance than either client’s local model alone. As a result, neither client may derive substantial benefits from federated learning. To address these irreconcilable conflicts, clustered federated learning decomposes the global model optimization problem into multiple local optimal model problems. The introduction of multi-center federated models mitigates conflicts between clients with different data distributions and provides more targeted cluster models for similar clients. This approach aims to improve overall model performance by accommodating heterogeneous client objectives and enhancing the relevance of the global model for specific client clusters.

Clustered federated learning assumes that N clients participating in federated learning can be divided into K disjoint clusters,

\{c_{1}, \dots, c_{K}\} (K \leq N)

, with each cluster’s client data distribution being different from other clusters. Clients within the same cluster have similar data distributions. In other words, clients within a cluster share similar optimization objectives, while clients across clusters have different optimization objectives. The distribution of client clusters satisfies Equation (2).

C = \{c_{1}, \dots, c_{K}\}, \forall i, j \in K and i \neq j, c_{i} \cap c_{j} = ϕ

(2)

Figure 1 shows an example application of clustered federated learning. Smartphones, smartwatches, cameras, etc., contain different types of user pictures, which are not shared with the server. These edge devices obviously take on different learning tasks, and grouping them helps to improve the performance of the federated model. The server identifies clients according to the uploaded model parameters and clusters them, and then forms the corresponding cluster center model based on these clusters. Finally, the server broadcasts corresponding models to clients according to the clustering results. The optimization objective of clustered federated learning is shown in Equation (3).

\underset{\begin{matrix} {W = {w}_{1}, \dots, w_{K}} \\ {C = {C}_{1}, \dots, C_{K}} \end{matrix}}{m i n} \sum_{k = 1}^{K} \sum_{i \in C_{k}} L (w_{k}, D_{i})

(3)

where

{C = {C}_{1}, \dots, C_{K}}

are the divided client clusters and

{W = {w}_{1}, \dots, w_{K}}

are the corresponding cluster centre models.

3.3. Self-Organizing Map

SOM [17] is an unsupervised clustering algorithm based on neural networks, which has stronger nonlinear modeling ability and self-adaptability. Through competitive-learning strategies, SOM can map complex high-dimensional input data to a simplified low-dimensional space while preserving the original topological structure of the input data space. SOM consists of two layers: the input layer and the output layer (competitive layer). The input layer contains n nodes, which correspond to the dimensions of the input data. The output layer typically has a one-dimensional or two-dimensional structure. There is a full connection between the two layers, and the number of neurons in the output layer is usually equal to the number of clusters. Figure 2 illustrates an example of a two-dimensional SOM network. During the training process, SOM identifies the most similar network node to each input sample, known as the Best Matching Unit (BMU), and defines a neighborhood set N around the BMU. At each learning step, all units within N are updated, while units outside N remain unchanged. Nodes within N will be updated to varying degrees based on their distance from the BMU, with nodes closer to the BMU receiving larger updates. The range of N shrinks with time and only includes BMU after a certain time, which is the transition from N (1) to N (3) in Figure 2.

Compared with typical clustering algorithms such as K-means [18], SOM not only updates the BMU but also updates the neighborhood node according to the distance from the neighborhood node to the BMU. With this smooth update mechanism, local optimality can be avoided. In addition, SOM networks have the ability to extract the characteristics of input patterns, and their node parameters to some extent represent the underlying patterns found from the dataset. This helps the server explore complex client identities in federated learning.

4. Self-Organizing Federated Learning

SoFL needs to address three key issues: (a) Determining the number of clusters in federated learning, i.e., the value of K in Equation (3); (b) Correctly assigning clients to their respective clusters, i.e., generating

C = \{C_{1}, \dots, C_{K}\}

as in Equation (3); (c) Algorithm for fusing the central models within clusters. Therefore, to address the above issues, we can decompose Equation (3) into two distinct optimization stages: the clustering stage with the optimization objective, as shown in Equation (4), and the model fusion stage with the optimization objective, as shown in Equation (5).

\underset{\begin{matrix} {C = {C}_{1}, \dots, C_{K}} \\ {θ = {θ}_{1}, \dots, θ_{N}} \end{matrix}}{m i n} \sum_{k = 1}^{K} \sum_{i, j \in C_{k}} d i s t (θ_{i}, θ_{j})

(4)

\min f (W_{k}) = \sum_{i \in C_{k}} p_{i} f_{i} (W_{k})

(5)

By analyzing the optimization objectives of the two stages, we find that the clustering stage focuses more on the similarity between clients, aiming to group clients with similar characteristics into the same clusters. The model fusion stage, on the other hand, concentrates on the performance of the joint model across all clients within each cluster, aligning with the traditional goal of federated-learning algorithms to train a single model effectively. Therefore, appropriately partitioning clients into multiple clusters is a critical challenge that influences the success of clustered federated learning. This involves finding a sensible partitioning scheme that minimizes the sum of distances between clients within each cluster.

4.1. Framework Overview

Determining the optimal number of clusters is an important issue when solving the data non-IID problem in federated learning. In cases where the number of clusters for given clients is determined, algorithms like IFCA [6] and FeSEM [7] demonstrate good performance. However, determining the optimal number of clusters in advance for federated learning is often impractical. When the number of clusters, K, is unknown, specific metrics such as the elbow method and silhouette coefficient, among others, are used to evaluate K. In the process of clustering, with the increase of the number of clusters, the sample division will become more refined and the degree of aggregation of each category will be higher. We usually use the sum of the distance between cluster members and their center, that is, the Within-Cluster Sum of Squares (WCSS), to measure the tightness of the cluster. When K is less than the real cluster number, the increase of K will increase the degree of aggregation of each cluster, so the decrease of WCSS will be large. When K reaches the real number of clusters, the return on the degree of aggregation obtained by increasing K will rapidly decrease, so the decline rate of WCSS will decrease sharply. In this process, there will be an obvious turning point in the WCSS curve, which is also the principle of the elbow law to identify K.

W C S S = \sum_{k = 1}^{K} \sum_{i \in C_{k}} {| x_{i} - μ_{k} |}^{2}

(6)

where

C_{k}

represents the k-th cluster, in which the centroid of the cluster is

μ_{k}

, and

x_{i}

is the data in the cluster. Then, the degree of aggregation of the k-th cluster can be expressed by

\sum_{i \in C_{k}} {| x_{i} - μ_{k} |}^{2}

. However, the elbow method often fails when the WCSS curve is smooth and lacks a distinct inflection point. In contrast, SOM neural networks can identify clustering structures without the need for predefining the number of clusters. Although SOM itself does not directly offer automatic determination of the optimal cluster number, inferring suitable cluster counts based on SOM’s topology and node distributions enhances clustering efficiency and accuracy. Hujun Yin et al. demonstrated that the feature space in SOM learning tends towards a stochastic process of multiple Gaussian distributions, ultimately converging to the probabilistic centers of input subsets [19].

Theorem 1.

The feature space of the SOM algorithm is approximated as a Gaussian-distributed stochastic process that converges in the mean-square sense and senses the center of mass of the final input subset.

w_{c} (n) \overset{n \to \infty}{⟶} m_{c} = \frac{1}{P (X_{c})} \int_{X_{c}} x f (x) d x \forall c \in Y

(7)

where

w_{c} (n)

is the connection weight,

{{m}_{c}}

is the final feature space, which is the set of clustering centres of the final input subset

{{X}_{c}}

,

Y

is the set of SOM output neurons, and n is the time step. Each neuron receives inputs from a set in the process of training, termed

X_{c} (n)

, which is a timevarying subset of the input set

X

. Suppose the probability density function of the input set

X

is

f (x)

, the probability of an input sample,

x (n)

, belonging to a subset,

X_{c} (n)

, is given by

P (X_{c}, n) = \int_{x {\in X}_{c} (n)} f (x) d x

. As time tends to infinity, {

X_{c} (n)

,

P (X_{c}, n)

,

c \in Y

} will tend to {

X_{c}

,

P (X_{c})

,

c \in Y

}, respectively. So, within each input subset

X_{c}

, the probability density function is

f_{c} (x) = f (x) / P (X_{c})

.

During the training process of the SOM network, nodes tend to move towards the high-density region of the input data subset. The SOM output neurons incorporate the data features that are relevant for mapping to the current node during a specific time period and can better represent the mapped subset. Thus, SOM provides not only clustered subsets but also winning neurons that can better represent the subset features during the training process. Compared with the complexity of directly clustering clients, the dimensionality of clustered data can be effectively reduced by the method of clustering SOM winning neurons.

However, it should be noted that the number of nodes in the SOM network is usually more than the actual number of clusters, and data in the same category in the classification result may correspond to similar but different winning neurons. If the categories are classified according to one winning neuron corresponding to one class of data, the categories classified by the SOM network may be more than the actual data categories, which in turn generates some redundant categories.

Therefore, the SoFL algorithm proposed in this paper designs a dual clustering method based on SOM. In the SoFL algorithm, the first layer of clustering exploits the SOM network to explore the feature space of the client model. By adjusting the winning neurons and their neighborhoods according to the client parameters, the SOM forms a quantitative mapping with minimum mean square distortion, and the corresponding neurons will be located at the center-of-mass position of the client subset, which initially determines the number of clusters and the clustering situation. By analyzing the WCSS curves of the winning neurons, SoFL can obtain the number of client clusters more accurately. After obtaining the preliminary clustering results, the SoFL algorithm takes the SOM winning neurons as input and performs the second layer of clustering based on the number of client clusters obtained. Regarding the second-layer clustering algorithm, it can be chosen flexibly based on actual situations. This dual clustering scheme has significant advantages in the following aspects: (a) Client clustering through SOM networks is smoother and less likely to fall into the local optimum dilemma. (b) By analyzing the WCSS curves of the winning neurons, the accuracy of the K value of the number of clusters is ensured while avoiding pre-setting, making the clustering process more adaptive and intelligent. (c) Secondary clustering compensates for the lack of boundary processing in SOM networks and optimizes the clustering effect.

Figure 3 illustrates the framework of the SoFL algorithm, which consists of the following main steps:

Step 1: The server sends the global model $W_{t}$ to the clients so that individual clients can start local training based on the global model.
Step 2: The clients receive the global model and train on the local dataset. This step enables each client to adapt the model to its unique data characteristics.
Step 3: The clients upload the trained model back to the server. The server collects updates from all clients in preparation for global model updates.
Step 4: The server aggregates the received models from each client and generates the global model $W_{t + 1}$ .

Repeat steps 1 through 4, where the server trains a global generic model. Due to the different optimization goals between clients, the global model usually converges to a suboptimal solution. However, the performance of this suboptimal solution regarding the local client is unlikely to be stable. Therefore, we summarize the clustering conditions as follows: (a) the update of the global model tends to be stable, that is,

0 < ‖\sum_{i = 1}^{N} ∆ w_{i}‖ < δ_{1}

, (b) the gradient update of the client is still in great fluctuation, that is,

\max_{i \in N} ‖∆ w_{i}‖ > δ_{2} > 0

. We provide general recommendations for selecting hyperparameters

δ_{1}

and

δ_{2}

based on Sattler, F. et al. [8], i.e.,

δ_{1} \approx \max ‖∆ w_{i}‖ / 10

and

δ_{2} \in [δ_{1}, {10 δ}_{1}]

. Another feasible approach is to specify a clustering round by the server so that the global model meets the convergence condition.

Step 5: The server converts the client parameters into one-dimensional vectors for input to the SOM network for training. During the training of the SOM network, the node parameters are affected by the local density of the client parameters, which are usually located at the center of mass of the client clusters. Therefore, the algorithm’s use of the SOM node parameters as the center of each class after initial clustering helps to more accurately represent the characteristics of individual SOM client clusters.
Step 6: The server performs secondary clustering using the parameters of the winning nodes as input. In fact, after the SOM training is completed, the winning nodes already represent a class of client clusters. Therefore, the clustering of the winning nodes is also a clustering process of the clients.
Step 7: Based on the clustering results, the server aggregates all client models within each cluster separately to generate a plurality of central models suitable for the corresponding client clusters.
Step 8: The server sends all cluster models to the clients within the corresponding cluster. This process enables each client to obtain a model that matches its characteristics, thereby improving the performance of the overall algorithm.

Through the above steps, the SoFL algorithm decomposes a globally common sub-optimal solution model into several local optimal solutions, which can effectively optimize and adjust the model while protecting the privacy of clients. The server assigns the appropriate model to the client based on its identity, which allows the client to receive a personalized solution rather than a generic solution for its local dataset. This will further encourage more clients to participate in training.

4.2. Algorithm Details

Model Pre-training

In the initial phase of the algorithm, the server first generates an initial model,

W_{0}

, and broadcasts it to all clients. All clients jointly maintain a global model. The important basis for client clustering is the gradient update direction of the client model, but not every round of communication has divisible update direction. During the iterative training process in steps 1–4 of Figure 3, clients need to train the model based on local data. At this time, the model moves towards the global model optimal solution, and all client gradient updates may converge to the same direction. The federated model tends to stabilize when the accuracy of the clients no longer improves after several rounds of communication. At this point, the main factor limiting the performance of the federated model is the inconsistent optimization goals among clients. Due to the variability of data partitioning, the direction of client gradient update in different clusters diverges. The SoFL algorithm enters the clustering phase accordingly. As a result, the differences in gradient updates between client clusters become more pronounced through model pre-training, allowing the clustering task to accurately classify clients.

Client Clustering

When the global model performance is no longer improved or the clustering threshold is satisfied, the model pre-training task is completed and the server will divide the clients into clusters. First, the server reshapes each layer parameter of the received N client gradient updates,

∆ w = {∆ w_{1}, \dots \dots, ∆ w_{N}}

, into one-dimensional vectors and splices them in turn, and finally obtains a set of one-dimensional vectors about the gradient

γ = {γ_{1}, \dots \dots, γ_{N}}

, where

γ_{i} \in R^{D}

; D is the number of model parameters. Then, the server takes

γ

as an input to the SOM network, randomly selects

γ_{i}

,

i \in [1, N]

, and calculates the similarity between

γ_{i}

and the weights of all output layer nodes. The node with the smallest distance from

γ_{i}

is considered as the BMU.

ν_{ω} = \underset{ν_{s} \in Z_{s o m}}{\arg m i n} d i s t (γ_{i}, ν_{s})

(8)

There are many measures of similarity, such as Manhattan distance, cosine distance, Euclidean distance, etc. The choice of BMU is likely to change when different metrics are chosen. The training models usually contain a large number of parameters, which makes the vector set

γ

of high dimensionality. Distance-based similarity measures have the potential to fail. In contrast, cosine distance measures the similarity between two non-zero vectors in terms of direction and is more suitable for the calculation of high-dimensional data. The similarity between any output layer node weight,

ν_{s}

, and gradient update,

γ_{i}

, can be expressed as:

α_{i, s} = \cos (ν_{s}, γ_{i}) = \frac{⟨ ν_{s}, γ_{i} ⟩}{∥ ν_{s} ∥ ∥ γ_{i} ∥}

(9)

The range of values for

α

is [−1, 1]. The closer the

α

is to 1, the more similar the two are, and the closer the

α

is to −1, the more different the two are.

In the initial stage of SOM network clustering, the connections between SOM output layer neurons and client data are not very close. Therefore, to ensure the orderliness of the mapping space, SOM needs to select a larger range for neuron updating neighborhoods to generate a rough global order. As the number of iterations increases, SOM gradually narrows the updating neighborhood, thereby completing neuron updates in a more targeted fashion. The update at the t-th iteration is represented by Equation (10).

ν_{j}^{t + 1} = \{\begin{matrix} ν_{j}^{t} + η (t) \cdot g_{ω j} \cdot (γ_{i} - ν_{j}^{t}) ν_{j} \in B M U n e i g h b o u r h o o d \\ ν_{j}^{t} ν_{j} \notin B M U n e i g h b o u r h o o d \end{matrix}

(10)

where

g_{ω j}

represents the update constraint of the j-th neuron in the neighbourhood of BMU, as shown in Equation (11), and the

p

represents the coordinate position of the neuron in the network.

g (ω, j) = e^{- \frac{{∥ p_{ω} - p_{j} ∥}^{2}}{{2 σ (t)}^{2}}}

(11)

where

η (t)

is the learning rate that shrinks over time, as shown in Equation (12), and

σ (t)

is a monotonically decreasing function that determines the update range of the BMU neighborhood, as shown in Equation (13).

η (t) = \frac{η_{0}}{1 + \frac{t}{T_{m a x} / 2}}

(12)

σ (t) = \frac{σ_{0}}{1 + \frac{t}{T_{m a x} / 2}}

(13)

After multiple rounds of iteration, the SOM network completes the initial clustering of clients. All clients mapped to the same network node can be considered part of the same cluster. Building upon this, the SoFL algorithm employs secondary clustering to eliminate redundant categories and achieve the precise clustering division of clients.

SOM winning neurons,

M = \{ξ_{1}, ξ_{2}, \dots, ξ_{z}\}

, can be considered as units in the network that undertake the task of feature extraction. By adjusting the weights of the winning neurons, the network can learn the important features of the input client parameters and reflect the structure of these features in the mapped space. Thus, the winning neurons can, to some extent, represent the relevant clients in the next clustering stage. This greatly reduces the computational effort of secondary clustering, as the data to be processed are reduced from the original number of clients to the number of winning neurons, and also ensures the stability of the estimated number of clusters. In other words, the winning neurons are the new inputs to the secondary clustering stage and also determine which clients can be grouped into the same categories.

In the second clustering phase, the server initializes

\tilde{K}

cluster centers,

{μ_{i} \in R^{D}, i = 1, \dots \dots, \tilde{K}}

. It sequentially computes the distances from the winning node parameters,

ν_{j}

, to each cluster center and performs secondary clustering, as follows:

C_{i} = C_{i} \cup \{ξ_{j}\} ⟺ i = \underset{\begin{matrix} i \in [K] \\ ξ_{j} \in ν_{s o m} \end{matrix}}{a r g m i n} {∥ μ_{i} - ξ_{j} ∥}_{2}

(14)

When all the data have been categorized, the server updates the clustering center based on all the sample data in

C_{i}

. The formula is shown in (15).

μ_{i} = \frac{1}{|C_{i}|} \sum_{ξ \in C_{i}} ξ

(15)

If the number of iterations is reached or, after multiple rounds of iterations, all the clustering centers remain unchanged, the server divides the output clusters according to the categorization results, and at this point the elements contained in each cluster are SOM winning nodes. In the initial clustering, each winning node already corresponds to the set of similar clients, so the cluster classification,

C

, is finally mapped to all clients.

The Federated-Learning Algorithm based on Dual Clustering

After the clients are clustered by double clustering, each client within a cluster can be approximated as a federated operation on an independent and identically distributed dataset, and thus any traditional federated-learning algorithms, such as FedAvg, FedProx, etc., can be chosen for the cluster. The server sends the central model,

{W_{1}, \dots \dots, W_{K}}

, forming multiple clusters to the corresponding clients, i.e., clients in the same cluster share the same central model. The overall process is shown in Algorithm 1.

Algorithm 1: Self-organizing federated-learning algorithm

1 Input: local training batches, B, the number of local iterations, E, local learning rate,

η

, initial client cluster,

C = \{\{c_{1}, c_{2}, \dots, c_{N}\}\}

, number of communications, T, scheduled clustering rounds,

T_{c l u s t e r}

, initial global model parameters,

W_{0}^{0}

, the amount of data, n.

2 Output:

{W_{1}, \dots \dots, W_{K}}

3 server:

4 for t = 0, 1, ……, T do:

5 for

C_{i}

in

C

:
6 randomly select m clients from the cluster

C_{i}

7 the server sends cluster model parameters

W_{i}^{t}

to m selected clients.

8 for k = 1, 2……∈ m do:

9

w_{k}^{t + 1}, {∆ w}_{k}^{t + 1} \leftarrow C l i e n t U p d a t e (k, W_{i}^{t})

10 end

11 if t ==

T_{c l u s t e r}

:
12 update SOM network parameters according to Equation (10)
13 obtain initial clustering results

C_{S O M} = {{c_{1}, \dots, c_{s}}_{1}, \dots, {c_{m}, \dots, c_{N}}_{z}}

and parameters of winning nodes

M = {ξ_{1}, ξ_{2}, \dots, ξ_{z}}

14 calculate the number of clusters K = elbow(

M

)
15 SOM winning nodes are clustered according to Equations (14) and (15)
16

C_{n o d e} = \{{\{ξ_{1}, \dots\}}_{1}, \dots, {\{{\dots, ξ}_{z}\}}_{K}\}

17 combine the two clustering results to form the final result

C

based on

C_{S O M}

and

C_{n o d e}

, and

C = {C_{1}, \dots, C_{K}}

18 end
19 Aggregation(C)

20 end

21

C l i e n t U p d a t e (k, W_{t})

:

22 for t = 1 to E do:

23 for B ∈ local data do:

24

S G D (B, η, W_{t})

25

{∆ w}_{k}^{t + 1} = w_{k}^{t + 1} - W_{t}

26 return

w_{k}^{t + 1}, {∆ w}_{k}^{t + 1}

27 Aggregation(C):
28 for

C_{i}

in

C

:
29 M

\leftarrow

|

C_{i}

|
30

W_{i} \leftarrow \sum_{m = 1}^{M} \frac{n_{m}}{n} w^{m}

31 return

{W_{1}, \dots \dots, W_{K}}

In each communication round, the clients in the cluster receive the cluster model parameters and train locally (lines 5–9). At this stage, all clients can be considered to be in the same cluster, i.e.,

C = \{\{c_{1}, c_{2}, \dots, c_{N}\}\}

. When training reaches a set number of rounds, the server uses the client parameters as input data to train the SOM network and obtains the winning node parameter matrix,

M = {ξ_{1}, ξ_{2}, \dots, ξ_{z}}

, and the client cluster division after SOM network clustering,

C_{S O M} = {{c_{1}, \dots, c_{s}}_{1}, \dots, {c_{m}, \dots, c_{N}}_{z}}

(line 11–12). Subsequently, the optimal number of clusters, K, is determined based on the inflection point of the WCSS curve (line 13), and the winning nodes are clustered again with the K-means algorithm and the nodes’ clustering result is

C_{n o d e} = \{{\{ξ_{1}, \dots\}}_{1}, \dots, {\{{\dots, ξ}_{z}\}}_{K}\}

(lines 14–15). At this point, each winning node represents a class of clients, so the secondary clustering is also a de-redundant division of the client classes (lines 16). Compared to traditional federated learning, the communication burden of SoFL is mainly concentrated in the clustering round. However, it should be emphasized that this burden will only occur once throughout the training process.

5. Experiments

5.1. Experimental Setup

5.1.1. Datasets and Segmentation

To evaluate the algorithm, we conducted experiments on two well-known public datasets. These datasets were chosen because real-world datasets on federated learning are scarce, and it is a better way to simulate federated-learning scenarios by extending freely available datasets.

MNIST [20]. MNIST is a handwritten grey scale image recognition task consisting of the numbers 0 to 9. The dataset consists of 60,000 training images and 10,000 test images with an image size of 28 × 28. In the MNIST experiment of this paper, 10,000 pictures are used as the training set.
CIFAR-10 [21]. CIFAR-10 is a 3-channel RGB image dataset containing 10 classes of objects with an image size of 32 × 32. There are a total of 50,000 training images and 10,000 test images in the dataset. In the CIFAR-10 experiment of this paper, 20,000 pictures are used as the training set.

To induce certain partitioning characteristics among clients, we extended these datasets as follows.

Feature Distribution Skew. At first, we rotated the datasets to simulate skewed feature distributions across clients. For instance, with K = 4, the datasets were rotated at 0°, 90°, 180°, and 270° to form four data partitions. Then, we randomly divided the dataset of the same partition into 5 clients, and finally formed a federated-learning scenario with 4 data partitions and a total of 20 clients. We randomly extracted data from one client from each of the four clusters, and the results are shown in Figure 4.
Label Distribution Skew. Secondly, by exchanging labels of client local data, we simulated the case of shifted label distribution in federated learning. For example, when K = 4, we exchanged the data label of each cluster according to $l a b e l = (l a b e l + i) % 10, i \in {0,2, 4,6}$ . In this way, the same class of data had different labels in different clusters.

Meanwhile, in order to simulate the unbalance of the amount of data owned by the federated-learning clients, we adopted the Dirichlet distribution [22] to control the distribution of client data based on the above two cases. Figure 5 illustrates the amount of data owned by the clients when the Dirichlet hyperparameter

φ = 0.5

.

Figure 4. Partial displays of MNIST (a) and CIFAR-10 (b) images when K = 4, where each row of data is sourced from clients in different clusters.

Figure 5. In the MNIST experiment, with

φ = 0.5

, the local dataset situations of 20 clients are depicted, where different colors represent images of digits 0–9.

Figure 5. In the MNIST experiment, with

φ = 0.5

, the local dataset situations of 20 clients are depicted, where different colors represent images of digits 0–9.

5.1.2. Client and Model Settings

In all subsequent experiments, we set the ratio of the division of the training set to the test set within the client to 0.8:0.2. For the MNIST experiments, we used a Multilayer Perceptron Model (MLP) with one hidden layer and a local iteration number E = 3. For the more complex CIFAR-10 experiment, we set up a CNN model with two convolutional layers and three fully connected layers with a local iteration number E = 5. The local optimizer was a small batch SGD with B = 100.

5.1.3. Baselines and Metrics

FedAvg: Federated learning based on weighted average.
FedProx: An optimized version of the FedAvg algorithm, which adds a correction term to the local loss function, with hyperparameter $λ = 0.1$ in this experiment.
CFL: A clustered federated-learning algorithm for clustering via optimal bisection.
IFCA: IFCA is an iterative clustering algorithm that determines the identity of a client by calculating its losses with the models.

In our experimental setup, the clients all have a test dataset locally. Therefore, we will evaluate the model accuracy on the clients and obtain the overall accuracy by averaging the test accuracy across all clients.

5.2. Experimental Results

5.2.1. Client Similarity

The core rationale for solving the non-IID data problem based on clustering methods and thus improving the performance of federated learning lies in the fact that the model parameters trained by clients with the same or similar data distributions are also similar. To this end, we first experimentally verify the similarity of model parameters across clients when there are different data distributions in federated learning. In this section of experiments, we take the feature distribution offset, i.e., rotating the image, to simulate the case of multiple client clusters. The similarity between the client update gradients when the joint training stabilizes is shown in Figure 6.

It can be observed from Figure 6a that there is no clear color differentiation in the client-side similarity matrix when multiple data distributions are not present. Whereas, in Figure 6b, the cosine similarity matrix shows two regions with significant color differences, i.e., clients 0–9 and clients 10–19. A similar situation also occurs in Figure 6c, where the regions where clients 0–4, 5–9, 10–14, and 15–19 are located are darker than the other regions. This indicates that the similarity of parameters within client clusters is significantly higher than the similarity of parameters between clusters. It can be seen that the update direction of the client parameters reflects the local data distribution to some extent. Therefore, after receiving the model parameters sent back by the clients, the server can identify the cluster identities of the clients based on these parameters and jointly train the clients with the same or similar optimization objectives, thus improving the performance of the federated-learning model.

5.2.2. Clustering Structure of SOM Network

The initial size of the SOM network should be determined based on the actual input. In general, when the number of client clusters is relatively clear, a closer resolution can be taken, when the SOM will degenerate into a K-means algorithm. However, when the number of client clusters is fuzzy, we recommend taking a larger size to obtain sufficient resolution. In this section of the experiment, Gaussian function is chosen for the proximity function of the SOM network, cosine distance is utilized as the distance metric,

σ

is set to 1.5, learning rate

η = 0.1

, and the number of network iterations is set to 300. In addition, the experiment uses U-Matrix (unified distance matrix) [23] to visualize the trained SOM network. In U-Matrix, the lighter color represents the smaller distance between the node and its neighboring nodes; the darker color represents the larger distance between the node and its neighboring nodes. Thus, U-Matrix can visualize the distance of each neuron from its neighboring nodes.

We conducted experiments with K = 2 and K = 4, respectively, and the results are shown in Figure 7. The dots in the figure represent the clients, the different colors represent the different client clusters they belong to, and the drop locations represent the neurons they are mapped to. In Figure 7a, the clients of the red cluster are distributed in the upper half part of the U-Matrix, and the clients of the yellow cluster are distributed in the lower half of the U-Matrix. And in Figure 7b, the four client clusters are in the four corner regions. This indicates that the SOM neurons do move towards the high-density regions in the input subset space during the training process and present a certain topology in the space. However, although the SOM network recognizes the clustering structure, the clients are still distributed across multiple neurons. To facilitate the coordinated training of more similar clients, we need to further cluster partitioning to eliminate redundant categories.

Notably, we generated different initial networks using different seed values and trained them. The results show that, because of the randomness of the initialization, the clients mapped to different network nodes in the map, but the final generated SOM networks also provide similar classification results. This may be due to the fact that the winning neurons and their neighborhood neurons are updated synchronously during the training process, resulting in well-converged SOM networks that end up with similar coverage. Therefore, the results obtained from clustering using SOM networks are less affected by the initialization. At the same time, the SOM network does not require a predefined number of clusters, and some correlations between input clients can be observed, which is very favorable for federated-learning probes where the identity of the participants is unknown.

5.2.3. Representation of SOM Neurons

In order to verify whether or not the SOM neuron parameters are more useful than the original client parameters for determining the number of potential clusters on the client side, we compute the WCSS curves with the client parameters and the SOM winning neuron parameters as inputs with the number of clusters K = 4, respectively, and the results are shown in Figure 8.

In Figure 8a, the WCSS curve calculated by the client model parameters declines more smoothly, the inflection point is not obvious, and the finalized number of clusters is far from the actual number of clusters. In Figure 8b, the WCSS curve computed with the SOM winning neuron parameters as inputs shows a clear turn at a predicted cluster number of 4, correctly calculating the optimal number of clusters. The reason for this is that the direct application of client model parameters may contain a large amount of redundant information, leading to an unclear clustering structure. Comparatively, SOM winning neurons reduce the amount of evaluation data while preserving the clustering features, resulting in a more prominent clustering structure. In addition, the computational cost required for clustering evaluation using SOM winning neuron parameters is significantly less than that of client-side parameters.

Next, we clustered the clients into three methods, K-means, SOM, and SOM + K-means, respectively, and the cluster labels for each client are shown in Table 2. Even though we set the correct number of clusters, the K-means algorithm still produces incorrect judgements between clients 1–5 and 11–14. SOM produces redundant categories after training, although it does not misclassify. The SOM + K-means method, on the other hand, combines the advantages of both and correctly classifies the clients.

5.2.4. Convergence Efficiency Analysis

Figure 9 shows the results of the accuracy convergence curves of SoFL compared to the other algorithms for the case of K = 4. As shown in Figure 9a, on the MNIST dataset, the two algorithms FedAvg and FedProx, which are dedicated to reducing the discrepancy between local and global, converge at an average accuracy of approximately 75%. The CFL algorithm completes the first clustering at the 20-th round of communication, and its model accuracy reaches 85% between the 20-th and 40-th rounds. Then, after the 40th round of communication, the CFL algorithm completes the second clustering, and its model mean accuracy reaches approximately 90%. And after the clustering in the 20th round of communication, the SoFL algorithm’s model mean accuracy has reached 90%.

Similar results are found on the CIFAR-10 dataset. In Figure 9b, the average accuracies of FedAvg and FedProx are 39% and 36%, respectively. The accuracy of CFL is approximately 42% at rounds 40 to 80, and the accuracy improves to 45% after round 80. The accuracy of SoFL has reached 47% after round 40. Neither of the two single-model-based algorithms can obtain the optimal model in the presence of different client preferences. Accordingly, the clustering-based federated-learning algorithm can significantly improve the accuracy of the model.

Notably, compared to CFL, SoFL effectively reduces the communication cost by speeding up the convergence of the model while guaranteeing the accuracy. This is mainly because CFL is based on the optimal binary classification algorithm, which does not fully complete clustering in the first clustering round, and there are still clients with inconsistent optimization objectives within each cluster, thus interfering with the training of the federated model. In federated learning, if there are multiple data partitions, the number of training sessions required for CFL convergence increases with the increase of data partition categories, and the SoFL algorithm clusters the clients in a single clustering round using two clustering algorithms, avoiding clients in different clusters from influencing each other.

5.2.5. Accuracy Analysis

In this section, we perform accuracy experiments on two processed datasets. The results of accuracy comparison between SoFL and the baseline algorithm when the number of client clusters K = 4 are shown in Table 3. FedAvg and FedProx perform poorly in both bias cases because both attempt to fit all data from different distributions, which is not reasonable in practice. In addition, there is no significant performance improvement in FedProx compared to FedAvg, which may be due to the fact that FedProx’s strategy of adding proximal terms at the local client is essentially to limit the bias of the local model from being too far away from the global model, but this does not alleviate the conflict between clients.

SoFL significantly improves the accuracy of the model compared to two traditional federated-learning methods. In the four cases, the accuracy of SoFL is improved by approximately 13%, 9%, 18%, and 10%, respectively. The accuracy of SoFL is improved by 3.83%, 2.4%, 1.92%, and 1.1%, respectively, compared to the CFL algorithm. The joint training of CFL after the first clustering seems to still have an adverse effect on the model. The accuracy of IFCA is lower than that of CFL and SoFL, which indicates that the method of determining clients identity by calculating minimum loss is not stable.

5.2.6. Impact of Data Heterogeneity

In the next experiments, we analyze the impact of data heterogeneity on the SoFL algorithm by varying the degree of heterogeneity of the client dataset by adjusting both the number of potential clusters and the value of the Dirichlet hyperparameter

α

, respectively.

First, we construct multiple client clusters by exchanging local data labels. As the number of clusters increases, the degree of heterogeneity of the overall data distribution increases accordingly. Table 4 shows the accuracy of each algorithm with different clusters. As the number of client clusters is changed from 2 to 8, the accuracies of FedAvg and FedProx on the MNIST and CIFAR-10 experiments drop by 33% and 20%, respectively, and the conflicting optimization objectives between clusters seriously impair the performance of the federated model. Among the three clustering federated-learning algorithms, SoFL has consistently maintained better accuracy.

Then, in the case where the number of client clusters is fixed at 4, we simulate the imbalance in the amount of data owned by the clients by varying the hyperparameter

α

, where the larger the

α

represents the smaller the difference in the amount of local data owned by the clients. The experimental results are shown in Table 5; when

α

gradually increases, the accuracy of FedAvg and FedProx decreases slightly. When the difference in the amount of data between clients is reduced, i.e., when the amount of data is more balanced, the performance of the traditional federated-learning method will be affected to some extent. In other words, federated learning is more likely to produce adversarial updates when there are multiple distributions with comparable amounts of data on the client side. This adversarial updating refers to the conflicting directions of model updates from different clients, which makes it more difficult for the global model to converge, or it converge more slowly, thus decreasing the performance of the overall model. SoFL is still able to maintain a high accuracy rate in this balanced data-volume scenario and is able to avoid the damage of adversarial updating to federated learning in fewer communication rounds.

It should be noted that, although CFL achieves similar model performance to SoFL in the experiments, there is a significant difference in convergence efficiency. CFL requires multiple rounds of clustering to help the server distinguish all client clusters, while SoFL completes the clustering task in one round of communication. This implies that SoFL is more accurate than CFL in most subsequent rounds and converges faster. Therefore, in federal learning scenarios where the more complex the data distribution environment is, SoFL shows more significant advantages compared to CFL. Moreover, IFCA can incur several times the consumption of other algorithms during the communication phase, which limits its applicability.

6. Conclusions

In this paper, we propose a novel clustered federated-learning algorithm, SoFL, to address the challenge of non-IID data to federated learning. We conducted experiments in several multi-cluster simulation environments. The experimental results show that the accuracy of SoFL in MNIST experiment is 13–18% higher than that of FedAvg. In addition, the convergence experiment proves that SoFL has a faster convergence rate than the general clustering algorithm. In practical applications, SoFL is suitable for the federated application scenario, in which the clients participating in training can be divided into several clusters, which have different optimization goals. In this case, SoFL can form multiple federated models by coordinating client training with the same optimization goal to mitigate the impact of non-IID data and improve the performance of the model. SoFL can be deployed in many fields, like the medical field, the finance field, and so on. As SoFL typically requires the participation of all clients when clustering, the calculation of complete model parameters on the server may increase the calculation cost to some extent. In future work, we will investigate how to compress communication parameters without compromising model similarity.

Author Contributions

Conceptualization, J.Z.; methodology, J.Z. and Z.Q.; software, Z.Q.; validation, Z.Q.; formal analysis, J.Z. and Z.Q.; investigation, Z.Q.; resources, J.Z. and Z.Q.; data curation, J.Z. and Z.Q.; writing—original draft preparation, J.Z. and Z.Q.; writing—review and editing, J.Z. and Z.Q.; visualization, J.Z. and Z.Q.; supervision, J.Z. and Z.Q.; project administration, J.Z.; funding acquisition, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is supported by the project “Research on Machine-Learning Methods Based on Multi-party Participation” (20210101483JC), which is financially supported by the Science & Technology Development Program of Jilin Province, China.

Data Availability Statement

The datasets that support the results of this study are publicly available datasets, and the use of these datasets in this work adheres to the licenses of these datasets. The MNIST dataset is available at https://yann.lecun.com/exdb/mnist/ (accessed on 24 July 2024). The CIFAR-10 dataset is available at https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz (accessed on 24 July 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. Advances and Open Problems in Federated Learning. Found. Trends Mach. Learn. 2021, 14, 1–210. [Google Scholar] [CrossRef]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated Optimization in Heterogeneous Networks. Proc. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar]
Karimireddy, S.P.; Kale, S.; Mohri, M.; Reddi, S.; Stich, S.; Suresh, A.T. Scaffold: Stochastic Controlled Averaging for Federated Learning. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020. [Google Scholar]
Wang, J.; Liu, Q.; Liang, H.; Joshi, G.; Poor, H.V. Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimization. Adv. Neural Inf. Process. Syst. 2020, 33, 7611–7623. [Google Scholar]
Ghosh, A.; Chung, J.; Yin, D.; Ramchandran, K. An Efficient Framework for Clustered Federated Learning. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Curran Associates Inc.: Red Hook, NY, USA, 2020; p. 1643. [Google Scholar]
Long, G.; Xie, M.; Shen, T.; Zhou, T.; Wang, X.; Jiang, J. Multi-Center Federated Learning: Clients Clustering for Better Personalization. World Wide Web 2022, 26, 481–500. [Google Scholar] [CrossRef]
Sattler, F.; Müller, K.R.; Samek, W. Clustered Federated Learning: Model-Agnostic Distributed Multitask Optimization under Privacy Constraints. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 3710–3722. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Huang, K.; Yang, W.; Wang, S.; Zhang, Z. On the Convergence of Fedavg on Non-IID Data. arXiv 2019, arXiv:1907.02189. [Google Scholar]
Gao, L.; Fu, H.; Li, L.; Chen, Y.; Xu, M.; Xu, C.Z. FedDC: Federated Learning with Non-IID Data Via Local Drift Decoupling and Correction. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Zhao, Y.; Li, M.; Lai, L.; Suda, N.; Civin, D.; Chandra, V. Federated Learning with Non-IID Data. arXiv 2018, arXiv:1806.00582. [Google Scholar] [CrossRef]
Fallah, A.; Mokhtari, A.; Ozdaglar, A. Personalized Federated Learning with Theoretical Guarantees: A Model-Agnostic Meta-Learning Approach. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Curran Associates Inc.: Red Hook, NY, USA, 2020; p. 300. [Google Scholar]
Smith, V.; Chiang, C.K.; Sanjabi, M.; Talwalkar, A.S. Federated Multi-Task Learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 4427–4437. [Google Scholar]
Arivazhagan, M.G.; Aggarwal, V.; Singh, A.K.; Choudhary, S. Federated Learning with Personalization Layers. arXiv 2019, arXiv:1912.00818. [Google Scholar]
Mishra, R.; Gupta, H.P.; Banga, G.; Das, S.K. Fed-RAC: Resource-Aware Clustering for Tackling Heterogeneity of Participants in Federated Learning. IEEE Trans. Parallel Distrib. Syst. 2024, 35, 1207–1220. [Google Scholar] [CrossRef]
Chen, Z.; Yu, S.; Chen, F.; Wang, F.; Liu, X.; Deng, R.H. Lightweight Privacy-Preserving Cross-Cluster Federated Learning with Heterogeneous Data. IEEE Trans. Inf. Forensics Secur. 2024, 19, 7404–7419. [Google Scholar] [CrossRef]
Kohonen, T. The Self-Organizing Map. Proc. IEEE 1990, 78, 1464–1480. [Google Scholar] [CrossRef]
MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 1 January 1967; Volume 1, pp. 281–297. [Google Scholar]
Yin, H.; Allinson, N.M. On the Distribution and Convergence of Feature Space in Self-Organizing Maps. Neural Comput. 1995, 7, 1178–1187. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images. Master’s Thesis, Department of Computer Science, University of Toronto, Toronto, ON, Canada, 2009. [Google Scholar]
Hsu, T.M.H.; Qi, H.; Brown, M. Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification. arXiv 2019, arXiv:1909.06335. [Google Scholar]
Kraaijveld, M.A. A Non-Linear Projection Method Based on Kohonen’s Topology Preserving Maps. In Proceedings of the 11th IAPR International Conference on Pattern Recognition. Volume II. Conference B: Pattern Recognition Methodology and Systems, The Hague, The Netherlands, 30 August–3 September 1992. [Google Scholar]

Figure 1. Example of clustered federated learning.

Figure 2. Schematic diagram of the 2D SOM network. The orange node in the figure is the best matching unit, the winning neighborhood is shown in the dashed range, and the yellowish nodes in this neighborhood are updated according to the distance from the BMU.

Figure 3. Overview of the SoFL algorithm.

Figure 6. The cosine similarity matrix of client parameters in MNIST experiments, where row i and column j represent the similarity between client i and client j, with darker colors representing a closer relationship between the two.

Figure 7. U-Matrix for K = 2 (a) and K = 4 (b).

Figure 8. Comparison of the results of the “elbow method” with the client parameters as input (a) and the SOM winning parameters as input (b) for K = 4, where the dashed positions are the calculated optimal number of clusters.

Figure 9. Accuracy convergence curves of SoFL and baseline algorithms on (a) MNIST and (b) CIFAR-10 for K = 4, with clustering rounds in dashed lines.

Table 1. Notations and related descriptions.

Notations	Description
$f (w)$	the loss function
$p_{i}$	the weight of the i-th client
$L (w_{k}, D_{i})$	the loss function of the i-th client on the k-th cluster model
$d i s t (θ_{i}, θ_{j})$	the distance between the i-th client and the j-th client
$W_{k}$	the k-th cluster center model
$w$	the client gradient
$∆ w$	the client gradient update
$γ$	one-dimensional expansion form of gradient update
$ν$	the SOM node parameters
$η$	the learning rate
$g_{ω j}$	the update constraint of the j-th neuron in the neighborhood of $ν_{ω}$
$μ$	the cluster center
$ξ$	the SOM activated node parameters

Table 2. Client cluster labelling under three clustering methods.

	K-Means (K = 4)	SOM	SOM+ K-Means
Client1	0	0	0
Client2	0	1	0
Client3	0	2	0
Client4	0	3	0
Client5	0	2	0
Client6	1	4	1
Client7	1	4	1
Client8	1	5	1
Client9	1	4	1
Client10	1	5	1
Client11	0	6	2
Client12	0	6	2
Client13	0	6	2
Client14	0	7	2
Client15	2	7	2
Client16	3	8	3
Client17	3	8	3
Client18	3	8	3
Client19	3	9	3
Client20	3	8	3

Table 3. Comparison of accuracy (%) when K = 4.

	Feature Distribution Skew		Label Distribution Skew
	MNIST	CIFAR-10	MNIST	CIFAR-10
FedAvg	78.18	39.5	71.89	38.3
FedProx	78.02	38.9	72.83	38.8
CFL	87.25	45.8	87.50	47.2
IFCA	83.43	44.2	82.95	45.6
SoFL	91.08	48.2	89.45	48.3

Table 4. Comparison of the accuracy of the algorithms at different numbers of client clusters (%).

	MNIST			CIFAR-10
	K = 2	K = 4	K = 8	K = 2	K = 4	K = 8
FedAvg	48.50	25.17	15.84	33.22	19.68	13.15
FedProx	48.36	26.74	14.12	33.70	20.50	13.64
CFL	93.45	91.74	90.40	49.36	46.94	43.72
IFCA	90.11	90.56	89.35	48.10	46.15	41.76
SoFL	93.53	92.30	90.63	49.20	47.34	43.92

Table 5. Comparison of accuracy at different

φ

(%).

Table 5. Comparison of accuracy at different

φ

(%).

	MNIST			CIFAR-10
	$φ = 0.5$	$φ = 1$	$φ = 10$	$φ = 0.5$	$φ = 1$	$φ = 10$
FedAvg	26.75	25.39	24.78	19.65	19.20	18.67
FedProx	25.68	25.22	23.62	20.52	19.94	18.22
CFL	91.89	91.79	89.47	46.03	47.02	45.93
IFCA	91.17	90.03	90.43	42.18	42.34	41.55
SoFL	91.23	92.25	91.86	46.22	47.70	47.94

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, J.; Qiao, Z. SoFL: Clustered Federated Learning Based on Dual Clustering for Heterogeneous Data. Electronics 2024, 13, 3682. https://doi.org/10.3390/electronics13183682

AMA Style

Zhang J, Qiao Z. SoFL: Clustered Federated Learning Based on Dual Clustering for Heterogeneous Data. Electronics. 2024; 13(18):3682. https://doi.org/10.3390/electronics13183682

Chicago/Turabian Style

Zhang, Jianfei, and Zhiming Qiao. 2024. "SoFL: Clustered Federated Learning Based on Dual Clustering for Heterogeneous Data" Electronics 13, no. 18: 3682. https://doi.org/10.3390/electronics13183682

APA Style

Zhang, J., & Qiao, Z. (2024). SoFL: Clustered Federated Learning Based on Dual Clustering for Heterogeneous Data. Electronics, 13(18), 3682. https://doi.org/10.3390/electronics13183682

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SoFL: Clustered Federated Learning Based on Dual Clustering for Heterogeneous Data

Abstract

1. Introduction

2. Related Work

3. Preliminaries

3.1. Notations Description

3.2. Clustered Federated Learning

3.3. Self-Organizing Map

4. Self-Organizing Federated Learning

4.1. Framework Overview

4.2. Algorithm Details

5. Experiments

5.1. Experimental Setup

5.1.1. Datasets and Segmentation

5.1.2. Client and Model Settings

5.1.3. Baselines and Metrics

5.2. Experimental Results

5.2.1. Client Similarity

5.2.2. Clustering Structure of SOM Network

5.2.3. Representation of SOM Neurons

5.2.4. Convergence Efficiency Analysis

5.2.5. Accuracy Analysis

5.2.6. Impact of Data Heterogeneity

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI