1. Introduction
Federated learning (FL) enables the collaborative training of a shared model by multiple decentralized clients, eliminating the demand for direct data exchange between clients. The utilization of this paradigm guarantees the localization of the training data and the protection of clients’ privacy [
1]. Consequently, FL have gained significant traction in addressing numerous practical challenges across diverse fields, including the medical arena [
2]. Nevertheless, the training data disseminated among numerous participating clients typically exhibit non-IID characteristics [
3,
4,
5], which is a vital issue in the field of FL, as highlighted in reference [
1]. The impact of label distributions in clients’ training data on the overall performance of classification tasks has been observed [
6]. Non-IID data significantly impact FL from two distinct perspectives: two primary factors contribute to the divergence of local models. One is that the data distributions vary significantly among various clients, and another is that the local data are imbalanced. The non-IID distributed data might result in a phenomenon known as “weight divergence” during a model’s training process. Furthermore, this could have a detrimental effect on the efficiency of the global model [
7,
8].
One approach to address the aforementioned issue is mitigating the impact of data category imbalance on FL model by employing data augmentation techniques in situations where there exists a substantial disparity in data categories across various client datasets [
9]. Nevertheless, the predominant obstacle in practical implementation is the inefficient utilization of communication resources resulting from the disproportionate allocation and dissemination of client data. The FedAvg algorithm, introduced by McMahan et al. [
10], is widely acknowledged in this context. The system efficiently combines the model updates from several clients by utilizing a weighted averaging technique on the parameters. The system efficiently combines the model updates from several clients by employing a weighted averaging technique on the model parameters. This study examines the variation in client data while making the assumption that the global data follow the IID assumption. Furthermore, it is observed that there has been limited progress in enhancing the performance of the algorithm. Based on this premise, researchers initiated investigations into enhanced FL methodologies, including FedProx [
11], SCAFFOLD [
12], and MOON [
13], with the aim of enhancing the performance of FedAvg in non-IID data scenarios through refining local training procedures. Nevertheless, in this particular situation, it is worth noting that the enhancements achieved by FedProx may have been more gratifying, whereas the SCAFFOLD and MOON approaches imposed a considerable amount of supplementary communication overhead. The enhanced FL algorithm for non-IID data aims to optimize the aggregation weight in order to enhance the performance of FedAvg. The enhancement of aggregation weight places greater emphasis on the computation of similarity between the local and the global model [
14,
15]. But this approach has significant storage and time costs and does not effectively merge variations in client data distribution. Hence, the primary focus of our research paper is centered in the non-IID data scenario. Our objective is to assess the similarity of the distribution of clients’ data and subsequently modify the aggregate weight based on these findings. The objective is to mitigate the communication bottleneck.
While FL can provide a certain level of privacy protection by sharing model parameters like gradients, it is vital to consider the non-IID scenario. In such cases, attackers can deduce model parameter information, hence posing a potential danger of privacy breach [
16]. To augment the privacy protection capacity of the model during transmission, current approaches integrate FL with additional privacy protection technologies. These technologies include Differential Privacy (DP) [
17], Homomorphic Encryption [
18], and Secure Multi-Party Computation [
19]. DP, in particular, possesses both a rigorous mathematical foundation and the ability to quantify the level of data privacy protection through the concept of privacy budget. Currently, DP has emerged as a highly effective method for safeguarding data privacy in the context of FL. Two primary application approaches of DP are commonly utilized: centralized DP and localized DP. The conventional approach to distributed processing involves the centralization of data processing and storage on the central server. However, this strategy is susceptible to single-point failures and potential breaches of privacy [
20]. The concept of localized data processing entails the distribution of data processing and protection tasks over multiple local devices, hence enhancing privacy safeguards. In [
21], the authors offer a privacy protection strategy called FedProx at the client level. However, they do not provide sufficient evidence to establish that the scheme fully satisfies the notion of DP. The study in [
22] offers a theoretical demonstration; however, it fails to consider the trade-off between privacy parameters and model utility. Hence, for the non-IID scenario, the pressing issue at hand pertains to the reduction of communication costs while simultaneously guaranteeing the privacy of FL. Hence, this study presents a privacy-preserving FL technique that leverages the similarity in distribution of client data.
The primary contributions of our study are as follows:
- (1)
To address the issue of suboptimal FL algorithm models resulting from non-IID data, we have put out a proposed scheme, which involves utilizing the Hellinger distance to quantify the disparity between the local data distributions of clients and the ideal balanced distribution. By doing so, we aim to alleviate the divergence in the model;
- (2)
To address the issue of excessive communication usage in FL while dealing with non-IID data, we propose an aggregation technique that incorporates similarity weighting. This method leverages the similarity results obtained from analyzing the data distribution of each client, allowing for fast transfer of local model information to the Parameter Server (PS);
- (3)
To address the privacy disclosure issue in FL, we employ DP as a solution. During the training process, Gaussian noise is incorporated into the client’s output in order to enhance privacy and security measures.
The remainder of our paper is organized as follows. Following this introduction, the relevant preliminary is presented in
Section 2 and the proposed system model for the privacy-preserving FL algorithm the context of non-IID is presented in
Section 3. The findings pertaining to privacy theory are presented in
Section 4, whereas the results concerning convergence theory can be found in
Section 5. In
Section 6, the discussion is given. Finally, the concluding remarks are presented in
Section 7.
2. Preliminary
This section primarily outlines the fundamental framework of FL, elucidates the notion of DP mechanism, and examines the influence of non-IID data on model optimization.
2.1. Federated Learning
FL refers to a collaborative training procedure that involves the interaction between local clients and PS [
19]. Supposing a standard FL system with
N local clients and a PS, each client
has its private training dataset
, and the dataset size is
; here,
,
represents data point
i of client
k, and
indicates the label of the data point
i of client
k. The client communicates with the PS to train the global model cooperatively without transmitting the original data. Therefore, the optimization problem of FL could be described as
where
denotes a global objective function,
stands for a model parameter vector,
refers to aggregate weights, and
denotes the local target function of client
k. Specifically, we assume that the
training data of the client
is
; then, the local objective function
can be defined as
where
refers to the loss function specified by the client. Cross-entropy is often used as a loss function in image recognition tasks.
The FedAvg algorithm [
21] is a commonly employed approach in federated optimization. The PS computes the average of the local model parameters submitted by individual clients and thereafter distributes the aggregated outcomes to each client. In the conventional FedAvg algorithm, the initial step involves the client retrieving the latest global model parameters from the PS and subsequently initializing the local model. In the scenario where clients are chosen at random for training, the selected clients engage in training the local data update model. This is achieved by individually executing
E rounds of random gradient descent steps and afterwards reporting the results to the PS. Ultimately, the PS obtains the model that has been modified locally and proceeds to aggregate it through an averaging process.
2.2. Differential Privacy
Dwork et al. proposed DP [
23] to solve the privacy protection problem in databases. As a proven privacy protection technology, DP can ensure that the impact of a single sample on the whole is always lower than a certain threshold when outputting information, which makes it impossible for attackers to analyze the situation of a single sample from the change in output.
Definition 1. (
-DP [
23]):
Consider any two neighboring datasets and , which differ in only one data sample. A randomized mechanism with domain and range guarantees -DP (-DP), and, for any subsets of outputs , it holds thatThis ensures that the output of -DP is indistinguishable, regardless of differences in a single record. The parameter is known as the privacy budget; smaller indicates stronger privacy protection level, and represents the probability to break the
-DP.
Typically, -DP and -DP assurance can be achieved through the Laplace Mechanism and Gaussian Mechanism, but this paper focuses on guaranteeing -DP by adding random noise that conforms to the Gaussian distribution to the output function. To meet the requirements of DP, this mechanism controls the noise variance within a certain range to ensure that it meets the following conditions: Here, the notation stands for the -norm sensitivity of the function.
Sampling processes are frequently employed in machine learning algorithms. The privacy amplification characteristic, as suggested by differential privacy (DP) [
24]
, demonstrates that the DP mechanism, when applied to a randomly chosen subset of the dataset, provides superior privacy safeguards compared to when it is applied to the entire dataset. Lemma 1. (
Privacy amplification by subsampling [
25]
): If is -DP, then obeys -DP, with and .
The privacy amplification theorem demonstrates that, by sub-sampling the client, it is possible to effectively decrease the noise variance needed to attain the desired level of privacy protection, as specified by DP. In a broader sense, the lemma suggests that it is vital to exploit the randomness in sub-sampling because, if is -DP, then a sub-sampled mechanism with probability obeys -DP for a sufficiently small .
2.3. Impact of Non-IID Data
In every global iteration of FL, each client aims to reduce their loss function based on its local data. The existence of non-IID attributes in the local dataset can result in significant discrepancies between the local and the global model. In certain instances, it has been shown that the gradient of local models may exhibit a contrasting direction compared to that of the global model, leading to a phenomenon known as drift inside the local model [
12,
26]. Put differently, the revised local model exhibits a bias towards the local optimum and deviates from the global optimum state. Assume that the parameters of these local models are uploaded to the PS for the purpose of aggregation. The precision of the global model will be impacted, and there will also be a significant utilization of network capacity, resulting in a decrease in communication efficiency.
Figure 1 illustrates the FedAvg problem in both IID and non-IID scenarios. In IID scenarios, it can be observed that the global optimal value exhibits a strong proximity to the local ideal value. In other words, the global average model converges towards the global optimum. In non-IID scenarios, the discrepancy between the global optimal value and the local ideal value results in a considerable distance between the averaged global model and the global optimal state. Hence, it is imperative to investigate the methodologies for developing a proficient FL in non-IID scenarios.
3. System Model
In this section, we introduce the privacy-preserving FL algorithm (HW-DPFL), which is designed on the basis of the concept of probability distribution similarity of data labels. Subsequently, the method’s specific process is described.
Firstly, it is vital to note that, in the FedAvg algorithm, the PS is responsible for aggregating and averaging the local model parameters. Thus, the effectiveness of FedAvg is greatly influenced by the weighting method employed. Typically, the weight assigned to each local dataset is determined by calculating the ratio of that dataset to the entire dataset. Nevertheless, in non-IID cases, this approach can have an impact on the rate of convergence and potentially compromise privacy. Hence, it is imperative to choose a more suitable approach for determining the weight. To address the problems at hand, this section presents a privacy-preserving FL approach called HW-DPFL, which leverages the similarity of probability distributions of data labels. The flow of the algorithm is depicted in
Figure 2. During the process of model aggregation, the algorithm computes the Hellinger distance of the label distribution for each client’s dataset. It then extracts the local model information from this calculation and aggregates it using an updated weighting approach. The proposed approach mitigates the challenges associated with training non-IID data and enhances the efficiency of model training.
In each iteration
, the label distribution of the client
dataset can be represented by the label vector
:
where
denotes the total number of label types,
indicates the number of C- type labels possessed by the client
.
The Hellinger distance is computed based on the label distribution
of the client
local dataset and the standard balanced data label distribution S:
Hellinger distance is a metric employed in the field of probability and statistics to quantify the degree of similarity between two probability distributions [
27]. In the context of non-IID data, the Hellinger distance could be employed as a metric to assess the similarity between two classes, hence enabling algorithmic enhancements. Hence, the measure of similarity between each client’s local dataset and the designated standard balanced dataset can be determined by computing the Hellinger distance.
The parameters for updating the weight of the model vary depending on the number of iterations:
Furthermore, given the PS’s inclination towards honesty and curiosity while adhering to the FL protocol, it demonstrates a greater interest in the client’s data information. Simultaneously, the system is susceptible to additional external attacks during the transmission of model parameters. To address this issue, we propose the incorporation of noise that adheres to a Gaussian distribution, thereby ensuring DP. Algorithm 1 provides a concise representation of the privacy-preserving FL method suggested in this research, which is founded on the concept of data label distribution similarity (HW-DPFL).
Algorithm 1: HW-DPFL |
Input: denotes the number of terminals; denotes the local batch size; denotes the local training times of the terminal model; denotes the proportion of clients participating in training; denotes learning rate; denotes standard balanced data label distribution. Output: model parameter The PS doesInitialize global model parameters for each round t = 1, 2, ···, do // Determine the number of clients for this round of communication // Randomly select M clients to participate in training for each client in parallel, do def GetWeight(): // Get the aggregation weight of user k (Get local dataset label distribution) // Calculate the distribution similarity of user data labels return def HW-DPFL(): // Weighted aggregation return def ClientUpdate: // Model update (Batch Local Datasets) for each local epoch from 1 to do for batch do // Train each batch of data return to server |
During the process of model iteration, the method introduces noise to the model parameter information, thereby perturbing the data in a manner that significantly hinders the attacker’s ability to extract meaningful information from it. The determination of noise parameters and privacy budget in DP is contingent upon the specific requirements for privacy protection. The combination theorem enables clients to effectively compute the privacy loss incurred throughout each iteration of the training process. In order to enhance clarity, the
tth round, on the basis of the HW-DPFL algorithm, might be denoted as follows:
where
is the global model parameter of round
t,
denotes the local model parameter of client
k in round
t,
means the local random gradient descent process of client
k, and
d is the dimension of model parameters.
4. Privacy Analysis
In this section, we focus on the analysis of the privacy guarantees offered by the HW-DPFL algorithm. We begin by analyzing the sensitivity of the local parameter update function in relation to the -norm. Following this, we proceed to assess the level of privacy guarantee in each subsequent iteration. Finally, we calculate the total privacy budget after the conclusion of all T iterations.
4.1. -Norm Sensitivity
To achieve DP, we incorporate the Gaussian technique with -norm sensitivity by introducing noise. Thus, we elucidate the sensitivity towards the local parameter updating function.
Assumption 1. Suppose is a uniform random sampling from the local data of client in the iteration . The squared norm of gradients for all clients is uniformly bounded, so for , , and .
Paper [
21]
has successfully used Assumption 1 for DP-based research proof, as evidenced by the application of a gradient clipping methodology [
28].
Lemma 2. If Assumption 1 holds, then the -norm sensitivity of the local update parameters for user k in the iteration t is 4.2. Privacy Guarantee in Round T
Subsequently, a sub-sampling privacy amplification lemma is employed to mitigate the noise variance, ensuring that each client adheres to the noise variance constraint in every iteration.
Theorem 1. Without replacement sampling in mini-batches, given that the noise level and the added noise are obtained from sampling from a Gaussian distribution , then we havewhere the sampling probability is .
Proof of Theorem 1. According to the privacy amplification by sub-sampling, the Gaussian noise level in fact can describe
. Since
we can then obtain that the Gaussian noise level achieves at least
. Specifically, in the iteration
t, in order to satisfy the
guarantee of client
k, the Gaussian noise level can be decreased to
The proof is finished; the text continues here. □
4.3. The Total Privacy Loss
In this paper, we employ the moment accountant approach to quantify the cumulative privacy loss across T rounds. Our proposed methodology offers a more stringent constraint for quantifying the overall extent of privacy compromise compared to prior research efforts.
Theorem 2. Assume that the noise obeys a Gaussian distribution ; then, the HW-DPFL algorithm guarantees . We have Proof of Theorem 2. According to [
28], we define the log of the moment-generating function evaluated at
for client
in iteration
as
Suppose that
and
stand for the probability density function of
and
, respectively. Let
denotes the mixture of two Gaussian distributions as
. Therefore, we have
where
According to composability for moment accountant method and Lemma 3 in [
27], we have
Next, following Theorem 2.2 in [
28], the HW-DPFL algorithm satisfies
-
. Here,
Since the above formula is a quadratic function of
, we assume that
Then,
where
is the minimum point of the function
.
To make the HW-DPFL algorithm satisfy
-
, let
The proof is finished, the text continues here. □
The coexistence of
and
adds to the acceleration of the convergence of Stochastic Gradient Descent (SGD) [
29]. Moreover, as stated in Theorem 1, in cases when both
and
exhibit substantial magnitudes, it becomes imperative to provide a higher level of noise in order to guarantee differential privacy. However, this increased noise may potentially hinder the convergence of the algorithm. This suggests that there is a trade-off between the speed at which the algorithm reaches convergence and the degree of privacy protection. The aforementioned trade-off is subjected to further analysis in the future section.
5. Convergence Analysis
This section primarily focuses on the analysis of the convergence of the HW-DPFL algorithm described herein. Let us commence by establishing certain assumptions.
Assumption 2. For all , each is L-smooth, i.e., for all x and y, .
Assumption 3. For all , each is strong convex, i.e., for all x and y, .
Assumption 4. For all , the stochastic gradients for each client satisfy and .
Let
and
denote the optimal values of total objective functions and objective function of the client
k, respectively. According to [
30], we assume that the degree of data heterogeneity can be expressed as
. It can be observed that, when the client data are IID,
. The more heterogeneous the data, the greater the value of
.
Suppose
is the model parameter of the client
in the round
, and
E is the total number of local epochs. The command set
represents the times of the client communicates with the PS. Considering that a subset of clients is randomly selected to participate the training according to the sampling scheme, at this time, if
, it means that the PS aggregates the local models’ parameters to obtain the global model and sends the latest model parameters to each client, if
, the client updates the local model parameters with its local data. Because clients participating in the training have to perform multiple rounds of iterations locally, we use an intermediate variable
to represent the results of the one-step SGD, and the updated results can be expressed as
In order to enhance the comprehensibility of the proof, we shall introduce the subsequent lemma.
Lemma 3. (Results for each round t) In iteration t, suppose that Assumptions 1 to 4 hold. Then,wherewhere stands for the global optimal solution.
Theorem 3. Suppose Assumptions 1 to 4 hold; then, the convergence rate of the HW-DPFL algorithm satisfies Proof of Theorem 3. If
, it can be observed that
. And if
, the two are not equal. Assuming that there is no communication loss among the selected clients in each round, we hope that the model parameters obtained after sub-sampling and average aggregation are unbiased; thus in the HW-DPFL algorithm, when
, we have
Here, it is used to express the expectation of the set of randomly selected partial clients.
Lemma 4. (Bounding the variance of
[
29]).
If PS samples uniformly without replacement, then the variance of is bounded byThen, we use , and the term .
Case 1. If
, then
because
. According to Lemma 3, we have
Case 2. If
, according to Lemmas 3 and 4, it follows that
Unrolling the recursion, we can obtain
Since
is
-smooth, we have
The proof is finished; the text continues here. □
By Theorem 3, the convergence upper bound of the HW-DPFL algorithm is affected by several factors, namely, the number of transmission rounds , the mini-batch size , the noise level , and the number of local update steps . It is important to recognize that an increase in has the potential to enhance the algorithm’s convergence rate. The potential for enhancing convergence rates exists when the mini batch size is increasing at the local level. Nevertheless, the algorithm’s convergence rate may be impeded by the significant magnitudes of and . Increasing the degree of noise has the potential to improve the effectiveness of privacy measures. However, this may lead to a decrease in the rate of convergence.
6. Experiment
In this section, we assess the efficacy of the HW-DPFL. The experiments primarily employ Convolutional Neural Networks for the purpose of classifying the MNIST dataset.
MNIST dataset: The dataset was publicly provided by the National Institute of Standards and Technology. It is a binary image dataset, which consists of 70,000 grayscale images that have been manually scribbled. Each image is associated with a numerical designation ranging from 0 to 9. The resolution of the image is fixed at 28 × 28 pixels. For the MNIST dataset, a resolution of 28 × 28 has been considered a relatively low resolution, which has been widely accepted and effectively applied in practice. Some image examples from the MNIST dataset are in
Figure 3.
A total of 60,000 images were designated as the training dataset, while the remaining 10,000 images were allocated for testing the model. During the model training process, it is necessary to specify the overall number of clients and ensure an equitable distribution of 60,000 images among them. This allocation ensures that each client receives an equal share of 600 photographs. The proportion of customers whose data are dependent and identically distributed is established at 0.8.
Parameter setting: We set and the maximal local gradient norm to 1. It should be noted that the loss function is defined as cross-entropy and represents a highly convex optimization issue:
- (1)
Impact of local mini-batch size
: The impact of varying local mini-batch sizes on the training loss of the HW-DPFL algorithm is depicted in
Figure 4. We set the values of local mini-batch size
. Based on the findings from the experimental results, it is not difficult to see that an optimal state is present in two distinct contexts. In the IID case, it is observed that an increase
results in accelerated convergence and greater reduction in training loss. Nevertheless, the outcome is detrimental when the magnitude is above a specific threshold. The decrease in training loss is more pronounced when handling non-IID data, and the disparity in convergence between distinct
values is more noticeable;
- (2)
Impact of the number of local update steps
: We also analyze the performance of the HW-DPFL algorithm with different local update steps. In the experiment, we set the number of local update steps as
. The outcomes are depicted in
Figure 5. For a fixed
, there is an optimal
value which makes the HW-DPFL perform the best in two scenarios. Moreover, increasing the value of
can result in expedited algorithm convergence. Nevertheless, the rate of convergence decelerates significantly when
is too large. In addition, for non-IID, an excessively large
results in a higher degree of variability in the training loss. The presence of larger
can result in more significant variations in weights among clients, hence impeding the convergence of the HW-DPFL;
- (3)
Impact of the noise level
: The experience results of the HW-DPFL with different noise levels
are presented in
Figure 6. We set the noise level
. The results indicate a steady decrease in training loss as the noise level increases. This can be attributed to the detrimental impact of high noise levels on the model’s convergence performance, leading to a substantial increase in training loss. In both IID and non-IID cases, the training loss of the HW-DPFL exhibits an initial steep decline. In non-IID scenarios, the training loss experiences a greater reduction. Furthermore, the HW-DPFL has the potential to enhance the resilience of the training model in the face of DP injection noise;
The above experiments examines the impact of various factors on the efficacy of the HW-DPFL algorithm. It is evident that the HW-DPFL algorithm demonstrates superior performance across several data features. When the data follow the IID assumption, correctly raising the local mini-batch size and the number of local update steps enhances the convergence speed and decreases training losses. However, surpassing a particular threshold would result in the reverse effect. When dealing with non-IID data, it has been seen that increasing the value of can effectively decrease training losses. However, it should be noted that, as the value of increases, there is a corresponding increase in the variability of training losses. Furthermore, it is vital to consider the trade-off between utility and privacy when dealing with both IID and non-IID scenarios. The excessive noise level significantly impacts the convergence performance of the model. In circumstances where the data are non-IID distributed, the HW-DPFL algorithm exhibits reduced training losses and demonstrates enhanced capacity for improving the robustness of the model. Hence, the performance of the HW-DPFL algorithm can be enhanced through the adjustment of parameters such as the local mini-batch size , the number of local update steps , and the high noise level . In relation to the value, it is imperative to select a suitable magnitude that aligns with the IID characteristics of the data. Regarding the E value, it is crucial to maintain control within a moderate range to prevent fluctuations and mitigate the adverse impact on the rate of convergence that may arise from an excessively large size. As for the value, it is essential to strike a balance between utility and privacy considerations, thereby opting for an appropriate level of noise that guarantees privacy while ensuring the desired level of utility. Implementing these modifications will enhance the training efficacy of the HW-DPFL algorithm and bolster the resilience of the model;
- (4)
Algorithm performance comparison: In IID and non-IID scenarios, HW-DPFL exhibits a greater level of accuracy compared to both the DP-FedAvg [
8] and DP-FL [
19]. Simultaneously, HW-DPFL demonstrates comparable accuracy to the DP-FL algorithm in the non-IID case in
Table 1, thereby confirming the practicality and efficacy of the HW-DPFL algorithm in non-IID data scenarios.
7. Discussion
This section examines three key aspects: data dissemination, privacy protection, and training time. In this study, we evaluate the efficacy of three techniques in the context of non-IID data, using heterogeneous and homogeneous models.
The primary focus of HW-DPFL lies in training non-IID data inside both homogeneous and heterogeneous models while emphasizing the implementation of robust privacy protection measures. The process of fine-tuning primarily takes place during the training stage and relies on weight aggregation. The Hellinger distance metric is also utilized to quantify the similarity between two probability distributions [
26]. The performance of a system is influenced by the configuration of its models and the distribution of its data, both at the local level throughout numerous iterations and at the global level during aggregations. The substantial variance in updates leads to a departure of the global model from the genuine optimization outcomes.
Models are commonly perceived as entities that serve as repositories for storing knowledge derived from diverse datasets. The complexity of a model is influenced by various factors, including its structural design, dimensions, the distribution of data, and the size of the dataset. The augmentation of hidden units or parameters results in generalization mistakes. When various techniques are utilized to train the model under identical conditions, such as measuring the model complexity using CNN, it is noted that the accuracy of the model surpasses that of a shallow model. However, it is worth noting that the training time is extended.
During the preparation of this paper, it has come to our attention that a study conducted by [
3] in the IEEE Internet of Things Journal in January 2023 explores an issue closely related to our research. It also investigates the application of FL to non-IID datasets, using DP techniques, yielding promising outcomes. However, it fails to address the content of our study adequately. Four distinct points of divergence exist between our paper and the work mentioned above: (1) The optimization of the gradient in FL was enhanced by [
3] by utilizing historical gradient information. In contrast, our approach focuses on optimizing the gradient by adjusting the server-side aggregation strategy of parameters; (2) The reference [
3] employs the K-means algorithm to cluster the label distribution of user data, aiming to address the issue of non-IID. In our work, we utilize the Hamming distance as a metric to quantify the difference between the IID and non-IID distribution; (3) Regarding the DP mechanism, [
3] employs Laplace noise and a simple combination theorem to calculate privacy loss. In contrast, we introduce Gaussian noise and utilize the moment accountant method to calculate privacy loss; (4) The reference [
3] solely presents empirical experiments to demonstrate their results, while we provide theoretical proof of privacy and convergence in our work.
The HW-DPFL is configured with three distinct levels of noise, which are afterward linked to DP privacy protection. The hyper-parameters indicate the privacy protection level for both data and models.
8. Conclusions
This paper has studied an FL framework toward non-IID data, and a novel approach called HW-DPFL is proposed based on weighted aggregation of data distribution, which aims to improve FL’s efficiency and protect the FL’s privacy in non-IID data scenarios. Based on Hellinger distance, the algorithm quantifies the distribution balance degree of the clients’ local privacy data labels to readjust the weight information of FL aggregation on PS so that the algorithm can converge faster while ensuring that the client information is fully trained. To effectively deal with the problem of information leakage, we add Gaussian noise to the shared parameters before uploading the parameters to PS. The algorithm can obtain local differential privacy with adjustable noise in FL architectures. Theoretical guarantees on the privacy protection capabilities and convergence of HW-DPFL were derived. The HW-DPFL algorithm was subsequently assessed using the MNIST dataset. The experimental findings exhibited the enhancement of HW-DPFL about non-IID data across several dimensions. The findings also suggest that HW-DPFL demonstrates potential usefulness and robust convergence in the face of non-IID data. Moreover, DP is incorporated into the upgraded FL framework to ensure the scheme’s privacy.
Further research can be conducted to explore additional examinations of the theorems and the efficacy of HW-DPFL in future endeavors. Additionally, it is important to address various non-IID settings, such as feature-based non-IID scenarios. The potential strengths of DP-shuffle can be enhanced through the manipulation of various levels of noise. Furthermore, due to its nature as a local sample federated scheme, HW-DPFL has the potential for seamless integration into many upcoming federated learning frameworks as a fundamental operational component.