Federated Distillation Methodology for Label-Based Group Structures

Yang, Geonhee; Tae, Hyunchul

doi:10.3390/app14010277

Open AccessArticle

Federated Distillation Methodology for Label-Based Group Structures

by

Geonhee Yang

and

Hyunchul Tae

^*

Digital Health Care R&D Department, Korea Institute of Industrial Technology, Cheonan 31056, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(1), 277; https://doi.org/10.3390/app14010277

Submission received: 10 October 2023 / Revised: 31 October 2023 / Accepted: 10 November 2023 / Published: 28 December 2023

(This article belongs to the Special Issue Methodologies, Approaches, and Challenges in Parallel and Distributed Computing System)

Download

Browse Figures

Versions Notes

Abstract

:

In federated learning (FL), clients train models locally without sharing raw data, ensuring data privacy. In particular, federated distillation transfers knowledge to clients regardless of the model architecture. However, when groups of clients with different label distributions exist, sharing the same knowledge among all clients becomes impractical. To address this issue, this paper presents an approach that clusters clients based on the output of a client model trained using their own data. The clients are clustered based on the predictions of their models for each label on a public dataset. Evaluations on MNIST and CIFAR show that our method effectively finds group identities, increasing accuracy by up to 75% over existing methods when the distribution of labels differs significantly between groups. In addition, we observed significant performance improvements on smaller client groups, bringing us closer to fair FL.

Keywords:

federated learning; distillation; federated distillation; clustering

1. Introduction

Federated learning (FL) enables multiple clients to contribute to training a global machine learning model without sharing their data with a central server [1]. Clients perform computations on their local data and send only the model updates to a central server, which aggregates them to improve the global model. The global model is then redistributed to the clients for subsequent training rounds. This framework ensures data safety by storing the data only on client devices, thereby minimizing the risk of breaches [2]. In the context of healthcare, this approach is particularly valuable because it enables collaborative research and model training across multiple medical institutions while complying with strict privacy regulations and minimizing the risk of exposing sensitive patient data [3].

Distillation is a machine learning technique that trains a simpler model, called a student, to mimic the actions of a more complex model, called a teacher, which typically improves efficiency without sacrificing accuracy. Federated distillation extends this approach to a decentralized setting, allowing many devices to train a student model collaboratively while keeping their data localized [4]. Recently, federated distillation has attracted considerable attention. Federated distillation captures and communicates the learning experience via logits, which are the pre-activation function outputs of individually trained models. This approach significantly reduces the communication overhead compared to traditional FL [4]. It also provides a balance between flexibility and security. Clients can use models suitable for their computational capabilities [5]. At the same time, the risks of information exposure are significantly reduced by transmitting only distilled knowledge via logits rather than raw data, thereby increasing the data privacy level [6].

Traditional federated distillation methods rely on a uniform global logit, which results in reduced accuracy, particularly when the data have a distinct group structure. To illustrate this, consider FL between hospitals specializing in different types of medical treatments. A hospital specializing in cancer will have a dataset containing only different types of cancer, while another hospital specializing in infectious diseases will have images labeled “infection”. In this context, the use of a uniform global logit compromises the quality of the global model, making it biased and less accurate.

Traditional federated distillation methods rely on a uniform global logit, which results in reduced accuracy. The label distribution skew [1], in which data labels are not equally distributed across clients, is a common problem in federated learning. However, traditional federated distillation trains all clients with the same global logit. This is likely to yield sub-optimal results for each client’s data distribution. To illustrate this, consider FL between hospitals specializing in different types of medical treatments. A hospital specializing in cancer will have a dataset containing only different types of cancer, while another hospital specializing in infectious diseases will have images labeled “infection”. In this situation, it would not be appropriate to train all clients with a single global logit. It would be better to divide the clients into groups based on the labels they have and train the clients with different logits based on the distribution of the clients’ labels per group.

Although clustering techniques exist in FL, to the best of our knowledge, no method has integrated clustering with federated distillation. Most clustering algorithms in FL use model parameters [7,8] for clustering. However, federated distillation only exchanges the output of the client model and does not exchange model parameters, so this approach is not applicable. Therefore, a clustering method suitable for federated distillation is needed that utilizes the output of the model. To focus on the label distribution skew, we use the label predicted by the client model. Therefore, we proposed a method that classifies client models based on the number of times they predict each label. Figure 1 illustrates our algorithm, which utilizes information about clusters for effective distillation. In practice, the number of groups is often unknown. The algorithm we propose addresses this issue by using hierarchical clustering, which eliminates the need for prior knowledge of the number of clusters.

In FL, fairness requires sensitive groups, such as gender and race, to not experience disparate outcome patterns such as different accuracy [1]. Unfortunately, minority social groups are often underrepresented in the training data. Therefore, their accuracy is often degraded. When the size of each client group varies, the existing methods significantly undermine the performance of minority client groups. On the other hand, our method performed well regardless of group size by assigning a logit that fits the distribution of the group data. This allowed us to become closer to a fair FL.

Guided by an empirical analysis of the esteemed MNIST and CIFAR datasets, we demonstrate that the clustering accuracy via prediction exceeds 90%. We also achieve high accuracy for each client model compared to traditional federated distillation methods in settings where an apparent group structure exists. Performance increased by up to 75%, and the greater the difference in the data distribution between each group, the greater the advantage of our algorithm. We show that our algorithm is effective even when the data is sparse.

The main contributions of this paper are:

We propose the first federated distillation approach that utilizes the predictions of the clients model on public data to cluster clients.
We show that our approach results in successful clustering even when the boundaries between each client group are unclear.
We demonstrate the effectiveness of our approach under challenging conditions such as insufficient data and ambiguous group boundaries. It also improves the performance of minority groups, bringing us closer to a fairer FL.

Our paper is organized as follows.

In Section 2, we review the relevant literature on clustering techniques in federated learning and federated distillation methods.
In Section 3, we formally define the problem and present the proposed clustered federated distillation algorithm using a label-based group structure.
In Section 4, we evaluate the proposed method
-
In Section 4.1, we present the setup used in our experiments.
-
In Section 4.2, we compare the clustering accuracy of our algorithm with existing FL methods using clustering in a label-based group structure.
-
In Section 4.3, we compare the performance of our method with existing federated distillation methods in a label-based group structure.
In Section 5, we summarize the main results and limitations and suggest directions for future work.

2. Related Works

2.1. Clustering in Federated Learning

In FL, several clustering criteria are available for categorizing and grouping clients based on various attributes. One example is Data Source-Based Clustering [9], which organizes clients according to the origin of their data, such as X-rays or ultrasound, in medical settings. Geographical and Diagnosis-Based Clustering [10] groups clients based on their locations or shared diagnoses. Loss Function Values-Based Clustering [8,11] focuses on similar model behaviors as deduced from the loss function values. Clusters based on inherent data distribution seek to enhance model generalization by considering the intrinsic characteristics of the data. Model Parameter-Based Clustering [12,13] gathers clients with analogous model parameters, reflecting parallel learning stages. Gradient Information-Based Clustering [7,14] forms clusters by examining shared gradient information. Prototype-Based Clustering [15] simplifies the global model by forming clusters around generalized prototypes that represent distinct data patterns.

Table 1 shows the commonly utilized clustering methods in federated learning. clustered FL (CFL) and FL+HC utilize the similarity of model parameters for clustering. However, the computation required to find the similarity of model parameters between clients is proportional to the model size. Furthermore, as the number of clients increases, the computation grows exponentially. On the other hand, our method utilizes the number of labels predicted by each client, which is not affected by model complexity or the number of clients. FedSEM and IFCA find the cluster identity by sending multiple model parameters to each client and selecting the model with the lower loss. This method is inefficient because each client has to repeat the learning process using multiple models. On the other hand, our method only adds clustering and obtaining the cluster logit at the end of the existing federated distillation process, so the overhead is small.

2.2. Federated Distillation

Federated distillation enables the adaptation of models to suit the computational capacity of a client [5] and minimizes information leakage during the sharing of high-dimensional model parameters [6]. FedMD enables heterogeneous FL among clients with different model structures by distilling knowledge from the server to the clients [5]. By contrast, Federated Group Knowledge Transfer (FedGKT) involves bidirectional distillation between clients and servers [16]. This method transfers the computational load from clients to servers, but raises privacy concerns. FedDF [17] uses unlabeled data to implement distillation while aggregating client-to-server logits across different model architectures. Distillation techniques have also been employed in One-Shot FL methods [18,19,20], which compress information from multiple client models into singular models.

3. Materials and Methods

3.1. Problem Definition

There are m clients. Each client k has a dataset

D_{k} : = {(x_{i}^{k}, y_{i})}_{i = 1}^{N_{k}}

, where

N_{k}

represents the number of instances for client k. There are L client groups, each containing instances from a limited set of

C_{l}

classes, where

1 < C_{l} < C

and C is the total number of classes. Clients also have access to an unlabeled public dataset

D_{p} : = {(x_{i}^{p})}_{i = 1}^{N_{p}}

. Each client employs a model

f_{k}

with potentially different architectures.

3.2. Federated Distillation

In federated distillation, each client trains a local model and communicates its knowledge to the central server. We use a one-shot method [18] in which the client sends the trained results to the server once and receives the aggregated data in return. This approach minimizes the communication overhead and accelerates the learning process. The distillation process uses the standard KL divergence loss represented in Equation (1).

KL (p, q) = \sum_{c = 1}^{C} p (c) log \frac{p (c)}{q (c)}

(1)

where

p (c)

and

q (c)

denote the predicted probabilities of class c obtained from the client and group models, respectively. Mathematically,

p (c) = σ (f_{k} (x))

and

q (c) = σ ({\tilde{f}}_{l} (x))

.

f_{k} (x)

is the logit from the client’s model, while

{\tilde{f}}_{l} (x)

is the averaged logit from clients belonging to group l.

3.3. Clustered Federated Distillation

Our objective is to identify the group of each client and train a specialized model for each group using both

D_{p}

and

D_{k}

. We assume no prior knowledge of the groups, including the number of clusters L.

All we can use for clustering is the

f (x^{p})

prediction from each client model on the public data. If we use these predictions directly for clustering, each of the

N_{p}

public data points will produce a C-dimensional prediction vector, where C is the number of classes. This results in the data of the shape

N_{p} \times C

to be clustered. On the other hand, if we use the number of labels predicted by each client model, the data to be clustered would only have a shape of C. Since

N_{p}

is usually much larger than C, this reduction in dimensionality allows for more efficient clustering.

So, the server aggregates the logits predicted by each client for

D_{p}

and then computes a count vector for each label predicted by each client k. This count vector is normalized, as described in Equations (2) and (3).

C o u n t_{k} = {[C o u n t_{k}^{c}]}_{c = 1}^{C}, where C o u n t_{k}^{c} = \sum_{i = 1}^{N_{p}} I (argmax (f_{k} (x_{i}^{p})) = c)

(2)

In Equation (2),

C o u n t_{k}

represents a vector in which each element

C o u n t_{k}^{c}

denotes the number of instances in

D_{p}

classified into class c by client k’s model

f_{k}

. The function I serves as an indicator that returns 1 if the condition is true and 0 otherwise.

N o r m C o u n t_{k} = {[\frac{C o u n t_{k}^{c} - \min (C o u n t_{k})}{\max (C o u n t_{k}) - \min (C o u n t_{k})}]}_{c = 1}^{C}

(3)

We employed agglomerative clustering to identify the client groups. It is a hierarchical clustering method that starts with each data point as a separate group and iteratively merges the closest groups together. The distance_threshold serves as a key parameter for setting the maximum distance for merging groups. Equation (3) normalizes the count vectors to a [0, 1] range to ensure the consistent application of the distance_threshold.

Algorithm 1 outlines our Clustered Federated Distillation Learning method. Each client trains its model on a private dataset and predicts classes on a public dataset

D_{p}

. These predictions are sent to a centralized server that clusters the clients based on them. The server calculates the average logit

{\tilde{f}}_{l} (x^{p})

for each cluster and sends it back to the corresponding client. The clients then distill their models using the KL divergence loss as Equation (1), effectively addressing non-IID data distribution and enhancing the overall model performance.

Algorithm 1 Clustered Federated Distillation framework

Input: Public dataset $D_{p}$ , private dataset $D_{k}$ , model of client k: $f_{k}$ , $k = 1, \dots, m$ , L group and $l_{c}$ clients at group l,
Output: Trained model $f_{k}$
Train: Each client trains $f_{k}$ on $D_{k}$ .
Predict: Each client predicts class $f_{k} (x_{i}^{p})$ on $D_{p}$ , and transmits the result to a central server.
Cluster: The server clusters using each client’s prediction. Using Equations (2) and (3)
Aggregate: The server averages the logit for each cluster. ${\tilde{f}}_{l} (x^{p}) = \frac{1}{l_{c}} \sum_{k \in Group l} f_{k} (x^{p})$
Distribute: Each client receives own group’s logit ${\tilde{f}}_{l} (x^{p})$ .
Distill: Each client model learns by distilling knowledge from ${\tilde{f}}_{l} (x^{p})$ . Using Equation (1).

4. Results

4.1. Setting

To experiment with different group structures, we varied the number of classes per group, ranging from 2 to 5, and the number of groups ranging from 2, 4, 6, 8, and 10. We used the MNIST [21] dataset, which is one of the most used datasets for FL research [7,12,22], unless otherwise noted. It has a small size of 28 × 28, making it a popular way to simply prove the concept of an algorithm. It has a total of 10 classes. This implied that a single class often belonged to multiple groups. For the neural network architecture, we employed a simple CNN with two convolutional layers, using ReLU as the activation function for the hidden layers and Softmax for the last layer’s activation function. The learning rate was set at lr

= 1 \times 10^{- 4}

and the batch size used was 128. Unless otherwise stated, five clients were uniformly assigned to each group, with each client having 50 data points per class. We have assumed that there are 400 unlabeled public datasets per class. We conducted experiments with 25 local training epochs for the client and 40 distillation epochs to learn from the aggregated logit.

In Section 4.2, we compared the clustering performance of our algorithm with that of an existing FL using clustering. In Section 4.3, we compare the performance of each client after the entire training process with the existing federated distillation algorithm.

4.2. Clustering Experiment

Baseline: We compared the clustering performance with CFL [7], one of the methods that uses weight variation for clustering. It uses the cosine similarity of weight updates between clients as a clustering criterion.

Metric: We measured the clustering performance using the adjusted rand index (ARI) [23], which quantifies the similarity between true and predicted cluster assignments based on true cluster identity. The silhouette score [24] serves as another metric for evaluating the degree of separation between clusters. The silhouette score was used to assess how well the criterion used for clustering represented the group structure in terms of the logit and similarity of weight updates. For both metrics, higher values indicate better-defined clusters and the score ranges from −1 to 1.

Hyperparameter: Table 2 displays the average ARI for various distance thresholds, a hyperparameter in agglomerative clustering. The distance threshold determines how far away from the aggregate clustering values are judged to belong in the same group. We employed two and five for the class per group and four for the number of groups. These results average over each distance threshold value. For future experiments, we choose a distance threshold of two. This is because the range of distance thresholds from 1.5 to 2.5 consistently yields high ARI values above 0.95, indicating optimal clustering.

Silhouette Score: Table 3 illustrates the silhouette scores of the baseline for each clustering method under different group structures, indicating how well the data are clustered. The count vector achieved silhouette scores above 0.5 in all cases, while the model update similarity scores were consistently close to zero, indicating little to no group structure. For both variables, fewer groups led to clearer group structures.

ARI: Table 4 shows the performance of our algorithm compared to CFL in terms of clustering accuracy. Our algorithm consistently achieved an ARI greater than 0.9 across different settings, indicating high clustering accuracy. By contrast, CFL recorded an ARI close to 0 in all test cases, demonstrating its persistent ineffectiveness in label-based clustering.

Existence of Minor Classes: Thus far, our experiments have focused on scenarios where clients in each group have only a subset of the classes. So, the boundaries between groups were clear and clustering was relatively easy. However, in this setup, we assumed that three classes per group appeared in large numbers, whereas the remaining classes appeared in smaller numbers. We refer to these infrequent classes as minor classes. We examined the effect of increasing the proportion of minor classes in each group from 5% to 50%. The total number of data for each client is 500. For example, if there is a 5% minor class, each client has 25 pieces of data belonging to seven minor classes and the remaining 475 data belonging to three ’major’ classes. As shown in Table 5 and Figure 2, the silhouette score decreased as the noise class increased and the group structure became less obvious. ARI, on the other hand, was 100% accurate until the noise class reached 30% and then saw a sharp drop in accuracy at 40%.

4.3. Performance Evaluation

Baseline: In this section, we compare the performance of our method with three baselines. FedDF (Federated ensemble Distillation for robust model Fusion) [17] has a similar distillation process to our method, except that it assigns the same logit to all clients. On the other hand, DS-FL (Distillation-Based Semi-Supervised Federated Learning) [25] uses entropy reduction averaging for model output aggregation. It is designed to deliberately reduce the entropy of the global logit before distributing it to mobile devices. FedAVG (Federated Averaging) [22] is a fundamental federated learning algorithm that averages model parameters across clients to form a global model, in contrast to the distillation approaches of FedDF and DS-FL.

Balanced Group Structure: Table 6 shows that our algorithm consistently outperforms FedDF when the number of clients in each group is equal. When each group contained fewer classes, there were fewer overlapping classes. This resulted in a more distinct group structure. Consequently, our method performed 15% better than the FedDF when there were five classes per group. When the group structure is the clearest, with only two classes per group, our method performed 75% better than the FedDF.

Unbalanced Group Structure: Table 7 and Figure 3 show the performance when the number of clients varies between groups. In the case of global distillation, the accuracy of groups with fewer clients tends to decrease significantly and sometimes approaches zero. Realistically, in this situation, a small number of groups will have to use their own data without applying FL. However, they will not be able to take advantage of the performance gains from FL at all. By contrast, our method ensures that clients in each cluster perform equally well, leading to a significant increase in accuracy for minority groups. This means that all clients can share in the profits of the FL.

Insufficient data: In FL, the amount of data is often insufficient [1]. If a client has insufficient data, it will struggle to train its model effectively. Conversely, a lack of public data interferes with the transfer of the model to the server. We conduct experiments in both client and public data scarce environments. We conduct experiments in both client and public data scarce environments. We use 50, 100, 300, and 500 data points for client data, and 100, 300, 500, and 1000 for public data. Figure 4 shows that the performance of other algorithms decreases as the size of the public dataset decreases, while our method maintains its performance. Our algorithm also has the highest accuracy and the lowest variability, represented by the short vertical lines for each data point.

CIFAR dataset: We have performed all of our experiments so far using the MNIST dataset. This time, we will use the more complex CIFAR-10 [26] dataset to see how well our method works on more challenging tasks. It is one of the most widely used datasets and consists of 60,000 color images of size 32 × 32 with 10 labels. We run our experiments with three groups, each with between two and five classes. Figure 5 shows that our algorithm improves the performance by more than 2× on average over the baseline, demonstrating that it works well on a more complex dataset.

Existence of Minor Classes: In real-world scenarios, it is often difficult to make clear distinctions between different groups. To address this, we introduce the concept of a “minor class”, a less prevalent class within each group, to blur traditional group boundaries. Figure 6 shows that as the proportion of these minor classes increases, the performance gap between our proposed method and traditional approaches narrows. This experiment utilizes the MNIST dataset.

Our experiments demonstrate several strengths of the proposed clustered federated distillation algorithm when clear label-based group structures exist. It achieves high clustering accuracy in delineating distinct groups based on their predicted distributions. The client model accuracy also significantly outperforms traditional federated distillation methods, especially for minority groups. The algorithm performs robustly even with limited client and public data.

However, the results also reveal some limitations of the current approach. The performance gains diminish as the group boundaries become more ambiguous, with a higher proportion of overlapping “minor” classes. The experiments have so far only focused on image classification tasks with a limited number of labels. Tests on more complex datasets and tasks are needed to fully evaluate the applicability of the method. In addition, more extensive tuning is needed to determine optimal hyperparameter settings, such as distance thresholds. Addressing these limitations is a worthwhile avenue for improving the clustered federated distillation framework proposed in this work.

5. Discussion

In this study, we address the scenario of different data distributions between different client groups in federated distillation. We introduce a methodology that uses hierarchical clustering to categorize clients according to the number of labels predicted by each model for public data. This approach overcomes the limitations of traditional federated distillation techniques that assume a uniform data distribution when a label-based group structure exists. Our method can be used when different groups (e.g., demographic groups) have significantly different data distributions to ensure that all groups receive equally good results.

The experiments show that the model correctly classifies groups with different labels. The accuracy of the model exceeds that of traditional methods when there is a clear cluster structure based on labels. In particular, the accuracy of a small number of groups, which is problematic in traditional federated distillation, is significantly improved. This may pave the way for fair FL. Furthermore, our method does not require knowledge of the number of clusters, making it applicable in a wider range of environments. However, as the group structure becomes less clear, the performance gap between our method and existing algorithms narrows. We will continue to improve our method to perform better in ambiguous group structures.

It would be an interesting research topic to combine our method with different data types, such as text, more complex images, or time series data. Our method could also be combined with data-free distillation where no public data exists. Our algorithm will also be very effective in the presence of malicious clients that send false predictions to the server. By creating a group of malicious clients, we can ensure that other clients are not affected by them.

Author Contributions

Conceptualization, G.Y.; Methodology, G.Y.; Software, G.Y.; Validation, G.Y.; Formal Analysis, G.Y.; Investigation, G.Y.; Resources, G.Y.; Data Curation, G.Y.; Writing—Original Draft Preparation, G.Y.; Writing—Review and Editing, G.Y. and H.T.; Visualization, G.Y.; Supervision, H.T.; Project Administration, H.T.; Funding Acquisition, H.T. All authors have read and agreed to the published version of the manuscript.

Funding

This study was carried out with the support of ´R&D Program for Forest Science Technology (Project No. 2021383A00-2323-0101)´ provided by Korea Forest Service (Korea Forestry Promotion Institute). This work was supported by the Korea Institute of Industrial Technology as “Development of holonic manufacturing system for future industrial environment [KITECH EO-230006].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available because the associate code is still under development.

Conflicts of Interest

The authors declare no conflict of interest.

References

Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R. Advances and Open Problems in Federated Learning; Foundations and Trends® in Machine Learning: Hanover, MA, USA, 2021; Volume 14, pp. 1–210. [Google Scholar]
Zhang, C.; Xie, Y.; Bai, H.; Yu, B.; Li, W.; Gao, Y. A survey on federated learning. Knowl.-Based Syst. 2021, 216, 106775. [Google Scholar] [CrossRef]
Rieke, N.; Hancox, J.; Li, W.; Milletari, F.; Roth, H.R.; Albarqouni, S.; Bakas, S.; Galtier, M.N.; Landman, B.A.; Maier-Hein, K.; et al. The future of digital health with federated learning. NPJ Digit. Med. 2020, 3, 119. [Google Scholar] [CrossRef] [PubMed]
Jeong, E.; Oh, S.; Kim, H.; Park, J.; Bennis, M.; Kim, S.L. Communication-efficient on-device machine learning: Federated distillation and augmentation under non-iid private data. arXiv 2018, arXiv:1811.11479. [Google Scholar]
Li, D.; Wang, J. Fedmd: Heterogenous federated learning via model distillation. arXiv 2019, arXiv:1910.03581. [Google Scholar]
Chang, H.; Shejwalkar, V.; Shokri, R.; Houmansadr, A. Cronus: Robust and heterogeneous collaborative learning with black-box knowledge transfer. arXiv 2019, arXiv:1912.11279. [Google Scholar]
Sattler, F.; Müller, K.R.; Samek, W. Clustered federated learning: Model-agnostic distributed multitask optimization under privacy constraints. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 3710–3722. [Google Scholar] [CrossRef]
Ghosh, A.; Chung, J.; Yin, D.; Ramchandran, K. An efficient framework for clustered federated learning. Adv. Neural Inf. Process. Syst. 2020, 33, 19586–19597. [Google Scholar] [CrossRef]
Qayyum, A.; Ahmad, K.; Ahsan, M.A.; Al-Fuqaha, A.; Qadir, J. Collaborative federated learning for healthcare: Multi-modal COVID-19 diagnosis at the edge. IEEE Open J. Comput. Soc. 2022, 3, 172–184. [Google Scholar] [CrossRef]
Huang, L.; Shea, A.L.; Qian, H.; Masurkar, A.; Deng, H.; Liu, D. Patient clustering improves efficiency of federated machine learning to predict mortality and hospital stay time using distributed electronic medical records. J. Biomed. Inform. 2019, 99, 103291. [Google Scholar] [CrossRef]
Mansour, Y.; Mohri, M.; Ro, J.; Suresh, A.T. Three approaches for personalization with applications to federated learning. arXiv 2020, arXiv:2002.10619. [Google Scholar]
Briggs, C.; Fan, Z.; Andras, P. Federated learning with hierarchical clustering of local updates to improve training on non-IID data. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN) IEEE, Glasgow, UK, 19–24 July 2020; pp. 1–9. [Google Scholar]
Long, G.; Xie, M.; Shen, T.; Zhou, T.; Wang, X.; Jiang, J. Multi-center federated learning: Clients clustering for better personalization. World Wide Web 2023, 26, 481–500. [Google Scholar] [CrossRef]
Duan, M.; Liu, D.; Ji, X.; Liu, R.; Liang, L.; Chen, X.; Tan, Y. FedGroup: Efficient clustered federated learning via decomposed data-driven measure. arXiv 2020, arXiv:2010.06870. [Google Scholar]
Huang, W.; Ye, M.; Shi, Z.; Li, H.; Du, B. Rethinking federated learning with domain shift: A prototype view. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) IEEE, Vancouver, BC, Canada, 8–22 June 2023; pp. 16312–16322. [Google Scholar]
He, C.; Annavaram, M.; Avestimehr, S. Group knowledge transfer: Federated learning of large cnns at the edge. Adv. Neural Inf. Process. Syst. 2020, 33, 14068–14080. [Google Scholar]
Lin, T.; Kong, L.; Stich, S.U.; Jaggi, M. Ensemble distillation for robust model fusion in federated learning. Adv. Neural Inf. Process. Syst. 2020, 33, 2351–2363. [Google Scholar]
Guha, N.; Talwalkar, A.; Smith, V. One-shot federated learning. arXiv 2019, arXiv:1902.11175. [Google Scholar]
Li, Q.; He, B.; Song, D. Practical one-shot federated learning for cross-silo setting. arXiv 2020, arXiv:2010.01017. [Google Scholar]
Zhou, Y.; Pu, G.; Ma, X.; Li, X.; Wu, D. Distilled one-shot federated learning. arXiv 2020, arXiv:2009.07999. [Google Scholar]
LeCun, Y. The MNIST Database of Handwritten Digits. 1998. Available online: http://yann.lecun.com/exdb/mnist/ (accessed on 21 March 2023).
McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the Artificial Intelligence and Statistics, PMLR, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
Rand, W.M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 1971, 66, 846–850. [Google Scholar] [CrossRef]
Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
Itahara, S.; Nishio, T.; Koda, Y.; Morikura, M.; Yamamoto, K. Distillation-based semi-supervised federated learning for communication-efficient collaborative training with non-iid private data. IEEE Trans. Mob. Comput. 2021, 22, 191–205. [Google Scholar] [CrossRef]
Krizhevsky, A.; Hinton, G. Learning multiple LAYERS of Features from Tiny Images. 2009. Available online: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf (accessed on 21 March 2023).

Figure 1. Federated distillation process using group structure. 1. Individual clients conduct local training using their own datasets. 2. Each client performs inference on a shared public dataset and forwards the results to a central server. 3. The server clusters the received outputs and obtains the soft label by averaging the outputs of the same group. 4. Clients receive the averaged soft labels corresponding to their groups. 5. Clients then perform distillation to align their model outputs with the received averaged soft labels.

Figure 2. Visualization of the count vector distribution with minor class that has a small percentage of each group. (a) When the minor class rate is 5%. (b) When the minor class rate is 10%. (c) When the minor class rate is 40%.

Figure 3. (a) Represents the average accuracy per client group when FedDF is applied, while (b) represents the average accuracy per client group when our method is applied. In both cases, the proportion for each client group is 6:2:2.

Figure 4. Performance of each algorithm as the amount of public dataset changes.

Figure 5. Graph representing accuracy in CIFAR with respect to the class per group.

Figure 6. Graph shows how performance varies depending on the “minor class” ratio. Each data point represents the average performance across dataset sizes of 100, 300, 500, and 1000. The MNIST dataset was used.

Table 1. Comparison of Federated Learning Methods that utilizes clustering. Note that FL+HC means Federated Learning with Hierarchical Clustering and IFCA means Iterative Federated Clustering Algorithm.

Name	Clustering Criterion	Dataset
Clustered FL (CFL) [7]	Similarity of $Δ$ W	MNIST, CIFAR-10
FL+HC [12]	Similarity of $Δ$ W	MNIST, FEMNIST
Federated SEM (FedSEM) [13]	Model that minimize loss	MNIST, FEMNIST, CelebA
IFCA [8]	Model that minimize loss	MNIST, CIFAR-10

Table 2. Average adjusted rand index (ARI) for different distance thresholds in agglomerative clustering.

	Distance Thresholds
	0.25	0.5	1.0	1.5	2.0	2.5	3.0	3.5	4.0
ARI	0.31	0.67	0.93	0.98	1	1	0.78	0.75	0.43

Table 3. Silhouette scores for each clustering criterion. “Ours” represents the silhouette score for the distribution of the count vectors output by each client model used by our method. “CFL” represents the silhouette score for the similarity of the weight update distribution used by Clustered Federated Learning.

Group	Class per Group
	2		3		4		5
	CFL	Ours	CFL	Ours	CFL	Ours	CFL	Ours
2	0.06	0.82	0.03	0.85	0.02	0.81	0.03	0.85
4	0.03	0.88	0.02	0.83	0.01	0.61	0.01	0.78
6	0.02	0.78	0.01	0.77	0.01	0.57	0.01	0.75
8	0.02	0.79	0.01	0.69	0.01	0.60	0.01	0.74
10	0.01	0.76	0.02	0.57	0.00	0.54	0.00	0.72

Table 4. Comparison of ARI scores under various group structures for our method and clustered federated learning (CFL).

Group	Class per Group
	2		3		4		5
	CFL	Ours	CFL	Ours	CFL	Ours	CFL	Ours
2	−0.03	1.00	0.01	1.00	−0.08	1.00	0.13	1.00
4	0.08	1.00	−0.01	1.00	0.06	0.90	−0.03	1.00
6	0.03	0.96	−0.03	1.00	−0.01	0.96	0.03	1.00
8	−0.03	1.00	0.04	1.00	−0.01	0.93	−0.01	1.00
10	0.01	0.91	0.01	0.93	−0.00	0.97	−0.01	1.00

Table 5. Clustering performance with ’minor classes’—the less frequent classes in each group. The title of each row is the proportion of the seven minor classes in each group.

	5%	10%	20%	30%	40%	50%
Silhouette	0.87	0.69	0.59	0.49	0.37	0.33
ARI	1.0	1.0	1.0	1.0	0.9	0.49

Table 6. Average accuracy comparison between the FedDF method and our proposed group-based distillation method across different group structures.

Group	Class per Group
	2		3		4		5
	FedDF	Ours	FedDF	Ours	FedDF	Ours	FedDF	Ours
2	70.0	92.3	74.0	90.9	80.2	90.7	83.2	93.7
4	55.0	98.0	49.8	90.9	70.8	92.9	80.7	92.3
6	30.9	93.8	72.7	93.6	74.6	88.6	69.6	92.8
8	58.9	94.8	58.2	93.3	67.6	87.6	82.8	91.6
10	53.0	90.5	73.5	94.0	81.0	91.7	83.2	92.1
Avg	53.6	93.9	65.6	92.5	74.8	90.3	79.9	92.5

Table 7. Performance metrics across groups with different ratios. The ’Group Ratio’ column shows the percentage of clients in each group. ’Group Acc’ represents the average accuracy achieved by each group at the end of the training process, while ’Total Acc’ represents the average accuracy across all clients. The order of the groups in ’Group Acc’ and ’Group Ratio’ is the same. All values are expressed as percentages.

Group Ratio	Group Acc		Total Acc
Group Ratio	FedDF	Ours	FedDF	Ours
70, 30	97, 0	96, 92	68	95
80, 20	97, 0	97, 86	77	95
50, 30, 20	96, 1, 0	97, 89, 97	49	94
60, 20, 20	96, 0, 0	96, 91, 93	58	94
40, 30, 20, 10	96, 7, 4, 64	96, 89, 94, 97	48	94
50, 20, 20, 10	96, 0, 0, 66	96, 93, 91, 96	54	94

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, G.; Tae, H. Federated Distillation Methodology for Label-Based Group Structures. Appl. Sci. 2024, 14, 277. https://doi.org/10.3390/app14010277

AMA Style

Yang G, Tae H. Federated Distillation Methodology for Label-Based Group Structures. Applied Sciences. 2024; 14(1):277. https://doi.org/10.3390/app14010277

Chicago/Turabian Style

Yang, Geonhee, and Hyunchul Tae. 2024. "Federated Distillation Methodology for Label-Based Group Structures" Applied Sciences 14, no. 1: 277. https://doi.org/10.3390/app14010277

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Federated Distillation Methodology for Label-Based Group Structures

Abstract

1. Introduction

2. Related Works

2.1. Clustering in Federated Learning

2.2. Federated Distillation

3. Materials and Methods

3.1. Problem Definition

3.2. Federated Distillation

3.3. Clustered Federated Distillation

4. Results

4.1. Setting

4.2. Clustering Experiment

4.3. Performance Evaluation

5. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI