Next Article in Journal
Effects of Different Load Carriage on Spatiotemporal Gait Parameters in Elite Intervention Police Officers
Previous Article in Journal
Revolutionising Financial Portfolio Management: The Non-Stationary Transformer’s Fusion of Macroeconomic Indicators and Sentiment Analysis in a Deep Reinforcement Learning Framework
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Federated Distillation Methodology for Label-Based Group Structures

Digital Health Care R&D Department, Korea Institute of Industrial Technology, Cheonan 31056, Republic of Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(1), 277; https://doi.org/10.3390/app14010277
Submission received: 10 October 2023 / Revised: 31 October 2023 / Accepted: 10 November 2023 / Published: 28 December 2023

Abstract

:
In federated learning (FL), clients train models locally without sharing raw data, ensuring data privacy. In particular, federated distillation transfers knowledge to clients regardless of the model architecture. However, when groups of clients with different label distributions exist, sharing the same knowledge among all clients becomes impractical. To address this issue, this paper presents an approach that clusters clients based on the output of a client model trained using their own data. The clients are clustered based on the predictions of their models for each label on a public dataset. Evaluations on MNIST and CIFAR show that our method effectively finds group identities, increasing accuracy by up to 75% over existing methods when the distribution of labels differs significantly between groups. In addition, we observed significant performance improvements on smaller client groups, bringing us closer to fair FL.

1. Introduction

Federated learning (FL) enables multiple clients to contribute to training a global machine learning model without sharing their data with a central server [1]. Clients perform computations on their local data and send only the model updates to a central server, which aggregates them to improve the global model. The global model is then redistributed to the clients for subsequent training rounds. This framework ensures data safety by storing the data only on client devices, thereby minimizing the risk of breaches [2]. In the context of healthcare, this approach is particularly valuable because it enables collaborative research and model training across multiple medical institutions while complying with strict privacy regulations and minimizing the risk of exposing sensitive patient data [3].
Distillation is a machine learning technique that trains a simpler model, called a student, to mimic the actions of a more complex model, called a teacher, which typically improves efficiency without sacrificing accuracy. Federated distillation extends this approach to a decentralized setting, allowing many devices to train a student model collaboratively while keeping their data localized [4]. Recently, federated distillation has attracted considerable attention. Federated distillation captures and communicates the learning experience via logits, which are the pre-activation function outputs of individually trained models. This approach significantly reduces the communication overhead compared to traditional FL [4]. It also provides a balance between flexibility and security. Clients can use models suitable for their computational capabilities [5]. At the same time, the risks of information exposure are significantly reduced by transmitting only distilled knowledge via logits rather than raw data, thereby increasing the data privacy level [6].
Traditional federated distillation methods rely on a uniform global logit, which results in reduced accuracy, particularly when the data have a distinct group structure. To illustrate this, consider FL between hospitals specializing in different types of medical treatments. A hospital specializing in cancer will have a dataset containing only different types of cancer, while another hospital specializing in infectious diseases will have images labeled “infection”. In this context, the use of a uniform global logit compromises the quality of the global model, making it biased and less accurate.
Traditional federated distillation methods rely on a uniform global logit, which results in reduced accuracy. The label distribution skew [1], in which data labels are not equally distributed across clients, is a common problem in federated learning. However, traditional federated distillation trains all clients with the same global logit. This is likely to yield sub-optimal results for each client’s data distribution. To illustrate this, consider FL between hospitals specializing in different types of medical treatments. A hospital specializing in cancer will have a dataset containing only different types of cancer, while another hospital specializing in infectious diseases will have images labeled “infection”. In this situation, it would not be appropriate to train all clients with a single global logit. It would be better to divide the clients into groups based on the labels they have and train the clients with different logits based on the distribution of the clients’ labels per group.
Although clustering techniques exist in FL, to the best of our knowledge, no method has integrated clustering with federated distillation. Most clustering algorithms in FL use model parameters [7,8] for clustering. However, federated distillation only exchanges the output of the client model and does not exchange model parameters, so this approach is not applicable. Therefore, a clustering method suitable for federated distillation is needed that utilizes the output of the model. To focus on the label distribution skew, we use the label predicted by the client model. Therefore, we proposed a method that classifies client models based on the number of times they predict each label. Figure 1 illustrates our algorithm, which utilizes information about clusters for effective distillation. In practice, the number of groups is often unknown. The algorithm we propose addresses this issue by using hierarchical clustering, which eliminates the need for prior knowledge of the number of clusters.
In FL, fairness requires sensitive groups, such as gender and race, to not experience disparate outcome patterns such as different accuracy [1]. Unfortunately, minority social groups are often underrepresented in the training data. Therefore, their accuracy is often degraded. When the size of each client group varies, the existing methods significantly undermine the performance of minority client groups. On the other hand, our method performed well regardless of group size by assigning a logit that fits the distribution of the group data. This allowed us to become closer to a fair FL.
Guided by an empirical analysis of the esteemed MNIST and CIFAR datasets, we demonstrate that the clustering accuracy via prediction exceeds 90%. We also achieve high accuracy for each client model compared to traditional federated distillation methods in settings where an apparent group structure exists. Performance increased by up to 75%, and the greater the difference in the data distribution between each group, the greater the advantage of our algorithm. We show that our algorithm is effective even when the data is sparse.
The main contributions of this paper are:
  • We propose the first federated distillation approach that utilizes the predictions of the clients model on public data to cluster clients.
  • We show that our approach results in successful clustering even when the boundaries between each client group are unclear.
  • We demonstrate the effectiveness of our approach under challenging conditions such as insufficient data and ambiguous group boundaries. It also improves the performance of minority groups, bringing us closer to a fairer FL.
Our paper is organized as follows.
  • In Section 2, we review the relevant literature on clustering techniques in federated learning and federated distillation methods.
  • In Section 3, we formally define the problem and present the proposed clustered federated distillation algorithm using a label-based group structure.
  • In Section 4, we evaluate the proposed method
    -
    In Section 4.1, we present the setup used in our experiments.
    -
    In Section 4.2, we compare the clustering accuracy of our algorithm with existing FL methods using clustering in a label-based group structure.
    -
    In Section 4.3, we compare the performance of our method with existing federated distillation methods in a label-based group structure.
  • In Section 5, we summarize the main results and limitations and suggest directions for future work.

2. Related Works

2.1. Clustering in Federated Learning

In FL, several clustering criteria are available for categorizing and grouping clients based on various attributes. One example is Data Source-Based Clustering [9], which organizes clients according to the origin of their data, such as X-rays or ultrasound, in medical settings. Geographical and Diagnosis-Based Clustering [10] groups clients based on their locations or shared diagnoses. Loss Function Values-Based Clustering [8,11] focuses on similar model behaviors as deduced from the loss function values. Clusters based on inherent data distribution seek to enhance model generalization by considering the intrinsic characteristics of the data. Model Parameter-Based Clustering [12,13] gathers clients with analogous model parameters, reflecting parallel learning stages. Gradient Information-Based Clustering [7,14] forms clusters by examining shared gradient information. Prototype-Based Clustering [15] simplifies the global model by forming clusters around generalized prototypes that represent distinct data patterns.
Table 1 shows the commonly utilized clustering methods in federated learning. clustered FL (CFL) and FL+HC utilize the similarity of model parameters for clustering. However, the computation required to find the similarity of model parameters between clients is proportional to the model size. Furthermore, as the number of clients increases, the computation grows exponentially. On the other hand, our method utilizes the number of labels predicted by each client, which is not affected by model complexity or the number of clients. FedSEM and IFCA find the cluster identity by sending multiple model parameters to each client and selecting the model with the lower loss. This method is inefficient because each client has to repeat the learning process using multiple models. On the other hand, our method only adds clustering and obtaining the cluster logit at the end of the existing federated distillation process, so the overhead is small.

2.2. Federated Distillation

Federated distillation enables the adaptation of models to suit the computational capacity of a client [5] and minimizes information leakage during the sharing of high-dimensional model parameters [6]. FedMD enables heterogeneous FL among clients with different model structures by distilling knowledge from the server to the clients [5]. By contrast, Federated Group Knowledge Transfer (FedGKT) involves bidirectional distillation between clients and servers [16]. This method transfers the computational load from clients to servers, but raises privacy concerns. FedDF [17] uses unlabeled data to implement distillation while aggregating client-to-server logits across different model architectures. Distillation techniques have also been employed in One-Shot FL methods [18,19,20], which compress information from multiple client models into singular models.

3. Materials and Methods

3.1. Problem Definition

There are m clients. Each client k has a dataset D k : = { ( x i k , y i ) } i = 1 N k , where N k represents the number of instances for client k. There are L client groups, each containing instances from a limited set of C l classes, where 1 < C l < C and C is the total number of classes. Clients also have access to an unlabeled public dataset D p : = { ( x i p ) } i = 1 N p . Each client employs a model f k with potentially different architectures.

3.2. Federated Distillation

In federated distillation, each client trains a local model and communicates its knowledge to the central server. We use a one-shot method [18] in which the client sends the trained results to the server once and receives the aggregated data in return. This approach minimizes the communication overhead and accelerates the learning process. The distillation process uses the standard KL divergence loss represented in Equation (1).
KL ( p , q ) = c = 1 C p ( c ) log p ( c ) q ( c )
where p ( c ) and q ( c ) denote the predicted probabilities of class c obtained from the client and group models, respectively. Mathematically, p ( c ) = σ ( f k ( x ) ) and q ( c ) = σ ( f ˜ l ( x ) ) . f k ( x ) is the logit from the client’s model, while f ˜ l ( x ) is the averaged logit from clients belonging to group l.

3.3. Clustered Federated Distillation

Our objective is to identify the group of each client and train a specialized model for each group using both D p and D k . We assume no prior knowledge of the groups, including the number of clusters L.
All we can use for clustering is the f ( x p ) prediction from each client model on the public data. If we use these predictions directly for clustering, each of the N p public data points will produce a C-dimensional prediction vector, where C is the number of classes. This results in the data of the shape N p × C to be clustered. On the other hand, if we use the number of labels predicted by each client model, the data to be clustered would only have a shape of C. Since N p is usually much larger than C, this reduction in dimensionality allows for more efficient clustering.
So, the server aggregates the logits predicted by each client for D p and then computes a count vector for each label predicted by each client k. This count vector is normalized, as described in Equations (2) and  (3).
C o u n t k = C o u n t k c c = 1 C , where C o u n t k c = i = 1 N p I argmax f k ( x i p ) = c
In Equation (2), C o u n t k represents a vector in which each element C o u n t k c denotes the number of instances in D p classified into class c by client k’s model f k . The function I serves as an indicator that returns 1 if the condition is true and 0 otherwise.
N o r m C o u n t k = C o u n t k c min ( C o u n t k ) max ( C o u n t k ) min ( C o u n t k ) c = 1 C
We employed agglomerative clustering to identify the client groups. It is a hierarchical clustering method that starts with each data point as a separate group and iteratively merges the closest groups together. The distance_threshold serves as a key parameter for setting the maximum distance for merging groups. Equation (3) normalizes the count vectors to a [0, 1] range to ensure the consistent application of the distance_threshold.
Algorithm 1 outlines our Clustered Federated Distillation Learning method. Each client trains its model on a private dataset and predicts classes on a public dataset D p . These predictions are sent to a centralized server that clusters the clients based on them. The server calculates the average logit f ˜ l ( x p ) for each cluster and sends it back to the corresponding client. The clients then distill their models using the KL divergence loss as Equation (1), effectively addressing non-IID data distribution and enhancing the overall model performance.
Algorithm 1 Clustered Federated Distillation framework
  • Input: Public dataset D p , private dataset D k , model of client k: f k , k = 1 , , m , L group and l c clients at group l,
  • Output: Trained model f k
  • Train: Each client trains f k on D k .
  • Predict: Each client predicts class f k ( x i p ) on D p , and transmits the result to a central server.
  • Cluster: The server clusters using each client’s prediction. Using Equations (2) and (3)
  • Aggregate: The server averages the logit for each cluster. f ˜ l ( x p ) = 1 l c k Group l f k ( x p )
  • Distribute: Each client receives own group’s logit f ˜ l ( x p ) .
  • Distill: Each client model learns by distilling knowledge from f ˜ l ( x p ) . Using Equation (1).

4. Results

4.1. Setting

To experiment with different group structures, we varied the number of classes per group, ranging from 2 to 5, and the number of groups ranging from 2, 4, 6, 8, and 10. We used the MNIST [21] dataset, which is one of the most used datasets for FL research [7,12,22], unless otherwise noted. It has a small size of 28 × 28, making it a popular way to simply prove the concept of an algorithm. It has a total of 10 classes. This implied that a single class often belonged to multiple groups. For the neural network architecture, we employed a simple CNN with two convolutional layers, using ReLU as the activation function for the hidden layers and Softmax for the last layer’s activation function. The learning rate was set at lr = 1 × 10 4 and the batch size used was 128. Unless otherwise stated, five clients were uniformly assigned to each group, with each client having 50 data points per class. We have assumed that there are 400 unlabeled public datasets per class. We conducted experiments with 25 local training epochs for the client and 40 distillation epochs to learn from the aggregated logit.
In Section 4.2, we compared the clustering performance of our algorithm with that of an existing FL using clustering. In Section 4.3, we compare the performance of each client after the entire training process with the existing federated distillation algorithm.

4.2. Clustering Experiment

Baseline: We compared the clustering performance with CFL [7], one of the methods that uses weight variation for clustering. It uses the cosine similarity of weight updates between clients as a clustering criterion.
Metric: We measured the clustering performance using the adjusted rand index (ARI) [23], which quantifies the similarity between true and predicted cluster assignments based on true cluster identity. The silhouette score [24] serves as another metric for evaluating the degree of separation between clusters. The silhouette score was used to assess how well the criterion used for clustering represented the group structure in terms of the logit and similarity of weight updates. For both metrics, higher values indicate better-defined clusters and the score ranges from −1 to 1.
Hyperparameter: Table 2 displays the average ARI for various distance thresholds, a hyperparameter in agglomerative clustering. The distance threshold determines how far away from the aggregate clustering values are judged to belong in the same group. We employed two and five for the class per group and four for the number of groups. These results average over each distance threshold value. For future experiments, we choose a distance threshold of two. This is because the range of distance thresholds from 1.5 to 2.5 consistently yields high ARI values above 0.95, indicating optimal clustering.
Silhouette Score: Table 3 illustrates the silhouette scores of the baseline for each clustering method under different group structures, indicating how well the data are clustered. The count vector achieved silhouette scores above 0.5 in all cases, while the model update similarity scores were consistently close to zero, indicating little to no group structure. For both variables, fewer groups led to clearer group structures.
ARI: Table 4 shows the performance of our algorithm compared to CFL in terms of clustering accuracy. Our algorithm consistently achieved an ARI greater than 0.9 across different settings, indicating high clustering accuracy. By contrast, CFL recorded an ARI close to 0 in all test cases, demonstrating its persistent ineffectiveness in label-based clustering.
Existence of Minor Classes: Thus far, our experiments have focused on scenarios where clients in each group have only a subset of the classes. So, the boundaries between groups were clear and clustering was relatively easy. However, in this setup, we assumed that three classes per group appeared in large numbers, whereas the remaining classes appeared in smaller numbers. We refer to these infrequent classes as minor classes. We examined the effect of increasing the proportion of minor classes in each group from 5% to 50%. The total number of data for each client is 500. For example, if there is a 5% minor class, each client has 25 pieces of data belonging to seven minor classes and the remaining 475 data belonging to three ’major’ classes. As shown in Table 5 and Figure 2, the silhouette score decreased as the noise class increased and the group structure became less obvious. ARI, on the other hand, was 100% accurate until the noise class reached 30% and then saw a sharp drop in accuracy at 40%.

4.3. Performance Evaluation

Baseline: In this section, we compare the performance of our method with three baselines. FedDF (Federated ensemble Distillation for robust model Fusion) [17] has a similar distillation process to our method, except that it assigns the same logit to all clients. On the other hand, DS-FL (Distillation-Based Semi-Supervised Federated Learning) [25] uses entropy reduction averaging for model output aggregation. It is designed to deliberately reduce the entropy of the global logit before distributing it to mobile devices. FedAVG (Federated Averaging) [22] is a fundamental federated learning algorithm that averages model parameters across clients to form a global model, in contrast to the distillation approaches of FedDF and DS-FL.
Balanced Group Structure: Table 6 shows that our algorithm consistently outperforms FedDF when the number of clients in each group is equal. When each group contained fewer classes, there were fewer overlapping classes. This resulted in a more distinct group structure. Consequently, our method performed 15% better than the FedDF when there were five classes per group. When the group structure is the clearest, with only two classes per group, our method performed 75% better than the FedDF.
Unbalanced Group Structure: Table 7 and Figure 3 show the performance when the number of clients varies between groups. In the case of global distillation, the accuracy of groups with fewer clients tends to decrease significantly and sometimes approaches zero. Realistically, in this situation, a small number of groups will have to use their own data without applying FL. However, they will not be able to take advantage of the performance gains from FL at all. By contrast, our method ensures that clients in each cluster perform equally well, leading to a significant increase in accuracy for minority groups. This means that all clients can share in the profits of the FL.
Insufficient data: In FL, the amount of data is often insufficient [1]. If a client has insufficient data, it will struggle to train its model effectively. Conversely, a lack of public data interferes with the transfer of the model to the server. We conduct experiments in both client and public data scarce environments. We conduct experiments in both client and public data scarce environments. We use 50, 100, 300, and 500 data points for client data, and 100, 300, 500, and 1000 for public data. Figure 4 shows that the performance of other algorithms decreases as the size of the public dataset decreases, while our method maintains its performance. Our algorithm also has the highest accuracy and the lowest variability, represented by the short vertical lines for each data point.
CIFAR dataset: We have performed all of our experiments so far using the MNIST dataset. This time, we will use the more complex CIFAR-10 [26] dataset to see how well our method works on more challenging tasks. It is one of the most widely used datasets and consists of 60,000 color images of size 32 × 32 with 10 labels. We run our experiments with three groups, each with between two and five classes. Figure 5 shows that our algorithm improves the performance by more than 2× on average over the baseline, demonstrating that it works well on a more complex dataset.
Existence of Minor Classes: In real-world scenarios, it is often difficult to make clear distinctions between different groups. To address this, we introduce the concept of a “minor class”, a less prevalent class within each group, to blur traditional group boundaries. Figure 6 shows that as the proportion of these minor classes increases, the performance gap between our proposed method and traditional approaches narrows. This experiment utilizes the MNIST dataset.
Our experiments demonstrate several strengths of the proposed clustered federated distillation algorithm when clear label-based group structures exist. It achieves high clustering accuracy in delineating distinct groups based on their predicted distributions. The client model accuracy also significantly outperforms traditional federated distillation methods, especially for minority groups. The algorithm performs robustly even with limited client and public data.
However, the results also reveal some limitations of the current approach. The performance gains diminish as the group boundaries become more ambiguous, with a higher proportion of overlapping “minor” classes. The experiments have so far only focused on image classification tasks with a limited number of labels. Tests on more complex datasets and tasks are needed to fully evaluate the applicability of the method. In addition, more extensive tuning is needed to determine optimal hyperparameter settings, such as distance thresholds. Addressing these limitations is a worthwhile avenue for improving the clustered federated distillation framework proposed in this work.

5. Discussion

In this study, we address the scenario of different data distributions between different client groups in federated distillation. We introduce a methodology that uses hierarchical clustering to categorize clients according to the number of labels predicted by each model for public data. This approach overcomes the limitations of traditional federated distillation techniques that assume a uniform data distribution when a label-based group structure exists. Our method can be used when different groups (e.g., demographic groups) have significantly different data distributions to ensure that all groups receive equally good results.
The experiments show that the model correctly classifies groups with different labels. The accuracy of the model exceeds that of traditional methods when there is a clear cluster structure based on labels. In particular, the accuracy of a small number of groups, which is problematic in traditional federated distillation, is significantly improved. This may pave the way for fair FL. Furthermore, our method does not require knowledge of the number of clusters, making it applicable in a wider range of environments. However, as the group structure becomes less clear, the performance gap between our method and existing algorithms narrows. We will continue to improve our method to perform better in ambiguous group structures.
It would be an interesting research topic to combine our method with different data types, such as text, more complex images, or time series data. Our method could also be combined with data-free distillation where no public data exists. Our algorithm will also be very effective in the presence of malicious clients that send false predictions to the server. By creating a group of malicious clients, we can ensure that other clients are not affected by them.

Author Contributions

Conceptualization, G.Y.; Methodology, G.Y.; Software, G.Y.; Validation, G.Y.; Formal Analysis, G.Y.; Investigation, G.Y.; Resources, G.Y.; Data Curation, G.Y.; Writing—Original Draft Preparation, G.Y.; Writing—Review and Editing, G.Y. and H.T.; Visualization, G.Y.; Supervision, H.T.; Project Administration, H.T.; Funding Acquisition, H.T. All authors have read and agreed to the published version of the manuscript.

Funding

This study was carried out with the support of ´R&D Program for Forest Science Technology (Project No. 2021383A00-2323-0101)´ provided by Korea Forest Service (Korea Forestry Promotion Institute). This work was supported by the Korea Institute of Industrial Technology as “Development of holonic manufacturing system for future industrial environment [KITECH EO-230006].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available because the associate code is still under development.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R. Advances and Open Problems in Federated Learning; Foundations and Trends® in Machine Learning: Hanover, MA, USA, 2021; Volume 14, pp. 1–210. [Google Scholar]
  2. Zhang, C.; Xie, Y.; Bai, H.; Yu, B.; Li, W.; Gao, Y. A survey on federated learning. Knowl.-Based Syst. 2021, 216, 106775. [Google Scholar] [CrossRef]
  3. Rieke, N.; Hancox, J.; Li, W.; Milletari, F.; Roth, H.R.; Albarqouni, S.; Bakas, S.; Galtier, M.N.; Landman, B.A.; Maier-Hein, K.; et al. The future of digital health with federated learning. NPJ Digit. Med. 2020, 3, 119. [Google Scholar] [CrossRef] [PubMed]
  4. Jeong, E.; Oh, S.; Kim, H.; Park, J.; Bennis, M.; Kim, S.L. Communication-efficient on-device machine learning: Federated distillation and augmentation under non-iid private data. arXiv 2018, arXiv:1811.11479. [Google Scholar]
  5. Li, D.; Wang, J. Fedmd: Heterogenous federated learning via model distillation. arXiv 2019, arXiv:1910.03581. [Google Scholar]
  6. Chang, H.; Shejwalkar, V.; Shokri, R.; Houmansadr, A. Cronus: Robust and heterogeneous collaborative learning with black-box knowledge transfer. arXiv 2019, arXiv:1912.11279. [Google Scholar]
  7. Sattler, F.; Müller, K.R.; Samek, W. Clustered federated learning: Model-agnostic distributed multitask optimization under privacy constraints. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 3710–3722. [Google Scholar] [CrossRef]
  8. Ghosh, A.; Chung, J.; Yin, D.; Ramchandran, K. An efficient framework for clustered federated learning. Adv. Neural Inf. Process. Syst. 2020, 33, 19586–19597. [Google Scholar] [CrossRef]
  9. Qayyum, A.; Ahmad, K.; Ahsan, M.A.; Al-Fuqaha, A.; Qadir, J. Collaborative federated learning for healthcare: Multi-modal COVID-19 diagnosis at the edge. IEEE Open J. Comput. Soc. 2022, 3, 172–184. [Google Scholar] [CrossRef]
  10. Huang, L.; Shea, A.L.; Qian, H.; Masurkar, A.; Deng, H.; Liu, D. Patient clustering improves efficiency of federated machine learning to predict mortality and hospital stay time using distributed electronic medical records. J. Biomed. Inform. 2019, 99, 103291. [Google Scholar] [CrossRef]
  11. Mansour, Y.; Mohri, M.; Ro, J.; Suresh, A.T. Three approaches for personalization with applications to federated learning. arXiv 2020, arXiv:2002.10619. [Google Scholar]
  12. Briggs, C.; Fan, Z.; Andras, P. Federated learning with hierarchical clustering of local updates to improve training on non-IID data. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN) IEEE, Glasgow, UK, 19–24 July 2020; pp. 1–9. [Google Scholar]
  13. Long, G.; Xie, M.; Shen, T.; Zhou, T.; Wang, X.; Jiang, J. Multi-center federated learning: Clients clustering for better personalization. World Wide Web 2023, 26, 481–500. [Google Scholar] [CrossRef]
  14. Duan, M.; Liu, D.; Ji, X.; Liu, R.; Liang, L.; Chen, X.; Tan, Y. FedGroup: Efficient clustered federated learning via decomposed data-driven measure. arXiv 2020, arXiv:2010.06870. [Google Scholar]
  15. Huang, W.; Ye, M.; Shi, Z.; Li, H.; Du, B. Rethinking federated learning with domain shift: A prototype view. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) IEEE, Vancouver, BC, Canada, 8–22 June 2023; pp. 16312–16322. [Google Scholar]
  16. He, C.; Annavaram, M.; Avestimehr, S. Group knowledge transfer: Federated learning of large cnns at the edge. Adv. Neural Inf. Process. Syst. 2020, 33, 14068–14080. [Google Scholar]
  17. Lin, T.; Kong, L.; Stich, S.U.; Jaggi, M. Ensemble distillation for robust model fusion in federated learning. Adv. Neural Inf. Process. Syst. 2020, 33, 2351–2363. [Google Scholar]
  18. Guha, N.; Talwalkar, A.; Smith, V. One-shot federated learning. arXiv 2019, arXiv:1902.11175. [Google Scholar]
  19. Li, Q.; He, B.; Song, D. Practical one-shot federated learning for cross-silo setting. arXiv 2020, arXiv:2010.01017. [Google Scholar]
  20. Zhou, Y.; Pu, G.; Ma, X.; Li, X.; Wu, D. Distilled one-shot federated learning. arXiv 2020, arXiv:2009.07999. [Google Scholar]
  21. LeCun, Y. The MNIST Database of Handwritten Digits. 1998. Available online: http://yann.lecun.com/exdb/mnist/ (accessed on 21 March 2023).
  22. McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the Artificial Intelligence and Statistics, PMLR, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
  23. Rand, W.M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 1971, 66, 846–850. [Google Scholar] [CrossRef]
  24. Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
  25. Itahara, S.; Nishio, T.; Koda, Y.; Morikura, M.; Yamamoto, K. Distillation-based semi-supervised federated learning for communication-efficient collaborative training with non-iid private data. IEEE Trans. Mob. Comput. 2021, 22, 191–205. [Google Scholar] [CrossRef]
  26. Krizhevsky, A.; Hinton, G. Learning multiple LAYERS of Features from Tiny Images. 2009. Available online: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf (accessed on 21 March 2023).
Figure 1. Federated distillation process using group structure. 1. Individual clients conduct local training using their own datasets. 2. Each client performs inference on a shared public dataset and forwards the results to a central server. 3. The server clusters the received outputs and obtains the soft label by averaging the outputs of the same group. 4. Clients receive the averaged soft labels corresponding to their groups. 5. Clients then perform distillation to align their model outputs with the received averaged soft labels.
Figure 1. Federated distillation process using group structure. 1. Individual clients conduct local training using their own datasets. 2. Each client performs inference on a shared public dataset and forwards the results to a central server. 3. The server clusters the received outputs and obtains the soft label by averaging the outputs of the same group. 4. Clients receive the averaged soft labels corresponding to their groups. 5. Clients then perform distillation to align their model outputs with the received averaged soft labels.
Applsci 14 00277 g001
Figure 2. Visualization of the count vector distribution with minor class that has a small percentage of each group. (a) When the minor class rate is 5%. (b) When the minor class rate is 10%. (c) When the minor class rate is 40%.
Figure 2. Visualization of the count vector distribution with minor class that has a small percentage of each group. (a) When the minor class rate is 5%. (b) When the minor class rate is 10%. (c) When the minor class rate is 40%.
Applsci 14 00277 g002
Figure 3. (a) Represents the average accuracy per client group when FedDF is applied, while (b) represents the average accuracy per client group when our method is applied. In both cases, the proportion for each client group is 6:2:2.
Figure 3. (a) Represents the average accuracy per client group when FedDF is applied, while (b) represents the average accuracy per client group when our method is applied. In both cases, the proportion for each client group is 6:2:2.
Applsci 14 00277 g003
Figure 4. Performance of each algorithm as the amount of public dataset changes.
Figure 4. Performance of each algorithm as the amount of public dataset changes.
Applsci 14 00277 g004
Figure 5. Graph representing accuracy in CIFAR with respect to the class per group.
Figure 5. Graph representing accuracy in CIFAR with respect to the class per group.
Applsci 14 00277 g005
Figure 6. Graph shows how performance varies depending on the “minor class” ratio. Each data point represents the average performance across dataset sizes of 100, 300, 500, and 1000. The MNIST dataset was used.
Figure 6. Graph shows how performance varies depending on the “minor class” ratio. Each data point represents the average performance across dataset sizes of 100, 300, 500, and 1000. The MNIST dataset was used.
Applsci 14 00277 g006
Table 1. Comparison of Federated Learning Methods that utilizes clustering. Note that FL+HC means Federated Learning with Hierarchical Clustering and IFCA means Iterative Federated Clustering Algorithm.
Table 1. Comparison of Federated Learning Methods that utilizes clustering. Note that FL+HC means Federated Learning with Hierarchical Clustering and IFCA means Iterative Federated Clustering Algorithm.
NameClustering CriterionDataset
Clustered FL (CFL) [7]Similarity of Δ WMNIST, CIFAR-10
FL+HC [12]Similarity of Δ WMNIST, FEMNIST
Federated SEM (FedSEM) [13]Model that minimize lossMNIST, FEMNIST, CelebA
IFCA [8]Model that minimize lossMNIST, CIFAR-10
Table 2. Average adjusted rand index (ARI) for different distance thresholds in agglomerative clustering.
Table 2. Average adjusted rand index (ARI) for different distance thresholds in agglomerative clustering.
Distance Thresholds
0.250.51.01.52.02.53.03.54.0
ARI0.310.670.930.98110.780.750.43
Table 3. Silhouette scores for each clustering criterion. “Ours” represents the silhouette score for the distribution of the count vectors output by each client model used by our method. “CFL” represents the silhouette score for the similarity of the weight update distribution used by Clustered Federated Learning.
Table 3. Silhouette scores for each clustering criterion. “Ours” represents the silhouette score for the distribution of the count vectors output by each client model used by our method. “CFL” represents the silhouette score for the similarity of the weight update distribution used by Clustered Federated Learning.
GroupClass per Group
2345
CFLOursCFLOursCFLOursCFLOurs
20.060.820.030.850.020.810.030.85
40.030.880.020.830.010.610.010.78
60.020.780.010.770.010.570.010.75
80.020.790.010.690.010.600.010.74
100.010.760.020.570.000.540.000.72
Table 4. Comparison of ARI scores under various group structures for our method and clustered federated learning (CFL).
Table 4. Comparison of ARI scores under various group structures for our method and clustered federated learning (CFL).
GroupClass per Group
2345
CFLOursCFLOursCFLOursCFLOurs
2−0.031.000.011.00−0.081.000.131.00
40.081.00−0.011.000.060.90−0.031.00
60.030.96−0.031.00−0.010.960.031.00
8−0.031.000.041.00−0.010.93−0.011.00
100.010.910.010.93−0.000.97−0.011.00
Table 5. Clustering performance with ’minor classes’—the less frequent classes in each group. The title of each row is the proportion of the seven minor classes in each group.
Table 5. Clustering performance with ’minor classes’—the less frequent classes in each group. The title of each row is the proportion of the seven minor classes in each group.
5%10%20%30%40%50%
Silhouette0.870.690.590.490.370.33
ARI1.01.01.01.00.90.49
Table 6. Average accuracy comparison between the FedDF method and our proposed group-based distillation method across different group structures.
Table 6. Average accuracy comparison between the FedDF method and our proposed group-based distillation method across different group structures.
GroupClass per Group
2345
FedDFOursFedDFOursFedDFOursFedDFOurs
270.092.374.090.980.290.783.293.7
455.098.049.890.970.892.980.792.3
630.993.872.793.674.688.669.692.8
858.994.858.293.367.687.682.891.6
1053.090.573.594.081.091.783.292.1
Avg53.693.965.692.574.890.379.992.5
Table 7. Performance metrics across groups with different ratios. The ’Group Ratio’ column shows the percentage of clients in each group. ’Group Acc’ represents the average accuracy achieved by each group at the end of the training process, while ’Total Acc’ represents the average accuracy across all clients. The order of the groups in ’Group Acc’ and ’Group Ratio’ is the same. All values are expressed as percentages.
Table 7. Performance metrics across groups with different ratios. The ’Group Ratio’ column shows the percentage of clients in each group. ’Group Acc’ represents the average accuracy achieved by each group at the end of the training process, while ’Total Acc’ represents the average accuracy across all clients. The order of the groups in ’Group Acc’ and ’Group Ratio’ is the same. All values are expressed as percentages.
Group RatioGroup AccTotal Acc
FedDFOursFedDFOurs
70, 3097, 096, 926895
80, 2097, 097, 867795
50, 30, 2096, 1, 097, 89, 974994
60, 20, 2096, 0, 096, 91, 935894
40, 30, 20, 1096, 7, 4, 6496, 89, 94, 974894
50, 20, 20, 1096, 0, 0, 6696, 93, 91, 965494
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, G.; Tae, H. Federated Distillation Methodology for Label-Based Group Structures. Appl. Sci. 2024, 14, 277. https://doi.org/10.3390/app14010277

AMA Style

Yang G, Tae H. Federated Distillation Methodology for Label-Based Group Structures. Applied Sciences. 2024; 14(1):277. https://doi.org/10.3390/app14010277

Chicago/Turabian Style

Yang, Geonhee, and Hyunchul Tae. 2024. "Federated Distillation Methodology for Label-Based Group Structures" Applied Sciences 14, no. 1: 277. https://doi.org/10.3390/app14010277

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop