Mutual Knowledge Distillation-Based Communication Optimization Method for Cross-Organizational Federated Learning

Liu, Su; Shen, Hong; Law, Eddie K. L.; Lam, Chan-Tong

doi:10.3390/electronics14091784

Open AccessArticle

Mutual Knowledge Distillation-Based Communication Optimization Method for Cross-Organizational Federated Learning

¹

Faculty of Applied Sciences, Macao Polytechnic University, Macao SAR, China

²

School of Engineering and Technology, Central Queensland University, Brisbane 4000, Australia

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(9), 1784; https://doi.org/10.3390/electronics14091784

Submission received: 4 March 2025 / Revised: 8 April 2025 / Accepted: 14 April 2025 / Published: 27 April 2025

(This article belongs to the Special Issue High-Performance Computing for AI: Architecture, Systems, and Algorithms)

Download

Browse Figures

Versions Notes

Abstract

:

With the increasing severity of data privacy and security issues, cross-organizational federated learning is facing challenges in communication efficiency and cost. Knowledge distillation, as an effective model compression technique, can reduce model size without significantly compromising accuracy, thereby lowering communication overhead. However, existing knowledge distillation methods either employ static distillation loss weights, ignoring bandwidth variations in communication networks, or fail to effectively account for bandwidth heterogeneity among different nodes, leading to communication bottlenecks. To enhance the overall system efficiency, there is an urgent need to find new methods that enable models to achieve optimal performance in resource-constrained environments. This paper proposes a communication optimization method based on mutual knowledge distillation (Fed-MKD) to address the bottleneck issues caused by high communication costs in cross-organizational federated learning. By leveraging a mutual distillation mechanism, Fed-MKD enables collaborative training of teacher and student models locally while reducing the frequency and size of global model transmissions to optimize communication. Our experimental results demonstrate that, compared to classical knowledge distillation methods, Fed-MKD significantly improves communication efficiency, with compression ratios ranging from 4.89× to 28.45×. Furthermore, Fed-MKD achieves up to 4.34× acceleration in convergence time across multiple datasets. These findings highlight the significant practical value of Fed-MKD in environments with heterogeneous data distributions and limited communication resources.

Keywords:

cross-organizational federated learning; knowledge distillation; communication optimization; data heterogeneity; integrated distillation

1. Introduction

Federated learning (FL) is an emerging distributed machine learning paradigm that enables multiple clients to collaboratively train a global model without sharing raw data. As model size increases, particularly with deep learning models like BERT [1] and CLIP [2], communication overhead becomes a major bottleneck. These large models demand significant computational resources and produce large gradient updates, resulting in high communication costs, which are especially challenging for resource-constrained devices [3,4,5].

Although communication optimization in federated learning has made some progress, existing methods often fail to effectively address the communication bottleneck posed by large models, especially in scenarios with heterogeneous client resources. Current mainstream solutions, such as reducing communication frequency [6,7] or model sparsification [8,9,10], do not fundamentally alleviate the communication burden.

Reducing communication frequency can lower per-round communication costs by decreasing the number of communications. However, this may hinder model convergence and degrade the final performance of the global model in some cases. On the other hand, model sparsification techniques reduce the amount of data transmission by discarding certain parameters or gradients, but they typically lead to a degradation in the model’s representational capacity.

To address these issues, we leverage knowledge distillation (KD) as a model compression technique for cross-organizational federated learning, which not only reduces the model size but also lowers communication costs while maintaining performance. This paper presents a federated learning communication optimization method based on a mutual knowledge distillation framework (Fed-MKD), specifically designed for cross-organizational federated learning scenarios. By transferring the knowledge of large teacher models to smaller student models, our approach effectively alleviates the communication burden in federated learning systems.

The main contributions of this work are summarized as follows:

(1): We propose a new cross-organizational federated learning framework that allows multiple organizations to collaborate in training models without sharing sensitive data.
(2): To optimize the federated learning process, we implement dynamic decision-making for local training rounds, eliminating the need for frequent model updates or transmissions. This approach reduces communication frequency, thereby lowering communication costs while preserving the effectiveness of model training.
(3): Fed-MKD dynamically adjusts the distillation loss weights during local training, allowing each client to flexibly adjust the distillation strategy based on the current training progress and network conditions. This adaptive adjustment enhances the efficiency of the distillation process and ensures that model accuracy is maintained even when communication frequency is reduced.

2. Related Work

Research on using knowledge distillation to optimize communication can be broadly categorized into two types: joint distillation and distillation compression.

2.1. Joint Distillation

The core idea of joint distillation methods is to avoid directly transmitting model updates. Instead, local model predictions on a shared public dataset (i.e., “soft labels” or “output probability distributions”) are exchanged between clients. Since clients are required only to exchange prediction results rather than model weights or gradients, the volume of data transmitted is significantly reduced, thereby effectively lowering communication costs.

For example, Sui et al. [11] proposed FedED, a method that does not transmit local parameters but instead uploads the aggregated prediction results of local models to train a global model. This approach reduces communication costs but relies on sharing a public dataset, which may pose significant privacy risks in practical applications. Lin et al. [12] proposed an ensemble distillation model aggregation approach, wherein the global model is trained using unlabeled data outputs from client models. This partially mitigates privacy exposure risks and allows the aggregation of models with varying sizes and structures. However, it still depends on local data outputs from clients and does not effectively reduce the communication overhead introduced by large-scale models. Meanwhile, FedGKT [13] employs client–server collaborative distillation to address communication and computation bottlenecks in large-scale model training on edge devices. Nonetheless, it remains limited in terms of privacy protection, fairness assurance, and adaptability to dynamic environments.

2.2. Distillation Compression

The second category focuses on distillation compression to optimize federated learning communication costs. This approach involves having the student model replicate the outputs of the teacher model(e.g., prediction results, hidden layer representations). This enables the student model to approximate the performance of the teacher model while utilizing fewer parameters. During the model update process, clients only need to transmit the compressed student model parameters, which improves communication efficiency and reduces computational overhead.

For instance, FedDF [12] implements federated learning through soft label distillation, where clients only transmit the probability distributions of the model outputs. This approach improves privacy protection and communication efficiency, but the knowledge transfer efficiency may be limited in scenarios with large architectural differences (e.g., CNNs vs. Transformers). FedMD [14] allows clients and servers to exchange prediction distributions on a public dataset without repeatedly uploading and downloading large-scale model parameters. However, it faces the issue of frequent prediction distribution transmission. FedKD [15] combines adaptive mutual knowledge distillation and gradient compression techniques, effectively enhancing communication efficiency and reducing training communication costs. But this method relies on multiple rounds of local training on the client side, which may prevent it from fully adapting to changes in dynamic environments during the training process.

In summary, while significant progress has been made in optimizing communication overhead in federated learning through knowledge distillation, several limitations remain. Joint distillation methods rely on the sharing of public datasets, which introduces privacy leakage risks. Distillation compression techniques reduce communication overhead but do not effectively address adaptability challenges in dynamic environments.

This motivates our work to account for both communication efficiency and privacy protection by proposing a communication optimization method for federated learning based on mutual knowledge distillation (Fed-MKD). Our approach not only reduces communication frequency but also improves model compression by adaptively adjusting the distillation loss function weights, all while avoiding the direct exchange of sensitive data and enhancing privacy protection.

3. Preliminary

3.1. Problem Formulation

In this section, we provide a detailed definition of the problem to be addressed in the context of cross-organizational federated learning and present the corresponding system model. To facilitate understanding, key notations used throughout this study are summarized in Table 1.

In a cross-organizational federated learning system, all participating DCs use the same machine learning model for training [16,17]. These models share the same loss function

ℓ (w, z)

, where

w \in R^{d}

represents the model parameters [18].

Assume a cross-organizational federated learning system involving N DCs participating in joint model training, where each DC relies on a parameter server (PS) for model parameter aggregation. Each DC maintains its own independent dataset

D_{i}

, and these datasets are mutually disjoint, i.e.,

D = ⋃_{i} D_{i}

(with

D_{i} \neq D_{j}

when

i \neq j

). During the training process, all DCs optimize the same objective function according to Reference [19] which is expressed by:

minimize : F (w) = \frac{1}{| D |} \sum_{x \in D} l (w, x)

(1)

where

| D |

represents the number of samples in the dataset D.

Each

D C_{i}

updates the local model parameters

w_{i}

using the stochastic gradient descent (SGD) algorithm [20] to obtain an optimal global model

w^{*}

. During each training iteration, DC computes the gradient

g_{i}

of the local model using the following formula:

g^{i} = \frac{1}{| D_{i} |} \sum_{x \in D_{i}} \nabla_{w_{i}} l (w_{i}, x)

(2)

where

| D_{i} |

denotes the number of samples in

D C_{i}

, and

\nabla_{w_{i}} l (w_{i}, x)

represents the gradient of the loss function with respect to the model parameters

w_{i}

, indicating the direction of model updates computed by

D C_{i}

on its local dataset

D_{i}

.

The update process for the model parameters

w_{i}

is as follows [21]:

w^{i} = w^{i} - η^{i} \cdot g^{i}

(3)

where

η^{i}

is the learning rate of

D C_{i}

, controlling the step size of each parameter update.

Through this process,

D C_{i}

updates its model parameters based on the corresponding local data. Once all DCs complete their local updates, they upload their model parameters to the PS. The global model w is updated through the following weighted averaging approach [22]:

w = \sum_{i} \frac{| D_{i} |}{| D |} w^{i}

(4)

Here,

\frac{| D_{i} |}{| D |}

represents the weight proportion of

D C_{i}

in the global model. Through this aggregation approach, the local knowledge of each DC is effectively integrated, ultimately yielding a global model w.

In cross-organizational federated learning, due to factors such as non-IID data, communication constraints, and local updates, it is often challenging to find a globally optimal solution. Training is typically terminated early when the objective function value approaches a predefined threshold. Therefore, the core objective is to enhance the communication efficiency of cross-organizational federated learning by reducing the communication time required for each round of global aggregation, thereby achieving a more efficient training process.

Let

τ_{r}^{i}

denote the communication time of

D C_{i}

during the r-th round of global aggregation,

F (w_{R})

represent the objective function value after R rounds of global aggregation, and

t h r

be the predefined threshold mentioned earlier. The problem addressed in our paper is formally defined as

\begin{matrix} minimize : & \sum_{r = 1}^{R} max_{i \in [1, N]} τ_{r}^{i} \\ subject to : & F (w_{R}) \leq t h r \end{matrix}

(5)

The optimization objective in Equation (5) aims to minimize the maximum communication time among all participating DCs during each global aggregation round. This objective is fundamentally linked to the core performance metrics in federated learning: the minimization of maximum communication time is achieved through student model compression and a dynamic adjustment strategy, while the constraint

F (w_{R}) \leq t h r

ensures that this optimization is not achieved at the expense of model accuracy.

Hence, the optimization goal of minimizing communication time complements the typical objective of maximizing model accuracy. By enabling more local training rounds before each global aggregation, it simultaneously optimizes both communication time and model accuracy. This balance enhances the overall efficiency of the federated learning system, ensuring accelerated convergence and reduced communication overhead in scenarios with heterogeneous data and resources.

It is noteworthy that Fed-MKD compresses the original large-scale model using knowledge distillation techniques. This approach requires each DC to simultaneously maintain and train two models of different sizes during the training process. In the Fed-MKD framework, the global model used is the smaller student model. The specific implementation of this method and its optimization strategies will be discussed in detail in the subsequent sections.

3.2. System Model Architecture

Some researchers have proposed transferring the knowledge of two models through knowledge distillation, where only the student model’s parameters are transmitted during global model aggregation, thereby significantly reducing the communication data volume. However, in cross-organizational federated learning environments, the datasets owned by different clients are often non-independent and identically distributed. As a result, the teacher model locally trained at each data center must simultaneously learn the features and knowledge from other client models.

To tackle this issue, we introduce a mutual distillation framework [15], in which both two models are trained together and learn from each other. The student model learns more precise discriminative methods from the teacher model, while the teacher model aggregates student models from different clients and learns the features and information from each client model.

This mutual distillation framework promotes knowledge exchange between two models through a bidirectional learning mechanism, thereby enhancing the efficiency and accuracy of federated learning systems. The cross-organizational federated model system based on the mutual distillation architecture is shown in Figure 1.

In cross-organizational federated learning, multi-modal data (MRI images and EEG time-series signals in medical scenarios) are often distributed across heterogeneous clients, leading to significant differences in the data modalities held by different organizations [23]. Additionally, traditional federated learning faces difficulties in establishing a unified feature representation space, and the semantic gap between modalities can result in inefficient knowledge transfer. The Fed-MKD framework, as shown in Figure 1, addresses these challenges through an innovative mutual distillation mechanism and is compatible with multi-modal data scenarios.

At the feature extraction level, Fed-MKD does not mandate a unified network architecture. Each client can autonomously select the optimal feature extractor based on local data characteristics (CNN for image processing, LSTM for time-series signal processing), with the only requirement being that the final prediction layer dimensions are consistent. This flexibility is achieved through a dual-model design: the teacher model retains a complete modality-specific feature extraction pipeline, while the student model aligns the logit distributions across modalities via the KL divergence in the logit space to enable cross-modal knowledge transfer. This architecture inherently possesses modality independence, allowing clients with different data modalities to maintain their native feature extraction processes, while reaching consensus through matching the probability distributions in the prediction space.

For each client participating in the training, it is assumed that the training is conducted within a DC. Each DC simultaneously trains both two models, using the same local data for training. The cross-organizational federated learning training process based on the mutual distillation architecture involves the following steps:

Each DC trains two models using local data (the local training process includes mutual distillation between two models);
DC pushes the updated student model parameters to PS;
The PS aggregates the parameter updates from each DC;
The PS generates the global student model using the aggregated results;
DC pulls the updated global student model from the PS;
DC updates the local student model using the global model.

This process optimizes the student model through shared model parameters. The mutual distillation framework enables the teacher model and student model to learn from each other, promoting knowledge sharing and integration across organizations.

4. Methodology

4.1. Mutual Knowledge Distillation Strategy

In the Fed-MKD framework, within each DC, both two models undergo mutual distillation. These two models have different loss functions when using the SGD strategy. Therefore, in the mutual distillation system, the loss calculation for two models must be performed separately, as shown in Figure 2.

The following provides a detailed explanation of each step in the implementation of Fed-MKD.

Let

Z^{t} = {z_{1}^{t}, z_{2}^{t}, \dots, z_{n}^{t}}

be the logits (i.e., the raw scores) output by the teacher model for a classification problem on a training sample x, and

Z^{s} = {z_{1}^{s}, z_{2}^{s}, \dots, z_{n}^{s}}

be the initial class probabilities output by the student model for the same sample, where n is the number of classes in the classification task. The softmax function with distillation temperature T (a hyperparameter that controls the smoothness of the soft labels generated by the teacher model) is defined as follows:

y_{i} = \frac{exp (z_{i} / T)}{\sum_{j = 1}^{n} exp (z_{j} / T)}

(6)

In particular,

P (y_{i})

is the predicted probability for class i,

z_{i}

is the logit for class i.

\sum_{j = 1}^{n} exp (z_{j})

is the sum of the exponentiated logits for all classes.

To more intuitively illustrate the physical significance of the distillation temperature, we provide a schematic of the output probability calculation process and present examples of the results obtained under different distillation temperatures T in Figure 3. The teacher network (upper part) and the student network (lower part) both receive the same training data. Next, the teacher network outputs intermediate features and prediction results, and calculates the soft targets using the temperature

T^{t}

, which are then used to align knowledge with the student network.

For

T < 1

, the probability distribution sharpens, reflecting increased model confidence in the highest logits and greater concentration of probabilities on a single class. At

T = 1

, Equation (6) corresponds to the standard softmax function. When

T > 1

, tthe softmax-generated probability distribution becomes smoother, and as T approaches infinity, the output converges to a uniform distribution, with class probabilities becoming nearly equal.

Let

Z^{t}

and

Z^{s}

go through the traditional softmax layer with

T = 1

to obtain the standard decision rule (i.e., hard target). Let

Y^{t} = {y_{1}^{t}, y_{2}^{t}, \dots, y_{n}^{t}}

and

Y^{s} = {y_{1}^{s}, y_{2}^{s}, \dots, y_{n}^{s}}

represent the raw logits output by two models, respectively. Let the true label of sample x be

Y = {y_{1}, y_{2}, \dots, y_{n}}

, where for any

1 \leq i \leq n

,

y_{i} \in {0, 1}

and

\sum_{i = 1}^{n} y_{i} = 1

, i.e., only the correct class has a label of 1, and all other classes have a label of 0.

Based on the work presented in reference [24,25,26], for two probability distributions

P = {p_{1}, p_{2}, \dots, p_{n}}

and

Q = {q_{1}, q_{2}, \dots, q_{n}}

, the cross-entropy between Q and P is defined as

H (P, Q) = - \sum_{i = 1}^{n} p_{i} log (q_{i})

(7)

p_{i}

is the probability of class i in the true distribution P (typically obtained through hard labels).

q_{i}

is the probability of class i in the model’s predicted distribution Q. The difference between the two probability distributions P and Q can be measured using their Kullback–Leibler (KL) divergence [27], which is defined as

D_{KL} (P ‖ Q) = \sum_{i = 1}^{n} p_{i} log (\frac{p_{i}}{q_{i}})

(8)

According to Reference [28], for the teacher model and the student model, the original loss (i.e., hard loss) of the sample x is commonly defined as the cross-entropy between the predicted decision rule

l_{t}^{h} (x) = H (Y, Y^{t})

and the true label

l_{s}^{h} (x) = H (Y, Y^{s})

. For simplicity, we omit model parameters and only use the indices t and s to represent the teacher and student models, respectively.

Let

Z^{t}

and

Z^{s}

go through the softmax layers with temperatures

T = T^{t}

and

T = T^{s}

, respectively, to obtain the soft decision rules, i.e.,

U^{t} = {u_{1}^{t}, u_{2}^{t}, \dots, u_{n}^{t}}

and

U^{s} = {u_{1}^{s}, u_{2}^{s}, \dots, u_{n}^{s}}

.

Define the distillation loss as the KL divergence between the two models’ probability distributions. The distillation loss for teacher model with sample x is

l_{t}^{d} (x) = D_{KL} (U^{s} ‖ U^{t})

. The distillation loss for the student model is then defined as

l_{s}^{d} (x) = D_{KL} (U^{t}, U^{s}) .

Based on the previous calculations, for a training dataset D, the original loss of the teacher model is given by

l_{t}^{h} = \frac{1}{| D |} \sum_{x \in D} l_{t}^{h} (x)

(9)

where

| D |

denotes the total number of samples in the dataset D, and

l_{t}^{h} (x)

is the loss function for the teacher model on a given sample x.

Similarly, the original loss for the student model is given by

l_{s}^{h} = \frac{1}{| D |} \sum_{x \in D} l_{s}^{h} (x)

(10)

The teacher model’s distillation loss is defined as the average KL divergence between the soft targets produced by the teacher and the student model:

l_{t}^{d} = \frac{1}{| D |} \sum_{x \in D} l_{t}^{d} (x)

(11)

Similarly, the distillation loss for the student model, based on the output of the teacher model, is given by

l_{s}^{d} = \frac{1}{| D |} \sum_{x \in D} l_{s}^{d} (x)

(12)

The final loss function of the teacher model is computed as a weighted sum of the hard loss and the distillation loss:

L_{t} = α \cdot L_{t}^{h} + (1 - α) \cdot L_{t}^{d}

(13)

where

α

is a predefined constant that controls the trade-off between the hard loss

L_{t}^{h}

and the distillation loss

L_{t}^{d}

.

Similarly, the final loss function for the student model is

L_{s} = β \cdot L_{s}^{h} + (1 - β) \cdot L_{s}^{d}

(14)

4.2. Implementation Process of Fed-MKD

4.2.1. Local Training Rounds Adjustment

To further reduce the communication cost under resource-constrained environments in cross-organizational federated learning, we propose a strategy that decreases the communication frequency between clients and the PS. The decision mechanism of this strategy is based on the global model’s performance improvement to determine the number of local training rounds before each aggregation. Specifically, the parameter server monitors the performance change of the global model between the last two aggregations.

If the performance improvement between the two aggregations is large, clients will perform more local training rounds to improve the global model. In contrast, if the performance improvement is small, fewer local training rounds will be performed to avoid over-training.

We assume that

L_{r}

represents the loss of the global model after the r-th aggregation on the test dataset. The performance improvement between the

r - 1

-th and r-th aggregations can be calculated as

Δ L_{r} = L_{r - 1} - L_{r}

(15)

where

L_{r - 1}

corresponds to the performance of the global model after

r - 1

-th aggregation.

Then, we use a constant A to set the upper limit for the number of local training rounds, thereby controlling the proportional relationship between the number of local training rounds and the performance improvement. By multiplying the relative performance improvement by the constant A, we can obtain an initial target for the number of local training rounds:

E_{r}^{*} = A \cdot \frac{L_{r - 1} - L_{r}}{L_{r - 1}}

(16)

Considering that at least one round of local training is required in practical applications, we need to set a maximum function to ensure that the number of local training rounds is not less than 1. The final decision-making mechanism is typically represented by the following formula:

E_{r} = max (A \cdot \frac{L_{r - 1} - L_{r}}{L_{r - 1}}, 1)

(17)

We can observe that

E_{r} \in [1, A]

from this formula,. Based on this strategy, if the performance of the aggregated global model improves significantly compared to the last aggregation, the client will perform more rounds of local training. Otherwise, fewer local training rounds will be conducted to avoiding overtraining.

4.2.2. Teacher Model Training with Adaptive Loss Weights

To enhance the distillation performance of the teacher model, this work introduces an adaptive mechanism for adjusting the weights of the teacher model’s loss function. This mechanism regulates the information transfer during the distillation process by controlling the weight ratio between the hard label loss (

L_{h}

) and the distillation loss (

L_{d}

) in the teacher model’s loss function.

We use the constant

δ

to influence the initial weight of the teacher model’s original loss. The value of

δ

should be chosen based on the dataset and the desired balance between the teacher model’s fit to local data and its ability to perform distillation. Another hyper-parameter

ϵ

determines the rate of change in the weight adjustment during training. It should be set based on the total number of local training rounds E before the next global aggregation. The weight of the teacher model’s original loss needs to be constrained within the range

[δ, δ + ϵ]

, where

0 \leq δ, δ + ϵ \leq 1

.

Let

α = δ (1 + \frac{ϵ ln e}{δ ln E})

, where E denotes the total number of local training rounds to be performed before the next aggregation, and e represents the number of local training rounds already completed in the current global cycle. Substituting

α

into Equation (13), the final loss function of the teacher model can be expressed as follows:

L_{t} = δ (1 + \frac{ϵ ln e}{δ ln E}) \cdot L_{h} + (1 - δ (1 + \frac{ϵ ln e}{δ ln E})) \cdot L_{d}, E > 1

(18)

Notice that, when

E = 1

,

α

reverts to its original constant. The teacher model loss function can then be described as

\begin{matrix} L_{t} = \{\begin{matrix} δ (1 + \frac{ϵ ln e}{δ ln E}) \cdot L_{t}^{h} + (1 - δ (1 + \frac{ϵ ln e}{δ ln E})) \cdot L_{t}^{d} & E > 1 \\ α \cdot L_{t}^{h} + (1 - α) \cdot L_{t}^{d} & E = 1 \end{matrix} \end{matrix}

(19)

It can be observed that, during the initial phase of each local training round, the distillation loss of the teacher model dominates the overall loss. This is attributed to the student model having just undergone aggregation, integrating information from other DCs. To enhance the teacher model’s ability to capture features from other DCs, the weight of the distillation loss is increased early in local training. As training progresses, this weight gradually decreases, allowing the teacher model to focus more on optimizing performance for the local dataset.

Since the dynamic weight adjustment mechanism in the Fed-MKD framework introduces a modality-aware adaptive strategy, it significantly enhances knowledge transfer efficiency in multi-modal federated learning scenarios. For high-dimensional image data, during the initial training phase, the hard label loss weight

α

is gradually adjusted towards the upper limit

δ

using a monotonically increasing function (as shown in Equation (18)). This design effectively stabilizes the gradient update process of the convolutional kernels, avoiding the common feature extractor oscillation problem in traditional fixed-weight schemes. For low signal-to-noise ratio (SNR) time-series data such as EEG signals, Fed-MKD automatically enhances the distillation loss weight

(1 - α \to ϵ)

. The adjustment magnitude is negatively correlated with the SNR:

ϵ = ϵ_{0} + γ (1 - \frac{S N R}{S N R_{0}}),

where

γ

is the modality-sensitive coefficient. This nonlinear mapping function ensures that, when the SNR is low, the model automatically allocates higher weights to the distillation loss, thereby significantly improving the model’s robustness in noisy environments.

4.3. The Specific Algorithm Design

This section elaborates on the algorithmic workflows designed for the participating client DCs and the PS in Algorithm 1 and Algorithm 2, respectively.

Algorithm 1 delineates the internal training process of the DCs within the Fed-MKD system. The input consists of the total number of global training rounds R and the local dataset

D_{i}

. During each global training round, DC i pulls the current global student model from the PS to update the local student model and retrieves the number of local training epochs E required for the current global round (Lines 3–4). Subsequently, DC i performs E rounds of local training (Lines 5–13). For each mini-batch, DC i inputs the data into both two models to obtain their respective predictions (Line 7). Based on these predictions, the original loss

L_{t}^{h}

, distillation loss

L_{t}^{d}

for the teacher model, and the original loss

L_{s}^{h}

, distillation loss

L_{s}^{d}

for the student model are computed as described in Section 4.1 (Line 8). The final losses

L_{t}

and

L_{s}

for the teacher and student models are calculated by weighted summation of the original and distillation losses (Lines 9–10). Notably, to enable the teacher model to better learn features from other participating DCs, the weights of

L_{t}^{h}

and

L_{t}^{d}

in

L_{t}

are adaptively adjusted during the local training process. Following this, the gradients

g_{t}^{e}

and

g_{s}^{e}

for the two models are computed based on

L_{t}

and

L_{s}

, respectively (Lines 11–12). These gradients are then used to update the local parameters

w_{t}^{i}

and

w_{s}^{i}

of the teacher and student models (Lines 14–15). Finally, DC i uploads the local student model parameters

w_{s}^{i}

to the PS (Line 17).

Algorithm 2 describes the operational process of the PS in the Fed-MKD system. Initially, the PS randomly initializes the parameters of the global student model (Line 1), followed by iterative global training rounds (Lines 2–10). In each global training round, the PS collects all local student model parameters from the participating DCs (Lines 3–5). Subsequently, the PS computes the aggregated parameters

w_{r}

and update the global parameters (Line 6). The performance of the updated global model is evaluated using a test dataset, and the loss function value for the current round is obtained (Line 7). Next, the number of local training epochs

E_{r}

for the DCs before the next global aggregation is calculated based on the relative reduction in the loss function between the current and previous rounds (Line 8). Finally, the PS broadcasts the current global student model parameters

w_{r}

and the number of local training epochs

E_{r}

to all DCs (Line 9).

Algorithm 1 Fed-MKD (Participating DC

i : i = 1, 2, \dots, N

)

Require:

R, D_{i}

1:: $D \leftarrow$ Split $D_{i}$ into mini-batches of size k;
2:: for $r \in R$ do
3:: Pull the global student model from PS and update the local student model $w_{s}^{j} \leftarrow w_{r}$ ;
4:: Pull the number of local training epochs $E \leftarrow E_{r}$ for the r-th cycle from PS;
5:: for $e \in E$ do
6:: for mini-batch $d \in D$ do
7:: Obtain predictions from the teacher model and student model $P r e d_{t}$ , $P r e d_{s}$ ;
8:: Compute losses $L_{t}^{h}$ , $L_{t}^{d}$ , $L_{s}^{h}$ , $L_{s}^{d}$ ;
9:: Compute the teacher model loss $L_{t}$ with Equation (19);
10:: Compute the student model loss $L_{s} = β \cdot L_{s}^{h} + (1 - β) \cdot L_{s}^{d}$ ;
11:: Compute the teacher model gradient $g_{t}^{e} \leftarrow g_{t}^{e} + \nabla L_{t}$ ;
12:: Compute the student model gradient $g_{s}^{e} \leftarrow g_{s}^{e} + \nabla L_{s}$ ;
13:: end for
14:: Update the teacher model $w_{t}^{j} \leftarrow w_{t}^{j} - η_{1} \cdot g_{t}^{e}$ ;
15:: Update the student model $w_{s}^{j} \leftarrow w_{s}^{j} - η_{s} \cdot g_{s}^{e}$ ;
16:: end for
17:: Upload $w_{s}^{j}$ to PS.
18:: end for

Algorithm 2 Fed-MKD (Parameter Server)

Require:

R, N

1:: Initialize global model parameters $w_{0} \leftarrow w_{i n i t}$ ;
2:: for $r \in R$ do
3:: for $i \in N_{i}$ do
4:: Receive $w_{s}^{i}$ from client DC i;
5:: end for
6:: Aggregate and update global model parameters $w_{r} = \sum_{i} w_{s}^{i} / N_{i}$ ;
7:: Compute loss $L_{r} \leftarrow test (w_{r})$ using the test dataset;
8:: Compute the number of local training epochs $E_{r} = max (|\frac{L_{r - 1} - L_{r}}{L_{r - 1}}|, 1)$ ;
9:: Broadcast $w_{r}, E_{r}$ to all DCs.
10:: end for

In cross-organizational federated learning frameworks, computational complexity directly impacts the scalability and deployment feasibility of a given method. We quantify the computational complexity of the proposed Fed-MKD approach. The analysis results can be used to assess the scalability of different knowledge distillation methods when the number of clients increases or when the data scale expands.

Based on a typical knowledge distillation pipeline, we define four key computational components: forward propagation, backward propagation, loss computation, and parameter aggregation, among which forward and backward propagation usually constitute the primary computational bottleneck.

Let

d_{1}

denote the number of parameters in the teacher model and

d_{2}

that of the student model (

d_{2} ≪ d_{1}

). Let B represent the local batch size, C the total number of classes in the classification task (i.e., the output dimension of the model), E the number of local training epochs before each aggregation, and e the total number of global communication rounds. Then, the total computational complexity of Fed-MKD can be expressed as

C_{Fed - MKD} = O (e \cdot (E \cdot B \cdot (2 d_{1}^{2} + 3 d_{2}^{2} + C^{2}) + K \cdot (d_{1} + d_{2}))

(20)

The computational complexity of Fed-MKD per iteration can be decomposed as

(1): Teacher model forward pass: $O (B \cdot d_{1}^{2})$ ;
(2): Student model forward pass: $O (B \cdot d_{2}^{2})$ ;
(3): Mutual distillation loss calculation (KL divergence): $O (B \cdot C^{2})$ ;
(4): Student model backward propagation: $O (B \cdot d_{2}^{2})$ ;
(5): Dynamic weight update: $O (d_{1} + d_{2})$ .

Although maintaining both the teacher model (with parameter size

d_{1}

) and the student model (with parameter size

d_{2}

) may seem to increase the computational burden, the actual per-round complexity is

O (B (2 d_{1}^{2} + 3 d_{2}^{2} + C^{2})) + O (K (d_{1} + d_{2}))

.

Compared to the baseline methods used in the experimental section, the differences in computational efficiency mainly stem from the architectural design. FedMD [14] only utilizes the teacher model, results in a per-round complexity of

O (B (2 d_{1}^{2} + C))

, with the measured FLOPs being

81.5 \pm 2.3 \times 10^{9}

, and its communication cost is 4.1 times that of Fed-MKD. FedGKT [13], although having the lowest computational cost (

O (B (2 d_{2}^{2} + C^{2}))

), lacks the guidance of the teacher model and requires an additional 15% training epochs to achieve comparable accuracy (p < 0.01). Notably, while the theoretical complexity of Fed-MKD is similar to that of the fixed-weight variant Fed-MKDFW [15], the former achieves a 19% practical speedup due to its dynamic adjustment mechanism, which reduces unnecessary computations by adapting to non-IID data.

5. Simulation Experiment and Results Analysis

5.1. Experimental Setup

5.1.1. Simulation Platform Setup

The platform used for experimental evaluation is a server workstation running the Windows 11 operating system, equipped with an Intel Core i9 14900K processor and an NVIDIA^® GeForce^® GTX 4060 Ti graphics card, with 128 GB of memory. The operating system is the Windows 11 Workstation version, and the simulation code is implemented using the Pytorch 2.3.0 framework.

The simulated cross-organizational federated learning system in the experiment consists of 16 client DCs and 1 PS. To control variables in the experiment, this chapter ignores the bandwidth differences between wide-area network links and sets the bidirectional bandwidth between DCs and PS to 100 Mbps.

5.1.2. Dataset and Model Parameters

This study conducted experiments using three commonly used datasets: Fashion-MNIST [29], CIFAR-100 [30], and Mini-ImageNet [31]. The Mini-ImageNet dataset was preprocessed following the method described in [32] to adapt it for classification tasks, dividing the data into training, validation, and test sets.

For Fashion-MNIST, both the teacher model and the student model are custom CNNs. The teacher model consists of two convolutional layers, two pooling layers, and one fully connected layer, with the number of convolutional kernels ranging from 16 to 256. The student model has the same hierarchical structure as the teacher model, but the number of convolutional kernels is reduced to a range of 8–64. For the CIFAR-100 dataset, we conducted two sets of experiments. In the first set, the teacher model was implemented using ResNet50, while the student model was a customized residual network, ResNet-Mini, designed with fewer residual blocks. In the second set, ResNet101 and ResNet18 were employed as the teacher model and the student model. For Mini-ImageNet, ResNet101 and ResNet18 were used as the teacher model and the student model, respectively.

In terms of hyperparameters, we utilized the SGD optimizer with momentum for gradient descent, with a mini-batch size set to 128. The weights and biases of the models were initialized using a Gaussian distribution [33] with a mean of 0 and a variance of 0.1. For Fashion-MNIST, the number of training epochs was set to 100, while for CIFAR-100 and Mini-ImageNet, it was set to 500. The learning rate was set to

1 \times 10^{- 3}

. To enhance training accuracy, a decaying learning rate was employed for training the CIFAR-100 and Mini-ImageNet dataset.

5.1.3. Comparative Schemes and Performance Metrics

The comparative methods used in the experiments are described as follows:

FedMD [14]: Locally trains only a larger teacher model to accelerate the learning process during training. This method serves as a baseline for comparing the effectiveness of training acceleration.
FedGKT [13]: Locally trains only the student model, aiming to achieve performance comparable to the teacher model with fewer parameters.
FedKD [15]: Employs the original mutual distillation framework for training.
Fed-MKDFW: Utilizes fixed loss weights during training with Fed-MKD, designed to evaluate the advantages of Fed-MKD’s adaptive adjustment of distillation weights under non-IID data distribution scenarios.

Moreover, we use four performance metrics with practical research significance to evaluate the communication optimization performance of Fed-MKD in cross-organizational federated learning: (1) model test accuracy; (2) convergence time; (3) total compression ratio; and (4) model training performance under non-independent and identically distributed (non-IID) settings.

Model accuracy refers to the prediction accuracy or correctness of the model on a given task, typically calculated on a test set. It is used to measure the model’s ability to correctly classify unseen data. In the knowledge distillation system used in this paper, only the student model participates in aggregation, and the global model is derived from it. Generally, the faster the global model accuracy improves, the faster the convergence speed.

For the training time required for the model to achieve convergence, if the change in the target value is less than 0.5% over ten consecutive epochs, the model is declared to have converged. Convergence detection begins only after the target value reaches a relatively optimal value, to avoid premature convergence judgments in cases where the loss curve exhibits step-like behavior.

The third metric is the ratio of the total data size transmitted by the baseline method (FedMD) at convergence to the total data size transmitted by other methods. The total compression ratio of the baseline is set to 1 in this experiment. Let

S_{b} = N E_{b} M_{b}

denote the total data size transmitted by the baseline model, where N is the number of DCs,

E_{b}

is the number of epochs required for convergence, and

M_{b}

is the number of parameters in the baseline model. Similarly, let

S_{j} = N E_{j} M_{j}

represent the total data size transmitted by method j. The total compression ratio for method j is defined as

\frac{S_{b}}{S_{j}} = \frac{E_{b} M_{b}}{E_{j} M_{j}} .

The fourth metric measures the final accuracy of the global model under the scenario where training data across different DCs are not independent and identically distributed (non-IID). The non-IID setup follows the methodology described in [34]. This non-IID setting method first sorts the training data based on class labels, and then randomly assigns an exclusive subset of classes for each data center, with the size of the subset denoted as

ξ

. Here,

ξ

represents the number of classes allocated to each data center. A smaller value of

ξ

implies that the data diversity of an individual client is lower and that the differences in data distribution among different clients are greater.

The varying class assignments demonstrate the inherent data diversity between different DCs, which simulates real-world scenarios where clients may have different data distributions. This non-IID setup helps in testing the robustness of the federated learning method in conditions wherein the data are not uniformly distributed across all clients. For Fashion-MNIST,

ξ

is set to 8 and 4, while for CIFAR-100 and Mini-ImageNet,

ξ

is set to 80 and 40.

5.2. Experimental Results and Analysis

5.2.1. Relationship Between Model Accuracy and Training Time

Figure 4 depicts the curves of model accuracy over time in four sets of experiments. Note that, in order to make the relationship and trend of the curve clearer, Figure 4 only captures part of the training process. The accuracy of the curve in the graph is not necessarily the accuracy of the model in the training. Final specific accuracy data will be presented in Table 2.

Overall, Fed-MKD demonstrated significant acceleration effects across all training scenarios. The transmission of the smaller student model undoubtedly contributed to communication speedup, as shown in Figure 4a. In experiments with relatively large model sizes (Figure 4b–d), the curves for Fed-MKD exhibited a faster initial rise compared to the FedMD method, which transmits the teacher model. However, due to the inherent limitations in the performance of the student model, its accuracy was eventually surpassed by FedMD. The larger the teacher model, the more pronounced the acceleration effect. Furthermore, reducing communication frequency further enhanced the speedup, as reflected in the figure, where the Fed-MKD curve consistently showed the largest gap compared to the baseline method using teacher model transmission (black curve).

5.2.2. Model Top-1 Accuracy and Convergence Time Speedup Ratio

Table 2 summarizes the numerical results from four different experimental scenarios. The datasets used in these experiments include Fashion-MNIST (with teacher and student models being custom CNNs of different sizes), CIFAR-100 (with a ResNet50 teacher model and a ResNet-Mini student model), CIFAR-100 (with a ResNet101 teacher model and a ResNet18 student model), and Mini-ImageNet (with a ResNet101 teacher model and a ResNet18 student model). For each experimental setup, the different training methods used the same learning rate, mini-batch size, and other initial configurations. The table primarily compares the differences in communication model parameters, model accuracy, convergence speedup, and overall compression ratio between the various training methods.

From the Top-1 accuracy shown in Table 2, it can be observed that, during Fashion-MNIST training, the optimization of the student model’s performance using Fed-MKD is not significant, with only a 1.14% improvement. This is because the dataset is relatively simple, and the accuracy gap between the student and teacher models in independent training is small, leaving little room for improvement. In contrast, on the other two datasets, Fed-MKD significantly improved the accuracy of ResNet-Mini and ResNet18 by 6.93% and 3.94%, respectively, on CIFAR-100; and by 3.07% on Mini-ImageNet.

Comparing to the baseline FedMD (Teacher), although the reduction in model size using Fed-MKD leads to some loss in accuracy, it significantly reduces communication costs and accelerates the training process. Furthermore, Fed-MKD consistently outperforms the standard mutual knowledge distillation method, FedKD. This is because Fed-MKD reduces communication frequency, enabling more local training rounds within the same time frame compared to FedKD.

Figure 5 visually demonstrates the comparison of convergence speeds between Fed-MKD and other training methods, with specific numerical results also shown in Table 2. The baseline for convergence acceleration is FedMD (Teacher), so its bar height in Figure 5 is always set to 1. In all experiments, Fed-MKD achieves the best convergence speedup, with the best performance observed during Mini-ImageNet training, where the convergence speed is accelerated by a factor of 4.34 compared to the baseline.

5.2.3. Total Compression Ratio

The overall data compression ratio for the total amount of data transmitted before convergence for each training method across the four experimental setups is summarized in Table 2. Clearly, transmitting smaller models consistently achieves a compression ratio greater than 1 in the experiments. This indicates that, although transmitting smaller models may result in more training rounds, it generally reduces the amount of data transmitted, thereby alleviating the communication burden in cross-organizational federated learning.

Since Fed-MKD reduces both the size of the transmitted model and the communication frequency, it achieves the smallest data transmission volume and consistently provides the highest compression ratio among all training methods in each experiment. The compression ratio for Fed-MKD ranges from 4.89× to 28.45×. Notably, the compression ratio for FedKD does not necessarily exceed that of the Student model. This is because both methods transmit models of the same size, and the compression ratio is determined by the number of training rounds required for model convergence. In some cases, the student model can converge quickly on its own, while in FedKD, the global model’s accuracy continues to improve under the effect of knowledge distillation, which is consistent with the analysis of convergence speedup discussed earlier.

5.2.4. Global Model Accuracy Under the Non-IID Setting

We evaluated the performance of Fed-MKD under non-IID data settings, as shown in Figure 6. Overall, knowledge distillation still improves the accuracy of the student model under non-IID data, and Fed-MKD consistently minimizes the accuracy loss compared to the Teacher model across all scenarios. The performance of the complete Fed-MKD slightly outperforms LFCKD-FW, which uses a fixed distillation loss function weight, with the difference being more apparent when

ξ

= 4, as shown in Figure 6a. This is because, under non-IID conditions, the features from models in other DCs are more significant, and the weight adjustment mechanism in Fed-MKD allows the teacher model to better learn the features from other models after each aggregation.

Figure 6b,c present the results from the same dataset, with Figure 6d showing a larger gap between FedMD (Teacher) and the other training methods. This is because the Student model, ResNet-Mini, used in this experiment is a custom neural network with relatively poor performance when facing complex datasets. As a result, there is less room for performance improvement through knowledge distillation.

Fed-MKD decouples local feature learning from global knowledge distillation, achieving a controlled balance of data dependency. The teacher model maintains a strong fitting capability on local data through the hard label loss (Equation (8)), thereby preserving the necessary sensitivity to local features, which is proportional to

\propto \frac{1}{| D_{i} |} .

In contrast, the student model absorbs knowledge from other clients via the KL divergence (Equation (9)) and maintains cross-scenario consistency through the dynamic distillation weight (Equation (19)). Therefore, through its architecture-independent design and dynamic generalization mechanism, Fed-MKD effectively supports efficient migration to heterogeneous data scenarios.

However, the adjustment range for distillation loss weight is a tunable parameter in practice. It is not easy to find an optimal setting for each scenario during the experiment. How to select or generate a suitable student model and efficiently determine the appropriate parameters for adjusting distillation loss weights remains a topic for future work.

6. Conclusions

This study investigates the communication bottleneck issues in cross-organizational federated learning systems and WAN environments and proposes a low-communication-frequency knowledge distillation method (Fed-MKD) to reduce communication costs. For the proposed Fed-MKD method, we provide a detailed description of its implementation process and elaborate on the specific execution steps of DCs and the PS within this framework. Through mathematical analysis, we also explore the impact of the reduced communication frequency strategy on model convergence.

Extensive experimental results on public datasets validate the effectiveness of the proposed Fed-MKD in optimizing communication. Notably, under non-IID data settings, the integration of an adaptive distillation loss weight adjustment strategy significantly improves both communication efficiency and model performance. These findings underscore the practical advantages of Fed-MKD in cross-organizational federated learning, highlighting its robustness and adaptability in real-world applications. The ability of Fed-MKD to optimize communication while maintaining high model performance has important implications for deploying federated learning in resource-constrained environments, where communication costs are a major concern.

Additionally, the model design under the Fed-MKD framework demonstrates strong generalizability. Thanks to the mutual distillation mechanism, the model is not only capable of adapting to the specific data characteristics of the local environment but is also applicable to different data scenarios. This means that, once trained in a particular data environment, the model can quickly adapt to other diverse data environments through fine-tuning or minimal additional training. This cross-scenario adaptability enables Fed-MKD to perform exceptionally well in scenarios involving multi-modal data and non-IID data.

Author Contributions

Conceptualization, S.L. and E.K.L.L.; methodology, S.L.; software, S.L.; validation, S.L.; formal analysis, S.L. and H.S.; investigation, S.L.; resources, C.-T.L.; data curation, S.L.; writing—original draft preparation, S.L.; writing—review and editing, S.L.; visualization, S.L.; supervision, H.S.; project administration, C.-T.L.; funding acquisition, C.-T.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Science and Technology Development Fund of Macao (FDCT) #0033/2023/RIA1 and the Quantum Technology Chalenges 2032 of Queensland Department of Environment and Science (#RSH/7189). And the APC was funded by Prof. Chan-Tong Lam, who is the project administrator.

Data Availability Statement

The three datasets generated and analyzed during the current study are accessible at https://github.com/zalandoresearch/fashion-mnist, https://opendatalab.com/OpenDataLab/cifar-100/tree/main/raw, https://yaoyaoliu.web.illinois.edu/projects/mtl/download/ accessed on 13 April 2025.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Devlin, J. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Li, T.; Sahu, A.K.; Talwalkar, A.; Smith, V. Federated learning: Challenges, methods, and future directions. IEEE Signal Process. Mag. 2020, 37, 50–60. [Google Scholar] [CrossRef]
Sattler, F.; Wiedemann, S.; Müller, K.-R.; Samek, W. Robust and communication-efficient federated learning from non-IID data. IEEE Trans. Neural Netw. Learn. Syst. 2019, 31, 3400–3413. [Google Scholar] [CrossRef]
Buyukates, B.; Ulukus, S. Timely communication in federated learning. In Proceedings of the IEEE INFOCOM 2021—IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Virtual, 10–13 May 2021; pp. 1–6. [Google Scholar]
Shahid, O.; Pouriyeh, S.; Parizi, R.M.; Sheng, Q.Z.; Srivastava, G.; Zhao, L. Communication efficiency in federated learning: Achievements and challenges. arXiv 2021, arXiv:2107.10996. [Google Scholar]
Zhao, Z.; Mao, Y.; Liu, Y.; Song, L.; Ouyang, Y.; Chen, X.; Ding, W. Towards efficient communications in federated learning: A contemporary survey. J. Frankl. Inst. 2023, 360, 8669–8703. [Google Scholar] [CrossRef]
Yang, Z.; Chen, M.; Wong, K.-K.; Poor, H.V.; Cui, S. Federated learning for 6G: Applications, challenges, and opportunities. Engineering 2022, 8, 33–41. [Google Scholar] [CrossRef]
Panda, A.; Mahloujifar, S.; Bhagoji, A.N.; Chakraborty, S.; Mittal, P. Sparsefed: Mitigating model poisoning attacks in federated learning with sparsification. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Virtual, 28–30 March 2022; pp. 7587–7624. [Google Scholar]
Peng, H.; Gurevin, D.; Huang, S.; Geng, T.; Jiang, W.; Khan, O.; Ding, C. Towards sparsification of graph neural networks. In Proceedings of the 2022 IEEE 40th International Conference on Computer Design (ICCD), Olympic Valley, CA, USA, 23–26 October 2022; pp. 272–279. [Google Scholar]
Sui, D.; Chen, Y.; Zhao, J.; Jia, Y.; Xie, Y.; Sun, W. Feded: Federated learning via ensemble distillation for medical relation extraction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 2118–2128. [Google Scholar]
Lin, T.; Kong, L.; Stich, S.U.; Jaggi, M. Ensemble distillation for robust model fusion in federated learning. Adv. Neural Inf. Process. Syst. 2020, 33, 2351–2363. [Google Scholar]
He, C.; Annavaram, M.; Avestimehr, S. Group knowledge transfer: Federated learning of large CNNs at the edge. Adv. Neural Inf. Process. Syst. 2020, 33, 14068–14080. [Google Scholar]
Li, D.; Wang, J. Fedmd: Heterogeneous federated learning via model distillation. arXiv 2019, arXiv:1910.03581. [Google Scholar]
Wu, C.; Wu, F.; Lyu, L.; Huang, Y.; Xie, X. Communication-efficient federated learning via knowledge distillation. Nat. Commun. 2022, 13, 2032. [Google Scholar] [CrossRef]
Zhang, L.; Shen, L.; Ding, L.; Tao, D.; Duan, L.-Y. Fine-tuning global model via data-free knowledge distillation for non-IID federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10174–10183. [Google Scholar]
Lee, G.; Jeong, M.; Shin, Y.; Bae, S.; Yun, S.-Y. Preservation of the global knowledge by not-true distillation in federated learning. Adv. Neural Inf. Process. Syst. 2022, 35, 38461–38474. [Google Scholar]
Zhang, X.; Zeng, Z.; Zhou, X.; Shen, Z. Low-dimensional federated knowledge graph embedding via knowledge distillation. arXiv 2024, arXiv:2408.05748. [Google Scholar]
Liu, T.; Xia, J.; Ling, Z.; Fu, X.; Yu, S.; Chen, M. Efficient federated learning for AIoT applications using knowledge distillation. IEEE Internet Things J. 2022, 10, 7229–7243. [Google Scholar] [CrossRef]
Malinovskiy, G.; Kovalev, D.; Gasanov, E.; Condat, L.; Richtarik, P. From local SGD to local fixed-point methods for federated learning. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 6692–6701. [Google Scholar]
Ochiai, K.; Terada, M. Poster: End-to-End Privacy-Preserving Vertical Federated Learning using Private Cross-Organizational Data Collaboration. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, Salt Lake City, UT, USA, 14–18 October 2024; pp. 4955–4957. [Google Scholar]
Fernandez, J.D.; Brennecke, M.; Barbereau, T.; Rieger, A.; Fridgen, G. Federated Learning: Organizational Opportunities, Challenges, and Adoption Strategies. arXiv 2023, arXiv:2308.02219. [Google Scholar]
Che, L.; Wang, J.; Zhou, Y.; Ma, F. Multimodal federated learning: A survey. Sensors 2023, 23, 6986. [Google Scholar] [CrossRef]
Dai, X.; Yan, X.; Zhou, K.; Yang, H.; Ng, K.W.; Cheng, J.; Fan, Y. Hyper-sphere quantization: Communication-efficient SGD for federated learning. arXiv 2019, arXiv:1911.04655. [Google Scholar]
Qu, X.; Wang, J.; Xiao, J. Quantization and knowledge distillation for efficient federated learning on edge devices. In Proceedings of the 2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), Virtual, 14–16 December 2020; pp. 967–972. [Google Scholar]
Chen, Z.; Tian, P.; Liao, W.; Chen, X.; Xu, G.; Yu, W. Resource-aware knowledge distillation for federated learning. IEEE Trans. Emerg. Top. Comput. 2023, 11, 706–719. [Google Scholar] [CrossRef]
Cui, J.; Tian, Z.; Zhong, Z.; Qi, X.; Yu, B.; Zhang, H. Decoupled kullback-leibler divergence loss. Adv. Neural Inf. Process. Syst. 2024, 37, 74461–74486. [Google Scholar]
Huang, Y.; Shen, P.; Tai, Y.; Li, S.; Liu, X.; Li, J.; Huang, F.; Ji, R. Improving face recognition from hard samples via distribution distillation loss. In Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part XXX 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 138–154. [Google Scholar]
Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv 2017, arXiv:1708.07747. [Google Scholar]
Moradi, R.; Berangi, R.; Minaei, B. Sparsemaps: Convolutional networks with sparse feature maps for tiny image classification. Expert Syst. Appl. 2019, 119, 142–154. [Google Scholar] [CrossRef]
Vinyals, O.; Blundell, C.; Lillicrap, T.; Wierstra, D. Matching networks for one shot learning. Adv. Neural Inf. Process. Syst. 2016, 29. [Google Scholar]
Cao, X. MLclf: The project machine learning CLassiFication for utilizing mini-ImageNet and tiny-ImageNet. 2022; Unpublished. [Google Scholar]
Shafiq, M.; Gu, Z. Deep residual learning for image recognition: A survey. Appl. Sci. 2022, 12, 8972. [Google Scholar] [CrossRef]
Li, C.; Li, G.; Varshney, P.K. Decentralized federated learning via mutual knowledge transfer. IEEE Internet Things J. 2021, 9, 1136–1147. [Google Scholar] [CrossRef]

Figure 1. Cross-organizational federated learning system under the mutual distillation architecture.

Figure 2. The calculation process of mutual distillation model loss.

Figure 3. Effect of the softmax function at different distillation temperatures.

Figure 4. The variation in model accuracy with training time.

Figure 5. Convergence time acceleration ratio.

Figure 6. The accuracy of four groups of non-IID.

Table 1. Notions and descriptions.

Notation	Definition
N	Number of DCs participating in the training process
w	Model parameters, $w \in R^{d}$ , where d is the dimensionality of the parameters
$D_{i}$	The sample set (local dataset) independently owned and used for local model training by the i-th DC
D	Entire dataset, encompassing all data utilized by the DCs, $D_{i} \subseteq D$
$l (w, x)$	The loss function computed under model parameters w for an input data x
$F (w)$	Objective function for global model optimization
$L (w)$	Loss function of the global model at the r-th round
$w^{i}$	Local model parameters used by $D C_{i}$
$w_{r}$	Global model parameters aggregated at the r-th round
$w_{t}^{i} / w_{s}^{i}$	The teacher/student model parameters in $D C_{i}$
$g^{i}$	Gradient at DC_i
$g_{t}^{i} / g_{s}^{i}$	The gradients of the teacher/student model in $D C_{i}$
$η^{i}$	Learning rate of $D C_{i}$ , controlling the step size of each gradient update
R	Upper bound of the training frequency for the global model
$t h r$	Target threshold for the expected
$τ_{r}^{i}$	Communication time between $D C_{i}$ and the PS during the r-th round of training
$P r e d_{t} (Z^{t}) / P r e d_{s} (Z^{s})$	Teacher/student model predictions
$L_{t}^{d} / L_{s}^{d}$	Teacher/student model distillation loss
$L_{t}^{h} / L_{s}^{h}$	Teacher/student model original loss
$L_{t} / L_{s}$	Teacher/student model final loss combining distillation loss with original loss
$Y^{t} / Y^{s}$	Teacher/student model hard decision (actual prediction result of the model)
$U^{t} / U^{s}$	Teacher/student model soft decision (probability distribution output)
$A, β, α$	Hyperparameters
E	Number of global training rounds between all participating DCs

Table 2. Experimental numerical results.

Dataset	Training System	Network Architecture	Communication Model Parameters	Top-1 Accuracy	Inference Speedup	Compression Ratio
Fashion-MNIST	FedMD (Teacher)	Custom CNN	0.32 M	91.42%	1.00	1.00
	FedGKT (Student)	Custom CNN	0.038 M	89.81% (−1.61%)	0.98×	7.21×
	FedKD	-	0.038 M	90.41% (−1.28%)	1.26×	9.43×
	Fed-MKD	-	0.038 M	90.95% (−0.47%)	2.79×	28.45×
CIFAR-100	FedMD (Teacher)	ResNet50	24.47 M	74.12%	1.00	1.00
	FedGKT (Student)	ResNet-Mini	5.37 M	62.55% (−11.57%)	1.90×	2.72×
	FedKD	-	5.37 M	64.59% (−9.53%)	2.20×	2.45×
	Fed-MKD	-	5.37 M	69.48% (−4.64%)	3.72×	4.89×
CIFAR-100	FedMD (Teacher)	ResNet101	43.06 M	76.15%	1.00	1.00
	FedGKT (Student)	ResNet18	10.58 M	71.19% (−4.96%)	0.98×	1.84×
	FedKD	-	10.58 M	72.68% (−3.47%)	1.13×	2.30×
	Fed-MKD	-	10.58 M	75.13% (−1.02%)	3.44×	9.16×
Mini-ImageNet	FedMD (Teacher)	ResNet101	43.06 M	69.34%	1.00	1.00
	FedGKT (Student)	ResNet18	10.58 M	64.89% (−4.45%)	2.73×	4.25×
	FedKD	-	10.58 M	66.73% (2.61%)	2.08×	3.47×
	Fed-MKD	-	10.58 M	67.96% (1.38%)	4.34×	8.68×

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, S.; Shen, H.; Law, E.K.L.; Lam, C.-T. Mutual Knowledge Distillation-Based Communication Optimization Method for Cross-Organizational Federated Learning. Electronics 2025, 14, 1784. https://doi.org/10.3390/electronics14091784

AMA Style

Liu S, Shen H, Law EKL, Lam C-T. Mutual Knowledge Distillation-Based Communication Optimization Method for Cross-Organizational Federated Learning. Electronics. 2025; 14(9):1784. https://doi.org/10.3390/electronics14091784

Chicago/Turabian Style

Liu, Su, Hong Shen, Eddie K. L. Law, and Chan-Tong Lam. 2025. "Mutual Knowledge Distillation-Based Communication Optimization Method for Cross-Organizational Federated Learning" Electronics 14, no. 9: 1784. https://doi.org/10.3390/electronics14091784

APA Style

Liu, S., Shen, H., Law, E. K. L., & Lam, C.-T. (2025). Mutual Knowledge Distillation-Based Communication Optimization Method for Cross-Organizational Federated Learning. Electronics, 14(9), 1784. https://doi.org/10.3390/electronics14091784

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Mutual Knowledge Distillation-Based Communication Optimization Method for Cross-Organizational Federated Learning

Abstract

1. Introduction

2. Related Work

2.1. Joint Distillation

2.2. Distillation Compression

3. Preliminary

3.1. Problem Formulation

3.2. System Model Architecture

4. Methodology

4.1. Mutual Knowledge Distillation Strategy

4.2. Implementation Process of Fed-MKD

4.2.1. Local Training Rounds Adjustment

4.2.2. Teacher Model Training with Adaptive Loss Weights

4.3. The Specific Algorithm Design

5. Simulation Experiment and Results Analysis

5.1. Experimental Setup

5.1.1. Simulation Platform Setup

5.1.2. Dataset and Model Parameters

5.1.3. Comparative Schemes and Performance Metrics

5.2. Experimental Results and Analysis

5.2.1. Relationship Between Model Accuracy and Training Time

5.2.2. Model Top-1 Accuracy and Convergence Time Speedup Ratio

5.2.3. Total Compression Ratio

5.2.4. Global Model Accuracy Under the Non-IID Setting

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI