FedDPA: Dynamic Prototypical Alignment for Federated Learning with Non-IID Data

Bensiah, Oussama Akram; Benaboud, Rohallah

doi:10.3390/electronics14163286

Open AccessArticle

FedDPA: Dynamic Prototypical Alignment for Federated Learning with Non-IID Data

by

Oussama Akram Bensiah

^1,2,*

and

Rohallah Benaboud

^1,2

¹

Research Laboratory on Computer Science’s Complex Systems (RELA(CS)2), University of Oum El Bouaghi, Oum El Bouaghi 04000, Algeria

²

Department of Mathematics and Computer Sciences, Faculty of Exact Sciences, Natural Sciences and Life, University of Oum El Bouaghi, Oum El Bouaghi 04000, Algeria

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(16), 3286; https://doi.org/10.3390/electronics14163286

Submission received: 9 July 2025 / Revised: 9 August 2025 / Accepted: 15 August 2025 / Published: 19 August 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Federated learning (FL) has emerged as a powerful framework for decentralized model training, preserving data privacy by keeping datasets localized on distributed devices. However, data heterogeneity, characterized by significant variations in size, statistical distribution, and composition across client datasets, presents a persistent challenge that impairs model performance, compromises generalization, and delays convergence. To address these issues, we propose FedDPA, a novel framework that utilizes dynamic prototypical alignment. FedDPA operates in three stages. First, it computes class-specific prototypes for each client to capture local data distributions, integrating them into an adaptive regularization mechanism. Next, a hierarchical aggregation strategy clusters and combines prototypes from similar clients, which reduces communication overhead and stabilizes model updates. Finally, a contrastive alignment process refines the global model by enforcing intra-class compactness and inter-class separation in the feature space. These mechanisms work in concert to mitigate client drift and enhance global model performance. We conducted extensive evaluations on standard classification benchmarks—EMNIST, FEMNIST, CIFAR-10, CIFAR-100, and Tiny-ImageNet 200—under various non-identically and independently distributed (non-IID) scenarios. The results demonstrate the superiority of FedDPA over state-of-the-art methods, including FedAvg, FedNH, and FedROD. Our findings highlight FedDPA’s enhanced effectiveness, stability, and adaptability, establishing it as a scalable and efficient solution to the critical problem of data heterogeneity in federated learning.

Keywords:

adaptive regularization; contrastive alignment; data heterogeneity; federated learning; prototype alignment

1. Introduction

Federated learning (FL) has become a key paradigm in privacy-preserving machine learning, enabling collaborative model training across numerous devices without centralizing raw data. This decentralized approach is particularly valuable in data-sensitive fields like healthcare and finance, where privacy regulations mandate local data retention [1]. Despite its advantages, FL is often hindered by data heterogeneity, a scenario where client datasets are non-identically and independently distributed (non-IID). This variation in data distribution causes significant disparities in local model updates, which can lead to “client drift”, slower convergence, and a decline in the global model’s overall performance [2]. These challenges underscore the need for innovative methods that can manage the complexities of non-IID data while upholding the privacy principles of FL. Existing solutions have approached this problem from several angles. The foundational FedAvg [1] algorithm aggregates local models through simple averaging, a method that performs well with uniform data but struggles in non-IID settings. To counter this, FedProx [3] introduced a proximal term to regularize local updates and reduce client drift, though at the cost of increased computational overhead. More recent prototype-based methods, such as FedNH [4], have shown promise by using class prototypes to align local and global models. However, many current techniques assume a static client environment and do not adapt well to dynamic data distributions, which is a common characteristic of real-world applications. Furthermore, the effectiveness of prototype-based solutions often depends on high-quality initial prototypes and may not adequately structure the feature space for complex tasks. To overcome these limitations, we introduce Dynamic Prototypical Alignment for Federated Learning (FedDPA), a novel framework designed to enhance performance and stability in heterogeneous environments. FedDPA makes three primary contributions: incorporating an adaptive regularization mechanism that adjusts penalties based on the degree of data heterogeneity at each client, effectively controlling client drift; employing a hierarchical aggregation which uses a tiered aggregation strategy that groups clients into clusters to align prototypes, which improves scalability and lowers communication costs; and contrastive alignment to refine the global feature space by optimizing for class separability, which directly boosts model performance. We validate FedDPA through comprehensive experiments on five benchmark datasets under challenging non-IID conditions, demonstrating its superior accuracy and efficiency compared with established baselines. This work offers a scalable, privacy-preserving solution for real-world decentralized learning and advances the field by providing a robust framework for managing data heterogeneity.

2. Background

Federated learning has revolutionized decentralized machine learning by enabling collaborative model training across multiple devices without necessitating the sharing of raw data. This paradigm is particularly advantageous in domains such as healthcare and finance, where data privacy is paramount. However, federated learning efficacy is frequently undermined by data heterogeneity, where client datasets exhibit non-identical and independently distributed (non-IID) distributions varying in size, class composition, and feature representation. This heterogeneity precipitates significant challenges, including client drift, delayed convergence, and suboptimal global model performance. The objective of federated learning is to train a global model by minimizing a global loss function

L (θ)

, which is an aggregation of the local loss functions from K distributed clients. Each client K has a local dataset

D_{k} = (x_{i}, y_{i})

, where

x_{i}

represents the input features and

y_{i}

is the corresponding labels. The global loss is a weighted average of the local losses:

L_{θ} = \sum_{k = 1}^{K} \frac{n_{k}}{N} L_{k} (θ)

(1)

where

θ

represents the global model parameters,

L_{k}

is the local loss for client k,

n_{k}

is the number of samples on client k, and N is the total number of samples across all clients. Optimizing this function is challenging with non-IID data, as conflicting updates from heterogeneous clients can impede convergence. To address this, various strategies have been developed. Federated averaging (FedAvg), introduced in [1], aggregates local updates via simple averaging but struggles with non-IID data distributions due to its assumption of uniformity. Furthermore, extensions such as FedProx [2] incorporate a proximal term to regularize local updates to improve robustness, resulting in increased computational complexity. FedDyn [5] employs dynamic regularization to align local and global objectives, while model-specific methods such as FedMA [6] align neural network layers to mitigate structural disparities. Prototype-based approaches such as FedProto [7] leverage class prototypes representative of feature vectors and summarize local data distributions to improve alignment. Despite these advances, many methods lack adaptability to evolving data distributions or impose significant computational burdens and limit their scalability. Building on this foundation, we propose Dynamic Prototypical Alignment for Federated Learning (FedDPA), as illustrated in Figure 1, which incorporates and merges adaptive regularization, hierarchical aggregation, and contrastive alignment to dynamically align local and global prototypes, offering a robust and scalable solution to data heterogeneity in federated learning.

3. Related Work

Federated learning (FL) has emerged as a transformative paradigm in machine learning, enabling decentralized devices to collaboratively train models while preserving data privacy, an essential requirement in domains such as healthcare and finance. Yet data heterogeneity, in which client datasets differ in size, class distribution, and feature representation, remains a persistent challenge that has motivated a broad range of mitigation strategies. Foundational work centers on aggregation under distribution shifts. The seminal federated averaging (FedAvg) algorithm aggregates local model updates through simple gradient averaging [1]. Although effective in homogeneous settings, FedAvg degrades under non-IID conditions, leading to client drift and reduced generalization. FedProx addresses these limitations by introducing a proximal regularization term that stabilizes local updates and improves robustness, albeit with added computational overhead [2]. A significant body of research modifies the training objective or aggregation rule to accommodate heterogeneity. FedDyn applies dynamic regularization to align local and global objectives and thereby improves convergence in heterogeneous environments [5]. FedNova normalizes client updates to handle data imbalance, though its benefits diminish under extremely skewed distributions [8]. FedCurv leverages curvature information by adding a penalty term to the loss, mitigating catastrophic forgetting in non-IID scenarios [9]. Bayesian-inspired designs offer another perspective; Bayesian Model Ensemble (FedBE) samples high-quality global models and combines them to obtain more robust aggregation [10]. Knowledge distillation approaches such as federated distillation (FedDF) merge diverse client models and lessen reliance on uniform architectures, but they typically require access to a public dataset [11]. Personalization methods explicitly tailor the global model to client-specific characteristics. FedPer adopts a base personalization layer framework in which shared base layers are trained collaboratively using FedAvg and personalization layers are optimized locally via stochastic gradient descent [12]. FedRep learns a shared low-dimensional representation across clients while allowing personalized local heads, achieving linear convergence in certain settings [13]. Ditto balances fairness and robustness by learning device-specific models regularized toward a global reference [14]. FedBABU facilitates rapid personalization with minimal fine-tuning by updating only the feature extractor during local training and keeping the classifier fixed [15]. Prototype-based approaches communicate compact feature representations rather than full gradients. Federated Prototype Learning across Heterogeneous Clients (FedProto) exchanges class prototypes to improve generalization under non-IID data [7]. FedPLCC enhances feature alignment via weighted prototypes and convergent clustering [16]. ProtoFL advances unsupervised one-class federated learning through prototypical representation distillation and normalizing flows to better handle non-IID distributions [17]. FedZaCt addresses class imbalance by combining Z average aggregation with cross-client prototype synchronization [18]. Several specialized methods target specific heterogeneity patterns. FedGMKD integrates cluster knowledge fusion with Gaussian mixture models and discrepancy-aware aggregation to address complex non-IID conditions [19]. FedCross proposes a multi-model cross-aggregation framework in which multiple middleware models are pairwise aggregated during training to reduce gradient divergence and enhance convergence under heterogeneity [20]. CS-FL introduces cross-zone model selection coupled with a blockchain-based credibility mechanism to enable reliable model fusion even in the presence of unbalanced or adversarial clients [21]. DQFed incorporates a data quality–aware aggregation strategy that weights client updates by consistency, completeness, and noise robustness, thereby improving global accuracy under severe non-IID data [22]. Despite these advances, limitations persist. Many methods implicitly assume stable client data distributions, which restricts their effectiveness in dynamic environments where distributions evolve over time. Prototype-based techniques can be sensitive to prototype quality and often require careful initialization. Moreover, achieving a favorable balance between computational efficiency and scalability remains challenging for large, heterogeneous deployments.

To address these gaps, we propose Dynamic Prototypical Alignment for Federated Learning (FedDPA), which integrates adaptive regularization, hierarchical aggregation, and contrastive alignment. This design enables continuous adaptation to evolving client data while reducing communication overhead and improving scalability. At the same time, contrastive alignment enhances class separability and strengthens overall performance. Unlike traditional prototype-based methods that rely on static assumptions about data distributions, FedDPA adapts in situ to changing client conditions, offering a practical and scalable solution for privacy-preserving, decentralized learning.

4. The FedDPA Methodology

The FedDPA framework is designed to improve the alignment between local client models and the central global model, thereby tackling the core challenge of data heterogeneity in FL. Its foundation is prototypical alignment, where prototypes—feature vectors that represent the central tendency of local data distributions—are used to align local and global representations dynamically. At each communication round, prototypes are recalibrated to reflect the most current local features, allowing the global model to maintain a balanced and adaptive view of all client data. The Dynamic Prototypical Alignment for Federated Learning (FedDPA) framework is designed to strengthen the coherence between local models trained on individual client devices and the global model aggregated at a central server. It directly addresses the persistent challenge of data heterogeneity in federated learning (FL). At the core of FedDPA lies the concept of prototypical alignment, in which prototypes—feature vectors that encapsulate the essential patterns of local data distributions—serve as intermediaries to align local and global representations. This alignment is dynamic, adapting continuously throughout the training process to accommodate evolving data distributions and maintain consistency across models. By adjusting and recalibrating prototypes at each communication round, FedDPA leverages the most up-to-date local features, enabling the global model to reflect a balanced representation that remains responsive to the diversity of client data. The alignment process uses prototypical distance metrics to quantify similarities between local and global prototypes, minimizing these distances during optimization. This approach mitigates client drift while preserving client-specific variations. FedDPA integrates four key components—local prototype computation, adaptive regularization, hierarchical aggregation, and contrastive alignment—to achieve robust performance in non-IID environments. Together, these mechanisms balance personalization with global consistency, enhance scalability, and improve the separability of the feature space, thereby overcoming several limitations of traditional FL methods.

4.1. Local Prototype Computation

Local prototypes are representative feature vectors computed by each client to summarize the key patterns of its local data distribution. Each client first extracts features from its dataset, groups these features according to classes or clusters, and then computes a centroid (prototype) for each group, as shown in Equation (2). These prototypes serve as condensed representations of client-specific data and form the basis for aligning local and global models. For a client k, the prototype

P_{i}^{k}

corresponding to class i is calculated as the mean of the feature vectors

h_{i, j}^{k}

for all samples j belonging to that class:

P_{i}^{k} = \frac{1}{n_{i}^{k}} \sum_{j = 1}^{n_{i}^{k}} h_{j}^{k}

(2)

where

n_{i}^{k}

is the set of samples from class i on client k and

h_{j}^{k}

is the feature vector of the jth sample in that set.

4.2. Adaptive Regularization

Adaptive regularization is a mechanism that adjusts the degree of alignment between local and global models according to the heterogeneity of the data. During training, a regularization term is incorporated into the loss function to encourage local prototypes to align with their corresponding global prototypes. The formulation of this regularization loss is expressed as follows:

L_{reg} = λ_{k} \sum_{i} {∥P_{i}^{k} - P_{i}^{g}∥}^{2}

(3)

In Equation (3),

P_{i}^{g}

denotes the global prototype for class i, and

l a m b d a_{k}

is an adaptive weight that adjusts according to the heterogeneity of client k’s data. The regularization weight is dynamically modified to provide greater flexibility for clients with highly unique data distributions while enforcing stricter alignment for clients whose data is more representative of the global distribution. Adaptive regularization balances personalization and global consistency by dynamically tuning the extent to which local updates are influenced by global objectives. The dynamic adjustment is defined by

λ_{k} = α + β \cdot {Heterogeneity}_{k}

(4)

In Equation (4),

α

denotes the base regularization strength, which applies uniformly to all clients, while

β

is the scaling factor that modulates the influence of data heterogeneity. The term

{Heterogeneity}_{k}

quantifies the divergence of client k’s data distribution from the global data, typically measured as the average distance between the local prototypes

P_{i}^{k}

and global prototypes

P_{i}^{g}

. This is given as follows:

{Heterogeneity}_{k} = \frac{1}{C} \sum_{i = 1}^{C} {∥p_{i}^{k} - p_{i}^{g}∥}^{2}

(5)

where C is the number of classes,

P_{i}^{k}

is the prototype of class i for client k, and

P_{i}^{g}

is the corresponding global prototype. By setting

α > 0

, the system ensures that no client is completely decoupled from the global model, thereby maintaining overall model consistency. A low value of

β

reduces the influence of heterogeneity, resulting in more uniform regularization strengths across clients and enforcing stronger global alignment.

4.3. Hierarchical Aggregation

Hierarchical aggregation is a multilevel process that combines updates (prototypes and weights) from clients to form the global model efficiently. This approach incorporates both intragroup and global-level aggregation to enhance scalability, fairness, and robustness, particularly in non-IID data scenarios. Intragroup prototype aggregation involves clustering clients with similar data distributions and combining their prototypes at an intermediate level. The resulting aggregated prototypes are then transmitted to the global server for final aggregation. For a specific group of clients G, the intermediate prototype

P_{i}^{G}

is computed as shown in Equation (6):

P_{i}^{G} = \frac{1}{N} \sum_{k \in G} P_{i}^{k}

(6)

where N is the number of clients in group G and

P_{i}^{k}

is the prototype for class i at client k. Global prototype aggregation involves consolidating the aggregated prototypes from each group and subsequently integrating them to refine the global model. The global prototype aggregation process is defined by Equation (7):

P_{i}^{g} = \frac{1}{M} \sum_{G = 1}^{M} P_{i}^{G}

(7)

where M is the total number of groups and

P_{i}^{G}

is the prototype for class i in group G. Intragroup weight aggregation employs an intermediate aggregator to perform subset-level weight aggregation before transmitting the results to the global server. This procedure is formalized in Equation (8):

W_{G} = \frac{1}{N} \sum_{k \in G} w_{k}

(8)

where N is the number of clients in group G and

w_{k}

represents the model weights of client k. The global server then aggregates the weights received from the subgroup aggregators. This aggregation process is described in Equation (9):

W_{g} = \frac{1}{M} \sum_{G = 1}^{M} W_{G}

(9)

where M is the total number of groups and

W_{G}

denotes the aggregated model weights for group G.

4.4. Contrastive Alignment

Contrastive alignment serves as a methodological approach to enhance the performance of the global model by implementing an alignment strategy that strengthens representation learning and optimizes overall model efficacy, particularly in the context of non-IID data distributions. This mechanism employs a contrastive loss function to align local prototypes with their corresponding global prototypes, while simultaneously ensuring clear differentiation between prototypes from distinct classes. The loss function consists of two key components. The first, intra-class alignment, aims to minimize the divergence between the local prototypes and their associated global prototypes, as defined in Equation (10):

L_{align} = \frac{1}{N} \sum_{k = 1}^{N} \sum_{i = 1}^{C} {∥P_{i}^{k} - P_{i}^{g}∥}^{2}

(10)

where N is the total number of clients, C is the total number of classes,

P_{i}^{k}

is the local prototype for class i at client k, and

P_{i}^{g}

is the global prototype for class i. The second component, inter-class separation, promotes adequate distinction among the global prototypes of different classes and is expressed in Equation (11):

L_{s e p a r a t i o n} = - \frac{1}{C (C - 1)} \sum_{i = 1}^{C} \sum_{j \neq i}^{C} max (m - ∥P_{i}^{g} - P_{j}^{g}∥, 0)

(11)

where C is the total number of classes,

P_{i}^{g}

is the global prototype for class i,

P_{j}^{g}

is the global prototype for class j, and m is the margin that enforces a minimum separation between prototypes. Setting

m > 0

ensures that prototypes corresponding to different classes remain sufficiently distant, thereby improving the model’s ability to distinguish between classes. Incorporating 0 within the max function ensures that a penalty is applied only when the distance between prototypes falls below the specified margin m. The global server leverages

L_{align}

and

L_{separation}

to fine-tune the global prototypes and systematically guide the update mechanism of the global model, as defined in Equation (12):

L_{contrast} = L_{align} + λ L_{separation}

(12)

where

λ

is a weighting factor that balances alignment and separation. In heterogeneous federated learning settings, we set

λ \leq 1

to encourage stronger alignment of prototypes across clients.

4.5. Theoretical Analysis for FedDPA

Dynamic Prototypical Alignment for Federated Learning (FedDPA) introduces prototype-based adaptive alignment to reduce the challenging conditions posed by non-IID data in federated learning (FL). This section provides a mathematically rigorous problem formulation, convergence analysis, complexity analysis, and comparative theoretical evaluation against baseline methods.

To establish a robust theoretical foundation for Dynamic Prototypical Alignment for Federated Learning (FedDPA), we outline the problem formulation, and we present a formal mathematical analysis of its convergence properties and computational complexity, particularly under non-IID data distributions. This analysis rigorously quantifies the impact of FedDPA’s adaptive regularization, contrastive alignment, and hierarchical aggregation mechanisms and compares them with state-of-the-art methods such as FedAvg [1].

4.5.1. Problem Formulation

In standard federated learning (FL), the goal is to optimize the global loss function

L (θ)

as mentioned in Equation (1) (Section 2). FedDPA modifies this optimization problem by introducing prototype-based alignment and contrastive alignment, where each client k computes the local class prototypes

P_{i}^{k}

for each class i and then aggregates the global prototypes

P_{i}^{g}

for each communication round. By adding this regularization term to the loss function, local prototypes are encouraged to align with global prototypes. Therefore, FedDPA optimizes the global loss function by adding the adaptive regularization mechanism to the loss function as specified in Equation (13):

L_{θ} = \sum_{k = 1}^{K} \frac{n_{k}}{N} (L_{k} (θ) + λ_{k} \sum_{i} {∥P_{i}^{k} - P_{i}^{g}∥}^{2})

(13)

where

λ_{k}

is an adaptive weight that adjusts based on the heterogeneity of client k’s data.

4.5.2. Convergence Analysis

FedDPA convergence hinges on dynamically aligning local and global prototypes to counter client drift in non-IID federated learning settings. Local prototypes capture client-specific data characteristics, while global prototypes aggregate knowledge across clients. By aligning these prototypes, FedDPA ensures that local updates remain consistent with the global model, enhancing generalization. This dynamic process adapts to evolving data distributions, promoting robust convergence. Below, we provide an intuitive explanation followed by formal arguments.

Theorem 1

(FedDPA Convergence under Non-IID Data). Assume the following:

L-Smoothness: The global loss $L$ is L-smooth such that $∥ \nabla L (θ_{1}) - \nabla L (θ_{2}) ∥ \leq L ∥ θ_{1} - θ_{2} ∥$ .
Bounded Gradient Dissimilarity: For any client k, $∥ \nabla L_{k} (θ) - \nabla L (θ) ∥ \leq δ$ , where δ quantifies data heterogeneity.
Prototype Stability: The prototype alignment error $∥ P_{i}^{k} - P_{i}^{G} ∥$ decays as $O (1 / t)$ with communication round t.

For a learning rate

η_{t} = \frac{η_{0}}{\sqrt{t}}

, FedDPA achieves

min_{t \in [T]} E [∥ \nabla L (θ_{t}) ∥^{2}] \leq O (\frac{1}{\sqrt{T}}) + O (\frac{δ^{2} + σ^{2}}{T})

where

σ^{2}

bounds the variance in client updates and

E

accounts for client sampling and mini-batch stochasticity.

Proof.

Proof Sketch:

Per-Round Descent: Uses L-smoothness and the update rule:

$\begin{matrix} L (θ_{t + 1}) & \leq L (θ_{t}) + 〈 \nabla L (θ_{t}), θ_{t + 1} - θ_{t} 〉 + \frac{L}{2} {∥ θ_{t + 1} - θ_{t} ∥}^{2} \\ \leq L (θ_{t}) - η_{t} {∥ \nabla L (θ_{t}) ∥}^{2} + η_{t}^{2} L σ^{2} + η_{t} δ^{2} . \end{matrix}$
Telescoping Sum: Summed over $t = 1$ to T while taking the expectations:

$\sum_{t = 1}^{T} η_{t} E [∥ \nabla L (θ_{t}) ∥^{2}] \leq L (θ_{1}) - L^{*} + L σ^{2} \sum_{t = 1}^{T} η_{t}^{2} + δ^{2} \sum_{t = 1}^{T} η_{t} .$
Learning Rate Selection: Substitute $η_{t} = \frac{η_{0}}{\sqrt{t}}$ and bound the sums:

$\sum_{t = 1}^{T} η_{t} \sim \sqrt{T}, \sum_{t = 1}^{T} η_{t}^{2} \sim log T .$
Final Bound: Divide by $\sum_{t = 1}^{T} η_{t}$ and use convexity:

$min_{t} E [∥ \nabla L (θ_{t}) ∥^{2}] \leq \frac{O (log T) + δ^{2} \sqrt{T}}{\sqrt{T}} = O (\frac{1}{\sqrt{T}}) + O (\frac{δ^{2}}{T}) .$

The

σ^{2}

term arises similarly from client update variance. Prototype alignment reduces

δ^{2}

via adaptive regularization as mentioned in Equations (3) and (4). □

Lemma 1

(Prototype Alignment Reduces Client Drift). Under prototype stability (Assumption 3), FedDPA ensures that

δ^{2} \leq \frac{C}{t} \sum_{k = 1}^{K} λ_{k} {∥ P_{i}^{k} - P_{i}^{G} ∥}^{2},

where C is a constant and

λ_{k}

is the adaptive weight from Equation (4).

Proof.

This follows from the Lipschitz continuity of gradients and the decay of

∥ P_{i}^{k} - P_{i}^{G} ∥

(see Equations (10) and (12) for contrastive alignment). □

Corollary 1

(IID Special Case). If the data is IID (

δ_{m a x} = 0

), then FedDPA achieves

min_{t} E [∥ \nabla L (θ_{t}) ∥^{2}] \leq O (\frac{log T}{\sqrt{T}}) .

Corollary 2

(Communication Efficiency). With M groups, FedDPA reduces the per-round communication cost by

O (K / M)

.

Our theoretical analysis establishes that FedDPA achieves

O (log T / \sqrt{T})

convergence under non-IID data (Theorem 1), with prototype alignment explicitly reducing client drift (Lemma 1). While the analysis assumes convex loss and fixed participation, the empirical results (Section 5) demonstrate FedDPA’s effectiveness in non-convex settings (e.g., CNNs on CIFAR-10/100) and partial client sampling (10% per round).

4.5.3. Computational Efficiency of Hierarchical Aggregation

FedDPA’s hierarchical aggregation strategy significantly reduces computational and communication overhead compared with standard federated averaging (FedAvg). Below, we formalize its efficiency gains and link them to the algorithm’s design.

Theorem 2

(Hierarchical Aggregation Complexity). Under FedDPA’s hierarchical aggregation (Algorithms 1 and 2), with the following variables:

K total clients partitioned into M groups ( $| G | = K / M$ clients per group);
d-dimensional prototypes for C classes;
$d_{w}$ -dimensional model parameters;
$m \leq M$ groups active per round.

the following complexity bounds hold:

Communication Cost: $O (m (d C + d_{w}))$ per round (vs. $O (K d_{w})$ for FedAvg [1]);
Server Computation: $O (m (d C + d_{w}))$ per round;
Client Computation: $O (n_{k} d C)$ per client (for $n_{k}$ local samples).

The reduction factor is

O (K / M)

, scaling linearly with the group size.

Proof.

Communication Analysis:

FedAvg: All K clients transmit $d_{w}$ -dimensional weights $\to O (K d_{w})$ ;
FedDPA:
- Clients send $(d C + d_{w})$ -dimensional prototypes plus weights to group leaders;
- Leaders aggregate and forward only m updates to server;
- Total: $O (m (d C + d_{w}))$ .

Computation Analysis:

Server:
-
FedAvg: $O (K d_{w})$ for weight averaging;
-
FedDPA: $O (m (d C + d_{w}))$ for two-level aggregation as explained in Equations (6) and (9).
Clients:
-
Prototype computation: $O (n_{k} d C)$ (Equation (2));
-
Local training: Same as FedAvg ( $O (n_{k} d_{w})$ ).

□

Lemma 2

(Scalability Advantage). For

M = \sqrt{K}

groups, the following are true:

Communication reduces from $O (K)$ to $O (\sqrt{K})$ ;
Server computation becomes $O (\sqrt{K} (d C + d_{w}))$ .

Corollary 3

(Resource-Constrained Settings). When

d C ≪ d_{w}

(e.g., small prototype dimensions), the following are true:

FedDPA’s overhead is dominated by $O (m d_{w})$ ;
It still achieves $K / M$ -fold improvement over FedAvg.

The empirical results in (Section 5.4) validate these bounds:

With $K = 100$ and $M = 10$ , FedDPA uses ∼10× fewer uplinks than FedAvg;
Prototype alignment adds $O (d C)$ overhead but enables faster convergence (fewer rounds);
The $K / M$ speedup matches the reduction from 100 to 17–21 rounds for CIFAR-10.

4.6. FedDPA Algorithm

The Dynamic Prototypical Alignment for Federated Learning (FedDPA) algorithm introduces a novel framework designed to address the pervasive challenge of data heterogeneity in decentralized learning environments. At the client level, each participating device computes local prototypes, which capture the central tendencies of class-specific feature representations, and optimizes its local model through adaptive regularization to align with the global model. These local updates, comprising both model parameters and prototypes, are then transmitted to the server. On the server side, a hierarchical aggregation mechanism clusters local prototypes and model updates, thereby reducing communication overhead and enhancing scalability. The server further refines the global model through contrastive alignment, a process that minimizes intra-class variation while maximizing inter-class separability within the feature space. This dual optimization ensures robust generalization and mitigates the adverse effects of client drift. The updated global model and prototypes are subsequently distributed to the clients selected for the next training round, facilitating consistent and synchronized learning across the network.

The FedDPA algorithm comprises two distinct components, as previously described: the client-side procedure, presented in Algorithm 1, and the global server-side procedure, presented in Algorithm 2. The detailed steps for each component are delineated as follows.

Algorithm 1 FedDPA: client side.

Input: Dataset partitions

D_{i}

, Initial global model weights

w_{k}

1:: for each communication round $t = 1, 2, \dots$ do
2:: for each client k do
3:: for each class i do
4:: Compute the local prototype $P_{i}^{k}$ .
5:: Local Model Update (Optimize the local model).
6:: end for
7:: end for
8:: end for
7:: return $P_{i}^{k}$ , $w_{k}$

Algorithm 2 FedDPA: global server side.

1:: Initialize global prototypes $P_{i}^{g} = 0$ for all classes
2:: Set number of communication rounds T
3:: Set margin m for inter-class separation {set $m > 0$ , as specified in Section 4.4}
4:: Set separation loss weighting factor $λ$ {set $λ \leq 1$ , as specified in Section 4.4}
5:: Set local training regularization parameter $α$ {set $α > 0$ , as specified in Section 4.2}
6:: Set data heterogeneity factor $β$ {set $β > 0$ , as specified in Section 4.2}
7:: for each round $t = 1, 2, \dots, T$ do
8:: Receive $w_{k}$ and $P_{i}^{k}$ from participating clients
9:: Perform Group Prototype Aggregation $P_{i}^{G}$
10:: Perform Global Prototype Aggregation $P_{i}^{g}$
11:: Perform Group Model Aggregation $W_{G}$
12:: Perform Global Model Aggregation $w_{g}$
13:: Compute Intra-Class Alignment Loss
14:: Compute Inter-Class Separation Loss
15:: Compute Combined Loss
16:: Update Global Model $w_{g}$
17:: Send $w_{g}$ and $P_{i}^{g}$ to all clients
18:: end for
19:: Send final global model $w_{g}$ to all clients

5. Experiments and Results

5.1. Set-Up

To assess the effectiveness of Dynamic Prototypical Alignment for Federated Learning (FedDPA), we conducted experiments using five popular benchmark datasets: Extended MNIST (EMNIST), which collects hand-written letters from thousands of writers [23]; Federated Extended MNIST (FEMNIST), derived from EMNIST and partitioned based on the writer of each digit or character [24], CIFAR-10, which contains 60,000 images divided into 10 classes, CIFAR-100, which is divided into 100 classes, and TinyImageNet, a smaller version of the ImageNet dataset containing 200 classes [25]. These datasets were chosen for their heterogeneous and complex data distributions, making them ideal for evaluating robustness in federated learning settings. We employed the convolutional neural network (CNN) architecture as the base model for the EMNIST, FEMNIST, CIFAR-10, and CIFAR-100 datasets, following [26], while we used Resnet18 [27] for TinyImageNet. Data heterogeneity was simulated using a

D i r i c h l e t (β)

distribution with a concentration parameter

β

, following the methodology of [11]. Two settings were explored, namely

β = 0.3

, representing a highly non-IID scenario with significant class imbalance (clients typically possess one or two dominant classes and few or no samples from others), and

β = 1.0

, yielding a more balanced distribution with reduced heterogeneity. The FL environment comprised 100 clients, with 10% randomly selected to participate in each of the 100 communication rounds. Each participating client performed five local training epochs per round. The margin m for inter-class separation was set to

m = 1

to ensure that the class prototypes were sufficiently spaced apart, thereby improving the model’s discriminative ability. The weighting factor for the separation loss was set to

λ = 0.1

to encourage better alignment of prototypes across clients, and the regularization parameter

α

and heterogeneity influence factor

β

were set to

0.01

and

0.1

, respectively. To ensure statistical robustness, experiments were repeated across three independent runs for each dataset under both (

D i r i c h l e t (β)

) distributions (

β = 0.3

and

β = 1.0

), providing reliable and reproducible results.

5.2. Evaluation Metrics

We used three main metrics to evaluate the effectiveness of FedDPA. The global accuracy was assessed by calculating the mean accuracy across three runs on a centralized test dataset and reporting the standard deviation to exhibit variability and ensure reliability. Using t-SNE visualizations [28], class separation was analyzed, which provided perception into the alignment and uniqueness of class prototypes in the feature space during communication rounds. Additionally, we also measured the convergence speed according to the number of rounds required to achieve stable model performance, reflecting the approach efficiency in heterogeneous settings. These metrics collectively offer a comprehensive evaluation of FedDPA’s ability to address data heterogeneity in FL.

5.3. Baselines

The performance of FedDPA was benchmarked against several state-of-the-art federated learning (FL) methods. These comparative methods included FedAvg [2], FedPer [12], Ditto [14], FedRep [13], FedBABU [15], FedNH [4], and FedROD [26]. These baselines represent a spectrum of strategies, from foundational aggregation (FedAvg) to personalization (FedPer and FedRep) and prototype-based approaches (FedNH), enabling a thorough comparative analysis of FedDPA’s effectiveness.

5.4. Results and Analysis

The results presented in Figure 2, Figure 3, Figure 4, Figure 5 and Figure 6 and Table 1 provide a comprehensive evaluation of the global model’s accuracy across three independent experiments. These experiments were conducted on five distinct datasets as mentioned in Section 5.1 and under two different data distribution settings: a more imbalanced distribution (

D i r i c h l e t (β = 0.3)

) and a more balanced distribution (

D i r i c h l e t (β = 1.0)

). A clear and consistent trend emerged from the data: the model consistently achieved higher accuracy when trained on a more balanced data distribution (

β = 1.0

) compared with an imbalanced distribution (

β = 0.3

). This finding is crucial for understanding the model’s robustness and its sensitivity to data heterogeneity. As appeared for the EMNIST dataset, the model’s accuracy under the imbalanced setting (

β = 0.3

) ranged from 97.43% to 97.62%. This performance improved under the balanced setting (

β = 1.0

), with accuracy scores ranging from 97.98% to 98.13%. Similarly, on FEMNIST, the model exceeded 84.00% accuracy in the imbalanced setting, and this increased to over 84.22% in the balanced setting. The CIFAR-10 dataset achieved accuracy rates exceeding 74.25% under the imbalanced distribution and improved significantly to over 81.95% with the balanced distribution. Similarly, for CIFAR-100, the accuracy rates ranged between 46.95% and 47.95% under the imbalanced setting (

β = 0.3

), while under the balanced setting (

β = 1.0

), the accuracy exceeded 51.05% to 52.55%. On the Tiny-ImageNet-200 dataset, the model achieved an approximate accuracy of 52.00% under the imbalanced distribution (

β = 0.3

). As with most other datasets, performance improved under the balanced setting (

β = 1.0

), where accuracy rates varied between 54.40% and 55.05%.

The results presented in Table 2 detail the number of communication rounds required to achieve stable model performance across three independent experiments for the five datasets used in our experiments under two distinct data distribution settings, defined by the

D i r i c h l e t

parameter (

β = 0.3

and

β = 1.0

). For the EMNIST and CIFAR-10 datasets, a more balanced data distribution (

β = 1.0

) consistently facilitated faster convergence. The number of rounds required for EMNIST ranged from 14 to 17 for

β = 1.0

compared with 15.67 for

β = 0.3

. Similarly, CIFAR-10 converged in about 13 rounds with

β = 1.0

, which was considerably faster than the rounds needed with

β = 0.3

. This observation supports the hypothesis that a more homogeneous data distribution can accelerate the training process. Conversely, the Tiny-ImageNet-200 dataset demonstrated an inverse trend. Under the more balanced distribution (

β = 1.0

), it required up to 17 rounds for convergence, whereas the more heterogeneous distribution (

β = 0.3

) resulted in a quicker convergence, with about 15 rounds needed. This indicates that the effect of data distribution on convergence was not uniform across all datasets. For the FEMNIST dataset, the number of rounds for convergence was approximately the same for both distribution settings, ranging from 22 to 24 rounds. This suggests that for certain datasets, the specific characteristics of the data distribution (within this tested range) had a minimal impact on the model’s convergence speed. Most notably, the CIFAR-100 dataset consistently required a significantly higher number of communication rounds for convergence across all experiments, ranging from 35 to 41 rounds. This pattern held regardless of the

β

value, suggesting that the intrinsic complexity of the dataset itself is a more dominant factor in determining convergence behavior than the characteristics of the data distribution. The higher standard deviation observed for CIFAR-100 under

β = 1.0

(3.21) compared with

β = 0.3

(1.53) also indicates more variability in convergence time in the more balanced setting, which is a unique finding among the datasets studied.

Figure 7, Figure 8, Figure 9, Figure 10 and Figure 11 present t-SNE visualizations of the alignment and separation of class prototypes in the feature space for the five datasets used in our experiments across different communication rounds (1, 25, 75, and 100) under two

D i r i c h l e t

distribution settings:

β = 0.3

and

β = 1.0

. Across all datasets, the plots for Round 01 consistently show a scattered distribution of class prototypes, indicating a lack of initial alignment and clear class boundaries. As the training progressed, a clear trend of prototype alignment and inter-class separation emerged. In particular, the visualizations for EMNIST and CIFAR-10 demonstrated rapid convergence, with prototypes forming distinct, compact clusters by Round 25 and becoming even more tightly grouped by Round 100. This suggests that for datasets with a manageable number of classes and relatively simple feature spaces, the model effectively learns to pull similar prototypes together while pushing dissimilar ones apart. For more complex datasets like CIFAR-100, FEMNIST, and Tiny-ImageNet 200, the process was more gradual. While some initial clustering was visible by Round 25, the prototypes continued to align and separate throughout the training, culminating in a more organized yet denser distribution by Round 100. This demonstrates the model’s ability to handle high-dimensional, fine-grained classification problems by progressively refining the feature space. The comparison between

β = 0.3

and

β = 1.0

further highlights the role of the regularization parameter in shaping the feature space, with a higher

β

value often leading to a more pronounced separation and a less compact arrangement of class prototypes. Overall, these visualizations confirm that FedDPA effectively structured the feature space, improving the discriminability of class prototypes over the course of the training rounds.

The results, detailed in Table 3, are reported as the mean and standard deviation over three independent runs under two distinct data distribution scenarios:

D i r i c h l e t (β)

distributions with

β = 0.3

and

β = 1.0

. On the EMNIST dataset, FedDPA demonstrated superior performance, with an accuracy of 97.52 ± 0.09 at

β = 0.3

and 98.05 ± 0.06 at

β = 1.0

. The second-best performer was FedROD, with an accuracy of 97.30 ± 0.12 at

β = 0.3

, and there was a tie between FedRep and FedROD, achieving 97.50 ± 0.49 and 97.50 ± 0.30, respectively, at

β = 0.3

. Notably, FedDPA’s standard deviation was consistently low, indicating a high degree of stability in its performance. For the FEMNIST dataset, the top-performing methods varied between the two

β

values. At

β = 0.3

, FedROD was in the lead with 86.30 ± 0.50, while FedDPA followed closely with the second-best accuracy of 84.91 ± 0.50. Conversely, at

β = 1.0

, FedDPA emerged as the best method, achieving an accuracy of 84.10 ± 0.66, with FedROD securing the second-best position with 83.90 ± 0.66. On the CIFAR-10 dataset, FedDPA consistently outperformed all other methods by a considerable margin. It achieved an accuracy of 74.30 ± 0.33 at

β = 0.3

and 82.05 ± 0.07 at

β = 1.0

. This represents a substantial improvement over the next best method, FedROD, which recorded accuracies of 72.31 ± 0.16 and 75.50 ± 0.15 for the respective

β

values. The performance gap became even more pronounced on the more complex CIFAR-100 dataset. FedDPA achieved an accuracy of 47.05 ± 0.07

β = 0.3

and 51.72 ± 0.17

β = 1.0

. The second-best performer, FedNH, lagged significantly behind with accuracies of 41.34 ± 0.25 and 43.19 ± 0.24 for the same respective

β

values. This highlights FedDPA’s robustness and scalability to more challenging classification tasks. The trend of FedDPA’s dominance was most evident on the highly complex Tiny-ImageNet-200 dataset. FedDPA’s accuracy was 51.55 ± 0.58 at

β = 0.3

and 54.75 ± 0.26 at

β = 1.0

. This performance was markedly superior to the second-best method, FedNH, which achieved accuracies of 36.71 ± 0.36 and 38.68 ± 0.30, respectively. The vast performance difference underscores the effectiveness of FedDPA in handling large-scale and complex image classification problems within a federated learning framework.

6. Ablation Study

To meticulously analyze the individual contribution of each core component of our proposed FedDPA framework, we conducted a thorough ablation study. By systematically deactivating each key mechanism—adaptive regularization, hierarchical aggregation, and contrastive alignment—we can isolate its impact on overall model performance. This analysis is crucial for understanding the sources of FedDPA’s strength in non-IID environments.

6.1. Experimental Set-Up

We chose three diverse datasets to conduct the ablation experiments and ensure a comprehensive evaluation: FEMNIST, CIFAR-10, and CIFAR-100. For each dataset, we used the highly heterogeneous setting

(β = 0.3)

to create a challenging environment that best highlighted the benefits of each component. We compared the full FedDPA model against three ablated configurations:

FedDPA without Adaptive Regularization: In this version, we replaced the dynamic regularization weight $λ_{k}$ with a fixed, non-adaptive value. This set-up aimed to quantify the advantage of dynamically adjusting the regularization penalty according to each client’s level of data heterogeneity.
FedDPA without Hierarchical Aggregation: We disabled the two-level aggregation structure for both prototypes and model weights. All client updates were sent directly to the global server for a single-step averaging process. This variant measured the performance impact of the hierarchical aggregation separate from its communication efficiency benefits.
FedDPA without Contrastive Alignment: In this final configuration, we removed the contrastive loss function $L_{contrast}$ from the server-side optimization. The global prototypes were aggregated via simple averaging without the explicit goal of minimizing intra-class variance and maximizing inter-class separation. This test revealed the significance of actively structuring the global feature space.

All other experimental settings, including the base model architecture, number of clients, communication rounds, and local training epochs, were kept identical to those in our main experiments (Section 5.2).

6.2. Results and Analysis

The results of the ablation study, presented in Table 4, reveal the indispensable role of each component in the FedDPA framework’s success:

Adaptive Regularization: Removing this component caused a significant drop in accuracy across all datasets: a $2.81 %$ decrease for FEMNIST, a $3.15 %$ decrease for CIFAR-10, and a $2.85 %$ decrease for CIFAR-100. This confirms that a one-size-fits-all regularization penalty is suboptimal; dynamically adapting to client-specific data distributions is crucial for mitigating client drift effectively.
Hierarchical Aggregation: Deactivating this component led to a consistent, albeit smaller, decrease in accuracy. Grouping clients with similar data distributions for an intermediate aggregation step provides a more stable and refined update to the global server, improving final model performance beyond communication efficiency benefits.
Contrastive Alignment: Disabling this component resulted in the most dramatic performance degradation across all datasets, with the accuracy dropping by $4.13 %$ for FEMNIST, $4.47 %$ for CIFAR-10, and $5.17 %$ for CIFAR-100. Actively structuring the feature space by enforcing class separation is the single most important contributor to FedDPA’s robustness; without it, the model cannot effectively learn discriminative features to overcome severe data heterogeneity.

In summary, the synergy between these three core components makes FedDPA a highly effective and robust solution for federated learning in real-world non-IID settings.

7. Discussion

The observations from Table 2 highlight the critical interplay between dataset complexity and data heterogeneity in federated learning environments. The influence of data distribution on the convergence speed was not uniform; instead, it was highly dependent on the inherent difficulty of the learning task. For relatively simple datasets like EMNIST and CIFAR-10, reducing data heterogeneity (i.e., using

β = 1.0

) significantly accelerated convergence. This suggests that when the task is less complex, making the data distribution more uniform across clients allows for more consistent model updates, leading to faster stabilization. However, this effect diminished as the dataset complexity increased. For the intricate CIFAR-100 dataset, which has a large number of classes and high intra-class variability, the number of communication rounds required for convergence remained consistently high (35–41 rounds), regardless of the

β

value. This demonstrates that for complex tasks, the inherent difficulty of the dataset is the primary bottleneck, effectively overshadowing the impact of data heterogeneity. The convergence behavior on Tiny-ImageNet-200 further reinforces this, showing a complex, nonlinear relationship where a more heterogeneous distribution (

β = 1.0

) surprisingly led to faster convergence. These findings carry significant implications for the design of federated learning systems. Rather than relying on a one-size-fits-all approach to managing data heterogeneity, optimization strategies should prioritize adaptive mechanisms that account for dataset-specific attributes. This could involve using tailored aggregation techniques or implementing dynamic scheduling of communication rounds to address the unique challenges posed by complex datasets, thereby mitigating convergence impediments.

The results presented in Table 3 provide compelling evidence of the FedDPA method’s superior performance in federated learning (FL) under varying degrees of data heterogeneity, controlled by the

D i r i c h l e t (β)

parameter. FedDPA consistently outperformed all other methods across all five benchmark datasets, demonstrating a robust ability to generalize and maintain high accuracy in non-IID settings. This consistent outperformance was particularly notable in more complex tasks, where the performance gap between methods became more pronounced. The superior performance of FedDPA is highlighted by its exceptional accuracy and stability. The method consistently achieved the highest global model accuracy, often with the lowest standard deviation. For instance, on the CIFAR-10 dataset at

β = 1.0

, FedDPA recorded a remarkable accuracy of 82.05 ± 0.07. This minimal standard deviation indicates exceptional consistency and reliability, a critical feature for real-world applications, where data distributions can be unpredictable. This finding suggests that FedDPA’s adaptive approach is highly effective at mitigating the detrimental effects of data heterogeneity, leading to more stable and trustworthy models. While FedROD also performed well, often ranking in second place, it still fell short of FedDPA’s precision. Its architecture, which separates the model into a common feature extractor and a personalized head, offers significant improvements over basic FL methods like FedAvg. However, the results show that FedDPA’s approach is more effective at adapting to the nuances of highly heterogeneous data, as seen in its larger performance gap over FedROD on more complex datasets like CIFAR-100. The performance trends across the datasets further underscore the challenges of handling complexity and heterogeneity. On CIFAR-100 and Tiny-ImageNet-200, the performance gap between FedDPA and other methods was most pronounced. This suggests that the FedDPA architecture is uniquely suited to handling the subtle class imbalances and complex feature ambiguities inherent in these high-dimensional datasets. In these scenarios, the ability to effectively learn a common representation while personalizing for local data is crucial, a task that FedDPA seems to handle with greater efficacy than its counterparts. In contrast, methods like FedNH showed a different trade-off. While it achieved stronger results than many other methods, particularly on CIFAR-10, its higher variance (e.g., ±2.51 at

β = 0.3

) suggests a potential trade-off between generalizability and stability. This indicates that FedNH may prioritize learning a more generalized model at the expense of consistency, which could be beneficial in scenarios with moderate data skew but detrimental in highly heterogeneous environments, where model stability is paramount. The clear and consistent superiority of FedDPA across all tested conditions establishes it as a highly reliable and effective solution for federated learning in a wide range of heterogeneous settings.

8. Conclusions

In this study, we proposed Dynamic Prototypical Alignment for Federated Learning (FedDPA), a novel framework designed to address the challenges of data heterogeneity in federated learning (FL). FedDPA employs prototype-based regularization, intra-class alignment, and inter-class separation to enhance global model consistency across non-identically and independently distributed (non-IID) client data. The dynamic alignment between local and global prototypes is achieved by leveraging a contrastive loss function, which explicitly minimizes the distance between local and global prototypes of the same class while maximizing the distance between different classes. This mechanism effectively improves model generalization. Extensive experiments on the EMNIST, FEMNIST, CIFAR-10, CIFAR-100, and tiny-ImageNet-200 datasets demonstrated that FedDPA consistently outperformed existing FL methods. For instance, on the challenging CIFAR-100 dataset with a high degree of non-IID data, FedDPA achieved a global model accuracy of 47.05%, representing an 8.93% improvement over the nearest approach FedNH and a 39.07 % improvement over FedROD. These results underscore the efficacy of prototype-based learning in tackling data heterogeneity and improving generalization in decentralized learning systems. Despite its success, several avenues for future research remain. Applying FedDPA to large-scale, real-world datasets such as those in medical or financial domains could more thoroughly validate its utility in highly heterogeneous environments. Additionally, integrating communication-efficient optimization techniques could enhance FedDPA’s scalability in resource-constrained settings. Likewise, another promising direction involves developing a method for selective client participation. Finally, incorporating privacy-preserving mechanisms could strengthen FedDPA’s robustness, aligning with federated learning’s core privacy objectives. In summary, FedDPA offers a scalable and effective solution to the challenges of data heterogeneity in federated learning, laying a foundation for more robust and generalizable federated learning systems suited to diverse, real-world applications.

Author Contributions

Conceptualization, O.A.B. and R.B.; supervision, R.B.; writing—original draft, O.A.B.; writing—review and editing, R.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data is contained within the article.

Acknowledgments

The authors appreciate support from the Research Laboratory on Computer Science’s Complex Systems RELA(CS)2 and would like to thank the Directorate General for Scientific Research and Technological Development (DGRSDT, MESRS).

Conflicts of Interest

The authors declare no conflicts of interest.

References

McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the Artificial Intelligence and Statistics. PMLR, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated optimization in heterogeneous networks. Proc. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar]
Li, X.; Huang, K.; Yang, W.; Wang, S.; Zhang, Z. On the convergence of fedavg on non-iid data. arXiv 2019, arXiv:1907.02189. [Google Scholar]
Dai, Y.; Chen, Z.; Li, J.; Heinecke, S.; Sun, L.; Xu, R. Tackling data heterogeneity in federated learning with class prototypes. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 7314–7322. [Google Scholar]
Acar, D.A.E.; Zhao, Y.; Navarro, R.M.; Mattina, M.; Whatmough, P.N.; Saligrama, V. Federated learning based on dynamic regularization. arXiv 2021, arXiv:2111.04263. [Google Scholar] [CrossRef]
Wang, H.; Yurochkin, M.; Sun, Y.; Papailiopoulos, D.; Khazaeni, Y. Federated learning with matched averaging. arXiv 2020, arXiv:2002.06440. [Google Scholar] [CrossRef]
Tan, Y.; Long, G.; Liu, L.; Zhou, T.; Lu, Q.; Jiang, J.; Zhang, C. Fedproto: Federated prototype learning across heterogeneous clients. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 8432–8440. [Google Scholar]
Wang, J.; Liu, Q.; Liang, H.; Joshi, G.; Poor, H.V. Tackling the objective inconsistency problem in heterogeneous federated optimization. Adv. Neural Inf. Process. Syst. 2020, 33, 7611–7623. [Google Scholar]
Shoham, N.; Avidor, T.; Keren, A.; Israel, N.; Benditkis, D.; Mor-Yosef, L.; Zeitak, I. Overcoming forgetting in federated learning on non-iid data. arXiv 2019, arXiv:1910.07796. [Google Scholar] [CrossRef]
Chen, H.Y.; Chao, W.L. Fedbe: Making bayesian model ensemble applicable to federated learning. arXiv 2020, arXiv:2009.01974. [Google Scholar]
Lin, T.; Kong, L.; Stich, S.U.; Jaggi, M. Ensemble distillation for robust model fusion in federated learning. Adv. Neural Inf. Process. Syst. 2020, 33, 2351–2363. [Google Scholar]
Arivazhagan, M.G.; Aggarwal, V.; Singh, A.K.; Choudhary, S. Federated learning with personalization layers. arXiv 2019, arXiv:1912.00818. [Google Scholar] [CrossRef]
Collins, L.; Hassani, H.; Mokhtari, A.; Shakkottai, S. Exploiting shared representations for personalized federated learning. In Proceedings of the International Conference on Machine Learning. PMLR, Virtual, 18–24 July 2021; pp. 2089–2099. [Google Scholar]
Li, T.; Hu, S.; Beirami, A.; Smith, V. Ditto: Fair and robust federated learning through personalization. In Proceedings of the International Conference on Machine Learning. PMLR, Virtual, 18–24 July 2021; pp. 6357–6368. [Google Scholar]
Oh, J.; Kim, S.; Yun, S.Y. Fedbabu: Towards enhanced representation for federated image classification. arXiv 2021, arXiv:2106.06042. [Google Scholar]
Kuang, L.; Guo, K.; Liang, J.; Zhang, J. An Enhanced Federated Prototype Learning Method under Domain Shift. arXiv 2024, arXiv:2409.18578. [Google Scholar] [CrossRef]
Kim, H.; Kwak, Y.; Jung, M.; Shin, J.; Kim, Y.; Kim, C. Protofl: Unsupervised federated learning via prototypical distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6470–6479. [Google Scholar]
Yang, T.; Xu, J.; Zhu, M.; An, S.; Gong, M.; Zhu, H. FedZaCt: Federated learning with Z average and cross-teaching on image segmentation. Electronics 2022, 11, 3262. [Google Scholar] [CrossRef]
Zhang, J.; Shan, C.; Han, J. FedGMKD: An Efficient Prototype Federated Learning Framework through Knowledge Distillation and Discrepancy-Aware Aggregation. In Proceedings of the 38th International Conference on Neural Information Processing System, Vancouver, BC, Canada, 10–15 December 2024; Volume 37, pp. 118326–118356. [Google Scholar]
Hu, M.; Zhou, P.; Yue, Z.; Ling, Z.; Huang, Y.; Li, A.; Liu, Y.; Lian, X.; Chen, M. FedCross: Towards accurate federated learning via multi-model cross-aggregation. In Proceedings of the 2024 IEEE 40th International Conference on Data Engineering (ICDE), Utrecht, The Netherlands, 13–16 May 2024; pp. 2137–2150. [Google Scholar]
Zhang, C.; Sun, H.; Shen, Z.; Wang, D. CS-FL: Cross-Zone Secure Federated Learning with Blockchain and a Credibility Mechanism. Appl. Sci. 2024, 15, 26. [Google Scholar] [CrossRef]
Bernardi, M.L.; Cimitile, M.; Usman, M. DQFed: A Federated Learning Strategy for Non-IID Data based on a Quality-Driven Perspective. In Proceedings of the 2024 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), Yokohama, Japan, 30 June–5 July 2024; pp. 1–8. [Google Scholar] [CrossRef]
Cohen, G.; Afshar, S.; Tapson, J.; Van Schaik, A. EMNIST: Extending MNIST to handwritten letters. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 2921–2926. [Google Scholar]
Caldas, S.; Duddu, S.M.K.; Wu, P.; Li, T.; Konečnỳ, J.; McMahan, H.B.; Smith, V.; Talwalkar, A. Leaf: A benchmark for federated settings. arXiv 2018, arXiv:1812.01097. [Google Scholar]
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images. Technical Report; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
Chen, H.Y.; Chao, W.L. On bridging generic and personalized federated learning for image classification. arXiv 2021, arXiv:2107.00778. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
van der Maaten, L.; Hinton, G. Viualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]

Figure 1. Pictorial view of proposed Dynamic Prototypical Alignment for Federated Learning (FedDPA) approach.

Figure 2. The global model’s accuracy across the three experiments on the EMNIST dataset.

Figure 3. The global model’s accuracy across the three experiments on the FEMNIST dataset.

Figure 4. The global model’s accuracy across the three experiments on the CIFAR-10 dataset.

Figure 5. The global model’s accuracy across the three experiments on the CIFAR-100 dataset.

Figure 6. The global model’s accuracy across the three experiments on the Tiny-ImageNet-200 dataset.

Figure 7. The t-SNE plots for the alignment and separation of class prototypes in the feature space for a random run from the three independent experiments for 1, 25, 75, and 100 rounds on the EMNIST dataset, along with

β = 0.3

and

β = 1.0

for the

D i r i c h l e t (β)

distributions.

Figure 7. The t-SNE plots for the alignment and separation of class prototypes in the feature space for a random run from the three independent experiments for 1, 25, 75, and 100 rounds on the EMNIST dataset, along with

β = 0.3

and

β = 1.0

for the

D i r i c h l e t (β)

distributions.

Figure 8. The t-SNE plots for the alignment and separation of class prototypes in the feature space for a random run from the three independent experiments for 1, 25, 75, and 100 rounds on the FEMNIST datasets, along with

β = 0.3

and

β = 1.0

for the

D i r i c h l e t (β)

distributions.

Figure 8. The t-SNE plots for the alignment and separation of class prototypes in the feature space for a random run from the three independent experiments for 1, 25, 75, and 100 rounds on the FEMNIST datasets, along with

β = 0.3

and

β = 1.0

for the

D i r i c h l e t (β)

distributions.

Figure 9. The t-SNE plots for the alignment and separation of class prototypes in the feature space for a random run from the three independent experiments for 1, 25, 75, and 100 rounds on the CIFAR-10 dataset, along with

β = 0.3

and

β = 1.0

for the

D i r i c h l e t (β)

distributions.

Figure 9. The t-SNE plots for the alignment and separation of class prototypes in the feature space for a random run from the three independent experiments for 1, 25, 75, and 100 rounds on the CIFAR-10 dataset, along with

β = 0.3

and

β = 1.0

for the

D i r i c h l e t (β)

distributions.

Figure 10. The t-SNE plots for the alignment and separation of class prototypes in the feature space for a random run from the three independent experiments for 1, 25, 75, and 100 rounds on the CIFAR-100 dataset, along with

β = 0.3

and

β = 1.0

for the

D i r i c h l e t (β)

distributions.

Figure 10. The t-SNE plots for the alignment and separation of class prototypes in the feature space for a random run from the three independent experiments for 1, 25, 75, and 100 rounds on the CIFAR-100 dataset, along with

β = 0.3

and

β = 1.0

for the

D i r i c h l e t (β)

distributions.

Figure 11. The t-SNE plots for the alignment and separation of class prototypes in the feature space for a random run from the three independent experiments for 1, 25, 75, and 100 rounds on the CIFAR-100 datasets, along with

β = 0.3

and

β = 1.0

for the

D i r i c h l e t (β)

distributions.

Figure 11. The t-SNE plots for the alignment and separation of class prototypes in the feature space for a random run from the three independent experiments for 1, 25, 75, and 100 rounds on the CIFAR-100 datasets, along with

β = 0.3

and

β = 1.0

for the

D i r i c h l e t (β)

distributions.

Table 1. Global model accuracy (GM Acc) from three independent experiments on EMNIST, FEMNIST, CIFAR-10, CIFAR-100, and Tiny-ImageNet-200 using Dirichlet(

β

) with

β = 0.3

and

β = 1.0

.

Table 1. Global model accuracy (GM Acc) from three independent experiments on EMNIST, FEMNIST, CIFAR-10, CIFAR-100, and Tiny-ImageNet-200 using Dirichlet(

β

) with

β = 0.3

and

β = 1.0

.

Dataset	Experiment	$β = 0.3$		$β = 1.0$
		GM Acc (%)	Std	GM Acc (%)	Std
EMNIST	Experiment 01	97.62		98.13
	Experiment 02	97.43	0.09	98.05	0.06
	Experiment 03	97.51		97.98
FEMNIST	Experiment 01	84.72		83.60
	Experiment 02	85.60	0.5	84.22	0.66
	Experiment 03	84.42		84.52
CIFAR-10	Experiment 01	74.72		82.13
	Experiment 02	74.25	0.33	81.95	0.07
	Experiment 03	73.92		82.06
CIFAR-100	Experiment 01	47.13		51.95
	Experiment 02	46.95	0.07	52.55	0.17
	Experiment 03	47.06		51.05
Tiny-ImageNet-200	Experiment 01	52.10		54.40
	Experiment 02	51.82	0.58	54.82	0.26
	Experiment 03	50.75		55.05

Table 2. Number of rounds required to reach stable model performance across three independent experiments on EMNIST, FEMNIST, CIFAR-10, CIFAR-100, and Tiny-ImageNet-200 under Dirichlet distributions with

β = 0.3

and

β = 1.0

.

Table 2. Number of rounds required to reach stable model performance across three independent experiments on EMNIST, FEMNIST, CIFAR-10, CIFAR-100, and Tiny-ImageNet-200 under Dirichlet distributions with

β = 0.3

and

β = 1.0

.

Dataset	Experiment	$β = 0.3$	$β = 1.0$
EMNIST	Experiment 01	16	15
	Experiment 02	14	15
	Experiment 03	17	14
FEMNIST	Experiment 01	23	22
	Experiment 02	22	24
	Experiment 03	23	22
CIFAR-10	Experiment 01	17	12
	Experiment 02	18	15
	Experiment 03	21	13
CIFAR-100	Experiment 01	35	35
	Experiment 02	36	36
	Experiment 03	38	41
Tiny-ImageNet-200	Experiment 01	15	17
	Experiment 02	17	20
	Experiment 03	15	18

Table 3. Comparison of global model accuracy (GM ACC) scores, reported as mean ± standard deviation over three independent runs on EMNIST, FEMNIST, CIFAR-10, CIFAR-100, and Tiny-ImageNet-200. The best results are in bold, and the second-best results are underlined.

Dataset	Method	$β = 0.3$	$β = 1.0$
EMINST	FedAvg	96.90 ± 0.26	97.00 ± 0.62
	FedPer	93.30 ± 0.40	97.20 ± 0.55
	Ditto	97.00 ± 0.27	97.20 ± 0.15
	FedRep	95.00 ± 0.18	97.50 ± 0.49
	FedBABU	-	-
	FedNH	-	-
	FedROD	97.30 ± 0.12	97.50 ± 0.30
	FedDPA	97.52 ± 0.09	98.05 ± 0.06
FEMNIST	FedAvg	83.10 ± 0.13	83.4 ± 0.54
	FedPer	79.90 ± 0.67	74.50 ± 0.58
	Ditto	81.5 ± 0.36	83.30 ± 0.30
	FedRep	79.5 ± 0.70	80.6 ± 0.42
	FedBABU	-	-
	FedNH	-	-
	FedROD	86.30 ± 0.5	83.90 ± 0.66
	FedDPA	84.91 ± 0.50	84.10 ± 0.66
CIFAR-10	FedAvg	66.40 ± 3.13	73.07 ± 1.60
	FedPer	61.58 ± 0.43	63.33 ± 0.53
	Ditto	66.40 ± 3.13	73.07 ± 1.60
	FedRep	40.13 ± 0.17	47.92 ± 0.38
	FedBABU	62.78 ± 3.09	70.34 ± 1.72
	FedNH	69.01 ± 2.51	75.34 ± 0.86
	FedROD	72.31 ± 0.16	75.50 ± 0.15
	FedDPA	74.30 ± 0.33	82.05 ± 0.07
CIFAR-100	FedAvg	35.14 ± 0.48	36.07 ± 0.41
	FedPer	15.04 ± 0.06	14.69 ± 0.03
	Ditto	35.14 ± 0.48	36.07 ± 1.41
	FedRep	5.42 ± 0.03	6.37 ± 0.04
	FedBABU	32.41 ± 0.40	22.21 ± 0.15
	FedNH	41.34 ± 0.25	43.19 ± 0.24
	FedROD	33.83 ± 0.25	35.20 ± 0.19
	FedDPA	47.05 ± 0.07	51.72 ± 0.17
Tiny-ImageNet-200	FedAvg	34.63 ± 0.26	37.65 ± 0.37
	FedPer	15.28 ± 0.14	13.71 ± 0.07
	Ditto	34.63 ± 0.26	37.65 ± 0.37
	FedRep	03.27 ± 0.02	03.91 ± 0.03
	FedBABU	26.36 ± 0.32	30.25 ± 0.32
	FedNH	36.71 ± 0.36	38.68 ± 0.30
	FedROD	36.46 ± 0.28	37.71 ± 0.31
	FedDPA	51.55 ± 0.58	54.75 ± 0.26

Table 4. Ablation study of the FedDPA model on three datasets: FEMNIST, CIFAR-10, and CIFAR-100.

Configuration	FEMNIST	CIFAR-10	CIFAR-100
Full FedDPA Model	84.91	74.30	47.05
Without Adaptive Regularization	82.10	71.15	44.20
Without Hierarchical Aggregation	83.55	72.50	45.95
Without Contrastive Alignment	80.78	69.83	41.88

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bensiah, O.A.; Benaboud, R. FedDPA: Dynamic Prototypical Alignment for Federated Learning with Non-IID Data. Electronics 2025, 14, 3286. https://doi.org/10.3390/electronics14163286

AMA Style

Bensiah OA, Benaboud R. FedDPA: Dynamic Prototypical Alignment for Federated Learning with Non-IID Data. Electronics. 2025; 14(16):3286. https://doi.org/10.3390/electronics14163286

Chicago/Turabian Style

Bensiah, Oussama Akram, and Rohallah Benaboud. 2025. "FedDPA: Dynamic Prototypical Alignment for Federated Learning with Non-IID Data" Electronics 14, no. 16: 3286. https://doi.org/10.3390/electronics14163286

APA Style

Bensiah, O. A., & Benaboud, R. (2025). FedDPA: Dynamic Prototypical Alignment for Federated Learning with Non-IID Data. Electronics, 14(16), 3286. https://doi.org/10.3390/electronics14163286

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FedDPA: Dynamic Prototypical Alignment for Federated Learning with Non-IID Data

Abstract

1. Introduction

2. Background

3. Related Work

4. The FedDPA Methodology

4.1. Local Prototype Computation

4.2. Adaptive Regularization

4.3. Hierarchical Aggregation

4.4. Contrastive Alignment

4.5. Theoretical Analysis for FedDPA

4.5.1. Problem Formulation

4.5.2. Convergence Analysis

4.5.3. Computational Efficiency of Hierarchical Aggregation

4.6. FedDPA Algorithm

5. Experiments and Results

5.1. Set-Up

5.2. Evaluation Metrics

5.3. Baselines

5.4. Results and Analysis

6. Ablation Study

6.1. Experimental Set-Up

6.2. Results and Analysis

7. Discussion

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI