1. Introduction
The explosive growth in the amount of data from devices has witnessed the rapid development of the Internet of Things (IoT), which provides ubiquitous sensing, computing, and communication capabilities to connect things to the Internet [
1]. To provide deep analysis for data from IoT devices, artificial intelligence (AI) algorithms have been adopted to enable intelligent IoT applications such as smart transportation and smart city [
2]. Traditionally, AI algorithms that need the centralized collection and processing of data are deployed on a centralized cloud/edge server or data center for data mining. However, the offloading of massive amounts of IoT data to remote servers and the processing of data in remote servers induces significant delays. Furthermore, the third-party server also raises data privacy concerns [
3]. In this context, integrating privacy-preservation and distributed AI into IoT becomes an important topic.
Recently, Federated Learning (FL) has emerged as a privacy-preserving distributed learning framework that enables intelligent IoT applications by allowing distributed IoT devices to collaboratively train machine learning models [
4]. FL enables multiple devices to train a joint global model via Stochastic Gradient Descent (SGD) and shares local model parameters instead of raw data [
5]. FL has seen recent successes in several applications. For example, Genetic Clustered Federated Learning (CFL) proposed in [
6] has been applied to detect COVID-19 patients in a privacy-preserving manner. An FL-assisted cooperative perception task in [
7] has been applied to vehicular networks where vehicles fuse sensor data in order to obtain an extended vision of the driving environment.
However, the existing federated learning methods suffer from inefficiency in model training performance when applied in resource-constrained IoT environments due to several reasons. (i) Unbalanced Data: Training data at IoT devices are different in size and distribution because of different sensing environments [
8]. Thus, communications in the uplink and downlink in FL are highly sensitive subject to non-independent and identically distributed (IID) data. (ii) Limited Bandwidth: IoT devices often have limited communication bandwidth due to their constrained network capabilities. Transmitting large model updates in traditional federated learning setups can be inefficient and slow in such environments. (iii) High Latency: IoT devices might have high communication latency, making frequent model updates impractical. This can lead to delays in aggregating updates and hinder the training process. (iv) Unreliable Connectivity: IoT devices might experience intermittent or unreliable connectivity. This can disrupt the communication process and lead to incomplete or delayed updates. The convergence of FL for IoT cannot be guaranteed all the time due to the intermittent connections [
9,
10,
11]. The situation of federated learning on mobile devices (e.g., sensors and Unmanned Aerial Vehicles (UAVs)) gets even worse, as they communicate via wireless channels and suffer from lower bandwidth, higher latency, and intermittent connections [
12]. Thus, the inefficiency training performance problem becomes an important bottleneck for scaling up FL.
It is necessary to solve the inefficiency training performance problem of federated learning. Several works have been undertaken to speed up federated learning convergence via local update and parameter compression [
13,
14]. The approach of a local update aims to reduce the frequent transmissions of model parameters via making full use of the computing capability of devices. The local update algorithms characterize a trade-off between computation and communication via a proposed concept of a local update coefficient that determines the ratio of the local update to bandwidth [
15]. The approach of parameter compression aims at reducing the amount of data to be transmitted through data compression techniques such as quantization and sparsification. The parameter compression algorithms characterize a trade-off between communication and model precision via the parameter compression coefficient (compression budget), which determines the ratio between bandwidth and accuracy [
16]. These methods have been individually studied to improve the efficiency of federated learning. However, jointly considering these two trade-offs and adaptively balancing their impacts on the convergence of federated learning from both mathematical estimation and theoretical analysis perspectives have remained unresolved. This significant problem motivated our research. This paper proposes a novel efficient adaptive federated optimization (FedEAFO) algorithm to speed up the convergence of federated learning for IoT, which minimizes the learning error via jointly considering two methods including local update and parameter compression. The FedEAFO enables federated learning to adaptively adjust two variables and balance trade-offs among computation, communication, and accuracy.
The proposed efficient adaptive federated optimization algorithm stands apart from traditional optimization approaches in several significant ways. These distinctions underscore the algorithm’s advancements and its tailored suitability for the complexities of IoT environments. (i) Joint Communication and Computation Optimization: One of the key differentiators of the FedEAFO algorithm is its holistic approach toward optimization. Unlike traditional methods that often focus on either communication or computation aspects, FedEAFO takes a pioneering step by seamlessly integrating both model compression techniques and multiple local training strategies. This integrated optimization effectively addresses the intricate interplay between computation and communication efficiency within the federated learning framework. (ii) Adaptive Optimization: FedEAFO introduces an adaptive dimension that sets it apart from conventional techniques. This adaptability empowers the algorithm to dynamically and intelligently adjust its optimization parameters, facilitating a fine-tuned balance between various trade-offs. This adaptiveness is in stark contrast to traditional methods that often employ fixed or pre-determined parameters, making FedEAFO particularly responsive to the dynamic nature of IoT environments. (iii) Enhanced Convergence Speed: Another distinctive feature lies in FedEAFO’s aim to expedite the convergence speed of federated learning. Unlike some conventional methods that might struggle to adapt to the specific challenges posed by IoT settings, FedEAFO strategically integrates model compression and local training. This synergy fosters quicker convergence by optimizing learning updates and conserving communication resources simultaneously. (iv) Tailoring to IoT Constraints: Traditional optimization methods might not fully account for the unique constraints and intricacies of IoT environments. FedEAFO, on the other hand, is meticulously crafted to align with the limitations of IoT devices, such as constrained bandwidth, high latency, and energy considerations. Its ability to harmonize these factors with optimization objectives marks a substantial departure from generic optimization paradigms.
This research introduces a novel approach that addresses the unique inefficiency training performance challenge faced by federated learning when applied to IoT environments, which is raised by limitations including limited bandwidth of devices, unreliable network connectivity, device heterogeneity, and unbalanced data. This research proposes innovative model compression techniques tailored to IoT devices, which aim to reduce the size of the model updates that need to be communicated between IoT devices and the central server. By compressing the model effectively, this research enables efficient utilization of the limited communication bandwidth of IoT devices. In addition, recognizing the high latency and limited energy resources of IoT devices, this research suggests the incorporation of multiple local training methods, which enable IoT devices to perform multiple rounds of training on their local data before transmitting updates to the central server. This can minimize the need for frequent communication, addressing the challenge posed by high latency and unreliable connectivity. Furthermore, this research takes into account the inherent heterogeneity of IoT devices in terms of hardware capabilities, network conditions, and data distributions. By jointly considering model compression and multiple local training, the proposed algorithm adapts to the diverse IoT system and ensures compatibility with devices of varying capabilities. Lastly, IoT devices often exhibit a wide range of computational capabilities and data distribution. The proposed efficient adaptive optimization method addresses these heterogeneities by incorporating adaptive multiple local training. The proposed algorithm can adjust the local update frequencies of devices to mitigate the impact of device heterogeneity and data imbalance. This method allows each IoT device to perform several rounds of training using its available computational resources. Devices with higher computational capacities can perform more local training rounds, contributing more effectively to the federated learning process. In addition, devices with smaller datasets might be allowed to perform more local updates before communicating with the central server, which enables devices with sparse data to catch up with those that have more data, reducing the disparities caused by data imbalance. The key contributions of the FedEAFO are summarized as the following:
This paper investigates the federated learning problem with a practical formulation of minimizing the error of the global model in terms of local update coefficient and compression budget, which characterizes trade-offs between communication and computation/model precision, respectively.
This paper proposes a novel efficient adaptive federated optimization algorithm using a derived error upper bound considering two variables including a local update and compression coefficient, which adaptively adjusts these two variables to improve the efficiency of federated learning.
Besides theoretical analysis of the proposed algorithm, we demonstrate strong empirical performance on two datasets of FedEAFO compared with other state-of-the-art methods, which achieve higher accuracies faster.
The proposed efficient adaptive federated optimization algorithm is a promising approach that addresses the challenges of training efficiency, heterogeneity, non-IID data, and privacy concerns in a distributed machine learning environment. It offers several practical applications across various industries: (i) Autonomous Vehicles: Autonomous vehicles generate a wealth of data during operation. With efficient adaptive federated learning, vehicles can collectively learn from each other’s experiences to improve safety and navigation while respecting user privacy. (ii) Financial Services: Banks and financial institutions often face data privacy regulations, making data sharing difficult. With efficient adaptive federated learning, banks can collectively analyze customer behavior and detect fraudulent activities without directly sharing sensitive financial information. It enables improved risk assessment, personalized financial recommendations, and fraud detection while preserving customer privacy. (iii) Smart Grids: In the energy sector, smart grids consist of numerous energy-consuming and energy-producing devices. Federated learning can help optimize energy consumption, predict energy demand, and manage grid stability while maintaining data privacy and decentralized decision making. (iv) Manufacturing: In smart factories, different machines and sensors generate data with varying characteristics. Efficient adaptive federated learning enables predictive maintenance, process optimization, and quality control without the need for centralized data collection, improving manufacturing efficiency and reducing downtime.
3. Preliminaries and Problem Formulation
The system overview of the proposed algorithm is presented in
Figure 1. Before introducing details of the proposed algorithm, this paper first presents a mathematical analysis on how coefficients of local update and parameter compression affect federated learning in order to further describe the impact of local update and parameter compression. Additionally, we theoretically formulate our problem and derive the error upper bound of federated learning, which jointly considers two variables including local update and parameter compression.
3.1. Federated Learning
We consider a federated learning system in resource-constrained IoT environments, which consists of a total of
IoT devices. Each device performs model training on its local dataset. The devices aim to jointly solve the following optimization problem.
where
represents global model weights,
corresponds to weight of the
n-th device, and
corresponds to local objective of the
n-th device. Equation (1) can be optimized by iterative exchange of model parameters between devices and the central server. Particularly, at the
t-th round, the local model of
n-th device
can be denoted as
where
corresponds to the data samples,
represents the learning rate, and
represents the
i-th local updates.
In practical applications of the IoT, different IoT devices will be used and different kinds of data will be collected. FL is well-suited for handling different types of data. In IoT scenarios, various devices with diverse sensors and functionalities generate data, and these data can be highly heterogeneous. FL provides a distributed learning framework that allows these devices to collaboratively learn a shared model without sharing their raw data with a central server. Here’s how FL handles different types of data in IoT applications: (i) Adaptive Hyperparameter Tuning: Some FL algorithms incorporate adaptive optimization techniques that allow the model to adjust its hyperparameters based on the data characteristics of each device. This adaptivity helps in efficiently utilizing devices with diverse data distributions. (ii) Decentralized Data Processing: In FL, each IoT device retains control over its data locally. The devices preprocess and transform their data to ensure consistency and compatibility with the chosen model architecture. This decentralization allows the devices to handle their specific data types and formats independently. (iii) Model Personalization: FL supports model personalization, where the global model can be adapted to different devices’ data characteristics. When devices have unique data distributions, they can personalize the shared model using their local data. Model aggregation mechanisms, like Federated Averaging, ensure that the personalized models are combined to improve the overall performance.
As soon as the
n-th device performs local updates based on Equation (2), the aggregated updates of local models
can be obtained by
where
.
represents the number of consecutive local updates, which can be adjusted to balance the trade-off between computation and communication of FL.
The devices then compress their locally aggregated updates
to
with the sparsity budget denoted as
, which can be adjusted to balance trade-offs between the communication and model precision of FL. The server aggregates all compressed locally aggregated updates from devices to obtain a compressed global model given by
The latest global model
can be updated via the stochastic gradient descent algorithm as
The central server forwards the updated global model to devices involved in federated learning for the next training round. The local model of devices can be updated via the received latest global weights. This is the complete process of federated learning with joint consideration of both local update and parameter compression.
For the analysis of the latency of the training process of our system, we consider a circular small-cell network in which devices are uniformly distributed within the coverage of central base stations. Let represent the between the base station and the device , and denote the path loss. Let and denote the transmit power of the device and the base station, respectively. Assume that the uplink channel adopts the orthogonal frequency division multiple access (OFDMA) technology. The devices compete for a resource block to upload local models under the rule of the Slotted ALOHA protocol in the uplink channel. And the base station occupies all the bandwidth to forward global models in downlink time slots. Let denote the total bandwidth of our system and denote the resource block bandwidth. Thus, the signal-to-noise ratio of the uplink and downlink channel can be denoted by and , where is the spectral power density of the noise.
Based on the above definition, the communication latency
for the local model upload and the global model download
of the
n-th device can be given by
where
denotes the total quantization bits for transmitting the model.
The computation latency for training and updating the local model of the
n-th device can be calculated as
where
is the number of local updates,
and
respectively denote the number of floating-point operations required for training a data sample and updating the local model, and
denotes the CPU capability of the
n-th device.
Thus, the end-to-end delay of the FL training process at the
t-th round can be given by
3.2. Problem Formulation
Our goal is to jointly consider two variables including
and
and adaptively adjust them to speed up federated learning. Theoretically, the problem is to find the optimal solution of
and
at different training rounds, which minimizes the error of federated learning in a given time. The problem can be mathematically formulated as
where
corresponds to the global learning objective defined in Equation (1).
represents the given time constraint. To solve the problem in Equation (9), the key is to derive expression of the error of federated learning describing the interdependence of
and
. However, such expression is almost impossible to obtain. Additionally, it is also hard to find the error upper bound, which jointly considers
and
in FL.
3.3. Learning Error Upper Bound
To derive the error upper bound of federated learning, which jointly considers two variables including local update and parameter compression, this subsection first derives error upper bound in terms of . Additionally, we derive error upper bound with consideration of local update described by and parameter compression described by .
We first make the following Assumptions 1–3 to present the analysis of error upper bound without parameter compression inspired by [
19].
The global loss function is differentiable, and L-smooth: and there is a lower bound .
The global weights variance in mini-batch is bounded by: . corresponds to variance between . and are constants inversely proportional to mini-batch size.
The SGD is the unbiased estimator of FGD: .
Theorem 1. Let Assumptions 1–3 hold. Choose the learning rate that satisfies .
Thus, the learning error after n rounds within given time T in Equation (9) is bounded by: The above Theorem 1 describes the trade-off between computation and communication to minimize learning error. Equation (10) illustrates that the local update coefficient is in the numerator and the denominator of expression of error upper bound, which means error upper bound will decrease, where either the values of is too small or too large. Thus, it is necessary to achieve balance. Additionally, the trade-off needs to be dynamically adjusted over various rounds of federated learning due to the dynamically varying loss function , which is in Equation (10).
Apart from the local update, the approach of parameter compression, which introduces compression into locally aggregated weights, will complicate the analysis of error upper bound in two aspects: (i)
will be affected by the parameter compression coefficient
[
26,
27]. (ii) The variance
in Equation (10) will depend on the parameter compression coefficient
[
28,
29].
As mentioned before, the compressed parameters are approximated by basic components in parameter compression. Because of sparsity, several components are more significant than others in the aspects of approximating the raw parameters. Thus, the problem is to select basic components unbiasedly to minimize the variance
. The locally aggregated weights from the
n-th device can be rewritten as:
where
is the number of basic components,
corresponds to the
k-th basic component, and
is the corresponding weight. Our analysis is based on the fact that a matrix can be denoted as a combination of basic matrices, which is the atomic decomposition for sparse representation in compressed sensing. Thus, our analysis can be extended to nearly all unbiased compressions. For example, TernGrad [
21] and QSGD [
22] are special cases of Equation (8). Additionally, sparsification algorithms including ATOMO [
25] also follow Equation (8). The formulation presents a problem on the selection of
. To meet the requirement of unbiased selection, we propose an estimator as follows, inspired by [
25]:
where
corresponds to the probability characterizing the Bernoulli distribution,
and
obeys the Bernoulli distribution. We provide two significant properties for the estimator via Lemmas 1 and 2.
Lemma 1. The variance of the estimator given by Equation (12) can be denoted as: .
Proof. The variance of the estimator in Equation (9) is defined as follows:
where
can be denoted as:
and
can be obtained by:
and in the same way,
.
Thus, based on Equations (13)–(15) and
, the variance of the estimator in Equation (12) can be obtained as follows:
□
Lemma 2. The estimator in Equation (12) is unbiased, which means: .
Proof. This can be proved simply by the definition of the expectation. □
In order to minimize the variance, we formulate the optimization problem. The reason why we try to minimize the variance is that the compressed parameters are closer to the original parameters when the variance decreases. The problem of minimizing the variance can be given as:
Before solving the optimization problem, we first provide the following Definition 1 of -balancedness:
Definition 1. is -unbalanced if . -balanced corresponds to the case that no element of is -unbalanced.
Thus, the theorem for the solution of the optimization problem can be given by:
Theorem 2. When
is -balanced, the solution for the aforementioned problem can be obtained by: Proof. Theorem 2 can be proved by the Lagrangian multiplier [
30]. □
Lemma 3. The difference between uncompressed parameters and compressed parameters and can be given by:where and .
Proof. Based on Lemma 1 and Theorem 2,
can be denoted by:
where
and
. □
After deriving the probability of the variance of the estimator, we are able to access the effects of compression on variance and communication time . Assuming that the model parameters to be uploaded are denoted as . Rather than sending original parameters, the devices can upload the compressed approximated parameters denoted as with sparsity representation, which only uploads the number of basic components. Thus, the communication time can be rewritten as , where corresponds to the communication time of each device. As to variance , the theorem for the variance of compressed locally aggregated parameters is based on Theorem 2, which can be given by:
Theorem 3. The bound for the variance of compressed locally aggregated parameters is as follows:where ,
,
and are constants inversely proportional to the size of the mini-batch.
Proof. The
can be described as:
For the first term in Equation (22), the
can be obtained by:
where
is from Lemma 1, and
is from Lemma 3.
For the second term in Equation (22),
can be obtained by
following by [
19].
Thus, based on Equations (22)–(23),
can be described by:
where
and
. □
With the derived new communication time and variance, Assumption 2 in Theorem 1 can be rewritten via Equation (25). To derive the error upper bound with parameter compression, we adapt Theorem 1 via new variance and communication time, which considers the impact of parameter compression on communication time and variance on the convergence of federated learning. The updated Theorem is as follows:
Theorem 4. The bound error upper bound with parameter compression is as follows:where
corresponds to the error upper bound considering local update and parameter compression. Proof. See Appendix in [
19], except for the new communication time
replaced by
and variance
replaced by
. □
indicates the dynamics of trade-offs between communication and computation/precision, which are determined by and . The first term in expression shows that the decreases as increases because is in the denominator, and the decreases as decreases due to is in the numerator. Thus, the and need to be reduced based on the analysis of the first term in Equation (22). However, the third term of Equation (22) requires to remain small because is in the numerator. The second and the third terms of Equation (22) require to remain large because is in the denominator. The above analysis indicates that either or is not an optimal choice as the former results in unnecessary communication overheads, while the latter suffers from a prolonged convergence due to large discrepancies among local models caused by less communication. Both and are not optimal choices because the former sends imprecise model parameters causing the prolonged convergence, whereas the latter results in frequent communication. Thus, this paper aims at finding the optimal balance to adjust trade-offs between communication and computation/precision.
4. The Proposed FedEAFO Algorithm
The above theoretical analyses illustrate that the error upper bound is ruled by
and
. This paper proposes an efficient adaptive federated optimization (FedEAFO) algorithm, which minimizes the learning error via jointly considering two variables including local update and parameter compression. The FedEAFO adaptively adjusts and balances trade-offs between communication and precision/computation.
Figure 1 presents an overview of FedEAFO scheme. Mathematically, FedEAFO finds the optimal balance to minimize the error upper bound in Equation (25), which can be denoted as:
This paper presents Theorem 5 to provide the theoretical analysis for the proof of convexity of , followed by Theorem 6, which finds optimal solutions to solve the problem of Equation (26).
Theorem 5. Let L, T, and be defined therein. Choose Assumptions (4) , (5) , (6) , (7) , thus the is convex.
Proof. Hessian matrix of
must be positive semidefinite is the condition of
to be convex. The Hessian matrix of
is derived as the following:
where
. The positive diagonal elements and the determinant is the condition of the Hessian matrix to be positive semidefinite, which can provide proof of the convexity of
. Obviously, the diagonal elements and the determinant are positive based on Assumptions 4–7. □
Theorem 6. The optimal solutions to minimize error upper bound can be given by: With Assumption 8 (
), Assumption 9 (
) and Assumption 10 (
), the
and
can be approximated by:
Proof. This can be proven via adopting Assumptions 8–10 and setting the partial derivatives of as 0. □
In Equations (29) and (30), the values of
and
are interdependent, which can be decoupled by substituting Equation (29) in Equation (30) as follows:
with the initial values
,
,
, the Equation (32) can be rewritten as:
Equation (32) illustrates that as the loss value decreases during the training process of federated learning, needs to decrease and should increase.
Algorithm 1 describes the details of the training process when performing the FedEAFO algorithm. The full flow of Algorithm 1 can be described as the following steps: (i) The server broadcasts the calculated local update coefficient
, compression coefficient
, and latest weights to the selected devices. The coefficient of local update determines the number of parameter computations in local training, whereas the coefficient of parameter compression determines the rate of parameter compression. (ii) The devices perform multiple local training based on the received local update coefficient
and upload the compressed locally aggregated model parameters at the certain ratio of compression determined by the compression coefficient
. (iii) The server aggregates all received compressed model parameters to update the global model. (iv) The server jointly optimizes and adjusts the two variables including the local update coefficient and parameter compression coefficient according to the latest value of loss via Theorem 6. The latest local update coefficient, parameter compression coefficient, and latest weights are broadcast by the server to the selected devices for the next iteration.
Algorithm 1: Efficient Adaptive Federated Optimization |
1: Server Executes: 2: the initialization of the global model parameter: 3: for round to do 4: The server jointly optimizes and via Equation (32) 5: The server broadcasts and to selected devices 6: for device to do 7: Devices receive and 8: for local training do 9: Devices compute 10: Devices update via Equation (2) 11: end for 12: Devices compute via Equation (3) 13: Devices compress via Equations (12) and (17) 14: Devices upload compressed to the server 15: end for 16: The server aggregates compressed updates via Equation (4) 17: The server updates the global model via Equation (5) 18: The latest global model are broadcast to the devices 19: end for |