Computation and Communication Efficient Adaptive Federated Optimization of Federated Learning for Internet of Things

Chen, Zunming; Cui, Hongyan; Wu, Ensen; Yu, Xi

doi:10.3390/electronics12163451

Open AccessArticle

Computation and Communication Efficient Adaptive Federated Optimization of Federated Learning for Internet of Things

¹

State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China

²

School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(16), 3451; https://doi.org/10.3390/electronics12163451

Submission received: 21 July 2023 / Revised: 11 August 2023 / Accepted: 13 August 2023 / Published: 15 August 2023

(This article belongs to the Special Issue Recent Advances in Blockchain Technology and Distributed AI Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The proliferation of the Internet of Things (IoT) and widespread use of devices with sensing, computing, and communication capabilities have motivated intelligent applications empowered by artificial intelligence. Classical artificial intelligence algorithms require centralized data collection and processing, which are challenging in realistic intelligent IoT applications due to growing data privacy concerns and distributed datasets. Federated Learning (FL) has emerged as a privacy-preserving distributed learning framework, which enables IoT devices to train global models through sharing model parameters. However, inefficiency due to frequent parameter transmissions significantly reduces FL performance. Existing acceleration algorithms consist of two main types including local update and parameter compression, which considers the trade-offs between communication and computation/precision, respectively. Jointly considering these two trade-offs and adaptively balancing their impacts on convergence have remained unresolved. To solve the problem, this paper proposes a novel efficient adaptive federated optimization (FedEAFO) algorithm to improve the efficiency of FL, which minimizes the learning error via jointly considering two variables including local update and parameter compression. The FedEAFO enables FL to adaptively adjust two variables and balance trade-offs among computation, communication, and precision. The experiment results illustrate that compared with state-of-the-art algorithms, the FedEAFO can achieve higher accuracies faster.

Keywords:

federated learning; distributed machine learning; communication efficiency; privacy protection

1. Introduction

The explosive growth in the amount of data from devices has witnessed the rapid development of the Internet of Things (IoT), which provides ubiquitous sensing, computing, and communication capabilities to connect things to the Internet [1]. To provide deep analysis for data from IoT devices, artificial intelligence (AI) algorithms have been adopted to enable intelligent IoT applications such as smart transportation and smart city [2]. Traditionally, AI algorithms that need the centralized collection and processing of data are deployed on a centralized cloud/edge server or data center for data mining. However, the offloading of massive amounts of IoT data to remote servers and the processing of data in remote servers induces significant delays. Furthermore, the third-party server also raises data privacy concerns [3]. In this context, integrating privacy-preservation and distributed AI into IoT becomes an important topic.

Recently, Federated Learning (FL) has emerged as a privacy-preserving distributed learning framework that enables intelligent IoT applications by allowing distributed IoT devices to collaboratively train machine learning models [4]. FL enables multiple devices to train a joint global model via Stochastic Gradient Descent (SGD) and shares local model parameters instead of raw data [5]. FL has seen recent successes in several applications. For example, Genetic Clustered Federated Learning (CFL) proposed in [6] has been applied to detect COVID-19 patients in a privacy-preserving manner. An FL-assisted cooperative perception task in [7] has been applied to vehicular networks where vehicles fuse sensor data in order to obtain an extended vision of the driving environment.

However, the existing federated learning methods suffer from inefficiency in model training performance when applied in resource-constrained IoT environments due to several reasons. (i) Unbalanced Data: Training data at IoT devices are different in size and distribution because of different sensing environments [8]. Thus, communications in the uplink and downlink in FL are highly sensitive subject to non-independent and identically distributed (IID) data. (ii) Limited Bandwidth: IoT devices often have limited communication bandwidth due to their constrained network capabilities. Transmitting large model updates in traditional federated learning setups can be inefficient and slow in such environments. (iii) High Latency: IoT devices might have high communication latency, making frequent model updates impractical. This can lead to delays in aggregating updates and hinder the training process. (iv) Unreliable Connectivity: IoT devices might experience intermittent or unreliable connectivity. This can disrupt the communication process and lead to incomplete or delayed updates. The convergence of FL for IoT cannot be guaranteed all the time due to the intermittent connections [9,10,11]. The situation of federated learning on mobile devices (e.g., sensors and Unmanned Aerial Vehicles (UAVs)) gets even worse, as they communicate via wireless channels and suffer from lower bandwidth, higher latency, and intermittent connections [12]. Thus, the inefficiency training performance problem becomes an important bottleneck for scaling up FL.

It is necessary to solve the inefficiency training performance problem of federated learning. Several works have been undertaken to speed up federated learning convergence via local update and parameter compression [13,14]. The approach of a local update aims to reduce the frequent transmissions of model parameters via making full use of the computing capability of devices. The local update algorithms characterize a trade-off between computation and communication via a proposed concept of a local update coefficient that determines the ratio of the local update to bandwidth [15]. The approach of parameter compression aims at reducing the amount of data to be transmitted through data compression techniques such as quantization and sparsification. The parameter compression algorithms characterize a trade-off between communication and model precision via the parameter compression coefficient (compression budget), which determines the ratio between bandwidth and accuracy [16]. These methods have been individually studied to improve the efficiency of federated learning. However, jointly considering these two trade-offs and adaptively balancing their impacts on the convergence of federated learning from both mathematical estimation and theoretical analysis perspectives have remained unresolved. This significant problem motivated our research. This paper proposes a novel efficient adaptive federated optimization (FedEAFO) algorithm to speed up the convergence of federated learning for IoT, which minimizes the learning error via jointly considering two methods including local update and parameter compression. The FedEAFO enables federated learning to adaptively adjust two variables and balance trade-offs among computation, communication, and accuracy.

The proposed efficient adaptive federated optimization algorithm stands apart from traditional optimization approaches in several significant ways. These distinctions underscore the algorithm’s advancements and its tailored suitability for the complexities of IoT environments. (i) Joint Communication and Computation Optimization: One of the key differentiators of the FedEAFO algorithm is its holistic approach toward optimization. Unlike traditional methods that often focus on either communication or computation aspects, FedEAFO takes a pioneering step by seamlessly integrating both model compression techniques and multiple local training strategies. This integrated optimization effectively addresses the intricate interplay between computation and communication efficiency within the federated learning framework. (ii) Adaptive Optimization: FedEAFO introduces an adaptive dimension that sets it apart from conventional techniques. This adaptability empowers the algorithm to dynamically and intelligently adjust its optimization parameters, facilitating a fine-tuned balance between various trade-offs. This adaptiveness is in stark contrast to traditional methods that often employ fixed or pre-determined parameters, making FedEAFO particularly responsive to the dynamic nature of IoT environments. (iii) Enhanced Convergence Speed: Another distinctive feature lies in FedEAFO’s aim to expedite the convergence speed of federated learning. Unlike some conventional methods that might struggle to adapt to the specific challenges posed by IoT settings, FedEAFO strategically integrates model compression and local training. This synergy fosters quicker convergence by optimizing learning updates and conserving communication resources simultaneously. (iv) Tailoring to IoT Constraints: Traditional optimization methods might not fully account for the unique constraints and intricacies of IoT environments. FedEAFO, on the other hand, is meticulously crafted to align with the limitations of IoT devices, such as constrained bandwidth, high latency, and energy considerations. Its ability to harmonize these factors with optimization objectives marks a substantial departure from generic optimization paradigms.

This research introduces a novel approach that addresses the unique inefficiency training performance challenge faced by federated learning when applied to IoT environments, which is raised by limitations including limited bandwidth of devices, unreliable network connectivity, device heterogeneity, and unbalanced data. This research proposes innovative model compression techniques tailored to IoT devices, which aim to reduce the size of the model updates that need to be communicated between IoT devices and the central server. By compressing the model effectively, this research enables efficient utilization of the limited communication bandwidth of IoT devices. In addition, recognizing the high latency and limited energy resources of IoT devices, this research suggests the incorporation of multiple local training methods, which enable IoT devices to perform multiple rounds of training on their local data before transmitting updates to the central server. This can minimize the need for frequent communication, addressing the challenge posed by high latency and unreliable connectivity. Furthermore, this research takes into account the inherent heterogeneity of IoT devices in terms of hardware capabilities, network conditions, and data distributions. By jointly considering model compression and multiple local training, the proposed algorithm adapts to the diverse IoT system and ensures compatibility with devices of varying capabilities. Lastly, IoT devices often exhibit a wide range of computational capabilities and data distribution. The proposed efficient adaptive optimization method addresses these heterogeneities by incorporating adaptive multiple local training. The proposed algorithm can adjust the local update frequencies of devices to mitigate the impact of device heterogeneity and data imbalance. This method allows each IoT device to perform several rounds of training using its available computational resources. Devices with higher computational capacities can perform more local training rounds, contributing more effectively to the federated learning process. In addition, devices with smaller datasets might be allowed to perform more local updates before communicating with the central server, which enables devices with sparse data to catch up with those that have more data, reducing the disparities caused by data imbalance. The key contributions of the FedEAFO are summarized as the following:

This paper investigates the federated learning problem with a practical formulation of minimizing the error of the global model in terms of local update coefficient and compression budget, which characterizes trade-offs between communication and computation/model precision, respectively.
This paper proposes a novel efficient adaptive federated optimization algorithm using a derived error upper bound considering two variables including a local update and compression coefficient, which adaptively adjusts these two variables to improve the efficiency of federated learning.
Besides theoretical analysis of the proposed algorithm, we demonstrate strong empirical performance on two datasets of FedEAFO compared with other state-of-the-art methods, which achieve higher accuracies faster.

The proposed efficient adaptive federated optimization algorithm is a promising approach that addresses the challenges of training efficiency, heterogeneity, non-IID data, and privacy concerns in a distributed machine learning environment. It offers several practical applications across various industries: (i) Autonomous Vehicles: Autonomous vehicles generate a wealth of data during operation. With efficient adaptive federated learning, vehicles can collectively learn from each other’s experiences to improve safety and navigation while respecting user privacy. (ii) Financial Services: Banks and financial institutions often face data privacy regulations, making data sharing difficult. With efficient adaptive federated learning, banks can collectively analyze customer behavior and detect fraudulent activities without directly sharing sensitive financial information. It enables improved risk assessment, personalized financial recommendations, and fraud detection while preserving customer privacy. (iii) Smart Grids: In the energy sector, smart grids consist of numerous energy-consuming and energy-producing devices. Federated learning can help optimize energy consumption, predict energy demand, and manage grid stability while maintaining data privacy and decentralized decision making. (iv) Manufacturing: In smart factories, different machines and sensors generate data with varying characteristics. Efficient adaptive federated learning enables predictive maintenance, process optimization, and quality control without the need for centralized data collection, improving manufacturing efficiency and reducing downtime.

2. Related Work

2.1. Local Update in Federated Learning

The approaches to improve the efficiency of federated learning and overcome communication bottlenecks can be categorized into local update and parameter compression. The approach of the local update aims at taking full advantage of the computing capability of devices to reduce the frequent transmissions of model parameters. For example, Nenghai et al. proposed the Asynchronous Stochastic Gradient Descent (ASGD) algorithm, which derives the bound of convergence of distributed gradient descent and only allows one step of the local update before the aggregation of the global model [17]. Virginia et al. proposed Fedprox, which puts forward the concept of a local model update coefficient determining the ratio of computation to communication, and it performs multiple local updates with a fixed local update coefficient [18]. Joshi et al. proposed ADACOMM, where an adaptive communication strategy is adopted to adjust the local update coefficient and dynamically adjust the trade-off between computation and communication to solve the problem of heterogeneous computing and communication capabilities of devices [19]. Kevin et.al proposed an AFD that adopts an adaptive strategy to determine the best local update coefficient under a given resource budget in order to speed up federated learning in non-IID settings [20].

2.2. Parameter Compression in Federated Learning

The approach of parameter compression adopts data compression algorithms such as sparsification and quantization to significantly reduce the amount of data to be transmitted. The parameter compression algorithm characterizes the trade-off between communication and model precision by a compression coefficient determining the ratio of communication to precision. The algorithms of parameter compression reduce communication overheads via uploading the quantized version or sparse representation of the model parameters. For instance, Wen et al. proposed Terngrad, which quantizes each parameter to 2 bits [21]. Alistarh et al. proposed the Quantification Stochastic Gradient Descent (QSGD) algorithm, which uses 2 bits and 4 bits to quantize various layers of model networks [22]. A 1-bit SGD that even quantizes the parameter to 1 bit was studied in [23]. The sparsification technique aims to sparsify parameters to send the significant parameters rather than all parameters. The sparsification techniques can be categorized into two categories according to the domain the sparsity is sought for, which include the raw domain and the transformed domain. For instance, Dally et al. proposed a parameter sparsification strategy to set unimportant elements to zero via a threshold, which defines values of unimportant elements as being between top 0.05% and bottom 0.05% [24]. Wright et al. proposed ATOMO, which aims at reducing the communication overheads via transforming parameters to a domain of Singular Value Decomposition (SVD) to exploit the low-dimensional structure to obtain more sparsity [25].

The summary of related works is listed in Table 1. The aforementioned methods have been individually studied to overcome communication bottlenecks and improve the efficiency of federated learning. It is expected that integrating the two approaches would be more effective in speeding up federated learning. However, integrating the two approaches and jointly considering the trade-offs between communication and computation/precision to adaptively balance and adjust their impacts on the convergence of federated learning have remained unresolved. Thus, there is an urgent necessity to design an efficient adaptive federated optimization algorithm via jointly balancing the trade-offs among communication, computation, and precision to supplement existing approaches.

3. Preliminaries and Problem Formulation

The system overview of the proposed algorithm is presented in Figure 1. Before introducing details of the proposed algorithm, this paper first presents a mathematical analysis on how coefficients of local update and parameter compression affect federated learning in order to further describe the impact of local update and parameter compression. Additionally, we theoretically formulate our problem and derive the error upper bound of federated learning, which jointly considers two variables including local update and parameter compression.

3.1. Federated Learning

We consider a federated learning system in resource-constrained IoT environments, which consists of a total of

N

IoT devices. Each device performs model training on its local dataset. The devices aim to jointly solve the following optimization problem.

w = \min_{w} {F (w) \Leftrightarrow \sum_{n = 1}^{N} p^{n} F^{n} (w)}

(1)

where

w

represents global model weights,

p^{n}

corresponds to weight of the n-th device, and

F^{n}

corresponds to local objective of the n-th device. Equation (1) can be optimized by iterative exchange of model parameters between devices and the central server. Particularly, at the t-th round, the local model of n-th device

w_{t, i}^{n}

can be denoted as

w_{t, i + 1}^{n} \leftarrow w_{t, i}^{n} - η_{t, i} \nabla F^{n} (w_{t, i}^{n}, φ_{t, i}^{n})

(2)

where

φ_{t, i}^{n}

corresponds to the data samples,

η_{t, i}

represents the learning rate, and

i

represents the i-th local updates.

In practical applications of the IoT, different IoT devices will be used and different kinds of data will be collected. FL is well-suited for handling different types of data. In IoT scenarios, various devices with diverse sensors and functionalities generate data, and these data can be highly heterogeneous. FL provides a distributed learning framework that allows these devices to collaboratively learn a shared model without sharing their raw data with a central server. Here’s how FL handles different types of data in IoT applications: (i) Adaptive Hyperparameter Tuning: Some FL algorithms incorporate adaptive optimization techniques that allow the model to adjust its hyperparameters based on the data characteristics of each device. This adaptivity helps in efficiently utilizing devices with diverse data distributions. (ii) Decentralized Data Processing: In FL, each IoT device retains control over its data locally. The devices preprocess and transform their data to ensure consistency and compatibility with the chosen model architecture. This decentralization allows the devices to handle their specific data types and formats independently. (iii) Model Personalization: FL supports model personalization, where the global model can be adapted to different devices’ data characteristics. When devices have unique data distributions, they can personalize the shared model using their local data. Model aggregation mechanisms, like Federated Averaging, ensure that the personalized models are combined to improve the overall performance.

As soon as the n-th device performs local updates based on Equation (2), the aggregated updates of local models

ℓ (w_{t}^{n})

can be obtained by

ℓ (w_{t}^{n}) \leftarrow \sum_{i = 1}^{I_{t}} ℓ (w_{t, i}^{n}, φ_{t, i}^{n})

(3)

where

ℓ (w_{t, i}^{n}, φ_{t, i}^{n}) = \nabla F^{n} (w_{t, i}^{n}, φ_{t, i}^{n})

.

I_{t}

represents the number of consecutive local updates, which can be adjusted to balance the trade-off between computation and communication of FL.

The devices then compress their locally aggregated updates

ℓ (w_{t}^{n})

to

\hat{ℓ} (w_{t}^{n})

with the sparsity budget denoted as

ε_{t}

, which can be adjusted to balance trade-offs between the communication and model precision of FL. The server aggregates all compressed locally aggregated updates from devices to obtain a compressed global model given by

\hat{ℓ} (w_{t}) = \frac{1}{N} \sum_{n = 1}^{N} \hat{ℓ} (w_{t}^{n})

(4)

The latest global model

w_{t + 1}

can be updated via the stochastic gradient descent algorithm as

w_{t + 1} \leftarrow w_{t} - η \hat{ℓ} (w_{t})

(5)

The central server forwards the updated global model

w_{t + 1}

to devices involved in federated learning for the next training round. The local model of devices can be updated via the received latest global weights. This is the complete process of federated learning with joint consideration of both local update and parameter compression.

For the analysis of the latency of the training process of our system, we consider a circular small-cell network in which devices are uniformly distributed within the coverage of central base stations. Let

d^{n}

represent the between the base station and the device

n

, and

L_{p a t h} (d)

denote the path loss. Let

P^{n}

and

P

denote the transmit power of the device

n

and the base station, respectively. Assume that the uplink channel adopts the orthogonal frequency division multiple access (OFDMA) technology. The devices compete for a resource block to upload local models under the rule of the Slotted ALOHA protocol in the uplink channel. And the base station occupies all the bandwidth to forward global models in downlink time slots. Let

B

denote the total bandwidth of our system and

B_{U}

denote the resource block bandwidth. Thus, the signal-to-noise ratio of the uplink and downlink channel can be denoted by

ρ_{U}^{n} = P^{n} - L_{path} (d^{n}) - N_{0} B_{U}

and

ρ_{D}^{n} = P - L_{path} (d^{n}) - N_{0} B

, where

N_{0}

is the spectral power density of the noise.

Based on the above definition, the communication latency

T_{U}^{n}

for the local model upload and the global model download

T_{D}^{n}

of the n-th device can be given by

T_{U}^{n} = \frac{H}{B_{U} \log_{2} (1 + ρ_{U}^{n})}, T_{D}^{n} = \frac{H}{B \log_{2} (1 + ρ_{D}^{n})}

(6)

where

H

denotes the total quantization bits for transmitting the model.

The computation latency for training and updating the local model of the n-th device can be calculated as

T_{C, t}^{n} = \frac{E C D^{n}}{f^{n}}, T_{L, t}^{n} = \frac{Q}{f^{n}}

(7)

where

E

is the number of local updates,

C

and

Q

respectively denote the number of floating-point operations required for training a data sample and updating the local model, and

f^{n}

denotes the CPU capability of the n-th device.

Thus, the end-to-end delay of the FL training process at the t-th round can be given by

T_{t} = \underset{T_{c o m m}}{\underset{︸}{\max {T_{U, t}^{n} + T_{D, t}^{n}}}} + \underset{T_{c o m p}}{\underset{︸}{\max {T_{C, t}^{n} + T_{L, t}^{n}}}}

(8)

3.2. Problem Formulation

Our goal is to jointly consider two variables including

I_{t}

and

ε_{t}

and adaptively adjust them to speed up federated learning. Theoretically, the problem is to find the optimal solution of

I_{t}

and

ε_{t}

at different training rounds, which minimizes the error of federated learning in a given time. The problem can be mathematically formulated as

\begin{array}{l} \min_{{I_{t}}, {ε_{t}}} E_{{φ_{t, i}^{n}}} [\min F (w_{t})] \\ s . t . \sum_{t = 1}^{U_{t}} (T_{c o m p} + T_{c o m m}) = T \end{array}

(9)

where

F (w_{t})

corresponds to the global learning objective defined in Equation (1).

T

represents the given time constraint. To solve the problem in Equation (9), the key is to derive expression of the error of federated learning describing the interdependence of

I_{t}

and

ε_{t}

. However, such expression is almost impossible to obtain. Additionally, it is also hard to find the error upper bound, which jointly considers

I_{t}

and

ε_{t}

in FL.

3.3. Learning Error Upper Bound

To derive the error upper bound of federated learning, which jointly considers two variables including local update and parameter compression, this subsection first derives error upper bound in terms of

I_{t}

. Additionally, we derive error upper bound with consideration of local update described by

I_{t}

and parameter compression described by

ε_{t}

.

We first make the following Assumptions 1–3 to present the analysis of error upper bound without parameter compression inspired by [19].

The global loss function is differentiable, and L-smooth: $| \nabla F (V) - \nabla F (W) | | \leq L | | V - W |$ and there is a lower bound $F_{\inf}$ .
The global weights variance in mini-batch is bounded by: $E_{{φ_{t, i}^{t}}} | | ℓ (w_{t}) - \nabla F (w_{t}) | |^{2}$ $\leq λ | | \nabla F (w_{n}) | |^{2} + δ$ . $δ$ corresponds to variance between $ℓ (w_{n})$ . $λ$ and $δ$ are constants inversely proportional to mini-batch size.
The SGD is the unbiased estimator of FGD: $E_{{φ_{t, i}^{n}}} [ℓ (w_{t})] = \nabla F (w_{t})$ .

Theorem 1.

Let Assumptions 1–3 hold. Choose the learning rate that satisfies

η L + η^{2} L^{2} I_{t} (I_{t} - 1) \leq 1

. Thus, the learning error after n rounds within given time T in Equation (9) is bounded by:

\frac{2 [F (w_{t}) - F_{\inf}]}{η T} (T_{c o m p} + \frac{T_{c o m m}}{I_{t}}) + \frac{η L δ}{N} + η^{2} L^{2} δ (I_{t} - 1)

(10)

Proof.

See Appendix in [19]. □

The above Theorem 1 describes the trade-off between computation and communication to minimize learning error. Equation (10) illustrates that the local update coefficient

I_{t}

is in the numerator and the denominator of expression of error upper bound, which means error upper bound will decrease, where either the values of

I_{t}

is too small or too large. Thus, it is necessary to achieve balance. Additionally, the trade-off needs to be dynamically adjusted over various rounds of federated learning due to the dynamically varying loss function

F (w_{k})

, which is in Equation (10).

Apart from the local update, the approach of parameter compression, which introduces compression into locally aggregated weights, will complicate the analysis of error upper bound in two aspects: (i)

T_{c o m m}

will be affected by the parameter compression coefficient

ε_{t}

[26,27]. (ii) The variance

δ

in Equation (10) will depend on the parameter compression coefficient

ε_{t}

[28,29].

As mentioned before, the compressed parameters are approximated by basic components in parameter compression. Because of sparsity, several components are more significant than others in the aspects of approximating the raw parameters. Thus, the problem is to select basic components unbiasedly to minimize the variance

δ

. The locally aggregated weights from the n-th device can be rewritten as:

ℓ (w_{t}^{n}) = \sum_{k = 1}^{K} d^{k} (w_{t}^{n}) α^{k} (w_{t}^{n})

(11)

where

K

is the number of basic components,

α^{k} (w_{n}^{t})

corresponds to the k-th basic component, and

d^{k} (w_{t}^{n})

is the corresponding weight. Our analysis is based on the fact that a matrix can be denoted as a combination of basic matrices, which is the atomic decomposition for sparse representation in compressed sensing. Thus, our analysis can be extended to nearly all unbiased compressions. For example, TernGrad [21] and QSGD [22] are special cases of Equation (8). Additionally, sparsification algorithms including ATOMO [25] also follow Equation (8). The formulation presents a problem on the selection of

d^{k} (w_{t}^{n})

. To meet the requirement of unbiased selection, we propose an estimator as follows, inspired by [25]:

\hat{ℓ} (w_{t}^{n}) = \sum_{k = 1}^{K} \frac{d^{k} (w_{t}^{n}) e^{k} (w_{t}^{n})}{p^{k} (w_{t}^{n})} α^{k} (w_{t}^{n})

(12)

where

p^{k} (w_{t}^{n})

corresponds to the probability characterizing the Bernoulli distribution,

p^{k} (w_{t}^{n}) \in (0, 1]

and

e^{k} (w_{t}^{n})

obeys the Bernoulli distribution. We provide two significant properties for the estimator via Lemmas 1 and 2.

Lemma 1.

The variance of the estimator given by Equation (12) can be denoted as:

E_{{e^{k} (w_{t}^{n})}} [{‖ \hat{ℓ} (w_{t}^{n}) - ℓ (w_{t}^{n}) ‖}^{2}] = \sum_{k = 1}^{K} d^{k} {(w_{t}^{n})}^{2} (\frac{1}{p^{k} (w_{t}^{n})} - 1)

.

Proof.

The variance of the estimator in Equation (9) is defined as follows:

\begin{array}{l} E_{{e^{k} (w_{t}^{n})}} [{‖ \hat{ℓ} (w_{t}^{n}) - ℓ (w_{t}^{n}) ‖}^{2}] \\ = E_{{e^{k} (w_{t}^{n})}} [(\sum_{k = 1}^{K} d^{k} (w_{t}^{n}) a^{k} (w_{t}^{n}) {(\frac{e^{k} (w_{t}^{n}) - p^{k} (w_{t}^{n})}{p^{k} (w_{t}^{n})}))}^{T} \\ \times (\sum_{k = 1}^{K} d^{k} (w_{t}^{n}) a^{k} (w_{t}^{n}) (\frac{e^{k} (w_{t}^{n}) - p^{k} (w_{t}^{n})}{p^{k} (w_{t}^{n})}))] \\ = \sum_{k = 1}^{K} d^{k} {(w_{t}^{n})}^{2} | | a^{k} (w_{t}^{n}) | |^{2} \times E_{{e^{k} (w_{t}^{n})}} [{(\frac{e^{k} (w_{t}^{n}) - p^{k} (w_{t}^{n})}{p^{k} (w_{t}^{n})})}^{2}] \\ + \sum_{x, y; x \neq y}^{k} d^{x} (w_{t}^{n}) d^{y} (w_{t}^{n}) 〈 a^{x} (w_{t}^{n}), a^{y} (w_{t}^{n}) 〉 \\ \times E_{{e^{k} (w_{t}^{n})}} [(\frac{e^{x} (w_{t}^{n}) - p^{x} (w_{t}^{n})}{p^{x} (w_{t}^{n})}) \times (\frac{e^{y} (w_{t}^{n}) - p^{y} (w_{t}^{n})}{p^{y} (w_{t}^{n})})] . \end{array}

(13)

where

E_{{e^{k} (w_{t}^{n})}} [{(\frac{e^{k} (w_{t}^{n}) - p^{k} (w_{t}^{n})}{p^{k} (w_{t}^{n})})}^{2}]

can be denoted as:

\begin{array}{l} E_{{e^{k} (w_{t}^{n})}} [{(\frac{e^{k} (w_{t}^{n}) - p^{k} (w_{t}^{n})}{p^{k} (w_{t}^{n})})}^{2}] \\ = (1 - p^{k} (w_{t}^{n})) \times {(\frac{0 - p^{k} (w_{t}^{n})}{p^{k} (w_{t}^{n})})}^{2} + p^{k} (w_{t}^{n}) \times {(\frac{1 - p^{k} (w_{t}^{n})}{p^{k} (w_{t}^{n})})}^{2} \\ = (\frac{1}{p^{k} (w_{t}^{n})} - 1) \end{array}

(14)

and

E_{{e^{k} (w_{t}^{n})}} [(\frac{e^{x} (w_{t}^{n}) - p^{x} (w_{t}^{n})}{p^{x} (w_{t}^{n})})]

can be obtained by:

\begin{array}{l} E_{{e^{k} (w_{t}^{n})}} [(\frac{e^{x} (w_{t}^{n}) - p^{x} (w_{t}^{n})}{p^{x} (w_{t}^{n})})] \\ = (1 - p^{x} (w_{t}^{n})) (\frac{0 - p^{x} (w_{t}^{n})}{p^{x} (w_{t}^{n})}) + p^{x} (w_{t}^{n}) (\frac{1 - p^{x} (w_{t}^{n})}{p^{x} (w_{t}^{n})}) \\ = 0 \end{array}

(15)

and in the same way,

E_{{e^{k} (w_{t}^{n})}} [(\frac{e^{y} (w_{t}^{n}) - p^{y} (w_{t}^{n})}{p^{y} (w_{t}^{n})})] = 0

.

Thus, based on Equations (13)–(15) and

| | a^{k} (w_{t}^{n}) | |^{2} = 1

, the variance of the estimator in Equation (12) can be obtained as follows:

E_{{e^{k} (w_{t}^{n})}} [{‖ \hat{ℓ} (w_{t}^{n}) - ℓ (w_{t}^{n}) ‖}^{2}] = \sum_{k = 1}^{K} d^{k} {(w_{t}^{n})}^{2} (\frac{1}{p^{k} (w_{t}^{n})} - 1)

(16)

□

Lemma 2.

The estimator in Equation (12) is unbiased, which means:

E_{{e^{k} (w_{t}^{n})}} [\hat{ℓ} (w_{t}^{n})] = ℓ (w_{t}^{n})

.

Proof.

This can be proved simply by the definition of the expectation. □

In order to minimize the variance, we formulate the optimization problem. The reason why we try to minimize the variance is that the compressed parameters are closer to the original parameters when the variance decreases. The problem of minimizing the variance can be given as:

\min \sum_{k = 1}^{K} \frac{d^{k} {(w_{t}^{n})}^{2}}{p^{k} (w_{t}^{n})} s . t . 0 < p^{k} (w_{t}^{n}) \leq 1 a n d \sum_{k = 1}^{K} p^{k} (w_{t}^{n}) = ε_{t}

(17)

Before solving the optimization problem, we first provide the following Definition 1 of

ε_{t}

-balancedness:

Definition 1.

ℓ (w_{t}^{n}) = \sum_{k = 1}^{K} d^{k} (w_{t}^{n}) α^{k} (w_{t}^{n})

is

ε_{t}

-unbalanced if

d^{k} (w_{t}^{n}) ε_{t} > | | d w_{t}^{n} | |_{1}

.

ε_{t}

-balanced corresponds to the case that no element of

ℓ (w_{t}^{n})

is

ε_{t}

-unbalanced.

Thus, the theorem for the solution of the optimization problem can be given by:

Theorem 2.

When

ℓ (w_{t}^{n})

is

ε_{t}

-balanced, the solution for the aforementioned problem can be obtained by:

p^{k} (w_{t}^{n}) = \frac{d^{k} (w_{t}^{n}) ε_{t}}{{‖ d (w_{t}^{n}) ‖}_{1}}

(18)

Proof.

Theorem 2 can be proved by the Lagrangian multiplier [30]. □

Lemma 3.

The difference between uncompressed parameters

ℓ (w_{t}^{n})

and compressed parameters

\hat{ℓ} (w_{t}^{n})

and can be given by:

E_{{e^{k} (w_{t}^{n})}} [{‖ \hat{ℓ} (w_{t}^{n}) - ℓ (w_{t}^{n}) ‖}^{2}] = \frac{δ_{1, n}^{t}}{ε_{t}} + δ_{2, n}^{t}

(19)

where

δ_{1, n}^{t} = \sum_{k = 1}^{K} (d^{k} (w_{t}^{n}) {‖ d (w_{t}^{n}) ‖}_{1})

and

δ_{2, n}^{t} = - \sum_{k = 1}^{K} d^{k} {(w_{t}^{n})}^{2}

.

Proof.

Based on Lemma 1 and Theorem 2,

E_{{e^{k} (w_{t}^{n})}} [{‖ \hat{ℓ} (w_{t}^{n}) - ℓ (w_{t}^{n}) ‖}^{2}]

can be denoted by:

\begin{array}{l} E_{{e^{k} (w_{t}^{n})}} [{‖ \hat{ℓ} (w_{t}^{n}) - ℓ (w_{t}^{n}) ‖}^{2}] \\ = \sum_{k = 1}^{K} d^{k} {(w_{t}^{n})}^{2} (\frac{1}{p^{k} (w_{t}^{n})} - 1) \\ = \frac{1}{ε_{t}} \sum_{k = 1}^{K} (d^{k} (w_{t}^{n}) {‖ d (w_{t}^{n}) ‖}_{1}) - \sum_{k = 1}^{K} d^{k} {(w_{t}^{n})}^{2} = \frac{δ_{1, n}^{t}}{ε_{t}} + δ_{2, n}^{t} \end{array}

(20)

where

δ_{1, n}^{t} = \sum_{k = 1}^{K} (d^{k} (w_{t}^{n}) {‖ d (w_{t}^{n}) ‖}_{1})

and

δ_{2, n}^{t} = - \sum_{k = 1}^{K} d^{k} {(w_{t}^{n})}^{2}

. □

After deriving the probability of the variance of the estimator, we are able to access the effects of compression on variance

δ

and communication time

T_{c o m m}

. Assuming that the model parameters to be uploaded are denoted as

ϒ

. Rather than sending original parameters, the devices can upload the compressed approximated parameters denoted as

\overset{\land}{ϒ}

with sparsity representation, which only uploads the

ε_{t}

number of basic components. Thus, the communication time

T_{c o m m}

can be rewritten as

T_{c o m m} = γ ε_{t}

, where

γ

corresponds to the communication time of each device. As to variance

δ

, the theorem for the variance of compressed locally aggregated parameters is based on Theorem 2, which can be given by:

Theorem 3.

The bound for the variance of compressed locally aggregated parameters is as follows:

E_{{φ_{t, i}^{n}}, {e^{k} (w_{t}^{n})}} [{‖ \hat{ℓ} (w_{t}) - \nabla F (w_{t}) ‖}^{2}] \leq λ {‖ \nabla F (w_{t}) ‖}^{2} + \frac{δ_{1}}{ε_{t}} + δ_{2}

(21)

where

λ

,

δ_{1}

, and

δ_{2}

are constants inversely proportional to the size of the mini-batch.

Proof.

The

E_{{φ_{t, i}^{n}}, {e^{k} (w_{t}^{n})}} [{‖ \hat{ℓ} (w_{t}) - \nabla F (w_{t}) ‖}^{2}]

can be described as:

\begin{array}{l} E_{{φ_{t, i}^{n}}, {e^{k} (w_{t}^{n})}} [{‖ \hat{ℓ} (w_{t}) - \nabla F (w_{t}) ‖}^{2}] \\ = E_{{φ_{t, i}^{n}}, {e^{k} (w_{t}^{n})}} [| | \hat{ℓ} (w_{t}) - ℓ (w_{t}) + ℓ (w_{t}) - \nabla F (w_{t}) | |^{2}] \\ \leq E_{{e^{k} (w_{t}^{n})}} [| | \hat{ℓ} (w_{t}) - ℓ (w_{t}) | |^{2}] + E_{{φ_{t, i}^{n}}} [{‖ ℓ (w_{t}) - \nabla F (w_{t}) ‖}^{2}] \end{array}

(22)

For the first term in Equation (22), the

E_{{e^{k} (w_{t}^{n})}} [| | \hat{ℓ} (w_{t}) - ℓ (w_{t}) | |^{2}]

can be obtained by:

\begin{array}{l} E_{{e^{k} (w_{t}^{n})}} [| | \hat{ℓ} (w_{t}) - ℓ (w_{t}) | |^{2}] \\ \leq \frac{1}{N} \sum_{n = 1}^{N} \sum_{k = 1}^{K} d^{k} {(w_{t}^{n})}^{2} (\frac{1}{p^{k} (w_{t}^{n})} - 1) = \frac{1}{N} (\sum_{n = 1}^{N} \frac{δ_{1, n}}{ε_{t}} + δ_{2, n}) \end{array}

(23)

where

E_{{e^{k} (w_{t}^{n})}} [{‖ \hat{ℓ} (w_{t}^{n}) - ℓ (w_{t}^{n}) ‖}^{2}] = \sum_{k = 1}^{K} d^{k} {(w_{t}^{n})}^{2} (\frac{1}{p^{k} (w_{t}^{n})} - 1)

is from Lemma 1, and

\sum_{k = 1}^{K} d^{k} {(w_{t}^{n})}^{2} (\frac{1}{p^{k} (w_{t}^{n})} - 1) = \frac{δ_{1, n}}{ε_{t}} + δ_{2, n}

is from Lemma 3.

For the second term in Equation (22),

E_{{φ_{t, i}^{n}}} [{‖ ℓ (w_{t}) - \nabla F (w_{t}) ‖}^{2}]

can be obtained by

λ {‖ \nabla F (w_{t}) ‖}^{2} + δ

following by [19].

Thus, based on Equations (22)–(23),

E_{{φ_{t, i}^{n}}, {e^{k} (w_{t}^{n})}} [{‖ \hat{ℓ} (w_{t}) - \nabla F (w_{t}) ‖}^{2}]

can be described by:

\begin{array}{l} E_{{e^{k} (w_{t}^{n})}} [| | \hat{ℓ} (w_{t}) - ℓ (w_{t}) | |^{2}] \\ \leq \frac{1}{N} \sum_{n = 1}^{N} \sum_{k = 1}^{K} d^{k} {(w_{t}^{n})}^{2} (\frac{1}{p^{k} (w_{t}^{n})} - 1) = \frac{1}{N} (\sum_{n = 1}^{N} \frac{δ_{1, n}}{ε_{t}} + δ_{2, n}) \end{array}

(24)

where

δ_{1} = \max_{k} (\frac{1}{N} \sum_{n = 1}^{N} δ_{1, n}^{k})

and

δ_{2} = \max_{k} (\frac{1}{N} \sum_{n = 1}^{N} δ_{2, j}^{k}) + δ

. □

With the derived new communication time and variance, Assumption 2 in Theorem 1 can be rewritten via Equation (25). To derive the error upper bound with parameter compression, we adapt Theorem 1 via new variance and communication time, which considers the impact of parameter compression on communication time and variance on the convergence of federated learning. The updated Theorem is as follows:

Theorem 4.

The bound error upper bound with parameter compression is as follows:

υ_{t} (I_{t}, ε_{t}) = \frac{2 [F (w_{t}) - F_{\inf}]}{η T} (T_{c o m p} + \frac{γ ε_{t}}{I_{t}}) + \frac{η L (\frac{δ_{1}}{ε_{t}} + δ_{2})}{N} + η^{2} L^{2} (\frac{δ_{1}}{ε_{t}} + δ_{2}) (I_{t} - 1)

(25)

where

υ_{t} (I_{t}, ε_{t})

corresponds to the error upper bound considering local update and parameter compression.

Proof.

See Appendix in [19], except for the new communication time

T_{c o m p}

replaced by

γ ε_{t}

and variance

δ

replaced by

\frac{δ_{1}}{ε_{t}} + δ_{2}

. □

υ_{t} (I_{t}, ε_{t})

indicates the dynamics of trade-offs between communication and computation/precision, which are determined by

I_{t}

and

ε_{t}

. The first term in

υ_{t} (I_{t}, ε_{t})

expression shows that the

υ_{t} (I_{t}, ε_{t})

decreases as

I_{t}

increases because

I_{t}

is in the denominator, and the

υ_{t} (I_{t}, ε_{t})

decreases as

ε_{t}

decreases due to

ε_{t}

is in the numerator. Thus, the

I_{t}

and

ε_{t}

need to be reduced based on the analysis of the first term in Equation (22). However, the third term of Equation (22) requires

I_{t}

to remain small because

I_{t}

is in the numerator. The second and the third terms of Equation (22) require to remain large because

ε_{t}

is in the denominator. The above analysis indicates that either

I_{t} = 1

or

I_{t} > > 1

is not an optimal choice as the former results in unnecessary communication overheads, while the latter suffers from a prolonged convergence due to large discrepancies among local models caused by less communication. Both

ε_{t} = 1

and

ε_{t} > > 1

are not optimal choices because the former sends imprecise model parameters causing the prolonged convergence, whereas the latter results in frequent communication. Thus, this paper aims at finding the optimal balance to adjust trade-offs between communication and computation/precision.

4. The Proposed FedEAFO Algorithm

The above theoretical analyses illustrate that the error upper bound is ruled by

I_{t}

and

ε_{t}

. This paper proposes an efficient adaptive federated optimization (FedEAFO) algorithm, which minimizes the learning error via jointly considering two variables including local update and parameter compression. The FedEAFO adaptively adjusts and balances trade-offs between communication and precision/computation. Figure 1 presents an overview of FedEAFO scheme. Mathematically, FedEAFO finds the optimal balance to minimize the error upper bound in Equation (25), which can be denoted as:

I_{t}^{*}, ε_{t}^{*} = \arg \min_{I_{t}, ε_{t}} υ_{t} (I_{t}, ε_{t})

(26)

This paper presents Theorem 5 to provide the theoretical analysis for the proof of convexity of

υ_{t} (I_{t}, ε_{t})

, followed by Theorem 6, which finds optimal solutions to solve the problem of Equation (26).

Theorem 5.

Let L, T, and

η

be defined therein. Choose Assumptions (4)

I_{t} \geq 2

, (5)

η^{5} \approx 0

, (6)

(L^{4} T δ_{1} / 2 α (F (w_{t}) - F_{\inf}) ε_{t}^{4}) < \infty

, (7)

2 η^{2} L T δ_{1} I_{t} \geq α N ε_{t}^{2} (F (w_{t}) - F_{\inf})

, thus the

υ_{t} (I_{t}, ε_{t})

is convex.

Proof.

Hessian matrix of

υ_{t} (I_{t}, ε_{t})

must be positive semidefinite is the condition of

υ_{t} (I_{t}, ε_{t})

to be convex. The Hessian matrix of

υ_{t} (I_{t}, ε_{t})

is derived as the following:

H (υ_{t} (I_{t}, ε_{t})) = [\begin{matrix} 2 X \frac{α ε_{t}}{I_{t}^{3}} & - X \frac{α}{I_{t}^{2}} - P \frac{δ_{1}}{ε_{t}^{2}} \\ - X \frac{α}{I_{t}^{2}} - P \frac{δ_{1}}{ε_{t}^{2}} & 2 Z \frac{δ_{1}}{ε_{t}^{3}} + 2 P (I_{t} - 1) \frac{δ_{1}}{ε_{t}^{3}} \end{matrix}]

(27)

where

X = \frac{2 [F (w_{0}) - F_{\inf}]}{η T}, Z = \frac{η L}{N}, and P = η^{2} L^{2}

. The positive diagonal elements and the determinant is the condition of the Hessian matrix to be positive semidefinite, which can provide proof of the convexity of

υ_{t} (I_{t}, ε_{t})

. Obviously, the diagonal elements and the determinant are positive based on Assumptions 4–7. □

Theorem 6.

The optimal solutions to minimize error upper bound

υ_{t} (I_{t}, ε_{t})

can be given by:

I_{t} = \sqrt{\frac{2 α [F (w_{t}) - F_{\inf}] ε_{t}^{2}}{η^{3} L^{2} (δ_{1} + δ_{2} ε_{t})}}, ε_{t} = \sqrt{\frac{δ_{1} η^{2} L T (1 - η L (I_{t} - 1)) I_{t}}{2 α [F (w_{t}) - F_{\inf}]}}

(28)

With Assumption 8 (

δ_{1} < < δ_{2} ε_{t}

), Assumption 9 (

η L (I_{t} - 1) < < 1

) and Assumption 10 (

F_{\inf} = 0

), the

I_{t}

and

ε_{t}

can be approximated by:

\frac{I_{t + 1}}{I_{t}} \approx \sqrt{\frac{F (w_{t + 1})}{F (w_{t})}} \sqrt{\frac{ε_{t + 1}}{ε_{t}}}

(29)

\frac{ε_{t + 1}}{ε_{t}} \approx \sqrt{\frac{F (w_{t})}{F (w_{t + 1})}} \sqrt{\frac{I_{t + 1}}{I_{t}}}

(30)

Proof.

This can be proven via adopting Assumptions 8–10 and setting the partial derivatives of

υ_{t} (I_{t}, ε_{t})

as 0. □

In Equations (29) and (30), the values of

I_{t}

and

ε_{t}

are interdependent, which can be decoupled by substituting Equation (29) in Equation (30) as follows:

\frac{I_{t + 1}}{I_{t}} = \sqrt[3]{\frac{F (w_{t + 1})}{F (w_{t})}}, \frac{ε_{t + 1}}{ε_{t}} = \sqrt[3]{\frac{F (w_{t})}{F (w_{t + 1})}}

(31)

with the initial values

I_{0}

,

ε_{0}

,

F (w_{0})

, the Equation (32) can be rewritten as:

I_{t} = \sqrt[3]{\frac{F (w_{t})}{F (w_{0})}} I_{0}, ε_{t} = \sqrt[3]{\frac{F (w_{0})}{F (w_{t})}} ε_{0}

(32)

Equation (32) illustrates that as the loss value

F (w_{t})

decreases during the training process of federated learning,

I_{t}

needs to decrease and

ε_{t}

should increase.

Algorithm 1 describes the details of the training process when performing the FedEAFO algorithm. The full flow of Algorithm 1 can be described as the following steps: (i) The server broadcasts the calculated local update coefficient

I_{t}

, compression coefficient

ε_{t}

, and latest weights to the selected devices. The coefficient of local update determines the number of parameter computations in local training, whereas the coefficient of parameter compression determines the rate of parameter compression. (ii) The devices perform multiple local training based on the received local update coefficient

I_{t}

and upload the compressed locally aggregated model parameters at the certain ratio of compression determined by the compression coefficient

ε_{t}

. (iii) The server aggregates all received compressed model parameters to update the global model. (iv) The server jointly optimizes and adjusts the two variables including the local update coefficient and parameter compression coefficient according to the latest value of loss via Theorem 6. The latest local update coefficient, parameter compression coefficient, and latest weights are broadcast by the server to the selected devices for the next iteration.

Algorithm 1: Efficient Adaptive Federated Optimization

1: Server Executes:
2: the initialization of the global model parameter:

w_{0}

3: for round

t = 1

to

I

do
4: The server jointly optimizes

I_{t}

and

ε_{t}

via Equation (32)
5: The server broadcasts

I_{t}

and

ε_{t}

to selected devices
6: for device

n = 1

to

N

do
7: Devices receive

I_{t}

and

ε_{t}

8: for local training

i = 1

to I_{t}

do
9: Devices compute

ℓ (w_{t, i}^{n}, φ_{t, i}^{n})

10: Devices update

w_{t, i}^{n}

via Equation (2)
11: end for
12: Devices compute

ℓ (w_{t, i}^{n})

via Equation (3)
13: Devices compress

ℓ (w_{t, i}^{n})

via Equations (12) and (17)
14: Devices upload compressed

\hat{ℓ} (w_{t})

to the server
15: end for
16: The server aggregates compressed updates via Equation (4)
17: The server updates the global model via Equation (5)
18: The latest global model

w_{t + 1}

are broadcast to the devices
19: end for

5. Experimental Evaluation

This section performs experiments to evaluate the performance of FedEAFO in speeding up federated learning and compares it with state-of-the-art algorithms on Fashion-MNIST and CIFAR-10 image classification tasks. In our experiment, this paper implements the algorithm via a centos7 server with Nvidia TITAN RTX GPU.

5.1. Experiment Setup

Training Datasets: Following [31,32], this paper constructs both IID and Non-IID datasets from Fashion-MNIST and CIFAR-10. The Fashon-MNIST dataset consists of 10 categories of images of fashion items including 70,000 28 × 28 images. The CIFAR-10 dataset is comprised of 60,000 examples with 32 × 32 color pixels in 10 classes such as airplanes, birds, and racing cars.

Implementation settings: For settings of training, this paper initializes the iteration time as 200, local epoch as 10, batch size as 32, and learning rate as 0.01. The proposed design is examined on image classification tasks on Fashion-MNIST and CIFAR-10 datasets. For model architectures, this paper adopts the Convolutional Neural Network (CNN) with two convolutional layers with 32 and 64 channels, respectively, and the fully connected and softmax output layer on Fashion-MNIST, and a light-weight ResNet18 on CIFAR-10.

Comparison baselines: This paper compares the proposed FedEAFO with two categories of the state-of-the-art methods: (i) Local update that aims at taking full advantage of the computing capability of devices to reduce the frequent transmissions of model parameters. This paper considers ADACOMM [19], which dynamically adjusts local update coefficients with adaptive strategies as baselines. (ii) Parameter compression that reduces the amount of data to be transmitted via data compression techniques. We adopt ATOMO [25], which exploits the sparsity of parameters to compress the parameters for comparisons.

Evaluation metrics: This paper considers different dimensions to analyze the experiment results accurately: (1) Round completion time (efficiency metric). This paper defines the round completion time as the time needed to be spent in one round of federated learning. The reason why we adopt round completion time instead of iteration/communication round is that unlike traditional machine learning where the time of different iterations are nearly the same; thus, the iteration round reflects the convergence speed and the iteration time of different rounds are different. The key factors for the convergence speed of FL are the round completion time and iteration round. The round completion time varies over the round due to three time-varying delay components: (i) The download delays of global model parameters from the devices with different downlink speeds. (ii) The transmission delays of compressed local model parameters from the devices with different compression ratios and uplink speeds. (iii) The Computation delays of local model update from the devices with different local update coefficients and hardware constraints. (2) Global Learning accuracy (utility metric). This paper considers global learning accuracy as the average accuracy of the global model to evaluate the effectiveness of the FedEAFO in accelerating the FL. (3) Traffic Consumption (communication efficiency metric). Communication efficiency quantifies the amount of data exchanged between IoT devices and the central server during the federated learning process. In the context of IoT, where communication resources are often limited, this metric becomes crucial. A more communication-efficient algorithm reduces the amount of data transmitted, leading to lower bandwidth consumption and energy expenditure for devices. It is essential to assess how the adaptive optimization technique impacts communication efficiency while maintaining or improving other performance aspects.

5.2. Results and Analysis

This paper compares the proposed FedEAFO with two categories of state-of-the-art algorithms including ADACOMM [19] of the local updates category and ATOMO [25] of the parameter compression category. The comparative experiment results in Figure 2 present the values of accuracy, loss, local update coefficient

I_{t}

, and parameter compression coefficient

ε_{t}

over round completion time for 32 devices on the Fashion-MNIST dataset. Figure 2a illustrates the superiority of the proposed FedEAFO in terms of efficiency and high accuracy compared with ATOMO and ADACOMM, which achieves higher accuracies faster over round completion time. Figure 2b illustrates the training loss of the algorithms. Figure 2c shows the values of

I_{t}

over training time, where the local update of ATOMO is assigned as one because it does not perform multiple local updates. Figure 2d shows the values of

ε_{t}

except for ADACOMM because it does not adopt parameter compression, whereas ATAMO adopts parameter compression with a fixed value of

ε_{t}

. The reason why the proposed EAFO is more efficient is that the FedEAFO dynamically adjusts the values of

I_{t}

and

ε_{t}

during federated learning. For

I_{t}

, the proposed FedEAFO begins with high values to enable more local updates to be performed, but ends with lower values and high accuracy. For

ε_{t}

, the proposed FedEAFO begins with low values due to coarse parameters and it can still contribute to improving the accuracy of the model effectively with low accuracy. The proposed FedEAFO ends with high values providing finer-grained parameters and high accuracy in Figure 2d. The communication/computation trade-off and communication/precision trade-off can be viewed as two exploration–exploitation trade-offs. Both adjustments of the local update coefficient and the parameter compression coefficient can be used to resolve the corresponding exploration–exploitation trade-off. The proposed FedEAFO starts with high exploration via large

I_{t}

and small

ε_{t}

, and FedEAFO ends with pure exploitation via small

I_{t}

and large

ε_{t}

, which means the steady state is reached. The data rates of both uplink and downlink are set to 100 Kbps. The dataset is IID. Note that

I_{t} \in {1, \dots, 40}

and

ε_{t} \in [4, 8]

.

The second simulation presents the comparison of the performance of our proposed FedEAFO algorithm with the fixed local update coefficient schemes and the fixed parameter compression coefficient schemes. Figure 3a illustrates the training accuracy over time under different values of the local updates coefficient

I_{t}

set to 10, 20, and 30. Figure 3b illustrates the training accuracy over time under different values of parameter compression coefficient

ε_{t}

set to 4, 6, and 8. As shown in Figure 3, our proposed FedEAFO algorithm converges faster and smoother than the fixed local update and parameter compression schemes, which verifies the effectiveness of the proposed algorithm.

This paper evaluates performance with different uplink/downlink data rates of the proposed FedEAFO compared with ADACOMM and ATOMO. The results in Figure 4a–c present the accuracy for 10/10 Mbps, 10/100 Kbps, and 100/10 Kbps uplink/downlink data rates, respectively. This paper considers high uplink/downlink data rates as 10 Mbps in Figure 4a, where parameter compression is less important than computation with relatively high data rates. In this case, computation has become a bigger bottleneck than communication for scaling up federated learning. Figure 4a presents that the ADACOMM outperforms the ATOMO, which adopts parameter compression in 10/10 Mbps data rates due to multiple local updates, and the proposed FedEAFO outperforms the ADACOMM because of benefits from the parameter compression. In Figure 4b considering the 10/100 Kbps data rates where the uplink communication has become the bottleneck, the ATOMO wins the race with ADACOMM because it compresses the parameters in the uplink. Additionally, the proposed FedEAFO outperforms the ATOMO due to the benefit from multiple local updates. On the contrary, in Figure 4c, considering the 100/10 Kbps data rates, the downlink communication has become the bottleneck. Thus, the performance of ATOMO is similar to the performance of ADACOMM despite parameter compression in the uplink.

Figure 5 shows the network traffic consumption of different algorithms when they achieve different target accuracy. We can find that the network traffic consumption of all algorithms for all datasets increases with the increasing accuracy. However, FedEAFO can always consume the minimum network traffic, which demonstrates the communication efficiency of FedEAFO. In addition, the algorithms with model compression can save much more network traffic than the algorithms without model compression.

This paper further evaluates the performance of the FedEAFO and the baselines on non-IID data. Concretely, we create non-IID data of the Fashon-MNIST dataset with different partition schemes. Each device has p of a unique class in 10 classes and other devices shared the remaining samples uniformly. Figure 6 shows the impacts of non-IID levels on the test accuracy of different algorithms, where the value of p is set to 0.1, 0.2, 0.4, 0.6, and 0.8. The result indicates that all algorithms suffer from a loss of accuracy with the increase in the non-IID level, while FedEAFO is more robust than other baselines.

6. Conclusions

FL has received considerable attention in IoT because of its capability of enabling devices to train machine learning models collaboratively via sharing local model parameters rather than privacy-sensitive data. However, FL-IoT implementation suffers from inefficiency and limited convergence performance in resource-constrained IoT environments. We propose a novel efficient adaptive federated optimization algorithm to improve the efficiency of FL training. The proposed FedEAFO minimizes the learning error by the joint consideration of two variables consisting of the coefficient of local update and parameter compression. Additionally, the proposed FedEAFO enables federated learning to adaptively and dynamically adjust the trade-offs among communication, computation, and model precision. The evaluation results demonstrate the superiority of the FedEAFO in terms of efficiency and accuracy. In future work, we will study interactions of FedEAFO with message-passing protocols that enable distributed optimization in unstable environments.

Author Contributions

Conceptualization, Z.C. and E.W.; methodology, Z.C. and E.W.; software, Z.C. and X.Y.; validation, Z.C. and H.C.; formal analysis, Z.C.; investigation, Z.C. and X.Y.; resources, H.C.; data curation, Z.C. and X.Y.; writing—original draft preparation, Z.C.; writing—review and editing, E.W. and H.C.; visualization, Z.C. and X.Y.; supervision, H.C.; project administration, Z.C. and H.C.; funding acquisition, H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant No. 62171049 and the National Key Research and Development Program of China under Grant No. 2020YFB1807805.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Haddad Pajouh, H.; Dehghantanha, A.; Parizi, R.M.; Aledhari, M.; Karimipour, H. A survey on internet of things security: Requirements, challenges, and solutions. IEEE Internet Things J. 2021, 14, 100129. [Google Scholar]
Liu, Y.; Wang, J.; Li, J.; Niu, S.; Song, H. Machine Learning for the Detection and Identification of Internet of Things Devices: A Survey. IEEE Internet Things J. 2021, 9, 298–320. [Google Scholar]
Ghosh, A.M.; Grolinger, K. Edge-cloud computing for Internet of Things data analytics: Embedding intelligence in the edge with deep learning. IEEE Trans. Ind. Inform. 2021, 17, 2191–2200. [Google Scholar]
Khan, L.U.; Saad, W.; Han, Z.; Hossain, E.; Hong, C.S. Federated Learning for Internet of Things: Recent Advances, Taxonomy, and Open Challenges. IEEE Commun. Surv. Tutor. 2021, 23, 1759–1799. [Google Scholar]
Zhang, T.; Gao, L.; He, C.; Zhang, M.; Krishnamachari, B.; Avestimehr, A.S. Federated Learning for the Internet of Things: Applications, Challenges, and Opportunities. IEEE Internet Things Mag. 2022, 5, 24–29. [Google Scholar]
Kandati; Reddy, D.; Gadekallu, T.R. Genetic Clustered Federated Learning for COVID-19 Detection. Electronics 2022, 11, 2714. [Google Scholar]
Posner, J.; Tseng, L.; Aloqaily, M.; Jararweh, Y. Federated Learning in Vehicular Networks: Opportunities and Solutions. IEEE Netw. 2021, 35, 152–159. [Google Scholar]
Mills, J.; Hu, J.; Min, G. Client-side optimization strategies for communication-efficient federated learning. IEEE Commun. Mag. 2022, 60, 60–66. [Google Scholar]
Nguyen, C.; Ding, M.; Pathirana, P.N.; Seneviratne, A.; Li, J.; Vincent Poor, H. Federated Learning for Internet of Things: A Comprehensive Survey. IEEE Commun. Surv. Tutor. 2021, 23, 1622–1658. [Google Scholar]
Sun, H.; Li, S.; Yu, F.R.; Qi, Q.; Wang, J.; Liao, J. Toward Communication-Efficient Federated Learning in the Internet of Things with Edge Computing. IEEE Internet Things J. 2020, 7, 11053–11067. [Google Scholar]
Ullo, S.L.; Sinha, G.R. Advances in IoT and smart sensors for remote sensing and agriculture applications. Remote Sens. 2021, 13, 2585. [Google Scholar]
Qin, Z.; Li, G.Y.; Ye, H. Federated learning and wireless communications. IEEE Wirel. Commun. 2021, 28, 134–140. [Google Scholar]
Nguyen, H.T.; Sehwag, V.; Hosseinalipour, S.; Brinton, C.G.; Chiang, M.; Vincent Poor, H. Fast-Convergent Federated Learning. IEEE J. Sel. Areas Commun. 2021, 39, 201–218. [Google Scholar]
Shah, S.M.; Lau, V.K.N. Model Compression for Communication Efficient Federated Learning. IEEE Trans. Neural Netw. Learn. Syst. 2021, 3, 1–15. [Google Scholar]
Reisizadeh, A.; Mokhtari, A.; Hassani, H.; Jadbabaie, A.; Pedarsani, R. Fedpaq: A Communication-Efficient Federated Learning Method with Periodic Averaging and Quantization. In Proceedings of the International Conference on Artificial Intelligence and Statistics (PMLR), Virtual Conference, 26–28 August 2020; pp. 2021–2031. [Google Scholar]
Mitra, A.; Jaafar, R.; Pappas, G.J.; Hassani, H. Linear convergence in federated learning: Tackling client heterogeneity and sparse gradients. Adv. Neural Inf. Process. Syst. 2021, 34, 14606–14619. [Google Scholar]
Zheng, S.; Meng, Q.; Wang, T.; Chen, W.; Yu, N.; Ma, Z.M.; Liu, T.Y. Asynchronous Stochastic Gradient Descent with Delay Compensation. In Proceedings of the International Conference on Artificial Intelligence and Statistics (PMLR), Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 4120–4129. [Google Scholar]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated optimization in heterogeneous networks. Proc. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar]
Wang, J.; Joshi, G. Adaptive communication strategies to achieve the best error-runtime trade-off in local-update SGD. Proc. Mach. Learn. Syst. 2019, 1, 212–229. [Google Scholar]
Wang, S.; Tuor, T.; Salonidis, T.; Leung, K.K.; Makaya, C.; He, T.; Chan, K. Adaptive Federated Learning in Resource Constrained Edge Computing Systems. IEEE J. Sel. Areas Commun. 2019, 37, 1205–1221. [Google Scholar]
Wen, W.; Xu, C.; Yan, F.; Wu, C.; Wang, Y.; Chen, Y.; Li, H. Terngrad: Ternary gradients to reduce communication in distributed deep learning. Adv. Neural Inf. Process. Syst. 2017, 4, 30–43. [Google Scholar]
Alistarh, D.; Grubic, D.; Li, J.; Tomioka, R.; Vojnovic, M. QSGD: Communication-efficient SGD via gradient quantization and encoding. Adv. Neural Inf. Process. Syst. 2017, 30, 47–60. [Google Scholar]
Seide, F.; Fu, H.; Droppo, J.; Li, G.; Yu, D. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In Proceedings of the Fifteenth Annual Conference of the International Speech Communication Association, Singapore, 14–18 September 2014; pp. 20–27. [Google Scholar]
Lin, Y.; Han, S.; Mao, H.; Wang, Y.; Dally, B. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018; pp. 40–48. [Google Scholar]
Wang, H.; Sievert, S.; Liu, S.; Charles, Z.; Papailiopoulos, D.; Wright, S. Atomo: Communication-efficient learning via atomic sparsification. Adv. Neural Inf. Process. Syst. 2018, 31, 10–24. [Google Scholar]
Xu, J.; Du, W.; Jin, Y.; He, W.; Cheng, R. Ternary Compression for Communication-Efficient Federated Learning. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 1162–1176. [Google Scholar] [CrossRef] [PubMed]
Haijian, S.; Ma, X.; Hu, R.Q. Adaptive federated learning with gradient compression in uplink NOMA. IEEE Trans. Veh. Technol. 2020, 69, 16325–16329. [Google Scholar]
Zha, Z.; Yuan, X.; Wen, B.; Zhang, J.; Zhou, J.; Zhu, C. Image Restoration Using Joint Patch-Group-Based Sparse Representation. IEEE Trans. Image Process. 2020, 29, 7735–7750. [Google Scholar]
Sattler, F.; Müller, K.R.; Samek, W. Clustered federated learning: Model-agnostic distributed multitask optimization under privacy constraints. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 3710–3722. [Google Scholar]
Ananduta, W.; Nedich, A.; Ocampo-Martinez, C. Distributed Augmented Lagrangian Method for Link-Based Resource Sharing Problems of Multi-Agent Systems. IEEE Trans. Autom. Control. 2021, 8, 10–23. [Google Scholar]
Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms. arXiv 2017, arXiv:1708.07747. [Google Scholar]
Zheng, Y.; Lai, S.; Liu, Y.; Yuan, X.; Yi, X.; Wang, C. Aggregation service for federated learning: An efficient, secure, and more resilient realization. IEEE Trans. Dependable Secur. Comput. 2022, 20, 988–1001. [Google Scholar]

Figure 1. The system overview of FedEAFO scheme.

Figure 2. The values of training accuracy, training loss,

I_{t}

, and

ε_{t}

over round completion time.

Figure 2. The values of training accuracy, training loss,

I_{t}

, and

ε_{t}

over round completion time.

Figure 3. The values of training accuracy with different values of local updates coefficient

I_{t}

and parameter compression coefficient

ε_{t}

over round completion time.

Figure 3. The values of training accuracy with different values of local updates coefficient

I_{t}

and parameter compression coefficient

ε_{t}

over round completion time.

Figure 4. The values of accuracy with different uplink/downlink data rates over time.

Figure 5. The network traffic consumption of algorithms when achieving the target accuracy.

Figure 6. The values of accuracy of the algorithms on non-IID data.

Table 1. Summary of the approaches of efficient federated learning.

Ref.	Challenge	Technique	Key Idea
[18]	Communication frequency reduction	Local update	Enable devices to perform multiple local updates with a fixed local update coefficient
[19]	Communication frequency reduction	Local update	An adaptive strategy is adopted to adjust local update coefficient dynamically
[20]	Communication frequency reduction	Local update	Similar to [19], but with convergence guarantees for non-IID setting
[21]	Model updates size reduction	Compression (quantization)	Compress the local model updates to a finite number of bits to reduce the amount of data transmitted between server and devices
[24]	Model updates size reduction	Compression (sparsification)	A model updates sparsification strategy to set unimportant elements to zero
[25]	Model updates size reduction	Compression (sparsification)	Transform model updates into Singular Value Decomposition (SVD) domain and sparsify model updates into low-dimensional structure

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Z.; Cui, H.; Wu, E.; Yu, X. Computation and Communication Efficient Adaptive Federated Optimization of Federated Learning for Internet of Things. Electronics 2023, 12, 3451. https://doi.org/10.3390/electronics12163451

AMA Style

Chen Z, Cui H, Wu E, Yu X. Computation and Communication Efficient Adaptive Federated Optimization of Federated Learning for Internet of Things. Electronics. 2023; 12(16):3451. https://doi.org/10.3390/electronics12163451

Chicago/Turabian Style

Chen, Zunming, Hongyan Cui, Ensen Wu, and Xi Yu. 2023. "Computation and Communication Efficient Adaptive Federated Optimization of Federated Learning for Internet of Things" Electronics 12, no. 16: 3451. https://doi.org/10.3390/electronics12163451

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Computation and Communication Efficient Adaptive Federated Optimization of Federated Learning for Internet of Things

Abstract

1. Introduction

2. Related Work

2.1. Local Update in Federated Learning

2.2. Parameter Compression in Federated Learning

3. Preliminaries and Problem Formulation

3.1. Federated Learning

3.2. Problem Formulation

3.3. Learning Error Upper Bound

4. The Proposed FedEAFO Algorithm

5. Experimental Evaluation

5.1. Experiment Setup

5.2. Results and Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI