Communication-Efficient Wireless Traffic Prediction with Federated Learning

Gao, Fuwei; Zhang, Chuanting; Qiao, Jingping; Li, Kaiqiang; Cao, Yi

doi:10.3390/math12162539

Open AccessArticle

Communication-Efficient Wireless Traffic Prediction with Federated Learning

by

Fuwei Gao

¹,

Chuanting Zhang

^2,*

,

Jingping Qiao

¹,

Kaiqiang Li

¹ and

Yi Cao

³

¹

School of Information Science and Engineering, Shandong Normal University, Jinan 250358, China

²

School of Software, Shandong University, Jinan 250100, China

³

School of Information Science and Engineering, University of Jinan, Jinan 250022, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(16), 2539; https://doi.org/10.3390/math12162539

Submission received: 17 July 2024 / Revised: 11 August 2024 / Accepted: 15 August 2024 / Published: 17 August 2024

(This article belongs to the Section E1: Mathematics and Computer Science)

Download

Browse Figures

Versions Notes

Abstract

:

Wireless traffic prediction is essential to developing intelligent communication networks that facilitate efficient resource allocation. Along this line, decentralized wireless traffic prediction under the paradigm of federated learning is becoming increasingly significant. Compared to traditional centralized learning, federated learning satisfies network operators’ requirements for sensitive data protection and reduces the consumption of network resources. In this paper, we propose a novel communication-efficient federated learning framework, named FedCE, by developing a gradient compression scheme and an adaptive aggregation strategy for wireless traffic prediction. FedCE achieves gradient compression through top-K sparsification and can largely relieve the communication burdens between local clients and the central server, making it communication-efficient. An adaptive aggregation strategy is designed by quantifying the different contributions of local models to the global model, making FedCE aware of spatial dependencies among various local clients. We validate the effectiveness of FedCE on two real-world datasets. The results demonstrate that FedCE can improve prediction accuracy by approximately 27% with only 20% of communications in the baseline method.

Keywords:

wireless traffic prediction; spatial–temporal data analysis; federated learning; deep neural networks; gradient compression

MSC:

68T05

1. Introduction

As fifth-generation (5G) technology evolves, researchers propose the definition and key features of 6G technology, with an emphasis on building intelligent and digital systems. The development of 6G technology is expected to be the primary arena for achieving global digitization and intelligence, and it will face many challenges [1]. Meanwhile, with the rapid advancement of artificial intelligence (AI) techniques, there has been a growing number of AI applications in the field of communication, such as traffic monitoring and intelligent operation and maintenance [2,3].

6G technology has the potential to provide sufficient data support for AI, while the efficiency of AI is also critical for the development of digital and intelligent systems. Therefore, the integration of communication and AI is crucial for realizing the construction of intelligent communication networks. This trend is particularly relevant for wireless traffic prediction [4], which can significantly benefit from the integration of communication and AI. Wireless traffic prediction plays a crucial role in the development of intelligent communication networks [5]. By collecting and analyzing historical traffic data, wireless traffic forecasting can predict future traffic patterns and assist in solving network resource allocation problems and designing base station sleeping strategies, ultimately enhancing the user’s network experience and reducing energy waste.

Yet, traditional wireless traffic prediction methods typically use centralized training methods, where all data from base stations are uploaded to a central server for training and prediction. However, this method has two main problems, i.e., data privacy leaking and a high network overhead.

To address these issues, one new direction is using federated learning for wireless traffic prediction. In the current literature, federated learning has proved to be efficient in privacy protection and in processing datasets distributed to different parties [6,7]. Unlike traditional methods, federated learning is a distributed learning approach that emphasizes local training. It requires only uploading gradient updates to a central server, where a new global model is yielded through aggregation. By uploading only the model gradient updates, federated learning reduces communication bandwidth usage and shortens the upload time. Furthermore, federated learning can ensure data privacy by only uploading gradient updates, which are federated, and the model training trend does not disclose user data. Compared with centralized learning, federated learning improves data security. Moreover, the recent increase in the computational power of edge devices has met the hardware requirements for running federated training. These reasons explain why federated learning has advantages in wireless traffic prediction that centralized training methods cannot match.

Though federated learning offers unique advantages in the field of machine learning, it also poses multiple challenges. The heterogeneity of local data and the variability of local clients can cause gradient information to deviate from the actual training trend of the local model, thereby affecting the performance of the global model. Moreover, federated learning requires multiple rounds of local training, which can result in the frequent transmission of gradient information and an increased network overhead. Furthermore, since federated learning only receives gradient information, performing spatial–temporal dependencies analysis on the original data is impossible, which negatively impacts the model’s prediction accuracy.

In this paper, we propose a communication-efficient federated learning framework called FedCE for wireless traffic prediction. Compared with traditional federated learning schemes, FedCE is communication-efficient, as a gradient compression scheme is designed during its local learning. In addition, to overcome the data heterogeneity challenge, a local control variable is introduced into FedCE to correct the gradient direction and ensure its optimality. Additionally, FedCE has a strong ability to capture the spatial–temporal dependencies among different local clients. This is because our proposed adaptive aggregation strategy can quantify the different contributions of local client models when performing global model learning. Our contributions can be summarized as follows.

We develop a communication-efficient federated learning framework for the wireless traffic prediction problem. A gradient compression scheme is designed and implemented through top-K sparsification. The communication between local clients and the central server can be considerably reduced.
We design a gradient correction scheme by adding a local control variable to correct the gradient information and ensure its update direction is optimal. This scheme can solve the data heterogeneity challenge faced with wireless traffic prediction.
We propose an adaptive aggregation scheme at the server side based on the gradient correlation. The spatial–temporal dependencies of different local clients can be modeled, and prediction performance can be largely improved.

The rest of this paper is arranged as follows. Section 2 summarizes related works on wireless traffic prediction and federated learning. Then, we give some preliminaries to our problem formulation. After that, we formally introduce our proposed framework in Section 4, including gradient compression, gradient correction, and an adaptive aggregation scheme. We report the experiment results in Section 5 and present the conclusions in Section 6.

2. Related Work

This section aims to summarize the research conducted in federated learning and wireless traffic prediction, which are the main topics addressed in this paper.

2.1. Federated Learning

Federated learning is a particular type of distributed machine learning method that is typically applied in scenarios with a large number of mobile devices and private data, such as smart mobile terminals, Internet of Things devices, financial and healthcare applications, etc. The concept of federated learning was first proposed by the Google [8], which allows multiple local devices to store unevenly distributed data and collaboratively train high-quality centralized models while protecting data privacy.

The classical algorithm of federated learning is Federated Average (FedAvg), whose workflow is as follows. (1) The server first sends the initialized model to the local devices; (2) then, the local devices perform local training based on their local dataset, generate a local model, and upload the gradient information or model to the central server; (3) the central server performs a model aggregation to yield a global model, which will be broadcast to local clients for next-round training. The above steps are repeated multiple times until the global model achieves sufficient accuracy or reaches stop conditions. FedAvg performs well on independent and identically distributed (IID) data, but its performance is suboptimal on non-IID data.

Federated learning has four main research directions, i.e., non-IID data distribution, unbalanced datasets, massive distribution, and unreliable participating clients [9]. In situations where each client has a unique data distribution, personalized Federated HyperNetworks (pFedHN) is proposed to provide each client with a personalized model using a central hypernetwork model [10]. By generating diverse models, pFedHN addresses the issue of personalized data among different clients. In addition, a general framework named Ditto was proposed [11] to overcome the non-IID challenge. A regularization term was added to balance the global and local models in the local model generation section to achieve personalized federated learning. The Fedprox optimization framework addressed the issue of statistical heterogeneity in federated learning [12]. In terms of heterogeneous federated learning, most work involves modifications to the local stochastic gradient descent. For instance, authors propose a local gradient estimator that reduces bias and variance, addressing the issue of heterogeneity in federated learning [13]. In addition, a new federated learning algorithm is proposed that fully leverages variance reduction methods, utilizing local and global control variables on both the local device and the central server to track local updates and address the issue of client drift [14].

2.2. Wireless Traffic Prediction

Wireless traffic prediction aims to predict time series data accurately. Its main task is to analyze the characteristics of historical data to predict future traffic volume. Autoregressive integrated moving average (ARIMA) [15] and exponential smoothing [16] are traditional statistical methods for time series prediction. However, with the development of deep learning technology, machine learning models are increasingly being used. For example, to handle long-term dependencies in time series, ref. [17] uses long short-term memory (LSTM) structures based on recurrent neural networks (RNNs). The authors propose the capsule time convolutional network to handle emergency situations such as holidays and natural disasters [18].

However, with the increasing contradiction between sensitive data leakage and the accuracy of wireless traffic prediction, federated learning-based wireless traffic prediction has received more attention [19]. Compared with these traditional centralized wireless traffic prediction methods, federated learning is more suitable for dealing with decentralized, privacy-sensitive data. A new federated framework called FedLoc is proposed, which successfully collects a smaller local dataset without sacrificing user privacy and approximates the global machine learning model in a collaborative manner [20].

Prior wireless traffic prediction research has continuously optimized network structures, considering only the temporal correlation of traffic while ignoring the spatial correlation of traffic. In one study [21], researchers argued that spatial correlation exists when constructing intelligent cellular networks. Later, Zhang et al. proposed a federated learning method based on dual attention (FedDA) [22]. More specifically, the authors divided several clusters based on the location information of BSs and then trained models with clustering characteristics in the clusters that are close in location, finally aggregating the global model. They successfully verified the importance of spatial–temporal correlation in wireless traffic prediction [22].

Another important metric of wireless traffic prediction is low communication costs. Compared to traditional centralized wireless traffic prediction methods, federated learning-based methods perform well in reducing the communication overhead. There are generally two methods to improve the communication efficiency of federated learning further. The first method is to reduce the number of communication rounds, and the second method is to reduce the amount of communication in a single round. To reduce the number of communication rounds, momentum federated learning [23] implements momentum gradient descent in local updates. The [23] algorithm introduces momentum into gradient descent and uses the convergence direction to guide it, effectively accelerating convergence. In [24], the authors further improved the momentum method by adding learning rate changes to achieve better convergence results. To address the problem of reducing the amount of communication in a single round, the deep gradient compression algorithm [25] was proposed, which sets a gradient threshold. During the communication process between the client and the server, only gradient information greater than the threshold is transmitted, and gradient information smaller than the threshold is discarded. Ref. [25] can reduce the communication overhead through gradient compression without slowing down the convergence rate.

3. Problem Formulation

In this section, we first introduce the system model we considered, and then we give the problem formulation.

The system model we considered is shown in Figure 1. In this model, there are a total of K base stations, and each base station k preserves a local wireless traffic dataset

d^{k} = {d_{1}^{k}, d_{2}^{k}, \dots, d_{T}^{k}}

with t index data points. Note that

d^{k}

could be privacy-sensitive if those base stations come from different network operators. We have a global model

w

, which is trained collaboratively by K base stations and coordinated by a central server.

Before training, each base station needs first to prepare a training dataset

D^{k} = {(x_{i}^{k}, y_{i}^{k})}

using the sliding window approach.

x_{i}^{k}

and

y_{i}^{k}

represent the i-th input features and target labels, respectively. In this paper, we consider two kinds of input features that model two different temporal dependencies, i.e., closeness dependence and period dependence. More specifically, closeness dependence means the t-th wireless traffic volume of a base station is closely related to its previous several traffic volumes. Similarly, period dependence means the t-th wireless traffic volume of a base station also depends on the traffic volume at the same time of previous days. For example, if we want to predict the traffic value at 4 P.M. on Friday, then the traffic value is closely related to the traffic value at 4 P.M. on Thursday, Wednesday, etc. Based on the above explanations,

x_{i}^{k}

and

y_{i}^{k}

can be denoted as follows if the sliding window size is c, the period dependence granularity is p, and only the performance of one-step-ahead prediction is considered.

\begin{matrix} x_{i}^{k} & = {\underset{closeness dependence}{\underset{︸}{d_{i}^{k}, \dots, d_{i + c p}^{k}}}, \underset{period dependence}{\underset{︸}{d_{i + c p - c}^{k}, \dots, d_{i + c p - 1}^{k}}}}, \end{matrix}

(1)

\begin{matrix} y_{i}^{k} & = {d_{i + c p}^{k}} . \end{matrix}

(2)

Once the dataset

{x^{k}, y^{k}}

is prepared, our problem can be written as follows.

arg min_{w} \{L (w) = \sum_{k} L_{k} (w)\},

(3)

where

w

and

L (\cdot)

denote the global model and the objective function we are trying to solve, respectively. Note that

L (\cdot)

is a summarization over K local functions

L_{k} (\cdot)

and

L_{k} (\cdot)

can be normally expressed as

L_{k} (w) = \sum_{i} {(f (x_{i}^{k}, w) - y_{i}^{k})}^{2},

(4)

where

f (x_{i}^{k}, w)

denotes the prediction of the i-th sample on the k-th client using model f.

The global model,

w

, can be obtained by performing the following four steps iteratively, as shown in Figure 1.

Step 1: The central server broadcasts the global model $w$ to local base stations. It should be noted that the number of base stations selected to participate in model training is a subset of all base stations. The selection principles can be customized by mobile network operators according to specific requirements.
Step 2: Each selected base station performs local stochastic gradient descent training with its own dataset. In this step, we introduce error compensation into stochastic gradient descent to overcome the client drift phenomenon confronted with traditional federated learning.
Step 3: Each selected base station first performs gradient compression to alleviate network burdens and save network bandwidth, and then it transfers the compressed gradient information to the central server.
Step 4: The central server performs global model aggregation after receiving all the gradient information from local base stations. In this step, we introduce a technique named gradient re-grouping to quantify the different contributions of base stations to the global model and capture spatial dependence among base stations.

The above four steps are executed iteratively until the stop condition is reached. After that, the global model is obtained. In the next section, we will introduce our proposed framework that implements Step 2, Step 3, and Step 4 in detail.

4. Proposed Framework

In this section, we will provide a detailed introduction to the FedCE method, as shown in the system diagram in Figure 1. FedCE mainly consists of three components, i.e., local training with error compensation (Step 2), gradient compression with top-K sparsification (Step 3), and global model aggregation with gradient re-grouping (Step 4).

4.1. Local Training

After receiving the global model

w

from the central server, each selected local client will start its local training to update the global model. Specifically, the update rule can be written as

w^{k} \leftarrow w^{k} - η_{l} \nabla f (w^{k}; B_{j}^{k}),

(5)

where

B_{j}^{k}

denotes the j-th batch data over client k. If local clients send

w^{k}

to the central server for global model averaging, the so-called ’client drift’ phenomenon will occur, resulting in performance degradation, since the average of the local gradient update directions diverges from the real global gradient update direction. Several strategies have been proposed to solve this problem, such as data-sharing, regularization, and controlled averaging. In this paper, we choose controlled averaging for our problem, since the data-sharing strategy breaks data privacy-preserving principles, and the regularization strategy finds it hard to tune its regularization item.

The controlled averaging strategy works as follows. The central server and each local client preserve a global controlled variable

c

and a local controlled variable

c^{k}

, respectively. The control variable

c^{k}

is adopted to correct the gradient information to follow the global gradient information. An illustration is shown in Figure 2, from which we can clearly observe that the corrected gradient update direction has a smaller gap between the optimal gradient direction than the original (uncorrected) gradient direction. To achieve this, we update the local training procedure, i.e., Equation (5), to

w^{k} \leftarrow w^{k} - η_{l} \{\nabla f (w^{k}; B_{j}^{k}) + β c_{cort}^{k}\},

(6)

where

c_{cort}^{k} = c - c^{k}

denotes the corrected gradient information and

β

is a user-specified constant that adjusts the contribution of

c_{cort}^{k}

to the local gradient. Introducing

β

into the local gradient update makes our gradient correction scheme much more flexible than traditional ones, as it plays an important role in balancing the contribution trade-off between the local gradient and the gradient correction.

4.2. Gradient Compression

In federated learning, communication overhead is a critical issue that affects system efficiency and scalability. It is influenced by the number of gradient transmissions between local clients and the central server and the size of each gradient transmission. We introduce gradient compression into wireless traffic prediction to reduce the communication overhead of federated training.

Previous research has demonstrated that a smaller portion of gradient exchange information is more valuable in distributed stochastic gradient descent. Consequently, the top-K gradient sparsification algorithm is employed to filter the local gradients, selectively uploading only the larger portion of changes while retaining the remaining gradients on the client for participation in the subsequent gradient update.

We first calculate the accumulated local gradient information by using

g^{k} \leftarrow (w - w^{k}) / η_{l},

(7)

where

w

is the global model before the local update and

w^{k}

is the local model.

Then, we perform top-K gradient sparsification on

g^{k}

to select the top K largest gradient values through

φ (g^{k}) = TopK (g^{k}) .

(8)

The compressed gradient information,

φ (g^{k})

, will be transferred to the central server after Equation (8) is finished on client k. In addition, the local control variable will also be updated by

c^{k} \leftarrow c^{k} - c + g^{k} .

(9)

4.3. Model Aggregation

In this section, we will introduce the model aggregation method, which is an important component of federated learning. The main work includes receiving local information and aggregating new global models.

The global model is normally obtained by averaging the transferred local models or performing one step of gradient descent based on the received gradient information. The operation of model aggregation for FedAvg can be denoted as

w \leftarrow w - \frac{η_{g}}{| S |} \sum_{k \in S} φ (g^{k}) .

(10)

One merit of the above averaging operation is that it is computationally convenient and easy to understand. However, it fails to model the spatial dependencies among different base stations, since it treats the contributions of local gradients equally. It is apparent that different base stations have different levels of dependencies.

To capture the spatial dependencies among the base stations, we find that gradient correlation is an effective method. The local gradient information is mathematically computed from local data, and the gradient can reflect the property of raw data. Thus, if local gradients have nonrandom correlations, it indicates a good representation of spatial dependencies among base stations. Luckily, we find that the gradient information of local clients has nonrandom correlations, which can be reflected by Figure 3.

Secondly, determining the weight of the local model solely based on the number of local samples overlooks the spatial correlation among local clients. In [21], the authors observed that the traffic in adjacent cells exhibits a high correlation, while the set of cells in different regions has distinct characteristics. For instance, the BS data within the commercial center area of a city exhibits similarity but differs significantly from the data of the stations in suburban or industrial areas.

To overcome these limitations, this paper proposes a model aggregation strategy based on traffic patterns and spatial correlation. This strategy utilizes local gradient information to capture the similarity between wireless traffic patterns and combines it with spatial distance to achieve weighted aggregation. To provide a detailed description of this technical step, we will illustrate the model aggregation process during the i

t h

global epoch as an example:

(1) The central server receives gradients

φ (g^{k})

and local control variables

c^{k}

from the local clients.

(2) We calculate a Pearson correlation coefficient matrix

ρ

using local gradients. The Pearson correlation matrix represents the correlation between local gradients, simulating the correlation of wireless traffic data for modeling purposes. We take m local gradient vectors as inputs and treat each gradient vector as a point. Next, we calculate the correlation coefficient between the points to obtain a similarity matrix. Figure 3 shows the Pearson correlation coefficient matrix of ten gradient information points

(g_{1} - g_{10})

uploaded to the central server in one global epoch. The value corresponding to the x-axis

g_{4}

and the y-axis

g_{8}

is

0.96

, indicating that gradients

g_{4}

and

g_{8}

are positively correlated, with a similarity of

0.96

. That is, positive numbers are positively correlated, negative numbers are negatively correlated, and the higher the value, the higher the similarity. For any given two BSs k and s, the Pearson correlation coefficient

ρ^{k, s}

can be expressed as

\begin{matrix} ρ^{k, s} & = \frac{c o v (g^{k}, g^{s})}{σ_{g^{k}} σ_{g^{s}}} \\ = \frac{E (g^{k} g^{s}) - E (g^{k}) E (g^{s})}{\sqrt{E ({(g^{k})}^{2}) - E {(g^{k})}^{2}} \sqrt{E ({(g^{s})}^{2}) - E {(g^{s})}^{2}}} . \end{matrix}

(11)

(3) Subsequently, we combine the reciprocal of the distance

d i s t

between the local BSs with the corresponding correlation coefficient to calculate the final weight

λ^{k, s}

. The reciprocal of the distance is employed to indicate that closer proximity corresponds to higher spatial similarity. The combination operation between gradient correlation and spatial proximity is

\begin{matrix} λ^{k, s} = ρ^{k, s} + \frac{1}{d i s t^{k, s}} . \end{matrix}

(12)

(4) Finally, we regroup the gradient information for BS k by introducing spatial dependency between k and other BSs. The regroup process can be denoted as

g_{spat}^{k} \leftarrow \sum_{s \in S} λ^{k, s} φ (g^{s}) .

(13)

Based on the regrouped local gradients, we can update the global model and control variable by

w^{i + 1} \leftarrow w^{i} - \frac{η_{g}}{| S |} \sum_{k \in S} g_{spat}^{k},

(14)

c \leftarrow c + \frac{| S |}{K} \sum_{k \in S} φ (g^{k} - c)

(15)

where

w_{i + 1}

denotes the new global model. The whole algorithm of our proposed FedCE is summarized in Algorithm 1.

Algorithm 1 FedCE

1:: Initialize global model $w$ , global control variables $c$ , learning rates $η_{l}$ and $η_{g}$
2:: for $r = 0, 1, \dots$ do
3:: Sample BS $S \in {1, \dots, K}$
4:: Broadcast $w$ and $c$ to all BSs in $S$
5:: for $k \in S$ do ▹ Run parallel on local BS
6:: Set $w^{k} \leftarrow w$ , $c^{k} \leftarrow c$
7:: for $j = 0, 1, \dots$ do
8:: Sample a batch dataset $B_{j}$
9:: Compute batch gradient $\nabla f (w^{k}; B_{j}^{k})$
10:: Correct gradient
11:: $w^{k} \leftarrow w^{k} - η_{l} (\nabla f (w^{k}; B_{j}^{k}) + β c^{k})$
12:: end for
13:: Compute local gradient $g^{k} \leftarrow (w - w^{k}) / η_{l}$
14:: Update control variable $c^{k} \leftarrow c^{k} - c + g^{k}$
15:: Upload $φ (g^{k})$ to the server
16:: end for
17:: Gradient regrouping $g_{spat}^{k} \leftarrow \sum_{s \in S} λ^{k, s} φ (g^{s})$
18:: Update $w \leftarrow w - \frac{η_{g}}{| S |} \sum_{k \in S} g_{spat}^{k}$
19:: Update $c \leftarrow c + \frac{| S |}{K} \sum_{k \in S} φ (g^{k} - c)$
20:: end for

5. Experimental Results

In this section, we first introduce the dataset and baseline methods. Then, we present the experimental settings and validate the effectiveness of FedCE through extensive experiments.

5.1. Dataset

In this study, we used call detail records (CDRs) obtained from Milan and Trentito as the primary datasets. The data collection period was from 1 November 2013 to 1 January 2014, and the communication volumes were recorded every ten minutes. In order for a record to be considered valid, the user’s connection time had to be at least 15 min, or data usage had to reach 5 MB.

To conduct the statistical analysis, we used the BS as the unit of analysis and considered the coverage area of each BS as a cell. Milan’s dataset comprised 10,000 cells distributed in a

100 \times 100

square grid, with the BSs primarily collecting information on sent and received SMS messages, calls, and internet data. We mainly used internet data as the experimental data source. On the other hand, the Trentito dataset contained 6575 cells, and to simplify the statistical analysis, we set the data interval to one hour.

To evaluate the accuracy of the prediction model, we used the mean absolute error (MAE) [26], mean squared error (MSE) [27], and R2 score as evaluation metrics. These metrics are commonly used to assess the accuracy of prediction models in various fields, including machine learning, data analysis, and statistics. The MAE measures the average magnitude of the errors in a set of predictions, while the MSE measures the average squared differences between predicted and actual values. These metrics help quantify the prediction model’s accuracy and provide a basis for comparison with other models. Overall, the use of CDRs from Milan and Trentito as the primary datasets, along with the evaluation metrics of MAE and MSE, allowed us to conduct a rigorous and comprehensive analysis of their prediction model’s performance. We also compared the communication cost (Comm.) among different algorithms. The lower the communication cost, the better the performance.

5.2. Baseline Methods

We compared FedCE with the following baselines. (1) Centralized method: All training data are stored centrally on a single data center or server. The entire model training process takes place on this centralized server. Centralized training is typically suitable for scenarios with small-scale and centralized data storage, but it may also raise concerns regarding privacy and security. (2) Standalone method: Each Bs has its own data and computational capabilities. Each base station independently completes modeling, training, and prediction tasks. (3) FedAvg: FedAvg is the most classical FL algorithm, which aggregates the global model by weighting according to the sample count in the local datasets. (4) FedDA: FedDA is a classic federated learning-based framework for wireless traffic prediction. Compared to FedAvg, it can simultaneously capture the spatio-temporal characteristics of wireless traffic, thereby improving the performance of the prediction model.

5.3. Experimental Setup and Results Analysis

This section discusses the application of FL in mobile communication networks. To address the existing problems, we used a method called FedCE and compared it with four baseline methods. The experiment used 100 randomly selected BSs for training and testing. The dataset was evenly divided into eight parts based on time index, with the first seven parts used as the local training set and the last part as the test set. This partition helps to measure the model’s generalization ability better. The same lightweight MLP network architecture [22] is used for the experiment to reduce the influence of other factors in FL, which can reduce computation and storage costs while maintaining high prediction accuracy. The local learning rate was set to

0.001

, and Adam was used as the optimizer. The global epoch was set to 100, the local epoch was set to 1, and the local batch size was set to 20. Based on this, the proposed FedCE method was compared with four baseline methods from three different perspectives, centralized, fully distributed, and federated, to verify its effectiveness and practicality.

Table 1 presents a comprehensive overview of the experimental findings concerning FedCE and the baseline algorithms. Here, FedCE-0.1/0.2/0.4 denotes three variations of local compression rates, namely,

s = 0.1

,

s = 0.2

, and

s = 0.4

. Notably, as the compression rate increases, the prediction accuracy of FedCE improves, albeit accompanied by an increase in communication overhead. In practical scenarios, adjustments can be flexibly made to align with specific communication requirements. Throughout subsequent chapters, unless explicitly stated otherwise, discussions default to the case of

s = 0.1

, with FedCE representing FedCE-0.1.

On the Milano dataset, it is evident that centralized training methods exhibit optimal performance, followed by federated training methods (FedDA, FedAvg, FedCE-0.4), while standalone training methods show the least favorable performance. However, centralized training methods fall short compared to federated training methods in terms of communication overhead and privacy protection. Among federated training methods, FedDA, FedAvg, and FedCE-0.4 demonstrate remarkably similar prediction performance, with the MSE increase being within

4 %

, and the MAE and R2 increases being less than

1 %

. Nevertheless, with regard to communication overhead, FedCE-0.4 shows a reduction of 60% compared to the baseline algorithms. It is important to note that communication overhead is calculated based on the amount of gradient information transmitted from local base stations to the central server over 100 global rounds. For the sake of comparison, the communication overhead of the two baseline methods, FedDA and FedAvg, is set to

100 %

.

On the Trento dataset, FedCE demonstrates the highest prediction performance, followed by other baseline algorithms in descending order: centralized training methods, FedDA, standalone training methods, and FedAvg. Compared to the top-performing baseline algorithm (centralized), FedCE shows increases of

16.93 %

,

1.86 %

, and

2.86 %

in MSE, MAE, and R2, respectively. In comparison to the leading federated training method (FedDA), FedCE exhibits increases of

24.57 %

,

6.13 %

, and

4.70 %

in MSE, MAE, and R2, respectively.

In summary, FedCE demonstrates lower prediction performance compared to centralized training methods on the Milano dataset, while showing similar performance to federated baseline algorithms. Conversely, on the Trento dataset, FedCE achieves the highest prediction accuracy. Additionally, FedCE performs the best in terms of communication overhead on both datasets.

5.4. Comparison between Actual Values and Predicted Values

In this section, we aim to elucidate the comparative analysis between the forecasted values and the actual observed values (ground truth) of various algorithms, thereby further corroborating their predictive efficacy. This comparison is illustrated through Figure 4, which depicts the performance of FL approaches (namely, FedDA, FedAvg, and FedCE) on the Milano and Trento datasets, respectively. The horizontal axis delineates the temporal progression, whereas the vertical axis quantifies traffic volume. Within these figures, the actual observed values are denoted by a solid blue line, whilst the forecasted values for FedCE, FedDA, and FedAvg are represented by a green dashed line, an orange dash-dot line, and a red dash-dot line, respectively. These values are derived from a randomly selected region (BS).

Observations from Figure 4a reveal that the forecasted values for FedCE closely align with those of the two federated baseline algorithms (FedAvg and FedDA), indicating a comparable predictive capability. Conversely, Figure 4b highlights that during peak traffic intervals, FedCE’s forecasted values more closely approximate the ground truth than those of the baseline algorithms. This implies a superior performance of FedCE in accurately forecasting traffic volumes with significant fluctuations at peak times.

To summarize, the comparative analysis between actual observed and forecasted values not only aligns with the findings presented in Table 1, but also enhances our understanding of the predictive performance of these algorithms. Specifically, while FedCE demonstrates comparable accuracy to the baseline algorithms on the Milano dataset, its predictive precision on the Trento dataset significantly surpasses that of its counterparts. This analysis substantiates the validity of the experimental outcomes, thereby affirming the efficacy of FedCE in traffic volume prediction.

6. Conclusions

This study introduces FedCE, a communication-efficient algorithm for wireless traffic prediction using federated learning (FL). We address the data heterogeneity across devices with a gradient correction mechanism, aligning local updates with the global model to reduce discrepancies. We use a TopK-based gradient sparsification technique to minimize communication overhead, significantly reducing the gradient data sent to the central server. For global model aggregation, we implement an adaptive strategy based on gradient similarity, using the Pearson correlation coefficient and base station distances to weight and combine local gradients. Experiments on two real-world datasets show FedCE’s superiority in communication efficiency and predictive accuracy. FedCE requires only 10% of the communication demands of baseline algorithms and significantly outperforms existing FL benchmarks on the Trento dataset. These results highlight FedCE’s potential to enhance wireless traffic prediction using FL.

Author Contributions

Conceptualization and methodology, C.Z. and J.Q.; algorithm implementation and draft preparation, F.G. and K.L.; writing—review and editing, C.Z., Y.C. and J.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Shandong Province Excellent Youth Science Fund Project (Overseas) under Grant No. 2024HWYQ-028, and by the Fundamental Research Funds of Shandong University under Grant No. 11480061330000.

Data Availability Statement

The original data presented in the study are openly available in the Harvard Dataverse at https://doi.org/10.7910/DVN/EGZHFV.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tomkos, I.; Klonidis, D.; Pikasis, E.; Theodoridis, S. Toward the 6G network era: Opportunities and challenges. IT Prof. 2020, 22, 34–38. [Google Scholar] [CrossRef]
Ahammed, T.B.; Patgiri, R.; Nayak, S. A vision on the artificial intelligence for 6G communication. ICT Express 2022, 19, 197–210. [Google Scholar] [CrossRef]
Gunkel, D.J. Communication and artificial intelligence: Opportunities and challenges for the 21st century. Communication +1 2012, 1, 1–25. [Google Scholar]
Zhang, C.; Dang, S.; Shihada, B.; Alouini, M.S. On Telecommunication Service Imbalance and Infrastructure Resource Deployment. IEEE Wirel. Commun. Lett. 2021, 10, 2125–2129. [Google Scholar] [CrossRef]
Chen, A.; Law, J.; Aibin, M. A survey on traffic prediction techniques using artificial intelligence for communication networks. Telecom 2021, 2, 518–535. [Google Scholar] [CrossRef]
Yu, B.; Mao, W.; Lv, Y.; Zhang, C.; Xie, Y. A survey on federated learning in data mining. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2022, 12, e1443. [Google Scholar] [CrossRef]
Yin, X.; Zhu, Y.; Hu, J. A comprehensive survey of privacy-preserving federated learning: A taxonomy, review, and future directions. ACM Comput. Surv. (CSUR) 2021, 54, 1–36. [Google Scholar] [CrossRef]
Konečnỳ, J.; McMahan, H.B.; Ramage, D.; Richtárik, P. Federated optimization: Distributed machine learning for on-device intelligence. arXiv 2016, arXiv:1610.02527. [Google Scholar]
Li, T.; Sahu, A.K.; Talwalkar, A.; Smith, V. Federated learning: Challenges, methods, and future directions. IEEE Signal Process. Mag. 2020, 37, 50–60. [Google Scholar] [CrossRef]
Shamsian, A.; Navon, A.; Fetaya, E.; Chechik, G. Personalized federated learning using hypernetworks. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual. 18–24 July 2021; pp. 9489–9502. [Google Scholar]
Li, T.; Hu, S.; Beirami, A.; Smith, V. Ditto: Fair and robust federated learning through personalization. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual. 18–24 July 2021; pp. 6357–6368. [Google Scholar]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated optimization in heterogeneous networks. Proc. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar]
Murata, T.; Suzuki, T. Bias-variance reduced local sgd for less heterogeneous federated learning. arXiv 2021, arXiv:2102.03198. [Google Scholar]
Karimireddy, S.P.; Kale, S.; Mohri, M.; Reddi, S.; Stich, S.; Suresh, A.T. Scaffold: Stochastic controlled averaging for federated learning. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual. 13–18 July 2020; pp. 5132–5143. [Google Scholar]
Wei, W.W. Time Series Analysis; Pearson College Div: Victoria, BC, Canada, 2013. [Google Scholar]
Gardner, E.S., Jr. Exponential smoothing: The state of the art. J. Forecast. 1985, 4, 1–28. [Google Scholar] [CrossRef]
Qiu, C.; Zhang, Y.; Feng, Z.; Zhang, P.; Cui, S. Spatio-temporal wireless traffic prediction with recurrent neural network. IEEE Wirel. Commun. Lett. 2018, 7, 554–557. [Google Scholar] [CrossRef]
Li, D.; Lin, C.; Gao, W.; Chen, Z.; Wang, Z.; Liu, G. Capsules TCN network for urban computing and intelligence in urban traffic prediction. Wirel. Commun. Mob. Comput. 2020, 2020, 6896579. [Google Scholar] [CrossRef]
Niknam, S.; Dhillon, H.S.; Reed, J.H. Federated learning for wireless communications: Motivation, opportunities, and challenges. IEEE Commun. Mag. 2020, 58, 46–51. [Google Scholar] [CrossRef]
Yin, F.; Lin, Z.; Kong, Q.; Xu, Y.; Li, D.; Theodoridis, S.; Cui, S.R. FedLoc: Federated learning framework for data-driven cooperative localization and location data processing. IEEE Open J. Signal Process. 2020, 1, 187–215. [Google Scholar] [CrossRef]
Zhang, C.; Zhang, H.; Yuan, D.; Zhang, M. Citywide cellular traffic prediction based on densely connected convolutional neural networks. IEEE Commun. Lett. 2018, 22, 1656–1659. [Google Scholar] [CrossRef]
Zhang, C.; Dang, S.; Shihada, B.; Alouini, M.S. Dual attention-based federated learning for wireless traffic prediction. In Proceedings of the IEEE INFOCOM 2021-IEEE Conference on Computer Communications, Vancouver, BC, Canada, 10–13 May 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–10. [Google Scholar]
Liu, W.; Chen, L.; Chen, Y.; Zhang, W. Accelerating federated learning via momentum gradient descent. IEEE Trans. Parallel Distrib. Syst. 2020, 31, 1754–1766. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Lin, Y.; Han, S.; Mao, H.; Wang, Y.; Dally, W.J. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv 2017, arXiv:1712.01887. [Google Scholar]
Willmott, C.J.; Matsuura, K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 2005, 30, 79–82. [Google Scholar] [CrossRef]
Chai, T.; Draxler, R.R. Root mean square error (RMSE) or mean absolute error (MAE). Geosci. Model Dev. Discuss. 2014, 7, 1525–1534. [Google Scholar]

Figure 1. System model of communication-efficient federated learning for wireless traffic prediction.

Figure 2. Illustration of gradient correction.

Figure 3. Pearson correlation matrix.

Figure 4. Comparison of true and predicted values of randomly selected regions (BSs) on the Milano and Trento datasets.

Table 1. Prediction performance comparisons among different methods on the two real-world datasets, i.e., trento and milano.

Methods	Trento Dataset				Milano Dataset
Methods	MSE	MAE	R2	Comm.	MSE	MAE	R2	Comm.
Standalone	2.1411	0.7751	0.7631	-	0.0978	0.2202	0.7692	-
Centralized	1.8761	0.7300	0.8550	-	0.0859	0.2011	0.8660	-
FedAvg	4.3719	1.1072	0.6621	$100 %$	0.0908	0.2115	0.8585	$100 %$
FedDA	2.0716	0.7632	0.8399	$100 %$	0.0940	0.2128	0.8535	$100 %$
FedCE-0.1	1.5590	0.7164	0.8795	$10 %$	0.1037	0.2274	0.8383	$10 %$
FedCE-0.2	1.5108	0.6968	0.8832	$20 %$	0.0978	0.2183	0.8476	$20 %$
FedCE-0.4	1.4844	0.6867	0.8853	$40 %$	0.0943	0.2132	0.8529	$40 %$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, F.; Zhang, C.; Qiao, J.; Li, K.; Cao, Y. Communication-Efficient Wireless Traffic Prediction with Federated Learning. Mathematics 2024, 12, 2539. https://doi.org/10.3390/math12162539

AMA Style

Gao F, Zhang C, Qiao J, Li K, Cao Y. Communication-Efficient Wireless Traffic Prediction with Federated Learning. Mathematics. 2024; 12(16):2539. https://doi.org/10.3390/math12162539

Chicago/Turabian Style

Gao, Fuwei, Chuanting Zhang, Jingping Qiao, Kaiqiang Li, and Yi Cao. 2024. "Communication-Efficient Wireless Traffic Prediction with Federated Learning" Mathematics 12, no. 16: 2539. https://doi.org/10.3390/math12162539

APA Style

Gao, F., Zhang, C., Qiao, J., Li, K., & Cao, Y. (2024). Communication-Efficient Wireless Traffic Prediction with Federated Learning. Mathematics, 12(16), 2539. https://doi.org/10.3390/math12162539

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Communication-Efficient Wireless Traffic Prediction with Federated Learning

Abstract

1. Introduction

2. Related Work

2.1. Federated Learning

2.2. Wireless Traffic Prediction

3. Problem Formulation

4. Proposed Framework

4.1. Local Training

4.2. Gradient Compression

4.3. Model Aggregation

5. Experimental Results

5.1. Dataset

5.2. Baseline Methods

5.3. Experimental Setup and Results Analysis

5.4. Comparison between Actual Values and Predicted Values

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI