**Algorithm 1:** Reward calculation algorithm

**Input:** new state *st*+<sup>1</sup> ← environment(*st*, *at*) **Output:** *rt* **Initialize** the reward *rt* = 0 **if** *c2, c3 of (9)* **then for** *each IoT n* = 1, 2, ..., *N* **do** Calculate *en* (*α<sup>n</sup>* = 0) according to Equation (8) Calulate *tn* (*α<sup>n</sup>* = 1) according to Equation (7) **if** *tn (α<sup>n</sup>* = 1*)*≤ *T* **then** Calculate *en* (*α<sup>n</sup>* = 1) according to Equation (8) **if** *en (α<sup>n</sup>* = 0*)>en (α<sup>n</sup>* = 1*)* **then** *en* = *en*(*α<sup>n</sup>* = 1) **else** *en* = *en*(*α<sup>n</sup>* = 0) **else** *en* = *en*(*α<sup>n</sup>* = 0) Calculate the reward for energy consumption of all IoT devices according to Equation (11) **else** Calculate the reward of resources constraint according to Equation (10)

#### 4.2.4. DDPG-Based Solution

The DDPG algorithm is a paradigm of the reinforcement learning method, which is the combination of AC and DQN. The specific network structure is shown in Figure 2. The training process of the network is carried out according to the numbers in the Figure 2. The input of the actor network is the state, and the output is the determined action value. The input of the critic network is the state and the action, the output is the *Q* value. The actor network consists of the evaluation network *μ* with parameter *θ<sup>μ</sup>* and the target network *μ* with parameter *θμ* . The critic network consists of the evaluation network *Q* with parameter *θ<sup>Q</sup>* and the target network *Q* with parameter *θQ* . Since the experience replay method is adopted, the data (*st*, *at*,*s <sup>t</sup>*,*rt*) are stored in the replay buffer according to the format of (*s*, *a*,*s* ,*r*). The parameters of critic network are updated by minimizing the loss,

$$Loss = \frac{1}{X} \sum\_{j=1}^{X} \left( y\_j - \mathbb{Q}(s\_{j'} a\_j | \theta^Q) \right)^2 \tag{12}$$

$$\log \eta\_{\rangle} = r\_{\rangle} + \gamma \bullet \mathcal{Q}'(s\_{j\prime}' \mu'(s\_{j\prime}' | \theta^{\mu'}) | \theta^{Q\prime}) \tag{13}$$

where *X* denotes the size of mini batch data, and *γ* denotes the discount factor. The actor network is updated according to the feedback of the critic network as follows:

$$\begin{array}{ll} \nabla\_{\theta} \mathbb{I} \approx \\ \frac{1}{\mathcal{X}} \sum\_{j=1}^{X} \left( \nabla\_{a} \mathbb{Q}(s\_{j'}, a\_{j} | \theta^{Q}) \vert\_{a\_{j} = \mu(s\_{j})} \star \nabla\_{\theta^{\mu}} \mu(s\_{j} | \theta^{\mu}) \right) \end{array} \tag{14}$$

DDPG framework has the characteristics of centralized training and decentralized execution. After the training is completed, the state is input into the actor network to obtain the offloading decision and resource allocation scheme.

**Figure 2.** Convergence property of different algorithm.

4.2.5. Computational Complexity Analysis

Floating Point Operations (FLOPs) can be used to measure the computational complexity of the algorithm or model. The proposed algorithm is a reinforcement learning algorithm based on DDPG framework. The DDPG framework consists of an actor network and a critic network. In this article, the actor network is composed of three full connection layers, and the critic network is composed of four full connection layers. The FLOPs of a full connection layer is 2 × *I* × *Q*, where *I* denotes the number of input neurons and

*Q* denotes the number of output neurons. Therefore, the FLOPs of the actor network is ∑3 *<sup>m</sup>*=<sup>1</sup> <sup>2</sup> <sup>×</sup> *Im* <sup>×</sup> *Qm*, and the FLOPs of the critic network is <sup>∑</sup><sup>4</sup> *<sup>m</sup>*=<sup>1</sup> 2 × *Im* × *Qm*. Since DDPG has the characteristics of centralized training and decentralized execution, whether the proposed framework can be implemented in a real time manner depends on the execution time of the actor network. For example, for a single core computer (2 GHz), its computing capacity is about 2 billion FLOPs per second, which is more than enough to be used to process the computation of the actor network according to network settings in this article. The specific network parameters are set in Section 5.1. Therefore, the proposed framework can be implemented in a real time manner.

#### *4.3. Large Timescale Policy Based on Federated Learning*

In this subsection, for the purpose of protecting privacy and security, users are reluctant to send their data the central cloud server. However, in the process of neural network training, more data will generally bring better training performance. For the above two reasons, Federated Learning algorithm is introduced into reinforcement learning. Federated learning is essentially a distributed machine learning technology. Its goal is to realize joint modeling and improve the performance of Artificial Intelligence (AI) model on the basis of ensuring data privacy, security and legal compliance. Since different blocks of smart city have the characteristic of the same application types and different users, the horizontal federated learning is adopted in this article.

In horizontal federated learning, it can be regarded as a distributed training model based on samples, which distributes all data to different machines. Each machine downloads the model from the central server to train the model with local data, and then the training parameters are returned to the central server for aggregation. In this process, each machine is the same and complete model, which can work independently. The aggregation mode of network parameters is given by

$$
\Theta = \frac{1}{\sum\_{k \in K} D\_k} \sum\_{i=k}^K D\_k \Theta\_k \tag{15}
$$

where *Dk* denotes the number of training samples on the *k*-th MEC server, Θ*<sup>k</sup>* and Θ denote the parameter sets of the *k*-th MEC server and the central cloud, respectively. Specifically, the two-timescale training process is summarized in Algorithm 2.


### **5. Performance Evaluation**

### *5.1. Parameter Setting*

In this section, we evaluate the performance of our proposed algorithm for smart city. The experimental platform adopts DELL PowerEdge (DELL-R940XA, 4\*GOLD-5117, RTX2080Ti). The simulation software is Pycharm (Professional Edition). The corresponding environment configuration is Python3.7.6, CUDA 11.4, Pytorch 1.5.0. The actor network is composed of three full connection layers (40 × 500, 500 × 128, 128 × 20), and the critic network is composed of four full connection layers (60 × 1024, 1024 × 512, 512 × 300, 300 × 1). Its activation function is RELU, and the output layer of actor network is tanh function to constraint the output value. Specifically, the simulation parameters of the system are presented in Table 2. The compared algorithms are as follows.

