*3.1. Task Model*

In the smart city scenario, there are a large number of different types of applications (such as smart security, smart traffic, smart parking, smart lamp and so on). These applications have lower real-time requirements than AR applications. Therefore, we set the delay threshold of these applications to the same, which is denoted by *T*. To describe the parametric context of each application task, we define a tuple representation as *φ<sup>n</sup>* = (*ωn*, *ϕn*). Specifically, *ω<sup>n</sup>* and *ϕ<sup>n</sup>* denote the data size (bit) and the computing workload (CPU cycles) of the task generated by IoT device *n*, respectively. The relationship between *ω<sup>n</sup>* and *ϕ<sup>n</sup>* is expressed as *ϕ<sup>n</sup>* = *η<sup>n</sup>* • *ωn*, where *η<sup>n</sup>* denotes the computing workload per bit. In this article, the offloading decision is denoted by *α<sup>n</sup>* ∈ {0, 1}. If *α<sup>n</sup>* = 0, the application task requested by IoT device *n* will not be offloaded to the edge server and will be processed on the IoT device *n*. If *α<sup>n</sup>* = 1, the application task requested by IoT device *n* will be offloaded to the MEC server. The import notations used in the rest of this article are summarized in Table 1.



**Figure 1.** System Model.

#### *3.2. Communication Model*

In this article, we consider the system with the OFDMA as the multiple access technology, in which the system bandwidth *B* is divided into *D* equal orthogonal sub-bands. In view of the OFDMA mechanism, interference is ignored due to the exclusive subcarrier allocation [25,32–34]. Therefore, we do not consider interference from other IoT devices in this article. A sub-band can only be allocated to one IoT device, but an IoT device can be allocated multiple sub-bands. Since the amount of data that needs to be returned to the IoT device after processing is very small, the time consumption in process of downlink transmission is not considered. Let *Bn* denotes the number of sub-bandwidths allocated to IoT device *n*. *pn* denotes the transmission power of IoT device *n*. *hn* denotes the uplink channel gain between the base station and IoT device *n* corresponding to a white Gaussian noise channel, which incorporates distance based path loss model and independent Rayleigh fading. Then, the uplink transmission rate *r up <sup>n</sup>* can be calculated by

$$r\_n^{\text{up}} = B\_n \frac{B}{D} \log\_2(1 + \frac{p\_n h\_n}{\delta^2}) \tag{1}$$

where *δ*<sup>2</sup> denotes the noise power. Therefore, the transmission time *t up <sup>n</sup>* and the energy consumption *e up <sup>n</sup>* of uplink transmission can be calculated by

$$t\_n^{\iota p} = \frac{\omega\_n}{r\_n^{\iota p}} \tag{2}$$

$$e\_n^{\mu p} = t\_n^{\mu p} \cdot p\_n \tag{3}$$

### *3.3. Computation Model*

In this article, the task generated by IoT device can be offload to the MEC server in order to reduce the energy consumption of the IoT device when the network is in good state. If the network state is bad, the task can only be executed on the IoT device. Next, two situations are described in detail, respectively.

#### 3.3.1. Processing at MEC Server

Let *fn* denotes the computing resources allocated by the MEC server to the task generated by IoT device *n*. Then, the execution time *t MEC <sup>n</sup>* can be calculated by

$$t\_n^{\text{MEC}} = \frac{\varphi\_n}{f\_n} \tag{4}$$

#### 3.3.2. Processing at IoT Device

According to the optimization objective, if the task is processed on the IoT device, the energy consumption is the smallest when the delay is equal to the delay threshold. Therefore, the processing time *tIoT <sup>n</sup>* and the energy consumption *eIoT <sup>n</sup>* can be calculated by

$$I\_m^{I\sigma T} = T \tag{5}$$

$$
\epsilon\_n^{I\circ T} = \kappa \cdot (\frac{q\_n}{T})^2 \cdot q\_n \tag{6}
$$

where *κ* is the energy coefficient, which depends on the chip architecture [35–37]. In this article, according to the work in [38], we set *κ* = 10<sup>−</sup>25.

#### **4. Two-Timescale Joint Optimization of Task Offloading and Resource Allocation**

*t*

In this section, the joint optimization of task offloading and resource allocation is formulated, and it is considered as Markov Decision Process (MDP). A deep reinforcement learning algorithm based on DDPG framework is proposed to solve this problem. In order to protect user privacy and improve the training performance of the deep neural network, Federated learning is introduced into the training model. A two-timescale federated reinforcement learning algorithm is proposed. The small timescale is to train the scheme of task offloading and resource allocation on each MEC server. The large timescale is to aggregate the trained model parameters on the central cloud server. The two-timescale are

executed alternately to obtain better training performance. In this article, since the central cloud server and MEC servers are connected by the wired network, the time consumption caused by parameters upload is not considered. The detail of problem formulation and solution are described as follows.

#### *4.1. Problem Formulation*

According to the above computation and communication models, the total time consumption and the energy consumption can be calculated by

$$t\_n = \alpha\_n \bullet \left( t\_n^{up} + t\_n^{MEC} \right) + (1 - \alpha\_n) \bullet t\_n^{loT} \tag{7}$$

$$e\_n = \alpha\_n \cdot e\_n^{\rm up} + (1 - \alpha\_n) \cdot e\_n^{\rm IoT} \tag{8}$$

The mathematical model with the objective of minimizing the energy consumption of all IoT devices subject to the latency requirement and the limited resources, is as follows:

$$\begin{array}{ll}\min\limits\_{B\_{n}f\_{n},\boldsymbol{a}\_{n}}\sum\_{n=1}^{N}c\_{n}\\\text{s.t.}\\(c1)\quad t\_{n} \leq T\\(c2)\quad \sum\_{n=1}^{N}B\_{n} \leq B\\(c3)\quad \sum\_{n=1}^{N}f\_{n} \leq P^{\text{MEC}}\\(c4)\quad \boldsymbol{a}\_{n} \in \{0,1\}\end{array} \tag{9}$$

where *FMEC* denotes the total computing resources of the MEC server. For the constraints, constraint (*c*1) indicates that the execution time of the IoT device *n* cannot exceed the delay threshold to ensure the QoE. We believe that as long as the processing time of the IoT task is within the delay threshold, a satisfactory user experience can be obtained. For example, in the community access control system, if the delay threshold of the face recognition system is 0.1 s, the user experience can be satisfied as long as the face recognition is completed within 0.1 s. Since users have the same QoE for completing face recognition within 0.1 s and 0.01 s, there is no need to pursue a lower processing time, which is meaningless in real scenes. Constraint (*c*2) indicates that the number of allocated sub-bandwidth cannot exceed the total bandwidth of base station. Constraint (*c*3) indicates that the computing resources allocated to all IoT devices by the MEC server cannot exceed the total computing resources of the MEC server. Constraint (*c*4) indicates that the task of IoT device is either processed on the MEC server or the IoT device *n*. If *α<sup>n</sup>* = 0, the task of IoT device will be processed on the IoT device. If *α<sup>n</sup>* = 1, the task of IoT device will be offloaded to the MEC server.

#### *4.2. Small Timescale Policy Based on Deep Reinforcement Learning*

In this subsection, the joint optimization problem is modeled as MDP, and a deep reinforcement learning based on DDPG framework is proposed to solve it. The common model of reinforcement learning is the standard MDP. Therefore, several elements of MDP are introduced in detail below.

#### 4.2.1. State Space

State is the description of the environment, which will change after an action is generated by the agent. In this article, the MEC server is modeled as an agent to optimize the energy consumption of all IoT devices. Let *st* = (*s*<sup>1</sup> *<sup>t</sup>* ,*s*<sup>2</sup> *<sup>t</sup>* , ... ,*s<sup>U</sup> <sup>t</sup>* ) denotes the state of MDP at time *t*. The state includes four parts: (1) the task size, the computing workload, the channel state of all IoT devices; (2) the computing resources of the MEC server; (3) the bandwidth of the base station; (4) the resource allocation scheme at the current time. The value range of all data in the state is [0, 1].

#### 4.2.2. Action Space

Action is the description of agent behavior, which is the result of the optimization scheme. Let *at* = (*a*<sup>1</sup> *<sup>t</sup>* , *a*<sup>2</sup> *<sup>t</sup>* , ... , *a<sup>L</sup> <sup>t</sup>* ) denotes the action of MDP at time *t*, which includes the change of computing and communication resources. The action space corresponds to Part 4 of the state space one by one. The value range of all data in the action is [−1, 1].

#### 4.2.3. Reward

After the agent takes an action, reward is the feedback of environment to agent. Let *rt* denotes the reward of MDP at time *t*. The objective of this article is to minimize the energy consumption of all IoT devices subject to the system resources and delay threshold. Therefore, the reward is set to two progressive steps. The first step is to ensure the system resources constraints, as follows:

$$\begin{aligned} r &= \chi\_1 \bullet \sum\_{n=1}^{\mathcal{U}} \left( (\mathbf{s}^n - 1) \bullet \mathbf{c}(\mathbf{s}^n - 1) - \mathbf{s}^n \bullet \mathbf{c}(-\mathbf{s}^n) \right) \\ &+ \chi\_2 \bullet \left( \sum\_{n=1}^{\mathcal{N}} B\_n - B \right) \bullet \mathbf{c} \left( \sum\_{n=1}^{\mathcal{N}} B\_n - B \right) \\ &+ \chi\_3 \bullet \left( \sum\_{n=1}^{\mathcal{N}} f\_n - F^{\rm{MEC}} \right) \bullet \mathbf{c} \left( \sum\_{n=1}^{\mathcal{N}} f\_n - F^{\rm{MEC}} \right) + b\_1 \end{aligned} \tag{10}$$

The second step is to minimize the energy consumption of all IoT devices, as follows:

$$r = \chi\_4 \bullet \exp\left(-\sum\_{n=1}^{N} e\_n / N\right) \tag{11}$$

where *χ*1, *χ*2, *χ*3, *χ*4, *b*<sup>1</sup> are constants. The purpose is to make rewards develop in a good direction. Specifically, the reward setting algorithm is illustrated in Algorithm 1.
