Multi-Agent Deep Reinforcement Learning-Based Partial Task Offloading and Resource Allocation in Edge Computing Environment

Ke, Hongchang; Wang, Hui; Sun, Hongbin

doi:10.3390/electronics11152394

Open AccessArticle

Multi-Agent Deep Reinforcement Learning-Based Partial Task Offloading and Resource Allocation in Edge Computing Environment

by

Hongchang Ke

^1,†

,

Hui Wang

^2,*,†

and

Hongbin Sun

^1,†

¹

School of Computer Technology and Engineering, Changchun Institute of Technology, Changchun 130012, China

²

College of Computer Science and Engineering, Changchun University of Technology, Changchun 130012, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2022, 11(15), 2394; https://doi.org/10.3390/electronics11152394

Submission received: 29 June 2022 / Revised: 27 July 2022 / Accepted: 27 July 2022 / Published: 31 July 2022

(This article belongs to the Section Networks)

Download

Browse Figures

Versions Notes

Abstract

:

In the dense data communication environment of 5G wireless networks, with the dramatic increase in the amount of request computation tasks generated by intelligent wireless mobile nodes, its computation ability cannot meet the requirements of low latency and high reliability. Mobile edge computing (MEC) can utilize its servers with mighty computation power and closer to tackle the computation tasks offloaded by the wireless node (WN). The physical location of the MEC server is closer to WN, thereby meeting the requirements of low latency and high reliability. In this paper, we implement an MEC framework with multiple WNs and multiple MEC servers, which consider the randomness and divisibility of arrival request tasks from WN, the time-varying channel state between WN and MEC server, and different priorities of tasks. In the proposed MEC system, we present a decentralized multi-agent deep reinforcement learning-based partial task offloading and resource allocation algorithm (DeMADRL) to minimize the long-term weighted cost including delay cost and bandwidth cost. DeMADRL is a model-free scheme based on Double Deep Q-Learning (DDQN) and can obtain the optimal computation offloading and bandwidth allocation decision-making policy by training the neural networks. The comprehensive simulation results show that the proposed DeMADRL optimization scheme has a nice convergence and outperforms the other three baseline algorithms.

Keywords:

mobile edge computing; low latency; deep reinforcement learning; partial offloading; resource allocation

1. Introduction

With the rapid development of 5G, the computation power of wireless node (WN) or intelligent mobile terminal cannot meet the requirements of computation and power consumption [1]. In a dense data communication environment, it is difficult to ensure real-time reliability of request computation tasks, especially computation-intensive or latency-sensitive tasks [2,3]. WN can offload its request tasks to a cloud computing server or mobile cloud computing server so that the tasks can be completed by a cloud server due to its powerful computation ability [4]. Lu et al. proposed a greedy algorithm to schedule the energy consumption and computation resource of cloud server and tradeoff the offloading energy consumption and deadline of finishing tasks [5]. Guo et al. formulated a mobile cloud computing-based financial data management system to optimize the complex financial data [6]. The tricky problem of cloud computing and mobile cloud computing is the backhaul delay problem caused by the long physical distance between WN and the cloud server and the resource scheduling of the cloud server because of complex tasks. For latency-sensitive request tasks, the experience is poor and the finishing rate of most request tasks cannot be satisfied, such as autonomous driving and remote online medical treatment.

Mobile Edge Computing (MEC) is different from traditional cloud computing and mobile cloud computing because the computation resources and applications located in the traditional cloud are sunk to the server side at the edge of the mobile communication network, which is closer to the physical location of the WN [7,8]. To this end, the cloud can deploy computation resources and applications to the MEC servers covered by it in a distributed manner according to actual needs [9,10]. Therefore, in the MEC environment, WNs can offload the computing-intensive or energy-intensive computation tasks generated by themselves to the MEC server and hand them over to the edge server for processing [11]. Under the condition of capacity, request computation tasks are completed under the stringent deadline, thereby effectively reducing the energy consumption of WNs, improving the completion rate of tasks within the certain deadline, and reducing task processing delay [12]. Li et al. introduced a tasks offloading method based on the heuristic algorithm for minimizing the energy consumption in the MEC environment [13]. Tian et al. proposed a reinforcement learning-based computation offloading and resource allocation decision-making scheme to minimize the computation resource cost, energy consumption cost, and tasks unfinished penalty cost [14]. Chen et al. designed an unmanned vehicle-assisted MEC framework and considered vehicle-to-infrastructure (V2I) offloading and vehicle-to-vehicle (V2I) offloading modes. Furthermore, a greedy scheme was proposed to obtain the optimal offloading stragery [15]. However, the aforementioned schemes are binary offloading patterns, that is to say, the total request tasks are either offloaded to the MEC server-side for processing, or the total tasks are executed locally by the processor. Once the size of the tasks is too great or dividable, it is inefficient to exploit the binary offloading scheme for finishing computation tasks.

Currently, there are some related works on partial computation offloading in the MEC environment. Kuang et al. jointly optimized partial offloading and resource allocation in MEC frameworks based on Lagrangian dual decomposition and greedy policy [16]. Li et al. transmitted delay and energy consumption minimization problem to alternation decomposition and iterative power allocation optimization policy and got the near-optimal partial offloading and resource allocation decision-making policy [17]. Saleem et al. merged device to device (D2D) into the MEC communication environment and proposed a partial computation offloading method based on joint partial offloading and resource allocation (JPORA) by decomposing the preoptimization problem. The proposed scheme can minimize the long-term delay cost under the deadline of time and energy consumption [18]. It is worthy to note that most of the current works about partial offloading schemes are based on heuristic search algorithms or decomposing nonconvex optimization problems into multiple subproblems. However, when the channel state between WN and MEC server is time-varying or the arrival task is time-varying in terms of the size and processing priority in the MEC environment, the generality and robustness of these mentioned algorithms are debatable.

Reinforcement learning (RL) can learn what to do or how to map the state to the action to maximize expected cumulative reward in episodes by agent [19]. The typical RL algorithms include Q-Learning for Off-policy and SARSA or SARSA (

λ

) for On-policy [20]. Deep reinforcement learning (DRL) merges deep learning with perception capability into RL with decision-making capability. DRL utilizes neural networks to evaluate the optimization function that is closer to the optimal result by updating the parameter of neural network [21]. The typical DRL algorithms consist of Deep Q-Learning (DQN) [22], Double Deep Q-Learning (DDQN) [23], Dueling Deep Q-Learning (Dueling DQN) [24], Actor–Critic (AC) [25,26], Policy Gradient (PG) [27], Proximal Policy Optimization (PPO) [28], and Deep Deterministic Policy Gradient (DDPG) [29]. In recent years, there are some related works on task offloading and resource allocation-based RL and DRL schemes. Zhang et al. presented DRL-based computation offloading and flying trajectory of unmanned aerial vehicle (UAV) to minimize the long-term energy consumption of UAV-assisted MEC system [30]. Yang et al. constructed a caching-enabled MEC model to save computation resources required by popular tasks. Furthermore, DDPG and collaborative caching algorithm-based joint offloading and caching policy were proposed to gain the minimum total latency cost and energy consumption cost [31]. Yang et al. implemented a two-stage learning-based scheme including machine learning at the local device and further learning on the MEC server. The optimization problem was modeled as a Markov decision process (MDP); then, DQN and DDPG were used to optimize the offloading decision-making policy and power allocation policy respectively to minimize the total execution delay, processing accuracy, and energy consumption of the MEC model [32]. However, these DRL-based optimization schemes can obtain the optimal or near-optimal offloading policy by learning the neural networks, but mentioned schemes either do not consider the time-varying channel state and the randomness of tasks or the divisibility and priority of tasks.

Different from the existing related works, we formulate an MEC framework with multiple WNs and multiple MEC servers. To the best of our knowledge, our work first considers the randomness and divisibility of arrival request tasks from WN, the time-varying channel state between WN and MEC server, and different priorities of tasks in the proposed MEC environment. Furthermore, we proposed a decentralized multi-agent deep reinforcement learning-based partial task offloading and resource allocation scheme (DeMADRL) to minimize the long-term weighted cost of the MEC framework. The main contributions of this work are as follows:

First, we formulate multiple WNs and multiple edge servers wireless computation and communication systems. Each WN generates dividable request tasks with a certain probability at a certain time. The tasks can be partially offloaded to the MEC server and partially executed by the local processor based on the priorities of tasks and the channel state between WN and the MEC server.
Next, we transit the cost-minimizing optimization problem to an MDP; then, the state space, the action space, and the reward function are described in detail. Due to a lack of priori knowledge about the state transition probability matrix, the optimization problem is modeled as a DDQN-based computation offloading and resource allocation scheme.
Furthermore, for minimizing the long-term weighted cost of the MEC framework, we proposed a decentralized multi-agent deep reinforcement learning-based partial task offloading and resource allocation scheme(DeMADRL), which can learn the optimal decision-making policy according to different priorities, size of WNs, and the time-varying channel state between WNs and MEC servers.
Finally, we comprehensively execute simulation setup, simulation comparison, and simulation result to verify the convergence and advantage of the proposed DeMADRL compared with three baseline algorithms.

The remainder of this paper is organized as follows. Section 2 elaborates on the network model, communication model, and computation model of the proposed MEC model. In Section 3, we state the optimization problem. Section 4 describes DRL-based optimization including optimization design and the DRL-based scheme. In Section 5, the extensive numerical results are executed. Finally, Section 6 concludes our paper.

2. System Model

In this section, we describe the proposed MEC system model in detail including the network model and formulation, communication model, and computation model.

2.1. Network Model and Formulationl

First, we design an MEC communication and computation model composed of multiple wireless nodes (WNs), multiple MEC servers, and a cloud server center. Massive WNs are scattered within a certain communication region which is composed of smartwatches, unmanned vehicles, and even virtual reality/augmented reality (AR/VR) devices. These WNs generate request computation tasks that are computation-intensive such as high-definition facial recognition images, online navigation maps, etc. Limited by insufficient computation resources of WNs’ local processor, more tasks can not be completed under the stringent deadline. Within a certain region, multiple MEC servers equipped with a macro base station will be fixed at a certain location. In this case, WNs can communicate with the macro base station and offload the computation-intensive computation tasks to the MEC server, and the MEC server receives offloaded tasks and executes them by its processor. Thanks to the sufficient computation resource and communication resources, the request tasks generated by WNs can be easily finished within the required deadline. When there are too many WNs served by the MEC servers, the task can be transmitted to the cloud server for processing, but the backhaul delay may not be guaranteed.

As shown in Figure 1, we define the set of WNs as

n \in W

and the set of MEC servers as

e \in E

. That is to say, the total number of WNs and MEC servers are W and E, respectively. However, WNs will generate computation tasks of a certain size and type according to a certain probability, and these tasks need to be completed within a certain task deadline. According to the traditional binary offloading pattern, if WN n generates the task

n_{k}

and is covered with the MEC server e, the task

n_{k}

can be offloaded to the MEC server e by the linked macro base station or executed by local processor. However, if the task is divisible or the size of the task is too great, the binary offloading cannot meet the delay requirement. Here, we consider the partial offloading pattern, that is to say, the task

n_{k}

can be divided by K subtasks. The subtasks K can be offloaded to the MEC server or executed by the local processor. Without loss of generality, we define

ρ_{n}

as the ratio of offloading task

n_{k}

, meeting

ρ_{n} \in [0, 1]

. Therefore, if WN n generates task

n_{k}

in a certain time slot, the ratio of task

n_{k}

ρ_{n}

will be offloaded to MEC server, and the ratio

1 - ρ_{n}

of task

n_{k}

will be executed by the local processor of WN n.

Let us take the example of task processing generated by unmanned vehicles. Currently, there are many CPU chipsets used in unmanned vehicles, such as: NVIDIA DRIVE Atlan, Intel Mobileye EyeQ5, Tesla FSD. Most chips are based on ARM architecture. For instance, the unmanned vehicle (Tesla Model 3) with camera, LiDAR, and radar can generate a large amount of data (pedestrian image, road image, digital map) in real time. The unmanned vehicle’s own chips can process data by preinstalled APP (such as pedestrian recognition program) and even perform AI training by GPU. However, with the surge in data volume generated by sensors, if training is involved, most of the data need to be offloaded to edge servers for processing, so as to meet real-time requirements (the stringent delay constraint).

2.2. Communication Model

For the convenience of processing, we discretize the continuous task processing period

T

into multiple time segments, which is time slot t, meeting

t \in T

. As described in Section 2.1, the request tasks generated by WNs can be transmitted to the macro base station of the MEC server. Here, we assume that the communication mode between WNs and MEC servers follows orthogonal frequency division multiple access (OFDMA). That is to say, the total bandwidth of the macro base station connected to the MEC server e is set as

B_{e}

, which can be divided into L subchannels. We assume that the channel state between WN n and MEC server e in each time slot is time-varying, which follows Markov distribution; the channel state can be modeled as [33]

h_{n, e} (t) = \sqrt{ϕ_{e} \frac{1}{D_{n, e} {(t)}^{δ_{n}}} * P_{n, e}}

(1)

where

ϕ_{e}

stands for the path-loss coefficient.

D_{n, e}

represents the physical distance between WN n and MEC server e.

δ_{n}

is the constant.

P_{n, e}

is the predefined transition probability matrix of the channel state. For instance,

[64, 128, 192, 256, 512]

is the set of channel states between WN and MEC server. In the time slot t, the channel state

h_{n, e} (t)

is set as 128, but in the next time slot

t + 1

, the channel state

h_{n, e} (t + 1)

is transited to 256 by a certain probability of state transition, such as

[0.2, 0.2, 0.2, 0.2, 0.2]

. In our manuscript, we set the size of the channel state is set to be large, such as 100 or 200, so that in each time slot, the state transitions in this set according to the certain probability, we use this way to simulate the time-varying channel state.

Therefore, according to the OFDMA characteristic and the channel state model [34,35], the achievable uplink and downlink transmission rate between WN n and MEC server e can be derived as

r_{n, e}^{o f f - u p} (t) = l_{e}^{o f f - u p} B_{e} {log}_{2} (1 + \frac{p_{n, e}^{o f f - u p} (t) \cdot h_{n, e} (t)}{σ_{e} {(t)}^{2}})

(2)

r_{n, e}^{o f f - d o w n} (t) = l_{e}^{o f f - d o w n} B_{e} {log}_{2} (1 + \frac{p_{n, e}^{o f f - d o w n} (t) \cdot h_{n, e} (t)}{σ_{e} {(t)}^{2}})

(3)

where

l_{e}

is the ratio of allocated bandwidth for transmitting the tasks of WN n.

p_{n, e}^{o f f - u p} (t)

and

p_{n, e}^{o f f - d o w n} (t)

are the uplink and downlink transmission power respectively.

2.3. Computation Model

WNs can generate request tasks in a certain probability in each time slot. If the computation resource of WNs cannot meet the computation or delay requirement for completing the request tasks, part of the tasks will be offloaded to some MEC servers. Examples of dividable request tasks include a small video of pedestrians walking over a period of time (2 min, 5 min), or a video of a vehicle driving (2 min, 10 min). Assuming that the duration of a video is 1 min, we can consider dividing it into 6 subvideos, each of which is 10 s, so that the size of the subvideos is much smaller, and processing these subvideos separately by MEC servers will significantly reduce the task processing delay and improve the MEC server execution efficiency. We define the request tasks generate by WNs n as K subtasks. However, the task of WNs n is set as the tuple

s_{n, e} (t) = 〈 s_{n, e}^{s i z e} (t), s_{n, e}^{c y c l e} (t), s_{n, e}^{p r i o r i t y} (t), s_{n, e}^{d e a d l i n e} (t) 〉

. Four elements is described as: (1)

s_{n, e}^{s i z e} (t)

(Mb) is the size of request tasks

s_{n, e} (t)

, (2)

s_{n, e}^{c y c l e} (t)

(GHz/Mb) is the required computation density, which is the total number of required CPU cycles for

s_{n, e} (t)

, (3)

s_{n, e}^{p r i o r i t y} (t)

is the computation priority of request tasks

s_{n, e} (t)

, and (4)

s_{n, e}^{d e a d l i n e} (t)

ms or time slots) is the latency constraint on finishing the request tasks

s_{n, e} (t)

. In addition, we assume that the request task

s_{n, e} (t)

generated by WN n in time slot t is stochastic, and the arrival probability of

s_{n, e} (t)

is set as

p_{n} (t)

, and

p_{n} (t)

follows Bernoulli distribution with

μ_{n}

. Therefore, the arrival probability

p_{n} (t)

meets

E [P_{n} (F (s_{n, e} (t) = 1))] = μ_{n}

, where

F

is the indicator function for whether the request workload

s_{n, e} (t)

arrive or not. So,

I (s_{n, e} (t)) = 1

; there are request computation tasks from WN n in time slot t.

2.3.1. Local Execution on WN

Since in time slot t, the task

s_{n, e} (t)

is composed of multiple subtasks which can be divided, to improve the execution efficiency, some tasks can be placed on WN n and processed by the local processor of WN n. We assume that there are

N_{e}

WNs covered by MEC server e, where

N_{e}

is the total number of WNs which will offload tasks

s_{n, e} (t)

to MEC server e, and

N_{e} \subset U, N_{e} \cap N_{\tilde{e}} = ⌀

. Therefore, if one task is composed of 10 subtasks, the offloading radio can be set as

[0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

, that is to say, 0,1,2,3,4,5,6,7,8,9,10 subtasks will be offloaded to MEC servers respectively. We can first calculate the allocated computation density for locally executing the tasks

s_{n, e} (t)

as follows [33,36]:

f_{n} (t) = \sqrt{\frac{p_{n, l} (t)}{κ_{n} (1 - ρ_{n}) s_{n, e}^{c y c l e} (t)}}

(4)

where

p_{n, l}

is the allocated power for local processing tasks.

κ_{n}

is the effectively switched capacitors. We can see that computation density

f_{n} (t)

for locally executing the tasks

s_{n, e} (t)

is bandwidth-dependent and takes into account the trade-off between bandwidth consumption and local latency which will be discussed in the next Section.

In this case, the computing delay for the task

s_{n, e} (t)

can be written as

d_{n}^{e x e c u t e} (t) = \frac{(1 - ρ_{n}) s_{n, e}^{c y c l e} (t)}{f_{n} (t)}

(5)

From Section 2.2, the tasks generated by WNs have different priorities. Therefore, we need to define a task buffer queue to place the request computation tasks with different priorities, so that the execution order of tasks with different priorities can be effectively scheduled to improve execution efficiency. Without loss of generality, when request tasks

s_{n, e} (t)

is pushed into the buffer queue in the time slot t, then the end time slot for finishing or unfinished the tasks

s_{n, e} (t)

can be derived by

d_{n}^{e n d} (t) = min \{t + d_{n}^{e x e c u t e} (t) - 1, t + s_{n, e}^{d e a d l i n e} (t) - 1\}

(6)

2.3.2. Offloading to MEC Server

Due to the insufficient computation power of the WN’s local processor, most tasks need to be offloaded to the MEC server for processing. According to Section 2.2, we can obtain the achievable transmission rate

r_{n, e}^{o f f - u p} (t)

and

r_{n, e}^{o f f - d o w n} (t)

, the uplink transmission delay for offloading tasks

s_{n, e} (t)

to MEC server e can be calculated by

d_{n}^{o f f - u p} (t) = \frac{ρ_{n} s_{n, e}^{s i z e} (t)}{r_{n, e}^{o f f - u p} (t)}

(7)

Meantime, the tasks offloaded to the MEC must be processed by the MEC server; then, the processing delay is calculated by

d_{n}^{o f f - e x e} (t) = \frac{ρ_{n} s_{n, e}^{c y c l e} (t)}{f_{n, e} (t)}

(8)

where

f_{n, e} (t)

is the allocated computation density for executing the tasks

s_{n, e} (t)

at MEC server e, which can be derived by

f_{n, e} (t) = \sqrt{\frac{p_{n, e} (t)}{κ_{e} ρ_{n} s_{n, e}^{c y c l e} (t)}}

(9)

where

p_{n, e}

is the allocated power for processing tasks at MEC server e.

κ_{e}

is the effective switched capacitors of MEC server e.

Similarly, when request tasks

s_{n, e} (t)

are pushed into the buffer queue at MEC server e in the time slot t, then the end time slot for finishing or unfinished the tasks

s_{n, e} (t)

at MEC server n can be derived by

d_{n, e}^{e n d} (t) = min \{t + d_{n}^{o f f - e x e} (t) + d_{n}^{e x e c u t e} (t) - 1, t + s_{n, e}^{d e a d l i n e} (t) - 1\}

(10)

In the end, the computation result will be returned from MEC server e to WN n; then, the downlink transmission delay for returning the computation result of tasks

s_{n, e} (t)

can be calculated by

d_{n}^{o f f - d o w n} (t) = \frac{η_{n}^{b a c k} ρ_{n} s_{n, e}^{s i z e} (t)}{r_{n, e}^{o f f - d o w n} (t)}

(11)

where

η_{n}^{b a c k}

stands for the tiny coefficient for the computation result.

It is noteworthy that the downlink communication only needs to return the calculation result to the WNs, so the calculation result transmitted by the downlink communication is much less than the data transmitted by the uplink communication. For instance, in terms of the face recognition application, the user will offload a batch of high-definition face images to the MEC server for processing through uplink communication. After the MEC server processes the offloaded face image tasks (several mega bytes (2Mb)), it only needs to return the recognition result (such as the user’s name bytes (10bytes)) to the user through downlink communication. Therefore, because the downlink data is small, the delay of downlink communication

d_{n}^{o f f - d o w n} (t)

can be neglected in this paper.

3. Optimization Problem Statement

The goal of our optimization problem is to minimize the long-term weighted total cost of the MEC system during the total period

T

. As described in Section 2, we mainly consider the delay cost, including local computation delay and offloading computation delay. However, it is noteworthy that minimizing latency cost must consume more energy or bandwidth cost, especially the cost of MEC servers. If only the delay cost is considered and the cost of the MEC server is ignored, then the task will be preferentially offloaded to the MEC server for processing, so that no tasks will be processed locally. Compared with the local processor, the computation resources of the MEC server are much more powerful (It is worth noting that the computing power of the MEC server is much stronger than that of the local processor of the WN. Therefore, if the bandwidth cost is not considered, as long as the bandwidth is sufficient, the task can be transmitted to the MEC side for processing with less transmission delay. Because the transmission delay is much smaller than the processing delay, this results in the agent of reinforcement learning will execute action to offload all tasks to the MEC server for processing and obtain a less cost (delay cost). Actually, this is unreasonable, because renting the MEC of the operator needs to be paid, so we should consider the bandwidth cost ). In order to avoid this situation, we jointly consider the bandwidth cost of the transmission request task and delay cost.

At first, the delay for finishing the task

s_{n, e} (t)

of WN n in time slot t can be derived by

d_{n} (t) = max {d_{n}^{e n d} (t) - \tilde{t}, d_{n, e}^{e n d} (t) - \tilde{t}}

(12)

where

\tilde{t}

represents the initial time slot for computing the last subtask of the corresponding request task n at local processor of WN or MEC server.

Therefore, the weighted total cost for finishing the task

s_{n, e} (t)

of WN n can be defined as

C_{n} (t) = β_{d} d_{n} (t) + β_{b} τ_{n, e} l_{e}^{o f f - u p} B_{e}

(13)

where

β_{d}

and

β_{b}

are the weighted coefficient for tradeoff the delay cost and the bandwidth cost, meeting

β_{d} + β_{b} = 1

.

τ_{n, e}

stands for the regulator for the bandwidth cost.

To this end, the optimization object is to minimize the long-term weighted cost including the delay cost and bandwidth cost in certain period

T

for each request tasks generated by each WN which can be defined as follows:

\begin{matrix} P 1 & min_{(ρ_{n} (t), l_{e}^{o f f - u p})} E [lim_{t \to \infty} \frac{1}{T} \frac{1}{N} \sum_{t = 1}^{T} \sum_{n = 1}^{N} C_{n} (t)] \\ s . t . & ρ_{n} (t) \in [0, 1], \\ 0 < l_{e}^{o f f - u p} < L, \\ t \leq \tilde{t}, \\ d_{n}^{e x e c u t e} (t), d_{n}^{o f f - e x e} (t), d_{n}^{o f f - u p} (t) \leq s_{n, e}^{d e a d l i n e} (t), \\ f_{n} (t) \leq F_{n}^{m a x}, f_{n, e} (t) \leq F_{e}^{m a x} \end{matrix}

(14)

where (14) is the optimization object including two optimization variables: the ratio of offloading request tasks and the ratio of allocated bandwidth. The first condition limits the ratio of offloading tasks to MEC server between 0 and 1. The second condition guarantees that the ratio of allocated bandwidth to offload tasks is not greater than the total bandwidth of any MEC server. The third condition stipulates that the processing time slot of any task should be greater than the task arrival time slot, but the task processing time does not have to be in the next slot of the task arrival time, as long as it can be completed within the deadline of task completion. The fourth condition stands for delay constraint, which constrains the delay of local processing, the transmission delay of the offloaded task, and the processing delay of the MEC server cannot exceed the deadline of task completion. The fifth condition means that the allocated computation density for executing tasks cannot be greater than the total number of CPU clock cycles for any WNs and MEC servers.

4. DRL-Based Optimization Method

4.1. Optimization Design

For

P 1

described in Section 2, this is a typical nonconvex optimization problem, which is hard to solve by convex optimization methods. Therefore, we consider describing the optimization problem

P 1

as an MDP solution, that is, the state of the current time slot is only related to the state of the previous time slot and has nothing to do with the state of other time slots. Furthermore, the state space, action space, state transition probability matrix, and reward function can then be defined. MDP can be described by a tuple

〈 O, A, P_{s, a}, U, γ 〉

, namely, where

O

is the state space,

A

is the action space,

P_{s, a}

is the state transition probability space,

U

is the reward space, and

γ

is the discount factor.

4.1.1. State Space

In terms of the state space of optimization problem

P 1

, we should consider many factors of the proposed MEC framework including the size of the arriving tasks for WN n, the computation density of the arriving tasks for WN n, and the priority level of the arriving tasks for WN n. Apart from that, we need to observe some information about the MEC server e including the channel state between WN N and MEC server e, the time slot

\tilde{t}

for computing the last subtask of task n, the computation capacity of MEC server e, and the length of tasks queue of MEC server e for all linked WNs. Therefore, the state of WN n in time slot t can be described as:

o_{n} (t) = \{s_{n, e}^{s i z e} (t), s_{n, e}^{c y c l e} (t), s_{n, e}^{p r i o r i t y} (t), h_{n, e} (t), \tilde{t}, F_{e}^{r e m a i n} (t), Q_{e} (t - 1)\}

(15)

4.1.2. Action Space

For the action space of optimization problem

P 1

, there are two factors including the offloading ratio

ρ_{n} (t)

of the request computation tasks

s_{n, e} (t)

of WN n, the allocating ratio

l_{e}^{o f f - u p}

of bandwidth for the request computation tasks

s_{n, e} (t)

of WN n. Therefore, the action of WN n in time slot t can be written by:

a_{n} (t) = \{ρ_{n} (t), l_{e}^{o f f - u p}\}

(16)

For instance, one task generated by WN are composed of 5 subtasks, the ratio of offloading tasks is in [0, 0.2, 0.4, 0.6, 0.8, 1.0], namely, 0, 1, 2, 3, 4, 5 subtasks are offloaded to MEC server respectively. Similarly, the total bandwidth is 15 MHz, the radio of allocated bandwidth when offloading tasks to MEC servers is in [1/15, 2/15, 3/15, … 14/15, 1] (0 is unreasonable), namely, 1 Mhz, 2 Mhz, 3 Mhz, …, 14 Mhz, and 15 Mhz sub-bandwidth are allocated for offloading the tasks. If action

a_{n}

of agent for WN n is [0.2, 2], WN n will offload 1 subtask to MEC server with 2/15 MHz allocated bandwidth, other 4 subtasks will be executed by local processor.

4.1.3. Reward Function

Reward refers to a reward and punishment value returned by performing a specific action

a_{n} (t)

in the current state

o_{n} (t)

. The quality of the reward definition directly determines the effect of the algorithm. As defined in Section 3, the goal of the optimization problem is to minimize the long-term weighted total cost of the MEC system during the total period

T

. Furthermore,

P 1

is described to tackle the optimization problem. According to (13) and (14), the reward function can be defined as

u_{n} (t) = F (- a (β_{d} d_{n} (t) + β_{b} τ_{n, e} l_{e}^{o f f - u p} B_{e}) - b)

.

F (x)

is a linear function about x. In this way, we can obtain the minimum cost

C_{n} (t)

by seeking the maximum reward

u_{n} (t)

based on the defined linear function.

4.2. DRL-Based Scheme

It can be seen that

p_{n} (t), s_{n, e}^{s i z e} (t), s_{n, e}^{c y c l e} (t), s_{n, e}^{p r i o r i t y} (t), h_{n, e} (t), \tilde{t}

are the local information which can be easily observed by WN n without MEC server involvement. However,

F_{e}^{r e m a i n} (t), Q_{e} (t - 1)

are the global information which belongs to MEC server e. That is to say, the state transition probability matrix

P_{s, a}

of MDP cannot be obtained. Therefore, MDP cannot solve our proposed optimization problem

P 1

, and we transform

P 1

into a model-free DRL algorithm. DRL uses a deep neural network (DNN) to simulate the optimizer of traditional RL algorithms and obtains the optimal action decision policy

π_{k}^{*}

by continuously training the DNN.

π_{n}^{*} = \underset{a_{n} \sim π_{n}}{arg max} E [\sum_{t = 1}^{T} γ^{(t - 1)} u_{n} (t)]

(17)

DQN is a typical and commonly used model-free and off-policy DRL algorithm, which is addressed in some optimization problems, but the tricky problem of DQN is the overestimation problem for expected cumulative reward. Therefore, we utilize DDQN to tackle the overestimation of DQN by defining a target network (T-Network) whose structure is the same as the main network (M-Network) of DQN. Another advanced technology, Experience Replay Memory, is introduced to solve the relation between adjacent training samples.

Figure 2 shows the structure of the multi-agent DRL model based on DDQN. Agent n of WN n interacts with the proposed simulation MEC model in each time slot. Based on the current state

o_{n} (t)

, agent execute the certain action

a_{n} (t)

, then transits into the next state

o_{n}^{'} (t)

and obtain the immediate reward

u_{n} (t)

. M-Network n involves the training for WNs n and outputs the action

a_{n} (t)

. However, although the structure of T-Network n is the same as the structure of M-Network n, T-Network does not participate in training but regularly updates all parameters of the neural network from M-Network n by soft update pattern. Experience Replay Memory is introduced to save the training samples for each agent. We can see from Figure 2 that the samples are composed of four elements

< o_{n}, a_{n}, u_{n}, o_{n}^{'} >

, which are the current state, the action, the immediate reward, and the next state. Finally, through continuous training of each agent, the loss function is updated to minimize the loss by stochastic gradient descent.

y_{n} = u_{n} + γ Q^{'} (o_{n}^{'}, arg max Q (o_{n}^{'}, a_{n} | θ_{n}) | θ_{n}^{'})

(18)

\begin{matrix} L_{n} (Q, Q^{'}) = E_{(o_{n}, a_{n}, u_{n}, o_{n}^{'}) \sim M_{n}} [(y_{n} - Q (o_{n}, a_{n} | θ_{n}) {)]}^{2} \end{matrix}

(19)

where

Q (o_{n} (t), a_{n} (t) | θ_{n})

, and

Q^{'} (o_{n}^{'} (t), a_{n}^{'} (t) | θ_{n}^{'})

are the Q-function of M-Network and T-Network, respectively (It is worth noting that we can transfer the complex training process for neural networks to the edge servers (MEC servers) with sufficient computation resources because the edge servers can easily deploy hardware devices such as GPUs for training the neural networks. After the training process is completed, the WN only needs to download the weight of the trained neural network from the edge server to perform offloading decision-making policy. The weight of the neural network is also numerical data. The weight of each WN’s neural network is about several hundred KB in size, which does not cause too much transmission overhead .

Algorithm 1 is the explanation on the proposed DRL-based optimization scheme.

Algorithm 1 The proposed DRL-based optimization scheme

1:: Initial $θ_{n}$ , $θ_{n}^{'}$ of M-Network and T-Network for WN $n \in N$ ;
2:: Initial $M_{n}$ of Experience Replay Memory for WN $n \in N$ ;
3:: for $e = 1$ to $E p i s o d e^{M a x i m u m}$ do
4:: Initial and reset the stochastic simulation MEC environment for WN $n \in N$ ;
5:: for $n = 1$ to N do
6:: Agent of WN $n \in N$ interacts with the simulation environment;
7:: for $t = 1$ to T do
8:: Observe the current state $o_{n} (t)$ consists of the size $s_{n, e}^{s i z e}$ of the arriving tasks for WN n, the computation density $s_{n, e}^{c y c l e}$ of the arriving tasks for WN n, the priority level $s_{n, e}^{p r i o r i t y}$ of the arriving tasks for WN n, the channel state $h_{e, n} (t)$ between WN N and MEC server e, the time slot $\tilde{t}$ for computing the last sub-task of task n, the computation capacity $F_{e}^{r e m a i n} (t)$ of MEC server e, the length of tasks queue $Q_{e} (t - 1)$ of MEC server e for all linked WNs
9:: Execute $a_{n} (t)$ composed of the offloading ratio $ρ_{n} (t)$ and the allocated ratio of bandwidth $l_{e}^{o f f - u p}$ according to $ϵ$ -greedy;
10:: Gain the immediate reward $u_{n} (t)$ , and receive the next state $o_{n}^{'}$ for agent of WN $n \in N$ ;
11:: Save the sample tuple $< o_{n} (t), a_{n} (t), u_{n} (t), o_{n}^{'} (t) >$ , and push it into $M_{n}$ when $M_{n}$ is not overflow;
12:: Randomly select mini-batch training samples from $M_{n}$ when the length of $M_{n}$ is greater than the size of mini-batch;
13:: Compute $y_{n} (t)$ according to (18);
14:: Compute the loss $L_{n} (Q, Q^{'})$ according to (19);
15:: Update $θ_{n}$ by stochastic gradient descent;
16:: end for
17:: Soft copy all parameters from M-Network to T-Network by $θ_{n}^{'} \leftarrow θ_{n}$ ;
18:: end for
19:: end for

5. Simulation Results

Simulation setup, performance comparison, and simulation results will be described in detail in this Section.

5.1. Simulation Setup

The proposed DeMADRL for task offloading and resource allocation is set up on a PC with I9 9900 K whose maximum frequency is 5.0 GHz and NVIDIA RTX 2080Ti whose video memory is 11 GB. The integrated development environment is PyCharm 2020.1 setup with TensorFlow 1.13.1.

In the proposed MEC model, the number of WNs is set as

N = 15

. WN n can generate the computation tasks with arrival probability

p_{n} = 0.35

, which can be composed of 5 subtasks. The number of MEC servers is set as

e = 3

. The number of time slots is set as

T = 100

, and the duration of each time slot is 1 ms namely,

t = 1

ms. The distance between WN and MEC server is in the range [60∼100] m.

δ_{n} = 3

,

P_{n, e}

follows Markov distribution with the same probability, that is to say, the current channel state in time slot t is transmitted to the next channel state in time slot

t + 1

based on Markov distribution with the same probability. The total size of bandwidth is set as

B_{e} = 15

M. The number of sub-bandwidth is set as

L_{e} = 15

.

P_{n, e}^{o f f - u p} = P_{n, e}^{o f f - u p} = P_{n, l} = 1

W. The maximum computation density of WN and MEC server are set as

F_{n}^{m a x} = 15.5

GHz,

F_{e}^{m a x} \in [78, 95]

GHz. The unit computation density for executing one unit task is set as 3.8 GHz/M. The size of each subtask from WNs follows uniform distribution with [1.6∼4.2]. The weight parameters for latency cost and bandwidth cost are set as

β_{d} = 0.5, β_{b} = 0.5

, respectively. The priorities of request tasks are set on 3 levels. Tasks with Priority 1, Priority 2, and Priority 3 are divided by different deadline

s_{n, e}^{d e a d l i n e}

, which are set as 6, 8, and 10 time slots.

In terms of the neural network for the proposed DeMADRL, the first layer is the input layer composed of the state of each WN, the second and third hidden layers are two fully connected layers, the last layer is the output layer. For the reward function, the parameter

a

of the linear function is set as 1. However, when the task is not completed under the deadline,

b

is set as 16(

2 * s_{n, e}^{d e a d l i n e}

), otherwise,

b = 0

. The purpose is to effectively improve the convergence speed of the proposed DeMADRL. The main parameters setting are illustrated in Table 1.

5.2. Performance Comparison

Three baselines are introduced to compare with the proposed DeMADRL and verify the disadvantage in comprehensive simulations, which are composed of a binary offloading DRL scheme (BiDRL). All tasks are offloaded to the MEC servers (All-MEC) and all tasks are executed by the local processor of WNs (All-LPE). The action space of DeMADRL is discretized into 11 levels

[0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

and 15 levels

[1 / 15, 2 / 15, 3 / 15, 4 / 15, 5 / 15, 6 / 15, 7 / 15, 8 / 15, 9 / 15, 10 / 15, 11 / 15, 12 / 15, 13 / 15, 14 / 15, 1.0]

for offloading ratio and allocated bandwidth ratio.

5.2.1. BiDRL

Different from our proposed decentralized multi-agent reinforcement learning-based partial offloading scheme, the tasks are not divided into subtasks. The whole task is executed by the local processor or offloaded to one MEC server. For improving the convergence effect, the size of arrival tasks is set as [9∼13.5].

5.2.2. All-MEC

Regardless of arrival tasks’ size and probability, all request tasks are offloaded to MEC servers with the maximum bandwidth. All-MEC is more concerned with offloading than resource allocation.

5.2.3. All-LPE

Regardless of arrival tasks’ size and probability, all request tasks are executed by local processor of WNs. All-LPE does not concern with the deadline for finishing tasks due to insufficient computation resources and battery level.

5.3. Simulation Results

First, we describe the convergence of proposed DeMADRL in Figure 3, Figure 4 and Figure 5. To better evaluate the performance of DeMADRL, we define the average cumulative rewards of the agents for all WNs during total time slots in each episode, which is written as:

U^{p e r E p i s o d e} = \frac{1}{T} \frac{1}{N} \sum_{t = 1}^{T} \sum_{n = 1}^{N} u_{n} (t)

(20)

Figure 3 shows the performance of DeMADRL in terms of

U^{p e r E p i s o d e}

with different learning rate

α

. If

α = 0.05

, DeMADRL cannot be convergent because of the great learning step. When

α = 0.01

, DeMADRL can converge to a certain region in terms of

U^{p e r E p i s o d e}

, but one that is not the optimal solution. When

α = 0.001, 0.0001

, the convergence and performance of DeMADRL is optimal. However, if learning rate is set as

α = 0.0001

, the convergence speed of DeMADRL is slow and unstable. When

α = 0.00005

, DeMADRL algorithm cannot learn the optimal policy until 800 episodes. Therefore, we set the learning rate as

α = 0.001

in our work.

Figure 4 reveals the convergence of DeMADRL in terms of

U^{p e r E p i s o d e}

with different coefficient of bandwidth cost

τ_{n, e}

. According to (13), the average cumulative rewards of DeMADRL will decrease with increase of

τ_{n, e}

. However, since DeMADRL can adaptively adjust bandwidth allocation ratio by learn the optimal policy, the average cumulative reward

U^{p e r E p i s o d e}

does not decrease linearly with the increase of the coefficient

τ_{n, e}

. From Figure 5, we can see that when

τ_{n, e} = 0.12, 0.16

, the performance of DeMADRL is almost the same, and both tend to be optimal decision-making schemes when the step of episode is greater than 600.

As illustrated in Figure 5, request tasks with different priorities will obtain different performances in terms of average cumulative reward

U^{p e r E p i s o d e}

. We assume that tasks with Priority 1, Priority 2, and Priority 3 are divided by different deadlines

s_{n, e}^{d e a d l i n e}

that are set as 6, 8, and 10 time slots. Furthermore, the DeMADRL algorithm can converge regardless of whether the priority was high or low, but the average cumulative reward

U^{p e r E p i s o d e}

for tasks with low priority was significantly smaller.

Figure 6 and Figure 7 show the comparison of four algorithms in terms of the average cumulative rewards during all 800 episodes. Without loss of generality, we define the average cumulative rewards of the agents for all WNs during all 800 episodes, which can be written as:

U^{a l l E p i s o d e} = \frac{1}{T} \frac{1}{E p i s o d e^{M a x i m u m}} \frac{1}{N} \sum_{t = 1}^{T} \sum_{e = 1}^{E p i s o d e^{M a x i m u m}} \sum_{n = 1}^{N} u_{n} (t)

(21)

Figure 6 describes the comparison on four algorithms with different probability

p_{n} (t)

of arrival tasks from WN. The probability is set as [0.25, 0.35, 0.45, 0.55, 0.65]. We can see from Figure 6 that the performance of proposed DeMADRL is better than that of BiDRL, All-MEC, and All-LPE with different probability of arrival tasks in terms of

m a t h b f U^{a l l E p i s o d e}

. The binary offloading pattern is adopted by BiDRL because the size of the arrival tasks is too great, and the delay cost of processing tasks and bandwidth cost of transmitting tasks increase accordingly, which leads to a decrease in the cumulative rewards

m a t h b f U^{a l l E p i s o d e}

. Unfortunately, because the task completion rate is low, the penalty will be added to the cumulative rewards

m a t h b f U^{a l l E p i s o d e}

, which cause the worse performance of BiDRL, All-MEC, and All-LPE compared with DeMADRL.

Figure 7 reveals the comparison of four algorithms with a different divided number of subtasks for WN. The levels of request tasks (the number of subtasks) can affect the performance of four algorithms. When the levels of request tasks are 3 or 4, due to the size of subtasks from each WN being greater,

U^{a l l E p i s o d e} (t)

drops dramatically by the penalty of unfinished tasks. However, when the levels of request tasks are greater than 5, the performance of the four algorithms does not change much because the little size of subtasks will not challenge the computation resources of WNs’ local processor and MEC servers.

U^{a l l E p i s o d e} (t)

of BiDRL and All-LPE is stable because the number of subtasks is independent of the two algorithms’ offloading and resource scheduling.

Furthermore, we discuss the delay and incomplete tasks of different priorities for two learning algorithms: DeMADRL and BiDRL.

Figure 8 shows the average delay of two learning algorithms with three-level priorities. Without loss of generality, the probability of arrival tasks is set as

p_{n} (t) = 0.35

. The weight coefficients for delay cost and bandwidth cost are set as

β_{d} = β_{b} = 0.5

for the tradeoff of two types of cost. As revealed in Figure 8, the difference between the delay costs of the three priority tasks is not large, because the probability of arrival task is not large and the size of tasks is not large too, which does not cause much impact on the task transmission and processing.

Figure 9 shows the ratio of incomplete tasks of two learning algorithms with three-level priorities. BiDRL is affected by the size of the tasks. Although the average delay is low as the arrival probability of the task is relatively small, there is no subtask division so that the integral task from each WN is offloaded to a single MEC server, which increases the bandwidth cost and computation cost of the edge server. These factors cause the performance of DiDRL to be significantly worse than that of DeMADRL in terms of the ratio of completing the task.

6. Conclusions

In this paper, multiple WNs and multiple edge server MEC model is formulated. The proposed MEC model comprehensively considers the randomness and divisibility of arrival request tasks from WN, the time-varying channel state between WN and the MEC server, and different priorities of tasks. Furthermore, for tackling the weighted cost minimization problem of the MEC system, a decentralized multi-agent deep reinforcement learning-based partial task offloading and resource allocation algorithm (DeMADRL) is proposed to learn the optimal decision-making scheme. In simulation results, with inclusive parameter settings, the convergence of DeMADRL is verified and compared with the other three baseline algorithms: BiDRL, All-MEC, and All-LPE. The performance of DeMADRL is best in terms of average cumulative rewards during the integral time period in all episodes.

Author Contributions

H.K.: Idea and physical model building, writing and modify the manuscript. H.W.: Experiments executing and parts of writing. H.S.: Writing—original draft, supervision, English proofing, and submission. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Jilin Province Scientific and Technological Planning Project of China (20210101415JC, YDZJ202201ZYTS556) and the Jilin Province Education Department Scientific Research Planning Foundation of China (JJKH20210753KJ).

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest exits in the submission of this manuscript, and the manuscript is approved by all authors for publication. I would like to declare on behalf of my coauthors that the work described was original research that has not been published previously, and not under consideration for publication elsewhere, in whole or in part. All the authors listed have approved the manuscript that is enclosed.

References

Yang, W.; Wang, N.; Guan, Z.; Wu, L.; Du, X.; Guizani, M. A Practical Cross-Device Federated Learning Framework over 5G Networks. IEEE Wirel. Commun. 2022. [Google Scholar] [CrossRef]
Lu, F.; Zhao, H.; Zhao, X.; Wang, X.; Saleem, A.; Zheng, G. Investigation of Near-Field Source Localization Using Uniform Rectangular Array. Electronics 2022, 11, 1916. [Google Scholar] [CrossRef]
Cardellini, V.; Personé, V.D.N.; Di Valerio, V.; Facchinei, F.; Grassi, V.; Presti, F.L.; Piccialli, V. A game-theoretic approach to computation offloading in mobile cloud computing. Math. Program. 2016, 157, 421–449. [Google Scholar] [CrossRef]
Guo, S.; Xiao, B.; Yang, Y.; Yang, Y. Energy-efficient dynamic offloading and resource scheduling in mobile cloud computing. In Proceedings of the IEEE INFOCOM 2016—The 35th Annual IEEE International Conference on Computer Communications, San Francisco, CA, USA, 10–14 April 2016; pp. 1–9. [Google Scholar]
Lu, J.; Hao, Y.; Wu, K.; Chen, Y.; Wang, Q. Dynamic offloading for energy-aware scheduling in a mobile cloud. J. King Saud-Univ.-Comput. Inf. Sci. 2022, 34, 3167–3177. [Google Scholar] [CrossRef]
Guo, Y.; Li, H. Exploration on the Optimal Application of Mobile Cloud Computing in Enterprise Financial Management under 5G Network Architecture. Adv. Multimed. 2022, 2022, 7500014. [Google Scholar] [CrossRef]
uz Zaman, S.K.; Jehangiri, A.I.; Maqsood, T.; Ahmad, Z.; Umar, A.I.; Shuja, J.; Alanazi, E.; Alasmary, W. Mobility-aware computational offloading in mobile edge networks: A survey. Clust. Comput. 2021, 24, 2735–2756. [Google Scholar] [CrossRef]
Plachy, J.; Becvar, Z.; Strinati, E.C.; Pietro, N.D. Dynamic Allocation of Computing and Communication Resources in Multi-Access Edge Computing for Mobile Users. IEEE Trans. Netw. Serv. Manag. 2021, 18, 2089–2106. [Google Scholar] [CrossRef]
Wang, C.; He, Y.; Yu, F.R.; Chen, Q.; Tang, L. Integration of networking, caching, and computing in wireless systems: A survey, some research issues, and challenges. IEEE Commun. Surv. Tutor. 2017, 20, 7–38. [Google Scholar] [CrossRef]
Zhao, F.; Chen, Y.; Zhang, Y.; Liu, Z.; Chen, X. Dynamic Offloading and Resource Scheduling for Mobile-Edge Computing with Energy Harvesting Devices. IEEE Trans. Netw. Serv. Manag. 2021, 18, 2154–2165. [Google Scholar] [CrossRef]
Shuja, J.; Bilal, K.; Alasmary, W.; Sinky, H.; Alanazi, E. Applying machine learning techniques for caching in next-generation edge networks: A comprehensive survey. J. Netw. Comput. Appl. 2021, 181, 103005. [Google Scholar] [CrossRef]
Abbas, N.; Zhang, Y.; Taherkordi, A.; Skeie, T. Mobile Edge Computing: A Survey. IEEE Internet Things J. 2018, 5, 450–465. [Google Scholar] [CrossRef] [Green Version]
Li, C.; Wang, H.; Song, R. Mobility-Aware Offloading and Resource Allocation in NOMA-MEC Systems via DC. IEEE Commun. Lett. 2022, 26, 1091–1095. [Google Scholar] [CrossRef]
Tian, K.; Chai, H.; Liu, Y.; Liu, B. Edge Intelligence Empowered Dynamic Offloading and Resource Management of MEC for Smart City Internet of Things. Electronics 2022, 11, 879. [Google Scholar] [CrossRef]
Chen, C.; Zeng, Y.; Li, H.; Liu, Y.; Wan, S. A multi-hop task offloading decision model in MEC-enabled internet of vehicles. IEEE Internet Things J. 2022. [Google Scholar] [CrossRef]
Kuang, Z.; Li, L.; Gao, J.; Zhao, L.; Liu, A. Partial offloading scheduling and power allocation for mobile edge computing systems. IEEE Internet Things J. 2019, 6, 6774–6785. [Google Scholar] [CrossRef]
Li, L.; Kuang, Z.; Liu, A. Energy efficient and low delay partial offloading scheduling and power allocation for MEC. In Proceedings of the ICC 2019–2019 IEEE International Conference on Communications (ICC), Shanghai, China, 20–24 May 2019; pp. 1–6. [Google Scholar]
Saleem, U.; Liu, Y.; Jangsher, S.; Tao, X.; Li, Y. Latency minimization for D2D-enabled partial computation offloading in mobile edge computing. IEEE Trans. Veh. Technol. 2020, 69, 4472–4486. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Introduction to Reinforcement Learning; MIT Press: Cambridge, MA, USA, 1998; Volume 135. [Google Scholar]
François-Lavet, V.; Henderson, P.; Islam, R.; Bellemare, M.G.; Pineau, J. An introduction to deep reinforcement learning. Found. Trends® Mach. Learn. 2018, 11, 219–354. [Google Scholar] [CrossRef] [Green Version]
Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep reinforcement learning: A brief survey. IEEE Signal Process. Mag. 2017, 34, 26–38. [Google Scholar] [CrossRef] [Green Version]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529. [Google Scholar] [CrossRef]
Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.; Lanctot, M.; Freitas, N. Dueling Network Architectures for Deep Reinforcement Learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1995–2003. [Google Scholar]
Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1928–1937. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
Kakade, S.M. A natural policy gradient. Adv. Neural Inf. Process. Syst. 2001, 14, 1531–1538. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Zhang, L.; Zhang, Z.Y.; Min, L.; Tang, C.; Zhang, H.Y.; Wang, Y.H.; Cai, P. Task offloading and trajectory control for UAV-assisted mobile edge computing using deep reinforcement learning. IEEE Access 2021, 9, 53708–53719. [Google Scholar] [CrossRef]
Yang, S.; Liu, J.; Zhang, F.; Li, F.; Chen, X.; Fu, X. Caching-Enabled Computation Offloading in Multi-Region MEC Network via Deep Reinforcement Learning. IEEE Internet Things J. 2022. [Google Scholar] [CrossRef]
Yang, H.; Wei, Z.; Feng, Z.; Chen, X.; Li, Y.; Zhang, P. Intelligent Computation Offloading for MEC-based Cooperative Vehicle Infrastructure System: A Deep Reinforcement Learning Approach. IEEE Trans. Veh. Technol. 2022. [Google Scholar] [CrossRef]
Wang, J.; Ke, H.; Liu, X.; Wang, H. Optimization for computational offloading in multi-access edge computing: A deep reinforcement learning scheme. Comput. Netw. 2022, 204, 108690. [Google Scholar] [CrossRef]
Kuang, Z.; Shi, Y.; Guo, S.; Xiao, B. Multi-user offloading game strategy in OFDMA mobile cloud computing system. IEEE Trans. Veh. Technol. 2019, 68, 12190–12201. [Google Scholar] [CrossRef]
Wu, Y.; Wang, Y.; Zhou, F.; Hu, R.Q. Computation efficiency maximization in OFDMA-based mobile edge computing networks. IEEE Commun. Lett. 2019, 24, 159–163. [Google Scholar] [CrossRef] [Green Version]
Chen, X.; Zhang, H.; Wu, C.; Mao, S.; Ji, Y.; Bennis, M. Optimized computation offloading performance in virtual edge computing systems via deep reinforcement learning. IEEE Internet Things J. 2018, 6, 4005–4018. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Multi-WNs and multi-MEC servers communication and computation system.

Figure 2. Structure of proposed decentralized multi-agent deep reinforcement learning model.

Figure 3. Convegence of DeMADRL with different learning rate

α

.

Figure 3. Convegence of DeMADRL with different learning rate

α

.

Figure 4. Convergence of DeMADRL with different coefficient of bandwidth cost

τ_{n, e}

.

Figure 4. Convergence of DeMADRL with different coefficient of bandwidth cost

τ_{n, e}

.

Figure 5. Structure of the multi-agent deep reinforcement learning model.

Figure 6. Comparison on four algorithms with different probability of arrival tasks.

Figure 7. Comparison on four algorithms with different divided number of subtasks for each WN.

Figure 8. Average delay of two learning algorithms with three-level priorities.

Figure 9. Ratio of incomplete tasks of two learning algorithms with three-level priorities.

Table 1. Hyperparameters setup.

Name	Value
Neural number of first hidden layer	400
Neural number of second hidden layer	300
Reward decay rate $γ$	0.85
Learning rate $α$	0.001
$ϵ$ -greedy $ϵ$	0.992
Memory size $M_{e}$	10,000
Maximum episodes $E p i s o d e^{M a x i m u m}$	800

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ke, H.; Wang, H.; Sun, H. Multi-Agent Deep Reinforcement Learning-Based Partial Task Offloading and Resource Allocation in Edge Computing Environment. Electronics 2022, 11, 2394. https://doi.org/10.3390/electronics11152394

AMA Style

Ke H, Wang H, Sun H. Multi-Agent Deep Reinforcement Learning-Based Partial Task Offloading and Resource Allocation in Edge Computing Environment. Electronics. 2022; 11(15):2394. https://doi.org/10.3390/electronics11152394

Chicago/Turabian Style

Ke, Hongchang, Hui Wang, and Hongbin Sun. 2022. "Multi-Agent Deep Reinforcement Learning-Based Partial Task Offloading and Resource Allocation in Edge Computing Environment" Electronics 11, no. 15: 2394. https://doi.org/10.3390/electronics11152394

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Agent Deep Reinforcement Learning-Based Partial Task Offloading and Resource Allocation in Edge Computing Environment

Abstract

1. Introduction

2. System Model

2.1. Network Model and Formulationl

2.2. Communication Model

2.3. Computation Model

2.3.1. Local Execution on WN

2.3.2. Offloading to MEC Server

3. Optimization Problem Statement

4. DRL-Based Optimization Method

4.1. Optimization Design

4.1.1. State Space

4.1.2. Action Space

4.1.3. Reward Function

4.2. DRL-Based Scheme

5. Simulation Results

5.1. Simulation Setup

5.2. Performance Comparison

5.2.1. BiDRL

5.2.2. All-MEC

5.2.3. All-LPE

5.3. Simulation Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI