1. Introduction
With the unprecedented amount and variety of data generated, users’ demand for high-quality services continues to increase. Edge computing is an emerging paradigm that provides storage, computing, and network resources between traditional cloud data centers and end devices. In edge computing, the basic infrastructure is edge nodes, including industrial switches, controllers, routers, video surveillance cameras, and embedded servers [
1,
2,
3,
4,
5]. Because IoT devices are constantly generating data, analytics must be highly time-sensitive. An important issue is to find a provisioning strategy for edge nodes that can reduce the monetary cost of edge resources and reduce transmission delays for users. Researchers have extensively explored resource provisioning for user workloads in most existing studies, such as offloading computing tasks directly to individual edge nodes or the cloud. This paper considers the task offloading problem of delay-sensitive users under capacity constraints in the online task scenario. It focuses on the resource allocation problem in the online task offloading process based on the edge node collaboration method. The resource allocation problem is determining the allocation of resources on the edge node to multiple users under user deadlines and edge node computing resources to minimize the total cost. Collaboration means an edge node can use rented edge nodes to provide services jointly. Our goal is to improve the cost efficiency of edge computing for network operators while maintaining the quality of service for users.
We take the actual scenario as an example to explain the motivation of the research problem in detail. Some of the definitions and symbolic representation methods involved need to be clearly stated and will be explained in the follow-up problem description section. We specify the example with an arbitrary number
of edge nodes and
of users, and we take
and
as the example. We assume there are five heterogeneous edge nodes
, and each edge node provides different computing resources to users. At the same time, the computing capabilities of edge nodes are limited, and edge nodes from different suppliers support collaborative services. We assume that within a period, the users passing through this area are
, and for each user, its connection range is within a specific area, as shown by the gray circle in
Figure 1. In this area, users can offload tasks to corresponding edge nodes. We assume that the connection and location of the edge node have been fixed by a third-party service provider or cloud data center, and the edge node receives and responds to the offload request of the terminal device within the service range and returns the result to the user after processing on the server. We assume that users can offload tasks to their nearest edge nodes, and when the tasks are offloaded to the same service entity, the running time on the service entity will grow linearly with the increase in the number of loads.
According to the scenario above, the dynamic changes in the network caused by mobile edge devices have a significant impact on the user’s choice of offloading location and the resulting performance. We use the following example to illustrate this effect. As shown in
Figure 1, user
, the user’s trajectory is from the area where edge node
, is located to the area where
is located. We assume that node
has better computing power than node
. For user
, an offloading scheme is to offload the workload of user
to the edge node
at the new location. In this scheme, the cost of user
is the sum of the task processing of node
and the transmission cost of the resulting feedback after completion. However, due to the dynamic movement of users on the edge network, assuming that during the task unloading process of user
, user
reaches area
at the same time and offloads the task to the edge node
. If user
has a higher task priority than
and both users attempt to offload tasks simultaneously, it could lead to a sudden increase in processing costs and delay for
’s task. Another solution is to offload the workload to the edge node
at the target position of the mobile trajectory. This solution can effectively reduce the delay of task processing, but the resulting communication overhead cannot be ignored. Therefore, for task offloading under online mobile users, the user’s offloading location must be considered, and the network dynamic changes caused by device mobility must be considered comprehensively.
In this paper, we research the problem of online task offloading in the mobile scenario of multiple users, which is an original proposal. We jointly optimize user request task delay and energy consumption, and we consider a more realistic and complex scenario on this basis, that is, multi-user mobility.
Our problem poses the following unique challenges: (1) Due to the limited and heterogeneous computing power of edge nodes, when several groups of users arrive with workloads of different sizes, finding a feasible task offloading strategy can be accomplished. It is a challenging task to complete the workload for the user within the deadline. (2) In this paper’s definition of the problem scenario, end users are mobile, and their action trajectories are random and time-varying. During the movement process, edge nodes can connect with some users, and users must offload their workload to the edge located in their effective area. It takes work to quickly feedback the processing results of requests to mobile users simultaneously. (3) The collaborative work of edge nodes can reduce the delay for users, but the communication cost may increase. It is not trivial to achieve efficient edge resource allocation while satisfying multiple mobile user requests with minimal cost to balance the trade-off between cost and latency.
This paper focuses on the online task offloading problem in the edge collaboration scenario and realizes the joint optimization of mobile user delay and energy consumption under capacity and computing power constraints. The main contributions of this paper are as follows:
- (1)
This paper discusses the problem of online task offloading for multiple mobile users and mainly studies the scenario of edge nodes working together in mobile edge computing. On this basis, a reward evaluation method based on deep reinforcement learning is proposed to realize the online tasks of multiple mobile users. Task offloading jointly optimizes overall latency and energy consumption under the constraints of edge physical resources.
- (2)
According to the characteristics of user mobility, a trajectory prediction method based on multi-user movement is designed to reduce the delay caused by user mobility. On this basis, centering on the task request user, a task optimization scheduling mechanism based on task multi-dimensional characteristics and marginal resource conditions is proposed, which avoids invalid states in the decision-making module and approximates the policy in the decision-making module according to the introduced feasibility mechanism.
- (3)
Finally, we conduct experiments based on synthetic and natural simulation datasets. We compare the proposed joint optimization method with several state-of-the-art methods in the synthetic simulation dataset section. We also evaluate the experimental results from different perspectives to provide corresponding conclusions. On this basis, we take obstacle detection and recognition in unmanned driving as a specific task and further verify the effectiveness of the proposed method based on real data sets. The experimental results show that the scheme can well realize the increase in the number of tasks requested by mobile terminal users in the edge collaborative service area and simultaneously guarantee the service quality of task requests with higher priority.
2. Related Work
In order to effectively coordinate computing resources in the end-edge-cloud collaborative computing paradigm and free mobile devices from limited computing power and energy supply, both the industry and academia regard computing offloading as a promising solution. Computing offloading is one of the hot research issues in the field of mobile edge computing. Many researchers have also performed much exploration on this issue. In the existing research, mobile edge computing application algorithms with different theoretical characteristics have been proposed. In computing offload for mobile edge computing, we must address two main related problems. The first is deciding when to offload computing tasks from devices to servers for processing; the second is how to allocate server resources to meet user needs reasonably. In order to solve the above problems, some existing works have proposed models and algorithms based on different optimization objectives. Mukherjee et al. [
6] considered the transmission delay and service rate from fog to cloud, and jointly optimized the number of tasks offloaded to adjacent fog nodes and the allocation of communication resources offloaded to remote cloud through semi-definite relaxation. Sarkar et al. [
7] used the task classification strategy to propose a dynamic task allocation strategy to allocate the required tasks, minimize the delay, and meet the deadline. Yang et al. [
8] proposed a feed-forward neural network model based on multi-task learning to solve binary offloading decision-making and computing resource allocation problems effectively. This method transforms the original problem into a mixed-integer nonlinear programming problem and uses the MTFNN model trained offline to derive the optimal solution. Compared with traditional optimization algorithms, this method has a lower computational cost and is significantly better than traditional methods in terms of computational time and inference accuracy. Zhan et al. [
9,
10,
11] considered the mobility in the process of task offloading in the computational offloading problem, converted the original global optimization problem into multiple local optimization problems, and proposed a heuristic mobile perceptual offloading algorithm used to obtain an approximate optimal offloading scheme. Zhou et al. [
12] proposed a reliability stochastic optimization model based on dynamic programming to deal with the dynamic and stochastic characteristics of the vehicle network and ensure the reliability of vehicle computing offloading. At the same time, they also proposed an optimal data transmission scheduling mechanism that considers the randomness of vehicular infrastructure communication and can maximize the lower bound.
In addition to the above work, in consideration of energy consumption, there are various solutions to the task offloading problem in different environments and scenarios. Under delay constraints, Zhang et al. [
13] designed an energy offloading strategy using the artificial fish swarm algorithm, which considers link conditions and effectively reduces device energy consumption. However, the algorithm complexity is high. In a multi-resource environment, Xu et al. [
14] proposed an energy-minimized particle swarm task scheduling algorithm to match multiple resources and reduce the energy consumption of edge terminal devices. Wei et al. [
15,
16,
17] divided the task offloading problem into mobile management problems and energy saving problems and used a greedy algorithm to minimize the energy consumption of mobile devices. Lu et al. [
18] provided an efficient resource allocation scheme to minimize the total cost of multiple mobile users by considering three different cases. Yu et al. [
19] studied the problem of task offloading in ultra-dense network scenarios. They proposed a task offloading algorithm based on Lyapunov optimization theory that effectively reduces the total energy consumption of base stations. Aiming at the problems of high energy consumption and computing power that mobile social platforms may cause, Guo et al. [
20] proposed an energy consumption optimization model based on the Markov decision process, which considers the network status of different environments and dynamically selects the best network. Access and refresh downloads in the best image format to reduce power consumption. In order to solve the privacy leakage problem that may occur in offloading decisions, Liu et al. [
21] studied the offloading problem based on deep learning. They proposed a deep learning-based offloading algorithm group sparse beamforming framework to optimize network power consumption. These task offloading decisions have achieved the purpose of reducing the delay time. However, they do not consider the impact of the energy consumption of the terminal device during the task offloading calculation process, and the terminal device may not be able to operate normally due to insufficient power.
Some researchers jointly proposed related solutions to the problem of joint optimization of delay and energy consumption. Gao et al. [
22] proposed a joint computing offloading and priority task scheduling scheme in mobile edge computing that uses a dynamic priority level task scheduling algorithm while considering the urgency of the task and the idleness of the edge server. This can reduce the task completion time and improve the service quality by assigning the task to the edge server. Kim et al. [
23] established the problem as a linear integer optimization problem to optimize delay and resource costs. This paper introduces a system, MoDEMS, that optimizes the deployment of edge computing based on user mobility, and proposes a Seq-Greedy heuristic algorithm to generate a migration plan that minimizes system cost and user delay. Ale et al. [
24] proposed an end-to-end deep reinforcement learning method to offload services to the best edge server and allocate the optimal computing resources to maximize the completion of tasks before their respective deadlines and minimize energy consumption. Hazra et al. [
25] proposed a heuristic-based transmission scheduling strategy that transmits according to the importance of the generated task. A graph-based task offloading strategy is introduced, which uses constrained mixed linear programming to deal with high traffic in peak-period scenarios while maintaining energy and delay constraints. Zhang et al. [
26] considered the effect of task priority on task offloading when solving the offloading decision problem. In order to better meet user needs, they proposed a method based on the importance model, which considered the differences between different user tasks. They combined the maximum constraint delay of completing the task and the size of computing resources required to complete the task. As the two main factors of task importance, the characteristics of the task are considered comprehensively. Yu et al. [
27,
28,
29] integrated mobility prediction in offload strategy and resource allocation methods, combining offload strategy and mobility management modules. This method can intelligently allocate tasks according to the user’s mobile mode to allocate tasks to locations with better network conditions in the future as much as possible to reduce energy consumption. From the above work, the individual differences of the task determine its importance, which is determined by the size of the required computing resources and the maximum tolerant delay. At the same time, user mobility is closely related to service quality. Frequent movement may cause the terminal device to leave the service range of the server, thereby interrupting the service. In addition, when offloading tasks are forwarded between different servers, a considerable transmission delay will be generated, resulting in an untimely service response and poor effect. Therefore, in this section, when solving the offloading problem of online task computing, multi-user mobility and task characteristic differences are considered comprehensively.
3. Problem Description
In the context of mobile edge computing, the third section delves into the comprehensive exploration of the system’s intricate mechanisms. This section, divided into three subsections, expounds on the nuances of the system model, transfer model, and computational model.
3.1. System Model
In this subsection, the mobile edge computing system adopts the three-layer architecture of cloud, edge, and terminal. We assume that the set of edge nodes in a specific activity area is
. These edge nodes are connected to a base station with limited computing power and storage capacity. We use the set
to denote mobile users served by edge nodes. The mobility of end users is described by the set
,
represents the mobile rate of the end user;
represents the direction of movement of the end user. In order to better capture the movement of each user, the system is assumed to operate in period slots, which are discretized into a time series
[
15,
30]. The trajectories of the terminal devices will be dynamically updated at the beginning of each time slot. Here we use
to represent the task request set,
, where
represents the size of task
, that is, the number of bits contained. We assume that at the beginning of each time slot
, an end user
will generate an indivisible task
, and each end user with mobility sends a task request to the edge node. Here we use
to indicate that user
offloads task
to edge server
at time
. Otherwise,
. For each edge node, use
to represent the task request set placed on the edge server
, where
.
3.2. Transfer Model
This section presents a transfer model for user task offloading to edge nodes. Here, we define the transmission rate of the offload link as
, and the transmission power as
. The specific calculation method is shown in Formula (1).
Here we use to denote the channel gain between users and , to denote the noise power, and let denote the transmission from to device power.
3.3. Computational Model
This section defines computing models for end users and edge servers, respectively. For end users, define two local computing and task offloading queues, respectively. For the starting moment of the time slot, the length of the computing queue is represented by , and the unloading queue is represented by , indicating that the queue length is represented by , and this paper reflects the load condition through the length of the queue. For each end user, we define its processor parameters as follows: represents the CPU frequency of the end user processor, that is, the computing power of the terminal device , and and represent the standby power and computing power of the device’s processor. The standby power is used to calculate the standby capacity loss in the process of waiting for the unloading result, and the calculation power is used to calculate the energy consumption required to process task , and the two are used for the total benefit evaluation. For the server side, we assume that each edge server has a task queue for processing tasks offloaded to the server, and here a first-in-first-out scheduling method is adopted. indicates the queue at the beginning of the th time slot of the task queue, where the queue length is represented by , which is used to reflect the server’s load. At the beginning of each time slot, the server node will broadcast the load condition of the task queue to all terminal devices in the service area for the terminal devices to make unloading decisions. Here we use to represent the computing power of the server node , that is, the CPU frequency of the edge service node processor.
Here, we define the edge servers in the area that can offload services for user systems as , and is satisfied for . We use to represent the service radius of edge server . We evaluate the process through the delayed income for the user’s task offloading delay. The terminal device must first determine that there are feasible nodes in the server set that can provide offloading services under the current location. The task cannot be offloaded to the edge side if there is no feasible node. The process of constructing the feasible edge node set is as follows: (1) Select , and calculate the distance between the end user and the edge server based on their current locations. (2) Compare the service radius of the edge server with the distance . If , then add into the feasible set . (3) Repeat the above steps until all edge nodes in the set are traversed. Based on the feasible edge node set constructed above, a delay calculation is performed on all edge nodes in , respectively. The terminal device generates a task at the beginning of the time slot, and the task size is .
3.3.1. Execute on the Local Device
We define
as the local execution delay of task
of end-user
. The specific calculation method is shown in Formula (2):
The length of the slot calculation queue is , and the processing capability of the mobile terminal equipment is .
We define
as the energy consumption of tasks executed locally, and the specific calculation method is shown in Formula (3):
Among them, is the local processing delay calculated by the Formula (2), and indicates the computing capability of the mobile terminal device .
3.3.2. Offload Processing Delay
We define
to represent the execution delay of the task on the server, where the length of the unloading queue of the edge server
is
, and the specific calculation is as shown in Formula (4):
On this basis, we define the delay of task transmission,
Here, we define
to represent the total delay of task unloading. Usually, the magnitude of returned result data after task processing is small, so this paper does not consider the calculation time delay, and the transmission time of the returned result is delayed. Therefore, we have the following:
In order to facilitate the location decision of the end user during the unloading process, this section measures it by defining the delayed benefit, here we use
to express it, and the specific calculation method is shown in Formula (7):
We define
as the energy consumption of tasks offloaded to the server, where
represents the standby energy consumption of the terminal
waiting for the data processing results, and
represents the execution delay on the server, so the calculation method of
is shown in Formula (8):
We define
as the energy consumption of tasks transmitted to the offload server, and the specific calculation method is shown in Formula (9):
Therefore, the total energy consumption of tasks offloaded to the edge server is
, and the specific calculation formula is shown in (10):
Here, we define the energy consumption benefit from measuring it, and we express it as
, and the specific calculation method is shown in Formula (11):
In order to balance the relationship between delay and energy consumption, we define the variable
to represent the preference factor, which determines whether the optimization goal is more inclined to reduce delay or reduce energy consumption.
represents the total benefit evaluation result. The specific calculation method is shown in Formula (12).
In this paper, we mainly study the problem of online task offloading in the edge collaborative operation scenario. With the optimization goal of maximizing revenue under limited resources, a new online task offloading method based on deep reinforcement learning is proposed. Optimizing total revenue is subject to cost constraints.
Among them, the Formula (13) is the optimization objective, and the Formulas (14) and (15) are the constraints. Formula (14) is a physical resource constraint, which means that the computing resource queue provided by the terminal device cannot exceed the threshold , and, at the same time, the computing resource queue provided by the edge server cannot exceed the threshold . Formula (15) shows the user whether to use service at time slot .
4. Edge Collaborative Online Task Offloading Strategy Based on Deep Reinforcement Learning
In order to minimize the total delay of the current user set over a continuous period, a novel decentralized dynamic service facility management framework based on deep reinforcement learning is designed in this section to achieve lower latency under physical resource and cost constraints.
4.1. Overall Policy Framework
In this subsection, a novel edge online task offloading based on deep reinforcement learning (OTO-DRL) management framework based on deep reinforcement learning is designed to achieve higher overall profit under physical resource and cost constraints.
Figure 2 shows the overall structure of the OTO-DRL framework.
Since the decision-making process in online task offloading is a stochastic optimization process, this section studies the framework based on the deep deterministic policy gradient (DDPG) algorithm. In order to concisely and accurately describe the current environment and state space, we need to consider the task workload distribution on the edge servers and the states they provide to users. Therefore, we design the state and action spaces, reward functions, and state transition strategies in the reinforcement learning framework. The definition of the reinforcement learning design is shown below.
Definition 1 (State Space). The state space describes the current environmental state of the mobile edge network, and it is a vector defined as . represents the task offloading sizes the terminal devices send to the server node in the -th time slot. For each terminal device, when does not send an offloading request to an edge server at the beginning of the -th time slot, the value of . In that case, the set corresponding to the revenue of in is the priority set of the offloading tasks that the terminal device sends to the edge server at the beginning of the -th time slot. If the terminal device does not send an offloading request to a server node at the beginning of the -th time slot, the value of the element at the corresponding position of in the set is ; otherwise, the value is equal to the priority of the offloading task. In the state space , the remaining computing resources of the edge server at the beginning of the -th time slot are represented by .
Definition 2 (Action Space). The action space describes the behavioral decisions of the agent, denoted as , which represents the migration strategy of tasks in time slot . represents the range of selection of edge servers during the migration process of task in time slot . Here, , where indicates local execution, and 1 indicates offloading to the currently connected edge server. For each service, the optional edge servers are represented by a range of consecutive edge server numbers , where represents the minimum number of edge servers that can be selected during task migration, and represents the maximum number of collaborating edge servers.
Since the problem we study here is an online learning process, the value of the reward cannot directly determine the final total profit of multiple mobile users in each time slot. Taking the
-th time slot as an example, the reward for completing a task considers current and future states. We define the essential reward value as
, and the reward for completing a single task is the product of the fundamental reward value and the task priority. Taking task
generated by terminal device
at the beginning of the t-th time slot as an example, the reward value that can be obtained by completing
is as follows:
For an edge server
during a time slot, we define the variable
as the instantaneous profit, which is the total reward the edge server can obtain after completing multiple tasks. We define the set
as the task queue of server node
at the beginning of the
-th time slot. Assuming that server node
completes tasks with indices 1 to
n in the task queue during the
-th time slot. The formula for calculating the instantaneous profit is as follows:
We define the variable
as the expected future profit, which is the reward that can be obtained from the offloading requests that arrive at server node
at the beginning of time slot
, in the future. Here, we define
as whether to process task
, and the specific calculation formula is as follows:
We define the variable
as the expected future loss, which is the total reward value of rejecting offloading tasks after the offloading scheduling at server node
at the beginning of time slot
. The calculation formula is as follows:
Definition 3 (Reward)
. The reward value is determined by the above multiple variables, considering the current profit of the edge server, as well as the expected future profit and loss after performing offloading scheduling operations. The reward is defined as shown in (20).
4.2. Load Balancing-Based Multi-Task Offloading Conflict Resolution Mechanism
This paper aims to minimize the total profit while enabling multiple mobile users to perform online task offloading. For each user’s task offloading request, the decision depends on observing the mobile edge network environment from their respective perspectives during each training process. However, the mobile edge computing system has no prior knowledge, meaning each service needs to know the size of the user’s data or trajectory. At the same time, the entire process is online and model-free, and multiple users move irregularly and independently during the learning process. Therefore, when the trajectories of multiple users are similar or overlap in the learning process, resource allocation imbalance is prone to occur in the coverage positions of multiple edge servers, increasing computation delay in some areas and decreasing user service quality. In order to maintain the performance of edge network services, this paper analyzes the closely coupled relationship between the activity trajectories of terminal users and edge resource load balancing. It proposes a multi-task offloading conflict resolution mechanism based on edge network load balancing.
We propose a new multi-task offloading conflict resolution mechanism in the decision-making module to avoid invalid states and approximate policies. Our solution mechanism includes two main stages: one is to find edge servers and service requests with uneven load distribution, and the other is to make collaborative job decisions for high-load edge nodes. Algorithm 1 describes the service placement problem, with the input being action and time slot , and the output being the updated service placement strategy. In this process, the algorithm also enables a conflict resolution mechanism to ensure that the service placement for edge servers can be effectively implemented. Based on the decision given by reinforcement learning under the current time slot, we first perform pre-offloading according to the task requests under each user’s predicted trajectory and then check the status of each edge server. If the number of tasks on an edge server exceeds its queue capacity after pre-offloading, it indicates that the server is congested and may cause task-offloading conflicts; otherwise, it indicates that all services offloaded to that server can be completed. Based on the above analysis, we begin to make offloading decisions. For congested edge servers, we first construct the latest task conflict set on that server, which consists of all tasks that request to be executed on service simultaneously. Then, we select the task with the maximum profit value in the conflict set for processing until the number of tasks reaches the queue threshold and update the set .
Definition 4 (Collaboration factor)
. represents the collaboration factor of the adjacent node to edge server at time slot , defined as shown in Formula (21):For the tasks that have not been processed in the set, we define the collaboration factor based on the node connectivity in the edge network. The core idea is to allocate tasks based on the distancebetween adjacent edge nodes and the queue length, whereis a tuning parameter. For the edge nodes determined to be collaborative jobs, the results will be returned to the edge serverafter completion and then fed back to the terminal users.
Algorithm 1. Load Balancing-Based Multi-Task Offloading Task Conflict Resolution Method (TCR) |
Input: |
Output: The updated action of tasks offloading decisions under the conflict edge servers; |
1 | fordo |
2 | Pre-offloading according to ; |
3 | for do |
4 | ; |
5 | if do |
6 | ; |
7 | ; |
8 | ; |
9 | for do |
10 | ; |
11 | |
12 | end for |
13 | ; |
14 | else |
15 | ; |
16 | end if |
17 | end for |
4.3. Online Task Offloading Strategy Based on Deep Reinforcement Learning
This section proposes a dynamic service placement strategy based on deep reinforcement learning. Based on the characteristics of the decision-making process, we investigate a solution based on the deep deterministic policy gradient (DDPG) algorithm. The main idea is to use a deep reinforcement learning agent to perform dynamic service placement for multiple mobile users to minimize the total delay. The specific steps are shown in Algorithm 2.
Algorithm 2. Online Task Offloading based on Deep Reinforcement Learning (OTO-DRL) |
Input: ; |
Output: ; |
1 | ; |
2 | ; |
3 | ; |
4 | for do |
5 | Initialize environmental parameters for edge servers and users, and generate an initial state ; |
6 | for each time slot from 1 to do |
7 | Select an action to determine the destination of migration by running the current policy network and exploration noise ; |
8 | Detect migration conflicts and resolve via Algorithm 1; |
9 | Execute action of each user agent independently, and observe reward and new state from the environment; |
10 | into replay buffer B; |
11 | Randomly sample a mini-batch of I transitions from replay buffer B; |
12 | by minimizing the loss function L in Equation (16); |
13 | Update the actor network by using the sampled policy gradient in Equation (17); |
14 | ; |
15 | end for |
16 | end for |
We use sets of edge nodes, services, and users as input and output of a dynamic service placement policy. We initialize the preliminary parameters of the reinforcement learning agent, including the main network, target network, and replay buffer, and begin training. Each edge server independently determines the placement strategy of services (migration or maintaining the original position) through training. We begin by initializing the environment parameters for the edge servers and users, generating an initial state, and starting the training process for a time slot. For each time slot, we select the action value of the current state by running the current decision network
and variance
, which determines the target migration position of each service as
. Since user mobility is unstable and autonomous, we detect and resolve any migration conflicts based on Algorithm 2. We execute the action value
for each user agent and observe the reward value and new state from the environment. We then store the state transition tuple
of the relevant information in the buffer. The actor and critic networks will be updated based on the value of the mini-batch. The critic network is updated, taking the state and action space as inputs and outputs of the action decision value [
30]. Specifically, the critic network approximates the action-value function
by minimizing the loss function shown in Formula (22):
For the actor network, it represents the policy parameterized by
, which uses stochastic gradient ascent to maximize
, as shown in Formula (16):
Finally, the target network is updated by and .