A Multi-Agent Reinforcement Learning-Based Task-Offloading Strategy in a Blockchain-Enabled Edge Computing Network

Liu, Chenlei; Sun, Zhixin

doi:10.3390/math12142264

Open AccessArticle

A Multi-Agent Reinforcement Learning-Based Task-Offloading Strategy in a Blockchain-Enabled Edge Computing Network

by

Chenlei Liu

^1,2,3,†

and

Zhixin Sun

^1,2,3,*,†

¹

Key Laboratory of Broadband Wireless Communication and Sensor Network Technology (Ministry of Education), Nanjing University of Posts and Telecommunications, New Mofan Road No. 66, Nanjing 210003, China

²

Post Big Data Technology and Application Engineering Research Center of Jiangsu Province, Nanjing University of Posts and Telecommunications, New Mofan Road No. 66, Nanjing 210003, China

³

Post Industry Technology Research and Development Center of the State Posts Bureau (Internet of Things Technology), Nanjing University of Posts and Telecommunications, New Mofan Road No. 66, Nanjing 210003, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2024, 12(14), 2264; https://doi.org/10.3390/math12142264

Submission received: 21 June 2024 / Revised: 13 July 2024 / Accepted: 15 July 2024 / Published: 19 July 2024

(This article belongs to the Special Issue Fuzzy Modeling and Fuzzy Control Systems)

Download

Browse Figures

Versions Notes

Abstract

In recent years, many mobile edge computing network solutions have enhanced data privacy and security and built a trusted network mechanism by introducing blockchain technology. However, this also complicates the task-offloading problem of blockchain-enabled mobile edge computing, and traditional evolutionary learning and single-agent reinforcement learning algorithms are difficult to solve effectively. In this paper, we propose a blockchain-enabled mobile edge computing task-offloading strategy based on multi-agent reinforcement learning. First, we innovatively propose a blockchain-enabled mobile edge computing task-offloading model by comprehensively considering optimization objectives such as task execution energy consumption, processing delay, user privacy metrics, and blockchain incentive rewards. Then, we propose a deep reinforcement learning algorithm based on multiple agents sharing a global memory pool using the actor–critic architecture, which enables each agent to acquire the experience of another agent during the training process to enhance the collaborative capability among agents and overall performance. In addition, we adopt attenuatable Gaussian noise into the action space selection process in the actor network to avoid falling into the local optimum. Finally, experiments show that this scheme’s comprehensive cost calculation performance is enhanced by more than 10% compared with other multi-agent reinforcement learning algorithms. In addition, Gaussian random noise-based action space selection and a global memory pool improve the performance by 38.36% and 43.59%, respectively.

Keywords:

blockchain; mobile edge computing; task offloading; multi-agent reinforcement learning

MSC:

68T07

1. Introduction

Mobile Edge Computing (MEC) is an emerging computing paradigm that deploys computing resources at the edge of the network to provide low-latency, high-bandwidth, and customizable services that can deliver computing and storage capabilities at the edge of the network, close to the data source. In addition, MEC can improve the performance, efficiency, and security of various applications that require low latency, high bandwidth, and data privacy, such as augmented reality, smart cities, and autonomous driving. However, because the mobile edge computing network environment has the characteristics of openness and dynamics, the network is vulnerable to the threat of malicious node invasion and data attacks. Such attacks can cause problems such as shared data leakage, task execution interference, and resource allocation anomalies within the network, seriously affecting the security of MEC. Therefore, ensuring the safe sharing of data and trustworthy collaboration of nodes is an important issue that needs to be solved in MEC [1,2].

Blockchain, known as a distributed, tamper-proof, decentralized data storage technology, was first proposed by Nakamoto in the context of Bitcoin, which enables secure, transparent, and immutable transactions and transfer of data records between multiple parties without relying on a trusted third party [3]. Blockchain has been widely used in various fields, such as digital asset management, supply chain finance, and intelligent manufacturing. Blockchain-based MEC (BMEC) is a new type of architecture that applies blockchain technology to MEC systems and can solve many challenges, such as data security, privacy protection, incentive mechanisms, resource management, etc. [4]. The in-depth integration of blockchain and MEC has been widely discussed [5]. In telematics and intelligent transportation systems, blockchain can provide collaborative management of service resources [6], data security sharing management [7], and collaborative node identity authentication [8]. In the smart grid, blockchain-based MEC is mainly applied to system architecture design [9], energy transaction pricing [10], and transaction security [11]. In addition, benefiting from the advantages of blockchain-based MEC, intelligent health care [12] and artificial intelligence [13] are also beginning to be applied.

Although the blockchain-based MEC system has excellent application prospects and research value, it also faces some important problems, including task offloading. MEC task offloading refers to the technology of offloading computing tasks from user devices to edge nodes or clouds for execution in order to solve the deficiencies of user devices in terms of resource storage, computational performance, and energy efficiency. For the problem of task offloading in blockchain-based MEC systems, there are still some limitations in the current research, mainly with respect to the following aspects: (1) most existing works only consider the quality of service indicators of traditional mobile edge computing task offloading, such as task processing latency and energy consumption, but ignore blockchain mechanisms and user privacy leakage, which makes the problem modeling insufficient [14,15,16], and (2) task offloading algorithms are often based on heuristic learning methods or single-agent reinforcement learning algorithms [17,18,19]. The analytical performance and solution efficiency need to be more satisfactory for dynamically changing, high-dimensional, non-convex task-offloading problems. In this paper, blockchain-based mobile edge computing task-offloading modeling and a multi-agent reinforcement learning method are investigated, and the main innovative contributions are summarized as follows:

We propose a novel task-offloading model for blockchain-based MEC networks that comprehensively considers the blockchain-specific incentive mechanism and consensus mechanism. It also takes the user privacy metric as the optimization objective, together with the task service quality as the joint optimization objective, which makes the modeling of the optimization problem more in line with the practical environment;
We propose a reinforcement learning algorithm based on a multi-agent global memory pool. Agents can enhance the overall collaborative ability among the agents by sharing parameters;
We adopt attenuatable Gaussian random noise in the action space selection process in the actor network to enhance the search capability and avoid falling into local optimum;
We conduct several sets of comparative experiments to validate the performance of the proposed algorithm in dealing with the task-offloading problem.

This paper is structured as follows. Section 2 investigates state-of-the-art research related to the research content of this paper. Section 3 presents the proposed blockchain MEC network task-offloading optimization model to be solved in this paper. Section 4 describes the principle and process of the reinforcement learning algorithm used in this paper. Section 5 conducts simulation experiments to evaluate the performance and effectiveness of the proposed algorithm. Section 6 summarizes the full paper.

2. Related Works

Blockchain-based MEC networks constitute an emerging research field combining the decentralization, high security, and tamper-proof features of blockchain and the low-latency, high-bandwidth, and real-time advantages of MEC to provide ideas for solving a series of challenges faced by the MEC network structure, such as security, privacy protection, resource management, and so on. This combination is crucial for building the next-generation of intelligent, secure, and efficient network environments. It has attracted many scholars to research its architecture and operation mechanism in depth. Le et al. [20] established a unified six-layer architecture with high efficiency, security, compatibility, and flexibility for blockchain-based resource sharing and transactions in mobile access networks. The proposes architecture contains many new features, such as an enhanced blockchain structure, secure interaction methods, efficient service mechanisms, and scalable transaction models. Salim et al. [21] proposed a latency tolerance-based cybertwin-assisted task-scheduling scheme, where cybertwins using logger functionality and digital asset functionality exchange smart contracts with cloud operators using digital assets to ensure maximum computational resources for efficient job allocation to edge clouds. Sun et al. [22] considered incentive and cross-server resource allocation in blockchain-driven MEC, where the blockchain prevents malicious edge servers from tampering with player information by maintaining a continuous, tamper-proof ledger database. In addition, they proposed two double auction mechanisms, namely the break even-based double auction mechanism (DAMB) and the more efficient break-even free double auction mechanism (BFDA), in which users request multitasking services with declared bids and edge servers cooperate to serve the users. Ding et al. [23] proposed a new noma-based MEC wireless blockchain network that minimizes system energy consumption through task offloading decision optimization, user clustering, computational resources, and transmission power allocation. Zhang et al. [24] proposed a reliable and efficient system based on edge computing and blockchain and designed a new group agent strategy based on trust computing that ensures the edge devices in the process of interaction to ensure reliability and improve transmission efficiency.

Task offloading is an important computing strategy that allows mobile devices to shift computationally intensive tasks to be executed on more powerful remote servers or edge computing nodes to reduce the computational burden and energy pressure on mobile devices while increasing task processing speed and efficiency, which is particularly important in MEC environments. This is because MEC enables low-latency, high-bandwidth services by bringing computing resources and storage capacity to the edge of the network, i.e., close to users and data sources. However, some challenges are still associated with introducing blockchain technology into MEC networks. The resource allocation process of MEC involves parameters such as latency, resource utilization, service provider profit, service user satisfaction, and energy. Blockchain parameters such as throughput, block size, block time, and block reward also need to be considered while designing blockchain-based resource allocation systems. Blockchain ensures decentralization, transparency, and invariance but also introduces computational and communication overheads and increases latency. Therefore, joint optimization of MEC resource allocation and blockchain parameters is an open challenge [25]. In addition, data sharing and knowledge discovery are essential requirements for the integration of blockchain and edge computing systems in many application scenarios, e.g., the need to comprehensively analyze a large number of patient medical records in smart health care and the need for vehicles to share GPS positioning data in the Internet of Vehicles (IoV) in order to correct errors in assisted autonomous driving. However, as the number of devices increases and device-generated data become more decentralized, an explosive increase in network nodes drives a huge demand for data sharing due to the large amount of sensitive information and private data stored in the system. Therefore, protecting data privacy and security is an important challenge for blockchain-based MEC [26]. Therefore, exploring the effective combination of blockchain technology and MEC for a secure, efficient, and scalable task-offloading mechanism remains an active and challenging research area [27]. For example, Guo et al. [28] used the Stackelberg game to model the interactions between edge cloud operators and different collaborative mining networks to obtain the optimal resource prices and device resource requirements when offloading tasks to the edge cloud. Lin et al. [14] proposed an efficient Device-to-Device (D2D) network authorization blockchain framework and designed an elastic resource allocation scheme using Lyapunov optimization theory to achieve high throughput with limited resources. Zhang et al. [15] proposed an efficient and improved closed-ended quadratic bidding game for allocating communication and computation resources under the quality-of-service (QoS) constraints of smart terminals in response to the optimization problem of joint communication and computation resource allocation, thus creating an edge cloud resource-sharing model based on blockchain technology and an auction game. Devi et al. [29] developed a system model to solve the task-offloading algorithm to minimize the data center’s completion time and energy consumption. In addition, blockchain-based, energy-aware task scheduling for data centers was proposed to provide the best solution for minimizing completion time and energy consumption. These works have investigated blockchain-based task offloading for MEC at the level of heuristic algorithms and mathematical computational methods such as game theory. However, these schemes have also become increasingly difficult to apply due to the random mobility of user terminals in the edge computing network, task uncertainty, and the complexity of the optimization problem, requiring the consideration of task offloading with multiple optimization objectives.

In recent years, due to the random mobility of user terminals in edge computing networks, the uncertainty of tasks, and the complexity of optimization problems, the task-offloading problem, which needs to consider multiple optimization objectives, has become increasingly challenging to be solved by traditional heuristics and game-theoretic methods. Deep reinforcement learning-based methods have gradually been widely applied to research with the aim of solving the traditional edge computing network task-offloading problem [30,31,32,33] and gradually extended and applied to research on task offloading for blockchain-based MEC networks. Yang et al. [34] integrated MEC (MEC) into a blockchain-based industrial IoT system to improve the computational power of industrial IoT devices with a comprehensive consideration of weighted system cost, including energy consumption and computational overheads, and formulated the posed problem as a Markov Decision Process (MDP), introducing Deep Reinforcement Learning (DRL) to solve the formal problem. Although the study demonstrated the effectiveness of the method in small-scale networks, further validation is needed for scalability and practical deployment in large-scale networks. Nguyen et al. [35] proposed a new distributed deep reinforcement learning-based approach employing a multi-agent deep deterministic policy gradient algorithm. Based on this, a game-theoretic solution was built to model the offloading and mining competition between edge devices as a potential game and prove the existence of a pure Nash equilibrium. However, the experimental scenarios and parameter settings in the study may not fully reflect the complexity of practical applications, and further verification of its generalization ability in different environments is needed. Yao et al. [36] proposed a blockchain-empowered collaborative task-offloading scheme for cloud-edge-device (CE-device) computation by modifying the blockchain consensus process to enable participants to reach a formulaic agreement by solving the task-offloading problem. To this end, each participant can apply a reinforcement learning-based approach to solve the task-offloading problem and compete for the block output right by comparing the performance of the offloading strategies and accepting the best strategy as the offloading solution for the next period. However, the scenario does not discuss the impact of operations in the blockchain on the task-offloading strategy. Nguyen et al. [19] proposed a reinforcement learning-based multi-user task-offloading algorithm to obtain a dynamic blockchain network with MEC using the optimal offloading strategy. The scheme formulates task offloading and privacy protection as a joint optimization problem and employs a reinforcement learning-based Q-network algorithm to learn the offloading strategy, which minimizes the total system cost in terms of combined computational latency and energy consumption while guaranteeing optimal user privacy and mining reward performance. However, the study adopted a single-agent reinforcement learning scheme, which is questionable for the computational efficiency of task offloading in complex network environments. Wang et al. [37] proposed a deep reinforcement learning (DRL)-based support scheme for blockchain-based IoT resource orchestration in which IoT edge servers and end users can reach a consensus on network resource allocation based on blockchain theory. In addition, agents relying on a policy network can be trained with these resource attributes to fully perceive changes in the network state and, thus, make dynamic resource allocation decisions. However, the study only considered the performance metrics of mobile edge computation tasks under a single user, and the user’s mobile characteristics need to be fully considered. In summary, these works enrich the research on blockchain–MEC task offloading based on reinforcement learning but still have certain defects that need to be improved in subsequent research.

This paper investigates a blockchain-based task-offloading model for MEC networks that takes the incentive mechanism and consensus mechanism of blockchain into account in the task-offloading problem model and, at the same time, takes the user privacy metric and the task service quality as the joint optimization objectives so that the optimization problem is more in line with the practical environment. In addition, an actor–critic reinforcement learning algorithm based on a shared global memory pool of multiple agents is proposed to improve the robustness and stability of the performance by sharing parameters. An action space selection process based on Gaussian noise is added to increase the algorithm’s spatial search ability to avoid falling into a local optimum.

3. Model

In this section, we propose a blockchain-based MEC system architecture, then provide a system overview and describe the operational flow and the blockchain consensus process.

3.1. System Model

In this paper, we propose a blockchain mobile edge network task-offloading model. The specific architecture of this model is shown in Figure 1. The network model mainly contains a blockchain layer, an edge server layer, and a device layer. The device layer contains a collection of devices, and different user devices interact with the MEC network environment, send task offload requests to the edge service node, and receive the offload policy feedback from the edge service node to complete task offloading. The edge server layer contains a collection of edge nodes; each node has certain task processing resources, receiving task-offloading requests sent by users and completing the task-offloading requests of the devices through cooperation between nodes. The edge service nodes also have the role of blockchain nodes, which can participate in network consensus and reward allocation in the blockchain layer and jointly maintain the blockchain that stores network information, ensuring the security of the network and incentivizing the participation of nodes.

The edge network of this model has a device set (

U = {u_{1}, u_{2}, \dots, u_{n}}

) consisting of n user devices and an edge node set (

E = {e_{1}, e_{2}, \dots, e_{m}}

) consisting of m edge nodes. For any user device,

u_{i} = (p w_{l i}, f_{l i}, e n c_{i}^{m a x}, t k_{i}^{t}, l o c_{l i}^{t} = (x_{l i}^{t}, y_{l i}^{t}))

can move along the irregular trajectory within the time slot and initiate a task-offloading request to the edge server, where

p w_{l i}

is the total transmission power of the device,

f_{l i}

is the processing speed (the number of processing cycles per second),

e n c_{i}^{m a x}

is the upper energy limit of the device,

t k_{i}^{t}

is the task initiated by the user in time slot t,

l o c_{l i}^{t}

is the device localization, and

x_{l i}^{t}

and

y_{l i}^{t}

are the position coordinates.

In addition, the task (

t k_{i}^{t}

) of any user device can be expressed as an array (

(d_{i}^{t}, D_{i}^{t}, T d_{i, m a x}^{t})

, where

d_{i}^{t}

(bit) is the size of the task,

D_{i}^{t}

is the number of computation cycles required for the computation task (500 CPU computation cycles are required to process the data of a 1 bit task in this paper), and

T d_{i, m a x}^{t}

is the maximum tolerable delay of the task). Due to the limitation of energy consumption and computational capability of the device, these tasks cannot all be computed locally at the same time and need to be partially offloaded to the edge node by using

t k_{i}^{t, l}

and

t k_{i j}^{t, o}

to denote the local computation of the task (

t k_{i}^{t}

) and its offloading part to the node (

e_{j}

), respectively. The sizes of the corresponding offloading task and computational task cycle are denoted as

d_{i}^{t, l}

,

D_{i}^{t, l}

,

d_{i j}^{t, o}

, and

D_{i j}^{t, o}

, respectively.

For any edge node,

e_{j} = (k_{j}, p w_{e j}, f_{e j}, l o c_{e j} = (x_{e j}, y_{e j}))

can receive the task data offloaded by the device and process the task using its computational resources, then return the result to the smart device after task processing is completed.

k_{j}^{t}

denotes the number of tokens held by the block node corresponding to the edge node,

p w_{e j}

denotes the transmission power of the edge node,

f_{e j}

denotes the processing speed of the edge node,

l o c_{e j}

denotes the fixed location of the server, and

x_{e j}

and

y_{e j}

are the location coordinates.

In the blockchain of this model, all the mobile edge network nodes also have the role of blockchain nodes, sharing parameters and recording proof of workload through the blockchain. The consortium blockchain uses a Proof of Stake (PoS)-based consensus method to validate the workload of the computing nodes and distribute incentive rewards to each of the individual nodes involved in the computation of the offloading task.

3.2. Consensus Model

The consensus mechanism in blockchain is the core method to ensure that all participating nodes agree on the state of the blockchain. Currently, there are two main consensus mechanisms in mainstream blockchain systems, namely Proof of Work (PoW) and proof of stake (PoS). In PoW, all entities compete to solve a mathematical puzzle to generate blocks and receive a reward. However, the process of PoW is very computationally intensive and only applies to mobile edge network scenarios. PoS is proposed to address the limitations of PoW, and unlike PoW, the probability of an entity getting the right to publish a block depends on its equity, i.e., the number of tokens owned by the entity [38]. A comparison of the two consensus mechanisms is shown in Table 1.

In this paper, the PoS-based consensus mechanism is used to implement the workload consensus checking of computing nodes. Its execution process is as follows:

(1): Packing node selection: The system selects a negative blockchain node [39] to construct a new block by periodically selecting the block creation node ( $e_{g} \in E$ ) among all the nodes with equity based on the number of tokens held by the verifier;
(2): New block creation: The block creation node packages all blockchain network transactions in the system during time slot t into a new block, assuming that the block consists of the following two parts: the task ( $t k_{i}^{t}$ ) offloading data ( $d_{b}^{t}$ ) and the block-fixing data ( $d_{0}$ ) contained in block $b_{i}$ . The size of the task transaction data is calculated from the original size conversion of the task, noting the conversion rate as s. Then, the block size $d_{b_{i}}^{t}$ can be expressed as

$d_{b_{i}}^{t} = d_{b}^{t} + d_{0} = s Σ_{e_{j} \in E} d_{i j}^{t, o} + d_{0}$

(1)
(3): Block validation: The coalition chain calculates the selection probability of the edge nodes according to the number of tokens owned by the nodes using a Poisson distribution with parameter $λ$ . The first v nodes according to the order of probability constitute the set of validation nodes ( $E^{V}$ ), in which the probability distribution of the edge node ( $e_{j}$ ) being selected as a validation node is

$p_{j}^{v} = P (K = k_{j}) = \frac{λ^{K}}{K!} e^{- λ}$

(2)
(4): Block addition: Once a new block is recognized by all the validation nodes, it is added to the blockchain;
(5): Incentive distribution: Based on the incentive mechanism, a certain reward is provided to the network nodes that participate in the task to compute and verify the new block.

The workflow of the blockchain-based MEC task-offloading system described in this section is shown in Figure 2.

3.3. Quality of Service Model

In this section, the blockchain-based MEC task-offloading quality of service model proposed in this paper is described in detail, in addition to description of the design methodology for quality of service models reported in existing MEC task-offloading research [15,29,33], to simulate a blockchain–mobile edge network within each time slot The delay and energy consumption generated by user task computation, task offloading communication, block verification, etc., are investigated to construct a blockchain-based network quality of service-oriented communication model and a computation model.

3.3.1. Communication Model

In this paper, it is assumed that the size of the task calculation result is much smaller than the task itself and that the communication overhead required to transmit the result is negligible. Therefore, this paper mainly considers the two data communication scenarios of user device task offloading and block verification and calculates the energy consumption and transmission delay in the communication process.

In this system model, the device and the MEC server are linked through a wireless network, and the transmission rate between them is affected by the transmission environment, communication resources, and transmission distance. In this paper, we refer to [40,41] and calculate the channel gain (

h_{i j}^{t}

) from any device (

u_{i}

) to edge node

e_{j}

in time slot t using the following formula:

h_{i j}^{t} = \frac{h_{0}}{{d i s t_{i j}^{t}}^{\frac{φ}{2}}}

(3)

where

h_{0}

denotes the initial gain of the channel,

φ

is the path loss exponent, and

d i s t_{i j}^{t} = \sqrt{{(x_{l i}^{t} - x_{e j})}^{2} + {(y_{l i}^{t} - y_{e j})}^{2}}

denotes the distance from device

u_{i}

to edge node

e_{j}

at time slot t.

The signal-to-interference-plus-noise ratio (

S I N R_{i, j}^{t}

) from device

u_{i}

to edge node

e_{j}

is

S I N R_{i, j}^{t} = \frac{p w_{i j}^{t} {|h_{i j}^{t}|}^{2}}{\sum_{e_{j}^{'} \in E \ {e_{j}}} p w_{i j^{'}}^{t} {|h_{i j^{'}}^{t}|}^{2} + N_{0}}

(4)

where

p w_{i j}^{t}

,

N_{0}

, and B denote the transmission power from device

u_{i}

to edge node

e_{j}

in time slot t, the Gaussian noise in the channel, and the channel communication bandwidth, respectively.

p w_{l i} = Σ_{e_{j} \in E} p w_{i j}^{t}

, and the data transmission rate from device

u_{i}

to edge node

e_{j}

in time slot t is

R_{i j}^{t} = B \cdot l o g_{2} (1 + S I N R_{i, j}^{t})

(5)

Therefore, in the task-offloading communication scenario, the communication delay (

T d_{i j, c o m m}^{t, o}

) and energy consumption (

E n_{i j, c o m m}^{t, o}

) of the user device (

u_{i}

) transmitting the task offloading to the edge node (

e_{j}

) during the time slot t is expressed as follows:

T d_{i j, c o m m}^{t, o} = \frac{d_{i j}^{t, o}}{R_{i j}^{t}}

(6)

E n_{i j, c o m m}^{t, o} = p w_{i j}^{t} T d_{i j, c o m m}^{t, o} = p w_{i j}^{t} \frac{d_{i j}^{t, o}}{R_{i j}^{t}}

(7)

During the block consensus process, the block generation node transmits the block to the validation node for verification. Assuming that the block generation node (

e_{g}

) and v validation nodes have a fixed network transmission speed (R) between them, the consensus verification communication delay (

T d_{i v, c o m m}^{t, v}

) and energy consumption (

E n_{i v, c o m m}^{t, v}

) between the block generation node (

e_{g}

) and the validation node (

e_{v}

) for block b are

T d_{i v, c o m m}^{t, v} = \frac{d_{b_{i}}^{t}}{R}

(8)

E n_{i v, c o m m}^{t, v} = p w_{e g} v T d_{i v, c o m m}^{t, v} = p w_{e g} \frac{d_{b_{i}}^{t}}{R}

(9)

3.3.2. Computing Model

In this paper, we mainly consider three kinds of computing scenarios, namely local task computing, task-offloading computing, and block verification computing. The computing model must determine the processing delay and energy consumption according to the computing process. It is assumed that the blockchain selects block generation nodes according to the number of tokens owned by the nodes, and the calculation volume of generation node selection is ignored in this model.

In the task computation scenario locally executed by the user device, the energy consumption coefficient of the user device is assumed to be

ε^{l} = 10^{- 11}

[42] in this paper. The delay (

T d_{i, c o m p}^{t, l}

) and energy consumption (

E n_{i, c o m p}^{t, l}

) of device

u_{i}

for local task processing are

T d_{i, c o m p}^{t, l} = \frac{D_{i}^{t, l}}{f_{l i}}

(10)

E n_{i, c o m p}^{t, l} = ε^{l} D_{i}^{t, l} {(f_{l i})}^{2}

(11)

In the offloading task scenario executed by edge nodes, this paper assumes that the edge node provides a separate CPU computing core for each offloading task, i.e., tasks offloaded on the same edge node have the same task-computing speed, and the number of offloading tasks that the edge node can host at the same time is related to the number of CPU cores. The energy consumption factor of the edge node is defined as

ε^{o} = 10^{- 27}

[34]. The delay (

T d_{i j, c o m p}^{t, o}

) and energy consumption (

E n_{i j, c o m p}^{t, o}

) of the edge node (

e_{j}

) in computing the offloading task (

t_{i}^{o}

) are

T d_{i j, c o m p}^{t, o} = \frac{D_{i j}^{t, o}}{f_{e j}}

(12)

E n_{i j, c o m p}^{t, o} = ε^{o} D_{i j}^{t, o} {(f_{e j})}^{2}

(13)

In the block consensus verification scenario, the delay and energy consumption generated by block generation are not calculated in this paper because the overall overhead of block creation is small compared to that of block verification, where there is a large number of validation links, which has a low impact on the overall performance of the system. When the edge node performs block validation, assuming that the validation computation period of block

b_{i}

is

D_{b_{i}}^{t}

, the validation delay (

T d_{i v, c o m p}^{t, v}

) and energy consumption (

E n_{i v, c o m p}^{t, v}

) of the blockchain validation node (

e_{v}

) are

T d_{i v, c o m p}^{t, v} = \frac{D_{b_{i}}^{t}}{f_{e v}}

(14)

E n_{i v, c o m p}^{t, v} = ε^{o} D_{b_{i}}^{t} {(f_{e v})}^{2}

(15)

3.3.3. Comprehensive Model

In this paper, we comprehensively calculate the delay and energy cost of the blockchain-based MEC task-offloading model by combining the designed communication and computation models.

(1): Latency Cost

The time delay in the quality of service model designed in this paper contains two links, namely task processing and block verification links. When calculating the time delay of the task processing link, it is assumed that all users start a local task and offload task processing from the same moment, i.e., local task computation and task offload transmission are carried out at the same time, so the actual time delay of task processing is the maximum value of the time delay of local computation and offload processing. The task-offloading delay consists of the communication delay

(T d_{i j, c o m m}^{t, o}

) of the user offloading the task to the edge node and the computation delay (

T d_{i j, c o m p}^{t, o}

) of the task on the edge node. If the user offloads the task to more than one edge node, the task-offloading delay is only computed for the longest processing delay; then, the task offloading delay (

T d_{i}^{t, o}

) of user

u_{i}

is denoted as

T d_{i}^{t, o} = max (T d_{i 1, c o m m}^{t, o} + T d_{i 1, c o m p}^{t, o}, \dots, T d_{i m, c o m m}^{t, o} + T d_{i m, c o m p}^{t, o})

(16)

Furthermore, the task processing delay (

{T d^{'}}_{i}^{t}

) of user

u_{i}

is denoted as

{T d^{'}}_{i}^{t} = max (T d_{i, c o m p}^{t, l}, T d_{i}^{t, o})

(17)

Similarly, when calculating the delay of the block verification link, since the packing node sends the block to each verification node for block verification at the same time, the block verification delay (

T d_{i}^{t, v}

) is the maximum delay processed by each verification node and is denoted as

T d_{i}^{t, v} = max (T d_{i 1, c o m m}^{t, v} + T d_{i 1, c o m p}^{t, v}, \dots, T d_{i m, c o m m}^{t, v} + T d_{i m, c o m p}^{t, v})

(18)

In summary, the delay (

T d_{i}^{t}

) of the quality of service model for user device

u_{i}

in time slot t is

T d_{i}^{t} = {T d^{'}}_{i}^{t} + T d_{i}^{t, v}

(19)

(2): Energy Cost

In the energy consumption calculation process, the energy consumption of the communication model and the computation model are obtained by summing the processing energy consumption of each task.

Then, the communication and computation energy of user device

u_{i}

in time slot t are

E n_{i}^{t, c o m m} = Σ_{e_{j} \in E} E n_{i j, c o m m}^{t, o} + Σ_{e_{v} \in E^{V}} E n_{i v, c o m m}^{t, v}

(20)

E n_{i}^{t, c o m p} = E n_{i, c o m p}^{t, l} + Σ_{e_{j} \in E} E n_{i j, c o m p}^{t, o} + Σ_{e_{v} \in E^{V}} E n_{i v, c o m p}^{t, v}

(21)

In summary, the energy consumption (

E n_{i}^{t}

) of the quality of service model for user device

u_{i}

in time slot t is

E n_{i}^{t} = E n_{i}^{t, c o m m} + E n_{i}^{t, c o m p}

(22)

3.4. Incentive Reward Model

Previous research [22,43,44,45] has integrated the incentive mechanism of blockchain into the study of task pricing and resource allocation of MEC, balancing the allocation of edge service resources and value gains by considering game theory and auction theory. In this paper, the design of the incentive mechanism is simplified, and only the edge nodes participating in task-offloading computation and block verification are considered to be provided with incentive tokens in equal proportions according to energy consumption. Hence, the incentive model favors edge nodes obtaining more incentive tokens to gain more benefits. In the incentive model, the blockchain uses the

β

ratio of the unit of energy converted into obtainable tokens based on the energy consumption of the edge nodes; then, the tokens generated by agent

u_{i}

are calculated as

I_{i}^{t} = \{\begin{matrix} β \sum_{e_{j} \in E} (E n_{i j, c o m p}^{t, o} + E n_{i j, c o m p}^{t, v}), e_{j} \in E^{V} \\ β \sum_{e_{j} \in E} E n_{i j}^{c o m p, o}, e_{j} \notin E^{V} \end{matrix}

(23)

3.5. Privacy Model

In this section, we mainly consider that in the process of MEC task offloading, if we consider the energy consumption and delay factors of task communication and computation, user terminals often tend to offload a large number of tasks to edge nodes that are closer to them and have higher levels of resources. However, such a task-offloading method potentially risks data privacy leakage because MEC tasks usually contain sensitive private data such as the physical location of the device, identity characteristics, task data, etc. Suppose that many tasks containing private information are offloaded to an edge node. In that case, the edge node, out of its curiosity or due to being hijacked by an adversary, may collect and infer the user’s location and business characteristics based on the user’s offloading preferences. More seriously, the edge node may predict the user’s private information based on these data characteristics, resulting in user privacy leakage [46]. Therefore, it is necessary to design a privacy metric model to evaluate the degree of privacy leakage that may be caused by the user in the process of task offloading.

Information entropy is a concept that measures the uncertainty or amount of information. Privacy computing models can be utilized to assess and reduce privacy risks. The information entropy-based privacy measure is advantageous in the task of measuring the privacy leakage of user data and has been applied in research on MEC task offloading [41,47]. Therefore, this paper uses the privacy metric based on information entropy to measure the MEC task offloading privacy protection effect.

We define user

u_{i}

’s task-offloading preference (

P_{i}

) and measure the probability that user

u_{i}

’s data are exposed to edge nodes by calculating the ratio of user

u_{i}

’s offloaded task data volume to the total task data volume (

P_{i}

), which is calculated as follows:

P_{i}^{t} = \frac{d_{i}^{t, o}}{d_{i}^{t}} = \frac{Σ_{e_{j} \in E} d_{i j}^{t, o}}{d_{i}^{t}}

(24)

Based on the user’s task-offloading preference, the concept of privacy entropy is further adopted to describe the amount of privacy information carried by the offloading strategy of user

u_{i}

H_{i}^{t}

. When there is no task offloading on the user’s terminal, i.e.,

P_{i}^{t} = 0

, the edge node cannot infer the user’s task information. The privacy entropy is at the maximum value (

H_{m a x}

), and in this paper, we set the value of maximum entropy to 10. The privacy entropy of user

u_{i}

is calculated as

H_{i}^{t} = \{\begin{matrix} - P_{i}^{t} l o g_{2} P_{i}^{t}, 0 < P_{i}^{t} < 1 \\ H_{m a x}, P_{i}^{t} = 0 \end{matrix}

(25)

4. Problem Description

This paper’s optimization objectives for task offloading in mobile blockchain edge networks focus on privacy preservation, quality of service, and incentive reward. Privacy protection requires maximization of the privacy entropy of the privacy-preserving model to prevent users from offloading too much private data to the edge servers, leading to user privacy leakage. Quality user experience requires minimization of the latency and energy consumption of offloading user tasks. Incentive rewards require maximization of the workload of nodes in the blockchain edge network and improvement of the workload and efficiency of nodes. In this paper, by comprehensively considering offloading privacy, quality of service, and incentive reward factors, the optimization problem can be formulated as the maximum value of the comprehensive optimization objective for user device

u_{i}

and edge servers within time slot t under the satisfaction of multiple constraints. The specific optimization objective function and constraints are expressed as follows:

P : max C_{i}^{t} = ω_{1} * H_{i}^{t} + ω_{2} * I_{i}^{t} - ω_{3} * T d_{i}^{t} - ω_{4} * E n_{i}^{t}

(26)

s . t . T d_{i}^{t} \leq T d_{i, m a x}^{t}

(27)

0 \leq p w_{i j}^{t} \leq p w_{i}^{l}

(28)

0 < P_{i}^{t} \leq 1

(29)

H_{i}^{t} \leq H_{m a x}

(30)

where

ω_{1}

,

ω_{2}

,

ω_{3}

, and

ω_{4}

are the weights of the indicators, which are used to specify the level of importance of different indicators. Equation (27) means the total task delay is constrained by the maximum tolerable delay of the task. Equation (28) means the device-to-node transmission power receives the constraint of the total transmission power. Equation (29) means the amount of offloaded task data of any user device does not exceed the constraint of the total task data. Equation (30) means the user’s privacy entropy is subject to the constraint of the maximum entropy value.

It is not difficult to find that the optimization problem presented in this paper is a mixed-integer linear programming problem, which are usually NP-hard and, therefore, difficult to solve with a globally optimal solution. The decision-making process for such problems occurs in a dynamic environment of long-term optimization, which makes it difficult for traditional convex optimization algorithms to adapt to unknown environments and perform adaptive optimization.

5. Algorithm

To address the environmental complexity and multi-objective competitiveness possessed by the above optimization problem description, this section first proposes an actor–critic deep reinforcement learning algorithm based on multiple agents sharing a global memory pool to improve the robustness and stability of performance. Secondly, the optimization problem is reformulated as a Markov process (MDP) by constructing each agent’s state space, action space, immediate rewards, and state transitions, and the algorithmic framework structure is described in detail.

5.1. Construction of the Markov Decision Process

In the blockchain mobile edge network task-offloading environment designed in this paper, each user device acts as a reinforcement learning agent, adopting a decentralized execution and centralized training model, which enables the agent to make independent decisions based on its observed and learned strategies. Multiple edge servers form a federated blockchain, sharing network parameters to jointly hold global information about the entire system. At the beginning of each time slot, user devices can initiate task processing requests, sending task and localization information to edge servers. After the edge server obtains the global network state information through blockchain sharing, it conducts centralized training. After training, each agent makes distributed local decisions based on its observations.

In order to solve the above optimization problem, it needs to be converted to the standard form of the Markov decision process (MDP) when using reinforcement learning algorithms. The key components of this transformation include defining the state space, action space, reward space, and state space transitions for each agent.

(1): State Space

The state space (

s_{i}^{t}

) of an agent (i) in time slot t consists of the localization (

l o c_{l i}^{t} = (x_{l i}^{t}, y_{l i}^{t})

) of its corresponding user device (

u_{i}

) and the amount of requested task data (

d_{i}^{t}

), i.e.,

s_{i}^{t} = (l_{l i}^{t}, d_{i}^{t})

. Therefore, the state space (

s^{t}

) of the reinforcement learning algorithm as a whole is denoted as

s^{t} = (s_{1}^{t}, \dots, s_{n}^{t})

.

(2): Action space

The action space (

a_{i}^{t}

) of agent i in time slot t represents the distribution of request data processing and channel power allocation of user device

u_{i}

in the current network state, i.e.,

a_{i}^{t} = (d_{i}^{t, l}, d_{i 1}^{t, o}, \dots, d_{i m}^{t, o}, p w_{i 1}^{t}, \dots, p w_{i m}^{t})

.

(3): Reward function

The reward function of the blockchain mobile edge network task-offloading model aims to maximize the optimization objective function (

C_{i}^{t}

) of each agent, i.e., maximize the privacy entropy of the user device to safeguard the privacy of user data, as well as the blockchain rewards computed by completing the offloaded tasks, and, at the same time, minimize the task processing latency and energy consumption of the user device in order to provide the user with a higher quality of service. The reward function at time slot t is expressed as follows:

r_{i}^{t} = \{\begin{matrix} ω_{1} * H_{i}^{t} + ω_{2} * I_{i}^{t} - ω_{3} * T d_{i}^{t} - ω_{4} * E n_{i}^{t}, Equations (27) - (30) \\ r_{0}, o t h e r \end{matrix}

(31)

where

r_{0}

is a constant much smaller than 0 that represents the value of the algorithmic base reward given by the environment if the current policy does not satisfy the constraints of Equations (27)–(30).

5.2. Algorithmic Framework

The framework of the algorithm proposed in this paper is shown in Figure 3. The algorithm sets a corresponding agent for each user device, including an actor network, a critic network, and a random sampler. The actor network and critic network adopt a dual neural network structure. The current network is responsible for constructing the actor’s policy network (

π_{i}

) and the critic’s value network (

Q_{i}

). The Q value of the critic network represents the expected reward for taking a particular action in a given state. The target network is softly updated using the current network parameters (

θ_{i}^{π}

and

θ_{i}^{Q}

), thus guaranteeing the stability of network learning.

We assume that the sample value function for the critic target network to compute time slot t is

Q_{i} (s_{i}^{t}, a_{i}^{t} | θ_{i}^{Q^{'}})

; then, the target Q value can be calculated as

q_{i} = r_{i}^{t} + γ Q_{i} (s^{t + 1}, a_{i}^{t + 1} | θ_{i}^{Q^{'}}),

(32)

where

γ

denotes the discount factor.

To update the critic’s current network parameter (

θ_{i}^{Q}

), the loss values of the parameters are computed using a mean-square error function. The mean-square error function can help the critic network accurately predict the value of a state or state–action pair.

L o s s (Q_{i}) = E [{(Q_{i} (s^{t}, a_{i}^{t} | θ_{i}^{Q}) - q_{i})}^{2}] = \frac{1}{n} \sum_{i = 1}^{n} {(Q_{i} (s^{t}, a_{i}^{t} | θ_{i}^{Q}) - q_{i})}^{2}

(33)

We minimize

L o s s (θ_{i}^{Q})

by gradient descent, and the update method for the

θ_{i}^{Q}

parameter is denoted by

θ_{i}^{Q} \leftarrow θ_{i}^{Q} + α \nabla_{θ_{i}^{Q}} L o s s (Q_{i}),

(34)

where

α

is the learning rate of the critic’s current network parameter (

θ_{i}^{Q}

).

The actor network constructs the action policy (

π_{i}

) based on the state space (

s_{i}^{t}

) of the reinforcement learning agent in time slot t and the reward function (

r_{i}^{t}

) and generates the action (

a_{i}^{t}

) in the time slot, which can be represented as

a_{i}^{t} = π_{i} (s_{i}^{t} | θ_{i}^{π})

(35)

However, using the output of the strategy network directly does not allow the agent to discover more strategies, so an exploration strategy needs to be constructed by adding noise.

a_{i}^{t} = π_{i} (s_{i}^{t} | θ_{i}^{π}) + τ N_{t}

(36)

where

τ

denotes the attenuation factor of the noise, which gradually decreases with the number of iterations of the algorithm to guarantee the stability of network training and

N_{t}

is Gaussian noise obeying a normal random distribution.

The policy objective function of the actor network is

J (π_{i}) = E [Q_{i} (s^{t}, a_{i}^{t} | θ_{i}^{Q})]

(37)

Then, the gradient of the objective function of the strategy is expressed as

\nabla_{θ_{i}^{π}} J (π_{i}) = E [\nabla_{a_{i}^{π}} Q_{i} (s^{t}, a_{i}^{t} | θ_{i}^{Q}) \nabla_{θ_{i}^{π}} π (s^{t} | θ_{i}^{π})]

(38)

Then, the update method for the

θ_{i}^{π}

parameter is expressed as

θ_{i}^{π} \leftarrow θ_{i}^{π} + β \nabla_{θ_{i}^{π}} J (π_{i})

(39)

where

β

is the learning rate of the actor network’s

θ_{i}^{π}

parameter.

In addition, the soft update method for the actor and critic target network parameters (

θ_{i}^{π^{'}}

and

θ_{i}^{Q^{'}}

) can be represented as

θ_{i}^{π^{'}} \leftarrow σ θ_{i}^{π} + (1 - σ) θ_{i}^{π^{'}}

(40)

θ_{i}^{Q^{'}} \leftarrow σ θ_{i}^{Q} + (1 - σ) θ_{i}^{Q^{'}}

(41)

where

σ \in (0, 1)

is the soft update weight.

In order to reduce environmental changes due to policy learning by other agents, this paper adopts a global memory pool to store the experience samples

(s_{i}^{t}, s_{i}^{t + 1}, a_{i}^{t}, r_{i}^{t})

of each agent and uses it to train the neural network of the agents. The global memory pool can be constructed by using the blockchain to realize the sharing of information among agents in the actual application process.

In order to better understand the idea and process of this paper, the pseudo-code of the algorithm is shown in Algorithm 1.

Algorithm 1: Actor–Critic Algorithm for Blockchain–MEC Task Offloading

5.3. Complexity Analysis

In this paper, the computational complexity of the proposed algorithm is mainly considered to be the sum of the training time overhead of all the agents. We assume that n is the number of agents,

L_{a}

is the number of neural network layers of the actor network,

L_{c}

is the number of neural network layers of the critic network, S is the number of samples of each agent from the global memory pool, I is the number of algorithmic iterations,

d_{s}

is the state-space dimension, and

d_{a}

is the action-space dimension. Then, the computational complexity of the algorithm can be calculated as

O (n S I (L_{a} + L_{c}) {(d_{s} + d_{a})}^{2})

.

6. Experiment and Discussion

In this section, our proposed algorithm is evaluated and analyzed through simulation experiments.

6.1. Experimental Environment

The hardware and software specifications of the experimental environment described in this paper are shown in Table 2.

6.2. Parameter Design

In order to realize the simulation of the network model, this paper simulates the mobile user task-offloading environment in real scenarios in a 1000 × 1000 area (Figure 4) that contains four blockchain–MEC servers at fixed locations and user mobile devices moving along the path of black arrows. The servers receive task offload requests from user mobile devices and specify the user offload policy for the devices through collaborative planning using multiple servers. The user’s mobile device moves along the non-random irregular black arrow path with a fixed step size in each time slot. It generates a random amount of task data, which are offloaded to one or more servers for processing according to the task-offloading policy.

The parameters of the reinforcement learning algorithm and blockchain edge network environment are shown in the following Table 3.

6.3. Experimental Analysis

6.3.1. Contrasted Algorithms

In this paper, the following algorithms are selected to be analyzed and compared:

JODRL-PP [33]: The JODRL-PP (Joint Optimal Deep Reinforcement Learning with Privacy Preservation) algorithm is a stochastic game-theoretically based task-offloading problem for multi-access point environments proposed for multi-agent deep reinforcement learning algorithms. The algorithm uses a trusted third party for centralized training. It achieves distributed execution to improve the quality of the results while considering the dynamic changes in a multi-user environment and dealing with the complexity of multiple users and access points through stochastic game theory.
IQL [48]: IQL (Independent Q-Learning) is a reinforcement learning algorithm applied in multi-agent systems. In a multi-agent system, each agent learns its own Q-value function independently without considering the actions and strategies of other agents and uses only its own state and action information in the learning process. In the IQL-based task-offloading algorithm, if an agent does not cache the corresponding requested service, the agent migrates the task to be executed to another agent that has cached the service based on the service cache information shared among the agents at the beginning of each time slot.
QMIX [49]: QMIX (Q-value Mixing Network) is a value-based multi-agent reinforcement learning algorithm that can be used to train decentralized policies in a centralized end-to-end manner. In addition, QMIX’s network estimates joint action values as complex nonlinear combinations of per-agent values conditional only on local observations. It requires that the joint action values for each agent be monotonic. This maximizes the joint action values that can be handled in non-strategy learning and ensures consistency between centralized and decentralized strategies.
VDN [50]: The VDN (Value-Decomposition Network) is a value decomposition method for multi-agent systems that decomposes the global value function into local value functions. Each agent learns only the local value function associated with it. This network architecture learns to decompose the team value function into the value functions of agents. It solves the problem of collaborative reinforcement learning of multiple agents with a single joint reward signal. The VDN algorithm does not consider the spatial relationship of the type of service request and the state of the wireless network among agents, and it directly decomposes the joint action value function into the sum of the local action value functions of all agents.

6.3.2. Results

(1): Experiment 1: Performance Comparison

We set up ten random mobile users in the experimental simulation environment by recording the reward function during 1000 iterations of the reinforcement learning algorithm, the result of which is shown in Figure 5. From the figure, we can find that compared with other schemes, the proposed algorithm’s curve of the final stabilization reward function value is significantly higher than that of other algorithms, and the fluctuation amplitude after stabilization is smaller.

In order to minimize the impact of single-experiment error on the results, we conducted five repetitive experiments. We recorded the average reward function values for different algorithm configurations for all training cycles, and the comparison results are shown in Figure 6. From the figure, we can find that the proposed algorithm improves by more than 40% in performance compared to QMIX, IQL, and VDN and outperforms JODRL-PP, indicating that the proposed algorithm can obtain a better solution to the problem set in this paper.

In our experiments, we also recorded the average costs of task processing energy consumption, task processing latency, user privacy metrics, and blockchain incentive rewards in the reward function, and the comparison graphs are shown in Figure 7. Through the comparison, we can find that the proposed algorithm significantly outperforms QMIX, IQL, and VDN in all costs except blockchain incentive rewards, except that the proposed algorithm reduces the energy cost by 44.38% and improves the blockchain incentive rewards by 13.27% compared to the JODRL-PP algorithm. However, the proposed algorithm is inferior in terms of task processing latency and user privacy metrics.

(2): Experiment 2: Performance Comparison under Different User Scales

In order to test the changes of the algorithms in the optimization problem proposed in this paper under different user sizes, we set the user sizes to 10, 15, 20, 25, and 30 and recorded the average reward function values of the algorithms under different user sizes in five groups of repeated experiments. The results are shown in Figure 8. From the figure, we can find that with increasing user size, the average reward function value of all models decreases; this is because with the increase in users, the corresponding amount of user tasks is also raised. The delay and energy consumption required to process the task increase due to the existence of an upper limit of the user’s privacy metric, and the blockchain network incentive rewards are subject to the limitation of the amount of nodes to receive the task. Hence, a decrease in the value of the reward function is a normal phenomenon. The proposed algorithm still has an optimal average reward function value based on different agent scales.

The experimental comparison graphs of the average cost of task processing energy consumption, task processing delay, user privacy metrics, and blockchain incentive rewards are shown in Figure 9. The proposed algorithm has advantages in some single cost metrics in growing user size, and the experimental results are similar to those of Experiment 1.

(3): Experiment 3: Ablation Experiment

In this paper, we design ablation experiments to investigate the effects of Gaussian noise-based action-space search in the proposed algorithm and the global memory pool of agents on the performance of the algorithm. As in Experiment 1, we set up 10 random mobile users in the experimental simulation environment by recording the reward function value during 1000 iterations of the reinforcement learning algorithm, and the result is shown in Figure 10. From the figure, we can find that the curve of the proposed algorithm reaches a stabilization level faster than that of the other two configurations. It exhibits less fluctuation of the state after stabilization. In addition, the algorithm’s final stabilization reward function value is significantly higher than that of the other two configurations, which indicates that the algorithm’s overall performance has been improved.

In addition, we compared the average rewards of different algorithm configurations through five repetitions of the experiment, as shown in Figure 11. From the figure, it can be seen that the average reward function value of the proposed algorithm possesses a significant advantage throughout the training cycle. The performance is improved by 38.36% and 43.59% compared to the schemes lacking Gaussian process action-space selection noise and global memory pool, respectively.

According to the results of the ablation experiments, the introduction of Gaussian noise-based action-space search and global shared memeory pool significantly improved the algorithm’s performance. These two improvements enhance the algorithm’s ability to explore and utilize historical information, thus improving the learning efficiency and quality of the policy in the long run. This enhancement is significant in complex and dynamic environments, requiring the algorithm to adapt and discover new and better strategies quickly.

In summary, the proposed algorithm was analyzed and validated through many comparative experiments, and we demonstrated the advantages of the proposed algorithm over comparative algorithms in terms of global optimization objectives. Through ablation experiments, we analyzed the important role of Gaussian noise-based action-space search and global shared memory pooling. However, the proposed algorithm still has a disadvantage in calculating task processing delay cost.

7. Conclusions and Future Works

In this paper, we propose a blockchain-based MEC task offloading strategy based on multi-agent reinforcement learning that utilizes a global memory pool to enable each agent to acquire the experience of other agents during the training process in order to enhance the collaborative ability among agents and the overall performance of the system. Moreover, the algorithm introduces a search strategy based on decayable Gaussian random noise action space, improving the agents’ search state space to avoid falling into the local optimum. In terms of the optimization objective function, this paper comprehensively considers cost factors such as task execution energy consumption, processing delay, user privacy metrics, and blockchain incentive rewards and innovatively proposes a blockchain-based MEC task-offloading model. The experimental results show that compared with other algorithms, the proposed algorithm improves the performance of the global optimization objective by more than 10% and has obvious advantages in energy consumption and blockchain incentive rewards. In addition, the ablation experiments show that the Gaussian process action-space selection noise and the global memory pool improve the performance by 38.36% and 43.59%, respectively.

However, this paper is subject to limitation in terms of problem modeling and algorithm design. Firstly, we only used the existing consensus mechanism and simplified incentive mechanism to simulate the execution process of blockchain on MEC, which still has a large deviation from the actual scenario. Secondly, we must consider more security elements of MEC task offloading in the model design. Thirdly, we still need to improve the algorithm’s execution efficiency. Therefore, further research and optimization of problem modeling and algorithm design for blockchain-based MEC task offloading are important research directions for us in the future.

Author Contributions

C.L. developed the idea, performed research and analyses, and wrote the manuscript. Z.S. verified and revised the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Natural Science Foundation of China (No. 62272239), the Postgraduate Research & Innovation Plan of Jiangsu Province (No. KYCX20_0761), and the Jiangsu Agriculture Science and Technology Innovation Fund (No. CX(22)1007).

Data Availability Statement

Data are contained within the article.

Acknowledgments

We wish to thank all code providers. We also wish to thank all colleagues, reviewers, and editors who provided valuable suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mao, Y.; You, C.; Zhang, J.; Huang, K.; Letaief, K.B. A Survey on Mobile Edge Computing: The Communication Perspective. IEEE Commun. Surv. Tutor. 2017, 19, 2322–2358. [Google Scholar] [CrossRef]
Qiu, H.; Zhu, K.; Luong, N.C.; Yi, C.; Niyato, D.; Kim, D.I. Applications of Auction and Mechanism Design in Edge Computing: A Survey. IEEE Trans. Cogn. Commun. Netw. 2022, 8, 1034–1058. [Google Scholar] [CrossRef]
Nakamoto, S. Bitcoin: A peer-to-peer electronic cash system. Decentralized Bus. Rev. 2008, 21260. Available online: https://bitcoin.org/bitcoin.pdf (accessed on 14 July 2024).
Yu, R.; Oguti, A.M.; Obaidat, M.S.; Li, S.; Wang, P.; Hsiao, K.F. Blockchain-based solutions for mobile crowdsensing: A comprehensive survey. Comput. Sci. Rev. 2023, 50, 100589. [Google Scholar] [CrossRef]
Yang, R.; Yu, F.R.; Si, P.; Yang, Z.; Zhang, Y. Integrated Blockchain and Edge Computing Systems: A Survey, Some Research Issues and Challenges. IEEE Commun. Surv. Tutor. 2019, 21, 1508–1532. [Google Scholar] [CrossRef]
Wang, S.; Ye, D.; Huang, X.; Yu, R.; Wang, Y.; Zhang, Y. Consortium Blockchain for Secure Resource Sharing in Vehicular Edge Computing: A Contract-Based Approach. IEEE Trans. Netw. Sci. Eng. 2021, 8, 1189–1201. [Google Scholar] [CrossRef]
Aujla, G.S.; Singh, A.; Singh, M.; Sharma, S.; Kumar, N.; Choo, K.K.R. BloCkEd: Blockchain-Based Secure Data Processing Framework in Edge Envisioned V2X Environment. IEEE Trans. Veh. Technol. 2020, 69, 5850–5863. [Google Scholar] [CrossRef]
Liu, H.; Zhang, P.; Pu, G.; Yang, T.; Maharjan, S.; Zhang, Y. Blockchain Empowered Cooperative Authentication with Data Traceability in Vehicular Edge Computing. IEEE Trans. Veh. Technol. 2020, 69, 4221–4232. [Google Scholar] [CrossRef]
Lu, Y.; Tang, X.; Liu, L.; Yu, F.R.; Dustdar, S. Speeding at the Edge: An Efficient and Secure Redactable Blockchain for IoT-Based Smart Grid Systems. IEEE Internet Things J. 2023, 10, 12886–12897. [Google Scholar] [CrossRef]
Bao, Z.; Tang, C.; Lin, F.; Zheng, Z.; Yu, X. Rating-protocol optimization for blockchain-enabled hybrid energy trading in smart grids. Sci. China Inf. Sci. 2023, 66, 159205. [Google Scholar] [CrossRef]
Guan, Z.; Zhou, X.; Liu, P.; Wu, L.; Yang, W. A Blockchain-Based Dual-Side Privacy-Preserving Multiparty Computation Scheme for Edge-Enabled Smart Grid. IEEE Internet Things J. 2022, 9, 14287–14299. [Google Scholar] [CrossRef]
Li, Z.; Zhang, J.; Zhang, J.; Zheng, Y.; Zong, X. Integrated Edge Computing and Blockchain: A General Medical Data Sharing Framework. IEEE Trans. Emerg. Top. Comput. 2023, 1–14. [Google Scholar] [CrossRef]
Sharma, D.; Kumar, R.; Jung, K.H. A bibliometric analysis of convergence of artificial intelligence and blockchain for edge of things. J. Grid Comput. 2023, 21, 79. [Google Scholar] [CrossRef]
Lin, Y.; Kang, J.; Niyato, D.; Gao, Z.; Wang, Q. Efficient Consensus and Elastic Resource Allocation Empowered Blockchain for Vehicular Networks. IEEE Trans. Veh. Technol. 2023, 72, 5513–5517. [Google Scholar] [CrossRef]
Zhang, X.; Zhu, X.; Chikuvanyanga, M.; Chen, M. Resource sharing of mobile edge computing networks based on auction game and blockchain. EURASIP J. Adv. Signal Process. 2021, 2021, 26. [Google Scholar] [CrossRef]
Xu, S.; Liao, B.; Yang, C.; Guo, S.; Hu, B.; Zhao, J.; Jin, L. Deep reinforcement learning assisted edge-terminal collaborative offloading algorithm of blockchain computing tasks for energy Internet. Int. J. Electr. Power Energy Syst. 2021, 131, 107022. [Google Scholar] [CrossRef]
Moghaddasi, K.; Rajabi, S.; Gharehchopogh, F.S. Multi-Objective Secure Task Offloading Strategy for Blockchain-Enabled IoV-MEC Systems: A Double Deep Q-Network Approach. IEEE Access 2024, 12, 3437–3463. [Google Scholar] [CrossRef]
Wu, H.; Wolter, K.; Jiao, P.; Deng, Y.; Zhao, Y.; Xu, M. EEDTO: An Energy-Efficient Dynamic Task Offloading Algorithm for Blockchain-Enabled IoT-Edge-Cloud Orchestrated Computing. IEEE Internet Things J. 2021, 8, 2163–2176. [Google Scholar] [CrossRef]
Nguyen, D.C.; Pathirana, P.N.; Ding, M.; Seneviratne, A. Privacy-Preserved Task Offloading in Mobile Blockchain with Deep Reinforcement Learning. IEEE Trans. Netw. Serv. Manag. 2020, 17, 2536–2549. [Google Scholar] [CrossRef]
Le, Y.; Ling, X.; Wang, J.; Guo, R.; Huang, Y.; Wang, C.X.; You, X. Resource Sharing and Trading of Blockchain Radio Access Networks: Architecture and Prototype Design. IEEE Internet Things J. 2023, 10, 12025–12043. [Google Scholar] [CrossRef]
Salim, M.M.; Pan, Y.; Park, J.H. Energy-efficient resource allocation in blockchain-based Cybertwin-driven 6G. J. Ambient. Intell. Humaniz. Comput. 2024, 15, 103–114. [Google Scholar] [CrossRef]
Sun, W.; Liu, J.; Yue, Y.; Wang, P. Joint Resource Allocation and Incentive Design for Blockchain-Based Mobile Edge Computing. IEEE Trans. Wirel. Commun. 2020, 19, 6050–6064. [Google Scholar] [CrossRef]
Ding, J.; Han, L.; Li, J.; Zhang, D. Resource allocation strategy for blockchain-enabled NOMA-based MEC networks. J. Cloud Comput. 2023, 12, 142. [Google Scholar] [CrossRef]
Zhang, L.; Zou, Y.; Wang, W.; Jin, Z.; Su, Y.; Chen, H. Resource allocation and trust computing for blockchain-enabled edge computing system. Comput. Secur. 2021, 105, 102249. [Google Scholar] [CrossRef]
Baranwal, G.; Kumar, D.; Vidyarthi, D.P. Blockchain based resource allocation in cloud and distributed edge computing: A survey. Comput. Commun. 2023, 209, 469–498. [Google Scholar] [CrossRef]
Xue, H.; Chen, D.; Zhang, N.; Dai, H.N.; Yu, K. Integration of blockchain and edge computing in internet of things: A survey. Future Gener. Comput. Syst. 2023, 144, 307–326. [Google Scholar] [CrossRef]
Liu, X. Towards blockchain-based resource allocation models for cloud-edge computing in IoT applications. Wirel. Pers. Commun. 2021, 135, 2483. [Google Scholar] [CrossRef]
Guo, S.; Dai, Y.; Guo, S.; Qiu, X.; Qi, F. Blockchain Meets Edge Computing: Stackelberg Game and Double Auction Based Task Offloading for Mobile Blockchain. IEEE Trans. Veh. Technol. 2020, 69, 5549–5561. [Google Scholar] [CrossRef]
Devi, I.; Karpagam, G.R. Energy-Aware Scheduling for Tasks with Target-Time in Blockchain based Data Centres. Comput. Syst. Sci. Eng. 2022, 40, 405–419. [Google Scholar] [CrossRef]
Xiong, J.; Guo, P.; Wang, Y.; Meng, X.; Zhang, J.; Qian, L.; Yu, Z. Multi-agent deep reinforcement learning for task offloading in group distributed manufacturing systems. Eng. Appl. Artif. Intell. 2023, 118, 105710. [Google Scholar] [CrossRef]
Lu, K.; Li, R.D.; Li, M.C.; Xu, G.R. MADDPG-based joint optimization of task partitioning and computation resource allocation in mobile edge computing. Neural Comput. Appl. 2023, 35, 16559–16576. [Google Scholar] [CrossRef]
Li, K.; Wang, X.; He, Q.; Yang, M.; Huang, M.; Dustdar, S. Task Computation Offloading for Multi-Access Edge Computing via Attention Communication Deep Reinforcement Learning. IEEE Trans. Serv. Comput. 2023, 16, 2985–2999. [Google Scholar] [CrossRef]
Wu, G.; Chen, X.; Gao, Z.; Zhang, H.; Yu, S.; Shen, S. Privacy-preserving offloading scheme in multi-access mobile edge computing based on MADRL. J. Parallel Distrib. Comput. 2024, 183, 104775. [Google Scholar] [CrossRef]
Yang, L.; Li, M.; Si, P.; Yang, R.; Sun, E.; Zhang, Y. Energy-Efficient Resource Allocation for Blockchain-Enabled Industrial Internet of Things with Deep Reinforcement Learning. IEEE Internet Things J. 2021, 8, 2318–2329. [Google Scholar] [CrossRef]
Nguyen, D.C.; Ding, M.; Pathirana, P.N.; Seneviratne, A.; Li, J.; Poor, H.V. Cooperative Task Offloading and Block Mining in Blockchain-Based Edge Computing with Multi-Agent Deep Reinforcement Learning. IEEE Trans. Mob. Comput. 2023, 22, 2021–2037. [Google Scholar] [CrossRef]
Yao, S.; Wang, M.; Qu, Q.; Zhang, Z.; Zhang, Y.F.; Xu, K.; Xu, M. Blockchain-Empowered Collaborative Task Offloading for Cloud-Edge-Device Computing. IEEE J. Sel. Areas Commun. 2022, 40, 3485–3500. [Google Scholar] [CrossRef]
Wang, C.; Jiang, C.; Wang, J.; Shen, S.; Guo, S.; Zhang, P. Blockchain-Aided Network Resource Orchestration in Intelligent Internet of Things. IEEE Internet Things J. 2023, 10, 6151–6163. [Google Scholar] [CrossRef]
Du, Y.; Wang, Z.; Li, J.; Shi, L.; Jayakody, D.N.K.; Chen, Q.; Chen, W.; Han, Z. Blockchain-Aided Edge Computing Market: Smart Contract and Consensus Mechanisms. IEEE Trans. Mob. Comput. 2023, 22, 3193–3208. [Google Scholar] [CrossRef]
Kaur, M.; Khan, M.Z.; Gupta, S.; Noorwali, A.; Chakraborty, C.; Pani, S.K. MBCP: Performance Analysis of Large Scale Mainstream Blockchain Consensus Protocols. IEEE Access 2021, 9, 80931–80944. [Google Scholar] [CrossRef]
Liang, L.; Kim, J.; Jha, S.C.; Sivanesan, K.; Li, G.Y. Spectrum and Power Allocation for Vehicular Communications with Delayed CSI Feedback. IEEE Wirel. Commun. Lett. 2017, 6, 458–461. [Google Scholar] [CrossRef]
Xu, X.; Liu, X.; Yin, X.; Wang, S.; Qi, Q.; Qi, L. Privacy-aware offloading for training tasks of generative adversarial network in edge computing. Inf. Sci. 2020, 532, 1–15. [Google Scholar] [CrossRef]
Chen, X. Decentralized Computation Offloading Game for Mobile Cloud Computing. IEEE Trans. Parallel Distrib. Syst. 2015, 26, 974–983. [Google Scholar] [CrossRef]
Huang, X.; Zhang, B.; Li, C. Incentive Mechanisms for Mobile Edge Computing: Present and Future Directions. IEEE Netw. 2022, 36, 199–205. [Google Scholar] [CrossRef]
Xu, Y.; Zhang, H.; Li, X.; Yu, F.R.; Ji, H.; Leung, V.C.M. Blockchain-Based Edge Collaboration with Incentive Mechanism for MEC-Enabled VR Systems. IEEE Trans. Wirel. Commun. 2024, 23, 3706–3720. [Google Scholar] [CrossRef]
Gao, Q.; Xiao, J.; Cao, Y.; Deng, S.; Ouyang, C.; Feng, Z. Blockchain-based collaborative edge computing: Efficiency, incentive and trust. J. Cloud Comput. 2023, 12, 72. [Google Scholar] [CrossRef]
Li, X.; Liu, S.; Wu, F.; Kumari, S.; Rodrigues, J.J.P.C. Privacy Preserving Data Aggregation Scheme for Mobile Edge Computing Assisted IoT Applications. IEEE Internet Things J. 2019, 6, 4755–4763. [Google Scholar] [CrossRef]
Xu, X.; He, C.; Xu, Z.; Qi, L.; Wan, S.; Bhuiyan, M.Z.A. Joint Optimization of Offloading Utility and Privacy for Edge Computing Enabled IoT. IEEE Internet Things J. 2020, 7, 2622–2629. [Google Scholar] [CrossRef]
Tampuu, A.; Matiisen, T.; Kodelja, D.; Kuzovkin, I.; Korjus, K.; Aru, J.; Aru, J.; Vicente, R. Multiagent cooperation and competition with deep reinforcement learning. PLoS ONE 2017, 12, e0172395. [Google Scholar] [CrossRef]
Rashid, T.; Samvelyan, M.; de Witt, C.S.; Farquhar, G.; Foerster, J.; Whiteson, S. Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. J. Mach. Learn. Res. 2020, 21, 1–51. [Google Scholar]
Sunehag, P.; Lever, G.; Gruslys, A.; Czarnecki, W.M.; Zambaldi, V.F.; Jaderberg, M.; Lanctot, M.; Sonnerat, N.; Leibo, J.Z.; Tuyls, K.; et al. Value-Decomposition Networks For Cooperative Multi-Agent Learning. arXiv 2017, arXiv:1706.05296. [Google Scholar]

Figure 1. Blockchain-based edge computing network model.

Figure 2. BMEC data process.

Figure 3. Algorithm structure.

Figure 4. Network environment simulation.

Figure 5. Reward function value iteration.

Figure 6. Average reward function value comparison.

Figure 7. (a) Average processing delay comparison; (b) average energy consumption comparison; (c) average incentive reward comparison; (d) average privacy metric comparison.

Figure 8. Reward function value iteration for different user scales.

Figure 9. (a) Average processing delay comparison; (b) average energy consumption comparison; (c) average incentive reward comparison; (d) average privacy metric comparison for different agent scales.

Figure 10. Reward function value iteration.

Figure 11. Average reward function value comparison.

Table 1. Consensus mechanism comparison.

Characteristic	Proof of Work (PoW)	Proof of Stake (PoS)
Energy consumption	High; requires large amounts of power	Low; does not require large amounts of computing resources
Hardware requirements	Requires high-performance hardware	No high-performance hardware required
Attack cost	High; needs to control 50% of computing power of the whole network	High; needs to control 50% of the tokens of the whole network
Reward mechanism	Mining rewards (blockchain currency)	Token rewards
Block generation speed	Usually slow; affected by computational difficulty	Usually slow; affected by computational difficulty
Degree of decentralization	High but tends to be concentrated in mining pools	High; coin holders are more widely distributed

Table 2. Hardware and software specifications.

	Designation	Specification Version
Hardware	CPU	AMD Ryzen 7-5800
	GPU	Nvidia RTX3060
	Memory	80 GB RAM
Software	Operation system	Windows11
	Language	python3.7.16
	Deep learning framework	torch1.10.0
	Library function	numpy1.21.6

Table 3. Experimental parameter settings.

Parameter Kind	Parameter Symbol	Description	Value
Model parameters	$d_{s}$	State-space dimension	3
	$d_{a}$	Action-space dimension	10
	$δ$	Soft update weights	0.01
	$α$	Critic update parameters	0.99
	$β$	Actor update parameters	0.95
	[ $ω_{1}, ω_{2}, ω_{3}, ω_{4}$ ]	Reward function weights	[0.8, 0.09, 0.09, 0.02]
Environmental parameters	$[l o c_{e 1}, l o c_{e 2}, l o c_{e 3}, l o c_{e 4}]$	Edge node location	(333, 333), (333, 666), (666, 333), (666, 666)
	$p w_{l i}$	User terminal transmission power	1.5
	$p w_{e j}$	Edge node transmission power	3
	$f_{l i}$	User terminal processing frequency	$10 \times 10^{8}$
	$f_{e j}$	Edge node processing frequency	$4 \times 10 \times 10^{9}$
	$d_{0}$	Block header size	2
	$N_{0}$	Wireless channel noise	$10 \times 10^{- 7}$
	B	Wireless communication bandwidth	$3 \times 10 \times 10^{9}$
	R	Server wired communication rate	10

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, C.; Sun, Z. A Multi-Agent Reinforcement Learning-Based Task-Offloading Strategy in a Blockchain-Enabled Edge Computing Network. Mathematics 2024, 12, 2264. https://doi.org/10.3390/math12142264

AMA Style

Liu C, Sun Z. A Multi-Agent Reinforcement Learning-Based Task-Offloading Strategy in a Blockchain-Enabled Edge Computing Network. Mathematics. 2024; 12(14):2264. https://doi.org/10.3390/math12142264

Chicago/Turabian Style

Liu, Chenlei, and Zhixin Sun. 2024. "A Multi-Agent Reinforcement Learning-Based Task-Offloading Strategy in a Blockchain-Enabled Edge Computing Network" Mathematics 12, no. 14: 2264. https://doi.org/10.3390/math12142264

APA Style

Liu, C., & Sun, Z. (2024). A Multi-Agent Reinforcement Learning-Based Task-Offloading Strategy in a Blockchain-Enabled Edge Computing Network. Mathematics, 12(14), 2264. https://doi.org/10.3390/math12142264

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Agent Reinforcement Learning-Based Task-Offloading Strategy in a Blockchain-Enabled Edge Computing Network

Abstract

1. Introduction

2. Related Works

3. Model

3.1. System Model

3.2. Consensus Model

3.3. Quality of Service Model

3.3.1. Communication Model

3.3.2. Computing Model

3.3.3. Comprehensive Model

3.4. Incentive Reward Model

3.5. Privacy Model

4. Problem Description

5. Algorithm

5.1. Construction of the Markov Decision Process

5.2. Algorithmic Framework

5.3. Complexity Analysis

6. Experiment and Discussion

6.1. Experimental Environment

6.2. Parameter Design

6.3. Experimental Analysis

6.3.1. Contrasted Algorithms

6.3.2. Results

7. Conclusions and Future Works

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI