AI Services-Oriented Dynamic Computing Resource Scheduling Algorithm Based on Distributed Data Parallelism in Edge Computing Network of Smart Grid

Zou, Jing; Xin, Peizhe; Wang, Chang; Zhang, Heli; Wei, Lei; Wang, Ying

doi:10.3390/fi16090312

Open AccessArticle

AI Services-Oriented Dynamic Computing Resource Scheduling Algorithm Based on Distributed Data Parallelism in Edge Computing Network of Smart Grid

by

Jing Zou

¹,

Peizhe Xin

¹,

Chang Wang

²,

Heli Zhang

^2,*

,

Lei Wei

³ and

Ying Wang

²

¹

State Grid Economic and Technological Research Institute Ltd., Beijing 221005, China

²

School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China

³

Information and Telecommunication Branch, State Grid Jiangsu Electric Power Ltd., Nanjing 211103, China

^*

Author to whom correspondence should be addressed.

Future Internet 2024, 16(9), 312; https://doi.org/10.3390/fi16090312

Submission received: 30 June 2024 / Revised: 25 July 2024 / Accepted: 16 August 2024 / Published: 28 August 2024

(This article belongs to the Special Issue Intelligent Transport Systems (ITSs) Meet Generative Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Massive computational resources are required by a booming number of artificial intelligence (AI) services in the communication network of the smart grid. To alleviate the computational pressure on data centers, edge computing first network (ECFN) can serve as an effective solution to realize distributed model training based on data parallelism for AI services in smart grid. Due to AI services with diversified types, an edge data center has a changing workload in different time periods. Selfish edge data centers from different edge suppliers are reluctant to share their computing resources without a rule for fair competition. AI services-oriented dynamic computational resource scheduling of edge data centers affects both the economic profit of AI service providers and computational resource utilization. This letter mainly discusses the partition and distribution of AI data based on distributed model training and dynamic computational resource scheduling problems among multiple edge data centers for AI services. To this end, a mixed integer linear programming (MILP) model and a Deep Reinforcement Learning (DRL)-based algorithm are proposed. Simulation results show that the proposed DRL-based algorithm outperforms the benchmark in terms of profit of AI service provider, backlog of distributed model training tasks, running time and multi-objective optimization.

Keywords:

AI services; computational resource scheduling; dynamic scheduling; distributed model training; data parallelism; edge computing first network; smart grid

1. Introduction

The smart grid integrates various advanced technologies such as information technology, communication technology, and control technology to achieve intelligent management of the power system. With the help of high-quality power support by the smart grid, “Eastern Data Western Calculation” has become a reality to achieve flow across domains of data elements [1]. Smart grids are reforming towards utilizing massive data for operations and services. To this end, the existing infrastructure of the smart grid will be fully integrated with the new digital infrastructure. The emerging converged communication network of smart grid and computing first network will be the key to promoting the “Eastern Data Western Calculation”. In the future, the smart grid will fully integrate and support the development of computing power while ensuring the transmission of electricity [2]. With the surging of artificial intelligence (AI) services (e.g., data analysis and model training), a large amount of AI data has been highly dependent on the computing first network of the smart grid. Specifically, these technologies make the smart grid suitable for many potential applications, such as industrial automation, augmented reality, real-time traffic control, smart homes, and data processing [3]. However, the model training of AI services usually requires high storage and computational resources. It is difficult to store and process massive data from a large amount of AI services in one centralized data center. Recently, distributed model training based on data parallelism effectively alleviates enormous computational and bandwidth pressures caused by computationally intensive AI services (e.g., Probabilistic Latent Semantic Analysis and Federal Learning) [4].

Edge computing can serve as an effective solution to execute distributed model training for AI services among different edge data centers in a collaborative manner. To this end, edge computing first network (ECFN) enabled smart grid has emerged [5]. It can not only provide low latency and efficient model training for AI service, but also can significantly reduce data storage and transmission costs for computational resource suppliers, thereby saving a lot of operating and maintenance costs. Notably, edge data centers in the ECFN-enabled smart grid are often deployed near users by operators and private organizations or individuals, such as third-party suppliers, idle computing resources of government and enterprises, and personal advanced edge data centers. From the economic perspective, due to the fact that most computing resource suppliers pay economic costs on investment and maintenance of edge data centers, most edge data centers care more about their own interests and may not be willing to share their own resources to support AI services. Without an incentive strategy to encourage edge data centers for resource sharing or an optimal resource scheduling strategy, it is difficult to achieve efficient collaboration among different selfish edge data centers to support AI services. Moreover, ubiquitous computational resources in the ECFN-enabled smart grid also cannot be fully scheduled, resulting in inefficient resource utilization. How to provide a principle of fair competition and promote willingness for collaboration among edge data centers of different suppliers become a crucial issue.

In this paper, we mainly investigate the dynamic, collaborative computational resource scheduling based on distributed data parallelism for AI services in ECFN of smart grids. A MILP model and a DRL-based algorithm are proposed. Notably, a heuristic algorithm of pricing strategy-based computational resource scheduling algorithm is utilized in the DRL-based algorithm so as to reduce space action and facilitate the model training efficiency. The current work can be applied to the daily operation and maintenance process of the State Grid, especially when there is an imbalance in computing power in the smart grid. There is a distribution of computing power between the three major cloud centers and the provincial core data center, which leads to excessive load on some edge computing centers, link compliance exceeding limits, and suboptimal energy consumption of computing centers. Therefore, it is necessary to achieve collaborative optimization and scheduling of computing power networks based on resources obtained from various aspects such as edge computing resource supply side and network resource side.

2. Related Work

Resource scheduling in computing plays a crucial role in optimizing the utilization of resources and improving system efficiency. Various algorithms and techniques have been proposed to address the challenges associated with resource scheduling in edge computing environments, which are mainly divided into two categories: heuristic and machine-learning-based approaches for resource scheduling.

2.1. Heuristic for Resource Scheduling

The purpose of this section is to review recent papers on heuristic-based resource allocation in edge computing first. Recently, computing resource scheduling problems in smart grid communication networks with edge computing systems have been solved by a hierarchical game-based cloud resource scheduling algorithm [6] and a greedy-based algorithm [7]. A dynamic resource optimization allocation framework is presented for Fog resource allocation in the smart grid [8]. A priority-based resource allocation mechanism is presented in [9]. In this mechanism, by considering that the requests are coming naturally, the accessible resources are scheduled dynamically. The received request could be either normal or prioritized. When the received requests are identified, each of them undergoes processing with three potential outcomes. Jiang et al. in [10] allocated computing resources through a combination of bandwidth and computing frequency using complete information game theory. A JTSRA algorithm of resource scheduling in multi-access edge computing is proposed in [11]. This algorithm adjusts the order of tasks based on the different priorities of tasks and achieves a rational allocation of resources. Yan et al. [12] discussed resource scheduling algorithms that are used in edge computing for burst network flow, highlighting the crucial role of efficient resource allocation in such environments. Seah et al. [13] introduced a combined communication and computing resource scheduling approach for sliced 5G multi-access edge computing systems. The proposed scheduler aims to efficiently allocate communication and computing resources while meeting latency constraints. Nie [14] proposed a cloud computing resource scheduling strategy based on a genetic ant colony fusion algorithm, combining genetic and ant colony algorithms to improve stability and efficiency in resource scheduling. Additionally, in the realm of edge computing, a cooperative computing solution is proposed in [15]. This aerial video streaming-enabled solution can minimize the video streaming time to improve the overall computing performance. A version of edge computing is presented in [16]. From the perspective of ECPs, the version interconnected with multiple edge clouds proposes an online framework to maximize benefits.

2.2. Machine Learning-Based Approach for Resource Scheduling

With the development of the machine learning approach, utilizing machine learning to solve existing problems becomes an effective solution. A Deep Reinforcement Learning-based service-oriented resource allocation algorithm is proposed [17]. Wang et al. [18] proposed a reinforcement learning approach to optimize scheduling games in mobile edge computing. Their method considers both the quality of mobile device-server links and the allocation of server computing resources. A resource allocation algorithm is presented in [19]. The algorithm combines Deep Reinforcement Learning and Genetic Algorithm to solve problems in mobile edge networks, including transmission power allocation and computation resources allocation. In [20], the authors investigate the task scheduling and resource allocation problem in edge computing scenarios, with the objective of maximizing the long-term satisfaction of tasks allocated to virtual machines deployed on edge servers. Markov decision process is formulated as to model to deal with problems, which involves states, actions, state transitions, and rewards. Following that, a resource allocation algorithm based on Deep Q Neural Network is designed to determine an optimal offloading decision and resource allocation scheme.

In addition to utilizing AI to achieve efficient resource allocation, the execution of AI business also involves the scheduling of computing power resources. Traditional centralized AI models cannot fully utilize edge network computing power. Distributed model training has been a topic of significant interest in recent years, with various studies focusing on improving the efficiency and scalability of training models across distributed systems. Machine learning has a wide range of practical application scenarios in edge computing’s first network-enabled smart grid [21]. A DRL-based edge computing network-aided resource allocation algorithm was proposed for the smart grid [22]. S.B. Slama et al. investigated the edge computing strategies for smart grids. A comprehensive review of emerging issues and edge computing in a smart grid environment is discussed and explained [3]. An architecture of edge computing-enabled Internet of Thing (IoT)-based smart grid is proposed. Moreover, three major scenarios of power systems, the power distribution, Micro-grid, advanced metering systems, and application of edge computing, are well-represented [23].

However, existing computing resource scheduling algorithms cannot solve the efficient distribution of AI services characterized by diversified types and time variability to edge data centers with changing workloads in the ECFN-enabled smart grid. Some research studies have investigated distributed model training based on data parallelism in edge computing networks. However, the optimal goals of existing studies mainly focus on how to minimize communication bandwidth for distributed training [24] and jointly optimize the training time and energy consumption [25]. The adaptive DNN model partition and deployment are investigated in edge computing-enabled metro optical interconnection networks [26]. Li et al. [27] optimized the partitioning and distribution of training data within edge computing-enabled optical networks to enhance the efficiency of distributed model training services. Wang et al. [28] introduced Pufferfish, a distributed training framework designed for communication and computation efficiency. It integrates gradient compression techniques into the model training process. Mansour et al. [29] investigated federated learning aggregation strategies, such as FedAvg, to derive a comprehensive “average” model from models trained across distributed clients.

Overall, the research in the above papers only considers the efficiency of training models and the performance of the models. Various techniques and frameworks have been introduced to address these challenges and improve the overall effectiveness of distributed model training. The literature on distributed model training emphasizes the importance of optimizing communication costs, resource utilization, and model performance to achieve efficient and scalable training across distributed systems. However, few studies focus on how to minimize joint optimization of computing resource consumption and profit of AI service suppliers in the ECFN of smart grids. Therefore, this paper mainly studies the dynamic collaborative computational resource allocation to respond to resource scheduling tasks for distributed AI services in ECFN.

3. Materials and Methods

In this study, the distribution of AI services among edge data centers is discussed. Edge data centers supplied by operators and private organizations or individuals are mesh-connected with optical transport network (OTN) technology. OTN not only provides large-particle bandwidth multiplexing, flexible cross-connection and configuration scheduling but also has enhanced networking and protection capabilities, as shown in Figure 1.

Notably, routing between each edge data center pair is pre-configured. Thus, to simplify the problem, routing and optical resource allocation problems are not considered in this letter. Given a set of AI services from different users and edge computational resources from different edge data center suppliers, this letter mainly focuses on the dynamic distribution of AI services to multiple edge data centers. The schematic diagram of the dynamic distribution problem is shown in Figure 1. Edge data centers are provided by different edge suppliers such as operators, third parties, etc. Moreover, idle edge computing resources of government, enterprise and personal advanced terminals are also considered. Notably, edge data centers are characterized by selfish interest and competitiveness. AI services usually require massive computational resources by a large amount of AI data. Based on distributed data parallelism, AI data can be split into multiple flexible data segments. In other words, AI services can be divided into multiple distributed model training tasks, and distributed model training can be executed on multiple edge data centers. We formulate an AI service as

R_{k}

, and its distribution model training task is denoted as

{r_{k, f}}

. It is essential to integrate all these edge computing resources and encourage all edge suppliers to share their edge computing resources for fair competition, aiming to achieve an efficient computing resource scheduling for all distributed model training tasks of AI services. The distribution of distributed model training tasks of AI services to multiple edge data centers in the ECFN of the smart grid should solve several sub-problems shown as follows.

Splitting of AI training data

The AI service includes a large amount of training data. How to split the AI training data into multiple data segments for distributed model training should be determined.

2.: Task scheduling order

As multiple distributed model training tasks of the AI service arrive, determining the task scheduling order can not only impact the overall computing resource utilization but also impact the training latency of the AI service. This is the foundation of computing resource scheduling.

3.: Edge data center selection

Edge data center selection includes the determination of the number and location of edge data centers and how to assign distributed model training tasks of the AI services to edge data centers.

4.: Edge computing resource scheduling

In order to schedule limited computing resources on edge data centers for AI services, the economic profit of AI service providers and the backlog of distributed training tasks should be jointly optimized.

3.1. MILP Model

A MILP model is formulated for AI service-oriented dynamic, collaborative computational resource scheduling problem based on distributed model training in the ECFN of the smart grid. Firstly, parameter definitions of MILP are introduced in detail. For constant parameters, a respective set of edge suppliers and a set of edge data centers are denoted as

S

and

E

. The respective set of users and set of different types of AI services are indicated as

U

and

R

, where

k \in U

, and

f \in R

. Moreover,

ε

is defined as the backlog threshold of distributed model training tasks in an edge data center. For input variables, respective computing processing capacity (GOPS) and cost ($) for the

j

-th edge data center of

i

-th edge supplier are denoted as

C_{i, j}

and

N_{i, j}

, where

i \in S

and

j \in E

. Respective

B^{k, f}

and

M^{k, f}

indicate requested computing resources (GOPS) and payment ($) for the k-th user’s f-th distributed model training tasks of AI service, where

k \in U

and

f \in R

.

P^{k, f}

indicates the complaint rate of the distributed model training task, which is a statistical value.

D_{i, j}

indicates the remaining amount of computing resources that can be complained about in the j-th edge data center of i-th edge supplier. For decision variables, we define a Boolean variable of

A_{i, j}^{t, k, f}

. If the requested computational resources of k-th user’s f-th type of distributed model training task are assigned to the j-th edge data center of i-th edge supplier in t-th time frame,

A_{i, j}^{t, k, f}

equals 1.

σ_{i, j}^{t}

defines the backlog of computing tasks from j-th edge data center of i-th edge supplier in t-th time frame.

X_{i, j}^{t}

is defined as a Boolean variable. If the j-th edge data center of i-th edge computing resource supplier is used in t-th time frame,

X_{i, j}^{t}

equals 1.

The respective constant parameters, input variables, and decision variables are given in Table 1, Table 2 and Table 3.

Then, the ultimate goal of the proposed MILP is to maximize the total profit of AI service providers and minimize the backlog of distributed training tasks, as formulated in Equation (1). The weight coefficients of

α

and

β

indicate the respective profit and backlog of the joint optimization objectives. In addition, the constraints of edge computing resource request, computing processing capacity, task backlog threshold on edge servers, and complaint rate threshold of AI services should be considered, which are indicated in Equations (2)–(5), respectively. In detail, Equation (2) indicates the constraint on computational resource requests for AI services. AI-distributed training data should be assigned to multiple edge data centers to satisfy the AI service request. The computational resource capacity of an edge data center is constrained in Equation (3). For each edge data center, the computational resources of AI service requests and the backlog of computing tasks in the edge data center at time (t − 1) should be less than the sum of the computational resource capacity of the edge data center and the backlog of computing request tasks in the edge data center at time t. Equation (4) indicates that the backlog of computing tasks in the edge data center should be less than the maximum allowable backlog of computing tasks. Equation (5) indicates that the sum of complaint rates for all AI services allocated to the edge data center should be less than the remaining number of computing tasks that can be complained about in that edge data center. The constraint on the usage quantity of edge data centers is indicated in Equation (6).

α \cdot (\sum_{t, i, j, k, f} M^{k, f} \cdot A_{i, j}^{t, k, f} - \sum_{t, i, j} N_{i, j} \cdot C_{i, j} \cdot X_{i, j}^{t}) - β \cdot \sum_{t, i, j} σ_{i, j}^{t}

(1)

\sum_{t, i, j} A_{i, j}^{t, k, f} = B^{k, f}, \forall k, f

(2)

\sum_{t, k, f} A_{i, j}^{t, k, f} + σ_{i, j}^{t - 1} \leq C_{i, j} + σ_{i, j}^{t}, \forall t \in [1, T], i, j

(3)

σ_{i, j}^{t} \leq ε, \forall t, i, j

(4)

\sum_{k, f} P^{k, f} \cdot A_{i, j}^{t, k, f} \leq D_{i, j}, \forall t, i, j

(5)

N_{m a x} \cdot X_{i, j}^{t} \geq A_{i, j}^{t, k, f}, \forall t \in [1, T], i, j, k, f

(6)

3.2. DRL-Based Dynamic Collaborative Resource Scheduling Algorithm

Although the MILP model can provide an accurate solution in small networks, it is time-consuming and cannot be worked out in large networks. Also, it cannot adapt to diversified types of service requests and dynamic network environments. To this end, a DRL-based AI service-oriented dynamic collaborative resource scheduling algorithm based on distributed data parallelism is proposed. Notably, our proposed DQN model is trained under sets of AI services during the time period t. As the time period t approaches infinity, it can be regarded as a dynamic planning problem. That is to say, the dynamic computing resource scheduling problem. Under diversified types of AI service requests and changing workloads on edge data centers, the proposed DRL-based algorithm is good at providing an elastic collaborative scheduling strategy of edge computational resources based on distributed data parallelism. We first formulate the dynamic, collaborative resource scheduling problem of edge computational resources as a Markov decision process. It is represented by a quadruple of <state, action, reward, Q value>. For the state, it contains the remaining computing resource capacity

{C_{i, j}}^{'}

of edge data centers and the cost of

N_{i, j}

for processing one computational request task on each edge data center. For the action, the computational request tasks

A_{i, j}^{t, k, f}

allocated on each edge data center of different suppliers in each time frame. The Q value represents the state-action function obtained through neural network training. In other words, it is the maximum reward expectation obtained by selecting different actions for different states, and the reward is formulated in Equation (7).

R e w a r d = \{\begin{matrix} α \cdot (\sum_{t, i, j, k, f} M^{k, f} \cdot A_{i, j}^{t, k, f} - \sum_{t, i, j} N_{i, j} \cdot C_{i, j} \cdot X_{i, j}^{t}) - β \cdot \sum_{t, i, j} σ_{i, j}^{t}, \\ - R, \end{matrix} \binom{i f s u c c e s s f u l l y a s s i g n e d}{i f n o t s u c c e s s f u l l y a s s i g n e d}

(7)

where

α

and

β

are the respective weight coefficients for a profit of computational service and backlog of computational request tasks on the edge data center in the ECFN of the smart grid.

The process steps of the DRL-based dynamic, collaborative resource scheduling algorithm are shown as follows.

(1): Initialize the parameters on edge data centers of different edge computational resource suppliers.
(2): Input the request $B^{k, f}$ and the current state $s_{t}$ of edge data center in the t-th time frame into DNN, which consists of an M-layer convolutional neural network and a N-layer fully connected network.
(3): Based on $B^{k, f}$ and $s_{t}$ , the corresponding Q value is calculated for each action, and then DNN outputs the set of Q values for all possible actions $Q^{π_{t}} (s_{t}, a_{t})$ under the current strategy $π_{t}$ , where $a_{t}$ indicates the action in t-th time frame.
(4): The action with the largest value is selected from the set of Q values to interact with the environment. The action can also be selected utilizing the $ε$ -Greedy strategy. That is to say that the action is selected with the $ε$ probability with the highest Q value. The action is randomly selected to fully explore the action space with (1- $ε$ ) probability. The value $ε$ will gradually increase after each action selection.
(5): After executing the action $a_{t}^{*} = {m a x}_{a_{t}} Q^{π_{t}} (s_{t}, a_{t})$ , the status $s_{t}$ of remaining computational resource capacity on the edge data center is updated to the next state $s_{t + 1}$ and return the reward value $r_{t}$ .
(6): The quadruple entry < $s_{t}$ , $s_{t + 1}$ , $a_{t}$ , $r_{t}$ > is saved in the memory.
(7): L sets of data entries from the memory are quantitatively and randomly extracted at regular intervals. In other words, a random mini-batch of data < $s_{t}, s_{t + 1}, a_{t}, r_{t}$ > are sampled to update DNN parameters by $θ_{t + 1} \leftarrow θ_{t} - α \cdot \sum_{i} d ξ_{t} / d θ_{i}$ .
(8): Train DNN parameters using extracted data entries based on Bellman optimization equations, formulated as $Q^{π_{t}} (s_{t}, a_{t}) = r (s_{t}, a_{t}) + γ \cdot {m a x}_{a_{t + 1}} Q^{π_{t}} (s_{t + 1}, a_{t + 1})$ , where $γ$ is a discount factor. The Bellman error, formulated as $ξ = 0.5 \cdot E_{(s_{t}, a_{t})} [Q^{π_{t}} (s_{t}, a_{t}, θ_{t}) - r (s_{t}, a_{t}) - γ \cdot {m a x}_{a_{t + 1}} Q^{π_{t}} (s_{t + 1}, a_{t + 1}, θ_{t})]$ , can be minimized to update DNN parameters by the gradient descent method. Then, the value of actions is fit well by DNN if the Bellman error converges after updating the DNN parameters.
(9): Repeat the above steps until the Q-value function converges. When using the model, the current computational resource status on the edge data center is also inputted into the model along with user requirements, and the model will return a dynamic, collaborative resource scheduling strategy.

Notably, to optimize the action space and facilitate the model training process, a heuristic algorithm called “Pricing Thinking oriented Greedy Strategy based Computing Resource Scheduling Algorithm (PT-GSCRSA)” is applied to the above proposed DRL-based algorithm. The main idea of PT-GSCRSA is to assign the most expensive computing payment of AI service to the lowest cost of the edge data center. In detail, when multiple AI services arrive, services and processing costs of edge data centers are sorted in descending order and ascending order, respectively. Then, we assign the AI service with the most expensive payment to the edge data center with the lowest cost for data processing. If the requested computing resources of AI services are larger than the computational resource capacity of the allocated edge data center, the optimal way to split AI training data into multiple data segments is to assign all remaining computational resources of the edge data center to the AI service. Otherwise, if the requested computing resources of AI services are smaller than the computational resource capacity of the allocated edge data center, all AI training data are assigned to the edge data center, and the remaining computational resources of the edge data center are updated. When the workload of the edge data center with the lowest cost is full after processing part of the AI training data, we assign the remaining AI training data to the edge data center with the second lowest cost. Repeat the above steps until all AI training data are allocated to multiple edge data centers. The flowchart of DRL-based dynamic computational resource scheduling algorithm is shown in Figure 2.

4. Simulation Results and Analysis

4.1. Simulation Setup

In this section, a series of simulations are conducted to evaluate the performance comparisons of the proposed DRL-based algorithm and benchmark. MILP model is compared as the accurate solution. A random distribution algorithm is chosen as a benchmark for comparisons. For the simulation environment setup, the proposed MILP model and DRL-based algorithm are programmed on a computer of Intel(R) Core (TM) i7-10710U CPU @ 1.10 GHz, 1.61 GHz, and 16.0 GB RAM, utilizing respective IBM CPLEX 12.8.0, MATLAB, and Python. In the simulations, we choose two test network scenarios of a small network and a large network, as the AI services vary from 20 to 220. We assume that each computational resource supplier has two edge data centers and that each AI user has two service requests. In addition, simulation parameters for the DRL-based algorithm are listed in Table 4, including neural network parameters, discount factor, exploration rate, and other hyperparameters.

4.2. Simulation Scenario I

In this simulation scenario of a small network, we consider a network topology of 30 edge data centers, which contains fifteen edge computational resource suppliers and two edge data centers for each. The number of AI services is set to 100, which consists of fifty users and two types of AI-distributed model training tasks. The backlog threshold of AI-distributed model training tasks in edge data centers is set to 50. A series of simulations in simulation scenario I of the small network are conducted to evaluate performance comparisons among different proposed DRL-based algorithms, MILPs, and benchmarks in terms of multi-objective optimization, profit, backlog, and running time.

(a): Multi-objective optimization vs. number of AI services

Figure 3 shows that the proposed DRL-based algorithm reduces by an average of 13.26% multi-objective than MILP, and the benchmark reduces by an average of 46% multi-objective than DRL as the weight coefficient ratio between profit and backlog is 100:1. As weight coefficient ratios between profit and backlog are 0.9:0.1 and 0.8:0.2, DRL-based algorithm reduces its respective average of 12.35% and 10.68% multi-objective than MILP, and benchmark reduces its respective average of 31.55% and 18.58% multi-objective than DRL. Thus, we find that the DRL-based algorithm achieves the most approximating performance as MILP, especially as the number of AI services is less than 60 when the weight coefficient ratio between profit and backlog is 0.8:0.2. Moreover, the benchmark reduces most multi-objectives by an average of 46% than DRL when the weight coefficient ratio between profit and backlog is 100:1. This is because under diversified types of AI service request and changing workloads on edge data centers, DRL can provide elastic computing resource scheduling strategy other than MILP with fixed accurate solution and benchmark with random assignment strategy. Another reason is that the proportion of profit is larger than the backlog of distributed model training tasks in multi-objective optimization.

(b): Profit vs. number of AI services

As shown in Figure 4a, the proposed DRL-based algorithm achieves an approximating performance as MILP, especially as the number of AI services is less than 60, and the DRL-based algorithm outperforms the benchmark in terms of profit of AI service provider. Another observation is that the benchmark reduces the average to 44.7% less profit than the DRL-based algorithm. This is reasonable because the benchmark utilizes the random fit approach to assign the computing service request to the edge data center and consumes more computational resources with less profit compared with the DRL-based algorithm.

(c): Backlog vs. number of AI services

As shown in Figure 4b, the proposed DRL reduces the average of 44.4% backlog of distributed model training tasks than the benchmark. Another observation is that although MILP can achieve the highest multi-objective among others, backlog performance under MILP is not stable due to the proportion of backlog being relatively small.

(d): Running time vs. number of AI services

It can be observed in Figure 4c that the DRL-based algorithm performs significantly better than MILP in terms of running time. Such performance advantage in time consumption is more obvious when the AI service number is large.

4.3. Simulation Scenario II

In this simulation scenario of a large network, we consider a network topology of 60 edge data centers, which contains thirty edge computational resource suppliers and two edge data centers for each. The number of AI services is set to 100, which consists fifty users and two types of AI distributed model training tasks. The backlog threshold of AI-distributed model training tasks in edge data centers is set to 50. A series of simulations in simulation scenario II of a large network are conducted to evaluate performance comparisons among different proposed DRL-based algorithms, MILPs, and benchmarks in terms of multi-objective optimization, profit, backlog, and running time.

(a): Multi-objective optimization vs. number of AI services

Figure 5 shows that our proposed DRL-based algorithm significantly outperforms benchmarks and approaches to MILP in terms of multi-objective optimization in a large network. Notably, MILP can obtain the exact solution to our discussed problem. In Figure 5a, the average error between the DRL-based algorithm and MILP is 30.34%, and DQN increases its average 34.25% joint optimization of profit and backlog than the benchmark, as the ratio of weight coefficient between profit and backlog is 100:1. Simulation results in Figure 5b,c show that the respective average error between DRL-based algorithm and MILP is 25.19% and 29.18%, and DQN increases its respective average by 12.83% and 86.28% of joint optimization of profit and backlog than the benchmark, as the respective ratios of weight coefficient between profit and backlog are 0.8:0.2 and 0.9:0.1. This is because under diversified types of AI service requests and changing workloads on edge data centers, DRL performs more adaptable to changes than MILP with fixed accurate solution and benchmark with random assignment strategy. Another reason is that the proportion of profit is larger than the backlog of distributed model training tasks in multi-objective optimization. Overall, the DRL-based algorithm increases by an average of 28.24% more than the benchmark in a large network.

(b): Profit vs. number of AI services

As shown in Figure 6a, DQN increases by an average of 31.07% profit more than the benchmark, and the average error between the DRL-based algorithm and MILP is 30.41% in a large network. This is reasonable because the benchmark randomly assigns the AI-distributed model training tasks to the edge data center and consumes more computational resources with less profit compared with the DRL-based algorithm.

(c): Backlog vs. number of AI services

We observe from Figure 6b that DQN reduces by an average of 60% backlog of distributed model training tasks than the benchmark in a large network. The average error between DQN and MILP is 47.68%.

(d): Running time vs. number of AI services

It can be observed in Figure 6c that MILP consumes the most running time among the three algorithms, especially as the number of AI services is more than 160. DQN reduces by an average of 98.27% running time than MILP. The running time between DQN and benchmark is not much different. This is reasonable because MILP shows high algorithm time complexity, especially in large networks.

5. Discussion and Conclusions

This letter investigates an AI service-oriented dynamic, collaborative computational resource scheduling algorithm based on distributed data parallelism in the ECFN of the smart grid. A MILP model is formulated, and a DRL-based algorithm is proposed, which utilizes a heuristic algorithm of PT-GSCRSA to facilitate the model training process. Simulation results show that DRL achieves the most approximating performance as MILP, especially when the number of AI services is less than 60 and when the weight coefficient ratio between profit and backlog in multi-objective is 0.8:0.2. DRL outperforms the benchmark in terms of profit of AI service provider, backlog of distributed model training tasks, running time, and multi-objective optimization in both small and large networks.

Author Contributions

Conceptualization, J.Z. and P.X.; methodology, J.Z.; software, J.Z.; validation, J.Z., P.X. and H.Z.; formal analysis, P.X.; investigation, H.Z.; resources, C.W.; data curation, H.Z.; writing—original draft preparation, H.Z.; writing—review and editing, C.W.; visualization, C.W.; supervision, H.Z.; project administration, L.W.; funding acquisition, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

This work was supported by the State Grid Corporation of China Science and Technology Project (5700-202318269A-1-1-ZN).

Conflicts of Interest

The authors, Jing Zou and Peizhe Xin, are engineers at State Grid Economic and Technological Research Institute, Ltd., and they declare no conflicts of interest. The authors Heli Zhang, Ying Wang and Chang Wang are researchers at Beijing University of Posts and Telecommunications, and they declare no conflicts of interest. The other authors declare no conflicts of interest.

References

Tang, W.; Lu, W.; Tan, Q.; Ma, J.; Liu, Z. Research on the status and development trend of the data center from the perspective of energy and power. In Proceedings of the IEEE International Conference on Advances in Electrical Engineering and Computer Applications (AEECA), Dalian, China, 20–21 August 2022; pp. 562–565. [Google Scholar]
Luo, F.; Zhao, J.; Dong, Z.Y.; Chen, Y.; Xu, Y.; Zhang, X.; Wong, K.P. Cloud-Based Information Infrastructure for Next-Generation Power Grid: Conception, Architecture, and Applications. IEEE Trans. Smart Grid 2016, 7, 1896–1912. [Google Scholar] [CrossRef]
Slama, S.B. Prosumer in smart grids based on intelligent edge computing: A review on Artificial Intelligence Scheduling Techniques. Ain Shams Eng. J. 2022, 13, 101504. [Google Scholar] [CrossRef]
Yang, Q.; Liu, Y.; Chen, T.; Tong, Y. Federated machine learning: Concept and applications. ACM Trans. Intell. Syst. Technol. 2019, 10, 1–19. [Google Scholar] [CrossRef]
Feng, C.; Wang, Y.; Chen, Q.; Ding, Y.; Strbac, G.; Kang, C. Smart grid encounters edge computing: Opportunities and applications. Adv. Appl. Energy 2021, 1, 100006. [Google Scholar] [CrossRef]
Gao, H.; Xia, W.; Yan, F.; Shen, L. Optimal Cloud Resource Scheduling in Smart Grid: A Hierarchical Game Approach. In Proceedings of the 2020 IEEE 91st Vehicular Technology Conference (VTC2020-Spring), Online, 25 May–31 July 2020. [Google Scholar]
Yao, J.; Li, Z.; Li, Y.; Bai, J.; Wang, J.; Lin, P. Cost-Efficient Tasks Scheduling for Smart Grid Communication Network with Edge Computing System. In Proceedings of the 15th International Wireless Communications & Mobile Computing Conference (IWCMC), Tangier, Morocco, 24–28 June 2019; pp. 272–277. [Google Scholar]
Li, Z.; Liu, Y.; Xin, R.; Gao, L.; Ding, X.; Hu, Y. A Dynamic Game Model for Resource Allocation in Fog Computing for Ubiquitous Smart Grid. In Proceedings of the 2019 28th Wireless and Optical Communications Conference (WOCC), Beijing, China, 9–10 May 2019; pp. 1–5. [Google Scholar]
Sharif, Z.; Jung, L.T.; Razzak, I.; Alazab, M. Adaptive and Priority-Based Resource Allocation for Efficient Resources Utilization in Mobile-Edge Computing. IEEE Internet Things J. 2023, 10, 3079–3093. [Google Scholar] [CrossRef]
Jiang, J.; Xin, P.; Wang, Y.; Liu, L.; Chai, Y.; Zhang, Y.; Liu, S. Computing Resource Allocation in Mobile Edge Networks Based on Game Theory. In Proceedings of the 2021 IEEE 4th International Conference on Electronics and Communication Engineering (ICECE), Xi′an, China, 17–19 December 2021; pp. 179–183. [Google Scholar]
Wang, G.; Xu, F.; Zhao, C. Multi-Access Edge Computing Based Vehicular Network: Joint Task Scheduling and Resource Allocation Strategy. In Proceedings of the 2020 IEEE International Conference on Communications Workshops (ICC Workshops), Dublin, Ireland, 7–11 June 2020; pp. 1–6. [Google Scholar]
Yan, J.; Rui, L.; Yang, Y.; Chen, S.; Chen, X. Resource Scheduling Algorithms for Burst Network Flow in Edge Computing. In Lecture Notes in Electrical Engineering, Proceedings of the 11th International Conference on Computer Engineering and Networks, Hechi, China, 21–25 October 2021; Liu, Q., Liu, X., Chen, B., Zhang, Y., Peng, J., Eds.; Springer: Singapore, 2022; Volume 808, p. 808. [Google Scholar]
Seah, W.K.G.; Lee, C.-H.; Lin, Y.-D.; Lai, Y.-C. Combined Communication and Computing Resource Scheduling in Sliced 5G Multi-Access Edge Computing Systems. IEEE Trans. Veh. Technol. 2021, 71, 3144–3154. [Google Scholar] [CrossRef]
Jun, N. Research on cloud computing resource scheduling strategy based on genetic ant colony fusion algorithm. In Proceedings of the Conference on Computer Science and Communication Technology, Tianjin, China, 12–13 March 2022. [Google Scholar]
Liu, Z.; Zhan, C.; Cui, Y.; Wu, C.; Hu, H. Robust Edge Computing in UAV Systems via Scalable Computing and Cooperative Computing. IEEE Wirel. Commun. 2021, 28, 36–42. [Google Scholar] [CrossRef]
Shao, X.; Hasegawa, G.; Dong, M.; Liu, Z.; Masui, H.; Ji, Y. An Online Orchestration Mechanism for General-Purpose Edge Computing. IEEE Trans. Serv. Comput. 2023, 16, 927–940. [Google Scholar] [CrossRef]
Xi, L.; Wang, Y.; Wang, Y.; Wang, Z.; Wang, X.; Chen, Y. Deep Reinforcement Learning-Based Service-Oriented Resource Allocation in Smart Grids. IEEE Access 2021, 9, 77637–77648. [Google Scholar] [CrossRef]
Wang, T.; Lu, B.; Wang, W.; Wei, W.; Yuan, X.; Li, J. Reinforcement Learning-Based Optimization for Mobile Edge Computing Scheduling Game. IEEE Trans. Emerg. Top. Comput. Intell. 2023, 7, 55–64. [Google Scholar] [CrossRef]
Vijayasekaran, G.; Duraipandian, M. Deep Q-learning based Resource Scheduling in IoT Edge Computing. In Proceedings of the 2024 International Conference on Inventive Computation Technologies (ICICT), Lalitpur, Nepal, 24–26 April 2024; pp. 359–364. [Google Scholar]
Wang, S.; Wu, Y.-C.; Xia, M.; Wang, R.; Poor, H.V. Machine intelligence at the edge with learning centric power allocation. IEEE Trans. Wireless Commun. 2020, 19, 7293–7308. [Google Scholar] [CrossRef]
Chi, Y.; Zhang, Y.; Liu, Y.; Zhu, H.; Zheng, Z.; Liu, R.; Zhang, P. Deep Reinforcement Learning Based Edge Computing Network Aided Resource Allocation Algorithm for Smart Grid. IEEE Access 2022, 11, 6541–6550. [Google Scholar] [CrossRef]
Chen, S.; Wen, H.; Wu, J.; Lei, W.; Hou, W.; Liu, W.; Xu, A.; Jiang, Y. Internet of Things Based Smart Grids Supported by Intelligent Edge Computing. IEEE Access 2019, 7, 74089–74102. [Google Scholar] [CrossRef]
Lin, Y.; Han, S.; Mao, H.; Wang, Y.; Dally, W.J. Deep gradient compression: Reducing the communication bandwidth for distributed training. In Proceedings of the Sixth International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018; pp. 1–14. [Google Scholar]
Tran, N.H.; Bao, W.; Zomaya, A.; Nguyen, M.N.H.; Hong, C.S. Federated learning over wireless networks: Optimization model design and analysis. In Proceedings of the IEEE Conference on Computer Communications, Paris, France, 29 April–2 May 2019; pp. 1387–1395. [Google Scholar]
Xu, J.; Liu, X.; Zhu, X. Deep Reinforcement Learning Based Computing Offloading and Resource Allocation Algorithm for Mobile Edge Networks. In Proceedings of the 2020 IEEE 6th International Conference on Computer and Communications (ICCC), Chengdu, China, 11–14 December 2020; pp. 1542–1547. [Google Scholar]
Liu, M.; Li, Y.; Zhao, Y.; Yang, H.; Zhang, J. Adaptive DNN model partition and deployment in edge computing-enabled metro optical interconnection network. In Proceedings of the Optical Fiber Communication (OFC) Conference, San Diego, CA, USA, 8–12 March 2020. [Google Scholar]
Li, Y.; Zeng, Z.; Li, J.; Yan, B.; Zhao, Y.; Zhang, J. Distributed Model Training Based on Data Parallelism in Edge Computing-Enabled Elastic Optical Networks. IEEE Commun. Lett. 2020, 25, 1241–1244. [Google Scholar] [CrossRef]
Wang, H.; Agarwal, S.; Papailiopoulos, D. Pufferfish: Communication-efficient Models at No Extra Cost. arXiv 2021, arXiv:2103.03936. [Google Scholar]
Mansour, A.B.; Carenini, G.; Duplessis, A.; Naccache, D. Federated Learning Aggregation: New Robust Algorithms with Guarantees. In Proceedings of the 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA), Nassau, Bahamas, 12–14 December 2022; pp. 721–726. [Google Scholar]

Figure 1. The schematic diagram of dynamic computational resource scheduling for AI services.

Figure 2. The flowchart of DRL-based dynamic computational resource scheduling algorithm.

Figure 3. Performance comparisons in small network in terms of multi-objective optimization, (a)

α

:

β

= 100:1, (b)

α

:

β

= 0.8:0.2, (c)

α

:

β

= 0.9:0.1.

Figure 3. Performance comparisons in small network in terms of multi-objective optimization, (a)

α

:

β

= 100:1, (b)

α

:

β

= 0.8:0.2, (c)

α

:

β

= 0.9:0.1.

Figure 4. Performance comparisons in small network in terms of (a) profit vs. No. of services, (b) backlog vs. No. of services, (c) running time vs. No. of services.

Figure 5. Performance comparisons in large network in terms of multi-objective optimization, (a)

α

:

β

= 100:1, (b)

α

:

β

= 0.8:0.2, (c)

α

:

β

= 0.9:0.1.

Figure 5. Performance comparisons in large network in terms of multi-objective optimization, (a)

α

:

β

= 100:1, (b)

α

:

β

= 0.8:0.2, (c)

α

:

β

= 0.9:0.1.

Figure 6. Performance comparisons in large network in terms of (a) profit vs. No. of services, (b) backlog vs. No. of services, (c) running time vs. No. of services.

Table 1. Constant parameters.

Parameters	Descriptions
$S$	Set of edge computing resource suppliers, $i \in S$ .
$E$	Set of edge data centers, j $\in E$ .
$U$	Set of computing resource request users, k $\in U$ .
$R$	Set of different types of AI services requested by computing service request users, $f \in R$ .
$ε$	Backlog threshold of AI-distributed model training tasks in the edge data center (GOPS).
$N_{m a x}$	A very large positive integer.

Table 2. Input variables.

Parameters	Descriptions
$C_{i, j}$	Computing resource processing capacity of j-th edge data center for i-th computing resource supplier (GOPS), $i \in S$ , j $\in E$ .
$N_{i, j}$	Computational processing cost of j-th edge data center for i-th computing resource supplier (Yuan/GOPS), $i \in S$ , j $\in E$ .
$B^{k, f}$	Edge computing service request for k-th user’s f-th type AI service request (GOPS), k $\in U$ , $f \in R$ .
$M^{k, f}$	Payment for data computing service of f-th type of AI service for k-th computing service request user (Yuan/GOPS), k $\in U$ , $f \in R$ .
$P^{k, f}$	Complaint rate of AI-distributed model training tasks for f-th type of AI service from k-th computing service request user (statistical value), k $\in U$ , $f \in R$ .
$D_{i, j}$	The remaining amount of AI-distributed model training tasks can be complained about in the j-th edge data center of i-th computing resource supplier (GOPS), $i \in S$ , j $\in E$ .

Table 3. Decision variables.

Parameters	Descriptions
$A_{i, j}^{t, k, f}$	Boolean variable. The requesting computing resources of k-th user’s f-th type of AI service is assigned to the j-th edge data center of i-th edge computing resource supplier (GOPS) in t-th time frame. If it is, $A_{i, j}^{t, k, f}$ equals 1.
$σ_{i, j}^{t}$	Backlog of AI-distributed model training tasks from j-th edge data centers of i-th edge computing resource supplier (GOPS) in t-th time frame.
$X_{i, j}^{t}$	The j-th edge data center of i-th edge computing resource supplier is used in t-th time frame. If it is, $X_{i, j}^{t}$ equals 1.

Table 4. Simulation parameters for DRL-based algorithm.

Parameters	Values
Number of CNN layers	2
Filters of CNN	5
Kernel size of CNN	2
Strides of CNN	1
Number of training iterations	3000
Learning rate	0.01
Reward decay	0.9
Memory size	5000
Batch size	32
Exploration rate	0.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zou, J.; Xin, P.; Wang, C.; Zhang, H.; Wei, L.; Wang, Y. AI Services-Oriented Dynamic Computing Resource Scheduling Algorithm Based on Distributed Data Parallelism in Edge Computing Network of Smart Grid. Future Internet 2024, 16, 312. https://doi.org/10.3390/fi16090312

AMA Style

Zou J, Xin P, Wang C, Zhang H, Wei L, Wang Y. AI Services-Oriented Dynamic Computing Resource Scheduling Algorithm Based on Distributed Data Parallelism in Edge Computing Network of Smart Grid. Future Internet. 2024; 16(9):312. https://doi.org/10.3390/fi16090312

Chicago/Turabian Style

Zou, Jing, Peizhe Xin, Chang Wang, Heli Zhang, Lei Wei, and Ying Wang. 2024. "AI Services-Oriented Dynamic Computing Resource Scheduling Algorithm Based on Distributed Data Parallelism in Edge Computing Network of Smart Grid" Future Internet 16, no. 9: 312. https://doi.org/10.3390/fi16090312

APA Style

Zou, J., Xin, P., Wang, C., Zhang, H., Wei, L., & Wang, Y. (2024). AI Services-Oriented Dynamic Computing Resource Scheduling Algorithm Based on Distributed Data Parallelism in Edge Computing Network of Smart Grid. Future Internet, 16(9), 312. https://doi.org/10.3390/fi16090312

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AI Services-Oriented Dynamic Computing Resource Scheduling Algorithm Based on Distributed Data Parallelism in Edge Computing Network of Smart Grid

Abstract

1. Introduction

2. Related Work

2.1. Heuristic for Resource Scheduling

2.2. Machine Learning-Based Approach for Resource Scheduling

3. Materials and Methods

3.1. MILP Model

3.2. DRL-Based Dynamic Collaborative Resource Scheduling Algorithm

4. Simulation Results and Analysis

4.1. Simulation Setup

4.2. Simulation Scenario I

4.3. Simulation Scenario II

5. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI