An Approach for Deployment of Service-Oriented Simulation Run-Time Resources

Zhang, Zekun; Peng, Yong; Zhang, Miao; Yin, Quanjun; Li, Qun

doi:10.3390/app132011341

Open AccessArticle

An Approach for Deployment of Service-Oriented Simulation Run-Time Resources

by

Zekun Zhang

,

Yong Peng

^*

,

Miao Zhang

,

Quanjun Yin

and

Qun Li

College of Systems Engineering, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(20), 11341; https://doi.org/10.3390/app132011341

Submission received: 12 September 2023 / Revised: 8 October 2023 / Accepted: 11 October 2023 / Published: 16 October 2023

(This article belongs to the Special Issue Advances in Edge Computing for Internet of Things)

Download

Browse Figures

Versions Notes

Abstract

:

The requirements for low latency and high stability in large-scale geo-distributed training simulations have made cloud-edge collaborative simulation an emerging trend. However, there is currently limited research on how to deploy simulation run-time resources (SRR), including edge servers, simulation services, and simulation members. On one hand, the deployment schemes of these resources are coupled and have mutual impacts. It is difficult to ensure overall optimum by deploying these resources separately. On the other hand, the pursuit of low latency and high system stability is often challenging to achieve simultaneously because high stability implies low server load, while a small number of simulation services implies high response latency. We formulate this problem as a multi-objective optimization problem for the joint deployment of SRR, considering the complex combinatorial relationship between simulation services. Our objective is to minimize the system time cost and resource usage rate of edge servers under constraints such as server resource capacity and the relationship between edge servers and base stations. To address this problem, we propose a learnable genetic algorithm for SRR deployment (LGASRD) where the population can learn from elites and adaptively select evolution operators performing well. Extensive experiments with different settings based on real-world data sets demonstrate that LGASRD outperforms the baseline policies in terms of optimality, feasibility, and convergence rate, verifying the effectiveness and excellence of LGASRD when deploying SRR.

Keywords:

cloud-edge collaboration; joint resource deployment; composite service; learnable genetic algorithm

1. Introduction

Training simulation is a safe, controllable, and cost-effective training method that can improve the abilities of trainees, becoming a critical method of talent cultivation in various industries. In large-scale geo-distributed training simulations, simulation members, i.e., users, can be categorized into three types [1]: live (L), virtual (V), and constructive (C) members, such as actual aircraft with various sensors, aircraft simulators controlled by simulated cockpits and computer generated aircraft, respectively. These members dispersed across different locations are integrated into a virtual space, where they can see and interact with each other.

However, systems capable of supporting such large-scale simulation activities are often excessively large and require sufficient computing resources and network resources to support their operation. Otherwise, users will suffer from tremendous latency, which can significantly impair user experience quality.

To address this issue, researchers have proposed deploying simulation systems in the cloud-edge collaborative computing environment, utilizing both the computing resources of cloud computing and the networking resources of edge computing [2]. Under the cloud-edge collaborative simulation paradigm, developers usually adopt the micro-service (MS) architecture to develop simulation systems. By encapsulating common functional modules into containers to form services, MS architecture can achieve higher development efficiency, more efficient resource reuse, and better elasticity [3,4].

During the operation of an MS-based training simulation system deployed in cloud-edge collaborative computing architecture, simulation members send service requests to the system to realize the functions they need. They also send messages to other members through the middleware service to achieve the interaction between him/her and other members, and also promote the progress of the current simulation activity.

As shown in Figure 1, each service request is composed of multiple micro-services and can be formulated as a directed acyclic graph (DAG) [5,6], in which a node represents a task, defined as the execution of its corresponding MS instance. Each task also has its predecessor and successor tasks. After a task is completed, its successor tasks will either be completed on the same server or on other servers. The latter method will incur service communication time, a part of data transmission time. Furthermore, MS instances provide computation in a queue-based manner, which may lead to long service waiting times due to huge user demands and a small number of instances.

Furthermore, in large-scale geo-distributed training simulations, message sending may lead to long-distance data transmission, generating user interaction time. For example, in a training simulation activity, if user A needs to send a message to user B, the message needs to be sent to the middleware service first, and then the middleware will forward it to user B.

To implement the training simulation process mentioned above, the primary challenge is how to deploy edge servers, micro-services, and members in the cloud-edge collaborative computing environment before the start of the simulation. On the one hand, edge servers can provide both computing and communication capabilities, making them an important infrastructure for cloud-edge collaborative computing architecture. In reality, there often exist situations where there are n available edge servers but m (

n < m

) candidate server locations, typically base stations (BSs) [7,8,9,10,11]. Therefore, we need to determine where these n edge servers should be deployed. We assume that the cloud center has existed, so we do not focus on the deployment of it. On the other hand, simulation members are the main participants of training simulations, and micro-services are indispensable components for enabling members to perform their required functions, so they are also extremely important for training simulations. As software resources, we need to determine their deployment schemes on servers. Note that the referred members here are constructive members, because in large-scale geo-distributed simulations, live members and virtual members are often difficult to move due to their large numbers, so we assume that their locations are known and fixed. For convenience in the description, we define servers, micro-services, and constructive members as simulation run-time resources (SRR), and refer to the latter two software resources as simulation run-time software resources (SRSR).

Different deployment schemes of SRR have a great impact on the time cost of simulation systems, including data transmission time and service queuing time, as well as the resource occupancy rate of edge servers. Data transmission is caused by users’ service requests and interactions between users. Without the loss of generality, it is usually assumed that a task is always scheduled to be processed by the server with the corresponding MS instance that is closest to its input data [11,12]. Obviously, the same server will face different user demands under different deployment schemes of edge servers and simulation members. Deployment decisions of micro-services have a significant influence on task scheduling and service queuing time. In addition, scattered users will lead to large data transmission time due to their messages. On the other hand, although more micro-service instances can reduce time costs, they can increase resource usage of edge servers. This increases the risk of crashes and should be avoided as much as possible. Therefore [13], deployment schemes of SRR are closely coupled and should be considered together.

However, there is currently little research on this problem. Existing works mostly study one problem in service deployment or server deployment. As mentioned earlier, they are coupled and it is difficult to achieve the overall optimum by researching them separately. In addition, most works only consider the deployment of monolithic services, instead of micro-services, thus ignoring the data dependencies between micro-services [11,14]. However, this deployment scheme cannot fully capture the characteristics of systems based on MS architecture. Finally, the current related research focuses mainly on scenarios such as autonomous driving and the Internet of Things (IoT), rather than training simulation. However, there are some elements in training simulation that these fields do not possess, such as L, V, and C members and their interactions.

This work studies the joint deployment of SRR in cloud-edge collaborative computing environments, aiming to minimize system time cost, while minimizing the resource consumption rate of edge servers. We formulate the joint deployment problem and design a learnable multiple-objective evolutionary algorithm to solve it. Furthermore, we target the service-oriented training simulation field, a more specialized scenario. In this paper, the main contributions are as follows:

We propose a complete process for deploying edge servers and software resources together, taking a more complex service modeling approach into account. This is an improvement over existing work.
We formulate the joint deployment of edge servers and SRSR as a multi-objective optimization problem with two objectives: minimizing both the system’s time cost and the resource usage rate of edge servers. To solve this problem, we proposed a learnable genetic algorithm for simulation run-time resource deployment (LGASRD), which is capable of adaptively selecting genetic operators and evolving under the guidance of an elite population.
Extensive experiments are implemented in Python to evaluate the performance of the LGASRD based on the real-world EUA data set [15] and scientific workflows [16]. The experimental results show that LGASRD outperforms the other typical benchmark policies.

The remainder of this article is organized as follows. Section 2 provides an overview of related works. Section 3 elaborates on the motivations behind our study on the joint deployment of SRR. Section 4 introduces the system model and formulates a multi-objective optimization problem. LGASRD is proposed to solve the problem in Section 5 and its performance analysis is conducted based on real-world data sets in Section 6. Finally, Section 7 concludes this article.

2. Related Works

Training simulation is one of the most effective and economical ways to enhance personnel’s professional ability, which has become a consensus. Faced with the service-oriented simulation and the use of cloud computing and edge computing for simulation, it is an inevitable trend in the development of computer simulation technology [17,18,19,20]. Therefore, the cloud-edge collaborative training simulation not only faces some challenges in the field of cloud computing and edge computing, such as server deployment [21,22,23,24] and service deployment [25,26,27,28], but also has some unique problems. Training simulation is different from other types of simulation such as supply chain, production, and manufacturing simulation. In the training simulation process, there will be a large amount of interaction between virtual and real entities. Therefore, unlike conventional research on edge computing technology, we need to consider the additional time cost, caused by interaction, of the edge computing systems used for training simulation. In addition, the deployment issue of computer-generated constructive simulation members, which can be regarded as a type of special simulation service, also needs to be resolved urgently. However, it is regrettable that we have not seen any relevant research yet.

Currently, research on how to deploy services to distributed edge computing systems is a hot topic. Chen et al. [29] defined the problem as a combinatorial contextual bandit learning problem. They utilized the Multi-armed Bandit theory to learn the optimal deployment scheme of services. Ouyang et al. [30] studied the problem of how to deploy services in the distributed edge computing environment in a mobility-aware and dynamic way, aiming to optimize the long-term averaged migration cost due to user mobility. To achieve this, they developed two efficient heuristic schemes based on the Markov approximation and best response update techniques. Gao et al. [28] designed an iterative algorithm to study how to achieve optimal service deployment and access network selection. Hu et al. [31] proposed an algorithm to adjust the task scheduling decisions and resource allocation by achieving a trade-off between the execution time of tasks and energy consumption.

However, those researchers only studied the services in a monolithic way, ignoring the composite property of services. Wang et al. [12,32] considered this factor, but only considered the linear combination relationship between services. However, in reality, the business logic relationships between services should be represented by DAG. In addition, though these works took different aspects into account, they assumed that all base stations could be deployed with edge servers. In fact, the number of available edge servers is limited and thus it is not reasonable that all base stations are enabled to execute computing tasks. Furthermore, different server deployment schemes will generate different service requests from users, resulting in different service deployment schemes. Therefore, the scheme of service deployment can be obtained unless the deployment of usable edge servers has finished. Considering this, these two elements should be deployed jointly.

In edge computing, the most common deployment method is to place edge servers at base stations [21,23]. Li et al. [21] identified a gap in the related works regarding energy efficiency in edge server deployment. To address this, they proposed a multi-objective optimization approach, incorporating an energy-aware edge server deployment algorithm. Their objective was to obtain a deployment solution that minimized energy consumption while maintaining access delay at an acceptable level. Mohan et al. [22] allowed edge servers to request processing of the extra workload from the cloud based on the assumption that the capacity constraints on each edge server were relaxed. Wang et al. [23] focused on the deployment of edge servers in a metropolitan area, assuming that all edge servers were homogeneous. They proposed a mixed-integer programming approach that aimed to minimize the edge server access latency as well as balance the workload among edge servers to achieve this goal. Lähderanta et al. [24] investigated the deployment of edge servers as a constrained location-allocation problem. The study incorporated the constraints of both upper and lower server capacities, aiming to minimize distance and achieve workload balance among the edge servers. However, the mentioned works just considered where to place the edge servers with different optimization goals and constraints, without taking the service placement under the current structure of the edge server placement scheme into account. They assume that each edge server hosts only one type of service, ignoring the demand of users for different services. In addition, they overlooked the diverse service requests of users with the assumption that each edge server was dedicated to hosting a single type of service.

3. Motivation

Figure 2 provides a more vivid illustration of the importance of considering the joint deployment of SRR, although there is currently little research on this topic. The figure includes training bases (TBs), edge servers (ESs), BSs, and a scheduling center located in the cloud. Software resources on ESs include MS instances and constructive members, depicted by colored circles and aircraft, respectively. To facilitate distinction, a jagged symbol is used to represent the middleware service instead of using a circle.

Taking Figure 2a as an example, only the deployment of servers is considered without considering data dependencies between services. In this case, three ESs are deployed on BS 1, 2, and 4, with purple, green, and middleware services deployed on them, respectively. A live user submits a service request for path planning. Since all ESs do not have the corresponding instance, the task can only be completed in the cloud, where all types of services are deployed. This will remarkably increase network latency. Additionally, a live user sends a message “Hello” to a constructive white aircraft, resulting in two data transmissions. However, if the white aircraft and middleware are deployed together, the second data transmission can be avoided.

The situation shown in Figure 2b optimizes the deployment of services and simulation members. The yellow service request submitted by a live user is dispatched to ES 1, avoiding huge communication costs generated by execution in the cloud. Additionally, some constructive members are deployed together with middleware, reducing user interaction costs to a certain extent. However, BS 3 and BS 5 are far away from TB A and B. Intuitively, it would be a better choice to deploy the two servers at BS 1 and 2.

Figure 2c illustrates the joint deployment of SRR, including ESs, micro-services, and constructive members, considering data dependencies between micro-services. A live user submits a request for path planning, including two tasks,

t_{1}

and

t_{2}

. Task

t_{1}

is scheduled to server 1, and then the output is sent to server 2 for processing the second task

t_{2}

. It should be noted that the yellow MS instances are deployed on both server 1 and server 2. The reason

t_{1}

is scheduled to server 1 rather than server 2 is that the instance on server 2 has other two tasks to process and therefore has a long waiting time.

Another significant difference between Figure 2a–c is that Figure 2c considers micro-services, using a more complex service representation approach, DAG, whereas Figure 2a,b consider monolithic services. This is mainly caused by the granularity difference between the services that the simulation system can support and the services required by users. The performance of the micro-services architecture depicted in Figure 2c is more flexible, so it is currently the mainstream architecture for developing large systems, such as simulation systems, in both the present and future [33]. Therefore, it is important, reasonable, and practical to study simulation resource deployment based on MS architecture.

4. System Model and Problem Formulation

In this section, we first give mathematical definitions of services, tasks, servers, users, and the representation method for joint deployment decision-making. Then, we calculate two objective functions: time cost and resource occupancy rate. Finally, we formulate the joint deployment optimization model. The main notations used in this article are listed in Table 1.

4.1. Definitions

Service: In a system implemented with MS architecture, the granularity of service requests submitted by users is usually greater than the granularity of individual micro-services (also called simulation services) that the system can provide. In such cases, it is necessary to combine simulation services into a composite service (CS) to meet user needs. Typically, users’ service requests are modeled as DAG.
Assuming that the MS library consists of n classes of micro-services, expressed by set $M S = [M S_{1}, \dots, M S_{n}]$ . To simplify the problem, assume that the resource requirement of $M S_{i}$ is expressed as $r_{M S_{i}} = {[c p u_{M S_{i}}, g p u_{M S_{i}}, r a m_{M S_{i}}]}^{T}$ and the input size and output size of $M S_{i}$ are $m s_{i}^{i n}$ and $m s_{i}^{o u t}$ , respectively. All micro-services can be combined to generate m composite services, represented by $S = [S_{1}, \dots, S_{m}]$ , to meet users’ demands.
Task: The execution of a fine-grained simulation service is referred to as a task. In the j-th type of service request $S_{i j}$ submitted by a user $U_{i}$ , the l-th task $t_{i j l}$ has its predecessor and successor tasks, represented by sets $p r e d (t_{i j l})$ and $s u c c (t_{i j l})$ , respectively. The initial task and final task in the request are $t_{i j I}$ and $t_{i j F}$ , whose $p r e d (t_{i j I}) = s u c c (t_{i j F}) = U_{i}$ . To facilitate the description of tasks and their corresponding micro-services, we define $t m s_{i}$ to represent the MS whose instances can process task $t_{i}$ .
Server: Assuming there are k available edge servers $R = [R_{1}, \dots, R_{k}]$ for training simulation. The resource capacity of edge server $R_{i}$ is $R c_{i} = {[c p u_{R_{i}}, g p u_{R_{i}}, r a m_{R_{i}}]}^{T}$ . Generally, each edge server is deployed on a BS. Assuming q $(q \geq k)$ BSs are available, and each BS can deploy one edge server at most. Once the mapping relationship between edge servers and BSs is determined, each server’s location can be described by a two-dimensional coordinate as $L_{R_{i}} = (x_{R_{i}}, y_{R_{i}})$ . Thus, the distance between k edge servers can be calculated as a matrix $A = {[a_{i j}]}_{k \times k}$ .
User: Simulation users include live, virtual and h constructive simulation members, with a total of u $(u > h)$ , so the user set can be represented as $U =$ $[U_{1}, \dots, U_{u - h}, U_{u - h + 1}, \dots, U_{u}]$ . Without the loss of generality, live and virtual users only make small movements around their original positions. We ignore their movements and treat their positions as fixed. For a constructive user, its location is the same as the location of the server deploying it. Thus, the location of user $U_{i}$ can be expressed as $L_{U_{i}} = (x_{U_{i}}, y_{U_{i}})$ . As a kind of software resource, the computing resource demand of constructive member $C_{i}$ is defined as $r_{C_{i}} = {[c p u_{C_{i}}, g p u_{C_{i}}, r a m_{C_{i}}]}^{T}$ .
Deployment scheme: Resources to be deployed include edge servers, simulation services, and constructive members. Servers need to be deployed on BSs and need to deploy software resources. Therefore, a deployment scheme $D S$ can be represented by a matrix, with the i-th column representing the deployment scheme for server $R_{i}$ . The matrix is divided into three parts from top to bottom: $D S_{R}$ , $D S_{S}$ , and $D S_{C}$ .
The first part, $D S_{R} = {[D S_{1}^{R}, \dots, D S_{k}^{R}]}_{1 \times k}$ , occupies one row of the matrix, indicating the mapping relationship between edge servers and BSs. The value of the i-th element in this row, $D S_{i}^{R}$ , represents that the i-th edge server is determined to be deployed on BS $D S_{i}^{R}$ . Note that a BS can deploy one edge server at most, so $1 \leq D S_{i}^{R}, D S_{j}^{R} \leq q$ and $D S_{i}^{R} \neq D S_{j}^{R}$ , for $\forall i, j = 1, \dots, k$ and $i \neq j$ .
The second part, $D S_{M S} = {[D S_{1}^{M S}, \dots, D S_{k}^{M S}]}_{n \times k}$ , is a n-row sub-matrix of matrix $D S$ , where $D S_{i}^{M S} = {[D S_{1 i}^{M S}, \dots, D S_{n i}^{M S}]}^{T}$ . $D S_{j i}^{M S}$ indicates the number of instances of the j-th type of micro-services deployed on server $R_{i}$ . The number of instances is non-negative integer, so $D S_{j i}^{M S} \in N$ for $\forall i = 1, \dots, k$ and $\forall j = 1, \dots, n$ , where N is the set of natural numbers. Note that as a special simulation service, one and only one middleware is necessary for training simulation activities. To reduce the interaction cost, we assume that it is deployed on edge servers rather than the cloud. To distinguish it from other simulation services, we use $D S_{M} = {[D S_{1}^{M}, \dots, D S_{k}^{M}]}_{1 \times k}$ , the first row of matrix $D S^{M S}$ as the deployment decision for it. Due to limitations mentioned above, $D S_{i}^{M}$ are binary variables for $\forall i = 1, \dots, k$ , and $\sum_{i = 1}^{k} D S_{i}^{M} = 1$ .
The third part, $D S_{C} = {[D S_{1}^{C}, \dots, D S_{k}^{C}]}_{h \times k}$ , is also a sub-matrix, with h rows, of $D S$ , where $D S_{i}^{C} = {[D S_{1 i}^{C}, \dots, D S_{h i}^{C}]}^{T}$ . The binary variable $D S_{j i}^{C}$ indicates that if the j-th constructive member is deployed on edge server $R_{i}$ . We assume that all constructive users are deployed on the edge instead of the cloud, thus $\sum_{i = 1}^{k} D S_{j i}^{C} = 1$ for $\forall j = 1, \dots, h$ .
Therefore, the deployment scheme decision $D S$ can be defined as

$D S = {[\begin{matrix} D S_{R} \\ D S_{S} \\ D S_{C} \end{matrix}]}_{(1 + n + h) \times k} = {[\begin{matrix} \begin{matrix} D S_{R} \\ D S_{M} \end{matrix} \\ \begin{matrix} D S_{S / M} \\ D S_{C} \end{matrix} \end{matrix}]}_{(1 + n + h) \times k}$

(1)

In addition to the previous definitions, we also define a few symbols for convenience.

$R (M)$ : the edge server deploying the middleware;
$R_{U_{i}} (t_{j})$ : the closest edge server to user $U_{i}$ that deploying instances can execute $t_{j}$ ;
$R (t_{i})$ : the edge server can process $t_{i}$ , with the shortest summation of distances between the current server and other servers executing the predecessor tasks of $t_{i}$ , i.e., the server executing task $t_{i}$ ;
$R (C_{i})$ : the edge server deploying constructive member $C_{i}$ ;
$R (U_{i})$ : the edge server closest to user $U_{i}$ .

4.2. Time Cost Model

4.2.1. Service Response Model

In terms of service response, we consider two factors: the network communication time between users and servers and between tasks; the queuing time of tasks at their corresponding instances calculated by

M / M / c

queuing model, where

c \geq 1

.

Service communication model:
Without the loss of generality, we assume that a j-th type of service request $S_{i j}$ from user $U_{i}$ is always first scheduled to $R_{U_{i}} (t_{i j I})$ to process the initial task $t_{i j I}$ , for $\forall i = 1, \dots, u$ and $\forall j = 1, \dots, m$ . The data size from $U_{i}$ to $R_{U_{i}} (t_{i j I})$ is $m s_{t m s_{i j I}}^{i n}$ . Subsequently, each task $t_{i j k}$ after $t_{i j I}$ will be scheduled to $R (t_{i j k})$ in topological order, and the data dependency scale between this task and its predecessor task $t_{i j p}$ is $m s_{p k}^{t r s}$ . When the service request is finished, the result, i.e., the output of $t_{i j F}$ is sent to user $U_{i}$ .
Therefore, the communication time caused by the service request $S_{i j}$ is

$t_{i j}^{C} = ⌈ \frac{a_{i n}}{a} ⌉ \frac{m s_{t m s_{i j I}}^{i n}}{B W} + \sum_{k = 2}^{N_{S_{i j}}} \sum_{t_{i j p} \in p r e d (t_{i j k})} ⌈ \frac{a_{R (t_{i j p}), R (t_{i j k})}}{a} ⌉ \frac{m s_{p k}^{t r s}}{B W} + ⌈ \frac{a_{o u t}}{a} ⌉ \frac{m s_{t m s_{i j F}}^{o u t}}{B W}$

(2)

where $a_{i n} = a_{R (U_{i}), R_{U_{i}} (t_{i j I})}$ and $a_{o u t} = a_{R_{U_{i}} (t_{i j F}), R (U_{i})}$ . $N_{S_{i j}}$ represents the number of small tasks in this request and a and $B W$ are the network distance of one hop and the network bandwidth, respectively. In the network environment uniformly configured for training simulations, they are all constants.
Service queuing model: If a task is scheduled to a busy instance, it will suffer a long queuing time. To simplify the problem without the loss of generality, we assume that the calculation time of all MS instances is consistent, which is equal to $μ$ . Thus, the waiting time of task $t_{k}$ is calculated by $M / M / c$ queuing model as

$t_{k}^{W} = \frac{{ρ_{k}}^{c_{k} + 1} p_{k}}{λ_{k} (c_{k} - 1)! {(c_{k} - ρ_{k})}^{2}}$

(3)

where $λ_{k}$ represents the arrival rate of tasks of the same type as $t_{k}$ and $ρ_{k} = λ_{k} / μ \leq 1$ . $c_{k} \geq 1$ indicates the number of instances that can process $t_{k}$ deployed on the current server. Note that if a task is scheduled to the cloud rather than an edge server, we ignore its queuing time but emphasize the data transmission time. The parameter $p_{k}$ that can be calculated as

$p_{k} = {(\sum_{n = 0}^{c_{k} - 1} \frac{{ρ_{k}}^{n}}{n!} + \frac{{ρ_{k}}^{c_{k}}}{c_{k}!} \frac{c_{k}}{c_{k} - ρ_{k}})}^{- 1}$

(4)

Therefore, the total queuing time of all tasks in the j-th type service request from user $U_{i}$ , $S_{i j}$ , can be represented as

$t_{i j}^{W} = \sum_{k = 1}^{N_{S_{i j}}} t_{i j k}^{W}$

(5)

Based on the proposed service communication model and service queuing model, the total data transmission time caused by users’ service requests can be formulated as

t^{S} = \sum_{i = 1}^{u} \sum_{j = 1}^{m} d_{i j} (t_{i j}^{C} + t_{i j}^{W})

(6)

where

d_{i j}

represents the demand for service request

S_{j}

of

U_{i}

.

4.2.2. User Interaction Model

In training simulation, a user can subscribe to messages of any other user on demand. If user

U_{i}

subscribes to the messages of

U_{j}

, indicating that the results of service requests submitted by

U_{j}

will be sent to

R (U_{j})

from

R (U_{i})

, we define

I t a_{i j} = 1

. Thus, the time cost of the subscription of

U_{i}

to

U_{j}

’s messages due to the service request

S_{l}

from

U_{j}

is

t_{i j l}^{M} = ⌈ \frac{a_{R (U_{j}), R (U_{i})}}{a} ⌉ \frac{m s_{t m s_{l F}}^{o u t}}{B W}

(7)

Therefore, the time cost of the subscription of

U_{i}

to

U_{j}

can be calculated as

t_{i j}^{M} = \sum_{l = 1}^{m} t_{i j l}^{M} d_{j l}

(8)

Furthermore, considering the subscription relationship between users, the interaction time cost is formulated as

t^{M} = \sum_{i = 1}^{u} \sum_{j \neq i} (t_{i j}^{M} I t a_{i j} + t_{j i}^{M} I t a_{j i})

(9)

In summary, the first objective function, total time cost due to data transmission between tasks among each service request from all users and interactions between them, can be represented as

f_{1} (D S) = t^{S} + t^{M}

(10)

4.3. Resource Occupancy Rate Model

The software resources that may be deployed on a server include simulation services and constructive members, whose resource demands are represented by

r_{M S}

and

r_{C}

, respectively. Once a deployment scheme is determined, the occupied resources of server

R_{i}

can be modeled as a 3D column vector:

B_{i} = \sum_{j = 1}^{n} D S_{j i}^{M S} r_{M S_{j}} + \sum_{j = 1}^{h} D S_{j i}^{C} r_{C_{j}}

(11)

Assuming that server

R_{i}

’s resource consumption rate is

m a x (B_{i} / R_{i})

. The second objective function, the resource usage rate of all edge servers, can be formulated as

f_{2} (D S) = \underset{i = 1, \dots, k}{m a x} {m a x (B_{i} / R_{i})}

(12)

4.4. Problem Formulation

The object of deploying SRR jointly is to minimize the overall time cost of the training simulation system while minimizing the resource consumption rate of edge servers. At the same time, constraints of resource capacities of edge servers, deploying rules, and variable properties should be satisfied. The joint deployment problem

P

, which has been proven to be NP-Hard [13], is formally defined as follows:

\begin{matrix} m i n & f (D S) = (f_{1} (D S), f_{2} (D S)) \end{matrix}

(13)

\begin{matrix} s . t . & m a x (B_{i} / R_{i}) < 1, \forall R_{i} \in R \end{matrix}

(14)

\begin{matrix} {(D S_{i}^{R} - D S_{J}^{R})}^{2} > 0, \forall i, j = 1, \dots, k a n d i \neq j \end{matrix}

(15)

\begin{matrix} \sum_{i = 1}^{k} D S_{i}^{M} = 1 \end{matrix}

(16)

\begin{matrix} \sum_{j = 1}^{k} D S_{i j}^{C} = 1, \forall C_{i} \in C \end{matrix}

(17)

\begin{matrix} D S_{i}^{R} \in {1, \dots, q}, \forall i = 1, \dots, k \end{matrix}

(18)

\begin{matrix} D S_{i}^{M} \in {0, 1}, \forall i = 1, \dots, k \end{matrix}

(19)

\begin{matrix} D S_{i j}^{S} \in N, \forall i = 1, \dots, n a n d j = 1, \dots, k \end{matrix}

(20)

\begin{matrix} D S_{i j}^{C} \in {0, 1}, \forall i = 1, \dots, h a n d j = 1, \dots, k \end{matrix}

(21)

The related constraint formula is as follows: Equation (14) means that the software resources deployed on an edge server cannot exceed the server’s capacity limit. Equation (15) indicates that one BS can only deploy one edge server at most, that is, the BSs where two edge servers are deployed are different. Equations (16) and (17), respectively, specify that the middleware service and a constructive member must and can only be deployed on one edge server. Equations (18)–(21) specify the range of values for various decision variables, which must all be integers.

Obviously,

P

is a nonlinear integer optimization problem that requires finding the Pareto optimal solution of the training simulation system’s time cost and edge servers’ resource occupancy while not violating constraints, e.g., Equations (15)–(21).

5. Simulation Run-Time Resources Deployment Algorithm

In this section, we elaborate on our Learnable Genetic Algorithm for Simulation Run-time Resources Deployment based on NSGA-II, called LGASRD, for

P

. Firstly, we propose the framework of LGASRD and analyze its differences from NSGA-II. Then, based on the encoding method represented as Equation (1), we introduce some novel crossover and mutation operators and let LGASRD adaptively choose the operator with better performance. The details are presented as follows.

5.1. The LGASRD Algorithm

LGASRD is implemented based on the well-known Non-dominated Sorting Genetic Algorithm II (NSGA-II). As an evolutionary algorithm, NSGA-II requires multiple iterations before termination, has a poor ability to escape from local optimal solutions, and cannot guarantee a feasible solution when faced with constrained optimization problems. Furthermore, other improved methods have not completely solved this problem. For example, while accelerating the convergence speed of the algorithm, there is a high possibility of sacrificing the diversity of the population, which can result in the inability to find good solutions. We introduce the elite knowledge and operator performance knowledge to speed up the convergence rate without too much loss of diversity of the population. The constraint handling mechanism is also designed to guarantee the feasibility of obtained solutions.

5.1.1. Constraint Handling Mechanism

As shown in Equation (1), this encoding method makes it easy to ensure that some constraints are satisfied by reasonable crossover and mutation operators, which mainly restrict the properties of variables and the relationships between variables, as formulated in Equations (15)–(21). However, it is difficult to satisfy the server resource occupancy constraints, e.g., Equation (14), by intuition. In some cases where server resources are limited, we must adopt strategies to ensure that the solution obtained is feasible.

To address this issue and preserve the diversity of the population as much as possible, we have divided the algorithm into two stages. In the first stage, we transfer the two objectives into one by weighting them using

ω_{1}

and

ω_{2}

and apply a penalty function to penalize solutions that violate the resource capacity constraint. To enhance the effectiveness of this mechanism, the penalty coefficient

M

increases exponentially with the iteration of the algorithm. Therefore, the objective function in the first stage is

\begin{matrix} f (D S) = f_{1} (D S) ω_{1} + f_{2} (D S) ω_{2} - m a x {0, f_{2} (D S) - 1} M \end{matrix}

(22)

In the second stage, we consider both

f_{1} (D S)

and

f_{2} (D S)

. To enhance the algorithm’s exploratory ability, we apply a parameter

ϵ

to relax the resource occupancy rate constraint and make

ϵ

rapidly approach 0 with each iteration of the algorithm. This ensures the feasibility of the solution obtained.

To ensure the availability of feasible solutions, our basic idea is to accept some unfeasible solutions in the early stages of the algorithm to increase the diversity of individuals in the population, while focusing more on the feasibility of solutions as the algorithm iterates.

5.1.2. Elite Knowledge

Elitist refers to the individual in a population that has the highest fitness value, e.g., the optimal objective function, from the start of evolution until now. It possesses the best genetic structure and excellent traits. To prevent the destruction or loss of elitists in the current population during subsequent iterative optimization processes, the elite retention strategy should be adopted, which involves adding an elitist from each generation to the Elite Population (EP). Rudolph has theoretically proven that the standard genetic algorithm with elite retention strategy is globally convergent [34]. Therefore, we use the individuals in EP, called elite knowledge, to guide the subsequent optimization. There are two aspects to consider when using elite knowledge: updating EP and migrating elitists.

The size of EP is

E

, and the top

E

individuals from the first generation population can be selected to form the first generation EP. Starting from the second-generation population, if there is an individual A in the current population that is superior to one of the individuals in EP, we replace this elitist with A. In the multi-objective stage, we consider that individual A is superior to individual B if one of the following three conditions is met:

Condition 1:

$\begin{matrix} f_{1} (A) \leq f_{1} (B) \\ f_{2} (A) \leq f_{2} (B) \\ A \notin E P \end{matrix}$

(23)
Condition 2:

$\begin{matrix} f_{1} (A) < f_{1} (B) \\ f_{2} (A) \geq f_{2} (B) \\ \frac{f_{1} (B) - f_{1} (A)}{f_{1} (B)} > \frac{f_{2} (A) - f_{2} (B)}{f_{2} (B)} \\ r a n d p < 1 - \frac{f_{2} (A) - f_{2} (B)}{f_{2} (B)} \frac{f_{1} (B)}{f_{1} (B) - f_{1} (A)} \end{matrix}$

(24)
Condition 3:

$\begin{matrix} f_{1} (A) \geq f_{1} (B) \\ f_{2} (A) < f_{2} (B) \\ \frac{f_{1} (A) - f_{1} (B)}{f_{1} (B)} < \frac{f_{2} (B) - f_{2} (A)}{f_{2} (B)} \\ r a n d p < 1 - \frac{f_{1} (A) - f_{1} (B)}{f_{1} (B)} \frac{f_{2} (B)}{f_{2} (B) - f_{2} (A)} \end{matrix}$

(25)

where

r a n d p

is a randomly generated probability. Equation (23) represents the case that A dominates B and no individual the same as A is in EP. Equations (24) and (25) mean that although A does not perform well in terms of

f_{1}

or

f_{2}

, it performs exceptionally well in terms of another objective function. Take Equation (24) as an example, it indicates that although A is worse (larger) than B in terms of

f_{2}

, it is better (smaller) than B in terms of

f_{1}

and the degree to which A is smaller than B for

f_{1}

is greater than the degree to which B is smaller than A for

f_{2}

.

In addition to trying to replace the inferior individuals in EP with outstanding ones in the current population to ensure the excellence of EP, we hope to use EP to guide the subsequent optimization process, e.g., elite migration. Every ten rounds of iteration in the second stage, we randomly select an elitist located in the top Pareto rank of EP and replace one individual in the lowest Pareto rank in the current population with it. This approach can bring the population closer to EP to some extent and accelerate the algorithm’s convergence speed without too much loss of diversity within the population.

5.1.3. Operator Performance Knowledge (OPK)

In the genetic algorithm, three operations are necessary: parent selection, crossover, and mutation. For a particular problem, using different operators to perform these three operations can result in different outcomes. In standard NSGA-II [35], the operators of performing these three operations are fixed, and the algorithm can only perform them in one way. This makes the algorithm lack adaptive ability and learning ability. To improve this, we introduce OPK into NSGA-II, allowing the algorithm to record the performance of operators for each operation. The higher the historical performance of an operator, the greater the probability it will be selected during the corresponding operation in a certain iteration.

We design two crossover operators and three mutation operations (detailed later); thus, we can obtain six different genetic methods. During each iteration, the algorithm records which crossover and mutation operator each individual used. If a new individual A dominates the corresponding old one B, that is,

f_{1} (n e w) \leq f_{1} (o l d)

and

f_{2} (n e w) \leq f_{2} (o l d)

, or A performs exceptionally better than B in terms of one objective function, represented as Equation (24) or Equation (25), the performance of the used crossover and mutation operators is increased by one. Take mutation operation as an example, if the current mutation operation is successful through the j-th mutation operator, the performance of the j-th mutation operator

P K (j) = P K (j) + 1

. When selecting an operator to change an individual in the next iteration, the probability of the j-th operator will be selected as

\begin{matrix} P (j) = \frac{P K (j)}{\sum_{i = 1}^{3} P K (i)} \end{matrix}

(26)

where

P (j)

represents the probability of the j-th operator is selected. As shown in Table 2, in the first iteration of our algorithm, the performance of each mutation operator is initialized to 1. The probability of each operator being selected is calculated as 33.3% according to Equation (26). If the second operator is selected and the mutated individual is better than the original one during the first mutation operation, the historic performance of it is increased to 2, and the probability of each operator being selected is calculated as 25%, 50%, and 25%. In the second mutation, if the first operator is chosen but the mutated individual does not become better, the performance and probability of selection of all operators remain unchanged.

5.1.4. The LGASRD Framework

Based on the well-known NSGA-II and the aforementioned operations, we obtain the proposed LGASRD in Algorithm 1. Firstly, we randomly generate the initial population and select

E

individuals performing well to create EP based on their fitness values (lines 3–5). After that, LGASRD starts the first phase of iteration. At the beginning of the iteration, executes three genetic operations, creating the new population. Note that LGASRD uses the roulette wheel strategy to perform the parent selection operation (line 7). Next, the fitness values of the new population are calculated, and then OPK and EP are updated (lines 8–10). At the final step of this phase, some parameters are updated (line 11). After the first stage, two objective functions of the current population are calculated and a new EP is created (lines 14–15). Compared to the first stage, there are several differences between the second stage of the algorithm:

LGASRD uses tournament strategy for parent selection in the second stage (line 17);
As the number of objective functions increases, some evaluation criteria become more complex, such as how to update OPK and EP (line 19);
Elite migration strategy is introduced (lines 25–27);
LGASRD stops iterating after multiple failed attempts to update EP (lines 28–30).

5.2. Crossover Operators

We design two crossover operators, which can be understood as vertical and horizontal combined with the encoding method shown as Equation (1).

The first operation is to exchange the three parts given in Equation (1) from the two parent individuals that will crossover with each other. It should be noted that the two parts related to micro-services and constructive members cannot be completely exchanged, or this crossover operation will be invalid. Instead, we will randomly select a range and exchange the parts that are within this range.

The second approach involves randomly selecting a server range and swapping the deployment schemes of servers within the range in the two parent chromosomes. However, this crossover operation may result in the newly obtained individuals violating certain constraints, such as Equations (14)–(17). Therefore, some repair mechanisms are introduced to ensure the feasibility of the resulting solutions.

These two crossover operators enable the exchange of chromosome segments between the two parent bodies in different ways, allowing LGASRD to adaptively select the crossover approach with better historic performance compared to using only one operator. As a result, information exchange between two parent individuals is more efficient.

5.3. Mutation Operators

We design two crossover operators, which can be understood as random and heuristic.

The first approach is random. It randomly selects two servers within the mutated chromosome, swapping the BSs deploying them, and randomly changing the locations of the middleware and constructive members. As for micro-services, it randomly selects a server and decides whether to increase or decrease the number of micro-services deployed on it based on its resource usage rate, i.e., the second optimization objective. The purpose of designing this operator is to enhance the randomness of our algorithm, thereby increasing its ability to escape from local optimal solutions.

Algorithm 1 Learnable GA-based Simulation Resources Deployment (LGASRD)

Require: Set of micro-services $M S$ ; Resource demand of micro-services $r_{M S}$ ; Set of composite services S; Set of edge servers R; Resource capacity of edge servers $R_{C}$ ; Location of base stations $L_{B S}$ ; Set of users U; Resource demand of constructive members $R_{C}$ ; Location of live and virtual members $L_{L V}$ ; User demand for composite services d; Counts of interactions among users $I t a$
Ensure: Deployment schemes $DS$ and EP

1:: $g \leftarrow 1$
2:: $F a i l C o u n t \leftarrow 0$
3:: Randomly generate $P$ individuals $D S_{1}, \dots, D S_{P}$ to form the population $DS$
4:: Calculate $f (DS)$ according to Equation (22)
5:: Select the top $E$ individuals in $DS$ to form the first generation EP
6:: while in the single-objective stage do
7:: Perform parent selection, crossover, and mutation operations to generate the next generation $DS$ according to OPK
8:: Calculate $f (DS)$
9:: Update OPK based on $f (DS)$
10:: Update EP based on $DS$ and $f (DS)$
11:: Update parameters including $M$ , $ϵ$ and $P_{m}$
12:: $g \leftarrow g + 1$
13:: end while
14:: Calculate $f_{1} (DS)$ and $f_{2} (DS)$ according to Equations (10)–(12)
15:: Select the top $E$ individuals in $DS$ based on $f_{1} (DS)$ , $f_{2} (DS)$ and their Pareto rank and congestion to form EP in the second stage
16:: while in the multi-objective stage do
17:: Perform parent selection, crossover, and mutation operations to generate the next generation $DS$ according to OPK
18:: Calculate $f_{1} (DS)$ , $f_{2} (DS)$ according to Equations (10)–(12)
19:: Update EP and OPK according to Equations (23)–(25)
20:: if fail to update EP then
21:: $F a i l C o u n t \leftarrow F a i l C o u n t + 1$
22:: else
23:: $F a i l C o u n t \leftarrow 0$
24:: end if
25:: if $g % R e p l a c e_F r e q = = 0$ then
26:: Replace one of the worst individuals in $DS$ with one of the best individuals in EP
27:: end if
28:: if $F a i l C o u n t = = G \times S t o p_R a t i o$ then
29:: break
30:: end if
31:: Update parameters including $M$ , $ϵ$ and $P_{m}$
32:: $g \leftarrow g + 1$
33:: end while
34:: return $DS$ , EP

The second variation operation is heuristic. It defines the weighted distance between live and virtual users and a BS, using the size of the communication data generated by these users as the weight to add up the physical distance between these users and the BS. Subsequently, it can be determined which BS currently deploying servers is the farthest from these users, and then a closer, idle BS will be selected to replace it. Additionally, with resource availability, the middleware and as many constructive members as possible will be deployed on the BS with the smallest weighted distance. Evidently, this heuristic operator primarily aims to reduce communication time caused by user interactions. We have also devised operations for other objectives, including service queuing time, service communication time, and resource occupancy rate, which are similar to the current operator and hence will not be elaborated on further. These operators can make use of some heuristic information to enable our algorithm to quickly find some not-bad solutions. In other words, while guaranteeing the lower bound of the obtained solution, they greatly speed up the convergence rate of LGASRD.

The third approach utilizes the information of EP, allowing the current variant individual to either become more similar to or less similar to one elitist. If the current individual’s crowding distance is not equal to 0, it learns from the elitist and becomes more similar to it. For example, if the elitist deploys the middleware on server A, the variant individual will deploy it on B, which is the closest server to A among all servers. If the current individual’s crowding distance is equal to 0, we want it to be unique. In other words, we will let it be different from other individuals in the current population. From an algorithmic perspective, the entire population tends towards EP, so we can make the current variant individual distant from them by reducing its similarity to the imitated elitist. If the variant individual overlaps with its adjacent individuals (solutions), this mutation operator increases the diversity of the population. Otherwise, when the current individual is approaching the imitated elitist, it may obtain better objective function values.

6. Experiment Evaluation

In this section, we report evaluation results using real-world data sets to verify the superiority of the proposed LGASRD. We first introduce the benchmark policies for evaluation. Then, we describe the data set and experimental settings. Finally, the detailed experiment results of LGASRD with respect to the compared methods are presented and discussed.

6.1. Benchmark Policies

Referring to the experimental evaluation in [8,10,13,35], we provide the following five representative baselines for comparative studies.

K-Means Clustering + NSGA-II (KMNSGA): This strategy first deploys edge servers on k BSs using the K-Means clustering algorithm. On this basis, it uses NSGA-II to determine how other software resources should be deployed.
NSGA-II: This approach is to directly use NSGA-II to solve the deployment problem for SRR we are facing.
K-Means Clustering + Top-k Service Placement (TServiceD): This algorithm first deploys k edge servers using the K-Means clustering algorithm. Then, it greedily deploys micro-services by first deploying the MS with the highest user demand, then the service with the second highest user demand, and so on. Finally, it randomly deploys the simulation middleware and constructive members.
Top-V Server Deployment + Top-V Service Deployment (TSD $^{2}$ ): This method first selects the k most heavily used BSs to deploy edge servers. Then, it greedily deploys micro-services by prioritizing the MS with the highest user demand, followed by the second highest, and so on. Finally, like TServiceD, it also randomly deploys the simulation middleware and constructive members.
Random: This policy randomly determines the deployment scheme while ensuring its feasibility.

6.2. Data Sets

In our evaluation, we utilized the EUA data set [15], which was derived from the Wireless Communications Licensing data set published by the Australian Communications and Media Authority (ACMA). This data set contains detailed information about the geographic locations of all cellular base stations in Australia, including their respective latitudes and longitudes. We randomly selected q base stations for use in our experiment, which we referred to as base stations M. To model user distribution, we drew upon the distribution generated by the EUA data set, which comprised

(u - h)

users.

To simulate the business logic relationships between multiple tasks in a service request, we used five workflows, provided by [16], that have been widely used in related research: Montage, Cybershake, Epigenomics, LIGO, SIPHT. Each workflow has a certain number of tasks, and each task can only be executed by containers that encapsulate the corresponding MS. The number of different types of micro-services included in each workflow is shown in Table 3.

6.3. Experimental Settings

All the experiments are implemented in Python 3.10.9 on a desktop computer equipped with a 2.30 GHz Intel Core i7-11800H CPU and 16 GB RAM (The source code is available by accessing https://github.com/ZkofZhang/An-Approach-for-Deployment-of-Service-Oriented-Simulation-Run-time-Resources (accessed on 1 September 2023)). The parameter settings are discussed as follows.

We conducted experiments with various scales, involving scale parameters such as the number of tasks in a workflow (Nt), the number of micro-services (Nms), the number of available edge servers (Nes), and the number of users (Nu), i.e., simulation members, and the number of available BSs that was fixed to 50. Users can be divided into live, virtual, and constructive users. In our experiments, we do not distinguish between live and virtual users and use normal users in the EUA data set to represent them. Moreover, for simplicity, we assume that the number of these two types of users is the same as that of constructive users. To facilitate description, we summarized the scale parameter settings for each set of experiments, as shown in Table 4. Note that the middleware is a type of MS that exists in all settings, so it does not appear in the table.

In our experiments, the resources required by each MS are directly read from the corresponding XML file. In the service-oriented training simulation, all members rely on requesting services to achieve their desired functions, and the execution process of services is not performed locally for constructive members. Therefore, they do not require a large amount of CPU resources for calculation. On the contrary, they need a certain amount of memory resources to store a large number of data. Taking this factor into account, we assume that each constructive member only needs one unit CPU resource and does not need any GPU resources. The amount of memory resources they require, which is randomly generated, is between 0.5 and 1.5 times the average memory required by all micro-services. Based on the above settings, we can obtain the total amount of resources consumed in the simulation system and set the resource capacity of each edge server to be between 0.05 and 0.75 times this total amount, reflecting the heterogeneity among servers. At the same time, we set the demand for various service requests by each member to be a random integer between 0 and 20. During our experiments, we also adopted an extreme scenario where all members needed to interact with each other by sending messages.

In LGASRD, we set the population size

P

to 50, and the maximum number of iterations

G

to 250. Every 5 generations, we select an elitist to replace one individual in the lowest Pareto rank of the current population. If EP fails to be updated for consecutive

0.15 G

generations, LGASRD will terminate.

6.4. Experiment Results

We conducted extensive experiments to demonstrate the advantages of LGASRD over comparison methods. Firstly, we presented the results of six algorithms under settings A to L to illustrate that LGASRD has better performance than others over multiple problem scales. Then, we reduced the resources provided by edge servers to verify the effectiveness of the constraint-handling mechanism in LGASRD. Under one experimental setup, each algorithm was run 20 times, and we took the average of these 20 runs.

6.4.1. Optimality and Scalability

Figure 3, Figure 4 and Figure 5 present the two objective function values of solutions in the highest Pareto rank generated by the mentioned six algorithms in the upper half when the number of micro-services is 10, 35, and 40, respectively. They also illustrate the iteration process of three evolutionary algorithms, including KMNSGA, NSGA-II, and LGASRD, in the lower half. Due to the fact that some parameters related to resources and user demand were randomly generated for each run, TServiceD and TSD2 obtained different results even though they have specific deployment rules. In these figures, we present the results of LGASRD in two forms: the normal population and the elite population, represented by red and green stars, respectively. Note that since the solutions obtained by LGASRD are too dense, they appear as a single point in the figures. Therefore, we enlarged the points representing the results calculated by LGASRD and placed them in a black box in the blank area.

As can be observed from Figure 3, Figure 4 and Figure 5 and Table 5, LGASRD exhibits the best comprehensive performance among the six algorithms tested. In experiments conducted with settings A, C, and D, both two objective functions achieved minimal values by LGASRD. This fully demonstrates the effectiveness of the strategies we used, such as elite knowledge and OPK, in enhancing the performance of NSGA-II. Under other experimental settings, LGASRD always obtained solutions that were not dominated by those of other algorithms, with a minimum value for a specific objective function.

Under these experimental settings, the performance of two other evolutionary algorithms, including KMNSGA and NSGA-II, and TServiceD, was also good and they sometimes obtained smaller values on another objective function than LGASRD. The reason for this may be that when the scale of experiments increased, LGASRD reached the termination condition before fully converging, failing to obtain better solutions. It can also be observed that Random, TServiceD, and TSD2 often produce unfeasible solutions. This may be due to their failure to consider the resources required by the simulation middleware and constructive members when using heuristic deployment rules for micro-services, thus directing a large amount of resources towards the deployment of micro-services.

Under three experimental settings where the number of micro-services is consistent, i.e., experiments presented in Figure 3, Figure 4 and Figure 5, as the number of edge servers and simulation members increases, the time cost within the system also increases. This is normal, as an increase in the number of elements within the simulation system will inevitably result in an increase in the scale and frequency of requests for services and data transmission.

In order to explore this in more detail, we summarize how the time costs of Pareto optimal solutions obtained by three evolutionary algorithms change as the scale of our experiment varies, as shown in Table 6. The experimental scale mentioned here mainly refers to the product of the number of users and edge servers, as well as the number of micro-services. In Table 6, they are presented in the left half and the right half, respectively. To provide a visual representation, we have only provided ratios of time costs under different scales without providing specific numerical values. We are only interested in the impact of the experiment scale on time cost and not comparing the performance of the mentioned algorithms, so the results for each algorithm are independent. Due to the limited data, it is difficult to quantitatively describe the growth trend of time cost with experimental scale increases. However, it is evident that the increase in time cost caused by the increase in the number of micro-services is significant, and the underlying reasons still need to be explored. On the contrary, the impact of the number of edge servers and users on time cost is relatively small, even though when the quantity of edge servers and simulation users increases, the increment of time cost exceeds the increase in experimental scale.

From the lower part of Figure 3, Figure 4 and Figure 5, it can be observed that, unlike KMNSGA and NSGA-II, LGASRD exhibits significant fluctuations in its iteration curve. After reaching a local minimum, LGASRD often continues to reduce the objective function value through guided by EP and OPK. This indicates that LGASRD possesses a stronger ability to escape local optimal solutions compared to the other two evolutionary algorithms, which is one of the main reasons for its superior performance. Furthermore, compared to KMNSGA and NSGA-II, LGASRD also exhibits a faster convergence speed. When EP fails to update consecutively for multiple generations, the algorithm terminates even if the normal population is still becoming better.

We also conducted experiments with settings J, K, and L, and the results are shown in Figure 6. It can be observed that LGASRD still exhibits excellent optimization performance, with its solutions dominating those generated by other comparison algorithms, except for slightly higher time cost compared to the solutions produced by TServiceD and TSD

^{2}

in Figure 6b. Furthermore, if we compare the numerical values of time cost in Figure 3 and Figure 6, we will find that if other settings are the same, the granularity of service requests does not have a significant impact on the magnitude of the training simulation system’s time cost. The granularity of a service request refers to the number of micro-services involved in it, i.e., the number of tasks in a workflow application.

6.4.2. Effectiveness of Constraint Handling Policy

The SRR deployment problem is an optimization problem with constraints. When solving this problem, certain rules can be set to ensure that the obtained solution does not violate the constraints. However, this approach is ineffective for constraints like the one shown in Equation (14). Therefore, additional constraint-handling methods, which have already been discussed in Section 5.1.1, need to be introduced into our algorithm. To validate the effectiveness of the constraint-handling approach used in LGASRD, we conducted experiments using some of the previous experimental settings. In contrast to previous experiments, we reduced the resource capacity of edge servers to one-tenth of their original capacity to simulate a resource-constrained scenario. Previous experiments’ results have shown better performance for three evolutionary algorithms, while Random, TServiceD, and TSD

^{2}

performed poorly and generated infeasible solutions under certain settings. Therefore, we will not discuss them further in this section.

As shown in Figure 7, in most cases, KMNSGA and NSGA-II can only generate a fraction of feasible solutions, or even produce entirely infeasible solutions, as illustrated in Figure 7a. In contrast, LGASRD is capable of producing feasible solutions in all scenarios, with the minimal time cost among the three algorithms. This clearly demonstrates the effectiveness and significance of the constraint-handling mechanism we employed.

In Figure 8, we present a comprehensive depiction of the time costs of solutions obtained from three algorithms in resource-constrained scenarios. The training simulation system’s time cost is the summation of service communication time, service queuing time, and user interaction time. We can observe that the time spent on queuing at MS instances is relatively small compared to the data transmission time incurred due to the system’s response to user-initiated service requests. Moreover, these two types of time costs do not exhibit significant increments with the increase in the granularity of service requests, the number of servers, and simulation users. In contrast, the time resulting from user interaction tends to increase significantly with the increase in the number of edge servers and users. This is evident. If micro-services are deployed properly, users may experience low service latency. However, as the number of simulation members increases, it inevitably leads to an increase in the frequency of interactions between users, resulting in a significant growth in user interaction time.

7. Conclusions

This paper investigates the deployment problem of SRR in the cloud-edge collaborative computing architecture, effectively improving the user experience during training simulation activities. Firstly, we formulate this problem as a multi-objective optimization problem, aiming to simultaneously minimize the time cost of the training simulation system and the resource occupancy rate of edge servers under various constraints. It is worth noting that we adopt a combinatorial service modeling approach, which is unique and valuable. Subsequently, we propose LGASRD to solve this problem. The algorithm is a two-stage approach that incorporates elite knowledge and operator performance knowledge, enabling it to adaptively learn and find better solutions at a faster pace. Extensive experiments based on real-world and widely used data sets demonstrate the superiority of LGASRD compared to the benchmark algorithms across multiple evaluation metrics. In future work, we plan to validate our algorithm in training simulation prototype systems, and explore more complex and coupled research by integrating resource deployment with task scheduling.

Author Contributions

Conceptualization, Z.Z. and Y.P.; methodology, Z.Z. and M.Z.; validation, Z.Z.; investigation, Z.Z.; data curation, Z.Z.; writing—original draft preparation, Z.Z.; writing—review and editing, Z.Z. and M.Z.; supervision, Q.Y. and Q.L.; funding acquisition, Q.Y. and Y.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China [grant number 62103425] and Natural Science Foundation of Hunan Province, China [grant number 2022JJ40559].

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

We thank Xiuling Zhang and Hailiang Chen at the National University of Defense Technology for attending discussions and giving us important advice.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

L	Live
V	Virtual
C	Constructive
LGASRD	Learnable Genetic Algorithm for Simulation Run-time Resource Deployment
SRR	Simulation Run-time Resource
SRSR	Simulation Run-time Software Resource
MS	Micro-Service
BS	Base Station
EP	Elite Population
OPK	Operator Performance Knowledge
NSGA-II	Non-dominated Sorting Genetic Algorithm II
DAG	Directed Acyclic Graph
TB	Training Base
ES	Edge Server

References

Gao, Y.; Zhang, Y.; Zhou, X.; Lu, H. Overview of Simulation Architectures Supporting Live Virtual Constructive (LVC) Integrated Training. In Proceedings of the 2021 6th International Conference on Control, Robotics and Cybernetics, Shanghai, China, 9–11 October 2021; pp. 333–338. [Google Scholar]
Miao, Z.; Yong, P.; Jiancheng, Z.; Quanjun, Y. Efficient Flow-Based Scheduling for Geo-Distributed Simulation Tasks in Collaborative Edge and Cloud Environments. IEEE Trans. Parallel Distrib. Syst. 2022, 33, 3442–3459. [Google Scholar] [CrossRef]
Cerny, T.; Donahoo, M.J.; Trnka, M. Contextual Understanding of Microservice Architecture: Current and Future Directions. Appl. Comput. Rev. 2018, 17, 29–45. [Google Scholar] [CrossRef]
Al-Masri, E. Enhancing the Microservices Architecture for the Internet of Things. In Proceedings of the 2018 IEEE International Conference on Big Data, Seattle, WA, USA, 10–13 December 2018; pp. 5119–5125. [Google Scholar]
Wang, S.; Ding, Z.; Jiang, C. Elastic Scheduling for Microservice Applications in Clouds. IEEE Trans. Parallel Distrib. Syst. 2021, 32, 98–115. [Google Scholar] [CrossRef]
Pallewatta, S.; Kostakos, V.; Buyya, R. QoS-aware Placement of Microservices-based IoT Applications in Fog Computing Environments. Futur. Gener. Comp. Syst. 2022, 131, 121–136. [Google Scholar] [CrossRef]
Xu, J.; Chen, L.; Zhou, P. Joint Service Caching and Task Offloading for Mobile Edge Computing in Dense Networks. In Proceedings of the IEEE INFOCOM 2018-IEEE Conference on Computer Communications, Honolulu, HI, USA, 16–19 April 2018; pp. 207–215. [Google Scholar]
Poularakis, K.; Llorca, J.; Tulino, A.M.; Taylor, I.; Tassiulas, L. Joint Service Placement and Request Routing in Multi-cell Mobile Edge Computing Networks. In Proceedings of the IEEE INFOCOM 2019-IEEE Conference on Computer Communications, Paris, France, 29 April–2 May 2019; pp. 10–18. [Google Scholar]
Chen, L.; Shen, C.; Zhou, P.; Xu, J. Collaborative Service Placement for Edge Computing in Dense Small Cell Networks. IEEE Trans. Mob. Comput. 2021, 20, 377–390. [Google Scholar] [CrossRef]
Farhadi, V.; Mehmeti, F.; He, T.; La Porta, T.F.; Khamfroush, H.; Wang, S.; Chan, K.S.; Poularakis, K. Service Placement and Request Scheduling for Data-Intensive Applications in Edge Clouds. IEEE-ACM Trans. Netw. 2021, 29, 779–792. [Google Scholar] [CrossRef]
Wang, S.; Guo, Y.; Zhang, N.; Yang, P.; Zhou, A.; Shen, X. Delay-Aware Microservice Coordination in Mobile Edge Computing: A Reinforcement Learning Approach. IEEE Trans. Mob. Comput. 2021, 20, 939–951. [Google Scholar] [CrossRef]
Zhao, H.; Deng, S.; Liu, Z.; Yin, J.; Dustdar, S. Distributed Redundant Placement for Microservice-based Applications at the Edge. IEEE Trans. Serv. Comput. 2022, 15, 1732–1745. [Google Scholar] [CrossRef]
Zhang, X.; Li, Z.; Lai, C.; Zhang, J. Joint Edge Server Placement and Service Placement in Mobile-Edge Computing. IEEE Internet Things J. 2022, 9, 11261–11274. [Google Scholar] [CrossRef]
Deng, S.; Xiang, Z.; Taheri, J.; Khoshkholghi, M.A.; Yin, J.; Zomaya, A.Y.; Dustdar, S. Optimal Application Deployment in Resource Constrained Distributed Edges. IEEE Trans. Mob. Comput. 2021, 20, 1907–1923. [Google Scholar] [CrossRef]
Lai, P.; He, Q.; Abdelrazek, M.; Chen, F.; Hosking, J.; Grundy, J.; Yang, Y. Optimal Edge User Allocation in Edge Computing with Variable Sized Vector Bin Packing. In Proceedings of the Service-Oriented Computing: 16th International Conference, Hangzhou, China, 7 November 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 230–245. [Google Scholar]
Juve, G.; Chervenak, A.; Deelman, E.; Bharathi, S.; Mehta, G.; Vahi, K. Characterizing and Profiling Scientific Workflows. Future Gener. Comp. Syst. 2013, 29, 682–692. [Google Scholar] [CrossRef]
Smit, M.; Stroulia, E. Simulating Service-Oriented Systems: A Survey and the Services-Aware Simulation Framework. IEEE Trans. Serv. Comput. 2013, 6, 443–456. [Google Scholar] [CrossRef]
Fujimoto, R.M. Research Challenges in Parallel and Distributed Simulation. ACM Trans. Model. Comput. Simul. 2016, 26, 1–29. [Google Scholar] [CrossRef]
Taylor, S.J. Distributed Simulation: State-of-the-rt and Potential for Operational Research. Eur. J. Oper. Res. 2019, 273, 1–19. [Google Scholar] [CrossRef]
Kratzke, N.; Siegfried, R. Towards Cloud-native Simulations – Lessons Learned from the Front-line of Cloud Computing. J. Def. Model. Simul.-Appl. Methodol. Technol.-JDMS 2021, 18, 39–58. [Google Scholar] [CrossRef]
Li, Y.; Wang, S. An Energy-Aware Edge Server Placement Algorithm in Mobile Edge Computing. In Proceedings of the 2018 IEEE International Conference on Edge Computing, San Francisco, CA, USA, 2–7 July 2018; pp. 66–73. [Google Scholar]
Mohan, N.; Zavodovski, A.; Zhou, P.; Kangasharju, J. Anveshak: Placing Edge Servers in The Wild. In Proceedings of the 2018 Workshop on Mobile Edge Communications, Budapest, Hungary, 7 August 2018; pp. 7–12. [Google Scholar]
Wang, S.; Zhao, Y.; Xu, J.; Jie, Y.; Hsu, C.H. Edge Server Placement in Mobile Edge Computing. J. Parallel Distrib. Comput. 2019, 127, 160–168. [Google Scholar] [CrossRef]
Lhderanta, T.; Leppnen, T.; Ruha, L.; Lovén, L.; Sillanp, M.J. Edge Computing Server Placement with Capacitated Location Allocation. J. Parallel Distrib. Comput. 2021, 153, 130–149. [Google Scholar] [CrossRef]
He, T.; Khamfroush, H.; Wang, S.; La Porta, T.; Stein, S. It’s Hard to Share: Joint Service Placement and Request Scheduling in Edge Clouds with Sharable and Non-Sharable Resources. In Proceedings of the 2018 IEEE 38th International Conference on Distributed Computing Systems, Vienna, Austria, 2–6 July 2018; pp. 365–375. [Google Scholar]
Salaht, F.A.; Desprez, F.; Lebre, A.; Prud’Homme, C.; Abderrahim, M. Service Placement in Fog Computing Using Constraint Programming. In Proceedings of the 2019 IEEE International Conference on Services Computing, Milan, Italy, 8–13 July 2019; pp. 19–27. [Google Scholar]
Fan, Q.; Ansari, N. On Cost Aware Cloudlet Placement for Mobile Edge Computing. IEEE/CAA J. Autom. Sin. 2019, 6, 926–937. [Google Scholar] [CrossRef]
Gao, B.; Zhou, Z.; Liu, F.; Xu, F. Winning at the Starting Line: Joint Network Selection and Service Placement for Mobile Edge Computing. In Proceedings of the IEEE INFOCOM 2019-IEEE Conference on Computer Communications, Paris, France, 29 April–2 May 2019; pp. 1459–1467. [Google Scholar]
Chen, L.; Xu, J.; Ren, S.; Zhou, P. Spatio–Temporal Edge Service Placement: A Bandit Learning Approach. IEEE Trans. Wirel. Commun. 2018, 17, 8388–8401. [Google Scholar] [CrossRef]
Ouyang, T.; Zhou, Z.; Chen, X. Follow Me at the Edge: Mobility-Aware Dynamic Service Placement for Mobile Edge Computing. IEEE J. Sel. Areas Commun. 2018, 36, 2333–2345. [Google Scholar] [CrossRef]
Hu, B.; Cao, Z.; Zhou, M. Scheduling Real-Time Parallel Applications in Cloud to Minimize Energy Consumption. IEEE Trans. Cloud Comput. 2022, 10, 662–674. [Google Scholar] [CrossRef]
Wang, P.; Xu, J.; Zhou, M.; Albeshri, A. Budget-Constrained Optimal Deployment of Redundant Services in Edge Computing Environment. IEEE Internet Things J. 2023, 10, 9453–9464. [Google Scholar] [CrossRef]
Di Francesco, P.; Malavolta, I.; Lago, P. Research on Architecting Microservices: Trends, Focus, and Potential for Industrial Adoption. In Proceedings of the 2017 IEEE International Conference on Software Architecture, Gothenburg, Sweden, 3–7 April 2017; pp. 21–30. [Google Scholar]
Rudolph, G. Convergence Properties of Canonical Genetic Algorithms. IEEE Trans. Neural Netw. 1994, 1, 96–101. [Google Scholar] [CrossRef]
Deb, K.; Pratap, A.; Agarwal, S.; Meyarivan, T. A Fast and Elitist Multiobjective Genetic Algorithm: NSGA-II. IEEE Trans. Evol. Comput. 2002, 6, 182–197. [Google Scholar] [CrossRef]

Figure 1. System Architecture.

Figure 2. Deployment schemes for different objects: (a) Edge server deployment; (b) SRSR deployment; (c) Joint SRR deployment.

Figure 3. Algorithm performance comparison and iteration process of three evolutionary algorithms under experimental settings A, B and C.

Figure 4. Algorithm performance comparison and iteration process of three evolutionary algorithms under experimental settings D, E and F.

Figure 5. Algorithm performance comparison and iteration process of three evolutionary algorithms under experimental settings G, H and I.

Figure 6. Algorithm performance comparison under experimental settings J, K and L.

Figure 7. Algorithm performance comparison of three evolutionary algorithms under the resource-constrained scenario.

Figure 8. Three types of time cost defined in Section 4.2.

Table 1. Summary of key notations.

Notation	Description
$M S_{i}$	the i-th type of MS
$r_{M S_{i}}$	the resource requirement of $M S_{i}$
$S_{i}$	the i-th type of CS combined by micro-services
$S_{i j}$	the j-th type of service request from the i-th user
$t_{i j l}$	the l-th task in $S_{i j}$
$R_{i}$	the i-th edge server
$R_{C_{i}}$	the resource capacity of $R_{i}$
$L_{R_{i}}$	the location coordinate of $R_{i}$
$U_{i}$	the i-th user
$r_{C_{i}}$	the resource requirement of the i-th constructive member
$D S$	the deployment scheme, i.e., the solution of the problem to be solved
$R (M)$	the edge server deploying the middleware
$R_{U_{i}} (t_{j})$	the closest edge server to user $U_{i}$ that deploying instances can execute $t_{j}$
$R (t_{i})$	the edge server can process $t_{i}$ with the smallest average data transmission time from the server executing the predecessor tasks of $t_{i}$
$R (C_{i})$	the edge server deploying constructive member $C_{i}$
$R (U_{i})$	the edge server closest to user $U_{i}$
$t_{i}^{C}$	the communication time caused by the service request $S_{i}$
$t_{i}^{M}$	the total queuing time of all tasks in the service request $S_{i}$
$t^{S}$	the data transmission time caused by users’ service requests
$t^{M}$	the interaction time cost among users
$B_{i}$	the resource usage rate of $R_{i}$

Table 2. Update and application performance knowledge of mutation operators.

PK	Historic Performance			Probability of Selection
Operator	Operator 1	Operator 2	Operator 3	Operator 1	Operator 2	Operator 3
Initialization	1	1	1	33.3%	33.3%	33.3%
mutation1	1	2	1	25%	50%	25%
mutation2	1	2	1	25%	50%	25%
⋯	⋯	⋯	⋯	⋯	⋯	⋯

Table 3. Number of micro-services each workflow contains.

Workflow (Service Request)	Number of Micro-Services
Montage (WfM)	9
Cybershake (WfC)	5
Epigenomics (WfE)	8
LIGO (WfL)	4
SIPHT (WfS)	13

Table 4. Summary of scale parameters.

Setting	Nt	Nms	Nes	Nu
A	50	WfM	10	10 + 10
B	50	WfM	15	30 + 30
C	50	WfM	20	50 + 50
D	50	All except WfC	10	10 + 10
E	50	All except WfC	15	30 + 30
F	50	All except WfC	20	50 + 50
G	50	All	10	10 + 10
H	50	All	15	30 + 30
I	50	All	20	50 + 50
J	100	WfM	10	10 + 10
K	100	WfM	15	30 + 30
L	100	WfM	20	50 + 50

Table 5. Performance of mentioned algorithms on two objective functions across nine experimental settings.

Algorithm	Best on O1	Better than LGASRD on O1	Best on O2	Better than LGASRD on O2
KMNSGA	0	0	3	3
NSGA-II	0	1	1	2
LGASRD	7		5
Random	2	2	0	0
TServiceD	0	0	0	0
TSD $^{2}$	0	0	0	0

Table 6. Changes in Time Cost with the Increase in the Number of Edge Servers, Users, and Micro-services.

Number of Edge Servers and Users					Number of Micro-Services
Setting	Scale	KMNSGA	NSGA-II	LGASRD	Setting	Scale	KMNSGA	NSGA-II	LGASRD
A	1	1	1	1	A	1	1	1	1
B	4.5	8.17	5.76	10.40	D	3.5	1897.87	2052.43	1517.05
C	10	37.14	22.62	29.66	G	4	2378.72	2059.93	2238.64
D	1	1	1	1	B	1	1	1	1
E	4.5	7.87	6.77	10.07	E	3.5	1828.13	2409.09	1469.95
F	10	32.74	18.98	34.27	H	4	1505.21	1103.90	890.71
G	1	1	1	1	C	1	1	1	1
H	4.5	5.17	3.09	4.14	F	3.5	1672.39	1721.85	1752.87
I	10	12.74	9.13	11.19	I	4	815.58	831.16	844.83

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Z.; Peng, Y.; Zhang, M.; Yin, Q.; Li, Q. An Approach for Deployment of Service-Oriented Simulation Run-Time Resources. Appl. Sci. 2023, 13, 11341. https://doi.org/10.3390/app132011341

AMA Style

Zhang Z, Peng Y, Zhang M, Yin Q, Li Q. An Approach for Deployment of Service-Oriented Simulation Run-Time Resources. Applied Sciences. 2023; 13(20):11341. https://doi.org/10.3390/app132011341

Chicago/Turabian Style

Zhang, Zekun, Yong Peng, Miao Zhang, Quanjun Yin, and Qun Li. 2023. "An Approach for Deployment of Service-Oriented Simulation Run-Time Resources" Applied Sciences 13, no. 20: 11341. https://doi.org/10.3390/app132011341

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Approach for Deployment of Service-Oriented Simulation Run-Time Resources

Abstract

1. Introduction

2. Related Works

3. Motivation

4. System Model and Problem Formulation

4.1. Definitions

4.2. Time Cost Model

4.2.1. Service Response Model

4.2.2. User Interaction Model

4.3. Resource Occupancy Rate Model

4.4. Problem Formulation

5. Simulation Run-Time Resources Deployment Algorithm

5.1. The LGASRD Algorithm

5.1.1. Constraint Handling Mechanism

5.1.2. Elite Knowledge

5.1.3. Operator Performance Knowledge (OPK)

5.1.4. The LGASRD Framework

5.2. Crossover Operators

5.3. Mutation Operators

6. Experiment Evaluation

6.1. Benchmark Policies

6.2. Data Sets

6.3. Experimental Settings

6.4. Experiment Results

6.4.1. Optimality and Scalability

6.4.2. Effectiveness of Constraint Handling Policy

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI