Microservice Workflow Scheduling with a Resource Configuration Model Under Deadline and Reliability Constraints

Li, Wenzheng; Li, Xiaoping; Chen, Long; Wang, Mingjing

doi:10.3390/s25041253

Open AccessArticle

Microservice Workflow Scheduling with a Resource Configuration Model Under Deadline and Reliability Constraints

by

Wenzheng Li

^1,*

,

Xiaoping Li

¹,

Long Chen

¹ and

Mingjing Wang

²

¹

School of Computer Science and Engineering, Southeast University, Nanjing 211189, China

²

School of Data Science and Artificial Intelligence, Wenzhou University of Technology, Wenzhou 325000, China

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(4), 1253; https://doi.org/10.3390/s25041253

Submission received: 18 November 2024 / Revised: 15 January 2025 / Accepted: 8 February 2025 / Published: 19 February 2025

(This article belongs to the Section Sensor Networks)

Download

Browse Figures

Versions Notes

Abstract

With the continuous evolution of microservice architecture and containerization technology, the challenge of efficiently and reliably scheduling large-scale cloud services has become increasingly prominent. In this paper, we present a cost-optimized scheduling approach with resource configuration for microservice workflows in container environments, taking into account deadline and reliability constraints. We introduce a graph deep learning model (DeepMCC) that automatically configures containers to meet various service quality (QoS) requirements. Additionally, we propose a reliability microservice workflow scheduling algorithm (RMWS), which incorporates heuristic leasing and deployment strategies to ensure reliability while reducing cloud resource leasing cost. Experiments on four scientific workflow datasets show that the proposed approach achieves an average cost reduction of 44.59% compared to existing reliability scheduling algorithms, with improvements of 26.63% in the worst case and 73.72% in the best case.

Keywords:

workflow scheduling; microservice workflow; container environment; cost-optimized scheduling

1. Introduction

The rapid advancement of information technology has popularized microservice architecture and containerization in cloud application development and deployment [1]. This trend is driven by the demand for enhancing system flexibility, scalability, and efficient resource utilization. By decoupling complex services, microservice architecture enables the independent development, deployment, and scaling of individual services, significantly improving the agility of cloud applications. However, the adoption of microservice architecture also introduces a variety of challenges, particularly in the areas of resource configuration and scheduling.

In terms of microservice resource provisioning, existing research primarily focuses on elastic resource configuration based on load predictions for individual services or specific types of services. However, microservice applications with task dependencies, often represented as workflow models using Directed Acyclic Graphs (DAGs), exhibit a higher level of complexity than independent tasks. The resource configuration for these tasks requires a comprehensive understanding of service dependencies and Quality of Service (QoS) requirements. Unfortunately, Most researchers do not take into account the resource configuration for complex workflow in scheduling, and neglect the reusability of configuration experiences, leading to inefficient resource utilization.

In terms of microservice workflow scheduling, many cloud workflow scheduling algorithms lack the resource failure model and not account for scheduling reliability. Resource failures, such as server crashes or network disruptions, can lead to diminished system performance, premature termination of program execution, and even data loss. They ultimately result in an increased number of tasks missing their deadlines, higher failure rates, and severely compromised reliability and stability in cloud computing. Due to the lower level of isolation on containers, their failure rate is higher than that on virtual machines. Therefore, it is essential to consider reliability requirements in microservice scenarios.

Moreover, containerization offers a lightweight and efficient way for deploying microservice, and they also introduce a dual-layer resource structure that consists of both virtual machines and containers. This two-tier resource structure complicates the cost optimization process, as it requires balancing trade-offs among resource utilization, performance, and cost. Traditional virtual machine allocation strategies struggle to accommodate the diverse and dynamic configurations of containers. These traditional strategies rely on limited virtual machine resources and fail to leverage the flexibility provided by containerization.

In summary, microservice architecture and containerization have significantly altered the scheduling system in terms of task scale, resource structure, and execution reliability requirements. The limitations of traditional scheduling system has prompted a shift toward more intelligent approaches [2], such as utilizing advanced AI models (e.g., deep learning and reinforcement learning). This paper presents an intelligent scheduling approach for scheduling microservice workflow on container environment, integrating graph deep learning with heuristic methods. The specific contributions of our work are as follows:

Precise Resource Configuration: A deep learning model for microservice container configuration (DeepMCC) is designed. Because of the advantages of Graph Neural Networks (GNN) in processing graph-structured data, the approach efficiently configures container resources for each service to meet QoS requirements.
System Reliability Enhancement: A container replication strategy is employed to enhance system redundancy and reliability. Additionally, the container migration strategy is designed to improve resource utilization.
Cost-optimized Scheduling: For the dual-layer virtual resource environment of containers and virtual machines, a reliability microservice workflow scheduling algorithm (RMWS) is proposed. This algorithm integrates container configuration, fault tolerance, and container migration to optimize cost. Experiment demonstrate that RMWS can minimize cost while ensuring reliability compared to relevant algorithms.

This paper introduces several innovative contributions to the field of microservice workflow scheduling in containerized environments. Firstly, it presents DeepMCC, a novel deep learning model tailored for precise container resource configuration in microservice architectures. Secondly, this paper enhances system reliability through the implementation of a container replication strategy, which boosts redundancy and resilience against resource failures. Lastly, the proposed RMWS algorithm represents a significant advancement in cost-optimized scheduling for microservice workflows. The approach distinguishes itself from existing algorithms, demonstrating superior performance in managing the intricacies of modern cloud environments.

The rest of the paper is organized as follows. The related work is described in Section 2. Section 3 shows models and problem formulation. Section 4 describes the proposed methods. Experimental results are shown in Section 5. Finally, conclusions and future research are detailed in Section 6.

2. Related Work

2.1. Workflow Scheduling

Workflow scheduling is an optimization problem of mapping tasks to resources, which involves sequencing tasks and allocating resources to optimize performance metrics, including execution time, resource utilization, and cost.

In the study of workflow scheduling considering deadline constraints, researchers focus on ensuring that tasks are completed before the deadline, while the scheduling objective is to minimize cost. Wu et al. [3] introduced the concept of probability-based upward ranking and designed two algorithms, ProLiS and L-ACO, which consider both task urgency and cost factors to achieve cost optimization. Chakravarthi et al. [4] employed a novel encoding scheme and a well-designed population initialization strategy to propose a firefly-based metaheuristic algorithm called CEFA. Sahni et al. [5] considered dynamic performance changes in cloud resources and proposed a high-performance JIT-C scheduling algorithm. Toussi et al. [6] proposed the EDQWS algorithm based on the divide-and-conquer approach.

In the study of workflow scheduling considering cost and budget, Wu et al. [7] proposed a cluster-based heuristic algorithm called PCP-B2, which balances budget allocation among PCPs. Ghafouri et al. [8] proposed a heuristic algorithm called CB-DT that uses backtracking to assign critical tasks to faster resources and non-critical tasks to lower-cost resources, optimizing execution time while satisfying budget constraints. Faragardi et al. [9] extended the classic HEFT algorithm to propose the GRP-HEFT algorithm, which consists of a resource provisioning mechanism and a scheduler.

Although these studies have made significant progress in workflow scheduling, they overlook the unique challenges presented by microservice architecture, such as low scheduling efficiency caused by the scale of tasks and the resources.

2.2. Microservice Workflow Scheduling

In the context of microservice scheduling, Gu et al. [10] and Guerrero et al. [11] focused on container deployment issues. He et al. [12] proposed a greedy-based algorithm for rapid deployment and continuous delivery of microservice in cloud and edge computing environment by modeling it as a Quadratic Sum-of-Ratios Fractional Problem. Bao et al. [13] addressed the performance issues of independent microservice by establishing a comprehensive model and making accurate predictions. Wang et al. [14] proposed the Elastic Scheduling of Microservice (ESMS) method, which combines task scheduling and automatic scaling to meet deadline constraints while minimizing the cost of virtual machines. Li et al. [15] introduced a heuristic algorithm called GSMS to minimize execution cost while satisfying deadline and reliability constraints. Abdullah et al. [16] presented the MSDSC framework to enhance the security of edge systems. Yu et al. [17] proposed a reinforcement learning algorithm with reliability constraints (WS-CCR).

Table 1 shows the relevant work on microservice workflow scheduling over the past five years (journal papers, excluding conference papers). Most scholars tend to design heuristic methods and there is currently no research that considers deadline, cost, and reliability simultaneously. The research presented in this paper fills this gap in microservice workflow scheduling.

Apart from heuristic methods, the recent advancement of Graph Neural Network (GNN) has presented a new opportunity to tackle resource allocation challenges. GNNs possess powerful capabilities in representing graph data, are not limited by the number or order of nodes, and excel in inductive learning, making them instrumental in graph analysis problems. Their successful applications in diverse domains have demonstrated their potential in addressing graph-structured problems [18]. Wang et al. [19] employed a GNN to mine correlations and predict service usage probabilities for task solutions, and then efficiently construct initial solutions using PN-based reinforcement learning. Liu et al. [20] proposed a dynamic graph neural network based model called DySR to tackle the evolution of service and the semantic gap between services and mashups. Dong et al. [21] designed an adaptive fault-tolerant workflow scheduling framework (RLFTWS) using deep reinforcement learning, balancing makespan, resource usage, and achieving fault tolerance. Inspired by these developments, this paper explores the GNN model to map task resource requirements to container configurations.

2.3. Fault-Tolerant Strategies in Scheduling

To enhance fault tolerance of workflow scheduling with reliability requirement, the primary research focus is on adopting replication strategies to generate multiple duplicates of tasks to ensure their smooth execution. Passive replication involves re-executing a backup task on new resources when the primary task fails [22], but may result in deadline violations. On the other hand, active replication requires multiple copies of each task to run simultaneously on different resources [23,24,25,26] to ensure that at least one copy can successfully complete.

Early studies use a fixed number of replicas to tolerate the maximum fault rate [23,24], leading to high and unnecessary execution cost. To address this issue, subsequent research introduced the concept of quantitative active replication, which optimizes cost by dynamically adjusting the number of replicas for different tasks. Zhao et al. [25] proposed an algorithm that minimizes the number of replicas while satisfying reliability requirements, thereby reducing cost. Xie et al. [26] also considered the trade-off between the number of replicas and cost, proposing the CGM algorithm for cost optimization under reliability constraints.

Apart from replication-based fault tolerance methods, some studies have adopted resubmission strategies to ensure the reliable execution of workflow [27]. Rescheduling for fault tolerance mainly involves reallocating resources for tasks that cannot be executed smoothly due to resource failures during execution, thereby ensuring the successful completion of these tasks. This strategy is suitable for tasks with low fault rates and redundant time.

Whether utilizing replication or resubmission as fault-tolerant strategies, these studies primarily focused on cloud resource scenarios with a single-layer resource structure of virtual machines, and there is almost no research on scenarios involving the dual-layer resource structure of both containers and virtual machines.

3. System Architecture and Problem Description

3.1. System Architecture

The system architecture is shown in Figure 1. Applications with user specified deadlines and reliability requirements are submitted to the scheduling system. Firstly, a deep learning model called DeepMCC receives workflow information, which includes the QoS (Quality of Service) requirement information. DeepMCC analyzes the resource characteristics of each workflow subtask and recommends appropriate container configurations. Subsequently, the workflow tasks with container configurations are sent to the RMWS module. The execution cost evaluator assesses the cost based on cloud service provider billing, while the reliability evaluator generates task replicas that meet reliability requirements using the replication fault-tolerant strategy.

As one of the core components of the system, the scheduler is responsible for comprehensively considering task characteristics, resource allocation, execution costs, and reliability requirements to develop the optimal scheduling strategy. The resource leasing manager is responsible for dynamically managing the leasing and release of cloud resources, while also monitoring the performance and status of resource utilization.

3.2. Workflow and Resource Model

A microservice workflow application is represented as a DAG (Directed Acyclic Graph)

W = (V, E)

, where V denotes the set of tasks and E represents the set of stask dependencies. The computational workload of a task

v_{i} \in V

is denoted as

w_{i}

.

e_{i, j} \in E

represents an edge between tasks

v_{i}

and

v_{j}

, meaning that

v_{j}

can only start after the completion of

v_{i}

. The edge

e_{i, j}

is associated with

d a t a_{i, j}

, which indicates the size of data transferred from

v_{i}

to

v_{j}

.

v_{i}

is described as a predecessor task of

v_{j}

, and

v_{j}

is described as a successor task of

v_{i}

. The set of predecessor tasks and successor tasks of

v_{i}

are denoted as

p r e d (v_{i})

and

s u c c (v_{i})

, respectively.

v_{e n t r y}

and

v_{e x i t}

refer to tasks that have no predecessor tasks and no successor tasks, respectively Table 2. By adding two virtual tasks with an execution time of

θ

at the beginning and end of the workflow, it is assumed that the workflow has only one

v_{e n t r y}

and one

v_{e x i t}

.

The execution cost of a virtual machine is linked to its type and charged based on a unit time cost (Budget Time Unit, BTU). If an hourly-based cost model similar to Amazon EC2 is used to charge customers for renting virtual machines, any duration exceeding one hour is rounded up to the nearest hour.

The resource pool is represented as a set of virtual machines

M = {m_{1}, m_{2}, \dots}

. The computing resource vector configured for the virtual machine instance

m_{k}

is denoted as

{\vec{R}}^{a} (k) = (R_{1}^{a}, \dots, R_{N_{r}}^{a})

, where

N_{r}

represents the total number of computing resource types (e.g., CPU, RAM, etc.). Each task

v_{i}

is encapsulated in a container

c_{k}

, which is deployed on a virtual machine

m_{l}

and consumes a specific amount of resources denoted as

\vec{R} (c_{k})

.

Similar to most microservice workflow scheduling algorithms [9], it is assumed that only one microservice can execute within a container at any given time. When task

v_{i}

is assigned to container

c_{k}

, its execution time

E T (v_{i}, c_{k})

is calculated as follows:

\begin{matrix} E T (v_{i}, c_{k}) = \frac{w_{i}}{s p e e d (c_{k})}, \end{matrix}

(1)

assuming all virtual machines are located in the same physical region of the cloud environment, with the bandwidth between virtual machines denoted as

b a n d w i d t h_{x, y}

and the transmission delay as

d e l a y_{x, y}

. The data transfer time

T T (v_{i}, v_{j})

between task

v_{i}

and its successor task

v_{j}

is calculated as follows:

\begin{matrix} T T (v_{i}, v_{j}) = \{\begin{matrix} \frac{d a t a_{i, j}}{b a n d w i d t h_{x, y}} + d e l a y_{x, y}, & if m_{x} \neq m_{y} \\ 0, & otherwise \end{matrix} \end{matrix}

(2)

where

m_{x}

and

m_{y}

represent the virtual machines that process

v_{i}

and

v_{j}

, respectively. If tasks

v_{i}

and

v_{j}

are on the same virtual machine, the data transfer time is 0.

3.3. Container Configuration Model

Figure 2 is an illustration of container configuration for a microservice workflow. The requirement for task

v_{i}

in the workflow is represented by

a_{i}

, defined as a tuple

a = (P^{i n}, P^{o u t})

.

P^{i n}

represents the required input parameters, and

P^{o u t}

represents the output parameters. These requirements are met by specifically configured containers (including hardware parameters, software environment, and dependent data). A container configuration is represented as a tuple

s = (I D, P^{i n}, P^{o u t}, Q)

in the form of container image files.

I D

is the unique identifier for a specific container image,

P^{o u t}

and

P^{i n}

have the same meanings as defined in simple tasks, and Q is the Quality of Service (QoS) indicator for the service, represented as a tuple composed of M QoS attributes,

Q = {Q_{1}, Q_{2}, \dots, Q_{M}}

. This paper considers two QoS attributes: execution time and reliability.

The candidate resource set C is a list composed of K different resource configuration types, which is represented by

C = {s_{1}, s_{2}, \dots, s_{K}}

. These different resources, in the form of container image files, can provide the same functionality with varying service qualities.

The optimization objective is to find resource configuration types that satisfy QoS preferences for a user-submitted workflow request. w represents the user’s preferences for QoS attributes, i.e., the weights of different QoS,

w = {w_{1}, w_{2}, \dots, w_{L}}

. Among them,

w_{i}

is greater than 0, and the sum of all

w_{i}

equals 1.

The QoS values for different service type indicators are weighted and combined using the following formula:

QoS = \sum_{i = 1}^{L} w^{i} \times n o r m_Q^{i}

(3)

In this paper, the objective of optimizing reliability is to achieve maximization, whereas the goal for optimizing execution time is to achieve minimization. To unify the measurement of different QoS attributes, Q is normalized to

n o r m_Q

, calculated using the following formula:

n o r m_Q^{i} = \{\begin{matrix} \frac{Q^{i} - m i n_Q^{i}}{m a x_Q^{i} - m i n_Q^{i}}, & if Q^{i} is reliability, \\ \frac{m a x_{_} Q^{i} - Q^{i}}{m a x_Q^{i} - m i n_Q^{i}}, & if Q^{i} is makespan . \end{matrix}

(4)

where

m a x_Q^{i}

and

m i n_Q^{i}

represent the maximum and minimum values of the i-th type of service indicator, respectively.

With QoS as the optimization objective, the problem is defined as follows:

\begin{matrix} \underset{x_{i, j}}{arg max} x_{i, j} \times Q o S \\ s . t . x_{i, j} = 0, 1, i = 1, 2, \dots, N, j = 1, 2, \dots, K_{i} \\ \sum_{j = 1}^{K_{i}} x_{i, j} = 1, \forall i = 1, 2, \dots, N \end{matrix}

(5)

From the mathematical model, it can be seen that the QoS-based container configuration is a classical combinatorial optimization problem.

3.4. Failure Model

The time of cloud resource failures follows a Poisson probability distribution [28,29]. Let

λ_{l}

denote the failure rate of virtual machine

m_{l}

in each time period, which can be obtained by calculating the statistical average based on the historical data of the virtual machine [30]. The reliability of task container

c_{k}

deployed on virtual machine

m_{l}

is expressed as:

\begin{matrix} R (v_{i}, c_{k}, m_{l}) = e^{- λ_{l} \times E T (v_{i}, c_{k})} \end{matrix}

(6)

The fault-tolerant mechanism in our work is based on replication technology, where each task has multiple replicas that can be assigned to containers located on different virtual machines. Let

n_{i}

denote the number of replicas for task

v_{i}

. The set of replicas for task

v_{i}

is denoted as

r e p (v_{i}) = {v_{i}^{1}, v_{i}^{2}, \dots, v_{i}^{n_{i}}}

, where

v_{i}^{1}

is the primary replica and the others are backup replicas. All

n_{i}

replicas share the same reliability model.

For task

v_{i}

, reliability is calculated as follows:

\begin{matrix} R (v_{i}) = 1 - \prod_{τ = 1}^{n_{i}} (1 - R (v_{i}^{τ}, c_{v_{i}^{τ}}, m_{v_{i}})) \end{matrix}

(7)

For a workflow W, reliability is calculated as:

\begin{matrix} R (W) = \prod_{v_{i} \in V} R (v_{i}) \end{matrix}

(8)

3.5. Problem Formulation

The start time of task

v_{i}

on container

c_{k}

depends on the availability of container

c_{k}

and the time when all data from

v_{i}

’s predecessor tasks is received. The formulas for calculating the start time

S T (v_{i}, c_{k})

and finish time

F T (v_{i}, c_{k})

are as follows:

\begin{matrix} S T (v_{i}, c_{k}) = max \{Avail (v_{i}, c_{k}), max_{v_{j} \in pred (v_{i})} \\ \{max_{v_{j}^{τ} \in rep (v_{j})} \{F T (v_{j}^{τ}) + T T (v_{j}^{τ}, v_{i})\}\}\} \end{matrix}

(9)

\begin{matrix} F T (v_{i}, c_{k}) = S T (v_{i}, c_{k}) + E T (v_{i}, c_{k}) \end{matrix}

(10)

where

r e p (v_{j})

represents the set of replicas of task

v_{j}

.

Avail (v_{i}, c_{k})

is the earliest time at which container

c_{k}

is ready to execute task

v_{i}

and is calculated using the following formula:

\begin{matrix} Avail (v_{i}, c_{k}) = max \{I T (v_{i}, c_{k}), max_{v_{s} \in Sche (c_{k})} \{AFT (v_{s})\}\} \end{matrix}

(11)

where

Sche (c_{k})

and

AFT (v_{s})

represent the set of tasks assigned to container

c_{k}

and the actual finish time of task

v_{s}

, respectively.

Additionally, the initialization time

I T (v_{i}, c_{k})

of a task depends on the state of the virtual machine and container and is calculated as follows:

\begin{matrix} I T (v_{i}, c_{k}) = \{\begin{matrix} 0, & if c_{k} is running \\ I T (c_{k}), & if creating c_{k} on m_{l} \\ I T (c_{k}) + I T (m_{l}), & if creating c_{k} and m_{l} \end{matrix} \end{matrix}

(12)

If there is a running $c_{k}$ that can handle task $v_{i}$ , no initialization is required, i.e., $I T (v_{i}, c_{k}) = 0$ .
If there is no suitable microservice to handle task $v_{i}$ , a new container $c_{k}$ is created, and then deploy on an existing virtual machine $m_{l}$ . Therefore, initialization of the container is necessary.
If there is no available microservice to handle task $v_{i}$ and no existing virtual machine, a new container $c_{k}$ and a new virtual machine $m_{l}$ are created.

Based on the above model, the scheduling scheme is represented as

π = (V, C, M, S T, E T)

.

π = (v_{i}, c_{k}, m_{l}, S T (v_{i}, c_{k}), F T (v_{i}, c_{k}))

indicates that task

v_{i}

is assigned to container

c_{k}

on virtual machine

m_{l}

from time

S T (v_{i}, c_{k})

to

F T (v_{i}, c_{k})

.

The formulas for calculating

L S T (c_{k}, m_{l})

and

L F T (c_{k}, m_{l})

are as follows:

\begin{matrix} LST (c_{k}, m_{l}) = min_{v_{i} \in Sche (c_{k})} \{S T (v_{i}, c_{k})\} \end{matrix}

(13)

\begin{matrix} LFT (c_{k}, m_{l}) = max_{v_{i} \in Sche (c_{k})} \{F T (v_{i}, c_{k})\} \end{matrix}

(14)

The formulas for calculating the total execution time

m a k e s p a n (W)

and total execution cost

c o s t (W)

of the microservice workflow W are:

\begin{matrix} m a k e s p a n (W) = max_{v_{i} \in V} \{max_{v_{i}^{τ} \in rep (v_{i})} \{F T (v_{i}^{τ}, c_{v_{i}^{τ}})\}\} \end{matrix}

(15)

\begin{matrix} c o s t (W) = \sum_{m_{l} \in I} p (m_{l}) ⌈(max_{c_{k} \in Sche (m_{l})} LFT (c_{k}, m_{l}) - min_{c_{k} \in Sche (m_{l})} LST (c_{k}, m_{l})) / T U⌉ \end{matrix}

(16)

where

Sche (m_{l})

is the set of containers deployed on the virtual machine

m_{l}

,

T U

represents the minimum unit time for virtual machine leasing, and

p (m_{l})

denotes the BTU cost of the virtual machine

m_{l}

.

Let

D_{r e q}

and

R_{r e q}

be the deadline and reliability requirements, respectively, for the workflow W submitted by the user. The optimization objectives are defined as follows:

\begin{matrix} \min c o s t (W) \\ s . t . m a k e s p a n (W) \leq D_{r e q} \\ r e l i a b i l i t y (W) \geq R_{r e q} . \end{matrix}

(17)

4. Proposed Methods

Based on the system architecture and mathematical formulas, this section, respectively, presents methods for the microservice container configuration (DeepMCC) and the reliability microservice workflow scheduling (RMWS).

4.1. DeepMCC

The design of DeepMCC is shown in Figure 3, where the model consists of five core blocks: embedding block, GCN block, global pooling block, MLP block, and GAT block. Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT) are integrated into DeepMCC to enhance performance. Specifically, GCN is designed to capture both local and global features of the workflow graph structure. By leveraging multi-layer convolutional operations, it progressively extracts high-level structural features, facilitating a more precise extraction of task relationships and dependencies of the workflow. The GAT block further improves the flexibility and adaptability. By employing an attention mechanism, GAT dynamically assigns weights to nodes or edges based on their importance, enabling the model to handle candidate resource sets of varying sizes. The implementation and training process of DeepMCC are shown in Algorithms 1 and 2.

Embedding Block: the initial step in task feature processing involves taking the user-submitted workflow graph $G = (V, E)$ as input. Each task node v has a candidate configuration set represented as $X_{v}^{q o s}$ . Edges E indicate task dependencies. The embedding block extracts features $x_{v}$ from $X_{v}^{q o s}$ through a multilayer perceptron. These feature vectors are transformed by a multi-layer perceptron (MLP) $m_{θ_{1}}$ to obtain refined feature vectors.
GCN Block: The feature vectors of each node are updated iteratively. In each iteration, the feature vector of the current node is updated based on the features of its neighbor nodes. After L iterations, a set of feature matrices is formed, i.e., $X_{L} = {X_{h} | h = 1, 2, \dots, L}$ . These feature matrices are concatenated to form $X_{G C N}$ , which contains the updated feature information for all nodes in the graph.
Global Pooling Block and MLP Block: The Global Pooling Block applies an average pooling operation to $X_{G C N}$ to extract a global feature vector $X_{g l o b a l}$ . This global feature vector is replicated N times (where N is the number of task nodes) to obtain $X_{r e p e a t}$ , which is then concatenated with $X_{G C N}$ to form a feature representation that combines global and local information. The result is fed into an MLP to obtain the final feature representation $X_{G C N^{'}}$ , integrating both local and global information.
GAT Block: For each task v, the features of its candidate resource set $x_{G C N^{'}}^{v}$ and the features of each candidate resource configuration $X_{v, j}^{q o s}$ are transformed using multi-layer perceptrons $m_{θ_{4}}$ and $m_{θ_{5}}$ . An attention coefficient $s_{v, j}$ is calculated using a multi-layer perceptron $m_{θ_{6}}$ with a Tanh activation function, representing the importance of each candidate resource configuration. The attention coefficients are normalized using the softmax function to obtain the selection probability $P r o b_{v, j}$ for each candidate resource configuration.

Algorithm 1 DeepMCC model.

Require:: G (the workflow graph with requirements), $X_{v}^{q o s}$ (the QoS matrix of $s_{v}$ in G), N (the number of tasks in G), L (the number of layers in GCN)
Ensure:: $P r o b_{v, j}$
1:: $x_{v} \leftarrow ExtractFeatures (X_{v}^{q o s})$
2:: $x_{v} \leftarrow m_{θ_{1}} (x_{v})$
3:: for $h = 1$ to L do
4:: $x_{v}^{h^{'}} \leftarrow \sum_{u \in pred (v)} m_{θ_{h^{'}}} (x_{u}^{h - 1} - x_{v}^{h - 1}) + x_{v}^{h - 1}$
5:: $x_{v}^{h} \leftarrow \sum_{u \in succ (v)} m_{θ_{h}} (x_{u}^{h^{'}} - x_{v}^{h^{'}}) + x_{v}^{h^{'}}$
6:: $X_{G C N} \leftarrow X_{G C N} \cup {x_{v}^{h}}$
7:: end for
8:: $x_{global} \leftarrow POOLING (m_{θ_{2}} (X_{G C N}))$
9:: $X_{repeat} \leftarrow REPEAT (x_{global}, N)$
10:: $X_{G C N C o n c a t} \leftarrow CONCAT (X_{G C N}, X_{repeat})$
11:: $X_{G C N^{'}} \leftarrow m_{θ_{3}} (X_{G C N C o n c a t})$
12:: $s_{v, j}^{'} \leftarrow m_{θ_{4}} (x_{G C N^{'}}^{v}) + m_{θ_{5}} (x_{v, j}^{q o s})$
13:: $s_{v, j} \leftarrow m_{θ_{6}} (s_{v, j}^{'})$
14:: $P r o b_{v, j} \leftarrow softmax (s_{v, j})$
15:: return $P r o b_{v, j}$

The system converts the solution into a multi-hot-type integer vector

y = (y_{1}, y_{2}, \dots, y_{m})

. The length of this vector is equal to the number of resource types in all resource sets pointed to by the current workflow task. Here,

y_{i} = 0

indicates that the resource is not selected by the current workflow; meanwhile, all resource types in the same candidate set can only be selected once. The system uses this multi-hot vector y as the label data for the network module. For example, the corresponding label data in Figure 2 are

y = (1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1)

.

Algorithm 2 Training of DeepMCC.

Require:: G (The workflow graph with requirements), $X_{v}^{q o s}$ (The QoS matrix of $s_{v}$ in G), N(The number of tasks in G), L(The number of layers in GCN)
Ensure:: $θ$ (The network parameter)
1:: while stopping criteria is not satisfied do
2:: $P r o b_{v} \leftarrow DeepMCC (G, X_{v}^{q o s}, N, L)$
3:: $L \leftarrow - \frac{1}{N} \sum_{v = 1}^{N} \sum_{i = 1}^{K_{v}} y_{v, i} log {Loss}_{j}$
4:: $θ \leftarrow θ - η \nabla_{θ} L$
5:: end while
6:: return $θ$

Given a training sample with m candidate resources, the output of DeepMCC is a vector

P r o b = (p_{s_{1}}, p_{s_{2}}, \dots, p_{s_{m}})

, representing selection probabilities for each resource. Conversely, the initial training labels are a binary vector

y = (y_{1}, y_{2}, \dots, y_{m})

, where

y_{i}

is either 0 or 1. Predicting the connection between a task and a resource is akin to binary classification. For m such connections, there are m binary classification problems. Each path prediction is independent. This study employs cross-entropy loss for these binary classifications, computed as follows:

{Loss}_{j} = - y_{j} \cdot log (p_{s_{j}}) - (1 - y_{j}) \cdot log (1 - p_{s_{j}})

(18)

The loss function obtained after combining multiple QoS attributes is:

L = - \frac{1}{N} \sum_{v = 1}^{N} \sum_{i = 1}^{K_{v}} y_{v, i} log {Loss}_{j}

(19)

Subsequently, stochastic gradient descent is employed for training, where each iteration involves using the label data of one resource pattern to update the parameters through gradient descent. The computation process is as follows:

θ \leftarrow θ - η \nabla_{θ} L

(20)

where

\nabla_{θ} L

represents the gradient of the loss function with respect to the parameters to be optimized, as denoted by

θ

, and

η

is the pre-defined learning rate step size.

4.2. RWMS

The process of RMWS is shown in Algorithm 3 and the

s e a r c h R e s o u r c e ()

is shown in Algorithm 4. The reliability requirements submitted by users serve as the reliability constraints for the workflow. This algorithm assigns a subreliability to each task based on the average reliability lower bound (Line 3). The calculation process is as follows:

\begin{matrix} R (v_{i}^{r e q}) = \sqrt[| V |]{R_{req}} \end{matrix}

(21)

Algorithm 3 RWMS.

Require:: Workflow W, Deadline $D_{r e q}$ , Reliability $R_{r e q}$
Ensure:: Scheduling scheme $π$
1:: Initialize priority queue $t a s k l i s t$ based on task priorities in W
2:: Initialize empty lists for containers C, VMs M, and scheduling scheme $π$
3:: Calculate subreliability for each task: $R (v_{i}^{r e q}) = \sqrt[| V |]{R_{r e q}}$
4:: while $t a s k l i s t$ is not empty do
5:: $v_{i, j} \leftarrow t a s k l i s t . p o p ()$
6:: ${\vec{R}}^{r} (i, j) \leftarrow DeepMCC (W, v_{i, j})$
7:: $n_{i, j} \leftarrow Calculate number of replicas for v_{i, j}$
8:: for $r \leftarrow 1 to n_{i, j}$ do
9:: $c_{k} \leftarrow Earliest container in C$
10:: if $c_{k} not found$ then
11:: $c_{k} \leftarrow$ Create container with $D_{r e q} (v_{i, j})$
12:: $C . a p p e n d (c_{k})$
13:: end if
14:: $E S_{i, j} \leftarrow$ $searchResource (v_{i, j}, M, {\vec{R}}^{r} (i, j))$
15:: if $E S_{i, j} is not empty$ then
16:: $m_{l} \leftarrow$ Select a VM according to $V S_{R V D}$ , $V S_{K L D}$ or $V S_{L B R}$
17:: else
18:: $m_{l} \leftarrow$ Lease a VM according to $V L_{M I N}$ , $V L_{R V D}$ or $V L_{K L D}$
19:: end if
20:: $M . a p p e n d (m_{l})$
21:: $Calculate S T (v_{i, j}, c_{k}) and F T (v_{i, j}, c_{k})$
22:: $π . a p p e n d ((v_{i, j}, c_{k}, m_{l}, S T, F T))$
23:: end for
24:: end while
25:: return $π$

For each task, it calculates the resource demand using DeepMCC and determines the number of replicas needed to meet the reliability requirement (Lines 6–7). Then, it iteratively processes tasks from the

t a s k l i s t

, starting with the highest priority (Lines 8–23). It then finds or creates a suitable container and searches for an available resource block on existing or newly requested VMs (14–19). Task start and finish times are calculated and appended to the scheduling scheme (Lines 21–22). The process continues until all tasks are scheduled, ensuring that both reliability and resource constraints are respected.

Algorithm 4

searchResource (v_{i, j}, M, {\vec{R}}^{r} (i, j))

.

1:: Initialize $E S_{i, j} \leftarrow ⌀$
2:: for $m_{k} \in M$ do
3:: $t_{1} \leftarrow current time or e s t_{j}^{i}$
4:: while $t_{1} \leq D_{r e q} (v_{i, j}) - T_{i, j}^{e}$ do
5:: $f l a g \leftarrow true$ , $t_{2} \leftarrow t_{1}$
6:: while $t_{2} \leq t_{1} + T_{i, j}^{e}$ do
7:: if ${\vec{R}}^{f} (m_{k}, t_{2}) \geq {\vec{R}}^{r} (i, j)$ then
8:: $t_{2} \leftarrow t_{2} + 1$
9:: else
10:: $f l a g \leftarrow false$
11:: $t_{1} \leftarrow t_{2}$
12:: break
13:: end if
14:: end while
15:: if $f l a g$ then
16:: Add $(m_{k}, t_{1})$ to $E S_{i, j}$
17:: end if
18:: $t_{1} \leftarrow t_{1} + 1$
19:: end while
20:: end for
21:: return $E S_{i, j}$

4.2.1. Heuristic Strategies

Three heuristic strategies are designed for container placement and virtual machine leasing, respectively. When

E S_{i, j}

is non-empty, it signifies the need to search for an available virtual machine (VM) within the leased instances to deploy the container for task

v_{i, j}

, while adhering to time constraints and reliability requirements. The feasible start time for task

v_{i, j}

is within

[e s t_{j}^{i}, d (i, j) - T_{i, j}^{e}]

, and the chosen VM’s idle resources must suffice for

v_{i, j}

’s demands during

[T_{i, j}^{s}, T_{i, j}^{f})

. The suitable VM is selected based on resource similarity and load balancing, considering the resource request vector

{\vec{R}}^{r} (i, j)

and VM’s remaining resources

{\vec{R}}^{f} (k, t)

. Three formulas are used to calculate the distance between resource vectors:

Resource Vector Distance ( $R V D$ ):

$\begin{matrix} R V D ({\vec{R}}^{f} (k, t), {\vec{R}}^{r} (i, j)) = \sum_{l = 1}^{m} {(x_{l} (k, t) - y_{l} (i, j))}^{2} \end{matrix}$

(22)

where $x_{l} (k, t)$ and $y_{l} (i, j)$ represent the proportions of resource types in ${\vec{R}}^{f} (k, t)$ and ${\vec{R}}^{r} (i, j)$ , respectively.
Kullback–Leibler Distance ( $K L D$ ):

$\begin{matrix} K L D ({\vec{R}}^{f} (k, t), {\vec{R}}^{r} (i, j)) = \sum_{l = 1}^{m} x_{l} (k, t) \cdot ln \frac{x_{l} (k, t)}{y_{l} (i, j)} \end{matrix}$

(23)
Load Balancing Rate ( $L B R$ ):

$\begin{matrix} L B R ({\vec{R}}_{l}^{f} (k, t), {\vec{R}}_{l}^{r} (i, j)) = \sum_{l = 1}^{m} \sqrt{{(r a t i o ({\vec{R}}_{l}^{a} (k)) - r a t i o ({\vec{R}}_{l}^{a} (k)))}^{2}} \end{matrix}$

(24)

where $r a t i o ({\vec{R}}_{l}^{a} (k))$ and $r a t i o ({\vec{R}}_{l}^{f} (k), {\vec{R}}_{l}^{r} (i, j))$ denote the resource proportions.

Based on the three criteria of distance, similarity, and load balancing, the following heuristic virtual machine selection rules are designed:

$V S_{R V D}$ : Select the VM in $E S_{i, j}$ with the smallest $R V D$ ;
$V S_{K L D}$ : Select the VM in $E S_{i, j}$ with the smallest $K L D$ ;
$V S_{L B R}$ : Select the VM in $E S_{i, j}$ with the smallest $L B R$ .

Three heuristic rules for leasing virtual machines are designed when

E S_{i, j}

is Empty:

$V L_{M I N}$ : Lease the VM for container $c_{k}$ with the lowest cost;
$V L_{R V D}$ : Lease the VM for container $c_{k}$ with the smallest $R V D$ ;
$V L_{K L D}$ : Lease the VM for container $c_{k}$ with the smallest $K L D$ .

4.2.2. Improving the Scheduling Solution

After generating the initial task scheduling solution

π

, Algorithm 5 is designed to optimize cost. The algorithm reduces virtual machine fragmentation through container migration. Algorithm 6 sequentially searches for idle slots on virtual machines. If a slot longer than the startup time is found, it will lease a new virtual machine and migrate the eligible container to the new virtual machine. Figure 4 illustrates a scheduling instance as an example.

Algorithm 5

s c h e d u l e I m p r o v e

.

1:: Initialize $M$ as the set of all leased VMs.
2:: for each $m_{k} \in M$ do
3:: ${TS}_{k} \leftarrow$ Call $idleTimeSlots$ () for $M_{k}$
4:: for each time slot $TS \in {TS}_{k}$ do
5:: $t_{s} \leftarrow$ start time of slot $TS$
6:: $t_{e} \leftarrow$ end time of slot $TS$
7:: if $t_{e} - t_{s} > T_{k}^{s}$ then
8:: Lease a new VM instance, $M_{n e w}$
9:: Migrate all containers scheduled in $M_{k}$ when $T_{i, j}^{s} > t_{e}$ to $M_{n e w}$ .
10:: Update lease time and release time of $M_{n e w}$
11:: Add $M_{n e w}$ to $M$
12:: end if
13:: end for
14:: end for
15:: Update $π$
16:: return $π$

Algorithm 6

i d l e T i m e S l o t s

.

1:: for $i = 1$ to n do
2:: for $j = 1$ to $μ_{i}$ do
3:: Add task $v_{i, j}$ to list $L$
4:: end for
5:: end for
6:: Sort $L$ by task start times $T_{i, j}^{s}$ in ascending order.
7:: Initialize an empty set for time slots, ${TS}_{k} \leftarrow ⌀$
8:: Set initial variables: $v \leftarrow L [0]$ , $t_{s} \leftarrow T^{f} (v)$ , and $t_{e} \leftarrow T^{f} (v)$
9:: Define $t_{m a x}$ as the maximum task finish time in the task list
10:: while $t_{s} \leq t_{m a x}$ and $t_{e} \leq t_{m a x}$ do
11:: if $\sum_{v_{i, j} \in L} {y_{i, j, k} (t_{s})} > 0$ then
12:: $t_{s} \leftarrow$ select maximum $T_{i, j}^{f}$ in which $y_{i, j, k} (t_{s}) = 1$
13:: else
14:: $t_{e} \leftarrow t_{s} + 1$
15:: while $\sum_{v_{i, j} \in L} {y_{i, j, k} (t_{e})} = 0$ and $t_{e} \leq t_{m a x}$ do
16:: $t_{e} \leftarrow t_{e} + 1$
17:: end while
18:: if $t_{e} > t_{s}$ then
19:: Add the time slot $(t_{s}, t_{e})$ to ${TS}_{k}$
20:: $t_{s} \leftarrow t_{e}$
21:: end if
22:: end if
23:: end while
24:: return the set of identified time slots, ${TS}_{k}$

5. Experiment

5.1. Experimental Setting

Due to the scarcity of open-source datasets specifically tailored for microservice scenarios, this paper utilizes simulated data to conduct its experiments. The following details the data processing and generation methods employed:

Container Configuration Data: The foundation of our container configuration data are the Quality of Web Service dataset [31], which comprises 2507 real-world web service entries. To create a structured resource set, we applied text clustering techniques to these entries, resulting in 200 distinct clusters. Each cluster serves as a candidate resource set, with sizes varying from as few as 2 to as many as 200 entries, providing a diverse range of options for our simulations.
Microservice Application Datasets: Our microservice application datasets are derived from four reputable scientific workflow datasets: Cybershake, LIGO, Montage, and SIPHT. These datasets encompass workflows with varying complexities, with the number of tasks in a single workflow ( $μ_{i}$ ) ranging from 100 to 1000 in increments of 100. This wide range allows us to test the robustness and scalability of our proposed methods across different workload sizes.
Training Samples for DeepMCC: We configured the generator to produce workflows with task counts ranging from 20 to 100, ensuring a diverse set of training instances. A Genetic Algorithm (GA) was then employed to determine near-optimal solutions for these workflows, which served as our training and validation data samples. For each sample size within the specified range, we generated 1000 training samples, 100 validation samples, and 50 test samples randomly.
User-Defined Deadlines and Reliability Requirements: To simulate real-world user constraints, we set user-defined deadlines ( $D_{i}$ ) based on the earliest finish time ( $e f t_{μ_{i}}^{i}$ ) of each workflow. These deadlines were calculated by multiplying $e f t_{μ_{i}}^{i}$ by a factor $θ$ , which ranged from 0.02 to 0.2 in increments of 0.02. Similarly, user-defined reliability requirements ( $R_{r e q}$ ) were varied from 0.7 to 0.999 in increments of 0.05, allowing us to assess the model’s performance under different reliability constraints.
Virtual Machine Prices: The prices of virtual machines used in our simulations are based on the billing model of Amazon Elastic Container Service (Amazon ECS) [32]. A detailed table (Table 3) outlines the unit costs of various on-demand AWS EC2 instances, providing a realistic pricing structure for our cost-optimization analyses [32].

5.2. Performance Evaluation of DeepMCC

To verify the effectiveness, DeepMCC makes a comparison with three combinatorial optimization algorithms: the Multiple Population Genetic Algorithm (MPGA) [33] as a representative of metaheuristic approaches, the Double Deep Q-Network algorithm (DDQN) [34] as a deep reinforcement learning approach, and a pre-trained deep reinforcement learning algorithm with QoS-labeled data (QoS-DRL). The hyperparameters for each algorithm are summarized in Table 4.

Each algorithm is run 10 times on the test samples, and the average QoS value is the solution result. The experimental results for different K values are shown in Figure 5. Table 5 and Table 6 present the average QoS and runtime for the algorithms under different workflow sizes (N) and resource sizes (K), respectively.

The QoS value of all algorithms gradually decrease with an increasing number of workflow nodes. Specifically, when K is small, DeepMCC, MPGA, and QoS-DRL achieve higher QoS values. However, as the number of workflow nodes increases, DDQN lags behind DeepMCC, MPGA, and QoS-DRL. When

K = 10, 000

, DeepMCC and MPGA perform the best. QoS-DRL decrease when the number of workflow nodes is large, but it still outperforms DDQN.

Figure 6 shows the trends in QoS optimization and runtime for

K = 100

. When the number of the candidate resource is small (

K = 100

), DeepMCC exhibits a relatively short runtime with an increasing number of tasks. In contrast, MPGA has a significantly longer runtime compared to DeepMCC, making it the slowest among the four algorithms. DDQN shows a longer runtime for smaller task sizes but becomes relatively faster as the task size increases. QoS-DRL consistently demonstrates the longest average runtime. As the task size grows, MPGA performs the best.

Figure 7 shows the trends for

K = 10, 000

. As the candidate resource increases (

K = 10, 000

), the performance of the algorithms varies significantly. DeepMCC remains the most advantageous algorithm, with a relatively slow increase in runtime as the workflow size grows. MPGA has a significantly longer runtime compared to DeepMCC. DDQN’s runtime varies little across different workflow sizes but is generally longer. QoS-DRL has the longest average runtime.

Summarizing the experimental results, DeepMCC outperforms MPGA by an average of 2.18% in QoS optimization, surpasses DDQN by 27.2%, and surpasses upon QoS-DRL by 3.37%. In terms of execution time, DeepMCC reduces runtime by an average of 71.2% compared to MPGA, 69.9% compared to DDQN, and 88.49% compared to QoS-DRL.

5.3. Performance Evaluation of RMWS

Three algorithms were used as baselines for comparison with RMWS: ProLiS [3], IRW [35], and CCRH [36]. ProLiS introduces a probabilistic task sorting and subdeadline allocation method to enhance fairness and efficiency in task scheduling. However, ProLiS does not consider reliability requirements. Therefore, we extend the reliability method to ProLiS by iteratively assigning task replicas to virtual machines until the subdeadlines are met. Two representative fault-tolerant scheduling algorithms, IRW and CCRH, are selected to evaluate the fault-tolerance performance. Both algorithms consider fault tolerance in scheduling, with IRW based on a resubmission strategy and CCRH employing a replication strategy. However, neither IRW nor CCRH considers deadline constraints. Therefore, the task sorting processes of IRW and CCRH are modified to prioritize tasks according to the subdeadline partitioning strategy.

Figure 8 illustrates the trends in execution cost as the deadline factor

θ

varies for the four algorithms on the Cybershake, LIGO, Montage, and SIPHT datasets, respectively. It is observed that the execution cost decreases as

θ

increases, allowing algorithms to utilize cheaper virtual machine resources to reduce costs.

When

θ

is greater than 0.16, RMWS and ProLiS exhibit similar optimization performance. However, as

θ

increases beyond 0.16, ProLiS’s cost optimization ability gradually diminishes, widening the gap with RMWS, though it still outperforms IRW and CCRH. RMWS consistently demonstrates the best and most stable cost optimization on the Cybershake dataset, meeting cost optimization targets under various deadline constraints. Notably, cost reduction becomes ineffective for IRW and CCRH when

θ

falls below a certain threshold. Specifically, IRW execution cost surges when

θ

is less than 0.15, remaining high thereafter. Similarly, CCRH exhibits the same phenomenon when

θ

is less than 0.13. The significant deviation in trends between IRW, CCRH, and the other three workflow datasets on Cybershake may be attributed to Cybershake’s abundant parallel structures, which require more virtual machines to accommodate tasks, increasing the number of virtual machines and leading to more resource contention and fragmentation.

Due to the cost surge of IRW and CCRH on the Cybershake dataset under tight deadline constraints (

θ < 0.13

), their data are excluded from this experiment’s statistics. Across the three workflow application types (LIGO, Montage, and SIPHT), RMWS reduces execution costs by an average of 32.70%, 49.50%, and 73.72% compared to ProLiS, IRW, and CCRH, respectively.

Figure 9 presents the execution costs under various reliability constraints for the four algorithms on the Cybershake, LIGO, Montage, and SIPHT datasets, respectively. On the Cybershake dataset, IRW consistently fails to meet reliability requirements, so only ProLiS, CCRH, and RMWS are analyzed. CCRH incurs the highest cost, while the execution costs of the other algorithms increase slightly with

R_{r e q}

. When

R_{r e q}

exceeds 0.99, the execution costs of all algorithms surge dramatically. This is because when

R_{r e q}

is low, reliability demands are easily met, with the primary constraint being the deadline. However, when

R_{r e q}

exceeds a certain threshold, reliability demands become the primary obstacle, necessitating the leasing of additional cloud resources to deploy more task replicas. Since CCRH’s strategy selects the maximum number of replicas per task for reliability, its execution cost is the highest.

Although IRW fails to find feasible solutions meeting constraints on the Cybershake dataset, it outperforms ProLiS on the other three datasets. The performance of IRW under reliability requirements is most closely related to workflow type, with the most significant impact from workflow structure variations observed on the SIPHT dataset. Across all workflow applications, RMWS improves execution costs by 26.63%, 35.67%, and 70.03% compared to ProLiS, IRW, and CCRH, respectively.

Figure 10 shows the execution cost optimization results for different workflow sizes on the Cybershake, LIGO, Montage, and SIPHT datasets, respectively. As workflow size increases, the execution costs of all algorithms rise. When the workflow size range from 200 to 500, IRW outperforms other algorithms. However, for larger workflows (e.g., greater than 500), IRW’s performance on the SIPHT dataset is inferior to RMWS. RMWS demonstrates the best cost optimization ability in larger workflow sizes. Compared to IRW and CCRH, RMWS reduces the average cost by 27.84% and 47.57%, respectively.

6. Conclusions and Future Work

This paper considers the cost optimization of microservices workflows with deadlines and reliability on container environment. DeepMCC is designed to learn from resource combinations to predict container selection probabilities. Additionally, a reliability scheduling algorithm RMWS is proposed with replication-based task scheduling and container migration process, which reduces VM leasing costs by utilizing idle fragments. Experimental results demonstrate that the DeepMCC supports solve the configuration problem of large-scale tasks effectively, and the proposed RWMS exhibits high performance compared to the comparison algorithms. Specifically, the DeepMCC model reduces the average runtime by 76.53% compared to three comparison algorithms, and the RWMS achieves an average improvement of 44.59% in cost optimization compared to the three contrast algorithms. Future work includes validating the proposed algorithms in real-world data, and utilizing unsupervised learning models to overcome the limitations of pre-training.

Author Contributions

All authors contributed to the study conception and design. Material preparation, data collection, and analysis were performed by W.L., X.L., L.C., and M.W. The first draft of the manuscript was written by W.L., and all authors commented on previous versions of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Key Research and Development Program of China (No. 2022YFB3305500), the National Natural Science Foundation of China (Nos. 62273089 and 62102080), and the Natural Science Foundation of Jiangsu Province (No. BK20210204).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are available from the corresponding author upon reasonable request. These data were used under license for the current study and cannot be redistributed without permission from the data provider.

Acknowledgments

The financial support mentioned in the Funding part is gratefully acknowledged.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, G.; Huang, B.; Liang, Z.; Qin, M.; Zhou, H.; Li, Z. Microservices: Architecture, container, and challenges. In Proceedings of the 2020 IEEE 20th International Conference on Software Quality, Reliability and Security Companion (QRS-C), Macau, China, 11–14 December 2020; pp. 629–635. [Google Scholar] [CrossRef]
Houmani, Z.; Balouek-Thomert, D.; Caron, E.; Parashar, M. Enhancing microservices architectures using data-driven service discovery and QoS guarantees. In Proceedings of the 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), Melbourne, VIC, Australia, 11–14 May 2020; pp. 290–299. [Google Scholar] [CrossRef]
Wu, Q.; Ishikawa, F.; Zhu, Q.; Xia, Y.; Wen, J. Deadline-Constrained Cost Optimization Approaches for Workflow Scheduling in Clouds. IEEE Trans. Parallel Distrib. Syst. 2017, 28, 3401–3412. [Google Scholar] [CrossRef]
Chakravarthi, K.; Loganathan, S.; Vaidehi, V. Cost-effective workflow scheduling approach on cloud under deadline constraint using firefly algorithm. Appl. Intell. 2021, 51, 1629–1644. [Google Scholar] [CrossRef]
Sahni, J.; Vidyarthi, D.P. A Cost-Effective Deadline-Constrained Dynamic Scheduling Algorithm for Scientific Workflows in a Cloud Environment. IEEE Trans. Cloud Comput. 2018, 6, 2–18. [Google Scholar] [CrossRef]
Toussi, G.; Naghibzadeh, M.; Abrishami, S.; Taheri, H.; Abrishami, H. EDQWS: An enhanced divide and conquer algorithm for workflow scheduling in cloud. J. Cloud Comput. 2022, 11, 13. [Google Scholar] [CrossRef]
Wu, F.; Wu, Q.; Tan, Y.; Li, R.; Wang, W. PCP-B 2: Partial critical path budget balanced scheduling algorithms for scientific workflow applications. Future Gener. Comput. Syst. 2016, 60, 22–34. [Google Scholar] [CrossRef]
Ghafouri, R.; Movaghar, A.; Mohsenzadeh, M. A budget constrained scheduling algorithm for executing workflow application in infrastructure as a service clouds. Peer Peer Netw. Appl. 2019, 12, 241–268. [Google Scholar] [CrossRef]
Faragardi, H.R.; Saleh Sedghpour, M.R.; Fazliahmadi, S.; Fahringer, T.; Rasouli, N. GRP-HEFT: A Budget-Constrained Resource Provisioning Scheme for Workflow Scheduling in IaaS Clouds. IEEE Trans. Parallel Distrib. Syst. 2020, 31, 1239–1254. [Google Scholar] [CrossRef]
Gu, H.; Li, X.; Liu, M.; Wang, S. Scheduling method with adaptive learning for microservice workflows with hybrid resource provisioning. Int. J. Mach. Learn. Cybern. 2021, 12, 3037–3048. [Google Scholar] [CrossRef]
Guerrero, C.; Lera, I.; Juiz, C. Resource optimization of container orchestration: A case study in multi-cloud microservices-based applications. J. Supercomput. 2018, 74, 2956–2983. [Google Scholar] [CrossRef]
He, X.; Tu, Z.; Wagner, M.; Xu, X.; Wang, Z. Online Deployment Algorithms for Microservice Systems with Complex Dependencies. IEEE Trans. Cloud Comput. 2023, 11, 1746–1763. [Google Scholar] [CrossRef]
Bao, L.; Wu, C.; Bu, X.; Ren, N.; Shen, M. Performance Modeling and Workflow Scheduling of Microservice-Based Applications in Clouds. IEEE Trans. Parallel Distrib. Syst. 2019, 30, 2114–2129. [Google Scholar] [CrossRef]
Wang, S.; Ding, Z.; Jiang, C. Elastic Scheduling for Microservice Applications in Clouds. IEEE Trans. Parallel Distrib. Syst. 2021, 32, 98–115. [Google Scholar] [CrossRef]
Li, Z.; Yu, H.; Fan, G.; Zhang, J. Cost-Efficient Fault-Tolerant Workflow Scheduling for Deadline-Constrained Microservice-Based Applications in Clouds. IEEE Trans. Netw. Serv. Manag. 2023, 20, 3220–3232. [Google Scholar] [CrossRef]
Lakhan, A.; Mohammed, M.A.; Rashid, A.N.; Kadry, S.; Abdulkareem, K.H.; Nedoma, J.; Martinek, R.; Razzak, I. Restricted Boltzmann Machine Assisted Secure Serverless Edge System for Internet of Medical Things. IEEE J. Biomed. Health Inform. 2023, 27, 673–683. [Google Scholar] [CrossRef] [PubMed]
Yu, X.; Wu, W.; Wang, Y. Integrating Cognition Cost with Reliability QoS for Dynamic Workflow Scheduling Using Reinforcement Learning. IEEE Trans. Serv. Comput. 2023, 16, 2713–2726. [Google Scholar] [CrossRef]
Li, Z.; Chen, Q.; Koltun, V. Combinatorial Optimization with Graph Convolutional Networks and Guided Tree Search. arXiv 2018, arXiv:1810.10659. [Google Scholar] [CrossRef]
Wang, X.; Xu, H.; Wang, X.; Xu, X.; Wang, Z. A Graph Neural Network and Pointer Network-Based Approach for QoS-Aware Service Composition. IEEE Trans. Serv. Comput. 2023, 16, 1589–1603. [Google Scholar] [CrossRef]
Liu, M.; Tu, Z.; Xu, H.; Xu, X.; Wang, Z. DySR: A Dynamic Graph Neural Network Based Service Bundle Recommendation Model for Mashup Creation. IEEE Trans. Serv. Comput. 2023, 16, 2592–2605. [Google Scholar] [CrossRef]
Dong, T.; Xue, F.; Tang, H.; Xiao, C. Deep reinforcement learning for fault-tolerant workflow scheduling in cloud environment. Appl. Intell. 2022, 53, 9916–9932. [Google Scholar] [CrossRef]
Zheng, Q.; Veeravalli, B.; Tham, C.K. On the Design of Fault-Tolerant Scheduling Strategies Using Primary-Backup Approach for Computational Grids with Low Replication Costs. IEEE Trans. Comput. 2009, 58, 380–393. [Google Scholar] [CrossRef]
Benoit, A.; Hakem, M.; Robert, Y. Fault tolerant scheduling of precedence task graphs on heterogeneous platforms. In Proceedings of the 2008 IEEE International Symposium on Parallel and Distributed Processing, Miami, FL, USA, 14–18 April 2008; pp. 1–8. [Google Scholar] [CrossRef]
Benoit, A.; Hakem, M.; Robert, Y. Contention awareness and fault-tolerant scheduling for precedence constrained tasks in heterogeneous systems. Parallel Comput. 2009, 35, 83–108. [Google Scholar] [CrossRef]
Zhao, L.; Ren, Y.; Xiang, Y.; Sakurai, K. Fault-tolerant scheduling with dynamic number of replicas in heterogeneous systems. In Proceedings of the 2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC), Melbourne, VIC, Australia, 1–3 September 2010; pp. 434–441. [Google Scholar] [CrossRef]
Xie, G.; Wei, Y.; Le, Y.; Li, R. Redundancy Minimization and Cost Reduction for Workflows with Reliability Requirements in Cloud-Based Services. IEEE Trans. Cloud Comput. 2022, 10, 633–647. [Google Scholar] [CrossRef]
Plankensteiner, K.; Prodan, R. Meeting Soft Deadlines in Scientific Workflows Using Resubmission Impact. IEEE Trans. Parallel Distrib. Syst. 2012, 23, 890–901. [Google Scholar] [CrossRef]
Xie, G.; Zeng, G.; Chen, Y.; Bai, Y.; Zhou, Z.; Li, R.; Li, K. Minimizing Redundancy to Satisfy Reliability Requirement for a Parallel Application on Heterogeneous Service-Oriented Systems. IEEE Trans. Serv. Comput. 2020, 13, 871–886. [Google Scholar] [CrossRef]
Hu, B.; Cao, Z. Minimizing Resource Consumption Cost of DAG Applications With Reliability Requirement on Heterogeneous Processor Systems. IEEE Trans. Ind. Inform. 2020, 16, 7437–7447. [Google Scholar] [CrossRef]
Qu, L.; Khabbaz, M.; Assi, C. Reliability-Aware Service Chaining In Carrier-Grade Softwarized Networks. IEEE J. Sel. Areas Commun. 2018, 36, 558–573. [Google Scholar] [CrossRef]
Al-Masri, E.; Mahmoud, Q.H. Discovering the best web service: A neural network-based solution. In Proceedings of the 2009 IEEE International Conference on Systems, Man and Cybernetics, San Antonio, TX, USA, 11–14 October 2009; pp. 4250–4255. [Google Scholar] [CrossRef]
Amazon Elastic Container Service. 2024. Available online: https://aws.amazon.com/ecs/ (accessed on 15 January 2025).
Xie, Y.; Gui, F.X.; Wang, W.J.; Chien, C.F. A Two-stage Multi-population Genetic Algorithm with Heuristics for Workflow Scheduling in Heterogeneous Distributed Computing Environments. IEEE Trans. Cloud Comput. 2023, 11, 1446–1460. [Google Scholar] [CrossRef]
Tong, Z.; Ye, F.; Liu, B.; Cai, J.; Mei, J. DDQN-TS: A novel bi-objective intelligent scheduling algorithm in the cloud environment. Neurocomputing 2021, 455, 419–430. [Google Scholar] [CrossRef]
Zhang, X.; Yao, G.; Ding, Y.; Hao, K. An improved immune system-inspired routing recovery scheme for energy harvesting wireless sensor networks. Soft Comput. 2017, 21, 5893–5904. [Google Scholar] [CrossRef]
Jing, W.; Liu, Y. Multiple DAGs reliability model and fault-tolerant scheduling algorithm in cloud computing system. Comput. Model. New Technol. 2014, 18, 22–30. [Google Scholar]

Figure 1. System architecture.

Figure 2. Mapping of resource requirements to container configuration.

Figure 3. DeepMCC diagram.

Figure 4. Instance of

s c h e d u l e I m p r o v e

.

Figure 4. Instance of

s c h e d u l e I m p r o v e

.

Figure 5. Average QoS value for different workflow sizes.

Figure 6. The trends of QoS value and Runtime with varying number of nodes among different algorithms (K = 100).

Figure 7. The trends of QoS value and runtime with varying number of nodes among different algorithms (K = 10,000).

Figure 8. Cost optimization results under different deadlines.

Figure 9. Cost optimization results under different reliability constraints.

Figure 10. Cost optimization result under different workflow sizes.

Table 1. Related literature on microservice workflow scheduling.

Literature	Makespan/Deadline	Cost/Budget	Reliability	Algorithm
Bao et al. [13]	✓	✓		Heuristic
Wang et al. [14]	✓	✓		Heuristic
Li et al. [15]	✓	✓	✓	Heuristic
Abdullah et al. [16]		✓		Deep Learning
Yu et al. [17]			✓	Reinforcement Learning
Our work	✓	✓	✓	Graph Deep Learning & Heuristic

Table 2. Description of main symbols.

Symbol	Description
W	Microservice workflow
V	Set of tasks in W
E	Set of task dependencies in W
$v_{i}$	The i-th task in W
$w_{i}$	Computational workload of $v_{i}$
$e_{i, j}$	Edge between $v_{i}$ and $v_{j}$
M	Set of virtual machines
$m_{l}$	The l-th virtual machine in M
$p (m_{l})$	Price of virtual machine $m_{l}$
C	set of candidate containers
$c_{k}$	The k-th container in C
$\vec{R} (c_{k})$	Resource vector of $c_{k}$
${\vec{R}}^{a} (k)$	Resource vector of $m_{k}$
$E T (v_{i}, c_{k})$	Execution time of $v_{i}$ on $c_{k}$
$S T (v_{i}, c_{k})$	Start time of task $v_{i}$ on $c_{k}$
$F T (v_{i}, c_{k})$	Finish time of task $v_{i}$ on $c_{k}$
$a_{i}$	Requirement parameters of $v_{i}$
$s_{i, j}$	The j-th candidate container of $v_{i}$
Q	Quality of Service (QoS)
w	Weights of user’s QoS preferences
$x_{i, j}$	decision variable for task-to-container
$P r o b_{i, j}$	probability of task-to-container
$R (v_{i}, c_{k}, m_{l})$	Reliability of $v_{i}$ in $c_{k}$ on $m_{l}$
$r e p (v_{i})$	Set of replicas for $v_{i}$
$π$	scheduling scheme

Table 3. Amazon EC2 pricing.

Instance Type	CPU Cores	Memory	BTU
m5.4xlarge	16vCPU	64 GB	USD 0.7680
m5.2xlarge	8vCPU	32 GB	USD 0.3840
m4.xlarge	4vCPU	16 GB	USD 0.2000
m4.large	2vCPU	8 GB	USD 0.1000
t2.medium	2vCPU	4 GB	USD 0.0464
t2.small	1vCPU	2 GB	USD 0.0230

Table 4. Hyperparameter settings for the algorithms.

Algorithm	Hyperparameters
DeepMCC	batch_size: 16, learning_rate: 0.0005, epoch_num: 30, gnn_layer: 6
MPGA	population_size: 100, group_num: 4, cross_rate: 0.5, epoch: 600
DDQN	frames: 10,000, batch_size: 32, buffer_size: 10,000, learning_rate: 0.01, GAMMA: 0.9, min_eps: 0.01, max_eps: 0.9, eps_frames: 10,000, sampling_weight: 0.4, Q_updates_num: 500
QoS-DRL	iter_num: 10, pretrain_epoch: 360, learning_epoch: 300, drl_lr: 0.0001, pretrain_lr: 0.001, sample_num: 64, best_num: 64

Table 5. Average QoS value.

Algorithm	K	N
Algorithm	K	20	40	60	80	100
DeepMCC	100	0.928	0.931	0.934	0.928	0.929
	1000	0.9	0.891	0.879	0.871	0.868
	10,000	0.873	0.849	0.822	0.811	0.806
MPGA	100	0.926	0.934	0.941	0.937	0.94
	1000	0.874	0.873	0.87	0.854	0.843
	10,000	0.822	0.811	0.798	0.769	0.745
DDQN	100	0.908	0.854	0.8	0.764	0.733
	1000	0.839	0.75	0.658	0.623	0.593
	10,000	0.77	0.644	0.515	0.482	0.452
QoS-DRL	100	0.928	0.937	0.945	0.942	0.946
	1000	0.878	0.873	0.865	0.836	0.812
	10,000	0.828	0.807	0.783	0.727	0.677

Table 6. Average runtime (unit: seconds).

Algorithm	K	N
Algorithm	K	20	40	60	80	100
DeepMCC	100	0.653	5.259	9.785	21.112	33.75
	1000	1.417	7.023	12.455	21.907	33.177
	10,000	2.154	8.757	15.053	23.318	33.042
MPGA	100	0.756	13.023	25.09	61.922	102.598
	1000	2.877	22.347	41.261	77.37	119.899
	10,000	4.945	31.626	57.198	95.027	138.799
DDQN	100	5.88	23.204	40.173	41.345	45.084
	1000	36.676	45.509	53.212	53.054	57.298
	10,000	66.786	67.55	65.946	66.047	70.28
QoS-DRL	100	7.298	53.418	98.72	180.362	273.203
	1000	12.006	62.568	111.575	190.02	284.231
	10,000	16.491	71.387	123.782	204.99	299.02

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, W.; Li, X.; Chen, L.; Wang, M. Microservice Workflow Scheduling with a Resource Configuration Model Under Deadline and Reliability Constraints. Sensors 2025, 25, 1253. https://doi.org/10.3390/s25041253

AMA Style

Li W, Li X, Chen L, Wang M. Microservice Workflow Scheduling with a Resource Configuration Model Under Deadline and Reliability Constraints. Sensors. 2025; 25(4):1253. https://doi.org/10.3390/s25041253

Chicago/Turabian Style

Li, Wenzheng, Xiaoping Li, Long Chen, and Mingjing Wang. 2025. "Microservice Workflow Scheduling with a Resource Configuration Model Under Deadline and Reliability Constraints" Sensors 25, no. 4: 1253. https://doi.org/10.3390/s25041253

APA Style

Li, W., Li, X., Chen, L., & Wang, M. (2025). Microservice Workflow Scheduling with a Resource Configuration Model Under Deadline and Reliability Constraints. Sensors, 25(4), 1253. https://doi.org/10.3390/s25041253

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Microservice Workflow Scheduling with a Resource Configuration Model Under Deadline and Reliability Constraints

Abstract

1. Introduction

2. Related Work

2.1. Workflow Scheduling

2.2. Microservice Workflow Scheduling

2.3. Fault-Tolerant Strategies in Scheduling

3. System Architecture and Problem Description

3.1. System Architecture

3.2. Workflow and Resource Model

3.3. Container Configuration Model

3.4. Failure Model

3.5. Problem Formulation

4. Proposed Methods

4.1. DeepMCC

4.2. RWMS

4.2.1. Heuristic Strategies

4.2.2. Improving the Scheduling Solution

5. Experiment

5.1. Experimental Setting

5.2. Performance Evaluation of DeepMCC

5.3. Performance Evaluation of RMWS

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI