Microservice Workflow Scheduling with a Resource Configuration Model Under Deadline and Reliability Constraints
Abstract
:1. Introduction
- Precise Resource Configuration: A deep learning model for microservice container configuration (DeepMCC) is designed. Because of the advantages of Graph Neural Networks (GNN) in processing graph-structured data, the approach efficiently configures container resources for each service to meet QoS requirements.
- System Reliability Enhancement: A container replication strategy is employed to enhance system redundancy and reliability. Additionally, the container migration strategy is designed to improve resource utilization.
- Cost-optimized Scheduling: For the dual-layer virtual resource environment of containers and virtual machines, a reliability microservice workflow scheduling algorithm (RMWS) is proposed. This algorithm integrates container configuration, fault tolerance, and container migration to optimize cost. Experiment demonstrate that RMWS can minimize cost while ensuring reliability compared to relevant algorithms.
2. Related Work
2.1. Workflow Scheduling
2.2. Microservice Workflow Scheduling
2.3. Fault-Tolerant Strategies in Scheduling
3. System Architecture and Problem Description
3.1. System Architecture
3.2. Workflow and Resource Model
3.3. Container Configuration Model
3.4. Failure Model
3.5. Problem Formulation
- If there is a running that can handle task , no initialization is required, i.e., .
- If there is no suitable microservice to handle task , a new container is created, and then deploy on an existing virtual machine . Therefore, initialization of the container is necessary.
- If there is no available microservice to handle task and no existing virtual machine, a new container and a new virtual machine are created.
4. Proposed Methods
4.1. DeepMCC
- Embedding Block: the initial step in task feature processing involves taking the user-submitted workflow graph as input. Each task node v has a candidate configuration set represented as . Edges E indicate task dependencies. The embedding block extracts features from through a multilayer perceptron. These feature vectors are transformed by a multi-layer perceptron (MLP) to obtain refined feature vectors.
- GCN Block: The feature vectors of each node are updated iteratively. In each iteration, the feature vector of the current node is updated based on the features of its neighbor nodes. After L iterations, a set of feature matrices is formed, i.e., . These feature matrices are concatenated to form , which contains the updated feature information for all nodes in the graph.
- Global Pooling Block and MLP Block: The Global Pooling Block applies an average pooling operation to to extract a global feature vector . This global feature vector is replicated N times (where N is the number of task nodes) to obtain , which is then concatenated with to form a feature representation that combines global and local information. The result is fed into an MLP to obtain the final feature representation , integrating both local and global information.
- GAT Block: For each task v, the features of its candidate resource set and the features of each candidate resource configuration are transformed using multi-layer perceptrons and . An attention coefficient is calculated using a multi-layer perceptron with a Tanh activation function, representing the importance of each candidate resource configuration. The attention coefficients are normalized using the softmax function to obtain the selection probability for each candidate resource configuration.
Algorithm 1 DeepMCC model. |
|
Algorithm 2 Training of DeepMCC. |
|
4.2. RWMS
Algorithm 3 RWMS. |
|
Algorithm 4 . |
|
4.2.1. Heuristic Strategies
- Resource Vector Distance ():
- Kullback–Leibler Distance ():
- Load Balancing Rate ():
- : Select the VM in with the smallest ;
- : Select the VM in with the smallest ;
- : Select the VM in with the smallest .
- : Lease the VM for container with the lowest cost;
- : Lease the VM for container with the smallest ;
- : Lease the VM for container with the smallest .
4.2.2. Improving the Scheduling Solution
Algorithm 5 . |
|
Algorithm 6 . |
|
5. Experiment
5.1. Experimental Setting
- Container Configuration Data: The foundation of our container configuration data are the Quality of Web Service dataset [31], which comprises 2507 real-world web service entries. To create a structured resource set, we applied text clustering techniques to these entries, resulting in 200 distinct clusters. Each cluster serves as a candidate resource set, with sizes varying from as few as 2 to as many as 200 entries, providing a diverse range of options for our simulations.
- Microservice Application Datasets: Our microservice application datasets are derived from four reputable scientific workflow datasets: Cybershake, LIGO, Montage, and SIPHT. These datasets encompass workflows with varying complexities, with the number of tasks in a single workflow () ranging from 100 to 1000 in increments of 100. This wide range allows us to test the robustness and scalability of our proposed methods across different workload sizes.
- Training Samples for DeepMCC: We configured the generator to produce workflows with task counts ranging from 20 to 100, ensuring a diverse set of training instances. A Genetic Algorithm (GA) was then employed to determine near-optimal solutions for these workflows, which served as our training and validation data samples. For each sample size within the specified range, we generated 1000 training samples, 100 validation samples, and 50 test samples randomly.
- User-Defined Deadlines and Reliability Requirements: To simulate real-world user constraints, we set user-defined deadlines () based on the earliest finish time () of each workflow. These deadlines were calculated by multiplying by a factor , which ranged from 0.02 to 0.2 in increments of 0.02. Similarly, user-defined reliability requirements () were varied from 0.7 to 0.999 in increments of 0.05, allowing us to assess the model’s performance under different reliability constraints.
- Virtual Machine Prices: The prices of virtual machines used in our simulations are based on the billing model of Amazon Elastic Container Service (Amazon ECS) [32]. A detailed table (Table 3) outlines the unit costs of various on-demand AWS EC2 instances, providing a realistic pricing structure for our cost-optimization analyses [32].
5.2. Performance Evaluation of DeepMCC
5.3. Performance Evaluation of RMWS
6. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Liu, G.; Huang, B.; Liang, Z.; Qin, M.; Zhou, H.; Li, Z. Microservices: Architecture, container, and challenges. In Proceedings of the 2020 IEEE 20th International Conference on Software Quality, Reliability and Security Companion (QRS-C), Macau, China, 11–14 December 2020; pp. 629–635. [Google Scholar] [CrossRef]
- Houmani, Z.; Balouek-Thomert, D.; Caron, E.; Parashar, M. Enhancing microservices architectures using data-driven service discovery and QoS guarantees. In Proceedings of the 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), Melbourne, VIC, Australia, 11–14 May 2020; pp. 290–299. [Google Scholar] [CrossRef]
- Wu, Q.; Ishikawa, F.; Zhu, Q.; Xia, Y.; Wen, J. Deadline-Constrained Cost Optimization Approaches for Workflow Scheduling in Clouds. IEEE Trans. Parallel Distrib. Syst. 2017, 28, 3401–3412. [Google Scholar] [CrossRef]
- Chakravarthi, K.; Loganathan, S.; Vaidehi, V. Cost-effective workflow scheduling approach on cloud under deadline constraint using firefly algorithm. Appl. Intell. 2021, 51, 1629–1644. [Google Scholar] [CrossRef]
- Sahni, J.; Vidyarthi, D.P. A Cost-Effective Deadline-Constrained Dynamic Scheduling Algorithm for Scientific Workflows in a Cloud Environment. IEEE Trans. Cloud Comput. 2018, 6, 2–18. [Google Scholar] [CrossRef]
- Toussi, G.; Naghibzadeh, M.; Abrishami, S.; Taheri, H.; Abrishami, H. EDQWS: An enhanced divide and conquer algorithm for workflow scheduling in cloud. J. Cloud Comput. 2022, 11, 13. [Google Scholar] [CrossRef]
- Wu, F.; Wu, Q.; Tan, Y.; Li, R.; Wang, W. PCP-B 2: Partial critical path budget balanced scheduling algorithms for scientific workflow applications. Future Gener. Comput. Syst. 2016, 60, 22–34. [Google Scholar] [CrossRef]
- Ghafouri, R.; Movaghar, A.; Mohsenzadeh, M. A budget constrained scheduling algorithm for executing workflow application in infrastructure as a service clouds. Peer Peer Netw. Appl. 2019, 12, 241–268. [Google Scholar] [CrossRef]
- Faragardi, H.R.; Saleh Sedghpour, M.R.; Fazliahmadi, S.; Fahringer, T.; Rasouli, N. GRP-HEFT: A Budget-Constrained Resource Provisioning Scheme for Workflow Scheduling in IaaS Clouds. IEEE Trans. Parallel Distrib. Syst. 2020, 31, 1239–1254. [Google Scholar] [CrossRef]
- Gu, H.; Li, X.; Liu, M.; Wang, S. Scheduling method with adaptive learning for microservice workflows with hybrid resource provisioning. Int. J. Mach. Learn. Cybern. 2021, 12, 3037–3048. [Google Scholar] [CrossRef]
- Guerrero, C.; Lera, I.; Juiz, C. Resource optimization of container orchestration: A case study in multi-cloud microservices-based applications. J. Supercomput. 2018, 74, 2956–2983. [Google Scholar] [CrossRef]
- He, X.; Tu, Z.; Wagner, M.; Xu, X.; Wang, Z. Online Deployment Algorithms for Microservice Systems with Complex Dependencies. IEEE Trans. Cloud Comput. 2023, 11, 1746–1763. [Google Scholar] [CrossRef]
- Bao, L.; Wu, C.; Bu, X.; Ren, N.; Shen, M. Performance Modeling and Workflow Scheduling of Microservice-Based Applications in Clouds. IEEE Trans. Parallel Distrib. Syst. 2019, 30, 2114–2129. [Google Scholar] [CrossRef]
- Wang, S.; Ding, Z.; Jiang, C. Elastic Scheduling for Microservice Applications in Clouds. IEEE Trans. Parallel Distrib. Syst. 2021, 32, 98–115. [Google Scholar] [CrossRef]
- Li, Z.; Yu, H.; Fan, G.; Zhang, J. Cost-Efficient Fault-Tolerant Workflow Scheduling for Deadline-Constrained Microservice-Based Applications in Clouds. IEEE Trans. Netw. Serv. Manag. 2023, 20, 3220–3232. [Google Scholar] [CrossRef]
- Lakhan, A.; Mohammed, M.A.; Rashid, A.N.; Kadry, S.; Abdulkareem, K.H.; Nedoma, J.; Martinek, R.; Razzak, I. Restricted Boltzmann Machine Assisted Secure Serverless Edge System for Internet of Medical Things. IEEE J. Biomed. Health Inform. 2023, 27, 673–683. [Google Scholar] [CrossRef] [PubMed]
- Yu, X.; Wu, W.; Wang, Y. Integrating Cognition Cost with Reliability QoS for Dynamic Workflow Scheduling Using Reinforcement Learning. IEEE Trans. Serv. Comput. 2023, 16, 2713–2726. [Google Scholar] [CrossRef]
- Li, Z.; Chen, Q.; Koltun, V. Combinatorial Optimization with Graph Convolutional Networks and Guided Tree Search. arXiv 2018, arXiv:1810.10659. [Google Scholar] [CrossRef]
- Wang, X.; Xu, H.; Wang, X.; Xu, X.; Wang, Z. A Graph Neural Network and Pointer Network-Based Approach for QoS-Aware Service Composition. IEEE Trans. Serv. Comput. 2023, 16, 1589–1603. [Google Scholar] [CrossRef]
- Liu, M.; Tu, Z.; Xu, H.; Xu, X.; Wang, Z. DySR: A Dynamic Graph Neural Network Based Service Bundle Recommendation Model for Mashup Creation. IEEE Trans. Serv. Comput. 2023, 16, 2592–2605. [Google Scholar] [CrossRef]
- Dong, T.; Xue, F.; Tang, H.; Xiao, C. Deep reinforcement learning for fault-tolerant workflow scheduling in cloud environment. Appl. Intell. 2022, 53, 9916–9932. [Google Scholar] [CrossRef]
- Zheng, Q.; Veeravalli, B.; Tham, C.K. On the Design of Fault-Tolerant Scheduling Strategies Using Primary-Backup Approach for Computational Grids with Low Replication Costs. IEEE Trans. Comput. 2009, 58, 380–393. [Google Scholar] [CrossRef]
- Benoit, A.; Hakem, M.; Robert, Y. Fault tolerant scheduling of precedence task graphs on heterogeneous platforms. In Proceedings of the 2008 IEEE International Symposium on Parallel and Distributed Processing, Miami, FL, USA, 14–18 April 2008; pp. 1–8. [Google Scholar] [CrossRef]
- Benoit, A.; Hakem, M.; Robert, Y. Contention awareness and fault-tolerant scheduling for precedence constrained tasks in heterogeneous systems. Parallel Comput. 2009, 35, 83–108. [Google Scholar] [CrossRef]
- Zhao, L.; Ren, Y.; Xiang, Y.; Sakurai, K. Fault-tolerant scheduling with dynamic number of replicas in heterogeneous systems. In Proceedings of the 2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC), Melbourne, VIC, Australia, 1–3 September 2010; pp. 434–441. [Google Scholar] [CrossRef]
- Xie, G.; Wei, Y.; Le, Y.; Li, R. Redundancy Minimization and Cost Reduction for Workflows with Reliability Requirements in Cloud-Based Services. IEEE Trans. Cloud Comput. 2022, 10, 633–647. [Google Scholar] [CrossRef]
- Plankensteiner, K.; Prodan, R. Meeting Soft Deadlines in Scientific Workflows Using Resubmission Impact. IEEE Trans. Parallel Distrib. Syst. 2012, 23, 890–901. [Google Scholar] [CrossRef]
- Xie, G.; Zeng, G.; Chen, Y.; Bai, Y.; Zhou, Z.; Li, R.; Li, K. Minimizing Redundancy to Satisfy Reliability Requirement for a Parallel Application on Heterogeneous Service-Oriented Systems. IEEE Trans. Serv. Comput. 2020, 13, 871–886. [Google Scholar] [CrossRef]
- Hu, B.; Cao, Z. Minimizing Resource Consumption Cost of DAG Applications With Reliability Requirement on Heterogeneous Processor Systems. IEEE Trans. Ind. Inform. 2020, 16, 7437–7447. [Google Scholar] [CrossRef]
- Qu, L.; Khabbaz, M.; Assi, C. Reliability-Aware Service Chaining In Carrier-Grade Softwarized Networks. IEEE J. Sel. Areas Commun. 2018, 36, 558–573. [Google Scholar] [CrossRef]
- Al-Masri, E.; Mahmoud, Q.H. Discovering the best web service: A neural network-based solution. In Proceedings of the 2009 IEEE International Conference on Systems, Man and Cybernetics, San Antonio, TX, USA, 11–14 October 2009; pp. 4250–4255. [Google Scholar] [CrossRef]
- Amazon Elastic Container Service. 2024. Available online: https://aws.amazon.com/ecs/ (accessed on 15 January 2025).
- Xie, Y.; Gui, F.X.; Wang, W.J.; Chien, C.F. A Two-stage Multi-population Genetic Algorithm with Heuristics for Workflow Scheduling in Heterogeneous Distributed Computing Environments. IEEE Trans. Cloud Comput. 2023, 11, 1446–1460. [Google Scholar] [CrossRef]
- Tong, Z.; Ye, F.; Liu, B.; Cai, J.; Mei, J. DDQN-TS: A novel bi-objective intelligent scheduling algorithm in the cloud environment. Neurocomputing 2021, 455, 419–430. [Google Scholar] [CrossRef]
- Zhang, X.; Yao, G.; Ding, Y.; Hao, K. An improved immune system-inspired routing recovery scheme for energy harvesting wireless sensor networks. Soft Comput. 2017, 21, 5893–5904. [Google Scholar] [CrossRef]
- Jing, W.; Liu, Y. Multiple DAGs reliability model and fault-tolerant scheduling algorithm in cloud computing system. Comput. Model. New Technol. 2014, 18, 22–30. [Google Scholar]
Literature | Makespan/Deadline | Cost/Budget | Reliability | Algorithm |
---|---|---|---|---|
Bao et al. [13] | ✓ | ✓ | Heuristic | |
Wang et al. [14] | ✓ | ✓ | Heuristic | |
Li et al. [15] | ✓ | ✓ | ✓ | Heuristic |
Abdullah et al. [16] | ✓ | Deep Learning | ||
Yu et al. [17] | ✓ | Reinforcement Learning | ||
Our work | ✓ | ✓ | ✓ | Graph Deep Learning & Heuristic |
Symbol | Description |
---|---|
W | Microservice workflow |
V | Set of tasks in W |
E | Set of task dependencies in W |
The i-th task in W | |
Computational workload of | |
Edge between and | |
M | Set of virtual machines |
The l-th virtual machine in M | |
Price of virtual machine | |
C | set of candidate containers |
The k-th container in C | |
Resource vector of | |
Resource vector of | |
Execution time of on | |
Start time of task on | |
Finish time of task on | |
Requirement parameters of | |
The j-th candidate container of | |
Q | Quality of Service (QoS) |
w | Weights of user’s QoS preferences |
decision variable for task-to-container | |
probability of task-to-container | |
Reliability of in on | |
Set of replicas for | |
scheduling scheme |
Instance Type | CPU Cores | Memory | BTU |
---|---|---|---|
m5.4xlarge | 16vCPU | 64 GB | USD 0.7680 |
m5.2xlarge | 8vCPU | 32 GB | USD 0.3840 |
m4.xlarge | 4vCPU | 16 GB | USD 0.2000 |
m4.large | 2vCPU | 8 GB | USD 0.1000 |
t2.medium | 2vCPU | 4 GB | USD 0.0464 |
t2.small | 1vCPU | 2 GB | USD 0.0230 |
Algorithm | Hyperparameters |
---|---|
DeepMCC | batch_size: 16, learning_rate: 0.0005, epoch_num: 30, gnn_layer: 6 |
MPGA | population_size: 100, group_num: 4, cross_rate: 0.5, epoch: 600 |
DDQN | frames: 10,000, batch_size: 32, buffer_size: 10,000, learning_rate: 0.01, GAMMA: 0.9, min_eps: 0.01, max_eps: 0.9, eps_frames: 10,000, sampling_weight: 0.4, Q_updates_num: 500 |
QoS-DRL | iter_num: 10, pretrain_epoch: 360, learning_epoch: 300, drl_lr: 0.0001, pretrain_lr: 0.001, sample_num: 64, best_num: 64 |
Algorithm | K | N | ||||
---|---|---|---|---|---|---|
20 | 40 | 60 | 80 | 100 | ||
DeepMCC | 100 | 0.928 | 0.931 | 0.934 | 0.928 | 0.929 |
1000 | 0.9 | 0.891 | 0.879 | 0.871 | 0.868 | |
10,000 | 0.873 | 0.849 | 0.822 | 0.811 | 0.806 | |
MPGA | 100 | 0.926 | 0.934 | 0.941 | 0.937 | 0.94 |
1000 | 0.874 | 0.873 | 0.87 | 0.854 | 0.843 | |
10,000 | 0.822 | 0.811 | 0.798 | 0.769 | 0.745 | |
DDQN | 100 | 0.908 | 0.854 | 0.8 | 0.764 | 0.733 |
1000 | 0.839 | 0.75 | 0.658 | 0.623 | 0.593 | |
10,000 | 0.77 | 0.644 | 0.515 | 0.482 | 0.452 | |
QoS-DRL | 100 | 0.928 | 0.937 | 0.945 | 0.942 | 0.946 |
1000 | 0.878 | 0.873 | 0.865 | 0.836 | 0.812 | |
10,000 | 0.828 | 0.807 | 0.783 | 0.727 | 0.677 |
Algorithm | K | N | ||||
---|---|---|---|---|---|---|
20 | 40 | 60 | 80 | 100 | ||
DeepMCC | 100 | 0.653 | 5.259 | 9.785 | 21.112 | 33.75 |
1000 | 1.417 | 7.023 | 12.455 | 21.907 | 33.177 | |
10,000 | 2.154 | 8.757 | 15.053 | 23.318 | 33.042 | |
MPGA | 100 | 0.756 | 13.023 | 25.09 | 61.922 | 102.598 |
1000 | 2.877 | 22.347 | 41.261 | 77.37 | 119.899 | |
10,000 | 4.945 | 31.626 | 57.198 | 95.027 | 138.799 | |
DDQN | 100 | 5.88 | 23.204 | 40.173 | 41.345 | 45.084 |
1000 | 36.676 | 45.509 | 53.212 | 53.054 | 57.298 | |
10,000 | 66.786 | 67.55 | 65.946 | 66.047 | 70.28 | |
QoS-DRL | 100 | 7.298 | 53.418 | 98.72 | 180.362 | 273.203 |
1000 | 12.006 | 62.568 | 111.575 | 190.02 | 284.231 | |
10,000 | 16.491 | 71.387 | 123.782 | 204.99 | 299.02 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, W.; Li, X.; Chen, L.; Wang, M. Microservice Workflow Scheduling with a Resource Configuration Model Under Deadline and Reliability Constraints. Sensors 2025, 25, 1253. https://doi.org/10.3390/s25041253
Li W, Li X, Chen L, Wang M. Microservice Workflow Scheduling with a Resource Configuration Model Under Deadline and Reliability Constraints. Sensors. 2025; 25(4):1253. https://doi.org/10.3390/s25041253
Chicago/Turabian StyleLi, Wenzheng, Xiaoping Li, Long Chen, and Mingjing Wang. 2025. "Microservice Workflow Scheduling with a Resource Configuration Model Under Deadline and Reliability Constraints" Sensors 25, no. 4: 1253. https://doi.org/10.3390/s25041253
APA StyleLi, W., Li, X., Chen, L., & Wang, M. (2025). Microservice Workflow Scheduling with a Resource Configuration Model Under Deadline and Reliability Constraints. Sensors, 25(4), 1253. https://doi.org/10.3390/s25041253