Efficient Orchestration of Distributed Workloads in Multi-Region Kubernetes Cluster

Furnadzhiev, Radoslav; Shopov, Mitko; Kakanakov, Nikolay

doi:10.3390/computers14040114

Open AccessArticle

Efficient Orchestration of Distributed Workloads in Multi-Region Kubernetes Cluster

by

Radoslav Furnadzhiev

,

Mitko Shopov

^*

and

Nikolay Kakanakov

Department of Computer Systems and Technologies, Faculty of Electronics and Automation, Technical University of Sofia, Plovdiv Branch, 1797 Sofia, Bulgaria

^*

Author to whom correspondence should be addressed.

Computers 2025, 14(4), 114; https://doi.org/10.3390/computers14040114

Submission received: 4 February 2025 / Revised: 17 March 2025 / Accepted: 19 March 2025 / Published: 21 March 2025

(This article belongs to the Special Issue Edge and Fog Computing for Internet of Things Systems (2nd Edition))

Download

Browse Figures

Versions Notes

Abstract

:

Distributed Kubernetes clusters provide robust solutions for geo-redundancy and fault tolerance in modern cloud architectures. However, default scheduling mechanisms primarily optimize for resource availability, often neglecting network topology, inter-node latency, and global resource efficiency, leading to suboptimal task placement in multi-region deployments. This paper proposes network-aware scheduling plugins that integrate heuristic, metaheuristic, and linear programming methods to optimize resource utilization and inter-zone communication latency for containerized workloads, particularly Apache Spark batch-processing tasks. Unlike the default scheduler, the presented approach incorporates inter-node latency constraints and prioritizes locality-aware scheduling, ensuring efficient pod distribution while minimizing network overhead. The proposed plugins are evaluated using the kube-scheduler-simulator, a tool that replicates Kubernetes scheduling behavior without deploying real workloads. Experiments cover multiple cluster configurations, varying in node count, region count, and inter-region latencies, with performance metrics recorded for scheduler efficiency, inter-zone communication impact, and execution time across different optimization algorithms. The obtained results indicate that network-aware scheduling approaches significantly improve latency-aware placement decisions, achieving lower inter-region communication delays while maintaining resource efficiency.

Keywords:

cloud computing; Kubernetes; orchestration; multi-region

1. Introduction

As modern computing infrastructure moves towards cloud-native architectures, Kubernetes has emerged as the dominant platform for orchestrating containerized workloads. Apache Spark has emerged as a leading platform for large-scale data processing, offering powerful parallel computing capabilities, fault tolerance, and scalability. With the rapid adoption of containerization and cloud-native architectures, deploying Apache Spark on Kubernetes has become a popular approach. Kubernetes, with its additional orchestration features, efficient resource management, and dynamic scaling, has become a complementary environment for running Spark applications. However, effectively scheduling these workloads on a Kubernetes cluster, especially one distributed across multiple availability zones, introduces unique challenges and opportunities.

The assignment of tasks to processors is a fundamental aspect that significantly affects execution and communication costs. Early models such as those proposed in [1] formalized task assignment by minimizing the combined cost of computation and interprocessor communication, a problem modeled as an NP-complete optimization. As progress was made in these models, additional constraints, such as limited memory on specific processors, were integrated.

In their study [2], Abdul-Rahman and Ayda examine in-depth user behavior on the Google Cloud infrastructure by analyzing data from publicly available Google trace logs. The complexity of cloud computing environments, such as Google’s back-end systems, and their ability to handle a variety of workloads ranging from large-scale web services to compute-intensive operations such as MapReduce were also studied. The authors highlight the phenomenon of ‘mice and elephants’, in which a small group of users (elephants) dominate resource consumption, while the majority of users (mice) contribute minimally. Their findings demonstrate that the recognition of user-specific patterns can guide policy-driven resource planning and scheduling. They also highlight the critical need to design infrastructure that can accommodate both high-priority, latency-sensitive processes and opportunistic, lower-priority workloads. Kubernetes, a widely adopted container orchestration platform, provides robust scalability, fault tolerance, and service continuity. Although it provides robust scaling and fault tolerance solutions, its default scheduling mechanisms often fail to effectively handle complex, latency-sensitive, and geographically dispersed workloads. The emergence of multi-regional infrastructures further compounds these challenges [3,4,5]. In modern cloud architectures, orchestrating containerized workloads across large clusters involves complex trade-offs between computational efficiency, communication latency, and fault tolerance. The rapid adoption of microservices, edge computing, and Internet of Things (IoT) applications highlights the need for advanced scheduling mechanisms capable of handling these trade-offs [3,4,5,6,7].

The scheduler, a component of Kubernetes container orchestration platform, is responsible for assigning pods (smallest deployable units) onto nodes in a cluster while adhering to constraints such as resource availability, affinity/anti-affinity rules, and QoS requirements. The default Kubernetes scheduler only considers current optimal node selection [8]. This approach often results in a locally optimal solution rather than a globally optimal one. It is based on heuristic algorithms that prioritize immediate scheduling needs. This neglects global optimization and could lead to suboptimal resource utilization. Traditional Kubernetes schedulers prioritize resource allocation based on CPU and memory constraints, without explicitly considering the underlying network topology or the interdependencies among distributed application components [8].

1.1. Related Work

In the field of Kubernetes scheduling, researchers [3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19] have proposed various algorithms that address resource allocation and workload distribution. These algorithms can be broadly classified into heuristic approaches, metaheuristic-based methods, and hybrid frameworks that combine multiple strategies. For example, while the default Kubernetes scheduler relies primarily on heuristic rules, advanced proposals often incorporate metaheuristic techniques such as evolutionary algorithms or swarm intelligence to optimize scheduling decisions. The complexity of scheduling containerized workloads in Kubernetes, especially in geo-distributed environments, requires robust optimization algorithms. These algorithms address multi-objective optimization challenges such as minimizing latency, balancing resource usage, and ensuring high availability. The default scheduler struggles to accommodate complex workloads like machine learning computations, edge computing, and heterogeneous environments. Custom schedulers address these gaps through multi-objective optimization, focusing on energy efficiency and latency reduction. Numerous authors [6,7,8,9,10,11] aim to propose new algorithms to address the complexities of modern computing infrastructures within the Kubernetes framework. The importance of multi-objective optimization algorithms for solving complex optimization problems is highlighted in [7,12,13]. The properties and applications of multiple optimization algorithms are systematically reviewed, along with their respective application areas.

1.1.1. Ant Colony Optimization

Several papers [6,7,9,15] propose novel ant colony optimization (ACO) algorithms for scheduling workloads in a cloud environment, addressing key optimization objectives: balancing cluster load, minimizing network transmission overhead, improving QoS, and minimizing power usage. The Zeus system [6] implements a time sliding window algorithm that considers actual resource utilization rather than static resource requests. The system proved effective at scale, being deployed on a 5000+ node production Kubernetes cluster.

In their research, Lin et al. [7] extended previous work on ACO for cloud scheduling and presented a multi-objective optimization model called ACO-MCS (Ant Colony Optimization algorithm for Multi-objective of Container-based Microservice Scheduling). The algorithm incorporates heuristic information combining three objectives: network transmission cost between microservices, resource utilization balance across physical nodes, and service reliability based on request failures. In [15], the workload placement problem is modeled as a multidimensional container packing problem (MDBP), where physical machines represent containers and virtual machines (VMs) represent items to be packed. A novel ACO-based algorithm that dynamically calculates VM placement to minimize energy consumption while maintaining performance requirements is developed.

A hybrid ACO-PSO algorithm that uses ACO solutions as initial positions for PSO and implements adaptive weight adjustment in PSO is presented in [9]. The algorithm considered both resource cost and load balancing through a dual-objective function.

1.1.2. Non-Dominated Sorting Genetic Algorithm

Evolutionary approaches and algorithms are commonly used for resource management in cloud environments. The non-dominated Sorting Genetic Algorithm (NSGA-II) [12] in particular has been proven as performant and reliable. The elitism in NSGA-II ensures that the best solutions so far are preserved. Its iterative approach uses genetic operators to evolve scheduling plans, each representing a potential deployment strategy across a Kubernetes cluster. The work presented in [16] explores the modeling of microservices as a directed graph, where the nodes represent individual services and the edges denote interoperability relationships. The study proposes four primary objective functions for multi-objective optimization:

Resource utilization efficiency: The goal is to optimize the distribution of resources while ensuring that the requirements of microservices are met;
Load balance: This objective balances the usage of computational resources uniformly across the physical machines;
System reliability: This objective focuses on minimizing failure rates of applications, microservices, and containers;
Network distance: This objective seeks to minimize the network overhead between interconnected microservices.

These four objective functions are used to evaluate fitness and select solutions. The study employs the NSGA-II algorithm to explore these competing objectives and identify optimal container allocations.

1.1.3. Integer Linear Programming

By leveraging advanced optimization techniques such as Integer Linear Programming (ILP) and heuristics, studies [4,5,17] demonstrate the potential for substantial performance improvements in resource management. They emphasize the need for schedulers to address multidimensional constraints, including resource availability, network latency, and energy efficiency.

In [17], a container-oriented scheduling solution is presented based on an ILP model that focuses on optimizing resource allocation within containerized applications. The ILP model’s objective function integrates energy cost, container image pulling costs, and network transition costs.

Santos et al. [4] introduce a network-aware scheduling extension for Kubernetes tailored to the requirements of Fog Computing environments, particularly for Smart City applications. The proposed approach aims to address latency and bandwidth challenges in a Fog Computing environment. The extension, referred to as the Network-Aware Scheduler (NAS), builds upon Kubernetes’ existing scheduling framework using the “scheduler extender” model. This approach enables the default Kubernetes Scheduler (KS) to delegate certain scheduling tasks to the NAS, which filters and prioritizes nodes based on network criteria.

Pogonip [5] is a scheduler designed for asynchronous microservices in edge computing environments. This work targets the scheduling and placement challenges of microservice-based applications in edge cloud systems, particularly focusing on the unique requirements of MQTT message queues.

1.1.4. Simulated Annealing

Simulated annealing (SA) is a probabilistic optimization algorithm inspired by the annealing process in metallurgy, where controlled cooling allows materials to reach a stable crystalline structure. SA leverages this analogy to solve complex optimization problems by iteratively refining solutions while balancing exploration and exploitation. An SA-based approach to task scheduling has been proposed in [13,14] as an effective scheduling algorithm.

An implementation of SA for task scheduling in cloud environments is presented in [14]. It begins with a greedy algorithm to generate an initial task allocation and applies SA to optimize task assignments. The process involves reassigning tasks among physical machines based on performance metrics such as CPU, memory, and bandwidth utilization. The study demonstrates significant improvements in task completion time and system utilization, particularly as the system scales with more nodes.

Wei et al. in [19] addresses key challenges in leveraging cloud-based elastic computing resources for processing remotely sensed big data (RSBD) using a containerized Apache Spark engine deployed on Kubernetes. The proposed framework utilizes Spark’s capabilities on Kubernetes clusters, augmented with a novel task scheduling mechanism (TSM) that balances workloads and minimizes resource congestion. A Kubernetes operator is used; in its control loop, it intercepts resource requests and adjusts pod affinity based on a predefined policy, achieving performance improvements of up to 13.6% in experimental setups.

1.2. Evaluation of Kubernetes Scheduling Algorithms

When testing and evaluating new scheduling algorithms in Kubernetes environments, several approaches can be used, as evidenced by prior works [3,4,6,10,11,19]:

Direct Deployment on a Real Cluster: Some researchers integrate their custom scheduler or algorithm into an actual Kubernetes cluster to observe its behavior under realistic conditions. For example, the Zeus system by Zhang et al. [6] was deployed on a production Kubernetes cluster with more than 5000 nodes to assess its effectiveness at scale. By measuring real resource utilization and workload performance in production, they demonstrated the algorithm’s impact (Zeus focuses on co-locating workloads based on actual usage rather than static requests). This real-world testing provides high-fidelity results and proves stability at scale, though it requires extensive resources and may be less reproducible. Similarly, Huang et al. [19] evaluated their Spark scheduling enhancement by implementing it as a Kubernetes operator and running it in a cloud environment, observing performance improvements in an experimental cluster. These approaches ensure the algorithm is tested with live pods and realistic conditions.
Controlled Lab Testbeds: Many studies opt for controlled experiments on smaller clusters (physical or virtual). Researchers set up a Kubernetes cluster (on VMs, bare metal, or cloud instances) with a limited number of nodes and deploy benchmark workloads to compare scheduling policies. For instance, latency-aware scheduling research by Centofanti et al. [3] and Santos et al. [4] (both targeting edge/fog scenarios) used Kubernetes clusters where network latencies were emulated and the performance of their custom schedulers was measured relative to the default. In such testbeds, tools like Kubernetes taints/affinities can inject constraints, and network control tools can add latency to mimic specific scenarios. The advantages of a lab setup are the repeatability and the ability to instrument the cluster deeply (e.g., measuring internal scheduler metrics or application-level outcomes). However, the results might not scale linearly to larger systems.
Simulation and Emulation Tools: An emerging approach is to use simulated environments to evaluate scheduling algorithms. Simulation allows the running of the scheduler logic on hypothetical scenarios without needing full Kubernetes deployment. One can define multiple scenarios of nodes and pods with pre-set characteristics and capture the scheduling decisions and execution time for each algorithm. This approach is inspired by the need for quick, reproducible tests of scheduler behavior. The simulator executes the real scheduling code in a sanboxed environment, so the core logic is the same as a live cluster, but the outcomes are deterministic given the inputs. The downside is that simulators cannot capture run-time performance metrics (since no real pods are being started), and thus the focus is on scheduler-centric metrics like decision optimality, fairness, or algorithmic overhead. This could be addressed by focusing on the scheduler’s scoring time and placement distribution as metrics.
Trace-Driven Emulation: A variant of simulation is the use of real workload traces to drive the scheduler in either a simulator or a real cluster replay. Some of the literature (e.g., studies cited in surveys [10,11]) replays traces from Kubernetes node metrics and event logs to evaluate new scheduling algorithms under identical workload patterns. By doing so, scholars can compare how different scheduling policies would have performed on the same workload. This often involves custom harnesses that feed pod submission events to the scheduler and record decisions.

This research aims to evaluate advanced algorithms for Kubernetes-based multi-zone clusters, optimizing resource utilization and inter-zone network efficiency. The study examines both global optimization methods, such as Integer Linear Programming and Simulated Annealing, and probabilistic heuristics, including Ant Colony Optimization and the Non-Dominated Sorting Genetic Algorithm, to determine their effectiveness in dynamic scheduling scenarios. By comparing these approaches, the study aims to evaluate the trade-offs between solution quality and computational efficiency, highlighting their strengths and limitations. The presented research is based on the approach of using simulation and emulation tools.

2. Materials and Methods

In the paper, a set of network-aware scheduling plugins is proposed. The studied algorithms are implemented as custom Kubernetes scheduling plugins. The scheduler-plugins project [20], maintained by the Kubernetes Scheduling special interest group, provides a set of out-of-tree plugins for extending the default scheduler. It is a relatively new feature [20,21] that allows users to override the default scheduling algorithm at predefined extension points (Figure 1). It provides a flexible and extensible approach for adding new capabilities. The scheduling process follows a well-defined cycle, consisting of multiple extensible phases. In the score phase, every feasible node is evaluated and a score based on optimization goals and algorithm is assigned based on predefined criteria. When a pod is created, it is added to the activeQ (active queue), marking the beginning of the Kubernetes scheduling cycle. The scheduler continuously monitors this queue, selecting pending pods and evaluating their placement across available nodes. The scheduling cycle progresses through multiple phases, including PreScore, Score, and NormalizeScore, where various plugins influence the decision-making process. During this process, the scheduler computes node suitability scores. Once a pod is assigned to a node, the Bind phase finalizes the decision, transitioning the pod from the Pending state to the Running state.

Four different plugins are developed and evaluated, each intercepting the PreScore lifecycle phase to dynamically load network latency data between nodes from ConfigMaps, gather information about pods belonging to the same group, and initialize data structures used by scoring calculations in the subsequent phase. The Score phase assigns a score to each node based on one of the proposed algorithms optimizing the score functions.

Most of the papers reviewed deal with edge or fog deployments, where nodes are distributed across multiple locations. This necessitates awareness on node topology. The proposed scoring mechanism aims to integrate network topology and inter-node latency considerations into scheduling decisions and to investigate how they address the unique challenges of container workload scheduling in geographically distributed environments. To evaluate the plugin implementation, a simulated environment is used in which the scheduler’s solutions are tested against each algorithm and the time needed to score a pod placement is recorded.

Kubernetes rescheduling involves pod eviction and redeployment to different nodes, necessitating process termination. While stateless microservices can withstand such interruptions, batch-processing workloads (e.g., Apache Spark jobs) require continuous execution to prevent computational inefficiencies and data loss.

Latency measurements between nodes are taken periodically, applying these metrics exclusively during initial scheduling decisions. When infrastructure changes occur, such as data center unavailability, immediate workload rescheduling begins only for affected pods. Otherwise, the system waits for current batch jobs to complete naturally before implementing updated latency-aware scheduling in subsequent execution cycles. This approach preserves computational integrity and eliminates unnecessary workload migrations.

This methodology corresponds to established cloud scheduling practices, where disrupting large-scale distributed workloads is avoided due to potential data loss, recomputation expenses, and network overhead. By constraining rescheduling to new scheduling cycles, the system maintains stable batch execution while progressively adapting to infrastructure evolution.

The study examines a multi-zone data center topology, as illustrated in Figure 2. In a non-simulated environment, a lightweight task periodically pings all cluster hosts to measure network latency. These measurements help to establish the relative location between cluster nodes.

The scheduling problem can be formally defined using sets, resource constraints, and network latency considerations. The set P represents the collection of pods that need to be scheduled onto worker nodes in the cluster, denoted by N. Pods may belong to specific groups, G, which represent dependent tasks. Since the cluster consists of multiple worker nodes that are distributed across different availability zones, latency matrix L defines the communication delay between nodes, where

L (n_{i}, n_{j})

quantifies the latency between the nodes

n_{i}

and

n_{j}

, influencing the placement of pods belonging to the same group.

Resource constraints are also key factors in scheduling decisions. Each pod, of set P has specific resource demands, including CPU and memory requirements, denoted by

R_{c p u} (p)

and

R_{m e m} (p)

, respectively. Each pod is assigned to a node, represented by the function A(p), ensuring that the node has sufficient available resources. Each node, n, provides a fixed capacity of CPU and memory,

C_{c p u} (n)

and

C_{m e m} (n)

, while the currently allocated resources are tracked by

U_{c p u} (n)

and

U_{m e m} (n)

.

Table 1 defines the notation used in the mathematical formulation of the problem. The key parameters related to pods, nodes, resource capabilities, and network latency are defined.

The scheduling problem thus involves efficiently mapping pods to nodes while balancing resource utilization, minimizing cross-zone latency, and respecting node constraints. To optimize resource utilization and network efficiency while minimizing cross-zone communication costs, the following objective functions are defined: (1) and (2) complement a percentage of the resources used on the node if the pod is scheduled on it. Maximizing CPU and memory usage ensures efficient workload distribution.

S_{c p u} (p_{i}, n_{j}) = 1 - (U_{c p u} (n_{j}) + R_{c p u} (p_{i})) / C_{c p u} (n_{j})

(1)

S_{m e m} (p_{i}, n_{j}) = 1 - (U_{m e m} (n_{j}) + R_{m e m} (p_{i})) / C_{m e m} (n_{j})

(2)

A higher value of

S_{c p u}

and

S_{m e m}

indicates better utilization, guiding the scheduler to maximize the overall resource efficiency. Given that pods within the same group, G, may frequently communicate, reducing network latency is critical. The latency score measures the average network delay between the pod to be scheduled and pods of the same group:

S_{l a t} (p_{i}, n_{j}) = \frac{1}{|G| - 1} \sum_{k = 1, p_{k} \neq p_{i}}^{|G|} L (n_{j}, A (p_{k}))

(3)

This function computes the mean network latency between the node

n_{j}

, which is getting scored, and the nodes where the remaining pods in the group are scheduled. Lower latency values indicate better placement decisions, minimizing costly cross-zone communication.

The Score plugin assigns a score to each node, if not otherwise described, using the arithmetic mean of the three functions. Finding the global optimum of this non-convex function presents a significant computational challenge due to the presence of multiple local extrema. Several algorithmic approaches that have been proven to solve similar problems in previous research were explored.

2.1. Ant-Colony Optimization

The ACO plugin uses artificial ants to construct solutions iteratively. The plugin intercepts the PreScore lifecycle phase to collect and store node information, initialize pheromones and network latencies, all while registering each pod in a recognized group. The Score phase is then responsible for returning a final numerical evaluation that indicates how suitable a particular node is for the pod in question. The algorithm maintains pheromone traces for each node and utilizes both exploitation and exploration strategies to find optimal pod placements. The key ACO parameters are as follows: alpha—controls the relative weight assigned to pheromone intensity when an ant selects a path; beta—emphasizes the heuristic contribution of each candidate; q0—steers an ant’s decision toward deterministic exploitation or stochastic exploration; rho—controls the rate of pheromone evaporation and prevent premature convergence by guiding the search process. The chosen values are, respectively, as follows: alpha = 1.0, beta = 3.0, q0 = 0.5, and rho = 0.1.

2.2. Non-Dominated Sorting Genetic Algorithm

The scoring implementation in the NSGA-II plugin generates an initial population of 50 individuals, where each individual represents a potential node assignment with three primary objectives: CPU utilization, memory utilization, and network latency. The algorithm iterates through 50 generations to find the optimal pod placement. The final scoring calculation combines the three objectives from the best solution. The crossover operation employs a Simulated Binary Crossover (SBX) with a crossover rate of 0.9 and an eta value of 20.0, which controls the similarity between parents and offspring. The mutation operation uses a mutation rate of 0.1, along with an eta value of 20.0, allowing for controlled diversity introduction in the population. A tournament selection with a tournament size of 2 is used to select parent solutions for reproduction, maintaining selection pressure while preserving diversity in the population. The final scoring calculation combines the three objectives from the best solution (rank 1, highest crowding distance) using a weighted sum approach. Two competing objectives are defined: a resource objective

o b j 1 (p_{i}, n_{j}) = (S_{c p u} (p_{i}, n_{j}) + S_{m e m} (p_{i}, n_{j})) / 2

and a latency objective

o b j 2 (p_{i}, n_{j}) = S_{l a t} (p_{i}, n_{j})

.

2.3. Simulated Annealing

The simulated annealing algorithm calculates an energy function that considers the scheduling constraints and objectives. The core scoring mechanism utilizes simulated annealing with an initial temperature of 100.0 and a cooling rate of 0.95, continuing until the temperature drops below 0.1. The energy calculation considers the optimization objectives weighted equally. The resulting score is normalized to fit within Kubernetes framework constraints (0 to MaxNodeScore), providing a standardized metric for node selection.

2.4. Integer Linear Programming

The Integer Linear Programming formulation in this Kubernetes scheduling plugin employs the Simplex algorithm to optimize pod placement by considering multiple objectives and constraints. The specific constraints enforce CPU and memory capacity limits while incorporating a nonlinear constraint approximation for network latency using an exponential decay function. This formulation is encoded into a structured map where each key corresponds to a constraint, and the values represent the coefficients of the decision variables. The constraints are passed to the Simplex solver, along with their respective right-hand side values and direction operators.

The solution returned by the Simplex algorithm represents the optimal values for each decision variable that maximizes the objective function while satisfying all constraints. The final scheduling score is computed as a normalized weighted sum of these optimal values, ensuring the score falls within the range for compatibility with the Kubernetes scoring system.

3. Results

3.1. Experimental Setup

The kube-scheduler-simulator [22] is an open-source tool developed by SIG Scheduling that enables testing and debugging of scheduling algorithms in a controlled environment. Unlike real Kubernetes clusters, where modifying scheduler behavior requires privileged access and risks operational disruption, this simulator offers a “dry-run” environment with virtual nodes and pods. It operates a scheduler process with a simulated API server and no kubelets (client processes managing each worker node), allowing users to define virtual infrastructure resources, submit pods with specific requirements, and observe scheduling decisions without launching containers. This tool is particularly valuable for scheduling plugin developers, facilitating development without requiring actual cluster deployment.

The simulator provides a reliable multi-region testbed which facilitates the injection of custom data (including latency matrix via ConfigMaps) and scheduling plugins without deploying a complete Kubernetes cluster. Based on the KWOK toolkit (Kubernetes Without Kubelet) [23], the simulator creates virtual node objects to which the scheduler assigns pods. These pods receive annotations documenting the scheduling process, enabling detailed algorithmic analysis. It is noteworthy that the simulator evaluates only scheduling decisions and timing metrics, not operational performance metrics such as actual CPU and Memory utilization or network throughput. The fidelity of the simulator to real scheduling was sufficient for our comparative study, as it uses the actual Kubernetes scheduling code paths. By using the kube-scheduler-simulator, we ensured that our experiments are repeatable and free from external noise, focusing purely on scheduling performance and placement outcomes.

To evaluate the effectiveness of the proposed scheduling plugins, multiple experimental scenarios are defined, each representing a multi-region Kubernetes cluster with varying node counts and inter-region latencies. Table 2 outlines the cluster configurations, ranging from smaller deployments (e.g., 3 regions × 5 nodes) to larger setups (e.g., 4 regions × 20 nodes). Apache Spark is treated as a Guaranteed workload and resource requests matching actual resource usage are assumed, ensuring consistent scheduling decisions. Spark allows explicit limits on CPU and memory per worker process, aligning with static resource requests for pods. In each scenario, every node is provisioned with 16 vCPUs and 16 GiB of memory, while each Spark executor pod requests 1 vCPU and 1 GiB of memory. This strict request-to-limit matching reflects real-world Kubernetes deployments where batch-processing jobs require dedicated resource reservations to maintain predictable execution.

To emulate multi-region network constraints, a fixed latency matrix (Table 3) using Kubernetes ConfigMaps is introduced. These preconfigured values (e.g., 5 ms inter-region latency) remain unchanged throughout each simulation run, enabling repeatable experiments. The scheduler plugins access this matrix at runtime, incorporating network-aware placement strategies to optimize workload distribution. Since the kube-scheduler-simulator does not execute actual workloads, the focus is on evaluating the scheduler’s decision-making process, ensuring that Spark executors are assigned to nodes efficiently while considering inter-region communication delays.

Multiple test scenarios are created for the conducted experiments. Each test scenario consists of a set of nodes distributed across multiple regions. The number of nodes per region varies, allowing for different zone densities. The number of pods to be scheduled is adjusted to simulate varying levels of resource demand, with each pod assigned a fixed CPU and memory request. Each node has 16,000 milicores and 16,384 MiB memory available. Each pod requests 1000 milicores and 1024 MiB memory. This ensures consistency across different scheduling algorithms while allowing for controlled experimentation with proposed resource allocation strategies.

The latency between nodes is defined in ConfigMaps, which models the network delay associated with inter-node communication. In the test environment, these values remain static for consistency, but in a real cluster, latency would be periodically measured and dynamically updated to reflect changing network conditions. The latency values utilized in our experimentation mirror those observed in actual Kubernetes deployment, with measurements rounded while maintaining fidelity to real-world network behavior.

Within the same region, latency is minimal and set to 0.7 ms, reflecting low-cost, high-speed internal communication. The nodes in adjacent regions have a moderate latency of 3.0 ms, while the nodes separated by greater distances experience higher latencies of 5.0 ms or 7.0 ms. This latency structure emulates real-world multi-zone cloud deployment.

Each scheduling plugin undergoes a series of ten independent executions per test scenario to ensure statistical robustness and to capture the variability in scheduling outcomes. For probabilistic scheduling algorithms, such as ACO, SA, and NSGA-II, inherent randomness can lead to different scheduling results across multiple runs. At the end of each execution, cluster status metrics are recorded. The distribution of pods across regions is captured, showing how effectively a plugin minimizes inter-zone communication. Another performance metric is the time it takes the plugin to schedule all pods, which reflects the computational overhead associated with making scheduling decisions.

3.2. Results and Evaluation

The high variance in execution time or scheduling decisions suggests that an algorithm is more sensitive to initial conditions or random factors, while the low variance indicates stability and predictability. To analyze and visualize the results, a gonum/plot is used to generate statistical plots.

Latency distributions are represented using box plots, illustrating the distribution of pod-to-pod latencies for each plugin and individual point, marking outliers falling beyond the whiskers.

The results shown in Figure 3 and Figure 4 demonstrate the performance of each plugin in the defined scenarios. Algorithms searching for a global optimum, such as ILP and SA, experience a significant increase in execution time as the search space expands with larger cluster sizes. The reason behind this computational overhead lies in the way these algorithms explore potential scheduling solutions.

The ILP solver must examine an exponentially growing number of possible assignments to find the optimal configuration. As the number of nodes and pods increases, the number of possible scheduling combinations expand factorially, making ILP highly sensitive to cluster size. While ILP guarantees an optimal solution, its practical applicability diminishes in large clusters due to the prohibitive time required to compute results.

SA attempts to escape local optima by introducing controlled randomness in the search process. While this enables it to explore a broader solution space, its reliance on an iterative approach means that execution time scales with the number of pods and nodes. In a large cluster, SA requires significantly more iterations to converge to a globally optimal solution, increasing computational complexity.

In contrast, ACO and NSGA-II maintain nearly constant execution time. This stability arises because both algorithms employ heuristics that allow them to approximate solutions in a limited number of iterations, rather than performing an exhaustive or near-exhaustive search. Outlier and suboptimal solutions are plotted on Figure 4 as individual points. A limitation of the conducted tests is that the heuristic algorithms were configured in all scenarios with fixed parameters, restricting their ability to explore a broader range of potential solutions. Their effectiveness is highly dependent on factors like the number of ants, population size, and iteration count. Using the same parameter settings across different test scenarios may have constrained their ability to fully exploit the solution space.

The resources utilization distributions are presented on Figure 5 and Figure 6 and illustrate the percentage of resources allocated to scheduled nodes. This metric shows how effectively each algorithm balances requests for computational resources across available infrastructure. The inter-quartile range

(I Q R)

represented by each box indicates the central tendency of resource allocation and the whiskers extending to

1.5 \times I Q R

reveal the degree of variance. Outliers beyond these boundaries suggest potential over-provisioned or underutilized nodes. Significant differences in median values between algorithms indicate variations in their resource distribution strategies. Narrower IRQs show more consistent resource utilization patterns, ensuring predictable performance characteristics. In both ACO and NSGA-II, the data indicate instances where certain nodes are allocated significantly higher percentages of resources, as evidenced by extended upper whiskers in the box plots.

A significant concern identified in conducted analysis is the potential of node over-provisioning, which may cause cascading pod failures. Mitigating this vulnerability would require implementing one of several approaches: adjusting objective function weights, adding an additional load-balancing objective, or implementing a fail-safe, rejecting destabilizing allocation patterns.

The results highlight a trade-off between execution efficiency and solution quality among the scheduling algorithms. ILP and SA deliver more optimal results, but at the cost of significantly higher scheduling delay. On the other hand, ACO and NSGA-II scale well but require parameter tuning to ensure that solutions do not deteriorate in large clusters.

4. Discussion

The results demonstrate the potential to improve pod placement efficiency while reducing inter-zone communication latency. Based on these findings, the developed scheduler plugins could be deployed on a production Kubernetes cluster to further validate their effectiveness for scheduling batch jobs, processing, and maintaining large datasets. During deployment, key metrics such as resource utilization, scheduling latency, and application performance could be continuously monitored, along with system stability and pod evictions, and these data could be used to fine-tune algorithm parameters, ensuring that heuristic methods such as ACO and NSGA-II effectively balance resources and solution quality while deterministic approaches like ILP remain computationally feasible. The proposed plugins rely on static resource requests, which are typically defined with a pessimistic assumption about task execution to prevent overallocation and ensure workload stability. However, this approach does not reflect real-time resource consumption and a moving average of the used resources instead of static requests could be used, much like the one proposed in [6]. Additionally, expanding the optimization objectives beyond resource efficiency to include energy consumption, QoS policies, and cost-aware scheduling will further enhance the scheduler’s adaptability. Given the computational complexity of ILP-based scheduling, exploring Go language bindings for external solvers like Google OR-Tools and the GNU Linear Programming Kit (GLPK) could significantly improve its runtime efficiency.

While the primary focus of this study is network-aware scheduling, extending the optimization objectives to include energy efficiency presents an important opportunity for further research. In multi-region deployments, reducing inter-zone data transfers and balancing workload distribution can significantly impact overall energy consumption.

Several recent studies [7,8,13] have explored energy-aware scheduling strategies in Kubernetes, leveraging techniques such as the following:

Energy-efficient bin-packing heuristics, where workloads are consolidated onto a minimum number of active nodes while minimizing power wastage in underutilized regions.
Thermal-aware scheduling, where workloads are allocated based on server temperature metrics to improve cooling efficiency in data centers.
Renewable energy-aware scheduling, where jobs are assigned to regions with the highest availability of renewable power sources to minimize carbon footprint.

Future work could incorporate multi-objective optimization objectives to balance latency, resource efficiency, and energy consumption. For example, extending the Ant Colony Optimization (ACO) model to include energy-awareness pheromones could guide pod placement decisions based on power consumption trends. Additionally, integrating Machine Learning (ML)-based prediction models to optimize energy efficiency in Kubernetes clusters is a promising avenue for improving sustainable cloud operations. While static requests provide a stable baseline for comparing scheduling algorithms, we recognize that real-world Kubernetes workloads exhibit dynamic resource consumption. Modern schedulers, such as Zeus [6], leverage historical usage trends and real-time monitoring to adjust pod requests adaptively. Dynamic resource allocation strategies could be incorporated, integrating observed runtime metrics to refine scheduling decisions in real cluster deployments. A promising approach involves integrating predictive resource modeling, where historical CPU, memory, and network usage trends inform adaptive scheduling heuristics. Additionally, Kubernetes Vertical Pod Autoscaler (VPA) and dynamic QoS policies can be explored to enable the runtime scaling of pod resource limits, ensuring efficient allocation without over-provisioning.

5. Conclusions

This study introduces a set of network-aware scheduling plugins for Kubernetes, aiming to improve resource utilization and inter-zone communication efficiency in multi-region clusters for batch processing workloads. By leveraging latency-aware placement strategies, the approach used mitigates the limitations of the default Kubernetes scheduler, which primarily optimizes for local resource availability without considering network topology or inter-node delays. Unlike existing approaches that focus solely on local node capacity, the presented scheduler plugins adapt to multi-region constraints, ensuring more optimal scheduling decisions in geo-distributed environments.

The evaluated plugins demonstrate a clear pattern. The probabilistic ACO and genetic NSGA-II algorithms demonstrate nearly constant execution times, regardless of the cluster size. This consistency comes with a notable variability in solution quality. The quality–time trade-off can be optimized by increasing population sizes and iterations, thus enabling better exploration of the solution space. In contrast, ILP and SA algorithms produce better solutions, but at the cost of exponentially increasing execution time as the cluster size increases.

Leveraging their different properties, a tiered scoring could be implemented that combines multiple algorithms running in parallel. In this approach, a computational deadline needs to be established. The fast to converge algorithm will be used as an interim solution, while allowing the computationally intensive algorithm an opportunity to produce a solution within the time constraint. Should the second algorithm fail to converge within the deadline, the score would default to the interim solution, thus ensuring quality and timeliness in scheduling solutions.

The simulation-based testing approach used provides a controlled and reproducible environment to evaluate scheduling algorithms before deploying them in production. Through testing under predefined network conditions and resource constraints, the strengths and limitations of different scheduling strategies could be identified without affecting live workloads. This allows for the fine-tuning of algorithms to ensure stability and performance in real-world clusters.

As Kubernetes continues to evolve along with various orchestrated workloads, the need for adaptive mechanisms to incorporate multidimensional environment parameters into scheduling decisions will be required. Reviewed algorithms and implemented tools provide a strong foundation for advancing Kubernetes scheduling capabilities, allowing the rapid evaluation of novel scheduling approaches.

Author Contributions

Conceptualization, R.F. and M.S.; methodology, R.F. and M.S.; software, R.F.; validation, R.F.; formal analysis, R.F. and M.S.; investigation, R.F.; resources, R.F.; data curation, R.F. and M.S.; writing—original draft preparation, R.F. and M.S.; writing—review and editing, M.S. and N.K.; visualization, R.F.; supervision, M.S.; project administration, M.S.; funding acquisition, M.S. and N.K. All authors have read and agreed to the published version of the manuscript.

Funding

This study is financed by the European Union—NextGenerationEU, through the National Recovery and Resilience Plan of the Republic of Bulgaria, project № BG-RRP-2.004-0005. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Data Availability Statement

The implementation of the proposed framework is available at https://github.com/dinozavyr/scheduler-plugins (accessed on 31 January 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ACO	Ant Colony Optimization;
ACO-MCS	Ant Colony Optimization algorithm for Multi-objective of Container-based Microservice Scheduling;
CPU	Central processing unit;
ILP	Integer Linear Programming;
KS	Kubernetes Scheduler;
MDBP	Multidimensional Container Packing Problem;
MQTT	Message Queue Telemetry Transport;
NAS	Network-Aware Scheduler;
NSGA-II	Non-dominated Sorting Genetic Algorithm version 2;
PSO	Particle Swarm Optimization;
QoS	Quality of Service;
RSBD	Remotely Sensed Big Data;
SA	Simulated Annealing;
SBX	Simulated Binary Crossover;
TSM	Task Scheduling Mechanism;
VM	Virtual Machine.

References

Rao, G.; Stone, H.S.; Hu, T.C. Assignment of Tasks in a Distributed Processor System with Limited Memory. IEEE Trans. Comput. 1979, 30, 291–299. [Google Scholar] [CrossRef]
Abdul-Rahman, O.; Aida, K. Towards Understanding the Usage Behavior of Google Cloud Users: The Mice and Elephants Phenomenon. In Proceedings of the 2014 IEEE 6th International Conference on Cloud Computing Technology and Science, Singapore, 15–18 December 2014; pp. 272–277. [Google Scholar]
Centofanti, C.; Tiberti, W.; Marotta, A.; Graziosi, F.; Cassioli, D. Latency-Aware Kubernetes Scheduling for Microservices Orchestration at the Edge. In Proceedings of the 2023 IEEE 9th International Conference on Network Softwarization (NetSoft), Madrid, Spain, 19–23 June 2023; pp. 426–431. [Google Scholar]
Santos, J.; Wauters, T.; Volckaert, B.; Turck, F.D. Towards Network-Aware Resource Provisioning in Kubernetes for Fog Computing Applications. In Proceedings of the 2019 IEEE Conference on Network Softwarization (NetSoft), Paris, France, 24–28 June 2019; pp. 351–359. [Google Scholar]
Pusztai, T.W.; Rossi, F.; Dustdar, S. Pogonip: Scheduling Asynchronous Applications on the Edge. In Proceedings of the 2021 IEEE 14th International Conference on Cloud Computing (CLOUD), Chicago, IL, USA, 5–10 September 2021; pp. 660–670. [Google Scholar]
Zhang, X.; Li, L.; Wang, Y.; Chen, E.; Shou, L.; Zhang, X. Zeus: Improving Resource Efficiency via Workload Colocation for Massive Kubernetes Clusters. IEEE Access 2021, 9, 105192–105204. [Google Scholar] [CrossRef]
Lin, M.; Xi, J.; Bai, W.; Wu, J. Ant Colony Algorithm for Multi-Objective Optimization of Container-Based Microservice Scheduling in Cloud. IEEE Access 2019, 7, 83088–83100. [Google Scholar] [CrossRef]
Carrión, C. Kubernetes Scheduling: Taxonomy, Ongoing Issues and Challenges. ACM Comput. Surv. 2022, 55, 138. [Google Scholar] [CrossRef]
Zhang, W.-G.; Ma, X.-L.; Zhang, J.-Z. Research on Kubernetes’ Resource Scheduling Scheme. In Proceedings of the 8th International Conference on Communication and Network Security, Qingdao, China, 2–4 November 2018. [Google Scholar]
Senjab, K.; Abbas, S.; Ahmed, N.; Khan, A.u.R. A survey of Kubernetes scheduling algorithms. J. Cloud Comput. 2023, 12, 87. [Google Scholar]
Rejiba, Z.; Chamanara, J. Custom Scheduling in Kubernetes: A Survey on Common Problems and Solution Approaches. ACM Comput. Surv. 2022, 55, 151. [Google Scholar]
Sharma, S.; Kumar, V. A Comprehensive Review on Multi-objective Optimization Techniques: Past, Present and Future. Arch. Comput. Methods Eng. 2022, 29, 5605–5633. [Google Scholar] [CrossRef]
Bianchi, L.; Dorigo, M.; Gambardella, L.M.; Gutjahr, W.J. A survey on metaheuristics for stochastic combinatorial optimization. Nat. Comput. 2009, 8, 239–287. [Google Scholar] [CrossRef]
Liu, X.; Liu, J. A Task Scheduling Based on Simulated Annealing Algorithm in Cloud Computing. Int. J. Hybrid Inf. Technol. 2016, 9, 403–412. [Google Scholar] [CrossRef]
Feller, E.; Rilling, L.; Morin, C. Energy-Aware Ant Colony Based Workload Placement in Clouds. In Proceedings of the 2011 IEEE/ACM 12th International Conference on Grid Computing, Lyon, France, 21–23 September 2011; pp. 26–33. [Google Scholar]
Guerrero, C.; Lera, I.; Juiz, C. Genetic Algorithm for Multi-Objective Optimization of Container Allocation in Cloud Architecture. J. Grid Comput. 2017, 16, 113–135. [Google Scholar]
Zhang, D.; Yan, B.; Feng, Z.; Zhang, C.; Wang, Y.-X. Container oriented job scheduling using linear programming model. In Proceedings of the 2017 3rd International Conference on Information Management (ICIM), Chengdu, China, 21–23 April 2017; pp. 174–180. [Google Scholar]
Rodrigues, L.R.; Pasin, M.; Alves, O.C.; Miers, C.C.; Pillon, M.A.; Felber, P.; Koslovski, G.P. Network-Aware Container Scheduling in Multi-Tenant Data Center. In Proceedings of the 2019 IEEE Global Communications Conference (GLOBECOM), Waikoloa, HI, USA, 9–13 December 2019; pp. 1–6. [Google Scholar]
Huang, W.; Zhou, J.; Zhang, D. On-the-Fly Fusion of Remotely-Sensed Big Data Using an Elastic Computing Paradigm with a Containerized Spark Engine on Kubernetes. Sensors 2021, 21, 2971. [Google Scholar] [CrossRef] [PubMed]
Kubernetes-Sigs/Scheduler-Plugins: Repository for Out-of-Tree Scheduler Plugins Based on Scheduler Framework. Available online: https://github.com/kubernetes-sigs/scheduler-plugins (accessed on 20 January 2025).
Scheduling Framework|Kubernetes. Available online: https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/ (accessed on 20 January 2025).
kubernetes-Sigs/Kube-Scheduler-Simulator: The Simulator for the Kubernetes Scheduler. Available online: https://github.com/kubernetes-sigs/kube-scheduler-simulator/ (accessed on 20 January 2025).
kubernetes-Sigs/Kwok: Kubernetes Without Kubelet—Simulates Thousands of Nodes and Clusters. Available online: https://github.com/kubernetes-sigs/kwok (accessed on 20 January 2025).

Figure 1. Kubernetes scheduling lifecycle.

Figure 2. Cluster network topology.

Figure 3. Time needed to score a pod on all nodes. The ACO line is entirely overlaid by the NSGA-II line, indicating identical values.

Figure 4. Pod-to-Pod network latency distribution. Dashed lines represent variance and circles represent outliers.

Figure 5. CPU utilization distribution on nodes having at least one pod scheduled. Dashed lines represent variance and circles represent outliers.

Figure 6. Memory utilization distribution on nodes having at least one pod scheduled. Dashed lines represent variance and circles represent outliers.

Table 1. Definitions used.

Definition	Description
$P = \{p_{1}, p_{2}, \dots, p_{m}\}$	The set of pods to be scheduled.
$N = \{n_{1}, n_{2}, \dots, n_{k}\}$	The set of worker nodes in the cluster.
$G \subseteq P$	Set of pod groups.
$L : N \times N \to R +$	Network latency between nodes.
$L (n_{i}, n_{j})$	Network latency between $n_{i}$ and $n_{j}$ .
$R_{c p u} (p), R_{m e m} (p)$	CPU and memory resource requests.
$A (p)$	Assignment of pod p, where $A (p) \in N$ .
$C_{c p u} (n), C_{m e m} (n)$	Resource capacity (for each node $n \in N$ ).
$U_{c p u} (n), U_{m e m} (n)$	Amount of resources currently allocated on node n.

Table 2. Test scenario parameters.

Scenario Name	Regions	Nodes per Region	Total Nodes	Number of Pods
S	3	5	15	10
M	4	5	20	13
L	3	10	30	20
XL	4	10	40	26
XXL	3	20	60	40
XXXL	4	20	80	53

Table 3. Region latency matrix.

	Region 1	Region 2	Region 3	Region 4
Region 1	0.7 ms	3 ms	5 ms	7 ms
Region 2	3 ms	0.7 ms	5 ms	7 ms
Region 3	5 ms	5 ms	0.7 ms	3 ms
Region 4	7 ms	7 ms	3 ms	0.7 ms

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Furnadzhiev, R.; Shopov, M.; Kakanakov, N. Efficient Orchestration of Distributed Workloads in Multi-Region Kubernetes Cluster. Computers 2025, 14, 114. https://doi.org/10.3390/computers14040114

AMA Style

Furnadzhiev R, Shopov M, Kakanakov N. Efficient Orchestration of Distributed Workloads in Multi-Region Kubernetes Cluster. Computers. 2025; 14(4):114. https://doi.org/10.3390/computers14040114

Chicago/Turabian Style

Furnadzhiev, Radoslav, Mitko Shopov, and Nikolay Kakanakov. 2025. "Efficient Orchestration of Distributed Workloads in Multi-Region Kubernetes Cluster" Computers 14, no. 4: 114. https://doi.org/10.3390/computers14040114

APA Style

Furnadzhiev, R., Shopov, M., & Kakanakov, N. (2025). Efficient Orchestration of Distributed Workloads in Multi-Region Kubernetes Cluster. Computers, 14(4), 114. https://doi.org/10.3390/computers14040114

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Orchestration of Distributed Workloads in Multi-Region Kubernetes Cluster

Abstract

1. Introduction

1.1. Related Work

1.1.1. Ant Colony Optimization

1.1.2. Non-Dominated Sorting Genetic Algorithm

1.1.3. Integer Linear Programming

1.1.4. Simulated Annealing

1.2. Evaluation of Kubernetes Scheduling Algorithms

2. Materials and Methods

2.1. Ant-Colony Optimization

2.2. Non-Dominated Sorting Genetic Algorithm

2.3. Simulated Annealing

2.4. Integer Linear Programming

3. Results

3.1. Experimental Setup

3.2. Results and Evaluation

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI