1. Introduction
Order scheduling is a crucial decision-making problem in supply chain management and the manufacturing industry; it plays an important role in rational resource allocation and utilization, making companies more competitive in the global marketplace. Before an order can be initiated for manufacturing, there are several preliminary steps, termed as ’pre-manufacturing activities’, that must be undertaken. Moreover, in light of the various uncertainties in the actual production cycle, the daily output for any order may not always align with initial expectations. In reality, before a new order can be put into production, a series of activities known as preproduction events need to be completed. In addition, in real production process, the daily production quantity of each order is not always as expected owing to various uncertainties [
1]. For example, the rapid development of technology and the exponential growth of mobile internet services in China have led to a significant increase in the volume of mobile communications and the complexity of work processes. In response, dispatching system providers are actively seeking ways to improve the efficiency and cost effectiveness of work order scheduling. While current order scheduling partially meet operators’ requirements, they face challenges such as inefficient utilization of human resources and inability to fully automate work order dispatching.
Previous research on automatic work order scheduling optimization has primarily focused on refining dispatching rules based on the Priority Dispatching Rule (PDR) [
2], a widely used method in scheduling problems. Compared to complex optimization techniques such as mathematical programming, PDR offers computational efficiency, intuitiveness, ease of implementation, and inherent adaptability to uncertainties commonly encountered in practice [
3]. Current studies for order scheduling can be divided into two categories, namely, heuristic scheduling and deep learning scheduling. The core idea of heuristic scheduling is to find an approximately optimal solution through heuristic rules or methods. Heuristic rules are estimated values or rules based on experience which guide the decision-making process of the selected strategy. However, heuristic scheduling design requires a great deal of specialized knowledge and often delivers limited performance in different situations. Moreover, most order scheduling models rely on heuristic methods [
4], lacking a guarantee of global optimality and exhibiting performance variability based on specific problem instances and the designer’s experience.
For deep learning scheduling, deep neural network can be used to automatically extract the scheduling based on the data. However, most dispatching algorithms prioritize minimizing the operating time as their sole objective [
5]. Despite this, practical scenarios often demand the consideration of multiple metrics. For example, in manufacturing systems, a scheduling algorithm must handle dynamic demand, machine breakdowns, and uncertain processing times in real time. Similarly, in online ride-hailing services, a scheduling algorithm should aim to reduce both the average waiting time for customers and the pick-up cost for drivers [
6]. Taking the customer service of a telecom company as an example, the work order assignment process involves the customer, the telecom company, and the operator. When a customer generates a repair work order through a call, the telecom company needs to efficiently assign the nearest repairer to the customer’s location while minimizing costs [
7]. Therefore, a good scheduling algorithm needs to simultaneously guarantee time, cost, and distance.
Figure 1 illustrates the difference between traditional reinforcement learning and multiple-objective reinforcement learning. The latter allows for the simultaneous optimization of three objectives, enabling greater flexibility to meet diverse needs [
8].
To address the challenges of multi-objective order scheduling, we need to answer the following three questions: (1) how to formulate the multi-objective order scheduling problems? (2) how to construct policy for perform order scheduling? and (3) how to obtain a Pareto-optimal policy of order scheduling to meet the multiple requirements of order scheduling? For the first question, we formulate the problem as a Multiple-Objectives Markov Decision Process (MOMDP), in which we represent the states using a disjunctive graph representation. This discrete graph structure effectively captures the characteristics of the order scheduling problem. For the second question, we introduce a Graph Neural Network (GNN) to construct a robust policy network. The GNN utilizes fixed-dimensional embeddings to represent nodes in the disjunctive graph, enabling effective decision-making for order scheduling. By learning the graph structure data, extracting relevant features, and exploring patterns within the graph, the GNN provides valuable insights. For the last question, we employ policy gradient methods to train the network and overcome the limitations associated with traditional approaches i order to obtain high quality Priority Dispatching Rules (PDRs). However, in multi-objective reinforcement learning, designing the reward function with subjective weight settings for each index often leads to imbalanced objective optimization. To address this issue, our algorithm employs a linear embedding technique that achieves a more balanced solution. Furthermore, to ensure the best optimization of all objectives, we introduce the convex hull algorithm to guarantee that the weight vector yields a maximized linearly scalarized value function.
To the best of our knowledge, no previous work has specifically addressed the same problem as we do here. In this paper, we focus on the unique requirements of communication work order dispatching systems while considering the needs of both operators and customers. By developing a multi-objective deep reinforcement learning approach, it is possible to objectively optimize each crucial objective based on the available data.
2. Related Work
This research focuses on addressing a problem with multiple objectives. In this section, we discuss the problem used in the experiment and the model building process. One relevant problem that we consider is the Job Shop Scheduling Problem (JSSP), which is a well-known optimization problem in operations research.
Traditional approaches to solving the JSSP can be categorized into four categories: analytical techniques, meta-heuristic [
9] algorithms, rule-based approaches, and simulation approaches. However, as the environments become more complex, traditional analytical techniques and simple mathematical models may not be capable of effectively analyzing them. While meta-heuristic algorithms have shown good performance in certain environments [
9], they cannot guarantee optimal solutions and are often unstable, yielding different results for different instances of the same problem. Rule-based algorithms, although effective, can be costly and difficult to transfer to new environments. In contrast, our proposed method is fully trained and can be directly applied to solve problems of different sizes without the need for transfer learning. Previous research has utilized the Genetic Algorithm (GA) [
10] and Shift Bottleneck (SB) [
11] methods to address the JSSP. However, these approaches require rebuilding when the environment changes, resulting in high computational costs and inspiring researchers to develop new methods for solving the problem.
In 2020, Reinforcement Learning (RL) was proposed to develop a Priority Dispatching Rule (PDR) for solving the JSSP [
12]. This approach utilized Graph Neural Networks (GNNs) to perform embeddings on the disjunctive graph, which captures the processing order and is size-agnostic in terms of both jobs and machines [
12]. However, this algorithm solely focuses on time optimization, while many scheduling goals involve multiple objectives, such as minimizing makespan and processing costs. Traditional RL methods cannot effectively handle these multi-objective scenarios, and may face challenges when the environment changes.
In another study proposed by Yang et al. [
13], the convex hull method was applied to implement ethical embedding in multiple-objective reinforcement learning [
14]. This novel algorithm aligns with current developments in the Multi-Objective Reinforcement Learning (MORL) literature to create an ethical environment as a single-objective Markov Decision Process (MDP) derived from the multi-objective MDP resulting from the reward specification process [
13].
Multi-objective optimization has already been applied to solve many different problems. In 2013, considering multiple production departments and multiple production processes, an NSGA-II-based Pareto optimization model was developed to handle this problem [
15]. In a study by Debiao Li, the multi-objective optimization problem (MOP) for minimizing collation delays and makespan was presented for the order scheduling problem in a mail-order pharmacy automation systems with a min–max Pareto objective function [
16].
To summarize, while previous research has made progress in addressing the JSSP, there are limitations in handling multiple objectives and adapting to changing environments. Our research aims to address these challenges and to develop an effective and robust solution using multi-objective deep reinforcement learning.
3. Problem Description
In the communication work order scheduling scenario, there is a set of work orders J such as complaint sheets and staff orders M. Because of process or permission, each order always has to be reviewed or processed by different workers, where each step is called the operation of and must be processed by a number workers in an order . In addition, each worker can only work on one order at a time. To solve this problem, we need to find a dispatching approach that takes the time as target while simultaneously optimizing multiple objectives. In this paper, we set the extra targets as the finish rate and cost.
3.1. Disjunctive Graph
It is well known that disjunction can represent the JSSP. A disjunctive graph represents the scheduling problem as a directed graph
, which is the set of all the vertices of the directed graph and each vertex represents an operation (with two empty operations to start and end). In particular,
C is a set of directed arcs (conjunctions), which are the arcs connecting two adjacent operations of the same job in the directed graph
G and
D is the set of all the disjunctive arcs, which are the arcs connecting two operations corresponding to an operator in the directed graph
G. Consequently, finding a solution to a job-dispatching instance is equivalent to fixing the direction of each disjunction such that the resulting graph is a DAG [
2].
3.2. Markov Decision Process Formulation
The continuous decision method based on PDR uses a series of steps to solve JSSP instances. In each step, a set of qualified operations (that is, operations for which previous operations have been scheduled) is first identified. The specific PDR is then applied to calculate the priority index of each qualified operation, and the one with the highest priority is selected for scheduling. Traditional PDRs design the priority by rules, such as selecting the shortest processing time from a set of operations. As mentioned above, solving a job dispatching instance can be viewed as the task of determining the direction of each disjunction. Therefore, we consider the dispatching decisions made by PDRs as actions of changing the disjunctive graph, and formulate the underlying MDP model as follows.
State. The state
of the disjunctive graph
presents the solution up to time
t, where
contains all the disjunctive arcs that have been assigned a direction up to
t and
includes arcs which are not dispatched. For each node
O in the graph, we record the recursively calculated value for each target and a binary record
, the value of which is be 1 if the node is scheduled in
:
where
is the process time for this operation,
is the degree of completion, and
is the cost of this operation. We calculate the target by only considering the precedence constraints from its predecessor.
Action. An action is one executable operation in a decision step t.
Rewards. The reward function is vital for reinforcement learning. At each juncture, the agent selects an action based on the present state according to its policy and the environment generates a reward signal that is the goal the agent wants to maximize during the action selection process.
5. Experiments
We evaluated and compared the training and test performance of our multiple-objective PDR and single-objective PDR. Both of the experiments were performed on generated instances provided by Zhang [
12]. According to the data from China Telecom Guangdong Company, work orders can be classified into four categories: consulting, complaints, business processing, and repair. When scheduling business processing and repair orders, the addresses are always different and distance is an important objective which greatly affects operations. For the same reason, the cost for repair should be taken into account. For application to specific work orders for telecoms, we used maintenance as an example to design three objectives, namely, the estimated time
t, cost provided by worker
c, and distance to the destination
d.
5.1. Models and Configurations
Our model was trained and validated on 100 instances, and we train our policy network for 10,000 iterations with predefined hyperparameters. For the graph embedding we used GIN composed by which had four hidden layers with a size dimension of 64. For the layers in GIN, we set the number of iterations K to 4. For PPO, the actor network and critic network shared the same structure, which had four hidden layers with a dimension of 32. We setthe training epochs of the network as 1, the clip parameter for PPO as 0.2, the critic loss coefficient as 1, the policy loss coefficient as 2, and the entropy loss coefficient as 0.01. During the training process, we set the discount of the reward to 1 and used the Adam function with a learning rate of as the optimizer. All other parameters followed the default settings in PyTorch.
5.2. Baselines
There are a number of methods which adopt reinforcement learning on job shop scheduling; however, all of them have only focused on time optimization. In order to prove our method’s effect for each objective, we used the algorithm proposed by Cong Zhang [
13], which is based on Python. In addition, there have been a number of PDRs proposed for JSSP; thus, we selected two traditional PDRs based on their performance as reported in [
23], including Shortest Processing Time (SPT) and Most Work Remaining (MWKR). SPT is one of the most widely used PDRs in research and industry, while MWKR is a newly developed model which has demonstrated excellent performance [
23]. We compared our time objective with traditional algorithms as well.
5.3. Results on Generated Instances
Figure 4 illustrates the process of a customer complaint handling system developed by Guangzhou Telecom. Customers call the customer service line to communicate their requirements. These requirements can be divided into four categories: 31% Consulting, 24% Malfunction, 23% Complaint, and 22% Business Process. After generating orders, 21% of orders about business expenses are handled by the intensive process. The remaining work orders are distributed to the ICS (Intensive Customer Service). These order are handled by district branches. Considering the area and maintenance costs, the distance and costs need to be considered to ensure that the single target approach will no longer work for this scenario. To simulate the dispatch data in real applications, we generate a dataset with three objectives: time, cost and distance.
First, we trained our method on generated data with a size of
. The three objective training curves are shown in
Figure 5. We generated 100 instances of each size (
,
) randomly for ten times, and report the average of three objectives to suggest the effect of optimization. The learning curves of the three objectives are reported below. All of them show the effective optimization process of our algorithm.
In order to highlight the significance of our multiple-objective algorithm, We performed testing on different sizes of generated datasets with three objectives, as mentioned above. The results of these three objectives are recorded in the table. We compared our dispatching policy with the deep reinforcement policy proposed by Zhang [
12] and compared the time objective with a traditional baseline.
We tested and verified our method on generated instances of sizes
. The results in
Table 1 show an obvious improvement with our method. As
Table 1 shows, our algorithm has better behavior on solving multiple-objective problems, while our MORL algorithm improves performance on the cost objective and distance objective. In particular, for datasets of size
we reached an approximately 20% reduction in the cost objective, while for datasets of size
we reached an approximately 20% reduction in the distance objective, which can ensure the nearest service point to take the job. Furthermore, through comparison with two widely used PDRs, namely, SPT and MWKR, we verified the optimization on the time objective.
Above all, even though our method had slightly poorer performance on the time objective compared with single-objective reinforcement learning algorithms, the results show better performance on multiple-objective optimization, which is more suitable for practical applications. In addition, we realized optimization on the time objective.
In order to demonstrate the superiority of our algorithm, we compared a model optimized using the convex hull algorithm with a model that was not optimized using the convex hull algorithm; the results are shown below in
Table 2. The convex hull algorithm was used to optimize the model, aiming to ensure that the weight vector used in the linearly scalarized value function was maximized, thereby achieving optimal optimization across all objectives. The training curves for the three objectives are shown in
Figure 5a–c. Based on our experimental results, we observed that the training curve significantly outperforms the unoptimized algorithm (see the
Table 3). This indicates that our optimization strategies effectively improved the performance of the algorithm, increasing the stability and accuracy of model training. In the future, we intend to further investigate and optimize the details of the algorithm in order to achieve even better results.
5.4. Results on Public Benchmarks
In order to ensure the validity of our experimental results, we carefully selected the DMU dataset and divided it into eight groups based on dataset size. Our team then generated policies for three of these groups, while the remaining five groups were used as the test set. The purpose of this experimental setup was to evaluate the performance of our policies against a baseline in a variety of dataset scenarios. In this way, we were able to determine the effectiveness of our policies in improving the performance of the DMU dataset. The results of our experiments, shown in
Table 4, clearly demonstrate that our policies outperformed the baseline in all of the tested scenarios. This indicates that our policies have the potential to significantly improve the performance of similar datasets in the future.