Next Article in Journal
Incremental versus Radical Innovation and Sustainable Competitive Advantage: A Moderated Mediation Model
Previous Article in Journal
Creating a Roadmap to Forecast Future Directions in Vertical Green Structures as a Climate Change Mitigation Strategy: A Critical Review of Technology-Driven Applications
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Low-Carbon Flexible Job Shop Scheduling Problem Based on Deep Reinforcement Learning

by
Yimin Tang
1,
Lihong Shen
2 and
Shuguang Han
2,*
1
School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China
2
School of Science, Zhejiang Sci-Tech University, Hangzhou 310018, China
*
Author to whom correspondence should be addressed.
Sustainability 2024, 16(11), 4544; https://doi.org/10.3390/su16114544
Submission received: 21 April 2024 / Revised: 20 May 2024 / Accepted: 23 May 2024 / Published: 27 May 2024

Abstract

:
As the focus on environmental sustainability sharpens, the significance of low-carbon manufacturing and energy conservation continues to rise. While traditional flexible job shop scheduling strategies are primarily concerned with minimizing completion times, they often overlook the energy consumption of machines. To address this gap, this paper introduces a novel solution utilizing deep reinforcement learning. The study begins by defining the Low-carbon Flexible Job Shop Scheduling problem (LC-FJSP) and constructing a disjunctive graph model. A sophisticated representation, based on the Markov Decision Process (MDP), incorporates a low-carbon graph attention network featuring multi-head attention modules and graph pooling techniques, aimed at boosting the model’s generalization capabilities. Additionally, Bayesian optimization is employed to enhance the solution refinement process, and the method is benchmarked against conventional models. The empirical results indicate that our algorithm markedly enhances scheduling efficiency by 5% to 12% and reduces carbon emissions by 3% to 8%. This work not only contributes new insights and methods to the realm of low-carbon manufacturing and green production but also underscores its considerable theoretical and practical implications.

1. Introduction

With carbon dioxide concentrations in the Earth’s atmosphere reaching the highest levels in at least 2 million years [1], the imperative to peak carbon emissions and achieve carbon neutrality [2] underscores the critical importance of advancing low-carbon manufacturing, green production, and carbon emission reduction efforts. Leveraging deep reinforcement learning algorithms for production scheduling optimization holds promise in significantly reducing resource consumption and environmental pollution associated with manufacturing processes.
The Job shop Scheduling Problem (JSP) stands as one of the most prevalent scheduling challenges. Johnson’s seminal work in 1954 [3] on scheduling for two machines laid the foundation for various scheduling problems, including JSP. Central to JSP is the allocation of each job operation to designated machines and the scheduling of processing sequences and start times under specific constraints to optimize performance metrics. The Flexible Job shop Scheduling Problem (FJSP), an extension of traditional JSP, was introduced by Brucker in 1990 [4]. As a complex NP-Hard problem [5], FJSP not only entails operation sequencing on machines but also necessitates machine assignment for each operation prior to sequencing, aligning more closely with real-world production scheduling environments.
In recent years, there has been a growing focus on environmental considerations in production scheduling. Jiang [6] and others proposed a multi-objective flexible job shop scheduling model incorporating energy consumption, maximum completion time, processing cost, and cost-weighted processing quality, alongside an enhanced non-dominated sorting genetic algorithm for solution. Zhang [7] and team established a low-carbon scheduling model for FJSP, considering factors like maximum completion time, machine load, and carbon emissions, and proposed a mixed NSGA-II algorithm for resolution. Jiang [8] developed a low-carbon flexible job shop scheduling model aiming to minimize the weighted sum of energy consumption costs and completion time costs, along with a tailored grey wolf algorithm for solutions. Luo [9] optimized for minimal maximum completion time and total energy consumption using an enhanced multi-objective grey wolf algorithm. Lu [10] introduced an effective multi-objective discrete virus optimization algorithm for solving the flexible job shop scheduling problem with variable processing times.
Despite their prevalence, the aforementioned problem is often addressed using heuristic algorithms lacking in generalization capability. Deep Reinforcement Learning (DRL)-based production scheduling has emerged as a potential solution, capable of learning and generalizing knowledge from training data to address new challenges. Notable studies have applied DRL to scheduling problems of various scales [11,12]. For instance, Naimi et al. [11] proposed and trained a multi-objective Q-learning algorithm for optimal rescheduling to minimize completion time and energy consumption changes. Proximal Policy Optimization (PPO), a mainstream DRL algorithm, has been employed [13,14] to solve FJSP. However, effective application of DRL algorithms to learn scheduling strategies necessitates neural network architectures capable of processing varying state graph sizes due to the diverse FJSP instance sizes. While Graph Neural Networks (GNNs) have shown promise [15,16] in handling variable-size instances, they are limited to isomorphic graph problems. Song [17] addressed this limitation by employing HGNN with a two-stage embedding process for heterogeneous graph problems, along with a decision model for simultaneous operation and machine selection. Graph Attention Networks (GATs) [18] have demonstrated efficacy in processing machine and operation features by leveraging attention mechanisms [19] to learn node importance, aiding in identifying high-priority operations and suitable operation–machine arc combinations.
The success of DRL in scheduling underscores its potential in addressing flexible job shop scheduling problems. However, existing DRL-based research tends to focus on single-objective scheduling problems, with limited attention paid to multi-objective FJSPs, and even fewer studies considering environmental impacts. Additionally, DRL-based solutions face challenges in improving model performance and narrowing the gap with exact solutions. Existing methods often fail to adequately model machine relationships, which can be viewed as competing for remaining unscheduled operations. Yang [20] proposed a novel approach to solving multi-objective FJSP problems by leveraging deep reinforcement learning-trained hand controllers to manipulate objects, combined with Bayesian optimization methods.
In summary, there has been a growing concern about carbon emissions and their impact on the environment. Industries worldwide are seeking ways to reduce their carbon footprint while maintaining productivity and profitability. One area that requires attention is flexible job shop scheduling, a critical process in various sectors such as manufacturing, logistics, and services.
The existing research in flexible job shop scheduling mainly focuses on optimizing economic indicators such as production efficiency, makespan, and resource utilization. However, with the increasing emphasis on sustainability, it is crucial to consider environmental factors, particularly carbon emissions, in the scheduling process. This study addresses carbon emission concerns in flexible job shop scheduling, developing a multi-objective model and proposing a deep reinforcement learning-based algorithm for LC-FJSP resolution. The aim is to optimize both economic and environmental indicators.
The proposed model has several practical implications for real-world applications. By simultaneously considering economic and environmental objectives, it provides decision-makers with a comprehensive solution that balances production efficiency and carbon emission reduction. This approach can help industries achieve sustainable and efficient scheduling, contributing to their environmental targets while maintaining operational performance. The primary contributions of this research are outlined as follows:
  • This research broadens the scope of the classic Flexible Job shop Scheduling Problem (FJSP) by integrating considerations of both operating and idle energy consumption into the machine parameters, formulating a mathematical model for the Low-carbon Flexible Job shop Scheduling Problem (LC-FJSP). The model’s primary goal is to minimize the weighted sum of total completion times and carbon emissions.
  • The study introduces a sophisticated graph attention network that accounts for machine energy consumption in its feature extraction process, deploying an adaptive graph attention mechanism to accurately discern the relationships and priorities between each node and its adjacent nodes, thereby elucidating the intricate dependencies among machines and operations.
  • It establishes an innovative end-to-end training framework known as the Low-carbon Graph Reinforcement Learning (LCGRL) framework. This framework incorporates a Bayesian optimization module designed to fine-tune weight coefficients for multi-objective challenges, significantly enhancing the training convergence and delivering a Deep Reinforcement Learning (DRL) model with exemplary composite performance.
Our proposed approach demonstrates significant novelty:
  • Firstly, this extension of the traditional Flexible Job shop Scheduling Problem (FJSP) provides a novel perspective on environmental sustainability issues.
  • Secondly, the method employs a graph attention mechanism that considers machine energy consumption, accurately identifying the complex dependencies between machines and operations, which is innovative in this field.
  • Finally, our end-to-end training method not only enhances training convergence but also achieves exceptional overall performance. This framework represents a pioneering application in addressing similar multi-objective challenges.
The structure of the remainder of this paper is as follows. Section 2 reviews recent scheduling methods; Section 3 explores the Flexible Job shop Scheduling (FJSP) concept and details its mathematical modeling; Section 4 elaborates on the algorithm’s design; Section 5 presents experimental setups and comparative results. Section 6 summarizes the study and suggests avenues for future research.

2. Related Works

2.1. Traditional Optimization Methods

In traditional optimization methods for the flexible job shop scheduling problem, many studies focus on improving existing genetic algorithms and other heuristic approaches to address multi-objective optimization challenges. These methods often introduce new neighborhood search strategies, adaptive operations, and fuzzy logic to enhance algorithm performance, achieving significant progress in optimizing production efficiency and reducing energy consumption.
Zhao et al. [21] proposed an Enhanced Non-dominated Sorting Genetic Algorithm II (ENSGA-II) to address the Multi-Objective Energy-Saving Flexible Job Shop Scheduling Problem (MOEFJSP). Extensive experiments show that the ENSGA-II is effective in optimizing production criteria and reducing energy consumption. Li et al.’s [22] study investigates a Bi-Population Balancing Multi-Objective Evolutionary Algorithm (MOEA) for solving the Distributed Flexible Job Shop Scheduling Problem (DFJSP) in a steelmaking system. The algorithm simultaneously considers fuzzy processing times and crane transportation processes, aiming to minimize both the maximum fuzzy completion time and the energy consumption during machine processing and crane transportation. Zhang et al. [23] discuss the flexible job shop scheduling problem in the context of remanufacturing, proposing an optimized approach based on an improved genetic algorithm. The method incorporates several enhancements, such as adaptive mutation and dynamic crossover strategies, to accelerate convergence and improve solution quality in complex scheduling scenarios.

2.2. Quantum Computing

Traditional optimization methods often struggle to balance solution quality and computation time, highlighting the need for more efficient algorithms. Recent advancements in quantum computing have shown promising results in this area. Schworm et al. [24] have proposed a Quantum Annealing-based approach (QASA) to tackle the Flexible Job Shop Scheduling Problem (FJSSP), achieving superior solution quality over classical methods like tabu search and simulated annealing. This approach effectively balances multiple objectives, including makespan, total workload, and job priority, by leveraging the strengths of Quantum Annealing combined with classical optimization techniques. Additionally, other studies have demonstrated the potential of quantum annealing in solving complex scheduling problems, suggesting it as a viable alternative or complement to traditional methods. Denkena et al. [25] utilized a hybrid Quantum Annealing and digital annealing approach to solve larger instances of FJSSP, showing significant improvements in computational efficiency and solution quality compared to traditional methods. Their work suggests that Quantum Annealing can be particularly beneficial for industrial-scale scheduling problems where quick rescheduling is needed to adapt to dynamic changes such as machine failures or supply chain disruptions.
Additional studies have benchmarked Quantum Annealing against state-of-the-art classical algorithms [26], indicating that it not only achieves comparable or better solution quality but also significantly reduces computation times. This makes Quantum Annealing a viable alternative or complement to traditional methods for complex scheduling problems.

2.3. Deep Reinforcement Learning

Li et al. [27] addressed the Energy-Aware Distributed Heterogeneous Flexible Job Shop Scheduling (DHFJS) problem, which extends traditional flexible job shop scheduling with added complexity. Their goal was to minimize total energy consumption (TEC) and makespan. They proposed a Deep Q-networks-based Co-Evolution algorithm (DQCE) to solve this NP-hard problem. Huang et al. [28] address the Distributed Job Shop Scheduling Problem (DJSP) using an end-to-end deep reinforcement learning (DRL) approach based on Graph Neural Networks (GNNs). Traditional methods separate job selection and factory assignment, but this method combines them, treating the problem as a Markov Decision Process (MDP). A specially designed disjunctive graph represents the problem, and a GNN extracts state features. The Proximal Policy Optimization (PPO) algorithm trains the policy, leading to improved decision making. Chen et al. [29] proposed a Q-learning-based multi-objective immune algorithm (Q-MOIA) to address the flexible job shop scheduling problem (FJSSP) with fuzzy processing times and dynamic disruptions. Their approach integrates a predictive–reactive dynamic/static rescheduling model, utilizing a mixed integer linear programming (MILP) model for static scenarios and rescheduling heuristics for dynamic disruptions like new job arrivals and machine breakdowns. The Q-MOIA algorithm enhances initial solutions with active decoding and improves exploration and exploitation capabilities through clonal selection and Q-learning mechanisms. In recent advancements, Ding et al. [30] proposed a multi-policy deep reinforcement learning framework to tackle the Multi-Objective Multiplicity Flexible Job Shop Scheduling Problem (MOMFJSP), which aims to minimize makespan and total tardiness. The framework utilizes a multi-policy proximal policy optimization algorithm (MPPPO), treating the scheduling problem as a Markov decision process. This approach integrates multiple policy networks with different objective weight vectors, a fluid model for state feature extraction, and a multi-policy co-evolution mechanism (MPCEM) to enhance interaction among policies. Their results demonstrate that this method significantly improves decision accuracy and schedule optimization compared to traditional dispatching rules and other scheduling methods.
Quantum computing methods, such as quantum annealing and digital annealing, have shown promising results in solving complex optimization problems. These methods can potentially explore vast solution spaces more efficiently than classical algorithms due to their ability to simultaneously evaluate multiple solutions. Specifically, quantum annealing has been successfully applied to various scheduling problems, demonstrating its capability to handle the combinatorial nature of these tasks.
Despite the potential advantages of quantum computing, we chose DRL and LCGAN for several reasons:
  • DRL, especially when combined with advanced neural network architectures like LCGAN, provides a flexible framework that can adapt to dynamic changes in the scheduling environment. This adaptability is crucial for real-time applications where production schedules may need frequent adjustments due to unexpected disruptions.
  • The use of DRL allows for a more nuanced representation of the scheduling problem through state–action pairs, enabling the learning of complex policies that can optimize multiple objectives simultaneously. LCGAN further enhances this by incorporating environmental considerations, making it suitable for low-carbon manufacturing processes.
  • Current quantum computing technologies are still in their nascent stages, with practical implementations being limited by hardware constraints and error rates. In contrast, DRL and neural networks are well-established, with extensive libraries and frameworks available, facilitating easier implementation and experimentation.

3. Problem Description and Model Construction

3.1. Problem Description

Assume the workshop and n jobs that need to be processed, denoted as the job set J = J 1 , J 2 , , J n . Each job J i consists of n i operations, and each operation O i j can be processed on any machine from a subset of capable machines M i j M . Each operation is assigned to only one machine, and the processing time of each operation varies depending on the machine used, with different operations also resulting in different amounts of carbon emissions. Typically, the performance of workshop scheduling is evaluated based on the maximum completion time of all jobs within the workshop. However, it is imperative to also consider the energy consumption costs within the workshop. This paper primarily focuses on two types of carbon emissions associated with workshop machines; one is the carbon emission produced during the machining process, and the other is the carbon emission resulting from electrical energy consumption when the machine is in an idle state. Addressing the Low-Carbon Flexible Job Shop Scheduling Problem (LC-FJSP) proposed in this study, the primary task involves assigning appropriate machines for operations and reasonably arranging the processing sequence of different job operations on the machines to minimize the idle carbon emissions of the machines. Thus, the LC-FJSP considered in this paper, similar to the general FJSP, encompasses two sub-problems: machine assignment and operation sequencing.
For the problem in question, the following assumptions are made:
1.
Both the jobs and machines are available at the initial starting time of zero;
2.
Each machine can process only one operation at a time, and each operation can only be processed by one available machine;
3.
Once the processing of a job begins, it cannot be interrupted;
4.
Different jobs are independent of each other, and there are precedence constraints between different operations within the same job;
5.
The preparation time of machines before processing operations is not considered.

3.2. LC-FJSP Math Model

3.2.1. Symbol Definition

In our study, the following symbol definitions are used, as shown in Table 1.

3.2.2. Carbon Emissions

Under processing conditions, the carbon emissions are calculated as follows:
E 1 = α e · W 1
W 1 = k = 1 m i = 1 n j = 1 n i p i j k t i j k x i j k
Under idle conditions, the carbon emissions are calculated as follows:
E 2 = α e · W 2
W 2 = k = 1 m p e k C T k S T k i = 1 n j = 1 n i t i j k x i j k
Total carbon emissions are calculated as follows:
T C E = E 1 + E 2

3.2.3. LC-FJSP Mathematical Programming Model

This article aims to minimize the weighted sum of the maximum completion time and carbon emissions as optimization objectives. The constructed LC-FJSP model is as follows:
min f = min α C m a x + β T C E
s . t . { (7) C max C i , i (8) M k M i j x i j k = 1 , i , j , k (9) S i , j + 1 C i j , i , j (10) C i j S i j = k = 1 m x i j k t i j k , i , j (11) S i j j + L · ( 1 y i j j , k ) S i j + t i j k = C i j , i , j (12) S i j + L · y i j j j , k C i j , i , j , i , j , k
The mathematical programming model for the Low-carbon Flexible Job shop Scheduling Problem (LC-FJSP) is formulated as follows. Equation (6) defines the objective to minimize the weighted sum of the maximum completion time and carbon emissions, with α and β serving as weight coefficients. These coefficients are optimally adjusted via Bayesian optimization to adapt to varying conditions. Equation (7) ensures that the completion time for each job does not exceed the established maximum completion time. Equation (8) mandates that each operation be exclusively assigned to a single machine for processing. Equation (9) establishes a hierarchy of priorities among operations within the same job. Equation (10) stipulates that operations, once commenced, must proceed uninterrupted. Equations (11) and (12) specify that a machine can only handle one operation at a time, with L denoted as a sufficiently large positive constant.

3.3. Instances of LC-FJSP and Disjunctive Graph Heterogeneity

LC-FJSP allows a job to be processed on multiple machines, and different machines may have different processing times, as shown in Table 2, which presents an instance of LC-FJSP.
In the provided table, J i denotes the i-th job, O i j denotes the j-th operation within job J i , and M k denotes the k-th machine. The Low-carbon Flexible Job shop Scheduling Problem (LC-FJSP) is effectively modeled using a disjunctive graph.
We introduce a disjunctive graph G = ( O , C , D ) , where O = O i j | i , j S , E comprises all nodes, including operational nodes, as well as virtual start node S and end node E. Set C comprises directed arcs that enforce precedence constraints among consecutive operations within the same job. Set D includes undirected disjunctive arcs linking operations that may be processed on a common machine. For instance, operations O 11 , O 13 , O 21 , and O 32 in Figure 1 are interconnected by red disjunctive arcs, signifying that they are all eligible for processing on machine M 1 .
Introduce machine nodes M 1 ,   M 2 ,   M 3 and establish connections between operations capable of being processed on these machines ( M i , where i = 1 ,   2 ,   3 ) and their respective machines using undirected arcs, thus forming a heterogeneous graph model, as depicted in Figure 2.

4. Design of Solution Algorithms

The LC-FJSP studied in this paper is an extension of the traditional FJSP, introducing carbon emission constraints, thereby increasing the complexity of the problem. With the increase in problem size, the search space for feasible solutions grows exponentially, categorizing it as an NP-Hard problem. This implies a challenge in finding polynomial-time algorithms, necessitating further research and exploration. Consequently, this paper devises a deep reinforcement learning algorithm to efficiently solve this problem. In this section, the LC-FJSP is regarded as a sequential decision-making problem, iteratively scheduling job operations, assigning operations to compatible machines at each state until all operations are scheduled. The overall framework of the algorithm is illustrated in Figure 3.

4.1. Markov Decision Process (MDP)

To address the LC-FJSP using deep reinforcement learning, we initially define the states, actions, state transitions, and rewards, transforming the problem into a Markov Decision Process (MDP). A DRL-based decision framework is then established, which treats the selection of operations and machines integrally and outputs a probability distribution for decision making. A greedy strategy is employed, focusing on selecting operation–machine pairs with the highest scores. Lastly, we explain the training methodology for the proposed model.
The scheduling process in FJSP is conceptualized as assigning a ready operation to a suitable idle machine. The procedure is as follows. At each decision point t (either at the start or upon the completion of an operation), the agent assesses the current state s t and selects an action a t , specifically assigning an unplanned operation to an available machine, beginning its execution from time T ( t ) . Subsequently, the system transitions to the next state at step t + 1 . This sequence continues until all operations are scheduled. The MDP framework is defined as follows:
State: The state representation captures the primary attributes and dynamics of the scheduling environment, considering both processes and machines as the composite state. The collective state of all processes and machines at any decision step t forms state s t , starting from the initial FJSP instance denoted as s 0 .
Action: The paper integrates process selection and machine assignment into a unified action choice, defining all feasible process–machine pairs as the action space. As scheduling progresses, the action space naturally diminishes as more operations are allocated.
State Transition: At each decision step t, from state s t , the agent selects an action from the available space, performing action a t , This leads to an environmental shift to the subsequent state s t + 1 .
Reward: The purpose of designing the reward function is to guide the agent to select actions that minimize the maximum completion time and total carbon emissions of all operations. The reward function r t at time step t is defined as f ( s t ) f ( s t + 1 ) , where f represents the value of α C m a x ( s t ) + β T C E ( s t ) in the current state s t . When the discount factor γ = 1 , the accumulation of rewards at each step yields t = 0 | O 1 | r t = f ( s 0 ) f ( s t ) . In a specific problem instance, f ( s 0 ) is a constant, implying that minimizing f and maximizing the cumulative reward are equivalent.
Policy: We adopt a stochastic policy π ( a t | s t ) , which defines a probability distribution over the action set A t for each state s t . The distribution of this policy is generated by a deep reinforcement learning algorithm, optimizing specific parameters during training to maximize the cumulative reward.
For example, consider a simple scenario where there are two jobs, J 1 and J 2 , each with one operation, O 1 and O 2 , respectively, and three machines, M 1 , M 2 , and M 3 . At a decision point t, both O 1 and O 2 are ready to be processed.
At time t, the current state s t includes the status of all jobs and machines. For instance, J 1 ’s operation O 1 is pending assignment, J 2 ’s operation O 2 is pending assignment, and all machines M 1 , M 2 , and M 3 are idle.
The action space includes all possible job–machine assignments. In this scenario, the possible actions are:
1.
Assign O 1 to M 1 ;
2.
Assign O 1 to M 2 ;
3.
Assign O 1 to M 3 ;
4.
Assign O 2 to M 1 ;
5.
Assign O 2 to M 2 ;
6.
Assign O 2 to M 3 .
Suppose the agent selects the action to assign O 1 to M 1 .The system transitions to the next state s t + 1 where O 1 is being processed on M 1 . For example, the new state could be as follows: O 1 is running on M 1 with an expected completion time of 5 units, O 2 is still pending assignment, M 2 and M 3 remain idle.
The reward for this action is calculated based on the reduction in the combined metric of maximum completion time ( C m a x ) and total carbon emissions ( T C E ). Suppose at state s t that C m a x is 20 and T C E is 30. After transitioning to state s t + 1 , C m a x reduces to 19 and T C E reduces to 28. The reward function r t is defined as f ( s t ) f ( s t + 1 ) , where f represents the weighted sum of C m a x and T C E . Assuming equal weights, f ( s t ) = 20 + 30 = 50 and f ( s t + 1 ) = 19 + 28 = 47 ; hence, r t = 50 47 = 3 .
By considering both process selection and machine assignment in the actions, the agent effectively learns to balance the workload and optimize the overall performance metrics in LC-FJSP.

4.2. Low-Carbon Graph Attention Network (LCGAN)

To address the Low-carbon Flexible Job shop Scheduling (LC-FJSP) challenge, characteristics such as processing time on operation-to-machine (O-M) arcs are critical. Based on these times, we can calculate the carbon emissions E 1 during processing on different machines; likewise, from the machine’s idle times, we can infer the carbon emissions E 2 when the machines are not active. This study takes advantage of the unique features and benefits of the heterogeneous graph structure by introducing a tailored LCGAN network architecture specifically designed for LC-FJSP. By enhancing two attention modules, as cited in [31], this framework skillfully captures feature representations of process and operation nodes. Energy consumption features are added to the O-M arcs within the machine feature attention module to aid in the amalgamation and filtration of process features. To address the relationship between time and carbon emissions in subsequent calculations, we employ Bayesian optimization to adjust the weights for maximum completion time and carbon emissions, aiming to identify the optimal solution. The input feature dimensions for the operation and machine are d o and d m , respectively. Figure 4 illustrates the architecture of LCGAN.

4.2.1. Operation Feature Attention Module

The operation feature attention module aims to connect the operations within the same workpiece by finding the most important operations through their inherent attributes. For each input operation feature h O i j R d O of O i j O u , this module establishes relationships between O i j , its predecessor O i , j 1 , and successor O i , j + 1 by calculating their attention coefficients as follows:
e i , j , p = LeakyReLU a W h O i j W h O i p
where a R 2 d O and W R d O × d O are linear transformations, for all | p j | 1 .
We chose the LeakyReLU activation function over the standard ReLU for several reasons. Firstly, LeakyReLU helps mitigate the “dying ReLU” problem, ensuring that neurons remain active and gradients flow during training, which is crucial for our operation feature attention module. Secondly, our preliminary experiments indicated the presence of noisy features and outliers in the dataset. LeakyReLU’s small negative slope allows for non-zero gradients when units are inactive, helping the model handle noise and outliers more robustly. Lastly, LeakyReLU’s ability to provide gradients for negative inputs aids in better gradient propagation, particularly beneficial for deep networks.
These calculations are similar to those in GAT but narrowed in scope. Since the predecessors (or successors) of some operations may not exist or may be removed at some step, dynamic masking is applied to the attention coefficients of these predecessors and successors. The softmax function normalizes all e i , j , p to obtain the normalized attention coefficients α i , j , p . Finally, by weighted linear combination of the transformed input features W h O i , j 1 , W h O i , j , and W h O i , j + 1 , followed by a nonlinear activation function σ , the output feature vector h O i j R d O is obtained:
h O i j = σ p = j 1 j + 1 α i , j , p W h O i p .
By sequentially connecting multiple operation feature attention modules, the message of O i j can be propagated to all operations in J i .

4.2.2. Machine Feature Attention Module

Each operation to be processed can ultimately be completed on only one machine; hence, there exists a competitive relationship between different machines, involving the same operations to be processed. This competitive relationship may dynamically evolve as the production process progresses. We define C k q as the set of operations competing between machines M k and M q , and E p i j k and E p i j q , respectively, represent the energy consumption of operation O i j on machine M k and machine M q . Furthermore, we use c k q = O i j C k q ( E p i j k + E p i j q ) h O i j to measure the intensity of competition between M k and M q , where more intense competition indicates that the candidate operations are more important. The machine feature attention module uses c k q to calculate the attention coefficient v k q . For each M k M u with input feature h M k R d m , the attention coefficients v k q for all M k competing are calculated as follows:
v k q = LeakyReLU ( b [ ( V 1 h M k ) | | ( V 1 h M q ) | | ( V 2 c k q ) ] )
where V 1 R d m × d m and V 2 R d m × d o are weight matrices, and b R 3 d m is a linear transformation.
C k k represents the set of unscheduled operations that M k can process. c k k can be considered a measure of M k ’s processing capability, and we similarly apply the above formula to calculate v k k . Then, normalized attention coefficients are obtained using softmax, and the transformed input features are combined and activated with ELU to obtain the machine output feature h M k R d m .

4.2.3. Multi-Head Attention Module

We utilize multiple attention heads to process the aforementioned modules, aiming to learn the diverse relationships between entities. Let H denote the number of attention heads in the attention layer; we apply 2 H attention mechanism modules, each containing different parameters. Firstly, parallel computations are performed to derive attention coefficients and combinations. Secondly, their outputs are integrated through an aggregation operator. We adopt concat as the aggregation operator, and an average operator is used in the last layer. Finally, an activation function σ is applied to obtain the output of the layer.

4.2.4. Graph Pooling

When we use graph neural networks (GNNs) to process graph-structured data, the input graphs may have a varying number of nodes and different edge connections. This diversity can make the model very sensitive to changes in the input graphs, making it difficult to generalize well to new graph data. Graph pooling operations can help solve this problem by aggregating nodes or subgraphs in the graph to obtain a higher-level representation. This higher-level representation is equivalent to a summary or abstraction of the entire graph, containing important information and key features of the graph. The original features of the operation O i j and the machine M k are denoted as h ( 0 ) O i j and h ( 0 ) M k , respectively. After processing by L layers of a GNN, the features aggregated with attention weights, h ( L ) O i j and h ( L ) M k , are used for subsequent decision tasks. Following the method in reference [19], we first average pool the features of the operations and machines, respectively, and then concatenate their results to form the global feature of the FJSP instance, as shown below:
h G ( L ) = 1 | O u | O i j O u h O i j ( L ) 1 | M u | M k M u h M k ( L ) .

4.3. Making Decision

We designed a decision network based on the actor–critic reinforcement learning framework, where both the actor and critic utilize Multilayer Perceptrons (MLPs), with parameters denoted by θ and ϕ , respectively. The actor network first generates a scalar for each and then uses the softmax function to output the desired distribution, resulting in a stochastic policy that guides the agent’s behavior in the environment. We concatenate all information related to a t into a single vector and feed it into M L P θ :
μ ( a t | s t ) = MLP θ h O i j ( L ) | | h M k ( L ) | | h G ( L ) | | h ( O i j , M k ) .
The probability of choosing action a t is given by
π θ ( a t | s t ) = e x p ( μ ( a t | s t ) ) b t A ( t ) e x p ( μ ( b t | s t ) ) .
The critic, represented by the value function network, evaluates the value of taking a certain action in different states, i.e., predicts the cumulative reward under a given policy. It takes the global feature h G ( L ) as input and produces a scalar v ( s t ) as the estimate of the state value. The objective of the critic network is to estimate the state value as accurately as possible, thereby providing better feedback and guidance for the actor network to improve its strategy and obtain higher rewards.

4.4. Bayesian Optimization

Reference [32]: We employ Bayesian optimization methods to determine the weights of the reward function. This method selects appropriate sampling points in the search space and adjusts the positions of sampling points based on observation results, gradually approaching the optimal solution. Our objective is to optimize the black-box function f ( C , E ) = α C + β E , where α and β are the coefficients to be optimized. We utilize the results of the decision network to obtain some sample points ( C i , E i ) and their corresponding function values f i = f ( C i , E i ) . According to the Bayesian optimization formula, we establish a Gaussian process model to describe f ( C , E ) . To select the optimal sampling point, we need to find a point ( C t + 1 , E t + 1 ) under the current Gaussian process model that minimizes the objective function:
C t + 1 , E t + 1 = argmin ( C , E ) X E [ f ( C , E ) X , y ] .
To update the Gaussian process model, we need to observe the function value f t + 1 = f ( C t + 1 , E t + 1 ) at the point ( C t + 1 , E t + 1 ) , and then add ( C t + 1 , E t + 1 ) , f t + 1 into the sample points and function values. Next, using Bayesian theorem and Gaussian process regression methods, the mean vector and covariance matrix of the Gaussian process model are updated.
Repeat the above steps until convergence is reached or a preset number of iterations is achieved. The final mean vector μ ( X ) can be used to estimate the values of α and β . Specifically, this is represented as:
α = μ C μ , β = μ E μ .
where μ C and μ E represent the mean values of C and E in the input space, respectively, and μ represents the mean value of the function values at all points in the input space.

4.5. LCGRL Training Framework

In this study, our training framework leverages the Proximal Policy Optimization (PPO) algorithm [33] for the effective training of our scheduling model, utilizing a framework of 1000 episodes. Throughout each episode, the agent concurrently interacts with a batch of Low-carbon Flexible Job shop Scheduling (LC-FJSP) environments, denoted as X t r , gathering state transition data to facilitate the iterative refinement of model parameters. To ensure variability, the environments undergo a resampling process every N t r episodes. Additionally, for verification purposes, our policy undergoes evaluation at intervals of N v a l episodes on a predefined set of validation datasets E v a l , alongside the execution of Bayesian optimization for hyperparameter tuning within the reward function. Our experimental design includes two distinct action selection strategies: a greedy strategy, favoring actions with the highest likelihood μ ( a t | s t ) for outcome validation, and a random sampling strategy, derived from the policy distribution π θ , to enhance exploratory behavior during training. The methodology for the training process Algorithm 1 is summarized as follows:
Algorithm 1 Training LCGRL for LC-FJSP
1:
Input: LCGAN network, policy network, and critic network with initial trainable parameters θ , ω , and ϕ , time weighting parameters α , energy parameters β , pre-sampled training data X t r , and fixed validation data X v a l
2:
for i t e r = 1 , 2 to I do
3:
   for  i = 1 , 2 to | X t r |  do
4:
     Initialize s i , t based on instance i
5:
     Update reward parameters α and β
6:
     for  t = 1 , 2 to T do
7:
        Extract embeddings using LCGAN
8:
        Sample a i , t π θ ( · | s i , t )
9:
        Receive reward r i , t and next state s i , t + 1
10:
        Collect the transition ( s i , t , a i , t , r i , t , s i , t + 1 ) and reward
11:
        Update s i , t s i , t + 1
12:
     end for
13:
     Compute the generalized advantage estimates A ^ i , t for each step
14:
   end for
15:
   Compute the total PPO loss , and optimize the parameters θ , ω , and ϕ for κ epochs
16:
   if  i t e r mod N t r = = 0  then
17:
     Resample X t r instances to from the training data
18:
   end if
19:
   if  i t e r mod N v a l = = 0  then
20:
     Validate π θ on X v a l
21:
    Compute the parameters α and β of the reward function using Bayesian optimization
22:
   end if
23:
end for

5. Experiments

This section presents the comparative experimental outcomes of our newly developed framework, LCGRL, against the DRL framework cited in [17], across both synthetic and public traditional FJSP instances. The objective is to validate the enhanced performance of our training framework on conventional FJSP problems. Furthermore, we assess our trained model against a fixed-weight counterpart on the LC-FJSP challenges, factoring in machine energy consumption, focusing on the differences in reward function outputs and carbon emissions. This comparison is intended to underscore the strengths of our model in effectively addressing the LC-FJSP.

5.1. Experimental Settings

5.1.1. Datasets

Consistent with the practices of the majority of related studies, this study generates synthetic FJSP instances for both training and testing purposes.
Given the scarcity of public benchmarks, the datasets currently available are inadequate for training deep reinforcement learning models. Consequently, this study employs the format of existing public datasets to randomly generate new instances that adhere to these standards, creating 100 instances for each specified scale. Table 3 details the data ranges for these instances. Additionally, the second dataset extension expands the processing time range for each task from (1, 20) to (1, 100).

5.1.2. Configuration

In this study, the model training is set with I = 1000 iterations, N t r = 20, N v a l = 10. LCGAN employs L = 2 attention layers, and each module consists of H = 4 attention heads, with ELU as the activation function σ . The output dimensions of each attention head in the first layer are d o ( 1 ) = d m ( 1 ) = 32 , while in the second layer, they are d o ( 2 ) = d m ( 2 ) = 8 . Both M L P θ and M L P ϕ have two hidden layers with 64 dimensions, and tanh serves as the activation function.
Regarding the PPO parameters, the coefficients in the loss function for policy, value, and entropy are set to 1, 0.5, and 0.01, respectively. The clipping parameter ε , GAE parameter λ , and discount factor γ are set to 0.2, 0.98, and 1, respectively. During training, the Adam optimizer [34] is utilized with a learning rate of l r = 3 × 10 4 , and the policy network is updated four times per episode.
The hardware specifications for our experiments include a machine with an Intel Xeon Gold 6152 CPU, a single NVIDIA Tesla T4 GPU, and Ubuntu 16.04 64-bit operating system. The software environment includes Python 3.8, PyTorch 1.8.1 for deep learning, OpenAI Gym 0.18.0 for reinforcement learning, and Scikit-Optimize 0.8.1 for optimization. Additional libraries used are NumPy 1.21.6, Pandas 1.3.5, and Matplotlib 3.3.4.

5.1.3. Baselines

This research contrasts the LCGRL scheduling strategy, developed through comprehensive training, against the DRL scheduling strategy [17], and four heuristic priority scheduling rules that are extensively applied in real-world scenarios, as documented in [35], namely FIFO, MOR, SPT, and MWKR. To guarantee a fair evaluation, these heuristic rules are executed within the same computational setting as our proposed approach, confirming the superiority of our learned strategy over traditional manual methods. Additionally, our findings are benchmarked against the Google OR-Tools solver [36]. In the context of public benchmark scenarios, this study further evaluates the efficacy of a two-stage genetic algorithm, as detailed in [37].

5.2. Results on Synthetic Datasets

The data presented in Table 4 clearly demonstrate that, for small to medium instances, our LCGRL model significantly surpasses the DRL model in terms of scheduling outcomes, regardless of whether actions are selected through greedy or random sampling strategies. In small to medium instances, our LCGRLG exhibits faster resolution times than DRLG. On extended datasets, both models perform comparably, with the increased variability in processing times having a minimal effect on the efficacy of greedy solutions. Additionally, the performance of the four PDRs on these extended sets highlights that, with broader processing time ranges, the SPT method significantly outperforms other PDRs, consistent with its inherent properties. Under random sampling conditions, our model not only shows considerable improvements but also surpasses the OR-tools solver on the standard dataset. Despite DRLS having quicker resolution times under random sampling strategies, the integration of a low-carbon graph neural network and Bayesian optimization modules does introduce some time costs. Nonetheless, the significant enhancements in solution quality achieved by LCGRLS are substantial and indicate potential for future advancements. To validate the generalization capabilities of the model, we documented its performance on large-scale instances after training on small to medium-sized datasets.
As illustrated in Table 5, the LCGRL model consistently outperforms both the DRL model and heuristic algorithms across various scenarios. In Range A with a 30 × 10 size, LCGRL (10 × 5) achieved a lower objective value and gap compared to DRL (10 × 5) and the heuristic MWKR. For a 40 × 10 size in Range A, LCGRL (10 × 5) again outperformed DRL (10 × 5) and MWKR, demonstrating better solution quality. In Range B with a 30 × 10 size, LCGRL (10 × 5) achieved significantly better results than DRL (10 × 5) and the heuristic SPT, with a notably lower objective value and gap. For a 40 × 10 size in Range B, LCGRL (10 × 5) maintained its superior performance over DRL (10 × 5) and SPT. These results highlight the robustness and efficiency of the LCGRL model in handling different problem scales and processing times, consistently delivering superior performance compared to both DRL and heuristic algorithms.

5.3. Results on Public Datasets

In this subsection, the scheduling strategies discussed previously are applied directly across four distinct public datasets of varying sizes to further assess their generalization capabilities on two standard public benchmarks frequently utilized in conventional studies. The results also include those from the DRL method [17] and the two-stage genetic algorithm [37]. The Gap is calculated based on the optimal solutions provided in [38].
Given that real-world scheduling challenges feature non-static data distributions, we delved into the LCGRL model’s efficacy across four distinct public benchmarks, divergent from the training instance distributions. Each benchmark comprises instances varying in size. For instance, the mk benchmark involves job and machine counts ranging from 10 to 20 and 5 to 15, respectively. We tested models trained on unexpanded datasets, which demonstrated the most favorable outcomes on these benchmarks. Results presented in Table 6 indicate that MWKR outperforms the other three PDRs. According to the performance of models trained under OR-Tools, the 2SGA baseline, and DRL with greedy and random sampling strategies, while OR-Tools and 2SGA generally lead, they require longer computational times. In contrast, the two deep reinforcement learning approaches not only deliver commendable performances, surpassing the best PDR, but also maintain reasonable runtimes. LCGRL consistently exceeds DRL in most tests, with slight variations noted in tests utilizing sampling strategies on la (edata). Moreover, in mk instances, LCGRL substantially excels over DRL under both action selection strategies. These findings underscore LCGRL’s capability to authentically discern the inherent structural nuances of FJSP, focusing on distinguishing compatible machine–operation pairs rather than simply mastering the regularities underlying specific data distributions.

5.4. t-Tests and Confidence Intervals for Synthetic and Public Data Results

The objective of this section is to perform a statistical comparison of different methods on synthetic and public datasets. By leveraging t-tests and confidence intervals, we aim to evaluate the performance and effectiveness of these methods under various conditions.
Table 7 presents the t-test results for synthetic and public datasets. In the synthetic data, several methods, such as FIFO vs. MOR, FIFO vs. DRLG, and SPT vs. LCGRLG, show significant differences, with p-values less than 0.05. This indicates that these methods have statistically different performances in synthetic environments.
In the public dataset, significant differences are less frequent but still present in some cases. For example, the SPT method shows significant differences with methods like FIFO, MWKR, and DRLG, suggesting distinct performance profiles in real-world settings.
Specifically, our trained LCGRLG model shows significant differences in several comparisons. On synthetic data, LCGRLG significantly differs from FIFO, MOR, MWKR, DRLG, and SPT methods, with all p-values less than 0.05; on public data, LCGRLG shows significant differences with DRLG and SPT methods.
Table 8 displays the confidence intervals for synthetic and public data. For synthetic data, the LCGRLG method has a mean performance of 388.75 with a confidence interval from 264.24 to 513.26, indicating some variability in performance. On public data, LCGRLG’s mean performance is 834.65 with a confidence interval from 491.28 to 1178.03, showing high and stable performance in real-world scenarios.
Overall, the t-test and confidence interval analyses indicate that the LCGRLG method performs well across different datasets, particularly showing high performance and significant statistical differences on public data. This suggests that LCGRLG has strong competitiveness and applicability in practical scenarios.

5.5. Results Considering Machine Energy Consumption

In the referenced experiments, the machine energy consumption coefficient was assumed to be zero, making the objective equivalent to makespan. To explore further, we incorporated machine energy consumption into our experiments and trained a variant of our model, designated LCGRLF (Low-Carbon Graph Reinforcement Learning Fixed), without the Bayesian optimization module. This allowed us to evaluate the benefits of including this module. We utilized synthetic datasets of sizes 10 × 5 and 20 × 10, with respective machine energy coefficients of [0.8, 0.9, 1.0, 1.1, 1.2] and [0.8, 0.8, 0.9, 0.9, 1.0, 1.0, 1.1, 1.1, 1.2, 1.2], maintaining fixed weights of 0.5 for both Cmax and TCE. Throughout the training process, we monitored the reward function values at each iteration and carbon emissions every ten samples, facilitating a comparative analysis with our Bayesian-enhanced LCGRL model, as depicted in Figure 5 and Figure 6.
Figure 5’s comparative results distinctly illustrate that, following stabilization, the carbon emissions of the LCGRL with the Bayesian optimization module are substantially lower than those of LCGRLF. This demonstrates that incorporating Bayesian optimization effectively supports our low-carbon objectives in scheduling tasks that account for machine energy consumption. The reward function comparison in Figure 6 reveals that the model enhanced with the Bayesian module converges more swiftly and maintains a lower variance in reward values post-convergence compared to the model without the module, indicating superior stability in managing the variability of multi-objective reward functions.

6. Conclusions

In this study, we introduce an innovative end-to-end deep reinforcement learning framework, LCGRL, marking the first integration of Bayesian optimization with deep reinforcement learning to tackle the flexible job shop scheduling problem. Our approach is designed to derive decision-making rules tailored for the Low-carbon Flexible Job shop Scheduling Problem and to train robust reinforcement learning models. Utilizing a heterogeneous graph model to delineate the scheduling state, our framework employs the Low-carbon Graph Attention Network to process the input of decision-related operations and raw machine features, achieving feature representation across heterogeneous graph nodes. We developed an actor–critic architecture embedded with Bayesian optimization to dynamically adjust the weight parameters of various objectives during training, facilitating the generation of low-carbon decisions that address both operation sequencing and machine allocation. The decision network is trained using the PPO algorithm. Empirical results illustrate that our trained decision rules significantly surpass heuristic scheduling rules, exhibiting superior generalization across large and public datasets. Our method outperforms conventional reinforcement learning techniques, particularly by incorporating a Bayesian optimization module that enhances the speed of reward function convergence and aligns outcomes more closely with low-carbon objectives.
While our method has demonstrated significant advantages, it is not without limitations. The primary limitation is the increased computational complexity introduced by the incorporation of multi-head attention modules and Bayesian optimization. This complexity could potentially hinder the implementation and scalability of our approach in real-world manufacturing environments, especially those with limited computational resources. Additionally, the current model may face challenges when dealing with real-time scheduling tasks due to the computational overhead.
To mitigate these limitations, future work could focus on optimizing the computational efficiency of the model and exploring hardware acceleration techniques. Another limitation is the assumption of static machine and job parameters, which might not hold true in highly dynamic manufacturing settings. Future research could address this by integrating adaptive mechanisms to handle such dynamic uncertainties.
The complexity introduced by multi-head attention modules and Bayesian optimization indeed poses challenges. However, these can be managed through various strategies. For instance, adopting high-performance computing resources and distributed computing techniques can help manage the increased computational demands. Furthermore, modularizing the framework allows for incremental upgrades and scalability across different manufacturing environments. Our empirical results have shown good generalization across various datasets, indicating potential applicability in diverse manufacturing contexts. Nevertheless, extensive real-world validation is necessary to fully establish the generalizability of our approach across different industries.
Our optimization approach primarily aims at minimizing carbon emissions while maintaining operational efficiency. However, there could be potential trade-offs such as increased production costs due to the need for advanced computational resources and potential delays in real-time decision making. Additionally, focusing heavily on low-carbon objectives might impact other sustainability metrics like resource utilization and overall production throughput.
In the future, our methodology could be adapted to address multi-objective flexible job shop scheduling issues characterized by dynamic uncertainties, such as the sudden insertion of new jobs and the unpredictability of deadlines. Additionally, future research might explore integrating additional optimization objectives, accounting for various sources of carbon emissions, and employing more sophisticated exploration techniques to develop practical scheduling solutions for complex manufacturing environments.

Author Contributions

Conceptualization, S.H.; methodology, Y.T. and L.S.; validation, Y.T.; data curation, Y.T.; writing—original draft preparation, Y.T. and L.S.; writing—review and editing, S.H.; funding acquisition, S.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Fund of China (NSFC-12071436).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data and models that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Arias, P.; Bellouin, N.; Coppola, E.; Jones, R.; Krinner, G.; Marotzke, J.; Naik, V.; Palmer, M.; Plattner, G.K.; Rogelj, J.; et al. Climate Change 2021: The physical science basis. In Proceedings of the Contribution of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change, Technical Summary, Geneva, Switzerland, 26 July–6 August 2021. [Google Scholar]
  2. Yin, R.; Liu, Z.; Shangguan, F. Thoughts on the Implementation path to a carbon peak and carbon neutrality in China’s steel industry. Engineering 2021, 7, 1680–1683. [Google Scholar] [CrossRef]
  3. Johnson, S.M. Optimal two-and three-stage production schedules with setup times included. Nav. Res. Logist. Q. 1954, 1, 61–68. [Google Scholar] [CrossRef]
  4. Brucker, P.; Schlie, R. Job-shop scheduling with multipurpose machines. Computing 1990, 45, 369–375. [Google Scholar] [CrossRef]
  5. Garey, M.R.; Johnson, D.S.; Sethi, R. The complexity of flowshop and jobshop scheduling. Math. Oper. Res. 1976, 1, 117–129. [Google Scholar] [CrossRef]
  6. Jiang, Z.; Zuo, L. Multi-objective Flexible Job-shop Scheduling under Low Carbon Strategy. Comput. Integr. Manuf. Syst. 2015, 21, 1023. [Google Scholar] [CrossRef]
  7. Zhang, C.; Gu, P.; Jiang, P. Low-carbon scheduling and estimating for a flexible job shop based on carbon footprint and carbon efficiency of multi-job processing. Proc. Inst. Mech. Eng. Part J. Eng. Manuf. 2015, 229, 328–342. [Google Scholar] [CrossRef]
  8. Jiang, T. Low Carbon Workshop Scheduling Problem Based on Grey Wolf Optimization Algorithm. Comput. Integr. Manuf. Syst. 2018, 24, 2428. [Google Scholar] [CrossRef]
  9. Luo, S.; Zhang, L.; Fan, Y. Energy-efficient scheduling for multi-objective flexible job shops with variable processing speeds by grey wolf optimization. J. Clean. Prod. 2019, 234, 1365–1384. [Google Scholar] [CrossRef]
  10. Lu, C.; Li, X.; Gao, L.; Liao, W.; Yi, J. An effective multi-objective discrete virus optimization algorithm for flexible job-shop scheduling problem with controllable processing times. Comput. Ind. Eng. 2017, 104, 156–174. [Google Scholar] [CrossRef]
  11. Naimi, R.; Nouiri, M.; Cardin, O. A Q-Learning rescheduling approach to the flexible job shop problem combining energy and productivity objectives. Sustainability 2021, 13, 13016. [Google Scholar] [CrossRef]
  12. Wang, S.; Li, J.; Tang, H.; Wang, J. Cea-fjsp: Carbon emission-aware flexible job-shop scheduling based on deep reinforcement learning. Front. Environ. Sci. 2022, 10, 1059451. [Google Scholar] [CrossRef]
  13. Luo, S.; Zhang, L.; Fan, Y. Real-time scheduling for dynamic partial-no-wait multiobjective flexible job shop by deep reinforcement learning. IEEE Trans. Autom. Sci. Eng. 2021, 19, 3020–3038. [Google Scholar] [CrossRef]
  14. Feng, Y.; Zhang, L.; Yang, Z.; Guo, Y.; Yang, D. Flexible job shop scheduling based on deep reinforcement learning. In Proceedings of the 2021 5th Asian Conference on Artificial Intelligence Technology (ACAIT), Haikou, China, 29–31 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 660–666. [Google Scholar] [CrossRef]
  15. Ni, F.; Hao, J.; Lu, J.; Tong, X.; Yuan, M.; Duan, J.; Ma, Y.; He, K. A multi-graph attributed reinforcement learning based optimization algorithm for large-scale hybrid flow shop scheduling problem. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021; pp. 3441–3451. [Google Scholar] [CrossRef]
  16. Park, J.; Chun, J.; Kim, S.H.; Kim, Y.; Park, J. Learning to schedule job-shop problems: Representation and policy learning using graph neural network and reinforcement learning. Int. J. Prod. Res. 2021, 59, 3360–3377. [Google Scholar] [CrossRef]
  17. Song, W.; Chen, X.; Li, Q.; Cao, Z. Flexible job-shop scheduling via graph neural network and deep reinforcement learning. IEEE Trans. Ind. Inform. 2022, 19, 1600–1610. [Google Scholar] [CrossRef]
  18. Casanova, P.; Lio, A.R.P.; Bengio, Y. Graph attention networks. arXiv 2018, arXiv:1710.10903. [Google Scholar] [CrossRef]
  19. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.u.; Polosukhin, I. Attention is All you Need. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar] [CrossRef]
  20. Yang, Z.; Yin, K.; Liu, L. Learning to use chopsticks in diverse gripping styles. Acm Trans. Graph. 2022, 41, 1–17. [Google Scholar] [CrossRef]
  21. Luan, F.; Zhao, H.; Liu, S.Q.; He, Y.; Tang, B. Enhanced NSGA-II for multi-objective energy-saving flexible job shop scheduling. Sustain. Comput. Inform. Syst. 2023, 39, 100901. [Google Scholar] [CrossRef]
  22. Li, J.; Han, Y.; Gao, K.; Xiao, X.; Duan, P. Bi-population balancing multi-objective algorithm for fuzzy flexible job shop with energy and transportation. IEEE Trans. Autom. Sci. Eng. 2023, 1–17. [Google Scholar] [CrossRef]
  23. Zhang, W.; Zheng, Y.; Ahmad, R. An energy-efficient multi-objective scheduling for flexible job-shop-type remanufacturing system. J. Manuf. Syst. 2023, 66, 211–232. [Google Scholar] [CrossRef]
  24. Schworm, P.; Wu, X.; Klar, M.; Glatt, M.; Aurich, J.C. Multi-objective Quantum Annealing approach for solving flexible job shop scheduling in manufacturing. J. Manuf. Syst. 2024, 72, 142–153. [Google Scholar] [CrossRef]
  25. Schworm, P.; Wu, X.; Glatt, M.; Aurich, J.C. Solving flexible job shop scheduling problems in manufacturing with Quantum Annealing. Prod. Eng. 2023, 17, 105–115. [Google Scholar] [CrossRef]
  26. Schworm, P.; Wu, X.; Glatt, M.; Aurich, J.C. Responsiveness to sudden disturbances in manufacturing through dynamic job shop scheduling using Quantum Annealing. Procedia Cirp 2023, 120, 511–516. [Google Scholar] [CrossRef]
  27. Li, R.; Gong, W.; Wang, L.; Lu, C.; Dong, C. Co-evolution with deep reinforcement learning for energy-aware distributed heterogeneous flexible job shop scheduling. IEEE Trans. Syst. Man, Cybern. Syst. 2023, 54, 201–211. [Google Scholar] [CrossRef]
  28. Huang, J.P.; Gao, L.; Li, X.Y. An end-to-end deep reinforcement learning method based on graph neural network for distributed job-shop scheduling problem. Expert Syst. Appl. 2024, 238, 121756. [Google Scholar] [CrossRef]
  29. Chen, X.l.; Li, J.q.; Xu, Y. Q-learning based multi-objective immune algorithm for fuzzy flexible job shop scheduling problem considering dynamic disruptions. Swarm Evol. Comput. 2023, 83, 101414. [Google Scholar] [CrossRef]
  30. Ding, L.; Guan, Z.; Rauf, M.; Yue, L. Multi-policy deep reinforcement learning for multi-objective multiplicity flexible job shop scheduling. Swarm Evol. Comput. 2024, 87, 101550. [Google Scholar] [CrossRef]
  31. Wang, R.; Wang, G.; Sun, J.; Deng, F.; Chen, J. Flexible Job Shop Scheduling via Dual Attention Network-Based Reinforcement Learning. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 3091–3102. [Google Scholar] [CrossRef]
  32. Wilson, A.; Fern, A.; Tadepalli, P. Using Trajectory Data to Improve Bayesian Optimization for Reinforcement Learning. J. Mach. Learn. Res. 2014, 15, 253–282. [Google Scholar]
  33. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
  34. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar] [CrossRef]
  35. Panwalkar, S.S.; Iskander, W. A survey of scheduling rules. Oper. Res. 1977, 25, 45–61. [Google Scholar] [CrossRef]
  36. Da Col, G.; Teppan, E.C. Industrial-size job shop scheduling with constraint programming. Oper. Res. Perspect. 2022, 9, 100249. [Google Scholar] [CrossRef]
  37. Rooyani, D.; Defersha, F.M. An efficient two-stage genetic algorithm for flexible job-shop scheduling. Ifac-Papersonline 2019, 52, 2519–2524. [Google Scholar] [CrossRef]
  38. Behnke, D.; Geiger, M.J. Test instances for the flexible job shop scheduling problem with work centers. Arbeitspapier /Research Paper/ Helmut-Schmidt-Universität, Lehrstuhl für Betriebswirtschaftslehre, insbes. Logistik-Management 2012, 1–13. [Google Scholar] [CrossRef]
Figure 1. Disjunctive graph example. Node S represents start, node E represents end, other nodes represent operations, directed arcs show precedence constraints, and undirected arcs represent potential sequences of operations on machines, with different colors indicating different machines.
Figure 1. Disjunctive graph example. Node S represents start, node E represents end, other nodes represent operations, directed arcs show precedence constraints, and undirected arcs represent potential sequences of operations on machines, with different colors indicating different machines.
Sustainability 16 04544 g001
Figure 2. Heterogeneous graph example. Node S represents start, node E represents end, other nodes represent operations and machines, directed arcs show the sequence of operations, and undirected arcs represent the relationship between operations and machines.
Figure 2. Heterogeneous graph example. Node S represents start, node E represents end, other nodes represent operations and machines, directed arcs show the sequence of operations, and undirected arcs represent the relationship between operations and machines.
Sustainability 16 04544 g002
Figure 3. Workflow of the algorithm framework.
Figure 3. Workflow of the algorithm framework.
Sustainability 16 04544 g003
Figure 4. Details of the LCGAN. Directed arcs represent the embedding order of feature vectors, while colored directed arcs represent the complete application of the current feature.
Figure 4. Details of the LCGAN. Directed arcs represent the embedding order of feature vectors, while colored directed arcs represent the complete application of the current feature.
Sustainability 16 04544 g004
Figure 5. The total carbon emissions change during the training process.
Figure 5. The total carbon emissions change during the training process.
Sustainability 16 04544 g005
Figure 6. The variation in reward function values during the training process.
Figure 6. The variation in reward function values during the training process.
Sustainability 16 04544 g006
Table 1. Symbol definitions of LC-FJSP Math Model.
Table 1. Symbol definitions of LC-FJSP Math Model.
SymbolDescription
nNumber of jobs
mNumber of machines
J i The i-th job
M k The k-th machine
n i Total number of operations in job J i
O i j The j-th operation of job J i
M i j Set of optional machines for the j-th operation of job J i
t i j k Processing time of operation O i j on machine M k
S i j Start time of operation O i j
C i j Completion time of operation O i j
C i Completion time of job J i
C m a x Maximum completion time
S T k Start time of machine M k
C T k Completion time of machine M k
p i j k Average energy consumption of operation O i j on machine M k
p e k Average energy consumption of machine M k under idle condition
α e Carbon emission coefficient of energy consumption
T C E Total carbon emission
x i j k = 1 , if operation O i j is processed on machine M k 0 , otherwise
y i j i j , k = 1 , if operation O i j precedes operation O i j on machine M k 0 , otherwise
Table 2. An instance of LC-FJSP.
Table 2. An instance of LC-FJSP.
JobOpreationOptional Machines and Processing Time
M 1 M 2 M 3
J 1 O 11 5--
O 12 -52
O 13 6-7
J 2 O 21 3-5
O 22 -4-
J 3 O 31 --3
O 32 36-
O 33 -2-
Table 3. Instance generation distributions.
Table 3. Instance generation distributions.
Size ( n   ×   m ) n i 1 | M ij | 2 t ij 3
10 × 5 U ( 4 , 6 ) U ( 1 , 5 ) U ( 1 , 20 )
20 × 5 U ( 4 , 6 ) U ( 1 , 5 ) U ( 1 , 20 )
15 × 10 U ( 4 , 6 ) U ( 1 , 10 ) U ( 1 , 20 )
20 × 10 U ( 4 , 6 ) U ( 1 , 10 ) U ( 1 , 20 )
20 × 10 U ( 4 , 6 ) U ( 1 , 10 ) U ( 1 , 20 )
40 × 10 U ( 4 , 6 ) U ( 1 , 10 ) U ( 1 , 20 )
1 Represents the number of operations in job J i . 2 Represents the number of machines that operation O i j can select. 3 Represents the average processing time of operation O i j .
Table 4. Results on synthetic datasets with two small to medium training scales.
Table 4. Results on synthetic datasets with two small to medium training scales.
SizePDRsGreedySampleOR-Tools
FIFO MOR SPT MWKR DRLG LCGRLG DRLS LCGRLS
Range A 1 10 × 5 Obj119.13114.87130.23113.69111.6796.39105.5994.5196.32
Gap23.74%19.32%35.21%18.03%16.03%0.03%9.66%−1.87%
Time0.040.040.040.040.450.291.110.99
15 × 10 Obj182.82171.42195.28168.74166.92153.22160.86141.20143.53
Gap27.45%19.47%36.01%17.59%16.33%4.56%12.13%−1.57%
Time0.130.130.120.131.430.973.985.25
20 × 5 Obj230.73231.37236.52236.86211.22189.49207.53180.27188.15
Gap22.64%22.96%25.74%25.79%12.27%0.69%10.31%−4.17%
Time0.160.160.160.160.900.602.362.64
20 × 10 Obj244.33231.34266.08234.46215.78197.32214.81185.82195.98
Gap24.73%18.07%35.69%19.7%10.15%0.63%9.64%−5.13%
Time0.360.360.350.361.911.346.238.67
Range B 2 10 × 5 Obj561.56550.92507.52536.53553.61400.69483.9331.49326.24
Gap73.78%70.59%55.8%66.06%71.42%23.1%49.71%1.78%
Time0.040.040.040.040.460.251.110.72
15 × 10 Obj883.53855.44728.92843.85807.47582.21756.07473.51377.17
Gap135.57%128.09%93.69%124.94%115.26%54.7%101.52%25.77%
Time0.120.120.120.120.880.754.165.20
20 × 5 Obj1074.521067.48842.221080.281059.04677.57962.9573.76602.04
Gap79.5%78.17%40.01%80.67%76.79%12.65%60.7%−4.68%
Time0.160.160.160.160.610.482.372.71
20 × 10 Obj1148.421115.98870.061100.891045.82613.11990.37506.01464.16
Gap148.27%141.42%87.64%137.97%126.12%32.25%114.15%9.09%
Time0.350.350.360.351.951.336.038.54
1 The processing time for each operation ranges from 1 to 20. 2 The processing time for each operation ranges from 1 to 100.
Table 5. Results on synthetic datasets with two big training scales.
Table 5. Results on synthetic datasets with two big training scales.
SizePDRsGreedySampleOR-Tools
MWKR SPT DRL DRL LCGRL LCGRL DRL DRL LCGRL LCGRL
10 × 5 20 × 10 10 × 5 20 × 10 10 × 5 20 × 10 10 × 5 20 × 10
Range A 1 30 × 10 Obj312.93350.07314.71313.04291.21291.52308.55312.59289.59290.56274.67
Gap13.96%27.47%14.61%14.01%6.04%6.16%12.36%13.83%5.45%5.81%
Time0.560.562.842.862.062.0712.7912.8017.4417.27
40 × 10 Obj414.82445.17417.87416.18383.30382.95410.76415.25384.28384.86365.96
Gap13.37%21.66%14.21%13.75%4.75%4.67%12.26%13.49%5.02%5.18%
Time0.800.803.823.813.053.0124.5424.4029.2027.92
Range B 2 30 × 10 Obj1539.6661105.991564.571543.69911.18923.371486.561461.16884.32883.34692.26
Gap122.89%59.74%126.55%123.57%31.75%33.56%115.21%111.51%27.86%27.72%
Time0.560.552.932.932.132.1312.8812.7518.3218.38
40 × 10 Obj2037.6521357.162048.962032.541112.511135.621976.251945.531111.921107.50274.67
Gap108.66%38.74%109.87%108.12%13.76%16.18%102.45%99.26%13.76%13.3%
Time0.790.773.873.933.012.9624.5524.5030.7628.72
1 The processing time for each operation ranges from 1 to 20. 2 The processing time for each operation ranges from 1 to 100.
Table 6. Results on public datasets.
Table 6. Results on public datasets.
Methodmkla (rdata)la (edata)la (vdata)
ObjGapTimeObjGapTimeObjGapTimeObjGapTime
DRLG10 × 5201.4028.52%1.251030.8311.15%1.401187.4815.53%1.40955.904.25%1.37
DRLG15 × 10198.5026.77%1.251030.3811.14%1.401182.0815.03%1.40954.334.02%1.37
LCGRLG10 × 5184.9014.61%0.811026.6010.64%0.891181.6515.14%0.90943.083.01%0.90
LCGRLG15 × 10186.3015.43%0.821029.9811.13%0.901179.5815.01%0.90945.133.22%0.89
DRLS10 × 5190.3018.56%4.12985.305.57%4.811116.688.17%4.90930.801.32%4.71
DRLS15 × 10190.6019.0%4.13988.385.95%4.811119.438.69%4.87931.331.34%4.72
LCGRLS10 × 5180.408.87%5.65980.685.21%6.741119.138.73%6.56924.430.57%6.86
LCGRLS15×10180.408.87%5.75981.755.39%6.851119.378.81%6.78924.200.56%6.81
MOR200.3628.08%0.111066.7315.07%0.121227.0719.24%0.12966.015.68%0.12
SPT237.5244.88%0.111200.4129.47%0.121312.8426.79%0.121082.8818.2%0.12
FIFO205.5631.82%0.111087.1217.25%0.121244.9220.83%0.12982.897.58%0.12
MWKR201.7428.91%0.111053.1013.86%0.121219.0118.6%0.12952.014.22%0.12
2SGA175.203.17%57.60------922.200.39%51.43
OR-tools174.201.5%1447.1935.800.11%1397.41028.93−0.03%899.60919.60−0.01%639.17
Table 7. t-Tests for synthetic and public data results.
Table 7. t-Tests for synthetic and public data results.
ComparisonSynthetic DataPublic Data
t-Statistic p-Value Significantt-Statistic p-Value Significant
FIFO vs. MOR3.290.0133Yes4.470.0209Yes
FIFO vs. SPT1.950.0928No−4.320.0229Yes
FIFO vs. MWKR2.310.0539No3.470.0404Yes
FIFO vs. DRLG2.730.0292Yes2.830.0662No
FIFO vs. LCGRLG2.760.0283Yes4.630.0190Yes
FIFO vs. DRLS3.540.0095Yes2.950.0602No
FIFO vs. LCGRLS2.850.0247Yes3.460.0407Yes
FIFO vs. OR-tools2.760.0281Yes2.750.0708No
MOR vs. SPT1.740.1261No−4.400.0217Yes
MOR vs. MWKR0.820.4369No2.390.0966No
MOR vs. DRLG2.340.0520No2.140.1219No
MOR vs. LCGRLG2.670.0320Yes4.390.0219Yes
MOR vs. DRLS3.370.0119Yes2.630.0785No
MOR vs. LCGRLS2.790.0271Yes3.170.0503No
MOR vs. OR-tools2.710.0303Yes2.530.0855No
SPT vs. MWKR−1.680.1366No4.120.0259Yes
SPT vs. DRLG−1.390.2085No4.080.0266Yes
SPT vs. LCGRLG3.960.0055Yes4.850.0167Yes
SPT vs. DRLS−0.530.6112No4.070.0268Yes
SPT vs. LCGRLS3.700.0077Yes4.410.0216Yes
SPT vs. OR-tools3.330.0126Yes3.810.0319Yes
MWKR vs. DRLG2.260.0580No1.470.2374No
MWKR vs. LCGRLG2.660.0325Yes3.650.0356Yes
MWKR vs. DRLS3.380.0118Yes2.400.0962No
MWKR vs. LCGRLS2.780.0273Yes2.960.0598No
MWKR vs. OR-tools2.710.0301Yes2.370.0982No
DRLG vs. LCGRLG2.580.0365Yes3.400.0426Yes
DRLG vs. DRLS2.790.0270Yes2.940.0606No
DRLG vs. LCGRLS2.720.0297Yes4.090.0264Yes
DRLG vs. OR-tools2.660.0324Yes2.610.0799No
LCGRLG vs. DRLS−2.400.0475Yes1.820.1667No
LCGRLG vs. LCGRLS3.060.0184Yes2.510.0867No
LCGRLG vs. OR-tools2.350.0511No2.110.1250No
DRLS vs. LCGRLS2.620.0343Yes1.780.1740No
DRLS vs. OR-tools2.540.0385Yes2.320.1029No
LCGRLS vs. OR-tools0.830.4334No1.810.1676No
Table 8. Confidence intervals for synthetic and public data results.
Table 8. Confidence intervals for synthetic and public data results.
MethodSynthetic DataPublic Data
Mean Lower CI Upper CI Mean Lower CI Upper CI
FIFO555.38327.42783.34880.12144.301615.95
MOR541.48318.98763.98865.04139.481590.60
SPT471.61305.62637.60958.41179.221737.60
MWKR538.91330.77747.05856.46140.181572.75
DRLG558.94338.91779.97842.61502.841182.38
LCGRLG388.75264.24513.26834.65491.281178.03
DRLS484.39306.11662.67806.60482.931130.28
LCGRLS300.34193.78406.90801.30474.711127.88
OR-tools299.77171.53428.01764.63133.621395.65
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tang, Y.; Shen, L.; Han, S. Low-Carbon Flexible Job Shop Scheduling Problem Based on Deep Reinforcement Learning. Sustainability 2024, 16, 4544. https://doi.org/10.3390/su16114544

AMA Style

Tang Y, Shen L, Han S. Low-Carbon Flexible Job Shop Scheduling Problem Based on Deep Reinforcement Learning. Sustainability. 2024; 16(11):4544. https://doi.org/10.3390/su16114544

Chicago/Turabian Style

Tang, Yimin, Lihong Shen, and Shuguang Han. 2024. "Low-Carbon Flexible Job Shop Scheduling Problem Based on Deep Reinforcement Learning" Sustainability 16, no. 11: 4544. https://doi.org/10.3390/su16114544

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop