4.1. Markov Decision Process (MDP)
To address the LC-FJSP using deep reinforcement learning, we initially define the states, actions, state transitions, and rewards, transforming the problem into a Markov Decision Process (MDP). A DRL-based decision framework is then established, which treats the selection of operations and machines integrally and outputs a probability distribution for decision making. A greedy strategy is employed, focusing on selecting operation–machine pairs with the highest scores. Lastly, we explain the training methodology for the proposed model.
The scheduling process in FJSP is conceptualized as assigning a ready operation to a suitable idle machine. The procedure is as follows. At each decision point t (either at the start or upon the completion of an operation), the agent assesses the current state and selects an action , specifically assigning an unplanned operation to an available machine, beginning its execution from time . Subsequently, the system transitions to the next state at step . This sequence continues until all operations are scheduled. The MDP framework is defined as follows:
State: The state representation captures the primary attributes and dynamics of the scheduling environment, considering both processes and machines as the composite state. The collective state of all processes and machines at any decision step t forms state , starting from the initial FJSP instance denoted as .
Action: The paper integrates process selection and machine assignment into a unified action choice, defining all feasible process–machine pairs as the action space. As scheduling progresses, the action space naturally diminishes as more operations are allocated.
State Transition: At each decision step t, from state , the agent selects an action from the available space, performing action , This leads to an environmental shift to the subsequent state .
Reward: The purpose of designing the reward function is to guide the agent to select actions that minimize the maximum completion time and total carbon emissions of all operations. The reward function at time step t is defined as , where f represents the value of in the current state . When the discount factor , the accumulation of rewards at each step yields . In a specific problem instance, is a constant, implying that minimizing f and maximizing the cumulative reward are equivalent.
Policy: We adopt a stochastic policy , which defines a probability distribution over the action set for each state . The distribution of this policy is generated by a deep reinforcement learning algorithm, optimizing specific parameters during training to maximize the cumulative reward.
For example, consider a simple scenario where there are two jobs, and , each with one operation, and , respectively, and three machines, , , and . At a decision point t, both and are ready to be processed.
At time t, the current state includes the status of all jobs and machines. For instance, ’s operation is pending assignment, ’s operation is pending assignment, and all machines , , and are idle.
The action space includes all possible job–machine assignments. In this scenario, the possible actions are:
- 1.
Assign to ;
- 2.
Assign to ;
- 3.
Assign to ;
- 4.
Assign to ;
- 5.
Assign to ;
- 6.
Assign to .
Suppose the agent selects the action to assign to .The system transitions to the next state where is being processed on . For example, the new state could be as follows: is running on with an expected completion time of 5 units, is still pending assignment, and remain idle.
The reward for this action is calculated based on the reduction in the combined metric of maximum completion time () and total carbon emissions (). Suppose at state that is 20 and is 30. After transitioning to state , reduces to 19 and reduces to 28. The reward function is defined as , where f represents the weighted sum of and . Assuming equal weights, and ; hence, .
By considering both process selection and machine assignment in the actions, the agent effectively learns to balance the workload and optimize the overall performance metrics in LC-FJSP.
4.2. Low-Carbon Graph Attention Network (LCGAN)
To address the Low-carbon Flexible Job shop Scheduling (LC-FJSP) challenge, characteristics such as processing time on operation-to-machine (O-M) arcs are critical. Based on these times, we can calculate the carbon emissions
during processing on different machines; likewise, from the machine’s idle times, we can infer the carbon emissions
when the machines are not active. This study takes advantage of the unique features and benefits of the heterogeneous graph structure by introducing a tailored LCGAN network architecture specifically designed for LC-FJSP. By enhancing two attention modules, as cited in [
31], this framework skillfully captures feature representations of process and operation nodes. Energy consumption features are added to the O-M arcs within the machine feature attention module to aid in the amalgamation and filtration of process features. To address the relationship between time and carbon emissions in subsequent calculations, we employ Bayesian optimization to adjust the weights for maximum completion time and carbon emissions, aiming to identify the optimal solution. The input feature dimensions for the operation and machine are
and
, respectively.
Figure 4 illustrates the architecture of LCGAN.
4.2.1. Operation Feature Attention Module
The operation feature attention module aims to connect the operations within the same workpiece by finding the most important operations through their inherent attributes. For each input operation feature
of
, this module establishes relationships between
, its predecessor
, and successor
by calculating their attention coefficients as follows:
where
and
are linear transformations, for all
.
We chose the LeakyReLU activation function over the standard ReLU for several reasons. Firstly, LeakyReLU helps mitigate the “dying ReLU” problem, ensuring that neurons remain active and gradients flow during training, which is crucial for our operation feature attention module. Secondly, our preliminary experiments indicated the presence of noisy features and outliers in the dataset. LeakyReLU’s small negative slope allows for non-zero gradients when units are inactive, helping the model handle noise and outliers more robustly. Lastly, LeakyReLU’s ability to provide gradients for negative inputs aids in better gradient propagation, particularly beneficial for deep networks.
These calculations are similar to those in GAT but narrowed in scope. Since the predecessors (or successors) of some operations may not exist or may be removed at some step, dynamic masking is applied to the attention coefficients of these predecessors and successors. The softmax function normalizes all
to obtain the normalized attention coefficients
. Finally, by weighted linear combination of the transformed input features
,
, and
, followed by a nonlinear activation function
, the output feature vector
is obtained:
By sequentially connecting multiple operation feature attention modules, the message of can be propagated to all operations in .
4.2.2. Machine Feature Attention Module
Each operation to be processed can ultimately be completed on only one machine; hence, there exists a competitive relationship between different machines, involving the same operations to be processed. This competitive relationship may dynamically evolve as the production process progresses. We define
as the set of operations competing between machines
and
, and
and
respectively, represent the energy consumption of operation
on machine
and machine
. Furthermore, we use
to measure the intensity of competition between
and
, where more intense competition indicates that the candidate operations are more important. The machine feature attention module uses
to calculate the attention coefficient
. For each
with input feature
, the attention coefficients
for all
competing are calculated as follows:
where
and
are weight matrices, and
is a linear transformation.
represents the set of unscheduled operations that can process. can be considered a measure of ’s processing capability, and we similarly apply the above formula to calculate . Then, normalized attention coefficients are obtained using softmax, and the transformed input features are combined and activated with ELU to obtain the machine output feature .
4.2.3. Multi-Head Attention Module
We utilize multiple attention heads to process the aforementioned modules, aiming to learn the diverse relationships between entities. Let H denote the number of attention heads in the attention layer; we apply attention mechanism modules, each containing different parameters. Firstly, parallel computations are performed to derive attention coefficients and combinations. Secondly, their outputs are integrated through an aggregation operator. We adopt concat as the aggregation operator, and an average operator is used in the last layer. Finally, an activation function is applied to obtain the output of the layer.
4.2.4. Graph Pooling
When we use graph neural networks (GNNs) to process graph-structured data, the input graphs may have a varying number of nodes and different edge connections. This diversity can make the model very sensitive to changes in the input graphs, making it difficult to generalize well to new graph data. Graph pooling operations can help solve this problem by aggregating nodes or subgraphs in the graph to obtain a higher-level representation. This higher-level representation is equivalent to a summary or abstraction of the entire graph, containing important information and key features of the graph. The original features of the operation
and the machine
are denoted as
and
, respectively. After processing by
L layers of a GNN, the features aggregated with attention weights,
and
, are used for subsequent decision tasks. Following the method in reference [
19], we first average pool the features of the operations and machines, respectively, and then concatenate their results to form the global feature of the FJSP instance, as shown below:
4.4. Bayesian Optimization
Reference [
32]: We employ Bayesian optimization methods to determine the weights of the reward function. This method selects appropriate sampling points in the search space and adjusts the positions of sampling points based on observation results, gradually approaching the optimal solution. Our objective is to optimize the black-box function
, where
and
are the coefficients to be optimized. We utilize the results of the decision network to obtain some sample points
and their corresponding function values
. According to the Bayesian optimization formula, we establish a Gaussian process model to describe
. To select the optimal sampling point, we need to find a point
under the current Gaussian process model that minimizes the objective function:
To update the Gaussian process model, we need to observe the function value at the point , and then add , into the sample points and function values. Next, using Bayesian theorem and Gaussian process regression methods, the mean vector and covariance matrix of the Gaussian process model are updated.
Repeat the above steps until convergence is reached or a preset number of iterations is achieved. The final mean vector
can be used to estimate the values of
and
. Specifically, this is represented as:
where
and
represent the mean values of C and E in the input space, respectively, and
represents the mean value of the function values at all points in the input space.