A Systematic Review on Reinforcement Learning for Industrial Combinatorial Optimization Problems

Martins, Miguel S. E.; Sousa, João M. C.; Vieira, Susana

doi:10.3390/app15031211

Open AccessSystematic Review

A Systematic Review on Reinforcement Learning for Industrial Combinatorial Optimization Problems

by

Miguel S. E. Martins

^*

,

João M. C. Sousa

and

Susana Vieira

IDMEC, Instituto Superior Técnico, Universidade de Lisboa, 1049-001 Lisbon, Portugal

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(3), 1211; https://doi.org/10.3390/app15031211

Submission received: 26 November 2024 / Revised: 14 January 2025 / Accepted: 21 January 2025 / Published: 24 January 2025

(This article belongs to the Special Issue Advancements in Multi-Agent Systems and Artificial Intelligence: Methodologies, Applications, and Future Trends)

Download

Browse Figures

Versions Notes

Abstract

:

This paper presents a systematic review on reinforcement learning approaches for combinatorial optimization problems based on real-world industrial applications. While this topic is increasing in popularity, explicit implementation details are not always available in the literature. The main objective of this paper is characterizing the agent–environment interactions, namely, the state space representation, action space mapping and reward design. Also, the main limitations for practical implementation and the needed future developments are identified. The literature selected covers a wide range of industrial combinatorial optimization problems, found in the IEEE Xplore, Scopus and Web of Science databases. A total of 715 unique papers were extracted from the query. Then, out-of-scope applications, reviews, surveys and papers with insufficient implementation details were removed. This resulted in a total of 298 papers that align with the focus of the review with sufficient implementation details. The state space representation shows the most variety, while the reward design is based on combinations of different modules. The presented studies use a large variety of features and strategies. However, one of the main limitations is that even with state-of-the-art complex models the scalability issues of increasing problem complexity cannot be fully solved. No methods were used to assess risk of biases or automatically synthesize the results.

Keywords:

combinatorial optimization; reinforcement learning; state space; action mapping; reward design; industry, innovation and infrastructure

1. Introduction

In recent years, there has been an explosion in the development and application of machine learning (ML). As models increase in complexity and parameters, more ambitious tasks become doable. Performance is not only tied to computational power but also to large volumes of high-quality data. However, most industrial problems are commonly represented as mathematical formulations or simulations.

Reinforcement learning (RL) is a subset of ML methods that are not so data-dependent. At the core of the RL loop is the iterative interaction between RL agent and environment. Instead of learning by inferring patterns from large amounts of data, RL agents learn by interacting and exploring a simulated environment step-by-step, learning with each interaction. This makes RL approaches perfectly suited for simulations and formulations of real problems, relinquishing the need for large datasets.

1.1. Reinforcement Learning Foundations

RL is a decision making framework where an agent learns by interacting with an environment to maximize the total received rewards. Many RL approaches model their environment as a Markov Decision Process (MDP), which consists of specific states, actions that navigate those states, rewards associated with reaching each state and the transition probabilities for the action–state paths. As represented in Figure 1, the agent takes action on the environment, making it change. The current state at the next time step, as well as a reward signal, are returned to the agent. This is repeated, possibly until a terminal state is reached, creating the agent–environment loop.

The reward signal quantifies how desirable it is to reach a state, and since the goal of the agent is to maximize the cumulative sum of all expected future rewards this signal guides the agent towards the desired behavior. This can be represented by value functions, where the state-value function V estimates the future cumulative reward starting from a certain state and the action-value function Q estimates the reward of taking a certain action in a specific state. These functions can be updated recursively to better estimate the long-term expected rewards when selecting a state or a state–action pair.

The mapping between states and actions is the policy, and different algorithms are used to adjust it. The Q-value is the expected reward for a specific action at a certain state and then following the policy. One core challenge of RL implementations is the trade-off between exploration, trying new actions to find better rewards, and exploitation, choosing known high-reward actions. A common strategy is to use the

ϵ

-greedy algorithm, which selects exploitation most of the time but has a small probability to explore by selecting a random action.

To store the value function estimates, tabular representations can be used. These lookup tables store the Q-values for each state, in a column, or state–action pair, in a matrix. If the number is large or the state space continuous, function approximation techniques can be used. These include linear models and neural networks, which are parametrized functions that generalize across states but may introduce approximation errors. When deep learning is used for this purpose, it is commonly referred to as deep reinforcement learning (DRL).

1.2. Previous Reviews

Many reviews and surveys using RL for the optimization of industrial problems exist, as seen in Table 1. These reviews are selected from the results of the query used for the literature review. Reviews commonly base their analysis on problems solved and algorithms used, either only briefly commenting on the agent–environment interactions or doing so for a very specific context. Practical implementation details, such as state encoding or reward design, often fall outside the scope of such analysis or have only a couple of lines dedicated to it. There are important and novel insights that can be borrowed from various areas but that do not easily proliferate across different application types. Thus, this review focuses heavily on the different approaches to the agent–environment interactions, independently of application or algorithms used.

1.3. Research Questions

A systematic literature review is conducted in this paper. The focus is on combinatorial optimization problems that represent real-world industrial problems using RL. This paper analyzes and compares different approaches to three important design decisions of any RL framework, the agent–environment interactions. The objectives of this paper can be described by the following research questions (RQ):

RQ1: how to sufficiently encode the problem (state representation);
RQ2: what control is necessary over the problem (action space);
RQ3: how to guide the agent towards the desired behavior (reward design);
RQ4: what the main limitations are for practical implementation;
RQ5: which future developments are needed.

When relevant, model architecture and problem constraints are also discussed. For the targeted industrial problems, most approaches either do not mention transition probabilities or define them as deterministic. The exceptions are some maintenance and resource breakdown problems, which explicitly detail the transition probabilities. For many problems, the traditional MDP representation is not feasible, for example, if there are too many states. When transition probabilities are fixed at one, the MDP representation also becomes irrelevant. Thus, these are considered as out of scope for the work developed here.

The review methodology is presented in Section 2. Section 3 discusses state representation and format. The action space is described in Section 3 and reward design in Section 5. The research questions are discussed in Section 6. Finally, Section 7 presents future research avenues.

2. Review Methodology

The methodology used in this review follows the PRISMA guidelines [14]. For this study, the following platforms were selected: IEEE Xplore, Scopus and Web of Science (WOS). For the Scopus (https://dev.elsevier.com/, accessed on 13 March 2024) and WOS (https://developer.clarivate.com/apis/wos, accessed on 13 March 2024) databases, the official APIs were used to extract the results. For IEEE Xplore, the same keyword search was conducted in the browser advanced search tool (https://ieeexplore.ieee.org/search/advanced, accessed on 13 March 2024). The keyword search for each platform was as follows:

(reinforcement learning) AND (industrial OR industry) AND ((combinatorial optimization) OR (operations research) OR (job shop) OR scheduling OR routing OR (bin packing) OR knapsack)) AND (real-world OR application OR benchmark OR (case study) OR (framework))

2.1. Search Criteria

For each database, the title, abstract and keywords were searched on 13 March 2024. All journal papers, conference papers and book sections were considered without date restriction. All the steps, from database extraction to the final included papers, are represented in the PRISMA flow diagram in Figure 2.

A total of 1286 items were returned from the three databases: IEEE Xplore (347), Scopus (565) and WOS (374). Each database search matched the title, abstract and keywords with the query. First, all retrieved items without DOIs were considered ineligible and discarded (119). Then, from all three database queries, duplicated (459) items were removed, resulting in a total of 715 unique papers, as seen in Figure 2. No papers were excluded in the screening phase. Of these, 33 papers were not retrieved due to lack of access (29) or not being available in English (4).

2.2. Eligibility Criteria

For the collected unique papers, the exclusion criteria used to determine eligibility that resulted in 298 papers included in the review were as follows:

Out of scope: irrelevant topics outside of industrial applications or not of combinatorial optimization (221);
Insufficient details: irreproducible papers without all needed explicit descriptions of agent–environment interactions (99);
Reviews and surveys (64).

To collect these data, one author manually checked for sufficient descriptions of agent–environment interactions. This information, together with the title and abstract, was also used to evaluate if papers were in the scope of this review.

This review focuses on industrial combinatorial optimization problems. Thus, we are mostly concerned with problems that deal with resource allocation, such as entity distribution or selection. Common examples include manufacturing, scheduling, and network allocation. Papers are considered out of scope if the application mainly deals with control or energy management systems. For these examples, the action is often a continuous control signal.

The resulting 298 papers contain verified and thorough implementations of RL for industrial combinatorial optimization problems. The state representation, action space and reward design are clearly explained in every paper selected, and the following sections group and categorize the different approaches to each of the three agent–environment interactions. The distribution of publications over time is presented in Figure 3. It can be seen that while little interest was given prior to 2015, in the last 10 years the number of publications has kept growing.

Besides manually describing and categorizing agent–environment interactions, algorithms used, etc., which were then used to populate the respective tables and images of this document, no other synthesis methods were used. No certainty assessment measures were used, no effect measure outcomes were used, and no missing results risk of bias or certainty assessment was performed.

3. State Representation

The state is the representation of the environment available to the agent. This section groups and details different state representations found in this review. Table 2 categorizes the most common state features in the reviewed papers, which are then further detailed in the following sections.

The most common approach is to use environment variables as different state features. These variables, representing different model states, change over time as the environment reacts to the actions given. By tracking key variables and related features, many approaches manage to create a short yet representative state.

Some of the most popular state features are resource-related. This includes machines, workers, vehicles, channels, etc. Other popular features relate to system entities such as jobs, operations, tasks, orders, products, etc. Job and order are used interchangeably to refer to the highest-level entity. A job can be the manufacturing of an object. Operation, sometimes used interchangeably with task, is a specific step in a job’s production process. Each step requires different resources. However, tasks can also be used as smaller units of work within an operation [29].

When the current environment should be represented, time-related metrics or solution representations can be part of the state. Time-related metrics can be computed for resources or entities and track lateness, wait times, etc. When representing the solution or its encoding, the objective function values or certain infeasibilities can be informative to the agent.

Static problem parameters provide extra context to the agent. These are frequent when the same agent must solve different problem instances, as it needs to be aware of changes in available resources, number and types of jobs, processing times, etc.

Some state representations are very problem-specific, only making sense for certain topics. This is the case for networks and energy problems. When the agent has a hybrid role, the state representation is also problem-dependent.

Lastly, it is also common for approaches to normalize state features [41,57,90,132]. In [132], the state is a list of seven features, such as percentage of finished jobs or idle time since last operation. Each feature has a unique scaling factor to keep all state values between zero and one. In [57], min–max scaling is applied to each job feature of the state.

3.1. Resource Features

It is very common to represent resource efficiency by measuring its utilization [18,29,33,40,41,42]. Non-normalized utilization is occupation, the amount of time a resource is busy [17,26,38,78].

Many approaches take particular interest in measuring queues [44,48,55,66]. While the number of queued jobs [45,49,50,54,60] is the most popular metric, the utilization of queues [60,68] or the free spaces available [40] are also used.

Load and demand can also give important information. In [78,96], the number of jobs currently being processed by machines is measured. In [93], the state contains the resource load, and in [17], the state has the current allocated demand. When describing more than the resource time needed, for example, considering volume and complexity of tasks, the workload is used [82,83,84].

Categorical information can be used to translate the condition or situation of a resource at a certain time, that is, the state. One of the most common state examples is the resource working behavior [60,69,93,110,113], for example, idle, working, under maintenance, etc. This notion of resource state is sometimes used interchangeably with status, which more often gives the position or condition of a measurement in relation to something.

3.2. Entity Features

Many metrics evaluate job completeness. Examples include the percentage of operations completed, or completion rate [35,41,94,136], the number of operations left [96,113] and the overall job demand [79,81,137]. Instead of counting jobs, refs. [41,132,133] count the remaining processing time. The working status of a job, if it is waiting, completed, etc., is a common feature [26,96,132], either categorical or one-hot-encoded.

Some problems require monitoring resource or entity locations. In [75], each vehicle is described by its current zone, available seats, pick-up time and destination zone of each passenger. In [165], each of the four flight legs are described in terms of origin and arrival locations and departure and arrival times. For entities, tracking the current entity location [81,140,161,168] can be of interest.

3.3. Time-Related

When a job entity is completed after its due date, the deadline for completion, there is lateness, or tardiness [18,27]. Lateness and tardiness are usually used interchangeably, but lateness can also be used as tardiness of larger magnitude.

Earliness [18,57] measures the surplus time from job completion to due date. Slack time [29,34,41,172] is used to measure how much a task start can be delayed without impacting overall completion time. If a job has zero slack time, it is the bottleneck of the system [59].

To measure idle time, the average waiting time [16,40,60,78,93], or the expected value of the wait time [39], measures how long job entities remain idle.

3.4. Solution

A solution representation can be part of the state. Some examples include operations sorted by start time [179,192], vehicles by sequence of visited nodes [188] or bin packing insertion order [176,179]. The whole solution [79,153,176,184], a partial solution [81] or the initial solution [180] can be used.

The objective function value [30,41,184,192,197] or similar metrics are also common, such as remaining processing time [41,57,62] or number of tasks left [142]. Some approaches measure or detail solution infeasibility [29,110,140,204].

Simulation-related metrics, such as simulation time [69,79,92], or past information, such as previous actions [20,197,206], can be useful in the state.

3.5. Static Parameters

Static problem parameters provide extra problem context for the agent. These are frequent when the same agent must solve different problem instances, as it needs to be aware of changes in available resources, number and types of jobs, processing times, etc.

It is very common to include entity [34,41,160] and resource [93,160,211] properties, such as resource maximum capacity [33,69,85], job processing time [49,57,69,78,192], arrival rates [65] or due dates [66,213].

3.6. Problem-Specific

Certain measurements only make sense considering the problem tackled. Energy management problems care about energy demands [132,198,213,225], energy levels [100,214,229] and consumption [201,203].

For network problems, many specific metrics are used, for example, signal-to-noise ratio [46]; resource backhaul transmission volume [91]; and resource radio channel quality [68].

For maintenance-related problems, specific state features include machine maintenance duration [45,116] or resource degradation level [48,89].

3.7. Multi-Agent

For multi-agent problems, the state of a single agent often contains all agents’ features [206,246,251]. Listing the resource metrics is also viable [250], especially when paring agents and resources [241,244,249].

For example, the agent input in [244] consists of three different matrices: processing time of jobs, assigned jobs to each agent and completed jobs. In [249], the state also includes info on current jobs requesting a decision and number of jobs in the other resources, the conveyors. Alternatively, some approaches can have a unique state per agent [49].

3.8. Hybrid Strategies

For some approaches, the RL agent is not used as the optimizer or solution generator. Instead, it takes an auxiliary role to other optimization strategies, using information about the state of the optimization itself, not the solution. Large-scale and uncertain resource scheduling problems are solved in [261], with an agent making pre-selections and order solving to simplify the problem a linear programming model must solve next. In [254], an RL agent complements a constraint programming algorithm. The state representation includes information on the instance solving, the current model and statistics from the past solving operations.

The RL agent can often be paired with metaheuristics [253,262], such as ant colony optimization [263] or particle swarm optimization [259]. Agents can read population metrics [203,257,260] or directly use algorithm optimization parameters as the state [252]. In [253], where the agent is used to control the parameters of a Cuckoo search metaheuristic, the state describes the population diversity and the tracks if there is diversification of intensification of solutions. In [259], the state is the current population levels of a particle swarm optimization algorithm.

The set of optimization parameters, which are used to generate circuit designs, are the state of [252]. A fuzzy rules-based approach is used in [258], where the conditional part of the fuzzy rules is the state.

Some agents read population metrics [203,253,257,260], such as average fitness and diversity or population improvement [260]. In [253], where the agent is used to control the parameters of a Cuckoo search metaheuristic, the state describes the population diversity and the tracks if there is diversification of intensification of solutions. The authors of [254] complement the RL agent with a constraint programming algorithm. The state representation includes information on the instance solving, the current model and statistics from the past solving operations.

The set of optimization parameters, which are used to generate circuit designs, are the state of [252]. In [259], the state is the current population levels of a particle swarm optimization algorithm. The agent will only work on the number of levels. This level-based learning approach forces particles to learn from upper levels only. Lower-level particles will focus on exploration and higher-level particles on exploitation.

3.9. State Format

The reviewed state representations always include some of the features previously described. The most common state format that gathers these features is a fixed list of variables. However, there are other noteworthy formats or considerations, as illustrated in Table 3. It summarizes how space representations can be shaped by providing a structured overview of the different formats in an application and feature-agnostic categorization. Note that the accompanying references are non-exhaustive lists of representative examples.

Entity lists, for example, resources or jobs, often present one of two approaches. When the entity features are averaged over all resources [16,18,27,34], it results in a compact, low-dimensional representation that is easy to handle but might overlook some relevant information. Alternatively, state features can each directly relate to individual entities [20,35,247], resulting in a more granular representation. However, this option requires a longer state that must also be fixed-sized, hurting generality.

Spatial-based representations allow for the data position within the structure to also play a role. States represented as matrices [71,221,232,233] leverage structured environments, such as grids and images, where the spatial proximity and order are relevant. An example of a specific matrix representation is a heightmap [195,221,232,233], often used to encode three-dimensional space into a two-dimensional representation. It is also important to highlight applications using convolutional neural networks (CNNs) [61,205,232,266] that, while more cumbersome to train, can automatically extract spatial features.

Graph-based representations are useful for relational and entity-based problems. Directed and undirected graphs [142,219,246,270] capture relational dependencies, respectively, asymmetric and symmetric interactions. An example of a directed arc would be a precedence constraint between operations, while an undirected arc could connect all job nodes compatible with the same resource. Disjunctive graphs [272,274,277,278] use both types of arcs and can encode structures with complex relationships, which is especially useful for multi-entity interactions such as scheduling.

Graph node features [81,246,278,282] allow for considering multiple features per resource regardless of the problem size, while edge features [19,113,150,239] are more focused on relationships between entities. These are both features used in graph neural networks (GNNs) [188,246,271,278], which process graph-based states more generally at the cost of higher computational complexity.

GNNs have the advantage of being size-agnostic. Other approaches also allow for variable-sized state formats [141,181], which are applicable to complex problems where the number of features can change. An example is recurrent neural networks (RNNs) [47,178,184,238], useful for capturing temporal dependencies for sequential decision making problems. These bring great flexibility and scalability to the problem, again at the cost of extra computational effort. Lastly, fuzzy approaches [83,246] can be used to handle uncertainty and imprecise states.

4. Action Space

The following chapter details popular actions made available to the agent in different literature approaches. Table 4 groups all the cited papers in relevant categories. Two common strategies compatible with multiple action spaces are

ϵ

-greedy exploration and action masking.

The classic approach is to have the RL agent estimate the Q-value of a certain state or state–action pair. However, it has become more popular to have the agent directly select a candidate option from a list, such as the next entity to consider or the next resource to allocate.

Alternatively, the list of actions can be a smaller, repeatable set of decisions, for example, accepting or rejecting a certain entity into a resource. Some approaches conduct hyperheuristic selection, having the agent choose the most appropriate heuristic based on the current state. The agent can also make multiple selections at once, combining any examples from this section.

Depending on the architecture, the agent might output a full solution. This is typically from approaches that work with variable-sized outputs. Finally, the output can produce one or multiple number estimates. The use cases include simulation or metaheuristic parameters.

In

ϵ

-greedy approaches [45,175,225,250,310],

ϵ

is a small number. With probability 1-

ϵ

, the action that will lead to the highest Q-value is selected. Otherwise, with

ϵ

, a random action will be selected. This strategy is often used to balance the exploration of new solutions versus exploiting currently promising solutions. This is a common alternative to always selecting the highest Q-value [267,280,289]. The

ϵ

-greedy threshold can also decrease over the training [95,260].

Action masking refers to only allowing the agent to select feasible actions [139,204,233,271], such as unavailable resources [84,121], entities [57,80,96,132,179] or locations [39]. In [24], all options are available, but if the candidate option returns an unfeasible solution, the agent does nothing instead. The authors of [170] include the feasible actions in the state.

4.1. Q-Value Estimation

Estimating the Q-value is of fundamental importance to RL approaches. Traditional methods estimate the Q-value of a state [150,207] or a state–action pair [45,165,294,296,298]. By comparing the Q-values of reachable states, the agent can select the action leading to the highest expected return.

Function approximation methods can use non-linear models, such as deep neural networks, to estimate the Q-values of each state [138,153,160,297] or state–action pair [84,91,149,164], even if the number of states is infinite. For example, to predict state–action pair values, ref. [290] uses a long-short term memory (LSTM) model and ref. [160] uses a deep neural network with an attention mechanism.

Value-based approaches, which compute Q-values for states or state–action pairs, can apply a softmax layer to the value outputs and use the result as an expected return probability distribution for each action [30,78]. This works even for tabular approaches [73]. Instead of searching for the maximum Q-value, the agent can directly select a candidate option form a list. Policy gradient models’ policy outputs the action selection probability for the said list [20,79,178,206,208], also using a softmax layer.

4.2. List Selection

Besides the possible states or state–action combos, the agent can be used to select a candidate option from a list. This is a very common option, and this selection can be conducted with or without repetition of the elements.

The action space can be a list of resources [38,49,122,160,296], jobs [49,94,198,213,232] or operations [31,115,271,278]. Locations [136,239,299] can also be selected by the agents.

4.3. Sequential Decisions

Agent decisions can be used to change the resource allocated to entities [40,113,167,246] or to move entities in and out of buffers [49,217,304]. It can also be used to change the resource state [40,113,167,246,304].

When the agent can move in a grid, it is common for the available decisions to be the cardinal directions [240,305] and in some cases also the ordinal directions [156,206]. Not moving or halting movement [206,240] is also common.

Multiple output approaches also use agent decisions [123,210,243,307], but often in only one of the outputs, having some other type of agent action in the other output.

4.4. Heuristic Selection

Heuristic selection strategies are both list selection approaches, picking a single heuristic from a list, but also hybrid strategies. Furthermore, it is an extremely popular approach. In particular, dispatching rule selection is widely used in the literature. By using simple strategies such as Shortest Processing Time (SPT) or Earliest Due Data (EDD), an agent can sequentially add all actions to a schedule [18,22,35,41,307].

Instead of dispatching rules, local search methods can also be the heuristics selected. These simple operations search for the neighboring solutions in search of an improvement [29,208,260,314].

4.5. Multiple Selections

To make multiple selections with a single action, approaches provide all possible combinations of resources as individual actions to the agents [116,275,309]. For scheduling, this can be resource and job pairings [59,64], or job and factory [275]. More commonly, the agent outputs multiple decisions in as many output values, often pairs, or with a list output with as many elements as resources [23,47,103].

While some bin packing approaches also use every combination of item, location and orientation [233], others have each decision as a separate output [189]. For graph-based problems, it can also select nodes [81] or links [242].

4.6. Variable-Sized Output

While not common, some approaches present variable-sized action outputs to solve different-sized problems with the same model. Some approaches give the Q-value of each resource [65] or a new full solution [184,292]. This is achieved with specific RNN architectures [47,243,289], for example, using an LSTM model [65,184,292].

4.7. Number Estimation

The output of an agent can be a number estimate [140,247], such as the number of local updates [310] or the number of packets to forward [207]. For these examples, the agent action is an integer number.

Continuous values can be used directly from the agent output [21,67], which is very common for energy management systems [98,105,214,227,236]. Alternatively, discretized continuous value ranges are sometimes used [88,141,311,312].

Hybrid strategies with RL agents and other optimization strategies working together are often used to estimate certain variables. One approach is to output parameters for an optimization algorithm [252,253,254,257,261,262].

Alternatively to parameters, the agent in [263] learns the value of different cities to complement an ant colony optimization algorithm. In [313], the agent outputs three weights, which are used to select the next node. In other approaches, the agent is used to change resource limits, such as inventory [261] or reconfigurations [159].

5. Reward Design

The following section summarizes the types of reward functions used. Different approaches use different combinations of the categories mentioned below, as seen on Table 5. Some approaches normalize the rewards [95,152,163], for example, using upper bound [17], min–max [207,249], z-score [296] or average expected time [170].

One of the most direct yet simple approaches is to mirror the objective function as the iteration reward or penalty. The reward value can also reflect the relative improvement of the solution over successive iterations. Both these approaches are often complemented with extra terms that help the agent better achieve the original objective.

Some rewards are the result of multiple reward functions, where certain environment conditions decide which function is used in each iteration. Based on the problem to solve, specific objectives might also be applicable. Multi-agent approaches can use multiple reward functions per agent or have specific rewards based on local and global events. Finally, when it is not possible or convenient to measure the objective function, alternative goals can be used.

5.1. Mirror Objective Function

One of the simplest and yet most common approaches is to reward based on the current objective function value [49,79,81,175]. The current value, the difference between iterations or some linear transformation of either case can also be used. The values measured can be summations, averages and standard deviations. Makespan, or completion time, is very common in scheduling problems [113,279,286,307].

Non-linear transformations to the objective function are also used. Using the max operator [110,153,184] breaks linearity but forces measurements to only take meaningful numbers. For example, negative lateness is not desired [18,27]. Ref. [290] punishes delay increments more severely when the total delay is small than when it is large.

This approach is very popular due to its simplicity and ease of implementation, since it directly optimizes the objective function. However, it can also be too sensitive to problem scale and magnitude differences in returned reward values.

5.2. Relative Rewards

It is common to reward the difference to the previous value [29,122,198,232,278], i.e., the temporal difference (TD). Comparison can also be made with other solutions such as initial [178,179,181,195] or best [29,188,244,259,263,286] or from other baselines [211,212]. This relative fitness can be either the ratio of both values [190,233,269] or the absolute difference [49,208,274].

The makespan is often used as a temporal difference measurement [49,84,142,188,278], but other measures such as profit [98,208,266] or resource utilization [109,178] are sometimes used.

Relative rewards shift the focus to improvement over time instead of immediate absolute improvements. They often have some type of normalization built in, which makes states across different problems more comparable. However, if wrong reference points are used or if noise is too impactful, the agent can converge into suboptimal solutions.

5.3. Extra Terms

When composing the reward, certain metrics can complement the objective function [75,184,195,246,247]. This can be performed to influence the agent behavior to account for costs [115,225,267] or lateness [71,173,208], for example.

Problem constraints can also be used as extra penalty terms [267,292,301], allowing infeasible solutions to be explored while also increasing the state space, for example, exceeding resource limits [258,261] or missing entity due dates [44,213].

Extra terms can make approaches more general with the flexibility they bring, allowing penalties and secondary objectives to further guide the agent. However, balancing multiple terms becomes of extra importance, as extra terms can dominate the primary reward function and introduce biases.

5.4. Conditional Rewards

Instead of having a single reward function, some authors prefer if–else statements to alternate between different reward functions [40,165,217,250,284]. Many approaches provide a zero reward [40,165,249,250] or a small negative reward [76,156,284,308] every step no key event is achieved, falling under the else or the otherwise condition. Alternatively, some approaches provide a fixed positive reward [24,185,219] on key events and zero or negative values otherwise.

Certain events can trigger conditional terms, often from entity or resource behavior. For example, in [309], the default reward is a small penalty, but if traffic flow is successively received, the reward is zero instead. On the other hand, ref. [106] constantly provides a small fixed reward, but if a open edge is selected, then a larger reward is returned.

The selected action can also decide the conditional reward terms [123,132,168,224], for example, if the candidate action is not valid [76,136,189,206] or if the action is successful [165,303]. Highly desirable actions can have greater reward magnitudes [146]. This can also happen upon visiting certain states [116,219] or state transitions [298,315], such as if undesirable states are reached [69,243].

Specific metrics can be used as threshold values to switch between terms [18,48,167,238,291]. The evolution of the solution can be used to select the appropriate reward equation [160,203,257,270,311].

A reward returned only at the end of the episode is called terminal or episodic. This can mean replacing the reward function with a specific terminal reward [69,136,282], with a fixed positive [76,282] or fixed negative [24,136] value. Extra penalty terms can also punish unfinished tasks [92] or training length [234]. Alternatively, agents might only be rewarded at the terminal state [35,176,177,234,302].

Conditional rewards also offer flexibility to the reward function by switching between functions that better adapt to ongoing events. However, these event-based triggers must be carefully designed so that the abrupt reward transitions do not lead to unstable learning. Thus, they require careful tuning.

5.5. Problem-Specific Objective

Certain metrics, either as extra terms or objective functions, only make sense for specific problems. This is valid for thematic areas, such as energy and routing problems, but also for different formulation types, such as non-linear and multi-objective approaches.

Some approaches consider multiple objectives, yet many combine all into a single objective function. In this scenario, it is common to represent each in different weighted terms of the reward function [113,141,161,193,249,289,296]. True multi-objective approaches [18,132,196,257] consider each objective variable separately. Multi-objective goals are adversary: improving one will deteriorate the others.

Domain-specific requirements might call for problem-specific rewards, which provide more meaningful performance evaluations by considering the context. This, however, hurts the generality of the approaches.

5.6. Multi-Agent

Multi-agent approaches can distinguish between individual agent rewards, local rewards, and rewards shared by all agents, global rewards. In [172,206,287], an additional complement is given to all agents when completing the schedule [172,287] or all routes [206]. During the episode, a smaller reward can be given based on the percentage of completed tasks [287] or upon job completion [172].

Some multi-agent approaches return different rewards for each agent. Ref. [164] minimizes routing costs using three different cooperative agents. While one agent rewards utilization and daily rewards, the others share the reward function, including extra penalty costs. In [78], one agent balances the load among factories, while another wants to minimize factory makespan. The former is the difference between maximum and minimum factory total processing time, while the latter reflects the makespan increment with each step.

Specific multi-agent rewards allow for coordination and competition between agents, creating scalable solutions for decentralized systems. However, the complexity increases greatly with the number of agents, and considering global and local rewards requires careful balancing.

5.7. Alternative Goals

Sometimes, the objective is not available during the episode, such as when minimizing total tardiness [66,174] or final makespan [22,38,54,247,307], forcing the approaches to utilize alternative or estimated metrics for the reward calculation. Ref. [22] demonstrate that, for their formulation, the makespan and the resource utilization are equivalent and use the resource utilization increments as reward. Some estimation examples use tardiness [66] or worst possible task completion [38,91].

When the agent’s role is not to optimize, but to assist other optimization strategies in hybrid approaches, the reward might not be related to the overall objective function. In [252], the agent output updates the optimization parameters for a simulation software. The resulting design-performance value is used to update the critic network using mean squared error. For the constraint programming approach in [254], the agent is penalized at every step, encouraging it to reach feasible solutions faster. In the bee colony optimization of [263], where metaheuristic parameters are given by the agent, the reward depends only on the metaheuristic finding a better solution. Lastly, in [276], no reward is used since the authors use imitation learning.

Learning through proxy metrics can be the only viable choice when the direct objective value is not available. This also enables hybrid approaches to optimize for their particular task and not for the overall problem. However, it may lead to unintended behaviors if the alternative metrics are not fully representative of the true goal.

6. Discussion

From the reviewed papers, it is clear that successful applications rely on good problem encoding and manipulation. The algorithm choice and the complexity need to fit the problem, and the reward signal must successfully guide the agent. Depending on how well the problem encoding leverages the state representation and the action space, simpler approaches are often as effective as complex state-of-the-art methods.

RL agents are used for many different purposes. Table 6 summarizes the different roles RL agents take, showing only selected citations instead of the exhaustive approach from previous tables.

6.1. Research Questions

6.1.1. RQ1—State Representation

The state space representation is crucial to supply the agent with sufficient information so that it can make an informed decision. Of the three agent–environment interactions, the state representation shows the most format variety, as summarized in Table 3. Variables, lists, matrices and even 3D matrices can be used for very different applications. Lists of environment variables are the most popular approach. The metrics included vary greatly but often include information about the resources and entities. While a list of metrics is a simple representation, it has been shown multiple times that it is sufficient for many applications.

Two specific model architectures show great success on multiple examples: recurrent and graph approaches. Both strategies create a state that aggregates information from individually considering all its parts, while still being compatible with problems of any size. Normalizing the state also seems to be beneficial.

6.1.2. RQ2—Action Space

It is important for the action space to allow the agent enough freedom to explore the solution space thoroughly. If the number of actions is finite, ultimately the agent is selecting one option from a list. For approaches that use this strategy, using some form of action masking seems to always be beneficial.

When the model used outputs multiple Q-values or selection probabilities, the agent must take an extra step to choose one of the actions. For exploration considerations, this might not always be the best value. Naturally, one common use for agents is to have them select from a predetermined list. This is often paired with action masking, only allowing feasible actions to be selected, which seems to be fundamental for faster complex problem convergence. Selecting resources and entities is very popular, either as a list item or a graph node.

6.1.3. RQ3—Reward Design

The reward design heavily influences whether the agent will behave as expected. It will not only guide the agent throughout the optimization, but it will also influence the speed at which it converges to the desired behavior. There is no clear preference regarding mixing penalties and rewards or simply using one of them. However, a careful balance between the magnitude of all rewards returned during an episode seems to be useful.

Interestingly, the reward design can be seen as the most modular approach of the three agent–environment interactions. There are many approaches to the design of the reward function, yet these can be broken down into a relatively small set of concepts, such as mirroring the objective function as the reward. Reward functions can be highly customizable by simply picking and choosing a number of options from this set.

Most approaches tend to mirror the objective function change between different episode iterations. Moreover, it is very common to have a conditional reward function. In either case, the studied approaches often include extra penalties based on lateness and formulation constrains. Similarly, rewards can be associated with successful entity or resource events.

6.1.4. RQ4—Limitations

There is a huge variety of states, actions and rewards. Since they are often very problem-dependent, there is a lack of consensus on what the best approaches are for each agent–environment interaction.

Exploration is a key part of RL approaches. They require extensive interactions with the environment, which can be computationally expensive. As a result, RL training is likely slower than other end-to-end deep learning approaches, such as supervised learning, that train on large, curated datasets. However, dataset generation can be cumbersome and expensive. Thus, the development of more data-based approaches can also be considered slow. Since RL approaches can instead leverage existing mathematical formulations, they effectively skip this step. Moreover, various methods such as curriculum learning and offline RL help reduce inefficiencies by improving both training speed and sample efficiency.

While RL training can be slow, a deployed agent is fast when compared to traditional exact or search-based methods used for combinatorial problems. For example, linear programming models and metaheuristic methods can take significantly longer to reach good solutions. Also, the iterative nature of RL easily allows for hybrid approaches.

There are multiple examples where very simple representations and simple approaches are enough to solve small instances of complex problems. However, these smaller state and action spaces are often not enough to represent and interact with the real problem. Industrial problems can be represented in many ways, as seen in Table 2. However, a very complex and thorough state might still not solve the problem. Adding unnecessary or redundant information increases the complexity of the approaches, often increasing the number of parameters greatly, without being matched by the same increase in performance. A careful balance is needed between state and action space length versus their inherent complexity increase.

When more complex approaches are needed, such as GNNs or RNNs, these require some expertise to train efficiently. They are often accompanied by comparatively longer training computational times, which can prevent these models from converging and seeing competitive results. Furthermore, while these bring some advantages compared to traditional models, such as variable-sized inputs or outputs, they do not solve the scalability issues of these representations. Graphs’ edge numbers grow exponentially with number of nodes. Recurrent approaches are forced to use extra mechanisms, such as attention, to not lose important information.

6.1.5. RQ5—Future Developments

It is a recurring issue from the reviewed papers that adding more constraints to the problem is desired, for example, adding due dates to consider lateness or adding resource breakdowns to consider a dynamic scenario. This further exacerbates the scalability issues, which require an urgent solution.

GNN and RNN approaches cannot yet fully solve this problem. Graph-based approaches have trouble scaling past a certain point since increasing the number of nodes might exponentially increase the number of edges. Recurrent-based approaches need more and more strategies to solve problems, such as using attention weights or the latest model architectures, further increasing required computational power.

It is important to use simpler architectures and problem encodings whenever feasible. However, more efficient agent–environment interactions, training frameworks or algorithms are needed to satisfy the desire to further constrain the already-complex problems tackled.

Traditional optimization methods still outperform RL for combinatorial optimization problems. However, RL offers an alternative approach better suited for dynamic environments and changing problems without customized problem tuning.

Various hybrid approaches where RL methods complement classic optimization methods have already been discussed in this document. However, there is still potential for RL to further save computational effort and increase efficiency by reducing the search space, providing metaheuristics parameters and warm-starting both solvers and metaheuristics, for example.

6.2. Popular Algorithms

There are many options available for RL algorithms, as shown in Figure 4. There is some overlap between policy gradient and actor–critic methods. For clarity, priority is given to algorithm naming conventions. It is common for papers to simply refer to their approach as DRL. These cases are categorized in the “undisclosed” label of the overall figure.

Q-learning is a common choice, allowing the agent to estimate Q-values for given state–action pairs [73,165,175,207,296]. Other algorithms used in the same context include REINFORCE [178,216,288], value iteration [92], TD-learning [166] and Bayesian RL [110].

Many approaches use deep Q-learning (DQN) [38,52,54,121,160]. Some common variations include double Q-learning [70,80,91,149,301] and dueling double DQN [168].

Actor–critic methods [63,105] are also used for option selection. Many variations are used, including advantage actor–critic (A2C) [20,79,122], soft actor–critic (SAC) [100,103] and asynchronous advantage actor–critic (A3C) [173]. Proximal policy optimization algorithms (PPO) [104,132,197,277,285] and deep deterministic policy gradient (DDPG) [68,227,316,317] are also very common.

6.3. The Limitations of This Review

Regarding what is included in this review, there are multiple noteworthy considerations. While the review was aimed at industrial applications with RL, by also restricting to combinatorial problems, some important industrial sectors can be under-represented, such as robotics and control applications. Similarly, by requiring the term industry or industrial, some services might be less represented, such as transportation problems. Excluding papers without sufficient, reproducible details regarding all agent–environment interactions might also discard more theoretical papers with valuable insights. Finally, our search was only conducted in English-language databases, possibly overlooking other high-quality papers.

Regarding the review process used, despite our best efforts to manually check more than 700 papers, this labor-intensive undertaking is subject to human error and bias. Also, there are diverse definitions of states, actions and rewards, and their categorization is inherently subjective. Differences in terminology and framework definition can also influence the classifications. It is also inevitable that for some agent–environment interactions their categorization is oversimplified, ignoring some category overlaps. Also, by not considering papers without practical implementation details, some valuable theoretical contributions might be missed.

Considering the results presented here, key categories and examples of actions, states and rewards are provided for future practitioners for RL-based industrial applications. These findings can be used to guide future research towards the most popular approaches, since their popularity proves their efficacy. At the same time, the identified less-explored approaches can lead to untapped opportunities.

Note that due to the heavy manual data processing conducted in this review, no assessment or reporting of risk of biases, study heterogeneity, robustness or confidence in the results was performed. No statistical synthesis and no assessment of certainty to the body of evidence were conducted. The present review was not registered, and no review protocol was prepared.

7. Conclusions

This review explores the RL state of the art for the combinatorial optimization of industrial problems on the following three databases: IEEE Xplore, Scopus and WOS. The main focus of the analysis was the agent–environment interactions, namely, the state representation, the action mapping and the reward design. Also, the current limitations and needed future developments are explored. A thorough categorization of each approach regarding all three agent–environment interactions is proposed. This review analyzes 298 studies out of the retrieved 715. The most common example is resource scheduling, either for manufacturing or wireless network problems. This includes partial resource allocation to different tasks, such as in network resource sharing.

The analysis of the literature shows that many studies are not always clear on practical implementation details, which hurts both readability and reproducibility. As an extreme example, some studies do not clearly state the algorithm used. As research in this area grows, it is important to make papers informative and unambiguous. In general, authors tend to focus on two main topics for future work: adding more constraints to the problem and using more advanced models and algorithms.

The desired extra constraints intend to approximate the formulation to the real problem. Two common examples are considering due dates and making the resource availability dynamic. However, these increases in complexity must be well managed since heavily constrained problems translate into agents harder to train.

Many approaches want to complement state-of-the-art RL methods with complex ML models. RNNs, CNNs and GNNs are all increasing in popularity. While the results are promising, the computational effort to train these models increases greatly for small performance gains. It is important to keep in mind that simpler RL approaches can also provide good results, as long as the agent–environment interactions are well designed.

Since most studies reflect a future desire to increase the complexity of either the problem or the approach, the authors suggest that future research should focus on problem representation and training frameworks that can handle this inevitable scaling in complexity.

RL still struggles with sample efficiency, interpretability and generalization of certain encodings. However, by combining RL with established optimization techniques, the combinatorial optimization community can develop efficient, scalable and adaptive solvers that leverage the strengths of both areas.

Author Contributions

Conceptualization, M.S.E.M., J.M.C.S. and S.V.; methodology, M.S.E.M., J.M.C.S. and S.V.; software, M.S.E.M.; validation, J.M.C.S. and S.V.; formal analysis, M.S.E.M.; investigation, M.S.E.M.; resources, J.M.C.S. and S.V.; data curation, M.S.E.M.; writing—original draft preparation, M.S.E.M.; writing—review and editing, M.S.E.M., J.M.C.S. and S.V.; visualization, M.S.E.M.; supervision, J.M.C.S. and S.V.; project administration, J.M.C.S. and S.V.; funding acquisition, M.S.E.M., J.M.C.S. and S.V. All authors have read and agreed to the published version of the manuscript.

Funding

The authors acknowledge Fundação para a Ciência e a Tecnologia (FCT) for its financial support via the projects LAETA Base Funding (DOI: 10.54499/UIDB/50022/2020) and LAETA Programatic Funding (DOI: 10.54499/UIDP/50022/2020). This research was funded by Fundação para a Ciência e a Tecnologia (FCT) under the PhD scholarship 10.54499/2020.08776.BD (Miguel S. E. Martins).

Data Availability Statement

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sutton, R.S.; Barto, A.G. Reinforcement Learning; MIT Press: Cambridge, MA, USA, 2020; p. 282. [Google Scholar]
Patel, P.P.; Jhaveri, R.H. Soft computing techniques to address various issues in wireless sensor networks: A survey. In Proceedings of the IEEE International Conference on Computing, Communication and Automation, ICCCA 2016, Greater Noida, India, 29–30 April 2016; pp. 399–404. [Google Scholar] [CrossRef]
Cheng, L.; Yu, T. A new generation of AI: A review and perspective on machine learning technologies applied to smart energy and electric power systems. Int. J. Energy Res. 2019, 43, 1928–1973. [Google Scholar] [CrossRef]
Cunha, B.; Madureira, A.M.; Fonseca, B.; Coelho, D. Deep Reinforcement Learning as a Job Shop Scheduling Solver: A Literature Review. In Proceedings of the 18th International Conference on Hybrid Intelligent Systems (HIS 2018), Porto, Portugal, 13–15 December 2018. [Google Scholar] [CrossRef]
Morocho-Cayamcela, M.E.; Lee, H.; Lim, W. Machine learning for 5G/B5G mobile and wireless communications: Potential, limitations, and future directions. IEEE Access 2019, 7, 137184–137206. [Google Scholar] [CrossRef]
Khan, S.; Farnsworth, M.; McWilliam, R.; Erkoyuncu, J. On the requirements of digital twin-driven autonomous maintenance. Annu. Rev. Control 2020, 50, 13–28. [Google Scholar] [CrossRef]
Naeem, M.; Rizvi, S.T.H.; Coronato, A. A Gentle Introduction to Reinforcement Learning and its Application in Different Fields. IEEE Access 2020, 8, 209320–209344. [Google Scholar] [CrossRef]
Quach, H.N.; Yeom, S.; Kim, K. Survey on reinforcement learning based efficient routing in SDN. In Proceedings of the 9th International Conference on Smart Media and Applications, Jeju, Republic of Korea, 17–19 September 2020; pp. 196–200. [Google Scholar] [CrossRef]
Frikha, M.S.; Gammar, S.M.; Lahmadi, A.; Andrey, L. Reinforcement and deep reinforcement learning for wireless Internet of Things: A survey. Comput. Commun. 2021, 178, 98–113. [Google Scholar] [CrossRef]
Xiao, Y.; Liu, J.; Wu, J.; Ansari, N. Leveraging Deep Reinforcement Learning for Traffic Engineering: A Survey. IEEE Commun. Surv. Tutor. 2021, 23, 2064–2097. [Google Scholar] [CrossRef]
Esteso, A.; Peidro, D.; Mula, J.; Díaz-Madroñero, M. Reinforcement learning applied to production planning and control. Int. J. Prod. Res. 2023, 61, 5772–5789. [Google Scholar] [CrossRef]
Torres, A.d.R.; Andreiana, D.S.; Roldán, Á.O.; Bustos, A.H.; Galicia, L.E.A. A Review of Deep Reinforcement Learning Approaches for Smart Manufacturing in Industry 4.0 and 5.0 Framework. Appl. Sci. 2022, 12, 12377. [Google Scholar] [CrossRef]
Ogunfowora, O.; Najjaran, H. Reinforcement and deep reinforcement learning-based solutions for machine maintenance planning, scheduling policies, and optimization. J. Manuf. Syst. 2023, 70, 244–263. [Google Scholar] [CrossRef]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef]
Haddaway, N.R.; Page, M.J.; Pritchard, C.C.; McGuinness, L.A. PRISMA2020: An R package and Shiny app for producing PRISMA 2020-compliant flow diagrams, with interactivity for optimised digital transparency and Open Synthesis. Campbell Syst. Rev. 2022, 18, e1230. [Google Scholar] [CrossRef] [PubMed]
Bai, Y.; Lv, Y. Reinforcement Learning-based Job Shop Scheduling for Remanufacturing Production. In Proceedings of the IEEE International Conference on Industrial Engineering and Engineering Management, Kuala Lumpur, Malaysia, 7–10 December 2022; pp. 246–251. [Google Scholar] [CrossRef]
Bretas, A.M.; Mendes, A.; Chalup, S.; Jackson, M.; Clement, R.; Sanhueza, C. Addressing deadlock in large-scale, complex rail networks via multi-agent deep reinforcement learning. Expert Syst. 2023, 42, e13315. [Google Scholar] [CrossRef]
Chang, J.; Yu, D.; Zhou, Z.; He, W.; Zhang, L. Hierarchical Reinforcement Learning for Multi-Objective Real-Time Flexible Scheduling in a Smart Shop Floor. Machines 2022, 10, 1195. [Google Scholar] [CrossRef]
Chen, Q.; Huang, W.; Peng, Y.; Huang, Y. A Reinforcement Learning-Based Framework for Solving the IP Mapping Problem. IEEE Trans. Very Large Scale Integr. Syst. 2021, 29, 1638–1651. [Google Scholar] [CrossRef]
Danino, T.; Ben-Shimol, Y.; Greenberg, S. Container Allocation in Cloud Environment Using Multi-Agent Deep Reinforcement Learning. Electronics 2023, 12, 2614. [Google Scholar] [CrossRef]
Geng, N.; Lan, T.; Aggarwal, V.; Yang, Y.; Xu, M. A Multi-agent Reinforcement Learning Perspective on Distributed Traffic Engineering. In Proceedings of the 2020 IEEE 28th International Conference on Network Protocols (ICNP), Madrid, Spain, 13–16 October 2020. [Google Scholar] [CrossRef]
Han, B.A.; Yang, J.J. Research on adaptive job shop scheduling problems based on dueling double DQN. IEEE Access 2020, 8, 186474–186495. [Google Scholar] [CrossRef]
Huang, Y.; Hao, C.; Mao, Y.; Zhou, F. Dynamic Resource Configuration for Low-Power IoT Networks: A Multi-Objective Reinforcement Learning Method. IEEE Commun. Lett. 2021, 25, 2285–2289. [Google Scholar] [CrossRef]
Islam, M.T.; Karunasekera, S.; Buyya, R. Performance and Cost-Efficient Spark Job Scheduling Based on Deep Reinforcement Learning in Cloud Computing Environments. IEEE Trans. Parallel Distrib. Syst. 2022, 33, 1695–1710. [Google Scholar] [CrossRef]
Li, X.; Wang, J.; Sawhney, R. Reinforcement learning for joint pricing, lead-time and scheduling decisions in make-to-order systems. Eur. J. Oper. Res. 2012, 221, 99–109. [Google Scholar] [CrossRef]
Li, X.; Fang, Y.; Pan, C.; Cai, Y.; Zhou, M. Resource Scheduling for UAV-Assisted Failure-Prone MEC in Industrial Internet. Drones 2023, 7, 259. [Google Scholar] [CrossRef]
Liu, W.; Wu, S.; Zhu, H.; Zhang, H. An Integration Method of Heterogeneous Models for Process Scheduling Based on Deep Q-Learning Integration Agent. In Proceedings of the 16th IEEE Conference on Industrial Electronics and Applications, ICIEA 2021, Chengdu, China, 1–4 August 2021; pp. 1966–1971. [Google Scholar] [CrossRef]
Ma, S.; Ilyushkin, A.; Stegehuis, A.; Iosup, A. Ananke: A Q-Learning-Based Portfolio Scheduler for Complex Industrial Workflows. In Proceedings of the 2017 IEEE International Conference on Autonomic Computing, ICAC 2017, Columbus, OH, USA, 17–21 July 2017; pp. 227–232. [Google Scholar] [CrossRef]
Martins, M.S.; Viegas, J.L.; Coito, T.; Firme, B.M.; Sousa, J.M.; Figueredo, J.; Vieira, S.M. Reinforcement learning for dual-resource constrained scheduling. IFAC-PapersOnLine 2020, 53, 10810–10815. [Google Scholar] [CrossRef]
Moon, J.; Yang, M.; Jeong, J. A novel approach to the job shop scheduling problem based on the deep Q-network in a cooperative multi-access edge computing ecosystem. Sensors 2021, 21, 4553. [Google Scholar] [CrossRef] [PubMed]
Siddesha, K.; Jayaramaiah, G.V.; Singh, C. A novel deep reinforcement learning scheme for task scheduling in cloud computing. Clust. Comput. 2022, 25, 4171–4188. [Google Scholar] [CrossRef]
Silva, T.; Azevedo, A. Production flow control through the use of reinforcement learning. Procedia Manuf. 2019, 38, 194–202. [Google Scholar] [CrossRef]
Williem, R.S.; Setiawan, K. Reinforcement learning combined with radial basis function neural network to solve job-shop scheduling problem. In Proceedings of the APBITM 2011—2011 IEEE International Summer Conference of Asia Pacific Business Innovation and Technology Management, Dalian, China, 10–12 July 2011; pp. 29–32. [Google Scholar] [CrossRef]
Wang, T.; Hu, X.; Zhang, Y. A DRL based approach for adaptive scheduling of one-of-a-kind production. Comput. Oper. Res. 2023, 158, 106306. [Google Scholar] [CrossRef]
Xu, N.; Bu, T.M. Policy network for solving flexible job shop scheduling problem with setup times and rescoure constraints. In Proceedings of the GECCO 2022 Companion—2022 Genetic and Evolutionary Computation Conference, Boston, MA, USA, 9–13 July 2022; pp. 208–211. [Google Scholar] [CrossRef]
Yang, Y.; Chen, X.; Yang, M.; Guo, W.; Jiang, P. Designing an Industrial Product Service System for Robot-Driven Sanding Processing Line: A Reinforcement Learning Based Approach. Machines 2024, 12, 136. [Google Scholar] [CrossRef]
Yuan, M.; Huang, H.; Li, Z.; Zhang, C.; Pei, F.; Gu, W. A multi-agent double Deep-Q-network based on state machine and event stream for flexible job shop scheduling problem. Adv. Eng. Inform. 2023, 58, 102230. [Google Scholar] [CrossRef]
Yu, L.; Yu, P.S.; Duan, Y.; Qiao, H. A resource scheduling method for reliable and trusted distributed composite services in cloud environment based on deep reinforcement learning. Front. Genet. 2022, 13, 964784. [Google Scholar] [CrossRef]
Zhang, C.; Odonkor, P.; Zheng, S.; Khorasgani, H.; Serita, S.; Gupta, C.; Wang, H. Dynamic Dispatching for Large-Scale Heterogeneous Fleet via Multi-agent Deep Reinforcement Learning. In Proceedings of the 2020 IEEE International Conference on Big Data, Big Data 2020, Atlanta, GA, USA, 10–13 December 2020; pp. 1436–1441. [Google Scholar] [CrossRef]
Zhang, M.; Lu, Y.; Hu, Y.; Amaitik, N.; Xu, Y. Dynamic Scheduling Method for Job-Shop Manufacturing Systems by Deep Reinforcement Learning with Proximal Policy Optimization. Sustainability 2022, 14, 5177. [Google Scholar] [CrossRef]
Zhao, Y.; Wang, Y.; Tan, Y.; Zhang, J.; Yu, H. Dynamic Jobshop Scheduling Algorithm Based on Deep Q Network. IEEE Access 2021, 9, 122995–123011. [Google Scholar] [CrossRef]
Zhao, C.; Deng, N. An actor-critic framework based on deep reinforcement learning for addressing flexible job shop scheduling problems. Math. Biosci. Eng. 2024, 21, 1445–1471. [Google Scholar] [CrossRef] [PubMed]
Chen, J.; Yang, P.; Ren, S.; Zhao, Z.; Cao, X.; Wu, D. Enhancing AIoT Device Association With Task Offloading in Aerial MEC Networks. IEEE Internet Things J. 2024, 11, 174–187. [Google Scholar] [CrossRef]
Gao, Y.; Wu, W.; Nan, H.; Sun, Y.; Si, P. Deep Reinforcement Learning based Task Scheduling in Mobile Blockchain for IoT Applications. In Proceedings of the ICC 2020—2020 IEEE International Conference on Communications (ICC), Dublin, Ireland, 7–11 June 2020. [Google Scholar]
Geurtsen, M.; Adan, I.; Atan, Z. Dynamic Scheduling of Maintenance by a Reinforcement Learning Approach—A Semiconductor Simulation Study. In Proceedings of the Winter Simulation Conference, Singapore, 11–14 December 2022; pp. 3110–3121. [Google Scholar] [CrossRef]
Gong, Y.; Sun, S.; Wei, Y.; Song, M. Deep Reinforcement Learning for Edge Computing Resource Allocation in Blockchain Network Slicing Broker Framework. In Proceedings of the IEEE Vehicular Technology Conference, Helsinki, Finland, 25–28 April 2021. [Google Scholar] [CrossRef]
Hao, Y.; Li, F.; Zhao, C.; Yang, S. Delay-Oriented Scheduling in 5G Downlink Wireless Networks Based on Reinforcement Learning With Partial Observations. IEEE/ACM Trans. Netw. 2023, 31, 380–394. [Google Scholar] [CrossRef]
Lamprecht, R.; Wurst, F.; Huber, M.F. Reinforcement Learning based Condition-oriented Maintenance Scheduling for Flow Line Systems. In Proceedings of the IEEE International Conference on Industrial Informatics (INDIN), Palma de Mallorca, Spain, 21–23 July 2021; pp. 1–7. [Google Scholar] [CrossRef]
Lei, K.; Guo, P.; Wang, Y.; Xiong, J.; Zhao, W. An End-to-end Hierarchical Reinforcement Learning Framework for Large-scale Dynamic Flexible Job-shop Scheduling Problem. In Proceedings of the International Joint Conference on Neural Networks, Padua, Italy, 18–23 July 2022; pp. 1–8. [Google Scholar] [CrossRef]
Li, Y.L.; Fadda, E.; Manerba, D.; Roohnavazfar, M.; Tadei, R.; Terzo, O. Online Single-Machine Scheduling via Reinforcement Learning. In Recent Advances in Computational Optimization; Springer: Cham, Switzerland, 2022. [Google Scholar] [CrossRef]
Li, K.; Ni, W.; Dressler, F. LSTM-Characterized Deep Reinforcement Learning for Continuous Flight Control and Resource Allocation in UAV-Assisted Sensor Network. IEEE Internet Things J. 2022, 9, 4179–4189. [Google Scholar] [CrossRef]
Marchesano, M.G.; Guizzi, G.; Popolo, V.; Converso, G. Dynamic scheduling of a due date constrained flow shop with Deep Reinforcement Learning. IFAC-PapersOnLine 2022, 55, 2932–2937. [Google Scholar] [CrossRef]
Meng, T.; Huang, J.; Li, H.; Li, Z.; Jiang, Y.; Zhong, Z. Q-Learning Based Optimisation Framework for Real-Time Mixed-Task Scheduling. Cyber-Phys. Syst. 2022, 8, 173–191. [Google Scholar] [CrossRef]
Palacio, J.C.; Jiménez, Y.M.; Schietgat, L.; Doninck, B.V.; Nowé, A. A Q-Learning algorithm for flexible job shop scheduling in a real-world manufacturing scenario. Procedia CIRP 2022, 106, 227–232. [Google Scholar] [CrossRef]
Raeissi, M.M.; Brooks, N.; Farinelli, A. A Balking Queue Approach for Modeling Human-Multi-Robot Interaction for Water Monitoring. In Proceedings of the PRIMA 2017: Principles and Practice of Multi-Agent Systems—20th International Conference, Nice, France, 30 October–3 November 2017; 10621 LNAI. pp. 212–223. [Google Scholar] [CrossRef]
Tan, Q.; Tong, Y.; Wu, S.; Li, D. Modeling, planning, and scheduling of shop-floor assembly process with dynamic cyber-physical interactions: A case study for CPS-based smart industrial robot production. Int. J. Adv. Manuf. Technol. 2019, 105, 3979–3989. [Google Scholar] [CrossRef]
Tassel, P.; Kovács, B.; Gebser, M.; Schekotihin, K.; Kohlenbrein, W.; Schrott-Kostwein, P. Reinforcement Learning of Dispatching Strategies for Large-Scale Industrial Scheduling. In Proceedings of the International Conference on Automated Planning and Scheduling, ICAPS, Virtual, 13–24 June 2022; Volume 32, pp. 638–646. [Google Scholar] [CrossRef]
Tripathy, S.S.; Bebortta, S.; Gadekallu, T.R. Sustainable Fog-Assisted Intelligent Monitoring Framework for Consumer Electronics in Industry 5.0 Applications. IEEE Trans. Consum. Electron. 2024, 70, 1501–1510. [Google Scholar] [CrossRef]
Thomas, T.E.; Koo, J.; Chaterji, S.; Bagchi, S. Minerva: A reinforcement learning-based technique for optimal scheduling and bottleneck detection in distributed factory operations. In Proceedings of the 2018 10th International Conference on Communication Systems and Networks, COMSNETS 2018, Bengaluru, India, 3–7 January 2018; pp. 129–136. [Google Scholar] [CrossRef]
Valet, A.; Altenmüller, T.; Waschneck, B.; May, M.C.; Kuhnle, A.; Lanza, G. Opportunistic maintenance scheduling with deep reinforcement learning. J. Manuf. Syst. 2022, 64, 518–534. [Google Scholar] [CrossRef]
Xing, Y.; Yang, L.; Hu, X.; Mei, C.; Wang, H.; Li, J. 6G Deterministic Network Technology Based on Hierarchical Reinforcement Learning Framework. In Proceedings of the IEEE International Symposium on Broadband Multimedia Systems and Broadcasting, BMSB, Beijing, China, 14–16 June 2023; pp. 1–6. [Google Scholar] [CrossRef]
Wang, Z.; Liao, W. Smart scheduling of dynamic job shop based on discrete event simulation and deep reinforcement learning. J. Intell. Manuf. 2023, 35, 2593–2610. [Google Scholar] [CrossRef]
Yang, D.; Gong, K.; Zhang, W.; Guo, K.; Chen, J. enDRTS: Deep Reinforcement Learning Based Deterministic Scheduling for Chain Flows in TSN. In Proceedings of the 2023 International Conference on Networking and Network Applications (NaNA), Qingdao, China, 18–21 August 2023; pp. 239–244. [Google Scholar] [CrossRef]
Zhang, Z.; Li, S.; Yan, X.; Zhang, L. Self-organizing network control with a TD learning algorithm. In Proceedings of the IEEE International Conference on Industrial Engineering and Engineering Management, Singapore, 10–13 December 2018; pp. 2159–2163. [Google Scholar] [CrossRef]
Zhang, T.; Shen, S.; Mao, S.; Chang, G.K. Delay-aware Cellular Traffic Scheduling with Deep Reinforcement Learning. In Proceedings of the GLOBECOM 2020—2020 IEEE Global Communications Conference, Taipei, Taiwan, 7–11 December 2020; pp. 1–5. [Google Scholar] [CrossRef]
Zhang, L.; Yang, C.; Yan, Y.; Hu, Y. Distributed Real-Time Scheduling in Cloud Manufacturing by Deep Reinforcement Learning. IEEE Trans. Ind. Inform. 2022, 18, 8999–9007. [Google Scholar] [CrossRef]
Zhang, F.; Han, G.; Liu, L.; Zhang, Y.; Peng, Y.; Li, C. Cooperative Partial Task Offloading and Resource Allocation for IIoT Based on Decentralized Multi-Agent Deep Reinforcement Learning. IEEE Internet Things J. 2024, 11, 5526–5544. [Google Scholar] [CrossRef]
Zhu, Y.; Sun, L.; Wang, J.; Huang, R.; Jia, X. Deep Reinforcement Learning-Based Joint Scheduling of 5G and TSN in Industrial Networks. Electronics 2023, 12, 2686. [Google Scholar] [CrossRef]
Aissani, N.; Bekrar, A.; Trentesaux, D.; Beldjilali, B. Dynamic scheduling for multi-site companies: A decisional approach based on reinforcement multi-agent learning. J. Intell. Manuf. 2012, 23, 2513–2529. [Google Scholar] [CrossRef]
Amaral, P.; Simoes, D. Deep Reinforcement Learning Based Routing in IP Media Broadcast Networks: Feasibility and Performance. IEEE Access 2022, 10, 62459–62470. [Google Scholar] [CrossRef]
Bulbul, N.S.; Fischer, M. Reinforcement Learning assisted Routing for Time Sensitive Networks. In Proceedings of the GLOBECOM 2022—2022 IEEE Global Communications Conference, Rio de Janeiro, Brazil, 4–8 December 2022; pp. 3863–3868. [Google Scholar] [CrossRef]
Chen, B.; Wan, J.; Lan, Y.; Imran, M.; Li, D.; Guizani, N. Improving cognitive ability of edge intelligent IIoT through machine learning. IEEE Netw. 2019, 33, 61–67. [Google Scholar] [CrossRef]
Dahl, T.S.; Matarić, M.; Sukhatme, G.S. Multi-robot task allocation through vacancy chain scheduling. Robot. Auton. Syst. 2009, 57, 674–687. [Google Scholar] [CrossRef]
Farahani, A.; Genga, L.; DIjkman, R. Online Multimodal Transportation Planning using Deep Reinforcement Learning. In Proceedings of the 2021 IEEE International Conference on Systems, Man and Cybernetics, Melbourne, Australia, 17–20 October 2021; pp. 1691–1698. [Google Scholar] [CrossRef]
Haliem, M.; Mani, G.; Aggarwal, V.; Bhargava, B. A Distributed Model-Free Ride-Sharing Approach for Joint Matching, Pricing, and Dispatching Using Deep Reinforcement Learning. IEEE Trans. Intell. Transp. Syst. 2021, 22, 7931–7942. [Google Scholar] [CrossRef]
Hazra, A.; Amgoth, T. CeCO: Cost-Efficient Computation Offloading of IoT Applications in Green Industrial Fog Networks. IEEE Trans. Ind. Inform. 2022, 18, 6255–6263. [Google Scholar] [CrossRef]
Höpken, A.; Pargmann, H.; Schallner, H.; Galczynski, A.; Gerdes, L. Delivery scheduling in meat industry using reinforcement learning. Procedia CIRP 2023, 118, 68–73. [Google Scholar] [CrossRef]
Huang, J.P.; Gao, L.; Li, X.Y. A Cooperative Hierarchical Deep Reinforcement Lerning based Multi-agent Method for Distributed Job Shop Scheduling Problem with Random Job Arrivals. Comput. Ind. Eng. 2023, 185, 109650. [Google Scholar] [CrossRef]
Hubbs, C.D.; Li, C.; Sahinidis, N.V.; Grossmann, I.E.; Wassick, J.M. A deep reinforcement learning approach for chemical production scheduling. Comput. Chem. Eng. 2020, 141, 106982. [Google Scholar] [CrossRef]
Lei, K.; Guo, P.; Wang, Y.; Zhang, J.; Meng, X.; Qian, L. Large-Scale Dynamic Scheduling for Flexible Job-Shop With Random Arrivals of New Jobs by Hierarchical Reinforcement Learning. IEEE Trans. Ind. Inform. 2024, 20, 1007–1018. [Google Scholar] [CrossRef]
Li, J.; Ma, Y.; Gao, R.; Cao, Z.; Lim, A.; Song, W.; Zhang, J. Deep Reinforcement Learning for Solving the Heterogeneous Capacitated Vehicle Routing Problem. IEEE Trans. Cybern. 2022, 52, 13572–13585. [Google Scholar] [CrossRef]
Li, H.; Assis, K.D.R.; Yan, S.; Simeonidou, D. DRL-Based Long-Term Resource Planning for Task Offloading Policies in Multiserver Edge Computing Networks. IEEE Trans. Netw. Serv. Manag. 2022, 19, 4151–4164. [Google Scholar] [CrossRef]
Li, K. Optimizing warehouse logistics scheduling strategy using soft computing and advanced machine learning techniques. Soft Comput. 2023, 27, 18077–18092. [Google Scholar] [CrossRef]
Liang, W.; Xie, W.; Zhou, X.; Wang, K.I.; Ma, J.; Jin, Q. Bi-Dueling DQN Enhanced Two-stage Scheduling for Augmented Surveillance in Smart EMS. IEEE Trans. Ind. Inform. 2022, 19, 8218–8228. [Google Scholar] [CrossRef]
Liu, Z.; Long, C.; Lu, X.; Hu, Z.; Zhang, J.; Wang, Y. Which Channel to Ask My Question? Personalized Customer Service Request Stream Routing Using Deep Reinforcement Learning. IEEE Access 2019, 7, 107744–107756. [Google Scholar] [CrossRef]
Lu, Y.; Huang, X.; Zhang, K.; Maharjan, S.; Zhang, Y. Communication-Efficient Federated Learning and Permissioned Blockchain for Digital Twin Edge Networks. IEEE Internet Things J. 2021, 8, 2276–2288. [Google Scholar] [CrossRef]
Méndez-Hernández, B.M.; Rodríguez-Bazan, E.D.; Martinez-Jimenez, Y.; Libin, P.; Nowé, A. A Multi-objective Reinforcement Learning Algorithm for JSSP. In Proceedings of the 28th International Conference on Artificial Neural Networks, Munich, Germany, 17–19 September 2019; pp. 567–584. [Google Scholar] [CrossRef]
Mhaisen, N.; Fetais, N.; Massoud, A. Real-Time Scheduling for Electric Vehicles Charging/Discharging Using Reinforcement Learning. In Proceedings of the 2020 IEEE International Conference on Informatics, IoT, and Enabling Technologies, ICIoT 2020, Doha, Qatar, 2–5 February 2020; pp. 1–6. [Google Scholar] [CrossRef]
Paraschos, P.D.; Koulinas, G.K.; Koulouriotis, D.E. A reinforcement learning/ad-hoc planning and scheduling mechanism for flexible and sustainable manufacturing systems. Flex. Serv. Manuf. J. 2024, 36, 714–736. [Google Scholar] [CrossRef]
Park, I.B.; Huh, J.; Kim, J.; Park, J. A Reinforcement Learning Approach to Robust Scheduling of Semiconductor Manufacturing Facilities. IEEE Trans. Autom. Sci. Eng. 2020, 17, 1420–1431. [Google Scholar] [CrossRef]
Roy, S.B.; Tan, E. Multi-hop Computational Offloading with Reinforcement Learning for Industrial IoT Networks. In Proceedings of the 2023 IEEE 97th Vehicular Technology Conference (VTC2023-Spring), Florence, Italy, 20–23 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Schneider, J.G.; Boyan, J.A.; Moore, A.W. Stochastic Production Scheduling to meet Demand Forecasts. In Proceedings of the 37th IEEE Conference on Decision & Control, Tampa, FL, USA, 18 December 1998; pp. 2722–2727. [Google Scholar]
Shen, X.; Liu, S.; Zhou, B.; Wu, T.; Zhang, Q.; Bao, J. Digital Twin-Driven Reinforcement Learning Method for Marine Equipment Vehicles Scheduling Problem. IEEE Trans. Autom. Sci. Eng. 2024, 21, 2173–2183. [Google Scholar] [CrossRef]
Song, W.; Mi, N.; Li, Q.; Zhuang, J.; Cao, Z. Stochastic Economic Lot Scheduling via Self-Attention Based Deep Reinforcement Learning. IEEE Trans. Autom. Sci. Eng. 2024, 21, 1457–1468. [Google Scholar] [CrossRef]
Tong, Z.; Wang, J.; Wang, Y.; Liu, B.; Li, Q. Energy and Performance-Efficient Dynamic Consolidate VMs Using Deep-Q Neural Network. IEEE Trans. Ind. Inform. 2023, 19, 11030–11040. [Google Scholar] [CrossRef]
Vivekanandan, D.; Wirth, S.; Karlbauer, P.; Klarmann, N. A Reinforcement Learning Approach for Scheduling Problems with Improved Generalization through Order Swapping. Mach. Learn. Knowl. Extr. 2023, 5, 418–430. [Google Scholar] [CrossRef]
Yu, X.; Wang, R.; Hao, J.; Wu, Q.; Yi, C.; Wang, P.; Niyato, D. Priority-Aware Deployment of Autoscaling Service Function Chains based On Deep Reinforcement Learning. IEEE Trans. Cogn. Commun. Netw. 2024, 10, 1050–1062. [Google Scholar] [CrossRef]
Wang, S.; Bi, S.; Zhang, Y.A. Reinforcement Learning for Real-Time Pricing and Scheduling Control in EV Charging Stations. IEEE Trans. Ind. Inform. 2021, 17, 849–859. [Google Scholar] [CrossRef]
Zhang, H.; Feng, L.; Liu, X.; Long, K.; Karagiannidis, G.K. User Scheduling and Task Offloading in Multi-Tier Computing 6G Vehicular Network. IEEE J. Sel. Areas Commun. 2023, 41, 446–456. [Google Scholar] [CrossRef]
Zhang, F.; Han, G.; Li, A.; Lin, C.; Liu, L. QoS-Driven Distributed Cooperative Data Offloading and Heterogeneous Resource Scheduling for IIoT. IEEE Internet Things Mag. 2023, 6, 118–124. [Google Scholar] [CrossRef]
Zhang, J.; Kong, L.; Zhang, H. Coordinated Ride-hailing Order Scheduling and Charging for Autonomous Electric Vehicles Based on Deep Reinforcement Learning. In Proceedings of the 2023 IEEE IAS Industrial and Commercial Power System Asia, I and CPS Asia 2023, Chongqing, China, 7–9 July 2023; pp. 2038–2044. [Google Scholar] [CrossRef]
Chen, Z.; Wang, H.; Wang, B.; Yang, L.; Song, C.; Zhang, X.; Lin, F.; Cheng, J.C. Scheduling optimization of electric ready mixed concrete vehicles using an improved model-based reinforcement learning. Autom. Constr. 2024, 160, 105308. [Google Scholar] [CrossRef]
Fu, F.; Kang, Y.; Zhang, Z.; Yu, F.R.; Wu, T. Soft Actor-Critic DRL for Live Transcoding and Streaming in Vehicular Fog-Computing-Enabled IoV. IEEE Internet Things J. 2021, 8, 1308–1321. [Google Scholar] [CrossRef]
Gao, Y.; Zhang, C.; Xie, Z.; Qi, Z.; Zhou, J. Cost-Efficient and Quality-of-Experience-Aware Player Request Scheduling and Rendering Server Allocation for Edge-Computing-Assisted Multiplayer Cloud Gaming. IEEE Internet Things J. 2022, 9, 12029–12040. [Google Scholar] [CrossRef]
Huang, Y.; Sun, Y.; Ding, Z. Renewable Energy Integration Driven Charging Scheme for Electric Vehicle Based Large Scale Delivery System. In Proceedings of the 2022 IEEE/IAS Industrial and Commercial Power System Asia (I&CPS Asia), Shanghai, China, 8–11 July 2022; pp. 1251–1256. [Google Scholar] [CrossRef]
Ingalalli, A.; Kamalasadan, S.; Dong, Z.; Bharati, G.; Chakraborty, S. An Extended Q-Routing-based Event-driven Dynamic Reconfiguration of Networked Microgrids. In Proceedings of the 2022 IEEE Industry Applications Society Annual Meeting (IAS), Detroit, MI, USA, 9–14 October 2022; pp. 1–6. [Google Scholar] [CrossRef]
Kim, S. Multi-Agent Learning and Bargaining Scheme for Cooperative Spectrum Sharing Process. IEEE Access 2023, 11, 47863–47872. [Google Scholar] [CrossRef]
Lee, Y.H.; Lee, S. Deep reinforcement learning based scheduling within production plan in semiconductor fabrication. Expert Syst. Appl. 2022, 191, 116222. [Google Scholar] [CrossRef]
Lei, J.; Hui, J.; Chang, F.; Dassari, S.; Ding, K. Reinforcement learning-based dynamic production-logistics-integrated tasks allocation in smart factories. Int. J. Prod. Res. 2023, 61, 4419–4436. [Google Scholar] [CrossRef]
Li, M.; Chen, C.; Hua, C.; Guan, X. Learning-Based Autonomous Scheduling for AoI-Aware Industrial Wireless Networks. IEEE Internet Things J. 2020, 7, 9175–9188. [Google Scholar] [CrossRef]
Ong, K.S.H.; Wang, W.; Hieu, N.Q.; Niyato, D.; Friedrichs, T. Predictive Maintenance Model for IIoT-Based Manufacturing: A Transferable Deep Reinforcement Learning Approach. IEEE Internet Things J. 2022, 9, 15725–15741. [Google Scholar] [CrossRef]
Onishi, T.; Takahashi, E.; Nishikawa, Y.; Maruyama, S. AppDAS: An Application QoS-Aware Distributed Antenna Selection for 5G Industrial Applications. In Proceedings of the 2023 IEEE 20th Consumer Communications & Networking Conference (CCNC), Las Vegas, NV, USA, 8–11 January 2023; pp. 1027–1032. [Google Scholar] [CrossRef]
Peng, S.; Xiong, G.; Ren, Y.; Shen, Z.; Liu, S.; Han, Y. A Parallel Learning Approach for the Flexible Job Shop Scheduling Problem. IEEE J. Radio Freq. Identif. 2022, 6, 851–856. [Google Scholar] [CrossRef]
Redhu, S.; Hegde, R.M. Cooperative Network Model for Joint Mobile Sink Scheduling and Dynamic Buffer Management Using Q-Learning. IEEE Trans. Netw. Serv. Manag. 2020, 17, 1853–1864. [Google Scholar] [CrossRef]
Rjoub, G.; Bentahar, J.; Abdel Wahab, O.; Saleh Bataineh, A. Deep and reinforcement learning for automated task scheduling in large-scale cloud computing systems. Concurr. Comput. Pract. Exp. 2021, 33, e5919. [Google Scholar] [CrossRef]
Ruiz Rodríguez, M.L.; Kubler, S.; de Giorgio, A.; Cordy, M.; Robert, J.; Le Traon, Y. Multi-agent deep reinforcement learning based Predictive Maintenance on parallel machines. Robot. Comput. Integr. Manuf. 2022, 78, 102406. [Google Scholar] [CrossRef]
Song, W.; Chen, X.; Li, Q.; Cao, Z. Flexible Job-Shop Scheduling via Graph Neural Network and Deep Reinforcement Learning. IEEE Trans. Ind. Inform. 2023, 19, 1600–1610. [Google Scholar] [CrossRef]
Tan, L.; Hai, X.; Ma, K.; Fan, D.; Qiu, H.; Feng, Q. Digital Twin-Enabled Decision-Making Framework for Multi-UAV Mission Planning: A Multiagent Deep Reinforcement Learning Perspective. In Proceedings of the IECON 2023—49th Annual Conference of the IEEE Industrial Electronics Society, Singapore, 16–19 October 2023; pp. 14–19. [Google Scholar] [CrossRef]
Waschneck, B.; Reichstaller, A.; Belzner, L.; Altenmuller, T.; Bauernhansl, T.; Knapp, A.; Kyek, A. Deep reinforcement learning for semiconductor production scheduling. In Proceedings of the 2018 29th Annual SEMI Advanced Semiconductor Manufacturing Conference, ASMC 2018, Saratoga Springs, NY, USA, 30 April–3 May 2018; pp. 301–306. [Google Scholar] [CrossRef]
Xia, M.; Liu, H.; Li, M.; Wang, L. A dynamic scheduling method with Conv-Dueling and generalized representation based on reinforcement learning. Int. J. Ind. Eng. Comput. 2023, 14, 805–820. [Google Scholar] [CrossRef]
Xie, R.; Gu, D.; Tang, Q.; Huang, T.; Yu, F.R. Workflow Scheduling in Serverless Edge Computing for the Industrial Internet of Things: A Learning Approach. IEEE Trans. Ind. Inform. 2022, 19, 8242–8252. [Google Scholar] [CrossRef]
Xu, Y.; Zhao, J. Actor-Critic with Transformer for Cloud Computing Resource Three Stage Job Scheduling. In Proceedings of the 2022 7th International Conference on Cloud Computing and Big Data Analytics, ICCCBDA 2022, Chengdu, China, 22–24 April 2022; pp. 33–37. [Google Scholar] [CrossRef]
Yan, K.; Shan, H.; Sun, T.; Hu, H.; Wu, Y.; Yu, L.; Zhang, Z.; Quek, T.Q. Reinforcement Learning-Based Mobile Edge Computing and Transmission Scheduling for Video Surveillance. IEEE Trans. Emerg. Top. Comput. 2022, 10, 1142–1156. [Google Scholar] [CrossRef]
Wang, S.; Li, J.; Luo, Y. Smart Scheduling for Flexible and Hybrid Production with Multi-Agent Deep Reinforcement Learning. In Proceedings of the 2021 IEEE 2nd International Conference on Information Technology, Big Data and Artificial Intelligence, ICIBA 2021, Chongqing, China, 17–19 December 2021; Volume 2, pp. 288–294. [Google Scholar] [CrossRef]
Wang, Z.; Liao, W. Job Shop Scheduling Problem Using Proximal Policy Optimization. In Proceedings of the 2023 IEEE International Conference on Industrial Engineering and Engineering Management, IEEM 2023, Singapore, 18–21 December 2023; pp. 1517–1521. [Google Scholar] [CrossRef]
Wei, Z.; Li, M.; Wei, Z.; Cheng, L.; Lyu, Z.; Liu, F. A novel on-demand charging strategy based on swarm reinforcement learning in WRSNs. IEEE Access 2020, 8, 84258–84271. [Google Scholar] [CrossRef]
Wu, D.; Liu, T.; Li, Z.; Tang, T.; Wang, R. Delay-Aware Edge-Terminal Collaboration in Green Internet of Vehicles: A Multiagent Soft Actor-Critic Approach. IEEE Trans. Green Commun. Netw. 2023, 7, 1090–1102. [Google Scholar] [CrossRef]
Yan, L.; Shen, H.; Kang, L.; Zhao, J.; Zhang, Z.; Xu, C. MobiCharger: Optimal Scheduling for Cooperative EV-to-EV Dynamic Wireless Charging. IEEE Trans. Mob. Comput. 2023, 22, 6889–6906. [Google Scholar] [CrossRef]
Zisgen, H.; Miltenberger, R.; Hochhaus, M.; Stöhr, N. Dynamic Scheduling of Gantry Robots using Simulation and Reinforcement Learning. In Proceedings of the 2023 Winter Simulation Conference (WSC), San Antonio, TX, USA, 10–13 December 2023; pp. 3026–3034. [Google Scholar] [CrossRef]
Zhao, Y.; Luo, X.; Zhang, Y. The application of heterogeneous graph neural network and deep reinforcement learning in hybrid flow shop scheduling problem. Comput. Ind. Eng. 2024, 187, 109802. [Google Scholar] [CrossRef]
Zhou, J.; Zheng, L.; Fan, W. Multirobot collaborative task dynamic scheduling based on multiagent reinforcement learning with heuristic graph convolution considering robot service performance. J. Manuf. Syst. 2024, 72, 122–141. [Google Scholar] [CrossRef]
Felder, M.; Steiner, D.; Busch, P.; Trat, M.; Sun, C.; Bender, J.; Ovtcharova, J. Energy-Flexible Job-Shop Scheduling Using Deep Reinforcement Learning. In Proceedings of the Conference on Production Systems and Logistics, Santiago de Querétaro, Mexico, 28 February–2 March 2023; pp. 353–362. [Google Scholar] [CrossRef]
Lara-Cárdenas, E.; Silva-Gálves, A.; Ortiz-Bayliss, J.C.; Amaya, I.; Cruz-Duarte, J.M.; Terashima-Marín, H. Exploring Reward-based Hyper-heuristics for the Job-shop Scheduling Problem. In Proceedings of the 2020 IEEE Symposium Series on Computational Intelligence (SSCI), Canberra, Australia, 1–4 December 2020; pp. 3133–3140. [Google Scholar]
Qu, S.; Jie, W.; Shivani, G. Learning adaptive dispatching rules for a manufacturing process system by using reinforcement learning approach. In Proceedings of the IEEE International Conference on Emerging Technologies and Factory Automation, ETFA, Berlin, Germany, 6–9 September 2016. [Google Scholar] [CrossRef]
Teng, Y.; Li, L.; Song, L.; Yu, F.R.; Leung, V.C. Profit Maximizing Smart Manufacturing over AI-Enabled Configurable Blockchains. IEEE Internet Things J. 2022, 9, 346–358. [Google Scholar] [CrossRef]
Wang, X.; Zhang, L.; Liu, Y.; Zhao, C. Logistics-involved task scheduling in cloud manufacturing with offline deep reinforcement learning. J. Ind. Inf. Integr. 2023, 34, 100471. [Google Scholar] [CrossRef]
Klein, N.; Prunte, J. A New Deep Reinforcement Learning Algorithm for the Online Stochastic Profitable Tour Problem. In Proceedings of the 2022 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM), Kuala Lumpur, Malaysia, 7–10 December 2022; pp. 635–639. [Google Scholar] [CrossRef]
Liu, Y.; Yang, C.; Jiang, L.; Xie, S.; Zhang, Y. Intelligent Edge Computing for IoT-Based Energy Management in Smart Cities. IEEE Netw. 2019, 33, 111–117. [Google Scholar] [CrossRef]
Muller, A.; Grumbach, F.; Kattenstroth, F. Reinforcement Learning for Two-Stage Permutation Flow Shop Scheduling—A Real-World Application in Household Appliance Production. IEEE Access 2024, 12, 11388–11399. [Google Scholar] [CrossRef]
Wang, Y.; Chen, X.; Wang, L. Deep Reinforcement Learning-Based Rescue Resource Distribution Scheduling of Storm Surge Inundation Emergency Logistics. IEEE Trans. Ind. Inform. 2023, 19, 10004–10013. [Google Scholar] [CrossRef]
Yan, H.; Cui, Z.; Chen, X.; Ma, X. Distributed Multiagent Deep Reinforcement Learning for Multiline Dynamic Bus Timetable Optimization. IEEE Trans. Ind. Inform. 2023, 19, 469–479. [Google Scholar] [CrossRef]
Zhou, Y.; Li, X.; Luo, J.; Yuan, M.; Zeng, J.; Yao, J. Learning to Optimize DAG Scheduling in Heterogeneous Environment. In Proceedings of the IEEE International Conference on Mobile Data Management, Paphos, Cyprus, 6–9 June 2022; pp. 137–146. [Google Scholar] [CrossRef]
Chen, Q.; Zheng, Z.; Hu, C.; Wang, D.; Liu, F. Data-driven task allocation for multi-task transfer learning on the edge. In Proceedings of the International Conference on Distributed Computing Systems, Dallas, TX, USA, 7–10 July 2019; pp. 1040–1050. [Google Scholar] [CrossRef]
Choi, G.; Jeon, S.; Cho, J.; Moon, J. A Seed Scheduling Method with a Reinforcement Learning for a Coverage Guided Fuzzing. IEEE Access 2023, 11, 2048–2057. [Google Scholar] [CrossRef]
Du, H.; Xu, W.; Yao, B.; Zhou, Z.; Hu, Y. Collaborative optimization of service scheduling for industrial cloud robotics based on knowledge sharing. Procedia CIRP 2019, 83, 132–138. [Google Scholar] [CrossRef]
Fechter, J.; Beham, A.; Wagner, S.; Affenzeller, M. Approximate Q-Learning for Stacking Problems with Continuous Production and Retrieval. Appl. Artif. Intell. 2019, 33, 68–86. [Google Scholar] [CrossRef]
Fu, F.; Kang, Y.; Zhang, Z.; Yu, F.R. Transcoding for live streaming-based on vehicular fog computing: An actor-critic DRL approach. In Proceedings of the IEEE INFOCOM 2020—IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Toronto, ON, Canada, 6–9 July 2020; pp. 1015–1020. [Google Scholar] [CrossRef]
Iwamura, K.; Mayumi, N.; Tanimizu, Y.; Sugimura, N. A study on real-time scheduling for holonic manufacturing systems—Determination of utility values based on multi-agent reinforcement learning. In Proceedings of the 4th International Conference on Industrial Applications of Holonic and Multi-Agent Systems, HoloMAS 2009, Linz, Austria, 31 August–2 September 2009; pp. 135–144. [Google Scholar] [CrossRef]
Lei, W.; Ye, Y.; Xiao, M. Deep Reinforcement Learning-Based Spectrum Allocation in Integrated Access and Backhaul Networks. IEEE Trans. Cogn. Commun. Netw. 2020, 6, 970–979. [Google Scholar] [CrossRef]
Li, X.; Luo, W.; Yuan, M.; Wang, J.; Lu, J.; Wang, J.; Lu, J.; Zeng, J. Learning to optimize industry-scale dynamic pickup and delivery problems. In Proceedings of the International Conference on Data Engineering, Chania, Greece, 19–22 April 2021; pp. 2511–2522. [Google Scholar] [CrossRef]
Liu, Y.; Yang, M.; Guo, Z. Reinforcement learning based optimal decision making towards product lifecycle sustainability. Int. J. Comput. Integr. Manuf. 2022, 35, 1269–1296. [Google Scholar] [CrossRef]
Ma, S.; Ruan, J.; Du, Y.; Bucknall, R.; Liu, Y. An End-to-End Deep Reinforcement Learning Based Modular Task Allocation Framework for Autonomous Mobile Systems. IEEE Trans. Autom. Sci. Eng. 2024, 1–15. [Google Scholar] [CrossRef]
Melnik, M.; Dolgov, I.; Nasonov, D. Hybrid intellectual scheme for scheduling of heterogeneous workflows based on evolutionary approach and reinforcement learning. In Proceedings of the IJCCI 2020—12th International Joint Conference on Computational Intelligence, Budapest, Hungary, 2–4 November 2020; pp. 200–211. [Google Scholar] [CrossRef]
Muller-Zhang, Z.; Kuhn, T. A Digital Twin-based Approach Performing Integrated Process Planning and Scheduling for Service-based Production. In Proceedings of the IEEE International Conference on Emerging Technologies and Factory Automation, ETFA, Stuttgart, Germany, 6–9 September 2022. [Google Scholar] [CrossRef]
Phiboonbanakit, T.; Horanont, T.; Huynh, V.N.; Supnithi, T. A Hybrid Reinforcement Learning-Based Model for the Vehicle Routing Problem in Transportation Logistics. IEEE Access 2021, 9, 163325–163347. [Google Scholar] [CrossRef]
Song, G.; Xia, M.; Zhang, D. Deep Reinforcement Learning for Risk and Disaster Management in Energy-Efficient Marine Ranching. Energies 2023, 16, 6092. [Google Scholar] [CrossRef]
Szwarcfiter, C.; Herer, Y.T.; Shtub, A. Balancing Project Schedule, Cost, and Value under Uncertainty: A Reinforcement Learning Approach. Algorithms 2023, 16, 395. [Google Scholar] [CrossRef]
Troch, A.; Mannens, E.; Mercelis, S. Solving the Storage Location Assignment Problem Using Reinforcement Learning. In Proceedings of the 2023 the 8th International Conference on Mathematics and Artificial Intelligence, Chongqing, China, 7–9 April 2023; pp. 89–95. [Google Scholar] [CrossRef]
Troia, S.; Alvizu, R.; Maier, G. Reinforcement learning for service function chain reconfiguration in NFV-SDN metro-core optical networks. IEEE Access 2019, 7, 167944–167957. [Google Scholar] [CrossRef]
Zhang, J.; Lv, Y.; Li, Y.; Liu, J. An Improved QMIX-Based AGV Scheduling Approach for Material Handling Towards Intelligent Manufacturing. In Proceedings of the 2022 IEEE 20th International Conference on Embedded and Ubiquitous Computing, EUC 2022, Wuhan, China, 9–11 December 2022; pp. 54–59. [Google Scholar] [CrossRef]
Guo, Y.; Li, J.; Xiao, L.; Allaoui, H.; Choudhary, A.; Zhang, L. Efficient inventory routing for Bike-Sharing Systems: A combinatorial reinforcement learning framework. Transp. Res. Part E Logist. Transp. Rev. 2024, 182, 103415. [Google Scholar] [CrossRef]
Kumar, A.; Dimitrakopoulos, R. Production scheduling in industrial mining complexes with incoming new information using tree search and deep reinforcement learning. Appl. Soft Comput. 2021, 110, 107644. [Google Scholar] [CrossRef]
Liu, S.; Wang, W.; Zhong, S.; Peng, Y.; Tian, Q.; Li, R.; Sun, X.; Yang, Y. A graph-based approach for integrating massive data in container terminals with application to scheduling problem. Int. J. Prod. Res. 2024, 62, 5945–5965. [Google Scholar] [CrossRef]
Lu, Y.; Fang, S.; Niu, T.; Chen, G.; Liao, R. Battery Swapping Strategy for Electric Transfer-Vehicles in Seaport: A Deep Q-Network Approach. In Proceedings of the 2023 IEEE/IAS 59th Industrial and Commercial Power Systems Technical Conference (I&CPS), Las Vegas, NV, USA, 21–25 May 2023; pp. 1–6. [Google Scholar] [CrossRef]
Ruan, J.H.; Wang, Z.X.; Chan, F.T.; Patnaik, S.; Tiwari, M.K. A reinforcement learning-based algorithm for the aircraft maintenance routing problem. Expert Syst. Appl. 2021, 169, 114399. [Google Scholar] [CrossRef]
Sun, Y.; Long, Y.; Xu, L.; Tan, W.; Huang, L.; Zhao, L.; Liu, W. Long-Term Matching Optimization With Federated Neural Temporal Difference Learning in Mobility-on-Demand Systems. IEEE Internet Things J. 2023, 10, 1426–1445. [Google Scholar] [CrossRef]
Wang, G.; Qin, Z.; Wang, S.; Sun, H.; Dong, Z.; Zhang, D. Towards Accessible Shared Autonomous Electric Mobility with Dynamic Deadlines. IEEE Trans. Mob. Comput. 2024, 23, 925–940. [Google Scholar] [CrossRef]
Zhang, L.; Yan, Y.; Hu, Y. Deep reinforcement learning for dynamic scheduling of energy-efficient automated guided vehicles. J. Intell. Manuf. 2024, 35, 3875–3888. [Google Scholar] [CrossRef]
Zhang, L.; Yang, C.; Yan, Y.; Cai, Z.; Hu, Y. Automated guided vehicle dispatching and routing integration via digital twin with deep reinforcement learning. J. Manuf. Syst. 2024, 72, 492–503. [Google Scholar] [CrossRef]
Gankin, D.; Mayer, S.; Zinn, J.; Vogel-Heuser, B.; Endisch, C. Modular Production Control with Multi-Agent Deep Q-Learning. In Proceedings of the IEEE International Conference on Emerging Technologies and Factory Automation, ETFA, Vasteras, Sweden, 7–10 September 2021; pp. 1–8. [Google Scholar] [CrossRef]
Stöckermann, P.; Immordino, A.; Altenmüller, T.; Seidel, G. Dispatching in Real Frontend Fabs With Industrial Grade Discrete-Event Simulations by Deep Reinforcement Learning with Evolution Strategies. In Proceedings of the 2023 Winter Simulation Conference (WSC), San Antonio, TX, USA, 10–13 December 2023; pp. 1–23. [Google Scholar] [CrossRef]
Liu, R.; Piplani, R.; Toro, C. A deep multi-agent reinforcement learning approach to solve dynamic job shop scheduling problem. Comput. Oper. Res. 2023, 159, 106294. [Google Scholar] [CrossRef]
Farag, H.; Gidlund, M.; Stefanovic, C. A Deep Reinforcement Learning Approach for Improving Age of Information in Mission-Critical IoT. In Proceedings of the 2021 IEEE Global Conference on Artificial Intelligence and Internet of Things, GCAIoT 2021, Dubai, United Arab Emirates, 12–16 December 2021; pp. 14–18. [Google Scholar] [CrossRef]
Lee, S.; Cho, Y.; Lee, Y.H. Injection mold production sustainable scheduling using deep reinforcement learning. Sustainability 2020, 12, 8718. [Google Scholar] [CrossRef]
Alitappeh, R.J.; Jeddisaravi, K. Multi-robot exploration in task allocation problem. Appl. Intell. 2022, 52, 2189–2211. [Google Scholar] [CrossRef]
Ao, W.; Zhang, G.; Li, Y.; Jin, D. Learning to Solve Grouped 2D Bin Packing Problems in the Manufacturing Industry. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 6–10 August 2023; pp. 3713–3723. [Google Scholar] [CrossRef]
Arishi, A.; Krishnan, K.; Arishi, M. Machine learning approach for truck-drones based last-mile delivery in the era of industry 4.0. Eng. Appl. Artif. Intell. 2022, 116, 105439. [Google Scholar] [CrossRef]
Fang, J.; Rao, Y.; Luo, Q.; Xu, J. Solving One-Dimensional Cutting Stock Problems with the Deep Reinforcement Learning. Mathematics 2023, 11, 1028. [Google Scholar] [CrossRef]
Liu, H.; Zhou, L.; Yang, J.; Zhao, J. The 3D bin packing problem for multiple boxes and irregular items based on deep Q-network. Appl. Intell. 2023, 53, 23398–23425. [Google Scholar] [CrossRef]
Palombarini, J.; Martínez, E. SmartGantt—An intelligent system for real time rescheduling based on relational reinforcement learning. Expert Syst. Appl. 2012, 39, 10251–10268. [Google Scholar] [CrossRef]
Palombarini, J.A.; Martínez, E.C. End-to-end on-line rescheduling from Gantt chart images using deep reinforcement learning. Int. J. Prod. Res. 2022, 60, 4434–4463. [Google Scholar] [CrossRef]
Saroliya, U.; Arima, E.; Liu, D.; Schulz, M. Hierarchical Resource Partitioning on Modern GPUs: A Reinforcement Learning Approach. In Proceedings of the IEEE International Conference on Cluster Computing, ICCC, Santa Fe, NM, USA, 31 October–3 November 2023; pp. 185–196. [Google Scholar] [CrossRef]
Servadei, L.; Zheng, J.; Arjona-Medina, J.; Werner, M.; Esen, V.; Hochreiter, S.; Ecker, W.; Wille, R. Cost optimization at early stages of design using deep reinforcement learning. In Proceedings of the MLCAD 2020—2020 ACM/IEEE Workshop on Machine Learning for CAD, Reykjavik, Iceland, 16–20 November 2020; pp. 37–42. [Google Scholar] [CrossRef]
Wang, X.; Ren, T.; Bai, D.; Chu, F.; Yu, Y.; Meng, F.; Wu, C.C. Scheduling a multi-agent flow shop with two scenarios and release dates. Int. J. Prod. Res. 2023, 62, 421–443. [Google Scholar] [CrossRef]
Wang, J.; Xing, C.; Liu, J. Intelligent preamble allocation for coexistence of mMTC/URLLC devices: A hierarchical Q-learning based approach. China Commun. 2023, 20, 44–53. [Google Scholar] [CrossRef]
Wang, Z.; Chen, Y.; Liu, C.; Lin, W.; Yang, L. Guided Reinforce Learning Through Spatial Residual Value for Online 3D Bin Packing. In Proceedings of the IECON 2023—49th Annual Conference of the IEEE Industrial Electronics Society, Singapore, 16–19 October 2023; pp. 1–5. [Google Scholar] [CrossRef]
Wang, C.; Shen, X.; Wang, H.; Xie, W.; Zhang, H.; Mei, H. Multi-Agent Reinforcement Learning-Based Routing Protocol for Underwater Wireless Sensor Networks with Value of Information. IEEE Sens. J. 2024, 24, 7042–7054. [Google Scholar] [CrossRef]
Wu, Y.; Song, W.; Cao, Z.; Zhang, J.; Lim, A. Learning Improvement Heuristics for Solving Routing Problems. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 5057–5069. [Google Scholar] [CrossRef]
Xiong, H.; Ding, K.; Ding, W.; Peng, J.; Xu, J. Towards reliable robot packing system based on deep reinforcement learning. Adv. Eng. Inform. 2023, 57, 102028. [Google Scholar] [CrossRef]
Yuan, J.; Zhang, J.; Cai, Z.; Yan, J. Towards Variance Reduction for Reinforcement Learning of Industrial Decision-making Tasks: A Bi-Critic based Demand-Constraint Decoupling Approach. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 6–10 August 2023; pp. 3162–3172. [Google Scholar] [CrossRef]
Zhang, H.; Hao, J.; Li, X. A method for deploying distributed denial of service attack defense strategies on edge servers using reinforcement learning. IEEE Access 2020, 8, 78482–78491. [Google Scholar] [CrossRef]
Zhao, F.; Jiang, T.; Wang, L. A Reinforcement Learning Driven Cooperative Meta-Heuristic Algorithm for Energy-Efficient Distributed No-Wait Flow-Shop Scheduling with Sequence-Dependent Setup Time. IEEE Trans. Ind. Inform. 2022, 19, 8427–8440. [Google Scholar] [CrossRef]
Zheng, X.; Chen, Z. An improved deep Q-learning algorithm for a trade-off between energy consumption and productivity in batch scheduling. Comput. Ind. Eng. 2024, 188, 109925. [Google Scholar] [CrossRef]
Zhang, J.; Liu, Y.; Qin, X.; Xu, X. Energy-Efficient Federated Learning Framework for Digital Twin-Enabled Industrial Internet of Things. In Proceedings of the IEEE International Symposium on Personal, Indoor and Mobile Radio Communications, PIMRC, Helsinki, Finland, 13–16 September 2021; pp. 1160–1166. [Google Scholar] [CrossRef]
Yang, Z.; Yang, S.; Song, S.; Zhang, W.; Song, R.; Cheng, J.; Li, Y. PackerBot: Variable-Sized Product Packing with Heuristic Deep Reinforcement Learning. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems, Prague, Czech Republic, 27 September–1 October 2021; pp. 5002–5008. [Google Scholar] [CrossRef]
Grumbach, F.; Badr, N.E.A.; Reusch, P.; Trojahn, S. A Memetic Algorithm with Reinforcement Learning for Sociotechnical Production Scheduling. IEEE Access 2022, 11, 68760–68775. [Google Scholar] [CrossRef]
Kallestad, J.; Hasibi, R.; Hemmati, A.; Sörensen, K. A general deep reinforcement learning hyperheuristic framework for solving combinatorial optimization problems. Eur. J. Oper. Res. 2023, 309, 446–468. [Google Scholar] [CrossRef]
Liu, C.; Zhu, H.; Tang, D.; Nie, Q.; Zhou, T.; Wang, L.; Song, Y. Probing an intelligent predictive maintenance approach with deep learning and augmented reality for machine tools in IoT-enabled manufacturing. Robot. Comput. Integr. Manuf. 2022, 77, 102357. [Google Scholar] [CrossRef]
Ran, Y.; Zhou, X.; Hu, H.; Wen, Y. Optimizing Data Center Energy Efficiency via Event-Driven Deep Reinforcement Learning. IEEE Trans. Serv. Comput. 2023, 16, 1296–1309. [Google Scholar] [CrossRef]
Shafiq, S.; Mayr-Dorn, C.; Mashkoor, A.; Egyed, A. Towards Optimal Assembly Line Order Sequencing with Reinforcement Learning: A Case Study. In Proceedings of the IEEE International Conference on Emerging Technologies and Factory Automation, ETFA, Vienna, Austria, 8–11 September 2020; pp. 982–989. [Google Scholar] [CrossRef]
Tong, Z.; Liu, B.; Mei, J.; Wang, J.; Peng, X.; Li, K. Data Security Aware and Effective Task Offloading Strategy in Mobile Edge Computing. J. Grid Comput. 2023, 21, 41. [Google Scholar] [CrossRef]
Wang, X.; Wang, J.; Liu, J. Vehicle to Grid Frequency Regulation Capacity Optimal Scheduling for Battery Swapping Station Using Deep Q-Network. IEEE Trans. Ind. Inform. 2021, 17, 1342–1351. [Google Scholar] [CrossRef]
Zhang, Z.Q.; Wu, F.C.; Qian, B.; Hu, R.; Wang, L.; Jin, H.P. A Q-learning-based hyper-heuristic evolutionary algorithm for the distributed flexible job-shop scheduling problem with crane transportation. Expert Syst. Appl. 2023, 234, 121050. [Google Scholar] [CrossRef]
Hou, J.; Chen, G.; Li, Z.; He, W.; Gu, S.; Knoll, A.; Jiang, C. Hybrid Residual Multiexpert Reinforcement Learning for Spatial Scheduling of High-Density Parking Lots. IEEE Trans. Cybern. 2024, 54, 2771–2783. [Google Scholar] [CrossRef]
Wang, H.; Bai, Y.; Xie, X. Deep Reinforcement Learning Based Resource Allocation in Delay-Tolerance-Aware 5G Industrial IoT Systems. IEEE Trans. Commun. 2024, 72, 209–221. [Google Scholar] [CrossRef]
Yeh, Y.H.; Chen, S.Y.H.; Chen, H.M.; Tu, D.Y.; Fang, G.Q.; Kuo, Y.C.; Chen, P.Y. DPRoute: Deep Learning Framework for Package Routing. In Proceedings of the 2023 28th Asia and South Pacific Design Automation Conference (ASP-DAC), Tokyo, Japan, 16–19 January 2023; pp. 277–282. [Google Scholar] [CrossRef]
Perin, G.; Nophut, D.; Badia, L.; Fitzek, F.H. Maximizing Airtime Efficiency for Reliable Broadcast Streams in WMNs with Multi-Armed Bandits. In Proceedings of the 2020 11th IEEE Annual Ubiquitous Computing, Electronics and Mobile Communication Conference, UEMCON 2020, New York, NY, USA, 28–31 October 2020; pp. 472–478. [Google Scholar] [CrossRef]
Yaakoubi, Y.; Dimitrakopoulos, R. Learning to schedule heuristics for the simultaneous stochastic optimization of mining complexes. Comput. Oper. Res. 2023, 159, 106349. [Google Scholar] [CrossRef]
Lin, C.C.; Deng, D.J.; Chih, Y.L.; Chiu, H.T. Smart Manufacturing Scheduling with Edge Computing Using Multiclass Deep Q Network. IEEE Trans. Ind. Inform. 2019, 15, 4276–4284. [Google Scholar] [CrossRef]
Paeng, B.; Park, I.B.; Park, J. Deep Reinforcement Learning for Minimizing Tardiness in Parallel Machine Scheduling with Sequence Dependent Family Setups. IEEE Access 2021, 9, 101390–101401. [Google Scholar] [CrossRef]
Yang, F.; Tian, J.; Feng, T.; Xu, F.; Qiu, C.; Zhao, C. Blockchain-Enabled Parallel Learning in Industrial Edge-Cloud Network: A Fuzzy DPoSt-PBFT Approach. In Proceedings of the 2021 IEEE Globecom Workshops, GC Wkshps 2021, Madrid, Spain, 7–11 December 2021; pp. 1–6. [Google Scholar] [CrossRef]
Zhao, Y.; Zhang, H. Application of machine learning and rule scheduling in a job-shop production control system. Int. J. Simul. Model. 2021, 20, 410–421. [Google Scholar] [CrossRef]
Chen, J.; Yi, C.; Wang, R.; Zhu, K.; Cai, J. Learning Aided Joint Sensor Activation and Mobile Charging Vehicle Scheduling for Energy-Efficient WRSN-Based Industrial IoT. IEEE Trans. Veh. Technol. 2023, 72, 5064–5078. [Google Scholar] [CrossRef]
Dai, B.; Ren, T.; Niu, J.; Hu, Z.; Hu, S.; Qiu, M. A Distributed Computation Offloading Scheduling Framework based on Deep Reinforcement Learning. In Proceedings of the 19th IEEE International Symposium on Parallel and Distributed Processing with Applications, 11th IEEE International Conference on Big Data and Cloud Computing, 14th IEEE International Conference on Social Computing and Networking and 11th IEEE Internation, New York, NY, USA, 30 September–3 October 2021; pp. 413–420. [Google Scholar] [CrossRef]
Qu, S.; Wang, J.; Govil, S.; Leckie, J.O. Optimized Adaptive Scheduling of a Manufacturing Process System with Multi-skill Workforce and Multiple Machine Types: An Ontology-based, Multi-agent Reinforcement Learning Approach. Procedia CIRP 2016, 57, 55–60. [Google Scholar] [CrossRef]
Antuori, V.; Hebrard, E.; Huguet, M.J.; Essodaigui, S.; Nguyen, A. Leveraging Reinforcement Learning, Constraint Programming and Local Search: A Case Study in Car Manufacturing. In Principles and Practice of Constraint Programming; Simonis, H., Ed.; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 657–672. [Google Scholar] [CrossRef]
Johnson, D.; Chen, G.; Lu, Y. Multi-Agent Reinforcement Learning for Real-Time Dynamic Production Scheduling in a Robot Assembly Cell. IEEE Robot. Autom. Lett. 2022, 7, 7684–7691. [Google Scholar] [CrossRef]
Rudolf, T.; Flögel, D.; Schürmann, T.; Süß, S.; Schwab, S.; Hohmann, S. ReACT: Reinforcement Learning for Controller Parametrization Using B-Spline Geometries. In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Honolulu, Oahu, HI, USA, 1–4 October 2023; pp. 3385–3391. [Google Scholar] [CrossRef]
Sun, M.; Wang, X.; Liu, X.; Wu, S.; Zhou, X.; Ouyang, C.X. A Multi-agent Reinforcement Learning Routing Protocol in Mobile Robot Network. In Proceedings of the 2021 4th International Conference on Information Communication and Signal Processing, ICICSP 2021, Shanghai, China, 24–26 September 2021; pp. 469–475. [Google Scholar] [CrossRef]
Sun, B.; Theile, M.; Qin, Z.; Bernardini, D.; Roy, D.; Bastoni, A.; Caccamo, M. Edge Generation Scheduling for DAG Tasks using Deep Reinforcement Learning. IEEE Trans. Comput. 2024, 73, 1034–1047. [Google Scholar] [CrossRef]
Yang, S.; Song, S.; Chu, S.; Song, R.; Cheng, J.; Li, Y.; Zhang, W. Heuristics Integrated Deep Reinforcement Learning for Online 3D Bin Packing. IEEE Trans. Autom. Sci. Eng. 2024, 21, 939–950. [Google Scholar] [CrossRef]
Zhang, J.; Shuai, T. Online Three-Dimensional Bin Packing: A DRL Algorithm with the Buffer Zone. Found. Comput. Decis. Sci. 2024, 49, 63–74. [Google Scholar] [CrossRef]
Zhou, Y.; Yan, S.; Peng, M. Content placement with unknown popularity in fog radio access networks. In Proceedings of the IEEE International Conference on Industrial Internet Cloud, ICII 2019, Orlando, FL, USA, 11–12 November 2019; pp. 361–367. [Google Scholar] [CrossRef]
Chen, S.; Jiang, C.; Li, J.; Xiang, J.; Xiao, W. Improved deep q-network for user-side battery energy storage charging and discharging strategy in industrial parks. Entropy 2021, 23, 1311. [Google Scholar] [CrossRef] [PubMed]
Ding, L.; Lin, Z.; Yan, G. Multi-agent Deep Reinforcement Learning Algorithm for Distributed Economic Dispatch in Smart Grid. In Proceedings of the IECON 2020 The 46th Annual Conference of the IEEE Industrial Electronics Society, Singapore, 18–21 October 2020; pp. 3529–3534. [Google Scholar] [CrossRef]
Li, J.; Zhou, C.; Liu, J.; Sheng, M.; Zhao, N.; Su, Y. Reinforcement Learning-Based Resource Allocation for Coverage Continuity in High Dynamic UAV Communication Networks. IEEE Trans. Wirel. Commun. 2024, 23, 848–860. [Google Scholar] [CrossRef]
Tan, Y.; Shen, Y.; Yu, X.; Lu, X. Low-carbon economic dispatch of the combined heat and power-virtual power plants: A improved deep reinforcement learning-based approach. IET Renew. Power Gener. 2023, 17, 982–1007. [Google Scholar] [CrossRef]
Van Den Bovenkamp, N.; Giraldo, J.S.; Salazar Duque, E.M.; Vergara, P.P.; Konstantinou, C.; Palensky, P. Optimal Energy Scheduling of Flexible Industrial Prosumers via Reinforcement Learning. In Proceedings of the 2023 IEEE Belgrade PowerTech, PowerTech 2023, Belgrade, Serbia, 25–29 June 2023; pp. 1–6. [Google Scholar] [CrossRef]
Villaverde, B.C.; Rea, S.; Pesch, D. InRout—A QoS aware route selection algorithm for industrial wireless sensor networks. Ad Hoc Netw. 2012, 10, 458–478. [Google Scholar] [CrossRef]
Xu, J.; Zhu, K.; Wang, R. RF aerially charging scheduling for UAV Fleet: AAA Q-learning approach. In Proceedings of the 2019 15th International Conference on Mobile Ad-Hoc and Sensor Networks, MSN 2019, Shenzhen, China, 11–13 December 2019; pp. 194–199. [Google Scholar] [CrossRef]
Ludeke, R.; Heyns, P.S. Towards a Deep Reinforcement Learning based approach for real time decision making and resource allocation for Prognostics and Health Management applications. In Proceedings of the 2023 IEEE International Conference on Prognostics and Health Management, ICPHM 2023, Montreal, QC, Canada, 5–7 June 2023; pp. 20–29. [Google Scholar] [CrossRef]
Huang, S.; Wang, Z.; Zhou, J.; Lu, J. Planning Irregular Object Packing via Hierarchical Reinforcement Learning. IEEE Robot. Autom. Lett. 2023, 8, 81–88. [Google Scholar] [CrossRef]
Puche, A.V.; Lee, S. Online 3D Bin Packing Reinforcement Learning Solution with Buffer. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems, Kyoto, Japan, 23–27 October 2022; pp. 8902–8909. [Google Scholar] [CrossRef]
Wu, Y.; Yao, L. Research on the Problem of 3D Bin Packing under Incomplete Information Based on Deep Reinforcement Learning. In Proceedings of the 2021 International Conference on E-Commerce and E-Management, ICECEM 2021, Dalian, China, 24–26 September 2021; pp. 38–42. [Google Scholar] [CrossRef]
Chen, G.; Chen, Y.; Du, J.; Du, L.; Mai, Z.; Hao, C. A Hybrid DRL-Based Adaptive Traffic Matching Strategy for Transmitting and Computing in MEC-Enabled IIoT. IEEE Commun. Lett. 2024, 28, 238–242. [Google Scholar] [CrossRef]
Ho, T.M.; Nguyen, K.K.; Cheriet, M. Game Theoretic Reinforcement Learning Framework For Industrial Internet of Things. In Proceedings of the IEEE Wireless Communications and Networking Conference, WCNC, Austin, TX, USA, 10–13 April 2022; pp. 2112–2117. [Google Scholar] [CrossRef]
Li, J.; Wang, R.; Wang, K. Service Function Chaining in Industrial Internet of Things With Edge Intelligence: A Natural Actor-Critic Approach. IEEE Trans. Ind. Inform. 2023, 19, 491–502. [Google Scholar] [CrossRef]
Ong, K.S.H.; Wang, W.; Niyato, D.; Friedrichs, T. Deep-Reinforcement-Learning-Based Predictive Maintenance Model for Effective Resource Management in Industrial IoT. IEEE Internet Things J. 2022, 9, 5173–5188. [Google Scholar] [CrossRef]
Akbari, M.; Abedi, M.R.; Joda, R.; Pourghasemian, M.; Mokari, N.; Erol-Kantarci, M. Age of Information Aware VNF Scheduling in Industrial IoT Using Deep Reinforcement Learning. IEEE J. Sel. Areas Commun. 2021, 39, 2487–2500. [Google Scholar] [CrossRef]
Bao, Q.; Zheng, P.; Dai, S. A digital twin-driven dynamic path planning approach for multiple automatic guided vehicles based on deep reinforcement learning. Proc. Inst. Mech. Eng. Part B J. Eng. Manuf. 2024, 238, 488–499. [Google Scholar] [CrossRef]
Gowri, A.S.; Shanth I Bala, P. An agent based resource provision for IoT through machine learning in Fog computing. In Proceedings of the 2019 IEEE International Conference on System, Computation, Automation and Networking, ICSCAN 2019, Pondicherry, India, 29–30 March 2019; pp. 12–16. [Google Scholar] [CrossRef]
Li, B.; Zhang, R.; Tian, X.; Zhu, Z. Multi-Agent and Cooperative Deep Reinforcement Learning for Scalable Network Automation in Multi-Domain SD-EONs. IEEE Trans. Netw. Serv. Manag. 2021, 18, 4801–4813. [Google Scholar] [CrossRef]
Lin, L.; Zhou, W.; Yang, Z.; Liu, J. Deep reinforcement learning-based task scheduling and resource allocation for NOMA-MEC in Industrial Internet of Things. Peer-to-Peer Netw. Appl. 2023, 16, 170–188. [Google Scholar] [CrossRef]
Liu, M.; Teng, Y.; Yu, F.R.; Leung, V.C.; Song, M. A Deep Reinforcement Learning-Based Transcoder Selection Framework for Blockchain-Enabled Wireless D2D Transcoding. IEEE Trans. Commun. 2020, 68, 3426–3439. [Google Scholar] [CrossRef]
Liu, C.L.; Chang, C.C.; Tseng, C.J. Actor-critic deep reinforcement learning for solving job shop scheduling problems. IEEE Access 2020, 8, 71752–71762. [Google Scholar] [CrossRef]
Liu, X.; Wang, G.; Chen, K. Option-Based Multi-Agent Reinforcement Learning for Painting With Multiple Large-Sized Robots. IEEE Trans. Intell. Transp. Syst. 2022, 23, 15707–15715. [Google Scholar] [CrossRef]
Liu, P.; Wu, Z.; Shan, H.; Lin, F.; Wang, Q.; Wang, Q. Task offloading optimization for AGVs with fixed routes in industrial IoT environment. China Commun. 2023, 20, 302–314. [Google Scholar] [CrossRef]
Lu, R.; Li, Y.C.; Li, Y.; Jiang, J.; Ding, Y. Multi-agent deep reinforcement learning based demand response for discrete manufacturing systems energy management. Appl. Energy 2020, 276, 115473. [Google Scholar] [CrossRef]
Peng, Z.; Lin, J. A multi-objective trade-off framework for cloud resource scheduling based on the Deep Q-network algorithm. Clust. Comput. 2020, 23, 2753–2767. [Google Scholar] [CrossRef]
Thanh, P.D.; Hoan, T.N.K.; Giang, H.T.H.; Koo, I. Packet Delivery Maximization Using Deep Reinforcement Learning-Based Transmission Scheduling for Industrial Cognitive Radio Systems. IEEE Access 2021, 9, 146492–146508. [Google Scholar] [CrossRef]
Wang, S.; Yuen, C.; Ni, W.; Guan, Y.L.; Lv, T. Multiagent Deep Reinforcement Learning for Cost- and Delay-Sensitive Virtual Network Function Placement and Routing. IEEE Trans. Commun. 2022, 70, 5208–5224. [Google Scholar] [CrossRef]
Budak, A.F.; Bhansali, P.; Liu, B.; Sun, N.; Pan, D.Z.; Kashyap, C.V. DNN-Opt: An RL Inspired Optimization for Analog Circuit Sizing using Deep Neural Networks. In Proceedings of the Design Automation Conference, San Francisco, CA, USA, 5–9 December 2021; pp. 1219–1224. [Google Scholar] [CrossRef]
Cao, Z.; Lin, C.; Zhou, M.; Huang, R. Scheduling Semiconductor Testing Facility by Using Cuckoo Search Algorithm with Reinforcement Learning and Surrogate Modeling. IEEE Trans. Autom. Sci. Eng. 2019, 16, 825–837. [Google Scholar] [CrossRef]
Chalumeau, F.; Coulon, I.; Cappart, Q.; Rousseau, L.M. SeaPearl: A Constraint Programming Solver Guided by Reinforcement Learning. In Integration of Constraint Programming, Artificial Intelligence, and Operations Research; Stuckey, P.J., Ed.; Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 392–409. [Google Scholar]
Lin, C.R.; Cao, Z.C.; Zhou, M.C. Learning-Based Grey Wolf Optimizer for Stochastic Flexible Job Shop Scheduling. IEEE Trans. Autom. Sci. Eng. 2022, 19, 3659–3671. [Google Scholar] [CrossRef]
Lin, C.R.; Cao, Z.C.; Zhou, M.C. Learning-Based Cuckoo Search Algorithm to Schedule a Flexible Job Shop With Sequencing Flexibility. IEEE Trans. Cybern. 2023, 53, 6663–6675. [Google Scholar] [CrossRef] [PubMed]
Tang, H.; Xiao, Y.; Zhang, W.; Lei, D.; Wang, J.; Xu, T. A DQL-NSGA-III algorithm for solving the flexible job shop dynamic scheduling problem. Expert Syst. Appl. 2024, 237, 121723. [Google Scholar] [CrossRef]
Wang, T.; Zhao, J.; Xu, Q.; Pedrycz, W.; Wang, W. A Dynamic Scheduling Framework for Byproduct Gas System Combining Expert Knowledge and Production Plan. IEEE Trans. Autom. Sci. Eng. 2023, 20, 541–552. [Google Scholar] [CrossRef]
Wang, X.; Yao, H.; Mai, T.; Guo, S.; Liu, Y. Reinforcement Learning-Based Particle Swarm Optimization for End-to-End Traffic Scheduling in TSN-5G Networks. IEEE/ACM Trans. Netw. 2023, 31, 3254–3268. [Google Scholar] [CrossRef]
Zhao, F.; Zhou, G.; Xu, T.; Zhu, N.; Jonrinaldi. A knowledge-driven cooperative scatter search algorithm with reinforcement learning for the distributed blocking flow shop scheduling problem. Expert Syst. Appl. 2023, 230, 120571. [Google Scholar] [CrossRef]
Ma, N.; Wang, Z.; Ba, Z.; Li, X.; Yang, N.; Yang, X.; Zhang, H. Hierarchical Reinforcement Learning for Crude Oil Supply Chain Scheduling. Algorithms 2023, 16, 354. [Google Scholar] [CrossRef]
Goh, S.L.; Kendall, G.; Sabar, N.R. Simulated annealing with improved reheating and learning for the post enrolment course timetabling problem. J. Oper. Res. Soc. 2019, 70, 873–888. [Google Scholar] [CrossRef]
Fairee, S.; Khompatraporn, C.; Prom-on, S.; Sirinaovakul, B. Combinatorial Artificial Bee Colony Optimization with Reinforcement Learning Updating for Travelling Salesman Problem. In Proceedings of the 2019 16th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), Pattaya, Thailand, 10–13 July 2019; pp. 93–96. [Google Scholar] [CrossRef]
Durst, P.; Jia, X.; Li, L. Multi-Objective Optimization of AGV Real-Time Scheduling Based on Deep Reinforcement Learning. In Proceedings of the 42nd Chinese Control Conference, Tianjin, China, 24–26 July 2023; pp. 5535–5540. [Google Scholar] [CrossRef]
Wang, L.; Yang, C.; Wang, X.; Li, J.; Wang, Y.; Wang, Y. Integrated Resource Scheduling for User Experience Enhancement: A Heuristically Accelerated DRL. In Proceedings of the 2019 11th International Conference on Wireless Communications and Signal Processing, WCSP 2019, Xi’an, China, 23–25 October 2019; pp. 1–6. [Google Scholar] [CrossRef]
Carvalho, J.P.; Dimitrakopoulos, R. Integrating short-term stochastic production planning updating with mining fleet management in industrial mining complexes: An actor-critic reinforcement learning approach. Appl. Intell. 2023, 53, 23179–23202. [Google Scholar] [CrossRef]
Soman, R.K.; Molina-Solana, M. Automating look-ahead schedule generation for construction using linked-data based constraint checking and reinforcement learning. Autom. Constr. 2022, 134, 104069. [Google Scholar] [CrossRef]
Wei, W.; Fu, L.; Gu, H.; Zhang, Y.; Zou, T.; Wang, C.; Wang, N. GRL-PS: Graph embedding-based DRL approach for adaptive path selection. IEEE Trans. Netw. Serv. Manag. 2023, 20, 2639–2651. [Google Scholar] [CrossRef]
Zhang, P.; Wang, C.; Kumar, N.; Liu, L. Space-Air-Ground Integrated Multi-Domain Network Resource Orchestration Based on Virtual Network Architecture: A DRL Method. IEEE Trans. Intell. Transp. Syst. 2022, 23, 2798–2808. [Google Scholar] [CrossRef]
Zhang, Z.Q.; Qian, B.; Hu, R.; Yang, J.B. Q-learning-based hyper-heuristic evolutionary algorithm for the distributed assembly blocking flowshop scheduling problem. Appl. Soft Comput. 2023, 146, 110695. [Google Scholar] [CrossRef]
Chen, R.; Li, W.; Yang, H. A Deep Reinforcement Learning Framework Based on an Attention Mechanism and Disjunctive Graph Embedding for the Job-Shop Scheduling Problem. IEEE Trans. Ind. Inform. 2023, 19, 1322–1331. [Google Scholar] [CrossRef]
Elsayed, E.K.; Elsayed, A.K.; Eldahshan, K.A. Deep Reinforcement Learning-Based Job Shop Scheduling of Smart Manufacturing. Comput. Mater. Contin. 2022, 73, 5103–5120. [Google Scholar] [CrossRef]
Farahani, A.; Elzakker, M.V.; Genga, L.; Troubil, P.; Dijkman, R. Relational Graph Attention-Based Deep Reinforcement Learning: An Application to Flexible Job Shop Scheduling with Sequence-Dependent Setup Times. In Proceedings of the 17th International Conference, LION 17, Nice, France, 4–8 June 2023; pp. 150–164. [Google Scholar] [CrossRef]
Gan, X.M.; Zuo, Y.; Zhang, A.S.; Li, S.B.; Tao, F. Digital twin-enabled adaptive scheduling strategy based on deep reinforcement learning. Sci. China Technol. Sci. 2023, 66, 1937–1951. [Google Scholar] [CrossRef]
Huang, J.P.; Gao, L.; Li, X.Y. An end-to-end deep reinforcement learning method based on graph neural network for distributed job-shop scheduling problem. Expert Syst. Appl. 2024, 238, 121756. [Google Scholar] [CrossRef]
Lee, J.H.; Kim, H.J. Imitation Learning for Real-Time Job Shop Scheduling Using Graph-Based Representation. In Proceedings of the 2022 Winter Simulation Conference, Singapore, 11–14 December 2022; pp. 3285–3296. [Google Scholar] [CrossRef]
Liu, C.L.; Huang, T.H. Dynamic Job-Shop Scheduling Problems Using Graph Neural Network and Deep Reinforcement Learning. IEEE Trans. Syst. Man Cybern. Syst. 2023, 53, 6836–6848. [Google Scholar] [CrossRef]
Zhao, X.; Song, W.; Li, Q.; Shi, H.; Kang, Z.; Zhang, C. A Deep Reinforcement Learning Approach for Resource-Constrained Project Scheduling. In Proceedings of the 2022 IEEE Symposium Series on Computational Intelligence, SSCI 2022, Singapore, 4–7 December 2022; pp. 1226–1234. [Google Scholar] [CrossRef]
Chilukuri, S.; Pesch, D. RECCE: Deep Reinforcement Learning for Joint Routing and Scheduling in Time-Constrained Wireless Networks. IEEE Access 2021, 9, 132053–132063. [Google Scholar] [CrossRef]
Vijayalakshmi, V.; Saravanan, M. Reinforcement learning-based multi-objective energy-efficient task scheduling in fog-cloud industrial IoT-based systems. Soft Comput. 2023, 27, 17473–17491. [Google Scholar] [CrossRef]
Wang, C.; Shen, X.; Wang, H.; Xie, W.; Mei, H.; Zhang, H. Q Learning-Based Routing Protocol with Accelerating Convergence for Underwater Wireless Sensor Networks. IEEE Sens. J. 2024, 24, 11562–11573. [Google Scholar] [CrossRef]
Yan, Z.; Du, H.; Zhang, J.; Li, G. Cherrypick: Solving the Steiner Tree Problem in Graphs using Deep Reinforcement Learning. In Proceedings of the 16th IEEE Conference on Industrial Electronics and Applications, ICIEA 2021, Chengdu, China, 1–4 August 2021; pp. 35–40. [Google Scholar] [CrossRef]
Yuan, Y.; Li, H.; Ji, L. Application of Deep Reinforcement Learning Algorithm in Uncertain Logistics Transportation Scheduling. Comput. Intell. Neurosci. 2021, 2021, 5672227. [Google Scholar] [CrossRef] [PubMed]
Zhong, C.; Jia, H.; Wan, H.; Zhao, X. DRLS: A Deep Reinforcement Learning Based Scheduler for Time-Triggered Ethernet. In Proceedings of the International Conference on Computer Communications and Networks, ICCCN, Athens, Greece, 19–22 July 2021; pp. 1–11. [Google Scholar] [CrossRef]
Chen, H.; Hsu, K.C.; Turner, W.J.; Wei, P.H.; Zhu, K.; Pan, D.Z.; Ren, H. Reinforcement Learning Guided Detailed Routing for Custom Circuits. Proc. Int. Symp. Phys. Des. 2023, 1, 26–34. [Google Scholar] [CrossRef]
Da Costa, P.; Zhang, Y.; Akcay, A.; Kaymak, U. Learning 2-opt Local Search from Heuristics as Expert Demonstrations. In Proceedings of the International Joint Conference on Neural Networks, Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar] [CrossRef]
He, X.; Zhuge, X.; Dang, F.; Xu, W.; Yang, Z. DeepScheduler: Enabling Flow-Aware Scheduling in Time-Sensitive Networking. In Proceedings of the IEEE INFOCOM 2023—IEEE Conference on Computer Communications, New York, NY, USA, 17–20 May 2023; pp. 1–10. [Google Scholar] [CrossRef]
Wu, Y.; Zhou, J.; Xia, Y.; Zhang, X.; Cao, Z.; Zhang, J. Neural Airport Ground Handling. IEEE Trans. Intell. Transp. Syst. 2023, 24, 15652–15666. [Google Scholar] [CrossRef]
Baek, J.; Kaddoum, G. Online Partial Offloading and Task Scheduling in SDN-Fog Networks with Deep Recurrent Reinforcement Learning. IEEE Internet Things J. 2022, 9, 11578–11589. [Google Scholar] [CrossRef]
Elsayed, M.; Erol-Kantarci, M. Deep Reinforcement Learning for Reducing Latency in Mission Critical Services. In Proceedings of the 2018 IEEE Global Communications Conference, GLOBECOM 2018, Abu Dhabi, United Arab Emirates, 9–13 December 2018. [Google Scholar] [CrossRef]
Servadei, L.; Lee, J.H.; Medina, J.A.; Werner, M.; Hochreiter, S.; Ecker, W.; Wille, R. Deep Reinforcement Learning for Optimization at Early Design Stages. IEEE Des. Test 2023, 40, 43–51. [Google Scholar] [CrossRef]
Solozabal, R.; Ceberio, J.; Sanchoyerto, A.; Zabala, L.; Blanco, B.; Liberal, F. Virtual Network Function Placement Optimization with Deep Reinforcement Learning. IEEE J. Sel. Areas Commun. 2020, 38, 292–303. [Google Scholar] [CrossRef]
Zou, Y.; Wu, H.; Yin, Y.; Dhamotharan, L.; Chen, D.; Tiwari, A.K. An improved transformer model with multi-head attention and attention to attention for low-carbon multi-depot vehicle routing problem. Ann. Oper. Res. 2024, 339, 517–536. [Google Scholar] [CrossRef]
Ahmed, B.S.; Enoiu, E.; Afzal, W.; Zamli, K.Z. An evaluation of Monte Carlo-based hyper-heuristic for interaction testing of industrial embedded software applications. Soft Comput. 2020, 24, 13929–13954. [Google Scholar] [CrossRef]
Li, Y.; Fadda, E.; Manerba, D.; Tadei, R.; Terzo, O. Reinforcement Learning Algorithms for Online Single-Machine Scheduling. In Proceedings of the 2020 Federated Conference on Computer Science and Information Systems, FedCSIS 2020, Sofia, Bulgaria, 6–9 September 2020; Volume 21, pp. 277–283. [Google Scholar] [CrossRef]
Ma, X.; Xu, H.; Gao, H.; Bian, M.; Hussain, W. Real-Time Virtual Machine Scheduling in Industry IoT Network: A Reinforcement Learning Method. IEEE Trans. Ind. Inform. 2023, 19, 2129–2139. [Google Scholar] [CrossRef]
Wang, Y.; Wang, K.; Huang, H.; Miyazaki, T.; Guo, S. Traffic and Computation Co-Offloading with Reinforcement Learning in Fog Computing for Industrial Applications. IEEE Trans. Ind. Inform. 2019, 15, 976–986. [Google Scholar] [CrossRef]
Xu, Z.; Han, G.; Liu, L.; Martinez-Garcia, M.; Wang, Z. Multi-energy scheduling of an industrial integrated energy system by reinforcement learning-based differential evolution. IEEE Trans. Green Commun. Netw. 2021, 5, 1077–1090. [Google Scholar] [CrossRef]
Jiang, T.; Zeng, B.; Wang, Y.; Yan, W. A New Heuristic Reinforcement Learning for Container Relocation Problem. J. Phys. Conf. Ser. 2021, 1873, 012050. [Google Scholar] [CrossRef]
De Mars, P.; O’Sullivan, A. Applying reinforcement learning and tree search to the unit commitment problem. Appl. Energy 2021, 302, 117519. [Google Scholar] [CrossRef]
Revadekar, A.; Soni, R.; Nimkar, A.V. QORAl: Q Learning based Delivery Optimization for Pharmacies. In Proceedings of the 2020 11th International Conference on Computing, Communication and Networking Technologies, ICCCNT 2020, Kharagpur, India, 1–3 July 2020. [Google Scholar] [CrossRef]
Kuhnle, A.; Kaiser, J.P.; Theiß, F.; Stricker, N.; Lanza, G. Designing an adaptive production control system using reinforcement learning. J. Intell. Manuf. 2021, 32, 855–876. [Google Scholar] [CrossRef]
Arredondo, F.; Martinez, E. Learning and adaptation of a policy for dynamic order acceptance in make-to-order manufacturing. Comput. Ind. Eng. 2010, 58, 70–83. [Google Scholar] [CrossRef]
Guan, W.; Zhang, H.; Leung, V.C. Customized Slicing for 6G: Enforcing Artificial Intelligence on Resource Management. IEEE Netw. 2021, 35, 264–271. [Google Scholar] [CrossRef]
Kan, H.; Shuai, L.; Chen, H.; Zhang, W. Automated Guided Logistics Handling Vehicle Path Routing under Multi-Task Scenarios. In Proceedings of the 2020 IEEE International Conference on Mechatronics and Automation, ICMA 2020, Beijing, China, 13–16 October 2020; pp. 1173–1177. [Google Scholar] [CrossRef]
Ghaleb, M.; Namoura, H.A.; Taghipour, S. Reinforcement Learning-based Real-time Scheduling under Random Machine Breakdowns and Other Disturbances: A Case Study. In Proceedings of the Annual Reliability and Maintainability Symposium, Orlando, FL, USA, 24–27 May 2021; pp. 1–8. [Google Scholar] [CrossRef]
Hu, H.; Jia, X.; He, Q.; Fu, S.; Liu, K. Deep reinforcement learning based AGVs real-time scheduling with mixed rule for flexible shop floor in industry 4.0. Comput. Ind. Eng. 2020, 149, 106749. [Google Scholar] [CrossRef]
Luo, S.; Zhang, L.; Fan, Y. Real-Time Scheduling for Dynamic Partial-No-Wait Multiobjective Flexible Job Shop by Deep Reinforcement Learning. IEEE Trans. Autom. Sci. Eng. 2022, 19, 3020–3038. [Google Scholar] [CrossRef]
Wu, J.; Zhang, G.; Nie, J.; Peng, Y.; Zhang, Y. Deep Reinforcement Learning for Scheduling in an Edge Computing-Based Industrial Internet of Things. In Wireless Communications and Mobile Computing; John Wiley and Sons Ltd.: Hoboken, NJ, USA, 2021. [Google Scholar]
Song, Q.; Lei, S.; Sun, W.; Zhang, Y. Adaptive federated learning for digital twin driven industrial internet of things. In Proceedings of the IEEE Wireless Communications and Networking Conference, WCNC, Nanjing, China, 29 March–1 April 2021. [Google Scholar] [CrossRef]
Li, Y.; Liao, C.; Wang, L.; Xiao, Y.; Cao, Y.; Guo, S. A Reinforcement Learning-Artificial Bee Colony algorithm for Flexible Job-shop Scheduling Problem with Lot Streaming. Appl. Soft Comput. 2023, 146, 110658. [Google Scholar] [CrossRef]
Naghibi-Sistani, M.B.; Akbarzadeh-Tootoonchi, M.R.; Javidi-Dashte Bayaz, M.H.; Rajabi-Mashhadi, H. Application of Q-learning with temperature variation for bidding strategies in market based power systems. Energy Convers. Manag. 2006, 47, 1529–1538. [Google Scholar] [CrossRef]
Kunzel, G.; Indrusiak, L.S.; Pereira, C.E. Latency and Lifetime Enhancements in Industrial Wireless Sensor Networks: A Q-Learning Approach for Graph Routing. IEEE Trans. Ind. Inform. 2020, 16, 5617–5625. [Google Scholar] [CrossRef]
Lu, H.; Zhang, X.; Yang, S. A Learning-based Iterative Method for Solving Vehicle Routing Problems. In Proceedings of the International Conference on Learning Representations, Virtual, 26 April–1 May 2020. [Google Scholar]
Nain, Z.; Musaddiq, A.; Qadri, Y.A.; Nauman, A.; Afzal, M.K.; Kim, S.W. RIATA: A Reinforcement Learning-Based Intelligent Routing Update Scheme for Future Generation IoT Networks. IEEE Access 2021, 9, 81161–81172. [Google Scholar] [CrossRef]
Zheng, K.; Luo, R.; Liu, X.; Qiu, J.; Liu, J. Distributed DDPG-Based Resource Allocation for Age of Information Minimization in Mobile Wireless-Powered Internet of Things. IEEE Internet Things J. 2024, 11, 29102–29115. [Google Scholar] [CrossRef]
Liu, X.; Xu, J.; Zheng, K.; Zhang, G.; Liu, J.; Shiratori, N. Throughput Maximization with an AoI Constraint in Energy Harvesting D2D-enabled Cellular Networks: An MSRA-TD3 Approach. IEEE Trans. Wirel. Commun. 2024; early access. [Google Scholar] [CrossRef]

Figure 1. Reinforcement learning loop representation, based on [1].

Figure 2. PRISMA flow diagram of query results breakdown, made using [15].

Figure 3. Distribution by publication date. The earliest paper is from 2002.

Figure 4. Algorithm type distribution.

Table 1. Selected literature reviews and surveys from the query used in this paper.

Reference	Type	Methods	Area	Objective
[2]	Survey	Soft computing	Wireless sensor networks	Approach overview
[3]	Review	ML	Smart energy and electric power systems	Approach overview
[4]	Review	DRL and evolutionary	Job shop scheduling	Overview
[5]	Review	ML	5G wireless communications	Potential solutions for area
[6]	Vision article	RL and digital twins	Maintenance	Approach overview
[7]	Review	RL	Multiple topics	MDP, RL algorithms and theory
[8]	Survey	RL	Software-defined network routing	Identifying and analyzing recent studies
[9]	Survey	RL and DRL	IoT communication and networking	Application analysis
[10]	Survey	DRL	Traffic engineering, routing and congestion	Application and approach overview
[11]	Review	RL	Production planning and control	Characteristics, algorithms and tools
[12]	Review	DRL	Intelligent manufacture	DRL applicability versus alternatives
[13]	Review	RL and DRL	Maintenance planning and optimization	Application taxonomy

Table 2. Common state representation features.

Group	Features	References
Resource features	Utilization and efficiency	[16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42]
	Queues	[40,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68]
	Load	[17,25,63,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101]
	State and status	[17,48,60,68,69,89,90,93,96,100,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131]
Entity features	Completed	[16,27,35,41,94,96,113,131,132,133,134,135,136]
	Demand	[18,32,58,79,81,94,137,138,139,140,141,142]
	State and status	[26,27,59,73,86,93,96,110,117,120,124,132,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160]
	Location	[26,61,69,75,81,83,93,119,140,141,156,161,162,163,164,165,166,167,168,169]
Time-related	Lateness and due dates	[18,27,52,66,75,134,168,170,171]
	Earliness and slack time	[18,29,33,34,41,57,59,134,171,172]
	Idle and waiting time	[16,27,39,40,60,78,93,123,132,173,174]
Solution	Full	[79,81,153,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193]
	Objectives and other metrics	[22,30,35,41,47,49,59,76,80,170,172,184,192,194,195,196,197,198,199,200,201,202,203]
	Infeasibility	[29,33,110,140,204]
	Simulation and prediction	[69,79,92,96,129,172,205,206]
	Past info	[20,25,64,73,79,81,90,197,206,207,208]
Static parameters	Processing times	[22,34,41,49,57,69,78,93,119,131,135,146,160,192,193,209,210,211,212]
	Arrival times or rates	[32,51,65,66,75,98,119,165,166,167,210,212,213,214]
	Capacity and demand	[24,33,39,69,81,85,121,134,155,176,177,186,215]
	Details	[18,27,30,38,52,60,64,72,73,74,81,85,93,103,119,123,136,140,155,160,174,176,177,182,189,201,206,207,213,214,216,217,218,219,220,221,222,223]
Problem-specific	Energy	[51,88,98,100,101,105,132,164,168,198,201,203,213,214,224,225,226,227,228,229,230]
	Maintenance	[27,45,48,60,89,116,155,231]
	Bin packing	[176,179,186,190,195,221,222,232,233,234]
	Networks	[51,63,72,76,159,173,186,205,226,231,235,236,237,238]
Multi-agent		[20,21,39,49,54,55,56,62,67,69,78,82,87,89,100,107,109,116,119,131,138,141,167,172,184,206,207,215,217,225,231,239,240,241,242,243,244,245,246,247,248,249,250,251]
Hybrid strategies		[163,203,252,253,254,255,256,257,258,259,260]

Table 3. Categorization of state representation formats with representative examples.

Format	Content	References	Insights
Entity lists	Averages	[16,18,27,34,35,37,42,78,203,264]	Compact and simple, loses individual details
Entity lists	Per resource	[20,35,37,74,95,101,108,122,127,169,205,226,247,265]	Granular details but less scalable
Spatial represen- tations	Matrices	[22,70,71,76,112,120,152,166,167,172,193,194,195,208,210,212,221,228,232,233,234,266,267]	For structured environments and spatial reasoning
	Heightmaps	[195,221,222,232,233,234]	Capture 3D variations in a 2D representation
	Convolutional approaches	[22,61,173,181,198,205,232,234,266]	Automatic feature extraction
Graph solutions	Undirected and directed	[19,21,142,219,220,228,246,254,268,269,270]	Symmetric and asymmetric relational dependencies
	Disjunctive graphs	[49,78,80,117,130,271,272,273,274,275,276,277,278]	Complex relationships and multi-entity interactions
	Graph node features	[81,219,229,246,261,268,278,279,280,281,282,283]	Per-entity attributes
	Graph edge features	[19,21,113,136,150,220,229,239,273,284]	Captures relationships and dependencies between entities
	Graph NN	[21,70,78,80,130,142,188,219,246,254,261,268,269,270,271,274,275,277,278,279,283,285,286,287,288]	Process graph states, improving generalization at computational cost
Variable-sized		[141,179,181,220]	Flexible, adapts to dynamic state spaces
Variable-sized	Recurrent neural networks	[20,47,65,104,115,136,156,178,184,198,238,243,271,288,289,290,291,292,293]	Capture temporal dependencies in sequential decisions
Fuzzy		[83,246,258]	Model uncertainty, useful for imprecise states

Table 4. Popular agent action categories.

Group	Features	References
Q-value estimation		[20,23,30,45,73,78,79,83,84,91,107,110,138,149,150,153,156,160,164,165,178,200,206,207,208,211,263,280,290,294,295,296,297,298]
List selection	Resources	[20,38,49,64,74,81,86,91,105,111,112,121,122,124,126,131,149,150,153,157,160,164,168,173,204,230,237,241,242,288,296]
	Entity	[31,32,47,49,50,52,54,57,59,64,68,73,79,80,83,84,87,94,96,108,109,115,130,131,132,134,135,139,143,148,154,163,172,176,177,179,198,199,210,213,215,220,232,271,273,274,277,278,280,286,287,291]
	Locations, positions and nodes	[70,71,75,76,81,97,102,128,136,137,142,152,158,166,175,186,187,188,190,191,195,219,221,222,229,239,242,251,268,269,276,281,282,284,285,299,300,301]
	Network-specific	[20,23,24,38,63,82,91,103,104,121,122,123,127,147,149,153,173,194,201,211,241,243,245,250]
Sequential decisions	Resource reallocation	[17,40,46,49,55,69,80,89,113,114,125,129,146,164,167,170,246,249,266,302]
	Accept or reject entities	[49,55,80,89,161,217,226,267,303,304]
	State change	[17,39,40,43,49,55,59,61,69,80,89,90,113,114,164,167,168,223,246,249,266,303,304]
	Cardinal and ordinal directions	[118,156,206,240,305]
	Maintenance actions	[45,48,60,151,231,238]
Heuristic selection	Dispatching	[16,18,22,27,30,34,35,36,37,41,42,53,62,66,93,120,169,171,209,242,244,264,289,306,307,308]
Heuristic selection	Local rules	[29,33,181,183,197,203,208,260,270]
Multiple selections	Single-action-encoded	[59,64,116,117,189,205,233,275,309]
	Multiple decisions	[25,114,123,127,147,168,193,196,201,210,228,243,245,289,307]
	Resources	[23,26,28,47,59,64,82,86,95,99,100,103,123,135,147,162,182,191,226,239,307]
Variable-sized output		[19,47,65,184,185,243,289,292]
Number estimation	Number output	[140,202,207,247,310]
	Continuous value	[23,58,77,92,95,98,101,105,106,164,199,214,224,225,227,228,236]
	Discretized continuous values	[88,127,141,193,311,312]
	Hybrid strategy parameters	[144,159,218,235,252,253,254,255,256,257,258,259,261,262,263,286,298,313]

Table 5. Common reward designs.

Group	Features	References
Mirror objective function		[19,21,22,28,33,37,51,65,72,79,93,107,114,119,131,138,149,152,163,164,204,206,214,222,223,226,230,247,264,267,279,286,307,308]
	Negative	[49,50,80,81,175,199,237,295]
	Weighted	[46,59,73,89,113,159,228,249,284]
	Inverse	[16,55,87,247,280]
	Multiple objectives in one	[23,28,46,59,89,103,109,113,134,136,141,201,232,241,249,264,285,289,296]
Relative rewards	Previous solution increment	[26,29,35,49,56,62,77,78,80,82,84,98,109,112,117,122,130,139,142,144,189,190,198,204,208,211,212,220,232,259,260,266,268,269,273,274,275,277,278,308]
	Initial solution	[109,178,179,181,195]
	Global best	[29,188,244,259,263,286]
	Other baselines	[32,95,138,148,190,211,212,233]
Extra terms	Scheduling	[75,110,154,194,195,212,245,246,247,280,301]
	Lateness	[71,92,105,140,153,173,184,191,208,215]
	Cost	[101,108,115,150,151,215,225,227,267,280,307]
	Network-related	[46,47,128,162,187,205,213,236,246]
	Constraint as penalty	[43,44,131,155,199,213,235,239,251,258,261,267,283,292,301]
Conditional rewards		[24,36,40,42,58,76,90,106,120,156,161,165,169,185,193,217,219,249,250,284,308,309]
	Events and states	[26,40,57,68,69,71,86,88,97,116,118,124,129,145,166,168,206,219,231,243,249,250,279,281,282,298,306,315]
	Action selection or infeasibility	[17,24,53,74,76,108,123,125,132,133,136,146,156,163,165,168,185,189,200,206,224,241,242,262,263,294,299,303,305]
Conditional rewards	Metric threshold	[18,48,60,70,71,111,116,132,133,159,167,168,183,207,218,238,279,291,311]
	Solution evolution	[92,96,160,192,196,197,203,257,270,311,313]
	Terminal	[24,34,35,69,76,92,136,143,172,176,177,182,234,282,287,288,302]
Problem-specific	Continuous production systems	[49,59,80,100,173]
	Due dates	[50,65,87,102,119,170,295,307,308]
	Profit and costs	[46,79,89,102,105,134,135,137,162,208,227,304]
	Energy	[72,88,100,128,194,225,289,312]
	Routing	[81,175,206,249]
	Non-linear objective function	[18,27,67,85,110,153,184,210,215,290]
	Multiple and many objective	[18,23,32,63,87,104,132,193,196,226,257,280]
Multi-agent		[20,78,164,172,206,287]
Alternative goals	Unavailable objective function	[22,31,38,39,42,54,55,61,64,66,91,99,115,161,174,186,202,221,247,304,307,308]
	Ratios	[38,39,55,71,126,158,171,185,221,256,304,308]
	Estimation	[22,38,66,91]
	Hybrid strategies	[235,252,253,254,255,263,300]

Table 6. Common RL agent roles and highlighted examples.

RL Role	Advantages	Disadvantages	Examples
Tabular methods	Simple implementation. Explainable results.	Limited state representation. Must sufficiently explore all states.	[45,166,175,294,296]
Iterative list selection	Very flexible. Widespread literature adoption.	Single-use actions. Often requires action masking.	[31,40,49,217,232]
Hybrid approaches	Simplify agents’ decisions. Enhance other methods with RL decision making.	RL only optimizes a subset. External methods or tools might be insufficient to optimize.	[18,123,184,254,258]
Specific neural network models	Variable-sized inputs and outputs. Relative positions of pixels, nodes and tokens provide extra context.	Frameworks are harder to train. Scalability issues.	[141,188,232,238,271]
Multi-agent	Agents can make simpler decisions. Specialized agents for smaller tasks.	More complex frameworks. Decentralization requires communication protocols.	[49,116,217,246,251]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Martins, M.S.E.; Sousa, J.M.C.; Vieira, S. A Systematic Review on Reinforcement Learning for Industrial Combinatorial Optimization Problems. Appl. Sci. 2025, 15, 1211. https://doi.org/10.3390/app15031211

AMA Style

Martins MSE, Sousa JMC, Vieira S. A Systematic Review on Reinforcement Learning for Industrial Combinatorial Optimization Problems. Applied Sciences. 2025; 15(3):1211. https://doi.org/10.3390/app15031211

Chicago/Turabian Style

Martins, Miguel S. E., João M. C. Sousa, and Susana Vieira. 2025. "A Systematic Review on Reinforcement Learning for Industrial Combinatorial Optimization Problems" Applied Sciences 15, no. 3: 1211. https://doi.org/10.3390/app15031211

APA Style

Martins, M. S. E., Sousa, J. M. C., & Vieira, S. (2025). A Systematic Review on Reinforcement Learning for Industrial Combinatorial Optimization Problems. Applied Sciences, 15(3), 1211. https://doi.org/10.3390/app15031211

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Systematic Review on Reinforcement Learning for Industrial Combinatorial Optimization Problems

Abstract

1. Introduction

1.1. Reinforcement Learning Foundations

1.2. Previous Reviews

1.3. Research Questions

2. Review Methodology

2.1. Search Criteria

2.2. Eligibility Criteria

3. State Representation

3.1. Resource Features

3.2. Entity Features

3.3. Time-Related

3.4. Solution

3.5. Static Parameters

3.6. Problem-Specific

3.7. Multi-Agent

3.8. Hybrid Strategies

3.9. State Format

4. Action Space

4.1. Q-Value Estimation

4.2. List Selection

4.3. Sequential Decisions

4.4. Heuristic Selection

4.5. Multiple Selections

4.6. Variable-Sized Output

4.7. Number Estimation

5. Reward Design

5.1. Mirror Objective Function

5.2. Relative Rewards

5.3. Extra Terms

5.4. Conditional Rewards

5.5. Problem-Specific Objective

5.6. Multi-Agent

5.7. Alternative Goals

6. Discussion

6.1. Research Questions

6.1.1. RQ1—State Representation

6.1.2. RQ2—Action Space

6.1.3. RQ3—Reward Design

6.1.4. RQ4—Limitations

6.1.5. RQ5—Future Developments

6.2. Popular Algorithms

6.3. The Limitations of This Review

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI