Reinforcement Learning Agent for Multi-Objective Online Process Parameter Optimization of Manufacturing Processes

Paranjape, Akshay; Quader, Nahid; Uhlmann, Lars; Berkels, Benjamin; Wolfschläger, Dominik; Schmitt, Robert H.; Bergs, Thomas

doi:10.3390/app15137279

Open AccessArticle

Reinforcement Learning Agent for Multi-Objective Online Process Parameter Optimization of Manufacturing Processes

by

Akshay Paranjape

^1,*

,

Nahid Quader

¹,

Lars Uhlmann

²

,

Benjamin Berkels

³,

Dominik Wolfschläger

⁴

,

Robert H. Schmitt

^4,5 and

Thomas Bergs

^2,5

¹

IconPro GmbH, Friedlandstraße 18, 52064 Aachen, Germany

²

Manufacturing Technology Institute, MTI of RWTH Aachen University, Campus-Boulevard 30, 52074 Aachen, Germany

³

Aachen Institute for Advanced Study in Computational Engineering Science (AICES), RWTH Aachen, Schinkelstrasse 2, 52062 Aachen, Germany

⁴

Laboratory for Machine Tools and Production Engineering, WZL of RWTH Aachen University, Campus-Boulevard 30, 52074 Aachen, Germany

⁵

Fraunhofer Institute for Production Technology IPT, Steinbachstr. 17, 52074 Aachen, Germany

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(13), 7279; https://doi.org/10.3390/app15137279

Submission received: 31 May 2025 / Revised: 22 June 2025 / Accepted: 25 June 2025 / Published: 27 June 2025

(This article belongs to the Special Issue Multi-Objective Optimization: Techniques and Applications)

Download

Browse Figures

Versions Notes

Abstract

Featured Application

This study presents a multi-objective optimization approach for manufacturing pinions, aiming to balance conflicting objectives such as geometric quality metrics related to radius and thickness. Preliminary validation is conducted using test cases, and the methodology is further substantiated through application to open-source manufacturing datasets and virtual experimentation scenarios.

Abstract

Optimizing manufacturing processes to reduce scrap and enhance process stability presents significant challenges, particularly when multiple conflicting objectives must be addressed concurrently. As the number of objectives increases, the complexity of the optimization task escalates. This difficulty is further intensified in online optimization scenarios, where optimal parameter settings must be delivered in real time within active production environments. In this work, we propose a reinforcement learning-based framework for the multi-objective optimization of manufacturing parameters, demonstrated through a case study on pinion gear manufacturing. The framework utilizes the Multi-Objective Maximum a Posteriori Optimization (MO-MPO) algorithm to train a reinforcement learning agent. A high-fidelity simulation of the pinion manufacturing process is constructed in Simufact, serving both data generation and validation purposes. The agent’s performance is assessed using a hold-out test set along with additional simulations of the physical process. To ensure the generalizability of the approach, further validation is performed using open-source manufacturing datasets and synthetically generated data. The results demonstrate the feasibility of the proposed method for real-time industrial deployment. Moreover, Pareto-optimality is verified via half-space analysis, emphasizing the framework’s effectiveness in managing trade-offs among competing objectives.

Keywords:

multi-objective optimization; process parameter optimization; reinforcement learning; multi-agent optimization; pinion gear optimization

1. Introduction

Scrap in manufacturing processes often results from process instabilities, which may arise due to factors such as tool wear, raw material variation, environmental changes, or improper process parameter settings [1]. These instabilities can be quantified using quality data from manufactured parts, typically measured by coordinate measuring machines (CMMs). To compensate for such deviations, quality control loops are commonly implemented to adjust process parameters based on observed quality metrics [2]. Effective monitoring and control of manufacturing processes require real-time feedback to support timely corrective actions under dynamic conditions. Consequently, a central requirement of quality control loops is the ability to identify optimal process parameter settings in real time [3]. Prior research in this area has largely focused on single-objective optimization using reinforcement learning (RL), validated through theoretical modeling or industrial case studies.

This study advances the field by exploring an RL-based strategy for multi-objective optimization of manufacturing process parameters. In particular, it aims to determine optimal settings for the production of pinion gears, with simultaneous consideration of two key quality attributes: flange radius and flange thickness. Pinion gears are essential components in a wide range of mechanical systems, prized for their ability to convert rotational motion into linear displacement. Common applications include steering assemblies, conveyor systems, gearboxes, bicycle drivetrains, and electric motors, highlighting their widespread industrial significance. This work focuses exclusively on addressing the challenges of multi-objective optimization through RL; single-objective methods are out of the scope of this investigation.

1.1. Outline of the Article and Research Contributions

This study is organized as follows. This section covers recent work related to multi-objective RL optimization and developments in pinion manufacturing. Section 2 formalizes the problem of multi-objective optimization in mathematical terms. Section 3 describes the Simufact simulation used to generate synthetic data for pinion manufacturing, along with the corresponding machine learning-based surrogate model. Section 4 presents the theoretical background and methodology for training the RL agent. Validation on two use cases, pinion manufacturing and open-source data, is presented in Section 5. Section 6 discusses considerations for integrating the proposed approach into production environments. Finally, Section 7 summarizes the main findings of the study.

The key contributions of this research are as follows:

Algorithmic Development: A multi-objective RL framework specifically adapted for manufacturing control tasks, developed by extending the Multi-Objective Maximum a Posteriori Optimization (MO-MPO) algorithm.
Industrial Validation: Validation of the proposed framework across two distinct industrial use cases, utilizing both real-world and synthetically generated datasets.
Empirical Evaluation: Demonstration of effectiveness through improved Process Capability Index (Cp) values and Pareto front approximations, verified using a gradient-based half-space analysis technique.
Production-Ready Implementation: Deployment of a production-ready system via containerized deployment with REST API integration, enabling seamless integration into computer-integrated manufacturing workflows.

1.2. Related Work

In recent years, the integration of advanced computational techniques into manufacturing processes has attracted considerable interest [4]. Among these, the application of RL for multi-objective optimization has demonstrated early potential and remains an active area of research [5]. RL, a subfield of machine learning, enables agents to learn optimal strategies through interaction with an environment. This paradigm is particularly well-suited for complex decision-making problems, such as optimizing process parameters under temporal and operational constraints.

The field of process parameter optimization has been extensively studied over the past two decades, with most research efforts centered on offline optimization techniques. A significant portion of these approaches employ evolutionary algorithms, such as Genetic Algorithms (GAs), Particle Swarm Optimization (PSO), and the Non-dominated Sorting Genetic Algorithm II (NSGA-II). While these methods are effective for both single- and multi-objective optimization problems, they are often computationally expensive. Their complexity tends to scale exponentially with the number of control parameters, rendering them less suitable for real-time or dynamically changing manufacturing environments. In recent years, RL has emerged as a promising alternative, offering improved efficiency, adaptability, and quality outcomes. Over the last five years, several studies have demonstrated RL’s applicability to process optimization tasks. Paranjape et al. [6] proposed an RL-based framework for process parameter optimization and benchmarked it against conventional methods such as GA and PSO. Their findings revealed that the RL-based approach achieved comparable performance to the best traditional algorithms while reducing response time by a factor of 100. These results were validated on both open-source and industrial datasets. The broader applicability of RL-based optimization has also been highlighted in other domains. For instance, Minak [7] applied a multi-objective optimization framework to the design of composite structures for solar vehicles, illustrating the generalizability of such approaches beyond conventional manufacturing contexts. Additional evidence of RL’s advantages in response time and robustness can be found across various application areas. Khdoudi et al. [8] showed that a Twin Delayed Deep Deterministic Policy Gradient (TD3) agent achieved faster convergence and lower execution time than a GA, while maintaining consistent performance across episodes demonstrating improved robustness in manufacturing optimization. In the field of chemical process control, Zhao et al. [9] employed Proximal Policy Optimization (PPO), reporting convergence speeds 10 to 10,000 times faster than other methods with stable outcomes under varying initial conditions. Similarly, Li et al. [10] proposed an RL-guided PSO variant that outperformed several traditional PSO algorithms on 28 benchmark functions, showing superior optimization performance and solution stability. Collectively, these studies underscore the potential of RL not only to deliver solution quality on par with traditional techniques but also to significantly enhance response time and robustness particularly in online and dynamic process parameter optimization scenarios.

In addition to demonstrating superior performance in controlled evaluations, RL has been successfully applied across a range of manufacturing domains, delivering measurable improvements in quality, efficiency, and adaptability. In the field of injection molding, Guo et al. [11] implemented an RL actor–critic architecture to enhance the process capability index (Cpk) for lens thickness in continuous production, achieving a notable improvement from 0.315 to 1.720. Zimmerling et al. [12] applied a convolutional neural network (CNN)-based actor–critic RL framework to optimize variable geometry in textile draping, leading to enhanced forming accuracy. He et al. [13] proposed a multi-agent deep RL approach using deep Q-networks (DQNs) in textile manufacturing, which demonstrated improvements in product quality, productivity, and cost efficiency through a case study on the ozonation process. In the domain of real-time laser welding, Quang et al. [14] employed a Q-learning-based control system that used sensor feedback to regulate weld quality, achieving target performance without relying on surrogate models. In machining operations, Zhao et al. [15] used a Proximal Policy Optimization (PPO)-based RL algorithm to optimize cutting speed, feed rate, and depth with respect to tool wear, resulting in a 7.6% reduction in energy consumption and an 8.6% decrease in production time. Similarly, Huang et al. [16] integrated Graph Neural Networks with multi-agent RL for camshaft grinding optimization; however, their model was not validated using real-world data. Beyond traditional manufacturing, RL has also been adopted in process control applications. Ballard et al. [17] applied the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm to emulsion polymerization, achieving better copolymer composition and reduced reaction times. In parallel, Marcineková and Sujová [18] utilized neural networks to optimize multiple objectives in CNC milling for furniture manufacturing. While effective, their static architecture lacked the dynamic adaptation capabilities of RL systems, which can adjust policies in real-time based on environmental feedback. Hybrid modeling approaches also show potential. Vujovic et al. [19] combined Finite Element Method (FEM) simulations with neural networks to optimize the design of traditional Montenegrin furniture, highlighting the benefits of integrating physics-based and data-driven models. While RL-based methods have demonstrated strong results in single-objective manufacturing optimization tasks, significant research gaps remain in the development of RL systems capable of simultaneously balancing competing quality criteria. Addressing these challenges is essential for advancing RL applications in multi-objective industrial optimization scenarios.

In gear manufacturing, numerous studies have focused on optimizing process and tool design to enhance production quality and efficiency. Although relatively few works specifically address pinion gear manufacturing, this study considers pinions within the broader context of gear systems due to their shared principles, such as hobbing, while also recognizing key distinctions in geometry, function, and process parameters. Deshmukh and Thakare [20] optimized the heat treatment process for pinion gears using the Taguchi method, identifying optimal settings of a 910 °C furnace temperature and 60 min quenching time to improve surface hardness and consistency for automotive applications. Sun et al. [21] applied predictive modeling techniques to optimize hobbing parameters for gear production. Similarly, Chen et al. [22] proposed a hybrid approach combining K-means clustering, Multi-Objective Hunger Games Search (MOHGS), and the Technique for Order Preference by Similarity to Ideal Solution (TOPSIS) to improve gear hobbing performance. Deptula et al. [23] employed genetic algorithms (GAs) for gear parameter optimization. All of these studies rely on offline optimization approaches. Kamratowski et al. [24] used BevelCut simulation to optimize hypoid gear designs including pinions by analyzing cutting characteristics such as chip thickness and distribution. However, their work did not incorporate data-driven or machine learning-based methods.

This article advances the application of RL in manufacturing by addressing the challenges of multi-objective optimization. A use case involving dynamic optimization in pinion manufacturing is presented, demonstrating the potential of RL in real-world production settings. To the best of the authors’ knowledge, this study represents the first application of a multi-agent RL-based optimization framework specifically in the context of pinion gear manufacturing. To evaluate the generalizability of the proposed approach and to conduct ablation studies, the method is also tested on a publicly available manufacturing dataset, enabling broader validation beyond the industrial use case.

2. Problem Statement

This study addresses the general formulation of a multi-objective process parameter optimization problem, where the objective is to identify an optimization model

F_{mulOpt}

that determines an optimal set of controllable parameters

{\vec{x}}_{c}

. The formal expression is given as follows:

{\vec{x}}_{c} = F_{mulOpt} ({\vec{x}}_{u}, {\vec{T}}_{0}, M_{process}) such that ∥M_{process} ({\vec{x}}_{u}, {\vec{x}}_{c}) - {\vec{T}}_{0}∥ \to min

(1)

Here,

{\vec{x}}_{c}

denotes the vector of controllable process parameters that are subject to optimization, while

{\vec{x}}_{u}

represents the uncontrollable process parameters, which are known a priori but cannot be modified. The vector

{\vec{T}}_{0}

defines the desired target values of the process outputs, and

M_{process}

represents the manufacturing process model that maps input parameters to output performance indicators. The goal is to identify a function

F_{mulOpt}

that computes an optimal configuration of

{\vec{x}}_{c}

, such that the process output

M_{process} ({\vec{x}}_{u}, {\vec{x}}_{c})

closely approximates the target vector

{\vec{T}}_{0}

, while accounting for the influence of uncontrollable factors. This work explores an RL-based approach for modeling

F_{mulOpt}

, tailored for the multi-objective optimization of process parameters in pinion manufacturing.

3. Process Simulation

Pinion manufacturing is typically performed using a cold forming forward extrusion process. To train an RL agent for this task, a simulated environment is essential, as the agent learns through interaction with its surroundings. Direct training on the physical manufacturing system is generally infeasible due to the risk of equipment damage and the high associated operational costs. A common alternative is to employ a surrogate model of the process, allowing for safe and efficient experimentation. In this study, a high-fidelity simulation of the pinion manufacturing process was developed using Simufact, which was then used to generate a dataset suitable for training a machine learning-based surrogate model. While surrogate models can also be trained using historical manufacturing data, such data were unavailable for this specific application. The simulation setup using Simufact is described in Section 3.1, and the development of the surrogate model is presented in Section 3.2.

3.1. Finite Element Method-Based Process Model

The axial compression forming process, employed to decrease the workpiece height while increasing its diameter, cannot be described in detail due to confidentiality constraints. However, the following overview outlines the origin of the dataset used in this study. The process is first modeled using the Finite Element Method (FEM). The simulation was developed using Simufact forming 2021 software from Hexagon AB, which employs an implicit finite element solver. To simplify the model, a two-dimensional, rotationally symmetric approach was chosen. This approach is permissible due to the rotationally symmetric forming in the process investigated. Figure 1a depicts the setup of the simulation model. The simulation model consists of all parts that interact with the workpiece. Accordingly, the punch, the workpiece, the die, and the ejector are considered. The punch, die, and ejector are simulated as rigid bodies, as the yield strength of the tool materials is significantly higher than that of the workpiece material, in order to reduce simulation time. In particular, to account for the machine stiffness k, a spring element is included for the punch. The spring constant is considered an input parameter, since changes may occur due to different machines, lubrication conditions, or temperatures. A vertical movement of the punch realizes the kinematics of the forming process. The vertical motion is therefore not only influenced by the stroke h but also by the machine stiffness k. The position of the die and the ejector are fixed in place. The elastic–plastic material behavior of the C14C (1.0401) workpiece is taken from the Simufact forming material database [25]. The initial mesh of the workpiece is shown in Figure 1b. The minimum mesh size

e_{\min}

of the hexahedral mesh in the area of forming is

e_{\min} = 0.1

mm. During the simulation, remeshing was taken into account to avoid excessive distortion of individual elements. The strain of the elements triggers remeshing. In addition to the machine stiffness k and the stroke h, the initial tool temperature

T_{tool_0}

and the initial flange thickness

b_{Fl_0}

were used as input parameters. As output parameters, the flange thickness

b_{Fl}

, the flange radius

r_{Fl}

, and twelve forming forces evenly distributed over the stroke h were considered. Experimental forming tests validate the simulation model using a tool provided by the ESW GmbH on a Schuler HPX400 from Schuler Group GmbH. An initial tool temperature

T_{tool_0}

corresponding to room temperature was ensured by allowing sufficient cooling times between each experiment. The initial flange thickness,

b_{Fl_0}

, is measured using an outside micrometer. Three measurements are averaged for each sample. Two different strokes,

h_{1}

and

h_{2}

, are used. For both strokes, twenty repeats are conducted for statistical validation. The results of the first stroke,

h_{1}

, are used to calibrate the machine stiffness, k. The second stroke

h_{2}

is higher than the first stroke

h_{1}

by

Δ h = 0.1

mm. The results corresponding to the second stroke

h_{2}

are used to validate the simulation model (see Figure 1c; cf. Table 1). The deviation e between the simulation results and experimental data is found to be less than 1%—i.e.,

e < 1 %

. Based on this minimal deviation, the simulation model is considered sufficiently accurate for use in further analysis.

A full factorial design of experiments with three levels was employed to generate a database, yielding 81 data points. Bayesian optimization of a Gaussian Process Regression (GPR) model was then used to identify additional simulation points. Figure 1d shows the deviation of the predicted flange thickness (GPR) from the simulated flange thickness (FEM)

Δ b_{Fl}

and the coefficient of determination

R^{2}

of the GPR model over the cycles of Bayesian optimization. The coefficient of determination,

R^{2}

, was calculated after all simulations, considering only the unknown data points. A total of 600 FEM simulations were performed.

3.2. Machine Learning-Based Process Model

The first step in the proposed framework involves training a surrogate model of the manufacturing process. The high-fidelity simulation developed using Simufact (cf. Section 3.1) exhibits substantial inference latency, rendering it impractical for direct use in RL agent training. To address this, machine learning-based surrogate models were employed to approximate the simulation outputs and significantly reduce inference time. These models were trained on synthetic data generated from the simulation environment. The dataset comprises four input parameters and fourteen output variables, as summarized in Table 2. Among the outputs, twelve correspond to force measurements recorded during the process, while the remaining two, flange thickness and flange radius, serve as key quality indicators of the manufactured pinion. An illustration of the pinion geometry is provided in Figure 2.

A total of 600 samples were generated using the simulation tool. The surrogate model was designed to predict the final flange thickness and flange radius based on the four input parameters listed in Table 2. The twelve force-related outputs were excluded from training, as they reflect process dynamics rather than quality characteristics. The process model comprises two independent feed-forward neural networks, each tasked with predicting one of the two target quality metrics: flange thickness and flange radius. Neural networks were selected over tree-based regression models due to their lower inference latency and capability to compute gradients with respect to input parameters, which is an essential feature required for validating Pareto-optimality. Hyperparameter tuning was conducted using the random search method implemented via KerasTuner. Each model was trained for a maximum of 500 epochs with a learning rate of 0.001, subject to early stopping based on validation loss. The final models demonstrated high predictive performance, achieving

R^{2}

scores of 0.97 for flange radius and 0.99 for flange thickness on a 30% test split. These trained models were subsequently used to emulate the pinion manufacturing environment for training the RL agents.

While surrogate models provide efficient approximations of complex physical processes, their accuracy is inherently limited by the quality and scope of the training data. They may also fail to capture nonlinearities, interactions, or dynamic behaviors present in real-world systems. Such limitations can lead to suboptimal or even unsafe control decisions during optimization [26]. Although quantifying model uncertainty is beyond the scope of this work, it is important to acknowledge the associated risks when deploying surrogate-based solutions.

4. Methodology

This article proposes using an RL agent as a solution for online optimization of manufacturing process parameters, citing its ability to perform adaptive, model-free optimization in real-time for dynamic manufacturing environments [5]. While prior research by Paranjape et al. [6] focused on single-objective optimization using Maximum a Posteriori Optimization (MPO), the present study extends this approach to a multi-objective context through a multi-agent training framework. The theoretical background of RL is provided in Section 4.1, the architecture of the proposed model is discussed in Section 4.1.2, and details on model training are provided in Section 4.2.2.

4.1. Background: Reinforcement Learning

RL is one of the three primary paradigms of machine learning, alongside supervised and unsupervised learning [27]. The main components of an RL system are illustrated in Figure 3 and described below.

Environment: The environment defines the external system within which the RL agent operates. It encapsulates the state and action spaces, as well as the reward function, and governs the dynamics of interaction. The environment responds to the agent’s actions by providing feedback in the form of rewards and new states (cf. Figure 3).

Agent: The agent is the core decision-making component in an RL framework. It observes the current state of the environment and selects actions according to a policy, which is iteratively improved through interaction. The agent’s goal is to learn a policy that maximizes the expected cumulative reward over time.

Reward: The reward is a scalar feedback signal emitted by the environment in response to the agent’s action. It quantifies the immediate utility or cost associated with a particular state–action transition. This signal drives the agent’s learning, incentivizing behaviors that yield higher long-term returns.

Action: An action represents a decision or move made by the agent based on the observed state. The set of all permissible actions constitutes the action space. Each action affects the environment’s state transition and the reward subsequently received by the agent. The chosen action affects the next state and influences future rewards.

4.1.1. Model Selection

Model selection is a critical preliminary step before building and tuning a model’s hyperparameters. RL algorithms can generally be classified into value-based, policy-based, or hybrid methods, such as actor–critic algorithms [27]. The key distinction among these categories lies in whether the algorithm learns a state–action value function or a stochastic policy, each with varying degrees of modeling flexibility. Actor–critic methods combine the strengths of both value- and policy-based approaches. An overview of various model-free RL algorithms is presented in Figure 4 for the reader’s reference. As illustrated in Figure 4, RL algorithms can be categorized into value-based, policy-based, and actor–critic methods. The Maximum a Posteriori Policy Optimization (MPO) algorithm, an actor–critic method, was selected over alternative actor–critic algorithms for the following reasons [28]:

1.: Stable EM-Style Updates: MPO employs a two-step update procedure (E-step and M-step), reminiscent of the Expectation–Maximization algorithm. This structure provides smooth and stable policy updates, which is especially advantageous in both low- and high-dimensional action spaces by mitigating the risk of destructive gradient updates.
2.: KL-Constrained and Off-Policy Learning: MPO enforces a Kullback–Leibler (KL) divergence constraint to limit how far the updated policy can deviate from the current one. This constraint, combined with off-policy learning via experience replay, enables efficient learning in complex environments without requiring new data at each iteration.
3.: Strong Performance in Continuous Control Tasks: Empirical studies have shown that MPO performs well in continuous and high-dimensional tasks such as robotics and simulated control environments. Its robust theoretical foundation and modular architecture also allow it to generalize effectively to simpler problems.

4.1.2. Model Architecture

Since MO-MPO is an actor–critic algorithm, this section details the neural network architectures used for both the policy (actor) and the value estimation (critic) networks. Determining the number of layers and neurons per layer constitutes a key hyperparameter choice that involves balancing model complexity with the capacity to capture meaningful patterns in the data. Following the architectural guidelines recommended by Hoffman et al. [29], we adopt a configuration consisting of three fully connected hidden layers. This architecture offers a practical trade-off between expressiveness and computational efficiency, making it well-suited for continuous control tasks.

Actor Network: The actor network

π

comprises three fully connected hidden layers, each with 256 neurons. This architecture strikes a well-established balance between depth and width, and aligns with the default configuration of the Soft Actor–Critic algorithm as implemented in Stable-Baselines3 [30]. The choice of 256 neurons per layer is motivated not only by its empirical effectiveness, but also by its computational efficiency: it matches the 32-thread CUDA warp size, thereby enhancing GPU utilization and kernel occupancy [31,32]. This specific 3 × 256 configuration is also recommended in the Acme framework’s implementation of MO-MPO by Hoffman et al. [29], where it is applied to both the actor and critic networks. The input dimension corresponds to the number of controllable parameters (four in this case). The network produces two output layers, a mean vector

μ

and a covariance matrix

Σ

, which together define the parameters of the policy distribution. The actor is trained using a two-step EM-style procedure:

E-step: Constructs a non-parametric action distribution

q (a ∣ s)

that is weighted by a prioritized combination of Q-functions, using weights

α_{k}

for each objective.

q (a ∣ s) \propto exp (\frac{1}{η} \sum_{k = 1}^{N} α_{k} Q_{k}^{π} (s, a))

(2)

M-step: Updates the policy

π_{θ} (a ∣ s)

by minimizing the KL divergence between the new action distribution

q (a ∣ s)

and the current policy.

L_{actor} = E_{s \sim D} [KL (q (a ∣ s) ∥ π_{θ} (a ∣ s))]

(3)

$π_{θ} (a ∣ s)$ : Parametric actor policy.
$q (a ∣ s)$ : Target distribution computed from the weighted sum of Q-functions.
$KL (q ‖ π)$ : KL divergence between the target and the current policy.
$E_{s \sim D}$ represents the expected value over the state distribution s∼ $D$ , where $D$ denotes the dataset containing sampled states.
$L_{actor}$ denotes the loss function.

Critic Network: Similar to the actor network, the critic network contains three fully connected hidden layers with 256 neurons per layer. As each objective requires its own Q-function, the total number of critic networks equals the number of objectives, two in this implementation. Both the state and action spaces determine the input dimension. Batches of state–action pairs are sampled from the replay buffer and passed to the critic, which returns a scalar Q-value for each objective. The critic’s primary role is to evaluate actions by approximating their expected returns, a process formally referred to as policy evaluation.

The MO-MPO algorithm incorporates a technique known as retracing, wherein state–action pairs are recomputed using the updated policy parameters. This mechanism is applied across all objectives and can be extended with a prioritization strategy tailored for multi-objective settings [33]. As described by Abdolmaleki et al. [28], the critic network is trained using the following loss function:

min_{{ϕ_{k}}_{k = 1}^{N}} \sum_{k = 1}^{N} E_{(s, a) \sim D} {({\hat{Q}}_{k}^{ret} (s, a) - Q_{k}^{π_{old}} (s, a; ϕ_{k}))}^{2}

(4)

In this formulation,

ϕ_{k}

denotes the trainable parameters of the

k^{th}

critic network, and

{\hat{Q}}_{k}^{ret} (s, a)

represents the retrace target for objective k. The dataset

D

refers to a replay buffer comprising historical state–action pairs

(s, a)

collected by the actor. The critic evaluates the prior policy

π_{old}

by minimizing the squared error between predicted Q-values

Q_{k}^{π_{old}}

and retraced targets

{\hat{Q}}_{k}^{ret}

. Beyond policy evaluation, the critic also facilitates policy improvement by informing updates to the actor network parameters

θ

. This is achieved by minimizing the KL divergence between the updated action distribution and the current actor policy

π

. In multi-objective contexts, divergence metrics can be used to prioritize certain objectives, ensuring that the learned policy respects these trade-offs while maximizing the expected cumulative return.

4.2. Model Training

The training process begins from an initial state, denoted as

{\vec{s}}_{0}

, which is randomly sampled from the state space. The actor selects an action

{\vec{a}}_{t}

based on the current state

{\vec{s}}_{t}

, leading to a transition to the next state

{\vec{s}}_{t + 1}

, and yielding a corresponding reward

{\vec{r}}_{t}

. Each such interaction is recorded as a transition tuple:

τ_{t} = ({\vec{s}}_{t}, {\vec{a}}_{t}, {\vec{r}}_{t}, {\vec{s}}_{t + 1}, π_{θ} ({\vec{a}}_{t} ∣ {\vec{s}}_{t})),

(5)

These transition tuples are stored in a replay buffer [34], which enables off-policy learning by allowing the model to sample past experiences during training. Each transition contains the current state

{\vec{s}}_{t}

, selected action

{\vec{a}}_{t}

, received reward

{\vec{r}}_{t}

, next state

{\vec{s}}_{t + 1}

, and the probability of the selected action under the current policy

π_{θ}

. The environment provides feedback in the form of rewards based on these transitions. This interaction process continues iteratively over a fixed number of time steps within each episode. In parallel, the critic samples mini-batches of transitions from the replay buffer to perform policy evaluation. Using these evaluations, the critic periodically updates the policy network weights to improve the actor’s decision-making. The actor and critic operate asynchronously [35], with the critic initiating updates only after a sufficient number of transitions have been accumulated in the replay buffer. The state space

S

and action space

A

are defined based on the set of controllable process parameters and their respective operational bounds.

4.2.1. Reward Modeling

An appropriately designed reward function is critical for effective model training [36]. Higher reward values correspond to closer proximity to the optimal state. Given the need to simultaneously satisfy multiple objectives, a vector-valued reward function is adopted, as proposed in [37]:

\vec{r} ({\vec{s}}_{t}, {\vec{a}}_{t}) = [r_{0} ({\vec{s}}_{t}, {\vec{a}}_{t}), \dots, r_{k} ({\vec{s}}_{t}, {\vec{a}}_{t})]

(6)

Here, k denotes the total number of objectives. The scalar reward associated with each objective is defined as follows:

r ({\vec{s}}_{t}, {\vec{a}}_{t}) = O {({\vec{s}}_{t + 1})}^{c} (1 - d ({\vec{s}}_{0}, {\vec{s}}_{t + 1})) (\frac{1}{1 + {(\frac{∥ \nabla P (\vec{s}) ∥}{K})}^{2}}), r ({\vec{s}}_{t}, {\vec{a}}_{t}) \in [0, 1]

(7)

In this formulation,

{\vec{s}}_{t}

and

{\vec{a}}_{t}

represent the state and action vectors, respectively. The term

d ({\vec{s}}_{0}, {\vec{s}}_{t + 1})

denotes a distance function between the initial and next states. The gradient norm

∥ \nabla P (\vec{s}) ∥

is scaled by a constant K to form a smooth reward-shaping component, penalizing transitions into regions with steep gradients. The exponent c was empirically set to 2, as it yielded optimal performance across evaluated tasks. The gradient term acts as a regularizer, discouraging the agent from exploring unstable or high-variance regions in the state space. The operator

O (\vec{s})

is defined as follows:

O (\vec{s}) = max (1 - |y - \hat{y}|, 0) \in [0, 1]

(8)

Here, y represents the predicted output from the surrogate model, and

\hat{y}

denotes the corresponding target or optimal value. This term evaluates prediction accuracy and encourages proximity to known optima. The overall reward formulation ensures that values remain bounded and continuous, contributing to training stability. The constant K governs the sensitivity to local gradient magnitude: larger values of K reduce the penalty effect, whereas smaller values amplify it. Collectively, this design promotes convergence to smooth, stable, and high-quality solutions.

4.2.2. Training Setup

The neural network weights were initialized using the default Xavier uniform initializer provided by Keras. The dataset was partitioned into a 70% training set and a 30% hold-out evaluation set to assess the generalization performance of the trained RL agent. Details regarding the architecture of the actor and critic networks are provided in Section 4.1.2. The Adam optimizer was employed for training due to its combined advantages of adaptive learning rates and momentum, making it well-suited for deep learning scenarios involving sparse gradients or noisy observations, as is typical in RL settings. All training was conducted on a workstation running Ubuntu 22.04, equipped with an NVIDIA RTX 2080 Ti GPU.

To rigorously evaluate the performance of the trained RL optimization model, a 30% hold-out split was further analyzed to identify edge cases. Specifically, samples in the evaluation set were ranked according to the deviation between their objective values and known optima. The top 100 samples exhibiting the highest errors, hereafter referred to as faulty samples, were selected as inputs to the RL agent. These samples represent scenarios where conventional optimization techniques often perform poorly, and thus serve as a challenging benchmark to evaluate the agent’s post-optimization corrective capabilities. During evaluation, the RL agent was tasked with recommending optimal control parameter values for these faulty samples, thereby guiding the system toward improved output quality. The performance of the model was assessed based on the following criteria:

Mean reward: The average reward obtained across all optimized faulty samples.
Stability: The consistency of predicted outputs across multiple evaluation trials.

The RL agent consistently demonstrated the ability to reduce error in these difficult cases, validating its effectiveness as a robust post-optimization correction mechanism. The complete set of hyperparameters used for training the actor and critic networks is summarized in Table 3.

4.2.3. Training Stability

Training stability is a critical concern in actor–critic RL methods due to the close coupling between learning targets and policy updates. In this work, the Maximum a Posteriori Policy Optimization (MPO) algorithm [38] provides a robust foundation for stable learning by enforcing a hard Kullback–Leibler (KL) divergence constraint during the E-step of the policy update. This constraint limits deviations from the previous policy, thereby ensuring smoother transitions between successive policies. The corresponding constrained optimization objective is defined as follows:

\begin{matrix} arg max_{q} & E_{μ (s)} [E_{q (a ∣ s)} [Q^{π_{i}} (s, a)]] \\ subject to & E_{μ (s)} [KL (q (a ∣ s) ∥ π (a ∣ s; θ_{i}))] < ϵ \end{matrix}

(9)

This formulation seeks to maximize the expected Q-value under a new action distribution

q (a ∣ s)

, while ensuring that it remains close to the current policy

π (a ∣ s; θ_{i})

within a KL divergence bound

ϵ

. To further enhance training robustness, several stabilization techniques were integrated into the training process. A soft target update mechanism [39] was applied to the critic network, with a smoothing coefficient of

τ = 0.005

, to reduce variance in bootstrapped target estimates. Gradient clipping with a global norm threshold of 1.0 was implemented to prevent instability in weight updates, particularly during the early phases of training [40]. Entropy regularization [41] was maintained during the E-step to encourage exploration and mitigate premature policy convergence.

Separate Adam optimizers [42] were employed for the actor and critic networks, with learning rates set to

10^{- 4}

and

10^{- 3}

, respectively. Additionally, a short warm-up phase for the learning rates was introduced to smooth the gradient dynamics at the start of training. All experiments were conducted with fixed random seeds to ensure reproducibility. The MPO algorithm inherently incorporates strong regularization mechanisms that contribute to stable learning behavior. As a result, minimal hyperparameter tuning or manual intervention was required. When combined with the aforementioned stabilization strategies, the algorithm consistently demonstrated reliable and robust training performance across all evaluated environments.

5. Experiments and Results

To evaluate the effectiveness of the proposed MO-MPO algorithm for multi-objective process parameter optimization, experiments were conducted on two datasets:

A pinion manufacturing dataset generated using Simufact simulations (cf. Section 3.1).
A continuous-flow manufacturing dataset available at https://github.com/nicolasj92/industrial-ml-datasets (accessed on 30 March 2025).

These experiments were designed to assess the algorithm’s capability to handle conflicting objectives inherent in real-world industrial processes. For instance, in the pinion manufacturing task, increasing the flange radius often leads to a reduction in part thickness, which highlights a typical trade-off scenario in multi-objective optimization. The optimized process parameters obtained through the MO-MPO algorithm were evaluated using predictions from the trained surrogate models on both the test set and an independent hold-out split. As explicit Pareto fronts were not provided for these use cases, Pareto optimality was assessed using a half-space validation approach to verify the quality and trade-off behavior of the obtained solutions.

5.1. Results on Pinion Manufacturing Dataset

Section 3.2 describes the surrogate model developed for the pinion manufacturing process, which serves as the first step in the proposed framework. Based on the model’s predictive performance, it was subsequently employed as the training environment for the RL agent to recommend optimal hub stroke values. Given the relative simplicity of the process, the RL agent achieved strong results within just 50 training episodes. As outlined in Section 3.2, the surrogate model attained an

R^{2}

score of 0.97 for flange radius and 0.99 for flange thickness on a held-out test split. The RL agent achieved reward values of 0.864 for flange radius and 0.866 for flange thickness, demonstrating a high success rate in identifying optimal stroke values as control parameters. Both flange radius and flange thickness are conflicting quality objectives. The optimization resulted in target values of 22.3 mm for radius and 3.93 mm for thickness. Figure 5 illustrates the model’s performance on the evaluation data, showing the ground truth values, the RL agent’s predictions, and the optimal solution point within the radius–thickness design space.

5.2. Results on Open-Source Dataset

The dataset originates from a multi-stage continuous-flow manufacturing process featuring both parallel and series configurations. Measurement machines within the system are anonymized using numerical identifiers. In the initial stage, machines M1–M33 operate in parallel and feed into a combiner. At this point, the combined output is measured at 15 discrete locations along the outer surface of the material. Measurement values of 0.0 indicate missing data at specific positions. Due to the high frequency of missing values, only the top five measurement points with the highest data availability were selected for optimization. The primary objective was to identify optimal control parameters that minimize the deviation of measured outputs from their respective setpoints. This dataset reflects the complexity of a real-world, multi-stage manufacturing system and serves as a robust testbed for evaluating the generalization capability and performance of the proposed RL-based optimization framework. It comprises over 14,000 records, each containing sensor readings, control variable settings, system states, and output measurements across various production stages. Setpoints are provided and serve as targets for the multi-objective optimization.

Table 4 categorizes the dataset parameters. Motor-related features, such as amperage and RPM, were selected as control parameters due to their direct influence on the manufacturing process and real-time adjustability. These variables are typically monitored and adjusted in operational environments, making them both logically relevant and practically implementable for an RL-based control strategy. Furthermore, their consistent presence in the dataset supports reliable model training and evaluation. Other parameters were treated as either uncontrollable or static.

Model performance is reported in terms of reward values obtained on a hold-out validation set after training. High reward scores were observed across all selected measurement points, indicating effective alignment with optimization objectives. Table 5 summarizes the RL rewards for two different control parameter configurations. The limited reward control setup is a subset of the extended control configuration, enabling comparative analysis of the optimization model’s effectiveness under varying degrees of control freedom. Appendix A provides further detail on limited and extended control configuration for this dataset.

5.3. Validation

The proposed technique is validated for process control using two complementary approaches. First, the Pareto optimality of the resulting solutions is analyzed to ensure their efficiency in addressing multiple objectives. Second, the solutions are assessed using process capability metrics to evaluate their practical effectiveness within the RL-based optimization framework.

5.3.1. Pareto-Optimal Validation

Abdolmaleki et al. [28] claim that the MO-MPO algorithm returns Pareto-optimal solutions in multi-objective settings. This section provides a formal validation of that claim by analyzing whether the obtained results satisfy the conditions for Pareto optimality, using a gradient-based geometric approach [43]. Initially, only continuous input parameters are considered for the gradient-based validation; the analysis is later extended to include discrete input parameters as well.

Let

F_{i} : X \to R

denote the predictive loss of objective

i (i = 1, \dots, n)

on the continuous input space

X \subset R^{d}

, and let

x \in X

be a candidate solution.

1.

Gradient calculation: The negative direction of the gradient of an objective loss function at the current solution indicates the direction of improvement. The first step is to calculate the negative gradient of the loss function for all objectives at the current solution x, which is

- \nabla_{x} F_{i} (x); i = 1, \dots, n

.

2.

Objective-wise improvement half-spaces: If a direction

z \in X

exists such that the dot product between z and the negative gradient

- \nabla_{x} F_{i} (x)

is greater than zero, it means that for objective i, there is a direction of improvement. The set of all such directions constitutes the half-space

H_{i} (x) = \{x + z | z \in X, 〈 z, - \nabla_{x} F_{i} (x) 〉 > 0\}, i = 1, \dots, n .

(10)

3.

Intersection of half-spaces: Considering all objectives, if the intersection of half-spaces is not empty, any element of this intersection represents a direction that improves all objectives. Thus, x cannot be a Pareto-optimal solution. Let us define the intersection of half-spaces as

A (x) = ⋂_{i = 1}^{n} H_{i} (x)

.

If $A (x) \neq ⌀$ , a single direction exists that decreases every $F_{i}$ simultaneously; hence, x is not Pareto-optimal, as all objectives can be improved at the current solution.
If $A (x) = ⌀$ , no direction exists that can improve all objectives at once; consequently, x satisfies the necessary condition for Pareto optimality in the continuous subspace.

This reasoning is formally captured as follows:

\begin{matrix} If A = ⋂_{i = 1}^{n} H (\nabla_{x} F_{i}) (x) \neq \emptyset & \Rightarrow x is not Pareto optimal for the continuous part, \\ If A = \emptyset and \nabla_{x} F_{i} (x) \neq 0 \forall i = 1, \dots, n & \Rightarrow x is Pareto optimal for the continuous part . \end{matrix}

(11)

Figure 6 illustrates the scenario for two objective loss functions,

F_{1}

and

F_{2}

, where the intersection is not empty. Therefore, there exists a direction in which both objectives can improve. The lines shown in the figure represent the level curves of

F_{1}

and

F_{2}

.

Once the solution is validated in the continuous space, it must also be verified for categorical or discrete inputs (if present). For discrete variables, the validation is based on the continuity of their dependency. If a small perturbation in the continuous variables does not affect the output of the discrete objectives, the solution can be considered Pareto optimal in the discrete part as well. This is formalized as follows:

D_{i} (x + ϵ z, y) = D_{i} (x, y) for z \in A, ϵ ≪ 1, i = 1, \dots, n

(12)

This provides numerical evidence of Pareto optimality for discrete objectives, although it does not constitute a full validation due to the approximate nature of continuous optimization. The validation procedure is implemented using linear programming techniques, and implementation-specific details are provided in the next section.

The described validation methodology was applied to the pinion manufacturing dataset and the open-source multi-stage continuous-flow dataset. Given that certain process parameters are non-controllable, not all optimization results are expected to satisfy Pareto optimality. However, using the proposed half-space and discrete validation criteria, it was found that approximately 81% of the evaluated solutions were Pareto-optimal. This result confirms that the MO-MPO framework effectively produces high-quality solutions under realistic process constraints, supporting its practical applicability in complex manufacturing optimization scenarios.

5.3.2. Validation Through Improvement in Cp Values

For post-optimization validation, stakeholders such as process engineers rely on process capability metrics to assess the industrial viability of the proposed solution. One of the most widely used metrics is the process capability index

C_{p}

, which quantifies the potential of a process to produce outputs within specified tolerance limits [44]. It is defined as follows:

C_{p} = \frac{U S L - L S L}{6 σ}

(13)

where

U S L

and

L S L

represent the upper and lower specification limits, respectively, and

σ

is the standard deviation of the process. A higher

C_{p}

value indicates lower process variability and better conformance to quality specifications.

To demonstrate that the optimization is not only mathematically sound but also practically effective, this section presents the improvement in

C_{p}

values before and after optimization. An increase in

C_{p}

reflects enhanced process stability and manufacturing precision. In addition to identifying Pareto-optimal trade-offs, the RL-based solution resulted in a notable enhancement in process capability. For the pinion manufacturing dataset, the

C_{p}

value for thickness improved from 0.33 to 0.44, while the

C_{p}

for the pinion radius increased from 0.32 to 0.38. Similarly, for the open-source continuous manufacturing dataset, an average improvement of approximately 30% in

C_{p}

values was observed across all measured target parameters, further validating the robustness and practical utility of the approach.

5.4. Virtual Experiments

To assess the robustness and generalizability of the proposed optimization framework, a series of virtual experiments were conducted. Unlike real-world industrial data, these experiments were based on synthetically generated datasets designed to simulate a controlled multi-objective manufacturing environment with varying complexity. Two synthetic datasets generated with the scikit-learn library were used in the experiments. Both datasets contained 16 input parameters, simulating sensor readings and process variables. The output dimensions varied: the first dataset had two output targets, while the second included five. The primary variable of interest was the number of controllable parameters, which was systematically increased in increments of two (i.e., one, two, four, and eight control parameters). The objective was to observe how increasing the dimensionality of the control space influences the optimization performance across different targets.

Figure 7a (left) shows the reward trends for Target 1 and Target 2 for the two-output dataset. Both targets exhibit a monotonic improvement in reward with increasing control dimensionality. Notably, Target 2 benefits significantly from a richer control space, with reward increasing from approximately 0.45 to 0.9. Target 1, which starts with a relatively higher baseline, shows marginal but consistent improvement. In contrast, the five-output dataset results, shown in Figure 7b (right), demonstrate a more complex relationship. Targets 1 and 2 initially improve as the control parameters increase, but their performance eventually plateaus or begins to declines, likely due to overfitting or the introduction of less-informative control dimensions. Target 3 and 4 remain relatively stable across the control dimensionality sweep, indicating insensitivity to added parameters. Target 5, which starts with the lowest reward, shows consistent improvement, suggesting that certain objectives require more complex control representations to be adequately optimized.

These virtual experiments confirm that increasing the number of control parameters can enhance performance but also reveal diminishing or negative returns for certain objectives beyond a threshold. This highlights the importance of feature selection and dimensionality control in multi-objective RL settings.

6. Integration in Production

After finalization of the RL model, it was deployed using containerization to ensure a consistent and scalable runtime environment. Docker was used to encapsulate the application [45], along with all necessary dependencies. This containerized deployment enabled portability across different operating environments and minimized compatibility issues. For real-time interaction, the model was served via an API built using FastAPI [46]. The interface supported POST requests, allowing external systems to submit input data and receive predictions. FastAPI’s asynchronous architecture is particularly well suited to high-throughput environments, enabling non-blocking communication and improved response times.

The deployed model was hosted on an Amazon Web Services (AWS) EC2 instance with fixed compute specifications. API endpoints were documented and created using OpenAPI Generator https://github.com/OpenAPITools/openapi-generator (accessed on 15 January 2025), facilitating seamless integration with computer-integrated manufacturing systems. A higher-level system architecture is illustrated in Figure 8. This setup allows automated decision-making by process engineers, leveraging RL-driven optimization in real time. To ensure reliability and robustness in deployment, a preliminary out-of-distribution (OOD) monitoring mechanism was implemented using the Isolation Forest algorithm [47] to detect unfamiliar inputs. This safeguard helped maintain the integrity of the predictions during live operation.

Latency tests showed that the average response time for a prediction request was approximately 300 milliseconds, making the system suitable for near real-time applications in production workflows. Model training was conducted offline and did not interfere with the deployed inference system.

7. Conclusions

This research advances the field of process parameter optimization by presenting an end-to-end RL framework for multi-objective control. The proposed solution is demonstrated on two industrial use cases and supported by experiments using synthetically generated datasets. A primary contribution of this work is the extension of the MO-MPO algorithm to real-time multi-objective RL applications, particularly within the context of dynamic manufacturing systems.

A critical challenge in manufacturing is the ability to make optimal decisions against multiple quality metrics instantly as process conditions change. The RL-based framework presented here meets this demand by offering scalable, real-time optimization capabilities. It effectively bridges simulation-based training with production-level deployment, achieving a high level of operational maturity.

7.1. Key Contributions from the Study

1.: A generalizable multi-objective RL agent tailored for manufacturing control tasks, developed as an extension of the MO-MPO algorithm.
2.: Successful deployment and testing of the framework on two industrial use cases, including real and synthetic data.
3.: Implementation in a containerized production environment with REST API access, supporting computer-integrated manufacturing workflows.
4.: Empirical validation of solution quality using Cp (Process Capability Index) improvement and gradient-based Pareto optimality checks.
5.: Demonstration of robust performance in high-dimensional control spaces with varying objective trade-offs.

7.2. Limitations and Future Work

Despite its strengths, the proposed approach has certain limitations. First, the performance of the RL agent may deteriorate when faced with highly sparse or noisy data, particularly if the surrogate model lacks sufficient fidelity. Second, the use of gradient-based Pareto validation relies on the assumption of continuous differentiability, which may not be valid for categorical or hybrid control parameters without the use of approximation techniques. Third, the current framework does not support online learning; all retraining must be conducted offline, limiting the system’s adaptability in production environments that evolve over time.

These limitations point to important directions for future research. While a preliminary out-of-distribution (OOD) detection mechanism was implemented (cf. Figure 8), a more comprehensive analysis is planned. Future work will explore integrating RL methods with online model updating strategies, allowing the agent to adapt continuously without requiring full retraining cycles. This capability is expected to enhance the system’s responsiveness and robustness in dynamic production environments. Additionally, extending the optimization framework to handle mixed-variable inputs, combining both continuous and discrete control parameters such as machine modes or configuration settings, will enhance its applicability across a wider range of industrial use cases. These advancements are expected to improve both the robustness and flexibility of the framework in real-world, data-constrained manufacturing settings.

Author Contributions

Conceptualization, A.P. and N.Q.; methodology, A.P. and N.Q.; software, A.P., N.Q. and L.U.; validation, A.P., N.Q. and B.B.; formal analysis, A.P.; investigation, A.P. and N.Q.; resources, L.U.; data curation, L.U.; writing—original draft preparation, A.P., N.Q. and L.U.; writing—review and editing, A.P. and N.Q.; visualization, A.P. and L.U.; review and editing, D.W. and A.P.; supervision, R.H.S., B.B. and T.B.; project administration, R.H.S.; funding acquisition, not applicable. All authors have read and agreed to the published version of the manuscript.

Funding

This research was sponsored by the Federal Ministry of Education and Research in Germany (BMBF) within the collaborative research project IRLeQuM (Industrial Reinforcement Learning for Quality Control of Massive Forming Processes), with funding reference code 02P20A073. The research project did not fund the APC.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data availability statement is provided. All data used in this study are publicly available and can be accessed at the following GitHub repository: https://github.com/akshayparanjape/multi_rl_MDPI (accessed on 1 January 2020).

Acknowledgments

The author would like to thank the industrial partner ESW (https://www.esw-group.eu/en/ (accessed on 30 May 2025) for their support in data generation and the industrial testing of the implemented method on-premise. Special thanks are extended to Stefan Hoppe and Tobias Artmann from ESW for their contributions. The author would also like to thank the ARES team at IconPro GmbH for their ongoing feedback and valuable discussion sessions throughout this research.

Conflicts of Interest

Authors Akshay Paranjape and Nahid Quader were employed by the company IconPro GmbH. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A. Open-Source Data Explanation

There are total of 15 measurements for the open-source dataset for the continuous factory process. Not all were used because of missing values or non-measured values. The details are provided below in Table A1. For the experimental setup in process parameter optimization, input parameters to the surrogate predictive model were categorized as either controllable or uncontrollable static based on intuitive reasoning. Two experimental scenarios, limited and extended controllability, were evaluated, as outlined in Section 5. A comprehensive classification of these parameters is provided in Table A2.

All typically controllable motor parameters were used as control inputs. To test the proposed algorithm’s flexibility and robustness, an extended set of parameters was added, including raw material feeder and temperature-related variables. Feeder parameters are generally adjustable in industrial settings. Temperature parameters, based on their naming convention (…Actual), may represent either sensor readings or setpoints, depending on the system. Due to the lack of clear documentation, the authors used engineering judgment to classify these variables. This extended configuration enabled broader experimental evaluation under more flexible control scenarios.

Table A1. Percentage of zero values for each output measurement.

Output Measurement	% Missing Values
Measurement5	95.12%
Measurement11	74.26%
Measurement7	62.19%
Measurement1	41.88%
Measurement14	35.53%
Measurement6	33.38%
Measurement12	22.63%
Measurement8	5.52%
Measurement9	5.12%
Measurement13	2.42%
Measurement10	1.90%
Measurement4	1.24%
Measurement3	0.96%
Measurement2	0.60%

Table A2. Control parameters used for experimental evaluation.

Parameter	Category	Reason for Inclusion
Machine1 MotorAmperage U Actual	Controllable	Direct motor control setting
Machine1 MotorRPM C Actual	Controllable	Direct motor speed setting
Machine2 MotorAmperage U Actual	Controllable	Direct motor control setting
Machine2 MotorRPM C Actual	Controllable	Direct motor speed setting
Machine3 MotorAmperage U Actual	Controllable	Direct motor control setting
Machine3 MotorRPM C Actual	Controllable	Direct motor speed setting
Machine1 RawMaterialFeederParameter U Actual	Extended Control	Feed rate typically adjustable by design
Machine2 RawMaterialFeederParameter U Actual	Extended Control	Feed rate typically adjustable by design
Machine3 RawMaterialFeederParameter U Actual	Extended Control	Feed rate typically adjutable by design
Machine1 Zone1Temperature C Actual	Extended Control *	Likely a measured variable; depends on implementation
Machine1 Zone2Temperature C Actual	Extended Control *	Likely a measured variable; depends on implementation
Machine2 Zone1Temperature C Actual	Extended Control *	Likely a measured variable; depends on implementation
Machine2 Zone2Temperature C Actual	Extended Control *	Likely a measured variable; depends on implementation
Machine3 Zone1Temperature C Actual	Extended Control *	Likely a measured variable; depends on implementation
Machine3 Zone2Temperature C Actual	Extended Control *	Likely a measured variable; depends on implementation
FirstStage CombinerOperation Temperature1 U Actual	Extended Control *	Likely a measured variable; depends on system control design
FirstStage CombinerOperation Temperature2 U Actual	Extended Control *	Likely a measured variable; depends on system control design
FirstStage CombinerOperation Temperature3 C Actual	Extended Control *	Likely a measured variable; depends on system control design

* These parameters were included under the extended control setting to test the robustness of the proposed framework. The authors do not claim whether these parameters are definitively controllable or uncontrollable; their inclusion is intended solely for experimental comparison, based on informed assumptions in the absence of definitive guidance from the dataset provider.

References

Weber, C.; Moslehi, B.; Dutta, M. An Integrated Framework for Yield Management and Defect/Fault Reduction. IEEE Trans. Semicond. Manuf. 1995, 8, 110–120. [Google Scholar] [CrossRef]
Magnanini, M.C.; Demir, O.; Colledani, M.; Tolio, T. Performance Evaluation of Multi-Stage Manufacturing Systems Operating under Feedback and Feedforward Quality Control Loops. CIRP Ann. 2024, 73, 349–352. [Google Scholar] [CrossRef]
Gu, W.; Li, Y.; Tang, D.; Wang, X.; Yuan, M. Using Real-Time Manufacturing Data to Schedule a Smart Factory via Reinforcement Learning. Comput. Ind. Eng. 2022, 171, 108406. [Google Scholar] [CrossRef]
Weichert, D.; Link, P.; Stoll, A.; Rüping, S.; Ihlenfeldt, S.; Wrobel, S. A Review of Machine Learning for the Optimization of Production Processes. Int. J. Adv. Manuf. Technol. 2019, 104, 1889–1902. [Google Scholar] [CrossRef]
Panzer, M.; Bender, B. Deep Reinforcement Learning in Production Systems: A Systematic Literature Review. Int. J. Prod. Res. 2022, 60, 4316–4341. [Google Scholar] [CrossRef]
Paranjape, A.; Plettenberg, N.; Ohlenforst, M.; Schmitt, R.H. Reinforcement Learning for Quality-Oriented Production Process Parameter Optimization Based on Predictive Models. Adv. Transdiscipl. Eng. 2023, 35, 327–344. [Google Scholar] [CrossRef]
Pavlovic, A.; Sintoni, D.; Fragassa, C.; Minak, G. Multi-Objective Design Optimization of the Reinforced Composite Roof in a Solar Vehicle. Appl. Sci. 2020, 10, 2665. [Google Scholar] [CrossRef]
Khdoudi, A.; Masrour, T.; El Hassani, I.; El Mazgualdi, C. A Deep-Reinforcement-Learning-Based Digital Twin for Manufacturing Process Optimization. Systems 2024, 12, 38. [Google Scholar] [CrossRef]
Zhao, J.; Zhang, X.; Wang, Y.; Wang, W.; Liu, Y. Reinforcement Learning for Process Optimization in Chemical Engineering. Processes 2020, 8, 1497. [Google Scholar] [CrossRef]
Li, H.; Liu, Z.; Zhang, Y.; Zhang, J.; Wang, Y. Reinforcement Learning-Based Adaptive Mechanisms for Metaheuristics: A Case with PSO. arXiv 2022, arXiv:2206.00835. [Google Scholar] [CrossRef]
Guo, F.; Zhou, X.; Liu, J.; Zhang, Y.; Li, D.; Zhou, H. A Reinforcement Learning Decision Model for Online Process Parameters Optimization from Offline Data in Injection Molding. Appl. Soft Comput. 2019, 85, 105828. [Google Scholar] [CrossRef]
Zimmerling, C.; Poppe, C.; Kärger, L. Estimating Optimum Process Parameters in Textile Draping of Variable Part Geometries—A Reinforcement Learning Approach. Procedia Manuf. 2020, 47, 847–854. [Google Scholar] [CrossRef]
He, Z.; Tran, K.P.; Thomassey, S.; Zeng, X.; Xu, J.; Yi, C. Multi-Objective Optimization of the Textile Manufacturing Process Using Deep-Q-Network Based Multi-Agent Reinforcement Learning. J. Manuf. Syst. 2022, 62, 939–949. [Google Scholar] [CrossRef]
Le Quang, T.; Meylan, B.; Masinelli, G.; Saeidi, F.; Shevchik, S.A.; Vakili Farahani, F.; Wasmer, K. Smart Closed-Loop Control of Laser Welding Using Reinforcement Learning. Procedia CIRP 2022, 111, 479–483. [Google Scholar] [CrossRef]
Zhao, X.; Li, C.; Tang, Y.; Li, X.; Chen, X. Reinforcement Learning-Based Cutting Parameter Dynamic Decision Method Considering Tool Wear for a Turning Machining Process. Int. J. Precis. Eng. Manuf. Green Technol. 2024, 11, 1053–1070. [Google Scholar] [CrossRef]
Huang, C.; Su, Y.; Chang, K. Camshaft Grinding Optimization Using Graph Neural Networks and Multi-Agent RL. J. Manuf. Process. 2022, 75, 210–220. [Google Scholar] [CrossRef]
Ballard, N.; Farajzadehahary, K.; Hamzehlou, S.; Mori, U.; Asua, J.M. Reinforcement Learning for the Optimization and Online Control of Emulsion Polymerization Reactors: Particle Morphology. Comput. Chem. Eng. 2024, 187, 108739. [Google Scholar] [CrossRef]
Marcineková, K.; Janáková Sujová, A. Multi-Objective Optimization of Manufacturing Process Using Artificial Neural Networks. Systems 2024, 12, 569. [Google Scholar] [CrossRef]
Vujovic, A.; Krivokapic, Z.; Grujicic, R.; Jovanovic, J.; Pavlovic, A. Combining FEM and Neural Networking in the Design of Optimization of Traditional Montenegrin Chair. FME Trans. 2016, 44, 374–379. [Google Scholar] [CrossRef]
Deshmukh, S.S.; Thakare, S.R. Optimization of Heat Treatment Process for Pinion by Using Taguchi Technique: A Case Study. Int. J. Eng. Res. Appl. 2012, 2, 592–598. Available online: https://www.ijera.com/papers/Vol2_issue6/CH26592598.pdf (accessed on 21 June 2025).
Sun, S.; Wang, S.; Wang, Y.; Lim, T.C.; Yang, Y. Prediction and optimization of hobbing gear geometric deviations. Mech. Mach. Theory 2018, 120, 288–301. [Google Scholar] [CrossRef]
Chen, X.; Li, X.; Li, Z.; Cao, W.; Zhang, Y.; Ni, J.; Wu, D.; Wang, Y. Control parameter optimization of dry hobbing under user evaluation. J. Manuf. Process. 2025, 133, 46–54. [Google Scholar] [CrossRef]
Deptula, A.; Osinski, P. Optimization of gear pump operating parameters using genetic algorithms and performance analysis. Adv. Sci. Technol. Res. J. 2025, 19, 211–227. [Google Scholar] [CrossRef] [PubMed]
Kamratowski, M.; Mazak, J.; Brimmers, J.; Bergs, T. Process and tool design optimization for hypoid gears with the help of the manufacturing simulation BevelCut. Procedia CIRP 2024, 126, 525–530. [Google Scholar] [CrossRef]
Simufact Engineering GmbH. Simufact Forming [Computer Software]. Version 2023.1, MSC Software, Hexagon AB. 2023. Available online: https://www.simufact.com (accessed on 30 May 2025).
Forrester, A.I.J.; Keane, A.J. Recent Advances in Surrogate-Based Optimization. Progress Aerosp. Sci. 2009, 45, 50–79. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Abdolmaleki, A.; Huang, S.; Hasenclever, L.; Neunert, M.; Song, F.; Zambelli, M.; Martins, M.; Heess, N.; Hadsell, R.; Riedmiller, M. A distributional view on multi-objective policy optimization. In Proceedings of the 37th International Conference on Machine Learning, Virtual Event, 13–18 July 2020; Daumé III, H., Singh, A., Eds.; Proceedings of Machine Learning Research, PMLR. 2020; Volume 119, pp. 11–22. Available online: https://proceedings.mlr.press/v119/abdolmaleki20a.html (accessed on 21 June 2025).
Hoffman, M.W.; Shahriari, B.; Aslanides, J.; Barth-Maron, G.; Momchev, N.; Sinopalnikov, D.; Stańczyk, P.; Ramos, S.; Raichuk, A.; Vincent, D.; et al. Acme: A Research Framework for Distributed Reinforcement Learning. arXiv 2022, arXiv:2006.00979. [Google Scholar]
Raffin, A.; Hill, A.; Gleave, A.; Kanervisto, A.; Ernestus, N.; Dormann, R. Stable-Baselines3: Reliable Reinforcement Learning Implementations. 2021. Available online: https://github.com/DLR-RM/stable-baselines3 (accessed on 22 June 2025).
Li, M.; Bi, Z.; Wang, T.; Wen, Y.; Niu, Q.; Liu, J.; Peng, B.; Zhang, S.; Pan, X.; Xu, J.; et al. Deep Learning and Machine Learning with GPGPU and CUDA: Unlocking the Power of Parallel Computing. arXiv 2024, arXiv:2410.05686. Available online: https://arxiv.org/abs/2410.05686 (accessed on 22 June 2025).
Li, S.; Xu, Y. Understanding the GPU Hardware Efficiency for Deep Learning. arXiv 2020, arXiv:2005.08803. [Google Scholar]
Munos, R.; Stepleton, T.; Harutyunyan, A.; Bellemare, M. Safe and Efficient Off-Policy Reinforcement Learning. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems (NeurIPS), Barcelona, Spain, 5–10 December 2016. [Google Scholar] [CrossRef]
Lin, L.-J. Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching. Mach. Learn. 1992, 8, 293–321. [Google Scholar] [CrossRef]
Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous Methods for Deep Reinforcement Learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; Volume 48, pp. 1928–1937. Available online: https://proceedings.mlr.press/v48/mniha16.html (accessed on 22 June 2025).
Ng, A.Y.; Harada, D.; Russell, S. Policy Invariance under Reward Transformations: Theory and Application to Reward Shaping. In Proceedings of the 16th International Conference on Machine Learning (ICML), Bled, Slovenia, 27–30 June 1999. [Google Scholar]
Roijers, D.M.; Vamplew, P.; Whiteson, S.; Dazeley, R. A Survey of Multi-Objective Sequential Decision-Making. J. Artif. Intell. Res. 2013, 48, 67–113. [Google Scholar] [CrossRef]
Abdolmaleki, A.; Springenberg, J.T.; Tassa, Y.; Munos, R.; Heess, N.; Riedmiller, M. Maximum a Posteriori Policy Optimisation. arXiv 2018, arXiv:1806.06920. [Google Scholar] [CrossRef]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous Control with Deep Reinforcement Learning. arXiv 2015, arXiv:1509.02971. [Google Scholar] [CrossRef]
Pascanu, R.; Mikolov, T.; Bengio, Y. On the Difficulty of Training Recurrent Neural Networks. In Proceedings of the 30th International Conference on Machine Learning (ICML), Atlanta, GA, USA, 16–21 June 2013. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar] [CrossRef]
Miettinen, K. Nonlinear Multiobjective Optimization; Springer: Boston, MA, USA, 1999. [Google Scholar] [CrossRef]
Montgomery, D.C. Introduction to Statistical Quality Control, 7th ed.; Wiley: Hoboken, NJ, USA, 2012. [Google Scholar]
Docker. Empowering App Development for Developers. 2013. Available online: https://www.docker.com (accessed on 22 June 2025).
Ramírez, S. FastAPI. 2018. Available online: https://fastapi.tiangolo.com (accessed on 22 June 2025).
Liu, F.T.; Ting, K.M.; Zhou, Z.-H. Isolation-Based Anomaly Detection. In Proceedings of the 2008 IEEE International Conference on Data Mining (ICDM), Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar] [CrossRef]

Figure 1. Setup of the FE model (a), detailed view of the meshing of the workpiece (b), validation of the simulation model (c), and deviation of the prediction of the GPR from the simulated flange thickness as well as the coefficient of determination of the GPR over the cycles of Bayesian optimization (d).

Figure 2. Image of a manufactured pinion gear with integrated ball bearings and internal splines, designed for high-performance mechanical systems. Image © ESW GmbH, used with permission. https://www.esw-group.eu/fertigungsverfahren/praegetechnik/ (accessed on 30 March 2025).

Figure 3. Components of an RL system and their interactions [27].

R_{t}

,

S_{t}

, and

A_{t}

represent reward, state, and action taken at time t, respectively.

Figure 3. Components of an RL system and their interactions [27].

R_{t}

,

S_{t}

, and

A_{t}

represent reward, state, and action taken at time t, respectively.

Figure 4. Taxonomy of RL algorithms, categorized into model-free and model-based methods [27]. Common algorithm names are abbreviated.

Figure 5. Visualization of the RL agent’s performance on the pinion manufacturing dataset. The plot displays the true target values, the agent’s predicted outcomes based on the suggested control parameters, and the optimal point (red cross) in the radius–thickness design space.

Figure 6. Contour representation of two continuous objective functions and their intersecting feasible half-spaces in a multi-objective optimization setting.

Figure 7. Virtual experiments on synthetic datasets with increasing control parameter dimensions. (a) Reward trends for two targets with rising control parameters. (b) Reward trends for five targets with rising control parameters.

Figure 8. Architecture of the RL optimization engine for process parameter optimization. The trained RL agent interacts with a surrogate process model to generate optimized control parameters. The engine receives current process states via a REST API and suggests control parameters to the process engineer through the same interface.

Table 1. Experimental design parameters and their corresponding levels.

Factor	Symbol	Levels
Machine stiffness	k	Low, Medium, High
Stroke	h	$h_{1}$ , $(h_{1} + Δ h / 2)$ , $h_{2}$
Initial tool temperature	$T_{tool_0}$	20 °C, 100 °C, 200 °C
Initial flange thickness	$b_{Fl_0}$	3.0 mm, 3.5 mm, 4.0 mm

Table 2. Description of parameters simulated using Simufact.

Process Parameter [Unit]	Description
Stroke	Stroke applied during the forming process
Tool temperature	Temperature of the tool during forming
Machine stiffness	Machine stiffness at room temperature
Flange thickness	Initial thickness of the flange
Output Forces	Twelve force values recorded at equal stroke intervals
Output Flange thickness	Final flange thickness of the product
Output Flange radius	Final flange radius of the product

Table 3. Hyperparameters used for the actor and critic networks of the RL agent.

Hyperparameter	Value
Actor network architecture	(256, 256, 256)
Critic network architecture	(256, 256, 256)
Batch size	128
$ϵ_{μ}$ (mean KL bound)	0.0001
$ϵ_{\sum}$ (covariance KL bound)	0.0001
Discount factor $γ$	0.99
Number of episodes	1500
Episode length	500
Optimizer	Adam
Activation function	ELU
Target update period	100
Prediction step	200
Objective priorities	0.1 for all objectives
Stability constant	200

Table 4. Categorization and count of input parameters used in the continuous-flow manufacturing dataset, grouped by their functional role in the process.

Anonymized Parameter Category	Count
Ambient Conditions	2
Raw Material	15
Temperature	15
Pressure	3
Motor	6
Combiner Operation	3
Total Input Parameters	44

Table 5. Performance of individual surrogate model and mean obtained reward on test split for open-source dataset.

Target	Process Model $r^{2}$ -Score	Reward Limited Control	Reward Extended Control
Measurement 1	0.94	0.73	0.69
Measurement 2	0.92	0.55	0.86
Measurement 3	0.93	0.78	0.86
Measurement 4	0.79	0.88	0.91
Measurement 5	0.95	0.78	0.88

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Paranjape, A.; Quader, N.; Uhlmann, L.; Berkels, B.; Wolfschläger, D.; Schmitt, R.H.; Bergs, T. Reinforcement Learning Agent for Multi-Objective Online Process Parameter Optimization of Manufacturing Processes. Appl. Sci. 2025, 15, 7279. https://doi.org/10.3390/app15137279

AMA Style

Paranjape A, Quader N, Uhlmann L, Berkels B, Wolfschläger D, Schmitt RH, Bergs T. Reinforcement Learning Agent for Multi-Objective Online Process Parameter Optimization of Manufacturing Processes. Applied Sciences. 2025; 15(13):7279. https://doi.org/10.3390/app15137279

Chicago/Turabian Style

Paranjape, Akshay, Nahid Quader, Lars Uhlmann, Benjamin Berkels, Dominik Wolfschläger, Robert H. Schmitt, and Thomas Bergs. 2025. "Reinforcement Learning Agent for Multi-Objective Online Process Parameter Optimization of Manufacturing Processes" Applied Sciences 15, no. 13: 7279. https://doi.org/10.3390/app15137279

APA Style

Paranjape, A., Quader, N., Uhlmann, L., Berkels, B., Wolfschläger, D., Schmitt, R. H., & Bergs, T. (2025). Reinforcement Learning Agent for Multi-Objective Online Process Parameter Optimization of Manufacturing Processes. Applied Sciences, 15(13), 7279. https://doi.org/10.3390/app15137279

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Reinforcement Learning Agent for Multi-Objective Online Process Parameter Optimization of Manufacturing Processes

Abstract

Featured Application

Abstract

1. Introduction

1.1. Outline of the Article and Research Contributions

1.2. Related Work

2. Problem Statement

3. Process Simulation

3.1. Finite Element Method-Based Process Model

3.2. Machine Learning-Based Process Model

4. Methodology

4.1. Background: Reinforcement Learning

4.1.1. Model Selection

4.1.2. Model Architecture

4.2. Model Training

4.2.1. Reward Modeling

4.2.2. Training Setup

4.2.3. Training Stability

5. Experiments and Results

5.1. Results on Pinion Manufacturing Dataset

5.2. Results on Open-Source Dataset

5.3. Validation

5.3.1. Pareto-Optimal Validation

5.3.2. Validation Through Improvement in Cp Values

5.4. Virtual Experiments

6. Integration in Production

7. Conclusions

7.1. Key Contributions from the Study

7.2. Limitations and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Open-Source Data Explanation

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI