Reinforcement Learning Agent for Multi-Objective Online Process Parameter Optimization of Manufacturing Processes
Abstract
Featured Application
Abstract
1. Introduction
1.1. Outline of the Article and Research Contributions
- Algorithmic Development: A multi-objective RL framework specifically adapted for manufacturing control tasks, developed by extending the Multi-Objective Maximum a Posteriori Optimization (MO-MPO) algorithm.
- Industrial Validation: Validation of the proposed framework across two distinct industrial use cases, utilizing both real-world and synthetically generated datasets.
- Empirical Evaluation: Demonstration of effectiveness through improved Process Capability Index (Cp) values and Pareto front approximations, verified using a gradient-based half-space analysis technique.
- Production-Ready Implementation: Deployment of a production-ready system via containerized deployment with REST API integration, enabling seamless integration into computer-integrated manufacturing workflows.
1.2. Related Work
2. Problem Statement
3. Process Simulation
3.1. Finite Element Method-Based Process Model
3.2. Machine Learning-Based Process Model
4. Methodology
4.1. Background: Reinforcement Learning
4.1.1. Model Selection
- 1.
- Stable EM-Style Updates: MPO employs a two-step update procedure (E-step and M-step), reminiscent of the Expectation–Maximization algorithm. This structure provides smooth and stable policy updates, which is especially advantageous in both low- and high-dimensional action spaces by mitigating the risk of destructive gradient updates.
- 2.
- KL-Constrained and Off-Policy Learning: MPO enforces a Kullback–Leibler (KL) divergence constraint to limit how far the updated policy can deviate from the current one. This constraint, combined with off-policy learning via experience replay, enables efficient learning in complex environments without requiring new data at each iteration.
- 3.
- Strong Performance in Continuous Control Tasks: Empirical studies have shown that MPO performs well in continuous and high-dimensional tasks such as robotics and simulated control environments. Its robust theoretical foundation and modular architecture also allow it to generalize effectively to simpler problems.
4.1.2. Model Architecture
- : Parametric actor policy.
- : Target distribution computed from the weighted sum of Q-functions.
- : KL divergence between the target and the current policy.
- represents the expected value over the state distribution s∼, where denotes the dataset containing sampled states.
- denotes the loss function.
4.2. Model Training
4.2.1. Reward Modeling
4.2.2. Training Setup
- Mean reward: The average reward obtained across all optimized faulty samples.
- Stability: The consistency of predicted outputs across multiple evaluation trials.
4.2.3. Training Stability
5. Experiments and Results
- A pinion manufacturing dataset generated using Simufact simulations (cf. Section 3.1).
- A continuous-flow manufacturing dataset available at https://github.com/nicolasj92/industrial-ml-datasets (accessed on 30 March 2025).
5.1. Results on Pinion Manufacturing Dataset
5.2. Results on Open-Source Dataset
5.3. Validation
5.3.1. Pareto-Optimal Validation
- 1.
- Gradient calculation: The negative direction of the gradient of an objective loss function at the current solution indicates the direction of improvement. The first step is to calculate the negative gradient of the loss function for all objectives at the current solution x, which is .
- 2.
- Objective-wise improvement half-spaces: If a direction exists such that the dot product between z and the negative gradient is greater than zero, it means that for objective i, there is a direction of improvement. The set of all such directions constitutes the half-space
- 3.
- Intersection of half-spaces: Considering all objectives, if the intersection of half-spaces is not empty, any element of this intersection represents a direction that improves all objectives. Thus, x cannot be a Pareto-optimal solution. Let us define the intersection of half-spaces as .
- If , a single direction exists that decreases every simultaneously; hence, x is not Pareto-optimal, as all objectives can be improved at the current solution.
- If , no direction exists that can improve all objectives at once; consequently, x satisfies the necessary condition for Pareto optimality in the continuous subspace.
5.3.2. Validation Through Improvement in Cp Values
5.4. Virtual Experiments
6. Integration in Production
7. Conclusions
7.1. Key Contributions from the Study
- 1.
- A generalizable multi-objective RL agent tailored for manufacturing control tasks, developed as an extension of the MO-MPO algorithm.
- 2.
- Successful deployment and testing of the framework on two industrial use cases, including real and synthetic data.
- 3.
- Implementation in a containerized production environment with REST API access, supporting computer-integrated manufacturing workflows.
- 4.
- Empirical validation of solution quality using Cp (Process Capability Index) improvement and gradient-based Pareto optimality checks.
- 5.
- Demonstration of robust performance in high-dimensional control spaces with varying objective trade-offs.
7.2. Limitations and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Open-Source Data Explanation
Output Measurement | % Missing Values |
---|---|
Measurement5 | 95.12% |
Measurement11 | 74.26% |
Measurement7 | 62.19% |
Measurement1 | 41.88% |
Measurement14 | 35.53% |
Measurement6 | 33.38% |
Measurement12 | 22.63% |
Measurement8 | 5.52% |
Measurement9 | 5.12% |
Measurement13 | 2.42% |
Measurement10 | 1.90% |
Measurement4 | 1.24% |
Measurement3 | 0.96% |
Measurement2 | 0.60% |
Parameter | Category | Reason for Inclusion |
---|---|---|
Machine1 MotorAmperage U Actual | Controllable | Direct motor control setting |
Machine1 MotorRPM C Actual | Controllable | Direct motor speed setting |
Machine2 MotorAmperage U Actual | Controllable | Direct motor control setting |
Machine2 MotorRPM C Actual | Controllable | Direct motor speed setting |
Machine3 MotorAmperage U Actual | Controllable | Direct motor control setting |
Machine3 MotorRPM C Actual | Controllable | Direct motor speed setting |
Machine1 RawMaterialFeederParameter U Actual | Extended Control | Feed rate typically adjustable by design |
Machine2 RawMaterialFeederParameter U Actual | Extended Control | Feed rate typically adjustable by design |
Machine3 RawMaterialFeederParameter U Actual | Extended Control | Feed rate typically adjutable by design |
Machine1 Zone1Temperature C Actual | Extended Control * | Likely a measured variable; depends on implementation |
Machine1 Zone2Temperature C Actual | Extended Control * | Likely a measured variable; depends on implementation |
Machine2 Zone1Temperature C Actual | Extended Control * | Likely a measured variable; depends on implementation |
Machine2 Zone2Temperature C Actual | Extended Control * | Likely a measured variable; depends on implementation |
Machine3 Zone1Temperature C Actual | Extended Control * | Likely a measured variable; depends on implementation |
Machine3 Zone2Temperature C Actual | Extended Control * | Likely a measured variable; depends on implementation |
FirstStage CombinerOperation Temperature1 U Actual | Extended Control * | Likely a measured variable; depends on system control design |
FirstStage CombinerOperation Temperature2 U Actual | Extended Control * | Likely a measured variable; depends on system control design |
FirstStage CombinerOperation Temperature3 C Actual | Extended Control * | Likely a measured variable; depends on system control design |
References
- Weber, C.; Moslehi, B.; Dutta, M. An Integrated Framework for Yield Management and Defect/Fault Reduction. IEEE Trans. Semicond. Manuf. 1995, 8, 110–120. [Google Scholar] [CrossRef]
- Magnanini, M.C.; Demir, O.; Colledani, M.; Tolio, T. Performance Evaluation of Multi-Stage Manufacturing Systems Operating under Feedback and Feedforward Quality Control Loops. CIRP Ann. 2024, 73, 349–352. [Google Scholar] [CrossRef]
- Gu, W.; Li, Y.; Tang, D.; Wang, X.; Yuan, M. Using Real-Time Manufacturing Data to Schedule a Smart Factory via Reinforcement Learning. Comput. Ind. Eng. 2022, 171, 108406. [Google Scholar] [CrossRef]
- Weichert, D.; Link, P.; Stoll, A.; Rüping, S.; Ihlenfeldt, S.; Wrobel, S. A Review of Machine Learning for the Optimization of Production Processes. Int. J. Adv. Manuf. Technol. 2019, 104, 1889–1902. [Google Scholar] [CrossRef]
- Panzer, M.; Bender, B. Deep Reinforcement Learning in Production Systems: A Systematic Literature Review. Int. J. Prod. Res. 2022, 60, 4316–4341. [Google Scholar] [CrossRef]
- Paranjape, A.; Plettenberg, N.; Ohlenforst, M.; Schmitt, R.H. Reinforcement Learning for Quality-Oriented Production Process Parameter Optimization Based on Predictive Models. Adv. Transdiscipl. Eng. 2023, 35, 327–344. [Google Scholar] [CrossRef]
- Pavlovic, A.; Sintoni, D.; Fragassa, C.; Minak, G. Multi-Objective Design Optimization of the Reinforced Composite Roof in a Solar Vehicle. Appl. Sci. 2020, 10, 2665. [Google Scholar] [CrossRef]
- Khdoudi, A.; Masrour, T.; El Hassani, I.; El Mazgualdi, C. A Deep-Reinforcement-Learning-Based Digital Twin for Manufacturing Process Optimization. Systems 2024, 12, 38. [Google Scholar] [CrossRef]
- Zhao, J.; Zhang, X.; Wang, Y.; Wang, W.; Liu, Y. Reinforcement Learning for Process Optimization in Chemical Engineering. Processes 2020, 8, 1497. [Google Scholar] [CrossRef]
- Li, H.; Liu, Z.; Zhang, Y.; Zhang, J.; Wang, Y. Reinforcement Learning-Based Adaptive Mechanisms for Metaheuristics: A Case with PSO. arXiv 2022, arXiv:2206.00835. [Google Scholar] [CrossRef]
- Guo, F.; Zhou, X.; Liu, J.; Zhang, Y.; Li, D.; Zhou, H. A Reinforcement Learning Decision Model for Online Process Parameters Optimization from Offline Data in Injection Molding. Appl. Soft Comput. 2019, 85, 105828. [Google Scholar] [CrossRef]
- Zimmerling, C.; Poppe, C.; Kärger, L. Estimating Optimum Process Parameters in Textile Draping of Variable Part Geometries—A Reinforcement Learning Approach. Procedia Manuf. 2020, 47, 847–854. [Google Scholar] [CrossRef]
- He, Z.; Tran, K.P.; Thomassey, S.; Zeng, X.; Xu, J.; Yi, C. Multi-Objective Optimization of the Textile Manufacturing Process Using Deep-Q-Network Based Multi-Agent Reinforcement Learning. J. Manuf. Syst. 2022, 62, 939–949. [Google Scholar] [CrossRef]
- Le Quang, T.; Meylan, B.; Masinelli, G.; Saeidi, F.; Shevchik, S.A.; Vakili Farahani, F.; Wasmer, K. Smart Closed-Loop Control of Laser Welding Using Reinforcement Learning. Procedia CIRP 2022, 111, 479–483. [Google Scholar] [CrossRef]
- Zhao, X.; Li, C.; Tang, Y.; Li, X.; Chen, X. Reinforcement Learning-Based Cutting Parameter Dynamic Decision Method Considering Tool Wear for a Turning Machining Process. Int. J. Precis. Eng. Manuf. Green Technol. 2024, 11, 1053–1070. [Google Scholar] [CrossRef]
- Huang, C.; Su, Y.; Chang, K. Camshaft Grinding Optimization Using Graph Neural Networks and Multi-Agent RL. J. Manuf. Process. 2022, 75, 210–220. [Google Scholar] [CrossRef]
- Ballard, N.; Farajzadehahary, K.; Hamzehlou, S.; Mori, U.; Asua, J.M. Reinforcement Learning for the Optimization and Online Control of Emulsion Polymerization Reactors: Particle Morphology. Comput. Chem. Eng. 2024, 187, 108739. [Google Scholar] [CrossRef]
- Marcineková, K.; Janáková Sujová, A. Multi-Objective Optimization of Manufacturing Process Using Artificial Neural Networks. Systems 2024, 12, 569. [Google Scholar] [CrossRef]
- Vujovic, A.; Krivokapic, Z.; Grujicic, R.; Jovanovic, J.; Pavlovic, A. Combining FEM and Neural Networking in the Design of Optimization of Traditional Montenegrin Chair. FME Trans. 2016, 44, 374–379. [Google Scholar] [CrossRef]
- Deshmukh, S.S.; Thakare, S.R. Optimization of Heat Treatment Process for Pinion by Using Taguchi Technique: A Case Study. Int. J. Eng. Res. Appl. 2012, 2, 592–598. Available online: https://www.ijera.com/papers/Vol2_issue6/CH26592598.pdf (accessed on 21 June 2025).
- Sun, S.; Wang, S.; Wang, Y.; Lim, T.C.; Yang, Y. Prediction and optimization of hobbing gear geometric deviations. Mech. Mach. Theory 2018, 120, 288–301. [Google Scholar] [CrossRef]
- Chen, X.; Li, X.; Li, Z.; Cao, W.; Zhang, Y.; Ni, J.; Wu, D.; Wang, Y. Control parameter optimization of dry hobbing under user evaluation. J. Manuf. Process. 2025, 133, 46–54. [Google Scholar] [CrossRef]
- Deptula, A.; Osinski, P. Optimization of gear pump operating parameters using genetic algorithms and performance analysis. Adv. Sci. Technol. Res. J. 2025, 19, 211–227. [Google Scholar] [CrossRef] [PubMed]
- Kamratowski, M.; Mazak, J.; Brimmers, J.; Bergs, T. Process and tool design optimization for hypoid gears with the help of the manufacturing simulation BevelCut. Procedia CIRP 2024, 126, 525–530. [Google Scholar] [CrossRef]
- Simufact Engineering GmbH. Simufact Forming [Computer Software]. Version 2023.1, MSC Software, Hexagon AB. 2023. Available online: https://www.simufact.com (accessed on 30 May 2025).
- Forrester, A.I.J.; Keane, A.J. Recent Advances in Surrogate-Based Optimization. Progress Aerosp. Sci. 2009, 45, 50–79. [Google Scholar] [CrossRef]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Abdolmaleki, A.; Huang, S.; Hasenclever, L.; Neunert, M.; Song, F.; Zambelli, M.; Martins, M.; Heess, N.; Hadsell, R.; Riedmiller, M. A distributional view on multi-objective policy optimization. In Proceedings of the 37th International Conference on Machine Learning, Virtual Event, 13–18 July 2020; Daumé III, H., Singh, A., Eds.; Proceedings of Machine Learning Research, PMLR. 2020; Volume 119, pp. 11–22. Available online: https://proceedings.mlr.press/v119/abdolmaleki20a.html (accessed on 21 June 2025).
- Hoffman, M.W.; Shahriari, B.; Aslanides, J.; Barth-Maron, G.; Momchev, N.; Sinopalnikov, D.; Stańczyk, P.; Ramos, S.; Raichuk, A.; Vincent, D.; et al. Acme: A Research Framework for Distributed Reinforcement Learning. arXiv 2022, arXiv:2006.00979. [Google Scholar]
- Raffin, A.; Hill, A.; Gleave, A.; Kanervisto, A.; Ernestus, N.; Dormann, R. Stable-Baselines3: Reliable Reinforcement Learning Implementations. 2021. Available online: https://github.com/DLR-RM/stable-baselines3 (accessed on 22 June 2025).
- Li, M.; Bi, Z.; Wang, T.; Wen, Y.; Niu, Q.; Liu, J.; Peng, B.; Zhang, S.; Pan, X.; Xu, J.; et al. Deep Learning and Machine Learning with GPGPU and CUDA: Unlocking the Power of Parallel Computing. arXiv 2024, arXiv:2410.05686. Available online: https://arxiv.org/abs/2410.05686 (accessed on 22 June 2025).
- Li, S.; Xu, Y. Understanding the GPU Hardware Efficiency for Deep Learning. arXiv 2020, arXiv:2005.08803. [Google Scholar]
- Munos, R.; Stepleton, T.; Harutyunyan, A.; Bellemare, M. Safe and Efficient Off-Policy Reinforcement Learning. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems (NeurIPS), Barcelona, Spain, 5–10 December 2016. [Google Scholar] [CrossRef]
- Lin, L.-J. Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching. Mach. Learn. 1992, 8, 293–321. [Google Scholar] [CrossRef]
- Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous Methods for Deep Reinforcement Learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; Volume 48, pp. 1928–1937. Available online: https://proceedings.mlr.press/v48/mniha16.html (accessed on 22 June 2025).
- Ng, A.Y.; Harada, D.; Russell, S. Policy Invariance under Reward Transformations: Theory and Application to Reward Shaping. In Proceedings of the 16th International Conference on Machine Learning (ICML), Bled, Slovenia, 27–30 June 1999. [Google Scholar]
- Roijers, D.M.; Vamplew, P.; Whiteson, S.; Dazeley, R. A Survey of Multi-Objective Sequential Decision-Making. J. Artif. Intell. Res. 2013, 48, 67–113. [Google Scholar] [CrossRef]
- Abdolmaleki, A.; Springenberg, J.T.; Tassa, Y.; Munos, R.; Heess, N.; Riedmiller, M. Maximum a Posteriori Policy Optimisation. arXiv 2018, arXiv:1806.06920. [Google Scholar] [CrossRef]
- Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous Control with Deep Reinforcement Learning. arXiv 2015, arXiv:1509.02971. [Google Scholar] [CrossRef]
- Pascanu, R.; Mikolov, T.; Bengio, Y. On the Difficulty of Training Recurrent Neural Networks. In Proceedings of the 30th International Conference on Machine Learning (ICML), Atlanta, GA, USA, 16–21 June 2013. [Google Scholar] [CrossRef]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar] [CrossRef]
- Miettinen, K. Nonlinear Multiobjective Optimization; Springer: Boston, MA, USA, 1999. [Google Scholar] [CrossRef]
- Montgomery, D.C. Introduction to Statistical Quality Control, 7th ed.; Wiley: Hoboken, NJ, USA, 2012. [Google Scholar]
- Docker. Empowering App Development for Developers. 2013. Available online: https://www.docker.com (accessed on 22 June 2025).
- Ramírez, S. FastAPI. 2018. Available online: https://fastapi.tiangolo.com (accessed on 22 June 2025).
- Liu, F.T.; Ting, K.M.; Zhou, Z.-H. Isolation-Based Anomaly Detection. In Proceedings of the 2008 IEEE International Conference on Data Mining (ICDM), Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar] [CrossRef]
Factor | Symbol | Levels |
---|---|---|
Machine stiffness | k | Low, Medium, High |
Stroke | h | , , |
Initial tool temperature | 20 °C, 100 °C, 200 °C | |
Initial flange thickness | 3.0 mm, 3.5 mm, 4.0 mm |
Process Parameter [Unit] | Description |
---|---|
Stroke | Stroke applied during the forming process |
Tool temperature | Temperature of the tool during forming |
Machine stiffness | Machine stiffness at room temperature |
Flange thickness | Initial thickness of the flange |
Output Forces | Twelve force values recorded at equal stroke intervals |
Output Flange thickness | Final flange thickness of the product |
Output Flange radius | Final flange radius of the product |
Hyperparameter | Value |
---|---|
Actor network architecture | (256, 256, 256) |
Critic network architecture | (256, 256, 256) |
Batch size | 128 |
(mean KL bound) | 0.0001 |
(covariance KL bound) | 0.0001 |
Discount factor | 0.99 |
Number of episodes | 1500 |
Episode length | 500 |
Optimizer | Adam |
Activation function | ELU |
Target update period | 100 |
Prediction step | 200 |
Objective priorities | 0.1 for all objectives |
Stability constant | 200 |
Anonymized Parameter Category | Count |
---|---|
Ambient Conditions | 2 |
Raw Material | 15 |
Temperature | 15 |
Pressure | 3 |
Motor | 6 |
Combiner Operation | 3 |
Total Input Parameters | 44 |
Target | Process Model -Score | Reward Limited Control | Reward Extended Control |
---|---|---|---|
Measurement 1 | 0.94 | 0.73 | 0.69 |
Measurement 2 | 0.92 | 0.55 | 0.86 |
Measurement 3 | 0.93 | 0.78 | 0.86 |
Measurement 4 | 0.79 | 0.88 | 0.91 |
Measurement 5 | 0.95 | 0.78 | 0.88 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Paranjape, A.; Quader, N.; Uhlmann, L.; Berkels, B.; Wolfschläger, D.; Schmitt, R.H.; Bergs, T. Reinforcement Learning Agent for Multi-Objective Online Process Parameter Optimization of Manufacturing Processes. Appl. Sci. 2025, 15, 7279. https://doi.org/10.3390/app15137279
Paranjape A, Quader N, Uhlmann L, Berkels B, Wolfschläger D, Schmitt RH, Bergs T. Reinforcement Learning Agent for Multi-Objective Online Process Parameter Optimization of Manufacturing Processes. Applied Sciences. 2025; 15(13):7279. https://doi.org/10.3390/app15137279
Chicago/Turabian StyleParanjape, Akshay, Nahid Quader, Lars Uhlmann, Benjamin Berkels, Dominik Wolfschläger, Robert H. Schmitt, and Thomas Bergs. 2025. "Reinforcement Learning Agent for Multi-Objective Online Process Parameter Optimization of Manufacturing Processes" Applied Sciences 15, no. 13: 7279. https://doi.org/10.3390/app15137279
APA StyleParanjape, A., Quader, N., Uhlmann, L., Berkels, B., Wolfschläger, D., Schmitt, R. H., & Bergs, T. (2025). Reinforcement Learning Agent for Multi-Objective Online Process Parameter Optimization of Manufacturing Processes. Applied Sciences, 15(13), 7279. https://doi.org/10.3390/app15137279