Reducing Computational Time in Pixel-Based Path Planning for GMA-DED by Using Multi-Armed Bandit Reinforcement Learning Algorithm

Ferreira, Rafael P.; Schubert, Emil; Scotti, Américo

doi:10.3390/jmmp9040107

Open AccessArticle

Reducing Computational Time in Pixel-Based Path Planning for GMA-DED by Using Multi-Armed Bandit Reinforcement Learning Algorithm

by

Rafael P. Ferreira

^1,2

,

Emil Schubert

³ and

Américo Scotti

^2,*

¹

Department of Mechanics and Materials, Federal Institute of Education, Science and Technology of Maranhão (IFMA), Campus São Luis Monte Castelo, São Luis 65030-005, MA, Brazil

²

Center for Research and Development of Welding Processes and Additive Manufacturing (Laprosolda), Federal University of Uberlândia (UFU), Uberlândia 38400-901, MG, Brazil

³

Alexander Binzel Schweisstechnik GmbH & Co. KG, Kiesacker 7-9, 35418 Buseck, Germany

^*

Author to whom correspondence should be addressed.

J. Manuf. Mater. Process. 2025, 9(4), 107; https://doi.org/10.3390/jmmp9040107

Submission received: 16 February 2025 / Revised: 20 March 2025 / Accepted: 21 March 2025 / Published: 25 March 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

This work presents an artificial intelligence technique to minimise path planning computer processing time for successful GMA-DED 3D printings. An advanced version of the Pixel space-filling-based strategy family is proposed and developed, using, originally for GMA-DED, an artificially intelligent Reinforcement Learning technique to optimise its heuristics. The initial concept was to boost the preceding Enhanced-Pixel version of the Pixel planning strategy by applying the solution of the Multi-Armed Bandit problem in the algorithms. Computational validation was initially performed to evaluate Advanced-Pixel improvements systematically and comparatively with the Enhanced-Pixel strategy. A testbed was set up to compare experimentally the performance of both algorithm versions. The results showed that the reduced processing time reached with the Advanced-Pixel strategy did not affect the performance gains of the Pixel strategy. A larger build was printed as a case study to conclude the study. The results outstand the artificially intelligent role of the Reinforcement Learning technique in printing more efficiently functional structures.

Keywords:

3D printing; DED; path planning; artificial intelligence; reinforcement learning; multi-armed bandit problem

1. Introduction

Gas Metal Arc-Directed Energy Deposition (GMA-DED) additive manufacturing technology has emerged as a promising technology for producing metallic functional components. This technology is also known as wire arc additive manufacturing or simply WAAM. The growing acceptance of GMA-DED can be attributed to its ability to print medium- to large-sized parts while delivering high deposition rates with energy efficiency and material savings. Industry leader players, like MX3D with their M1 Metal AM System, Gefertec with its 3DMP^® technology, and RAMLAB’s, amongst others, successfully printed large-sized parts to demonstrate comprehensive solutions in this field. Furthermore, notable investments in GMA-DED research and development, such as the GBP 1.2 million Innovate UK project [1], signify the increasing recognition and support for this transformative technology.

Despite its demonstrated potential, the successful adoption of GMA-DED in industry is hindered by certain limitations. As highlighted by Singh et al. [2], effective process planning is crucial for overcoming limitations and ensuring optimal implementation of the GMA-DED technology. In their study, Amal et al. [3] demonstrated the significant influence of trajectory planning on surface roughness, shape accuracy, material consumption, and mechanical properties. Furthermore, they emphasise that optimising trajectory planning allows effective control over continuity, regularity, voids, and other imperfections, all without needing modifications to the machine or component design. Besides the trajectory planning, it is essential to consider the overlapping distance to ensure high-quality results, as discussed by several authors, including Ding et al. [4] and Hu et al. [5] (see these references for more details on these approaches). In short, improving trajectory planning to address these challenges is essential to enhance the quality and performance of GMA-DED-printed components.

Jafari et al. [6] indicate that conventional trajectory planning strategies, such as Zigzag and Parallel Contour, prevalent in polymer additive manufacturing may not always be suitable for GMA-DED processes, especially when dealing with complex parts. According to the same author, additional trajectory planning strategies were developed to handle complex geometries, such as the MAT [7], A-MAT [8], and the Water Poured [9] methods. Alternatively, the literature suggests that employing a non-conventional space-filling strategy could also be a viable approach to addressing this challenge. Cox et al. [10] defined space-filling strategies (or curves) as continuous trajectories in a unit square that pass through all the points that discretise this square. Recent studies by Vishwanath and Suryakumar [11] have demonstrated that fractal-based space-filling strategies in GMA-DED can significantly improve temperature uniformity and reduce distortion in printed parts compared to conventional strategies. Additionally, Singh et al. [12] achieved favourable print quality results in GMA-DED using a non-conventional space-filling strategy based on a Travel Salesman Problem (TSP) solver. However, it is important to note that the positive results in GMA-DED have primarily been demonstrated on simpler parts, such as cubes and animal shapes. Notwithstanding, applying these strategies to more complex parts has only been a future proposal, as far as it is known.

In that direction, already in 2021, therefore prior to the studies mentioned earlier, a non-conventional space-filling strategy was introduced as a promising approach by Ferreira and Scotti [13] to be applied in GMA-DED (Gas Metal Arc-Directed Energy Deposition) additive manufacturing. This happened in the initial phase of the Pixel strategy development, renamed some months later as Basic-Pixel. The goal of Basic-Pixel was to identify the optimal trajectory, specifically targeting the shortest trajectory, among all results obtained during each iteration (optimisation process). In short, the basic pixel strategy comprises two phases: layer discretisation and trajectory optimisation. In the first phase, as schematically illustrated in Figure 1a, this strategy conceptualises the printing plan, i.e., the slice of the printable part (delimitated by the solid black line square), as a grid composed of equidistant nodes (blue dots). The distance between these nodes and their quantity is a designer’s definition based on the centre-to-centre distance between two GMA-DED deposition tracks. In the second phase of the approach, inspired by the renowned Travelling Salesman Problem (TSP), the Basic-Pixel strategy uses heuristics to define optimised connections between neighbour nodes (green arrows), as shown in Figure 1b, through iterations.

A key characteristic of the Basic-Pixel strategy (detailed in the reference mentioned above) is the discretisation of the surface into ordered nodes in the x-axis direction. In addition, four heuristics of trajectory planning generate initial trajectories connecting these nodes, which are subsequently optimised by the 2-opt algorithm. The user determines the number of times the trajectories will be generated and optimised through an input (“number of iterations”). In the end, the user receives a report with the coordinates of all trajectories and their respective distances. In this case, the user has the freedom to choose which trajectory to use to print the part.

Following this, the same authors of the Basic-Pixel strategy conceived an Enhanced-Pixel strategy [14] as a high-performance version of the Basic-Pixel strategy, specifically designed to generate trajectories with smaller distances. This strategy was successfully applied to print parts with mixed characteristics (slender and bulky). The Enhanced-Pixel strategy offered advantages over the Basic-Pixel version, including incorporating a new axis ordering and a new trajectory planning heuristic. By utilising dual axis ordering directions (x and y) and employing five heuristics of trajectory planning, the Enhanced-Pixel strategy extensively explores the optimisation space, leading to improved outcomes. It is important to mention that the references [13,14] demonstrated the potential applicability of the Pixel-based strategy in printing more complex parts (including those shapes with sharper corners) using the GMA-DED technology. It was also shown that the performance of the Pixel-based strategy is comparable to that of conventional strategies, based on Zigzag and Contour.

However, computational time has proven to be a potential drawback for this new version of the Pixel family of strategies. Long computational times are because the algorithm is “greedy”, going through all combinations of axis ordering and trajectory planning heuristics to determine the best result for each iteration. The use of a greedy algorithm called GRASP (Greedy Randomised Adaptive Search Procedure), which is at the core of the Pixel strategy, contributes to the time-consuming nature of this process. Additionally, there is a risk of generating suboptimal trajectory distances, requiring multiple iterations to attain satisfactory results.

To overcome these setbacks, a more efficient selection of the axis ordering and trajectory planning heuristic for each iteration during the algorithm running was considered in this work. At that time, artificial intelligence (AI) came up as an option to follow [15] and achieve shorter trajectories, using fewer iterations than the original algorithm, potentially reducing printing time. However, to the authors’ best knowledge, this approach has not been used in a trajectory planning context. Therefore, the objective of this work was to investigate whether using AI Reinforcement Learning (a machine learning approach), explicitly employing the Multi-Armed Bandit (MAB) problem to boost the optimisation algorithm, would improve the Pixel strategy’s performance and, consequently, increase its operational efficiency and effectiveness.

2. AI Reinforcement Learning Background Applied to Path Planning Strategies in GMA-DED

Although artificial intelligence (AI) techniques are a hot topic nowadays, their specificities are not widely familiar to all readers. Therefore, a few words about the concepts applied in this work devolvement seem necessary for fully understanding it. According to Kumar et al. [16], machine learning is the science of teaching machines to learn on their own to solve real-time problems based on input data. Beyond this definition, machine learning has different branches, typified as three different techniques, namely, Supervised, Unsupervised and Reinforcement. Supervised Learning is the most commonly implemented machine learning technique. Using this approach, the models need to learn functions so that inputs fit the outputs. Then, the functions reveal information from categorised training data, and each intake is related to its assigned value. A large amount of data are needed to obtain precise and accurate results. An Artificial Neural Network (ANN) for human face recognition is a typical, yet obsolete to some extent, example of supervised learning. Techniques classified as Unsupervised machine learning are capable tools for detecting similarities, thus concluding out of unclassified data by clustering them based on their similarities (restructuring the input record into new features or a set of objects with the same patterns). Searching platforms (engines) are examples of unsupervised learning usage, using clustering algorithms to group information into small numbers that are the same or associated with each other, including word frequency, sentence length, page count, etc. When there are enough data to solve the problem, Unsupervised Learning is not likely to be the best option.

Reinforcement Learning, in turn, takes a different road between Supervised and Unsupervised techniques. It does not rely on pre-existing data-driven relationships but instead employs a method that rewards desired behaviours and penalises undesired ones to guide the learning process. In brief, the Reinforcement Learning method works on interacting with the environment. Table 1 shows the main terms and definitions of the Reinforcement Learning technique that will be used in this manuscript.

Reinforcement Learning (RL) has emerged as a powerful tool for enhancing system adaptability, process control, and path planning in wire arc additive manufacturing (WAAM). Wang et al. [17] highlight the importance of monitoring and control in WAAM, particularly in maintaining dimensional accuracy and mitigating defects. While their work focuses on regression networks and Active Disturbance Rejection Control (ADRC) for weld shape optimisation, it underscores the need for advanced AI-driven solutions, such as RL, to further enhance WAAM processes. Beyond system adaptability, RL has also been applied to process control, as demonstrated by Mattera et al. [18], who explore the use of RL to develop intelligent control systems for industrial manufacturing, including WAAM. Their work showcases the potential of RL-based controllers, such as the Deep Deterministic Policy Gradient method, to optimise the welding process while bridging the gap between simulation and real-world applications.

Additionally, RL has been successfully integrated into path planning strategies. Petrik and Bambach [19] introduce RLTube, an RL-based algorithm that enhances deposition path planning for thin-walled bent tubes, offering greater flexibility and efficiency compared to rigid mathematical approaches. Similarly, in another work [20], the same authors present RLPlanner, which automates path planning for thin-walled structures by combining RL with Sequential Least Squares Programming, ensuring better adaptability to geometric variations. Collectively, these studies demonstrate the growing role of RL in improving WAAM, from deposition path optimisation to real-time process control, paving the way for more intelligent and adaptive manufacturing systems.

In this way, to tackle the time-consuming challenge associated with the Pixel family strategy, Reinforcement Learning emerges as a promising enhancement. Unlike methods that require a pre-built database of parts and shapes, the Pixel strategy could, by this means, learn and adapt during its execution (a self-learning approach), case by case. Statements by Singh et al. [21] and Hutsebaut-Buysse et al. [22] endorse Reinforcement Learning, which allows algorithms to make real-time decisions based on feedback received during execution. In conclusion, integrating Reinforcement Learning would empower the Pixel strategy algorithm to optimise its choices autonomously and reduce its computational time. Among various Reinforcement Learning techniques, Bouneffouf et al. [23] and Silva et al. [24] highlight the Multi-Armed Bandit (MAB) problem as a simple and prominent technique. In addition, it can help enhance the performance of the Pixel strategy.

The MAB is widely used for sequential decision-making problems, where the agent selects from a set of actions, each with an unknown reward distribution. Sutton and Barto [25] is a traditional literature source for the subject. In terms of real-world applications, Silva et al. [24] mentioned that the MAB algorithm had demonstrated remarkable effectiveness in recommender systems, information retrieval, healthcare, and finance. Leveraging the MAB-based algorithm, the challenge of selecting (better yet, recommending since it is based on likelihood) the optimal axis ordering and trajectory planning heuristic during the algorithm processing can be intelligently addressed. By employing the MAB, the algorithm can dynamically adapt its decision-making process based on the feedback received from the environment (iterations). This enables the algorithm to effectively balance the trade-off between quality response and computational time, leading to improved performance and efficiency in trajectory planning (higher overall quality of the printed components while optimising the computational resources).

3. Proposal of One Trajectory Planning Strategy Assisted by AI Reinforcement Learning Using a MAB: The Case of the Advanced-Pixel

This section is dedicated to presenting the Advanced-Pixel strategy, which is based on a Reinforcement Learning Multi-Armed Bandit (MAB) algorithm. For that, initially (Section 3.1), the Enhanced-Pixel strategy and its origin and limitations are discussed. Subsequently (Section 3.2), the Advanced-Pixel strategy is introduced as an improvement of the Enhanced-Pixel approach.

3.1. The Enhanced-Pixel Strategy

Ferreira and Scotti [14] introduced the Enhanced-Pixel as an improved strategy within the Pixel path planning family. The original Pixel strategy, conceived by the authors as a space-filling approach for path planning generation, was briefly described in Section 1, with further details provided in Ferreira and Scotti [13]. In these two publications, the performance of the Pixel strategy was comparatively assessed and validated against traditional non-space-filling strategies, such as Zigzag and Contour. It is important to note that the authors do not claim the Pixel-based strategies to be inherently superior (or inferior) to traditional approaches. Instead, they present the Pixel strategies as an alternative that end users may consider, depending on the specific requirements of their application.

The Enhanced-Pixel strategy, also shown briefly in Section 1, aimed at generating the shortest trajectories while ensuring the entire layer surface is filled. The flowchart in Figure 2a provides an overview of the steps involved in the Enhanced-Pixel strategy, while Algorithm 1 provides an overview of these steps. The intention of both presentation approaches is to make the subject clear for potential readers, regardless of their background in algorithm design. Summarising, the process begins with the input by the user of the desired number of iterations, represented in Figure 2a as a following action after Start and by line 1 of Algorithm 1. Two matrices were implemented to store trajectory distance values (lines 3 and 4 of the pseudocode). Matrix1 stores the distances of all ten generated trajectories at each iteration, while BestValues records the minimum distance extracted from Matrix1 during each iteration t. In lines 5 and 6, the nodes (shown in Figure 1a) are ordered in the x and y directions, respectively. The algorithm then performs loops shown in Figure 2a (also in the pseudocode, line 7) where, in each iteration, a random node is chosen as the starting point for a trajectory generation, represented in the flowchart and also by line 8 of the pseudocode. Next, for each axis ordering (AO) along the x and y axes, trajectories are systematically generated using all five heuristic methods for trajectory planning (HTP) (see also pseudocode lines 10 and 16). Afterwards, the 2-opt algorithm is applied to optimise each of these trajectories (seen in the flowchart or lines 11 and 17 of the pseudocode). The distances d of the optimised trajectories are then calculated (lines 12 and 18) and stored in Matrix1 (lines 13 and 19). This nature of assessing multiple possibilities is greedy (GRASP-based algorithm), leading to choosing the most promising one at each iteration based on the minimum trajectory distance. In total, ten trajectories are generated by looping and stored in a Matrix1 per iteration (10 loops x iteration). After that, the minimum distance d_min is extracted from Matrix1 (see flowchart or line 21 of the pseudocode) and stored in another matrix called BestValues (also in line 22). This matrix serves as a repository for all the best values identified in each iteration. At this point, the algorithm either restarts for a new iteration, repeats the process, or proceeds to the end.

Algorithm 1 Enhanced-Pixel Strategy

1: INPUT: Number of loops t
2: OUTPUT: Best Values matrix
3: Initialize Matrix1[10][ ] ← empty
4: Initialize BestValues[ ] ← empty
5: Order nodes in X Direction (AOx)
6: Order nodes in Y Direction (AOy)
7: for i = 1 to t step 1 do
8: Choose a random start node
▷ Generate trajectories along AOx direction
9: for each heuristic h ∈ {NNH, AH, BH, RCH, CH} do
10: Generate trajectory with h starting from random node
11: Apply 2-opt optimisation to the trajectory
12: Calculate distance d
13: Store d in Matrix1
14: end for
▷ Generate trajectories along AOy direction
15: for each heuristic h ∈ {NNH, AH, BH, RCH, CH} do
16: Generate trajectory with h starting from random node
17: Apply 2-opt optimisation to the trajectory
18: Calculate distance d
19: Store d in Matrix1
20: end for
21: Extract the minimum distance d_min from Matrix1
22: Store d_min in BestValues matrix
23: end for
24: return BestValues

3.2. The Advanced-Pixel Strategy

As still mentioned in the introductory section, the Enhanced-Pixel strategy incorporates a new axis ordering and trajectory planning heuristic, allowing for extensive exploration of the optimisation space. However, the computational time required to evaluate each combination is a potential drawback. Despite potentially finding the best option at each loop, the algorithm requires multiple iterations to converge. Therefore, a more intelligent recommendation for the axis ordering (AO) and heuristic of trajectory planning (HTP) seemed crucial to enhance the performance. Based on that, the authors of this article incorporated the Multi-Armed Bandit (MAB) concept, an AI Reinforcement Learning approach, to the Advanced-Pixel strategy by generating a single trajectory using a specific combination of the axis ordering and heuristic of trajectory planning (the action) amongst all combinations available. The MAB-based algorithm was designed to have a recommendation at each iteration guided by a pre-defined policy, which defines the criteria to choose the most promising option (see policy definition in Table 1 and more details on the possibilities of policies assessed in this work in the text ahead, Section 3.2.1, Section 3.2.2 and Section 3.2.3).

Following the flowchart in Figure 2b and/or the Algorithm 2 that refers to the Advanced-Pixel version, the proposed algorithm presented in this work initially follows the same first step as that of the Enhanced-Pixel strategy, where the user specifies the number of iterations (line 1). However, in addition to this, the user must also provide the parameters (hyperparameters) associated with the selected MAB policy tool (line 2). These hyperparameters depend on the specific policy (e.g., ε-greedy, UCB, TS) chosen to select the combination of AO and HTP. The output of the action in the flowchart of Figure 2b, an action also described in pseudocode line 3, is identical to that of the Enhanced-Pixel strategy. In line 4, for a better understanding, h signifies a single AO-HTP combination amongst all possible combinations. In the Advanced-Pixel concept, there is a comprehensive collection encompassing all ten conceivable AO-HTP combinations {h₁, h₂, h₃, h₄, h₅, h₆, h₇, h₈, h₉, h₁₀}. Each of these h are represented in Figure 2 and Algorithm 1 by AOx + NNH + 2-opt, AOx + AH + 2-opt, AOx + BH + 2-opt, AOx + RCH + 2-opt, AOx + CH + 2-opt, AOy + NNH + 2-opt, AOy + AH + 2-opt, AOy + BH + 2-opt, AOy + RCH + 2-opt, and AOy + CH + 2-opt. For example, h₁ represents the trajectory generated using an axis ordering in the x-direction (AOx) with the NNH heuristic, optimised by the 2-opt algorithm (NNH + 2-opt).

Following again the flowchart in Figure 2b, some parameters and matrices are initialised in lines 5 to 7 of the pseudocode. Then, in line 8, the main loop begins, iterating according to the number of iterations defined by the user, represented by t (confined within 1 to n, where n is a finite number). At line 9, a random starting node is selected, similar to the approach used in the Enhanced-Pixel strategy. Afterwards, in line 10, only one heuristic h is selected from the ten available options in H, representing all possible combinations of axis ordering (AO) and heuristic trajectory planning (HTP). The selection of h follows the policy defined by the user, which will be explained in the next subsections.

Next, in the flowchart or pseudocode line 11, the trajectory is generated using the selected h. Its corresponding trajectory distance is, then, calculated according to line 12. For the application of the Multi-Armed Bandit (MAB) algorithm, certain metrics are essential and must be calculated to support decision-making. One of these metrics is the value function, used to estimate the expected return, and it is guided by the chosen policy tool. The value function is represented by

Q_{t} (h) = (\sum_{t = 1}^{n} D_{t} (h)) / {(N}_{t} (h))

, where D_t(h) is the trajectory distance of a chosen AO-HTP combination (h) at a particular iteration number (t) and N_t(h) is the number of times that a particular h has been chosen until that given iteration. These metrics are calculated in lines 13 and 14. Finally, the trajectory distance is stored in the BestValues matrix. At this point, the algorithm either restarts for the next iteration, repeats the process, or proceeds to termination, depending on the number of iterations defined.

Algorithm 2 Advanced-Pixel Strategy—MAB Based Trajectory Planning

1: INPUT: Number of iterations n
2: INPUT: Hyperparameters of the MAB policy tool (ϵ-greedy, UCB, TS)
3: OUTPUT: Best Value Matrix
4: Define the set of AO-HTP combinations: H = {h₁, h₂, h₃, h₄, h₅, h₆, h₇, h₈, h₉, h₁₀}
5: Initialize Q(h) = 0 for all h ∈ H ▷ Value function for each AO-HTP combination
6: Initialize N(h) = 0 for all h ∈ H ▷ Counter of selections for each h
7: Initialize BestValueMatrix ← empty
8: for t = 1 to n do
9: Select a random starting node
▷ Choose an AO-HTP combination according to the MAB policy tool
10: Choose h ∈ H based on the policy (e.g., ϵ-greedy, UCB, TS)
11: Generate trajectory using combination h
12: Calculate trajectory distance D_t(h)
▷ Update statistics for h

13 : N_{t} (h)

\leftarrow N_{t} (h)

+ 1

14 : Q_{t} (h) = (\sum_{t = 1}^{n} D_{t} (h)) / {(N}_{t} (h))

▷ Store trajectory distance in Best Value Matrix
15: Store D_t(h) in BestValueMatrix
16: end for
17: Return BestValueMatrix

By utilising the MAB framework, the Advanced-Pixel strategy (the agent) eliminates the need for the conventional greedy algorithm to generate multiple trajectories (time-consuming), as the Enhanced-Pixel version does. This strategy, instead, focuses on selecting the most suitable combination, as demonstrated in line 10. Once a trajectory is generated per iteration, it is stored in the “Best value Matrix”, line 15. At this point, the algorithm (the environment) either restarts for a new iteration, repeating the process, or reaches the end (completing the episode).

To clarify the selection process of h (the AO-HTP combination) shown in line 10, it is necessary to explain the policy tool. As mentioned earlier, the policy tool is responsible for selecting among the available combinations of axis ordering (AO) and heuristic trajectory planning (HTP) at each iteration. This selection is based on the trajectory distances obtained (considered as rewards when minimised). A good policy must be a balance between exploration and exploitation. According to Almasri et al. [26], the most popular tools used in the MAB for solving exploration–exploitation dilemmas are the ε-greedy, Upper Confidence Bound (UCB), and Thompson Sampling (TS). They help balance exploiting the best-performing target and exploring other possibilities to potentially discover enhanced performances. Naturally, the objective of these policy tools, for the current application, is to minimise the trajectory distance.

The next subsection presents the policy tools ε-greedy, Upper Confidence Bound (UCB), and Thompson Sampling (TS) used in this work. It is important to note that all three algorithms are applied at line 10 of the Advanced-Pixel algorithm, shown in Algorithm 2 and the flowchart of Figure 2b. Their main objective is to select an h value, representing a specific combination of axis ordering (AO) and heuristic trajectory planning (HTP).

3.2.1. ε-Greedy Policy Tool

The ε-greedy algorithm is one of the simplest policy tools for selecting actions in Reinforcement Learning. The key to the success of this algorithm is the ε hyperparameter (a hyperparameter is a parameter that the value is used to control the learning process), which determines the trade-off between exploration and exploitation.

Algorithm 3 presents the ε-greedy policy within the Advanced-Pixel strategy. As input, the algorithm requires the set of possible options H, the value function Q_t(h) for all h ∈ H (which is updated after each iteration), and the exploration probability ε (where 0 ≤ ε ≤ 1), as shown in lines 1, 2, and 3 of Algorithm 3. The algorithm begins by generating a random value r, as shown in line 5. If r < ε, then h is randomly selected from H, initiating the exploration phase (lines 6 and 7). Otherwise, if r ≥ ε, the algorithm selects the h with the best result according to the value function Q_t(h), which corresponds to the exploitation phase (lines 8 and 9). It is important to note that, in a standard Multi-Armed Bandit (MAB) problem, the objective is to maximise a reward. However, in the Pixel strategies, the goal is to minimise a specific metric, such as trajectory distance. To reconcile this difference, a common approach is to multiply the objective function (in this case, the value function) by −1, effectively transforming the minimisation problem into a maximisation one. Finally, the algorithm returns the selected h, which will be used in the Advanced-Pixel strategy.

A key challenge for the user in this algorithm is determining the appropriate value for the ε hyperparameter, as it governs the balance between exploration (trying new options) and exploitation (choosing the best-known option).

Algorithm 3 ϵ-greedy Policy for Selecting AO-HTP Combinations

1: INPUT: Set of AO-HTP combinations H = {h₁, h₂, h₃, h₄, h₅, h₆, h₇, h₈, h₉, h₁₀}
2: INPUT: Value function Q_t(h) for all h ∈ H
3: INPUT: Exploration probability ϵ (0 ≤ ϵ ≤ 1)
4: function EPSILONGREEDYSELECTION(H, Q, ϵ)
5: Generate a random number r in [0, 1]
6: if r < ϵ then ▷ Exploration: choose randomly
7: Randomly select h from H
8: else ▷ Exploitation: choose the best option

9 : Select h = {{a r g m a x}_{h \in H} (- Q}_{t} (h))

10: end if
11: return h
12: end function

3.2.2. Upper Confidence Bound (UCB) Policy Tool

Unlike the ε-greedy algorithm, the algorithm to develop the policy UCB estimates uncertainties of all the ten AO-HTP combinations, instead of a random exploration to determine one AO-HTP combination selection. By doing so, the UCB policy tool first prioritises actions with higher uncertainty, as it provides greater potential for information gain to make a decision. This is achieved by selecting the combination of the Q_t(h) value function, considering an uncertainty C_t(h) that yielded the highest value in a given iteration. This reasoning is described in Equation (1).

h = {a r g m a x}_{h \in H} (- (Q_{t} (h)) + C_{t} (h))

(1)

Silva et al. [24] explain that the mentioned uncertainty (C_t(h)) is derived from Hoeffding’s inequality, expressed by Equation (2), where c is the exploration hyperparameter chosen by the user (it must be positive); the higher its value, the lower the confidence in the estimate. According to Equation (2), the uncertainty estimate would augment logarithmically as long as t (number of iterations) increases. However, at the same time, N_t(h) (the number of times that a given AO-HTP combination was picked up) increases, downsizing the uncertainty. It can be noticed that when

N_{t} (h)

is equal to 0, h (Equation (1)) is maximised (because it has not yet been explored). Therefore, at the beginning of the algorithm, all ten AO-HTP combinations must be looped to feed the values of

Q_{t} (h)

and

N_{t} (h)

into Equations (1) and (2). In addition, one can observe that, due to the logarithmic effect of t, slowing down the increasing, in the beginning (lower t), uncertainty increases, but it turns lower when N_t(h) gets greater. Hence, a less explored AO-HTP combination (h) is selected more frequently at the beginning of the iterations to reduce uncertainties. However, along with the processing time, the selection frequency of these h will be less frequent, privileging the h with higher uncertainties, that is, the combinations maximised after each iteration (maximum h).

C_{t} (h) = \sqrt{\frac{c * l o g t}{N_{t} (h)}}

(2)

For better understanding, Algorithm 4 presents the UCB (Upper Confidence Bound) policy within the Advanced-Pixel strategy. As inputs, the algorithm requires the set of possible combinations H, the value function Q_t(h) for all h ∈ H (which is updated after each iteration), the current iteration number t, the exploration hyperparameter c (a positive constant), and N_t(h), which is the number of times each combination h has been selected—these are detailed in lines 1 to 5 of Algorithm 4. The algorithm begins by initialising all ten h combinations. For each combination, a trajectory is generated, and its corresponding distance D_t(h), selection count N_t(h), and value function Q_t(h) are calculated, as shown in lines 7 to 12. This initialisation step is necessary for the first loop iteration because both Q_t(h) and C_t(h) are initially zero. After this initialisation, h is selected based on the calculated values, as shown in line 16. In the second and subsequent iterations, the confidence parameter C_t(h) is calculated for all ten h combinations, as described in line 14. This updated C_t(h) is then used to select h in line 16.

Algorithm 4 UCB Policy for Selecting AO-HTP Combinations

1: INPUT: Set of AO-HTP combinations H = {h₁, h₂, h₃, h₄, h₅, h₆, h₇, h₈, h₉, h₁₀}
2: INPUT: Value function Q_t(h) for all h ∈ H
3: INPUT: Number of iterations t
4: INPUT: Exploration hyperparameter c (positive value)
5: INPUT: N_t(h), number of times each combination has been selected
6: function UCBSELECTION(H, Q, t, c, Nt)
7:   for each h ∈ H do
8:      if N_t(h) = 0 then
9:                Generate trajectory using combination h
10:    Calculate trajectory distance D_t(h)
11:

N_{t} (h)

←

N_{t} (h)

+ 1
12:

Q_{t} (h) = (\sum_{t = 1}^{n} D_{t} (h)) / {(N}_{t} (h))

13: else
14:

C_{t} (h) \leftarrow \sqrt{\frac{c * l o g t}{N_{t} (h)}}

15: end for
16: Select

h = {a r g m a x}_{h \in H} (- (Q_{t} (h)) + C_{t} (h))

17: return h
18: end function

3.2.3. Thompson Sampling (TS) Policy Tool

Thompson Sampling [27] is a heuristic that can be used in algorithms for online decision problems, such as the MAB problem, where actions are taken sequentially to face the exploration–exploitation dilemma. As a result, it maximises immediate performance, investing to accumulate new information that may improve future performance. It works by maintaining a distribution of the heuristic performance and selecting the most likely optimal action according to that distribution. The distribution is updated at each step based on the rewards obtained. Martín et al. [28] emphasise that the algorithm based on the Thompson Sampling policy tool has garnered significant interest due to its effectiveness in solving the MAB problem. According to Jain et al. [29], this policy tool was specifically developed for scenarios where the rewards from actions follow a Bernoulli distribution, i.e., they can be categorised as successes or failures. For the current application, the probability of an AO-HTP combination (h) is updated as more pieces of evidence or information are found (as a Bayesian method). Silva et al. [24] state in their review that the Beta distribution is a suitable model for estimating the success probability of each reward at each iteration, as it considers the number of successes and failures observed up to that point.

To apply Thompson Sampling in this work, it was particularly convenient to work with Beta distributions because of their conjugacy properties. The Beta distribution is a type of probability distribution used to represent the outcomes of proportions or percentages covering the interval of, for instance, 0 ≤ X ≤ 1, where X can be assumed to be the proportion of rewards in relation to regrets, defined by the algorithm. More details on Beta distribution can be obtained by reading the reference [30]. The probability density function (PDF) of a Beta distribution is the distribution plot, which curves resemble a bell shape, a U-shape with asymptotic ends, a strictly increasing or decreasing line, or even a straight horizontal line. The Beta distribution PDF is represented by parameters denoted by alpha (α) and beta (β), the latter being the reflection of the first (α and β = α − 1). These parameters control the shape of the distribution.

Figure 3 can be used to give insights into the Beta distribution, the shape parameters α and β and the corresponding probability density function (PDF). When the shape is represented by the parameters α = 1 and β = 1, a uniform distribution is obtained (straight yellow dashed line) after two or more iterations (t ≥ 2). This means that both rewards and regrets will have the same probability of selection at each iteration. Dirkx and Dimitrakopoulos [31] state that this approach is a good practice for initialising the algorithm for applying the Thompson policy tool so that a good exploration phase of the heuristics occurs at the beginning. However, the distributions shown in Figure 3 might change at each iteration by the algorithm, because the agent needs to increase the risk to gain more rewards. Then, if α is changed to, for instance, 8 and β to 2 (green dotted line), a given combination is more likely to be chosen (this combination is closer to success than to failure, but it can fail), whereas with α = 2 and β = 8 (blue solid line), this given combination is less likely to be chosen.

Regarding the application of the Thompson Sampling policy tool, shown in Algorithm 5, within the Advanced-Pixel strategy, the algorithm requires as inputs, lines 1 to 4, the set of possible combinations H, with an initial assumption that all combinations h ∈ H have equal chances of success (α = 1 and β = 1), the number of iterations t, and the trajectory distance Dt(h) for each h, obtained from previous iterations. The algorithm begins by determining whether each reward is a success or failure. The success case is determined by whether the trajectory distance generated by h is lower than the mean trajectory distance calculated up to that iteration, as shown in line 6. If successful, α is incremented. Otherwise, if unsuccessful, β is incremented, as shown in lines 11 to 14. In the first iteration, since there are no historical data for comparison, all combinations are penalised with an unsuccessful reward by default. Over time, this update mechanism adjusts the Beta distribution for each h, increasing the likelihood of selecting combinations with a higher success rate (higher α) and decreasing the probability for those with consistent failures (higher β). This behaviour is illustrated in Figure 3, where higher α values result in a distribution favouring selection (dot-dashed green line), while higher β values shift the probability density lower (solid blue line). Finally, in each loop, a Q_t(h) value is sampled from the updated Beta distribution (line 16), and the h with the highest sampled probability is selected (line 17).

Algorithm 5 Thompson Sampling for AO-HTP Combinations Selection

1: INPUT: Set of AO-HTP combinations H = {h₁, h₂, h₃, h₄, h₅, h₆, h₇, h₈, h₉, h₁₀}
2: INPUT: Alpha α = 1 and Beta β = 1 for each h ∈ H
3: INPUT: Number of iterations t
4: INPUT: Distance of trajectory for each h ∈ H (from previous iterations)
5: function THOMPSONSAMPLING(H, α, β, t)
6: if D_t(h) <

\bar{D}

and t ≠ 1 then
7: Reward is a sucess
8: else
9: Reward is a failure
10: end if
11: if Reward is a success then
12: α(h) ← α(h) + 1 ▷ Increment α for successful h
13: else
14: β(h) ← β(h) + 1 ▷ Increment β for unsuccessful h
15: end if
16: For each h ∈ H, sample a value Q_t(h) ∼ Beta(α(h), β(h))
17: Select the

h = {a r g m a x}_{h \in H} \tilde{Q}_{t} (h)

18: return h
19: end function

4. Computational Validation of the Advanced-Pixel Strategy Assisted by an AI Reinforcement Learning Approach

4.1. Methodology to Assess Computational Efficiency Increase with AI Reinforcement Learning

Three parts (printable pieces) with different shapes (to minimise chances of bias), presented in Figure 4 with their respective node numbers (a minimal number of nodes needed to discretise the sliced plan), were studied to evaluate a possible computational advantage of Advanced-Pixel (Figure 2b, boosted with the Reinforcement Learning approach) over the Enhanced-Pixel strategy (former version, Figure 2a). The following policy tools (see Section 3.2.1, Section 3.2.2 and Section 3.2.3) and their respective hyperparameters arbitrarily defined by the user were used separately in the Advanced-Pixel strategy (in time, a hyperparameter is employed in this work as a parameter whose value controls the learning process):

(a) ε-greedy policy tool: with ε values arbitrarily defined as 0.3 (favouring more exploitation), 0.5 (providing a balance between exploration and exploitation), and 1.0. However, a logical artifice was introduced in this policy tool algorithm to improve the tool efficiency even further. A decay rate was applied to the hyperparameter ε, so the algorithm decreases the ε hyperparameter over time at a certain rate. This artifice allows for more exploration at the beginning (with larger ε values) and more exploitation at the end (with smaller ε values). As proof of concept, a decay rate of 1% was defined. However, this was applied to the case where ε was defined as 1.0 (a value that would not be reasonable if kept constant). Therefore, the ε value was decreased in steps of 0.01 in each iteration in this work. Although chosen arbitrarily, these values are grounded in the principle that a higher ε promotes exploration, while a lower ε favours exploitation [32]. Additionally, the decay strategy enables us to observe the gradual transition from exploration to exploitation over time.

(b) Upper Confidence Bound policy tool: c values (hyperparameters) arbitrarily defined as 0.3, 0.5, 3.0, and 5.0. These values were also selected arbitrarily but they span a range of confidence levels to evaluate their impact on decision-making, with larger values encouraging greater exploration driven by increased uncertainty [24].

(c) Thompson Sampling policy tool: no input parameters is demanded.

Aiming at a higher performance of each tool, an extensive episode of 500 iterations was set for each path planning strategy. Considering the low number of nodes in each printable part (Figure 4) and the large size of the episode, a shortage of nodes to be randomly selected by the algorithm may occur. Therefore, the trajectory initialisation node was purposely allowed to be picked more than once by the policy tool algorithm. It is important to mention that the number of iterations through the AO-HTP combinations (axis ordering and heuristics of trajectory planning) differs between the Enhanced-Pixel and Advanced-Pixel algorithms. In the Enhanced-Pixel method, each iteration evaluates 10 sets of combined AO-HTP (Figure 2a), whereas the Advanced-Pixel method evaluates only one combination per iteration (Figure 2b). To enable a fair comparison between the two strategies, each of the 10 trajectory generations produced by a single AO-HTP combination in Enhanced-Pixel was treated as one iteration. This means that, for comparison purposes, 10 iterations of Enhanced-Pixel amounted to one iteration of Advanced-Pixel. As a result, both algorithms turned comparable on the same iteration basis; Enhanced-Pixel uses 50 iterations with 10 combinations per iteration, while Advanced-Pixel uses 500 individual iterations, making the total number of trajectory evaluations equivalent for both methods.

This study considered two criteria for the analyses: Convergence Analysis (of the trajectory distance process minimisation) and Analysis of Cumulative Regret. Convergence Analysis is standard in optimisation studies, while Cumulative Regret is commonly used to compare MAB algorithms [33,34]. In practical terms, Convergence Analysis examines the speed at which the minimal trajectory distance is achieved for each strategy throughout iterations. It helps identify the best strategy based on the minimal trajectory distances. In each iteration, a generated trajectory distance value (represented by

D_{t, h}

) is compared to the lowest value obtained thus far. If the new value is lower, it replaces the previous lowest value. Otherwise, it is disregarded. Figure 5 demonstrates the use of convergence curves to analyse the results of each strategy after a certain number of iterations, by using hypothetical fitting lines. The orange dashed line (curve marked as (a)) represents an example of a strategy that achieved a shorter trajectory (approximately 842 mm) in around 450 iterations. The blue solid curve (b) converges to an intermediate value of trajectory distance at around 250 iterations. In this work, the primary criterion is the trajectory distance value, while a secondary criterion is the convergence speed (lower iteration numbers). On the other hand, the green dot-dashed curve, identified by (c), converges faster (around 90 iterations) but does not achieve the shortest trajectory as that of the dashed-orange curve.

Cumulative Regret, the second criterion, represents the accumulated loss resulting from the chosen policy tool that did not select an optimal combination AO-HTP (axis ordering-heuristics of the trajectory planning) at each iteration. Equation (3) is the way to calculate the progress of this metric along the number of iterations. The ideal combination, yielding the best reward up to a given iteration (represented by

D^{*}

in Equation (3)), is compared to the actual combination chosen at that iteration (represented by

D_{t, h}

, Equation (3)). More regrets for the same number of iterations mean fewer positive rewards. Cumulative Regret tells how much one loses by betting more on the wrong combination of AO-HTP; exploring more than exploiting. The Analysis of Cumulative Regret curves follows a similar approach to Convergence Analysis for the minimisation process. Regret increases as the number of iterations grows, and higher values indicate larger regrets (poor results).

R e g r e t = \sum_{t = 1}^{n} [- (D^{*} - D_{t, h})]

(3)

4.2. Results and Discussions

The first analysed criterion was carried out through the convergence curves. Typical corresponding curves, for the three case studies, are given in Figure 6a–c. However, Table 2 shows the quantified trajectory distances and convergence iteration numbers for the three printed parts. The columns named “Trajectory distance (mm) after 500 iterations” have the lowest values achieved by the individual strategies for each part highlighted in bold. In the corresponding “Time (iterations) to converge” columns, the numbers in bold also represent the lowest number of iterations among the lowest trajectory distances from each part. Conversely, the higher values are underlined to be highlighted. Therefore, a comparison between the Advanced-Pixel and the Enhanced-Pixel strategy approaches showed the predominance of the first strategy in all parts.

However, it is important to note that not all policy tools (and their hyperparameters) used in the Advanced-Pixel strategy outperformed the Enhanced-Pixel strategy, especially in “Part 1” and “Part 2”. This means that the performance of AI Reinforcement Learning (the Advanced-Pixel strategy) depends on the specified policy tool and its corresponding hyperparameters. Occasionally, selecting a different policy tool may result in not-so-good performance compared to Enhanced-Pixel. One potential solution is to run the main algorithm with multiple policy tools, allowing for a comprehensive evaluation. However, it is important to note that this approach would increase the computational time required (this drawback will demand that techniques be applied to further mitigate computational processing in the near future). On the other hand, in “Part 3”, all policy tools applied in the Advanced-Pixel strategy consistently outperformed the Enhanced-Pixel. This suggests a trend where the Advanced-Pixel strategy, regardless of the policy tool used, may have an advantage over the Enhanced-Pixel strategy when a larger number of nodes are applied to represent the shape. Naturally, regardless of the number of nodes applied, if the iteration number is too high, the convergence from either the Advanced-Pixel or Enhanced-Pixel strategies will be at the same trajectory distances, yet at the cost of prohibitive computational times (anyway, the convergence will be likely faster when using RL).

Cumulative Regret was the second criterion evaluated, and its curves are presented in Figure 7a–c. The values of Cumulative Regret are presented in the third column in Table 2 (Regret analysis), for each part. It is evident from all three parts that the Enhanced-Pixel strategy has a higher Cumulative Regret (see red highlighted values). This is because it follows a non-intelligent structure that does not prioritise the best combination AO-HTP. Consequently, it can always yield poor results and does not learn from them. This leads to an increase in regrets and makes this strategy unreliable. On the other hand, the Advanced-Pixel strategy acquires knowledge about AO (axis ordering) and HTP (heuristics for trajectory planning) that generate the best results as the number of iterations progresses.

After evaluating the two criteria, it was found that Advanced-Pixel, in fact, performed better than the Enhanced-Pixel strategy. Among the policy tools (and their hyperparameters) tested in the Advanced-Pixel strategy, the UCB policy proved to be superior. In a ranking of policies tools, the UCB with hyperparameters c = 3, c = 0.5, and c = 0.3 performed better in simulating the printing of “Part 1”, “Part 2”, and “Part 3”, respectively (see green highlighted values for each part). It is important to note that the goal of this study was solely to verify the performance gain of the Advanced-Pixel strategy over its previous version, Enhanced-Pixel, concerning optimisation of the work function. However, it is worth mentioning that more comprehensive studies, such as using additional comparison criteria, can be conducted to determine the best policy tool within the Advanced-Pixel strategy. Based on the presented results, the “optimal” policy tool may vary depending on the specific context and the number of discretised nodes used. As mentioned earlier, designing the Multi-Armed Bandit (MAB) algorithm to accommodate multiple policies and use a policy selection mechanism to trade-off between them is possible. However, this approach should be carefully considered due to its potential impact on computational time. Future studies can explore this aspect further to find a suitable balance between comprehensive evaluation and computational efficiency.

5. Experimental Validation of the Advanced-Pixel Strategy Assisted by an AI Reinforcement Learning Approach

5.1. Methodology to Assess Experimentally the Efficiency and Effectiveness Increase with AI Reinforcement Learning

The three components depicted in Figure 4 were printed according to the experimental settings detailed in Table 3, employing both the Enhanced-Pixel and Advanced-Pixel strategies to assess their efficiency and effectiveness. These part shapes may appear too simple for the assessment. However, they are found as part of grid-stiffened panels. The number of iterations was set to 500 for both strategies, keeping the equivalence of 500 iterations in Advanced-Pixel and 50 iterations × 10 loops in Enhanced-Pixel. In applying the Advanced-Pixel strategy, for simplification’s sake, only the best policy tool determined through computational validation (previous section) was utilised in the experimental validation. Specifically, for “Part 1”, the UCB policy tool with c = 0.5 was employed, for “Part 2”, the UCB policy tool with c = 0.3 was used, and for “Part 3”, the ε-greedy policy tool with ε = 0.5 was selected. Two trajectories were generated, one for the odd layers and another for the even layers. These trajectories were then replicated for all layers in the print, with each layer starting from a randomly selected point. The printing process involved nine layers for “Part 1”, six layers for “Part 2”, and four layers for “Part 3”. Printing efficiency was quantified by considering the trajectory distance and printing time, while effectiveness was assessed by examining the presence of superficial discontinuities in the printed parts, both before and after machining.

5.2. Results and Discussions

Figure 8, Figure 9 and Figure 10 display the paths generated and the parts printed using the Enhanced-Pixel and Advanced-Pixel strategies. From the top and side views, it is evident that the printed parts were shaped according to the models (Figure 4). Given the principle of both strategies, a different starting point was randomly selected for each printed layer. Uniformly distributed surfaces indicate no material accumulation (arc starts—stops, movement accelerations, and decelerations, etc.), potentially remedied by the trajectory direction changes between layers (odds and evens). In addition, no visible superficial discontinuities, such as voids or lack of fusion, are seen on the rough surface. However, a lower height at the edges, typical of pieces printed by GMA-DED, is observed from the side view. According to Li et al. [35], this occurs due to a shortage of material caused by the near-parabolic bead shape.

Searching for discontinuities inside the parts, the upper surface of each printed part used in the experimental validation was flattened by machining (removing approximately 5 mm of material), as illustrated in Figure 11. The parts printed after the trajectory planned with the Advanced-Pixel strategy did not provide any visible internal discontinuity on the machined plane. However, “Part 1” and “Part 3” printed with the Enhanced-Pixel strategy presented some spots (resembling unfilled regions) on the same plane (marked by red circles). The cause of these erratic events was not deeply investigated, but they are hypothetically spots of improper bead overlapping (the parameter optimisation was out of this work scope). Cui et al. [36] claim that the contact transfer controlled by current and wire feeding processes, usually characterised by low heat input, might be a contributing factor. Ar-based shielding gas in this work, with only 2% CO₂, also contributes to these events. The findings indicate that the Pixel-based techniques, whether Enhanced or Advanced, can produce high-quality parts despite having at least two factors contributing to improper bead overlapping. However, the Advanced generation is prone to sounder builds.

Regarding the distance of the trajectory and the printing time shown in Table 4, the Advanced-Pixel strategy tended remarkably to lower values (there was a tie in the printing time only in one case, namely the odd layer of Part 2). It must be noted that as the number of nodes increases (from Part 1 to Part 3), there is a tendency to increase the difference in print time between the Advanced-Pixel and the Enhanced-Pixel strategy. This may indicate that the difference in printing time would be more significant if assessed with large pieces (when the node numbers are likely higher).

6. A Case Study of Using the Advanced-Pixel Strategy Assisted by an AI Reinforcement Learning Approach

The three parts (Figure 4) taken to validate the Advanced-Pixel strategy were typically bulky shapes with propositional features to qualify the performance comparison. However, there are also shapes prone to be wire arc additively manufactured that are composed of thin walls, such as angled grid structures (or lattice structures). According to Tao and Leu [37], these structures possess many superior properties to solid material and conventional structures. Furthermore, they are feasible by additive manufacturing techniques, including GMA-DED. It is important to state that trajectory planning for thin walls uses concepts different from those for bulky parts. Therefore, to further demonstrate the efficacy (geometry and soundness) of the Advanced-Pixel strategy, an angled grid structure, a model example presented in Figure 12, was used as a case study.

The same parameters presented in Table 3 were used for this case study. The spacing between nodes was set to 4 mm, resulting in 2919 nodes, and the path trajectory planner performed only 20 iterations (to be purposely less optimised and prone to defects). Six layers were built up. The Thompson Sampling policy tool was deliberately chosen (just a reminder, the user selects the policy tool as an input) because there is no need for any fixed hyperparameters, providing a simple but powerful method. That being said, no attempt was made to optimise the trajectory planning using other policy tools. The experimental results are shown in Figure 13, where continuous trajectories were generated for the odd and even layers, as seen in Figure 13a,b, with each layer starting from a randomly selected point. Notably, the printed part exhibits no visible voids or lack of fusion, as evidenced by Figure 13c,d. A non-conformity was identified in the first layer, as indicated by the red circle in Figure 13e. This non-conformity could be attributed to either lower energy deposition in the first layer over the substrate or to using two plates as substrates placed side-by-side. Consequently, it can easily be avoided in future. Additionally, cracks were observed, as indicated by the yellow circles, which are likely a result of residual stresses in this region before cutting off (cross-sectioning) the printed part. However, those discontinuities are related to the interface printed part-substrate, not the building itself.

The same case study was used to confirm, or not, the higher performance of the Advanced-Pixel strategy (which uses AI Reinforcement Learning in its algorithm) to generate shorter trajectories and minimise printing time when compared to the Enhanced-Pixel strategy (with no use of Reinforcement Learning). As this could be performed computationally, no printing was carried out using Enhanced-Pixel, only generating the trajectory with this strategy for this shape. Using a concept similar to the validation assessment (Section 3), the Enhanced-Pixel was configured with only two iterations (corresponding to 20 iterations of the Advanced-Pixel strategy). Table 5 allows the comparison of the two strategies concerning computational efficiency (not deposition quality). The outcomes indicate that the Advanced-Pixel can (probabilistically) generate significantly shorter trajectories and printing times than the Enhanced-Pixel. It is important to know that the angled grid model (Figure 12) was larger than those used in the validation stage (Figure 4), needing more nodes for the planner (more sensitive comparison). Significant differences of about 100 mm and 60 s were found for trajectory distance and printing time, respectively. These differences were not noted in the results presented in Section 4.2, as the number of nodes did not exceed 700.

7. Conclusions and Future Work

The concept of the MAB problem (an AI reinforcement tool) in the algorithm was applied to a non-conventional space-filling Enhanced-Pixel strategy to optimise the trajectory, as stated in the objective of this work. In summary, the following was noted:

(a): The algorithm of the Advanced-Pixel strategy processes the optimised solution faster than its predecessor (Enhanced-Pixel) due to fewer iterations.
(b): Reducing iterations does not negatively impact the trajectory planning performance using the Reinforcement Learning approach. In fact, the algorithm performance gain shows that Advanced-Pixel converges, in most cases, to the shortest trajectory with shorter printing times. However, it is worth noting that the solution applied in Advanced-Pixel is based on probabilistic concepts, and one cannot expect the advanced version to beat the predecessor Pixel version in 100% of the cases.
(c): The sensibility of the algorithm performance comparison increases for larger printable parts (higher number of nodes).
(d): Therefore, the implementation of Reinforcement Learning through the MAB problem succeeded well in “grading up” the Pixel family of space-filling trajectory planners.

As future work, there is potential for conducting additional studies involving a larger variety of geometries, as well as exploring the use of more policy tools for working with the MAB. This opens the possibility of incorporating other Reinforcement Learning algorithms. Additionally, the research group plans to investigate the use of clustering techniques to enhance further the performance of the Advanced-Pixel strategy.

Author Contributions

Conceptualisation: R.P.F. and A.S.; methodology: R.P.F. and A.S.; software: R.P.F.; validation: R.P.F. and A.S.; formal analysis: R.P.F., E.S. and A.S.; investigation: R.P.F.; resources: E.S.; data curation: E.S. and A.S.; writing—original draft preparation: R.P.F. and A.S.; writing—review and editing: R.P.F., A.S. and E.S.; visualisation: R.P.F. and A.S.; supervision: R.P.F., E.S. and A.S.; project administration: E.S. and A.S.; funding acquisition: E.S. and A.S. All authors have read and agreed to the published version of the manuscript.

Funding

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Americo Scotti reports financial support was provided by the National Council for Scientific and Technological Development (Brazil). Rafael Pereira Ferreira reports financial support was provided by the Coordination for the Improvement of Higher Education Personnel-CAPES (88887.696939/2022-00) (Brazil). Emil Schubert reports that equipment and supplies were partially provided by Alexander Binzel Schweisstechnik GmbH & Co. KG. (Germany). The author, Americo Scotti, is an Editor-in-Chief for Welding in the Word and was not involved in the editorial review or the decision to publish this article. The authors still declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability Statement

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Acknowledgments

The authors would like to thank the Federal University of Uberlandia, Brazil, and Alexander Binzel Schweisstechnik GmbH & Co. for their generous support in providing laboratory infrastructure and essential materials, which significantly contributed to the success of this research.

Conflicts of Interest

Author Emil Schubert is employed by Alexander Binzel Schweisstechnik GmbH & Co. The remaining authors, from educational institutions, declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Essop, A. 3D Printing Industry. Oil and Gas Industry Consortium Completes Two Projects to Accelerate Adoption of AM. 3D Printing Industry. 2020. Available online: https://3dprintingindustry.com/news/oil-and-gas-industry-consortium-completes-two-projects-to-accelerate-adoption-of-am-169056/ (accessed on 14 May 2021).
Singh, S.; Sharma, S.K.; Rathod, D.W. A review on process planning strategies and challenges of WAAM. Mater. Today Proc. 2021, 47, 6564–6675. [Google Scholar] [CrossRef]
Amal, M.S.; Justus, P.C.T.; Senthilkumar, V. Simulation of wire arc additive manufacturing to find out the optimal path planning strategy. Mater. Today Proc. 2022, 66, 2405–2410. [Google Scholar] [CrossRef]
Ding, D.; Pan, Z.; Cuiuri, D.; Li, H. A multi-bead overlapping model for robotic wire and arc additive manufacturing (WAAM). Robot. Comput. Integr. Manuf. 2015, 31, 101–110. [Google Scholar] [CrossRef]
Hu, Z.; Qin, X.; Li, Y.; Yuan, J.; Wu, Q. Multi-bead overlapping model with varying cross-section profile for robotic GMAW-based additive manufacturing. J. Intell. Manuf. 2020, 31, 1133–1147. [Google Scholar] [CrossRef]
Jafari, D.; Vaneker, T.H.J.; Gibson, I. Wire and arc additive manufacturing: Opportunities and challenges to control the quality and accuracy of manufactured parts. Mater. Des. 2021, 2021, 109471. [Google Scholar] [CrossRef]
Ding, D.; Pan, Z.; Cuiuri, D.; Li, H. A practical path planning methodology for wire and arc additive manufacturing of thin-walled structures. Robot. Comput. Integr. Manuf. 2015, 34, 8–19. [Google Scholar] [CrossRef]
Ding, D.; Pan, Z.; Cuiuri, D.; Li, H.; Larkin, N. Adaptive path planning for wire-feed additive manufacturing using medial axis transformation. J. Clean. Prod. 2016, 133, 942–952. [Google Scholar] [CrossRef]
Wang, X.; Wang, A.; Li, Y. A sequential path-planning methodology for wire and arc additive manufacturing based on a water-pouring rule. Int. J. Adv. Manuf. Technol. 2019, 103, 3813–3830. [Google Scholar] [CrossRef]
Cox, J.J.; Takezaki, Y.; Ferguson, H.R.P.; Kohkonen, K.E.; Mulkay, E.L. Space-filling curves in tool-path applications. Comput.-Aided Des. 1994, 26, 215–224. [Google Scholar] [CrossRef]
Vishwanath, N.; Suryakumar, S. Use of fractal curves for reducing spatial thermal gradients and distortion control. J. Manuf. Process. 2022, 81, 594–604. [Google Scholar] [CrossRef]
Singh, S.; Singh, A.; Kapil, S.; Das, M. Utilisation of a TSP solver for generating non-retractable, direction favouring toolpath for additive manufacturing. Addit. Manuf. 2022, 59, 103126. [Google Scholar] [CrossRef]
Ferreira, R.P.; Scotti, A. The Concept of a Novel Path Planning Strategy for Wire + Arc Additive Manufacturing of Bulky Parts: Pixel. Metals 2021, 11, 498. [Google Scholar] [CrossRef]
Ferreira, R.P.; Vilarinho, L.O.; Scotti, A. Enhanced-pixel strategy for wire arc additive manufacturing trajectory planning: Operational efficiency and effectiveness analyses. Rapid Prototyp. J. 2024, 30, 1–15. [Google Scholar] [CrossRef]
Ferreira, R.P.; Schubert, E.; Scotti, A. Exploring Multi-Armed Bandit (MAB) as an AI Tool for Optimising GMA-WAAM Path Planning. J. Manuf. Mater. Process. 2024, 8, 99. [Google Scholar] [CrossRef]
Kumar, S.; Gopi, T.; Harikeerthana, N.; Gupta, M.K.; Gaur, V.; Krolczyk, G.M.; Wu, C. Machine learning techniques in additive manufacturing: A state of the art review on design, processes and production control. J. Intell. Manuf. 2023, 34, 21–55. [Google Scholar] [CrossRef]
Wang, Y.; Xu, X.; Zhao, Z.; Deng, W.; Han, J.; Bai, L.; Liang, X.; Yao, J. Coordinated monitoring and control method of deposited layer width and reinforcement in WAAM process. J. Manuf. Process. 2021, 71, 306–316. [Google Scholar] [CrossRef]
Mattera, G.; Caggiano, A.; Nele, L. Optimal data-driven control of manufacturing processes using reinforcement learning: An application to wire arc additive manufacturing. J. Intell. Manuf. 2025, 36, 1291–1310. [Google Scholar] [CrossRef]
Petrik, J.; Bambach, M. RLTube: Reinforcement learning based deposition path planner for thin-walled bent tubes with optionally varying diameter manufactured by wire-arc additive manufacturing. Manuf. Lett. 2024, 40, 31–36. [Google Scholar] [CrossRef]
Petrik, J.; Bambach, M. Reinforcement learning and optimisation based path planning for thin-walled structures in wire arc additive manufacturing. J. Manuf. Process. 2023, 93, 75–89. [Google Scholar] [CrossRef]
Singh, V.; Chen, S.-S.; Singhania, M.; Nanavati, B.; Kar, A.K.; Gupta, A. How are reinforcement learning and deep learning algorithms used for big data based decision making in financial industries–A review and research agenda. Int. J. Inf. Manag. Data Insights 2022, 2, 100094. [Google Scholar] [CrossRef]
Hutsebaut-Buysse, M.; Mets, K.; Latré, S. Hierarchical Reinforcement Learning: A Survey and Open Research Challenges. Mach. Learn. Knowl. Extr. 2022, 4, 172–221. [Google Scholar] [CrossRef]
Bouneffouf, D.; Rish, I.; Aggarwal, C. Survey on Applications of Multi-Armed and Contextual Bandits. In Proceedings of the 2020 IEEE Congress on Evolutionary Computation (CEC), Glasgow, UK, 24 July 2020; pp. 1–8. [Google Scholar] [CrossRef]
Silva, N.; Werneck, H.; Silva, T.; Pereira, A.C.M.; Rocha, L. Multi-Armed Bandits in Recommendation Systems: A survey of the state-of-the-art and future directions. Expert. Syst. Appl. 2022, 197, 116669. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, UK, 2018. [Google Scholar]
Almasri, M.; Mansour, A.; Moy, C.; Assoum, A.; Le Jeune, D.; Osswald, C. Distributed competitive decision making using multi-armed bandit algorithms. Wirel. Pers. Commun. 2021, 118, 1165–1188. [Google Scholar] [CrossRef]
Russo, D.J.; Roy, B.V.; Kazerouni, A.; Osband, I.; Wen, Z. A Tutorial on Thompson Sampling. Found. Trends Mach. Learn. 2018, 11, 1–96. Available online: https://web.stanford.edu/~bvr/pubs/TS_Tutorial.pdf. (accessed on 30 May 2023). [CrossRef]
Martín, M.; Jiménez-Martín, A.; Mateos, A.; Hernández, J.Z. Improving A/B Testing on the Basis of Possibilistic Reward Methods: A Numerical Analysis. Symmetry 2021, 13, 2175. [Google Scholar] [CrossRef]
Jain, S.; Bhat, S.; Ghalme, G.; Padmanabhan, D.; Narahari, Y. Mechanisms with learning for stochastic multi-armed bandit problems. Indian J. Pure. Appl. Math. 2016, 47, 229–272. [Google Scholar] [CrossRef]
Gupta, A.K.; Nadarajah, S. Handbook of Beta Distribution and Its Applications; CRC Press: Boca Raton, FL, USA, 2004. [Google Scholar]
Dirkx, R.; Dimitrakopoulos, R. Optimising infill drilling decisions using multi-armed bandits: Application in a long-term, multi-element stockpile. Math. Geosci. 2018, 50, 35–52. [Google Scholar] [CrossRef]
Mignon, A.; Rocha, A.; Luis, R. An Adaptive Implementation of ε-Greedy in Reinforcement Learning. Procedia Comput. Sci. 2017, 109, 1146–1151. [Google Scholar] [CrossRef]
Koch, P.H.; Rosenkranz, J. Sequential decision-making in mining and processing based on geometallurgical inputs. Miner. Eng. 2020, 149, 106262. [Google Scholar] [CrossRef]
Marković, D.; Stojić, H.; Schwöbel, S.; Kiebel, S.J. An empirical evaluation of active inference in Multi-Armed Bandits. Neural Netw. 2021, 144, 229–246. [Google Scholar] [CrossRef]
Li, Y.; Han, Q.; Zhang, G.; Horváth, I. A layers-overlapping strategy for robotic wire and arc additive manufacturing of multi-layer multi-bead components with homogeneous layers. Int. J. Adv. Manuf. Technol. 2018, 96, 3331–3344. [Google Scholar] [CrossRef]
Cui, J.; Yuan, L.; Commins, P.; He, F.; Wang, J.; Pan, Z. WAAM process for metal block structure parts based on mixed heat input. Int. J. Adv. Manuf. Technol. 2021, 113, 503–521. [Google Scholar] [CrossRef]
Tao, W.; Leu, M.C. Design of lattice structure for additive manufacturing. In Proceedings of the International Symposium on Flexible Automation (ISFA), Cleveland, OH, USA, 1–8 August 2016; pp. 325–332. [Google Scholar] [CrossRef]

Figure 1. The Basic-Pixel strategy: (a) layer discretisation as a grid composed of equidistant nodes (blue dots) inside a printable part slice (surrounded by a solid blue line); (b) the pixel connections between neighbour nodes (green arrows) to accomplish with the trajectory.

Figure 2. Flowchart of the 2nd and 3rd generation of Pixel strategies: (a) Enhanced-Pixel and (b) Advanced-Pixel (where NNH—Nearest Neighbour Heuristic; AH—Alternate Heuristic; BH—Biased Heuristic; RCH—Random Contour Heuristic; CH—Continuous Heuristic; MAB—Multi-Armed Bandit framework; AO—axis ordering, either in axis x or in axis y; and HTP—heuristic of trajectory planning).

Figure 3. Probability density function (PDF) of the Beta distribution for 0 ≤ X ≤ 1, where X are the mean rewards, with three hypothetical combinations of shape parameters α and β.

Figure 4. Different part shapes used in the computational validation and the corresponding number of nodes needed to discretise the sliced plan (arbitrarily defined by the user).

Figure 5. Hypothetical comparative analysis of convergence curves for AO-HTP combinations: (a) low efficiency in convergence (long time to converge) but high efficacy (lowest trajectory distance); (b) lowest efficiency in convergence and not high efficacy; (c) highest efficiency in convergence and lowest efficacy.

Figure 6. Convergence charts of the trajectory plan: (a) “Part 1”; (b) “Part 2”; and (c) “Part 3”.

Figure 7. Cumulative Regret charts of the trajectory plan: (a) “Part 1”; (b) “Part 2”; and (c) “Part 3”.

Figure 8. Part 1: (Left) Random trajectories designed by the planner using Enhanced-Pixel (upper) and Advanced-Pixel (lower) strategies; (centre and right) corresponding printed parts with nine layers (trajectory starting points of the trajectory are not characterised because they were randomly selected for each printed layer).

Figure 9. Part 2: (Left) Random trajectories designed by the planner using Enhanced-Pixel (upper) and Advanced-Pixel (lower) strategies; (centre and right) corresponding printed parts with six layers (trajectory starting points of the trajectory are not characterised because they were randomly selected for each printed layer).

Figure 10. Part 3: (Left) Random trajectories designed by the planner using Enhanced-Pixel (upper) and Advanced-Pixel (lower) strategies; (centre and right) corresponding printed parts with four layers (trajectory starting points of the trajectory are not characterised because they were randomly selected for each printed layer).

Figure 11. The parts from Figure 8, Figure 9 and Figure 10 with the upper surface machined (note minor imperfections highlighted by red circles).

Figure 12. An angled grid 3D model composed of 12 mm thick walls (dimensions in millimetres).

Figure 13. An angled grid structure: (a,b)—random trajectories from Advanced-Pixel strategy; (c)—corresponding printed part; (d,e)—top and side views, respectively, surfaces after being flattened by machining (note some imperfections highlighted by circles, which reason for them is explained in the preceding paragraph).

Table 1. Reinforcement Learning-related terms and their definitions.

Reinforcement Learning Terms	Definition
Environment	The place where the agent gathers information, interacts with its surroundings and acquires knowledge through learning processes.
Agent	Who or what takes actions that affect the environment.
Action	The set of all possible operations/moves the agent can make.
Episode	A set of interactions between the agent and the environment during a single run of the algorithm.
Reward and Regret	A feedback signal provided by the environment to the learning agent (it can be positive rewards, or simply rewards, or negative rewards, or simply regrets).
Exploration	To gather information to understand the environment better.
Exploitation	The action of using and benefiting from resources, i.e., using the information from exploration to reach the target results.
Policy	Set of rules that an agent at a given state must follow to select an action to maximise the reward and avoid regrets.
Value function	The metric used to estimate the expected return or cumulative reward an agent can obtain in a given state, ruled by the policy.

Table 2. Summary of the metrics used to assess the efficiency of the combination path planning strategy and policy tool, in comparison between Enhanced and Advanced.

Strategy-Tool		“Part 1”			“Part 2”			“Part 3”
Strategy-Tool		Trajectory Distance (mm) After 500 Iterations ⁽*⁾	Time (Iterations) to Converge	Regret Analysis	Trajectory Distance (mm) After 500 Iterations ⁽*⁾	Time (Iterations) to Converge	Regret Analysis	Trajectory Distance (mm) After 500 Iterations ⁽*⁾	Time (Iterations) to Converge	Regret Analysis
Enhanced-Pixel		845.4	450	8364.8	1461.9	277	17,505.6	1904.7	75	16,374.5
Advanced-Pixel	ε-greedy, ε = 0.3	844.3	243	5102.61	1461.9	94	11,547.3	1904.0	491	10,533.7
	ε-greedy, ε = 0.5	845.9	223	5661.1	1461.9	69	14,459.4	1900.0	425	12,565.9
	ε-greedy, ε = 1 with 1% decay	844.3	355	4899.8	1465.4	154	9884.5	1901.2	353	9190.8
	UCB, c = 0.3	845.9	23	3956.3	1461.9	53	9904.3	1901.1	36	6737.7
	UCB, c = 0.5	844.3	111	4343.7	1475.5	378	4005.3	1901.1	22	10,911.5
	UCB, c = 3	844.3	341	3815.3	1461.9	31	9518.7	1901.1	150	7878.9
	UCB, c = 5	844.3	113	4383.3	1475.5	63	4407.2	1901.1	421	8762.6
	TS	845.9	31	4229.4	1461.9	132	10,643	1901.1	213	8797.4

⁽*⁾—in equivalence, 500 iterations in Advanced-Pixel = 50 iterations × 10 loops in Enhanced-Pixel. Notes: the numbers in bold represent the lowest number of iterations among the lowest trajectory distances, while the highest number of iterations are underlined.

Table 3. Experimental setting for printing the parts.

Process	Kinects (a Cold Metal Transfer technology from Abicor Binzel)
Arc welding equipament	iRob 501 Pro (Abicor-Binzel)
Torch movement system	ABB Robot IRB 1520 ID
Substrate	SAE 1020 carbon steel (200 × 200 × 12 mm)
Substrate cooling	Natural air cooling
Wire	AWS ER70S-6—ϕ 1.2 mm
Shielding gas	Ar + 2% CO₂—15 L/min
CTWD *	12 mm
Deposition speed (travel speed)	48.0 cm/min
Set wire feed speed	3.7 m/min
Set voltage	15.2 V
Set current	136 A
Interlayer temperature	>80 °C (around the whole top surface area)

(*) CTWD—contact tip to work distance.

Table 4. Printing efficiency assessment of the three printed parts (Figure 4).

Part Number	Layers	Criteria	Enhanced-Pixel	Advanced-Pixel
1	Odd	Trajectory Distance (mm)	845.81	845.56
	Odd	Printing Time (s)	125.07	124.36
	Even	Trajectory Distance (mm)	846.45	845.56
	Even	Printing Time (s)	125.11	124.35
2	Odd	Trajectory Distance (mm)	1462.67	1462.15
	Odd	Printing Time (s)	215.19	215.19
	Even	Trajectory Distance (mm)	1463.58	1462.33
	Even	Printing Time (s)	215.19	214.82
3	Odd	Trajectory Distance (mm)	1903.96	1903.39
	Odd	Printing Time (s)	289.01	284.14
	Even	Trajectory Distance (mm)	1902.91	1902.33
	Even	Printing Time (s)	288.49	284.10

Table 5. Computational efficiency comparison to print an angled grid structure by Enhanced- and Advanced-Pixel strategies.

Part	Layers	Criteria	Enhanced-Pixel	Advanced-Pixel
Angled grid (Figure 12)	Odd	Trajectory distance (mm)	9525.08	9420.39
	Odd	Printing time (s)	1609	1523
	Even	Trajectory distance (mm)	9546.08	9422.64
	Even	Printing time (s)	1620	1556

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ferreira, R.P.; Schubert, E.; Scotti, A. Reducing Computational Time in Pixel-Based Path Planning for GMA-DED by Using Multi-Armed Bandit Reinforcement Learning Algorithm. J. Manuf. Mater. Process. 2025, 9, 107. https://doi.org/10.3390/jmmp9040107

AMA Style

Ferreira RP, Schubert E, Scotti A. Reducing Computational Time in Pixel-Based Path Planning for GMA-DED by Using Multi-Armed Bandit Reinforcement Learning Algorithm. Journal of Manufacturing and Materials Processing. 2025; 9(4):107. https://doi.org/10.3390/jmmp9040107

Chicago/Turabian Style

Ferreira, Rafael P., Emil Schubert, and Américo Scotti. 2025. "Reducing Computational Time in Pixel-Based Path Planning for GMA-DED by Using Multi-Armed Bandit Reinforcement Learning Algorithm" Journal of Manufacturing and Materials Processing 9, no. 4: 107. https://doi.org/10.3390/jmmp9040107

APA Style

Ferreira, R. P., Schubert, E., & Scotti, A. (2025). Reducing Computational Time in Pixel-Based Path Planning for GMA-DED by Using Multi-Armed Bandit Reinforcement Learning Algorithm. Journal of Manufacturing and Materials Processing, 9(4), 107. https://doi.org/10.3390/jmmp9040107

Article Menu

Reducing Computational Time in Pixel-Based Path Planning for GMA-DED by Using Multi-Armed Bandit Reinforcement Learning Algorithm

Abstract

1. Introduction

2. AI Reinforcement Learning Background Applied to Path Planning Strategies in GMA-DED

3. Proposal of One Trajectory Planning Strategy Assisted by AI Reinforcement Learning Using a MAB: The Case of the Advanced-Pixel

3.1. The Enhanced-Pixel Strategy

3.2. The Advanced-Pixel Strategy

3.2.1. ε-Greedy Policy Tool

3.2.2. Upper Confidence Bound (UCB) Policy Tool

3.2.3. Thompson Sampling (TS) Policy Tool

4. Computational Validation of the Advanced-Pixel Strategy Assisted by an AI Reinforcement Learning Approach

4.1. Methodology to Assess Computational Efficiency Increase with AI Reinforcement Learning

4.2. Results and Discussions

5. Experimental Validation of the Advanced-Pixel Strategy Assisted by an AI Reinforcement Learning Approach

5.1. Methodology to Assess Experimentally the Efficiency and Effectiveness Increase with AI Reinforcement Learning

5.2. Results and Discussions

6. A Case Study of Using the Advanced-Pixel Strategy Assisted by an AI Reinforcement Learning Approach

7. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI