Reinforcement Learning-Based Augmentation of Data Collection for Bayesian Optimization Towards Radiation Survey and Source Localization

Marquardt, Jeremy; Lucas, Leonard; Chatzidakis, Stylianos

doi:10.3390/jne6020010

Open AccessArticle

Reinforcement Learning-Based Augmentation of Data Collection for Bayesian Optimization Towards Radiation Survey and Source Localization

by

Jeremy Marquardt

^1,*

,

Leonard Lucas

² and

Stylianos Chatzidakis

¹

Department of Nuclear Engineering, Purdue University, West Lafayette, IN 47907, USA

²

Bettis Atomic Power Laboratory, West Mifflin, PA 15122, USA

^*

Author to whom correspondence should be addressed.

J. Nucl. Eng. 2025, 6(2), 10; https://doi.org/10.3390/jne6020010

Submission received: 30 November 2024 / Revised: 3 April 2025 / Accepted: 7 April 2025 / Published: 15 April 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Safer and more efficient characterization of radioactive environments requires exploring intelligently, utilizing robotic systems which use smart strategies and physics-based statistical models. Bayesian Optimization (BO) provides one such statistical framework to explainably find the global maximum within noisy contexts while also minimizing the number of trials. For radiation survey and source location, the aid of such a machine learning algorithm could significantly cut down on time and health risks required for maintenance and emergency response scenarios. Maintaining the explainability while increasing the efficiency of the search has been found possible by including the high uncertainty data that is picked up while the agent is in transit. Now that the paths of transit matter to data acquisition they could be optimized as well. This paper introduces reinforcement learning (RL) to the BO search framework. The behavior of this RL additive is observed in simulation over three different datasets of real radiation data. It is shown that the RL additive can cause significant increases to the score of the maximum point discovered, but the computational time cost is increased by nearly 100% while the reconstructed radiation field root mean square error (RMSE) of the BO+RL algorithm matches BO performance within 1%.

Keywords:

radiation survey; active learning; Bayesian optimization; reinforcement learning

1. Introduction

Measuring and characterizing radiation fields efficiently and safely is a regular part of nuclear power plant maintenance, and in response to a radiological disaster, the stakes are only higher. In response to the Fukushima Daiichi disaster, over 13 robots were deployed with the goal of conducting radiation surveys. These robots varied greatly in locomotion, size, environment, and power, but most were teleoperated. This teleoperation sometimes required communication cables, which occasionally became caught and severed, resulting in early grief such as the loss of QUINCE1 [1,2]. Keeping track of the internal radiation conditions was important as static sensors had failed across the building due to the flooding. In addition, several dosimeters were rendered inoperable by seawater from the tsunami, and several areas of Fukushima were beyond the 1000 mSv range, requiring human operators to turn back [3].

Due to the inherent danger of spending time in radioactive environments, it has always been of interest to make the survey process itself more efficient and safer for humans. It is no surprise that the field has a deep relationship with robotic automation [4,5]. However, radiation chips away at all materials, not just humans, and it is not long before concerns must also turn to the constitution of the robot. Therefore, it is not enough to build hardier surveyors—we must also survey smarter. However, improving radiation survey efficiency while maintaining the same reliability and explainability is difficult due to the complexities of the problem: The properties of radiation plague the exercise with many issues that can be difficult to resolve, let alone the inherent mathematical difficulties associated with finding the global maximum. Algorithms for radiation surveys must contend with non-static background noise in the form of background radiation, the navigation of extensive low-count areas where little useful information can be gleaned, and radiation’s complex attenuation behavior.

BO has been observed to be very effective in efficiently tackling global optimization problems within many fields [6]. It is supposed that BO could provide an effective method in radiation surveying and localization because it can provide precise and explainable uncertainties, and at each step provides easily analyzable data for a human operator to quickly understand and take control of the search process if needed [7]. However, the usual framework of BO is not a perfect fit for a quick and effective radiation survey, as it only specifies points to examine at the next step and does not account for the inefficiency incurred via the transportation of the surveyor. Not only does this make the process less efficient, but it throws away what could be useful data from the transportation step (i.e., while moving to a new spot, readings are still recorded). It is hoped that reinforcement learning (RL) can add path-planning capabilities to BO, making it more efficient for localizing and surveying radiation fields. It is also vital to consider that radiation penetrates most materials at a proportional rate per length, determined by nuclear characteristics and atomic density. This intense material attenuation makes radiation unique among commonly sought-after signals, such as light and radio.

BO is composed of two parts: a regression function with uncertainty, and an acquisition function. The regression function (usually a Gaussian Process Regressor [GPR]) uses training data to optimize its own hyperparameters to establish a fit. The acquisition function uses the prediction of the mean and the uncertainty to assign each point of the domain of interest for new points to add to the training data. The acquisition function is usually tuned to find the maximum or the minimum. The label of the acquired points is then found by observation. The predicted point and the observed label are appended to the training data. These newly augmented training data are used to fit the regressor again, completing the loop. The reason why GPRs are often used is that they make the implementation of both parts of the process simple; GPRs usually only need to fit hyperparameters of their kernels, an extensively explored problem, and the uncertainty predicted by the GPRs can be fed directly into the acquisition function. For this paper, the Expected Improvement (EI) acquisition function is often used as it provides a pragmatic middle point between the exploration of new data and the exploitation of better-understood data [8,9].

RL is a machine learning framework for optimizing the performance of automatic processes using rewards and penalties from past actions performed while in certain states [10]. By keeping track of penalties and their associated action-state pairs, RL can create automated systems (called policies) that choose new actions when presented with action-state pairs [11]. RL is a natural fit for this problem as the navigation between data exploration and exploitation is central to its construction. While RL has had numerous successes in many related disciplines [12,13,14], especially in path planning within robotics [15], it is much more computationally taxing and requires much more training data to provide even rudimentary results compared to BO.

Much of the structure of this model draws inspiration from Liu et al. [16], who used double Q learning, with the Q function estimated using a convolutional neural network. The agent recorded measurements, and occupancy, and used these in addition to the given map geometry as inputs for the Q estimation. Though it required intensive amounts of processing power, Liu’s strategy showed a marked improvement over gradient search, especially in instances where obstacles directly blocked the signal from the radiation source. Romanchek [17] directly builds upon Liu’s work, adding a stop search action. This stop action necessitates a negative reward for premature stops, which is measured proportionally to distance. This approach retains Liu’s ability to navigate around obstacles and explore but can beat statistical methods in terms of quickly finishing the search in an empty room.

Building upon Liu, Hu [18] takes the convolutional approach from Liu but uses proximal policy optimization instead of double Q learning. Hu also separates the search protocol into two phases, namely localization and exploration, and the decision of which policy to use is determined by a third classification policy. For exploration, Hu uses the same inputs as Liu but with some amendments. The robot does not have knowledge of the map, instead dynamically adding walls to an observed geometry input map when they come within a small radius around the robot. The spaces the small radius covers also form their own input. The localization policy only uses a 5 × 5 unit area centered around the robot as input, making the algorithm speedier, and only inputting radiation measurements, observing geometry, and previously visited spaces. The classification policy is approximated by a simple neural network which uses the last five measurement readings to determine which policy is favored for the next measurement. Hu’s protocol is among the most computationally expensive among its contemporaries but is not trapped as easily as others and shows a superb generality when it comes to different room types.

In addition to these convolutional strategies, a parallel strategy was developed by Proctor [19]. Much like Hu, Proctor uses proximal policy optimization, but on a different class of neural network: the Gated Recurrent Unit (GRU), which is an improvement on recurrent neural networks that addresses the problem of exploding gradients with gates. This results in higher stability, which leads to longer history memory. However, to train the GRU, an estimated position for the source is needed. Thus, Proctor utilizes another GRU, augmented using particle filtering (a statistical method based on converging upon a source location by iterative over a converging field of particles) to estimate the position of the true source, which is used by the Proximate Policy Optimization (PPO) GRU to estimate RL loss. The complexity of this method rivals Hu but is shown to succeed in an even wider set of environments, including different signal-to-background ratios as well as different obstacle orientations. The Proctor method also requires much less initial knowledge, such as background rate.

It is worth noting that none of the above approaches significantly make use of radiation’s material attenuation. There is the occlusion of signals from obstacles, but this does not accurately model the way real radiation signals are modified by interactions with obstacles. By using real radiation data, the method introduced via this work hopes to overcome this limitation.

Another common property among these different strategies is the reliance upon a single-point source. In a real radiation source search, there will be distributed low-level sources, and there may be more than one radioactive source. The rewards of the localizer in Hu, the reward from the stopper in Romanchek, and the reward used in Liu all depend on the distance to a single-point source.

There is one final source to examine, and it is an outlier, as it is not radiation-specific. But Morere [20] remains incredibly notable for combining the two frameworks of BO source search (in this case for obstacle reconstruction) with RL, though in a much different way. Morere is concerned with many of the same limitations of BO that we have observed in the motivation section, but instead of putting the RL in charge of modifying BO-suggested data, he uses BO to select among actions, whose future trajectories and rewards are modeled using Partially Observable Markov Decision Process (POMDP) RL, solved with Monte Carlo tree search. This makes the search protocol non-myopic, i.e., it makes decisions based on how the next action will affect future actions. At the cost of adding another trade-off parameter κ, significant improvements are observed over more myopic BO implementations [21].

In this paper, a novel lightweight RL augmentation onto BO is implemented, allowing for more complex paths and improved learning. The improved learning acts as a richer source of information for a simulated robot, which can then be used to more efficiently characterize the radiation field. This implementation is tested using a probabilistic simulation derived from real-world radiation data. After a large batch of randomized simulations, the performance of the BO+RL method is compared against BO alone along many metrics, such as computational load, time required to run the test, and root mean square error in radiation field predictions.

2. Materials and Methods

A basic flow chart showing each significant process in the model is provided below in Figure 1, and an algorithm is provided in Algorithm 1. Throughout the rest of this section and following as near to chronological order as is convenient, each process will be explicated.

Algorithm 1. A flowchart detailing the steps of the BO+RL algorithm
Input: $N$ $= domain of experiment, G P R p$ $= a prior GPR, and a model M$ that can be sampled. $I n i t i a l T r a i n i n g P a t h$ $= Calculate initial exploration path from N$ $Fit G P R$ $with I n i t i a l T r a i n i n g P a t h$ $and G P R p$ $Update c u r r e n t P o i n t$ $to last point of I n i t i a l T r a i n i n g P a t h$ $Predict mean (μ (x)) and uncertainty (σ (x)) for all points in N$ $using G P R$ $E I p o i n t s (x)$ $= EI (μ$ $, σ$ $, N$ ) $While (mean (E I p o i n t s) > ξ$ $and standard deviation (E I p o i n t s) > ζ$ )
	$E I p o i n t$ $= a r g m a x_{x}$ $(E I p o i n t s$ (x)) $If {‖E I p o i n t - c u r r e n t P o i n t‖}_{1} < δ$
		$If E I p o i n t$ $= c u r r e n t P o i n t$
			$t r a i n i n g D a t a$ $= Static sample (c u r r e n t P o i n t$ )
		else
			$E I s u b s e t$ = all points and their EI within $δ$ of the current point $P a t h$ = On-policy first visit MC policy optimization $(E I s u b s e t, c u r r e n t P o i n t$ ) $t r a i n i n g D a t a$ = Dynamic sample $(P a t h$ ) Update $c u r r e n t P o i n t$ to last point of $P a t h$
	else
		$Path = A * (c u r r e n t P o i n t$ $, E I p o i n t$ ) $t r a i n i n g D a t a$ $= Dynamic sample (P a t h$ ) $Update c u r r e n t P o i n t$ $to last point of P a t h$
	$fit G P R$ $to (t r a i n i n g D a t a$ ) $Predict mean (μ) and uncertainty (σ) of N$ $using G P R$ $E I p o i n t s (x)$ $= EI (μ$ $, σ$ $, N$ )
end

2.1. Initial Exploration Path

Realistically, any survey will begin with an exploratory initial path [22]. This initial path is usually provided by analyzing how to ’cover’ the space within initial data. This is usually provided by complex radiation estimation software [23]. However, utilizing such software would have two issues:

It would require a more grounded explanation of the scenario start, namely where and which readings are ‘original’ to the scenario before the survey process began. This semi-arbitrary decision would erode the credibility of using such software in the first place;
The selection of data needs to have some randomness, to provide more context for how the algorithm performs beyond some set initial condition and to introduce some randomness into the trials.

To these mutual ends, a K-means algorithm was chosen but with the number of iterations curbed so that the initial random locations influence the positioning of the final centroids. A line was then drawn through the centroids in an order that would minimize the distance traveled, by taking advantage of the scenario room’s geometric shape. This method was found to achieve both goals; more discussion of this method can be found in [24].

2.2. GPR Kernel

The key to a successful BO process is choosing the correct kernel. In this case, we utilize the Matern kernel [25].

k (x, x^{'}) = σ^{2} \frac{2^{1 - ν}}{Γ (ν)} {(\sqrt{2 ν} \frac{{||x - x^{'}||}_{2}}{l})}^{ν} K_{ν} (\sqrt{2 ν} \frac{{||x - x^{'}||}_{2}}{l})

(1)

where

x

and

x^{'}

are two points in the search domain,

ν

is the smoothness parameter,

l

is the length scale,

K_{ν} (\cdot)

is the modified Bessel function and

Γ (\cdot)

is the Gamma function. We choose

ν = 3 / 2

as the Matern 3/2 kernel is somewhat of a default for modeling complex non-stationary processes, and its structure is somewhat analogous to radiation attenuation. In addition to default settings, we set more conservative length scale extremes based on physics. The details of these limits are given in [24]. The fitting is provided by optimizing the log-likelihood of the kernel to the data using scikit learn’s internal Limited memory Broyden–Fletcher–Goldfarb–Shanno algorithm (lbfgs) implementation [26]. Lbfgs is a well-established optimization algorithm for lower computational power.

The GPR is trained using

X

,

Y

, and

Y E r r

.

X

is the list of the grid points so far explored,

Y

is the count rate corresponding to that point, and

Y E r r

is the count rate variance calculated as specified in the Sampling Section below. Being able to use the radiation counting uncertainty as a bias for the fitting is one of the key benefits of using GPR.

2.3. Acquisition Function

The EI Acquisition function was selected to choose the next BO point, as well as select the rewards for the RL process.

EI (μ (x), σ (x), x^{*}) = (f (x^{*}) - μ (x)) Φ (f (x^{*}); μ (x), σ (x)) - σ (x) ϕ (f (x^{*}); μ (x), σ (x))

(2)

The EI acquisition is particularly popular as a default [27], but there are additional reasons why it should be implemented for this problem. The Prediction of Improvement acquisition function has a saturation threshold of 1 and will not suggest far away points that may have even greater returns, limiting exploration behavior. Conversely, the Upper Confidence Bound (UCB) [28] is so optimistic that the RL may be reluctant to exploit any points beyond the already found local maxima. EI is specifically designed to balance both tendencies, but in this implementation, we wish to ensure a higher degree of explorative exhaustivity. Thus, instead of using the highest point found in the training data, we use the highest point in the predicted data. This results in more explorative behavior while the data fit is poorer. Finally, EI provides for the creation of a stopping condition, as over iterations EI converges to zero for all points as it is determined that the maximum has been found.

2.4. Thresholding Positions for RL

The simplest way to split the exploration and exploitation regimes is to see if BO suggests an exploitative point and switches over to RL to maximize rewards through trajectories. However, what determines an exploitative point? A comparison of the two EI terms could be used, but it seems more pertinent (if less mathematically rigorous) to define it via how far we would wish to restrict the total movement (and therefore, total computational effort) of the RL algorithm. This matches well with our goal of saving RL for cases where it can provide the most benefit/computational cost. In Algorithm 1, this threshold is notated as

δ

.

The distance is measured in

| | {\cdot| |}_{1}

or Taxicab distance, as this is a perfect match with the movements that the RL algorithm can make (explained in the Section 2.5).

2.5. ϵ- Soft MC Policy

This algorithm’s implementation is based on the implementation provided in [10], Chapter 5.4. This specific algorithm was chosen because it did not use the unrealistic exploring starting point requirement to calculate Q. Instead, all episodes begin starting from the same point that the agent would start from. This results in increased efficiency, as many states given to the RL algorithm are not likely to be visited, so calculating the Q for them would have been a waste.

The RL specifics are given as follows:

States = {grid locations within dataset};
Actions = {up, down, left, right} [the agent can also stay still by attempting to move in an invalid direction];
Reward = EI calculated at each point;
Horizon = $| |$ current position − BO selected position $‖_{1}$ + 1.

The choices for state and actions are simple: they break down the movement problem into its smallest possible components (with respect to the discretized dataset.). However, why specifically EI should be used for the reward is worth interrogating. Consider that there is a significant deviation in behavior between EI and the second most obvious choice, the predicted mean. A mean will always provide some reward for any point, but EI converges to zero for anything it deems below the current highest reading. Because of this, RL based on EI is less likely to explore and more likely to exploit. Further, known points take a significant penalty in EI, which encourages the RL to not just sample the starting point (achievable if the starting point is next to a wall). Therefore, exploitation is the focus, but not to the point where new data will not be collected.

The horizon for the episodes was chosen by finding the number of steps required to barely exceed the BO suggested point. The decision was made to save processing time. If the BO suggested point is near the current point, then RL should not have to move too far to properly exploit it.

The number of episodes used in training and the choice of RL-specific parameters is given in the Section 4.1.

2.6. Sampling

There are two types of data collection used in this experiment: static collection and dynamic sampling (dynamic sampling is used for both A* and RL). The static collection represents low uncertainty data measured over a period of 30 s when the agent is at the destination, and movement collection is high uncertainty data gathered over a period of seconds as the robot traverses the room and ends with a static sample at the final point. It was assumed that there is no acceleration and a constant movement speed for the agent and the radiation Poisson characteristics are constant (per grid square) across the movement, and it was assumed that the robot only spends one second in traversal per grid point.

The statistics are based on the collection of counts, which are then converted into a count rate. The collection of counts is a Poisson process, with mean

= λ

. However, for use in a GPR, the data must be treated as a Gaussian, so the data are assumed as a Gaussian with mean =

λ

and the standard deviation being

\sqrt{λ}

, where lambda is the single free parameter of the Poisson distribution, which in practice is modeled as the number of counts. Finally, the data are given to the model in terms of count rate, which results in a normal with a mean of

λ / T

and a standard deviation of

\sqrt{λ} / T

. When an

X

has already been visited, the

Y

and

Y E r r

quantities are updated according to statistical principles [29]:

μ_{r e a d s} (x) \leftarrow \frac{μ_{r e a d s} (x) \times t + μ_{n e w}}{t_{r e a d s} (x) + t_{n e w}}

(3)

σ_{r e a d s}^{2} (x) \leftarrow \frac{σ_{r e a d s}^{2} (x) \times {(t_{r e a d s} (x))}^{2} + σ_{n e w}^{2}}{{(t_{r e a d s} (x) + t_{n e w})}^{2}}

(4)

t_{r e a d s} (x) \leftarrow t_{r e a d s} (x) + t_{n e w}

(5)

where

μ_{r e a d s} (x)

,

σ_{r e a d s}^{2} (x)

,

t_{r e a d s} (x)

is the recorded mean, variance, and time spent at position

x

, and

μ_{n e w}

,

σ_{n e w}^{2}

,

t_{n e w}

are the new data being incorporated into the database. When a 0 signal is detected, the residence time is updated, but the signal and its error are not added to training data, as this results in 0 error 0 CPS training data, which is overly restrictive and not physical.

3. Data

The data used in this project comes from a real grid-based radiation survey of Purdue’s own PUR-1 (as shown in Figure 2). The points are 0.304 × 0.304 m square and were measured at three heights. There are three scenarios: scenario #1 (reactor off), scenario #2 (the hallway outside the inactive reactor room), scenario #3 (reactor on) [7]. The hallway scenario has 358 points, the reactor off scenario has 501 points, and the reactor on scenario has 505 points. The results of the radiation survey are shown in Figure 3.

In the reactor on scenario, the radioactive field is dominated by radiation emissions from the core of the reactor, mounted in the center of the circular reactor well. In the hallway, the left third of the radiation field is influenced by a subcritical pile and a display of uranium fuel processing examples. The rest of the radiation field is characterized by background radiation and slight contamination from previous radioactive contamination.

4. Results

4.1. Experimental Parameters

4.1.1. Loop Ending Parameters

The BO loop is kept cycling if the mean EI of all possible data points are

μ_{E I} >

0.001 =

ξ

counts per second (CPS), with a standard deviation of

σ_{E I} >

0.01 =

ζ

CPS. These numbers were somewhat arbitrarily chosen, as they were determined through trial and error. Though the meaning of the threshold is understandable, what these values exactly correspond to physically is less clear.

4.1.2. RL Thresholding Size

In all experiments,

δ = 7

. An example of this subset size with

δ = 5

is given in Figure 4. The selection of

δ

was semi-arbitrary, but something within a single order of magnitude is preferred for the sake of computational efficiency.

4.1.3. RL Implementation Parameters

The parameter that controls the decay of reward,

γ

, was set to 0.95. A high

γ

was selected as the effect of the total chain of decisions matters significantly.

ϵ

, the variable that modulates random changes from the current optimal strategy, was set to 0.01 to ensure quicker, more decisive convergence.

Another crucial choice was the number of episodes to use before returning the policy. After observing the development of Q in certain examples (see Figure 5), it was proposed that Q’s variation over time could be used as a stopping condition.

Unfortunately, Q was not stable. Even after EI was normalized, there was difficulty finding a balance between ensuring stable convergence and minimizing the amount of time spent thinking. To promote computational speed, a flat 1000 episodes was ultimately implemented. This may result in sub-optimal planning performance, but it does have the benefit of taking a near-fixed amount of time, which makes it much simpler to estimate the allocation of time for the running of experiments.

4.1.4. Running Experiments

For each scenario (hallway, reactor off, reactor on), both BO+RL and BO alone simulations were conducted. Each experiment consisted of 500 runs, from different K-means initializations to provide randomness. We record several metrics, including the maximum point found, the number of iterations, the computational time spent computing each episode, the simulation traversal times, and the root mean square error of the overall fit.

4.2. Maximums Found

It is observed from Figure 6 that there are examples of BO+RL performing better, performing marginally worse, and not making a difference (albeit because of full successes, which is still a positive development). However, it is similarly worth noting that the poorer performance was nowhere near as prominent as the overwhelming improvement that was observed in the hallway data. It is known that the hallway data combines noisy data with two high points that are close and that BO-alone cases tend to get stuck on the smaller maximum.

The reactor off dataset does not provide a steep gradient as hallway when RL was triggered farther from the highest hot spots. Being more exhaustive around these false hotspots reduces the EI distribution across the search domain, leading to more stopping conditions reached before the true maximum could be discovered.

4.3. Proportions of Methods Used

The type of steps (RL, static, A*) were recorded, and their proportions per scenario are presented in Table 1. The table’s columns sum to 100%.

In the reactor on scenario, RL is not invoked very often because the data have a high signal-to-noise ratio and a simple overall shape, so the position of the data point with the maximum signal can be easily interpolated from limited data. The large number of static samples in reactor on is notable when compared to the other two scenarios, as it is nearly double the proportional usage of static samples observed in hallway and reactor off scenarios. Once the maximum was found, static samples were used to bring down uncertainty and clear the EI loop closing thresholds.

4.4. Number of Iterations

Being the easiest to report of the time-based metrics, the number of iterations tells us how many steps an episode of an experiment took to converge. The means and standard deviations are reported in Table 2 below.

Of note is that BO+RL is consistently larger than BO, but the nature of the increase changes depending on the scenario. The BO+RL to BO performance difference seems to close as the scenario becomes more difficult to fit, implying that the BO+RL may be better suited for density irregularities like those found in the reactor off scenario.

4.5. Timing

As listed above, there are many ways to measure time. At the computational level, how much time the simulation measured, and measuring based on episode total time and per-step time. For the sake of proper comparison of computational times, the hardware specifications are hereby given below:

CPU: Intel i7-8565U @ 1.80 GHz, 4 cores, 8 logical processors;
RAM: 16 GB;
NVIDIA GeForce GTX 1050 (2 GB);
The GPU was only used in the GPR training portions of the algorithm.

The simulations were run on Python 3 on a Windows 10 Home OS.

4.5.1. Computational Time Total

By measuring the total real-time each episode took to run, we can construct histograms to compare the two algorithms. As can be seen in Table 3, there really is no comparison–RL is significantly slower. Even if the RL stopping condition was dynamically set to end early and could, there would still be a significant slowdown since RL requires a whole new nested for loop.

However, proportionally, this amount of time taken is smaller than a single static read of 30 s, not to mention the transit time. The reduced performance is therefore notable but not compromising.

4.5.2. Simulation Time per Episode

Although the computational time is important for those who want to replicate this experiment, arguably even more important is the simulation traversal time or simulation time, the number of seconds total spent moving around and recording counts within the simulation. The RL may be slow, but if it is negligible compared to the lower bound of how long it would take a robot to collect data, then the computational penalty RL incurs is no real loss. The simulation times are compared across all three experiments and are provided in Table 4.

Within most contexts, BO+RL does not represent a major increase in time over BO. (Granted, there is a more significant increase in the reactor on scenario, for the same reasons it has an increased number of iterations in general). For the current configuration of simulation data collection parameters, the RL processing time would be minimal compared to the actual time of collecting the data and moving around.

4.6. Dose RMSE

It is also worth assessing how well the fit of the resulting GPR is. Although the main goal is to characterize the maximum, engaging in further survey operations, especially those involving humans, will require the entire radiation field to be well characterized in total. At the end of each episode of the experiment, the mean count rate was estimated by the GPR, and this was compared with the true count readings. The resulting data are compared in Figure 7 and Table 5; There is no great deviation between the dose RMSE of the BO+RL and BO experiments.

4.7. Behavioral Notes

To observe the behavioral difference between the different algorithms, tests were run with the same 8 initial points for both BO and BO+RL. An example contrasting the algorithm’s behavior is shown in Figure 8.

From the behavioral snapshots, we see that RL performs as the theory predicts, able to hone in on the fine details when searching around hotspots, within areas of high dosage where its possible improvements to exploitation can bear the most fruit. The reactor off scenario shows the starkest contrast in behavior as the simulated robots spend much less time traversing the area and are able to find the area of the highest dosage much more efficiently.

5. Conclusions

To improve the exploitation behavior of BO with path-based sampling, RL policy control was introduced. Over three different scenarios, comparisons between BO+RL and BO alone were undertaken. Although the amount of time taken and the accuracy of reconstruction did not demonstrably improve, the primary metric (the magnitude of the most active data point) was significantly increased in the hallway scenario, validating the hypothesis that introducing RL would allow for BO to provide more efficient hotspot identification. RL’s ability to more efficiently converge on high-dose sources has implications for deployment in real radioactive survey applications. In maintenance scenarios, the BO+RL algorithm can provide increased performance in quantifying hotspots. In anomaly identification scenarios, the BO+RL algorithm can converge to the source quicker, but this could come with the risk of increased radiation dose, as it spends more time around high source areas. Thankfully, the algorithm’s design allows human operators to easily evaluate the predicted signal and its uncertainty, interceding according to their judgment. But before such physical surveys, more analysis is required to optimize the deployment of RL within this BO framework. Results have shown that, in some contexts and metrics, there can be significant unfavorable deviation between BO+RL and BO.

With regard to future study, there are several vectors of exploration. The MC convergence was not well established for scenarios other than the hallway, and though it is unlikely, it is possible that incomplete RL convergence could be the cause of the lagging metrics observed with those scenarios. If a return to dynamic stopping is validated, then it would be worth evaluating that stopping condition with changes made to the policy per episode, rather than Q, since the policy values are not as scenario-dependent as the policy, which is bounded, and can only change between 5 values per each state.

Author Contributions

Conceptualization: J.M. and L.L.; methodology: J.M.; software: J.M.; validation: S.C.; formal analysis: J.M.; investigation: J.M.; resources: S.C.; writing—original draft preparation, J.M.; writing—review and editing, J.M., S.C., and L.L.; visualization, J.M.; supervision, S.C. and L.L.; project administration, J.M.; funding acquisition, J.M. All authors have read and agreed to the published version of the manuscript.

Funding

This Research was performed under appointment to the Rickover Fellowship Program in Nuclear Engineering (cooperative agreement grant number DE-NR0000806) sponsored by the Naval Reactors Division of the National Nuclear Security Administration (NNSA).

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to the security concerns of the Rickover Fellowship Program.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yokokohji, Y. The Use of Robots to Respond to Nuclear Accidents: Applying the Lessons of the Past to the Fukushima Daiichi Nuclear Power Station. Annu. Rev. Control. Robot. Auton. Syst. 2021, 4, 681–710. [Google Scholar] [CrossRef]
Yoshida, T.; Nagatani, K.; Tadokoro, S.; Nishimura, T.; Koyanagi, E. Improvements to the Rescue Robot Quince Toward Future Indoor Surveillance Missions in the Fukushima Daiichi Nuclear Power Plant. In Field and Service Robotics; Yoshida, K., Tadokoro, S., Eds.; In Springer Tracts in Advanced Robotics, Vol. 92; Springer: Berlin/Heidelberg, Germany, 2014; Volume 92, pp. 19–32. [Google Scholar] [CrossRef]
Internationale Atomenergie-Organisation (Ed.) The Fukushima Daiichi accident. In STI/PUB. Vienna; International Atomic Energy Agency: Vienna, Austria, 2015. [Google Scholar]
Tsitsimpelis, I.; Taylor, C.J.; Lennox, B.; Joyce, M.J. A review of ground-based robotic systems for the characterization of nuclear environments. Prog. Nucl. Energy 2019, 111, 109–124. [Google Scholar] [CrossRef]
Giefer, D.L.; Jeffries, A.B. Implementation of Remote Equipment at Three Mile Island Unit 2. Nucl. Technol. 1989, 87, 641–647. [Google Scholar] [CrossRef]
Shahriari, B.; Swersky, K.; Wang, Z.; Adams, R.P.; de Freitas, N. Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proc. IEEE 2015, 104, 148–175. [Google Scholar] [CrossRef]
Marquardt, J.; Hermann, A.; Lucas, L.; Chatzidakis, S. Comparing Radiation Survey Gaussian Process Regression Methods for Reactor Room Radiation Reconstruction. In Proceedings of the 18th International Probabilistic Safety Assessment and Analysis (PSA 2023), Knoxville, TN, USA, 15–20 July 2023; pp. 346–354. [Google Scholar]
Lizotte, D.J. Practical Bayesian Optimization. Ph.D. Thesis, University of Alberta, Edmonton, AB, Canada, 2008. [Google Scholar]
Jones, D.R.; Schonlau, M.; Welch, W.J. Efficient Global Optimization of Expensive Black-Box Functions. J. Glob. Optim. 1998, 13, 455–492. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018; Available online: https://books.google.com/books?hl=en&lr=&id=uWV0DwAAQBAJ&oi=fnd&pg=PR7&dq=reinforcement+learning+Barto&ots=mjnGp7Z1n3&sig=9enl8h1HrOPuBy5RFQ_z6WuiF1Q (accessed on 11 December 2023).
Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement learning: A survey. J. Artif. Intell. Res. 1996, 4, 237–285. [Google Scholar] [CrossRef]
Zhou, S.K.; Le, H.N.; Luu, K.; Nguyen, H.V.; Ayache, N. Deep reinforcement learning in medical imaging: A literature review. Med. Image Anal. 2021, 73, 102193. [Google Scholar] [CrossRef] [PubMed]
Panzer, M.; Bender, B. Deep reinforcement learning in production systems: A systematic literature review. Int. J. Prod. Res. 2021, 60, 4316–4341. [Google Scholar] [CrossRef]
Hengst, F.D.; Grua, E.M.; el Hassouni, A.; Hoogendoorn, M. Reinforcement learning for personalization: A systematic literature review. Data Sci. 2020, 3, 107–147. [Google Scholar] [CrossRef]
Garaffa, L.C.; Basso, M.; Konzen, A.A.; de Freitas, E.P. Reinforcement Learning for Mobile Robotics Exploration: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2021, 34, 3796–3810. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Abbaszadeh, S. Double Q-Learning for Radiation Source Detection. Sensors 2019, 19, 960. [Google Scholar] [CrossRef] [PubMed]
Romanchek, G.R.; Abbaszadeh, S. Stopping criteria for ending autonomous, single detector radiological source searches. PLoS ONE 2021, 16, e0253211. [Google Scholar] [CrossRef] [PubMed]
Hu, H.; Wang, J.; Chen, A.; Liu, Y. An autonomous radiation source detection policy based on deep reinforcement learning with generalized ability in unknown environments. Nucl. Eng. Technol. 2022, 55, 285–294. [Google Scholar] [CrossRef]
Proctor, P.; Teuscher, C.; Hecht, A.; Osiński, M. Proximal Policy Optimization for Radiation Source Search. J. Nucl. Eng. 2021, 2, 368–397. [Google Scholar] [CrossRef]
Morere, P.; Marchant, R.; Ramos, F. Sequential Bayesian optimization as a POMDP for environment monitoring with UAVs. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 6381–6388. [Google Scholar]
Marchant, R.; Ramos, F.; Sanner, S. Sequential Bayesian optimisation for spatial-temporal monitoring. In UAI; Citeseer, 2014; pp. 553–562. Available online: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=c6b95a5a13f2730d9921264dfbf253ff01ccb23a (accessed on 12 November 2023).
Ristic, B.; Gunatilaka, A.; Rutten, M. An information gain driven search for a radioactive point source. In Proceedings of the 2017 2007 10th International Conference on Information Fusion, Québec, QC, Canada, 9–12 July 2007; pp. 1–8. [Google Scholar]
Matzke, B.D.; Newburn, L.L.; Hathaway, J.E.; Bramer, L.M.; Wilson, J.E.; Dowson, S.T.; Sego, L.H.; Pulsipher, B.A. Visual Sample Plan Version 7.0 User’s Guide. Richland; Pacific Northwest National Laboratory: Washington, DC, USA, 2014. [Google Scholar]
Marquardt, J.; Hermann, A.; Saad, O.; Lucas, L.; Chatzidakis, S. Automated Robotic Exploration of Noisy Radiation Fields from Raw Real-World Radiation Data using a Dynamically Enhanced Implementation of Bayesian Optimization. ASME J. Nuclear Rad. Sci. 2025; submitted. [Google Scholar]
Rasmussen, C.E.; Williams, C.K.I. Gaussian Processes for Machine Learning. In Adaptive Computation and Machine Learning; MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Frazier, P.I. Knowledge-Gradient Methods for Statistical Learning. Ph.D. Thesis, Princeton University, Princeton, NJ, USA, 2009. [Google Scholar]
Auer, P. Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res. 2002, 3, 397–422. [Google Scholar]
Taylor, J.R. Error Analysis; Univ. Science Books: Sausalito, CA, USA, 1997; Volume 20, Available online: http://ndl.ethernet.edu.et/bitstream/123456789/58185/1/39%20pdf.pdf (accessed on 11 March 2024).

Figure 1. A flowchart representing the steps of the BO+RL algorithm.

Figure 2. Image of the PUR-1 reactor room. When the reactor is on, the main source of radiation in the room is the core within the circular well. The wall covered in monitors serves as the north for all following diagrams of the reactor room.

Figure 3. Distributions of recorded survey data, in 3d (a–c) and from a top-down view (d–f). The reactor on scenario is represented with (a,d), the hallway scenario is represented with (b,e), and the reactor off scenario is represented with (c,f).

Figure 4. An example of thresholding for RL-states with

δ

= 5 for demonstration. Red points represent the subset that will be used for RL, the bright red point is where the agent is located.

Figure 4. An example of thresholding for RL-states with

δ

= 5 for demonstration. Red points represent the subset that will be used for RL, the bright red point is where the agent is located.

Figure 5. Examples of the sum(Q) changing over MC iteration (a) and the change in sum(Q) over MC iteration (b).

Figure 6. Pie charts show how often the true maximum point was found for each scenario. Reactor on is represented with (a,b), the hallway scenario is represented with (c,d), and the reactor off is represented with (e,f). The left column (a,c,e) represents the BO-only experiments, while the right column (b,d,f) represents the BO+RL experiments.

Figure 7. Histograms representing the distributions of RMSE for BO and BO+RL simulations, (a) for the reactor on scenario, (b) for the hallway scenario, and (c) for the reactor off scenario.

Figure 8. Comparison of behavior of BO and BO+RL on all three scenarios. Orange points represent high uncertainty sampling; blue dots represent RL-guided movement, and green dots represent static samples. The beginning point is shown in magenta. Rows correspond to a scenario where (a–c) represent the reactor on scenario, (d–f) represent the hallway scenario, and (g–i) represent the reactor on scenario. Columns represent the data overlay, with (a,d,g) representing the raw survey readings, (b,e,h) representing a BO-only investigation path, and (c,f,i) representing BO+RL investigation paths.

Table 1. Table expressing the distributions of a method for each dataset.

	Rector on	Hallway	Reactor off
Static	66.0%	33.2%	36.1%
RL	3.55%	23.5%	19.3%
A*	30.55%	43.3%	44.6%

Table 2. Mean and (Standard Deviation) of loop iteration number for BO vs. BO+RL with percent change.

	BO [s]	BO+RL [s]	Percent Change
Reactor on	11 (1.79)	25.3 (25.2)	130% (231%)
Hallway	12.3 (6.92)	16 (17.1)	29.7% (149%)
Reactor off	21.6 (19.9)	25.8 (23.9)	19.3% (144%)

Table 3. Mean and standard deviation of computational time for BO vs. BO+RL with percent change.

	BO [s]	BO+RL [s]	Percent Change
Reactor on	1.13 (0.0535)	12.2 (16.4)	980% (1.45 × 10³%)
Hallway	1.04 (0.0919)	21.5 (24.7)	1.97 × 10³% (2.38 × 10³%)
Reactor off	1.26 (0.404)	18.6 (36.3)	1.38 × 10³% (2.91 × 10³%)

Table 4. Mean and standard deviation of simulation time for BO vs. BO+RL with percent change.

	BO [s]	BO+RL [s]	Percent Change
Reactor on	469 (55.1)	899 (756)	91.7% (162%)
Hallway	485 (212)	604 (537)	24.5% (120%)
Reactor off	877 (683)	1.01 × 10³ (810)	15.2% (121%)

Table 5. Mean and standard deviation of dose RMSE for BO vs. BO+RL with percent change.

	BO	BO+RL	Percent Change
Reactor on	1.5 (0.1)	1.49 (0.0752)	−0.667% (8.34%)
Hallway	0.228 (0.0263)	0.232 (0.0297)	1.75% (17.4%)
Reactor off	0.183 (0.0229)	0.184 (0.0239)	0.546% (18.1%)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Marquardt, J.; Lucas, L.; Chatzidakis, S. Reinforcement Learning-Based Augmentation of Data Collection for Bayesian Optimization Towards Radiation Survey and Source Localization. J. Nucl. Eng. 2025, 6, 10. https://doi.org/10.3390/jne6020010

AMA Style

Marquardt J, Lucas L, Chatzidakis S. Reinforcement Learning-Based Augmentation of Data Collection for Bayesian Optimization Towards Radiation Survey and Source Localization. Journal of Nuclear Engineering. 2025; 6(2):10. https://doi.org/10.3390/jne6020010

Chicago/Turabian Style

Marquardt, Jeremy, Leonard Lucas, and Stylianos Chatzidakis. 2025. "Reinforcement Learning-Based Augmentation of Data Collection for Bayesian Optimization Towards Radiation Survey and Source Localization" Journal of Nuclear Engineering 6, no. 2: 10. https://doi.org/10.3390/jne6020010

APA Style

Marquardt, J., Lucas, L., & Chatzidakis, S. (2025). Reinforcement Learning-Based Augmentation of Data Collection for Bayesian Optimization Towards Radiation Survey and Source Localization. Journal of Nuclear Engineering, 6(2), 10. https://doi.org/10.3390/jne6020010

Article Menu

Reinforcement Learning-Based Augmentation of Data Collection for Bayesian Optimization Towards Radiation Survey and Source Localization

Abstract

1. Introduction

2. Materials and Methods

2.1. Initial Exploration Path

2.2. GPR Kernel

2.3. Acquisition Function

2.4. Thresholding Positions for RL

2.5. ϵ- Soft MC Policy

2.6. Sampling

3. Data

4. Results

4.1. Experimental Parameters

4.1.1. Loop Ending Parameters

4.1.2. RL Thresholding Size

4.1.3. RL Implementation Parameters

4.1.4. Running Experiments

4.2. Maximums Found

4.3. Proportions of Methods Used

4.4. Number of Iterations

4.5. Timing

4.5.1. Computational Time Total

4.5.2. Simulation Time per Episode

4.6. Dose RMSE

4.7. Behavioral Notes

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI