1. Introduction
Measuring and characterizing radiation fields efficiently and safely is a regular part of nuclear power plant maintenance, and in response to a radiological disaster, the stakes are only higher. In response to the Fukushima Daiichi disaster, over 13 robots were deployed with the goal of conducting radiation surveys. These robots varied greatly in locomotion, size, environment, and power, but most were teleoperated. This teleoperation sometimes required communication cables, which occasionally became caught and severed, resulting in early grief such as the loss of QUINCE1 [
1,
2]. Keeping track of the internal radiation conditions was important as static sensors had failed across the building due to the flooding. In addition, several dosimeters were rendered inoperable by seawater from the tsunami, and several areas of Fukushima were beyond the 1000 mSv range, requiring human operators to turn back [
3].
Due to the inherent danger of spending time in radioactive environments, it has always been of interest to make the survey process itself more efficient and safer for humans. It is no surprise that the field has a deep relationship with robotic automation [
4,
5]. However, radiation chips away at all materials, not just humans, and it is not long before concerns must also turn to the constitution of the robot. Therefore, it is not enough to build hardier surveyors—we must also survey smarter. However, improving radiation survey efficiency while maintaining the same reliability and explainability is difficult due to the complexities of the problem: The properties of radiation plague the exercise with many issues that can be difficult to resolve, let alone the inherent mathematical difficulties associated with finding the global maximum. Algorithms for radiation surveys must contend with non-static background noise in the form of background radiation, the navigation of extensive low-count areas where little useful information can be gleaned, and radiation’s complex attenuation behavior.
BO has been observed to be very effective in efficiently tackling global optimization problems within many fields [
6]. It is supposed that BO could provide an effective method in radiation surveying and localization because it can provide precise and explainable uncertainties, and at each step provides easily analyzable data for a human operator to quickly understand and take control of the search process if needed [
7]. However, the usual framework of BO is not a perfect fit for a quick and effective radiation survey, as it only specifies points to examine at the next step and does not account for the inefficiency incurred via the transportation of the surveyor. Not only does this make the process less efficient, but it throws away what could be useful data from the transportation step (i.e., while moving to a new spot, readings are still recorded). It is hoped that reinforcement learning (RL) can add path-planning capabilities to BO, making it more efficient for localizing and surveying radiation fields. It is also vital to consider that radiation penetrates most materials at a proportional rate per length, determined by nuclear characteristics and atomic density. This intense material attenuation makes radiation unique among commonly sought-after signals, such as light and radio.
BO is composed of two parts: a regression function with uncertainty, and an acquisition function. The regression function (usually a Gaussian Process Regressor [GPR]) uses training data to optimize its own hyperparameters to establish a fit. The acquisition function uses the prediction of the mean and the uncertainty to assign each point of the domain of interest for new points to add to the training data. The acquisition function is usually tuned to find the maximum or the minimum. The label of the acquired points is then found by observation. The predicted point and the observed label are appended to the training data. These newly augmented training data are used to fit the regressor again, completing the loop. The reason why GPRs are often used is that they make the implementation of both parts of the process simple; GPRs usually only need to fit hyperparameters of their kernels, an extensively explored problem, and the uncertainty predicted by the GPRs can be fed directly into the acquisition function. For this paper, the Expected Improvement (EI) acquisition function is often used as it provides a pragmatic middle point between the exploration of new data and the exploitation of better-understood data [
8,
9].
RL is a machine learning framework for optimizing the performance of automatic processes using rewards and penalties from past actions performed while in certain states [
10]. By keeping track of penalties and their associated action-state pairs, RL can create automated systems (called policies) that choose new actions when presented with action-state pairs [
11]. RL is a natural fit for this problem as the navigation between data exploration and exploitation is central to its construction. While RL has had numerous successes in many related disciplines [
12,
13,
14], especially in path planning within robotics [
15], it is much more computationally taxing and requires much more training data to provide even rudimentary results compared to BO.
Much of the structure of this model draws inspiration from Liu et al. [
16], who used double Q learning, with the Q function estimated using a convolutional neural network. The agent recorded measurements, and occupancy, and used these in addition to the given map geometry as inputs for the Q estimation. Though it required intensive amounts of processing power, Liu’s strategy showed a marked improvement over gradient search, especially in instances where obstacles directly blocked the signal from the radiation source. Romanchek [
17] directly builds upon Liu’s work, adding a stop search action. This stop action necessitates a negative reward for premature stops, which is measured proportionally to distance. This approach retains Liu’s ability to navigate around obstacles and explore but can beat statistical methods in terms of quickly finishing the search in an empty room.
Building upon Liu, Hu [
18] takes the convolutional approach from Liu but uses proximal policy optimization instead of double Q learning. Hu also separates the search protocol into two phases, namely localization and exploration, and the decision of which policy to use is determined by a third classification policy. For exploration, Hu uses the same inputs as Liu but with some amendments. The robot does not have knowledge of the map, instead dynamically adding walls to an observed geometry input map when they come within a small radius around the robot. The spaces the small radius covers also form their own input. The localization policy only uses a 5 × 5 unit area centered around the robot as input, making the algorithm speedier, and only inputting radiation measurements, observing geometry, and previously visited spaces. The classification policy is approximated by a simple neural network which uses the last five measurement readings to determine which policy is favored for the next measurement. Hu’s protocol is among the most computationally expensive among its contemporaries but is not trapped as easily as others and shows a superb generality when it comes to different room types.
In addition to these convolutional strategies, a parallel strategy was developed by Proctor [
19]. Much like Hu, Proctor uses proximal policy optimization, but on a different class of neural network: the Gated Recurrent Unit (GRU), which is an improvement on recurrent neural networks that addresses the problem of exploding gradients with gates. This results in higher stability, which leads to longer history memory. However, to train the GRU, an estimated position for the source is needed. Thus, Proctor utilizes another GRU, augmented using particle filtering (a statistical method based on converging upon a source location by iterative over a converging field of particles) to estimate the position of the true source, which is used by the Proximate Policy Optimization (PPO) GRU to estimate RL loss. The complexity of this method rivals Hu but is shown to succeed in an even wider set of environments, including different signal-to-background ratios as well as different obstacle orientations. The Proctor method also requires much less initial knowledge, such as background rate.
It is worth noting that none of the above approaches significantly make use of radiation’s material attenuation. There is the occlusion of signals from obstacles, but this does not accurately model the way real radiation signals are modified by interactions with obstacles. By using real radiation data, the method introduced via this work hopes to overcome this limitation.
Another common property among these different strategies is the reliance upon a single-point source. In a real radiation source search, there will be distributed low-level sources, and there may be more than one radioactive source. The rewards of the localizer in Hu, the reward from the stopper in Romanchek, and the reward used in Liu all depend on the distance to a single-point source.
There is one final source to examine, and it is an outlier, as it is not radiation-specific. But Morere [
20] remains incredibly notable for combining the two frameworks of BO source search (in this case for obstacle reconstruction) with RL, though in a much different way. Morere is concerned with many of the same limitations of BO that we have observed in the motivation section, but instead of putting the RL in charge of modifying BO-suggested data, he uses BO to select among actions, whose future trajectories and rewards are modeled using Partially Observable Markov Decision Process (POMDP) RL, solved with Monte Carlo tree search. This makes the search protocol non-myopic, i.e., it makes decisions based on how the next action will affect future actions. At the cost of adding another trade-off parameter κ, significant improvements are observed over more myopic BO implementations [
21].
In this paper, a novel lightweight RL augmentation onto BO is implemented, allowing for more complex paths and improved learning. The improved learning acts as a richer source of information for a simulated robot, which can then be used to more efficiently characterize the radiation field. This implementation is tested using a probabilistic simulation derived from real-world radiation data. After a large batch of randomized simulations, the performance of the BO+RL method is compared against BO alone along many metrics, such as computational load, time required to run the test, and root mean square error in radiation field predictions.
2. Materials and Methods
A basic flow chart showing each significant process in the model is provided below in
Figure 1, and an algorithm is provided in Algorithm 1. Throughout the rest of this section and following as near to chronological order as is convenient, each process will be explicated.
Algorithm 1. A flowchart detailing the steps of the BO+RL algorithm |
Input: that can be sampled. ) ) |
| (x))
|
| | |
| | | ) |
| | else | | |
| | | = all points and their EI within of the current point = On-policy first visit MC policy optimization ) = Dynamic sample) Update to last point of |
| else |
| | ) ) |
| ) ) |
end |
2.1. Initial Exploration Path
Realistically, any survey will begin with an exploratory initial path [
22]. This initial path is usually provided by analyzing how to ’cover’ the space within initial data. This is usually provided by complex radiation estimation software [
23]. However, utilizing such software would have two issues:
It would require a more grounded explanation of the scenario start, namely where and which readings are ‘original’ to the scenario before the survey process began. This semi-arbitrary decision would erode the credibility of using such software in the first place;
The selection of data needs to have some randomness, to provide more context for how the algorithm performs beyond some set initial condition and to introduce some randomness into the trials.
To these mutual ends, a K-means algorithm was chosen but with the number of iterations curbed so that the initial random locations influence the positioning of the final centroids. A line was then drawn through the centroids in an order that would minimize the distance traveled, by taking advantage of the scenario room’s geometric shape. This method was found to achieve both goals; more discussion of this method can be found in [
24].
2.2. GPR Kernel
The key to a successful BO process is choosing the correct kernel. In this case, we utilize the Matern kernel [
25].
where
and
are two points in the search domain,
is the smoothness parameter,
is the length scale,
is the modified Bessel function and
is the Gamma function. We choose
as the Matern 3/2 kernel is somewhat of a default for modeling complex non-stationary processes, and its structure is somewhat analogous to radiation attenuation. In addition to default settings, we set more conservative length scale extremes based on physics. The details of these limits are given in [
24]. The fitting is provided by optimizing the log-likelihood of the kernel to the data using scikit learn’s internal Limited memory Broyden–Fletcher–Goldfarb–Shanno algorithm (lbfgs) implementation [
26]. Lbfgs is a well-established optimization algorithm for lower computational power.
The GPR is trained using , , and . is the list of the grid points so far explored, is the count rate corresponding to that point, and is the count rate variance calculated as specified in the Sampling Section below. Being able to use the radiation counting uncertainty as a bias for the fitting is one of the key benefits of using GPR.
2.3. Acquisition Function
The EI Acquisition function was selected to choose the next BO point, as well as select the rewards for the RL process.
The EI acquisition is particularly popular as a default [
27], but there are additional reasons why it should be implemented for this problem. The Prediction of Improvement acquisition function has a saturation threshold of 1 and will not suggest far away points that may have even greater returns, limiting exploration behavior. Conversely, the Upper Confidence Bound (UCB) [
28] is so optimistic that the RL may be reluctant to exploit any points beyond the already found local maxima. EI is specifically designed to balance both tendencies, but in this implementation, we wish to ensure a higher degree of explorative exhaustivity. Thus, instead of using the highest point found in the training data, we use the highest point in the predicted data. This results in more explorative behavior while the data fit is poorer. Finally, EI provides for the creation of a stopping condition, as over iterations EI converges to zero for all points as it is determined that the maximum has been found.
2.4. Thresholding Positions for RL
The simplest way to split the exploration and exploitation regimes is to see if BO suggests an exploitative point and switches over to RL to maximize rewards through trajectories. However, what determines an exploitative point? A comparison of the two EI terms could be used, but it seems more pertinent (if less mathematically rigorous) to define it via how far we would wish to restrict the total movement (and therefore, total computational effort) of the RL algorithm. This matches well with our goal of saving RL for cases where it can provide the most benefit/computational cost. In Algorithm 1, this threshold is notated as .
The distance is measured in
or Taxicab distance, as this is a perfect match with the movements that the RL algorithm can make (explained in the
Section 2.5).
2.5. ϵ- Soft MC Policy
This algorithm’s implementation is based on the implementation provided in [
10], Chapter 5.4. This specific algorithm was chosen because it did not use the unrealistic exploring starting point requirement to calculate Q. Instead, all episodes begin starting from the same point that the agent would start from. This results in increased efficiency, as many states given to the RL algorithm are not likely to be visited, so calculating the Q for them would have been a waste.
The RL specifics are given as follows:
States = {grid locations within dataset};
Actions = {up, down, left, right} [the agent can also stay still by attempting to move in an invalid direction];
Reward = EI calculated at each point;
Horizon = current position − BO selected position + 1.
The choices for state and actions are simple: they break down the movement problem into its smallest possible components (with respect to the discretized dataset.). However, why specifically EI should be used for the reward is worth interrogating. Consider that there is a significant deviation in behavior between EI and the second most obvious choice, the predicted mean. A mean will always provide some reward for any point, but EI converges to zero for anything it deems below the current highest reading. Because of this, RL based on EI is less likely to explore and more likely to exploit. Further, known points take a significant penalty in EI, which encourages the RL to not just sample the starting point (achievable if the starting point is next to a wall). Therefore, exploitation is the focus, but not to the point where new data will not be collected.
The horizon for the episodes was chosen by finding the number of steps required to barely exceed the BO suggested point. The decision was made to save processing time. If the BO suggested point is near the current point, then RL should not have to move too far to properly exploit it.
The number of episodes used in training and the choice of RL-specific parameters is given in the
Section 4.1.
2.6. Sampling
There are two types of data collection used in this experiment: static collection and dynamic sampling (dynamic sampling is used for both A* and RL). The static collection represents low uncertainty data measured over a period of 30 s when the agent is at the destination, and movement collection is high uncertainty data gathered over a period of seconds as the robot traverses the room and ends with a static sample at the final point. It was assumed that there is no acceleration and a constant movement speed for the agent and the radiation Poisson characteristics are constant (per grid square) across the movement, and it was assumed that the robot only spends one second in traversal per grid point.
The statistics are based on the collection of counts, which are then converted into a count rate. The collection of counts is a Poisson process, with mean
. However, for use in a GPR, the data must be treated as a Gaussian, so the data are assumed as a Gaussian with mean =
and the standard deviation being
, where lambda is the single free parameter of the Poisson distribution, which in practice is modeled as the number of counts. Finally, the data are given to the model in terms of count rate, which results in a normal with a mean of
and a standard deviation of
. When an
has already been visited, the
and
quantities are updated according to statistical principles [
29]:
where
,
,
is the recorded mean, variance, and time spent at position
, and
,
,
are the new data being incorporated into the database. When a 0 signal is detected, the residence time is updated, but the signal and its error are not added to training data, as this results in 0 error 0 CPS training data, which is overly restrictive and not physical.
5. Conclusions
To improve the exploitation behavior of BO with path-based sampling, RL policy control was introduced. Over three different scenarios, comparisons between BO+RL and BO alone were undertaken. Although the amount of time taken and the accuracy of reconstruction did not demonstrably improve, the primary metric (the magnitude of the most active data point) was significantly increased in the hallway scenario, validating the hypothesis that introducing RL would allow for BO to provide more efficient hotspot identification. RL’s ability to more efficiently converge on high-dose sources has implications for deployment in real radioactive survey applications. In maintenance scenarios, the BO+RL algorithm can provide increased performance in quantifying hotspots. In anomaly identification scenarios, the BO+RL algorithm can converge to the source quicker, but this could come with the risk of increased radiation dose, as it spends more time around high source areas. Thankfully, the algorithm’s design allows human operators to easily evaluate the predicted signal and its uncertainty, interceding according to their judgment. But before such physical surveys, more analysis is required to optimize the deployment of RL within this BO framework. Results have shown that, in some contexts and metrics, there can be significant unfavorable deviation between BO+RL and BO.
With regard to future study, there are several vectors of exploration. The MC convergence was not well established for scenarios other than the hallway, and though it is unlikely, it is possible that incomplete RL convergence could be the cause of the lagging metrics observed with those scenarios. If a return to dynamic stopping is validated, then it would be worth evaluating that stopping condition with changes made to the policy per episode, rather than Q, since the policy values are not as scenario-dependent as the policy, which is bounded, and can only change between 5 values per each state.