Deep Reinforcement Learning for Automated Insulin Delivery Systems: Algorithms, Applications, and Prospects

Yu, Xia; Yang, Zi; Sun, Xiaoyu; Liu, Hao; Li, Hongru; Lu, Jingyi; Zhou, Jian; Cinar, Ali

doi:10.3390/ai6050087

Open AccessReview

Deep Reinforcement Learning for Automated Insulin Delivery Systems: Algorithms, Applications, and Prospects

by

Xia Yu

¹

,

Zi Yang

¹

,

Xiaoyu Sun

¹

,

Hao Liu

¹,

Hongru Li

¹,

Jingyi Lu

²,

Jian Zhou

² and

Ali Cinar

^3,*

¹

College of Information Science and Engineering, Northeastern University, Shenyang 110819, China

²

Department of Endocrinology and Metabolism, Shanghai Jiao Tong University Affiliated Sixth People’s Hospital, Shanghai Clinical Center for Diabetes, Shanghai 200233, China

³

Department of Chemical and Biological Engineering, Illinois Institute of Technology, Chicago, IL 60616, USA

^*

Author to whom correspondence should be addressed.

AI 2025, 6(5), 87; https://doi.org/10.3390/ai6050087

Submission received: 11 March 2025 / Revised: 20 April 2025 / Accepted: 21 April 2025 / Published: 23 April 2025

(This article belongs to the Special Issue Artificial Intelligence for Future Healthcare: Advancement, Impact, and Prospect in the Field of Cancer)

Download

Browse Figures

Versions Notes

Abstract

Advances in continuous glucose monitoring (CGM) technologies and wearable devices are enabling the enhancement of automated insulin delivery systems (AIDs) towards fully automated closed-loop systems, aiming to achieve secure, personalized, and optimal blood glucose concentration (BGC) management for individuals with diabetes. While model predictive control provides a flexible framework for developing AIDs control algorithms, models that capture inter- and intra-patient variability and perturbation uncertainty are needed for accurate and effective regulation of BGC. Advances in artificial intelligence present new opportunities for developing data-driven, fully closed-loop AIDs. Among them, deep reinforcement learning (DRL) has attracted much attention due to its potential resistance to perturbations. To this end, this paper conducts a literature review on DRL-based BGC control algorithms for AIDs. First, this paper systematically analyzes the benefits of utilizing DRL algorithms in AIDs. Then, a comprehensive review of various DRL techniques and extensions that have been proposed to address challenges arising from their integration with AIDs, including considerations related to low sample availability, personalization, and security are discussed. Additionally, the paper provides an application-oriented investigation of DRL-based AIDs control algorithms, emphasizing significant challenges in practical implementations. Finally, the paper discusses solutions to relevant BGC control problems, outlines prospects for practical applications, and suggests future research directions.

Keywords:

automated insulin delivery systems; deep reinforcement learning; glycemic control; insulin infusion; diabetes

1. Introduction

Diabetes is a chronic disease that can be broadly categorized into four types as described in the Classification and Diagnosis of Diabetes 2023: type 1 diabetes (T1D), type 2 diabetes (T2D), diabetes instigated by other causes, and gestational diabetes [1]. Currently, one out of eleven people in the world have diabetes and it is predicted that the number will rise to one out of every ten people in the next decade. Among these, T1D is marked by a lack or near lack of beta cell function and necessitates exogenous insulin therapy [2]. People with T1D usually make 180 decisions on average every day to regulate their blood glucose concentration (BGC) [3], generating an ongoing mental burden. As diabetes progresses, insulin therapy is also prescribed for T2D cases with impaired beta cell function and/or severe insulin resistance [4].

Insulin administration modalities can be categorized into manual insulin injection methods and automated insulin infusion methods. Manual insulin injections (called multiple daily injections) involve the use of syringes or insulin pens to inject insulin several times every day [5,6,7], which is the method of insulin delivery used by the majority of people with diabetes [8]. The use of insulin pumps can be manual (called sensor-augmented pump (SAP) therapy) or automated (known as automated insulin delivery system (AIDs)) [9]. T1D is prevalent in childhood or adolescence [10], a population that often requires assistance from others for insulin injections and has poor adherence. Furthermore, young children (<6 years of age) are at high risk of hyperglycemia/hypoglycemia [11]. They also exhibit more glycemic variability than older children [12]. Hence, manual insulin injections may not be appropriate for them as they do not have the means to respond promptly based on continuous glycemic variability and are only available during the day. Recent studies have shown that in adolescents and young adults, the use of continuous glucose monitoring (CGM) devices reduces the hemoglobin A1c (HbA1c) and the average BGC over the past 2–3 months, and increases time in range (TIR) of BGC (70–180 mg/dL of blood glucose) [13,14]. Moreover, AIDs have been shown to have a modest advantage in reducing the incidence of severe hypoglycemia (BGC < 54 mg/dL) in children and adults [15]. Clinical studies and data reported by AIDs manufacturers indicate that AIDs increase the time-in-range (TIR) of glucose concentrations and reduce time-below-rage (hypoglycemia) significantly, reducing the fear of hypoglycemia of people with diabetes and improving their quality of life [4]. While the glucose control algorithms in current AIDs have improved TIR from 50–60% with SAP to 70–80%, AIDs are becoming more popular to reduce the burden of BGC management on people.

An AIDs consists of three components, a CGM device, an insulin pump, and a BGC controller that computes the required insulin dosages automatically [16] (as shown in Figure 1). The AIDs is also known as an artificial pancreas system [17]. The AIDs increase or decrease the dose of insulin infusion to subcutaneous tissue based on the glucose levels measured by CGM to imitate physiologic insulin delivery by the pancreas [18]. Several studies have been conducted in adults and children using various AIDs with different algorithms, pumps, and sensors [19,20,21,22]. The current commercially available AIDs are hybrid closed-loop AIDs that need manual entry of meal information for insulin bolusing to regulate postprandial BGC and exercise information for reducing insulin infusion to prevent hypoglycemia and increase the TIR [23,24]. Studies have shown that AIDs may also increase user satisfaction and/or reduce diabetes-related burden [25]. Currently, fully automated AIDs that eliminate the need for carbohydrate counting and manual entry of meal information or adjustments for scheduled exercise are under development [4].

The human body is a complex system, and the control systems of AIDs face many challenges due to several issues listed below.

(1): Non-linearity. Due to the existence of synergistic effects of human hormones as well as the complexity of biochemical reactions, the glycemic control of people with diabetes is a non-linear problem.
(2): Inter- and intra-patient variability. There are significant differences in glycemic metabolism between patients, and for the same individual, the long-term and short-term glycemic performance can vary due to circadian rhythms, emotional fluctuations, external perturbations, and other factors.
(3): Time lag. A time delay of up to one hour exists between making insulin infusions and the resulting BGC changes [26], necessitating adjustments in dosage calculations.
(4): Uncertain disturbances. The patient’s BGC is not only determined by the insulin dose but also affected by external disturbances such as physical activities (type, intensity, and duration), emotional fluctuations, and dietary intake. Many of these factors are difficult to detect, classify, and characterize based on only CGM and insulin infusion data. They cause unknown disturbances to closed-loop AIDs, leading to high uncertainty.
(5): High safety requirements. Excessive insulin may lead to severe hypoglycemic events that can be life-threatening, so safety must always be ensured in the design and operation of the control algorithm.

These factors increase the complexity of the control algorithm design and bring great challenges to the performance of AIDs. Many review papers have discussed different control algorithms [27,28,29,30,31]. Conventional Proportional Integral Derivative (PID) controllers have been used to eliminate steady-state errors in BGC levels [32,33,34,35]. Fuzzy logic (FL) control algorithms that provide some robustness due to the fuzziness of their rules have also been considered for BGC control [36,37]. The most commonly used control algorithm is Model Predictive Control (MPC), which has provided good results in both simulation and clinical studies [38,39,40,41,42,43].

However, the commercially available hybrid closed-loop AIDs have TIR values in the 65–80% range, there is room for further improvement of TIR in free living by leveraging artificial intelligence (AI) in control decisions. With the widespread use of advanced sensors and insulin pumps, AIDs generate a large amount of interconnected data, providing the basis for data-driven AI approaches such as Deep Reinforcement Learning (DRL). DRL is an advanced form of Reinforcement Learning. Both algorithms rely on reward signals to learn optimal behavior through trial and error, while DRL leverages deep neural networks (NNs) to handle high-dimensional state and action spaces, enabling more sophisticated and scalable solutions. The autonomous learning, adaptive tuning, and optimal decision-making capabilities of DRL [44] make it a suitable candidate for consideration in AIDs either as the controller or in combination with traditional control algorithms. For a deeper understanding of the advantages and limitations of traditional control algorithms such as MPC in AIDs applications, as well as its comparison to DRL, we provide a detailed comparative analysis. Please refer to Section 6. Specifically, DRL is a learning process allowing periodical decision-making, observing the effects of these decisions, and automatically adjusting its behavior to achieve an optimal strategy. DRL has exhibited great potential in many fields, including playing Go [45], autonomous driving [46], smart grid operation optimization [44], and drug delivery [47]. In recent years, DRL control algorithms have been considered for BGC regulation [48,49].

In a broad sense, DRL is a computational framework through which machines achieve goals by interacting with their environment. A single interaction involves the machine, in a specific state of the environment, making an action decision, applying this action to the environment, and receiving feedback in the form of a reward and the subsequent state of the environment. This process iterates over multiple rounds, with the objective of maximizing the expected cumulative rewards. In DRL, the decision-making machine is referred to as an agent. Unlike the “model” in supervised learning, the “agent” in DRL emphasizes not only the machine’s ability to perceive information from its environment but also its capability to directly influence and modify the environment through decision-making, rather than merely producing predictive signals.

In general, the DRL approach offers the following advantages.

(1): Potential resistance to perturbations. The “Agent” considers all states when evaluating the expected value, enabling the DRL algorithm to adapt to potential external perturbations of the “Environment”;
(2): Inter- and intra-patient variability. People with diabetes are considered as the “Environment” in DRL, and the agent makes the appropriate decisions based on its interactions with the individualized environment, which has advantages over some fixed model-based control algorithms in adapting to the variability of the “Environment”;
(3): Adaptation to time lag. The reward function in DRL calculates the cumulative reward, which can address the problem of latency in the “Environment”;
(4): Adaptation to sequential decision-making tasks. The BGC control task is a sequential decision-making task, where the action at the current time step influences the subsequent state change, which in turn determines the action that needs to be made at the next time step, and DRL has a natural match with sequential decision-making tasks.

However, DRL systems still face a series of formidable challenges.

(1): Low sample availability. When confronted with high-dimensional, continuous action spaces, DRL often struggles to effectively utilize sample data, leading to low learning efficiency and excessive reliance on data volume. This issue not only impacts the training speed of DRL systems but also restricts their application in the real world, as acquiring a large amount of sample data in practical environments may be limited by cost and time constraints.
(2): Personalization. When an initial or pre-trained DRL controller is set to a patient, the agent encounters the challenge of distributional bias. During the training process, DRL models often only have access to local data distributions, resulting in unstable or even ineffective performance when facing different environments or unknown scenarios. Additionally, the DRL controller is required to achieve personalized control objectives due to inter-patient variability in blood glucose dynamics and individualized treatment plans.
(3): Security. Since the DRL systems typically operate in dynamic and uncertain environments, their decisions may be influenced by external attacks or unexpected disturbances, leading to unexpected behaviors or outcomes.

Although some DRL reviews have been published, there is still a lack of detailed discussions on the application of DRL in AIDs in recent years [26,49,50,51,52,53]. Specifically, earlier reviews focused on DRL methods or AIDs without discussing the problem of applying the DRL algorithm to AIDs. The rapid rise in interest in AI-based healthcare systems has led to a surge in publications employing DRL, with numerous cutting-edge DRL algorithms being developed for AIDs. This rapid advancement necessitates a thorough analysis of potential DRL applications. Hence, this paper aims to offer a comprehensive overview of the various DRL-based methods applied to AIDs.

The main contributions of this paper are

(1): To provide a well-organized overview of the methodology and applications, which includes basic concepts and detailed analysis and conclusions of DRL components applied to AIDs, focusing on discussing how DRL algorithms and AIDs can have a better match in terms of applications, including state space, action space, and reward function.
(2): To classify the problems of AIDs into three key categories to illustrate the challenges of DRL in low data availability, personalization, and security.
(3): To provide insights into the challenges, potential solutions, and future directions for DRL-based AIDs, with prospects for further development in terms of data requirement and computational power.

A total of 106 articles were retrieved by entering “Reinforcement learning” and “Blood glucose” on the Web of Science, and there were about 60 relevant articles when setting the time conditions from 2020 to the present, and finally, a total of 32 articles using the DRL method were retained for reference. The search date was 1 January 2025. Figure 2 illustrates the change in the number of documents over time. The number of documents gradually increased from 2020 to 2024.

This paper aims to provide a relatively exhaustive review of DRL-based AIDs, especially publications from 2020 to the present. More importantly, specific potential research directions are highlighted for interested parties by summarizing, highlighting, and analyzing DRL features and their applications in AIDs. The rest of the paper is organized as follows. Section 2 introduces the basic theory of DRL, and discusses its state-of-the-art and its extensions for applications in AIDs. Section 3 details the DRL-based AIDs regarding reward functions, state spaces, and action spaces. Following this, Section 4 discusses the challenges to be addressed in DRL-based AIDs. Section 5 discusses the practical applications, including the data needs and the computational power. Section 6 provides a comparison of the main control algorithms for AIDs. Finally, Section 7 provides conclusions.

2. Overview of DRL

2.1. The Basic Structure of DRL

DRL algorithms, as a class of powerful learning algorithms, learn how to maximize rewards in the current environment to achieve goals by directly interacting with the environment and obtaining feedback rewards, without heavily relying on prior experience [54]. This section focuses on the basic framework of DRL, shown in Figure 3, and its fundamental learning and decision-making processes, aiming to provide a quick start for beginners.

DRL models the environment as a Markov Decision Process (MDP), describing the basic structure of decision-making problems through a five-tuple

(S, A, P, R, γ)

. Here,

S

represents the state space,

A

represents the action space,

P (s^{'}| s, a)

describes the probability distribution of transitioning from the current state

s

to the next state

s^{'}

through action

a

,

R (s, a)

denotes the immediate reward, and

γ

is the discount factor, which balances the relative importance of short–term rewards and long–term gains. The agent′s goal is to learn a policy

π (a | s)

through repeated interactions with the environment, thereby maximizing its expected cumulative reward.

The core idea of DRL is to learn optimal behavior through multiple rounds of interaction. In a typical interaction process, the agent observes the current state

s_{t}

at the time step

t

, selects an action

a_{t}

according to the policy

π (a| s)

, and after the action is applied to the environment, the environment returns a new state

s_{t + 1}

and an immediate reward

r_{t}

. This interaction iterates, with the agent aiming to maximize the expected cumulative return in the future. The cumulative return (Return) is defined as:

G_{t} = \sum_{k = 0}^{\infty} γ^{k} r_{t + k}

(1)

among them, the discount factor

γ \in [0,1]

controls the influence of future rewards on current decisions.

To guide the agent’s decision-making, RL defines two key value functions: the state-value function and the action-value function. The state-value function

V^{π} (s)

represents the expected cumulative reward starting from state

s

under policy

π

:

V^{π} (s) = E_{π} [G_{t} | s_{t} = s]

(2)

While the action-value function

Q^{π} (s, a)

further evaluates the expected cumulative reward after executing action

a

in a specific state

s

:

Q^{π} (s, a) = E_{π} [G_{t} | s_{t} = s, a_{t} = a]

(3)

For the estimation of value functions, Temporal-Difference (TD) learning is a crucial method. The TD method combines the recursive structure of dynamic programming with the sampling idea of Monte Carlo methods, with its core being the use of TD error

δ_{t}

to update estimates. For Q value, TD error is defined as:

δ_{t} = r_{t} + γ Q (s_{t + 1}, a^{'}) - Q (s_{t}, a_{t})

(4)

where

a^{'}

is the action chosen in the next state

s_{t + 1}

(as derived from a policy or optimality conditions). The corresponding update is:

Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α δ_{t}

(5)

where

α

is the learning rate.

In the decision-making process, RL strategies can be divided into two categories: value-based strategies and policy-based optimization.

1. Value-based algorithms such as Deep Q-learning (DQN), which uses a greedy policy or

ε

-greedy policy to derive an action from the value function. This method is generally limited to discrete action environments or low-dimensional tasks:

a_{t} = a r g \underset{a}{m a x} Q (s_{t}, a)

(6)

2. Policy-based algorithms such as policy gradient methods, which explicitly learn a parameterized policy

π_{θ} (a | s)

and sample actions directly from the policy. The objective of policy gradient optimization is to maximize the expected cumulative reward of the policy, with its gradient given by:

\nabla_{θ} J (π_{θ}) = E_{π} [\nabla_{θ} l o g π_{θ} (a | s) \cdot Q^{π} (s, a)]

(7)

This method can naturally extend to continuous action spaces while avoiding the action search problem of value function methods.

In the case of AIDs, the environment consists of a patient, and the action is usually the dose determined for the insulin pen or the infusion flowrate determined for the insulin pump. By performing actions on the environment and calculating the corresponding rewards, the agent is able to learn strategies that minimize a given loss function [55]. The general process is shown in Figure 4.

In the medical scenario of DRL-based AIDs, the agent makes decisions regarding insulin dosage based on the internal Policy and acts through the insulin pump on the Environment (patient). The CGM feeds back BGC information as Observations to the agent for the next-round decision-making. The system uses Rewards to reflect the outcome of the previous action. If the insulin dosage decision keeps the BGC within the target range, a positive reward is given to indicate a gain; if it causes the BGC to deviate from the target range, a negative reward is provided based on the degree and duration of the deviation to represent a loss, thus guiding the agent to optimize its decisions.

2.2. Classification of DRL and Their Characteristics

Since its inception, DRL has seen extensive research dedicated to developing algorithms tailored to various tasks and requirements. In the context of AIDs and recent research trends, DRL algorithms can be broadly classified into three categories: (1) value-based and policy-based, (2) on-policy and off-policy strategies, and (3) model-based and model-free. This section provides a comprehensive introduction and analysis of the characteristics of different types of algorithms and their applicability in AIDs. The classification of DRL algorithms is shown in Figure 5.

1. Value-based and Policy-based: The value-based approach selects optimal actions by directly optimizing the action-value function

Q^{π} (s, a)

. This method is suitable for discrete action spaces and has been used to design decision tree strategies based on blood glucose states. As a classic method, DQN includes two Q-networks (current and target networks) for estimating values and determining actions, respectively, ensuring stable Q-value estimation [56,57,58,59,60]. Double DQN is an improved version of DQN that incorporates two sets of Q-networks (two current Q and two target Q, totaling four) to avoid overestimation issues. Specifically, when calculating Q-values, it always selects the lowest value from one set of networks [61].

However, methods that solely rely on value functions are significantly limited in continuous action spaces, as actions cannot be directly selected through a continuous value function, which affects the control accuracy of AIDs. To address this, policy optimization methods (such as PPO [62,63,64,65,66,67,68] and TRPO [69]) tackle complex continuous control tasks by directly optimizing the policy function

π_{θ} (a | s)

. PPO is an improved version of TRPO, and both algorithms restrict policy updates based on a trust region to avoid policy deterioration caused by excessive updates of deep networks.

Combination methods of value and policy (such as DDPG, SAC, and TD3) have demonstrated impressive performance. These methods aim to estimate value while improving policies based on that value, as shown in Figure 6. This framework is flexible and can incorporate powerful components such as Double Q (SAC, TD3), deterministic policies (DDPG), exploratory policies (SAC), and delayed updates (TD3), exhibiting robust adaptability in AIDs. There are studies that implement Actor and Critic in DDPG with DNN for better effects [70,71,72]. There are also studies applying the normalized advantage functions method, which simplifies the DDPG network structure, to the AIDs to seek higher efficiency [73]. However, since DDPG is a deterministic gradient algorithm, this may lead to insufficient exploration capability in some complex environments And the fact that in a continuous and infinite space, the network structure of DDPG is sensitive to hyper-parameter settings and has poor convergence [70]. In response to the two deficiencies of DDPG, SAC [48,65,70,74,75,76,77,78,79] firstly controls the importance of entropy (exploration) with the entropy regularization coefficient by incorporating an entropy regularization term. Secondly, SAC contains a total of five networks: one Actor, two Critic (current Q network and the corresponding target Q network), and two Critic networks by selecting one network with a smaller Q value at a time, thus alleviating the problem of Q overestimation network. Improvements in the SAC algorithm’s ability to utilize states have also been studied by using the Dual attention network to extend states and give them an attention score [74]. Recently, SAC has been studied for AIDs using dual-agent [80]. The TD3 was also proposed almost simultaneously with SAC, to solve the problems that exist in DDPG [81,82,83,84].

2. On-policy and Off-policy: On-policy strategies (such as TRPO and PPO) maintain consistency between the behavior policy (data collection strategy) and the target policy (desired strategy), emphasizing that the agent updates solely through interaction data sampled from its current policy. This approach excels in policy stability and convergence because On-policy methods consistently utilize samples generated by the latest policy for optimization, ensuring that policy updates align with the current policy direction and avoiding interference from historical or other policy distributions, thereby enhancing the accuracy of updates [85]. Additionally, On-policy methods restrict the exploration scope, keeping it centered around the current policy distribution, effectively mitigating risks associated with excessive exploration [85]. In the context of dynamic blood glucose control, this characteristic prevents significant adjustments to insulin doses, thereby reducing uncontrollable blood glucose fluctuations.

In contrast, Off-policy strategies (such as DQN, DDPG, and SAC) differ between the behavior policy and the target policy, aiming directly at obtaining the optimal target policy. Consequently, these algorithms can leverage historical data or data generated by other policies for updates, significantly boosting sample efficiency. This characteristic makes Off-policy strategies highly suitable for leveraging historical data and optimizing simulated environments in AIDs. For instance, historical data from patients undergoing different insulin injection treatment strategies and data generated from patient simulation models can be used to train Off-policy algorithms, reducing dependence on actual patient data while accelerating the policy learning process. However, Off-policy strategies may suffer from distribution shifts in response to patient dynamics (such as intra-patient variability or simulation errors), which necessitates reliable environment modeling or algorithmic adjustments for resolution.

3. Model-based and Model-free: Model-free methods rely on direct learning of policies or value functions from interaction data with the environment, thereby avoiding the need for precise modeling of patients’ insulin-glucose dynamics. This grants them significant advantages in high-dimensional and complex tasks, especially for personalized glucose control. However, model-free methods require a vast amount of interaction data, which may pose ethical challenges and high data acquisition costs in real patient scenarios.

In contrast, model-based methods (such as MPC-RL [75] or MBPO [60]) leverage explicit modeling of patients’ metabolic processes combined with RL algorithms for policy optimization. The crux of this approach lies in its ability to explicitly model environmental dynamics, reducing the demand for interaction data and providing higher sample efficiency. For instance, model-based methods can simulate patients’ postprandial glucose changes, guiding the optimization of insulin injection timing and dosage. However, model errors may significantly affect policy performance, necessitating mitigation through uncertainty estimation or model correction.

3. DRL-Based AIDs

3.1. State Space Variables Selection

Table 1 summarizes the state variables used in the literature and analyzes the fully closed-loop/hybrid closed-loop AIDs.

On the one hand, a fully closed-loop AIDs implies the exclusion of additional external inputs, thereby circumventing control bias stemming from inaccurate estimations. On the other hand, hybrid closed-loop systems integrate external inputs (such as dietary information) into state variables, enabling the agent to directly access data regarding external disturbances rather than learning them. Hence, if an external interference estimator proves sufficiently accurate, a hybrid closed-loop system could be deemed to enhance agent efficiency.

In the above research related to DRL-based AIDs, some studies expand the dimension of the state space, enabling the agent to make insulin decisions with the help of more information without the need to construct additional mathematical models, which is the advantage of AI technology. In addition, according to the conditions of different patients, for example, some patients can clearly provide their dietary information while others cannot, and different state spaces can be constructed to achieve BGC control under different conditions.

3.2. Action Space Variables Selection

Insulin secretion in the physiological state can be divided into two parts: first, continuous micro-insulin secretion independent of meals, i.e., basal insulin secretion, which is secreted in pulses continuously for 24 h, to maintain BGC in the fasting and basal states; and second, a large amount of insulin secretion caused by the stimulus of elevated BGC after a meal, which can form a curvilinear wave of secretion, i.e., mealtime insulin secretion [9]. Equation (8) summarizes the insulin pill dose used in the standard pill volume calculator model [88], which is closely related to the computational goals of DRL-based AIDs. As shown in Table 2, the action variables of DRL-based AIDs are summarized.

Bolus insulin = Meal insulin + Correction insulin - IOB

(8)

In Table 2, the action-slice spatial definition parameters include different behaviors such as insulin dose (basal and bolus), insulin injection time, food intake, or optimization of bolus by optimizing the parameters used to calculate the insulin bolus [89]. Studies that did not specify basal insulin or bolus combined the two into a single output (Insulin doses) used to regulate the fluctuations in the patient’s BGC.

When the demand for insulin optimization reported in these studies is only for mild cases with relatively smooth intraday fluctuations or patients who have received long-term insulin therapy, optimizing a few mealtime insulin doses per day can be achieved since their daily insulin injections and postprandial insulin injections are close to stable. This not only reduces the physiological and psychological burdens of patients’ data monitoring but also enables the optimization of the daily insulin dose to adapt to long-term BGC fluctuations. For patients with severe intra-day BGC fluctuations, especially adolescents, optimizing the insulin dose on a daily basis is far from enough. Glucose changes need to be measured at a frequency of every 1/3/5 min of glucose monitoring, and insulin infusions need to be optimized at a frequency higher than once a day to accommodate their more dramatic glucose fluctuations. In the case of such high-frequency drug administration, AIDs can obtain a large amount of data, and the DRL control algorithm can make decisions more effectively because it is a data-based method. In addition, DRL can capture the characteristics of blood glucose fluctuations in this situation by expanding the state space. Without having to reconstruct the mathematical model, it can effectively reduce the risks.

Table 2 shows that the discrete or continuous nature of the action space is related to the type of algorithm. Most studies have also added restrictions to the action space to give a safe “trust region” for the output insulin dose and to accelerate the convergence of the algorithm to improve efficiency.

3.3. Reward Function Selection

This subsection summarizes and analyzes the reward functions used to develop control algorithms, which measure the level of success of Agents based on a set of selected actions that result in a change in the state of the environment. It can be self-defined based on the goals of the task and expert knowledge. Table 3 summarizes the wide variety of reward functions defined in the literature.

When designing the reward function, the acceptable BGC is in the range of 70 mg/dL to 180 mg/dL (Time-in-range—TIR) [4]. The risk associated with hypoglycemia differs significantly from that of hyperglycemia. Generally, hypoglycemia poses a greater risk compared to hyperglycemia. Hyperglycemia cannot cause serious consequences in a short period except ketoacidosis, whereas hypoglycemia can cause headaches, coma, and even sudden death in a few hours. In addition, the level of fluctuations in BGC values is also one of the criteria for evaluating the goodness of insulin dose recommended. Under the premise that BGC is controlled within the target range (70~180 mg/dL), the smaller the fluctuation of BGC, the better the control effect, and the smaller the damage to vasculature. And, for an iterative process such as DRL, under a certain state, to evaluate the good or bad insulin dose, a suitable reward function related to BGC must be formulated to define the objective of the DRL problem.

Magni et al. [90] mapped the risk for BGC values of 70 and 280 mg/dL to the same value of 25 and the risk for BGC values of 50 and 400 mg/dL to the same value of 50 by constructing a risk function: the risk value rises rapidly in the hypoglycemic region and slowly in the hyperglycemic region. Based on this risk function, scholars have defined various new reward functions with always negative reward values [48,59,77,81,82,86]. In addition, some papers introduced a Gaussian function as part of the reward function [64,69,74,89,91,92,93]. Other papers used the absolute value of the relative error between the actual BGC and the average normal BGC as the reward function so that the reward for each step was between 0 and 1 [94]. Some papers reported mining more glycemic features and combining them with the weights used to scale the hypoglycemic and hyperglycemic components to form a reward function [95,96,97,98]. For example, an exponential function with a natural log constant e as the base was used as the reward function, and the median value of 127 mg/dL at the boundary between hypoglycemia and hyperglycemia was used as the target glucose value to reward the actual action, with the reward value always ranging from 0 to 1 [99]. In addition, many papers have focused mainly on high and low glucose events, setting the reward function as a penalty value (−3, −2, −1, 0, or multiplied by different weighting coefficients) at different glucose concentration intervals (high glucose, low glucose, and norm glycemic) [59,60,61,64,65,69,71,73,83,87]. Some papers set a target glucose value (e.g., 80 or 90 mg/dL) and use the negative of the absolute error between the glucose concentration and the target value as the reward function [100,101].

Improperly defined reward function may lead to risky behavior called reward hacking, which means the agent maximizes reward by unexpected behavior. For example, when glucose fails to reach a safe level, the cumulative reward is maximized by ending the event as soon as possible. To avoid this, termination penalties are added to trajectories that enter a dangerous glucose state (glucose levels below a lower limit or above a higher limit) [48]. In extreme scenarios, given the objective of avoiding termination, this value should be set to infinity, or at the very least to a sufficiently low value to offset the patient’s life expectancy [65]. Turning to safety, in addition to the safety constraint rules on actions discussed in the previous subsection, some studies have also considered actions (insulin doses) in the reward function, which helps to avoid regions of instability [99] as well as potentially leading to less aggressive actions by the patient.

Due to the different individual needs of patients, research related to DRL-based AIDs often requires customizing the reward function to match the differentiated diagnosis and treatment needs. Since the reward function is the main carrier for driving the agent to optimize its strategy, when designing the reward function, more emphasis should be placed on safety to reduce the agent’s dangerous exploration.

Although some current studies have made improvements in this regard, it is still not possible to completely ensure that every output action is safe. This may be because relying solely on the guidance of the reward function cannot completely prevent the agent from taking dangerous actions. In the future, it is necessary to further constrain the action space to find action outputs within the safe range.

4. Challenges to Be Addressed

The above summary of the algorithms, state space, action space, and reward function settings indicate that the application of DRL on AIDs promotes the development of AI in healthcare. However, there are some important challenges that must be addressed for the implementation of DRL algorithms in AIDs applications.

4.1. Low Sample Availability

There are limitations to applying NN-based DRL to clinical trials for insulin administration [74]. One of the drawbacks is the problem of low sample availability, i.e., many direct interaction trials with the environment are required upfront, and successes and failures must be experienced to shape the information from trial-and-error events to determine a good action strategy. However, patients cannot be subjected intentionally to treatment failures such as hypoglycemia or hyperglycemia. Simulation studies provide an alternative, but high-fidelity personalized models are needed. Figure 7 shows various methods used to address this problem in current studies. High costs of developing high-fidelity models, extensive simulations to cover many real-life scenarios, and high data collection costs over extended periods of time remain obstacles to applying DRL to AIDs.

As illustrated in Figure 7, some studies have used NNs for DRL to improve sample utilization, such as offline DRL algorithms, model-based DRL algorithms, pre-training with datasets, and meta-training methods. Offline RL [60,82,83] trains the Agent offline with datasets containing state-action pairs and deploys them to the real environment after the offline training is completed, thus avoiding the process of interacting with the real environment, which is beneficial for AIDs with high safety requirements. Model-based RL [58,102] uses the state-action pairs generated by the interaction between the Agent and the environment to learn a model and then uses the model to interact with the Agent and generate an action, or learn using the generated simulated data. A typical algorithm such as Dyna-Q [58] perform 1 Q-learning after each interaction with the environment, Dyna-Q does n Q-planning, and the Agent of Model-based RL switches to interacting with the model after interacting with the real environment in the pre-training period, lessening the need for subsequent trial-and-error work. During pre-training, some approaches [48,59,71,102] (or migration learning in a narrow sense) initialize some or all of the network in DRL based on an existing dataset. And then use some or all of the network for interacting with the real environment after pre-training is completed. For instance, one might opt to fine-tune all layers of the generalized model or retain the weights of certain earlier layers while fine-tuning solely the higher-level components of the network to mitigate overfitting. It is reported that the earlier layers contained more generalized features (e.g., insulin suspension during hypoglycemic trends), which should be helpful for all BGC control approaches in T1D [59]. The purpose of pre-training is to find parameters that are best for all tasks, but there is no guarantee that taking trained parameters and training other tasks will yield good effects. The pre-training approach uses a dataset to train the network parameters in advance before migrating them to a new task, which is somewhat safer than using a completely untrained network directly to interact with the environment.

In contrast, meta-training [75,76] uses multiple tasks with a small number of samples to learn the mechanisms they have in common so that the model can be adapted quickly to a new task, e.g., the patient’s personalized information is used as multiple sub-tasks through a probabilistic encoder embedding algorithm [63], to learn the mechanisms they have in common so that the model can be adapted more quickly when applied to a new person with diabetes. In contrast to pre-training, the purpose of meta-training is not to assess how well the network parameters perform on each sub-task but rather to focus on how well the overall network parameters trained with the sub-tasks perform [103]. In fact, the meta-training approach also trains a model in advance. But unlike narrow pre-training, meta-training is task-based and requires less data than data-based pre-training, which further improves sample utilization. The loss evaluated by meta-training is the test loss after the task is trained while pre-training directly seeks to determine the loss on top of the original base without training. The use of meta-training methods, in combination with DRL plays a similar role as pre-training. They are able to avoid using a completely ignorant network to interact with the environment and can narrow pre-training. However, they cannot quickly adapt to new tasks because there are differences between the new tasks and the population model, and this situation will definitely occur in the field of personalized medicine. Therefore, some studies use a probabilistic encoder to obtain valuable information from new experiences and personalized information of patients to adapt to new tasks [76], which may provide an idea for improving the distribution error problem.

For the problem of low sample availability, in the future, it would also be possible to use imitation learning to train using data generated from the decisions/actions of domain experts. This will enable achieving outputs close to the expert’s strategy in real environments, an idea that has already been demonstrated in other domains [104], and which will hopefully be used in the future in the design of DRL-based AIDs controllers.

4.2. Personalization

“Offline RL”, “Mode-based RL”, and “Transfer Learning” are only based on sample data to train a strategy in advance, and if this initial strategy is put into a real environment, the problem of distributional bias can occur. Specifically, if the next state encountered is one that has not been observed before in the data, and since the strategy has not been trained in this state or a similar state, it may choose an action at random, which will cause the next state to further deviate from the distribution of the data encountered by the expert strategy, which is called a “Compounding Error” in behavioral cloning, and in offline RL, it is called a “Distributional Shift”. Moreover, a model that has been trained with a large amount of data for a certain task usually needs to be retrained after switching to another task, as shown in Figure 8, which is very time-consuming and labor-intensive.

The distribution bias illustrated in Figure 8 is the problem of personalization, whether the method can be adapted quickly by putting it on different patients. Various solutions proposed are outlined below. Firstly, we discuss the meta-learning method [76], which can effectively alleviate the problem of computational cost caused by a large number of tuning parameters and task-switching model retraining. It tries to make the model acquire an ability to learn the tuning parameter and quickly learn new tasks based on acquiring existing knowledge. Since it is based on task training, it is characterized by fast learning of new concepts or skills with a small number of samples, and no retraining is required when facing a new task. Compared to the three methods mentioned above, meta-trained models are expected to be more effective when put into real environments and adapt quickly.

However, meta-learning requires the training and testing tasks to have similar distributions as much as possible, which seems to be contrary to solving the distribution error problem. It is proposed to select those samples that are more representative of the testing task to be labeled by active learning to reduce the generalization problem caused by the different distributions [76]. Hence, meta-learning also becomes a unique advantage that can alleviate the distribution error problem due to its ability of “learning to learn”. Secondly, it specializes in improving the problem of extrapolation error (i.e., the distributional error) for “offline RL” [82].

To address the problem of model adaptation to different individuals, the “classification before training” approach [48,61] provides another solution by clustering the patient population into smaller categories to enhance personalization. But essentially this does not increase the level of personalization of a single model and does not guarantee that all characteristics are taken into account, since the human body is so complex that it is not possible to differentiate between patients by classification alone or to provide a model for every patient. The “population-then-migration-to-individual” approach [59,71,83], also suffers from distributional errors and has been shown to increase catastrophic failure rates, i.e., “over-fitting”, when training on individual patients. Other alternatives proposed include the “reward function or action space or observation frequency personalization” approach [65,102], which can be used as a complementary measure due to the customizable nature of both the reward function and the action space, and the “hyper-parameter personalization” approach [63]. Moreover, for personalization, inverse DRL can be considered to learn to understand individual preferences and goals from historical patient data without the need to explicitly categorize the patient, so that the underlying objective function for the patient is inferred from his or her observed behavior, and the goal of the glycemic control task can be automatically learned without the need to manually define personalized goals. It also presents how the model is learned from the patient, thus improving interpretability, and it may be beneficial to design personalized action spaces using clinically-identified individual parameters [62].

4.3. Security

The issue of low error tolerance (i.e., safety), which is a feasibility issue in the practical application of DRL in combination with AIDs, has two obstacles: the problem of low number of samples in the earlier stages of controller development and the post-protection measures. The first problem has already been discussed in Section 4.1. The second problem, “late-blooming measures”, can be understood as the ability of the clinician to intervene with the controller and patient when the situation is incorrectly predicted, to provide some clues to the internal state of the patient and controller. Most studies have made some effort in this regard, e.g., by “imposing termination penalties on reward functions”, “narrowing the action space” [48,61,70,71,87,89,105], “restricting the search strategy” [83], and “threshold pause” [59] to make the algorithm pay more attention to security, or introducing the idea of “switching control” [58,74,75] to switch to a safer strategy when DRL does not perform well.

In the future, hierarchical DRL is expected to be introduced to cope with different safety levels, e.g., having a bottom-level control policy for routine situations and triggering a high-level emergency control policy when a dangerous situation is detected. Safe Reinforcement Learning is also a method to enhance the security of AIDs. This method ensures that the agent does not enter unsafe states by adding constraints during the policy optimization process. It has achieved good results in fields such as robotics [106] and autonomous driving [107]. Furthermore, before putting these algorithms into practice, extensive evaluation is necessary, in addition to taking extra precautions and setting strict constraints to make the automated decision-making process as safe as possible. Perhaps more important, is the need to collaborate with regulatory agencies to develop extensive tests and criteria to assess the safety of the proposed methods at every step.

5. Practical Applications

In recent years, DRL algorithms have been proposed as an alternative control technology for AIDs for people with diabetes, especially in T1D. The performance of the proposed methods has usually been illustrated with simulation studies. Clinical data are often difficult to obtain because patients and healthcare professionals must handle sensor data collection with care to ensure accuracy and reliability, in addition to the ethical issues related to using them [49,81]. However, AIDs are also confronted with several practical challenges in applications, which impede the smooth implementation of DRL in AIDs. The current AIDs have a high cost that limits widespread adoption unless covered by health insurance. Growing data collected from current users indicate that the average time-in-range of glucose concentration has improved compared to alternative approaches for insulin dosing decisions and delivery. Early models of AIDs had limitations in capturing and interpreting some dynamic variations in glucose levels that led to improper insulin actions. Feedback from users has enabled the improvement of the algorithms and eliminated many of these shortcomings. Usability has also improved with improvements in the user interfaces. The current systems rely on glucose and insulin information and necessitate manual entries for meals and exercise events. Multivariable AIDs that use data from additional sensors can eliminate the need for manual entries by users and enhance the situational awareness of the AIDs in real-world use. Advances in technology, competition among different AIDs companies, and higher volumes of production will enable further improvement of AIDs technologies and reduce the cost of these devices. Table 4 provides a summary analysis of recent literature regarding practical applications of DRL algorithms in AIDs. Among them, those marked with √ indicate that this kind has been adopted.

As can be seen from Table 4, most of the current studies use simulators to collect data to train agents, and the test subjects are virtual patients in simulators. This can be a big gap from the real world because most of these studies use virtual platforms, using fixed patient parameters and meal scene settings, and do not consider disruptive conditions such as physical activities, which would not have been possible in a real-world scenario. There are a few studies that use mathematical models to directly generate data for agents to learn and test, which is similar to using simulators but has the disadvantage of not being an accurate representation of the real world. Nowadays, there are studies that are gradually trying to transition from virtual platforms to the real world, and they are using real electronic medical record datasets for agent training and evaluation, which may indicate the potential for the application of DRL in the clinic in the future. In a recent study, they have implemented the use of DRL for clinical evaluation, under strict constraints [86].

5.1. Data

There may be several reasons for using simulated data rather than real-world data in most studies listed in Table 4:

(1): Collecting large amounts of clinical data is usually expensive and time-consuming [75,108], and only a few studies have made clinical validations [86,94]. The algorithms reported in Table 4 used simulated data from 5 days [76], 6 days [75], 30 days [59], 180 days [61,71], 1050 days [77] and 1000 days [48] to develop personalized dosing strategies. Whereas it is unethical to allow DRL algorithms to control patient BGC for several years in an in vivo environment without any associated safety guarantees [82], To validate the scalability of offline DRL methods in more intricate environments, algorithms should undergo training and evaluation using genuine retrospective patient data samples (e.g., data available through the OhioT1DM dataset) [83], which is why it is crucial to perform an out-of-strategy evaluation of offline algorithms [83,109].
(2): Assessing algorithms in human subjects without preclinical validation or suitable safety constraints can pose risks [71]. In fact, due to the nature of DRL techniques, the use of simulated environments is particularly appropriate since the learning of the model is achieved through trial-and-error interactions with the environment, and therefore, performing such processes on virtual objects is critical to avoid dangerous situations for people with diabetes [61].

However, there are several disadvantages to using in-silico data:

(1): Simulations often overestimate the benefits of testing interventions because they fail to account for all the uncertainties and perturbations that occur in real-life scenarios [89].
(2): Patterns learned from large amounts of simulated data do not always match the real clinical world [76].

In response to the first point, many studies have provided explanations. For example, some studies have shown that as long as they do not rely on dietary announcements or expert knowledge to work properly, they consider that an accurate simulation is sufficient and that the learning phase does not need to be completed with real-world data. Of course, this would require using a precise mathematical model with parameters adjusted separately for each patient [73]. However, the lack of real-world data remains a problem if this approach is to be applied to treating other diseases. A recent study conducted a proof-of-concept feasibility trial that demonstrated the feasibility of DRL in hospitalized patients with T2D. And it showed that the acceptability, effectiveness, and safety of AI protocols are similar to those of treating physicians [86].

In response to the second point, the distribution error problem is summarized in Section 4.2. While the simulator may not adequately capture inter-patient differences or changes in the glucose regulatory system over time, its successful use as an FDA-approved alternative to animal testing, UVa/Padova simulator [110], is a non-trivial achievement [48]. The frequency of using this simulator in various studies is followed by the use of minimal models of AIDA and Bergman [111,112], the second most common choice [49]. However, while the current situation may highlight the challenges in obtaining real-world data from individuals with diabetes, there is indeed a pressing need to transition research from simulated to clinical data [94] to facilitate the validation of algorithms [49]. To mitigate the problem of low sample utilization in DRL applications to AIDs, these studies [59,61,62,71,75,76,82,87] used a simulator or data set for “pre-training” to train models in advance and then test them. Some measures to address this issue have also emerged in recent years, such as the combination of clinical electronic health record (EHR) data and DRL, which can consistently recommend medication dosages in different patient physiologic states. This approach can take into account more complex physiological variables and mortality information than studies that simulate glucose management [60] and is more suitable for inpatient diabetes management, as well as other EHR-based studies [101,113,114,115], so it seems that studies based on EHR data are also a promising research direction for the future.

There is also the issue of instrument noise in data. It has been shown in previous studies that CMG noise affects the performance of the model [71]. Numerous uncertainties and noises exist in the real world, such as artifacts from the CGM system, physical activity, and health conditions, which may affect the level of the BGC and reduce the effectiveness of the control algorithms. Future work should also include clinical validation of the proposed algorithms as well as the addition of stochastic noise in simulators that have been already implemented in several studies, and the addition of various disturbances such as physical activities, acute stress, and variations in insulin sensitivity over time to approximate real-world scenarios.

5.2. Computational Power

The main drawback of using DRL is the computational complexity of training [69], with many studies showing that more than 1000 days of training algorithms in simulators are required to obtain better BGC control strategies [48,59,61,62,77,116], and the resulting computational requirements are conceivably very high. Using a mid-level computing device (intel i9-10920X, 64 GB RAM, 2 NVIDAI RTX 2080 GPU), practical parametric and hyper-parameter training of a single patient usually takes 7 to 10 days even if 4 to 8 patients can be trained simultaneously [65]. Therefore, it is essential to keep the state space model size small enough to prevent memory overrun for simulation to cover a certain number of days. Some studies have estimated the GPU usage to be considered as a factor of 10 of the original for the theoretical case where the state space size will grow [102]. It is also possible to speed up the training process by the methods summarized in Section 4.1, the prioritized memory replay method [117,118], and the use of a nonlinear action mapping function [62], but the training efficiency leaves much to be desired.

In the context of computational constraints for the problem of algorithm embedding ability, technological advances in the field of medical devices and the field of communication in the last few years have led to a closer connection between mobile apps (which can act as control algorithm carriers), CGMs, and insulin pumps, i.e., allowing for the convenience of AIDs. Many researchers have now integrated simple control algorithms into apps to automatically manage or recommend insulin infusions and evaluate them in clinical trials [119,120,121,122]. We usually train algorithm models on cloud servers in the current phase. However, future research directions may include migrating these models into the framework of a smartphone operating system (e.g., iOS, Android) to enable local training of DRL models using real-time data locally using the central processing unit of the smartphone. Such data-driven algorithms could also potentially apply to people with T2D supported by insulin therapy. However, further research is needed to achieve this goal. Once the models are trained to maturity, they can be easily embedded into smartphone applications, providing a viable solution for future clinical trials [71]. For example, if TensorFlow is used to develop DRL models, they can be easily deployed on smartphones or embedded devices through the TensorFlow Lite converter. Furthermore, embedded algorithms have the potential to be continuously trained and optimized for improvement with new data obtained from devices (e.g., CGMs, insulin pumps, activity monitors, etc.) and user inputs (e.g., dietary information) [59].

6. Comparison of Main Control Algorithms for AIDs

6.1. Model Predictive Control

The MPC algorithm forecasts future BGC with the aim of driving current BGC into the target range. The two steps involved in the operation of the MPC are building the predictive model and solving the constrained optimization problem. There are three approaches to building a predictive model, data-driven modeling, physiological first-principles (usually compartmental) modeling, and hybrid modeling. Various model-based algorithms need to use a physiological model of glucose-insulin kinetics of the human body, and significant effort has been made to fit the physiological model [38,123]. Solving the constraint optimization aims to minimize the gap between the predictive model output and the reference value, and the general structure is shown in Figure 9.

The prediction modeling is generally done using the state space equations shown below [90]:

\{\begin{matrix} x (k + 1) = A x (k) + B u (k) + M d (k) \\ y (k) = C x (k) \end{matrix}

(9)

where

x (k)

,

u (k)

,

d (k)

,

y (k)

are the states, inputs, disturbances, and outputs of the model, and

A

,

B

,

M

,

C

are the corresponding parameter matrices. The optimization control objective is generally in the form of a quadratic polynomial [124]:

J (u) = ‖y_{k} - {y_{r}‖}^{2} + λ ‖Δ_{k} - {Δ‖}^{2}

(10)

where

y_{k}

is the measured value of the patient’s BGC at the moment,

y_{r}

is the target value of the patient’s BGC,

Δ

is the information related to the regularization term, and

λ

is the regularization coefficient, which takes the value range of 0~1.

The MPC algorithm is based on previous BGC (either the reported CGM data or BGC estimated from the CGM data) as well as insulin infusion data and the prediction of BGC is achieved by the established prediction model, which in turn calculates the most appropriate current insulin infusion through the optimizer and repeats the above process at the next sampling moment to get the most appropriate insulin infusion at the next moment (Figure 10).

The advantages of DRL control algorithms over the frequently utilized model MPC algorithms in current AIDs are numerous. These advantages are as follows.

1.: Adaptation to potential perturbations

The DRL control algorithm has a natural advantage in uncertain perturbations and is compared with the MPC control algorithm by the following. First, since MPC is predicted by explicit models, and explicit models cannot model all perturbations, MPC is susceptible to uncertain perturbations, and most of them are linear (data-driven) models of the form

y (k) + a_{1} y (k - 1) + \dots + a_{n_{a}} y (k - n_{a}) = b_{0} u (k - 1) + \dots + b_{n_{b}} u (k - n_{b}) + f_{0} ε (k) + \dots + f_{n_{c}} ε (k - n_{c})

(11)

where

y

is the information about the BGC concentration of patients with type 1 diabetes at discrete times (k),

u

is the value of insulin infusion rate,

ε

is the model prediction error, and

a_{1}, \dots, a_{n_{a}}, b_{0}, \dots, b_{n_{b}}, f_{0}, \dots, f_{n_{c}}

are the parameters to be determined in the model [125].

Instead of the above linear model, a nonlinear model can be developed. This could be a physiological, first-principles model or a neural network (NN) based model to fit the relationship between the inputs and outputs. Figure 11 illustrates a feedforward NN.

At this point, the computation of the state variables when an NN is used cannot be expressed as an equation embedded in the optimization because the

f (\cdot)

function in

y = f (x, u)

cannot be expressed explicitly, i.e., it is not possible to solve for the control quantity

u

through the state space equation, and the subsequent optimization needs to be sought by iterating through all the conforming control quantities as shown in Figure 12.

Although predictive models utilizing nonlinearities are able to capture nonlinear relationships better as well as handle multidimensional data, there are still two problems:

(1): The model cannot adapt to uncertain perturbations because the inputs to the predictor are only factors already known to affect BGC.
(2): Even with a neural network-based model, MPC still needs to solve the optimization problem at each time step (traversal solving), which can become very difficult in highly uncertain environments.

As for the DRL control algorithm, since DRL is a value-based strategy, the product of the state transfer probabilities of all possible following states and the corresponding values is taken into account in the value estimation as described in Equations (2) and (3), so the DRL control algorithm can adapt to potential perturbations whereas the MPC control algorithm cannot make an adaptation until the disturbance is either explicitly expressed or its effect shows up in the output (BGC reported by CGM), which causes significant delays in response.

2.: Responding to Patient Variability

The control algorithm of AIDs should be able to adapt quickly to different patients, and to inter-day-intra-day variability of a single patient, which is referred to as the problem of inter- and intra-patient variability.

For the MPC control algorithm, although the continuous optimization of the predictive model has shown promising effects in terms of glycemic control, the controller may not be able to achieve the desired effect because it can never truly fit the actual patient’s physiological processes. The limitations of the predictive model with fixed parameters cause the degradation of performance when large disturbances occur [126]. Moreover, though the predictive model parameters of MPC can be personalized for different patients, individualized parameter tuning is invasive and costly [127]. In addition, executing the online optimization problem on an embedded device results in increased computational expenses.

In contrast, DRL control algorithms, due to their nature of interacting with the environment, can discriminate between different patients or the variability of the same patient at different time intervals by adjusting the state transfer probabilities implicitly within the algorithm in a closed-loop manner and this identification is potentially achieved by interacting with the environment.

3.: Computational Efficiency

For DRL, once the network is trained, there is no need to solve the optimization problem every time step like MPC, which is an efficiency gain.

4.: Exploratory Capacity

MPC has the idea of prediction, which somewhat solves the delay problem. Similarly, DRL has the concept of prediction, which DRL realizes through the value function (the value function is a prediction of the overall reward in the future); however, MPC evaluates the goodness of the state through

y_{p}

, which the predicted BGC in the future, and which is entirely determined by the performance of the predictive model, and

u

, which is computed from

y_{p}

, which means that the controlling role of MPC depends on the performance of the predictive model.

Whereas DRL evaluates the goodness of the state through the value

Q

, which is the overall reward obtained from future predictions,

u

is not obtained directly from

Q

. Instead, it adopts an exploratory strategy to ensure the Agent does not fall into a local optimum.

5.: Adaptation to Delayed Insulin Action

Because DRL maximizes the overall reward received in the future, the value function is the expected value of the overall future reward. So, the action may have a long-term effect on the outcome, which affixes to the delay in insulin action, and several research innovate in this regard [63,102].

\begin{matrix} V (s) & = E [G_{t} | S_{t} = s] \\ = E [R_{t} + γ R_{t + 1} + γ^{2} R_{t + 2} + \dots | S_{t} = s] \\ = E [R_{t} + γ (R_{t + 1} + γ R_{t + 2} + \dots) | S_{t} = s] \\ = E [R_{t} + γ V (S_{t + 1}) | S_{t} = s] \end{matrix}

(12)

Table 5 presents the advantages and disadvantages of both approaches and concurrently explores potential future research directions.

Based on the above analysis, it can be concluded that there are two alternatives to providing control systems for AIDs that address several issues discussed in the paper. One approach is MPCs that use adaptive models based on recursive system identification and multivariable control systems that leverage data collected in real-time to detect various disturbances such as physical activities and predict their potential effect on BGC well before these effects appear in BGC [128,129,130,131,132,133]. While this approach includes real-time optimization that has some computational load, applications that run on smartphones are being tested. The second approach is the replacement of the MPC-based controller with a DNN-based controller trained by a large number of simulation results based on the UVA/Padova simulator [134]. Both technologies can be combined with DRL to benefit from the advantages of each to develop powerful AIDs.

6.2. Proportional Integral Derivative

The PID algorithm utilizes three components to mimic the physiological transport of insulin secreted by human β cells as much as possible [135]. The proportional component of the PID controller in AIDs represents the quantity of insulin secreted in response to the deviation between the measured BGC and the target BGC. This component plays a crucial role in adjusting the insulin secretion amount, aiming to stabilize the BGC at the target value. As for the derivative component, it is designed to rapidly regulate insulin secretion when there is a swift change in the BGC, as illustrated:

P I D (t) = K_{P} (G - G_{0}) + K_{I} \int (G - G_{0}) d t + K_{D} (\frac{d G}{d t})

(13)

where PID(t) represents the insulin actions, G₀ is the target BGC, G is the real-time measurement of BGC, and K_P, K_I, and K_D are the gains of proportional terms, integral terms, and differential terms, respectively.

Table 6 presents the advantages and disadvantages of both approaches and concurrently explores potential future research directions.

6.3. Fuzzy Logic

In the context of AIDs, the application principle of FL lies in summarizing the long-term accumulated experience in diabetes treatment and establishing an expert knowledge database. The conditions and operations involved in clinical treatment are represented by fuzzy sets, and these FL rules are stored within the expert knowledge database. By utilizing fuzzy reasoning, the appropriate insulin infusion dose parameters can be derived. The FL algorithm does not necessitate the prior construction of complex models, and it has greater compatibility and is more straightforward to integrate with hardware systems.

Nonetheless, once integrated into the system, the FL algorithm lacks the flexibility to be modified. Given the significant individual variations among diabetic patients, when the controlled parameters change substantially, the control effectiveness of the FL algorithm is notably diminished, and it may even perform worse than a standard PID control. The fuzzy rules and membership functions lack specific theoretical guidance, and their comprehension and implementation can only be achieved through heuristic approaches.

Table 7 presents the advantages and disadvantages of both approaches and concurrently explores potential future research directions.

7. Summary and Conclusions

The relevant research results from 2020 to the present exceed the sum of publications in all previous years, thus indicating that the opportunities and challenges in this area have attracted the attention of machine learning and the medical fields worldwide. In this paper, we provide a comprehensive review of the application of DRL in AIDs. Firstly, an overview of DRL is given. Secondly, the methods used in the literature in recent years are classified into two categories of various DRL techniques based on optimization strategies, i.e., value-based and policy-based algorithms. A detailed theoretical description of typical value-based DRL algorithms, including DQN and its variants, is presented. Several popular policy-based algorithms, including randomized and deterministic policies, are introduced. Thirdly, we provide an exhaustive survey, comparison, and analysis of the three elements, namely, state space, action space, and reward function, of DRL methods applied to AIDs. Fourthly, the main challenges, possible solutions, and future research directions are discussed from the point of view of the combined application of DRL and AIDs. Fifthly, several control algorithms applied to AIDs have been compared in detail, and their advantages and disadvantages have been contrasted to understand their respective roles and effectiveness. Finally, two critical challenges, data and algorithms, and their opportunities in applying algorithms are summarized based on the survey of practical applications in the literature in recent years in terms of using data and equipment resources.

While the manuscript reports the performance of DRL-based AIDs, there are reasons that caused the industry to use MPC. The appeal of using MPC includes the ability to check and ensure the stability of the models used in MPC, well-defined optimization criteria, and modification of the objective function over time depending on the state of the individual are some of these reasons. The use of DRL in AIDs must address these issues and also receive acceptance from regulatory agencies such as the US Food and Drug Administration and their equivalents in many countries around the world. Clinical trials conducted in hospitals, hotels, and free living would be necessary to determine the non-inferiority and then the superiority of DRL-based AIDs to move this approach from academic research to medical treatment available to people with diabetes.

Author Contributions

X.Y. designed the study and summarized the methods. Z.Y. reviewed the literature and wrote the first draft of this manuscript. X.S. was a major contributor to writing the manuscript. H.L. (Hao Liu) provided revision suggestions to improve the content of the manuscript. H.L. (Hongru Li), J.L. and J.Z. offered suggestions about this study. A.C. supervised this study and is responsible for coordinating writing, reviewing, and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by Noncommunicable Chronic Diseases-National Science and Technology Major Project (2024ZD0532000; 2024ZD0532003), Young Scientists Fund of the National Natural Science Foundation of China (62403114) and Liaoning Provincial Natural Science Foundation of China (2023-BSBA-128).

Institutional Review Board Statement

Our manuscript is a review article that synthesizes and analyzes existing published literature. It does not involve any new human or animal research, including data collection, experiments, or interventions. Therefore, no ethical approval or informed consent was required for this study. We have adhered to all ethical standards in the process of reviewing and presenting the information from previous studies, ensuring the accuracy and integrity of our work.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The funders had no role in the design of the study. The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial intelligence
AIDs	Automated insulin delivery system
BGC	Blood glucose concentration
CGM	Continuous glucose monitoring
CHO	Carbohydrate
DDPG	Deep deterministic policy gradient
DNN	Deep neural networks
Double DQN	Double deep Q-Network
DRL	Deep reinforcement learning
FL	Fuzzy logic
HbA1c	Hemoglobin A1c
IOB	Insulin on board
MBPO	Model-based policy optimization
MPC	Model predictive control
PID	Proportional integral derivative
PPO	Proximal policy optimization
SAC	Soft actor critic
SAP	Sensor augmented pump
TD3	Twin delayed deep deterministic policy gradient
TD3-BC	Twin delayed DDPG with behavioral cloning
T1D	Type 1 diabetes
T2D	Type 2 diabetes
TIR	Time in range
TRPO	Trust region policy optimization

References

Association, A.D. Diagnosis and Classification of Diabetes Mellitus. Diabetes Care 2012, 36, S67–S74. [Google Scholar] [CrossRef] [PubMed]
ElSayed, N.A.; Aleppo, G.; Aroda, V.R.; Bannuru, R.R.; Brown, F.M.; Bruemmer, D.; Collins, B.S.; Hilliard, M.E.; Isaacs, D.; Johnson, E.L.; et al. 9. Pharmacologic Approaches to Glycemic Treatment: Standards of Care in Diabetes—2023. Diabetes Care 2023, 46, S140–S157. [Google Scholar] [CrossRef] [PubMed]
Stanford Medicine. New Research Keeps Diabetics Safer During Sleep. Available online: http://scopeblog.stanford.edu/2014/05/08/new-research-keeps-diabetics-safer-during-sleep/ (accessed on 23 October 2024).
Phillip, M.; Nimri, R.; Bergenstal, R.M.; Barnard-Kelly, K.; Danne, T.; Hovorka, R.; Kovatchev, B.P.; Messer, L.H.; Parkin, C.G.; Ambler-Osborn, L.; et al. Consensus Recommendations for the Use of Automated Insulin Delivery Technologies in Clinical Practice. Endocr. Rev. 2023, 44, 254–280. [Google Scholar] [CrossRef]
Oliveira, C.P.; Mitchell, B.D.; Fan, L.D.; Garey, C.; Liao, B.R.; Bispham, J.; Vint, N.; Perez-Nieves, M.; Hughes, A.; McAuliffe-Fogarty, A. Patient perspectives on the use of half-unit insulin pens by people with type 1 diabetes: A cross-sectional observational study. Curr. Med. Res. Opin. 2021, 37, 45–51. [Google Scholar] [CrossRef]
Kamrul-Hasan, A.B.M.; Hannan, M.A.; Alam, M.S.; Rahman, M.M.; Asaduzzaman, M.; Mustari, M.; Paul, A.K.; Kabir, M.L.; Chowdhury, S.R.; Talukder, S.K.; et al. Comparison of simplicity, convenience, safety, and cost-effectiveness between use of insulin pen devices and disposable plastic syringes by patients with type 2 diabetes mellitus: A cross-sectional study from Bangladesh. BMC Endocr. Disord. 2023, 23, 37. [Google Scholar] [CrossRef]
Machry, R.V.; Cipriani, G.F.; Pedroso, H.U.; Nunes, R.R.; Pires, T.L.S.; Ferreira, R.; Vescovi, B.; De Moura, G.P.; Rodrigues, T.C. Pens versus syringes to deliver insulin among elderly patients with type 2 diabetes: A randomized controlled clinical trial. Diabetol. Metab. Syndr. 2021, 13, 64. [Google Scholar] [CrossRef]
ElSayed, N.A.; Aleppo, G.; Aroda, V.R.; Bannuru, R.R.; Brown, F.M.; Bruemmer, D.; Collins, B.S.; Hilliard, M.E.; Isaacs, D.; Johnson, E.L.; et al. 7. Diabetes Technology: Standards of Care in Diabetes—2023. Diabetes Care 2023, 46, S111–S127. [Google Scholar] [CrossRef]
Chinese Society of Endocrinology. Chinese Insulin Pump Treatment Guidelines (2021 edition). Chin. J. Endocrinol. Metab. 2021, 37, 679–701. [Google Scholar]
Tonnies, T.; Brinks, R.; Isom, S.; Dabelea, D.; Divers, J.; Mayer-Davis, E.J.; Lawrence, J.M.; Pihoker, C.; Dolan, L.; Liese, A.D.; et al. Projections of Type 1 and Type 2 Diabetes Burden in the US Population Aged <20 Years Through 2060: The SEARCH for Diabetes in Youth Study. Diabetes Care 2023, 46, 313–320. [Google Scholar] [CrossRef]
ElSayed, N.A.; Aleppo, G.; Aroda, V.R.; Bannuru, R.R.; Brown, F.M.; Bruemmer, D.; Collins, B.S.; Hilliard, M.E.; Isaacs, D.; Johnson, E.L.; et al. 14. Children and Adolescents: Standards of Care in Diabetes—2023. Diabetes Care 2023, 46, S230–S253. [Google Scholar] [CrossRef]
Wadwa, R.P.; Reed, Z.W.; Buckingham, B.A.; DeBoer, M.D.; Ekhlaspour, L.; Forlenza, G.P.; Schoelwer, M.; Lum, J.; Kollman, C.; Beck, R.W.; et al. Trial of Hybrid Closed-Loop Control in Young Children with Type 1 Diabetes. N. Engl. J. Med. 2023, 388, 991–1001. [Google Scholar] [CrossRef] [PubMed]
DiMeglio, L.A.; Kanapka, L.G.; DeSalvo, D.J.; Hilliard, M.E.; Laffel, L.M.; Tamborlane, W.V.; Van Name, M.A.; Woerner, S.; Adi, S.; Albanese-O’Neill, A.; et al. A Randomized Clinical Trial Assessing Continuous Glucose Monitoring (CGM) Use with Standardized Education with or Without a Family Behavioral Intervention Compared with Fingerstick Blood Glucose Monitoring in Very Young Children with Type 1 Diabetes. Diabetes Care 2021, 44, 464–472. [Google Scholar] [CrossRef]
Laffel, L.M.; Kanapka, L.G.; Beck, R.W.; Bergamo, K.; Clements, M.A.; Criego, A.; DeSalvo, D.J.; Goland, R.; Hood, K.; Liljenquist, D.; et al. Effect of Continuous Glucose Monitoring on Glycemic Control in Adolescents and Young Adults with Type 1 Diabetes A Randomized Clinical Trial. JAMA J. Am. Med. Assoc. 2020, 323, 2388–2396. [Google Scholar] [CrossRef] [PubMed]
Yeh, H.C.; Brown, T.T.; Maruthur, N.; Ranasinghe, P.; Berger, Z.; Suh, Y.D.; Wilson, L.M.; Haberl, E.B.; Brick, J.; Bass, E.B.; et al. Comparative Effectiveness and Safety of Methods of Insulin Delivery and Glucose Monitoring for Diabetes Mellitus: A Systematic Review and Meta-analysis. Ann. Intern. Med. 2012, 157, 336–347. [Google Scholar] [CrossRef]
Forlenza, G.P.; Lal, R.A. Current Status and Emerging Options for Automated Insulin Delivery Systems. Diabetes Technol. Ther. 2022, 24, 362–371. [Google Scholar] [CrossRef]
Lal, R.A.; Ekhlaspour, L.; Hood, K.; Buckingham, B. Realizing a Closed-Loop (Artificial Pancreas) System for the Treatment of Type 1 Diabetes. Endocr. Rev. 2019, 40, 1521–1546. [Google Scholar] [CrossRef]
Boughton, C.K.; Hovorka, R. New closed-loop insulin systems. Diabetologia 2021, 64, 1007–1015. [Google Scholar] [CrossRef]
Karageorgiou, V.; Papaioannou, T.G.; Bellos, I.; Alexandraki, K.; Tentolouris, N.; Stefanadis, C.; Chrousos, G.P.; Tousoulis, D. Effectiveness of artificial pancreas in the non-adult population: A systematic review and network meta-analysis. Metab. Clin. Exp. 2019, 90, 20–30. [Google Scholar] [CrossRef]
Anderson, S.M.; Buckingham, B.A.; Breton, M.D.; Robic, J.L.; Barnett, C.L.; Wakeman, C.A.; Oliveri, M.C.; Brown, S.A.; Ly, T.T.; Clinton, P.K.; et al. Hybrid Closed-Loop Control Is Safe and Effective for People with Type 1 Diabetes Who Are at Moderate to High Risk for Hypoglycemia. Diabetes Technol. Ther. 2019, 21, 356–363. [Google Scholar] [CrossRef]
Forlenza, G.P.; Ekhlaspour, L.; Breton, M.; Maahs, D.M.; Wadwa, R.P.; DeBoer, M.; Messer, L.H.; Town, M.; Pinnata, J.; Kruse, G.; et al. Successful At-Home Use of the Tandem Control-IQ Artificial Pancreas System in Young Children During a Randomized Controlled Trial. Diabetes Technol. Ther. 2019, 21, 159–169. [Google Scholar] [CrossRef]
Forlenza, G.P.; Pinhas-Hamiel, O.; Liljenquist, D.R.; Shulman, D.I.; Bailey, T.S.; Bode, B.W.; Wood, M.A.; Buckingham, B.A.; Kaiserman, K.B.; Shin, J.; et al. Safety Evaluation of the MiniMed 670G System in Children 7–13 Years of Age with Type 1 Diabetes. Diabetes Technol. Ther. 2019, 21, 11–19. [Google Scholar] [CrossRef] [PubMed]
Sherr, J.L.; Cengiz, E.; Palerm, C.C.; Clark, B.; Kurtz, N.; Roy, A.; Carria, L.; Cantwell, M.; Tamborlane, W.V.; Weinzimer, S.A. Reduced Hypoglycemia and Increased Time in Target Using Closed-Loop Insulin Delivery During Nights with or Without Antecedent Afternoon Exercise in Type 1 Diabetes. Diabetes Care 2013, 36, 2909–2914. [Google Scholar] [CrossRef] [PubMed]
Zaharieva, D.P.; Messer, L.H.; Paldus, B.; O’Neal, D.N.; Maahs, D.M.; Riddell, M.C. Glucose Control During Physical Activity and Exercise Using Closed Loop Technology in Adults and Adolescents with Type 1 Diabetes. Can. J. Diabetes 2020, 44, 740–749. [Google Scholar] [CrossRef]
Carlson, A.L.; Sherr, J.L.; Shulman, D.I.; Garg, S.K.; Pop-Busui, R.; Bode, B.W.; Lilenquist, D.R.; Brazg, R.L.; Kaiserman, K.B.; Kipnes, M.S.; et al. Safety and Glycemic Outcomes During the MiniMed (TM) Advanced Hybrid Closed-Loop System Pivotal Trial in Adolescents and Adults with Type 1 Diabetes. Diabetes Technol. Ther. 2022, 24, 178–189. [Google Scholar] [CrossRef]
Bothe, M.K.; Dickens, L.; Reichel, K.; Tellmann, A.; Ellger, B.; Westphal, M.; Faisal, A.A. The use of reinforcement learning algorithms to meet the challenges of an artificial pancreas. Expert Rev. Med. Devices 2013, 10, 661–673. [Google Scholar] [CrossRef]
Bhonsle, S.; Saxena, S. A review on control-relevant glucose-insulin dynamics models and regulation strategies. Proc. Inst. Mech. Eng. Part I J. Syst. Control. Eng. 2020, 234, 596–608. [Google Scholar] [CrossRef]
Kovatchev, B. A Century of Diabetes Technology: Signals, Models, and Artificial Pancreas Control. Trends Endocrinol. Metab. 2019, 30, 432–444. [Google Scholar] [CrossRef]
Quiroz, G. The evolution of control algorithms in artificial pancreas: A historical perspective. Annu. Rev. Control 2019, 48, 222–232. [Google Scholar] [CrossRef]
Thomas, A.; Heinemann, L. Algorithm for automated insulin delivery (AID): An overview. Diabetologie 2022, 18, 862–874. [Google Scholar] [CrossRef]
Fuchs, J.; Hovorka, R. Closed-loop control in insulin pumps for type-1 diabetes mellitus: Safety and efficacy. Expert. Rev. Med. Devices 2020, 17, 707–720. [Google Scholar] [CrossRef]
Marchetti, G.; Barolo, M.; Jovanovic, L.; Zisser, H.; Seborg, D.E. An improved PID switching control strategy for type 1 diabetes. IEEE Trans. Biomed. Eng. 2008, 55, 857–865. [Google Scholar] [CrossRef] [PubMed]
MohammadRidha, T.; Ait-Ahmed, M.; Chaillous, L.; Krempf, M.; Guilhem, I.; Poirier, J.Y.; Moog, C.H. Model Free iPID Control for Glycemia Regulation of Type-1 Diabetes. IEEE Trans. Biomed. Eng. 2018, 65, 199–206. [Google Scholar] [CrossRef] [PubMed]
Al-Hussein, A.-B.A.; Tahir, F.R.; Viet-Thanh, P. Fixed-time synergetic control for chaos suppression in endocrine glucose-insulin regulatory system. Control Eng. Pract. 2021, 108, 104723. [Google Scholar] [CrossRef]
Skogestad, S. Simple analytic rules for model reduction and PID controller tuning. Model. Identif. Control 2004, 25, 85–120. [Google Scholar] [CrossRef]
Nath, A.; Dey, R.; Balas, V.E. Closed Loop Blood Glucose Regulation of Type 1 Diabetic Patient Using Takagi-Sugeno Fuzzy Logic Control. In Soft Computing Applications, Sofa 2016; Springer: Cham, Switzerland, 2018; Volume 634, pp. 286–296. [Google Scholar] [CrossRef]
Yadav, J.; Rani, A.; Singh, V. Performance Analysis of Fuzzy-PID Controller for Blood Glucose Regulation in Type-1 Diabetic Patients. J. Med. Syst. 2016, 40, 254. [Google Scholar] [CrossRef]
Bondia, J.; Romero-Vivo, S.; Ricarte, B.; Diez, J.L. Insulin Estimation and Prediction: A Review of the Estimation and Prediction of Subcutaneous Inaulin Pharmacokinetics in Closed-loop Glucose Control. IEEE Control Syst. Mag. 2018, 38, 47–66. [Google Scholar] [CrossRef]
Oviedo, S.; Vehi, J.; Calm, R.; Armengol, J. A review of personalized blood glucose prediction strategies for T1DM patients. Int. J. Numer. Methods Biomed. Eng. 2017, 33, e2833. [Google Scholar] [CrossRef]
Gondhalekar, R.; Dassau, E.; Doyle, F.J. Velocity-weighting & velocity-penalty MPC of an artificial pancreas: Improved safety & performance. Automatica 2018, 91, 105–117. [Google Scholar] [CrossRef]
Shi, D.W.; Dassau, E.; Doyle, F.J. Adaptive Zone Model Predictive Control of Artificial Pancreas Based on Glucose- and Velocity-Dependent Control Penalties. IEEE Trans. Biomed. Eng. 2019, 66, 1045–1054. [Google Scholar] [CrossRef]
Birjandi, S.Z.; Sani, S.K.H.; Pariz, N. Insulin infusion rate control in type 1 diabetes patients using information-theoretic model predictive control. Biomed. Signal Process. Control 2022, 76, 103635. [Google Scholar] [CrossRef]
Williams, G.; Drews, P.; Goldfain, B.; Rehg, J.M.; Theodorou, E.A. Information-Theoretic Model Predictive Control: Theory and Applications to Autonomous Driving. IEEE Trans. Robot. 2017, 34, 1603–1622. [Google Scholar] [CrossRef]
Li, Y.; Yu, C.; Shahidehpour, M.; Yang, T.; Zeng, Z.; Chai, T. Deep Reinforcement Learning for Smart Grid Operations: Algorithms, Applications, and Prospects. Proc. IEEE 2023, 111, 1055–1096. [Google Scholar] [CrossRef]
Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the game of Go without human knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef] [PubMed]
Kiran, B.R.; Sobh, I.; Talpaert, V.; Mannion, P.; Sallab, A.A.A.; Yogamani, S.; Pérez, P. Deep Reinforcement Learning for Autonomous Driving: A Survey. IEEE Trans. Intell. Transp. Syst. 2022, 23, 4909–4926. [Google Scholar] [CrossRef]
Nemati, S.; Ghassemi, M.M.; Clifford, G.D. Optimal medication dosing from suboptimal clinical examples: A deep reinforcement learning approach. In Proceedings of the 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Orlando, FL, USA, 16–20 August 2016; pp. 2978–2981. [Google Scholar]
Fox, I.; Lee, J.; Pop-Busui, R.; Wiens, J. Deep Reinforcement Learning for Closed-Loop Blood Glucose Control. Mach. Learn. Healthc. 2020, 126, 508–536. [Google Scholar]
Tejedor, M.; Woldaregay, A.Z.; Godtliebsen, F. Reinforcement learning application in diabetes blood glucose control: A systematic review. Artif. Intell. Med. 2020, 104, 101836. [Google Scholar] [CrossRef]
Sharma, R.; Singh, D.; Gaur, P.; Joshi, D. Intelligent automated drug administration and therapy: Future of healthcare. Drug Deliv. Transl. Res. 2021, 11, 1878–1902. [Google Scholar] [CrossRef]
Toffanin, C.; Visentin, R.; Messori, M.; Palma, F.D.; Magni, L.; Cobelli, C. Toward a Run-to-Run Adaptive Artificial Pancreas: In Silico Results. IEEE Trans. Biomed. Eng. 2018, 65, 479–488. [Google Scholar] [CrossRef]
Yau, K.-L.A.; Chong, Y.-W.; Fan, X.; Wu, C.; Saleem, Y.; Lim, P.-C. Reinforcement Learning Models and Algorithms for Diabetes Management. IEEE Access 2023, 11, 28391–28415. [Google Scholar] [CrossRef]
Denes-Fazakas, L.; Fazakas, G.D.; Eigner, G.; Kovacs, L.; Szilagyi, L. Review of Reinforcement Learning-Based Control Algorithms in Artificial Pancreas Systems for Diabetes Mellitus Management. In Proceedings of the 18th International Symposium on Applied Computational Intelligence and Informatics (SACI), Timisoara, Romania, 23–25 May 2024; pp. 565–571. [Google Scholar] [CrossRef]
Shin, J.; Badgwell, T.A.; Liu, K.-H.; Lee, J.H. Reinforcement Learning—Overview of recent progress and implications for process control. Comput. Chem. Eng. 2019, 127, 282–294. [Google Scholar] [CrossRef]
Daskalaki, E.; Diem, P.; Mougiakakou, S.G. Model-Free Machine Learning in Biomedicine: Feasibility Study in Type 1 Diabetes. PLoS ONE 2016, 11, e0158722. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Gu, W. An Improved Strategy for Blood Glucose Control Using Multi-Step Deep Reinforcement Learning. arXiv 2024, arXiv:2403.07566. [Google Scholar]
Ahmad, S.; Beneyto, A.; Zhu, T.; Contreras, I.; Georgiou, P.; Vehi, J. An automatic deep reinforcement learning bolus calculator for automated insulin delivery systems. Sci. Rep. 2024, 14, 15245. [Google Scholar] [CrossRef] [PubMed]
Del Giorno, S.; D’Antoni, F.; Piemonte, V.; Merone, M. A New Glycemic closed-loop control based on Dyna-Q for Type-1-Diabetes. Biomed. Signal Process. Control 2023, 81, 104492. [Google Scholar] [CrossRef]
Zhu, T.; Li, K.; Herrero, P.; Georgiou, P. Basal Glucose Control in Type 1 Diabetes Using Deep Reinforcement Learning: An In Silico Validation. IEEE J. Biomed. Health Inform. 2021, 25, 1223–1232. [Google Scholar] [CrossRef]
Li, T.; Wang, Z.; Lu, W.; Zhang, Q.; Li, D. Electronic health records based reinforcement learning for treatment optimizing. Inf. Syst. 2022, 104, 101878. [Google Scholar] [CrossRef]
Noaro, G.; Zhu, T.; Cappon, G.; Facchinetti, A.; Georgiou, P. A Personalized and Adaptive Insulin Bolus Calculator Based on Double Deep Q—Learning to Improve Type 1 Diabetes Management. IEEE J. Biomed. Health Inform. 2023, 27, 2536–2544. [Google Scholar] [CrossRef]
Hettiarachchi, C.; Malagutti, N.; Nolan, C.J.; Suominen, H.; Daskalaki, E. Non-linear Continuous Action Spaces for Reinforcement Learning in Type 1 Diabetes. In Ai 2022: Advances in Artificial Intelligence; Springer: Cham, Switzerland, 2022; Volume 13728, pp. 557–570. [Google Scholar] [CrossRef]
Lee, S.; Kim, J.; Park, S.W.; Jin, S.M.; Park, S.M. Toward a Fully Automated Artificial Pancreas System Using a Bioinspired Reinforcement Learning Design: In Silico Validation. IEEE J. Biomed. Health Inform. 2021, 25, 536–546. [Google Scholar] [CrossRef]
Lehel, D.-F.; Siket, M.; Szilágyi, L.; Eigner, G.; Kovács, L. Investigation of reward functions for controlling blood glucose level using reinforcement learning. In Proceedings of the 2023 IEEE 17th International Symposium on Applied Computational Intelligence and Informatics (SACI), Timisoara, Romania, 23–26 May 2023; pp. 387–392. [Google Scholar]
Viroonluecha, P.; Egea-Lopez, E.; Santa, J. Evaluation of blood glucose level control in type 1 diabetic patients using deep reinforcement learning. PLoS ONE 2022, 17, e0274608. [Google Scholar] [CrossRef]
Hettiarachchi, C.; Malagutti, N.; Nolan, C.J.; Suominen, H.; Daskalaki, E. G2P2C—A modular reinforcement learning algorithm for glucose control by glucose prediction and planning in Type 1 Diabetes. Biomed. Signal Process. Control 2024, 90, 105839. [Google Scholar] [CrossRef]
El Fathi, A.; Pryor, E.; Breton, M.D. Attention Networks for Personalized Mealtime Insulin Dosing in People with Type 1 Diabetes. IFAC-PapersOnLine 2024, 58, 245–250. [Google Scholar] [CrossRef]
Denes-Fazakas, L.; Szilagyi, L.; Kovacs, L.; De Gaetano, A.; Eigner, G. Reinforcement Learning: A Paradigm Shift in Personalized Blood Glucose Management for Diabetes. Biomedicines 2024, 12, 2143. [Google Scholar] [CrossRef] [PubMed]
Nordhaug Myhre, J.; Tejedor, M.; Kalervo Launonen, I.; El Fathi, A.; Godtliebsen, F. In-Silico Evaluation of Glucose Regulation Using Policy Gradient Reinforcement Learning for Patients with Type 1 Diabetes Mellitus. Appl. Sci. 2020, 10, 6350. [Google Scholar] [CrossRef]
Di Felice, F.; Borri, A.; Di Benedetto, M.D. Deep reinforcement learning for closed-loop blood glucose control: Two approaches. IFAC Pap. 2022, 55, 115–120. [Google Scholar] [CrossRef]
Zhu, T.; Li, K.; Kuang, L.; Herrero, P.; Georgiou, P. An Insulin Bolus Advisor for Type 1 Diabetes Using Deep Reinforcement Learning. Sensors 2020, 20, 5058. [Google Scholar] [CrossRef]
Ellis, Z. Application of Reinforcement Learning Algorithm to Minimize the Dosage of Insulin Infusion; East Carolina University: Greenville, NC, USA, 2024; p. 59. [Google Scholar]
Raheb, M.A.; Niazmand, V.R.; Eqra, N.; Vatankhah, R. Subcutaneous insulin administration by deep reinforcement learning for blood glucose level control of type-2 diabetic patients. Comput. Biol. Med. 2022, 148, 105860. [Google Scholar] [CrossRef]
Lim, M.H.; Lee, W.H.; Jeon, B.; Kim, S. A Blood Glucose Control Framework Based on Reinforcement Learning With Safety and Interpretability: In Silico Validation. IEEE Access 2021, 9, 105756–105775. [Google Scholar] [CrossRef]
Lv, W.; Wu, T.; Xiong, L.; Wu, L.; Zhou, J.; Tang, Y.; Qian, F. Hybrid Control Policy for Artificial Pancreas via Ensemble Deep Reinforcement Learning. IEEE Trans. Biomed. Eng. 2023, 72, 309–323. [Google Scholar] [CrossRef]
Yu, X.; Guan, Y.; Yan, L.; Li, S.; Fu, X.; Jiang, J. ARLPE: A meta reinforcement learning framework for glucose regulation in type 1 diabetics. Expert Syst. Appl. 2023, 228, 120156. [Google Scholar] [CrossRef]
Zhu, J.; Zhang, Y.; Rao, W.; Zhao, Q.; Li, J.; Wang, C. Reinforcement Learning for Diabetes Blood Glucose Control with Meal Information. Bioinform. Res. Appl. 2021, 13064, 80–91. [Google Scholar] [CrossRef]
Jiang, J.; Shen, R.; Wang, B.; Guan, Y. Blood Glucose Control Via Pre-trained Counterfactual Invertible Neural Networks. arXiv 2024, arXiv:2405.17458. [Google Scholar]
Chlumsky-Harttmann, M.; Ayad, A.; Schmeink, A. HypoTreat: Reducing Hypoglycemia in Artificial Pancreas Simulation. In Proceedings of the 2024 IEEE Conference on Evolving and Adaptive Intelligent Systems (EAIS), Madrid, Spain, 23–24 May 2024; pp. 56–62. [Google Scholar] [CrossRef]
Jaloli, M.; Cescon, M. Basal-bolus advisor for type 1 diabetes (T1D) patients using multi-agent reinforcement learning (RL) methodology. Control Eng. Pract. 2024, 142, 105762. [Google Scholar] [CrossRef]
Mackey, A.; Furey, E. Artificial Pancreas Control for Diabetes using TD3 Deep Reinforcement Learning. In Proceedings of the 2022 33rd Irish Signals and Systems Conference (ISSC), Cork, Ireland, 9–10 June 2022; pp. 1–6. [Google Scholar]
Emerson, H.; Guy, M.; McConville, R. Offline reinforcement learning for safer blood glucose control in people with type 1 diabetes. J. Biomed. Inform. 2023, 142, 104376. [Google Scholar] [CrossRef]
Zhu, T.; Li, K.; Georgiou, P. Offline Deep Reinforcement Learning and Off-Policy Evaluation for Personalized Basal Insulin Control in Type 1 Diabetes. IEEE J. Biomed. Health Inform. 2023, 27, 5087–5098. [Google Scholar] [CrossRef]
Beolet, T.; Adenis, A.; Huneker, E.; Louis, M. End-to-end offline reinforcement learning for glycemia control. Artif. Intell. Med. 2024, 154, 102920. [Google Scholar] [CrossRef]
Schulman, J.; Levine, S.; Moritz, P.; Jordan, M.; Abbeel, P. Trust Region Policy Optimization. Int. Conf. Mach. Learn. 2015, 37, 1889–1897. [Google Scholar]
Wang, G.; Liu, X.; Ying, Z.; Yang, G.; Chen, Z.; Liu, Z.; Zhang, M.; Yan, H.; Lu, Y.; Gao, Y.; et al. Optimized glycemic control of type 2 diabetes with reinforcement learning: A proof-of-concept trial. Nat. Med. 2023, 29, 2633–2642. [Google Scholar] [CrossRef]
Wang, Z.; Xie, Z.; Tu, E.; Zhong, A.; Liu, Y.; Ding, J.; Yang, J. Reinforcement Learning-Based Insulin Injection Time And Dosages Optimization. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar]
Unsworth, R.; Avari, P.; Lett, A.M.; Oliver, N.; Reddy, M. Adaptive bolus calculators for people with type 1 diabetes: A systematic review. Diabetes Obes. Metab. 2023, 25, 3103–3113. [Google Scholar] [CrossRef]
Jafar, A.; Fathi, A.E.; Haidar, A. Long-term use of the hybrid artificial pancreas by adjusting carbohydrate ratios and programmed basal rate: A reinforcement learning approach. Comput. Methods Programs Biomed. 2021, 200, 105936. [Google Scholar] [CrossRef]
Magni, L. Model Predictive Control of Type 1 Diabetes: An in Silico Trial. J. Diabetes Sci. Technol. 2007, 1, 804–812. [Google Scholar] [CrossRef]
De Paula, M.; Acosta, G.G.; Martínez, E.C. On-line policy learning and adaptation for real-time personalization of an artificial pancreas. Expert Syst. Appl. 2015, 42, 2234–2255. [Google Scholar] [CrossRef]
De Paula, M.; Ávila, L.O.; Martínez, E.C. Controlling blood glucose variability under uncertainty using reinforcement learning and Gaussian processes. Appl. Soft Comput. 2015, 35, 310–332. [Google Scholar] [CrossRef]
De Paula, M.; Martinez, E. Probabilistic optimal control of blood glucose under uncertainty. In 22 European Symposium on Computer Aided Process Engineering; Elsevier: Amsterdam, The Netherlands, 2012; Volume 30, pp. 1357–1361. [Google Scholar]
Akbari Torkestani, J.; Ghanaat Pisheh, E. A learning automata-based blood glucose regulation mechanism in type 2 diabetes. Control Eng. Pract. 2014, 26, 151–159. [Google Scholar] [CrossRef]
Daskalaki, E.; Diem, P.; Mougiakakou, S.G. An Actor-Critic based controller for glucose regulation in type 1 diabetes. Comput. Methods Programs Biomed. 2013, 109, 116–125. [Google Scholar] [CrossRef]
Sun, Q.; Jankovic, M.V.; Budzinski, J.; Moore, B.; Diem, P.; Stettler, C.; Mougiakakou, S.G. A Dual Mode Adaptive Basal-Bolus Advisor Based on Reinforcement Learning. IEEE J. Biomed. Health Inform. 2019, 23, 2633–2641. [Google Scholar] [CrossRef]
Sun, Q.; Jankovic, M.V.; Mougiakakou, G.S. Reinforcement Learning-Based Adaptive Insulin Advisor for Individuals with Type 1 Diabetes Patients under Multiple Daily Injections Therapy. In Proceedings of the 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Berlin, Germany, 23–27 July 2019; pp. 3609–3612. [Google Scholar]
Sun, Q.; Jankovic, M.V.; Stettler, C.; Mougiakakou, S. Personalised adaptive basal-bolus algorithm using SMBG/CGM data. In Proceedings of the 11th International Conference on Advanced Technologies and Treatments for Diabetes, Vienna, Austria, 14–17 February 2018. [Google Scholar]
Thananjeyan, B.; Balakrishna, A.; Nair, S.; Luo, M.; Srinivasan, K.; Hwang, M.; Gonzalez, J.E.; Ibarz, J.; Finn, C.; Goldberg, K. Recovery RL: Safe Reinforcement Learning With Learned Recovery Zones. IEEE Robot. Autom. Lett. 2021, 6, 4915–4922. [Google Scholar] [CrossRef]
Yasini, S.; Naghibi-Sistani, M.B.; Karimpour, A. Agent-based simulation for blood glucose control in diabetic patients. World Acad. Sci. Eng. Technol. 2009, 33, 672–679. [Google Scholar]
Myhre, J.N.; Launonen, I.K.; Wei, S.; Godtliebsen, F. Controlling Blood Glucose Levels in Patients with Type 1 Diabetes Using Fitted Q-Iterations and Functional Features. In Proceedings of the 2018 IEEE 28th International Workshop on Machine Learning for Signal Processing (MLSP), Aalborg, Denmark, 17–20 September 2018; pp. 1–6. [Google Scholar]
Shifrin, M.; Siegelmann, H. Near-optimal insulin treatment for diabetes patients: A machine learning approach. Artif. Intell. Med. 2020, 107, 101917. [Google Scholar] [CrossRef]
Shu, Y.; Cao, Z.; Gao, J.; Wang, J.; Yu, P.S.; Long, M. Omni-Training: Bridging Pre-Training and Meta-Training for Few-Shot Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 45, 15275–15291. [Google Scholar] [CrossRef]
Reske, A.; Carius, J.; Ma, Y.; Farshidian, F.; Hutter, M. Imitation Learning from MPC for Quadrupedal Multi-Gait Control. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 5014–5020. [Google Scholar]
Ahmad, S.; Beneyto, A.; Contreras, I.; Vehi, J. Bolus Insulin calculation without meal information. A reinforcement learning approach. Artif. Intell. Med. 2022, 134, 102436. [Google Scholar] [CrossRef]
Brunke, L.; Greeff, M.; Hall, A.W.; Yuan, Z.; Zhou, S.; Panerati, J.; Schoellig, A.P. Safe Learning in Robotics: From Learning-Based Control to Safe Reinforcement Learning. Annu. Rev. Control Robot. Auton. Syst. 2022, 5, 411–444. [Google Scholar] [CrossRef]
Zhang, Y.; Liang, X.; Li, D.; Ge, S.S.; Gao, B.; Chen, H.; Lee, T.H. Adaptive Safe Reinforcement Learning with Full-State Constraints and Constrained Adaptation for Autonomous Vehicles. IEEE Trans. Cybern. 2024, 54, 1907–1920. [Google Scholar] [CrossRef] [PubMed]
Artman, W.J.; Nahum-Shani, I.; Wu, T.; McKay, J.R.; Ertefaie, A. Power analysis in a SMART design: Sample size estimation for determining the best embedded dynamic treatment regime. Biostatistics 2018, 21, 1468–4357. [Google Scholar] [CrossRef] [PubMed]
Fu, J.; Norouzi, M.; Nachum, O.; Tucker, G.; Wang, Z.; Novikov, A.; Yang, M.; Zhang, M.R.; Chen, Y.; Kumar, A.; et al. Benchmarks for Deep Off-Policy Evaluation. arXiv 2021, arXiv:2103.16596. [Google Scholar]
Kovatchev, B.P.; Breton, M.; Dalla Man, C.; Cobelli, C. In Silico Preclinical Trials: A Proof of Concept in Closed-Loop Control of Type 1 Diabetes. J. Diabetes Sci. Technol. 2009, 3, 44–55. [Google Scholar] [CrossRef]
Lehmann, E.D.; Deutsch, T. A physiological model of glucose-insulin interaction in type 1 diabetes mellitus. J. Biomed. Eng. 1992, 14, 235–242. [Google Scholar] [CrossRef]
Bergman, R.N. Minimal model: Perspective from 2005. Horm. Res. 2005, 64 (Suppl. S3), 8–15. [Google Scholar] [CrossRef]
Liu, Z.; Ji, L.; Jiang, X.; Zhao, W.; Liao, X.; Zhao, T.; Liu, S.; Sun, X.; Hu, G.; Feng, M.; et al. A Deep Reinforcement Learning Approach for Type 2 Diabetes Mellitus Treatment. In Proceedings of the 2020 IEEE International Conference on Healthcare Informatics (ICHI), Oldenburg, Germany, 30 November–3 December 2020; pp. 1–9. [Google Scholar]
Lopez-Martinez, D.; Eschenfeldt, P.; Ostvar, S.; Ingram, M.; Hur, C.; Picard, R. Deep Reinforcement Learning for Optimal Critical Care Pain Management with Morphine using Dueling Double-Deep Q Networks. In Proceedings of the 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Berlin, Germany, 23–27 July 2019; pp. 3960–3963. [Google Scholar]
Weng, W.-H.; Gao, M.; He, Z.; Yan, S.; Szolovits, P. Representation and Reinforcement Learning for Personalized Glycemic Control in Septic Patients. arXiv 2017, arXiv:1712.00654. [Google Scholar]
Fox, I.; Wiens, J. Reinforcement learning for blood glucose control: Challenges and opportunities. In Proceedings of the 2019 International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; Volume 126, pp. 508–536. [Google Scholar]
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized Experience Replay. arXiv 2015, arXiv:1511.05952. [Google Scholar]
Hester, T.; Vecerik, M.; Pietquin, O.; Lanctot, M.; Schaul, T.; Piot, B.; Horgan, D.; Quan, J.; Sendonaris, A.; Osband, I.; et al. Deep Q-learning from demonstrations. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; p. 394. [Google Scholar]
Herrero, P.; El-Sharkawy, M.; Daniels, J.; Jugnee, N.; Uduku, C.N.; Reddy, M.; Oliver, N.; Georgiou, P. The Bio-inspired Artificial Pancreas for Type 1 Diabetes Control in the Home: System Architecture and Preliminary Results. J. Diabetes Sci. Technol. 2019, 13, 1017–1025. [Google Scholar] [CrossRef]
Deshpande, S.; Pinsker, J.E.; Zavitsanou, S.; Shi, D.; Tompot, R.W.; Church, M.M.; Andre, C.C.; Doyle, F.J.; Dassau, E. Design and Clinical Evaluation of the Interoperable Artificial Pancreas System (iAPS) Smartphone App: Interoperable Components with Modular Design for Progressive Artificial Pancreas Research and Development. Diabetes Technol. Ther. 2019, 21, 35–43. [Google Scholar] [CrossRef] [PubMed]
Li, K.; Liu, C.; Zhu, T.; Herrero, P.; Georgiou, P. GluNet: A Deep Learning Framework for Accurate Glucose Forecasting. IEEE J. Biomed. Health Inform. 2020, 24, 414–423. [Google Scholar] [CrossRef] [PubMed]
Li, K.; Daniels, J.; Liu, C.; Herrero, P.; Georgiou, P. Convolutional Recurrent Neural Networks for Glucose Prediction. IEEE J. Biomed. Health Inform. 2020, 24, 603–613. [Google Scholar] [CrossRef]
Cobelli, C. Diabetes: Models, Signals and Control. In Proceedings of the 13th Imeko Tc1-Tc7 Joint Symposium—Without Measurement No Science, without Science No Measurement, London, UK, 1–3 September 2010; Volume 238. [Google Scholar] [CrossRef]
Copp, D.; Gondhalekar, R.; Hespanha, J. Simultaneous model predictive control and moving horizon estimation for blood glucose regulation in Type 1 diabetes. Optim. Control Appl. Methods 2017, 39, 904–918. [Google Scholar] [CrossRef]
Sun, X. Study on Dynamic Control of Blood Glucose in Type 1 Diabetes Mellitus Based on GPC. Master’s Thesis, Northeastern University, Boston, MA, USA, 2018. [Google Scholar]
Babar, S.A.; Rana, I.A.; Mughal, I.S.; Khan, S.A. Terminal Synergetic and State Feedback Linearization Based Controllers for Artificial Pancreas in Type 1 Diabetic Patients. IEEE Access 2021, 9, 28012–28019. [Google Scholar] [CrossRef]
Messori, M.; Incremona, G.P.; Cobelli, C.; Magni, L. Individualized model predictive control for the artificial pancreas: In silico evaluation of closed-loop glucose control. IEEE Control Syst. Mag. 2018, 38, 86–104. [Google Scholar] [CrossRef]
Turksoy, K.; Bayrak, E.S.; Quinn, L.; Littlejohn, E.; Çinar, A. Multivariable adaptive closed-loop control of an artificial pancreas without meal and activity announcement. Diabetes Technol. Ther. 2013, 15, 386–400. [Google Scholar] [CrossRef]
Turksoy, K.; Hajizadeh, I.; Hobbs, N.; Kilkus, J.; Littlejohn, E.; Samadi, S.; Feng, J.; Sevil, M.; Lazaro, C.; Ritthaler, J.; et al. Multivariable Artificial Pancreas for Various Exercise Types and Intensities. Diabetes Technol. Ther. 2018, 20, 662–671. [Google Scholar] [CrossRef]
Hajizadeh, I.; Rashid, M.; Samadi, S.; Sevil, M.; Hobbs, N.; Brandt, R.; Cinar, A. Adaptive personalized multivariable artificial pancreas using plasma insulin estimates. J. Process Control 2019, 80, 26–40. [Google Scholar] [CrossRef]
Hajizadeh, I.; Rashid, M.M.; Turksoy, K.; Samadi, S.; Feng, J.; Sevil, M.; Hobbs, N.; Lazaro, C.; Maloney, Z.; Littlejohn, E.; et al. Incorporating Unannounced Meals and Exercise in Adaptive Learning of Personalized Models for Multivariable Artificial Pancreas Systems. J. Diabetes Sci. Technol. 2018, 12, 953–966. [Google Scholar] [CrossRef]
Sun, X.; Rashid, M.; Askari, M.R.; Cinar, A. Adaptive personalized prior-knowledge-informed model predictive control for type 1 diabetes. Control Eng. Pract. 2023, 131, 105386. [Google Scholar] [CrossRef] [PubMed]
Sun, X.; Rashid, M.; Hobbs, N.; Askari, M.R.; Brandt, R.; Shahidehpour, A.; Cinar, A. Prior informed regularization of recursively updated latent-variables-based models with missing observations. Control Eng. Pract. 2021, 116, 104933. [Google Scholar] [CrossRef] [PubMed]
Castillo, A.; Villa-Tamayo, M.F.; Pryor, E.; Garcia-Tirado, J.F.; Colmegna, P.; Breton, M. Deep Neural Network Architectures for an Embedded MPC Implementation: Application to an Automated Insulin Delivery System. IFAC-Pap. 2023, 56, 11521–11526. [Google Scholar] [CrossRef]
Steil, G.; Rebrin, K.; Mastrototaro, J.J. Metabolic modelling and the closed-loop insulin delivery problem. Diabetes Res. Clin. Pract. 2006, 74, S183–S186. [Google Scholar] [CrossRef]

Figure 1. Basic structure of the AIDs.

Figure 2. Number of relevant articles over years.

Figure 3. Basic structure of DRL.

Figure 4. DRL-based AIDs.

Figure 5. Classification of DRL algorithms (thick-bordered boxes represent different categories, while other boxes indicate specific algorithms).

Figure 6. Actor-Critic algorithm.

Figure 7. Existing solutions to the problem of low sample utilization in DRL.

Figure 8. Distribution error problem.

Figure 9. Model predictive control structure diagram.

Figure 10. Structure of model predictive control algorithm for Artificial Pancreas.

Figure 11. Feedforward Neural Network Glucose Prediction Model. (Glucose* represents the predicted value of blood glucose).

Figure 12. Ergodic solution in AIDs using MPC.

Table 1. Selection of state space variables in the study.

References	State Space Variables	Fully/Hybrid Closed Loop
Ian Fox [48], Chirath Hettiarachchi [62,66], Dénes-Fazakas Lehel [68], Senquan Wang [56], Miriam Chlumsky-Harttmann [79]	BGC, insulin	Fully
Seunghyun Lee [63]	BGC, rate of BGC, IOB
Francesco Di Felice [70], Alan Mackey [81], Phuwadol Viroonluecha [65], Silvia Del Giorno [58], Dénes-Fazakas Lehel [64], Jingchi Jiang [78]	BGC
Zackarie Ellis [72]	BGC, insulin, Error between the BGC and 4.5 mmol/L
Sayyar Ahmad [57]	BGC, maximum, minimum, Area under the curve
Mohammad Ali Raheb [73]	BGC, IOB
Guangyu Wang [86]	BGC, demographics, diagnosis, symptom, medication, laboratory test index
Jonas Nordhaug Myhre [69]	BGC, insulin, IOB	Hybrid
Taiyu Zhu [71]	BGC, CHO, IOB, Injection time of bolus
Min Hyuk Lim [74], Jinhao Zhu [77], Mehrad Jaloli [80], Anas El Fathi [67]	BGC, insulin, CHO
Zihao Wang [87]	BGC, time stamp, CHO
Taiyu Zhu [59]	BGC, CHO, basal rate, bolus, glucagon
Tianhao Li [60]	Dynamic and static variables in the electronic medical record
Harry Emerson [82]	BGC, IOB, CHO
Wenzhou Lv [75]	BGC, bolus, CHO, IOB
Giulia Noaro [61]	BGC, CHO, Rate of change, Carbohydrate ratio, standard postprandial insulin dose
Xuehui Yu [76]	Daily minimum and maximum BGC, posterior distribution probabilities formed from the product of the distributions of prior and individualized information (age, weight, etc.)
Taiyu Zhu [83]	BGC, mean, maximum, minimum, maximum difference between adjacent measurements, percentage of high and low glucose, time stamp, number of hours to last CHO, bolus
Tristan Beolet [84]	BGC, insulin, IOB, TDD, COB, time stamp, weight

Table 2. Selection of action space variables in the study.

References	Action Space Variables			Continuous/Discrete		Borderless/Bounded
References	Basal	Bolus	Insulin Doses	Continuous	Discrete	Borderless	Bounded
Ian Fox [48], Chirath Hettiarachchi [62,66], Francesco Di Felice [70], Zihao Wang [87], Taiyu Zhu [83], Guangyu Wang [86], Anas El Fathi [67], Miriam Chlumsky-Harttmann [79], Dénes-Fazakas Lehel [68]			√	√			√
Phuwadol Viroonluecha [65], Wenzhou Lv [75], Alan Mackey [81], Mehrad Jaloli [80], Tristan Beolet [84]	√			√			√
Taiyu Zhu [71], Mehrad Jaloli [80]		√		√			√
Dénes-Fazakas Lehel [64], Mohammad Ali Raheb [73], Jinhao Zhu [77], Min Hyuk Lim [74], Zackarie Ellis [72], Jingchi Jiang [78]			√	√		√
Giulia Noaro [61], Xuehui Yu [76]		√		√		√
Jonas Nordhaug Myhre [69], Harry Emerson [82], Xuehui Yu [76]	√			√		√
Silvia Del Giorno [58], Sayyar Ahmad [57]		√			√		√
Tianhao Li [60], Taiyu Zhu [59], Seunghyun Lee [63], Senquan Wang [56]			√		√		√

Supplement: Bounded and Borderless refer to the existence and absence of boundaries in the action space respectively; Continuous and Discrete indicate that the action space is continuous and discrete, respectively.

Table 3. Selection of the reward function in the study.

Log Function (Risk Function)	Gaussian Function	Power Function	Linear Function	Others
Ian Fox [48], Harry Emerson [82], Alan Mackey [81], Guangyu Wang [86], Chirath Hettiarachchi [66], Jinhao Zhu [77], Anas El Fathi [67], Tristan Beolet [84], Miriam Chlumsky-Harttmann [79]. e.g., $\begin{array}{l} R (g) = \{\begin{matrix} - 10 {(3.35506 \log {(g)}^{0.8353} - 3.7932)}^{2} & 10 mg / dL \leq g \leq 1000 mg / dL \\ - a & g < 10 mg / dL or g > 1000 mg / dL \end{matrix} \end{array}$ where a is penalty and the value is different in different literatures	Jonas Nordhaug Myhre [69], Mehrad Jaloli [80], Min Hyuk Lim [74], Dénes-Fazakas Lehel [64,68]. e.g., $\begin{array}{l} R (g) = \exp (- ε \|g - g_{r e f}\|), ε > 0 \end{array}$ where g_ref = 127 mg/dL	Francesco Di Felice [70], Silvia Del Giorno [58], Wenzhou Lv [75], Jingchi Jiang [78]. e.g., $R (g) = - 0.001 {(g - 100)}^{2}$	Taiyu Zhu [59], Taiyu Zhu [83], Taiyu Zhu [71], Tianhao Li [60], Mohammad Ali Raheb [73], Phuwadol Viroonluecha [65], Giulia Noaro [61], Seunghyun Lee [63], Zihao Wang [87], Zackarie Ellis [72], Sayyar Ahmad [57], Senquan Wang [56]. e.g., $R (g) = \{\begin{matrix} 1 & 90 mg / dL \leq g \leq 140 mg / dL \\ 0.1 & 70 mg / dL \leq g < 90 mg / dL & 140 mg / dL \leq g < 180 mg / dL \\ - 0.4 - \frac{(g - 180)}{200} & 180 mg / dL < g \leq 300 mg / dL \\ - 0.6 + \frac{(g - 70)}{100} & 30 mg / dL \leq g < 70 mg / dL \\ - 1 & e l s e \end{matrix}$	Xuehui Yu [76]. e.g., $\begin{array}{l} R (g) = r (g_{t + 1, \max}) + r (g_{t + 1, \min}) \\ where g_{t + 1, \max} and g_{t + 1, \min} is the maximum and minimal \\ BG value 1 day after insulin infusion . \\ r (g) = β - d i s t_{o u t s i d e} (g; q) - α \cdot d i s t_{i n s i d e} (g; q) \\ where g is g_{t + 1, \max} or g_{t + 1, \min}, q = [q_{\min}, q_{\max}] \in R^{2 d} is a query box that is [70, 180] . \\ β represents a fixed scalar margin, 0 < α < 1 is a discount factor \\ that adjusts the attention of d i s t_{o u t s i d e} and d i s t_{i n s i d e} . \\ d i s t_{o u t s i d e} (g; q) = {‖\max (g - q_{\max}, 0) + \max (q_{\min} - g, 0)‖}_{1} \\ d i s t_{i n s i d e} (g; q) = {‖c e n (q) - \min (q_{\max}, \max (q_{\min}, g))‖}_{1} \\ c e n (q) = (q_{\max} + q_{\min}) / 2 is the central point of q_{\max} and q_{\min} . \end{array}$
Advantages: It quantifies the risk of BGC, which can encourage agent to pay more attention to those BGC that cause high risk, so as to effectively avoid extreme hyperglycemia and hypoglycemia events. Disadvantages: The pros and cons of BGC fluctuations are not taken into account; Different patients have a variety of BGC risks, but this function cannot make personalized reward assessments for different patients;	Advantages: It grows very quickly, so the further away from the ideal BGC, the penalty increases significantly. This helps the agent more strongly avoid extreme situations. Disadvantages: The punishment needs to be controlled by coefficient adjustment, but this adjustment can be sensitive.	Advantages: It directly penalizes deviations from the ideal blood glucose value, and the higher the value, the more severe the penalty. Disadvantages: There is the same penalty for deviations above and below ideal BGC, which is not conducive to RL differentiating the risk of hyperglycemia from hypoglycemia	Advantages: It simplifies reward calculations and clearly defines rewards or penalties for different blood glucose levels. Disadvantages: It is discontinuous at interval boundaries and can cause the agent to behave unsteadily at the boundaries; The setting of intervals may be subjective, and different individuals and different situations may require different intervals.	Advantages: It provides a more granular assessment of the pros and cons of different BGC. Disadvantages: The pros and cons of BGC fluctuations are not considered; It does not take into account the BGC assessment in different patients because the safe BGC range varies from patient to patient.

Table 4. Practical applications of DRL algorithms in AIDs.

References	Data Source			Subjects (Patients)			Type of Patients
References	Virtual Platform	Mathematical Model	Electronic Medical Records	Real	Virtual	Offline Clinical Datasets	T1DM	T2DM	No Distinction
Xuehui Yu [76], Harry Emerson [82], Chirath Hettiarachchi [66], Mehrad Jaloli [80], Wenzhou Lv [75], Giulia Noaro [61], Zihao Wang [87], Ian Fox [48], Chirath Hettiarachchi [62], Jinhao Zhu [77], Seunghyun Lee [63], Alan Mackey [81], Phuwadol Viroonluecha [65], Min Hyuk Lim [74], Taiyu Zhu [83], Taiyu Zhu [59], Anas El Fathi [67], Sayyar Ahmad [57], Jingchi Jiang [78], Miriam Chlumsky-Harttmann [79], Senquan Wang [56]	√				√		√
Taiyu Zhu [71]	√		√		√	√
Silvia Del Giorno [58]			√		√		√
Jonas Nordhaug Myhre [69], Francesco Di Felice [70], Dénes-Fazakas Lehel [64,68]		√			√		√
Mohammad Ali Raheb [73]		√			√			√
Zackarie Ellis [72]		√			√				√
Tianhao Li [60]			√			√			√
Tristan Beolet [84]			√			√	√
Guangyu Wang [86]			√	√				√

Table 5. Comparison between MPC and DRL.

	MPC	DRL
Advantages	Strong prediction ability Good constraint-handling ability Advantage in multivariable control	No need for precise models No need for complex calculations Strong adaptability Ability to handle complex environments
Disadvantages	High dependence on models High computational complexity Limited adaptability	High data requirements Poor interpretability Safety concerns for situations not captured in past data
Future research directions	Improvement of adaptive models Integration with detection and classification technologies Reduction of computational complexity	Efficient data learning Improvement of interpretability Strengthening of safety mechanisms

Table 6. Comparison between PID and DRL.

	PID	DRL
Advantages	Few parameters and simple operation Mature theory	Strong adaptability No need for precise models Ability to handle complex environments
Disadvantages	Limitation of the fixed form of the controller equation Poor adaptability Inability to handle constraints Inability to handle complex dynamics, nonlinearities	The same as Table 5
Future research directions	Optimization of adaptive parameter tuning Expansion of the integration of intelligent algorithms	The same as Table 5

Table 7. Comparison between FL and DRL.

	FL	DRL
Advantages	Based on clinical experience Without the need for complex models	Strong adaptive ability Get rid of model dependence Respond flexibly to complex environments
Disadvantages	Lack of flexibility Lack of rigorous theoretical guidance Limited adaptability	The same as Table 5
Future research directions	New techniques to optimize fuzzy rules and functions Integration with other technologies Integration with artificial intelligence technologies	The same as Table 5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, X.; Yang, Z.; Sun, X.; Liu, H.; Li, H.; Lu, J.; Zhou, J.; Cinar, A. Deep Reinforcement Learning for Automated Insulin Delivery Systems: Algorithms, Applications, and Prospects. AI 2025, 6, 87. https://doi.org/10.3390/ai6050087

AMA Style

Yu X, Yang Z, Sun X, Liu H, Li H, Lu J, Zhou J, Cinar A. Deep Reinforcement Learning for Automated Insulin Delivery Systems: Algorithms, Applications, and Prospects. AI. 2025; 6(5):87. https://doi.org/10.3390/ai6050087

Chicago/Turabian Style

Yu, Xia, Zi Yang, Xiaoyu Sun, Hao Liu, Hongru Li, Jingyi Lu, Jian Zhou, and Ali Cinar. 2025. "Deep Reinforcement Learning for Automated Insulin Delivery Systems: Algorithms, Applications, and Prospects" AI 6, no. 5: 87. https://doi.org/10.3390/ai6050087

APA Style

Yu, X., Yang, Z., Sun, X., Liu, H., Li, H., Lu, J., Zhou, J., & Cinar, A. (2025). Deep Reinforcement Learning for Automated Insulin Delivery Systems: Algorithms, Applications, and Prospects. AI, 6(5), 87. https://doi.org/10.3390/ai6050087

Article Menu

Deep Reinforcement Learning for Automated Insulin Delivery Systems: Algorithms, Applications, and Prospects

Abstract

1. Introduction

2. Overview of DRL

2.1. The Basic Structure of DRL

2.2. Classification of DRL and Their Characteristics

3. DRL-Based AIDs

3.1. State Space Variables Selection

3.2. Action Space Variables Selection

3.3. Reward Function Selection

4. Challenges to Be Addressed

4.1. Low Sample Availability

4.2. Personalization

4.3. Security

5. Practical Applications

5.1. Data

5.2. Computational Power

6. Comparison of Main Control Algorithms for AIDs

6.1. Model Predictive Control

6.2. Proportional Integral Derivative

6.3. Fuzzy Logic

7. Summary and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI