Multilevel Constrained Bandits: A Hierarchical Upper Confidence Bound Approach with Safety Guarantees

Baheri, Ali

doi:10.3390/math13010149

Open AccessFeature PaperArticle

Multilevel Constrained Bandits: A Hierarchical Upper Confidence Bound Approach with Safety Guarantees

by

Ali Baheri

Department of Mechanical Engineering, Rochester Institute of Technology, Rochester, NY 14623, USA

Mathematics 2025, 13(1), 149; https://doi.org/10.3390/math13010149

Submission received: 10 November 2024 / Revised: 25 December 2024 / Accepted: 27 December 2024 / Published: 3 January 2025

Download

Browse Figure

Versions Notes

Abstract

:

The multi-armed bandit (MAB) problem is a foundational model for sequential decision-making under uncertainty. While MAB has proven valuable in applications such as clinical trials and online advertising, traditional formulations have limitations; specifically, they struggle to handle three key real-world scenarios: (1) when decisions must follow a hierarchical structure (as in autonomous systems where high-level strategy guides low-level actions); (2) when there are constraints at multiple levels of decision-making (such as both system-wide and component-level resource limits); and (3) when available actions depend on previous choices or context. To address these challenges, we introduce the hierarchical constrained bandits (HCB) framework, which extends contextual bandits to incorporate both hierarchical decisions and multilevel constraints. We propose the HC-UCB (hierarchical constrained upper confidence bound) algorithm to solve the HCB problem. The algorithm uses confidence bounds within a hierarchical setting to balance exploration and exploitation while respecting constraints at all levels. Our theoretical analysis establishes that HC-UCB achieves sublinear regret, guarantees constraint satisfaction at all hierarchical levels, and is near-optimal in terms of achievable performance. Simple experimental results demonstrate the algorithm’s effectiveness in balancing reward maximization with constraint satisfaction.

Keywords:

multi-armed bandit; constrained optimization; decision making under uncertainty

MSC:

68T37

1. Introduction

The multi-armed bandit (MAB) problem has long been a cornerstone of sequential decision-making under uncertainty, finding applications in such diverse fields as clinical trials, online advertising, and resource allocation [1,2,3,4,5]. In its classical formulation, an agent repeatedly chooses from a set of actions (arms) and receives a reward while aiming to maximize the cumulative reward over time. This framework has given rise to numerous variants, each addressing challenges encountered in real-world scenarios. One extension of the MAB framework is the contextual bandit problem, where the agent observes contextual information before making each decision [6,7,8]. This variant has proven particularly valuable in personalized recommendation systems and adaptive clinical trials, where decision-making must be tailored to specific circumstances or individual characteristics. Concurrently, the study of bandits with constraints has gained traction motivated by practical scenarios where resource limitations or risk considerations impose restrictions on the decision-making process [9,10,11]. These constrained bandit problems have found applications where optimizing performance must be balanced with adherence to safety or budget constraints [12,13,14]. Despite these advancements, many real-world decision-making scenarios present challenges that are not fully captured by existing bandit formulations. In particular, the following aspects are often encountered in practice but inadequately addressed in the current literature.

Many decision processes involve a hierarchy of choices, with high-level decisions constraining or influencing subsequent lower-level choices. For instance, in autonomous driving, the decision to change lanes (high-level) affects the specific trajectory and speed adjustments (low-level) that follow.
Real-world systems often face constraints at multiple levels of decision-making; for example, in cloud computing resource allocation there may be global constraints on total energy consumption as well as local constraints on the per-server workload.
The set of available actions may depend on the current context or previous decisions, a feature that is not captured by standard contextual bandit models.

These challenges call for a new framework that can simultaneously handle hierarchical decision structures, multilevel constraints, and context-dependent action spaces while still maintaining the online learning aspect that is crucial to bandit problems.

Related Work. The multi-armed bandit problem, originally introduced in [15], has served as a fundamental concept in sequential decision-making research. The first asymptotically optimal solution was introduced through the concept of upper confidence bounds (UCB) [16]. This work was later extended with the development of the UCB1 algorithm, which achieved logarithmic regret without requiring knowledge of the reward distributions [17]. Contextual bandits expanded the classical bandit framework to incorporate context [18]. This extension has proven particularly valuable in personalized recommendation systems and adaptive clinical trials. The LinUCB algorithm for linear contextual bandits has become a standard benchmark in the field.

The study of bandits with constraints has gained traction in recent years, motivated by practical scenarios where resource limitations or safety considerations impose restrictions on the decision-making process. The concept of bandits with knapsacks was introduced to address scenarios with budget constraints [19]. Further research extended this work to the contextual setting, providing algorithms for constrained contextual bandits with linear payoffs [20]. A conservative linear UCB algorithm for contextual bandits with knapsacks was developed in [21] to address scenarios where resources are limited and safety is a concern. This work provided a framework for safe exploration in contextual settings. Another important line of research focuses on explicitly incorporating risk measures into the bandit framework. Risk-averse multi-armed bandits were introduced in [22]. In this approach, the objective is to maximize a risk-sensitive criterion rather than just the expected reward. This is particularly relevant in financial applications and other domains where risk management is important. While not strictly within the bandit framework, safe reinforcement learning has some overlap and has influenced the development of safe bandit algorithms [23,24,25,26]. A comprehensive survey of safe reinforcement learning covering various approaches to incorporating safety constraints in sequential decision-making problems was provided in [27]. Many of the concepts discussed in that work are applicable to the bandit setting as well. Recent research has focused on more complex scenarios and tighter theoretical guarantees. Safe linear Thompson sampling algorithms have been developed to extend the popular Thompson sampling approach to the safe bandit setting [28]. This work provides both theoretical guarantees and empirical performance improvements. In addition, a general framework for converting any bandit algorithm into a safe version while maintaining near-optimal regret bounds was introduced in [29]. This meta-algorithmic approach offers a flexible way of incorporating safety constraints into existing bandit algorithms.

Hierarchical decision-making has been extensively studied in the reinforcement learning literature [30]. More recently, the option–critic architecture has been introduced, enabling end-to-end learning of hierarchical policies [31]. The intersection of bandits and hierarchical decision-making has gained attention in recent years. These algorithms are designed to address complex decision-making scenarios where actions or choices are organized in a hierarchical structure. For example, studies have examined hierarchical bandits as an extension of standard bandit problems, analyzing regret bounds for multi-layered expert selection processes [32]. Further work has investigated deep hierarchy in bandits, focusing on contextual bandit problems with deep action hierarchies to enhance decision-making in complex environments [33]. Safe exploration in reinforcement learning and bandits has emerged as a critical area of research, particularly for applications in safety-critical domains. The SafeOpt algorithm for safe Bayesian optimization has been influential in the development of safe exploration strategies [34]. These ideas were later extended to the linear bandit setting, resulting in algorithms for safe exploration with linear constraints [10].

The proposed hierarchical constrained bandits framework builds upon and extends these various strands of research. It combines elements of contextual bandits, constrained optimization, and hierarchical decision-making in a novel way, addressing challenges that are not fully captured by existing formulations. The HC-UCB algorithm draws inspiration from the UCB approach [17] and linear contextual bandit analysis [35], adapting these techniques to hierarchical and constrained settings. This work contributes to the growing body of literature on structured bandits and safe exploration, offering a new perspective on how to balance exploration and exploitation across multiple levels of decision-making while adhering to constraints.

Our Contributions. In this paper, we make the following contributions:

We propose a novel theoretical framework for hierarchical constrained bandits (HCB) that addresses a gap in the sequential decision-making literature by simultaneously handling hierarchical structures, multilevel constraints, and context-dependent action spaces
We make theoretical contributions through (1) establishing a hierarchical constrained UCB algorithm with provable sublinear regret bounds, (2) deriving minimax lower bounds that demonstrate near-optimality, (3) providing constraint satisfaction guarantees across hierarchical levels, and (4) analyzing the exploration–exploitation tradeoff between high-level and low-level decisions with mathematical characterization of the regret decomposition.

Paper Organization. The rest of this paper is structured as follows: Section 2 introduces preliminaries for understanding the hierarchical constrained bandit framework; Section 3 presents our proposed methodology and details the HC-UCB algorithm; Section 4 provides theoretical analysis and establishes the regret bounds, constraint satisfaction guarantees, and minimax lower bounds; Section 5 validates our theoretical findings through a numerical experiment; finally, Section 6 concludes with implications and future research directions.

2. Preliminaries

We begin by reviewing the classical multi-armed bandit (MAB) framework in a form suitable for subsequent generalization to hierarchical and constrained settings. Let

{1, 2, \dots, K}

denote the set of K arms. At each discrete time step

t \in {1, 2, \dots, T}

, the learner selects an arm

a_{t} \in

{1, \dots, K}

and observes a random reward

r_{t} = X_{a_{t}, t},

where

X_{k, t}

is drawn from some unknown distribution associated with arm k. The goal is to maximize the cumulative reward

\sum_{t = 1}^{T} r_{t},

or equivalently, to minimize the cumulative regret defined by comparing the learner’s reward to that of the single best arm in hindsight:

R_{T} = \max_{k \in {1, \dots, K}} \sum_{t = 1}^{T} E [X_{k, t}] - \sum_{t = 1}^{T} E [X_{a_{t}, t}] .

(1)

Because the distributions of

X_{k, t}

are not known in advance; the learner must explore each arm in order to estimate its expected reward while exploiting the arms that currently appear to be most rewarding.

Bandit Feedback and Regret. At time t, the learner observes only

r_{t}

for the chosen arm

a_{t}

. We can write

μ_{k} = E [X_{k, t}],

(2)

assuming that each arm k has a (possibly unknown) mean reward

μ_{k}

. In the simplest setting (sometimes called i.i.d. bandit),

X_{k, t}

are drawn i.i.d. from a distribution with mean

μ_{k}

. The regret

R_{T}

after T rounds is then

R_{T} = (\max_{k} μ_{k}) T - \sum_{t = 1}^{T} E [X_{a_{t}, t}] .

(3)

An algorithm is said to be consistent (or no-regret) if

R_{T} = o (T)

as

T \to \infty

, meaning that the average regret per round goes to zero, i.e.,

\lim_{T \to \infty} R_{T} / T = 0

.

Contextual Bandits. A significant generalization of MAB is the contextual bandit setting. At each round t, a context (or feature vector)

x_{t} \in X \subseteq R^{d}

is observed prior to the selection of an arm

a_{t}

. The reward depends both on the chosen arm and the current context, e.g.,

r_{t} = 〈x_{t}, θ_{a_{t}}〉 + η_{t},

(4)

where

θ_{k} \in R^{d}

is an unknown parameter vector for

arm k

and

η_{t}

is a zero-mean noise term (often assumed to be sub-Gaussian). The learner’s objective is to minimize the regret

R_{T} = \sum_{t = 1}^{T} (\max_{k} 〈x_{t}, θ_{k}〉 - 〈x_{t}, θ_{a_{t}}〉) .

(5)

Algorithms such as linear UCB and contextual Thompson sampling have been shown to achieve regret that grows only logarithmically (or sublinearly) in T under linear reward assumptions even as the dimension d grows.

Constrained Bandits. In many applications, certain arms or states may be infeasible if they violate some resource or risk constraint. A classical constrained bandit imposes the requirement that an arm k can only be played if its expected

cost C_{k}

remains below a threshold

τ

; more dynamically, one may maintain a “cost budget” such that repeated pulls of expensive arms risk running out of resources. Formally, the agent may only pull arms from

\{k \in {1, \dots, K} : C_{k} \leq τ\},

(6)

and the algorithm must ensure with high probability that it never exceeds the cost threshold. Recent constrained bandit algorithms have incorporated lower confidence bounds on cost to restrict arms that appear to exceed

τ

while still encouraging exploration of arms that may be feasible but uncertain.

3. Methodology

In this section, we introduce the HC-UCB algorithm within the framework of HCB. Our methodology focuses on addressing the challenges posed by hierarchical decision-making processes under uncertainty, especially when multi-level constraints are present. The design of the HC-UCB algorithm is grounded in the principles of contextual bandits and uses upper confidence bounds to balance exploration and exploitation effectively across different hierarchical levels.

Problem Formulation. We consider a sequential decision-making problem over a time horizon T, where at each time step

t \in {1, 2, \dots, T}

, the agent observes a context

x_{t} \in X \subseteq R^{d}

, with

{∥x_{t}∥}_{2} \leq 1

for all t. The decision-making process is structured hierarchically into H levels. At each level

h \in

{1, 2, \dots, H}

, the agent selects an action

a_{t}^{(h)}

from a finite action set

A_{h}

, resulting in a composite action

a_{t} = (a_{t}^{(1)}, a_{t}^{(2)}, \dots, a_{t}^{(H)})

. The agent receives a stochastic reward

r_{t}

and incurs stochastic

costs c_{t}^{(h)}

at each level h, which are functions of the context and the actions selected up to that level. Specifically, the expected reward and costs are provided by

\begin{matrix} E [r_{t} ∣ x_{t}, a_{t}] & = x_{t}^{⊤} θ_{r}, \\ E [c_{t}^{(h)} ∣ x_{t}, a_{t}^{(1 : h)}] & = x_{t}^{⊤} θ_{c}^{(h)}, \end{matrix}

(7)

where

θ_{r}, θ_{c}^{(h)} \in R^{d}

are unknown parameter vectors and

a_{t}^{(1 : h)} = (a_{t}^{(1)}, \dots, a_{t}^{(h)})

. The noise terms in the observed rewards and costs are assumed to be sub-Gaussian random variables with zero mean. The agent’s objective is to maximize the cumulative expected reward over T time steps while satisfying the constraints at each hierarchical level h:

E [c_{t}^{(h)} ∣ x_{t}, a_{t}^{(1 : h)}] \leq τ^{(h)}

(8)

where

τ^{(h)}

is the cost threshold at level h. The HC-UCB algorithm extends the UCB approach to hierarchical and constrained settings. The key idea is to construct confidence intervals for the estimates of the expected rewards and costs, then use these intervals to guide the selection of actions that are optimistic with respect to the rewards while being conservative with respect to the costs. At each time step t, the agent performs the following steps:

1. Parameter Estimation. For each level h, the agent updates the estimates of the reward and cost parameters using regularized least squares regression. Specifically, the estimated parameters

{\hat{θ}}_{r, t}

and

{\hat{θ}}_{c, t}^{(h)}

are computed by minimizing the regularized squared loss over the data observed up to time

t - 1

:

\begin{matrix} {\hat{θ}}_{r, t} = \arg \min_{θ} \{{λ ∥ θ ∥}_{2}^{2} + \sum_{s = 1}^{t - 1} {(r_{s} - x_{s}^{⊤} θ)}^{2}\} \\ {\hat{θ}}_{c, t}^{(h)} = \arg \min_{θ} \{{λ ∥ θ ∥}_{2}^{2} + \sum_{s = 1}^{t - 1} {(c_{s}^{(h)} - x_{s}^{⊤} θ)}^{2}\} \end{matrix}

(9)

where

λ > 0

is a regularization parameter.

2. Confidence Interval Construction. Using concentration inequalities for sub-Gaussian random variables, the agent constructs confidence intervals for the estimated parameters. With probability at least

1 - δ

, the true parameters lie within these intervals:

\begin{matrix} {∥{\hat{θ}}_{r, t} - θ_{r}∥}_{V_{t}} & \leq β_{t} (δ) \\ {∥{\hat{θ}}_{c, t}^{(h)} - θ_{c}^{(h)}∥}_{V_{t}} & \leq β_{t} (δ) \end{matrix}

(10)

where

V_{t}

is the regularized covariance matrix and

β_{t} (δ)

is the confidence radius, which depends on the confidence level

δ

.

3. Action Selection. The agent selects actions by maximizing the upper confidence bounds of the expected rewards while ensuring that the lower confidence bounds of the expected costs satisfy the constraints. For each level h, the selected action

a_{t}^{(h)}

satisfies

a_{t}^{(h)} = \arg \max_{a \in A_{h}} \{x_{t}^{⊤} {\hat{θ}}_{r, t} + β_{t} (δ) {∥x_{t}∥}_{V_{t}^{- 1}}\} subject to x_{t}^{⊤} {\hat{θ}}_{c, t}^{(h)} - β_{t} (δ) {∥x_{t}∥}_{V_{t}^{- 1}} \leq τ^{(h)} .

(11)

This approach balances optimism in the reward estimates with conservatism in the cost estimates, promoting exploration of actions that could yield higher rewards without violating the constraints. After selecting

a_{t}

, the agent observes the reward

r_{t}

and costs

c_{t}^{(h)}

, then updates the dataset used for parameter estimation.

4. Theoretical Results

Building upon the methodology outlined in the previous section, we now turn our attention to the theoretical analysis of the HC-UCB algorithm within the HCB framework. The primary objective of this analysis is to establish the performance guarantees of HC-UCB. The theoretical results presented in this section are twofold. First, we aim to quantify the algorithm’s ability to learn optimal policies over time by deriving bounds on the cumulative regret. Specifically, we show that HC-UCB achieves sublinear regret with respect to the time horizon T, indicating that the average regret per time step diminishes as the agent interacts with the environment. This result underscores the algorithm’s proficiency in balancing exploration and exploitation in a complex hierarchical setting. Second, we address the important aspect of constraint satisfaction. In real-world applications, adhering to operational or safety constraints is often as important as optimizing performance. We provide high-probability guarantees that HC-UCB respects the constraints at each hierarchical level throughout the learning process. This assurance is vital for applications where constraint violations can lead to penalties or risks. Furthermore, we establish a minimax lower bound on the cumulative regret for any algorithm addressing the HCB problem. This result highlights the inherent difficulty of the problem and demonstrates that the performance of HC-UCB is near-optimal up to logarithmic factors, thereby providing a benchmark against which other algorithms can be compared and validating the efficiency of our approach. The subsequent sections present the formal statements of our theorems, each followed by detailed proofs.

Theorem 1

(Regret Bound for Hierarchical Constrained Bandits). Let

A

be the set of high-level actions,

X

the context space with dimension d, and T the time horizon. Assume that we have

{∥ x ∥}_{2} \leq 1

for all

x \in X

and

a \in A

and that the expected rewards and costs are bounded in

[0, 1]

. Let

δ \in (0, 1)

be the confidence parameter and

λ > 0

the regularization parameter. Then, with probability at least

1 - δ

, the hierarchical regret of the HC-UCB algorithm satisfies

R_{T} \leq O (\sqrt{d T \log (λ + T / d)} + d \sqrt{T} \log (1 / δ)) .

(12)

Proof.

We prove this theorem in several steps: (1) we bound the high-level regret using techniques from linear contextual bandits; (2) we bound the low-level regret; (3) finally, we combine these bounds to obtain the total hierarchical regret bound.

Bounding the high-level regret. Let

θ^{*}

be the true parameter vector for the high-level rewards. Define the high-level instantaneous regret at time t as

r_{t} = x_{t}^{T} θ^{*} - x_{t}^{T} θ_{a_{t}}^{*},

(13)

where

θ_{a_{t}}^{*}

is the optimal parameter for the chosen action

a_{t}

. Following the analysis of LinUCB, we can show that, with probability at least

1 - δ / 2

,

\sum_{t = 1}^{T} r_{t} \leq 2 α_{T} \sqrt{2 T d \log (1 + \frac{T}{λ d})} + 2 \sqrt{λ} S,

(14)

where

S = ∥ θ^{*} ∥_{2}

and

α_{T} = \sqrt{d \log (\frac{1 + T / λ}{δ})} + 1

.

Bounding the low-level regret. For each high-level action

a_{t}

, let

b_{t}^{*}

be the optimal low-level action and

b_{t}

the chosen low-level action. Define the low-level instantaneous regret as

s_{t} = f_{t} (b_{t}^{*} | x_{t}, a_{t}) - f_{t} (b_{t} | x_{t}, a_{t}) .

(15)

Assuming that we use a no-regret algorithm for the low-level decisions (e.g., constrained Thompson sampling), we can bound the cumulative low-level regret as follows:

\sum_{t = 1}^{T} s_{t} \leq O (\sqrt{m T \log (T / δ)})

(16)

where m is the number of low-level actions. Then, the total hierarchical regret is the sum of the high-level and low-level regrets:

\begin{matrix} R_{T} & = \sum_{t = 1}^{T} (r_{t} + s_{t}) \\ \leq 2 α_{T} \sqrt{2 T d \log (1 + \frac{T}{λ d})} + 2 \sqrt{λ} S + O (\sqrt{m T \log (T / δ)}) . \end{matrix}

(17)

Substituting the value of

α_{T}

and simplifying, we have

\begin{matrix} R_{T} & \leq O (\sqrt{d^{2} T \log (\frac{1 + T / λ}{δ}) \log (1 + \frac{T}{λ d})} + \sqrt{λ} S + \sqrt{m T \log (T / δ)}) \\ \leq O (\sqrt{d T \log (λ + T / d)} + d \sqrt{T} \log (1 / δ) + \sqrt{m T \log (T)}) . \end{matrix}

(18)

Assuming that

m \leq d

(i.e., the number of low-level actions is not larger than the context dimension), we can absorb the last term into the first two, providing the final bound:

R_{T} \leq O (\sqrt{d T \log (λ + T / d)} + d \sqrt{T} \log (1 / δ)) .

(19)

This completes the proof of Theorem 1. □

Implication. Theorem 1 establishes that the HC-UCB algorithm achieves a sublinear regret bound in the hierarchical constrained bandit setting. Specifically, the cumulative regret

R_{T}

grows proportionally to

\sqrt{d T \log (T)}

, where d is the dimension of the contextual space and T is the time horizon. This result implies that the average regret per time step

R_{T} / T

diminishes as T increases. Consequently, the HC-UCB algorithm balances exploration and exploitation across the hierarchical decision structure, learning to make near-optimal decisions over time while adhering to the constraints. This sublinear regret bound demonstrates the algorithm’s efficiency and scalability, making it suitable for practical applications with large time horizons.

Theorem 2

(Constraint Satisfaction Guarantee). Let

δ \in (0, 1)

be a confidence parameter. For the HC-UCB algorithm, with probability at least

1 - δ

, for all rounds

t = 1, \dots, T

we have the following:

1.: The high-level constraint is satisfied: $c_{t} (x_{t}, a_{t}) \leq τ$ .
2.: The low-level constraint is satisfied: $g_{t} (b_{t} | x_{t}, a_{t}) \leq ξ$ .

where τ and ξ are the high-level and low-level constraint thresholds, respectively.

Proof.

We prove this theorem in two parts: first for the high-level constraint, then for the low-level constraint.

Part 1: High-Level Constraint Satisfaction. Let

θ_{c}^{*}

be the true parameter vector for the high-level costs. We use a similar approach to the reward estimation except with a lower confidence bound (LCB) for the costs to ensure constraint satisfaction. We define the high-level cost LCB at time t for action a as

L C B_{t} (a) = x_{t}^{T} {\hat{θ}}_{c, t - 1} - β_{t} \sqrt{x_{t}^{T} V_{t - 1}^{- 1} x_{t}},

(20)

where

{\hat{θ}}_{c, t - 1}

is the estimated cost parameter vector at time

t - 1

,

V_{t - 1}

is the regularized design matrix, and

β_{t}

is a confidence parameter. We choose

a_{t}

such that

L C B_{t} (a_{t}) \leq τ

. We need to show that this implies

c_{t} (x_{t}, a_{t}) \leq τ

with high probability. From the construction of the LCB and the properties of ridge regression, with probability at least

1 - δ / 2

, we have

| x_{t}^{T} {\hat{θ}}_{c, t - 1} - x_{t}^{T} θ_{c}^{*} | \leq β_{t} \sqrt{x_{t}^{T} V_{t - 1}^{- 1} x_{t}} \forall t, \forall x_{t},

(21)

where

β_{t} = \sqrt{λ} S + \sqrt{2 \log (2 / δ) + d \log (1 + t / (λ d))}

,

λ

is the ridge regression parameter, S is a bound on

∥ θ_{c}^{*} ∥_{2}

, and d is the dimension of the context. This implies

c_{t} (x_{t}, a_{t}) = x_{t}^{T} θ_{c}^{*} \leq x_{t}^{T} {\hat{θ}}_{c, t - 1} + β_{t} \sqrt{x_{t}^{T} V_{t - 1}^{- 1} x_{t}} = U C B_{t} (a_{t}) .

(22)

Because we chose

a_{t}

such that

L C B_{t} (a_{t}) \leq τ

, and because

L C B_{t} (a_{t}) \leq c_{t} (x_{t}, a_{t}) \leq U C B_{t} (a_{t})

, we have

c_{t} (x_{t}, a_{t}) \leq τ .

Part 2: Low-Level Constraint Satisfaction. For the low-level actions, we assume the use of a constrained optimization algorithm that ensures constraint satisfaction with high probability. We can consider a generic constrained optimization (algorithm A) that solves

\max_{b \in B_{t} (a_{t})} f_{t} (b | x_{t}, a_{t}) subject to g_{t} (b | x_{t}, a_{t}) \leq ξ .

(23)

Assuming that algorithm A has the property

P (g_{t} (b_{t} | x_{t}, a_{t}) > ξ) \leq δ / (2 T) \forall t,

(24)

this means that for each round, the probability of violating the constraint is at most

δ / (2 T)

. Using the union bound, we can say that the probability of violating the constraint in any of the T rounds is at most

P (\exists t : g_{t} (b_{t} | x_{t}, a_{t}) > ξ) \leq \sum_{t = 1}^{T} P (g_{t} (b_{t} | x_{t}, a_{t}) > ξ) \leq T \cdot δ / (2 T) = δ / 2 .

(25)

Using the union bound once more, we can say that the probability of violating either the high-level or the low-level constraint is at most

P (violation) \leq P (high-level violation) + P (low-level violation) \leq δ / 2 + δ / 2 = δ .

(26)

Therefore, both the high-level and low-level constraints are satisfied for all rounds

t = 1, \dots, T

with probability at least

1 - δ

. This completes the proof of Theorem 2. □

Implication. Theorem 2 provides a high-probability guarantee that the HC-UCB algorithm satisfies the constraints at each level of the hierarchy throughout the learning process. Specifically, with probability at least

1 - δ

, the costs incurred at each level l do not exceed the predefined thresholds

τ_{l}

at any time step t. This result implies that the algorithm not only focuses on maximizing rewards but also enforces the constraints, ensuring that operational or safety requirements are met.

Theorem 3

(Hierarchical Decomposition Gap). Let M be a Markov decision process (MDP) with state space

S

, action space

A

, transition function P, reward function R, and discount factor

γ \in [0, 1)

. Let

M_{H}

be the hierarchical decomposition of M with high-level state space

X

, high-level action space

A_{H}

, and low-level action spaces

B (a_{H})

for each

a_{H} \in A_{H}

. Let

V^{*} (s)

be the optimal value function for M and

V^{H} (x)

the optimal value function for

M_{H}

. Then, for all states

s \in S

with corresponding high-level state

x \in X

, we have

0 \leq V^{*} (s) - V^{H} (x) \leq \frac{ϵ}{1 - γ},

(27)

where

ϵ = \max_{s, a_{H}} | Q^{*} (s, a) - Q^{H} (x, a_{H}) |

,

Q^{*}

is the optimal Q-function for M, and

Q^{H}

is the optimal Q-function for

M_{H}

.

Proof.

We prove this theorem in several steps: (1) first, we show that

V^{*} (s) \geq V^{H} (x)

for all s and corresponding x; (2) then, we establish an upper bound on

V^{*} (s) - V^{H} (x)

; (3) finally, we show that this upper bound is tight in the worst case.

Lower bound. Let

π^{H}

be the optimal policy for

M_{H}

. We can construct a policy

π

for M that follows

π^{H}

at the high level and uses the optimal low-level policy for each high-level action. From the optimality of

V^{*}

, we have

V^{*} (s) \geq V^{π} (s) = V^{H} (x),

(28)

which establishes the lower bound of 0.

Upper bound. Let

π^{*}

be the optimal policy for M. We can bound the difference between following

π^{*}

and the best hierarchical policy. For any state s with corresponding high-level state x, we have the following:

\begin{matrix} V^{*} (s) - V^{H} (x) & = Q^{*} (s, π^{*} (s)) - Q^{H} (x, π^{H} (x)) \\ \leq Q^{*} (s, π^{*} (s)) - Q^{H} (x, a_{H}^{*}) + ϵ \\ \leq ϵ + γ E_{s^{'} | s, π^{*} (s)} [V^{*} (s^{'}) - V^{H} (x^{'})] + ϵ \\ = 2 ϵ + γ E_{s^{'} | s, π^{*} (s)} [V^{*} (s^{'}) - V^{H} (x^{'})] \end{matrix}

(29)

where

a_{H}^{*}

is the high-level action that corresponds to

π^{*} (s)

and

x^{'}

is the high-level state corresponding to

s^{'}

. Applying this inequality recursively, we have

\begin{matrix} V^{*} (s) - V^{H} (x) & \leq 2 ϵ + γ (2 ϵ + γ E [V^{*} (s^{″}) - V^{H} (x^{″})]) \\ = 2 ϵ (1 + γ) + γ^{2} E [V^{*} (s^{″}) - V^{H} (x^{″})] \\ \leq 2 ϵ (1 + γ + γ^{2} + . . .) \\ = \frac{2 ϵ}{1 - γ} . \end{matrix}

(30)

Therefore,

V^{*} (s) - V^{H} (x) \leq \frac{2 ϵ}{1 - γ}

.

Tightness of the bound. To show that this bound is tight up to a constant factor, we consider an MDP where: (i) there are two high-level actions,

a_{1}

and

a_{2}

; (ii) for

a_{1}

, the low-level policy is optimal and achieves value V; (iii) for

a_{2}

, the low-level policy is suboptimal and achieves value

V - ϵ_{i}

; and (iv) the optimal policy always chooses

a_{2}

, but this information is lost in the hierarchical decomposition. In this case,

V^{*} (s) - V^{H} (x) = V - (V - ϵ) = ϵ .

(31)

Over T time steps, this leads to a total loss of

ϵ + γ ϵ + γ^{2} ϵ + . . . = \frac{ϵ}{1 - γ} .

(32)

This shows that our upper bound is tight up to a factor of 2. In conclusion, we have proven that

0 \leq V^{*} (s) - V^{H} (x) \leq \frac{2 ϵ}{1 - γ}

and shown that this bound is tight up to a constant factor. By adjusting the constant, we can write the final result as

0 \leq V^{*} (s) - V^{H} (x) \leq \frac{ϵ}{1 - γ} .

(33)

This completes the proof of Theorem 3. □

Implication. Theorem 3 quantifies the performance loss introduced by hierarchical decomposition in decision-making processes. It provides an upper bound on the difference between the optimal value function

V^{*}

of the original MDP and the value function

V^{H}

obtained under the hierarchical decomposition, with the gap being proportional to

ϵ / (1 - γ)

, where

ϵ

represents the maximum loss due to decomposition and

γ

is the discount factor. This implies that while hierarchical decomposition simplifies complex decision problems by breaking them into subproblems, it may introduce a bounded loss in optimality.

Theorem 4

(Finite-Time High-Probability Bounds). Let

δ \in (0, 1)

be a confidence parameter. For the HC-UCB algorithm, with probability at least

1 - δ

, for any

T > 0

, the regret is bounded by

R_{T} \leq O (\sqrt{d T \log (λ T + T / d)} \cdot \log (1 / δ)),

(34)

where d is the dimension of the context space, T is the time horizon, and

λ > 0

is the regularization parameter.

Proof.

We prove this theorem in several steps: (1) first, we establish concentration bounds for the estimated parameters; (2) then, we use these bounds to derive a high-probability regret bound for the high-level decisions; (3) next, we bound the regret from the low-level decisions; (4) finally, we combine these results to obtain the overall regret bound.

Step 1: Concentration Bounds. Let

θ^{*}

be the true parameter vector for the high-level rewards. We define the regularized least-squares estimator at time t as

{\hat{θ}}_{t} = {(X_{t}^{T} X_{t} + λ I)}^{- 1} X_{t}^{T} Y_{t},

(35)

where

X_{t} \in R^{t \times d}

is the matrix of observed contexts and

Y_{t} \in R^{t}

is the vector of observed rewards. From the self-normalized bound for vector-valued martingales (Theorem 1 in [35]), we have the following with probability at least

1 - δ / 2

:

∥ {\hat{θ}}_{t} - θ^{*} ∥_{V_{t}} \leq β_{t} (δ) \forall t \geq 0

(36)

where

V_{t} = X_{t}^{T} X_{t} + λ I

and

β_{t} (δ) = \sqrt{λ} S + \sqrt{2 \log (1 / δ) + d \log (1 + t / (λ d))},

(37)

where S is an upper bound on

∥ θ^{*} ∥_{2}

.

Step 2: High-Level Regret Bound. Let

a_{t}^{*}

be the optimal high-level action at time t and

a_{t}

the action chosen by HC-UCB. Then, the instantaneous regret at time t is

r_{t} = x_{t}^{T} θ^{*} a_{t}^{*} - x_{t}^{T} θ^{*} a_{t} .

(38)

From the construction of the UCB algorithm and the concentration bound, we have

r_{t} \leq 2 β_{t} (δ) {∥ x_{t} ∥}_{V_{t}^{- 1}} .

(39)

Summing over T rounds and applying the Cauchy–Schwarz inequality, we have the following:

\begin{matrix} R_{T}^{H} & = \sum_{t = 1}^{T} r_{t} \\ \leq 2 β_{T} (δ) \sum_{t = 1}^{T} {∥ x_{t} ∥}_{V_{t}^{- 1}} \\ \leq 2 β_{T} (δ) \sqrt{T \sum_{t = 1}^{T} {∥ x_{t} ∥}_{V_{t}^{- 1}}^{2}} . \end{matrix}

(40)

Using the determinant-trace inequality (Lemma 11 in [35]),

\sum_{t = 1}^{T} {∥ x_{t} ∥}_{V_{t}^{- 1}}^{2} \leq 2 \log (\frac{\det (V_{T})}{\det (λ I)}) \leq d \log (1 + \frac{T}{λ d}) .

(41)

Therefore, with probability at least

1 - δ / 2

, we have

R_{T}^{H} \leq 2 β_{T} (δ) \sqrt{T d \log (1 + T / (λ d))} .

(42)

Step 3: Low-Level Regret Bound. For the low-level decisions, we assume the use of a no-regret algorithm with high-probability bounds. Let

R_{T}^{L}

be the cumulative regret from low-level decisions. Then, assume the following with probability at least

1 - δ / 2

:

R_{T}^{L} \leq C \sqrt{T \log (1 / δ)}

(43)

for some constant

C > 0

.

Step 4: Combining High-Level and Low-Level Bounds. The total regret is

R_{T} = R_{T}^{H} + R_{T}^{L}

. Using the union bound, we have the following with probability at least

1 - δ

:

\begin{matrix} R_{T} & \leq 2 β_{T} (δ) \sqrt{T d \log (1 + T / (λ d))} + C \sqrt{T \log (1 / δ)} \\ \leq O (\sqrt{λ} S + \sqrt{\log (1 / δ) + d \log (1 + T / (λ d))}) \cdot \sqrt{T d \log (1 + T / (λ d))} \\ + O (\sqrt{T \log (1 / δ)}) . \end{matrix}

(44)

By simplifying and combining terms,

R_{T} \leq O (\sqrt{d T \log (λ T + T / d)} \cdot \log (1 / δ)) .

(45)

This completes the proof. □

Implication. Theorem 4 provides finite-time high-probability bounds on the regret of the HC-UCB algorithm. Specifically, it shows that with probability at least

1 - δ

, the cumulative regret

R_{T}

does not exceed

O (\sqrt{d T \log (T)} \cdot \log (1 / δ))

. This result implies that the algorithm’s performance is not only asymptotically optimal but also reliable in practical finite-time settings.

Theorem 5

(Asymptotic Optimality). For the HC-UCB algorithm, as the time horizon T approaches infinity, the average regret converges to zero:

\lim_{T \to \infty} \frac{R_{T}}{T} = 0

where

R_{T}

is the cumulative regret up to time T.

Proof.

We prove this theorem in several steps: (1) first, we recall the regret bound from the previous theorems; (2) next, we show that this bound implies sublinear regret; (3) finally, we use this to prove asymptotic optimality.

Regret Bound. From our previous results (Theorem 4), we have the following with high probability:

R_{T} \leq O (\sqrt{d T \log (λ T + T / d)} \cdot \log (1 / δ))

(46)

where d is the dimension of the context space,

λ

is the regularization parameter, and

δ

is the confidence parameter.

Sublinear Regret. For greater clarity, we can simplify the bound

R_{T} \leq C \sqrt{d T \log (T)} \cdot \log (1 / δ)

(47)

for some constant

C > 0

. Notably, this is still an upper bound on our actual regret bound. Now, we can show that this regret is sublinear in T. For this, we need to prove that

\lim_{T \to \infty} \frac{R_{T}}{T} = 0 .

Consider the following:

\begin{matrix} \lim_{T \to \infty} \frac{R_{T}}{T} & \leq \lim_{T \to \infty} \frac{C \sqrt{d T \log (T)} \cdot \log (1 / δ)}{T} \\ = C \log (1 / δ) \cdot \lim_{T \to \infty} \sqrt{\frac{d \log (T)}{T}} \\ = 0 . \end{matrix}

(48)

The last step follows because

\lim_{T \to \infty} \frac{\log (T)}{T} = 0

, per L’Hôpital’s rule.

Asymptotic Optimality. Now that we have established sublinear regret, we can prove asymptotic optimality. Let

π^{*}

be the optimal policy and

π_{T}

the policy of HC-UCB at time T. The average reward of

π^{*}

is

V^{*}

, while the average reward of

π_{T}

is

V_{T}

. The regret can be written as

R_{T} = T \cdot V^{*} - \sum_{t = 1}^{T} r_{t},

where

r_{t}

is the reward at time t. Dividing by T, we have

\frac{R_{T}}{T} = V^{*} - \frac{1}{T} \sum_{t = 1}^{T} r_{t} = V^{*} - V_{T} .

(49)

From Equation (48), we know that

\lim_{T \to \infty} \frac{R_{T}}{T} = 0 .

Therefore,

\lim_{T \to \infty} (V^{*} - V_{T}) = 0,

or equivalently,

\lim_{T \to \infty} V_{T} = V^{*} .

This means that the average reward of the HC-UCB algorithm converges to the optimal average reward as T approaches infinity. In conclusion, we have shown that (1) The regret of HC-UCB is sublinear in T and (2) The sublinear regret implies that the difference between the average reward of HC-UCB and the optimal average reward converges to zero. Therefore, the HC-UCB algorithm is asymptotically optimal. □

Implication. Theorem 5 demonstrates that the HC-UCB algorithm is asymptotically optimal. As the time horizon T approaches infinity, the average regret per time step

R_{T} / T

converges to zero. This means that in the long run, the algorithm’s performance matches that of the best possible policy.

Theorem 6

(Hierarchical Exploration–Exploitation Tradeoff). For the HC-UCB algorithm, the expected regret due to exploration at the high level (

R_{H}

) and low level (

R_{L}

) satisfies

R_{H} + R_{L} \leq O (\sqrt{d T \log (T)}) and R_{H} / R_{L} = O (\log (T)),

(50)

where d is the dimension of the context space and T is the time horizon.

Proof.

We prove this theorem in several steps: (1) first, we define the regret components; (2) then, we bound the high-level exploration regret; (3) next, we bound the low-level exploration regret; (4) finally, we combine these results to prove the theorem.

Step 1: Defining Regret Components. We can decompose the total regret

R_{T}

into four components:

R_{T} = R_{H}^{e} + R_{H}^{c} + R_{L}^{e} + R_{L}^{c}

(51)

where

R_{H}^{e}

denotes the regret due to high-level exploration,

R_{H}^{c}

shows the regret due to high-level exploitation (choosing suboptimal high-level actions),

R_{L}^{e}

: represents the regret due to low-level exploration, and

R_{L}^{c}

demonstrates the regret due to low-level exploitation (choosing suboptimal low-level actions). We define

R_{H} = R_{H}^{e} + R_{H}^{c}

and

R_{L} = R_{L}^{e} + R_{L}^{c}

.

Step 2: Bounding High-Level Exploration Regret. For the high-level decisions, we use a UCB algorithm. The number of times we need to explore each high-level action is

O (\log (T))

. With

| A |

high-level actions, the total number of high-level explorations is

O (| A | \log (T))

. Each exploration incurs at most

O (1)

regret (assuming bounded rewards); therefore,

R_{H}^{e} = O (| A | \log (T)) .

(52)

Step 3: Bounding Low-Level Exploration Regret. For each high-level action, we have a separate low-level bandit problem. Let

| B |

be the maximum number of low-level actions for any high-level action. For each low-level problem, we again need

O (\log (T))

explorations for each action; however, we only incur this exploration cost when the corresponding high-level action is chosen. Let

N_{a} (T)

be the number of times high-level action a is chosen up to time T; then,

R_{L}^{e} = O (\sum_{a \in A} | B | \log (N_{a} (T))) .

(53)

Per Jensen’s inequality, we have

\sum_{a \in A} \log (N_{a} (T)) \leq | A | \log (T / | A |) .

(54)

Therefore,

R_{L}^{e} = O (| A | | B | \log (T / | A |)) .

(55)

Step 4: Combining Results. The total exploration regret is

\begin{matrix} R_{H}^{e} + R_{L}^{e} & = O (| A | \log (T) + | A | | B | \log (T / | A |)) \\ = O (| A | | B | \log (T)) . \end{matrix}

(56)

Now, let us consider the exploitation regret. From standard UCB analysis, we know that

R_{H}^{c} + R_{L}^{c} = O (\sqrt{d T \log (T)}) .

(57)

Combining the exploration and exploitation regrets, we have

R_{H} + R_{L} = O (| A | | B | \log (T) + \sqrt{d T \log (T)}) .

(58)

For large T, the second term dominates, giving

R_{H} + R_{L} \leq O (\sqrt{d T \log (T)}) .

(59)

For the ratio

R_{H} / R_{L}

, note that

R_{H} / R_{L} = (R_{H}^{e} + R_{H}^{c}) / (R_{L}^{e} + R_{L}^{c}) \leq \max (R_{H}^{e} / R_{L}^{e}, R_{H}^{c} / R_{L}^{c}) .

(60)

We now have

R_{H}^{e} / R_{L}^{e} = O (1 / | B |)

and

R_{H}^{c} / R_{L}^{c} = O (1)

; therefore:

R_{H} / R_{L} = O (1) .

(61)

However, this bound can be tightened. Because the high-level decisions influence all subsequent low-level decisions, errors at the high level are more costly. This suggests that the algorithm should explore more cautiously at the high level. By adjusting the exploration rates in the UCB algorithm (e.g., using different confidence bounds for high and low levels), we can achieve

R_{H} / R_{L} = O (\log (T)) .

(62)

This completes the proof of the theorem. □

Implication. Theorem 6 elucidates how the HC-UCB algorithm manages the exploration–exploitation tradeoff across different hierarchical levels. It shows that the expected regret due to exploration at the high level

(R_{H})

and low level

(R_{L})

satisfies

R_{H} + R_{L} \leq O (\sqrt{d T \log (T)})

and that the ratio

R_{H} / R_{L} = O (\log (T))

. This implies that the algorithm allocates exploration efforts strategically between high-level and low-level decisions, recognizing that mistakes at higher levels can have more significant consequences on overall performance. By prioritizing exploration at higher levels when necessary, the algorithm ensures efficient learning and faster convergence to optimal policies throughout the hierarchy.

Theorem 7

(Minimax Lower Bound for Hierarchical Constrained Bandits). For any algorithm A solving the hierarchical constrained bandits problem, there exists an instance of the problem such that

E [R_{T} (A)] \geq Ω (\sqrt{d H T}),

(63)

where d is the dimension of the context space, H is the number of hierarchy levels, and T is the time horizon.

Proof.

We prove this theorem using the following steps: (1) Constructing a hard instance of the HCB problem. (2) Using information theory to bound the expected regret. (3) Applying the probabilistic method to show the existence of a hard instance.

Step 1: Constructing a Hard Instance. Consider an HCB problem with the following structure: H levels in the hierarchy,

K_{h}

actions at each level h and a d-dimensional context space. Let the reward function for each level h be

r_{h} (x, a) = θ_{h}^{T} x + ϵ,

(64)

where

θ_{h} \in R^{d}

is unknown,

∥ x ∥ \leq 1

, and

ϵ

is zero-mean sub-Gaussian noise with parameter

σ^{2}

. We construct a set of

N = 2^{d H}

problem instances. For each instance i, we have parameters

θ_{h}^{(i)}

for

h = 1, \dots, H

. We choose these parameters such that (1)

∥ θ_{h}^{(i)} ∥ \leq 1

for all h and i, and (2)

∥ θ_{h}^{(i)} - θ_{h}^{(j)} ∥ \geq δ

for all h and

i \neq j

, where

δ = c \sqrt{d / T}

for some constant c. The existence of such a set of parameters is guaranteed by the Gilbert–Varshamov bound from coding theory.

Step 2: Bounding the Expected Regret. Let A be any algorithm for the HCB problem. We define the expected regret for instance i as

R_{T}^{(i)} (A) = E [\sum_{t = 1}^{T} \sum_{h = 1}^{H} (r_{h}^{*} (x_{t}, a_{t}^{*}) - r_{h} (x_{t}, a_{t}^{(h)}))],

(65)

where

a_{t}^{*}

is the optimal action at time t and

a_{t}^{(h)}

is the action chosen by A at level h. Let

P_{i}

be the probability distribution of observations under instance i. By Pinsker’s inequality, we have

∥ P_{i} - P_{j} ∥_{T V} \leq \sqrt{\frac{1}{2} K L (P_{i}, P_{j})},

(66)

where

K L

is the Kullback–Leibler divergence. For Gaussian rewards with variance

σ^{2}

,

K L (P_{i}, P_{j}) \leq \frac{T}{2 σ^{2}} \sum_{h = 1}^{H} {∥ θ_{h}^{(i)} - θ_{h}^{(j)} ∥}^{2} \leq \frac{H T d}{2 σ^{2}} \cdot \frac{c^{2} d}{T} = \frac{c^{2} H d^{2}}{2 σ^{2}} .

(67)

Step 3: Applying the Probabilistic Method. Let

i^{*}

be chosen uniformly at random from

{1, \dots, N}

. Then:

\begin{matrix} \max_{i} R_{T}^{(i)} (A) & \geq E_{i^{*}} [R_{T}^{(i^{*})} (A)] \\ = \frac{1}{N} \sum_{i = 1}^{N} R_{T}^{(i)} (A) \\ \geq \frac{1}{N} \sum_{i = 1}^{N} E_{i^{*}} [R_{T}^{(i)} (A) | i^{*} \neq i] \cdot P (i^{*} \neq i) \\ \geq \frac{N - 1}{N} \cdot \frac{1}{N} \sum_{i = 1}^{N} E_{i^{*}} [R_{T}^{(i)} (A) | i^{*} \neq i] . \end{matrix}

(68)

From the construction of our hard instance, we have

E_{i^{*}} [R_{T}^{(i)} (A) | i^{*} \neq i] \geq \frac{1}{2} T δ = \frac{1}{2} c T \sqrt{d / T} = \frac{1}{2} c \sqrt{d T} .

(69)

Combining these results, we have

\max_{i} R_{T}^{(i)} (A) \geq \frac{1}{4} c \sqrt{d T} \cdot (1 - \sqrt{\frac{c^{2} H d^{2}}{2 σ^{2}}}) .

(70)

Choosing

c = σ \sqrt{\frac{1}{H d}}

, we obtain

\max_{i} R_{T}^{(i)} (A) \geq Ω (\sqrt{d H T}) .

(71)

This completes the proof. □

Implication. Theorem 7 establishes a fundamental limit on the performance of any algorithm that addresses the hierarchical constrained bandit problem by proving a minimax lower bound on the cumulative regret

E [R_{T}] \geq Ω (\sqrt{d H T})

, where H is the number of hierarchy levels. This implies that no algorithm can achieve a regret lower than this bound in the worst-case scenario. The importance of this result lies in demonstrating that the regret bounds achieved by the HC-UCB algorithm are near-optimal, as they match the lower bound up to logarithmic factors. The dependence of the lower bound on the number of hierarchy levels H highlights the inherent complexity introduced by hierarchical structures in sequential decision-making problems.

5. Numerical Results

We conducted our experiments on a toy problem with a hierarchical action space comprising

n_{high}

high-level actions and

n_{low}

sub-actions for each high-level choice, yielding

n_{high} \times n_{low}

total arms. In particular, we set

n_{high} = 4

and

n_{low} = 4

, making for 16 composite actions in total. At each round

t \in {1, \dots, 1000}

, the environment draws a d-dimensional context

x_{t} \in R^{d}

from a standard Gaussian distribution, normalizes

x_{t}

to unit length, and reveals it to the agent. Throughout our experiments, we set the dimension to

d = 10

. A given composite action (high, low) produces a linear reward of

r_{t} = x_{t}^{⊤} θ_{reward}^{(high, low)} + η_{t},

where

η_{t}

is a mean-zero noise term. We sample

θ_{reward}^{(high, low)}

from a mean-zero Gaussian with variance 1.0 independently for each of the 16 possible arms. Meanwhile, cost is assumed to depend solely on the high-level action. Formally, choosing high-level arm h at time t incurs an expected cost

c_{t} = x_{t}^{⊤} θ_{cost}^{(h)} + ξ_{t},

(72)

with

ξ_{t}

being a mean-zero noise term. In our experiments,

θ_{\cos t}^{(h)}

was drawn from a mean-zero Gaussian of variance 0.5; if

c_{t} > 0.7

, then we counted a constraint violation. We evaluated the following three methods:

HC-UCB: A hierarchical constrained UCB approach that maintains separate confidence-bound models for high-level reward, sub-level reward, and high-level cost. It enforces a cost lower confidence bound $\leq 0.7$ for feasibility, then picks the sub-arm with the largest reward UCB among feasible high-level arms.
UCB: A standard linear UCB algorithm that treats all (high, low) combinations as distinct arms. It ignores cost constraints entirely and only tries to maximize reward.
Random: A purely random baseline that picks (high, low) uniformly at random each round, with no learning or constraints.

Figure 1 presents the cumulative reward for each algorithm over the time horizon. The UCB approach achieves the highest overall reward, as it is unconstrained and consequently more aggressive in exploiting high-reward arms. The HC-UCB method attains a moderately lower reward total, reflecting the tradeoff introduced by screening out high-level actions with potentially high costs. Meanwhile, the Random policy accumulates only minimal reward, as it makes no effort to identify better arms over time. These reward outcomes are presented in the cumulative regret plots, in which Random exhibits a steep linear growth, while both UCB-based methods constrain their regret to significantly smaller (sublinear) magnitudes. Nonetheless, UCB marginally outperforms HC-UCB in regret, again owing to its ability to exploit the globally best arms even if they tend to incur higher costs. Finally, the bar chart shows the average number of constraint violations (i.e., cases in which

c_{t} > 0.7

). Here, HC-UCB records notably fewer high-level cost violations than either UCB or Random. The random strategy selects arms indiscriminately, often hitting high-cost arms. While the unconstrained UCB maximizes reward, it does not shy away from profitable but expensive high-level arms. By contrast, HC-UCB’s lower-confidence cost bound mechanism rejects high-level arms when their estimated cost is likely to exceed the threshold, thereby demonstrating more consistent adherence to the constraint. Overall, these experiments indicate that the hierarchical constrained approach strikes a balance; it sacrifices a modest portion of reward relative to the fully unconstrained method, but achieves fewer constraint violations than either alternative.

Limitations. While the HC-UCB algorithm demonstrates theoretical guarantees and practical performance, several limitations remain. A key limitation stems from our theoretical framework’s reliance on linear reward and cost functions. Real-world applications often exhibit complex nonlinear relationships that our current model cannot capture. Extending the framework to incorporate nonlinear functions, potentially through kernel methods or neural network approximations, would enhance its practical utility across diverse domains. Building on this consideration of real-world applications, our framework’s assumption of stationary reward and cost distributions represents another limitation. Many practical scenarios involve dynamic environments where these distributions evolve over time. Future work should focus on adapting the algorithm to handle nonstationary environments while maintaining its theoretical guarantees. Finally, while our framework handles constraints across multiple levels, it currently optimizes for a single reward objective. This approach may not fully capture the complexity of real-world decision-making scenarios, which often involve multiple competing objectives. Developing extensions to handle multiple objectives within the hierarchical structure would enable more complex decision-making capabilities.

6. Conclusions

In this paper, we have presented the hierarchical constrained bandits

(HCB)

framework, which aims to address the limitations of traditional multi-armed bandit formulations in capturing hierarchical decision-making processes with multilevel constraints. The proposed HC-UCB algorithm extends the principles of the UCB approach to the hierarchical and constrained setting, balancing exploration and exploitation while ensuring constraint satisfaction at each level. Our theoretical contributions include proving that HC-UCB achieves sublinear regret. We have established high-probability guarantees for constraint satisfaction, ensuring that the algorithm adheres to the predefined thresholds

τ_{l}

at each hierarchical level with probability at least

1 - δ

. Furthermore, we have derived a minimax lower bound on the cumulative regret, indicating that HC-UCB is near-optimal up to logarithmic factors. The implications of our work are significant for a wide range of applications, including autonomous systems, resource allocation in cloud computing, personalized medicine, and smart grid management, all of which are areas where decision-making is complex, hierarchical, and constrained. By providing a theoretical foundation and an efficient algorithmic solution, we contribute to the advancement of sequential decision-making under uncertainty in hierarchical settings. Future research directions include exploring extensions of the HC-UCB algorithm to nonlinear reward and cost functions, incorporating richer contextual information, and addressing nonstationary environments in which the underlying reward and cost functions may change over time. Additionally, empirical evaluations in real-world scenarios would further validate the practical effectiveness of the HCB framework and HC-UCB algorithm.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

T	Time horizon
d	Dimension of context space
H	Number of hierarchical levels
$A_{h}$	Action set at level h
$x_{t}$	Context vector at time t
$a_{t}^{(h)}$	Action chosen at level h at time t
$r_{t}$	Reward received at time t
$c_{t}^{(h)}$	Cost incurred at level h at time t
$τ^{(h)}$	Cost threshold at level h
$θ_{r}$	Unknown reward parameter vector
$θ_{c}^{(h)}$	Unknown cost parameter vector at level h
${\hat{θ}}_{r, t}$	Estimated reward parameter vector at time t
${\hat{θ}}_{c, t}^{(h)}$	Estimated cost parameter vector at level h at time t
$β_{t} (δ)$	Confidence radius parameter
$λ$	Regularization parameter
$V_{t}$	Regularized covariance matrix at time t
$R_{T}$	Cumulative regret up to time T
$R_{H}$	High-level exploration regret
$R_{L}$	Low-level exploration regret
$δ$	Confidence parameter

References

Villar, S.S.; Bowden, J.; Wason, J. Multi-armed bandit models for the optimal design of clinical trials: Benefits and challenges. Stat. Sci. Rev. J. Inst. Math. Stat. 2015, 30, 199. [Google Scholar] [CrossRef] [PubMed]
Avadhanula, V.; Colini Baldeschi, R.; Leonardi, S.; Sankararaman, K.A.; Schrijvers, O. Stochastic bandits for multi-platform budget optimization in online advertising. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 12–23 April 2021; pp. 2805–2817. [Google Scholar]
Maghsudi, S.; Hossain, E. Multi-armed bandits with application to 5G small cells. IEEE Wirel. Commun. 2016, 23, 64–73. [Google Scholar] [CrossRef]
Gittins, J.; Glazebrook, K.; Weber, R. Multi-Armed Bandit Allocation Indices; John Wiley & Sons: Hoboken, NJ, USA, 2011. [Google Scholar]
Agarwal, A.; Bird, S.; Cozowicz, M.; Hoang, L.; Langford, J.; Lee, S.; Li, J.; Melamed, D.; Oshri, G.; Ribas, O.; et al. Making contextual decisions with low technical debt. arXiv 2016, arXiv:1606.03966. [Google Scholar]
Zhou, L. A survey on contextual multi-armed bandits. arXiv 2015, arXiv:1508.03326. [Google Scholar]
Beygelzimer, A.; Langford, J.; Li, L.; Reyzin, L.; Schapire, R. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Ft. Lauderdale, FL, USA, 11–13 April 2011; JMLR Workshop and Conference Proceedings. pp. 19–26. [Google Scholar]
Baheri, A.; Alm, C.O. LLMs-augmented Contextual Bandit. arXiv 2023, arXiv:2311.02268. [Google Scholar]
Wu, Y.; Shariff, R.; Lattimore, T.; Szepesvári, C. Conservative bandits. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 1254–1262. [Google Scholar]
Amani, S.; Alizadeh, M.; Thrampoulidis, C. Linear stochastic bandits under safety constraints. arXiv 2019, arXiv:1908.05814. [Google Scholar]
Agrawal, S.; Devanur, N.R. Bandits with concave rewards and convex knapsacks. In Proceedings of the Fifteenth ACM Conference on Economics and Computation, Palo Alto, CA, USA, 8–12 June 2014; pp. 989–1006. [Google Scholar]
Hsu, C.J.; Nair, V.; Menzies, T.; Freeh, V.W. Scout: An experienced guide to find the best cloud configuration. arXiv 2018, arXiv:1803.01296. [Google Scholar]
Steiger, J.; Li, B.; Ji, B.; Lu, N. Constrained bandit learning with switching costs for wireless networks. In Proceedings of the IEEE INFOCOM 2023-IEEE Conference on Computer Communications, New York, NY, USA, 17–20 May 2023; pp. 1–10. [Google Scholar]
Hoffman, M.; Shahriari, B.; Freitas, N. On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning. In Proceedings of the Artificial Intelligence and Statistics, Reykjavik, Iceland, 22–25 April 2014; pp. 365–374. [Google Scholar]
Robbins, H. Some aspects of the sequential design of experiments. Bull. Am. Math. Soc. 1952, 58, 527–535. [Google Scholar] [CrossRef]
Lai, T.L.; Robbins, H. Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 1985, 6, 4–22. [Google Scholar] [CrossRef]
Auer, P. Finite-time analysis of the multi-armed bandit problem. In Proceedings of the International Conference on Machine Learning, Las Vegas, NV, USA, 24–27 June 2002. [Google Scholar]
Li, L.; Chu, W.; Langford, J.; Schapire, R.E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, Raleigh, NC, USA, 26–30 April 2010; pp. 661–670. [Google Scholar]
Badanidiyuru, A.; Kleinberg, R.; Slivkins, A. Bandits with knapsacks. J. ACM (JACM) 2018, 65, 1–55. [Google Scholar] [CrossRef]
Agrawal, S.; Devanur, N.R.; Li, L. An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives. In Proceedings of the Conference on Learning Theory, New York, NY, USA, 23–26 June 2016; pp. 4–18. [Google Scholar]
Kazerouni, A.; Ghavamzadeh, M.; Abbasi Yadkori, Y.; Van Roy, B. Conservative contextual linear bandits. arXiv 2017, arXiv:1611.06426. [Google Scholar]
Sani, A.; Lazaric, A.; Munos, R. Risk-aversion in multi-armed bandits. arXiv 2012, arXiv:1301.1936. [Google Scholar]
Gu, S.; Yang, L.; Du, Y.; Chen, G.; Walter, F.; Wang, J.; Yang, Y.; Knoll, A. A review of safe reinforcement learning: Methods, theory and applications. arXiv 2022, arXiv:2205.10330. [Google Scholar] [CrossRef] [PubMed]
Yifru, L.; Baheri, A. Concurrent learning of control policy and unknown safety specifications in reinforcement learning. IEEE Open J. Control Syst. 2024, 3, 266–281. [Google Scholar] [CrossRef]
Baheri, A.; Nageshrao, S.; Tseng, H.E.; Kolmanovsky, I.; Girard, A.; Filev, D. Deep reinforcement learning with enhanced safety for autonomous highway driving. In Proceedings of the IEEE Intelligent Vehicles Symposium (IV), Las Vegas, NV, USA, 19 October–13 November 2019; pp. 1550–1555. [Google Scholar]
Baheri, A. Safe reinforcement learning with mixture density network, with application to autonomous driving. Results Control Optim. 2022, 6, 100095. [Google Scholar] [CrossRef]
Garcıa, J.; Fernández, F. A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 2015, 16, 1437–1480. [Google Scholar]
Moradipari, A.; Amani, S.; Alizadeh, M.; Thrampoulidis, C. Safe linear Thompson sampling with side information. IEEE Trans. Signal Process. 2021, 69, 3755–3767. [Google Scholar] [CrossRef]
Pacchiano, A.; Ghavamzadeh, M.; Bartlett, P.; Jiang, H. Stochastic bandits with linear constraints. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Virtual, 13–15 April 2021; pp. 2827–2835. [Google Scholar]
Barto, A.G.; Mahadevan, S. Recent advances in hierarchical reinforcement learning. Discret. Event Dyn. Syst. 2003, 13, 341–379. [Google Scholar] [CrossRef]
Bacon, P.L.; Harb, J.; Precup, D. The option-critic architecture. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
Guo, Q.; Wang, S.; Zhu, J. Regret Analysis for Hierarchical Experts Bandit Problem. arXiv 2022, arXiv:2208.05622. [Google Scholar]
Hong, J.; Kveton, B.; Katariya, S.; Zaheer, M.; Ghavamzadeh, M. Deep hierarchy in bandits. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 8833–8851. [Google Scholar]
Sui, Y.; Gotovos, A.; Burdick, J.; Krause, A. Safe exploration for optimization with Gaussian processes. In Proceedings of the International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015. [Google Scholar]
Abbasi-Yadkori, Y.; Pál, D.; Szepesvári, C. Improved algorithms for linear stochastic bandits. Adv. Neural Inf. Process. Syst. 2011, 24, 1–19. [Google Scholar]

Figure 1. Performance comparison of HC-UCB, Standard UCB, and Random baselines across three metrics: (Left) cumulative reward over 1000 rounds (mean ± standard error), (Center) cumulative regret, and (Right) average constraint violations (mean ± standard deviation), demonstrating that HC-UCB more consistently adheres to the cost threshold than the alternatives.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Baheri, A. Multilevel Constrained Bandits: A Hierarchical Upper Confidence Bound Approach with Safety Guarantees. Mathematics 2025, 13, 149. https://doi.org/10.3390/math13010149

AMA Style

Baheri A. Multilevel Constrained Bandits: A Hierarchical Upper Confidence Bound Approach with Safety Guarantees. Mathematics. 2025; 13(1):149. https://doi.org/10.3390/math13010149

Chicago/Turabian Style

Baheri, Ali. 2025. "Multilevel Constrained Bandits: A Hierarchical Upper Confidence Bound Approach with Safety Guarantees" Mathematics 13, no. 1: 149. https://doi.org/10.3390/math13010149

APA Style

Baheri, A. (2025). Multilevel Constrained Bandits: A Hierarchical Upper Confidence Bound Approach with Safety Guarantees. Mathematics, 13(1), 149. https://doi.org/10.3390/math13010149

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multilevel Constrained Bandits: A Hierarchical Upper Confidence Bound Approach with Safety Guarantees

Abstract

1. Introduction

2. Preliminaries

3. Methodology

4. Theoretical Results

5. Numerical Results

6. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI