Next Article in Journal
Evaluating the Attraction of Scenic Spots Based on Tourism Trajectory Entropy
Previous Article in Journal
Modulated Radio Frequency Stealth Waveforms for Ultra-Wideband Radio Fuzes
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Thompson Sampling for Stochastic Bandits with Noisy Contexts: An Information-Theoretic Regret Analysis

by
Sharu Theresa Jose
1,* and
Shana Moothedath
2,*
1
School of Computer Science, University of Birmingham, Birmingham B15 2TT, UK
2
Department of Electrical Engineering, Iowa State University, Ames, IA 50011, USA
*
Authors to whom correspondence should be addressed.
Entropy 2024, 26(7), 606; https://doi.org/10.3390/e26070606
Submission received: 20 May 2024 / Revised: 9 July 2024 / Accepted: 15 July 2024 / Published: 17 July 2024
(This article belongs to the Special Issue Information Theoretic Learning with Its Applications)

Abstract

:
We study stochastic linear contextual bandits (CB) where the agent observes a noisy version of the true context through a noise channel with unknown channel parameters. Our objective is to design an action policy that can “approximate” that of a Bayesian oracle that has access to the reward model and the noise channel parameter. We introduce a modified Thompson sampling algorithm and analyze its Bayesian cumulative regret with respect to the oracle action policy via information-theoretic tools. For Gaussian bandits with Gaussian context noise, our information-theoretic analysis shows that under certain conditions on the prior variance, the Bayesian cumulative regret scales as O ˜ ( m T ) , where m is the dimension of the feature vector and T is the time horizon. We also consider the problem setting where the agent observes the true context with some delay after receiving the reward, and show that delayed true contexts lead to lower regret. Finally, we empirically demonstrate the performance of the proposed algorithms against baselines.

1. Introduction

Decision-making in the face of uncertainty is a widespread challenge found across various domains such as control and robotics [1], clinical trials [2], communications [3], and ecology [4]. To tackle this challenge, learning algorithms have been developed to uncover effective policies for optimal decision-making. One notable framework for addressing this is contextual bandits (CBs), which capture the essence of sequential decision-making by incorporating side information, termed context [5].
In the standard CB model, an agent interacts with the environment over numerous rounds. In each round, the environment presents a context to the agent based on which the agent chooses an action and receives a reward from the environment. The reward is stochastic, drawn from a probability distribution whose mean reward (which is a function of context-action pair) is unknown to the agent. The goal of the agent is to design a policy for action selection that can maximize the cumulative mean reward accrued over a T-length horizon.
In this paper, we focus on a CB model that assumes stochastic rewards with linear mean-reward functions, also called stochastic linear contextual bandits. Stochastic linear CB models find applications in various settings including internet advertisement selection [6], where the advertisement (i.e., action) and webpage features (i.e., context) are used to construct a linear predictor of the probability that a user clicks on a given advertisement, and article recommendation on web portals [7].
While most prior research on CBs has primarily focused on models with known exact contexts [8,9,10], in many real-world applications, the contexts are noisy, e.g., imprecise measurement of patient conditions in clinical trials, weather or stock market predictions. In such scenarios, when the exact contexts are unknown, the agent must utilize the observed noisy contexts to estimate the mean reward associated with the true context. However, this results in a biased estimate that renders the application of standard CB algorithms unsuitable. Consequently, recent efforts have been made to develop CB algorithms tailored to noisy context settings.
Related Works: ref. [11] considers a setting where there is a bounded zero-mean noise in the m-dimensional feature vector (denoted by ϕ ( a , c ) , where a is the action and c is the context) rather than in the context vector, and the agent observes only noisy features. For this setting, they develop an upper confidence bound (UCB) algorithm. Ref. [12] models the uncertainty regarding the true contexts by a context distribution that is known to the agent, while the agent never observes the true context and develops a UCB algorithm. A similar setting has also been considered in [13]. Differing from these works, ref. [14] considers the setting where the true feature vectors are sampled from an unknown feature distribution at each time, but the agent observes only a noisy feature vector. Assuming Gaussian feature noise with unknown mean and covariance, they develop an Optimism in the Face of Uncertainty (OFUL) algorithm. A variant of this setting has been studied in [15].
Motivation and Problem Setting: In this work, inspired by [14], we consider the following noisy CB setting. In each round, the environment samples a true context vector c t from a context distribution that is known to the agent. The agent, however, does not observe the true context but observes a noisy context c ^ t obtained as the output of a noise channel P ( c ^ t | c t , γ * ) parameterized by γ * . The agent is aware of the noise present but does not know the channel parameter γ * . Following [14], we consider Gaussian noise channels for our regret analysis.
Based on the observed noisy contexts, the agent chooses an action a t and observes a reward r t corresponding to the true context. We consider a linear bandit whose mean reward ϕ ( a t , c t ) θ * is determined by an unknown reward parameter θ * . The goal of the agent is to design an action policy that minimizes the Bayesian cumulative regret with respect to the action policy of a Bayesian oracle. The oracle has access to the reward model and the channel parameter γ * , and uses the predictive distribution of the true context given the observed noisy context to select an action.
Our setting differs from [14] in that we assume noisy contexts rather than noisy feature vectors and that the agent knows the context distribution. The noise model, incorporating noise in the feature vector, allows [14] to transform the original problem into a different CB problem that estimates a modified reward parameter. Such a transformation, however, is not straightforward in our setting with noise in contexts rather than in feature vectors, where we wish to analyze the Bayesian regret. Additionally, we propose a de-noising approach to estimate the predictive distribution of the true context from given noisy contexts, offering potential benefits for future analyses.
The assumption of known context distribution follows from [12]. This can be motivated by considering the example of an online recommendation engine that pre-processes the user account registration information or contexts (e.g., age, gender, device, location, item preferences) to group them into different clusters [16]. The engine can then infer the ‘empirical’ distribution of users within each cluster to define a context distribution over true contextual information. A noisy contextual information scenario occurs when a guest with different preferences logs into a user’s account.
Challenges and Novelty: Different from existing works that developed UCB-based algorithms, we propose a fully Bayesian Thompson Sampling (TS) algorithm that approximates the Bayesian oracle policy. The proposed algorithm differs from the standard contextual TS [10] in the following aspects. Firstly, since the true context vectors are not accessible at each round and the channel parameter γ * is unknown, the agent uses its knowledge of the context distribution and the past observed noisy contexts to infer a predictive posterior distribution of the true context from the current observed noisy context. The inferred predictive distribution is then used to choose the action. This de-noising step enables our algorithm to ‘approximate’ the oracle action policy that uses knowledge of the channel parameter γ * to implement exact de-noising. Secondly, the reward r t received by the agent corresponds to the unobserved true context c t . Hence, the agent cannot accurately evaluate the posterior distribution of θ * and sample from it as is conducted in standard contextual TS. Instead, our algorithm proposes to use a sampling distribution that ‘approximates’ the posterior.
Different from existing works that focus on frequentist regret analysis, we derive novel information-theoretic bounds on the Bayesian cumulative regret of our algorithm. For Gaussian bandits, our information-theoretic regret bounds scale as O ˜ ( m T ) (the notation O ˜ ( ) suppresses logarithmic terms in •), where m , T denote the dimension of the feature vector and time horizon respectively, under certain conditions on the variance of the prior on θ * . Furthermore, our Bayesian regret analysis shows that the posterior mismatch, resulting due to replacing the true posterior distribution with a sampling distribution, results in an approximation error that is captured via the Kullback–Leibler (KL) divergence between the distributions. To the best of our knowledge, quantifying the posterior mismatch via KL divergence has not been studied before and is of independent interest.
Finally, we also extend our algorithm to a setting where the agent observes the true context after the decision is made and reward is observed [12]. We call this setting CBs with delayed true contexts. Such scenarios arise in many applications where only a prediction of the context is available at the time of decision-making; however, the true context is available later. For instance, in farming-recommender systems where, at the time of making the decision regarding which crop to cultivate in a year, the true contextual information about the weather pattern is unavailable, while some ‘noisy’ weather predictions are available. In fact, the true weather pattern is observed only after the decision is made. We show that our TS algorithm for this setting with delayed true contexts results in reduced Bayesian regret. Table 1 compares our regret bound with that of the state-of-the-art algorithms in the noiseless and noisy CB settings.

2. Problem Setting

In this section, we present the stochastic linear CB problem studied in this paper. Let A denote the action set with K actions and C denote the (possibly infinite) set of d-dimensional context vectors. At iteration t N , the environment randomly draws a context vector c t C according to a context distribution  P ( c ) defined over the space C of context vectors. The context distribution P ( c ) is known to the agent. The agent, however, does not observe the true context c t drawn by the environment. Instead, it observes a noisy version c ^ t of the true context, obtained as the output of a noisy, stochastic channel P ( c ^ t | c t , γ * ) with the true context c t as the input. The noise channel P ( c ^ t | c t , γ * ) is parameterized by the noise channel parameter  γ * that is unknown to the agent.
Having observed the noisy context c ^ t at iteration t, the agent chooses an action a t A according to an action policy  π t ( · | c ^ t ) . The action policy may be stochastic describing a probability distribution over the set A of actions. Corresponding to the chosen action a t , the agent receives a reward from the environment given by
r t = f ( θ * , a t , c t ) + ξ t ,
where f ( θ * , a t , c t ) = ϕ ( a t , c t ) θ * is the linear mean-reward function and ξ t is a zero-mean reward noise variable. The mean reward function f ( θ * , a t , c t ) is defined via the feature map  ϕ : A × C R m , that maps the action and true context to an m-dimensional feature vector, and via the reward parameter θ * R m that is unknown to the agent.
We call the noisy CB problem described above CBs with unobserved true context (see Setting 1) since the agent does not observe the true context c t and the selection of action is based solely on the observed noisy context. Accordingly, at the end of iteration t, the agent has accrued the history H t , r , a , c ^ = { r τ , a τ , c ^ τ } τ = 1 t of observed reward-action-noisy context tuples. The action policy π t + 1 ( · | c ^ t + 1 ) at ( t + 1 ) th iteration may depend on the history H t , r , a , c ^ .
Setting 1: CBs with unobserved true contexts
1:
for  t = 1 , , T   do
2:
    Environment samples c t P ( c ) .
3:
    Agent observes noisy context c ^ t P ( c ^ t | c t , γ * ) .
4:
    Agent chooses an action a t π t ( · | c ^ t ) .
5:
    Agent receives reward r t according to (1).
6:
end for
We also consider a variant of the above problem where the agent has access to a delayed observation of the true context c t as studied in [12]. We call this setting CBs with delayed true context. In this setting, at iteration t, the agent first observes a noisy context c ^ t , chooses action a t π t ( · | c ^ t ) , and receives reward r t . Later, the true context c t is observed. It is important to note that the agent has no access to the true context at the time of decision-making. Thus, at the end of iteration t, the agent has collected the history H t , r , a , c , c ^ = { r τ , a τ , c τ , c ^ τ } τ = 1 t of observed reward-action-context-noisy context tuples.
In both of the problem settings described above, the agent’s objective is to devise an action policy that minimizes the Bayesian cumulative regret with respect to a baseline action policy. We define Bayesian cumulative regret next.

Bayesian Cumulative Regret

The cumulative regret of an action policy π t ( · | c ^ t ) quantifies how different the mean reward accumulated over T iterations is from that accrued by a baseline action policy π t * ( · | c ^ t ) . In this work, we consider as baseline the action policy of an oracle that has access to the channel noise parameter γ * , reward parameter θ * , the context distribution P ( c ) and the noise channel likelihood P ( c t | c ^ t , γ * ) . Accordingly, at each iteration t, the oracle can infer the exact predictive distribution  P ( c t | c ^ t , γ * ) of the true context from the observed noisy context c ^ t via Baye’s rule as
P ( c t | c ^ t , γ * ) = P ( c t , c ^ t | γ * ) P ( c ^ t | γ * ) .
Here, P ( c t , c ^ t | γ * ) = P ( c t ) P ( c ^ t | c t , γ * ) is the joint distribution of the true and noisy contexts given the noise channel parameter γ * , and  P ( c ^ t | γ * ) is the distribution obtained by marginalizing P ( c t , c ^ t | γ * ) over the true contexts, i.e., 
P ( c ^ t | γ * ) = E P ( c t ) [ P ( c ^ t | c t , γ * ) ] ,
where E [ · ] denotes expectation with respect to ‘ ’. The oracle action policy then adopts an action
a t * = arg max a A E P ( c t | c ^ t , γ * ) [ ϕ ( a , c t ) θ * ] = arg max a A ψ ( a , c ^ t | γ * ) θ * ,
at iteration t, where ψ ( a , c ^ | γ * ) : = E P ( c | c ^ , γ * ) [ ϕ ( a , c ) ] . Note, that as in [14,18], we do not choose the stronger oracle action policy of arg max a A ϕ ( a , c t ) θ * , that requires access to the true context c t , as it is generally not achievable by an agent that observes only noisy context c ^ t and has no access to γ * .
For fixed parameters θ * and γ * , we define the cumulative regret of the action policy π t ( · | c ^ t ) as
R T ( π | θ * , γ * ) = t = 1 T E [ ϕ ( a t * , c t ) θ * ϕ ( a t , c t ) θ * | θ * , γ * ] ,
the expected difference in mean rewards of the oracle decision policy and the agent’s decision policy over T iterations. In (5), the expectation is taken with respect to the joint distribution P ( H t 1 , r , c ^ , c , a | θ * , γ * ) P ( c ^ t , c t , a t | H t 1 , r , c ^ , c , a , θ * , γ * ) , where P ( c ^ t , c t , a t | H t 1 , r , c ^ , c , a , θ * , γ * ) = P ( c ^ t , c t | γ * ) π t ( a t | c ^ t ) = P ( c ^ t | γ * ) P ( c t | c ^ t , γ * ) π t ( a t | c ^ t ) . Using this, the cumulative regret (5) can be written as
R T ( π | θ * , γ * ) = t = 1 T E P ( H t 1 , r , c ^ , c , a | θ * , γ * ) E P ( c ^ t | γ * ) π t ( a t | c ^ t ) E P ( c t | c ^ t , γ * ) ϕ ( a t * , c t ) θ * ϕ ( a t , c t ) θ * = t = 1 T E P ( H t 1 , r , c ^ , c , a | θ * , γ * ) E P ( c ^ t | γ * ) π t ( a t | c ^ t ) ψ ( a t * , c ^ t | γ * ) θ * ψ ( a t , c ^ t | γ * ) θ * : = t = 1 T E ψ ( a t * , c ^ t | γ * ) θ * ψ ( a t , c ^ t | γ * ) θ * | θ * , γ * .
Our focus in this work is on a Bayesian framework where we assume that the reward parameter θ * Θ and channel noise parameter γ * Γ are independently sampled by the environment from prior distributions P ( θ * ) , defined on the set Θ of reward parameters, and  P ( γ * ) , defined on the set Γ of channel noise parameters, respectively. The agent has knowledge of the prior distributions, the reward likelihood in (1) and the noise channel likelihood P ( c ^ t | c t , γ * ) , although it does not observe the sampled γ * and θ * . Using the above prior distributions, we define Bayesian cumulative regret of the action policy π t ( · | c ^ t ) as
R T ( π ) = E [ R T ( π | θ * , γ * ) ] ,
where the expectation is taken with respect to the priors P ( θ * ) and P ( γ * ) .
In the next sections, we present our novel TS algorithms to minimize the Bayesian cumulative regret for the two problem settings considered in this paper.

3. Modified TS for CB with Unobserved True Contexts

In this section, we consider Setting 1 where the agent only observes the noisy context c ^ t at each iteration t. Our proposed modified TS Algorithm is given in Algorithm 1.
Algorithm 1: TS with unobserved true contexts ( π TS )
1:
for  t = 1 , , T   do
2:
    The environment selects a true context c t .
3:
    Agent observes noisy context c ^ t .
4:
    Agent evaluates the predictive posterior distribution P ( c t | c ^ t , H t 1 , c ^ ) as in (8).
5:
    Agent samples θ t P ¯ ( θ * | H t 1 , r , a , c ^ ) .
6:
    Agent chooses action a t as in (11).
7:
    Agent observes reward r t as in (1).
8:
end for
The proposed algorithm implements two steps in each iteration t N . In the first step, called the de-noising step, the agent uses the current observed noisy context c ^ t and the history H t 1 , c ^ = { c ^ τ } τ = 1 t 1 of past observed noisy contexts to obtain a predictive posterior distribution  P ( c t | c ^ t , H t 1 , c ^ ) of the true context c t . This is a two-step process, where firstly the agent uses the history H t 1 , c ^ of past observed noisy contexts to compute the posterior distribution of γ * as P ( γ * | H t 1 , c ^ ) P ( γ * ) τ = 1 t 1 P ( c ^ τ | γ * ) , where the conditional distribution P ( c ^ t | γ * ) is evaluated as in (3). Note, that to evaluate the posterior, the agent uses its knowledge of the context distribution P ( c ) , the prior P ( γ * ) and the noise channel likelihood P ( c ^ t | c t , γ * ) . Using the derived posterior P ( γ * | H t 1 , c ^ ) , the predictive posterior distribution of the true context is then obtained as
P ( c t | c ^ t , H t 1 , c ^ ) = E P ( γ * | H t 1 , c ^ ) [ P ( c t | c ^ t , γ * ) ] ,
where P ( c t | c ^ t , γ * ) is defined as in (2).
The second step of the algorithm implements a modified Thompson sampling. Note, that since the agent does not have access to the true contexts, it cannot evaluate the posterior distribution with known contexts,
P ( θ * | H t 1 , r , a , c ) P ( θ * ) τ = 1 t 1 P ( r τ | a τ , c τ , θ * ) ,
as is conducted in standard contextual TS. Instead, the agent must evaluate the true posterior distribution under noisy contexts,
P t ( θ * ) : = P ( θ * | H t 1 , r , a , c ^ ) P ( θ * ) E P ( γ * ) τ = 1 t 1 E P ( c τ ) [ P ( c ^ τ | c τ , γ * ) P ( r τ | a τ , c τ , θ * ) ] .
However, evaluating the marginal distribution E P ( γ * ) τ = 1 t 1 E P ( c τ ) [ P ( c ^ τ | c τ , γ * ) P ( r τ | a τ , c τ , θ * ) ] is challenging even for Gaussian bandits as the mean ϕ ( a τ , c τ ) θ * of the reward distribution P ( r τ | a τ , c τ , θ * ) is, in general, a non-linear function of the true context c τ . As a result, the posterior P t ( θ * ) is analytically intractable.
Consequently, at each iteration t, the agent samples θ t P ¯ ( θ * | H t 1 , r , a , c ^ ) from a distribution P ¯ ( θ * | H t 1 , r , a , c ^ ) that ‘approximates’ the true posterior P t ( θ * ) . The specific choice of this sampling distribution depends on the problem setting. Ideally, one must choose a distribution that is sufficiently ‘close’ to the true posterior. In the next sub-section, we will explain the choice for Gaussian bandits.
Using the sampled θ t and the predictive posterior distribution P ( c t | c ^ t , H t 1 , c ^ ) obtained from the denoising step, the agent then chooses action a t at iteration t as
a t = arg max a A ψ ( a , c ^ t | H c ^ ) θ t , where
ψ ( a t , c ^ t | H c ^ ) : = E P ( c t | c ^ t , H t 1 , c ^ ) [ ϕ ( a t , c t ) ]
is the expected feature map with respect to P ( c t | c ^ t , H t 1 , c ^ ) .

3.1. Linear-Gaussian Stochastic CBs

We now instantiate Algorithm 1 for Gaussian CBs. Specifically, we consider Gaussian bandits with the reward noise ξ t in (1) as Gaussian N ( 0 , σ 2 ) with mean 0 and variance σ 2 > 0 . We also assume a Gaussian prior P ( θ * ) = N ( 0 , λ I ) on the reward parameter θ * with mean zero and an m × m diagonal, covariance matrix with entries λ > 0 . Here, I denotes the identity matrix. The assumption of diagonal prior covariance is in line with Lemma 3 in [19].
We consider a multivariate Gaussian context distribution P ( c ) = N ( μ c , Σ c ) with mean μ c R d and covariance matrix Σ c R d × d . The context noise channel P ( c ^ | c , γ * ) is also similarly Gaussian with a mean ( γ * + c ) and covariance matrix Σ n R d × d . We assume the prior on noise channel parameter γ * to be Gaussian P ( γ * ) = N ( 0 , Σ γ ) with d-dimensional zero mean vector 0 and covariance matrix Σ γ R d × d . We assume that Σ c , Σ γ and Σ n are all positive definite matrices known to the agent. The assumption of positive definite covariance matrices is to facilitate the Bayesian analysis adopted in this work. Similar assumptions were also required in the related work of [14].
For this setting, we can analytically evaluate the predictive posterior distribution P ( c t | c ^ t , H t 1 , c ^ ) = N ( c t | V t , R t 1 ) as a multi-variate Gaussian with inverse covariance matrix,
R t = M Σ n 1 ( H t 1 ) Σ n 1 ,
where H t = ( t 1 ) Σ n 1 ( t 2 ) Σ n 1 M 1 Σ n 1 + Σ γ 1 and M = Σ c 1 + Σ n 1 , and with the mean vector
V t = ( R t 1 ) Σ c 1 μ c + Σ n 1 c ^ t Σ n 1 ( H t 1 ) L t ,
where
L t = Σ n 1 M 1 ( Σ c 1 μ c + Σ n 1 c ^ t ) + ( Σ n 1 Σ n 1 M 1 Σ n 1 ) τ = 1 t 1 c ^ τ ( t 1 ) Σ n 1 M 1 Σ c 1 μ c .
Derivations are presented in Appendix C.1.2.
For the modified-TS step, we sample θ t from the approximate posterior distribution
P ¯ t ( θ * ) : = P ¯ ( θ * | H t 1 , r , a , c ^ ) P ( θ * ) τ = 1 t 1 P ¯ ( r τ | a τ , c ^ τ , H τ 1 , c ^ , θ * ) ,
where
P ¯ ( r τ | a τ , c ^ τ , H τ 1 , c ^ , θ * ) = N ( ψ ( a τ , c ^ τ | H c ^ ) θ * , σ 2 )
and ψ ( a t , c ^ t | H c ^ ) is the expected feature map defined in (12). This yields the approximate posterior to be a Gaussian distribution P ¯ t ( θ * ) = N ( μ t 1 , Σ t 1 1 ) whose inverse covariance matrix and mean, respectively, evaluate as
Σ t 1 = I λ + 1 σ 2 τ = 1 t 1 ψ ( a τ , c ^ τ | H c ^ ) ψ ( a τ , c ^ τ | H c ^ )
μ t 1 = Σ t 1 1 σ 2 τ = 1 t 1 r τ ψ ( a τ , c ^ τ | H c ^ ) .
The sampling distribution P ¯ t ( θ * ) considered above is different from the true posterior distribution (10), which is analytically intractable. However, it bears resemblance to the posterior (9) when the true contexts are known, with the reward distribution P ( r τ | a τ , c τ , θ * ) replaced by P ¯ ( r τ | a τ , c ^ τ , H τ 1 , c ^ , θ * ) . In Section 3.2.2, we show that the above choice of sampling distribution is indeed ‘close’ to the true posterior.

3.2. Bayesian Regret Analysis

In this section, we derive information-theoretic upper bounds on the Bayesian regret (7) of the modified TS algorithm for Gaussian CBs. To this end, we first outline the key information-theoretic tools required to derive our bound.

3.2.1. Preliminaries

To start, let P ( x ) and Q ( x ) denote two probability distributions defined over the space X of random variables x. Then, the Kullback–Leibler (KL)-divergence between the distributions P ( x ) and Q ( x ) is defined as
D KL ( P ( x ) | | Q ( x ) ) = E P ( x ) log P ( x ) Q ( x ) ,
if P ( x ) is absolutely continuous with respect to Q ( x ) , and takes value otherwise. If x and y denote two random variables described by the joint probability distribution P ( x , y ) , the mutual information I ( x ; y ) between x and y is defined as I ( x ; y ) = D KL ( P ( x , y ) P ( x ) P ( y ) ) , where P ( x ) (and P ( y ) ) is the marginal distribution of x (and y). More generally, for three random variables x, y and z with joint distribution P ( x , y , z ) , the conditional mutual information I ( x ; y | z ) between x and y given z evaluates as
I ( x ; y | z ) = E P ( z ) [ D K L ( P ( x , y | z ) P ( x | z ) P ( y | z ) ) ]
where P ( x | z ) and P ( y | z ) are conditional distributions. We will also use the following variational representation of the KL-divergence, also termed the Donskar–Varadhan (DV) inequality,
D KL ( P ( x ) Q ( x ) ) E P ( x ) [ f ( x ) ] log E Q ( x ) [ exp ( f ( x ) ) ] ,
which holds for any measurable function f : X R satifying the inequality E Q ( x ) [ exp ( f ( x ) ) ] < .

3.2.2. Information-Theoretic Bayesian Regret Bounds

In this section, we present information-theoretic upper bounds on the Bayesian regret of the modified TS algorithm. To this end, we first state our main assumption.
Assumption 1.
The feature map ϕ ( · , · ) R m has bounded norm, i.e.,  ϕ ( · , · ) 2 1 .
The following theorem gives our main result.
Theorem 1.
Assume that the covariance matrices satisfy Σ n Σ c 1 0 and Σ n Σ γ 1 Σ n M 0 where M = Σ n 1 + Σ c 1 . Under Assumption 1, if  λ σ 2 1 T 1 , the following upper bound on the Bayesian regret of the modified TS algorithm holds,
R T ( π TS ) U ( m , σ 2 T ) + 2 T m σ 2 + 2 T σ 2 ( log ( K ) + m ) + 4 T m σ 2 2 T π + 2 4 σ 2 m log 2 m T Tr ( ( Σ n Σ γ 1 Σ n M ) 1 ) + log ( T ) Tr ( Σ c Σ n 1 ) ,
where
U ( m , λ ) = 2 T m σ 2 min { m , 2 log ( 1 + K ) } log 1 + T λ m σ 2 .
The theorem above shows that the proposed TS algorithm achieves O ˜ ( m T ) regret when the prior P ( θ * ) is highly informative with variance parameter satisfying the constraint λ σ 2 / T .
Remark 1.
The assumption on covariance matrices in Theorem 1 directly holds for diagonal covariance matrices with positive eigen values.
To prove the regret bound of Theorem 1, we start by defining
a ^ t = arg max a A ψ ( a , c ^ t | H c ^ ) θ *
as the action that maximizes the mean reward ψ ( a , c ^ t | H c ^ ) θ * corresponding to reward parameter θ * . Using the above, the Bayesian cumulative regret (7) for the proposed TS algorithm π TS can be decomposed as
R T ( π TS ) = R CB T + R EE 1 T + R EE 2 T , where R CB T = t = 1 T E ψ ( a ^ t , c ^ t | H c ^ ) θ * ψ ( a t , c ^ t | H c ^ ) θ * , R EE 1 T = t = 1 T E ψ ( a t * , c ^ t | γ * ) θ * ψ ( a ^ t , c ^ t | H c ^ ) θ * , R EE 2 T = t = 1 T E ψ ( a t , c ^ t | H c ^ ) θ * ψ ( a t , c ^ t | γ * ) θ * .
In (23), the first term R CB T quantifies the Bayesian regret of our action policy (11) with respect to the action policy (22) for a CB with mean reward function ψ ( a , c ^ t | H c ^ ) θ * . The second term R EE 1 T accounts for the average difference in the cumulative mean rewards of the oracle optimal action policy (4), evaluated using the exact predictive distribution P ( c t | c ^ t , γ * ) , and our action policy (11), that uses the inferred predictive posterior distribution P ( c t | c ^ t , H t 1 , c ^ ) . In this sense, R EE 1 T captures the error in approximating the exact predictive distribution P ( c t | c ^ t , γ * ) via the inferred predictive distribution P ( c t | c ^ t , H c ^ ) . The third term R EE 2 T similarly accounts for the average approximation error.
To derive an upper bound on the Bayesian regret R T ( π TS ) , we separately upper bound each of the three terms in (23) as derived in the following lemmas. The lemma below presents an upper bound on R CB T .
Lemma 1.
Under Assumption 1, the following upper bound holds if λ σ 2 1 T 1 ,
R CB T U ( m , σ 2 T ) + 2 σ 2 t = 1 T D t + 2 σ 2 T log ( K ) + t = 1 T D t ,
U ( m , σ 2 T ) + 2 T m σ 2 + 2 T σ 2 ( log ( K ) + m ) ,
where D t = E [ D KL ( P t ( θ * ) P ¯ t ( θ * ) ) ] and U ( m , λ ) is as defined in (21).
To derive the upper bound in (24), we leverage results from [19] that study information-theoretic Bayesian regret of standard contextual TS algorithms via lifted information-ratio. However, the results do not directly apply to our algorithm due to the posterior mismatch between the sampling distribution P ¯ t ( θ * ) and the true posterior distribution P t ( θ * ) . Consequently, our upper bound (24) consists of three terms: the first term, defined as in (21), corresponds to the upper bound on the Bayesian regret of contextual TS that assumes P ¯ t ( θ * ) as the true posterior. This can be obtained by applying the lifted information ratio-based analysis of Cor. 2 in [19]. The second and third terms account for the posterior mismatch via the expected KL-divergence E [ D KL ( P t ( θ * ) P ¯ t ( θ * ) ) ] between the true posterior P t ( θ * ) and the sampling distribution P ¯ t ( θ * ) . In particular, this expected KL divergence can be upper bounded by 2 ( t 1 ) λ m / σ 2 (See Appendix C.1.3 for proof) under the prior P ( θ * ) = N ( 0 , λ I ) . Importantly, our result holds when this prior distribution is sufficiently concentrated with its variance satisfying the inequality λ σ 2 / T . This ensures that the contribution of posterior mismatch to the Bayes regret scales is O ( m T ) .
The following lemma gives an upper bound on the sum R EE 1 T + R EE 2 T .
Lemma 2.
Under Assumption 1, the following upper bound holds for δ ( 0 , 1 ) ,
R EE 1 T + R EE 2 T 2 R EE 1 T 4 δ 2 T m λ 2 π + 2 4 λ T m log 2 m δ t = 1 T I ( γ * ; c t | c ^ t , H t 1 , c ^ ) .
In addition, if the covariance matrices satisfy that Σ n Σ c 1 0 and Σ n Σ γ 1 Σ n M 0 where M = Σ n 1 + Σ c 1 , then (26) can be further upper bounded as
R EE 1 T + R EE 2 T 4 δ 2 T m λ 2 π + 2 4 λ T m log 2 m δ Tr ( ( Σ n Σ γ 1 Σ n M ) 1 ) + log ( T ) Tr ( Σ c Σ n 1 ) .
Lemma 2 shows that the error in approximating P ( c t | c ^ t , γ * ) with P ( c t | c ^ t , H t 1 , c ^ ) , on average, can be quantified via the conditional mutual information I ( γ * ; c t | c ^ t , H t 1 , c ^ ) between γ * and true context c t given knowledge of observed noisy contexts up to and including iteration t.
Finally, combining Lemmas 1 and  2 with the choice of δ = 1 / T and λ σ 2 / T gives us the regret bound in Theorem 1.

3.3. Beyond Gaussian Bandits

In the previous sections, we studied Gaussian bandits and analyzed the Bayesian regret. We will now discuss the potential extension of results beyond Gaussian bandits. As in [14], we will focus on Gaussian context distribution and context noise distribution, which helps to derive the upper bound on the estimation errors in Lemma 2.
To extend the Bayesian regret analysis to non-Gaussian bandits, Lemma 1 requires bandit-specific modifications. Specifically, the derivation of the term U ( m , λ ) , that captures the standard Bayesian regret of contextual TS with P ¯ t ( θ * ) as the true posterior, and that of the posterior mismatch term via the expected KL divergence critically depends on the type of bandit and the choice of the sampling posterior. The Bayesian regret bound U ( m , λ ) is derived using the lifted information ratio-based approach of [19]. This can indeed be extended to non-Gaussian bandits like logistic bandits (see [19]) to obtain a modified U ( m , λ ) term.
However, the analysis of posterior mismatch term for non-Gaussian bandits is non-trivial and depends on the specific bandit assumed. Firstly, to characterize the posterior mismatch via the expected KL divergence, our analysis requires the chosen sampling distribution P ¯ t ( θ * ) to be sub-Gaussian. To choose the sampling distribution, one can follow the framework adopted in (15) and (16) and use an ‘appropriate’ reward distribution P ¯ ( r τ | a τ , c ^ τ , H τ 1 , c ^ , θ * ) such that ( a ) the KL divergence D KL ( P ( r τ | a τ , c τ , θ * ) P ¯ ( r τ | a τ , c ^ τ , H τ 1 , c ^ , θ * ) ) between the true reward distribution and the chosen reward distribution is small to minimize posterior mismatch, and  ( b ) the resulting sampling distribution is easy to sample from and has sub-Gaussian tails. Thus, analyzing the posterior mismatch for non-Gaussian bandits requires a case-by-case treatment. For Gaussian bandits, we control the above KL divergence by choosing a Gaussian distribution P ¯ ( r τ | a τ , c ^ τ , H τ 1 , c ^ , θ * ) with mean ψ ( a τ , c ^ τ | H c ^ ) as in (16). Finally, in Section 5, we extend Algorithm 1 to logistic bandits with the choice of sampling distribution motivated by (15) and (16) and use Langevin Monte Carlo to sample from this distribution.

4. TS for CB with Delayed True Contexts

In this section, we consider the CBs with delayed true context setting where the agent observes the true context c t after it observes the reward r t corresponding to the chosen action a t . Note, that at the time of choosing action a t , the agent has access only to noisy contexts. We specialize our TS algorithm to this setting, and call it Algorithm 2 (or π delay TS ).
Algorithm 2: TS for Delayed True contexts ( π delay TS )
1:
for  t = 1 , , T   do
2:
    The environment selects a true context c t .
3:
    Agent observes noisy context c ^ t .
4:
    Agent evaluates the predictive posterior distribution P ( c t | c ^ t , H t 1 , c , c ^ ) as in (28).
5:
    Agent samples θ t P ( θ * | H t 1 , r , a , c ) .
6:
    Agent chooses action a t as in (33).
7:
    Agent observes reward r t (as in (1)) and the true context c t .
8:
end for
Algorithm 2 follows similar steps as in Algorithm 1. However, different from Algorithm 1, at the tth iteration, the agent knows the history H t 1 , c , c ^ of true contexts in addition to that of noisy contexts. Consequently, in the de-noising step, the agent evaluates the predictive posterior distribution as
P ( c t | c ^ t , H t 1 , c , c ^ ) = E P ( γ * | H t 1 , c , c ^ ) [ P ( c t | c ^ t , γ * ) ] ,
where P ( c t | c ^ t , γ * ) is as defined in (2) and posterior distribution P ( γ * | H t 1 , c , c ^ ) is obtained via Baye’s rule as P ( γ * | H t 1 , c , c ^ ) P ( γ * ) τ = 1 t 1 P ( c τ , c ^ τ | γ * ) using the history of true and noisy contexts.
For the Gaussian context, noise as considered in Section 3.1, the predictive posterior distribution P ( c t | c ^ t , H t 1 , c , c ^ ) = N ( V ˜ t , R ˜ t 1 ) is multivariate Gaussian with the inverse covariance matrix,
R t ˜ = M Σ n 1 H ˜ t 1 Σ n 1 ,
and the mean vector
V t ˜ = R ˜ t 1 Σ c 1 μ c + Σ n 1 c ^ t + Σ n 1 H ˜ t 1 Σ n 1 τ = 1 t 1 ( c ^ τ c τ ) Σ n 1 H ˜ t 1 Σ n 1 M 1 ( Σ c 1 μ c Σ n 1 c ^ t ) ,
where M = Σ c 1 + Σ n 1 and H ˜ t = Σ n 1 M 1 Σ n 1 + ( t 1 ) Σ n 1 + Σ γ 1 . Derivation can be found in Appendix B.2.4.
Following the denoising step, the next step in Algorithm 2 is a conventional Thompson sampling step, thanks to access to delayed true contexts. Consequently, the agent can evaluate the posterior distribution P ( θ * | H t 1 , r , a , c ) with known contexts as in (9) and use it to sample θ t P ( θ * | H t 1 , r , a , c ) . For  Gaussian bandit with Gaussian prior on θ * , the posterior distribution P ( θ * | H t 1 , r , a , c ) = N ( μ ˜ t 1 , Σ ˜ t 1 1 ) is a multivariate Gaussian distribution whose inverse covariance matrix and mean, respectively, evaluate as
Σ ˜ t 1 = 1 λ I + 1 σ 2 τ = 1 t 1 ϕ ( a τ , c τ ) ϕ ( a τ , c τ )
μ ˜ t 1 = Σ ˜ t 1 1 σ 2 τ = 1 t 1 r τ ϕ ( a τ , c τ ) .
Using the sampled θ t and the obtained predictive posterior distribution P ( c t | c ^ t , H t 1 , c , c ^ ) , the agent then chooses action a t as
a t = arg max a A ψ ( a , c ^ t | H c , c ^ ) θ t ,
where we use the expected feature map ψ ( a t , c ^ t | H c , c ^ ) : = E P ( c t | c ^ t , H t 1 , c , c ^ ) [ ϕ ( a t , c t ) ] .

Information-Theoretic Bayesian Regret Bounds

In this section, we derive an information-theoretic upper bound on the Bayesian regret (7) of Algorithm 2 for Gaussian CBs. The following theorem presents our main result.
Theorem 2.
Under Assumption 1 and assuming that covariance matrices satisfy Σ γ Σ n 1 0 , the following inequality holds for δ ( 0 , 1 ) when λ σ 2 ,
R T ( π delay TS ) U ( m , λ ) + 4 T δ 2 2 m λ π + 2 2 λ m T d log 1 + T Tr ( Σ γ Σ n 1 ) / d log 2 m δ ,
where U ( m , λ ) is as defined in (21).
Theorem 2 shows that Algorithm 2 achieves O ˜ ( m T ) regret with the choice of δ = 1 / T if d = O ( m ) . Furthermore, due to the absence of posterior mistmatch, the upper bound above is tighter than that of Theorem 1.
We now outline the main lemmas required to prove Theorem 2. To this end, we re-use the notation
a ^ t = arg max a A ψ ( a , c ^ t | H c , c ^ ) θ *
to define the optimal action maximizing the mean reward ψ ( a , c ^ t | H c , c ^ ) θ * .
To derive the regret upper bound in Theorem 2, we first decompose the Bayesian cumulative regret (7) of Algorithm 2 ( π delay TS ), similar to (23), into the following three terms,
R T ( π delay TS ) = R d , CB T + R d , EE 1 T + R d , EE 2 T where , R d , CB T = t = 1 T E ψ ( a ^ t , c ^ t | H c , c ^ ) θ * ψ ( a t , c ^ t | H c , c ^ ) θ * , R d , EE 1 T = t = 1 T E ψ ( a t * , c ^ t | γ * ) θ * ψ ( a ^ t , c ^ t | H c , c ^ ) θ * , R d , EE 2 T = t = 1 T E ψ ( a t , c ^ t | H c , c ^ ) θ * ψ ( a t , c ^ t | γ * ) θ * .
An upper bound on R T ( π delay TS ) can be then obtained by separately bounding each of the three terms in (35).
In (35), the first term R d , CB T corresponds to the Bayesian cumulative regret of a standard contextual TS algorithm that uses ψ ( a , c ^ t | H c , c ^ ) θ * for a A as the mean reward function. Note, that due to availability of delayed true contexts, there is no posterior mismatch in Algorithm 2. Hence, we apply Cor. 3 in [19] to yield the following upper bound on R d , CB T .
Lemma 3.
Under Assumption 1, the following upper bound on R d , CB T holds for λ σ 2 1 ,
R d , CB T U ( m , λ ) ,
where U ( m , λ ) is defined as in (21).
Lemma 3 gives a tighter bound in comparison to Lemma 1 where the posterior mismatch results in additional error terms in the regret bound.
We now upper bound the second term R d , EE 1 T of (35), which similar to the term R EE 1 T in (23), captures the error in approximating the exact predictive distribution P ( c t | c ^ t , γ * ) via the inferred predictive distribution P ( c t | c ^ t , H c , c ^ ) . The following lemma shows that this approximation error over T iterations can be quantified, on average, via the mutual information I ( γ * ; H T , c , c ^ ) between γ * and the T-length history of observed true and noisy contexts. This bound also holds for the third term R d , EE 2 T of (35) which similarly accounts for the average approximation error.
Lemma 4.
Under Assumption 1, for any δ ( 0 , 1 ) , we have the following upper bound,
R d , EE 1 T + R d , EE 2 T 2 R d , EE 1 T 4 m λ T log 2 m δ I ( γ * ; H T , c , c ^ ) + 4 T δ 2 2 m λ π .
Furthermore, if the covariance matrices satisfy that Σ γ Σ n 1 0 , we obtain that
I ( γ * ; H T , c , c ^ ) d 2 log 1 + T Tr ( Σ γ Σ n 1 ) / d ) .
Combining Lemmas 3 and  4 then gives us the upper bound on R T ( π delay TS ) in Theorem 1.

5. Experiments and Final Remarks

In this section, we experimentally validate the performance of the proposed algorithms on synthetic and real-world datasets. Details of implementation can be found in Appendix D.
Synthetic Datasets: For synthetic datasets, we go beyond Gaussian bandits and evaluate our algorithms for logistic contextual bandits (see Figure 1 (Left) and (Center)). In both these settings, we consider Gaussian contexts and context noise as in Section 3.1 with parameters Σ c = I , Σ γ = σ γ 2 I , Σ n = σ n 2 I for some σ γ 2 , σ n 2 > 0 . We further consider action a A and context c C to be d = 5 dimensional vectors with a i and c i , respectively, denoting their ith component. We use ϕ ( a , c ) = [ a 1 2 , a 2 2 , a 3 2 , a 4 2 , a 5 2 , c 1 2 , c 2 2 , c 3 2 , c 4 2 , c 5 2 , a 1 c 1 , a 2 c 2 , a 3 c 3 , a 4 c 4 , a 5 c 5 ] as the m = 15 dimensional feature vector.
Gaussian Bandits: The mean reward function is given by f ( θ * , a , c ) = ϕ ( a , c ) θ * with the feature map described above. Other parameters are fixed as σ γ 2 = σ n 2 = 1.1 , σ 2 = 2 and λ = 0.01 . Plots are averaged over 100 independent trials.
Logistic Bandits: The reward r t { 0 , 1 } is Bernoulli with mean reward given by μ ( ϕ ( a , c ) θ * ) , where μ ( z ) = 1 / ( 1 + exp ( z ) ) is the sigmoid function. We consider a Gaussian prior N ( 0 , I ) over θ * . In Algorithm 1, we choose the sampling distribution
P ¯ t ( θ * ) P ( θ * ) τ = 1 t 1 Ber ( μ ( ψ ( a τ , c ^ τ | H c ^ ) θ * ) ) .
However, the posterior P ¯ t ( θ * ) is analytically intractable since Bernoulli reward-Gaussian prior forms a non-conjugate distribution pair. Consequently, we use Langevin Monte Carlo (LMC) [20] to sample from P ¯ t ( θ * ) . We run LMC for I = 50 iterations with learning rate η t = 0.2 / t and inverse temperature β 1 = 0.001 . Plots are averaged over 10 independent trials.
MovieLens Dataset: We use the MovieLens-100K dataset [21] to evaluate the performances. To utilise this dataset, we first perform non-negative matrix factorization on the rating matrix R = [ r c , a ] R 943 × 1682 with 3 latent factors to obtain R = W H , where W R 943 × 3 and H R 3 × 1682 . Each row vector W c corresponds to an user context, while each column vector H a corresponds to movie (action) features. The mean and variance of the Gaussian context distribution is estimated from the row vectors of W. We then add Gaussian noise to context as in the synthetic settings with σ n 2 = 0.1 .
We apply K-means algorithm to the column vectors of H to group the actions into K = 20 clusters. We use m k R 3 to denote the centroid and v k to denote the variance of the kth cluster. We then fix the mean and variance of the Gaussian prior over θ * as μ θ = ( m 1 , , m K ) and Σ θ = diag ( v 1 I 3 , , v K I 3 ) , with  I 3 denoting the 3 × 3 identity matrix, respectively. The feature vector ϕ ( a , c ) is then fixed as a 60-dimensional vector with vector W c at the index of the cluster k to which action a belongs and zeros everywhere else. We further add mean-zero Gaussian noise to the mean reward ϕ ( a , c ) θ * with variance σ 2 = 0.01 . The Bayesian oracle in this experiment has access to the exact context noise parameter γ * sampled from the Gaussian prior with variance Σ γ = σ γ 2 I , as well as the true θ * sampled from the Gaussian prior P ( θ * ) .
Baselines: We compare our algorithms with two baselines: TS _ naive and TS _ oracle . In  TS _ naive , the agent observes only noisy contexts but is unaware of the presence of noise. Consequently, it naively implements conventional TS with noisy context c ^ t . This sets the benchmark for the worst-case achievable regret. The second baseline TS _ oracle assumes that the agent knows the true channel parameter γ * , a setting studied in [18], and can thus perform exact denoising via the predictive posterior distribution P ( c t | c ^ , γ * ) . This algorithm sets the benchmark for the best achievable regret.
Figure 1 (Left) corroborates our theoretical findings for Gaussian bandits. In particular, our algorithms (Algorithms 1 and  2) demonstrate sub-linear regret and achieve robust performance comparable to the best achievable performance of TS _ oracle . We remark that while our regret analysis of Gaussian bandits is motivated due to the tractability of posterior distributions and the concentration properties of Gaussians, our empirical results for logistic bandits in Figure 1 (Center) show a promising extension of our algorithms to non-conjugate distributions. Extension of Bayesian regret analysis to such general distributions is left for future work. Further, our experiments on MovieLens data in Figure 1 (Right) validate the effectiveness of our algorithm in comparison to the benchmarks. The plot shows that our approach outperforms TS _ naive and achieves comparable regret as that of TS _ oracle which is the best achievable regret.

6. Conclusions

We studied a stochastic CB problem where the agent observes noisy contexts through a noise channel with an unknown channel parameter. For Gaussian bandits and Gaussian context noise, we introduced a TS algorithm that achieves O ˜ ( m T ) Bayesian regret. The setting of Gaussian bandits with Gaussian noise was chosen for easy tractability of posterior distributions used in the proposed TS algorithms. We believe that the algorithm and key lemmas can be extended to when the likelihood-prior form conjugate distributions. Extension to general distributions is left for future work.
Finally, we conjecture that our proposed modified TS algorithm and the information-theoretic Bayesian regret analysis could be extended to noisy contexts in multi-task bandit settings. In this regard, a good starting point would be to leverage prior works that study multi-armed hierarchical bandits [22] and contextual hierarchical bandits [23] with linear-Gaussian reward models. However, the critical challenge is to evaluate the posterior mismatch which requires a case-by-case analysis.

Author Contributions

Conceptualization, S.T.J. and S.M.; Methodology, S.T.J.; Formal analysis, S.T.J. and S.M.; Writing – original draft, S.T.J. and S.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are openly available in Kaggle, https://www.kaggle.com/datasets/prajitdatta/movielens-100k-dataset, accessed on 1 July 2023.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Preliminaries

Definition A1
(Sub-Gaussian Random Variable). A random variable y is said to be s 2 -sub-Gaussian with respect to the distribution P ( y ) if the following inequality holds:
E P ( y ) [ exp ( λ ( y E P ( y ) [ y ] ) ] exp λ 2 s 2 2 .
Lemma A1
(Change of Measure Inequality). Let x R n be a random vector and g : R n R denote a real-valued function. Let P ( x ) and Q ( x ) be two probability distributions defined on the space of x. If  g ( x ) is s 2 -sub-Gaussian with respect to Q ( x ) , then the following inequality holds,
| E P ( x ) [ g ( x ) ] E Q ( x ) [ g ( x ) ] | 2 s 2 D KL ( P ( x ) Q ( x ) ) .
Proof. 
The inequality (A2) follows by using the Donsker-Varadhan inequality (20) with f ( x ) = λ g ( x ) for λ R . This yields that
D KL ( P ( x ) Q ( x ) ) E P ( x ) [ λ g ( x ) ] log E Q ( x ) [ exp ( λ g ( x ) ) ] E P ( x ) [ λ g ( x ) ] E Q ( x ) [ λ g ( x ) ] λ 2 s 2 2
where the last inequality follows from the assumption of sub-Gaussianity. Rearranging, we obtain
E P ( x ) [ λ g ( x ) ] E Q ( x ) [ λ g ( x ) ] λ 2 s 2 2 + D KL ( P ( x ) Q ( x ) ) .
For λ > 0 , we obtain that
E P ( x ) [ g ( x ) ] E Q ( x ) [ g ( x ) ] λ s 2 2 + D KL ( P ( x ) Q ( x ) ) λ ,
and optimizing over λ > 0 then yields that
E P ( x ) [ g ( x ) ] E Q ( x ) [ g ( x ) ] 2 s 2 D KL ( P ( x ) Q ( x ) ) .
Similarly, for  λ < 0 , we obtain that
E Q ( x ) [ g ( x ) ] E P ( x ) [ g ( x ) ] 2 s 2 D KL ( P ( x ) Q ( x ) ) .
Lemma A2.
Let x R n be distributed according to Q ( x ) = i = 1 n N ( x i | μ i , σ i 2 ) , i.e., each element of the random vector is independently distributed according to a Gaussian distribution with mean μ i and variance σ i 2 . Let g ( x ) = max i = 1 , n x i denote the maximum of n Gaussian random variables. Then, the following inequality holds for λ 0 ,
log E Q ( x ) [ exp ( λ g ( x ) ) ] log n + λ max i μ i + λ 2 max i σ i 2 2 .
For any distribution P ( x ) that is absolutely continuous with respect to Q ( x ) , we then have the following change of measure inequality,
E P ( x ) [ g ( x ) ] E Q ( x ) [ g ( x ) ] 2 log n + D KL ( P ( x ) Q ( x ) ) max i σ i 2 .
Proof. 
The proof of inequality (A8) follows from standard analysis (see [14]). We present it here for the sake of completeness. The following sequence of relations hold for any λ 0 ,
E Q ( x ) [ exp ( λ g ( x ) ) ] = E Q ( x ) [ max i exp ( λ x i ) ] i = 1 n E Q ( x i ) [ exp ( λ x i ) ] = i = 1 n exp λ μ i + λ 2 σ i 2 / 2 n exp ( λ max i μ i + λ 2 max i σ i 2 / 2 ) .
Taking logarithm on both sides of the inequality yields the upper bound in (A8). We now apply the DV inequality (20) as in (A3). This yields that
D KL ( P ( x ) Q ( x ) ) E P ( x ) [ λ g ( x ) ] log E Q ( x ) [ exp ( λ g ( x ) ) ] ( a ) E P ( x ) [ λ g ( x ) ] log n λ max i μ i λ 2 max i σ i 2 2 ( b ) E P ( x ) [ λ g ( x ) ] E Q ( x ) [ λ g ( x ) ] log n λ 2 max i σ i 2 2 ,
where the inequality in ( a ) follows from (A8). The inequality in ( b ) follows from observing that max i x i x i for all i, whereby we obtain that E Q ( x ) [ max i x i ] μ i which holds for all i. The latter inequality implies that E Q ( x ) [ max i x i ] max i μ i . Re-arranging and optimizing over λ 0 then yields the required inequality in (A9). □

Appendix B. Linear-Gaussian Contextual Bandits with Delayed Contexts

In this section, we provide all the details relevant to the Bayesian cumulative regret analysis of TS for delayed, linear-Gaussian contextual bandits.

Appendix B.1. TS Algorithm for Linear-Gaussian Bandits with Delayed True Contexts

The pseudocode for the TS algorithm for Gaussian bandits is given in Algorithm A1.
Algorithm A1: TS with Delayed Contexts for Gaussian Bandits ( π delay TS )
1:
Given parameters: ( Σ n , σ 2 , λ , Σ γ , μ c , Σ c ) . Initialize μ ˜ 0 = 0 R m and Σ ˜ 0 1 = ( 1 / λ ) I
2:
for  t = 1 , , T  do
3:
      The environment selects a true context c t .
4:
      Agent observes noisy context c ^ t .
5:
      Agent computes R ˜ t and V ˜ t using (29) and (30) to evaluate P ( c t | c ^ t , H t 1 , c , c ^ ) = N ( V ˜ t , R ˜ t 1 ) .
6:
      Agent samples θ t N ( μ ˜ t 1 , Σ ˜ t 1 1 ) where μ ˜ t 1 and Σ ˜ t 1 are defined as in (32) and (31).
7:
      Agent chooses action a t as in (33) using θ t and P ( c t | c ^ t , H t 1 , c , c ^ ) .
8:
      Agent observes reward r t corresponding to a t , and the true context c t .
9:
end for

Appendix B.2. Derivation of Posterior and Predictive Posterior Distributions

In this section, we provide detailed derivation of posterior predictive distribution for Gaussian bandits. To this end, we first derive the exact predictive distribution P ( c t | c ^ t , γ * ) .

Appendix B.2.1. Derivation of P ( c t | c ^ t , γ * )

We begin by noting that
P ( c t | c ^ t , γ * ) = P ( c t ) P ( c ^ t | c t , γ * ) P ( c ^ t | γ * ) P ( c t ) P ( c ^ t | c t , γ * ) = N ( μ c , Σ c ) N ( c t + γ * , Σ n ) .
Subsequently,
log ( P ( c t | c ^ t , γ * ) ) log ( P ( c t ) P ( c ^ t | c t , γ * ) ) 1 2 ( c t μ c ) Σ c 1 ( c t μ ) + ( c ^ t c t γ * ) Σ n 1 ( c ^ t c t γ * ) = 1 2 ( c t ( Σ c 1 + Σ n 1 ) c t μ c Σ c 1 + ( c ^ t γ * ) Σ n 1 c t c t Σ c 1 μ c + Σ n 1 ( c ^ t γ * ) + ( c ^ t γ * ) Σ n 1 ( c ^ t γ * ) ) = 1 2 ( c t M c t A t M c t c t M A + A t M A A t M A + ( c ^ t γ * ) Σ n 1 ( c ^ t γ * ) ) ,
where we have defined
M = Σ c 1 + Σ n 1
A t = ( M 1 ) Σ c 1 μ c + Σ n 1 ( c ^ t γ * ) .
From (A12), we obtain
log ( P ( c t | c ^ t , γ * ) ) 1 2 c t M c t A t M c t c t M A + A t M A .
This implies that
P ( c t | c ^ t , γ * ) = N ( A t , M 1 ) .

Appendix B.2.2. Derivation of P ( c ^ t | γ * )

We now derive the distribution P ( c ^ t | γ * ) which is defined in (3) as
P ( c ^ t | γ * ) = E P ( c t ) [ P ( c ^ t | c t , γ * ) ] .
Hence, P ( c ^ t | γ * ) can be obtained by marginalizing the joint distribution P ( c t ) P ( c ^ t | c t , γ * ) = P ( c t | c ^ t , γ * ) P ( c ^ t | γ * ) over c t . To this end, we use (A12) to obtain,
log ( P ( c t ) P ( c ^ t | c t , γ * ) ) = log ( P ( c t | c ^ t , γ * ) P ( c ^ t | γ * ) ) log ( P ( c t | c ^ t , γ * ) ) 1 2 A t M A + ( c ^ t γ * ) Σ n 1 ( c ^ t γ * ) ,
which implies that
log ( P ( c ^ t | γ * ) ) 1 2 A t M A + ( c ^ t γ * ) Σ n 1 ( c ^ t γ * ) 1 2 ( c ^ t Σ n 1 Σ n 1 ( M 1 ) Σ n 1 c ^ t c ^ t ( Σ n 1 ( M 1 ) ( Σ n 1 γ * + Σ c 1 μ c ) + Σ n 1 γ * ) ( μ c Σ c 1 γ * Σ n 1 ) ( M 1 ) Σ n 1 + γ * Σ n 1 c ^ ) = 1 2 c ^ t G c ^ t F G c ^ t G F c ^ t 1 2 ( c ^ t F ) G ( c ^ t F ) ,
where
G = Σ n 1 Σ n 1 ( M 1 ) Σ n 1
F = ( G 1 ) Σ n 1 ( M 1 ) ( Σ n 1 γ * + Σ c 1 μ c ) + Σ n 1 γ * = ( G 1 ) ( G γ * + Σ n 1 ( M 1 ) Σ c 1 μ c ) .
Thus,
P ( c ^ t | γ * ) = N ( F , G 1 ) .

Appendix B.2.3. Derivation of P ( γ * | H t 1 , c , c ^ )

We now derive the posterior distribution P ( γ * | H t 1 , c , c ^ ) . To this end, we use Baye’s theorem as
P ( γ * | H t 1 , c , c ^ ) τ = 1 t 1 P ( c ^ τ | c τ , γ * ) P ( γ * ) = τ = 1 t 1 N ( c τ + γ * , Σ n ) N ( 0 , Σ γ ) .
We then have,
log P ( γ * | H t 1 , c , c ^ ) 1 2 τ = 1 t 1 ( c ^ τ c τ γ * ) Σ n 1 ( c ^ τ c τ γ * ) + γ * Σ γ 1 γ * = 1 2 τ = 1 t 1 ( c ^ τ + c τ + γ * ) Σ n 1 ( c ^ τ + c τ + γ * ) + γ * Σ γ 1 γ * 1 2 ( γ * ( ( t 1 ) Σ n 1 + Σ γ 1 ) γ * γ * Σ n 1 ( τ = 1 t 1 ( c ^ τ c τ ) ) ( τ = 1 t 1 ( c ^ τ c τ ) ) Σ n 1 γ * ) .
Consequently, we obtain that,
P ( γ * | H t 1 , c , c ^ ) = N ( γ * | Y t , W t 1 )
where
W t = ( t 1 ) Σ n 1 + Σ γ 1
Y t = ( W t 1 ) Σ n 1 τ = 1 t 1 ( c ^ τ c τ ) .

Appendix B.2.4. Derivation of Posterior Predictive Distribution P ( c t | c ^ t , H t 1 , c , c ^ )

Using results from previous subsections, we are now ready to derive the posterior predictive distribution P ( c t | c ^ t , H t 1 , c , c ^ ) . Note, that P ( c t | c ^ t , H t 1 , c , c ^ ) = E P ( γ * | H t 1 , c , c ^ ) [ P ( c t | c ^ t , γ * ) ] . We then have the following set of relations:
log P ( γ * | H t 1 , c , c ^ ) P ( c t | c ^ t , γ * ) 1 2 ( c t A t ) M ( c t A t ) + ( γ * Y t ) W t ( γ * Y t ) = 1 2 ( ( c t D E t + ( M 1 ) Σ n 1 γ * ) M ( c t D E t + ( M 1 ) Σ n 1 γ * ) + ( γ * Y t ) W t ( γ * Y t ) ) 1 2 ( γ * ( Σ n 1 ( M 1 ) Σ n 1 + W t ) γ * γ * ( Σ n 1 ( D + E t c t ) + W t Y t ) ( ( D + E t c t ) Σ n 1 + Y t W t ) γ * + ( c t D E t ) M ( c t D E t ) ) = 1 2 γ * H ˜ t γ * γ * H ˜ t J ˜ t J ˜ t H ˜ t γ * + J ˜ t H ˜ t J ˜ t J ˜ t H ˜ t J ˜ t + ( c t D E t ) M ( c t D E t ) log ( P ( γ * | H t , c , c ^ ) ) 1 2 J ˜ t H ˜ t J ˜ t + ( c t D E t ) M ( c t D E t )
where D = ( M 1 ) Σ c 1 μ c , E t = ( M 1 ) Σ n 1 c ^ t , H ˜ t = Σ n 1 ( M 1 ) Σ n 1 + W t and J ˜ t = ( H ˜ t 1 ) ( Σ n 1 ( D + E t c t ) + W t Y t ) .
Since log P ( γ * | H t 1 , c , c ^ ) P ( c t | c ^ t , γ * ) = log P ( c t | c ^ t , H t 1 , c , c ^ ) P ( γ * | H t , c , c ^ ) , we obtain
log ( P ( c t | c ^ t , H t 1 , c , c ^ ) ) 1 2 J ˜ t H ˜ t J ˜ t + ( c t D E t ) M ( c t D E t ) 1 2 ( c t M Σ n 1 ( H ˜ t 1 ) Σ n 1 c t c t ( M ( D + E t ) Σ n 1 ( H ˜ t 1 ) ( Σ n 1 ( D + E t ) + W t Y t ) ) ( M ( D + E t ) Σ n 1 ( H ˜ t 1 ) ( Σ n 1 ( D + E t ) + W t Y t ) ) c t ) .
This gives
P ( c t | c ^ t , H t 1 , c , c ^ ) = N ( V ˜ t , R ˜ t 1 ) w h e r e
R ˜ t = M Σ n 1 ( H ˜ t 1 ) Σ n 1
V ˜ t = ( R ˜ t 1 ) M ( D + E t ) Σ n 1 ( H ˜ t 1 ) ( Σ n 1 ( D + E t ) + W t Y t ) .

Appendix B.3. Proof of Lemma 3

We now present the proof of Lemma 3. To this end, we first recall that E t [ · ] = E [ · | F t ] where F t = H t 1 , r , a , c , c ^ c ^ t , and we denote P t ( θ * ) as the posterior distribution P ( θ * | H t 1 , r , a , c ) of θ * given the history of observed reward-action-context tuples. We can then equivalently write R d , CB T as
R d , CB T = t = 1 T E E t ψ ( a ^ t , c ^ t | H c , c ^ ) θ * ψ ( a t , c ^ t | H c , c ^ ) θ * = t = 1 T E [ E t f ( θ * , a ^ t , c t ) f ( θ * , a t , c t ) : = Δ t ] ,
where ψ ( a , c ^ t | H c , c ^ ) = E P ( c t | c ^ t , H t 1 , c , c ^ ) [ ϕ ( a , c t ) ] is as defined in (33) and we have used f ( θ * , a t , c t ) = ϕ ( a t , c t ) θ * to denote the mean-reward function. To obtain an upper bound on R d , CB T , we define the following lifted information ratio as in [19],
Γ t = ( E t [ Δ t ] ) 2 Λ t
where
Λ t = E t f ( θ * , a t , c t ) f ¯ ( a t , c t ) 2 ,
with f ¯ ( a t , c t ) = E t [ f ( θ * , a t , c t ) | a t , c t ] denoting the expectation of mean reward with respect to the posterior distribution P t ( θ * ) . Subsequently, we obtain the following upper bound
R d , CB T t = 1 T E Γ t Λ t E t = 1 T Γ t t = 1 T E Λ t ,
where the last inequality follows by an application of Cauchy–Schwarz inequality. An upper bound on R d , CB T then follows by obtaining an upper bound on the lifted information ratio Γ t as well as on Λ t .
We first evaluate the term Λ t . To this end, note that f ¯ ( a t , c t ) = ϕ ( a t , c t ) μ ˜ t 1 , with μ ˜ t 1 defined as in (32). Using this, we obtain
Λ t = E t ( ϕ ( a t , c t ) ( θ * μ ˜ t 1 ) ) 2
= E t ϕ ( a t , c t ) ( θ * μ ˜ t 1 ) ( θ * μ ˜ t 1 ) ϕ ( a t , c t )
= E t ϕ ( a t , c t ) Σ ˜ t 1 1 ϕ ( a t , c t ) = E t ϕ ( a t , c t ) Σ ˜ t 1 1 2
where Σ ˜ t 1 is as in (31), and the third equality follows since conditional on F t , ( a t , c t ) is independent of θ * . Subsequently, we can apply the elliptical potential lemma Lemma 19.4 in [24] using the assumption that ϕ ( · , · ) 2 1 and that σ 2 / λ 1 . This results in
t = 1 T ϕ ( a t , c t ) Σ ˜ t 1 1 2 = σ 2 t = 1 T ϕ ( a t , c t ) ( σ 2 Σ ˜ t 1 ) 1 2 2 σ 2 log det ( Σ ˜ T ) det ( σ 2 / λ I ) = 2 σ 2 m log σ 2 / λ + T / m m log ( σ 2 / λ ) = 2 m σ 2 log 1 + T λ m σ 2 .
To upper bound the lifted information ratio term Γ t , we can use Lemma 7 in [19]. To demonstrate how to leverage results from [19], we start by showing that the inequality Γ t m holds. To this end, we note that the lifted information ratio can be equivalently written as
Γ t = E t f ( θ t , a t , c t ) f ¯ ( a t , c t ) 2 Λ t
which follows since
E t [ Δ t ] = E t f ( θ * , a ^ t , c t ) f ( θ * , a t , c t ) = E t f ( θ * , a ^ t , c t ) f ¯ ( a t , c t ) = a P t ( a ^ t = a ) E t f ( θ * , a , c t ) | a ^ t = a a P t ( a t = a ) E t f ¯ ( a t = a , c t ) = a P t ( a ^ t = a ) E t f ( θ * , a , c t ) | a ^ t = a E t [ f ¯ ( a , c t ) ] ,
where the second equality holds since conditioned on F t , ( a t , c t ) and θ * are independent. In the third equality, we denote P t ( a ^ t ) = P ( a ^ t | F t ) and P t ( a t ) = P ( a t | F t ) . Using these, the last equality follows since a t = d a ^ t , i.e, P t ( a t ) = P t ( a ^ t ) . Now, let us define a K × K matrix M with entries given by
M a , a = P t ( a ^ t = a ) P t ( a t = a ) E t f ( θ * , a , c t ) | a ^ t = a E t [ f ¯ ( a , c t ) ] .
Using this and noting that P t ( a ^ t ) = P t ( a t ) , we obtain that E t [ Δ t ] = Tr ( M ) . We now try to bound Λ t in terms of the matrix M. To see this, we can equivalently write Λ t as
Λ t = a P t ( a t = a ) E t f ( θ * , a , c t ) f ¯ ( a , c t ) 2 = a , a P t ( a t = a ) P t ( a ^ t = a ) E t f ( θ * , a , c t ) f ¯ ( a , c t ) 2 | a ^ t = a a , a P t ( a t = a ) P t ( a ^ t = a ) E t f ( θ * , a , c t ) f ¯ ( a , c t ) | a ^ t = a 2 = a , a P t ( a t = a ) P t ( a ^ t = a ) E t [ f ( θ * , a , c t ) | a ^ t = a ] E t [ f ¯ ( a , c t ) ] 2 = M F 2 ,
where the inequality follows by the application of Jensen’s inequality. We thus obtain that
Γ t Tr ( M ) 2 M F 2 m ,
where the last inequality follows from Prop. 5 in [25]. From Lemma 3 in [19], we also obtain Γ t 2 log ( 1 + K ) . This results in the upper bound Γ t min { m , 2 log ( 1 + K ) } .
Using this and the bound of (A31) in (A28), we obtain
R d , CB T 2 T m σ 2 min { m , 2 ( 1 + log K ) } log 1 + T λ m σ 2 .

Appendix B.4. Proof of Lemma 4

We now prove an upper bound on the term R d , EE 1 T . To this end, let us define the following event:
E : = θ * 2 2 λ m log 2 m δ : = U .
Note, that since θ * N ( θ * | 0 , λ I ) , we obtain that with probability at least 1 δ , the following inequality holds θ * 2 λ log 2 m δ . Since θ * 2 m θ * , the above inequality in turn implies the event E such that P ( E ) 1 δ .
R d , EE 1 T ( a ) t = 1 T E ψ ( a t * , c ^ t | γ * ) θ * ψ ( a t * , c ^ t | H c , c ^ ) θ * = t = 1 T E [ E P ( c t | c ^ t , γ * ) [ ϕ ( a t * , c t ) θ * ] E P ( c t | c ^ t , H c , c ^ ) [ ϕ ( a t * , c t ) θ * ] : = Δ P ( c t | c ^ t , γ * ) , P ( c t | c ^ t , H c , c ^ ) ] ,
= t = 1 T E [ Δ P ( c t | c ^ t , γ * ) , P ( c t | c ^ t , H c , c ^ ) 1 { E } ] + t = 1 T E [ Δ P ( c t | c ^ t , γ * ) , P ( c t | c ^ t , H c , c ^ ) 1 { E c } ]
( b ) t = 1 T E [ Δ P ( c t | c ^ t , γ * ) , P ( c t | c ^ t , H c , c ^ ) 1 { E } ] + 2 δ T E [ θ * 2 | E c ] ,
where the inequality ( a ) follows from the definition of a ^ t = arg max a A ψ ( a , c ^ | H c , c ^ ) θ * , and 1 { } denotes the indicator function which takes value 1 when • is true and takes value 0 otherwise. The inequality in ( b ) follows by noting that
E [ Δ P ( c t | c ^ t , γ * ) , P ( c t | c ^ t , H c , c ^ ) I { E c } ] E E P ( c t | c ^ t , γ * ) [ ϕ ( a t * , c t ) ] E P ( c t | c ^ t , H c , c ^ ) [ ϕ ( a t * , c t ) ] θ * 1 { E c } E E P ( c t | c ^ t , γ * ) [ ϕ ( a t * , c t ) ] E P ( c t | c ^ t , H c , c ^ ) [ ϕ ( a t * , c t ) ] 2 θ * 2 1 { E c } 2 E θ * 2 1 { E c } = 2 P ( E c ) E [ θ * 2 | E c ] 2 δ E [ θ * 2 | E c ] ,
where the last inequality is due to P ( E c ) = 1 P ( E ) δ . To obtain an upper bound on E [ θ * 2 | E c ] , we note that the following set of inequalities hold:
E [ θ * 2 | E c ] ( a ) m E [ θ * | E c ] = ( b ) m E [ θ * | θ * > u ] = m i = 1 m P ( θ * = | θ i * | ) E [ θ * θ * = | θ i * | , θ * > u m i = 1 m E [ | θ i * | | θ i * | > u = ( c ) 2 m i = 1 m x > u x g ( x ) d x = ( d ) 2 λ m i = 1 m x > u g ( x ) d x = 2 λ m 3 / 2 g ( u ) = 2 λ m 3 / 2 1 2 π λ exp ( u 2 / 2 λ ) = δ m λ 2 π ,
where ( a ) follows since θ 2 m θ , ( b ) follows since θ * 2 > m 2 λ log 2 m δ implies that θ * > 2 λ log 2 m δ : = u . The equality in ( c ) follows by noting that | θ i * | , where θ i * N ( 0 , λ ) , follows a folded Gaussian distribution with density 2 g ( θ i * ) where g ( θ i * ) = 1 2 π λ exp ( θ i * 2 / ( 2 λ ) ) is the Gaussian density. The equality in ( d ) follows by noting that x g ( x ) = λ g ( x ) , where g ( x ) is the derivative of the Gaussian density. Thus, we have the following upper bound
t = 1 T E [ Δ P ( c t | c ^ t , γ * ) , P ( c t | c ^ t , H c , c ^ ) 1 { E c } ] 2 T δ 2 m λ 2 π .
We now obtain an upper bound on t = 1 T E Δ P ( c t | c ^ t , γ * ) , P ( c t | c ^ t , H c , c ^ ) 1 { E } . To this end, note that
E Δ P ( c t | c ^ t , γ * ) , P ( c t | c ^ t , H c , c ^ ) 1 { E } P ( E ) E Δ P ( c t | c ^ t , γ * ) , P ( c t | c ^ t , H c , c ^ ) | E E [ Δ P ( c t | c ^ t , γ * ) , P ( c t | c ^ t , H c , c ^ ) | E ] .
Note, that under the event E , we have the following relation,
| ϕ ( a t * , c t ) θ * ) | ϕ ( a t * , c t ) 2 θ * 2 U ,
whereby ϕ ( a t * , c t ) θ * is U 2 -sub-Gaussian.
Consequently, applying Lemma A1 gives the following upper bound
t = 1 T E [ Δ P ( c t | c ^ t , γ * ) , P ( c t | c ^ t , H c , c ^ ) 1 { E } ] t = 1 T E 2 U 2 D KL ( P ( c t | c ^ t , γ * ) P ( c t | c ^ t , H c , c ^ ) 2 T U 2 t = 1 T E D KL ( P ( c t | c ^ t , γ * ) P ( c t | c ^ t , H c , c ^ ) = ( a ) 2 T U 2 t = 1 T I ( c t ; γ * | c ^ t , H c , c ^ )
( b ) 2 T U 2 t = 1 T I ( c t , c ^ t ; γ * | H c , c ^ )
= ( c ) 2 T U 2 I ( H T , c , c ^ ; γ * )
where the equality in ( a ) follows by the definition of condition mutual information
I ( c t ; γ * | c ^ t , H c , c ^ ) : = E D KL P ( c t , γ * | c ^ t , H c , c ^ ) P ( c t | c ^ t , H c , c ^ ) P ( γ * | c ^ t , H c , c ^ ) ,
and inequality in ( b ) follows since I ( c t , c ^ t ; γ * | H c , c ^ ) = I ( c ^ t ; γ * | H c , c ^ ) + I ( c t ; γ * | c ^ t , H c , c ^ ) I ( c t ; γ * | c ^ t , H c , c ^ ) due to the non-negativity of mutual information, and finally, the equality in ( c ) follows from the chain rule of mutual information.
We now analyze the mutual information I ( γ * ; H T , c , c ^ ) which can be written as
I ( γ * ; H T , c , c ^ ) = H ( γ * ) H ( γ * | H T , c , c ^ )
= 1 2 log ( 2 π e ) d det ( Σ γ ) 1 2 log ( 2 π e ) d det ( W 1 )
= 1 2 log det ( Σ γ ) det ( W 1 )
where W = ( T 1 ) Σ n 1 + Σ γ 1 = Σ γ 1 ( I + ( T 1 ) Σ γ Σ n 1 ) . Using this, we can equivalently write
I ( γ * ; H T , c , c ^ ) = 1 2 log 1 det ( ( I + ( T 1 ) Σ γ Σ n 1 ) 1 )
= 1 2 log det ( I + ( T 1 ) Σ γ Σ n 1 ) .
Under the assumption that Σ γ Σ n 1 0 , we have I + ( T 1 ) Σ γ Σ n 1 0 , whereby using the determinant-trace inequality we obtain,
I ( γ * ; H T , c , c ^ ) = 1 2 log det ( I + ( T 1 ) Σ γ Σ n 1 ) d 2 log Tr ( I + ( T 1 ) Σ γ Σ n 1 ) / d ) = d 2 log 1 + ( T 1 ) Tr ( Σ γ Σ n 1 ) / d d 2 log 1 + T Tr ( Σ γ Σ n 1 ) / d .
Using this in (A44), we obtain that
t = 1 T E [ Δ P ( c t | c ^ t , γ * ) , P ( c t | c ^ t , H c , c ^ ) I { E } ] T d log 1 + T Tr ( Σ γ Σ n 1 ) / d U 2 .
Finally, using this in (A39), gives the following upper bound
R d , EE 1 T 2 λ m T d log 1 + T Tr ( Σ γ Σ n 1 ) / d log 2 m δ + 2 T δ 2 m λ 2 π .
We finally note that same upper bound holds for the term R d , EE 2 T .

Appendix C. Linear-Gaussian Noisy Contextual Bandits with Unobserved True Contexts

Appendix C.1. Derivation of Posterior Predictive Distribution

In this section, we derive the posterior predictive distribution P ( c t | c ^ t , H t 1 , c ^ ) for Gaussian bandits with Gaussian context noise. To this end, we first derive the posterior P ( γ * | H t 1 , c ^ ) .

Appendix C.1.1. Derivation of Posterior P ( γ * | H t 1 , c ^ )

Using Baye’s theorem, we have
P ( γ * | H t 1 , c ^ ) P ( γ * ) τ = 1 t 1 P ( c ^ τ | γ * ) ,
where P ( c ^ τ | γ * ) is derived in (A18). Subsequently, we have that
log p ( γ * | H t 1 , c ^ ) 1 2 τ = 1 t 1 ( c ^ τ F ) G ( c ^ τ F ) 1 2 γ * Σ γ 1 γ * 1 2 ( γ * ( ( t 1 ) G + Σ γ 1 ) γ * γ * ( G c ^ 1 : t 1 ( t 1 ) Σ n 1 ( M 1 ) Σ c 1 μ c ) c ^ 1 : t 1 G ( t 1 ) μ c Σ c 1 M 1 Σ n 1 γ * ) ,
where we have denoted τ = 1 t 1 c ^ τ = c ^ 1 : t 1 . We then obtain
P ( γ * | H t 1 , c ^ ) = N ( M ˜ t , N t 1 ) where ,
N t = ( t 1 ) G + Σ γ 1
M ˜ t = ( N t 1 ) G c ^ 1 : t 1 ( t 1 ) Σ n 1 ( M 1 ) Σ c 1 μ c .

Appendix C.1.2. Derivation of P ( c t | c ^ t , H t 1 , c ^ )

The derivation of posterior predictive distribution P ( c t | c ^ t , H t 1 , c ^ ) follows in a similar line as that in Appendix B.2.4. We start the derivation by noting that P ( c t | c ^ t , H t 1 , c ^ ) = E P ( γ * | H t 1 , c ^ ) [ P ( c t | c ^ t , γ * ) ] .
We have
log ( P ( γ * | H t 1 , c ^ ) P ( c t | c ^ t , γ * ) ) 1 2 ( c t A t ) M ( c t A t ) + ( γ * M ˜ t ) N ( γ * M ˜ t ) = 1 2 ( M 1 ) Σ n 1 γ * + c t D E t M ( M 1 ) Σ n 1 γ * + c t D E t + ( γ * M ˜ t ) N t ( γ * M ˜ t ) ) = 1 2 ( γ * J t ) H t ( γ * J t ) J t H t J t + ( c t D E t ) M ( c t D E t ) + M ˜ t N t M ˜ t ,
where A t is defined in (A14), M is defined in (A13), M ˜ t in (A53), N in (A52), D = ( M 1 ) Σ c 1 μ c , E t = ( M 1 ) Σ n 1 c ^ t , H t = Σ n 1 ( M 1 ) Σ n 1 + N t and J t = ( H t 1 ) Σ n 1 ( c t + D + E t ) + N t M ˜ t .
Subsequently, we have
log p ( c t | c ^ t , H t 1 , c ^ ) 1 2 J t H t J t + ( c t D E t ) M ( c t D E t ) 1 2 ( c t M Σ n 1 ( H t 1 ) Σ n 1 c t c t M ( D + E t ) Σ n 1 ( H t 1 ) L t
( D + E t ) M L t ( H t 1 ) Σ n 1 ,
where L t = ( D + E t ) Σ n 1 + M ˜ t N t . Thus, we have,
P ( c t | c ^ t , H t 1 , c ^ ) = N ( c t | V t , R t 1 ) , where
R t = M Σ n 1 ( H t 1 ) Σ n 1
V t = ( R t 1 ) M ( D + E t ) Σ n 1 ( H t 1 ) L t .

Appendix C.1.3. Evaluating the KL Divergence between the True Posterior P t ( θ * ) and Sampling Distribution P ¯ t ( θ * )

In this subsection, we analyze the true posterior distribution P t ( θ * ) : = P ( θ * | H t 1 , r , a , c ^ ) and the approximate sampling distribution P ¯ t ( θ * ) : = P ¯ ( θ * | H t 1 , r , a , c ^ ) , and derive the KL divergence D KL ( P t ( θ * ) P ¯ t ( θ * ) ) between them. To see this, note that from Bayes’s theorem, we have the following joint probability distribution
P ( θ * , H t 1 , r , c ^ | H t 1 , a ) = P ( θ * ) E P ( γ * ) τ = 1 t 1 E P ( c τ ) [ P ( c ^ τ | c τ , γ * ) P ( r τ | a τ , c τ , θ * ) ] : = P ( H t 1 , r , c ^ | H t 1 , a , θ * )
= P ( θ * ) P ( H t 1 , r , c ^ | H t 1 , a , θ * )
whereby we obtain that
P t ( θ * ) P ( θ * ) P ( H t 1 , r , c ^ | H t 1 , a , θ * ) .
In particular, for general feature maps ϕ ( a , c ) , the distribution P ( H t 1 , r , c ^ | H t 1 , a , θ * ) cannot be exactly evaluated, even under Gaussian assumptions, resulting in the posterior P t ( θ * ) to be intractable, in general.
In contrast to this, our approximate TS-algorithm scheme assumes the following joint probability distribution,
P ¯ ( θ * , H t 1 , r , c ^ | H t 1 , a ) = P ( θ * ) E P ( γ * ) τ = 1 t 1 E P ( c τ ) [ P ( c ^ τ | c τ , γ * ) P ¯ ( r τ | a τ , c ^ τ , H τ 1 , c ^ , θ * ) ] : = P ¯ ( H t 1 , r , c ^ | H t 1 , a , θ * )
= P ( θ * ) P ¯ ( H t 1 , r , c ^ | H t 1 , a , θ * ) ,
where
P ¯ ( r τ | a τ , c ^ τ , H τ 1 , c ^ , θ * ) = N ( ψ ( a τ , c ^ τ | H τ 1 , c ^ ) , σ 2 ) .
Consequently, we have
P ¯ t ( θ * ) P ( θ * ) P ¯ ( H t 1 , r , c ^ | H t 1 , a , θ * ) = P ( θ * ) τ = 1 t 1 P ¯ ( r τ | a τ , c ^ τ , H τ 1 , c ^ , θ * ) E P ( γ * ) τ = 1 t 1 [ P ( c ^ τ | γ * ) .
As a result, we can upper bound the KL divergence D KL ( P t ( θ * ) P ¯ t ( θ * ) ) as
D KL ( P t ( θ * ) P ¯ t ( θ * ) ) D KL P ( θ * , H t 1 , r , c ^ | H t 1 , a ) P ¯ ( θ * , H t 1 , r , c ^ | H t 1 , a ) = D KL P ( θ * ) P ( H t 1 , r , c ^ | θ * , H t 1 , a ) P ( θ * ) P ¯ ( H t 1 , r , c ^ | H t 1 , a , θ * ) = E P ( θ * ) D KL P ( H t 1 , r , c ^ | θ * , H t 1 , a ) P ¯ ( H t 1 , r , c ^ | H t 1 , a , θ * ) ( a ) E P ( θ * ) P ( γ * ) E P ( H t 1 , c ) D KL τ = 1 t 1 P ( c ^ τ | c τ , γ * ) P ( r τ | a τ , c τ , θ * ) τ = 1 t 1 P ( c ^ τ | c τ , γ * ) P ¯ ( r τ | a τ , H τ , c ^ , θ * ) = E P ( θ * ) E P ( γ * ) E P ( H t 1 , c ) τ = 1 t 1 E P ( H τ 1 , c ^ | H τ 1 , c , γ * ) D KL P ( r τ | a τ , c τ , θ * ) P ¯ ( r τ | a τ , c ^ τ , H τ 1 , c ^ , θ * )
= ( b ) E P ( θ * ) E P ( γ * ) E P ( H t 1 , c ) τ = 1 t 1 E P ( H τ 1 , c ^ | H τ 1 , c , γ * ) ( ϕ ( a τ , c τ ) θ * ψ ( c ^ τ , a τ | H τ 1 , c ^ ) θ * ) 2 2 σ 2 , = E P ( θ * ) E P ( γ * ) E P ( H t 1 , c ) τ = 1 t 1 E P ( H τ 1 , c ^ | H τ 1 , c , γ * ) | ( ϕ ( a τ , c τ ) ψ ( c ^ τ , a τ | H τ 1 , c ^ ) ) θ * | 2 2 σ 2 , ( c ) E P ( θ * ) E P ( γ * ) E P ( H t 1 , c ) τ = 1 t 1 E P ( H τ 1 , c ^ | H τ 1 , c , γ * ) ( ϕ ( a τ , c τ ) ψ ( c ^ τ , a τ | H τ 1 , c ^ ) 2 2 θ * 2 2 2 σ 2 ( d ) E P ( θ * ) E P ( γ * ) E P ( H t 1 , c ) τ = 1 t 1 E P ( H τ 1 , c ^ | H τ 1 , c , γ * ) 4 θ * 2 2 2 σ 2 = τ = 1 t 1 E P ( θ * ) 2 θ * 2 2 σ 2 = ( e ) 2 ( t 1 ) λ m σ 2 .
In the above series of relationships,
  • equality in ( a ) follows by noting that
    P ¯ ( H t 1 , r , c ^ | H t 1 , a , θ * ) = E P ( γ * ) E P ( H t 1 , c ) [ τ = 1 t 1 P ( c ^ τ | c τ , γ * ) P ¯ ( r τ | a τ , c ^ τ , θ * ) ]
    and
    P ( H t 1 , r , c ^ | H t 1 , a , θ * ) = E P ( γ * ) E P ( H t 1 , c ) [ τ = 1 t 1 P ( c ^ τ | c τ , γ * ) P ( r τ | a τ , c τ , θ * ) ] ,
    and applying Jensen’s inequality on the jointly convex KL divergence,
  • equality in ( b ) follows from evaluating the KL divergence between two Gaussian distributions with same variance σ 2 and with means ϕ ( a τ , c τ ) θ * and ψ ( c ^ τ , a τ | H τ 1 , c ^ ) θ * respectively,
  • inequality in ( c ) follows from application of Cauchy–Schwarz inequality,
  • inequality in ( d ) follows from Assumption 1,
  • inequality in ( e ) follows from
    E P ( θ * ) [ θ * 2 2 ] = j = 1 m E [ ( θ j * ) 2 ] = λ m
    since P ( θ * ) = N ( 0 , λ I ) .

Appendix C.2. Proof of Lemma 1

For notational simplicity, throughout this section we use ψ ( a ) : = ψ ( a , c ^ t | H c ^ ) = E P ( c t | c ^ t , H t 1 , c ^ ) [ ϕ ( a , c t ) ] to denote the expected feature map. Furthermore, we use F t = H t 1 , r , a , c ^ c ^ t .
We start by distinguishing the true and approximated posterior distributions. Recall that P t ( θ * ) : = P ( θ * | H t 1 , r , a , c ^ ) denotes the true posterior and P ¯ t ( θ * ) : = P ¯ ( θ * | H t 1 , r , a , c ^ ) denotes the approximated posterior. We then denote P t ( a ^ t , θ * ) : = P ( a ^ t , θ * | F t ) as the distribution of a ^ t and θ * conditioned on F t , while P ¯ t ( a ^ t , θ * ) : = P ¯ ( a ^ t , θ * | F t ) denote the distribution of a ^ t and θ * under the sampling distribution. Furthermore, we have that P ¯ t ( a t , θ * ) = P ¯ t ( a t ) P ¯ t ( θ * ) = P t ( a t ) P ¯ t ( θ * ) . We start by decomposing R CB T into the following three differences,
R CB T = t = 1 T E E P t ( a ^ t , θ * ) [ ψ ( a ^ t ) θ * ] E P t ( a t , θ * ) [ ψ ( a t ) θ * ] = t = 1 T E [ E P ¯ t ( a ^ t , θ * ) [ ψ ( a ^ t ) θ * ] E P ¯ t ( a t , θ * ) [ ψ ( a t ) θ * ] : = Term 1 + E P t ( a ^ t , θ * ) [ ψ ( a ^ t ) θ * ] E P ¯ t ( a ^ t , θ * ) [ ψ ( a ^ t ) θ * ] ] : = Term 2 + E P ¯ t ( a t , θ * ) [ ψ ( a t ) θ * ] E P t ( a t , θ * ) [ ψ ( a t ) θ * ] ] : = Term 3
We will separately upper bound each of the three terms in the above decomposition.

Appendix C.2.1. Upper Bound on Term 2

To obtain an upper bound on Term 2 , note that the following equivalence holds E P t ( a ^ t , θ * ) [ ψ ( a ^ t ) θ * ] = E P t ( θ * ) max a ψ ( a ) θ * . Using this, we can rewrite Term 2 as
Term 2 = E P t ( θ * ) [ max a ψ ( a ) θ * ] E P ¯ t ( θ * ) [ max a ψ ( a ) θ * ] .
Note, here that when θ * P ¯ t ( θ * ) , for each a A , we have that z a = ψ ( a ) θ * follows Gaussian distribution N ( z a | ψ ( a ) μ t 1 , ψ ( a ) Σ t 1 1 ψ ( a ) ) with mean ψ ( a ) μ t 1 and variance ψ ( a ) Σ t 1 1 ψ ( a ) , where μ t 1 and Σ t 1 are as defined in (18) and (17), respectively. Thus, E P ¯ t ( θ * ) [ max a z a ] is the average of maximum of Gaussian random variables. We can then apply Lemma A2 with P ( x ) = P t ( θ * ) , Q ( x ) = P ¯ t ( θ * ) , n = | A | = K , μ i = ψ ( a ) μ t 1 and σ i = ψ ( a ) Σ t 1 1 ψ ( a ) to obtain that
Term 2 2 ( log K + D KL ( P t ( θ * ) P ¯ t ( θ * ) ) max a ψ ( a ) Σ t 1 ψ ( a ) .
Using this, we obtain that
t = 1 T E [ Term 2 ] E t = 1 T 2 ( log K + D KL ( P t ( θ * ) P ¯ t ( θ * ) ) ) max a ψ ( a ) Σ t 1 ψ ( a ) ( a ) t = 1 T E 2 ( log K + D KL ( P t ( θ * ) P ¯ t ( θ * ) ) ) t = 1 T E max a ψ ( a ) Σ t 1 ψ ( a ) , ( b ) 2 λ T t = 1 T E log K + D KL ( P t ( θ * ) P ¯ t ( θ * ) ) ( c ) 2 λ T T log K + t = 1 T 2 ( t 1 ) λ m σ 2 = 2 λ T 2 log K + 4 λ 2 T ( T 2 T ) m 2 σ 2 2 λ T 2 log K + 2 λ 2 T 3 m σ 2 ,
where the inequality in ( a ) follows from Cauchy–Schwarz inequality, and the inequality in ( b ) follows since
t = 1 T E max a ψ ( a ) Σ t 1 ψ ( a ) t = 1 T E max a ψ ( a ) ( λ I ) ψ ( a ) λ T ,
which follows since Σ t 1 λ I and ψ ( a ) 1 . The inequality in ( c ) follows from (A66).
If λ σ 2 T , we obtain that
t = 1 T E [ Term 2 ] 2 T σ 2 ( log ( K ) + m ) .

Appendix C.2.2. Upper Bound on Term 3

We can bound Term 3 by observing that
Term 3 = E P t ( a t ) [ ψ ( a t ) ] E P ¯ t ( θ * ) [ θ * ] E P t ( θ * ) [ θ * ]
= E P ¯ t ( θ * ) [ Ψ t θ * ] E P t ( θ * ) [ Ψ t θ * ]
where we used Ψ t = E P t ( a t ) [ ψ ( a t ) ] . Note, that for θ * P ¯ t ( θ * ) , the random variable Ψ t θ * is Gaussian with mean Ψ t μ t 1 and variance Ψ t Σ t 1 1 Ψ t . Consequently, Ψ t θ * is also Ψ t Σ t 1 1 Ψ t -sub-Gaussian according to Definition A.1. By using Lemma A1, we then obtain that
| Term 3 | 2 ( Ψ t Σ t 1 Ψ t ) D KL ( P t ( θ * ) P ¯ t ( θ * ) ) .
Using Cauchy–Schwarz inequality then yields that
E [ t = 1 T | Term 3 | ] t = 1 T E [ Ψ t Σ t 1 Ψ t ] t = 1 T E [ 2 D KL ( P t ( θ * ) P ¯ t ( θ * ) ) ] 2 λ T λ T 2 m σ 2 = 2 λ 2 T 3 m σ 2 .
where the second inequality follows from (A66). As before, if λ σ 2 T , we then obtain that
E [ t = 1 T | Term 3 | ] 2 T m σ 2 .

Appendix C.2.3. Upper Bound on Term 1

Note, that in Term 1 , P ¯ t ( a ^ t ) = P ¯ t ( a t ) = P t ( a t ) , whereby the posterior is matched. Hence, one can apply bounds from conventional contextual Thompson Sampling here. For simplicity, we denote E ¯ t [ · ] = E P ¯ [ · | F t ] to denote the expectation with respect to P ¯ t ( a , θ ) . To this end, as in the proof of Lemma 3, we start by defining an information ratio,
Γ t = Term 1 2 E ¯ t ψ ( a t ) θ * ψ ( a t ) μ t ) 2 : = Λ t ,
using which we obtain the upper bound on Term 1 as
t = 1 T E [ Term 1 ] E t = 1 T Γ t Λ t t = 1 T E [ Γ t ] t = 1 T E [ Λ t ]
by the Cauchy–Schwarz inequality.
Furthermore, we have
Λ t = E ¯ t ( ψ ( a t ) ( θ * μ t ) ) 2 = E ¯ t ψ ( a t ) ( θ * μ t 1 ) ( θ * μ t 1 ) ψ ( a t ) = E ¯ t [ ψ ( a t ) Σ t 1 1 ψ ( a t ) ] = E ¯ t [ ψ ( a t ) Σ t 1 1 ] ,
where μ t 1 and Σ t 1 are defined as in (18) andd (17). Subsequently, using elliptical potential lemma, we obtain
t = 1 T Λ t 2 m σ 2 log 1 + ( T ) λ m σ 2 .
To obtain an upper bound on the information ratio Γ t , we define f ¯ ( θ * , a ) = ψ ( a ) θ * and f ¯ ( a ) = ψ ( a ) μ t 1 and let
M a , a = a , a P ¯ t ( a t = a ) P ¯ t ( a ^ t = a ) ( E ¯ t [ f ¯ ( θ * , a ) | a ^ t = a ] f ¯ ( a ) ) .
It is easy to see that
Term 1 = E ¯ t [ f ¯ ( a ^ t , θ * ) ] E ¯ t [ f ¯ ( a t ) ] = a P ¯ t ( a ^ t = a ) E ¯ t [ f ¯ ( θ * , a ) | a ^ t = a ] E ¯ t [ f ¯ ( a ^ t ) ] = a P ¯ t ( a ^ t = a ) E ¯ t [ f ¯ ( θ * , a ) | a ^ t = a ] f ¯ ( a ) = Tr ( M ) ,
where the second and last equality follows since P ¯ t ( a t ) = P ¯ t ( a ^ t ) . Similarly, we can relate Λ t with the matrix ( M a , a ) as
Λ t = E ¯ t [ ( f ¯ ( θ * , a t ) f ¯ ( a t ) ) 2 ] = a P ¯ t ( a t = a ) E ¯ t [ ( f ¯ ( θ * , a ) f ¯ ( a ) ) 2 ] = a , a P ¯ t ( a t = a ) P ¯ t ( a ^ t = a ) E ¯ t [ ( f ¯ ( θ * , a ) f ¯ ( a ) ) 2 | a ^ t = a ] a , a P ¯ t ( a t = a ) P ¯ t ( a ^ t = a ) E ¯ t [ ( f ¯ ( θ * , a ) f ¯ ( a ) ) | a ^ t = a ] 2
= a , a P ¯ t ( a t = a ) P ¯ t ( a ^ t = a ) E ¯ t [ ( f ¯ ( θ * , a ) | a ^ t = a ] f ¯ ( a ) 2 = M F 2 ,
whereby we obtain
Γ t Tr ( M ) 2 M F 2 m ,
where the last inequality can be proved as in [25]. Following [19], it can be seen that Λ t 2 log ( 1 + K ) also holds. Using this, together with the upper bound (A79) gives
t = 1 T E [ Term 1 ] 2 T m σ 2 min { m , 2 log ( 1 + K ) } log 1 + T λ m σ 2 .

Appendix C.3. Proof of Lemma 2

We first give an upper bound on the estimation error that does not require the assumption of a linear feature map.

Appendix C.3.1. A General Upper Bound on R EE 1 T

To obtain an upper bound on R EE 1 T that does not require the assumption that ϕ ( a , c ) = G ( a ) c , we leverage the same analysis as in the proof of Lemma 4. Subsequently, we obtain that
R EE 1 T t = 1 T E Δ P ( c t | c ^ t , γ * ) , P ( c t | c ^ t , H c ^ ) 1 { E } + 2 δ T E [ θ * 2 | E c ] t = 1 T E Δ P ( c t | c ^ t , γ * ) , P ( c t | c ^ t , H c ^ ) 1 { E } + 2 δ 2 T m λ 2 π ,
where the event E is defined as in (A37). Subsequently, the first summation can be upper bounded as
t = 1 T E [ Δ P ( c t | c ^ t , γ * ) , P ( c t | c ^ t , H c ^ ) 1 { E } ] 2 T U 2 t = 1 T I ( c t ; γ * | c ^ t , H c ^ ) = 2 T U 2 t = 1 T ( H ( c t | c ^ t , H c ^ ) H ( c t | c ^ t , γ * ) ) = T U 2 t = 1 T log det ( R t 1 ) det ( M t ) T U 2 ( Tr ( ( Σ n Σ γ 1 Σ n M ) 1 ) + log ( T ) Tr ( Σ c Σ n 1 ) )
where U is defined as in (A37), M t = Σ c 1 + Σ n 1 and R t is as in (13).
To derive the last inequality, we observe the following series of relationships starting from (13):
Σ n 1 ( H t 1 ) Σ n 1 = ( Σ n H t Σ n ) 1 = ( t 1 ) Σ n ( t 2 ) M 1 + Σ n Σ γ 1 Σ n 1 R t = M Σ n 1 ( H t 1 ) Σ n 1 = M ( t 1 ) Σ n ( t 2 ) M 1 + Σ n Σ γ 1 Σ n 1 = M I M 1 ( t 1 ) Σ n ( t 2 ) M 1 + Σ n Σ γ 1 Σ n 1 R t 1 = I M 1 ( t 1 ) Σ n ( t 2 ) M 1 + Σ n Σ γ 1 Σ n 1 1 M 1 = I ( t 1 ) Σ n M ( t 2 ) I + Σ n Σ γ 1 Σ n M 1 1 M 1
R t 1 M = I ( t 1 ) Σ n M ( t 2 ) I + Σ n Σ γ 1 Σ n M 1 1 = I ( t 1 ) Σ n Σ c 1 + ( t 1 ) I ( t 2 ) I + Σ n Σ γ 1 Σ n M 1 1 = I I + ( t 1 ) Σ n Σ c 1 + Σ n Σ γ 1 Σ n M : = P t 1 1 = ( a ) I I P t ( I + P t ) 1 1 = ( P t ( I + P t ) 1 ) 1
where the equality in ( a ) follows from Woodbury matrix identity and by the assumption that Σ n Σ γ 1 Σ n M 0 , Σ n Σ c 1 0 , we have P t 0 is invertible. Now,
det ( R t 1 M ) = 1 det ( P t ( I + P t ) 1 ) = det ( P t 1 ( I + P t ) ) = det ( P t 1 + I ) ( b ) Tr ( P t 1 + I ) d d = ( 1 + Tr ( P t 1 ) / d ) d ,
where the inequality in ( b ) follows from the determinant-trace inequality. Subsequently, we have
t = 1 T log det ( R t 1 M ) t = 1 T d log ( 1 + Tr ( P t 1 ) / d ) ( c ) t = 1 T Tr ( P t 1 ) = Tr ( P 1 1 ) + t = 2 T Tr ( P t 1 ) = Tr ( ( Σ n Σ γ 1 Σ n M ) 1 ) + t = 2 T Tr ( P t 1 ) ( d ) Tr ( ( Σ n Σ γ 1 Σ n M ) 1 ) + log ( T ) Tr ( Σ c Σ n 1 ) ,
where the inequality in ( c ) follows since log ( 1 + x ) x for x > 0 and the inequality in ( d ) follows since by assumption P t ( t 1 ) Σ n Σ c 1 , whereby we have P t 1 1 t 1 Σ c Σ n 1 and consequently, Tr ( P t 1 ) 1 t 1 Tr ( Σ c Σ n 1 ) . Finally, we use that s = 1 T 1 / s log ( T ) .
We thus obtain that
R EE 1 T 2 λ m T log ( 2 m / δ ) Tr ( ( Σ n Σ γ 1 Σ n M ) 1 ) + log ( T ) Tr ( Σ c Σ n 1 ) + 2 δ 2 T m λ 2 π
for δ ( 0 , 1 ) . We note that same upper bound holds for the term R EE 2 T .

Appendix C.3.2. Upper Bound for Linear Feature Maps and Scaled Diagonal Covariance Matrices

We now obtain an upper bound on the estimation error under the assumption of a linear feature map ϕ ( a , c ) = G a c such that ϕ ( a , c ) 2 1 . The following set of inequalities hold:
R EE 1 T = t = 1 T E ψ ( a t * , c ^ t | γ * ) θ * ψ ( a ^ t , c ^ t | H c ^ ) θ * t = 1 T E ψ ( a t * , c ^ t | γ * ) θ * ψ ( a t * , c ^ t | H c ^ ) θ * = t = 1 T E [ E P ( c t | c ^ t , γ * ) [ ϕ ( a t * , c t ) θ * ] E P ( c t | c ^ t , H c ^ ) [ ϕ ( a t * , c t ) θ * ] .
Note, that P ( c t | c ^ t , H c ^ ) = N ( c t | V t , R t 1 ) where R t and V t are, respectively, defined in (13) and (14). Consequently, ϕ ( a t * , c t ) θ * = c t G a t * θ * is s t 2 = θ * G a t * R t 1 G a t * θ * -sub-Gaussian with respect to P ( c t | c ^ t , H c ^ ) . Consequently, using Lemma A1, we can upper bound the inner expectation of (A89) as
| E P ( c t | c ^ t , γ * ) [ ϕ ( a t * , c t ) θ * ] E P ( c t | c ^ t , H c ^ ) [ ϕ ( a t * , c t ) θ * | 2 s t 2 D KL ( P ( c t | c ^ t , γ * ) P ( c t | c ^ t , H c ^ ) ) .
Summing over t and using Cauchy–Schwarz inequality then gives
R EE 1 T 2 ( t = 1 T E [ s t 2 ] ) t = 1 T E D KL ( P ( c t | c ^ t , γ * ) P ( c t | c ^ t , H c ^ ) ) .
We now evaluate the KL-divergence term. To this end, note that conditioned on γ * and c ^ t , c t is independent of H c ^ , i.e., P ( c t | c ^ t , γ * , H c ^ ) = P ( c t | c ^ t , γ * ) . This gives
t = 1 T E D KL ( P ( c t | c ^ t , γ * ) P ( c t | c ^ t , H c ^ ) ) = t = 1 T I ( c t ; γ * | c ^ t , H c ^ ) = t = 1 T H ( c t | c ^ t , H c ^ ) H ( c t | c ^ t , γ * ) = 1 2 t = 1 T E P ( c ^ t , H c ^ ) [ log det ( R t 1 ) ] 1 2 t = 1 T E P ( c ^ t , γ * ) [ log det ( M 1 ) ] = t = 1 T 1 2 E P ( c ^ t , H c ^ ) [ log det ( R t 1 ) det ( M ) d σ c 2 σ n 2 σ γ 2 σ n 2 + σ c 2 + log ( T 1 ) ,
where M = Σ c 1 + Σ n 1 and R t is as in (13). The first equality follows by noting that
E D KL ( P ( c t | c ^ t , γ * ) P ( c t | c ^ t , H c ^ ) ) = E P ( c ^ t , H c ^ ) E P ( γ * | c ^ t , H c ^ ) D KL ( P ( c t | c ^ t , γ * , H c ^ ) P ( c t | c ^ t , H c ^ ) ) = E P ( c ^ t , H c ^ ) [ I ( c t ; γ * | c ^ t , H c ^ ) ]
with the outer expectation taken over c ^ t and H c ^ . The last inequality is proved in Appendix C.3.3 using that Σ c = σ c 2 I , Σ n = σ n 2 I and Σ γ = σ γ 2 I .
We can now upper bound t E [ s t 2 ] as follows.
E [ s t 2 ] E [ max a θ * G a R t 1 G a θ * ] a E [ θ * G a R t 1 G a θ * ] = a Tr ( G a R t 1 G a E [ θ * θ * ] ) = λ a Tr ( R t 1 G a G a ) = λ b t Tr ( a G a G a )
where the last equality uses R t 1 = b t I as in (A95). Using (A95), we obtain
t b t = σ c 2 σ n 2 f t 1 + σ c 2 σ γ 2 ( t 1 ) σ γ 2 σ n 2 + f σ n 2 = σ c 2 σ n 2 T f + t σ c 2 f σ c 2 σ γ 2 ( t 1 ) σ γ 2 + f = σ c 2 σ n 2 T f + σ c 4 σ γ 2 f 2 + t > 1 σ c 2 f σ c 2 σ γ 2 ( t 1 ) σ γ 2 + f σ c 2 σ n 2 f T + σ c 4 σ γ 2 f 2 + σ c 4 f log ( T 1 ) .
Using the above relation, we obtain
t E [ s t 2 ] = λ Tr ( a G a G a ) t b t λ K σ c 2 f max a Tr ( G a G a ) σ n 2 T + σ c 2 σ γ 2 f + σ c 2 log ( T 1 ) .
If λ d σ 2 T , we obtain
t E [ s t 2 ] d σ 2 K σ c 2 f max a Tr ( G a G a ) σ n 2 + σ c 2 σ γ 2 T f + σ c 2 log ( T 1 ) T = L
Using the above inequality together with (A92) in (A91) yields
R EE 1 T 2 L d σ c 2 σ n 2 σ γ 2 σ n 2 + σ c 2 + log ( T 1 ) .
An upper bound on R EE 2 T similarly follows.

Appendix C.3.3. Analysis of R t for Scaled Diagonal Covariance Matrices

Assume that Σ n = σ n 2 I , Σ c = σ c 2 I and Σ γ = σ γ 2 I . Then, from (13), we obtain
H t = ( t 1 ) Σ n 1 ( t 2 ) Σ n 1 M 1 Σ n 1 + Σ γ 1 = ( ( t 1 ) σ n 2 ( t 2 ) σ n 2 σ c 2 σ n 4 ( σ c 2 + σ n 2 ) : = f + 1 σ γ 2 ) I = ( t 1 ) σ γ 2 ( t 2 ) σ c 2 σ γ 2 / f + σ n 2 σ n 2 σ γ 2 I .
This implies
H t 1 = σ n 2 σ γ 2 ( t 1 ) σ γ 2 ( t 2 ) σ c 2 σ γ 2 / f + σ n 2 I
whereby we obtain
R t = f ( t 1 ) σ γ 2 + σ n 2 + σ c 2 σ c 2 ( ( t 1 ) σ γ 2 σ n 2 + σ c 2 σ γ 2 + f σ n 2 ) I and R t 1 = σ c 2 ( ( t 1 ) σ γ 2 σ n 2 + σ c 2 σ γ 2 + f σ n 2 ) f ( t 1 ) σ γ 2 + σ n 2 + σ c 2 I = b t I .
Noting that M t = f σ n 2 σ c 2 I we then have
R t 1 M t = ( t 1 ) σ γ 2 σ n 2 + σ c 2 σ γ 2 + f σ n 2 σ n 2 ( t 1 ) σ γ 2 + σ n 2 + σ c 2 I = ( t 1 ) σ γ 2 σ n 2 + σ c 2 σ γ 2 + f σ n 2 ( t 1 ) σ γ 2 σ n 2 + f σ n 2 I = ( 1 + σ c 2 σ γ 2 ( t 1 ) σ γ 2 σ n 2 + f σ n 2 ) I .
Subsequently, we obtain that for t > 1 ,
log det ( R t 1 M t ) = d log 1 + σ c 2 σ γ 2 ( t 1 ) σ γ 2 σ n 2 + f σ n 2 d log 1 + σ c 2 / σ n 2 ( t 1 ) d σ c 2 / σ n 2 t 1 ,
whereby
t = 1 T log det ( R t 1 M t ) d log 1 + σ c 2 σ γ 2 f σ n 2 + t > 1 T d σ c 2 / σ n 2 t 1
d σ c 2 σ n 2 σ γ 2 f + log ( T 1 )
where the last inequality follows since log ( 1 + x ) x and s = 1 T 1 s log ( T ) .

Appendix D. Details on Experiments

In this section, we present details on the baselines implemented for stochastic CBs with unobserved true contexts.

Appendix D.1. Gaussian Bandits

For Gaussian bandits, we implemented the baselines as explained below.
  • TS_naive: This algorithm implements the following action policy at each iteration t,
a t = arg max a A ϕ ( a , c ^ t ) θ t ,
where θ t is sampled from a Gaussian distribution N ( μ t 1 , naive , Σ t 1 , naive 1 ) with
Σ t 1 , naive = I λ + 1 σ 2 τ = 1 t 1 ϕ ( a τ , c ^ τ ) ϕ ( a τ , c ^ τ ) μ t 1 , naive = Σ t 1 , naive 1 σ 2 τ = 1 t 1 r τ ϕ ( a τ , c ^ τ ) .
  • TS_oracle: In this baseline, the agent has knowledge of the true predictive distribution P ( c t | c ^ t , γ * ) . Consequently, at each iteration t, the algorithm chooses action
a t = arg max a A ψ ( a , c ^ t | γ * ) θ t ,
where θ t is sampled from a Gaussian distribution N ( μ t 1 , poc , Σ t 1 , poc 1 ) with
Σ t 1 , poc = I λ + 1 σ 2 τ = 1 t 1 ψ ( a τ , c ^ τ | γ * ) ψ ( a τ , c ^ τ | γ * ) μ t 1 , poc = Σ t 1 , poc 1 σ 2 τ = 1 t 1 r τ ψ ( a τ , c ^ τ | γ * ) .
For Gaussian bandits, the following figure shows additional experiment comparing the performance of our proposed Algorithm 1 for varying values of the number K of actions. All parameters are set as in Figure 1 (Left).
Figure A1. Bayesian cumulative regret of Algorithm 1 as a function of iterations over varying number K of actions.
Figure A1. Bayesian cumulative regret of Algorithm 1 as a function of iterations over varying number K of actions.
Entropy 26 00606 g0a1

Appendix D.2. Logistic Bandits

In the case of logistic bandits, we implemented the baselines as explained below.
  • TS_naive: This algorithm considers the following sampling distribution:
Q ( θ * | H t 1 , r , a , c ^ ) P ( θ * ) τ = 1 t 1 Ber ( μ ( ϕ ( a τ , c ^ τ ) θ * ) ) .
However, due to the non-conjugateness of Gaussian prior P ( θ * ) and Bernoulli reward likelihood, sampling from the above posterior distribution is not straightforward. Consequently, we adopt the Langevin Monte Carlo (LMC) sampling approach from [20]. To sample θ t at iteration t, we run LMC for I = 50 iterations with learning rate η t = 0.2 / t and inverse temperature parameter β 1 = 0.001 . Then, θ t is chosen as the output of the LMC after I = 50 iterations. Using the sampled θ t , the algorithm then chooses the action a t as
a t = arg max a A ϕ ( a , c ^ t ) θ t .
  • TS_oracle: This algorithm considers the following sampling distribution:
Q ( θ * | H t 1 , r , a , c ^ ) P ( θ * ) τ = 1 t 1 Ber ( μ ( ψ ( a τ , c ^ τ | γ * ) θ * ) ) ,
where ψ ( a t , c ^ t | γ * ) = E P ( c t | c ^ t , γ * ) [ ϕ ( a t , c t ) ] is the expected feature map under the posterior predictive distribution with known γ * . As before, to sample from the above distribution, we use I = 50 iterations of LMC with learning rate η t = 0.2 / t and inverse temperature parameter β 1 = 0.001 . Using the sampled θ t , the algorithm then chooses the action a t as
a t = arg max a A ψ ( a , c ^ t | γ * ) θ t .

References

  1. Srivastava, V.; Reverdy, P.; Leonard, N.E. Surveillance in an abruptly changing world via multiarmed bandits. In Proceedings of the IEEE Conference on Decision and Control (CDC), Los Angeles, CA, USA, 15–17 December 2014; pp. 692–697. [Google Scholar]
  2. Aziz, M.; Kaufmann, E.; Riviere, M.K. On multi-armed bandit designs for dose-finding clinical trials. J. Mach. Learn. Res. 2021, 22, 686–723. [Google Scholar]
  3. Anandkumar, A.; Michael, N.; Tang, A.K.; Swami, A. Distributed algorithms for learning and cognitive medium access with logarithmic regret. IEEE J. Sel. Areas Commun. 2011, 29, 731–745. [Google Scholar] [CrossRef]
  4. Srivastava, V.; Reverdy, P.; Leonard, N.E. On optimal foraging and multi-armed bandits. In Proceedings of the Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 2–4 October 2013; pp. 494–499. [Google Scholar]
  5. Bubeck, S.; Cesa-Bianchi, N. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. arXiv 2012, arXiv:1204.5721. [Google Scholar]
  6. Abe, N.; Biermann, A.W.; Long, P.M. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica 2003, 37, 263–293. [Google Scholar] [CrossRef]
  7. Agarwal, D.; Chen, B.C.; Elango, P.; Motgi, N.; Park, S.T.; Ramakrishnan, R.; Roy, S.; Zachariah, J. Online models for content optimization. Adv. Neural Inf. Process. Syst. 2008, 21, 17–24. [Google Scholar]
  8. Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 2002, 47, 235–256. [Google Scholar] [CrossRef]
  9. Chu, W.; Li, L.; Reyzin, L.; Schapire, R. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 208–214. [Google Scholar]
  10. Agrawal, S.; Goyal, N. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 127–135. [Google Scholar]
  11. Lamprier, S.; Gisselbrecht, T.; Gallinari, P. Profile-based bandit with unknown profiles. J. Mach. Learn. Res. 2018, 19, 2060–2099. [Google Scholar]
  12. Kirschner, J.; Krause, A. Stochastic bandits with context distributions. Adv. Neural Inf. Process. Syst. 2019, 32, 14113–14122. [Google Scholar]
  13. Yang, L.; Yang, J.; Ren, S. Multi-feedback bandit learning with probabilistic contexts. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence Main Track, Yokohama, Japan, 11–17 July 2020. [Google Scholar]
  14. Kim, J.h.; Yun, S.Y.; Jeong, M.; Nam, J.; Shin, J.; Combes, R. Contextual Linear Bandits under Noisy Features: Towards Bayesian Oracles. In Proceedings of the International Conference on Artificial Intelligence and Statistics, PMLR, Valencia, Spain, 25–27 April 2023; pp. 1624–1645. [Google Scholar]
  15. Guo, Y.; Murphy, S. Online learning in bandits with predicted context. arXiv 2023, arXiv:2307.13916. [Google Scholar]
  16. Roy, D.; Dutta, M. A systematic review and research perspective on recommender systems. J. Big Data 2022, 9, 59. [Google Scholar] [CrossRef]
  17. Russo, D.; Van Roy, B. Learning to optimize via posterior sampling. Math. Oper. Res. 2014, 39, 1221–1243. [Google Scholar] [CrossRef]
  18. Park, H.; Faradonbeh, M.K.S. Analysis of Thompson sampling for partially observable contextual multi-armed bandits. IEEE Control Syst. Lett. 2021, 6, 2150–2155. [Google Scholar] [CrossRef]
  19. Neu, G.; Olkhovskaia, I.; Papini, M.; Schwartz, L. Lifting the information ratio: An information-theoretic analysis of thompson sampling for contextual bandits. Adv. Neural Inf. Process. Syst. 2022, 35, 9486–9498. [Google Scholar]
  20. Xu, P.; Zheng, H.; Mazumdar, E.V.; Azizzadenesheli, K.; Anandkumar, A. Langevin monte carlo for contextual bandits. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 24830–24850. [Google Scholar]
  21. Harper, F.M.; Konstan, J.A. The movielens datasets: History and context. ACM Trans. Interact. Intell. Syst. (TIIS) 2015, 5, 1–19. [Google Scholar] [CrossRef]
  22. Hong, J.; Kveton, B.; Zaheer, M.; Ghavamzadeh, M. Hierarchical bayesian bandits. In Proceedings of the International Conference on Artificial Intelligence and Statistics, PMLR, Virtual, 28–30 March 2022; pp. 7724–7741. [Google Scholar]
  23. Hong, J.; Kveton, B.; Katariya, S.; Zaheer, M.; Ghavamzadeh, M. Deep hierarchy in bandits. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 8833–8851. [Google Scholar]
  24. Lattimore, T.; Szepesvári, C. Bandit Algorithms; Cambridge University Press: Cambridge, UK, 2020. [Google Scholar]
  25. Russo, D.; Van Roy, B. An information-theoretic analysis of thompson sampling. J. Mach. Learn. Res. 2016, 17, 2442–2471. [Google Scholar]
Figure 1. Comparison of Bayesian regret of proposed algorithms with baselines as a function of number of iterations. (Left): Gaussian bandits with K = 40 , σ n 2 = σ γ 2 = 1.1 ; (Center) Logistic bandits with K = 40 , σ n 2 = 2 , σ γ 2 = 2.5 ; (Right) MovieLens dataset with added Gaussian context noise and Gaussian prior: parameters set as σ n 2 = 0.1 , σ γ 2 = 0.6 .
Figure 1. Comparison of Bayesian regret of proposed algorithms with baselines as a function of number of iterations. (Left): Gaussian bandits with K = 40 , σ n 2 = σ γ 2 = 1.1 ; (Center) Logistic bandits with K = 40 , σ n 2 = 2 , σ γ 2 = 2.5 ; (Right) MovieLens dataset with added Gaussian context noise and Gaussian prior: parameters set as σ n 2 = 0.1 , σ γ 2 = 0.6 .
Entropy 26 00606 g001
Table 1. Comparison of the regret bounds of our proposed TS algorithm for noisy CB with state-of-the art algorithms.
Table 1. Comparison of the regret bounds of our proposed TS algorithm for noisy CB with state-of-the art algorithms.
ReferenceSettingAlgorithmRegretBound
[8]Linear CBLinRelFrequentist O ˜ ( m T )
[9]Linear CBLin-UCBFrequentist O ˜ ( m T )
[10]Linear CBTSFrequentist O ( m T log 3 / 2 T )
[17]Linear CBTSBayesian O ( m T log T )
[11]Noisy CBSampLinUCBFrequentist O ˜ ( m T )
[12]Noisy CBUCBFrequentist O ˜ ( m T )
[14]Noisy CBOFULFrequentist O ˜ ( m T )
Our workNoisy CBTSBayesian O ˜ ( m T )
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jose, S.T.; Moothedath, S. Thompson Sampling for Stochastic Bandits with Noisy Contexts: An Information-Theoretic Regret Analysis. Entropy 2024, 26, 606. https://doi.org/10.3390/e26070606

AMA Style

Jose ST, Moothedath S. Thompson Sampling for Stochastic Bandits with Noisy Contexts: An Information-Theoretic Regret Analysis. Entropy. 2024; 26(7):606. https://doi.org/10.3390/e26070606

Chicago/Turabian Style

Jose, Sharu Theresa, and Shana Moothedath. 2024. "Thompson Sampling for Stochastic Bandits with Noisy Contexts: An Information-Theoretic Regret Analysis" Entropy 26, no. 7: 606. https://doi.org/10.3390/e26070606

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop