Abstract
We study stochastic linear contextual bandits (CB) where the agent observes a noisy version of the true context through a noise channel with unknown channel parameters. Our objective is to design an action policy that can “approximate” that of a Bayesian oracle that has access to the reward model and the noise channel parameter. We introduce a modified Thompson sampling algorithm and analyze its Bayesian cumulative regret with respect to the oracle action policy via information-theoretic tools. For Gaussian bandits with Gaussian context noise, our information-theoretic analysis shows that under certain conditions on the prior variance, the Bayesian cumulative regret scales as , where m is the dimension of the feature vector and T is the time horizon. We also consider the problem setting where the agent observes the true context with some delay after receiving the reward, and show that delayed true contexts lead to lower regret. Finally, we empirically demonstrate the performance of the proposed algorithms against baselines.
1. Introduction
Decision-making in the face of uncertainty is a widespread challenge found across various domains such as control and robotics [1], clinical trials [2], communications [3], and ecology [4]. To tackle this challenge, learning algorithms have been developed to uncover effective policies for optimal decision-making. One notable framework for addressing this is contextual bandits (CBs), which capture the essence of sequential decision-making by incorporating side information, termed context [5].
In the standard CB model, an agent interacts with the environment over numerous rounds. In each round, the environment presents a context to the agent based on which the agent chooses an action and receives a reward from the environment. The reward is stochastic, drawn from a probability distribution whose mean reward (which is a function of context-action pair) is unknown to the agent. The goal of the agent is to design a policy for action selection that can maximize the cumulative mean reward accrued over a T-length horizon.
In this paper, we focus on a CB model that assumes stochastic rewards with linear mean-reward functions, also called stochastic linear contextual bandits. Stochastic linear CB models find applications in various settings including internet advertisement selection [6], where the advertisement (i.e., action) and webpage features (i.e., context) are used to construct a linear predictor of the probability that a user clicks on a given advertisement, and article recommendation on web portals [7].
While most prior research on CBs has primarily focused on models with known exact contexts [8,9,10], in many real-world applications, the contexts are noisy, e.g., imprecise measurement of patient conditions in clinical trials, weather or stock market predictions. In such scenarios, when the exact contexts are unknown, the agent must utilize the observed noisy contexts to estimate the mean reward associated with the true context. However, this results in a biased estimate that renders the application of standard CB algorithms unsuitable. Consequently, recent efforts have been made to develop CB algorithms tailored to noisy context settings.
Related Works: ref. [11] considers a setting where there is a bounded zero-mean noise in the m-dimensional feature vector (denoted by , where a is the action and c is the context) rather than in the context vector, and the agent observes only noisy features. For this setting, they develop an upper confidence bound (UCB) algorithm. Ref. [12] models the uncertainty regarding the true contexts by a context distribution that is known to the agent, while the agent never observes the true context and develops a UCB algorithm. A similar setting has also been considered in [13]. Differing from these works, ref. [14] considers the setting where the true feature vectors are sampled from an unknown feature distribution at each time, but the agent observes only a noisy feature vector. Assuming Gaussian feature noise with unknown mean and covariance, they develop an Optimism in the Face of Uncertainty (OFUL) algorithm. A variant of this setting has been studied in [15].
Motivation and Problem Setting: In this work, inspired by [14], we consider the following noisy CB setting. In each round, the environment samples a true context vector from a context distribution that is known to the agent. The agent, however, does not observe the true context but observes a noisy context obtained as the output of a noise channel parameterized by . The agent is aware of the noise present but does not know the channel parameter . Following [14], we consider Gaussian noise channels for our regret analysis.
Based on the observed noisy contexts, the agent chooses an action and observes a reward corresponding to the true context. We consider a linear bandit whose mean reward is determined by an unknown reward parameter . The goal of the agent is to design an action policy that minimizes the Bayesian cumulative regret with respect to the action policy of a Bayesian oracle. The oracle has access to the reward model and the channel parameter , and uses the predictive distribution of the true context given the observed noisy context to select an action.
Our setting differs from [14] in that we assume noisy contexts rather than noisy feature vectors and that the agent knows the context distribution. The noise model, incorporating noise in the feature vector, allows [14] to transform the original problem into a different CB problem that estimates a modified reward parameter. Such a transformation, however, is not straightforward in our setting with noise in contexts rather than in feature vectors, where we wish to analyze the Bayesian regret. Additionally, we propose a de-noising approach to estimate the predictive distribution of the true context from given noisy contexts, offering potential benefits for future analyses.
The assumption of known context distribution follows from [12]. This can be motivated by considering the example of an online recommendation engine that pre-processes the user account registration information or contexts (e.g., age, gender, device, location, item preferences) to group them into different clusters [16]. The engine can then infer the ‘empirical’ distribution of users within each cluster to define a context distribution over true contextual information. A noisy contextual information scenario occurs when a guest with different preferences logs into a user’s account.
Challenges and Novelty: Different from existing works that developed UCB-based algorithms, we propose a fully Bayesian Thompson Sampling (TS) algorithm that approximates the Bayesian oracle policy. The proposed algorithm differs from the standard contextual TS [10] in the following aspects. Firstly, since the true context vectors are not accessible at each round and the channel parameter is unknown, the agent uses its knowledge of the context distribution and the past observed noisy contexts to infer a predictive posterior distribution of the true context from the current observed noisy context. The inferred predictive distribution is then used to choose the action. This de-noising step enables our algorithm to ‘approximate’ the oracle action policy that uses knowledge of the channel parameter to implement exact de-noising. Secondly, the reward received by the agent corresponds to the unobserved true context . Hence, the agent cannot accurately evaluate the posterior distribution of and sample from it as is conducted in standard contextual TS. Instead, our algorithm proposes to use a sampling distribution that ‘approximates’ the posterior.
Different from existing works that focus on frequentist regret analysis, we derive novel information-theoretic bounds on the Bayesian cumulative regret of our algorithm. For Gaussian bandits, our information-theoretic regret bounds scale as (the notation suppresses logarithmic terms in •), where denote the dimension of the feature vector and time horizon respectively, under certain conditions on the variance of the prior on . Furthermore, our Bayesian regret analysis shows that the posterior mismatch, resulting due to replacing the true posterior distribution with a sampling distribution, results in an approximation error that is captured via the Kullback–Leibler (KL) divergence between the distributions. To the best of our knowledge, quantifying the posterior mismatch via KL divergence has not been studied before and is of independent interest.
Finally, we also extend our algorithm to a setting where the agent observes the true context after the decision is made and reward is observed [12]. We call this setting CBs with delayed true contexts. Such scenarios arise in many applications where only a prediction of the context is available at the time of decision-making; however, the true context is available later. For instance, in farming-recommender systems where, at the time of making the decision regarding which crop to cultivate in a year, the true contextual information about the weather pattern is unavailable, while some ‘noisy’ weather predictions are available. In fact, the true weather pattern is observed only after the decision is made. We show that our TS algorithm for this setting with delayed true contexts results in reduced Bayesian regret. Table 1 compares our regret bound with that of the state-of-the-art algorithms in the noiseless and noisy CB settings.
Table 1.
Comparison of the regret bounds of our proposed TS algorithm for noisy CB with state-of-the art algorithms.
2. Problem Setting
In this section, we present the stochastic linear CB problem studied in this paper. Let denote the action set with K actions and denote the (possibly infinite) set of d-dimensional context vectors. At iteration , the environment randomly draws a context vector according to a context distribution defined over the space of context vectors. The context distribution is known to the agent. The agent, however, does not observe the true context drawn by the environment. Instead, it observes a noisy version of the true context, obtained as the output of a noisy, stochastic channel with the true context as the input. The noise channel is parameterized by the noise channel parameter that is unknown to the agent.
Having observed the noisy context at iteration t, the agent chooses an action according to an action policy . The action policy may be stochastic describing a probability distribution over the set of actions. Corresponding to the chosen action , the agent receives a reward from the environment given by
where is the linear mean-reward function and is a zero-mean reward noise variable. The mean reward function is defined via the feature map , that maps the action and true context to an m-dimensional feature vector, and via the reward parameter that is unknown to the agent.
We call the noisy CB problem described above CBs with unobserved true context (see Setting 1) since the agent does not observe the true context and the selection of action is based solely on the observed noisy context. Accordingly, at the end of iteration t, the agent has accrued the history of observed reward-action-noisy context tuples. The action policy at iteration may depend on the history .
| Setting 1: CBs with unobserved true contexts |
|
We also consider a variant of the above problem where the agent has access to a delayed observation of the true context as studied in [12]. We call this setting CBs with delayed true context. In this setting, at iteration t, the agent first observes a noisy context , chooses action , and receives reward . Later, the true context is observed. It is important to note that the agent has no access to the true context at the time of decision-making. Thus, at the end of iteration t, the agent has collected the history of observed reward-action-context-noisy context tuples.
In both of the problem settings described above, the agent’s objective is to devise an action policy that minimizes the Bayesian cumulative regret with respect to a baseline action policy. We define Bayesian cumulative regret next.
Bayesian Cumulative Regret
The cumulative regret of an action policy quantifies how different the mean reward accumulated over T iterations is from that accrued by a baseline action policy . In this work, we consider as baseline the action policy of an oracle that has access to the channel noise parameter , reward parameter , the context distribution and the noise channel likelihood Accordingly, at each iteration t, the oracle can infer the exact predictive distribution of the true context from the observed noisy context via Baye’s rule as
Here, is the joint distribution of the true and noisy contexts given the noise channel parameter , and is the distribution obtained by marginalizing over the true contexts, i.e.,
where denotes expectation with respect to ‘’. The oracle action policy then adopts an action
at iteration t, where . Note, that as in [14,18], we do not choose the stronger oracle action policy of , that requires access to the true context , as it is generally not achievable by an agent that observes only noisy context and has no access to .
For fixed parameters and , we define the cumulative regret of the action policy as
the expected difference in mean rewards of the oracle decision policy and the agent’s decision policy over T iterations. In (5), the expectation is taken with respect to the joint distribution , where . Using this, the cumulative regret (5) can be written as
Our focus in this work is on a Bayesian framework where we assume that the reward parameter and channel noise parameter are independently sampled by the environment from prior distributions , defined on the set of reward parameters, and , defined on the set of channel noise parameters, respectively. The agent has knowledge of the prior distributions, the reward likelihood in (1) and the noise channel likelihood , although it does not observe the sampled and . Using the above prior distributions, we define Bayesian cumulative regret of the action policy as
where the expectation is taken with respect to the priors and .
In the next sections, we present our novel TS algorithms to minimize the Bayesian cumulative regret for the two problem settings considered in this paper.
3. Modified TS for CB with Unobserved True Contexts
In this section, we consider Setting 1 where the agent only observes the noisy context at each iteration t. Our proposed modified TS Algorithm is given in Algorithm 1.
| Algorithm 1: TS with unobserved true contexts () |
The proposed algorithm implements two steps in each iteration . In the first step, called the de-noising step, the agent uses the current observed noisy context and the history of past observed noisy contexts to obtain a predictive posterior distribution of the true context . This is a two-step process, where firstly the agent uses the history of past observed noisy contexts to compute the posterior distribution of as where the conditional distribution is evaluated as in (3). Note, that to evaluate the posterior, the agent uses its knowledge of the context distribution , the prior and the noise channel likelihood . Using the derived posterior , the predictive posterior distribution of the true context is then obtained as
where is defined as in (2).
The second step of the algorithm implements a modified Thompson sampling. Note, that since the agent does not have access to the true contexts, it cannot evaluate the posterior distribution with known contexts,
as is conducted in standard contextual TS. Instead, the agent must evaluate the true posterior distribution under noisy contexts,
However, evaluating the marginal distribution is challenging even for Gaussian bandits as the mean of the reward distribution is, in general, a non-linear function of the true context . As a result, the posterior is analytically intractable.
Consequently, at each iteration t, the agent samples from a distribution that ‘approximates’ the true posterior . The specific choice of this sampling distribution depends on the problem setting. Ideally, one must choose a distribution that is sufficiently ‘close’ to the true posterior. In the next sub-section, we will explain the choice for Gaussian bandits.
Using the sampled and the predictive posterior distribution obtained from the denoising step, the agent then chooses action at iteration t as
is the expected feature map with respect to .
3.1. Linear-Gaussian Stochastic CBs
We now instantiate Algorithm 1 for Gaussian CBs. Specifically, we consider Gaussian bandits with the reward noise in (1) as Gaussian with mean 0 and variance . We also assume a Gaussian prior on the reward parameter with mean zero and an diagonal, covariance matrix with entries . Here, denotes the identity matrix. The assumption of diagonal prior covariance is in line with Lemma 3 in [19].
We consider a multivariate Gaussian context distribution with mean and covariance matrix . The context noise channel is also similarly Gaussian with a mean and covariance matrix . We assume the prior on noise channel parameter to be Gaussian with d-dimensional zero mean vector and covariance matrix We assume that and are all positive definite matrices known to the agent. The assumption of positive definite covariance matrices is to facilitate the Bayesian analysis adopted in this work. Similar assumptions were also required in the related work of [14].
For this setting, we can analytically evaluate the predictive posterior distribution as a multi-variate Gaussian with inverse covariance matrix,
where and , and with the mean vector
where
Derivations are presented in Appendix C.1.2.
For the modified-TS step, we sample from the approximate posterior distribution
where
and is the expected feature map defined in (12). This yields the approximate posterior to be a Gaussian distribution whose inverse covariance matrix and mean, respectively, evaluate as
The sampling distribution considered above is different from the true posterior distribution (10), which is analytically intractable. However, it bears resemblance to the posterior (9) when the true contexts are known, with the reward distribution replaced by . In Section 3.2.2, we show that the above choice of sampling distribution is indeed ‘close’ to the true posterior.
3.2. Bayesian Regret Analysis
In this section, we derive information-theoretic upper bounds on the Bayesian regret (7) of the modified TS algorithm for Gaussian CBs. To this end, we first outline the key information-theoretic tools required to derive our bound.
3.2.1. Preliminaries
To start, let and denote two probability distributions defined over the space of random variables x. Then, the Kullback–Leibler (KL)-divergence between the distributions and is defined as
if is absolutely continuous with respect to , and takes value ∞ otherwise. If x and y denote two random variables described by the joint probability distribution , the mutual information between x and y is defined as , where (and ) is the marginal distribution of x (and y). More generally, for three random variables x, y and z with joint distribution , the conditional mutual information between x and y given z evaluates as
where and are conditional distributions. We will also use the following variational representation of the KL-divergence, also termed the Donskar–Varadhan (DV) inequality,
which holds for any measurable function satifying the inequality .
3.2.2. Information-Theoretic Bayesian Regret Bounds
In this section, we present information-theoretic upper bounds on the Bayesian regret of the modified TS algorithm. To this end, we first state our main assumption.
Assumption 1.
The feature map has bounded norm, i.e., .
The following theorem gives our main result.
Theorem 1.
Assume that the covariance matrices satisfy and where . Under Assumption 1, if , the following upper bound on the Bayesian regret of the modified TS algorithm holds,
where
The theorem above shows that the proposed TS algorithm achieves regret when the prior is highly informative with variance parameter satisfying the constraint .
Remark 1.
The assumption on covariance matrices in Theorem 1 directly holds for diagonal covariance matrices with positive eigen values.
To prove the regret bound of Theorem 1, we start by defining
as the action that maximizes the mean reward corresponding to reward parameter . Using the above, the Bayesian cumulative regret (7) for the proposed TS algorithm can be decomposed as
In (23), the first term quantifies the Bayesian regret of our action policy (11) with respect to the action policy (22) for a CB with mean reward function . The second term accounts for the average difference in the cumulative mean rewards of the oracle optimal action policy (4), evaluated using the exact predictive distribution , and our action policy (11), that uses the inferred predictive posterior distribution . In this sense, captures the error in approximating the exact predictive distribution via the inferred predictive distribution . The third term similarly accounts for the average approximation error.
To derive an upper bound on the Bayesian regret , we separately upper bound each of the three terms in (23) as derived in the following lemmas. The lemma below presents an upper bound on .
Lemma 1.
To derive the upper bound in (24), we leverage results from [19] that study information-theoretic Bayesian regret of standard contextual TS algorithms via lifted information-ratio. However, the results do not directly apply to our algorithm due to the posterior mismatch between the sampling distribution and the true posterior distribution . Consequently, our upper bound (24) consists of three terms: the first term, defined as in (21), corresponds to the upper bound on the Bayesian regret of contextual TS that assumes as the true posterior. This can be obtained by applying the lifted information ratio-based analysis of Cor. 2 in [19]. The second and third terms account for the posterior mismatch via the expected KL-divergence between the true posterior and the sampling distribution . In particular, this expected KL divergence can be upper bounded by (See Appendix C.1.3 for proof) under the prior . Importantly, our result holds when this prior distribution is sufficiently concentrated with its variance satisfying the inequality . This ensures that the contribution of posterior mismatch to the Bayes regret scales is .
The following lemma gives an upper bound on the sum .
Lemma 2.
Under Assumption 1, the following upper bound holds for ,
In addition, if the covariance matrices satisfy that and where , then (26) can be further upper bounded as
Lemma 2 shows that the error in approximating with , on average, can be quantified via the conditional mutual information between and true context given knowledge of observed noisy contexts up to and including iteration t.
Finally, combining Lemmas 1 and 2 with the choice of and gives us the regret bound in Theorem 1.
3.3. Beyond Gaussian Bandits
In the previous sections, we studied Gaussian bandits and analyzed the Bayesian regret. We will now discuss the potential extension of results beyond Gaussian bandits. As in [14], we will focus on Gaussian context distribution and context noise distribution, which helps to derive the upper bound on the estimation errors in Lemma 2.
To extend the Bayesian regret analysis to non-Gaussian bandits, Lemma 1 requires bandit-specific modifications. Specifically, the derivation of the term , that captures the standard Bayesian regret of contextual TS with as the true posterior, and that of the posterior mismatch term via the expected KL divergence critically depends on the type of bandit and the choice of the sampling posterior. The Bayesian regret bound is derived using the lifted information ratio-based approach of [19]. This can indeed be extended to non-Gaussian bandits like logistic bandits (see [19]) to obtain a modified term.
However, the analysis of posterior mismatch term for non-Gaussian bandits is non-trivial and depends on the specific bandit assumed. Firstly, to characterize the posterior mismatch via the expected KL divergence, our analysis requires the chosen sampling distribution to be sub-Gaussian. To choose the sampling distribution, one can follow the framework adopted in (15) and (16) and use an ‘appropriate’ reward distribution such that the KL divergence between the true reward distribution and the chosen reward distribution is small to minimize posterior mismatch, and the resulting sampling distribution is easy to sample from and has sub-Gaussian tails. Thus, analyzing the posterior mismatch for non-Gaussian bandits requires a case-by-case treatment. For Gaussian bandits, we control the above KL divergence by choosing a Gaussian distribution with mean as in (16). Finally, in Section 5, we extend Algorithm 1 to logistic bandits with the choice of sampling distribution motivated by (15) and (16) and use Langevin Monte Carlo to sample from this distribution.
4. TS for CB with Delayed True Contexts
In this section, we consider the CBs with delayed true context setting where the agent observes the true context after it observes the reward corresponding to the chosen action . Note, that at the time of choosing action , the agent has access only to noisy contexts. We specialize our TS algorithm to this setting, and call it Algorithm 2 (or ).
| Algorithm 2: TS for Delayed True contexts () |
Algorithm 2 follows similar steps as in Algorithm 1. However, different from Algorithm 1, at the tth iteration, the agent knows the history of true contexts in addition to that of noisy contexts. Consequently, in the de-noising step, the agent evaluates the predictive posterior distribution as
where is as defined in (2) and posterior distribution is obtained via Baye’s rule as using the history of true and noisy contexts.
For the Gaussian context, noise as considered in Section 3.1, the predictive posterior distribution is multivariate Gaussian with the inverse covariance matrix,
and the mean vector
where and . Derivation can be found in Appendix B.2.4.
Following the denoising step, the next step in Algorithm 2 is a conventional Thompson sampling step, thanks to access to delayed true contexts. Consequently, the agent can evaluate the posterior distribution with known contexts as in (9) and use it to sample . For Gaussian bandit with Gaussian prior on , the posterior distribution is a multivariate Gaussian distribution whose inverse covariance matrix and mean, respectively, evaluate as
Using the sampled and the obtained predictive posterior distribution , the agent then chooses action as
where we use the expected feature map .
Information-Theoretic Bayesian Regret Bounds
In this section, we derive an information-theoretic upper bound on the Bayesian regret (7) of Algorithm 2 for Gaussian CBs. The following theorem presents our main result.
Theorem 2.
Under Assumption 1 and assuming that covariance matrices satisfy , the following inequality holds for when ,
where is as defined in (21).
Theorem 2 shows that Algorithm 2 achieves regret with the choice of if . Furthermore, due to the absence of posterior mistmatch, the upper bound above is tighter than that of Theorem 1.
We now outline the main lemmas required to prove Theorem 2. To this end, we re-use the notation
to define the optimal action maximizing the mean reward .
To derive the regret upper bound in Theorem 2, we first decompose the Bayesian cumulative regret (7) of Algorithm 2 (), similar to (23), into the following three terms,
An upper bound on can be then obtained by separately bounding each of the three terms in (35).
In (35), the first term corresponds to the Bayesian cumulative regret of a standard contextual TS algorithm that uses for as the mean reward function. Note, that due to availability of delayed true contexts, there is no posterior mismatch in Algorithm 2. Hence, we apply Cor. 3 in [19] to yield the following upper bound on .
Lemma 3.
Lemma 3 gives a tighter bound in comparison to Lemma 1 where the posterior mismatch results in additional error terms in the regret bound.
We now upper bound the second term of (35), which similar to the term in (23), captures the error in approximating the exact predictive distribution via the inferred predictive distribution . The following lemma shows that this approximation error over T iterations can be quantified, on average, via the mutual information between and the T-length history of observed true and noisy contexts. This bound also holds for the third term of (35) which similarly accounts for the average approximation error.
Lemma 4.
Under Assumption 1, for any , we have the following upper bound,
Furthermore, if the covariance matrices satisfy that , we obtain that
Combining Lemmas 3 and 4 then gives us the upper bound on in Theorem 1.
5. Experiments and Final Remarks
In this section, we experimentally validate the performance of the proposed algorithms on synthetic and real-world datasets. Details of implementation can be found in Appendix D.
Synthetic Datasets: For synthetic datasets, we go beyond Gaussian bandits and evaluate our algorithms for logistic contextual bandits (see Figure 1 (Left) and (Center)). In both these settings, we consider Gaussian contexts and context noise as in Section 3.1 with parameters , , for some . We further consider action and context to be dimensional vectors with and , respectively, denoting their ith component. We use as the dimensional feature vector.
Figure 1.
Comparison of Bayesian regret of proposed algorithms with baselines as a function of number of iterations. (Left): Gaussian bandits with , ; (Center) Logistic bandits with , , ; (Right) MovieLens dataset with added Gaussian context noise and Gaussian prior: parameters set as , .
Gaussian Bandits: The mean reward function is given by with the feature map described above. Other parameters are fixed as , and . Plots are averaged over 100 independent trials.
Logistic Bandits: The reward is Bernoulli with mean reward given by , where is the sigmoid function. We consider a Gaussian prior over . In Algorithm 1, we choose the sampling distribution
However, the posterior is analytically intractable since Bernoulli reward-Gaussian prior forms a non-conjugate distribution pair. Consequently, we use Langevin Monte Carlo (LMC) [20] to sample from . We run LMC for iterations with learning rate and inverse temperature . Plots are averaged over 10 independent trials.
MovieLens Dataset: We use the MovieLens-100K dataset [21] to evaluate the performances. To utilise this dataset, we first perform non-negative matrix factorization on the rating matrix with 3 latent factors to obtain , where and . Each row vector corresponds to an user context, while each column vector corresponds to movie (action) features. The mean and variance of the Gaussian context distribution is estimated from the row vectors of W. We then add Gaussian noise to context as in the synthetic settings with .
We apply K-means algorithm to the column vectors of H to group the actions into clusters. We use to denote the centroid and to denote the variance of the kth cluster. We then fix the mean and variance of the Gaussian prior over as and , with denoting the identity matrix, respectively. The feature vector is then fixed as a 60-dimensional vector with vector at the index of the cluster k to which action a belongs and zeros everywhere else. We further add mean-zero Gaussian noise to the mean reward with variance . The Bayesian oracle in this experiment has access to the exact context noise parameter sampled from the Gaussian prior with variance , as well as the true sampled from the Gaussian prior .
Baselines: We compare our algorithms with two baselines: and . In , the agent observes only noisy contexts but is unaware of the presence of noise. Consequently, it naively implements conventional TS with noisy context . This sets the benchmark for the worst-case achievable regret. The second baseline assumes that the agent knows the true channel parameter , a setting studied in [18], and can thus perform exact denoising via the predictive posterior distribution . This algorithm sets the benchmark for the best achievable regret.
Figure 1 (Left) corroborates our theoretical findings for Gaussian bandits. In particular, our algorithms (Algorithms 1 and 2) demonstrate sub-linear regret and achieve robust performance comparable to the best achievable performance of . We remark that while our regret analysis of Gaussian bandits is motivated due to the tractability of posterior distributions and the concentration properties of Gaussians, our empirical results for logistic bandits in Figure 1 (Center) show a promising extension of our algorithms to non-conjugate distributions. Extension of Bayesian regret analysis to such general distributions is left for future work. Further, our experiments on MovieLens data in Figure 1 (Right) validate the effectiveness of our algorithm in comparison to the benchmarks. The plot shows that our approach outperforms and achieves comparable regret as that of which is the best achievable regret.
6. Conclusions
We studied a stochastic CB problem where the agent observes noisy contexts through a noise channel with an unknown channel parameter. For Gaussian bandits and Gaussian context noise, we introduced a TS algorithm that achieves Bayesian regret. The setting of Gaussian bandits with Gaussian noise was chosen for easy tractability of posterior distributions used in the proposed TS algorithms. We believe that the algorithm and key lemmas can be extended to when the likelihood-prior form conjugate distributions. Extension to general distributions is left for future work.
Finally, we conjecture that our proposed modified TS algorithm and the information-theoretic Bayesian regret analysis could be extended to noisy contexts in multi-task bandit settings. In this regard, a good starting point would be to leverage prior works that study multi-armed hierarchical bandits [22] and contextual hierarchical bandits [23] with linear-Gaussian reward models. However, the critical challenge is to evaluate the posterior mismatch which requires a case-by-case analysis.
Author Contributions
Conceptualization, S.T.J. and S.M.; Methodology, S.T.J.; Formal analysis, S.T.J. and S.M.; Writing – original draft, S.T.J. and S.M. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
The data presented in this study are openly available in Kaggle, https://www.kaggle.com/datasets/prajitdatta/movielens-100k-dataset, accessed on 1 July 2023.
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A. Preliminaries
Definition A1
(Sub-Gaussian Random Variable). A random variable y is said to be -sub-Gaussian with respect to the distribution if the following inequality holds:
Lemma A1
(Change of Measure Inequality). Let be a random vector and denote a real-valued function. Let and be two probability distributions defined on the space of x. If is -sub-Gaussian with respect to , then the following inequality holds,
Proof.
Lemma A2.
Let be distributed according to , i.e., each element of the random vector is independently distributed according to a Gaussian distribution with mean and variance . Let denote the maximum of n Gaussian random variables. Then, the following inequality holds for ,
For any distribution that is absolutely continuous with respect to , we then have the following change of measure inequality,
Proof.
The proof of inequality (A8) follows from standard analysis (see [14]). We present it here for the sake of completeness. The following sequence of relations hold for any ,
Taking logarithm on both sides of the inequality yields the upper bound in (A8). We now apply the DV inequality (20) as in (A3). This yields that
where the inequality in follows from (A8). The inequality in follows from observing that for all i, whereby we obtain that which holds for all i. The latter inequality implies that . Re-arranging and optimizing over then yields the required inequality in (A9). □
Appendix B. Linear-Gaussian Contextual Bandits with Delayed Contexts
In this section, we provide all the details relevant to the Bayesian cumulative regret analysis of TS for delayed, linear-Gaussian contextual bandits.
Appendix B.1. TS Algorithm for Linear-Gaussian Bandits with Delayed True Contexts
The pseudocode for the TS algorithm for Gaussian bandits is given in Algorithm A1.
| Algorithm A1: TS with Delayed Contexts for Gaussian Bandits () |
|
Appendix B.2. Derivation of Posterior and Predictive Posterior Distributions
In this section, we provide detailed derivation of posterior predictive distribution for Gaussian bandits. To this end, we first derive the exact predictive distribution .
Appendix B.2.1. Derivation of
Appendix B.2.2. Derivation of
Appendix B.2.3. Derivation of
We now derive the posterior distribution . To this end, we use Baye’s theorem as
We then have,
Consequently, we obtain that,
where
Appendix B.2.4. Derivation of Posterior Predictive Distribution
Using results from previous subsections, we are now ready to derive the posterior predictive distribution . Note, that . We then have the following set of relations:
where , , and .
Since , we obtain
This gives
Appendix B.3. Proof of Lemma 3
We now present the proof of Lemma 3. To this end, we first recall that where , and we denote as the posterior distribution of given the history of observed reward-action-context tuples. We can then equivalently write as
where is as defined in (33) and we have used to denote the mean-reward function. To obtain an upper bound on , we define the following lifted information ratio as in [19],
where
with denoting the expectation of mean reward with respect to the posterior distribution . Subsequently, we obtain the following upper bound
where the last inequality follows by an application of Cauchy–Schwarz inequality. An upper bound on then follows by obtaining an upper bound on the lifted information ratio as well as on
We first evaluate the term . To this end, note that , with defined as in (32). Using this, we obtain
where is as in (31), and the third equality follows since conditional on , is independent of . Subsequently, we can apply the elliptical potential lemma Lemma 19.4 in [24] using the assumption that and that . This results in
To upper bound the lifted information ratio term , we can use Lemma 7 in [19]. To demonstrate how to leverage results from [19], we start by showing that the inequality holds. To this end, we note that the lifted information ratio can be equivalently written as
which follows since
where the second equality holds since conditioned on , and are independent. In the third equality, we denote and . Using these, the last equality follows since , i.e, . Now, let us define a matrix M with entries given by
Using this and noting that , we obtain that . We now try to bound in terms of the matrix M. To see this, we can equivalently write as
where the inequality follows by the application of Jensen’s inequality. We thus obtain that
where the last inequality follows from Prop. 5 in [25]. From Lemma 3 in [19], we also obtain . This results in the upper bound .
Appendix B.4. Proof of Lemma 4
We now prove an upper bound on the term . To this end, let us define the following event:
Note, that since , we obtain that with probability at least , the following inequality holds . Since , the above inequality in turn implies the event such that .
where the inequality follows from the definition of , and denotes the indicator function which takes value 1 when • is true and takes value 0 otherwise. The inequality in follows by noting that
where the last inequality is due to . To obtain an upper bound on , we note that the following set of inequalities hold:
where follows since , follows since implies that . The equality in follows by noting that , where , follows a folded Gaussian distribution with density where is the Gaussian density. The equality in follows by noting that , where is the derivative of the Gaussian density. Thus, we have the following upper bound
We now obtain an upper bound on . To this end, note that
Note, that under the event , we have the following relation,
whereby is -sub-Gaussian.
Consequently, applying Lemma A1 gives the following upper bound
where the equality in follows by the definition of condition mutual information
and inequality in follows since due to the non-negativity of mutual information, and finally, the equality in follows from the chain rule of mutual information.
We now analyze the mutual information which can be written as
where . Using this, we can equivalently write
Under the assumption that , we have , whereby using the determinant-trace inequality we obtain,
Using this in (A44), we obtain that
Finally, using this in (A39), gives the following upper bound
We finally note that same upper bound holds for the term .
Appendix C. Linear-Gaussian Noisy Contextual Bandits with Unobserved True Contexts
Appendix C.1. Derivation of Posterior Predictive Distribution
In this section, we derive the posterior predictive distribution for Gaussian bandits with Gaussian context noise. To this end, we first derive the posterior .
Appendix C.1.1. Derivation of Posterior
Using Baye’s theorem, we have
where is derived in (A18). Subsequently, we have that
where we have denoted . We then obtain
Appendix C.1.2. Derivation of
The derivation of posterior predictive distribution follows in a similar line as that in Appendix B.2.4. We start the derivation by noting that .
Subsequently, we have
where . Thus, we have,
Appendix C.1.3. Evaluating the KL Divergence between the True Posterior and Sampling Distribution
In this subsection, we analyze the true posterior distribution and the approximate sampling distribution , and derive the KL divergence between them. To see this, note that from Bayes’s theorem, we have the following joint probability distribution
whereby we obtain that
In particular, for general feature maps , the distribution cannot be exactly evaluated, even under Gaussian assumptions, resulting in the posterior to be intractable, in general.
In contrast to this, our approximate TS-algorithm scheme assumes the following joint probability distribution,
where
Consequently, we have
As a result, we can upper bound the KL divergence as
In the above series of relationships,
- equality in follows by noting thatandand applying Jensen’s inequality on the jointly convex KL divergence,
- equality in follows from evaluating the KL divergence between two Gaussian distributions with same variance and with means and respectively,
- inequality in follows from application of Cauchy–Schwarz inequality,
- inequality in follows from Assumption 1,
- inequality in follows fromsince .
Appendix C.2. Proof of Lemma 1
For notational simplicity, throughout this section we use to denote the expected feature map. Furthermore, we use .
We start by distinguishing the true and approximated posterior distributions. Recall that denotes the true posterior and denotes the approximated posterior. We then denote as the distribution of and conditioned on , while denote the distribution of and under the sampling distribution. Furthermore, we have that . We start by decomposing into the following three differences,
We will separately upper bound each of the three terms in the above decomposition.
Appendix C.2.1. Upper Bound on
To obtain an upper bound on , note that the following equivalence holds . Using this, we can rewrite as
Note, here that when , for each , we have that follows Gaussian distribution with mean and variance , where and are as defined in (18) and (17), respectively. Thus, is the average of maximum of Gaussian random variables. We can then apply Lemma A2 with , , , and to obtain that
Using this, we obtain that
where the inequality in follows from Cauchy–Schwarz inequality, and the inequality in follows since
which follows since and . The inequality in follows from (A66).
If , we obtain that
Appendix C.2.2. Upper Bound on
We can bound by observing that
where we used . Note, that for , the random variable is Gaussian with mean and variance . Consequently, is also -sub-Gaussian according to Definition A.1. By using Lemma A1, we then obtain that
Using Cauchy–Schwarz inequality then yields that
where the second inequality follows from (A66). As before, if , we then obtain that
Appendix C.2.3. Upper Bound on
Note, that in , , whereby the posterior is matched. Hence, one can apply bounds from conventional contextual Thompson Sampling here. For simplicity, we denote to denote the expectation with respect to . To this end, as in the proof of Lemma 3, we start by defining an information ratio,
using which we obtain the upper bound on as
by the Cauchy–Schwarz inequality.
Furthermore, we have
where and are defined as in (18) andd (17). Subsequently, using elliptical potential lemma, we obtain
To obtain an upper bound on the information ratio , we define and and let
It is easy to see that
where the second and last equality follows since . Similarly, we can relate with the matrix as
whereby we obtain
where the last inequality can be proved as in [25]. Following [19], it can be seen that also holds. Using this, together with the upper bound (A79) gives
Appendix C.3. Proof of Lemma 2
We first give an upper bound on the estimation error that does not require the assumption of a linear feature map.
Appendix C.3.1. A General Upper Bound on
To obtain an upper bound on that does not require the assumption that , we leverage the same analysis as in the proof of Lemma 4. Subsequently, we obtain that
where the event is defined as in (A37). Subsequently, the first summation can be upper bounded as
where U is defined as in (A37), and is as in (13).
To derive the last inequality, we observe the following series of relationships starting from (13):
where the equality in follows from Woodbury matrix identity and by the assumption that , , we have is invertible. Now,
where the inequality in follows from the determinant-trace inequality. Subsequently, we have
where the inequality in follows since for and the inequality in follows since by assumption , whereby we have and consequently, . Finally, we use that .
We thus obtain that
for . We note that same upper bound holds for the term .
Appendix C.3.2. Upper Bound for Linear Feature Maps and Scaled Diagonal Covariance Matrices
We now obtain an upper bound on the estimation error under the assumption of a linear feature map such that . The following set of inequalities hold:
Note, that where and are, respectively, defined in (13) and (14). Consequently, is -sub-Gaussian with respect to . Consequently, using Lemma A1, we can upper bound the inner expectation of (A89) as
Summing over t and using Cauchy–Schwarz inequality then gives
We now evaluate the KL-divergence term. To this end, note that conditioned on and , is independent of , i.e., . This gives
where and is as in (13). The first equality follows by noting that
with the outer expectation taken over and . The last inequality is proved in Appendix C.3.3 using that , and .
Appendix C.3.3. Analysis of for Scaled Diagonal Covariance Matrices
Assume that , and . Then, from (13), we obtain
This implies
whereby we obtain
Noting that we then have
Subsequently, we obtain that for ,
whereby
where the last inequality follows since and
Appendix D. Details on Experiments
In this section, we present details on the baselines implemented for stochastic CBs with unobserved true contexts.
Appendix D.1. Gaussian Bandits
For Gaussian bandits, we implemented the baselines as explained below.
- TS_naive: This algorithm implements the following action policy at each iteration t,
- TS_oracle: In this baseline, the agent has knowledge of the true predictive distribution . Consequently, at each iteration t, the algorithm chooses action
Figure A1.
Bayesian cumulative regret of Algorithm 1 as a function of iterations over varying number K of actions.
Figure A1.
Bayesian cumulative regret of Algorithm 1 as a function of iterations over varying number K of actions.

Appendix D.2. Logistic Bandits
In the case of logistic bandits, we implemented the baselines as explained below.
- TS_naive: This algorithm considers the following sampling distribution:
- TS_oracle: This algorithm considers the following sampling distribution:
References
- Srivastava, V.; Reverdy, P.; Leonard, N.E. Surveillance in an abruptly changing world via multiarmed bandits. In Proceedings of the IEEE Conference on Decision and Control (CDC), Los Angeles, CA, USA, 15–17 December 2014; pp. 692–697. [Google Scholar]
- Aziz, M.; Kaufmann, E.; Riviere, M.K. On multi-armed bandit designs for dose-finding clinical trials. J. Mach. Learn. Res. 2021, 22, 686–723. [Google Scholar]
- Anandkumar, A.; Michael, N.; Tang, A.K.; Swami, A. Distributed algorithms for learning and cognitive medium access with logarithmic regret. IEEE J. Sel. Areas Commun. 2011, 29, 731–745. [Google Scholar] [CrossRef]
- Srivastava, V.; Reverdy, P.; Leonard, N.E. On optimal foraging and multi-armed bandits. In Proceedings of the Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 2–4 October 2013; pp. 494–499. [Google Scholar]
- Bubeck, S.; Cesa-Bianchi, N. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. arXiv 2012, arXiv:1204.5721. [Google Scholar]
- Abe, N.; Biermann, A.W.; Long, P.M. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica 2003, 37, 263–293. [Google Scholar] [CrossRef]
- Agarwal, D.; Chen, B.C.; Elango, P.; Motgi, N.; Park, S.T.; Ramakrishnan, R.; Roy, S.; Zachariah, J. Online models for content optimization. Adv. Neural Inf. Process. Syst. 2008, 21, 17–24. [Google Scholar]
- Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 2002, 47, 235–256. [Google Scholar] [CrossRef]
- Chu, W.; Li, L.; Reyzin, L.; Schapire, R. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 208–214. [Google Scholar]
- Agrawal, S.; Goyal, N. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 127–135. [Google Scholar]
- Lamprier, S.; Gisselbrecht, T.; Gallinari, P. Profile-based bandit with unknown profiles. J. Mach. Learn. Res. 2018, 19, 2060–2099. [Google Scholar]
- Kirschner, J.; Krause, A. Stochastic bandits with context distributions. Adv. Neural Inf. Process. Syst. 2019, 32, 14113–14122. [Google Scholar]
- Yang, L.; Yang, J.; Ren, S. Multi-feedback bandit learning with probabilistic contexts. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence Main Track, Yokohama, Japan, 11–17 July 2020. [Google Scholar]
- Kim, J.h.; Yun, S.Y.; Jeong, M.; Nam, J.; Shin, J.; Combes, R. Contextual Linear Bandits under Noisy Features: Towards Bayesian Oracles. In Proceedings of the International Conference on Artificial Intelligence and Statistics, PMLR, Valencia, Spain, 25–27 April 2023; pp. 1624–1645. [Google Scholar]
- Guo, Y.; Murphy, S. Online learning in bandits with predicted context. arXiv 2023, arXiv:2307.13916. [Google Scholar]
- Roy, D.; Dutta, M. A systematic review and research perspective on recommender systems. J. Big Data 2022, 9, 59. [Google Scholar] [CrossRef]
- Russo, D.; Van Roy, B. Learning to optimize via posterior sampling. Math. Oper. Res. 2014, 39, 1221–1243. [Google Scholar] [CrossRef]
- Park, H.; Faradonbeh, M.K.S. Analysis of Thompson sampling for partially observable contextual multi-armed bandits. IEEE Control Syst. Lett. 2021, 6, 2150–2155. [Google Scholar] [CrossRef]
- Neu, G.; Olkhovskaia, I.; Papini, M.; Schwartz, L. Lifting the information ratio: An information-theoretic analysis of thompson sampling for contextual bandits. Adv. Neural Inf. Process. Syst. 2022, 35, 9486–9498. [Google Scholar]
- Xu, P.; Zheng, H.; Mazumdar, E.V.; Azizzadenesheli, K.; Anandkumar, A. Langevin monte carlo for contextual bandits. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 24830–24850. [Google Scholar]
- Harper, F.M.; Konstan, J.A. The movielens datasets: History and context. ACM Trans. Interact. Intell. Syst. (TIIS) 2015, 5, 1–19. [Google Scholar] [CrossRef]
- Hong, J.; Kveton, B.; Zaheer, M.; Ghavamzadeh, M. Hierarchical bayesian bandits. In Proceedings of the International Conference on Artificial Intelligence and Statistics, PMLR, Virtual, 28–30 March 2022; pp. 7724–7741. [Google Scholar]
- Hong, J.; Kveton, B.; Katariya, S.; Zaheer, M.; Ghavamzadeh, M. Deep hierarchy in bandits. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 8833–8851. [Google Scholar]
- Lattimore, T.; Szepesvári, C. Bandit Algorithms; Cambridge University Press: Cambridge, UK, 2020. [Google Scholar]
- Russo, D.; Van Roy, B. An information-theoretic analysis of thompson sampling. J. Mach. Learn. Res. 2016, 17, 2442–2471. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
