Next Article in Journal
Efficient Privacy-Preserving Face Recognition Based on Feature Encoding and Symmetric Homomorphic Encryption
Next Article in Special Issue
Causal Graphical Models and Their Applications
Previous Article in Journal
LEARNet: A Learning Entropy-Aware Representation Network for Educational Video Understanding
Previous Article in Special Issue
Compositional Causal Identification from Imperfect or Disturbing Observations
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Single-Stage Causal Incentive Design via Optimal Interventions

by
Sebastián Bejos
1,*,
Eduardo F. Morales
1,
Luis Enrique Sucar
1 and
Enrique Munoz de Cote
2
1
Computer Science Department, National Institute of Astrophysics, Optics and Electronics, Puebla 72840, Mexico
2
Mutable Tactics, London CB2 9PJ, UK
*
Author to whom correspondence should be addressed.
Entropy 2026, 28(1), 4; https://doi.org/10.3390/e28010004
Submission received: 5 July 2025 / Revised: 2 December 2025 / Accepted: 5 December 2025 / Published: 19 December 2025
(This article belongs to the Special Issue Causal Graphical Models and Their Applications)

Abstract

We introduce Causal Incentive Design (CID), a framework that applies causal inference to canonical single-stage principal–agent problems (PAPs) characterized by bilateral private information. Within CID, the operating rules of PAPs are formalized using an additive-noise causal graphical model (CGM). Incentives are modeled as interventions on a function space variable, Γ , which correspond to policy interventions in the principal–follower causal relation. The causal inference target estimand V ( Γ ) is defined as the expected value of the principal’s utility variable under a specified policy intervention in the post-intervention distribution. In the context of additive-Gaussian independent noise, the estimand V ( Γ ) decomposes into a two-layer expectation: (i) an inner Gaussian smoothing of the principal’s utility regression; and (ii) an outer averaging over the conditional probability of the follower’s action given the incentive policy. A Gauss–Hermite quadrature method is employed to efficiently estimate the first layer, while a policy-local kernel reweighting approach is used for the second. For offline selection of a single incentive policy, a Functional Causal Bayesian Optimization (FCBO) algorithm is introduced. This algorithm models the objective functional γ V ( γ ) using a functional Gaussian process surrogate defined on a Reproducing Kernel Hilbert Space (RKHS) domain and utilizes an Upper Confidence Bound (UCB) acquisition functional. Consequently, the policy value V ( γ ) becomes an interventional query that can be answered using offline observational data under standard identifiability assumptions. High-probability cumulative-regret bounds are established in terms of differential information gain for the proposed FBO algorithm. Collectively, these elements constitute the central contributions of the CID framework, which integrates causal inference through identification and estimation with policy search in principal–agent problems under private information. This approach establishes a causal decision-making pipeline that enables commitment to a high-performing incentive in a single-shot game, supported by regret guarantees. Provided that the data used for estimation is sufficient, the resulting offline pipeline is appropriate for scenarios where adaptive deployment is impractical or costly. Beyond the methodological contribution, this work introduces a novel application of causal graphical models and causal reasoning to incentive design and principal–agent problems, which are central to economics and multi-agent systems.

1. Introduction

In an incentive design (ID) problem, a central institution alters the behavior of self-interested agents to improve overall performance by implementing an incentive function that adjusts their individual payoffs. The central institution, typically represented as a principal agent, may aim to drive the system’s performance towards a more desirable behavior, such as maximizing revenue or social welfare within the multi-agent system. This work concentrates on the single-stage canonical principal–agent problem (PAP), in which an incentive designer agent (the principal) commits to an incentive function that influences the behavior of a single follower. The interaction is a one-shot hierarchical inverse Stackelberg game, in which the principal first announces an incentive function; the follower then responds optimally based on that commitment, and the outcomes are realized without any subsequent adaptation or additional rounds. We maintain the assumption of private information for both players; specifically, the principal is unaware of the follower’s utility function and vice versa.
The single-stage PAP is the canonical problem in incentive design (ID) because it includes the core structural features in a one-shot setting: the principal commits once to an incentive, the follower best-responds, and outcomes are realized; information asymmetry and incomplete knowledge persist; and strategic reasoning is driven by ex ante beliefs about utilities or types. While there is no intertemporal adaptation or belief updating between rounds, the principal–agent (leader–follower) tension remains fundamental: one party creates incentives in the face of uncertainty, while the other optimizes behavior in response to that commitment. These features remain structurally invariant when extending the single-period formulation to multiple followers and multiple principals. With private information on both sides, the design task still entails bilateral inference at design time—reasoning about the other party’s unknown objectives—which is the core difficulty that scales to richer multi-agent settings. Thus, the single-stage formulation preserves the irreducible skeleton of ID problems and the central epistemic and incentive-alignment challenges that govern their multi-agent variants [1].
Consider a lender as the principal and a borrower as the follower. The lender must choose, once, a contract rule—an incentive function that maps borrower attributes or application information to loan terms (e.g., interest rate, credit limit, collateral, or a repayment rebate). The borrower then decides whether to accept the contract and, if funded, how much effort to exert toward repayment. Both sides hold private information (e.g., the borrower’s risk and effort costs, and the lender’s risk appetite and cost structure). The outcome—acceptance, repayment, or default, and the lender’s realized payoff—occurs once; there is no opportunity to update terms after observing behavior. This one-shot contract choice is exactly a single-stage PAP: the lender designs the incentive rule under uncertainty; the borrower best-responds; the realized payoff reflects both the selection induced by the rule and the behavioral response it elicits.
We treat the principal’s single-stage choice as the endpoint of a finite-horizon sequential optimization carried out before deployment. The principal starts from historical observations of the system—past incentive rules, the follower’s responses, and realized payoffs—and uses them to estimate the principal’s expected utility under any candidate incentive function. A data-efficient search procedure (Bayesian optimization over a space of incentive functions) iteratively proposes candidate incentives, updates beliefs about expected utility using the accumulated evidence, and records the most promising options. After a fixed number of iterations, the principal selects one incentive function—the estimated best—to implement in the single-stage game. This procedure makes no restrictive assumptions about the players’ utility forms and respects the private-information setting; it is simply a principled way to use prior observations to select a high-value one-shot incentive.

1.1. Approach and Contributions

Even in the single-stage canonical principal–agent problem (PAP), it is difficult to predict how a committed incentive will affect the follower’s behavior under bilateral private information. We address this challenge by integrating causal inference with incentive design: we represent the single-stage PAP as a causal graphical model (CGM) in which the incentive function is treated as an intervention on the function space variable Γ , and we target the principal’s utility variable J L under that policy, V ( γ ) E [ J L do ( Γ = γ ) ] . This causal target is identified via the g-formula, enabling estimation of policy values from observational data without requiring controlled experiments. We then cast policy selection as a black-box sequential offline optimization over a function space of incentives, using Bayesian optimization to efficiently search among costly evaluations of V ( γ ) before committing to a single policy in the one-shot game.
In the proposed framework, the leader’s policy Γ acts as a functional intervention, the follower’s action is an explicit mediator, and the leader’s payoff J L is the outcome node. The policy value V ( γ ) = E [ J L d o ( Γ = γ ) ] constitutes an interventional and counterfactual query on this causal graphical model, which can be identified and estimated from observational data (offline logs) under the assumptions of positivity and additive noise. Accordingly, this work advances causal reasoning with graphical models and introduces a novel application domain in incentive design and principal–agent models, thereby complementing established applications in biology, medicine, and economics.
We formalize the single-stage canonical PAP (one principal and one follower) as an additive-noise CGM with endogenous variables { Γ , Ω L , Ω F , J F , J L } that capture the inverse-Stackelberg semantics Ω L = γ ( Ω F ) + U L , where incentives are interventions do ( Γ = γ ) . From this model, we derive a non-parametric identification of V ( γ ) by marginalizing over Ω L and Ω F . In the additive-Gaussian setting, the estimand becomes a nested expectation with two interpretable components: an inner expectation (a Gaussian smoothing of the outcome regression μ L ( Ω L , Ω F , Γ ) ) and an outer expectation over the mediator law p ( Ω F Γ = γ ) induced by the incentive. Practically, we compute the inner term by Gauss–Hermite quadrature and approximate the outer term by policy-local kernel reweighting in policy space; a linear credit-market example (affine pricing rule) illustrates a closed-form specialization.
We optimize γ V ( γ ) over an admissible set Γ modeled as a Reproducing Kernel Hilbert Space (RKHS) H k . A functional Gaussian process (GP) surrogate on H k —with a functional RBF kernel driven by RKHS distances between incentive functions—provides posterior mean and uncertainty over policy values. A functional GP-UCB acquisition functional trades off exploration and exploitation to propose the next incentive to evaluate, building a data set { ( γ t , V ^ ( γ t ) ) } t = 1 T from which we select the best γ + to deploy in the single-stage game. We provide high-probability cumulative-regret bounds for this Stackelberg Functional Causal Bayesian Optimization (FCBO) procedure in terms of differential information gain, with guarantees for both finite policy sets and infinite (non-parametric) RKHS domains.
Our pipeline builds on established ingredients: causal identification via the g-formula and policy interventions [2,3], RKHS/GP surrogates for Bayesian optimization [4,5,6], and GP-UCB-style exploration with information-gain regret analysis [7,8]. We have four new key contributions: (i) a PAP-specific CGM that treats incentives as functional interventions and makes the follower action an explicit mediator of policy value; (ii) a two-layer identification and estimation recipe for single-stage PAPs, using Gaussian inner smoothing with policy-local outer reweighting; (iii) a functional GP surrogate on an RKHS of incentive rules with a policy-space RBF and support-aware acquisition; and (iv) regret bounds stated in terms of differential information gain for the Stackelberg functional setting with a uniform sub-Gaussian envelope for estimator noise. Figure 1 sketches the full CID pipeline (identification and estimation of V ( γ ) and FCBO policy search) and Figure 2 depicts one FCBO iteration; shaded boxes on Figure 1 and Figure 2 mark components introduced in this paper. Our contributions can be summarized as follows:
  • CID for single-stage PAPs. We establish a generic CGM for the single-stage canonical PAP in which incentives are formalized as functional interventions, yielding a principled causal estimand for policy value V ( γ ) = E [ J L do ( Γ = γ ) ] identified from observational data via the g-formula.
  • Semi-parametric estimation strategy. We develop a practical, modular estimator for V ( γ ) under additive Gaussian noise: (a) learn the principal’s action mechanism and outcome regression; (b) compute the inner Gaussian expectation by Gauss–Hermite quadrature; and (c) approximate the induced mediator law with policy-local kernel weights. A linear credit-market pricing example provides a closed form that clarifies the roles of policy parameters and the induced borrowing response.
  • Functional Bayesian optimization for incentives. We propose the Stackelberg FCBO algorithm: a functional GP surrogate on H k with a functional GP-UCB acquisition functional to sequentially (offline) optimize over incentive functions when policy evaluations are expensive and noisy.
  • Theoretical guarantees. We prove high-probability cumulative-regret bounds that scale with T β T I T , where I T is the information gain and β T is the exploration schedule. We extend the analysis from finite admissible sets to infinite RKHS domains via covering arguments and uniform approximation—quantifying the data-efficiency and reliability of the offline design that precedes the one-shot deployment.
Figure 1. Overview of the CID pipeline. Identification yields a two-layer estimand; the estimator feeds a functional GP surrogate and a support-aware GP-UCB search to select a single policy offline. Shaded boxes indicate contributions introduced in this paper.
Figure 1. Overview of the CID pipeline. Identification yields a two-layer estimand; the estimator feeds a functional GP surrogate and a support-aware GP-UCB search to select a single policy offline. Shaded boxes indicate contributions introduced in this paper.
Entropy 28 00004 g001
Figure 2. One iteration of Stackelberg FCBO algorithm. Novel elements are highlighted with shaded boxes: support-aware GP-UCB selection and the nested estimator used for off-policy evaluation.
Figure 2. One iteration of Stackelberg FCBO algorithm. Novel elements are highlighted with shaded boxes: support-aware GP-UCB selection and the nested estimator used for off-policy evaluation.
Entropy 28 00004 g002

1.2. Related Work

Incentive design (ID) has long been studied in economics and control through the lens of principal–agent theory [9,10]. Within this tradition, the single-stage (one-shot) PAP is a canonical formulation of contracting under asymmetric information: the principal commits to a contract, the agent best-responds, and outcomes are realized once. Classical approaches model uncertainty with priors over types and often assume a known mapping from actions to outcomes, yielding foundational results in adverse selection and moral hazard [1,9]. Control-theoretic treatments similarly emphasize leader–follower structure but typically rely on parametric models of dynamics and payoffs (e.g., early closed-loop Stackelberg formulations and multi-level control [10], and modern bilevel optimization theory and hardness results [11]).
More recently, the machine learning literature has approached ID through online learning paradigms—such as multi-armed bandits, reinforcement learning, and Bayesian optimization [5,12]—where the principal adapts incentives across rounds while observing partial feedback. This work has produced algorithms for repeated principal–agent settings, combinatorial bandits for dynamic preferences [13,14], meta-gradient methods for adaptive mechanism design [12,15], and BO-based procedures that steer multi-agent systems toward desirable equilibria [16]. While powerful, these approaches are inherently multi-stage and often treat the environment as a black box, emphasizing empirical adaptation rather than explicit reasoning about how a committed incentive causes behavioral change in a one-shot deployment. Beyond incentive design, robustness studies in multilayer transportation networks similarly demonstrate that inter-layer coupling critically governs aggregate performance—see the bilayer railway–aviation analysis with discrete cross-layer assignment [17].
A parallel line of research integrates causal reasoning into sequential decision-making [18,19,20,21]. Causal bandits, counterfactual policy evaluation, and causal reinforcement learning utilize graphical models to structure exploration and target the effects of interventions [22,23]. In Causal Bayesian Optimization (CBO) [24], the optimization routine exploits known causal structure among inputs to identify high-value interventions more efficiently. These methods, however, typically consider atomic (variable-level) interventions and multi-round interactions. In contrast, our focus is on a one-shot commitment problem: the principal must select a single incentive function before deployment and cannot adapt it afterward.
Our work connects these strands by introducing a causal, offline design pipeline tailored to the single-stage PAP. We represent the principal–agent interaction with a causal graphical model (CGM) and interpret incentives as functional (conditional) interventions and policies as interventions on the node that embodies the principal’s rule. This yields a policy value target for the principal (the expected utility under the post-intervention distribution) that can be identified from observational data via the g-formula and estimated without requiring online experimentation. To efficiently search over a space of incentive functions before deployment, we adopt a Functional Bayesian Optimization procedure [6,25]: a functional Gaussian process surrogate on an RKHS of admissible incentives, coupled with a UCB-type acquisition functional, guides a small number of costly offline evaluations of policy value. Taken together, the proposed approach complements classical single-stage contract theory by supplying a causal identification–estimation pathway for policy value under private information. Finally, our analysis provides high-probability regret guarantees in terms of differential information gain, quantifying how efficiently the offline procedure learns the principal’s objective over the function space.
Our emphasis is on the theoretical underpinnings of single-stage causal incentive design, specifically, (i) identification of policy value under functional interventions; (ii) a two-layer estimation approach with policy-local support diagnostics; and (iii) support-aware Functional Bayesian Optimization with regret control under a uniform sub-Gaussian envelope. To maintain focus and adhere to space constraints, we reserve numerical simulations and applied case studies for subsequent work that will use the present framework as a baseline to explore design choices, including policy classes, kernel metrics, and bandwidth selection across various domains.

1.3. Outline

The rest of the paper is organized as follows. The preliminary background, Section 2, first reviews the bilevel viewpoint of hierarchical Stackelberg games and inverse Stackelberg formulations. We then formalize the single-stage principal–agent problem (SS-PAP) and specify the information model with bilateral private information. The background closes with the essentials of causal graphical models (CGMs) and the notions of hard and functional (policy) interventions used later for identification and estimation. We introduce our Causal Incentive Design (CID) framework for canonical SS-PAPs in Section 3. We present an additive-noise Gaussian CGM for the canonical single-stage PAP and its assumptions, define the principal’s policy value target estimand V ( γ ) = E [ J L do ( Γ = γ ) ] , and provide a semi-parametric identification formula for V ( γ ) via the g-formula. We then describe a practical offline estimation pipeline for the identification formula in general and provide an illustrative credit-market example with an affine pricing rule that demonstrates a closed-form specialization of the estimand. Section 4 details the Stackelberg FCBO procedure for offline policy selection. We model the objective γ V ( γ ) with a functional Gaussian-process surrogate over an RKHS of admissible incentives and employ a functional GP-UCB acquisition functional. We then present the Stackelberg FCBO algorithm and provide high-probability cumulative-regret bounds in terms of differential information gain. Section 5 reflects on the implications of CID for single-shot incentive design and outlines directions for empirical validation and broader multi-agent extensions.

2. Preliminary Background

The background is organized into three components: (i) the Stackelberg and inverse Stackelberg perspectives for incentive design (Section 2.1); (ii) the single-stage principal–agent problem (PAP) setting and the information model for the single-stage PAP with bilateral private information (Section 2.2); and (iii) the fundamentals of causal graphical models (CGMs), hard interventions, and functional (policy) interventions, which serve as the foundation for interpreting incentives as policies (Section 2.3).

2.1. Incentive Design as Inverse Stackelberg Games

Games characterized by a hierarchical decision-making structure, where one or more players, called the leaders, declare their strategy first and impose this strategy upon the other players, called the followers, are referred to as Stackelberg games. If leaders declare their strategy as a mapping from the followers’ decision space into their own decision space, we refer to this as inverse Stackelberg games. The essence of a Stackelberg game can be described considering the basic single-leader single-follower single-stage game as follows. Let us denote by ω L Ω L R n L and ω F Ω F R n F the leader and follower decision variables, respectively. The leader chooses an action or decision from its decision space Ω L , and the follower, informed of the leader’s choice, subsequently chooses an action from its decision space Ω F . Each player aims to maximize its utility functions J L : Ω L × Ω F R and J F : Ω L × Ω F R , both functions of the decision spaces Ω L and Ω F .
Since the leader acts first and announces his decision ω L , which is subsequently made known to the follower, the follower’s action becomes a function of the leader’s action ω F = r F ( ω L ) . The function r F : Ω L Ω F is called the reaction function of the follower (as it indicates how the follower will react to the leader’s decision). Knowing ω L , the follower chooses ω F * , with 
ω F * arg max ω F Ω F J F ( ω L , ω F ) = J F ( ω L , r F ( ω L ) ) .
Taking into account the aforementioned, before the leader announces his decision ω L , he will realize how the follower will react and hence based on this knowledge, the leader will choose and subsequently announce the leader’s optimal decision ω L * Ω L , with 
ω L * arg max ω L Ω L J L ( ω L , r F ( ω L ) ) .
In contrast to the original Stackelberg game discussed above, in the inverse Stackelberg game, the leader acts first by deciding and announcing an incentive function γ to the follower, instead of ω L . This incentive function is a mapping of the follower’s decision space to the leader’s decision space, i.e.,  γ : Ω F Ω L , so the leader’s decision now follows directly from the follower’s decision. Note that the Stackelberg game is a special case of the inverse Stackelberg game where the incentive function maps to a constant, that is, the case in which γ : Ω F { ω L } . In addition, the incentive function approach allows the leader to first determine a particular desired point ( ω L d , ω F d ) that he intends to achieve. A natural choice for this ( ω L d , ω F d ) would be a global optimum of the leader ( ω L * , ω F * ) arg max ω L Ω L , ω F Ω F J L ( ω L , ω F ) . Given the desired point, the problem can be formulated as follows:
(1) Find : γ Γ , (2) s . t . ( ω L d , ω F d ) arg max ω F Ω F J F ( γ ( ω F ) , ω F ) ,
where Γ = { γ : Ω F Ω L } is the admissible set of incentive functions γ from which the leader can choose. Even this static inverse Stackelberg problem with a single leader and a single follower is hard to solve analytically because of the function composition, the fact that there could be more than one best answer and the fact that the follower response might not be unique. We can say for sure that this inverse Stackelberg game is at least strongly NP-hard, as it can be posted as the following program:
(3) ( ω L d , ω F d ) arg max ω L Ω L , ω F Ω F J L ( γ ( ω F ) , ω F ) , (4) s . t . ω F d arg max ω F Ω F J F ( γ ( ω F ) , ω F ) .
Taking ω L as a free variable for γ ( ω F ) , this is equivalent to a bilevel optimization problem formulation shown in Equation (5) next, which is proven to be strongly NP-hard even when the objective functions are linear [11].
max x U X U , x L X L F ( x U , x L ) , s . t . x L arg max f ( x U , x L ) g i ( x U , x L ) 0 , i I , G j ( x U , x L ) 0 , j J ,
where X U , X L denote the decision space of the upper and lower levels, respectively, F: X U × X L R is the upper-level objective function, f: X U × X L R is the lower-level objective function, and the functions g i : X U × X L R , G i : X U × X L R represents the lower-level and upper-level constraints respectively, for  i I N , j J N . This nested two-level optimization task structure requires that only optimal solutions of the lower-level task are acceptable as feasible solutions of the upper-level task.

2.2. The Single-Stage Principal–Agent Problem

In practice, the leader and follower agents base their decisions on a set of information that is readily accessible to them. Incentive design (ID), also known as contract theory, investigates inverse Stackelberg games under a variety of information models available to the leader and the follower agent and the information asymmetries between them. In this domain, the inverse Stackelberg game is called the principal–agent problem, with the leader as the principal and the follower as the agent. The way this information set is conceived and theoretically described is a major factor in how the fields of economics, control theory, and machine learning approach the principal–agent problem.
In this research, we focus on the single-stage principal–agent problem (SS-PAP), which is a hierarchical inverse Stackelberg game with a specific information model played between a single principal (leader) and a single follower. This single-stage game problem is as follows: First, the principal decides or selects an incentive function γ from a predetermined set of functions Γ = { γ : Ω F Ω L } , and announces γ to the follower, without knowledge of the follower’s utility. Then, with knowledge of this incentive function γ , the follower agent selects an action ω F * Ω F that maximizes his utility J F ( γ ( ω F ) , ω F ) . We do not impose restrictive assumptions on the functional form of utilities. Regarding the information model, we assume that both the principal and the follower agent possess private information, i.e., the utility function of the follower is unknown to the principal and vice versa. The outcome or result in this single-stage principal–agent problem are the utilities J F ( ω F * ) , ω F * ) and J L ( ω F * ) , ω F * ) after the principal decides the incentive function γ . Our focus is on the principal’s perspective of the problem, where the principal wants to select the incentive function γ that maximizes its utility J L .
We assume that the principal solves this problem as a sequential optimization problem with finite horizon T, using previously observed historical data, given as N observations D 1 = { ( γ i , ω F i , ω L i , J L i ) } i = 1 N of the system. That is, using D 1 = { ( γ i , ω F i , ω L i , J L i ) } i = 1 N , to estimate E [ J L t ] given that an incentive function γ t Γ is applied, for rounds T, the principal decides on the best γ * Γ to implement in the SS-PAP. In other words, the principal (as the decision-maker), at each stage t [ T ] , is responsible for the development of a policy π t : D t 1 γ t , where D t 1 = ( γ 1 , E [ J L 1 ] ) , , ( γ t 1 , E [ J L t 1 ] ) ; and when the accumulated history data set D T is complete, the principal decides the incentive function γ + such that
γ + arg max γ t , E [ J L t ] D T E [ J L 1 ] , , E [ J L T ] .
Therefore, we deal with a sequential optimization problem, and we take a Bayesian approach to the optimization of the functional objective of the principal. Appendix C provides details about Bayesian optimization (BO), which specializes in sequentially optimizing objectives with the above-mentioned characteristics, relying on a statistical model of the objective function whose beliefs guide the algorithm in making the most fruitful decisions.

2.3. Causal Graphical Models and Causal Inference

This research proposes solving the single-stage principal agent problem (SS-PAP) as defined in the previous subsection by incorporating causal reasoning. In this section, we provide the essential concepts of causal inference required to understand the Causal Incentive Design (CID) framework proposed in Section 3 to solve SS-PAP.
A causal graphical model (CGM) consists of a directed acyclic graph (DAG) G , which is called the causal structure, together with a four-tuple U , V , F , p ( U ) where
  • U is a set { U i } i [ p ] of exogenous (unobserved) variables, which are determined by factors external to the model;
  • V is a set { V i } i [ p ] , with  p N , of endogenous (observed) variables.
  • F = { f V 1 , , f V p } is a set of functions known as structural equations, such that V i = f V i ( pa ( v i ) , u i ) , for each V i V , with  pa ( V i ) V denoting the parents of V i ;
  • p ( U ) is a set of probability distributions { p ( U i ) } i [ p ] for each U i , where U i is a random disturbance distributed according to p ( U i ) , independently of all other U j with i j .
The set of vertices in the causal structure G corresponds to a set of random variables V = { V 1 , , V p } and each edge of the set of edges E represents a direct functional relationship between the corresponding variables, expressed by saying that V i is the direct cause of V j for an edge ( V i , V j ) E . So, the graph G encodes the causal relationship between the variables in V . We assume that the system V is causally sufficient, and therefore there are no variables in V that are common direct causes of at least two observed variables that are unmeasured. Let G be the causal structure of a CGM, i.e., a directed acyclic graph (DAG) with variables V . The statistical model associated to G consists of all distributions that factorize as
p ( V ) = V V p V pa G ( V ) ,
where pa G ( V ) denotes the parents of V in G .
The notation d o ( X = x ) indicates an idealized experiment or intervention in which the values of X are set to x, for some X X V . Hard interventions is the term used to describe this category of interventions. A fundamental problem in Causal Inference is to identify p ( y | d o ( X = x ) ) from observational data. In practice, experimental data are not always accessible due to the fact that randomized controlled experiments might be impractical, costly, or even unethical. Within the causal system V , we distinguish three different types of variables: no manipulative variables C , treatment variables X that can be set to specific values, i.e., intervene them, and a target variable Y that represents the outcome of interest. For a fully observed causal system V with a causal structure DAG G , the post-intervention distribution can be identified through the g-formula, also known as the truncated formula. We follow the standard g-formula identification [26]:
p ( V X d o ( X = x ) ) = V V X p V pa ( V ) | X = x .
So, a CGM induces joint observational and interventional probability distributions given in Equations (7) and (18). Pearl (see [26]) formalized the concept of causal effect of X on Y after manipulating X, that is, doing X = x for some x R ( X ) , where R ( · ) denotes the range of a random variable, as the expected value of the distribution p ( Y | d o ( X = x ) ) , i.e., the distribution of Y after operation d o ( X = x ) , denoted as E p ( Y | d o ( X = x ) ) Y or E Y d o ( X = x ) . Like any ordinary probability distribution, p · d o ( X = x ) obeys the same rules of conditioning and marginalization. So, we can specify expectations in the conventional manner, as shown in Equation (9).
E Y d o ( X = x ) = R ( Y ) y p y d o ( X = x ) d y
The traditional hard interventions discussed above are unconditional actions that simply force a variable X to adopt a designated value x. Functional interventions refer to a broader category of interventions in which a variable X is manipulated to respond in a predetermined manner to a set of other observable variables Z Pa ( X ) in the causal system through a functional relationship X = g ( z ) . Specifically, a functional intervention can be considered as a substitution of the structural equation f X F , of the intervened variable X, with a deterministic function of some parents in the CGM. In other words, by a functional intervention, we substitute the structural equation X = f X ( pa ( x ) , u x ) with a function X = g ( z ) for some Z Pa ( X ) in the causal model. It is important to note that hard interventions are the specific instance of functional interventions in which the structural equation f X is substituted by a constant function g.
Pearl first explored functional interventions under the name of conditional actions in [2]. More recently, Correa and Bareinboim in [3] introduced a new set of inference rules that extends the do-calculus for hard interventions in [2] to derive claims about functional interventions. Following these, let p ( y d o ( X = g ( z ) ) ) stand for the post-intervention distribution of Y established under the policy X = g ( z ) . To compute the post-intervention distribution p ( y d o ( X = g ( z ) ) ) for the functional intervention d o ( X = g ( z ) ) , we condition on Z and write
p ( y d o ( X = g ( z ) ) ) = R ( Z ) p ( y d o ( X = g ( z ) ) , z ) p ( z d o ( X = g ( z ) ) ) = R ( Z ) p ( y d o ( X = g ( z ) ) , z ) p ( z ) = E [ p ( y d o ( X = g ( z ) ) , z ) ] with x = g ( z ) ,
where the equality p ( z d o ( X = g ( z ) ) ) = p ( z ) comes from the fact that Z cannot be a descendant of X, and therefore any intervention in X has no effect on the distribution of Z . Thus, the causal effect of a functional intervention X = g ( Z ) can be computed from the expression p ( y d o ( X = x ) , z ) by substituting x for g ( z ) and taking the expectation over Z .

3. Causal Incentive Design for Single-Stage Canonical PAPs

In this section, we develop the proposed Causal Incentive Design (CID) framework by showing how causal inference can be incorporated into the principal–agent problem described in Section 2.2 to solve it. First in Section 3.1, we propose the general additive-noise Gaussian CGM M L 1 F 1 that represents the dynamics of the single-stage canonical principal–agent problem P A P L 1 F 1 and describe its key semantic aspects and underlying assumptions. This P A P L 1 F 1 is the canonical version in which the principal–agent problem is analyzed with a single principal controlling a unique decision variable ω L Ω L and a single follower agent with a unique decision variable ω F Ω F . The core concepts of CID can be elucidated through the examination of the canonical principal–agent problem P A P L 1 F 1 and the CGM M L 1 F 1 that represents it.
The CID for the single-stage PAP is developed in three parts: (i) specification of the additive-noise CGM for the single-stage PAP and definition of the policy-value target V ( γ ) = E [ J L d o ( Γ = γ ) ] (Section 3.1); (ii) derivation of a semi-parametric identification for V ( γ ) via the g-formula, demonstrating its decomposition into an inner Gaussian smoothing and an outer expectation over the follower’s action (Section 3.2); and (iii) presentation of a two-layer estimator (Section 3.3), which first computes the inner Gaussian expectation (Section 3.3.3), and then approximates the outer expectation using policy-local kernel reweighting with accompanying diagnostics (Section 3.3.4).

3.1. A Causal Graphical Model for the Single-Stage Canonical Principal–Agent Problem

We consider the single-stage canonical principal–agent problem P A P L 1 F 1 as a causal system with endogenous observable random variables V = { Γ , Ω L , Ω F , J F , J L } , where Γ is the admissible set of incentive functions with range R Γ = Γ . The variables Ω L , Ω F represent the decision variables of the principal and the follower, respectively, with ranges R Ω L = Ω L and R Ω F = Ω F , where Ω L , Ω F are the decision spaces of the principal and the follower, respectively. The variables J F , J L represent the values of the utility functions of the follower and the principal, respectively.
Definition 1.
The CGM M L 1 F 1 for P A P L 1 F 1 has endogenous variables set V , exogenous variables set U , and structural equations set F given as follows:
V = { Γ , Ω L , Ω F , J F , J L } , U = { U Ω L , U Ω F , U J F , U J L } , F = { f Γ , f Ω L , f Ω F , f J F , f J L } , with
(11) Γ : =   γ (12) Ω F : =   f Ω F ( γ , ϵ Ω F ) = B R ( γ ) + ϵ Ω F (13) Ω L : =   f Ω L ( γ , Ω F , ϵ Ω L ) = γ ( Ω F ) + ϵ Ω L (14) J F : =   f J F ( Ω L , Ω F , ϵ J F ) = J F ( Ω L , Ω F ) + ϵ J F (15) J L : =   f J L ( Ω L , Ω F , ϵ J L ) = J L ( Ω L , Ω F ) + ϵ J L
where ϵ Ω F R U Ω F , ϵ Ω L R U Ω L , ϵ Ω J F R U J F , ϵ Ω J L R U J L ; with p ( U i ) = N ( 0 , σ i 2 ) for all U i U and p ( U ) is jointly independent. The function γ : R Ω F R Ω L Γ is an incentive function, decided by the principal from the function space Γ. The functional B R ( γ ) is the best response functional for the follower agent given γ. The functions J F ( ω L , ω F ) , J L ( ω L , ω F ) are the utility functions of the follower and the leader, respectively, in the P A P L 1 F 1 .

Assumptions Underlying the CGMs for Single-Stage Canonical PAPs

The CGM M L 1 F 1 is an additive-noise Gaussian, Structural Causal Model (SCM); not inherently linear, as we do not assume that the structural equations f V i ( pa ( v i ) , u i ) , for each V i V , hold a specific functional form. We assume that the set of endogenous variables V = { Γ , Ω L , Ω F , J F , J L } in the CGM—namely, the incentives variable Γ , the principal’s decision variable Ω L , the follower’s decision variable Ω F , and their respective utilities J L and J F —is causally sufficient. That is, there are no latent common causes outside this set that jointly influence any two of these variables.
The causal structure G L 1 F 1 of M L 1 F 1 , has directed edges from Γ to { Ω L , Ω F } , from  Ω L to { J F , J L } , and from Ω F to { Ω L , J F , J L } . The variable Γ does not depend on an exogenous random variable or any other variable in the causal system; it is a deterministic root variable whose value is decided and fixed by the principal choice of γ Γ . Thus, γ acts as a fixed input to the causal system, selected by the principal agent.
The best response functional B R ( γ ) in the structural Equation (13), is the mapping from the principal’s chosen incentive function γ to the action ω F * that maximizes the follower’s utility. That is,
B R ( γ ) = ω F * arg max ω F Ω F E J F γ ( ω F ) + ϵ Ω L , ω F .
This formulation reveals how the follower’s decision responds to the incentive function chosen by the principal. The structural Equation (14), Ω L : =   f Ω L ( γ , Ω F , ϵ Ω L ) = γ ( Ω F ) + ϵ Ω L , is central to represent the inverse Stackelberg game semantics. It expresses that after the principal proposes the incentive function γ , the principal’s action Ω L is given as a function γ ( ω F ) of the follower’s action ω F . This composition in the directed acyclic triangle Γ Ω F , Ω F Ω L , Γ Ω L , with the structural Equations (13) and (14), and it highlights how the principal’s concrete action Ω L responds to the follower’s action, influenced by the incentive function γ selected by the principal. The utility variables J L and J F are defined as functions of ( Ω L , Ω F ) , consistent with conventional formulations of principal–agent problems.
The generative causal model M L 1 F 1 must be examined from two perspectives: the principal’s and the follower’s. From the principal’s view, the follower’s utility J F is an unobservable or hidden variable. So, for the principal, the set of hidden variables is H L = { J F } , while the set of observable variables is O L = { Γ , Ω L , Ω F , J L } , with H L O L = V . On the other hand, from the follower’s view, H F = { J L } and O F = { Γ , Ω L , Ω F , J F } , with  H F O F = V . These separated perspectives are just marginalizations of the same generative SCM M L 1 F 1 , but maintain consistency with the information model in P A P L 1 , F 1 .
It is also important to establish that, with respect to the information model in P A P L 1 , F 1 , the principal views the follower mechanism f Ω F ( γ , U Ω F ) as a black box. In other words, the principal does not know the function B R ( γ ) that the follower uses to decide ω F , but only observes the decision ω F . This work centers on the principal’s perspective, and for the remainder, we will concentrate exclusively on this view of the causal model M L 1 F 1 , assuming V = O L H L and treating f Ω F ( γ , U Ω F ) as a black box.

3.2. A Non-Parametric Identification for the Causal Inference Target

In the Causal Incentive Design (CID) framework, with respect to the single-stage P A P L 1 F 1 , the principal’s objective is to determine an incentive function γ * such that its utility J L is maximized, utilizing prior observations of the system in question. To accomplish this, the principal should first be able to estimate the expected value of their utility J L , in the particular scenario where the incentive function γ is applied, ideally utilizing only observational data.
In the context of causal inference, this requires determining whether the expected value of the utility of the principal E [ J L ] in the post-intervention distribution p J L d o ( Γ = γ ) is identifiable. In other words, determine whether
E p J L d o ( Γ = γ ) [ J L ] or equivalently , E J L d o ( Γ = γ ) ,
can be estimated using exclusively observational data from variables Γ , Ω L , Ω F , J L , i.e., a data set D = γ i , ω F i , ω L i , j L i i = 1 N of N N observations of the system. We also employ the simpler notation E γ [ J L ] to denote E J L d o ( Γ = γ ) . Recall that the variable J F is a hidden variable from the principal’s perspective; so the principal has no knowledge or information about this variable.

Identification via the g-Formula

We first use the g-formula over the principal’s perspective to construct a non–parametric identification for the causal inference target E J L d o ( Γ = γ ) , which is then adequate to become a semi-parametric target estimand by considering the Gaussian additive independent errors on the CGM M L 1 F 1 . The g-formula over the principal’s perspective for the intervention d o ( Γ = γ ) is given by
p ( O L Γ d o ( Γ = γ ) ) = O L O L Γ p ( O L pa ( O L ) ) | Γ = γ .
That is, we obtain the post-intervention distribution p ( O L Γ d o ( Γ = γ ) ) = p ω F , ω L , j L d o ( Γ = γ ) , removing all factors from the pre-intervention distribution p ( O L ) corresponding to the intervened variable and substituting Γ = γ into the remaining factors. So, the post-interventional distribution p ω F , ω L , j L d o ( Γ = γ ) is factorized as shown in Equation (19). Then, we can estimate the target post-intervention distribution p ( j L d o ( Γ = γ ) ) by marginalizing the variables ω F and ω L on the factorization for p ω F , ω L , j L d o ( Γ = γ ) , as shown in Equation (20).
p ω F , ω L , j L d o ( Γ = γ ) = p ( ω F Γ = γ ) p ( ω L ω F , Γ = γ ) p ( j L ω L , ω F , Γ = γ )
p ( j L d o ( Γ = γ ) ) = p ( j L ω L , ω F , Γ = γ ) p ( ω L ω F , Γ = γ ) p ( ω F Γ = γ ) d ω L d ω F .
The target functional value V ( γ ) = E J L d o ( Γ = γ ) under the incentive function (policy) γ is the expectation of J L in the post-intervention world, that is,
V ( γ ) = E γ [ J L ] = j L p ( j L d o ( Γ = γ ) ) d j L .
Therefore, substituting the factorization form (Equation (20)) of the post-intervention distribution p ( j L d o ( Γ = γ ) ) , we obtain the following:
(22) V ( γ ) = j p ( j ω L , ω F , Γ = γ ) p ( ω L ω F , Γ = γ ) p ( ω F Γ = γ ) d ω L d ω F d j . (23) = j p ( j ω L , ω F , Γ = γ ) d j p ( ω L ω F , Γ = γ ) p ( ω F Γ = γ ) d ω L d ω F ,
where the exchange of integrals is allowed by the Tonelli/Fubini Theorem. Observe that the inner integral is the following conditional mean:
j p ( j ω L , ω F , Γ = γ ) d j = E [ Y Ω L = ω L , Ω F = ω F , Γ = γ ]
Let us denote μ L ( ω L , ω F , γ ) : =   E [ Y Ω L = ω L , Ω F = ω F , Γ = γ ] . Thus, we have
(25) V ( γ ) = μ L ( ω L , ω F , γ ) p ( ω L ω F , Γ = γ ) p ( ω F Γ = γ ) d ω L d ω F (26) = μ L ( ω L , ω F , γ ) p ( ω L ω F , Γ = γ ) d ω L p ( ω F Γ = γ ) d ω F .
From structural Equation (14), Ω L : =   f Ω L ( γ , Ω F , ϵ Ω L ) = γ ( Ω F ) + ϵ Ω L , if  U Ω L has density f U Ω L , then for a fixed ω F , we have p ( ω L ω F , Γ = γ ) = f U L ( ω L γ ( ω F ) ) . Hence, the inner integral of Equation (26) becomes
μ L ( ω L , ω F , γ ) f U L ( ω L γ ( ω F ) ) d ω L .
Which, by the change of variable r = ω L γ ( ω F ) , d ω L = d r + γ ( ω F ) , becomes
μ L ( γ ( ω F ) + t , ω F , γ ) f U Ω L ( r ) d r + γ ( ω F ) = E U Ω L μ L ( γ ( ω F ) + U Ω L , ω F , γ ) ,
since r = ω L γ ( ω F ) = ϵ Ω L p ( U Ω L ) , as  Ω L = γ ( ω F ) + U Ω L with U Ω L Ω F . Therefore, Equation (26) can be expressed as
V ( γ ) = μ L ( ω L , ω F , γ ) p ( ω L ω F , Γ = γ ) d ω L p ( ω F Γ = γ ) d ω F = E U Ω L μ L ( γ ( ω F ) + U Ω L , ω F , γ ) p ( ω F Γ = γ ) d ω F .
The outer integral of Equation (29) corresponds to averaging over the follower’s action distribution p ( Ω F Γ = γ ) induced by the policy γ . Recall that from the principal’s perspective, the follower mechanism Ω F = f Ω F ( γ , U Ω F ) is unknown and is treated as a black-box mechanism.

3.3. Estimations on the Semi-Parametric Identification Formula

We now demonstrate how to estimate V ( γ ) ; first through a working example in the context of the credit market, and subsequently, we describe how the estimation can be conducted in general under additive, independent, mean-zero Gaussian noise, without committing to a particular parametric form for the structural functions. In this section, for simplicity and clarity on the estimation task, we rename two of the causal system variables. Here, we use M : = Ω F to denote the follower’s action and refer to it as the mediator, because it acts as a mediator on the causal path from a fixed incentive policy γ Γ to the principal’s utility J L . We refer here to the principal’s utility as the outcome of the system, and we denote it here as Y : = J L for convenience.
Before going into the details for the estimation on the linear parametrized PAP example and the general case estimation; we provide an introduction of the general components to learn on the identified estimand:
(30) V ( γ ) = E Y d o ( Γ = γ ) (31) = { E U L μ L ( γ ( m ) + U L , m , γ ) Inner expectation for the outcome model . } p ( m Γ = γ ) Black-box mediator conditional . d m ,
where μ L ( ω L , m , γ ) = E [ Y Ω L = ω L , M = m , Γ = γ ] , and  Ω L = γ ( M ) + U L holds. The two main components are inner expectation for the outcome model and the outer black-box mediator conditional, as shown in Equation (31).
For the first, we unpack the nested structure by first estimating the action mechanism Ω L = f Ω L ( γ , M , U L ) = f ^ L ( γ , M ) + σ ^ L ( γ , M ) U L by learning f ^ L ( γ , M ) and σ ^ L ( γ , M ) from the data; i.e., regress Ω L on ( M , Γ ) for the mean f ^ L and regress squared residuals on ( M , Γ ) to obtain σ ^ L 2 . With these, we learn the conditional outcome mean μ L ( ω L , m , γ ) = E [ Y Ω L = ω L , M = m , Γ = γ ] as regression problem with features ( Ω L , M , Γ ) and outcome Y. Then, the inner expectation for the outcome model (see Equation (31)) may be computed by Gauss–Hermite quadrature or Monte Carlo using common random numbers across m to reduce variance.
The second main component is the post-intervention mediator law (what the follower does under the policy γ ); the black-box mediator conditional p ( m Γ = γ ) . We approximate p ( m Γ = γ ) near the target policy by a policy-local reweighting of the observation, selecting a policy-space metric d G and kernel K with bandwidth h. Thus, the outer expectation over the mediator law under a fixed γ is computed by policy-local kernel weights. Therefore, in general, the inner layer (first component) is a Gaussian smoothing of the outcome regression around the mean principal’s action (in linear models, the variance drops out and we recover the closed form; in general, we keep a small quadrature); and the outer layer (second component) is a weighted average over observed mediators. The kernel with bandwidth provides a principled way to compute those weights with global regularization as we show in the following sections.
Since empirical performance is determined by domain-specific design choices, such as policy parameterization, kernel selection on policy space, bandwidth h, and support criteria, we defer numerical examples and simulations to future work that will focus on applied instantiations of the present framework. We concentrate on formal properties that hold independent of the chosen application.

3.3.1. The Estimation of a Linear Parametrized Instance

Consider a financial institution that is required to establish a transparent pricing plan before evaluating a potential borrower’s response. The bank (the principal) announces an affine policy, which includes a fixed fee and an interest “slope” that is dependent on the size of the loan. The bank, thereafter, observes how much the business (the follower) actually borrows. This is an instance of a canonical P A P L 1 F 1 , which we can parameterize as the following linear Gaussian Structural Causal Models (SCM).
(32) Γ : =   γ ( α , β ) ( m ) = α m + β , (33) M : =   μ M ( α , β ) + U M , U M N ( 0 , σ M 2 ) , (34) Ω L : =   α M + β + U Ω L , U Ω L N ( 0 , σ L 2 ) , (35) Y : =   b + r Ω L + q M + U Y , U Y N ( 0 , σ Y 2 ) ,
with U M , U Ω L , U Y mutually independent and independent of Γ . We model the bank’s pricing rule as an affine policy γ ( α , β ) ( m ) = α m + β , with  α 0 , β 0 ; so, we have an affine pricing as the incentive functions family. As we have assumed from the principal’s perspective, the follower’s mechanism and utility remain hidden from the principal. So, the function μ M ( α , β ) is an unknown function for the principal. In this context, we can see Ω F as the realized cash inflow from the borrower, i.e., the bank collections follow the mechanism Ω L = γ ( M ) + U L = α M + β + U L ; the interest or the cash inflow M plus the fee actually received at the single stage, possibly noisy. The principal utility is consider linear given as Y = b + r Ω L + q M + U Y , where b represents the baseline overheads, r is the realization factor that converts collections into value (often r 1 if a dollar collected is a dollar of revenue before costs; r < 1 if haircuts rules apply), and q is the marginal value of the loan size M. It is typically negative if it aggregates funding cost, expected credit loss, and capital operating costs per unit of balance.
J L Profit = r Ω L Collections + q M Funding and risk cost per unit + b Fixed margin = r α M + β + q M + b .
In short, profit is given as the collections minus (negative q) the funding cost per size. This aligns with the canonical P A P L 1 F 1 , where the utility of the principal depends on the decisions of the leader Ω L and the follower Ω F .
Given this outcome model, the conditional mean μ L ( ω L , m , γ ) used inside the g-formula identification in Equation (29) is
μ L ( ω L , m , γ ) = E [ Y Ω L = ω L , M = m , Γ = γ ] = b + r ω L + q m ,
since E [ U Y Ω L , M , Γ ] = 0 . Then, we reduce E U Ω L μ L ( γ ( m ) + U Ω L , m , γ ) , the inner expectation over the principal’s action noise U Ω L in the identification formula, using Ω L = γ ( m ) + U Ω L and E [ U Ω L M = m , Γ = γ ] = 0 . If we fix m, the inner term becomes
E U Ω L μ L ( γ ( m ) + U Ω L , m , γ ) = E b + r ( γ ( m ) + U Ω L ) + q m = b + r γ ( m ) + q m .
Because μ L is linear on ω L and U Ω L is mean-zero (conditional on m), the action-noise variance do not affect the mean; only the mean of Ω L matters at this step. We reduce the outer integral by plugging the reduced inner term (36) into the outer integral in (29), and take the (outer) expectation with respect to the mediator law p ( m Γ = γ ) :
V ( γ ) = b + r γ ( m ) + q m p ( m Γ = γ ) d m = b + r E γ ( M ) Γ = γ + q E [ M Γ = γ ] .
Now, substituting the affine policy γ ( α , β ) ( m ) = α m + β :
E γ ( M ) Γ = ( α , β ) = E [ α M + β Γ = ( α , β ) ] = α μ M ( α , β ) + β ,
with μ M ( α , β ) : = E [ M Γ = ( α , β ) ] . Hence,
V ( α , β ) = b + r α μ M ( α , β ) + β + q μ M ( α , β ) = b + r β + ( r α + q ) μ M ( α , β ) .
Therefore, the expression in (37) for V ( α , β ) is the estimand we want to approximate from the data, given as N i.i.d. observations { ( α i , β i , M i , Ω L , i , Y i ) } i = 1 N . The formula in (37) depends on (i) the outcome coefficients ( b , r , q ) , and (ii) the black-box mean of the mediator under policy μ M ( α , β ) = E [ M Γ = ( α , β ) ] . We estimate them separately in two steps and subsequently input them into the closed form. In a first step, we estimate ( b ^ , r ^ , q ^ ) from a regression of Y on ( Ω L , M ) ; and in a second step, the estimation of μ ^ M ( α , β ) is performed through the regression of M with respect to the variables ( α , β ) . The closed form estimator is then
V ^ ( α , β ) = b ^ + r ^ β + ( r ^ α + q ^ ) μ ^ M ( α , β ) .
A natural way to estimate ( b ^ , r ^ , q ^ ) is by weighting observations according to their location in relation to ( α , β ) . Then, a weighted least-squares (WLS) fit yields ( b ^ , r ^ , q ^ ) that are tuned to the neighborhood of the target policy γ ( α , β ) . That is, for a target incentive policy γ ( α , β ) ( m ) = α m + β , we fit a local linear outcome model Y b + r Ω L + q M + ε , by weighting observations according to their proximity to γ ( α , β ) in the policy space.
Let X i = [ 1 , Ω L , i , M i ] and y i = Y i for i [ N ] , and arrange the data in the matrix X R n × 3 , and the vector y R n . A practical choice for a distance d ( α i , β i ) , ( α , β ) in the policy space is an isotropic distance defined as
d ( α i , β i ) , ( α , β ) = α ˜ i α ˜ 2 + β ˜ i β ˜ 2 ,
with standardized coordinates α ˜ = ( α α ¯ ) / sd ^ ( α ) (and analogously for β ).
Let K ( · ) be a radial kernel (Gaussian or Epanechnikov) and h > 0 a scalar bandwidth. We then define the unnormalized weights as follows:
w ˜ i ( h ) = K d ( ( α i , β i ) , ( α , β ) ) h , w i ( h ) = w ˜ i ( h ) j = 1 n w ˜ j ( h ) , i w i ( h ) = 1 ,
and W = diag w 1 ( h ) , , w n ( h ) . Then, the WLS estimator θ ^ = ( b ^ , r ^ , q ^ ) is
θ ^ = b ^ r ^ q ^ = X W X 1 X W y .
We estimate μ M ( α , β ) = E [ M Γ = ( α , β ) ] for a fixed policy ( α , β ) , regressing the realized mediator M on the policy coordinates ( α , β ) , using observed tuples { ( α i , β i , M i ) } i = 1 N . We proceed similarly in this regression with the idea of weighting observations according to their proximity to γ ( α , β ) . Let θ i = ( α i , β i ) and θ = ( α , β ) denote observed and target incentive policies. We use the same policy-space isotropic distance and unnormalized weights as in the previous regression.
We can approximate μ M by a local plane around θ = ( α , β ) , given as the linear combination θ 0 + θ 1 ( α i α ) + θ 2 ( β i β ) of the variables ( α i α ) and ( β i β ) . Thus, using the approximation M i θ 0 + θ 1 ( α i α ) + θ 2 ( β i β ) , the WLS problem is given by
min θ 0 , θ 1 , θ 2 i = 1 N w i ( h ) M i θ 0 + θ 1 ( α i α ) + θ 2 ( β i β ) 2 ,
with the same kernel weights w i ( h ) . Then, the estimate of the mean at the target is the intercept in this linear model. Therefore, θ 0 ^ represents the estimated mean value of the plane at the specific target policy ( α , β ) and μ ^ M ( α , β ) = θ 0 ^ . In matrix form, let the local design row be x i = [ 1 , α i α , β i β ] , arranging in the matrix X R n × 3 the x i , and W = diag w 1 ( h ) , , w n ( h ) . A local linear WLS estimator for μ M ( α , β ) = E [ M Γ = ( α , β ) ] is
θ ^ = ( X W X ) 1 X W M , μ ^ M ( α , β ) = θ ^ 0 , with M = ( M 1 , , M n )

3.3.2. The Estimation in the General Gaussian Additive-Noise Case

We model the conditional distribution of the leader’s realized action Ω L given the mediator M and policy γ as the following Gaussian:
p Ω L M = m , Γ = γ N f ^ L ( m , γ ) , σ L 2 ( m , γ )
We learn non-parametrically the mean map f ^ L : M × A R and the variance map σ L 2 : M × A ( 0 , ) from data { ( M i , A i , Ω L , i ) } i = 1 N using RKHS regression kernel, i.e., Kernel Ridge Regression (KRR) on ( M , Γ ) [4,27]. Therefore, to learn the mean function f ^ L , we solve the KRR problem, where the goal is to find a function f that minimizes a regularized loss function.
f ^ L ( m , γ ) arg min f H M A ω L i f ( m i , γ i ) 2 + λ f f H M A 2 .
The first term measures how well the function fits the training data. The second term is a regularization penalty, which penalizes the complexity of the function as measured by its norm in the RKHS, using the regularization hyperparameter λ f that balances the trade-off. By the representer theorem, the solution of problem (41) can be expressed as a linear combination of kernel functions centered at the training data points f ( · ) = i [ N ] α f i k M A ( x i , · ) . In matrix form, f ^ L ( m , a ) = k ( m , a ) α f , where k ( m , a ) = k M A ( x 1 , ( m , a ) ) , , k M A ( x n , ( m , a ) ) ; and α f = K + λ f I 1 y , where K R N × N with K i j = k M A ( x i , x j ) , y = ( y 1 , , y N ) .
A product kernel on ( m , a ) can capture interactions as in our case, where the effect of m on Ω L depends on policy γ . So, we can establish a product kernel on ( m , a ) as follows:
k M A prod ( m , a ) , ( m , a ) = exp ( m m ) 2 2 l M 2 k M ( m , m ) × exp a a 2 2 l Γ 2 k Γ ( a , a ) .
Likewise, for the variance function σ L 2 ( m , γ ) , we compute the residuals r i = ω L i f ^ L ( m i , γ i ) ; and we can define pseudo-responses z i = log ( r i 2 + ϵ ) with small ϵ > 0 , for numerical stability. Modeling log σ L 2 ( m , γ ) = v ( m , γ ) in an RKHS, guarantees strict positivity after exponentiation, turns multiplicative scale effects into additive structure, and improves residual behavior under Gaussian assumptions. Therefore, we fit KRR on ( m , γ ) z and solve
v ^ L ( m , γ ) arg min v H M A z i v ( m i , γ i ) 2 + λ v v H m γ 2 .
As before, v ( · ) = i [ N ] α v i k M A ( x j , · ) with α v = ( K + λ v I ) 1 z and z = ( z 1 , , z n ) . We predict v ^ ( m , a ) and set σ ^ L 2 ( m , a ) = exp v ^ ( m , a ) > 0 .
Using the above, we can now estimate μ L ( ω , m , a ) = E [ Y Ω L = ω ,   M = m ,   Γ = a ] . We can use a weighted KRR formulation to compute μ ^ L ( ω , m , a ) with observations x i = ( ω i , m i , a i ) , and output y i = Y i , i = 1 , , n . Let k X be a positive-definite kernel on x = ( ω , m , a ) with RKHS H X . Then, we solve the following weighted regularized least-squares problem:
μ ^ L ( x ) arg min μ H X i = 1 n w i ( γ ) y i μ ( x i ) 2 + λ μ μ H X 2 ,
employing μ ( · ) = i [ N ] α μ i k X ( x i , · ) , α μ = K W + λ I n 1 W y ; thus, μ ^ L ( x ) = k x α , where K i j = k X ( x i , x j ) , K R n × n , W = diag w 1 ( γ ) , , w n ( γ ) , y = ( y 1 , , y n ) and k x = [ k X ( x 1 , x ) , , k X ( x n , x ) ] . The policy weights w i ( γ ) 0 with i w i ( γ ) = 1 , prioritize fidelity near γ . More details about these weights are given in Section 3.3.4.

3.3.3. Computing the Inner Gaussian Expectation

So far, we have estimated what appears within the curly brackets before integration, in the identification Formula (29), shown again below for clarity:
V ( γ ) = μ L ( ω L , m , γ ) p ( ω L m , Γ = γ ) d ω L p ( m Γ = γ ) d m = E U Ω L μ L ( γ ( m ) + U Ω L , m , γ ) p ( m Γ = γ ) d m ,
where μ L ( ω L , m , γ ) : =   E [ Y Ω L = ω L , M = m , Γ = γ ] . That is, we show before how to estimate μ L ( ω L , m , γ ) and p ( ω L m , Γ = γ ) and now we show how to approximate
μ L ( ω L , m , γ ) p ( ω L m , Γ = γ ) d ω L = E U Ω L μ L ( γ ( m ) + U Ω L , m , γ ) = E U Ω L μ L Ω L , m , γ | M = m , Γ = γ = g U Ω L ( m ; γ ) .
An efficient and natural way to compute the inner Gaussian expectation g ^ U Ω L ( m ; γ ) is by the Gauss–Hermite (GH) quadrature method. The integral for g U Ω L ( m ; γ ) can be expressed as
g U Ω L ( m ; γ ) = μ L ( ω L , m , γ ) 1 2 π σ Ω L exp ( ω L μ Ω L ) 2 2 σ Ω L 2 d ω L ,
where μ Ω L : = f ^ L ( γ , m ) , σ Ω L : = σ L ( γ , m ) > 0 , which we previously estimate and p Ω L ( M = m , Γ = γ ) N μ Ω L , σ Ω L 2 . Writing Ω L = μ Ω L + U L with U L N ( 0 , σ Ω L 2 ) , the change of variables U L = σ Ω L Z , i.e.,  Z = U L / σ Ω L , transforms the expectation to the standard-normal scale. No information is lost; it is the same integral expressed in a convenient form for the GH method. Therefore, with a standard normal Z N ( 0 , 1 ) , equivalently, we have
g Z ( m ; γ ) = E Z μ L μ Ω L + σ Ω L Z , m , γ = μ L μ Ω L + σ Ω L z , m , γ e z 2 / 2 2 π d z .
The GH method approximates integrals of the form e x 2 f ( x ) d x m = 1 M w m G H f ( x m G H ) , where { x m G H , w m G H } m = 1 M are Hermite nodes and weights. To bring the standard-normal expectation (44) to this form, set z = 2 x ( d z = 2 d x ), yielding
(45) g Z ( m ; γ ) = 1 π μ L μ Ω L + σ Ω L 2 x , m , γ f ( x G H ) e x 2 GH weights w G H d x . (46) 1 π m = 1 M w m G H μ L μ Ω + σ Ω 2 x j , m , γ ,
where the computable approximation in (46) for the definite integral in (45) is given by applying a M-point rule GH. The sum of all the weights for the GH rule is always m = 1 M w m G H = e x 2 d x = π and the Golub–Welsch Algorithm is a standard numerical method for computing the nodes and weights of any Gaussian quadrature rule, including the Gauss–Hermite quadrature. Unlike Monte Carlo, GH has no sampling variance; this stability is valuable when comparing many candidate policies γ . The inner integral is always one-dimensional (action noise), so GH is computationally light, and the nodes { x m G H , w m G H } m = 1 M are pre-computable and reusable.

3.3.4. Policy-Local Empirical Measure Construction in the Outer Integral

Finally, the identified value of a fixed policy γ can be written as a conditional expectation E g γ ( M ) Γ = γ , since:
V ( γ ) = E U Ω L μ L ( γ ( m ) + U Ω L , m , γ ) p ( m Γ = γ ) d m , = g U Ω L ( m ; γ ) p ( m Γ = γ ) d m , = E g γ ( M ) Γ = γ
So, the task is to estimate the expectation over the mediator distribution under policy γ . We estimate this by a local kernel smoother in policy space. We observe i.i.d. { ( m i , γ i , ) } i = 1 N with realized policies γ i (e.g., γ i = ( α i , β i ) ). Because  Γ is continuous, the event { Γ = γ } has probability zero; therefore, we approximate the conditional expectation by localizing to policies γ i that are close to γ and form a policy-local empirical distribution over the observed mediators { M i } . For that policy-local empirical distribution, we may define a distance d ( γ i , γ ) in the policy space. Since γ i , γ lie in an RKHS H k , we use d ( γ i , γ ) = γ i γ H k to reflect the policy geometry. Additionally, we select a non-negative, radial kernel K ( u ) 0 with u = d / h and a bandwidth h > 0 as the Gaussian kernel, also known as the Radial Basis Function (RBF) kernel K ( u ) = exp ( u 2 / 2 ) . So, as in the linear parametrized instance before, we define the unnormalized weights:
w ˜ i ( γ ; h ) = K d ( γ i , γ ) h , w i ( γ ; h ) = w ˜ i ( γ ; h ) j = 1 n w ˜ j ( γ ; h ) 0 , with i [ N ] w i = 1 ,
Thus, we approximate E g γ ( M ) Γ = γ using a policy-local empirical measure supported on the observed mediators for the conditional p ( m Γ = γ ) . The policy-local empirical distribution for approximating p ( m Γ = γ ) is defined as
P ^ γ ( h ) ( d m ) = i = 1 n w i ( γ ; h ) δ m i ( d m ) ,
where δ M m i is the unit mass at m i . Intuitively, P ^ γ ( h ) is the law of M one would obtain by resampling historical mediators with probability proportional to how close their policies γ i are to the target γ . This approximation for p ( m Γ = γ ) turns the outer integral into an interpretable, positivity–respecting kernel average. Integrating the test function g ^ U Ω L ( · ; γ ) against this measure yields
(49) V ^ ( γ ) = E ^ [ g γ ( M ) Γ = γ ] (50) = g ^ U Ω L ( m ; γ ) P ^ γ ( h ) ( d m ) (51) = i = 1 n w i ( γ ; h ) g ^ ( M i ; γ ) ,
which is precisely the outer half of the nested g-computation estimator V ^ ( γ ) . The policy-local reweighting in (49)–(51) implicitly requires a positivity (overlap) condition: for any target policy γ , there must be sufficient probability mass of logged policies in a neighborhood of γ so that the conditional law p ( M Γ = γ ) can be well approximated by the empirical measure P ^ γ ( h ) . In finite samples, this induces an empirical support constraint: V ^ ( γ ) , which is reliable only if the weights w i ( γ ; h ) do not degenerate. We quantify local support via the effective sample size (ESS):
ESS ( γ ; h ) : = i = 1 N w i ( γ ; h ) 2 i = 1 N w i ( γ ; h ) 2 .
The ESS is a measure of how many independent samples the weighted sample set is equivalent to. A low ESS indicates high variance and instability in the estimate V ^ ( γ ) because the estimate is dominated by a small number of data points. We also define the support set:
S support ( h , τ , w max ) : = γ Γ : ESS ( γ ; h ) τ , max i w i ( γ ; h ) w max .
The support set S support defines the region in the policy space Γ where the estimate V ^ ( γ ) is considered empirically reliable. Policies γ outside this set are not trustworthy due to a lack of data overlap. The  ESS ( γ ; h ) τ condition ensures that the evaluation is based on a sufficiently large effective sample size, preventing the estimate from being dominated by just a few data points. By requiring ESS ( γ ; h ) τ , we enforce a degree of overlap between the target policy γ and the logged policies. Higher τ means we demand better overlap and a more stable estimate, which shrinks the support set S support . The  max i w i ( γ ; h ) w max condition directly guards against extreme importance weights, which are a hallmark of poor overlap or model mismatch in off-policy evaluation. This constraint helps bound the influence of rare, yet highly weighted, events. Policies γ that require assigning an enormous weight to any single data point are excluded from the support set. Lower w max means we tolerate less weight variance and demand better weight balance, which also shrinks the support set S support . The effective sample size (ESS) serves as a proxy for uncertainty in the outer expectation. A small ESS suggests high estimator variance and low information, which necessitates a trust-region restriction in policy space. Conversely, a large ESS indicates data-supported proposals and reduced uncertainty.

4. A Functional Bayesian Optimization Algorithm for Single-Stage Canonical PAPs

Functional Bayesian Optimization (FBO) is the version of Bayesian Optimization (BO) (see Appendix C) in which the domain D of the objective function to maximize f o b j : D R is a space of functions. We leverage the sequential optimization process of Bayesian optimization and the use of the Gaussian Process Upper Confidence Bound (GP-UCB) acquisition function (see Appendix C.2) to solve the single-stage canonical P A P L 1 F 1 . Specifically, we propose an FBO algorithm to sequentially solve the following optimization problem.
γ * arg max γ t Γ E J L d o Γ = γ t .
This is an objective functional, since its domain is a space of incentive functions Γ . We employ a GP-UCB acquisition functional (see Appendix C.2 and Section 4.2) in the proposed FBO algorithm as the strategy to guide the selection of the best next incentive function evaluation γ t in the sequential optimization of Equation (52). Additionally, we use a GP-UCB acquisition functional to establish upper bounds on the cumulative regret R T for sequentially optimizing the objective functional in Equation (52) with horizon T when the FBO algorithm described in the following subsections is employed.
In order to describe the proposed FBO algorithm to solve Equation (52), we begin by establishing a functional Gaussian process as a surrogate function model for the objective functional in Section 4.1. For this, we set up a functional space for the incentive functions space Γ , which allows the definition of a Gaussian process kernel over this space of functions in Section 4.1.1, and in light of this, we show how to compute the posterior distribution for this functional Gaussian process in Section 4.1.2. Later, the GP-UCB acquisition functional for functional search is described in Section 4.2. Having these elements available, we present, in Section 4.3, Algorithm 1 to solve Equation (52). This algorithm is referred to as the Stackelberg Functional Causal Bayesian Optimization (FCBO) algorithm for P A P L 1 F 1 . In Section 4.4, we show cumulative-regret bounds results in terms of differential information gain for the Stackelberg FCBO algorithm.
Algorithm 1: Stackelberg FCBO for single-stage P A P L 1 F 1
 Input    :  D 1 = { ( γ i , ω F i , ω L i , J L i ) } i = 1 N ; the H k specification; the horizon T;
                the functional GP kernel K on Γ H k ; the exploration schedule { β t } t = 1 T ;
                and the CGM M L 1 F 1 for a P A P L 1 F 1 from the principal’s perspective.
 Output :  D T = { ( γ t , E γ t [ J L t ] ) } t [ T ] , with  γ t H k d or γ t H k
1.
2Initialize functional GP GP μ 0 ( γ ) , K 0 ( γ , γ ) ;
3for  t = 1 , , T  do
4 Select γ t arg max γ H k μ t ( γ ) + β t σ t ( γ ) (see Section 4.2);
5 Estimate E γ t [ J L ] as V ^ ( γ t ) = E ^ J L d o Γ = γ t (see Section 3.3);
6 Set D t D t 1 { ( γ t , E γ t [ J L t ] ) } ;
7 Update functional GP GP μ t ( γ ) , K t ( γ , γ ) ; D t (see Equation (55));
8end
9return  D T = { ( γ t , E γ t [ J L t ] ) } t [ T ]

4.1. Functional Gaussian Process Surrogate Model

We employ a Reproducing Kernel Hilbert Space (RKHS) as a space of functions for the domain of the objective functional E γ t [ J L ] : Γ R in Equation (52). Taking Γ as H k , that is, E γ t [ J L ] : H k R , where H k is an RKHS with reproducing kernel k: X × X R and X R (see Appendix B), allows us to directly define a Gaussian process kernel over functions in H k (see Section 4.1.1). In each round t, an incentive function γ t H k is selected by means of a GP-UCB acquisition functional (see Section 4.2) and a noisy evaluation E γ t [ J L ] is returned.
We use a Gaussian process (GP) as a surrogate model for the objective functional E γ t [ J L ] : H k R . We assume specifically that the input function space is an RKHS H k , which allows us to define a GP kernel over functions in a RKHS function space in Section 4.1.1. Therefore, the GP we use for this objective functional with domain H k is given as a stochastic process F γ = { f ( γ ) γ H k } , in which every finite collection of random variables f ( γ ) has a multivariate normal distribution. So, we use this GP F γ = { f ( γ ) γ H k } , with F γ GP ( μ ( γ ) , K ( γ , γ ) ) as a surrogate model for E γ t [ J L ] . Next, based on [6], we formulate a GP kernel K ( γ , γ ) = c o v [ f ( γ ) , f ( γ ) ] with γ , γ H k . See [4] for GP fundamentals and [6,25] for functional BO surrogates.

4.1.1. Functional Gaussian Process Kernel

Based on the RKHS H k kernel k, the GP kernel K ( γ , γ ) = c o v [ f ( γ ) , f ( γ ) ] can be constructed. We focus here on the construction of a functional version of the standard Radial Basis Function (RBF) kernel, which we assume in the regret analysis of this functional Bayesian framework.
The functional RBF can be stated as in Equation (53), where · H k is the norm in the RKHS H k , so γ , γ H k 2 is the squared distance between the functions γ , γ H k , which can be calculated as the internal product γ γ , γ γ H k in the RKHS H k .
K ( γ , γ ) = exp γ , γ H k 2 2 σ 2 .
Thus, the squared distance γ , γ H k 2 in a functional RBF between two incentive functions γ and γ is given by Equation (54), with υ i , υ j , ξ i , ξ j R , n , m N , x i , x j , x i , x j X .
γ γ , γ γ H k = i [ n ] j [ n ] υ i υ j k ( x i , x j ) 2 i [ n ] j [ m ] υ i ξ j k ( x i , x j ) + i [ m ] j [ m ] ξ i ξ j k ( x i , x j ) .
We reserve lower-case k ( x , x ) for the reproducing kernel of the input RKHS H k over X, and upper-case K ( γ , γ ) for the Gaussian-process kernel over incentive functions Γ = H k . Thus, k appears only inside · H k and expansions like Equation (54), while K and its posterior K t govern GP means, covariances, and Gram matrices.

4.1.2. Functional Gaussian Process Posterior Distribution

After building the functional RBF kernel primarily by finding out how to measure the distance between the elements γ of the function space H k , the posterior distribution of the functional GP F γ GP ( μ ( · ) , K ( · , · ) ) may be updated in a way very similar to a scalar GP. The posterior distribution is again a GP F γ GP ( μ ( · ) , K ( · , · ) ) , with functional mean μ t ( · ) and covariance kernel K t ( · , · ) given as follows:
μ t ( γ ) = k t ( γ ) ( G t + σ 2 I ) 1 y t , K t ( γ , γ ) = K ( γ , γ ) k t ( γ ) ( G t + σ 2 I ) 1 k t ( γ ) ,
where k t ( γ ) = [ K ( γ , γ 1 ) , K ( γ , γ 2 ) , , K ( γ , γ t ) ] , G t is the t × t Gram matrix with entries G t [ i , j ] = K ( γ i , γ j ) , for i , j [ t ] (see Appendix B), and y t = [ y 1 , , y t ] is the vector of estimations y 1 = E γ 1 [ J L ] , , y t = E γ t [ J L ] .

4.2. Upper Confidence Bound Acquisition Functional

The functional version of the GP-UCB acquisition function (see Appendix C.2, Equation (A20)) is given in Equation (56):
α fGP-UCB ( γ ; D , β t ) = μ t ( γ ) + β t σ t ( γ ) ,
where σ t ( γ ) = K t ( γ , γ ) (see Equation (55)). This acquisition functional can be interpreted as the strategy to choose an incentive function as follows:
γ t + 1 = arg max γ H k μ t ( γ ) + β t σ t ( γ ) ,
which if we remove the second term β t σ t ( γ ) and optimize just for μ t ( γ ) , i.e., maximizing the expected reward based on the posterior distribution so far, this rule would be too greedy too soon and would lead to getting stuck in shallow local optima. Instead, the full objective μ t ( γ ) + β t σ t ( γ ) in Equation (57) prefers both incentive functions γ where f ( γ ) is uncertain, that is, with large σ t ( γ ) , and where we expect to achieve high rewards, i.e., with large μ t ( γ ) . Thus, it implicitly deals with the trade-off between exploration and exploitation. An interpretation of this intervention rule is that it greedily selects the intervention functions γ such that f ( γ ) should be a reasonable upper bound on f ( γ * ) , where γ * is the optimal incentive function (see Appendix C.2; this follows the GP-UCB principle [7]). The UCB term β t 1 / 2 σ t 1 ( γ ) promotes policies with high expected information gain about the functional value. As shown in the regret analysis in Section 4.4, this connects the functional acquisition directly to mutual-information control. Our analysis offers theoretical guidance for acquisition, specifically through trust-region methods using ESS and weight caps. Empirical tuning and benchmarking are intentionally deferred to a companion, application-driven study.

4.3. The Stackelberg FCBO Algorithm for Single-Stage Canonical PAPs

We present a Functional Causal Bayesian Optimization (FCBO) algorithm for the single-stage P A P L 1 F 1 by integrating all components of Section 4.1 and Section 4.2. We refer to this algorithm, shown in Algorithm 1, as the Stackelberg FCBO algorithm for the single-stage P A P L 1 F 1 . Recall that Algorithm 1 addresses the optimization problem of Equation (52).
Algorithm 1 gets as input a data set of past observations of the system, D 1 = { ( γ i , ω F i , ω L i , J L i ) } i = 1 N ; the specification of the incentive function space given as an RKHS H k , where γ t H k d (finite function space) or γ t H k (infinite function space); the number of rounds T contemplated in the sequential optimization; the kernel K of the functional GP; the exploration schedule { β t } t = 1 T ; and the CGM M L 1 F 1 for a P A P L 1 F 1 from the principal’s perspective. The functional space H k can be specified by stating the reproducing kernel k of the RKHS H k (see Appendix B). In addition, the functional GP kernel K over Γ H k must be specified (by default, we can use the functional RBF kernel of Equation (53)). In Algorithm 1, we use a time-indexed exploration coefficient β t to match the high-probability guarantees in Section 4.4. For these results, we instantiate
β t = 2 log t 2 π 2 | Γ | 6 δ ,
so, the acquisition is μ t ( γ ) + β t σ t ( γ ) . In practice, one may instead tune a constant confidence level ρ and set β = Φ 1 ( ρ ) (where Φ 1 is the quantile function, i.e., the inverse cumulative distribution function (CDF); see Appendix C.2). This is equivalent to using a constant exploration coefficient β t β . Our regret bounds in Section 4.4 hold whenever the run uses the exploration schedule { β t } t = 1 T as β t = 2 log ( t 2 π 2 | Γ | 6 δ ) or any schedule that dominates it pointwise. We can set the exploration schedule in this way for default in Algorithm 1.
The M L 1 F 1 from the principal’s perspective have the induced DAG, from the set of observable variables O L = { Γ , Ω L , Ω F , J L } , with arrows Γ Ω L , Γ Ω F , Ω F Ω L , Ω L J L and Ω F J L , as causal structure. Additionally, the follower utility J F is an unobservable or hidden variable, and the follower mechanism f Ω F ( γ , U Ω F ) is a black box function. Algorithm 1 returns the accumulated history data set D T after T rounds in the sequential optimization:
D T = γ 1 , E γ 1 [ J L ] , , γ T , E γ T [ J L ] .
Modeling f ( γ ) as a sample from F γ = { f ( γ ) γ H k } with F γ GP ( μ ( γ ) , K ( γ , γ ) ) , as a surrogate model for the objective functional E γ t [ J L ] : H k R (see Section 4.1). We completely specified this GP by its mean functional μ ( γ ) = E [ f ( γ ) ] and kernel functional K ( γ , γ ) = E f ( γ ) μ ( γ ) f ( γ ) μ ( γ ) . We assume, without loss of generality (see [4]), a GP with mean function zero, i.e., μ 0 , as prior distribution over F γ . So, we initialize the functional GP in line 2 with μ 0 ( γ ) as the zero function and K 0 ( γ , γ ) as the functional RBF in Equation (53), to begin the inductive process of Algorithm 1.
From Line 3 to Line 8, the sequential optimization is performed, going through T rounds. The decision on the incentive function γ t is made employing the GP-UCB acquisition functional (see Section 4.2) as a policy π t : D t 1 γ t . The expected value of J L on the post-intervention distribution p J L d o ( Γ = γ t ) is estimated by computing V ^ ( γ t ) = E ^ J L d o Γ = γ t , using the techniques developed in Section 3.3. The estimation of E γ t [ J L ] , is the approach by which Algorithm 1 gets information about the objective functional of Equation (52) to guide the search for γ * . To complete a round t in the sequential optimization process outlined in Algorithm 1, first the data set D t is updated in Line 6 by adding the new observation pair ( γ t , E γ t [ J L ] ) . Then, in Line 7, the surrogate functional, i.e., the functional GP GP , for the objective functional of the optimization problem in Equation (52), is updated using D t . Figure 2 complements Algorithm 1 with a flow-chart of one FCBO round. The support-aware GP-UCB step and the two-layer estimator are highlighted as novel elements.
Once the T rounds are completed, Algorithm 1 returns D T . Observe that, despite the fact that the algorithm returns the cumulative history data set in Equation (58), it implicitly addresses the optimization problem of Equation (52), since the following holds for large T:
γ * γ + arg max γ t , E γ t [ J L ] D T E γ 1 [ J L ] , , E γ T [ J L ] .
That is, due to the convergence result in Section 4.4 for Algorithm 1, we maintain the best incentive function γ + from γ + , E γ + [ J L ] D T to solve Equation (52), as γ + approximates the optimum γ * of the functional optimization problem in Equation (52). Furthermore, we can bound the expected cumulative regret E [ R D T T ] using Algorithm 1 to address the singe stage canonical P A P L 1 F 1 . This result is presented in the following subsection.

4.4. Information-Theoretic Regret Bounds on the Stackelberg FCBO Algorithm

The cumulative-regret bounds for Algorithm 1 are derived by adapting the regret bounds results in terms of differential information gain (see Section 4.4.1) from [7] to our problem setting. In our setting, each observation y t = V ^ ( γ t ) aggregates information about an interventional query V ( γ t ) on the underlying CGM. The resulting regret bounds therefore quantify how quickly an information-guided sequence of interventions d o ( Γ = γ t ) converges, in policy value, to the best available intervention in the considered class. First, in Section 4.4.2, we deal with the setting in which the set of admissible incentive functions Γ is finite. Specifically, we assume that Γ is represented as the finite RKHS H k d provided in Appendix B, so | Γ | = d N . Then, we show how to extend the results for the finite case from Section 4.4.2 to the general case where Γ is an infinite RKHS H k in Section 4.4.3.
Recall from Section 4.1 that the objective functional in Equation (52) is viewed as a sample path f of the functional GP F γ = { f ( γ ) γ H k } with F γ GP ( μ ( γ ) , K ( γ , γ ) ) . Algorithm 1 uses a centered GP GP μ 0 , K ( γ , γ ) as a prior distribution GP μ 0 ( γ ) , K 0 ( γ , γ ) over F γ . So, we are assuming a priori that the objective functional in Equation (52) is a sample path f from a centered GP GP μ 0 , K ( γ , γ ) . The initialization of the functional GP in Algorithm 1 also specified K 0 ( γ , γ ) as the functional RBF in Equation (53). However, following the analysis in [7], we do not assume a specific functional kernel.
At round t, the observation is y t = V ^ ( γ t ) with estimation error ε t : = y t V ( γ t ) . We assume that { ε t } t 1 forms a conditionally sub-Gaussian martingale-difference sequence with respect to the algorithm’s filtration { F t } , i.e., E [ ε t F t 1 ] = 0 and E [ exp ( λ ε t ) F t 1 ] exp ( λ 2 R 2 / 2 ) for all λ R and some envelope R > 0 (cf. kernelized bandit analyses with conditionally sub-Gaussian noise [8]). The variance may depend on γ t (heteroskedasticity). For the analysis, we further assume a uniform bound Var ( ε t F t 1 ) σ * 2 on the support set where the Stackelberg FCBO algorithm proposes candidates. Under this uniform envelope, the standard GP-UCB high-probability confidence sets and regret bounds apply, with σ * 2 entering the information-gain and β t expressions (Section 4.4.1).

4.4.1. Differential Information Gain

A primary concern in the regret analysis of Algorithm 1 is to measure the efficiency with which τ noisy observations D τ = γ 1 , E γ 1 [ J L ] , , γ τ , E γ τ [ J L ] , and acquire knowledge about the objective functional E J L d o ( Γ = γ t ) in Equation (52). This learning process becomes apparent through the improvement of the GP surrogate model GP ( μ ( γ ) , K ( γ , γ ) ) for E J L d o ( Γ = γ t ) , after the observation process in which the Algorithm 1 decides on γ t , by the GP-UCB acquisition functional policy π t : D t 1 γ t , and then, after the follower response ω F , computes V ^ ( γ t ) = E ^ J L d o ( Γ = γ t ) to obtain the observed pair γ t , E γ t [ J L ] . The informativeness of this observation process, as a function of the number of observations τ , provides a way to bound the ability to learn about the objective functional E J L d o ( Γ = γ t ) .
Here, we use the notation E ^ γ t [ J L ] = E γ t [ J L ] + ϵ t , to explicitly distinguish between the assumed noisy observations E ^ γ t [ J L ] and exact observations E γ t [ J L ] on the objective functional E J L d o ( Γ = γ t ) . The information capacity of an arbitrary set of sampling points Γ τ = { γ 1 , , γ τ } Γ regarding the objective functional E J L d o ( Γ = γ t ) can be measured by the mutual information I τ , also known as information gain, between f τ = { E γ 1 [ J L ] , , E γ τ [ J L τ ] } and the noisy observations y τ = { E ^ γ 1 [ J L ] , , E ^ γ τ [ J L ] } at these points. That is, measuring the reduction in uncertainty over E J L d o ( Γ = γ t ) resulting from the disclosure of y τ , which is calculated as in Equation (60).
I τ ( f τ ; y τ ) = H ( f τ ) H ( f τ y τ ) = H ( y τ ) H ( y τ f τ ) = I τ ( y τ ; f τ ) ,
as mutual information is symmetric, and where H ( f τ ) denotes the differential entropy of f τ and H ( y τ f τ ) denotes the conditional differential entropy of y τ given f τ . The differential entropy of a set of multivariate Gaussian distributed random variables is calculated as H ( X 1 , , X n ) = H ( N n ( μ , K ) ) = 1 2 log [ ( 2 π e ) n | K | ] , where | K | denotes the determinant of the covariance matrix K (see [28]). Thus, we can write its covariance as K τ + σ 2 I , that is, separating the additive noise into a diagonal matrix σ 2 I , where K τ = [ K ( E ^ γ [ J L ] , E ^ γ [ J L ] for every E ^ γ [ J L ] , E ^ γ [ J L ] y τ . Then, we can compute H ( y τ ) as follows:
H ( y τ ) = 1 2 log [ ( 2 π e ) τ | K τ + σ 2 I | ] .
To compute H ( y τ f τ ) , observe that given f τ , the only randomness left is the noise, which has covariance σ 2 I , so
H ( y τ f τ ) = 1 2 log [ ( 2 π e ) τ | σ 2 I | ] = 1 2 log [ ( 2 π e ) τ σ 2 τ ] .
Putting all together, the information gain I τ ( f τ ; y τ ) is given by 1 2 log | I + K τ σ 2 | as shown next in Equation (63):
I τ ( f τ ; y τ ) = H ( y τ ) H ( y τ f τ ) = 1 2 log [ ( 2 π e ) τ | K τ + σ 2 I | ] 1 2 log [ ( 2 π e ) τ | σ 2 I | ] = 1 2 log | K τ + σ 2 I | | σ 2 I | = 1 2 log | I + K τ σ 2 |
I ( f t ; y t ) = H ( y t ) H ( y t f t ) = 1 2 log I + σ 2 K t
In our setting, we define the maximum information gain I T after T rounds as
I T = 1 2 log I + Σ T 1 G T , Σ T : = diag ( σ 1 2 , , σ T 2 ) ,
where I is the identity matrix and G T is the covariance matrix. For notational simplicity in the stated bounds we adopt the conservative specialization Σ T σ * 2 I , which yields I T 1 2 log | I + σ * 2 G T | and recovers the expressions in our regret constants.
After presenting the concept of mutual information on differential entropy to assess the informativeness with respect to the objective functional E J L d o ( Γ = γ t ) of a noisy set of observations y τ = { E ^ γ 1 [ J L ] , , E ^ γ τ [ J L ] } at a certain set of sampling points Γ τ = { γ 1 , , γ τ } Γ . We are now able to present the regret analysis for Algorithm 1 in terms of information gain I τ ( f τ ; y τ ) .

4.4.2. Regret Bound for a Finite Incentive Function Space

The cumulative regret in a sequential optimization process is the loss in reward resulting from the inability to predict the points that maximize the objective function, in advance. For Algorithm 1, this is the error between the decided incentive functions γ 1 , , γ T and those that maximize the objective functional E J L d o ( Γ = γ ) . That is, letting γ * arg max γ t Γ f ( γ ) , with f ( γ ) = E J L d o ( Γ = γ t ) , for a incentive function choice γ t in round t, we incur instantaneous regret r t = f ( γ * ) f ( γ t ) , and the cumulative regret R T is defined as R T = t = 1 T r t .
Theorem 1 establishes a probabilistic bound for the cumulative regret of Algorithm 1, ensuring its convergence and efficiency with high probability. Considering a confidence bound δ ( 0 , 1 ) , the result of Theorem 1 is that the cumulative regret R T of Algorithm 1 is bound by the square root of a linear expression of the information gain I T ( f T ; y T ) , with probability of at least 1 δ . Recall from previous Section 4.4.1 that information gain I T ( f T ; y T ) is a measure of the reduction in uncertainty over the objective functional, resulting from the disclosure of observations y T = { E ^ γ 1 [ J L ] , , E ^ γ T [ J L ] } , generated by the decision of the incentive function, i.e., sampling points, Γ τ = { γ 1 , , γ τ } Γ . The linear expression for the information gain in the specified upper bound includes the low coefficients β T and C 1 . The coefficient β t is the exploration coefficient in the acquisition functional (see Section 4.2) on the strategy to choose the next incentive function in Equation (57), which for the purpose of the bound we use β t = 2 log ( t 2 π 2 | Γ | 6 δ ) , for all t [ T ] . This choice reconciles the exploration coefficient used in Algorithm 1 with the regret analysis: any implementation that uses β t above (or a pointwise larger schedule) satisfies the bound of Theorem 1. For the auxiliary coefficient C 1 , we use C 1 = 4 σ 2 log ( 1 + σ 2 ) . Theorem 1 given below assumes the finite functional space H k d , which was established in the Appendix B, as the finite set of allowed incentive functions, i.e., Γ = H k d so | Γ | = d for some d N . Recall from Section 4.1 that the objective functional in Equation (52) is viewed as a sample path f of the functional GP F γ = { f ( γ ) γ H k } with F γ GP ( μ ( γ ) , K ( γ , γ ) ) . Assume further that the observation errors { ε t } are conditionally R-sub-Gaussian and form a martingale-difference sequence with a uniform variance envelope Var ( ε t F t 1 ) σ * 2 over the proposal support.
Theorem 1.
Let δ ( 0 , 1 ) and assume that Γ is finite, with Γ = H k d for some d N . Running Algorithm 1 using β t = 2 log ( t 2 π 2 | Γ | 6 δ ) for a sample path f ( γ ) = E J L d o ( Γ = γ ) of a functional GP F γ = { f ( γ ) γ H k d } GP ( μ ( γ ) , K ( γ , γ ) ) with mean function μ t ( γ t ) = 0 and covariance function K t ( γ , γ ) , we get a cumulative-regret bound of O * T I τ ( f τ ; y τ ) log | Γ | with high probability. Specifically,
Pr R D T T C 1 T β T I T ( f T ; y T ) 1 δ ,
for all T 1 , where C 1 = 4 σ 2 log ( 1 + σ 2 ) .
The proof of Theorem 1 is provided in Appendix A by demonstrating a series of partial results given as Propositions A1–A4.

4.4.3. Regret Bound on an Infinite Incentive Function Space

The regret bound of Theorem 1 is established under the assumption that the set of admissible incentive functions Γ is finite, specifically Γ = H k d . In practice, however, the Stackelberg FCBO algorithm can operate over a continuous, potentially infinite-dimensional admissible set of incentive functions Γ H k , where H k is a bounded RKHS. In this subsection, we extend the cumulative-regret analysis of the Stackelberg FCBO algorithm to an infinite admissible space of incentive functions Γ H k where H k is a bounded separable RKHS. This corresponds to the setting where incentive functions γ t Γ = H k are sampled and optimized over a non-parametric space of continuous functions. In accordance with the infinite domain analysis in [7], we assume that the functional f ( γ ) , with f ( γ ) = E J L d o ( Γ = γ ) is Lipschitz continuous and is a member of the RKHS that is induced by the functional kernel.
Applying the regret bounds of Theorem 1 to a finite subset of H k obtained by discretizing H k does not ensure that the regret bound is applicable to the original continuous infinite domain H k . The issue is that the objective functional f: H k R , even if continuous, may not be well approximated on all of the function space H k by its restriction to a finite set, and this would only control regret over the discretized surrogate problem, not the full functional optimization problem:
γ * arg max γ t Γ H k E J L d o Γ = γ t .
Furthermore, regret is accumulated over the true sequence of evaluations, which may lie outside any fixed discretization. So, to rigorously lift the regret guarantees from a finite discretization to the full RKHS domain H k , we require uniform approximation of both (i) the infinite set of admissible incentive functions and (ii) the objective functional f: H k R itself. Therefore, the key step is to approximate the infinite-dimensional domain H k and the functional f ( γ ) in a way that allows the application of the previously developed finite-set regret analysis from Theorem 1.
In order to overcome this, we approximate the infinite function space by constructing a finite-dimensional subspace by means of two applications of the Stone–Weierstrass theorem (see Appendix D). In the first application, the objective functional f: Γ R , with Γ H k compact, is uniformly approximated on Γ by finite linear combinations of kernel sections K ( · , γ i ) associated with a carefully chosen finite set of basis incentive functions { γ 1 , , γ d } H k . As a consequence, for every ε 1 > 0 , there exists such a finite set and coefficients α 1 , , α d R satisfying
sup γ Γ f ( γ ) i = 1 d α i K ( γ i , γ ) < ε 1 ,
and the set { γ i } i = 1 d Γ then acts as a basis of reference incentive functions, and the approximant i = 1 d α i K ( γ i , γ ) providing a finite-dimensional parametric model for f. Details of this construction are given in Appendix D, where an explanation of two formal methods for selecting such basis incentive functions (one based on minimal ε -covers and one based on spectral decomposition of the kernel operator) is also provided in Appendix D.2.
From the first application of the Stone–Weierstrass theorem, we obtained a finite set of basis incentive functions { γ 1 , , γ d } H k and coefficients { α i } such that f ( γ ) is uniformly approximated, to within an error ε 1 > 0 , by
i = 1 d α i K ( γ i , γ ) .
In the second application, each basis function γ i is itself uniformly approximated on the compact input domain Ω F , the input space of the follower’s decision variable, by a polynomial p i ( n ) of degree at most n, so that the entire approximation to f is parameterized by finitely many coefficients. That is, we now apply the Stone–Weierstrass theorem a second time, but this time to each basis function γ i individually, treating it as a continuous scalar-valued function on the compact domain Ω F . Let P n denote the algebra of real polynomials on Ω F of degree at most n. The Stone–Weierstrass theorem on Ω F , compact in the Euclidean topology, ensures that P n is uniformly dense in C ( Ω F , R ) . Therefore, for any ε 2 > 0 and for each i { 1 , , d } , there exists a polynomial p i ( n ) P n such that
sup x Ω F | γ i ( x ) p i ( n ) ( x ) | < ε 2 .
Therefore, each γ i can be replaced by a polynomial p i ( n ) of degree n with uniform approximation error ε 2 on Ω F . So, the resulting functional approximation becomes
f ( γ ) i = 1 d α i K p i ( n ) , γ ,
where K is evaluated between the polynomial surrogate p i ( n ) and the input γ .
Replacing each γ i with p i ( n ) parameterises the basis incentive functions by their polynomial coefficients. So, the full parameterisation of the approximant to f now consists of (1) the d coefficients { α i } from the first approximation step, and (2) the coefficients of the d polynomials { p i ( n ) } , each with finitely many coefficients determined by its degree n. This yields a finite-dimensional cross-parameter space:
Θ d , n = ( α 1 , , α d , coeffs ( p 1 ( n ) ) , , coeffs ( p d ( n ) ) )
that fully describes the approximant to f. Hence, the second Stone–Weierstrass application replaces infinite-dimensional basis functions γ i H k with finite-dimensional polynomial surrogates p i ( n ) , thus turning the problem of learning f over Γ into an optimization over the finite-dimensional parameter set Θ d , n .
This two-step reduction maps the original optimization problem over the infinite-dimensional domain Γ H k to an optimization over a finite-dimensional parameter space, while controlling the approximation error. The uniform error ε 2 in approximating each γ i propagates to the approximation of f in a way controlled by the Lipschitz constant of f with respect to γ , ensuring that the total error from both applications remains bounded when d and n are chosen large enough. So, by Lipschitz continuity of f on Γ with respect to · H k , with constant L > 0 , this perturbation changes the output of f by at most L ε 2 . Consequently, the regret bound in the infinite-dimensional case satisfies
R inf T R finit T + T ε 1 + L ε 2 ,
where R T finit is the cumulative-regret bound from the finite-basis case. By choosing the basis size d and polynomial degree n large enough, both ε 1 and ε 2 can be made arbitrarily small, so that the additive term T ( ε 1 + L ε 2 ) remains negligible compared to the leading T -scale term in R T inf . See Theorem 2 below for explicit assumptions and a sublinear cumulative-regret guarantee in the infinite-domain setting.
This reduction method explicitly connects the choice of basis to the covering number N Γ , · H k , ε , which determines the discretization size | Γ ε | , where Γ ε = { γ 1 , , γ d } and
N Γ , · H k , ε = min | Γ ε | : Γ ε Γ is an ε -cover of Γ , for · H k .
In the finite function space regret analysis of Theorem 1 (See Equation (A4)), the confidence parameter has the form β T = 2 log ( π 2 T 2 6 δ | Γ | ) , which leads to following:
β T = 2 log π 2 T 2 6 δ N Γ , · H k , ε 2 log π 2 T 2 6 δ | Γ ε | ;
as in practice, the set may come from a near-minimal cover so d = | Γ ε | N Γ , · H k , ε , i.e., the selected set forms a valid but not necessarily minimal ε -cover; typically bounded by a small factor (often at most logarithmic in N for standard greedy covering procedures on compact metric spaces (See Appendix D.2). This relationship in Equation (65) shows that β T grows logarithmically with T but also depends logarithmically on the covering number of Γ , so reducing d toward the minimal covering number directly leads to a smaller β T and, therefore, a tighter regret bound. Nevertheless, even for a valid, but not necessarily minimal, ε -cover, | Γ ε | is constant for a fixed ε , so β T O ( log T ) .
Considering all the elements mentioned above and following the same reasoning as in the proof of Theorem 1 (See Appendix A), the regret analysis for R f i n i t T can be extended to the infinite-dimensional domain Γ H k using a discretization argument as follows. We first construct an ϵ T -cover Γ ϵ T Γ in the H k norm, with size | Γ ϵ T | = d ϵ T , so that every γ Γ is within ϵ T of some γ ˜ Γ ϵ T . A union bound over all γ Γ ϵ T and all t T yields, with probability at least 1 δ , a uniform confidence interval of the form
| f ( γ ) μ t 1 ( γ ) | β T σ t 1 ( γ ) + L ϵ T ,
where β T = 2 log T 2 π 2 6 δ d ϵ T and L is the Lipschitz constant of f in the H k norm, · H k .
Applying this bound to both the optimal γ * and the chosen γ t in the round t shows that the instantaneous regret satisfies r t = f ( γ * ) f ( γ t ) 2 β T σ t 1 ( γ t ) + 2 L ϵ T (Proposition A2). Summing over t = 1 , , T and using Cauchy–Schwarz together with the information gain bound t = 1 T σ t 1 2 ( γ t ) C 1 I τ ( f τ ; y τ ) (Proposition A3), we obtain
R inf T = t = 1 T f ( γ * ) f ( γ t ) 2 C 1 T β T I T ( f T ; y T ) + 2 T L ϵ T .
Choosing ϵ T = 1 T makes the additive term 2 T L ϵ T a constant, which can be upper-bounded by π 2 6 , using t = 1 1 t 2 = π 2 6 . The final result is the high-probability bound.
R inf T C 1 T β T I T ( f T ; y T ) + π 2 6 .
Theorem 2.
Let Γ H k be compact in · H k , and let K: Γ × Γ R be bounded and continuous. Suppose observations follow y t = f ( γ t ) + ε t with σ-sub-Gaussian noise ε t , and f: Γ R is a sample path of the centered functional GP GP ( 0 , K ) and is L-Lipschitz with respect to · H k . For analysis, fix any discretization schedule { ε t } t = 1 T and an ε t -cover Γ ε t Γ with size N ( Γ , · H k , ε t ) . Run Algorithm 1 with GP-UCB and the exploration schedule
β t = 2 log π 2 t 2 N ( Γ , · H k , ε t ) 6 δ , δ ( 0 , 1 ) .
Let I T denote the (maximum) information gain after T rounds. Then, with probability at least 1 δ ,
R i n f T C 1 T β t I T + 2 L t = 1 T ε t , C 1 = 4 σ 2 log 1 + σ 2 .
In particular, if t = 1 ε t < (e.g., ε t = c / t 2 ), then R i n f T = O ˜ T I T is sublinear; if moreover I T = o ( T ) , then R i n f T = o ( T ) .
Proof. 
Combine the finite- | Γ | GP-UCB bound with a union bound over the ε t -covers Γ ε t to obtain uniform confidence bands on Γ ε t , which yields
β t = 2 log ( π 2 t 2 N ( Γ , · H k , ε t ) / ( 6 δ ) ) .
Lipschitz continuity lifts these bands from Γ ε t to all Γ , adding 2 L ε t to per-round regret; summing over t produces the stated bound. The two-step uniform-approximation argument (via discretization) recovers R inf T R finit T + T ( ε 1 + L ε 2 ) , absorbed by t ε t when the schedule is summable. □
Therefore, schedule β t mirrors the finite-set case with | Γ | replaced by the covering number, and the bound extends Theorem 1 by adding a discretization penalty 2 L t ε t . Contemplating the following assumptions for the infinite-domain analysis: (i) f GP ( 0 , K ) on Γ with K bounded and continuous. (ii) f is L-Lipschitz in · H k on Γ . (iii) Noise ε t is σ -sub-Gaussian and Algorithm 1 uses β t as above. (iv) Γ H k is compact and covers Γ ε t exist with size N ( Γ , · H k , ε t ) . (v) The information gain I T is finite.

4.4.4. Practical Implications of the Regret Bounds

The regret bounds for Stackelberg FCBO turn our theory into concrete guarantees for one-shot incentive design. They quantify, with high probability, how far the principal’s expected utility under the selected incentive can be from that of the best admissible incentive, before deployment. This matters in practice because the single-stage PAP requires a single commitment—there is no opportunity to “learn on the fly.”
The bounds certify a maximum difference in actual versus expected performance at the time of decision-making, offering an up-front measure of acceptable risk for using an incentive in important decisions (such as setting credit terms, granting subsidies, or allocating resources). Because the bound scales with differential information gain—that is, the knowledge added by each experiment—each offline evaluation reflects its informational value. This allows practitioners to prioritize experiments that reduce uncertainty the most, rather than running large, costly test batteries.
The bounds yield practical stopping rules. For example, stop when the acquisition-driven Upper Confidence Bound (a statistical estimate indicating the highest likely value) on the best candidate falls below a target gap (such as a minimum acceptable return on investment, ROI, or a maximum loss threshold). It also helps determine how many offline evaluations are needed to reach a desired near-optimality level, meaning how close the chosen candidate is to the optimal option.
In summary, the regret bounds provide actionable guarantees for offline policy selection: they quantify residual risk, guide the allocation of evaluation effort, and ensure that the final one-shot incentive is provably close to optimal under the given causal model and data.

4.5. On Extending CID to Multi-Follower Single-Stage Principal–Agent Problems

This subsection addresses the extension of the CID framework for single-stage principal–agent problems (SS–PAPs) to account for multiple followers. We illustrate this extension by presenting two contrasting CGMs for the base case of a PAP consisting of one principal and two followers. We use the notation M L v m F u n , where the superscripts m, n represent the number of principal agents and follower agents in the CGM, respectively (we omit the superscript when it is equal to one, as in the canonical CGM M L 1 F 1 ), and the subscripts v, u the number of variables controlled by the principals, followers, respectively. To show how causal inference on canonical SS–PAPs can be extended to multi-follower SS–PAPs, we employ two representative models: the CGM M L 2 F 1 2 with individualized incentives and independent follower utility functions, and the CGM M L 1 F 1 2 with universal incentives and joint utility functions. For these CGMs, the endogenous variables are incentive function space variable Γ ; follower actions Ω F 1 , Ω F 2 ; principal action Ω L 1 , Ω L 2 (individualized) or action Ω L (universal); utilities variables J F 1 , J F 2 , J L . We assume as before that the exogenous variables are mutually independent, mean-zero; and Gaussian when needed for analytic smoothing: U A 1 , U A 2 (follower action noises), U L 1 , U L 2 or U L (leader-action noises) and U J F 1 , U J F 2 , U J L (utility noises). In this subsection, for clarity in the notation, we sometimes use a 1 , a 2 or a i , a i for the followers actions ω F 1 and ω F 2 , and A 1 , A 2 for Ω F 1 , Ω F 2 . As has been consistently acknowledged in this research, we assume the principal’s perceptive; i.e., the principal does not have knowledge about the followers utilities nor their best-response maps.

4.5.1. CGMs with Individualized Incentives and Independent Follower Utilities

In M L 2 F 1 2 , the principal has two decision variables Ω L 1 and Ω L 2 , i.e., it must decides on two distinct incentive functions γ 1 Γ and γ 2 Γ , one for each follower. This type of CGM for SS–PAPs with multiple followers allows the selection of distinct incentives for different followers, which is useful in multi-agent systems where it is important to break symmetries between followers. Furthermore, in the M L 2 F 1 2 CGM, the utility function of each follower does not depend on the decisions of the other followers, i.e., there is no interference between the follower utilities: J F i depends only on its own channel (though principal utility J L may couple channels). Γ 1 Ω F 1 and Γ 2 Ω F 2 via best responses (68), as shown in the structural equations F given below:
(68) Ω F 1 = BR 1 ( Γ 1 , U A 1 ) , Ω F 2 = BR 2 ( Γ 2 , U A 2 ) , (69) Ω L 1 = Γ 1 ( Ω F 1 ) + U L 1 , Ω L 2 = Γ 2 ( Ω F 2 ) + U L 2 , (70) J F 1 = g F 1 ( Ω L 1 , Ω F 1 ) + U J F 1 , J F 2 = g F 2 ( Ω L 2 , Ω F 2 ) + U J F 2 , (71) J L = g L ( Ω L 1 , Ω L 2 , Ω F 1 , Ω F 2 ) + U J L . Γ 1 = γ 1 ; Γ 2 = γ 2
Therefore, choosing policies Γ 1 = γ 1 , Γ 2 = γ 2 , the principal wants to estimate the causal estimand V ( γ 1 , γ 2 ) = E J L | do ( Γ 1 = γ 1 , Γ 2 = γ 2 ) and targets incentive policies γ 1 * and γ 2 * such that
γ 1 * , γ 2 * arg max γ 1 Γ 1 , γ 2 Γ 2 E J L d o Γ 1 = γ 1 , Γ 2 = γ 2 .
Using the g-formula, the identification formula for V ( γ 1 , γ 2 ) admits the familiar nested expectation form:
(73) V ( γ 1 , γ 2 ) = E U L 1 , U L 2 g L γ 1 ( a 1 ) + U L 1 , γ 2 ( a 2 ) + U L 2 , a 1 , a 2 inner expectation ( Gaussian smoothing ) (74) × p ( a 1 Γ 1 = γ 1 ) p ( a 2 Γ 2 = γ 2 ) d a 1 d a 2 ,

4.5.2. CGMs with Universal Incentive and Joint Follower Utilities

In contrast, in M L 1 F 1 2 , incentives are said to be universal, that is, the incentive function γ is applied equally to all followers, and here the utility function of each follower depends on the decisions of the other followers. The structural equations F for M L 1 F 1 2 are given in Equations (76)–(79).
(75) Γ = γ (76) ( Ω F 1 , Ω F 2 ) = Φ ( Γ , U A 1 , U A 2 ) ( Joint best-response equilibrium map ) , (77) Ω L = Γ ( Ω F 1 , Ω F 2 ) + U L , (78) J F i = g F i ( Ω L , Ω F 1 , Ω F 2 ) + U J F i , i { 1 , 2 } , (79) J L = g L ( Ω L , Ω F 1 , Ω F 2 ) + U J L .
The direct causal relations from arrows Γ ( Ω F 1 , Ω F 2 ) are via the joint best-response equilibrium map Φ ; as shown in the structural Equation (76). One can write Ω F 1 = BR 1 ( Γ 1 , U A 1 ) , Ω F 2 = BR 2 ( Γ 2 , U A 2 ) , and define Φ as the fixed-point solver. That is, each follower i has payoff J F i ( a i , a i ; γ ) with the best-response correspondence:
BR i ( γ , a i ) arg max a i A i u i ( a i , a i ; γ ) , i { 1 , 2 } ,
and the joint best-response correspondence is
B γ ( a 1 , a 2 ) = BR 1 ( γ , a 2 ) × BR 2 ( γ , a 1 ) A 1 × A 2 ,
for compact action sets A i R d i . So, B γ ( a 1 , a 2 ) takes an action profile ( a 1 , a 2 ) as input and returns the set of all possible joint actions where each player is best responding to the input action of the other player. A Nash equilibrium under γ is then any fixed point ( a 1 , a 2 ) B γ ( a 1 , a 2 ) , as a 1 BR 1 ( γ , a 2 ) , i.e., follower 1’s action is a best response to follower 2’s action; and a 2 BR 1 ( γ , a 1 ) , i.e., follower 2’s action is a best response to follower 1’s action. However, in scenarios where the followers game presents multiple Nash Equilibria (NE), the conventional method of defining NE as a fixed point fails to provide clarity on the actual outcome that players will select, resulting in a model that lacks predictive capability. Therefore, it is important to consider some additional assumptions for this joint best-response equilibrium map Φ , such as the existence of the equilibrium and the specification of an equilibrium selection rule, where the goal is to move from the Nash equilibrium correspondence (a set of possible outcomes) to a unique joint action law (a single predicted outcome).
Under M L 1 F 1 2 the principal wants to estimate the causal estimand V ( γ ) = E J L | do ( Γ = γ and targets the incentive function γ * such that:
γ * arg max γ Γ E J L d o Γ = γ .
By the g-formula, the identification for V ( γ ) also admits the nested expectation form:
V ( γ ) = E U L g L γ ( a 1 , a 2 ) + U L , a 1 , a 2 inner expectation ( Gaussian smoothing ) p ( a 1 , a 2 Γ = γ ) d a 1 d a 2 .
In the identification formula for V ( γ ) , we can observe the importance of having a joint best-response equilibrium map Φ with an equilibrium selection rule. In multiple equilibria cases, p ( a 1 , a 2 Γ = γ ) becomes non-unique, ambiguous, and very hard to estimate from data; so it is important to impose or learn an equilibrium selection rule that makes the joint action law unique or as unambiguous as possible.

5. Discussion

This work formalizes the Causal Incentive Design (CID) framework for the canonical single-stage principal–agent problem (SS–PAP) by treating incentives as interventions in a causal graphical model (CGM). The causal target V ( γ ) = E [ J L d o ( Γ = γ ) ] is the expected value of the principal’s utility variable J L under a specified policy intervention Γ = γ in the post-intervention distribution p J L d o ( Γ = γ ) . The construction of an estimand by the identification of V ( γ ) is via the g-formula, and the selection of a single deployment incentive policy is through a Functional Bayesian Optimization algorithm. The process pipeline on CID for SS–PAPs is explicitly offline: historical data records from the system are mapped into estimates of V ( γ ) , which are then optimized prior to a one-shot commitment. This connection between causal identification and sample-efficient search yields high-probability regret guarantees for the offline selection step.
A central contribution is the anatomy of the estimand V ( γ ) as a nested expectation with two interpretable layers: (i) an inner Gaussian smoothing of the outcome regression around the principal’s induced action, and (ii) an outer averaging with respect to the follower’s induced action law p ( M Γ = γ ) ; which extends to multi-follower setting. The inner layer admits quadrature-based evaluation (e.g., Gauss–Hermite), while the outer layer is handled by a policy-local reweighting scheme that respects positivity and interpretability. In the affine credit-market specialization (Section 3.3.1), this decomposition collapses to a closed form, clarifying how pricing slope and fees interact with induced borrowing behavior.
The CID framework demonstrate a recent advance in causal reasoning grounded in causal graphical models (CGMs): policies are formalized as functional interventions on a policy node, policy value is represented as an interventional or counterfactual query, and identification and estimation are performed directly on the CGM using observational logs. Additionally, the single-stage principal–agent model examined here introduces a novel application domain for CGMs in incentive design, complementing established applications in biology, medicine, and economics, and facilitating further exploration of multi-agent scenarios.
The remainder of the Discussion is structured as follows. Section 5.1 revisits the Stackelberg FCBO procedure and interprets the information–gain regret bounds as causal decision guarantees for one-shot policy selection in single-stage PAPs. Section 5.2 reviews the primary modeling, identification, and learning assumptions of the CID framework, including the CGM structure, positivity/overlap, the additive-noise model, and the uniform sub-Gaussian noise envelope used in the regret analysis. Section 5.3 discusses the practical scope of the current single-stage, offline analysis, emphasizing limitations related to policy-local support, potential model misspecification such as heavy-tailed noise or unmodeled confounding, and robustness considerations, as well as the importance of effective-sample-size diagnostics. Section 5.4 analyzes the computational demands of the CID pipeline and the FCBO algorithm, specifying how time and memory requirements scale with the number of logged units, the number of policy evaluations, and GP updates, and indicating regimes in which value estimation or surrogate modeling is the dominant computational cost. Section 5.5 describes extensions of the CID framework and nested estimand beyond the single-follower case, considering individualized versus joint incentives and the impact of interactions among multiple followers on the CGM, identification, and the design of learning algorithms.

5.1. Offline Functional BO and Causal Decision Guarantees

The Stackelberg FCBO algorithm models γ V ( γ ) as a black-box functional objective on a Reproducing Kernel Hilbert Space (RKHS) and uses a functional GP surrogate with a GP-UCB-style acquisition functional. The resulting cumulative-regret bounds for the Stackelberg FCBO algorithm scale with T β t I T , where I T is an information gain term. This offers concrete guidance on evaluation budgets (sampling costs), stopping rules (such as UCB gaps), and the exploration schedule prior to the one-shot deployment; i.e., determining when enough data have been collected to make a final decision.
In contrast to online-learning methods such as bandits and reinforcement learning over actions, CID for SS–PAPs is distinctly one-shot and offline, utilizing intrinsic causal structure in PAPs and causal reasoning to assess policies without the need for deployment. In relation to causal BO methods, we argue that interventions d o ( Γ = γ ) in the function-valued random variable Γ , which involve functional (conditional) interventions d o Ω L = γ ( ω L ) , since
E J L d o ( Γ = γ ) E J L d o ( Γ = γ ) , d o Ω L = γ ( ω L ) ,
are the right abstraction for causal reasoning about incentive functions in PAPs. In M L 1 F 1 , the node Γ explicitly represents the incentive policy and is connected to both Ω L and Ω F . These connections Γ Ω L and Γ Ω F , together with Ω F Ω L , Ω L J L , and Ω F J L , constitute the minimal policy-aware motif that captures how incentives γ Γ shape the leader’s realized decision Ω L and the follower’s decision Ω F , to produce the principal’s utility variable J L in the system. In structural terms, the leader’s action is generated by the policy applied to the follower context, Ω L = γ ( Ω F ) , while the adoption of a policy also reshapes the distribution of follower variables through Γ Ω F .
This inverse Stackelberg-aware approach makes the hierarchical semantics explicit: the dependence between principal and follower decisions is encoded both by Ω F Ω L (the policy takes follower context as input) and through behavior modifications driven by incentive policies Γ Ω L and Γ Ω F (the policy governs how actions are produced and how follower behavior is distributed under intervention). Moreover, the follower decision Ω F acts as a mediator M in the causal path from Γ to J L in the principal’s perspective of M L 1 F 1 , in agreement with the intrinsic semantics of PAPs.
The CID framework for SS–PAPs illustrates the fundamental difference between considering the causal relations between the input variables ( Γ , Ω F in SS–PAPs ) and not considering them in an optimization problem. This variation is the central motivation in the causal decision-making framework of Causal Bayesian Optimization (CBO) [24] over classical Bayesian optimization. In this sense, the CID framework for SS–PAPs exhibits PAPs, inverse Stackelberg games, and bilevel optimization problems, as a fundamental family of problems that elucidates this important distinctness (between BO and CBO), given that the hierarchical semantics of an inverse Stackelberg game rely on the causal relationships Γ Ω L , Γ Ω F and Ω F Ω L . Disregarding the causal relationships between these three variables Γ , Ω F , and Ω L merely distorts the semantics of the functional optimization problem that solves a PAP.

5.2. Assumptions

The identification and estimation of CID are based and rely on the following assumptions: (i) Consistency: the recorded outcome equals the counterfactual under the realized policy. (ii) Causal sufficiency: for variables { Γ , Ω L , Ω F , J L , J F } in the generative CGM M L 1 F 1 ; (iii) Black-box follower’s best-response: we work over the principal’s perspective where the observed variables are { Γ , Ω F , Ω L , J L } and the follower’s best-response is considered a black-box function. (iv) Positivity (overlap): in a policy-local sense, we assume that there exists a neighborhood N ρ ( γ ) such that p ( Ω F Γ N ρ ( γ ) ) > 0 wherever the estimator needs support to approximate p ( Ω F Γ = γ ) . (v) Policy compliance: The action implemented by the principal is consistent with the distribution of Ω L = γ ( Ω F ) + U L in practical deployment, as it is presupposed. Otherwise, if this alignment is not achieved, the action d o ( Γ = γ ) and the behavior observed in practice will differ significantly. (vi) Additive, zero-mean Gaussian noise SCM: featuring mutually independent exogenous terms; in the leader’s action, in particular, allowing inner Gaussian smoothing on the identification formula. We also assume i.i.d. logged units and no interference across units, i.e., the Stable Unit Treatment Value Assumption (SUTVA). Violations or weak forms of these conditions, such as latent confounding, measurement error, or insufficient policy-local overlap, can bias V ( γ ) and compromise the calibration of uncertainty. (vii) Noise envelope for regret analysis: While the outer estimator is heteroskedastic across γ , we assume a uniform sub-Gaussian envelope on ε t = V ^ ( γ t ) V ( γ t ) over the proposal support. This implies a diagonal noise model in Section 4.4.1 with Σ T σ * 2 I , under which the stated GP-UCB bounds hold (see [8,29]).

5.3. Scope and Limitations

The present CID analysis is single-stage and offline: it commits once to a policy and cannot adapt after deployment. Dynamics (carry-over effects), learning-by-doing, and path-dependent constraints are beyond the scope of this paper. The follower mechanism is treated as a black box (we observe Ω F , Ω L , J L , but not the follower utilities); this pushes strategic dependence into the observed mediator law and increases sensitivity to support gaps in Ω F near the target policy γ . So, in the absence of extensive historical policy support, the sensitivity in the outer layer increases. The additive Gaussian assumption simplifies the inner layer but may under-represent heavy tails or heteroskedasticity; when noise departs from Gaussian, it can lead to inaccuracies in inner smoothing and variance estimates.
For diagnostics and protection against weak overlap: (i) Monitor weight distributions, to look for outlier weights pointing to weak overlap and high variance. (ii) Monitor the effective sample size (ESS), which measures the effective number of independent samples you have after applying the weights.
ESS ( γ ; h ) : = i = 1 N w i ( γ ; h ) 2 i = 1 N w i ( γ ; h ) 2 .
A low ESS relative to the total sample size (N) indicates that a few heavily weighted units dominate the estimation, signaling a high variance problem. (iii) Monitor sensitivity to the policy-space bandwidth, by re-running the evaluation using different values for this bandwidth parameter h. If the value of the new policy changes significantly when you slightly adjust the bandwidth, the estimate is unstable and highly dependent on a tuning parameter, which reduces confidence in the result. (iv) Reduce overfitting bias in the outer layer employing sample-splitting, cross-fitting, and orthogonalization.
In practice, Algorithm 1 can constrain candidate policies to a trust region S support (see Section 3.3.4 for the definition of S support ), adapt the bandwidth h to balance bias–variance and/or inflate posterior uncertainty as γ approaches the boundary of S support . Crucially, kernel reweighting does not create support: if γ is far from the historical policies, P ^ γ ( h ) collapses, and V ^ ( γ ) becomes high-variance (and possibly biased). Therefore, the practical search space of Algorithm 1 is data-dependent and effectively limited to neighborhoods of previously observed policies. Because the outer estimator relies on policy-local overlap, we can incorporate a policy-support score s ( γ ) : = ESS ( γ ; h ) / N into the selection. Concretely, we can either constrain GP-UCB to γ S support ( h , τ , w max ) (hard trust region) or use a soft penalty A ˜ t ( γ ) = A t ( γ ) · φ ( s ( γ ) ) , where A t ( γ ) is the acquisition function, with φ ( s ) 0 as s 0 , i.e., this penalty function must approach zero as the policy-support score s ( γ ) approaches zero. Conversely, when s ( γ ) is high (close to 1), φ ( s ) should be close to 1 to minimally impact the acquisition function A t ( γ ) . This prevents proposals in regions where p ( M Γ = γ ) cannot be estimated from logs and makes the practical search space explicit and data-dependent.

5.4. Time–Space Complexity over a Full FBO Run

Estimating V ( γ ) once via the semi-parametric identification formula (assuming additive, zero-mean Gaussian noise) costs O N ( P + C γ + Q eff C μ ) , under policy-local reweighting over N recorded tuples, and where P is the policy-space representation size, C γ is the cost to compute γ ( Ω F ) once, Q eff is the inner integration budget (e.g., Gauss–Hermite nodes or MC points), and C μ is the cost to evaluate the fitted outcome regression once. Across a BO horizon of T iterations (one evaluation per step), the total cost to produce these off-line evaluations is O T N ( P + C γ + Q eff C μ ) . With local reweighting over only the top L N nearest policies, this becomes O T ( P log N + L ( P + C γ + Q eff C μ ) ) . The one-time training of the outcome model contributes an upfront C train μ that is then amortized. The memory requirement for the estimator is O ( N ) (or O ( L ) when local weighting is applied), in addition to O ( N P ) if the policy embeddings are stored in cache. The quadrature nodes are O ( Q eff ) and can be considered negligible.
The implementation of an exact functional Gaussian Process surrogate with incremental updates throughout the execution adds O ( 1 + B ) T 3 + ( 1 + B ) T 2 P for Cholesky updates and acquisition scoring of B candidates per iteration. Since, across the full horizon T, kernel updates are t = 1 T O ( N t P ) = O ( T 2 P ) ; Cholesky factorizations are t = 1 T O ( N t 2 ) = O ( T 3 ) ; and total UCB functional acquisition scoring work (each needs mean and variance) t = 1 T O B ( N t P + N t 2 ) = O B ( T 2 P + T 3 ) , where N t is the number of observed policy evaluations at iteration t (so N t = t ). Merging the estimation cost and exact GP updates plus functional acquisition work cost, the total end-to-end time is
Total time C train μ + O T N ( P + C γ + Q eff C μ ) For all V ^ ( γ ) evaluations + O ( 1 + B ) T 3 + ( 1 + B ) T 2 P For the exact functional GP .
Replace N by L and add T P log N under local weighting. Comparing the estimation cost T N ( P + C γ + Q eff C μ ) of V ( γ ) with the dominant cubic GP term ( 1 + B ) T 3 , we can establish that GP dominates when
T N ( P + C γ + Q eff C μ ) / ( 1 + B ) ,
while the estimation cost of V ( γ ) dominates below that scale. In terms of space, the exact GP stores dense Gram/Cholesky factors of size T × T , i.e., O ( T 2 ) .

5.5. Toward Multi-Follower Settings

We outline a multi-follower generalization of CID that preserves the CGM semantics of the canonical case. Let followers be indexed by i [ n ] with contexts Ω F ( i ) , and let the leader’s (possibly vector-valued) action be Ω L . The policy node Γ represents the incentive rule(s) chosen by the principal and, in the policy-augmented CGM, carries arrows Γ Ω L and Γ Ω F ( i ) for all i. As in the single-follower case, contexts feed into the leader’s realized action via Ω F ( i ) Ω L , but now the post-intervention law involves the joint mediator M = ( Ω F ( 1 ) , , Ω F ( n ) ) , since both Ω L and { Ω F ( i ) } affect the principal’s payoff J L . Two instructive extensions clarify how identification and learning must adapt beyond the canonical case: (i) the regime IIIU: individualized incentives with conditionally independent follower utilities (factorized mediator law), and (ii) the regime UIJU: universal incentives with joint follower utilities (coupled actions via a joint best-response map).
In the IIIU regime, the principal deploys a collection of incentive functions (e.g., one per channel, segment, or product), and the followers’ utilities are conditionally independent given their own contexts and the relevant part of the policy. Graphically, the mediator law factorizes under interventions as follows:
p M | d o { Γ ( i ) = γ ( i ) } = i = 1 n p Ω F ( i ) | d o ( Γ ( i ) = γ ( i ) ) .
Estimation and optimization inherit a separable structure: outer-layer reweighting and inner smoothing decompose over i, enabling parallel computation and straightforward uncertainty aggregation. Surrogate modeling should mirror this factorization (e.g., with parallel GPs, modeling the functional objective as a sum or collection of independent functional GPs or using block-diagonal kernels over policy components, where the overall kernel is constructed by combining separate, independent kernels for each component), and identifiability reduces to policy-local overlap for each channel. This setting suits applications where incentives are personalized or targeted and cross-follower externalities are negligible.
In the UIJU regime, a single policy γ applies to all followers whose actions are strategically coupled; so, we use a joint best-response map A = Φ γ , M , U F , with exogenous terms U F , where we use A to denote the actions carried out by followers { ω F ( 1 ) , , ω F ( n ) } . The identification requires an equilibrium-selection rule to make the post-intervention law; i.e., p A | M , d o ( Γ = γ ) is well defined through a selection mapping S: ( γ , M ) A . Three practical approaches are (i) Uniqueness: Assume that d o ( Γ = γ ) almost surely results in a unique equilibrium outcome A. (ii) Parametric selection: Encode tie-breaking or stability criteria in S and learn its parameters from observed play. (iii) Stochastic selection: Models the choice among equilibria as a stochastic process, where each equilibrium is chosen with a specific probability.
Surrogate modeling for the UIJU regime should reflect cross-follower dependence (e.g., using multi-task GPs with coregionalization and structured kernels over policy components), and outer-layer estimation must use joint reweighting over M (and, if modeled, A) rather than product weights. Positivity must hold in a joint sense: support near γ is needed for the relevant regions of the ( M , A ) -space, a stricter requirement than in the individualized case.

6. Conclusions

This study introduces a principled, offline framework for single-stage incentive design using causal inference, grounded in causal graphical models (CGMs) and information-theoretic analysis. The single-stage principal–agent problem (SS–PAP) is formalized as a CGM, where the principal’s incentive rule is modeled as a functional intervention on a policy node Γ , the follower’s action serves as an explicit causal mediator, and the principal’s payoff J L is the outcome. The causal target estimand is defined as the principal’s expected utility under an incentive policy intervention, V ( γ ) = E [ J L d o ( Γ = γ ) ] , which quantifies how a committed incentive causally influences the follower’s behavior and the principal’s payoff under private information.
For the identification of the estimand, a semi-parametric expression for V ( γ ) is derived using the g-formula, which decomposes into an inner Gaussian smoothing over noise and an outer expectation with respect to the post-intervention distribution of the mediator. Building on this structure, a two-layer estimator is proposed for offline logs: the inner layer computes a Gaussian expectation, implemented numerically via Gauss–Hermite quadrature, while the outer layer approximates the interventional mediator law using policy-local kernel reweighting with effective-sample-size and weight-cap diagnostics. This design explicitly addresses the bias–variance trade-off and ensures that off-policy evaluation maintains overlap in policy space.
For policy selection, the value functional γ V ( γ ) is embedded in a functional Gaussian process surrogate over a Reproducing Kernel Hilbert Space (RKHS) of admissible incentives, and a support-aware GP-UCB search is performed. The resulting regret bounds are characterized in terms of differential information gain, directly connecting the exploration–exploitation trade-off to information-theoretic quantities under a uniform sub-Gaussian envelope for estimator noise. Collectively, these components establish a unified pipeline for causal reasoning and optimal intervention design in single-stage principal–agent settings using only offline observational logs.
The proposed offline approach is particularly suitable in scenarios where adaptive experimentation is infeasible, ethically constrained, or operationally costly. For instance, this applies when contracts, tariffs, or credit terms must be fixed prior to deployment and cannot be iteratively adjusted. Within this regime, the analysis clarifies the conditions under which causal identification is possible, the extent to which extrapolation in policy space is feasible while maintaining overlap, and methods for quantifying uncertainty through effective sample size and information gain.
The CID framework extends beyond the canonical single-follower case. In a multi-follower generalization, the nested form of the estimand is preserved, but a strong dependence emerges based on the interaction of follower utilities. In regimes with individualized incentives and conditionally independent utilities, identification, estimation, and surrogate modeling decompose across followers, enabling parallelization and simpler diagnostics. Conversely, in regimes with universal incentives and jointly coupled utilities, identification depends on equilibrium-selection assumptions, and outer-layer estimation must address joint mediator laws and stricter positivity requirements. These extensions demonstrate both the robustness of the two-layer CID structure and the additional challenges encountered in more complex multi-agent systems.
Overall, the presented CID framework advances the field of causal graphical models and their applications by integrating a CGM-based representation of principal–agent interactions with explicit causal reasoning about policy interventions and information-theoretic guarantees for offline policy search. This framework establishes a foundational link between causal inference and policy search in principal–agent problems, supporting reliable single-stage incentive design in complex systems and providing a basis for future research involving empirical studies, richer policy classes, and more complex multi-follower environments.

Future Work

The proposed approach enables the integration of causal reasoning into incentive design, especially for canonical single-stage PAPs. It also reveals opportunities to strengthen robustness in causal inference estimation, improve computational scalability, validate the framework under realistic constraints, and expand to multi-follower and multi-stage PAPs. Future work will address these limitations and explore extensions to make CID dependable at scale and deployable in practice.
Beyond the scope of this work, an important future extension is a robust identification toolkit that tolerates limited policy-local overlap in data logs and mild unmeasured confounding. Specifically, we plan to incorporate partial-identification estimation bounds, i.e., reporting intervals for the causal target estimand V ( γ ) rather than a single point estimation, with techniques based on [30] tailored to policy interventions, or proximal/negative-control strategies when credible proxy variables are available. In addition to targeted sensitivity analyses using weight clipping and bandwidth sweeps with effective sample size diagnostics. By addressing this limitation, the practical utility of CID improves in data regimes where perfect overlap and strict sufficiency are unrealistic.
Other aspects for strengthening robustness involve exploring beyond additive-Gaussian noise and augmenting policy-local reweighting with structural conditional density models to learn the mediator law given an incentive policy p ( M Γ ) . Regarding the first, we plan to replace the inner Gaussian smoothing with robust or heavy-tailed quadrature methods, such as Student-t distributions or mixture models, which can better account for outliers and extreme data points. In addition, to explore heteroskedastic noise models and variance-adaptive smoothing, to propagate these refinements into uncertainty quantification. Regarding conditional density models for p ( M Γ ) , the plan is to utilize conditional normalizing flows or score-based estimators while preserving the identification logic and benchmarking bias–variance trade-offs. These improvements are essential to reduce bias in heavy-tailed regimes and to maintain calibrated uncertainty when noise scale depends on context.
The forthcoming work will involve the complete extension of CID to multi-follower over the distinct configurations. This will require adaptations to surrogate modeling tailored to specific regimes and adjusting the regret bounds. Many systems are situated between regimes: individualized incentives with conditionally independent follower utilities and universal incentives with joint follower utilities. A pragmatic approach is to adopt clustered dependence: partition followers into groups with strong in-group coupling and weak cross-group effects. The CGM should then mix block-factorized mediator laws with group-level equilibrium selection. For a large amount of followers, exchangeability and mean-field approximations can effectively reduce dimensionality. These techniques summarize follower interactions through aggregate statistics (such as empirical means or occupancy measures), which can be used as input to the system’s laws and particularly on the joint best-response map, yielding scalable identification and surrogate models with kernels over aggregates. This implies measuring similarity between two policies based on the similarity of their predicted aggregate statistics, making the optimization process efficient for massive populations.
Beyond the existing framework, a promising direction is constraint-aware CID. The core motivation is that incentive policies deployed in real-world settings (such as credit, pricing, or subsidies) must reliably satisfy constraints, and not just on average. This means mitigating the risk of rare and catastrophic failures. Essential constraints include safety (avoiding dangerous or unintended side effects), budget (ensuring costs remain within defined limits), and fairness (ensuring similar treatment or impact across groups, minimizing disparities in approval rates or resource allocation). The planned efforts on this matter involve implementing three interconnected components in the search process for policy interventions: (i) constrained acquisitions that enforce high-probability feasibility; (ii) risk-sensitive objectives and chance constraints; and (iii) fairness controls. The analysis of this extension must delineate clear Pareto trade-offs, enabling decision-makers to transparently assess trade-offs between maximizing utility and satisfying constraints such as safety, fairness, and budget.
A paramount direction is to generalize from the single-stage setting to multi-stage environments, where incentives are applied over time. Specifically, we plan to formalize a dynamic CGM with history-dependent policies Π , identify the causal target V ( Π ) with a sequential (stage-wise) analogue of our two-layer estimator, and lift the optimization to a Dynamic Functional Causal Bayesian Optimization algorithm that accounts for temporal causal relations and stage-wise uncertainty. This expansion is essential to capture learning-by-doing, carryover effects, and path-dependent constraints—features central to many real-world deployments.
Finally, translating these theoretical gains into robust real-world value is a key long-term goal. This involves curated benchmarks (credit pricing, platform subsidies, and market mechanisms), rigorous baselines (contextual bandits with off-policy evaluation (OPE), policy-gradient RL with OPE, and non-causal BO), and comprehensive reporting of offline policy regret, uncertainty coverage, and constraint satisfaction—with open, reproducible artifacts. Future work will include detailed simulations and case studies to demonstrate (a) calibration of the two-layer estimator under policy dispersion and overlap constraints; (b) separation of outer policy-local reweighting and inner Gaussian smoothing through ablation studies; and (c) performance of end-to-end Functional Causal Bayesian Optimization (FCBO) with support-aware acquisition across representative policy classes.
Additionally, it is essential to monitor run-time and memory usage as policy classes and evaluation budgets expand. The primary bottleneck in Bayesian Optimization (BO) is the Gaussian Process (GP) surrogate model, which often has cubic time complexity over the BO horizon. This can be overcome by replacing costly exact computations with approximations, such as random-feature surrogates chosen in the policy space, and by coordinating multi-fidelity evaluations of V ( γ ) . That is, using “cheap samples” that can be sampled quickly at a low computational cost in the inner layer of the identification formula to gain a coarse understanding of the policy’s value, and reserve the expensive, high-accuracy estimations (like high-fidelity quadrature for integrating probabilities) for only the most promising policies. This is essential for validating scalability in resource-constrained environments and for enabling larger, more expressive policy classes without sacrificing regret guarantees.
Taken together, these directions aim not only to refine the current methodology but also to broaden CID’s theoretical and practical footprint. Specifically, the field must transition from single-stage to multi-stage decision-making; move from brittle point estimates to robust identification; progress from exact, heavy surrogates to scalable multi-fidelity optimization; expand from isolated agents to coupled, constraint-laden systems; and shift from stylized demonstrations to reproducible, high-stakes applications. Advancing along these directions will facilitate the development of reliable, deployable incentive designs grounded in causal reasoning.

Author Contributions

Conceptualization, S.B., E.F.M., L.E.S. and E.M.d.C.; methodology, S.B., E.F.M., L.E.S. and E.M.d.C.; formal analysis, S.B., E.F.M. and E.M.d.C.; investigation, S.B., E.F.M., L.E.S. and E.M.d.C.; resources, S.B., E.F.M., L.E.S. and E.M.d.C.; writing—original draft preparation, S.B.; writing—review and editing, S.B., E.F.M., L.E.S. and E.M.d.C.; supervision, E.F.M., L.E.S. and E.M.d.C.; project administration, E.F.M., L.E.S. and E.M.d.C.; funding acquisition, E.F.M., L.E.S. and E.M.d.C. All authors have read and agreed to the published version of the manuscript.

Funding

S.B. is supported by Secretaría de Ciencia, Humanidades, Tecnología e Innovación (SECIHTI), Scholarship Number 1562. Instituto Nacional de Astrofísica, Óptica y Electrónica (INAOE).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
BOBayesian Optimization
CBOCausal Bayesian Optimization
CIDCausal Incentive Design
CGMCausal Graphical Models
FCBOFunctional Causal Bayesian Optimization
FBOFunctional Bayesian Optimization
GPGaussian Process
MASMulti-Agent System
PAPPrincipal–Agent Problem
RKHSReproducing Kernel Hilbert Space
SCMStructural Causal Model

Appendix A. Regret Bound Proofs in a Finite Function Space

Details for the proofs of the propositions needed to prove Theorem 1 are presented here next. Proposition A1 provides a high probability bound on the difference | f ( γ ) μ t 1 ( γ ) | , between the true function value f ( γ ) and the posterior mean μ t 1 ( γ ) across the time steps 1 , , T and the incentive functions γ = { γ 1 , , γ T } Γ in the finite functional decision space Γ , showing that the posterior mean μ t 1 ( γ ) is a good estimate of the function value f ( γ ) , with γ = { γ 1 , , γ t 1 } Γ up to a confidence width of β t σ t 1 ( x ) .
Proposition A1.
Let Γ be the finite decision set and δ ( 0 , 1 ) be the desired confidence level. Select δ over the open interval ( 0 , 1 ) and set β t = 2 log ( π 2 t 2 6 δ | Γ | ) . Then
f ( γ ) μ t 1 ( γ ) β t σ t 1 ( γ ) γ Γ , t 1 ,
holds with probability greater than or equal to 1 δ .
Proof. 
We show that for appropriately chosen confidence parameters { β t } for t [ T ] , every such confidence interval is always valid, with high probability. A random variable X N ( 0 , 1 ) has the upper bound of the tail probability Pr ( X > c ) 1 2 e c 2 2 for c > 0 (see [31]), and Pr ( | X | > c ) = 2 Pr ( X > c ) as the standard normal distribution is symmetric. For a fixed t 1 , conditioned on observations y t 1 = { E ^ p γ 1 [ J L 1 ] , , E ^ p γ τ [ J L t 1 ] } , the sequence of incentive functions γ = { γ 1 , , γ t 1 } Γ are deterministic and the posterior distribution f ( γ ) N ( μ t 1 ( γ ) , σ t 1 2 ( γ ) ) at time t 1 . We normalize this Gaussian as Z = f ( γ ) μ t 1 ( γ ) σ t 1 ( γ ) N ( 0 , 1 ) so we can apply the previous tail bound and symmetry property to Z. Therefore, we have the following bound, using X = Z and c = β t :
Pr f ( γ ) μ t 1 ( γ ) σ t 1 ( γ ) > β t = Pr f ( γ ) μ t 1 ( γ ) > β t σ t 1 ( γ ) e β t 2 .
As we want the confidence interval to hold for all γ Γ and all time steps t [ T ] in Algorithm 1, we use the union-bound inequality to construct the appropriate parameters { β t } for t [ T ] . Let E be the event that at least one confidence interval fails for some γ Γ at some time step t [ T ] in the Algorithm 1, so
E = γ Γ , t [ T ] f ( γ ) μ t 1 ( γ ) > β t σ t 1 ( γ ) .
Applying the union bound, we know that
Pr ( E ) = Pr γ Γ , t [ T ] f ( γ ) μ t 1 ( γ ) > β t σ t 1 ( γ ) t [ T ] γ Γ Pr f ( γ ) μ t 1 ( γ ) > β t σ t 1 ( γ ) .
By running Algorithm 1 up to T time steps, we have T | Γ | possibilities that { f ( γ ) μ t 1 ( γ ) > β t σ t 1 ( γ ) } , and as we want to ensure that Pr ( E ) δ , we can set
Pr f ( γ ) μ t 1 ( γ ) > β t σ t 1 ( γ ) δ T | Γ | .
So, combining this with Equation (A1), we can solve equation e β t 2 = δ T | Γ | for β t , getting β t = 2 log T | Γ | δ . Finally, to analyze Algorithm 1 for T , the identity t = 1 1 t 2 = π 2 6 is used to design a time-dependent probability budget for each time step and considering the union of infinitely many events. This increases the confidence parameter β t over time, so that the probability of failure decreases suitably quickly, and the probability of failure anywhere and at any time is small. So, we now want
Pr ( E ) = Pr { t = 1 γ Γ f ( γ ) μ t 1 ( γ ) > β t σ t 1 ( γ ) δ .
Similar to Equation (A2), but now taking advantage of the identity t = 1 1 t 2 = π 2 6 , we set
Pr f ( γ ) μ t 1 ( γ ) > β t σ t 1 ( γ ) δ | Γ | · 6 π 2 t 2 .
Then, by the union bound inequality for this T case, we have the following:
Pr ( E ) t = 1 γ Γ Pr f ( γ ) μ t 1 ( γ ) > β t σ t 1 ( γ ) γ Γ t = 1 δ | Γ | · 6 π 2 t 2 = γ Γ δ | Γ | · 6 π 2 t = 1 1 t 2 = γ Γ δ | Γ | = δ
So, as above, using the tail bound from Equation (A1), we can solve for β t the equation e β t 2 = 6 δ π 2 t 2 | Γ | , from which we obtain that
β t = 2 log ( π 2 t 2 6 δ | Γ | )
Proposition A2.
For a fixed t 1 . If f ( γ ) μ t 1 ( γ ) β t σ t 1 ( γ ) for all γ 1 , , γ t 1 γ Γ , then the instantaneous regret r t = f ( γ * ) f ( γ t ) 2 β t σ t 1 ( γ t ) .
Proof. 
Algorithm 1 selects the incentive function γ t Γ According to the acquisition functional
γ t = arg max γ Γ μ t 1 ( γ ) + β t σ t 1 ( γ ) .
So, γ Γ , μ t 1 ( γ ) + β t σ t 1 ( γ ) μ t 1 ( γ t ) + β t σ t 1 ( γ t ) . In particular, for the unknown optimal incentive function γ * , we have that
μ t 1 ( γ * ) + β t σ t 1 ( γ * ) μ t 1 ( γ t ) + β t σ t 1 ( γ t ) .
By hypothesis, in the particular cases of γ * and γ t we have
f ( γ * ) μ t 1 ( γ * ) + β t σ t 1 ( γ * )
f ( γ t ) μ t 1 ( γ t ) β t σ t 1 ( γ t ) .
Then, using inequalities, we have the following bound for instantaneous regret r t
r t = f ( γ * ) f ( γ t ) μ t 1 ( γ * ) + β t σ t 1 ( γ * ) μ t 1 ( γ t ) β t σ t 1 ( γ t ) μ t 1 ( γ t ) + β t σ t 1 ( γ t ) μ t 1 ( γ t ) β t σ t 1 ( γ t ) = 2 β t σ t 1 ( γ t )
Proposition A3 shows that the information gain for a selected set of incentive functions can be expressed in terms of the predictive variances and that we can bound this variance in terms of the information gain.
Proposition A3.
For any finite sequence of incentive functions γ 1 , , γ τ Γ
t = 1 τ ( σ t ) 2 ( γ t ) σ 2 log ( 1 + σ 2 ) I τ ( f τ ; y τ ) .
Proof. 
Following Section 4.4, y τ = { y 1 , , y τ } = { E ^ p γ 1 [ J L 1 ] , , E ^ p γ τ [ J L τ ] } , where the incentive functions γ 1 , , γ T are deterministic conditioned on y τ 1 . From Equation (63), we know that I τ ( f τ ; y τ ) = H ( y τ ) H ( y τ f τ ) . Using the chain rule for differential entropy, we have the following alternative expressions for H ( y τ ) in Equation (A8) and H ( y τ f τ ) in Equation (A9), equivalent to Equation (63).
H ( y τ ) = H ( y 1 , , y τ ) = t = 1 τ H ( y t y 1 , , y t 1 ) = t = 1 τ 1 2 log 2 π e ( σ t 1 ) 2 ( γ t ) + σ 2 .
H ( y τ f τ ) = t = 1 τ H ( ϵ t ) = t = 1 τ 1 2 log ( 2 π e ) σ 2 ,
since given f τ , the only randomness left is noise, as y t = E ^ p γ t [ J L t ] = f ( γ t ) + ϵ t , with ϵ t N ( 0 , σ 2 ) . So, using Equations (A8) and (A9), now we can express the information gain for the selected incentive functions as follows:
I τ ( f τ ; y τ ) = H ( y τ ) H ( y τ f τ ) = t = 1 τ 1 2 log 2 π e ( σ t 1 ) 2 ( γ t ) + σ 2 t = 1 τ 1 2 log ( 2 π e ) σ 2 = 1 2 t = 1 τ log ( σ t 1 ) 2 ( γ t ) + σ 2 σ 2 = 1 2 t = 1 τ log 1 + ( σ t 1 ) 2 ( γ t ) σ 2
Observe that in Equation (A10), we have expressed the information gain for a selected set of incentive functions exclusively in terms of predictive variances. Now, we use the elementary inequality for all z [ 0 , 1 ] , z 1 log ( 1 + z 1 ) log ( 1 + z ) , and definite z t = σ t 1 2 ( γ t ) σ 2 so that σ t 1 2 ( γ t ) = σ 2 z t , and then we have
z t 1 log ( 1 + ( z t ) 1 ) log ( 1 + z t ) 1 log ( 1 + σ 2 ) log ( 1 + z t ) ,
because z t σ 2 . Now multiplying both sides of Equation (A11) by σ 2 ans sum over t = 1 , , T we have
t = 1 T ( σ t 1 ) 2 ( γ t ) σ 2 log ( 1 + σ 2 ) t = 1 T log ( 1 + z t ) = σ 2 log ( 1 + σ 2 ) t = 1 T log 1 + ( σ t 1 ) 2 ( γ t ) σ 2 = σ 2 log ( 1 + σ 2 ) I τ ( f τ ; y τ )
Proposition A4.
Let δ ( 0 , 1 ) be the desired confidence level and let β t = 2 log ( π 2 t 2 6 δ | Γ | ) . Then, the following holds with probability at least 1 δ :
t = 1 T ( r t ) 2 β T C 1 I T ( f T ; y T ) T 1 ,
where C 1 = 4 σ 2 log ( 1 + σ 2 ) .
Proof. 
From Propositions A1 and A2, we have that { ( r t ) 2 4 β t ( σ t 1 ) 2 ( γ t ) , t 1 } with probability at least 1 δ , by squaring both sides of expression r t 2 β σ t 1 ( γ t ) in Proposition A2. So, as for the monotonic increasing property of β t , β t β T for all t [ T ] and the result from Proposition A3, we have
t = 1 T ( r t ) 2 4 t = 1 T β t ( σ t 1 ) 2 ( γ t ) 4 β T t = 1 T ( σ t 1 ) 2 ( γ t ) β T 4 σ 2 log ( 1 + σ 2 ) I τ ( f τ ; y τ ) β T C 1 I τ ( f τ ; y τ )
Finally, Theorem 1 is a consequence of applying the Cauchy–Schwarz inequality t = 1 T r t T t = 1 T ( r t ) 2 to the result in Proposition A4 with which we obtain the following bound:
R T = t = 1 T r t T t = 1 T ( r t ) 2 = T β T C 1 I τ ( f τ ; y τ )

Appendix B. Reproducing Kernel Hilbert Spaces

Before describing the notion of Reproducing Kernel Hilbert Space (RKHS) functional space, we discuss kernel functions. Let X be a nonempty set and let R X be the set of all functions of real value on X . Recall that a Hilbert space H is a complete vector space, i.e., all Cauchy sequences converge, with an inner product · , · H that gives rise to a norm h H = h , h H for every h H . A function k: X × X R is called a kernel on X if there exist a real Hilbert space H and a map Φ : X H such that for all x , x X , we have k ( x , x ) = Φ ( x ) , Φ ( x ) . We call Φ a feature map and H a feature space of k. Furthermore, we must know that a kernel is positive definite in the sense that for all n N , ω 1 , , ω n R , and all x 1 , , x n R , we have i [ n ] j [ n ] ω i ω j k ( x i , x j ) 0 . The matrix K = ( k ( x j , x i ) ) i j is called the Gram matrix, and the positive definiteness of the kernel k is equivalent to saying that the determinant of K must be non-negative. A functional space RKHS H k in X is a Hilbert space H R X with a kernel function k: X × X H , which is called the reproducing kernel, where for all x X and all functions f H , we have f ( x ) = f , k x H , where k x = k ( · , x ) H .
In addition to the definition, a more intuitive approach to understand the rationale underlying an RKHS function space can be achieved through the following construction, which involves the formation of a Hilbert function space (an RKHS) from a kernel k: X × X R . Given X and a function k, consider a feature map Φ : X R X , defined as x Φ ( x ) = k x = k ( x , · ) , known as the reproducing kernel map. That is, the point x X is mapped to a function k x : X R , so k x ( y ) = k ( x , y ) for y X . Now, we can construct a vector space by considering the images { k x x X } as a spanning set, i.e., construct a vector space containing all linear combinations of the functions k ( · , x ) . Thus, given a kernel k ( x , x ) , the RKHS of functions H k and its inner product are given in Equation (A16) and Equation (A17), respectively.
H k = { h ( · ) = i [ n ] ω i k ( x i , · ) ω i R , n N , x i X } ,
h , g = i [ n ] ω i k ( x i , x ) , j [ m ] ξ j k ( x j , x ) = i [ n ] j [ m ] ω i ξ j k ( x i , x j ) ,
with ω i , ξ j R , n , m N , x i , x j X . From the inner product defined in Equation (A17), we can deduce the reproducing properties: k ( · , x ) , h = h ( x ) , k ( · , x ) , k ( · , x ) = k ( x , x ) and we can see that Φ ( x ) , Φ ( x ) = k ( · , x ) , k ( · , x ) = k ( x , x ) . Hilbert spaces of scalar-valued functions with reproducing kernels were introduced and studied in [32]. The kernel k in this appendix is the reproducing kernel on X for H k . In the main text, the GP over functions Γ H k uses a distinct kernel K ( γ , γ ) (see Section 4.1.1), constructed from · H k .

A Finite Reproducing Kernel Hilbert Space

Here, we provide the construction of the functional space H k d used for the regret analysis for Algorithm 1 in Theorem 1. The most general construction of a RKHS can be achieved using a positive semidefinite kernel defined on a finite set. Let X = { x 1 , x 2 , , x d } be a finite nonempty set. Let k: X × X R be a symmetric and positive semidefinite kernel function, i.e., for any choice of real coefficients c 1 , , c d , we have i = 1 d j = 1 d c i c j k ( x i , x j ) 0 . We have a finite RKHS by defining the function space H k d as the span of kernel sections at the finite points H n : = span { k ( · , x 1 ) , k ( · , x 2 ) , , k ( · , x n ) } . So, every function f H k d has the form f ( x ) = i = 1 n c i k ( x , x i ) , for some vector of coefficients c = ( c 1 , , c d ) R d . The Gram matrix is defined as K R n × n by K i j : = k ( x i , x j ) . Since k is positive semidefinite, the matrix K is symmetric and positive semidefinite. For functions f , g H k d defined by
f ( x ) = i = 1 n c i k ( x , x i ) , g ( x ) = i = 1 n b i k ( x , x i ) ,
So, the inner product is given by f , g H k d : = c K b , and the associated norm is given by f H k d 2 = f , f H k d = c K c . The space H k d satisfies the reproducing property f ( x j ) = f , k ( · , x j ) H k d , j = 1 , , d , because for any f H n , we have
f ( x j ) = i = 1 d c i k ( x j , x i ) = f , k ( · , x j ) H k d .
The aforementioned construction can be further extended to consider H k d as a finite-dimensional RKHS of a generic kernel, with an infinite and continuous input space X R d . The key idea is to restrict the RKHS to a finite-dimensional subspace by choosing a finite number of basis functions associated with the kernel.

Appendix C. Bayesian Optimization

Given a real-valued objective function f o b j : X R defined over some domain X , Bayesian optimization (BO) takes a probabilistic approach to methodically search for a point x * arg max x X f o b j ( x ) , that achieves the global maximum value f o b j * ( x ) = max x X f o b j ( x ) . It is worth highlighting that we do not presume any assumptions about the characteristics of the domain X . In particular, it is not required to be a Euclidean space; it may be a space with a more complex structure, such as a function space, which is the focus of our investigation. In addition, BO differs from traditional optimization, as it does not require f o b j ( x ) to have a known restricted functional form. BO is an instance of the more general framework of sequential optimization where the input is an initial (possibly empty) data set D = { ( x , y ) } from which we want to design a sequence of experiments to gather information about f o b j ( x ) . In each interaction, a policy optimizer selects a point x α where we make our next observation. Having selected the point x = x α , we can obtain feedback from the system under study to obtain f o b j ( x ) . We append the newly observed information to update the data set, i.e., D = D { ( x , y ) } . We repeat until the termination condition is reached. There are different mechanisms by which the decision is made to terminate the sequential optimization process. For our purpose in this research, we assume a stopping criteria by exhausting an experimentation budget, that is, a finite time horizon T N . Also, we discard the strong assumption of having an exact observation model and instead adopt an additive Gaussian noise observation model, i.e., the value f o b j ( x ) observed at x modeled as y = f o b j ( x ) + ϵ , where ϵ N ( 0 , σ 2 ) represents the measurement error.
What distinguishes BO from the more general sequential optimization framework is that BO relies on Bayesian inference (Bayes’ Theorem) to reason about the uncertain quantities of a system of interest in light of our prior knowledge and any available data. This specially includes the objective function f o b j , which is typically treated as a stochastic process in this context, i.e., a probability distribution over an infinite family of random variables y = f o b j ( x ) + ϵ in a probability space, indexed by points x X .

Appendix C.1. Gaussian Process Surrogate Model

We focus on the Gaussian process (GP) representation of the objective function f o b j . A GP is a stochastic process such that every finite collection of random variables in the stochastic process has a multivariate normal distribution. The distribution of a GP is the joint distribution of all those (infinitely many) random variables, and as such it is a distribution over functions with a continuous domain. Thus, a GP can be used as a prior probability distribution over functions to perform Bayesian inference, i.e., to compute posterior distribution refining our initial beliefs according to the observed data.
We specify a Gaussian process as GP ( μ ( · ) , k ( · , · ) ) with mean function μ and covariance kernel k that is presumably bounded: k ( x , x ) 1 , for all x X , without loss of generality. The mean function μ ( x ) = E ( f o b j x ) determines the expected value of our model for f o b j at any location x, thus serving as a location parameter that represents the central tendency of our stochastic model for f o b j ( x ) . On the other hand, the covariance kernel k ( x , x ) = c o v [ f o b j , f o b j x , x ] determines how deviations from the mean are structured in our model for f o b j , i.e., for its samples f o b j ( x ) GP ( μ ( x ) , k ( x , · ) ) , encoding also expected properties for our stochastic model of f o b j such as differentiability and smoothness. The squared exponential, also known as the Radial Basis Function (RBF) kernel, and the kernels of the Matérn family of kernels, are two examples of frequently used kernels in the GP framework.

Appendix C.2. Acquisition Functions

Every BO method fundamentally has two major components: (i) First, a probabilistic surrogate model to represent the belief over the black-box objective function f o b j , as the Gaussian processes previously discussed. (ii) Second, an acquisition function (AF), α : X R , which establishes a utility value for the candidate evaluation points from the posterior distribution of the probabilistic surrogate model to select the next optimal evaluation point. Specifically, an acquisition function α : X R assigns a score to each point in the domain reflecting our preferences over locations for the next observation.
Even though there is a large collection of acquisition functions, we restrict our attention to the Gaussian Process Upper Confidence Bound (GP-UCB) acquisition function. The Upper Confidence Bound criterion is commonly used to deal with the exploration-exploitation dilemma and to evaluate strategies in Multi-Armed Bandit (MAB) algorithms. The GP-UCB acquisition function is an extension of the UCB criteria on MAB to BO, as BO can be interpreted as an infinite-armed bandit problem. Assume we have gathered an arbitrary data set D , and consider an arbitrary point x X . Consider the quantile function associated with the predictive distribution p ( y x , D ) given by
q ( ρ ; x , D ) = inf { y Pr ( y y x , D ) ρ } .
This quantile function satisfies the relation y q ( ρ ; x , D ) with probability ρ ( 0 , 1 ) , so we can interpret q ( ρ ; x , D ) as an Upper Confidence Bound (UCB) on the objective function y, i.e., say that the value of y will exceed the bound only with tunable probability 1 ρ . As a function of x, we can elucidate q ( ρ ; x , D ) as an optimistic estimate of y, suggesting that we observe where this Upper Confidence Bound is maximized, resulting in the acquisition function α UCB ( x ; D , ρ ) = q ( ρ ; x , D ) , where q is the quantile function with confidence ρ ( 0 , 1 ) . For a Gaussian process, this quantile takes the simple form of Equation (A20):
α GP-UCB ( x ; D , ρ ) = μ ( x ) + β σ ( x ) ,
where β = Φ 1 ( ρ ) depends on the confidence level and can be computed from the inverse Gaussian cumulative distribution function ( Φ 1 denotes the quantile function, i.e., the inverse cumulative distribution function (CDF)). Employing low confidence values (e.g., ρ = 0.8 ) diminishes recognition for sites with high uncertainty, hence favoring exploitation substantially. Increasing the confidence parameter (e.g., ρ = 0.95 or ρ = 0.99 ) leads to increasingly exploratory behavior. The policy implemented by sequentially maximizing Upper Confidence Bounds is backed by robust theoretical guarantees. Srinivas et al. [7] demonstrated that this policy is guaranteed to effectively maximize the objective at a non-trivial rate for GP models under reasonable assumptions. In the main text, (Section 4.2 and Section 4.3), we parameterize the acquisition by a (possibly time-varying) exploration coefficient β t . For the regret analysis (Section 4.4), we take β t = 2 log t 2 π 2 | Γ | 6 δ , which is equivalent to using a time-varying confidence level ρ t = Φ ( β t ) .

Appendix D. Functional Approximation via the Stone–Weierstrass Theorem

The Stone–Weierstrass theorem provides sufficient conditions for approximating continuous functions on compact spaces using elements from a subalgebra. In its classical real-valued formulation, it characterizes when an algebra of continuous functions is uniformly dense in the full space C ( X , R ) , the Banach space of real-valued continuous functions on a compact Hausdorff space X, endowed with the uniform norm (see, for instance, ref. [33]).
Let ( X , d ) be a compact Hausdorff space and let C ( X , R ) denote the space of continuous real-valued functions on X, equipped with the uniform norm
g : = sup x X | g ( x ) | .
A subset A C ( X , R ) is called an algebra if it is closed under pointwise addition, scalar multiplication, and pointwise multiplication. We say that A separates points of X if for every pair of distinct points x , y X , there exists g A such that g ( x ) g ( y ) . This means the algebra is rich enough to distinguish any two points of the domain. We say that A contains the constants if the constant function 1 ( x ) 1 belongs to A.
Theorem A1
(Stone–Weierstrass Theorem). If A C ( X , R ) is a subalgebra that separates points of X and contains the constants, then A is dense in C ( X , R ) with respect to the uniform norm. That is, for every f C ( X , R ) and every ε > 0 , there exists g A such that
f g < ε .
In our context, X will be either the compact subset Γ H k of admissible incentive functions, or the compact input domain Ω F of a follower’s decision variable. The algebra A will be constructed from either (1) the span of kernel sections K ( · , γ 0 ) with γ 0 Γ for the approximation of the functional f: Γ R , or (2) the span of kernel sections k ( · , x ) with x Ω F or the polynomial algebra P n for the approximation of scalar-valued incentive functions γ : Ω F R .
The separation property follows from the positive definiteness of the kernels K or k, which guarantees that kernel sections distinguish distinct points. The constants are contained in the algebra because constant functions can be obtained (or approximated) from kernel sections or are trivially included in P n .
Theorem A1 then ensures that any continuous target functional or incentive function can be uniformly approximated by finite linear combinations of functions from these algebras, which is the foundation for the two approximation steps described in the following subsections.

Appendix D.1. Approximation of the Objective Functional

Let G H k be a compact subset of the RKHS of incentive functions, equipped with the topology induced by the RKHS norm · H k . That is, a sequence { γ n } H k converges to γ H k if and only if γ n γ H k 0 . We consider the objective functional f: G R defined by
f ( γ ) = E [ J L do ( Ω L = γ ( ω F ) ) ] ,
which we assume to be continuous with respect to · H k . The functional kernel K: H k × H k R induces a family of evaluation maps of the form γ K ( γ , γ 0 ) for fixed γ 0 G . We define the algebra
A 1 : = span K ( · , γ 0 ) | γ 0 G C ( G , R ) .
This algebra separates points, since K is strictly positive definite and for γ 1 γ 2 , there exists γ 0 such that K ( γ 1 , γ 0 ) K ( γ 2 , γ 0 ) . Also, constants functions are included in this algebra, since K ( γ , γ ) is bounded below by a positive value and adding scaled copies yields constant approximations. Hence, according to the Stone–Weierstrass theorem, A 1 is uniformly dense in C ( G , R ) , and any continuous objective functional f can be approximated by elements of the algebra A 1 . The algebra described above is equivalent to the following algebra
We can rewrite this A 1 in the language that we need for regret bounds. Let Γ H k be a compact subset of the RKHS of incentive functions, endowed with the norm topology · H k . We assume that the objective functional f: Γ R is continuous, f C ( Γ ) . We rewrite this algebra A 1 as follows:
A 1 : = span { K ( · , γ 0 ) γ 0 Γ } ,
where K: H k × H k R is the functional kernel used in the GP prior. Because K is positive definite, A 1 separates points in Γ and contains constant functions, as stated above. By the Stone–Weierstrass theorem, A 1 is dense in C ( Γ ) under the uniform norm.
So, as a consequence, for every ε 1 > 0 , there exists a finite set { γ 1 , , γ d } Γ and coefficients α 1 , , α d R such that
sup γ Γ f ( γ ) i = 1 d α i K ( γ i , γ ) < ε 1 .
Therefore, the finite set { γ i } i = 1 d plays the role of a basis of reference incentive functions. Each γ Γ is represented through its coordinates K ( γ 1 , γ ) , , K ( γ d , γ ) in this basis. The approximant i = 1 d α i K ( γ i , γ ) is a finite-dimensional parametric model for f, with parameters { α i } . Thus, the algebraic density statement of Stone–Weierstrass and the finite-basis parametric view are equivalent; both state that a continuous functional f in Γ can be uniformly approximated by an element of the finite-dimensional span of kernel sections K ( · , γ i ) . This is the first step in reducing the infinite-dimensional functional optimization problem to one over a finite parameter space.

Appendix D.2. Selection Methods of a Finite Set of Basis Incentive Functions

In practice, the first application of the Stone–Weierstrass theorem requires the construction of a finite set { γ 1 , , γ d } H k such that the algebra they generate, A 1 = span { K ( · , γ i ) : i = 1 , , d } , separates points and contains the constants, ensuring density in C ( H k , R ) under the uniform norm on compact subsets. Below, we describe two formal approaches for constructing such a set.
Method 1: ε -cover-based selection. Let Γ H k be compact in the RKHS norm · H k . For ε > 0 , an ε-cover of Γ with respect to the H k norm is a finite subset G ε Γ such that for every γ Γ , there exists γ ˜ G ε satisfying γ γ ˜ H k ε . An ε -cover is said to be minimal if it has the smallest possible cardinality among all ε -covers of Γ . The minimal cardinality is called the ε-covering number and is denoted by N ( Γ , · H k , ε ) .
Given ε > 0 , a minimal ε -cover can be constructed by iteratively selecting points from Γ such that (i) the first point is chosen arbitrarily from Γ , (ii) each subsequent point is chosen so that it lies at H k -norm distance strictly greater than ε from all previously chosen points, and (iii) the process stops once the selected points cover Γ in the sense above. This greedy selection achieves a covering whose cardinality is within a factor of optimal for compact metric spaces. Basis selection. Once a minimal (or near-minimal) ε -cover G ε = { γ ˜ 1 , , γ ˜ d } is obtained, we set { γ 1 , , γ d } : = G ε . By properties of continuous positive-definite kernels K, the kernel sections K ( · , γ i ) separate points in Γ and constants are contained in A 1 = span { K ( · , γ i ) } .
By the Stone–Weierstrass theorem, A 1 is then dense in C ( Γ , R ) in the uniform topology (see, for instance, [27]). This method explicitly links the choice of basis to covering numbers N ( Γ , · H k , ε ) , which determines the discretization size | G ε | = d , which are also relevant in the analysis of the confidence parameter β T in the regret bounds.
Method 2: Spectral decomposition of the kernel operator. Assume that K: Γ × Γ R is continuous and positive-definite on a compact Γ with associated Mercer decomposition
K ( γ , γ ) = m = 1 λ m ϕ m ( γ ) ϕ m ( γ ) ,
where ( λ m ) m 1 are positive eigenvalues and ( ϕ m ) m 1 are orthonormal eigenfunctions in L 2 ( Γ , μ ) for some finite Borel measure μ . Select γ i Γ such that γ i corresponds to the RKHS representative associated with ϕ i , and take d large enough so that m > d λ m is below a desired threshold. The kernel sections K ( · , γ i ) inherit the properties of separation and constant inclusion, and the span of the first d eigenfunctions produces a finite-dimensional approximation space whose richness improves with d. For details on Mercer expansions and spectral selection strategies, see [34].
In both methods, the returned finite set { γ 1 , , γ d } serves as a basis for the algebra A 1 , ensuring that the approximation property required by the Stone–Weierstrass theorem holds to within the desired tolerance on compact subsets of H k .

References

  1. Ratliff, L.J.; Dong, R.; Sekar, S.; Fiez, T. A perspective on incentive design: Challenges and opportunities. Annu. Rev. Control Robot. Auton. Syst. 2009, 2, 305–338. [Google Scholar]
  2. Pearl, J. A probabilistic calculus of actions. In Uncertainty in Artificial Intelligence; Morgan Kaufmann: San Mateo, CA, USA, 1994; pp. 454–462. [Google Scholar]
  3. Correa, J.; Bareinboim, E. A Calculus for Stochastic Interventions: Causal Effect Identification and Surrogate Experiments. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 10093–10100. [Google Scholar]
  4. Williams, C.K.; Rasmussen, C.E. Gaussian Processes for Machine Learning; The MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
  5. Snoek, J.; Larochelle, H.; Adams, R.P. Practical bayesian optimization of machine learning algorithms. Adv. Neural Inf. Process. Syst. 2012, 25. [Google Scholar]
  6. Vien, N.A.; Zimmermann, H.; Toussaint, M. Bayesian Functional Optimization. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 4171–4178. [Google Scholar]
  7. Srinivas, N.; Krause, A.; Kakade, S.M.; Seeger, M.W. Information-theoretic regret bounds for Gaussian process optimization in the bandit setting. IEEE Trans. Inf. Theory 2012, 58, 3250–3265. [Google Scholar] [CrossRef]
  8. Chowdhury, S.R.; Gopalan, A. On kernelized multi-armed bandits. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 844–853. [Google Scholar]
  9. Laffont, J.J.; Martimort, D. The Theory of Incentives: The Principal-Agent Model; Princeton University Press: Princeton, NJ, USA, 2009. [Google Scholar]
  10. Basar, T.; Selbuz, H. Closed-Loop Stackelberg Strategies with Applications in the Optimal Control of Multilevel Systems. IEEE Trans. Autom. Control 1979, 24, 166–179. [Google Scholar] [CrossRef]
  11. Dempe, S.; Zemkoho, A. (Eds.) Bilevel Optimization: Advances and Next Challenges; Springer International Publishing: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
  12. Yang, J.; Wang, E.; Trivedi, R.; Zhao, T.; Zha, H. Adaptive Incentive Design with Multi-Agent Meta-Gradient Reinforcement Learning. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, Virtual, 9–13 May 2022; pp. 1436–1445. [Google Scholar]
  13. Ho, C.-J.; Slivkins, A.; Vaughan, J.W. Adaptive Contract Design for Crowdsourcing Markets: Bandit Algorithms for Repeated Principal–Agent Problems. J. Artif. Intell. Res. 2016, 55, 317–359. [Google Scholar] [CrossRef]
  14. Fiez, T.; Sekar, S.; Zheng, L.; Ratliff, L.J. Combinatorial Bandits for Incentivizing Agents with Dynamic Preferences. In Proceedings of the 34th Conference on Uncertainty in Artificial Intelligence, Monterey, CA, USA, 6–10 August 2018; pp. 247–257. [Google Scholar]
  15. Guresti, B.; Vanlioglu, A.; Ure, N.K. IQ-Flow: Mechanism Design for Inducing Cooperative Behavior to Self-Interested Agents in Sequential Social Dilemmas. In Proceedings of the 22nd International Conference on Autonomous Agents and Multiagent Systems, London, UK, 29 May–2 June 2023; pp. 2143–2160. [Google Scholar]
  16. Mguni, D.; Jennings, J.; Macua, S.V.; Sison, E.; Ceppi, S.; Cote, E.M.D. Coordinating the Crowd: Inducing Desirable Equilibria in Non-Cooperative Systems. In Proceedings of the 18th International Conference on Autonomous Agents and Multiagent Systems, Montreal, QC, Canada, 13–17 May 2019; pp. 386–394. [Google Scholar]
  17. Jiang, J.; Wu, L.; Yu, J.; Wang, M.; Kong, H.; Zhang, Z.; Wang, J. Robustness of bilayer railway-aviation transportation network considering discrete cross-layer traffic flow assignment. Transp. Res. Part D Transp. Environ. 2024, 127, 104071. [Google Scholar] [CrossRef]
  18. Lattimore, F.; Lattimore, T.; Reid, M.D. Causal Bandits: Learning Good Interventions via Causal Inference. Adv. Neural Inf. Process. Syst. 2016, 29, 1181–1189. [Google Scholar]
  19. Bareinboim, E.; Forney, A.; Pearl, J. Bandits with Unobserved Confounders: A Causal Approach. Adv. Neural Inf. Process. Syst. 2015, 28, 1342–1350. [Google Scholar]
  20. Lee, S.; Bareinboim, E. Structural Causal Bandits: Where to Intervene? Adv. Neural Inf. Process. Syst. 2018, 31, 6276–6286. [Google Scholar]
  21. Lee, S.; Bareinboim, E. Structural Causal Bandits with Non-Manipulable Variables. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 4164–4172. [Google Scholar]
  22. Buesing, L.; Weber, T.; Zwols, Y.; Racaniere, S.; Guez, A.; Lespiau, J.B.; Heess, N. Woulda, Coulda, Shoulda: Counterfactually- Guided Policy Search. arXiv 2019, arXiv:1811.06272. [Google Scholar]
  23. Madumal, P.; Miller, T.; Sonenberg, L.; Vetere, F. Explainable Reinforcement Learning Through a Causal Lens. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 2493–2500. [Google Scholar]
  24. Aglietti, V.; Lu, X.; Paleyes, A.; González, J. Causal Bayesian Optimization. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, Online, 26–28 August 2020; Volume 108, pp. 22–31. [Google Scholar]
  25. Gultchin, L.; Virginia, A.; Alexis, B.; Silvia, C. Functional Causal Bayesian Optimization. In Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, Pittsburgh, PA, USA, 31 July–4 August 2023; Volume 216, pp. 756–765. [Google Scholar]
  26. Pearl, J. Causality: Models, Reasoning, and Inference, 2nd ed.; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
  27. Cucker, F.; Smale, S. On the mathematical foundations of learning. Bull. Am. Math. Soc. 2002, 39, 1–49. [Google Scholar] [CrossRef]
  28. Thomas, M.T.C.A.J.; Joy, A.T. Elements of Information Theory; Wiley-Interscience: New York, NY, USA, 2006. [Google Scholar]
  29. Goldberg, P.; Williams, C.; Bishop, C. Regression with input-dependent noise: A Gaussian process treatment. Adv. Neural Inf. Process. Syst. 1997, 10, 493–495. [Google Scholar]
  30. Bejos, S.; Sucar, L.E.; Morales, E.F. Estimating Bounds on Causal Effects Considering Unmeasured Common Causes. In Proceedings of the International Conference on Probabilistic Graphical Models (PMLR), Nijmegen, The Netherlands, 11–13 September 2024; pp. 498–514. [Google Scholar]
  31. Zhang, T. Mathematical Analysis of Machine Learning Algorithms; Cambridge University Press: Cambridge, UK, 2023. [Google Scholar]
  32. Aronszajn, N. Theory of reproducing kernels. Trans. Am. Math. Soc. 1950, 68, 337–404. [Google Scholar] [CrossRef]
  33. Rudin, W. Principles of Mathematical Analysis, 3rd ed.; International Series in Pure and Applied Mathematics; McGraw-Hill: New York, NY, USA, 1976. [Google Scholar]
  34. Bach, F. On the equivalence between kernel quadrature rules and random feature expansions. J. Mach. Learn. Res. 2017, 18, 1–38. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bejos, S.; Morales, E.F.; Sucar, L.E.; Munoz de Cote, E. Single-Stage Causal Incentive Design via Optimal Interventions. Entropy 2026, 28, 4. https://doi.org/10.3390/e28010004

AMA Style

Bejos S, Morales EF, Sucar LE, Munoz de Cote E. Single-Stage Causal Incentive Design via Optimal Interventions. Entropy. 2026; 28(1):4. https://doi.org/10.3390/e28010004

Chicago/Turabian Style

Bejos, Sebastián, Eduardo F. Morales, Luis Enrique Sucar, and Enrique Munoz de Cote. 2026. "Single-Stage Causal Incentive Design via Optimal Interventions" Entropy 28, no. 1: 4. https://doi.org/10.3390/e28010004

APA Style

Bejos, S., Morales, E. F., Sucar, L. E., & Munoz de Cote, E. (2026). Single-Stage Causal Incentive Design via Optimal Interventions. Entropy, 28(1), 4. https://doi.org/10.3390/e28010004

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop