Draw from the posterior predictive distribution.
y.obs <- data$y
y.cf.samples <- extract(fit, sample = "test", value = "ppd")
y.1.samples <- z * y + (1 - z) * y.cf.samples
y.0.samples <- (1 - z)*y+z* y.cf.samples
ite.samples <- y.1.samples - y.0.samples
sate.samples <- rowMeans(ite.samples)
sate <- mean(sate.samples)
sate.lb <- sate - 1.96 * sd(sate.samples)
sate.ub<-sate+1.96*sd(sate.samples)
```
To obtain intervals and estimates for effects on the treated population, subset the individual effect matrices prior to averaging across rows.

### *5.3. Fixed vs. Random Effects*

It is worth noting that we assume that our causal assumptions have not changed from above. That is, the grouping variables are not acting as confounders, they impact

only the error structure of the data generating process (henceforth, DGP). Of course, in practice, in any given setting, it is always possible that ignorability would not be satisfied solely given the other covariates but would be satisfied when conditioning on the grouping variable as well. In that case it might be helpful to include the grouping variable as a fixed effect as well, since the random effects assumption would not be expected to hold, and conditioning on group level fixed effects allows one to control for any unmeasured group level confounders. In the most likely scenario that ignorability is not satisfied even conditional on the grouping variable—that is, there are unmeasured individual level confounders—a random effects specification tends to be a reasonable compromise between ignoring the group level structure entirely and using fixed effects, as fixed effects can act as bias-amplifying covariates [56,57].

### **6. Simulation Design**

We designed a set of simulations to better understand the properties of stan4bart relative to close alternatives that either (1) have parametric assumptions or (2) cannot explicitly accommodate more general error structures. This section outlines our simulation design which has the general goal of trying to mimic a realistic data structure.

### *6.1. Original IHDP Simulation*

The basic structure of our simulation mimics the simulation structure developed by Hill [1] in the paper that first introduced machine learning for causal inference. This simulation used data from a randomized experiment called the Infant Health and Development Program (IHDP; [58,59]) conducted in the 1980s to understand whether intensive childcare in the first few years of life could have a positive impact on the development of children who were born low-birth-weight and premature.

This study randomized roughly one third of the 985 participating families to participate in the IHDP intervention. Participants were eligible for intensive, high-quality child care and home visits from a trained provider during the first three years of infancy. A subset of the covariates collected during the baseline phase of that study and used frequently in subsequent evaluations of the IHDP program were included as the covariates for that simulation. Thus, the simulation reflected the actual distributions for and associations among covariates found naturally in existing data. The simulation covariates comprised six continuous, nine binary, and two unordered categorical variables reflecting child measurements at birth, the mother's sociodemographic characteristics at the time of birth, behaviors engaged in during pregnancy, and indicators for the study site.

To construct an observational study for the simulation, a hypothetical treatment assignment was induced by removing a nonrandom portion of the originally randomized treatment group, those children born to nonwhite mothers. This destroyed the independence between the originally randomized treatment assignment and the covariates but maintained the common support for the new treatment group. By simulating outcomes for the remaining sample with a mean structure that was a function solely of the treatment and covariates, ignorability was satisfied by construction.

To explore the ability of BART to flexibility fit nonlinear response surfaces, three different DGPs were used to generate potential outcomes. Response surface A was linear for both E[*Y*(0) | *X* = *x*] and E[*Y*(1) | *X* = *x*] and had a constant treatment effect. Response surface B created heterogeneous treatment effects by keeping the model for *E*[*Y*(0) | *X* = *x*] linear but allowing the model for E[*Y*(1) | *X* = *x*] to be nonlinear by exponentiating a linear combination of the covariates. Response surface C created heterogeneous treatment effects by including a variety of squared terms and interactions.

In the original paper [1], this simulation was used to demonstrate the superior performance of BART for causal inference relative to linear regression and a generic implementation of propensity score methods. Since Hill [1] was published, testing grounds have been developed that allow for comparisons between BART and propensity score methods, in which the propensity score methods were able to be more carefully curated by methodologists who were experts in that field. These have also shown superior performance of BART [15]. While that paper also explored performance in settings where the common support assumption was violated, the current study restricted attention to scenarios where common support was satisfied to allow space for exploring features of the data specific to multilevel settings.

#### *6.2. Extensions to the Original IHDP Simulation*

This section details how we extended the original IHDP simulation to allow for a group structure and explore other features of the DGP.

#### 6.2.1. Adding Group Structure to the Response Surfaces

We wanted to create a grouped structure that would mimic those features of a grouped data structure that exist naturally. Therefore, we repurposed two variables that were used as covariates in the original IHDP simulation and treated them as grouping variables in the current simulation. The first of these was the collection of eight indicators for the study site (a blocking variable in the original IHDP experiment). The other was a variable representing the mothers' age at birth (treated as continuous in the original simulation) which had 26 levels.

These groups were incorporated into the response surface in two different ways. The **varying intercepts** setting generated data from the respective response surface

$$\begin{array}{rcl} Y\_i(0) \mid \lambda\_{\mathcal{S}[i]}^{\text{int}} \epsilon\_i^0 &=& h^z(\boldsymbol{\kappa}\_i) + \lambda\_{\mathcal{S}[i]}^{\text{int}} + \epsilon\_i^0\\ Y\_i(1) \mid \lambda\_{\mathcal{S}[i]}^{\text{int}} \epsilon\_i^1 &=& h^z(\boldsymbol{\kappa}\_i) + \lambda\_{\mathcal{S}[i]}^{\text{int}} + \boldsymbol{\tau}^\* + \epsilon\_i^1\\ &\qquad\qquad\qquad\lambda\_{\mathcal{S}}^{\text{int}} \sim \quad \mathsf{N}(0, \sigma\_{\lambda^{\text{int}}})\_{\prime}\\ &\epsilon\_i^0 &\sim \quad \mathsf{N}(0, \sigma\_0)\_{\prime}\\ &\epsilon\_i^1 &\sim \quad \mathsf{N}(0, \sigma\_1)\_{\prime} \end{array}$$

where *hz*(*xi*) reflects the function of the covariates specific to the given potential outcome and either response surface A, B, or C. *τ*<sup>∗</sup> only appears in the model for *Y*(1) and represents the constant treatment effect when *<sup>h</sup>*<sup>0</sup>(*xi*) = *<sup>h</sup>*<sup>1</sup>(*xi*) in response surface A. In response surface B and C, these are not equal an thus heterogenous treatment effects that vary with levels of the covariates are induced. The asterisk is meant to remind the reader that *τ*<sup>∗</sup> should not necessarily be interpreted as a constant or average treatment effect. *λ*int *g* is the varying intercept that corresponds to the grouping variable in question.

In contrast, the **varying intercepts and slopes** setting generated data from an augmented version of the above

$$\begin{array}{rcl} Y\_i(0) \mid \lambda\_{\frac{\rm{int}}{\mathcal{S}[i]}}^{\rm{int}} \epsilon\_i^0 &=& h^z(\boldsymbol{x}\_i) + \lambda\_{\boldsymbol{\mathcal{S}}[i]}^{\rm{int}} + \epsilon\_i^0\\ Y\_i(1) \mid \lambda\_{\boldsymbol{\mathcal{S}}[i]}^{\rm{int}} \lambda\_{\boldsymbol{\mathcal{S}}[i]}^{\rm{lo}} \epsilon\_i^1 &=& h^z(\boldsymbol{x}\_i) + \lambda\_{\boldsymbol{\mathcal{S}}[i]}^{\rm{int}} + \lambda\_{\boldsymbol{\mathcal{S}}[i]}^{\rm{lo}} + \epsilon\_i^1\\ \lambda\_{\boldsymbol{\mathcal{S}}}^{\rm{int}} &\sim& \mathcal{N}(0, \sigma\_{\lambda^{\rm{int}}}),\\ \lambda\_{\boldsymbol{\mathcal{S}}}^{\rm{lo}} &\sim& \mathcal{N}(0, \sigma\_{\lambda^{\rm{lo}}}),\\ \epsilon\_i^0 &\sim& \mathcal{N}(0, \sigma\_0),\\ \epsilon\_i^1 &\sim& \mathcal{N}(0, \sigma\_1). \end{array}$$

This specification allowed the model for *Y*(1) to include the term *λ*slo *g*[*i*] rather than the *τ*<sup>∗</sup> in the varying intercepts specification so that treatment effects could vary explicitly by group according to a distribution of varying slopes.

The choice of grouping variable and whether or not the varying slopes were included in the DGP represented two distinct simulation knobs, each with two levels. Combined with the three response surfaces discussed above, this created 12 different settings within which to evaluate performance.

### 6.2.2. Additional Simulation Knobs Explored

We also explored the variation in performance across two settings that are not represented in the results in the next section for the sake of parsimonious exposition. First, we assessed the variation in performance based on the size of the treatment effect. Expressed in units standardized by the standard deviation of the outcome, these effect sizes we examined were 0, 0.2, 0.5, and 0.8. We found no difference in results across these choices. We also tested the differences in performance based on intraclass correlation values of 0.2, 0.333 and 0.5. We also found no difference in results across these choices.

### *6.3. Methods Compared*

We compared the performance of a variety of methods in an attempt to understand the advantages of combining flexible modeling with the ability to explicitly incorporate more complicated grouped error structures.

### 6.3.1. Linear Models

We fit several linear models to the data. **Linear full pool** is a linear regression where the groups are ignored entirely. **Linear f.e.** is a linear regression with fixed effects included for the grouping variables; this represents our no pooling option. **Linear v.i.** is a linear regression with varying intercepts. **Linear v.i.s** is a linear regression with varying intercepts and varying slopes, where the slopes in question are the coefficients on the treatment variable. Each of the last two were fit using the stan\_lmer function in rstanarm.

Given that the group-level estimands were one of our areas of focus it seemed unfair to not include versions of the above that more explicitly targeted these estimands. We included two additional models with this in mind. **LinearX f.e.** is a standard linear regression that includes both fixed effects and interactions between the fixed effects and the treatment variable. **LinearX v.i.s.** is an implementation of the stan\_lmer function that allows for both varying intercepts and varying treatment effects.

### 6.3.2. BART-Based Models

We also fit several different versions of BART models. **vanilla BART** uses a traditional BART specification similar to that used in Hill [1] but specifically omitting the grouping variables and including the propensity score as a covariate. **BART f.e.** extends this basic implementation by adding fixed effects for the grouping variables. **BART v.i.** is a BART implementation that allows for varying intercepts through the rbart\_vi function in dbarts. All BART implementations included a propensity score as suggested by Hahn et al. [7]. The propensity score was estimated using BART using a hyperprior on the end-node variance, making it extremely unlikely to take on small values and thus overfit, essentially guarding against the problems induced by the originally proposed implementation [37]. Finally, we also implemented Bayesian causal forests, which we denote **vanilla BCF** and **BCF f.e.**

### 6.3.3. stan4bart Implementations

We implemented two different versions of stan4bart. The simpler version, **stan4bart v.i.**, allows for varying intercepts. The slightly more complicated version, **stan4bart v.i.s.**, allows for varying intercepts and slopes.

To fit **stan4bart v.i.**, models with varying intercepts were specified as:

```
fit <- stan4bart(
y ~ bart(. - g) + (1 | g),
train = data,
test = data.test
)
```
Fitting **stan4bart v.i.s.** allowed for a variation in both intercepts and slopes and was specified as:

```
fit <- stan4bart(
y ~ bart(. - g) + (1 + z | g),
train = data,
test = data.test
)
```
### **7. Simulation Results**

We compared methods based on performance with respect to several criteria for each of our targeted estimands. We present the results for each estimand in turn.

### *7.1. SATT*

We evaluated the performance with respect to SATT for each of our methods across the six different settings by focusing on the root-mean-square error (RMSE), average interval length, and coverage. The RMSE and interval length were standardized by the standard deviation of the outcome variable so that the absolute size of each measure was more meaningful.

Figure 1 displays the results of our simulations for each method with respect to SATT as measured by RMSE (y-axis) and the average interval length using six plots. Rows correspond to response surfaces (A, B, or C) and columns to the metric displayed (RMSE or interval length). The results specific to the choice of grouping variable (group 1 or group 2) are displayed on each plot with different shapes (triangle or circle, respectively). The grouping structure is represented by whether the plotted shape is hollow (varying intercept) or filled (varying intercept and slope).

**Figure 1.** Results of our simulations for each method with respect to SATT as measured by RMSE (**left** panel) and average interval length (**right** panel). Each row corresponds to one of the three response surfaces (A, B, or C). Shapes are used to represent one of two grouping structures, triangles are for results from grouping structure 1, and circles for results from grouping structure 2. Hollow shapes represent results from DGPs with random intercepts and solid shapes represent results from DGPs with random intercepts and random slopes.

The results for the linear response surface (A) demonstrate strong performance overall from all methods with regard to RMSE with the possible exceptions of the vanilla BART and

BCF implementations and the pooled linear regressions in the group 2 version of the DGPs. Given the simplicity of the response surface, these results are not surprising—the only complexity is the grouping structure. The differences across methods are more apparent in the average interval lengths. Here, the linear models that allow for variation (either intercept or slope) have the shortest intervals followed by the BCF and stan4bart methods. These are followed by BCF with fixed effects and linear regression, and then the BART methods with fixed effects and varying effects. The BART implementations that completely ignore the grouped variables not surprisingly performs the worst overall on this metric. One odd result is the linear model with varying slopes, which performs reasonably well with regard to the interval length for the first grouping variable but much worse for the second. We suspect that this has to do with the fact that while the group 1 version has more levels, the correlation structure of group 2 is more complex. The effect of different correlation on the performance of different methods is beyond the scope of this paper but is an issue that could be explored in future simulation studies.

The ordering with regard to performance changes for some methods once we move to the results for the nonlinear response surfaces in the second and third rows. These are more challenging for all of the methods (note the change in the y-axis) but particularly for those that have strict linear parametric requirements. The strongest consistent performers with regard to RMSE are the stan4bart methods, **BCF f.e.**, **BART v.i.**, and **BART f.e.**. The versions of BCF and BART that ignore the group structure perform fine in the setting with the first grouping variable (triangles) but less well with the second (circles). The linear models perform the worst. The best performers with regard to the average interval length are again the flexible fitters with an edge once again for the stan4bart and BCF methods.

The performance with regard to the interval length for response surfaces B and C highlights the differences between the stan4bart methods and **BCF f.e.** relative to vanilla BCF (with just slightly longer intervals) and the BART methods with grouping structure. The linear methods trail with **LinearX v.i.s.**, demonstrating by far the longest intervals. Vanilla BART has longer intervals than the best linear models for response surface B and slightly longer ones for response surface C.

A shorter interval length is only an asset, however, if nominal coverage is achieved. Figure 2 displays the coverage results for the top contenders across our 12 settings. In addition, the plots include the average interval length across grouping settings for each response surface as part of the method label on the x-axis. These plots indicate that the stan4bart methods seem to strike the best balance between having a low RMSE and shorter intervals while still maintaining nominal coverage. The BCF methods which performed similarly to the stan4bart methods with regard to the RMSE and interval length struggled a bit more to achieve nominal coverage, particularly for response surface B.

### *7.2. GSATT*

The results for the group-level ATTs are more complicated because we have many more estimands to consider (one for each group). Thus, we organized the plots to display the RMSE and interval length results on separate plots. Since there was virtually no distinction in the results between the two grouping settings—varying intercept versus varying intercept and slope—we elected to collapse those results. Instead, we broke out our group 1 and group 2 results into separate sets of plots (top and bottom panels).

Figure 3 displays the RMSE for each method (x-axis) and group-level estimand across the six settings defined by the response surface (columns) and grouping variable (rows). The performance for each method is displayed in its own column with separate points for each estimand (group-level ATT). The average RMSE across estimands for each method is displayed next to the label for its name for each response surface (collapsed across settings defined by grouping variable). Across all of the response surfaces, the stan4bart methods perform the best followed very closely by **BCF f.e.** and then the other BART-based methods. The linear methods perform noticeably worse in all settings but in particular

when the response surface is nonlinear (B) and additionally when the treatment effects are heterogeneous by covariate values (C).

**Figure 2.** Percentage of 95% intervals that covered the true SATT for each of the top-performing methods. Results are presented separately by settings defined by response surface (columns A, B, C) and multilevel structure (rows: varying intercepts or varying intercepts and slopes). Results from settings defined by grouping variable are displayed on the same plot with different symbols. Labels on the x-axis additionally provide the average interval length (across both grouping settings).

Figure 4 displays the average interval length for each method (*x*-axis) across the six settings defined by the response surface (columns) and grouping variables (rows). The performance for each method is displayed in its own column with separate points for each estimand (group-level ATT). The results that achieved nominal coverage for a given estimand are displayed with solid rather than open circles for each group-level estimand.

The average coverage for each method and response surface combination (collapsed across other sources of variability) is displayed next to the name of each method. For response surface A, the linear methods with varying intercepts and slopes have the shortest intervals; however, the coverage with respect to the group estimands is quite poor, averaging 41% and 44%. The interval length for these methods increases with the more complicated response surfaces and in the scenarios with the first group variable is more variable across group estimands. **LinearX f.e.** performs the worst in terms of interval length but has better coverage properties across the board.

The other methods perform reasonably similarly with regard to the distribution of interval lengths across group-level estimands; however, the stan4bart implementations and **BCF f.e.** are also able to maintain the best coverage. **stan4bart v.i.s** is the only method that achieves nominal average coverage across all three response surfaces and **vanilla BCF** performs the worst in this regard with an average coverage dipping to 80% for response surface B.

Figure 5 displays the coverage percentages separately for each combination of method, grouping variable, and estimand and is thus capable of revealing greater distinctions across methods that looked similar in the previous plot. With one exception, the stan4bart demonstrate the least variability in coverage rates across groups. **vanilla BCF** has the greatest variability in coverage among the flexible models. **LinearX t.e.** is unable to provide

reasonable coverage in the setting with the second grouping variable; however, it performs far better with respect to the covariate of group-level estimands than the other linear methods for the setting defined by the first grouping variable.

**Figure 3.** RMSE results for the group-level estimands across methods. Each plot corresponds to a setting defined by grouping variable (row) and response surface (column). Results are collapsed across the settings defined by varying intercepts versus varying intercepts and slopes. Average RMSE across these settings and across estimands are displayed numerically next to the name of each method, separately for each response surface. Estimands that were covered by a 95% interval produced by the method were filled in rather than left hollow.

### *7.3. iCATEs*

We evaluated the ability of each method to estimate the CATE for each combination of covariate values that manifested in each sample as the iCATEs. To compare performance, we used the metric proposed in Hill [1], the precision in estimation of heterogeneous effects measure, or PEHE. This was calculated within each dataset for a given method as the square root of the average of the squared differences between the estimate of the iCATE and the true iCATE for each person.

Figure 6 displays the PEHE results for each of the methods across the six settings defined by response surface and multilevel setting (varying intercepts versus varying intercepts and slopes). Results are collapsed across the DGPs defined by the grouping variable.

For the linear response surface A, which has a constant treatment effect, all of the methods perform similarly which is not surprising given the ease of the task. The only method that noticeably performs a bit worse is the linear model with fixed effects interacted with the treatment, likely because it is overfitting. The landscape changes for the nonlinear response surfaces where the top performing methods are the flexible models with the strongest performance demonstrated by the stan4bart methods, BCF with fixed effects, and BART with varying intercepts.

**Figure 4.** Interval length results for the group-level estimands across methods. Each plot corresponds to a setting defined by grouping variable (row) and response surface (column). Results are collapsed across the settings defined by varying intercepts versus varying intercepts and slopes. Average coverage across these settings and across estimands are displayed numerically next to the name of each method, separately for each response surface. Within each vertical panel the methods are ordered by average interval length across both grouping variable settings and estimands. Estimands that were covered by a 95% interval produced by the method were filled in rather than left hollow.

**Figure 5.** Coverage rates for each method with respect to each of the group-level estimands. Plots vary by settings defined by grouping variable (rows) and response surface (columns A, B, and C) and are collapsed across grouping scenarios (varying intercept versus varying intercept and slope).

**Figure 6.** PEHE results for each of the methods across the six settings defined by response surface (columns A, B, and C) and multilevel setting (rows corresponding to varying intercepts versus varying intercepts and slopes). Results are collapsed across the DGPs defined by the grouping variable.

### **8. Discussion**

The goal of this work was to develop a method that could extend the BART framework for the flexible fitting of response surfaces to accommodate more complex error structures. We evaluated the utility of this approach by assessing performance in a causal inference context that allowed for varying intercepts or varying intercepts and slopes. For one of our three response surfaces, this heterogeneity was in addition to the heterogeneity in treatment effects that was a systematic (nonrandom) function of observed confounders.

Our results indicated that the stan4bart models provided superior performance when compared against both methods with flexible fit that did not allow for a more complicated error structure as well as methods that explicitly accommodated a grouped error structure but assumed a linear parametric mean structure. Throughout, BCF was a strong competitor on all performance measures even though it did not explicitly accommodate the error structure.

We evaluated stan4bart in a causal setting, which is generally more challenging than standard prediction settings. Given its strong performance in this challenging setting, we recommend the use of stan4bart both in causal and noncausal settings. More broadly, we hope that stan4bart will be a jumping-off point for the further development of methods that aim to marry flexible mean structures with parametric approaches to either the mean structure or the grouped error structure.

**Author Contributions:** Conceptualization, J.L.H., V.D. and B.G.; methodology, V.D.; software, V.D. and G.P.; validation, V.D. and G.P.; formal analysis of properties of estimators, G.P.; investigation, G.P.; resources, J.L.H. and B.G.; data curation, G.P.; writing—original draft preparation, J.L.H., V.D., B.G. and G.P.; writing—review and editing, J.L.H., V.D., B.G. and G.P.; visualization, G.P.; supervision, J.L.H.; project administration, J.L.H. and G.P.; funding acquisition, J.L.H. and B.G. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Office of Naval Research gran<sup>t</sup> number N00014-17-1-2141, the Institute of Education Sciences gran<sup>t</sup> number R305D200019, and the National Science Foundation gran<sup>t</sup> numbers 2051246 and 2153019.

**Institutional Review Board Statement:** Not applicable. **Data Availability Statement:** This data can be found here: https://github.com/gperrett/stan4bartstudy (accessed on 14 August 2022).

**Conflicts of Interest:** The authors declare no conflict of interest.
