2.1. STAR Models and Deep Learning
Recently, owing to the flexibility of implementations such as TensorFlow (see
Abadi et al. 2016), deep learning approaches can also be used for fitting STAR models. In
Agarwal et al. (
2021),
neural additive models are discussed. Such models use sub-networks to represent each input variable
with a smooth function
and connect their output nodes directly to the response
Y, which is possibly transformed by a non-linear activation function
; see
Figure 1a for an illustration. A neural additive model thus allows the fitting of additive models of the form
in line with (
2). To fit a STAR model in the general case of (
1) through deep learning, we need to slightly generalize the structure of the neural additive model to allow the sub-networks to use as input all covariates used in component
; see
Figure 1b for a schematic example with the equation
Related models are discussed in
Rügamer et al. (
2021) in an even more general context of multivariate
, allowing for modeling of the full conditional distribution of
Y.
2.2. STAR Models and Gradient Boosting
However, it is less known that some modern implementations of gradient boosting are actually able to non-parametrically fit STAR models of the form (
1). The solution is based on imposing so-called
feature interaction constraints, an idea that goes back to
Lee et al. (
2015). Their approach was implemented in 2018 in XGBoost, and, on our request, (
Mayer 2020) in LightGBM in 2020 as well. Interaction constraints are specified as a collection of feature subsets
, where each
specifies a group of features that are allowed to interact. Algorithmically, the constraints are enforced by the following simple rule during tree growth:
At each split, the set of split candidates is the union of those that contain all previously used split variables of the current branch.Consequently, each tree branch uses features from one feature set
only. Its associated prediction on the scale of
contributes to the component
of an implicitly defined STAR model of the form (
1), where each model component
uses feature subset
as specified by the constraints.
An important type of constraint is feature
partitions, i.e., disjoint feature sets
. They include the special case of the collection of singletons
that would produce an additive model of the form (
2). For partitions, by the above rule, the first split variable of a tree determines the feature set
to be used throughout that tree.
Figure 2 illustrates a simple example of such a model with the following equation:
The corresponding constraints form a feature partition specified by
In the two programming languages R and Python, interaction constraints for XGBoost and LightGBM are specified as part of the parameter list passed to the corresponding
train method.
Table 1 shows how to specify the
interaction_constraints parameter for an example with covariates A, B, C, and D and interaction constraints
and
corresponding to a model as in
Figure 2.
XGBoost and LightGBM offer different loss/objective functions. These specify the functional
to be modeled; see
Table 2 for some of the possibilities.
Remark 1.
At the time of writing and until at least XGBoost version 1.4.1.1, XGBoost respects only non-overlapping interaction constraint sets , i.e., partitions. LightGBM can also deal with overlapping constraint sets.
Boosted trees with interaction constraints support only non-linear components, unlike, e.g., deep learning and component-wise boosting, both of which allow a mix of linear and non-linear components. See Remark 4, as well as our two case studies, for a two-step procedure that would linearize some effects fitted by XGBoost or LightGBM, thus overcoming this limitation.
Feature pre-processing for boosted trees is simpler than for other modeling techniques. Missing values are allowed for most implementations, feature outliers are unproblematic, and some implementations (including LightGBM) can directly deal with unordered categorical variables. Furthermore, highly correlated features are only problematic for interpretation, not for model fitting.
Model-based boosting “mboost” with trees as building blocks is an alternative to using XGBoost or LightGBM with interaction constraints.
2.3. STAR Models and Supervised Dimension Reduction
Since STAR models (including GAMs with two-dimensional interaction surfaces) typically use only a small subset of (interacting) features per component, the keyword “dimension reduction” rarely appears in connection with this type of model. However, a strength of tree-based components (e.g., trees as base learners in model-based boosting “mboost” or the approach via interaction constraints) and deep learning is that some components can use even a large number of interacting features. Such components serve as one-dimensional representations of their features, conditional on the other features. The values of such components (or sometimes of sums of multiple components) might be used as derived features in another model or analysis. Thus, STAR models offer an effective way to perform (one-dimensional) supervised dimension reduction. Note that, by supervised, we mean that the dimension reduction procedure uses the response variable of the model. Some examples illustrate this.
Example 1.
- 1.
House price models with additive effects for structural characteristics and time (for maximal interpretability) and one multivariate component using all locational variables with complex interactions (for maximal predictive performance). The model equation could be as follows: Component provides a one-dimensional representation of all locational variables. We will see an example of such a model in the Florida case study.
- 2.
This is similar to the first example, but adding the date of sale to the component with all locational variables, leading to a model with time-dependent location effects. The component depending on locational variables and time represents those variables by a one-dimensional function. Such a model will be shown in the Swiss case study below.
In the above examples, the feature subsets
used by components
are non-overlapping, i.e., are partitions. For a STAR model
f of the general form (
1), where features might appear in multiple components, we can extend the above idea of dimensionality reduction.
Definition 1 (Purely additive contributions and encoders)
. Let be a feature subset of interest and the fitted additive components of a STAR modelas in Equation (1). The contribution of to the predictor is given by the partial predictorwhere denotes the index set of components using features in . Furthermore, we call a purely additive contribution or encoder of if , i.e., if depends solely on features in . In this case, we say that has a purely additive contribution or is encodable and write the encoder as , where are values of the feature vector corresponding to . Remark 2.
From the above definition, it directly follows that the purely additive contribution of a (possibly large) set of features provides a supervised one-dimensional representation of the features in , optimized for predictions on the scale of ξ and conditional on the effects of other features.
For simplicity, we assume that each model component is irreducible, i.e., it uses only as many features as necessary. In particular, a component additive in its features would be represented by multiple components instead.
The encoder of is defined up to a shift, i.e., for any constant c is an encoder of as well. If c is chosen so that the average value of the encoder is 0 on some reference dataset, e.g., the training dataset, we speak of a centered encoder.
Example 2. Consider the fitted STAR model Following the above definition, we can say:
- 1.
Let . Then, is a purely additive contribution of .
- 2.
Due to interaction effects, the singleton and are not encodable.
- 3.
Let and . Then, is an encoder of .
- 4.
The fitted model is an encoder of the set of all features. This is true for STAR models in general.
Next, we consider the fitted STAR model Here, the feature sets and form a partition. As a direct consequence of Definition 1, each feature set (and all its possible unions) of a partition is encodable. This property will implicitly be used in both of our case studies.
Depending on the implementation of the STAR algorithm, it might be possible to directly extract the encoder of a feature set
from the fitted model. A procedure that works independently of the implementation is described in the following Algorithm 1, which requires only access to the prediction function
on the scale of
.
Algorithm 1 Encoder extraction |
|
Remark 3 (Raw scores and boosting)
. By default, predictions of XGBoost and LightGBM are on the scale of Y. If the functional ξ includes a link function (e.g., log or logit), one can obtain predictions on the scale of ξ via the argument outputmargin (XGBoost) or rawscore (LightGBM) of their predict method. This is relevant for the application of Algorithm 1 and also for interpretation of effects, as in Section 2.4. The values of a (possibly centered) encoder of a feature set can be used as a one-dimensional representation of for subsequent analyses, e.g., as a derived covariate in a simplified regression model. In the case studies below, we will see that this approach produces models with an excellent trade-off between interpretability and predictive strength.
Example 3. For instance, after fitting the STAR model (4) of Example 1 with XGBoost, we could extract the (centered) purely additive contribution of all location covariates using Algorithm 1 and calculate a subsequent linear regression of the formwhere represent the values of the locational variables. The main difference with the initial XGBoost model is that the effects of the building characteristics are now linear. Remark 4 (Modeling strategy)
. The workflow in the last example is in line with the following general modeling strategy. Groups of related features (for instance, a large set of locational variables) are sometimes difficult to represent in a linear regression model. How should the features be transformed? What interactions are important? How should one deal with strong multicollinearity? These burdens could be delegated to an initial STAR model of suitable structure. Then, the model components representing such feature groups would be extracted and plugged as high-level features into a subsequent linear regression model. This allows one, e.g., to linearize some additive effects of a fitted boosted trees model with interaction constraints.
Encoders are also helpful for interpreting effects in general STAR models, as will be explained in the next section.
2.4. Interpreting STAR Models
In this section, we provide some information on how to interpret effects of features and feature sets in general STAR models (
1), with a special focus on simple, fully transparent descriptions. These techniques will be used later in the case studies.
One of the most common techniques to describe the effect of a feature set
in an ML model is the partial dependence plot (PDP) introduced in
Friedman (
2001). It visualizes the average partial effect of
by taking the average of many
individual conditional expectation profiles (ICE, see
Goldstein et al. 2015). The ICE profile for the
ith observation with observed covariate vector
and feature set
is calculated by evaluating predictions
over a sufficiently fine grid of values for vector components corresponding to
, keeping all other vector components of
fixed. The stronger the interaction effects from other features, the less parallel the ICE profiles across multiple observations are. Thus, a visual test for additivity can be performed by plotting many ICE profiles and checking if they are parallel (see
Goldstein et al. 2015). Conversely, if all ICE profiles of
are parallel,
is represented by the model in an additive way. In that case, a single ICE profile (or the PDP) serves as a fully transparent description of the effect of
in the sense that it is clear how
acts on
globally for all observations ceteris paribus.
Studying ICE and PDP plots is not only interesting for interpreting feature effects in complex ML models. Along with other general-purpose tools from the field of explainable ML (see, e.g.,
Biecek and Burzykowski 2021;
Molnar 2019), they can also be used to interpret models of restricted complexity, such as linear regression models, GAMs, or STAR models. There, they serve as (feature-centric) alternatives to partial residual plots (
Wood 2017) that are frequently used to visualize effects of
single model components. Note that, up to a vertical shift, partial residual plots coincide with ICE curves for features that appear in only one (single-feature) component.
In the context of STAR models, from Definition 1, it is easy to see that, for a feature set that has a purely additive contribution, ICE profiles of are parallel across all observations, and their values correspond to values of the encoder evaluated over the domain of (up to an additive constant). This includes centered or uncentered encoders as extracted by Algorithm 1. In fact, Algorithm 1 differs from the calculation of ICE profiles only in terms of technical details.
Thus, the effects of an encodable feature set can be described in a simple yet transparent way by, e.g., showing one ICE profile, the PDP, or by evaluating its encoder over the domain of . This is a major advantage of STAR models over unstructured ML models, where transparent descriptions of feature effects are unrealistic due to complex high-order interactions.
However, since it is difficult to describe ICE/PDP/encoders of more than two features, this concept is limited from a practical perspective to feature sets with only one or two features.
Remark 5.
To benefit from additivity, effects of features in STAR models are typically interpreted on the scale of ξ.
ICE profiles and the PDP can be used to interpret effects of non-encodable feature sets as well. Due to interaction effects, however, a single ICE profile or a PDP cannot give a complete picture of such an effect.
We have mentioned that describing multivariable ICE/PDP/encoders of more than two features is difficult in practice. However, depending on the modeling situation, it is not uncommon that a possibly large, encodable feature set represents a low-dimensional feature set with mapping . Here, and denote feature vectors corresponding to feature sets and . In this case, the ICE/PDP/encoder can be evaluated over values of to provide, again, a fully transparent description of the effects of and . Thus, instead of using the encoder with high-dimensional , we would use the equivalent encoder with low-dimensional to describe the effects of feature set (or ), even if we cannot exactly describe the effects of single features in .
We have seen three such situations in Example 1: The full set of location variables can represent low-dimensional features, such as:
The administrative unit (a one-dimensional feature).
The address (also a one-dimensional feature).
Latitude/longitude (a two-dimensional feature).
Similarly, time-dependent location effects could represent the two features “administrative unit” and “time”.
We use this projection trick in both of our case studies in
Section 3 and
Section 4 to describe multivariate location effects.