Precise Tensor Product Smoothing via Spectral Splines

Helwig, Nathaniel E.

doi:10.3390/stats7010003

Open AccessArticle

Precise Tensor Product Smoothing via Spectral Splines

by

Nathaniel E. Helwig

^1,2

¹

Department of Psychology, University of Minnesota, 75 E River Road, Minneapolis, MN 55455, USA

²

School of Statistics, University of Minnesota, 224 Church Street SE, Minneapolis, MN 55455, USA

Stats 2024, 7(1), 34-53; https://doi.org/10.3390/stats7010003

Submission received: 9 December 2023 / Revised: 27 December 2023 / Accepted: 3 January 2024 / Published: 10 January 2024

(This article belongs to the Special Issue Novel Semiparametric Methods)

Download

Browse Figures

Versions Notes

Abstract

:

Tensor product smoothers are frequently used to include interaction effects in multiple nonparametric regression models. Current implementations of tensor product smoothers either require using approximate penalties, such as those typically used in generalized additive models, or costly parameterizations, such as those used in smoothing spline analysis of variance models. In this paper, I propose a computationally efficient and theoretically precise approach for tensor product smoothing. Specifically, I propose a spectral representation of a univariate smoothing spline basis, and I develop an efficient approach for building tensor product smooths from marginal spectral spline representations. The developed theory suggests that current tensor product smoothing methods could be improved by incorporating the proposed tensor product spectral smoothers. Simulation results demonstrate that the proposed approach can outperform popular tensor product smoothing implementations, which supports the theoretical results developed in the paper.

Keywords:

generalized additive model; linear mixed model; multiple nonparametric regression; penalized spline; smoothing spline analysis of variance

1. Introduction

Consider a multiple nonparametric regression model [1] of the form

Y = f (X) + ϵ

(1)

where

Y \in R

is the observed response variable,

X = {(X_{1}, \dots, X_{p})}^{⊤} \in X

is the observed predictor vector,

X = X_{(1)} \times \dots \times X_{(p)}

is the product domain with

X_{(j)}

denoting the domain of the j-th predictor,

f : X \to R

is the (unknown) real-valued function connecting the response and predictors, and

ϵ

is an error term that satisfies

E (ϵ) = 0

and

E (ϵ^{2}) = σ^{2} < \infty

. Note that this implies that

E (Y | X) = f (X)

, i.e., the function

f (\cdot)

is the conditional expectation of the response variable Y given the predictor vector

X

. Given a sample of training data, the goal is to estimate the unknown mean function f without having any a priori information about the parametric nature of the functional relationship (e.g., without assuming linearity).

Let

{(x_{i}, y_{i})}_{i = 1}^{n}

denote a sample of n independent observations from the model in Equation (1), where

y_{i} \in R

is the i-th observation’s realization of the response variable, and

x_{i} = {(x_{i 1}, \dots, x_{i p})}^{⊤} \in X

is the i-th observation’s realization of the predictor vector. To estimate f, it is typical to minimize a penalized least squares functional of the form

\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - f (x_{i}))}^{2} + λ P (f)

(2)

where

P (f) \geq 0

denotes some non-negative penalty that describes the complexity of f, i.e., if

P (f) > P (g)

, then the function f is more complex (less smooth) than the function g, and the tuning parameter

λ \geq 0

controls the influence of the penalty. To find a reasonable balance between fitting (the data) and smoothing (the function),

λ

is often chosen via cross-validation, information theory, or maximum likelihood estimation [2].

When the penalty P is a semi-norm in a (tensor product) reproducing kernel Hilbert space (RKHS), the minimizer of Equation (2) is referred to as a (tensor product) smoothing spline [1,3,4,5,6,7]. Note that (tensor product) smoothing splines are used within multiple nonparametric regression frameworks, such as generalized additive models (GAMs) [7,8] and smoothing spline analysis of variance (SSANOVA) models [3,5,6]. Such methods have proven powerful for nonparametric (multivariate) function estimation for a variety of different types of data, such as oceanography [9], social media [10], clinical biomechanics [11], self-esteem development [12], smile perception [13], clinical neuroimaging [14], psychiatry [15], and demography [16].

To find the

{\hat{f}}_{λ}

that minimizes Equation (2), it is first necessary to specify the assumed model form. For example, with

p = 2

predictors, we could consider one of two forms:

\begin{matrix} additive : & f (X) = f_{0} + f_{1} (X_{1}) + f_{2} (X_{2}) \\ interactive : & f (X) = f_{0} + f_{1} (X_{1}) + f_{2} (X_{2}) + f_{12} (X_{1}, X_{2}) \end{matrix}

where

X = {(X_{1}, X_{2})}^{⊤} \in X = X_{(1)} \times X_{(2)}

is the bidimensional predictor,

f_{0} \in R

is an intercept,

f_{1} : X_{(1)} \to R

is the main effect of the first predictor,

f_{2} : X_{(2)} \to R

is the main effect of the second predictor, and

f_{12} : X \to R

is the two-way interaction effect. Note that these models are nested given that the additive model is equivalent to the interaction model if

f_{12} = 0

.

For additive models,

f_{j}

is typically represented by a spline basis of rank

r_{j}

, such as

f_{j} (X_{j}) = Z_{j}^{⊤} β_{j}

. Note that

Z_{j} = {(Z_{1}^{(j)} (X_{j}), \dots, Z_{r_{j}}^{(j)} (X_{j}))}^{⊤} \in R^{r_{j}}

denotes the known spline basis vector that depends on the chosen knots (later described), and

β_{j} \in R^{r_{j}}

is the unknown coefficient vector. To define the complexity of each (additive) effect, it is typical to consider penalties of the form

P_{j} (f_{j}) = β_{j}^{⊤} Q_{j} β_{j}

, where

Q_{j}

is a semi-positive definite matrix. Using these representations of the function evaluation and penalty, Equation (2) can be written as

\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - f_{0} - \sum_{j = 1}^{p} z_{i j}^{⊤} β_{j})}^{2} + \sum_{j = 1}^{p} λ_{j} β_{j}^{⊤} Q_{j} β_{j}

(3)

where

z_{i j} = {(Z_{1}^{(j)} (x_{i j}), \dots, Z_{r_{j}}^{(j)} (x_{i j}))}^{⊤}

is the i-th observation’s realization of the

Z_{j}

vector, and

λ_{j} \geq 0

are tuning parameters that control the influence of each penalty.

When the model contains interaction effects, different approaches can be used to represent and penalize the interaction terms. In GAMs, it is typical to (i) represent interaction effects by taking an outer (Kronecker) product of marginal basis vectors, and (ii) penalize interaction effects using an equidistant (grid) approximation (see [7] (pp. 227–237)). In SSANOVA models, it is typical to represent and penalize interaction effects using a tensor product RK function (see [5] (pp. 40–48)). For a thorough comparison of the two approaches, see Helwig [1]. In both frameworks, estimation of interaction effects can be costly when using a moderate to large number of knots, which is true even when using scalable parameterizations and algorithms [9,17,18]. This is because efficient computational tools for exact tensor product function representation and penalization are lacking from the literature, which hinders the widespread application of tensor product smoothing splines.

To address this practical issue, this paper (i) proposes a spectral representer theorem for univariate smoothing spline estimators, and (ii) develops efficient computational strategies for constructing tensor product smoothing splines from marginal spectral representations. The marginal spectral spline representation that I propose is similar to that proposed by [19]; however, the version that I consider penalizes all of the non-constant functions of each predictor. The tensor product basis construction approach that I propose generally follows the idea proposed by [20], where tensor products are built from outer (Kronecker) products of marginal bases. However, unlike this approach, I leverage reproducing kernel theory to develop exact analytical penalties for tensor product smooth terms. The proposed approach makes it possible to fit tensor product smoothing spline models (a) with interaction effects between any combination of predictors, and (b) using any linear mixed modeling software.

The remainder of this paper is organized as follows: Section 2 provides background on the reproducing kernel Hilbert space theory relevant to univariate smoothing splines; Section 3 provides background on the tensor product smoothing splines; Section 4 proposes an alternative tensor product smoothing spline (like) framework that penalizes all non-constant functions of the predictor; Section 5 develops the spectral representer theories necessary for efficiently computing exact tensor product penalties; Section 6 conducts a simulation study to compare the proposed approach to an existing (comparable) method; Section 7 demonstrates the proposed approach using a real dataset; and Section 8 discusses potential extensions of the proposed approach.

2. Smoothing Spline Foundations

2.1. Reproducing Kernel Hilbert Spaces

Consider a single predictor (i.e.,

p = 1

) that satisfies

x \in X

, and let

H

denote a RKHS of functions on

X

. The unknown function f from Equation (2) is assumed to be an element of

H

, which will be denoted by

f \in H

. Suppose that the space

H

can be decomposed into two orthogonal subspaces, such as

H = H_{0} \oplus H_{1}

, where ⊕ denotes the tensor summation. Note that

H_{0} = {f : P (f) = 0, f \in H}

is the null space, which contains all functions (in

H

) that have zero penalty, and

H_{1} = {f : P (f) > 0, f \in H}

is the contrast space, which contains all functions (in

H

) that have a non-zero penalty. In some cases, the null space can be further decomposed, such as

H_{0} = H_{00} \oplus H_{01}

, where

H_{00} = {f : f \in H_{0}, f (x) \propto 1 \forall x \in X}

is a space of constant functions (intercept), and

H_{01} = {f : f \in H_{0}, f (x)

1 \forall x \in X}

is a space of non-constant functions (unpenalized). For example, when using a cubic smoothing spline,

H_{01}

contains the linear effect of X, which is unpenalized.

The inner product of

H

will be denoted by

〈 f, g 〉

for any

f, g \in H

, and the corresponding norm will be written as

∥ f ∥ = \sqrt{〈 f, f 〉}

for any

f \in H

. Given the tensor sum decomposition of

H

, the inner product can be written as a summation of the corresponding subspaces’ inner products, such as

〈 f, g 〉 = {〈 f, g 〉}_{0} + {〈 f, g 〉}_{1}

. Note that

{〈 f, g 〉}_{0}

is the null space inner product for any

f, g \in H_{0}

, and

{〈 f, g 〉}_{1}

is the contrast space inner product for any

f, g \in H_{1}

. The corresponding norms will be denoted by

{∥ f ∥}_{0} = \sqrt{{〈 f, f 〉}_{0}}

(norm of

H_{0}

) and

{∥ f ∥}_{1} = \sqrt{{〈 f, f 〉}_{1}}

(norm of

H_{1}

). When the null space consists of non-constant functions, i.e, when

H_{0} = H_{00} \oplus H_{01}

, the null space inner product can be written as

{〈 f, g 〉}_{0} = {〈 f, g 〉}_{00} + {〈 f, g 〉}_{01}

, where

{〈 f, g 〉}_{00}

is the inner product of

H_{00}

for any

f, g \in H_{00}

, and

{〈 f, g 〉}_{01}

is the inner product of

H_{01}

for any

f, g \in H_{01}

. The corresponding norm can be written as

{∥ f ∥}_{0} = \sqrt{{∥ f ∥}_{00}^{2} + {∥ f ∥}_{01}^{2}}

, where

{∥ f ∥}_{00} = \sqrt{{〈 f, f 〉}_{00}}

and

{∥ f ∥}_{01} = \sqrt{{〈 f, f 〉}_{01}}

denote the norms of

H_{00}

and

H_{01}

, respectively.

The RK of

H

will be denoted by

R (x, z) = R_{z} (x) = R_{x} (z)

for any

x, z \in X

. Note that the RK is an element of the RKHS, i.e.,

R \in H

for any

x, z \in X

. By definition, the RK is the representer of the evaluation functional in

H

, which implies that the RK satisfies

f (x) = 〈 R_{x} (z), f (z) 〉

for any

f \in H

and any

x, z \in X

. This important property, which is referred to as the “reproducing property” of the (reproducing) kernel function, implies that any function in

H

can be evaluated through the inner product and RK function. Following the decompositions of the inner product, the RK function can be written as

R (x, z) = R_{0} (x, z) + R_{1} (x, z)

, where

R_{0} \in H_{0}

and

R_{1} \in H_{1}

denotes the RKs of

H_{0}

and

H_{1}

, respectively. Furthermore, when

H_{0} = H_{00} \oplus H_{01}

, the null space RK can be decomposed such as

R_{0} (x, z) = R_{00} (x, z) + R_{01} (x, z)

, where

R_{00} \in H_{00}

and

R_{01} \in H_{01}

denotes the RKs of

H_{00}

and

H_{01}

, respectively. By definition,

R_{00} (x, z) = β_{0}

for all

x, z \in X

, where

β_{0} \in R

is some constant.

The tensor sum decomposition

H = H_{0} \oplus H_{1}

implies that any function

f \in H

can be written as a summation of two components, such as

f (x) = f_{0} (x) + f_{1} (x)

where

f_{0} \in H_{0}

is the null space contribution and

f_{1} \in H_{1}

is the contrast space contribution. Furthermore, when

H_{0} = H_{00} \oplus H_{01}

, the null space component can be further decomposed into its constant and non-constant contributions, such as

f_{0} (x) = f_{00} (x) + f_{01} (x)

, where

f_{00} (x) \propto 1

for all

x \in X

. Let

P_{0}

denote the projection operator for the null space, such that

P_{0} f = f_{0}

for any

f \in H

. Similarly, let

P_{1}

denote the projection operator for the contrast space, such that

P_{1} f = f_{1}

for any

f \in H

. Note that

f_{0} \in H_{0}

is referred to as the “parametric component” of f, given that

H_{0}

is a finite dimensional subspace. In contrast,

f_{1} \in H_{1}

is the “nonparametric component” of f, given that

H_{1}

is an infinite dimensional subspace.

2.2. Representer Theorem

Still consider a single predictor (i.e.,

p = 1

) that satisfies

x \in X

with

H = H_{0} \oplus H_{1}

denoting a RKHS of functions on

X

. Now, suppose that the penalty functional in Equation (2) is defined to be the squared norm of the function’s projection into the contrast space, i.e.,

P (f) = ∥ P_{1} {f ∥}^{2} = {∥ f_{1} ∥}_{1}^{2}

. Note that the second equality is due to the fact that

∥ f_{1} ∥_{0}^{2} = 0

for any

f_{1} \in H_{1}

, which is a consequence of the orthogonality of

H_{0}

and

H_{1}

. More specifically, given

{(x_{i}, y_{i})}_{i = 1}^{n}

, consider the problem of finding the function

{\hat{f}}_{λ} = \underset{f \in H}{arg min} [\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - f (x_{i}))}^{2} + λ {∥ P_{1} f ∥}^{2}]

(4)

where

P_{1}

is the projection operator for the contrast space

H_{1}

. Note that the solution is subscripted with

λ

to emphasize the dependence on the tuning parameter.

Suppose that the null space has dimension

m \geq 1

. Note that

m = 1

when

H_{0}

only consists of the constant (intercept) subspace, whereas

m \geq 2

when

H_{0} = H_{00} \oplus H_{01}

. Let

{N_{0}, N_{1}, \dots, N_{m - 1}}

denote a basis for the null space

H_{0}

, such that any

f_{0} \in H_{0}

can be written as

f_{0} (x) = \sum_{j = 0}^{m - 1} β_{j} N_{j} (x)

for some coefficient vector

β = {(β_{0}, \dots, β_{m - 1})}^{⊤} \in R^{m}

. The representer theorem of Kimeldorf and Wahba [21] reveals that the optimal smoothing spline estimator from Equation (4) has the form

f_{λ} (x) = \sum_{j = 0}^{m - 1} β_{j} N_{j} (x) + \sum_{i = 1}^{n} α_{i} R_{1} (x, x_{i})

(5)

where

R_{1} \in H_{1}

is the RK of the contrast space, and

α = {(α_{1}, \dots, α_{n})}^{⊤} \in R^{n}

is the coefficient vector that combines the training data RK evaluations.

The representer theorem in Equation (5) reveals that the smoothing spline estimator can be written as

f_{λ} (x) = f_{0 λ} (x) + f_{1 λ} (x)

, where

f_{0 λ} (x) = \sum_{j = 0}^{m - 1} β_{j} N_{j} (x)

is the null space contribution and

f_{1 λ} (x) = \sum_{i = 1}^{n} α_{i} R_{1} (x, x_{i})

is the contrast space contribution. Using the optimal representation from Equation (5), the penalty has the form

\begin{matrix} ∥ P_{1} f_{λ} ∥^{2} & = ∥ \sum_{i = 1}^{n} α_{i} R_{1} (x, x_{i}) ∥_{1}^{2} \\ = \sum_{i = 1}^{n} \sum_{j = 1}^{n} α_{i} α_{j} {〈 R_{1} (x, x_{i}), R_{1} (x, x_{j}) 〉}_{1} \\ = α^{⊤} Q α \end{matrix}

(6)

where

Q = [R_{1} (x_{i}, x_{i^{'}})]

evaluates the RK function at all combinations of

i, i^{'} \in {1, \dots, n}

. Note that the first line is due to the fact that

P_{1} f_{λ} = f_{1 λ}

for any

f_{λ} \in H

, the second line is due to the bilinear nature of the inner product, and the third line is due to the reproducing property of the RK function.

2.3. Scalable Computation

The optimal solution given by the representer theorem in Equation (5) uses all training data points to represent

f_{1 λ}

, which could be computationally costly when n is large. For more scalable computation, it is typical to approximate

f_{1 λ}

by evaluating the contrast space RK at all combinations of

r < n

knots, which are typically placed at the quantiles of the training data predictor scores. Using this type of (low-rank) smoothing spline approximation, the approximation to the representer theorem becomes

f_{λ} (x) \approx \sum_{j = 0}^{m - 1} β_{j} N_{j} (x) + \sum_{ℓ = 1}^{r} α_{ℓ} R_{1} (x, x_{ℓ}^{*})

(7)

where

{x_{ℓ}^{*}}_{ℓ = 1}^{r}

are the chosen knots. As long as enough knots are used in the representation, the approximate representer theorem in Equation (7) can produce theoretically optimal function estimates [22,23]. For optimal asymptotic properties, the number of knots should be on the order of

r ≍ O (n^{2 / (4 δ + 1)})

, where

δ \in [1, 2]

depends on the smoothness of the unknown true function. Note that

δ = 1

is necessary when

P (f) < \infty

is barely satisfied, whereas

δ = 2

can be used when f is sufficiently smooth (see [5,22,23]).

Using the approximate representer theorem in Equation (7), the penalized least squares functional from Equation (4) becomes a penalized least squares problem of the form

[\begin{matrix} {\hat{β}}_{λ} \\ {\hat{α}}_{λ} \end{matrix}] = \underset{β \in R^{m}, α \in R^{n}}{arg min} [\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - N_{i}^{⊤} β - R_{i}^{⊤} α)}^{2} + λ α^{⊤} Q α]

(8)

where

N_{i} = {(N_{0} (x_{i}), \dots, N_{m - 1} (x_{i}))}^{⊤}

is the i-th observation’s null space basis function vector, and

R_{i} = {(R_{1} (x_{i}, x_{1}^{*}), \dots, R_{1} (x_{i}, x_{r}^{*}))}^{⊤}

is the i-th observation’s contrast space basis function vector. Note that

Q = [R_{1} (x_{k}^{*}, x_{ℓ}^{*})]

evaluates the contrast space RK at all combinations of knots. Given a choice of the smoothing parameter

λ

, the solution has the form

[\begin{matrix} {\hat{β}}_{λ} \\ {\hat{α}}_{λ} \end{matrix}] = {[\begin{matrix} N^{⊤} N & N^{⊤} R \\ R^{⊤} N & R^{⊤} R + n λ Q \end{matrix}]}^{†} [\begin{matrix} N^{⊤} \\ R^{⊤} \end{matrix}] y

(9)

where

N = {(N_{1}, \dots, N_{n})}^{⊤}

is the null space design matrix with

N_{i}

as rows,

R = {(R_{1}, \dots, R_{n})}^{⊤}

is the contrast space design matrix with

R_{i}

as rows,

y = {(y_{1}, \dots, y_{n})}^{⊤}

is the response vector, and

{(\cdot)}^{†}

denotes the Moore–Penrose pseudoinverse [24,25].

3. Tensor Product Smoothing

3.1. Marginal Function Space Notation

Now, consider the multiple nonparametric regression model in Equation (1), where

X = {(X_{1}, \dots, X_{p})}^{⊤} \in X

is the observed predictor vector. Note that

X = X_{(1)} \times \dots \times X_{(p)}

is the product domain with

X_{(j)}

denoting the domain of

X_{j}

. Following the discussion from Section 2.1, let

H_{(j)}

denote a RKHS of functions on

X_{(j)}

for

j = 1, \dots, p

. Suppose that the complexity (i.e., lack of smoothness) for each predictor’s marginal RKHS is defined according to some non-negative penalty functional

P_{j}

. This implies that each RKHS can be decomposed such as

H_{(j)} = H_{0 (j)} \oplus H_{1 (j)}

, where

H_{0 (j)} = {f : P_{j} (f) = 0, f \in H_{(j)}}

is the j-th predictor’s null space, which contains all functions (in

H_{(j)}

) that have zero penalty, and

H_{1 (j)} = {f : P_{j} (f) > 0, f \in H_{(j)}}

is the j-th predictor’s contrast space, which contains all functions (in

H_{(j)}

) that have a non-zero penalty. When relevant, the j-th predictor’s null space can be further decomposed such as

H_{0 (j)} = H_{00 (j)} \oplus H_{01 (j)}

, where

H_{00 (j)}

is a constant (intercept) subspace, and

H_{01 (j)}

contains non-constant functions that are unpenalized.

The inner product of

H_{(j)}

will be denoted by

{〈 f, g 〉}_{(j)}

for any

f, g \in H_{(j)}

, and the corresponding norm will be written as

{∥ f ∥}_{(j)} = \sqrt{{〈 f, f 〉}_{(j)}}

for any

f \in H_{(j)}

. Each inner product can be decomposed into its null and contrast contributions, such as

{〈 f, g 〉}_{(j)} = {〈 f, g 〉}_{0 (j)} + {〈 f, g 〉}_{1 (j)}

, and the corresponding norms will be denoted by

{∥ f ∥}_{0 (j)} = \sqrt{{〈 f, f 〉}_{0 (j)}}

(norm of

H_{0 (j)}

) and

{∥ f ∥}_{1 (j)} = \sqrt{{〈 f, f 〉}_{1 (j)}}

(norm of

H_{1 (j)}

). When

H_{0 (j)} = H_{00 (j)} \oplus H_{01 (j)}

, the null space inner product can be written as

{〈 f, g 〉}_{0 (j)} = {〈 f, g 〉}_{00 (j)} + {〈 f, g 〉}_{01 (j)}

, where

{〈 f, g 〉}_{00 (j)}

is the inner product of

H_{00 (j)}

for any

f, g \in H_{00 (j)}

, and

{〈 f, g 〉}_{01 (j)}

is the inner product of

H_{01 (j)}

for any

f, g \in H_{01 (j)}

. The corresponding norm can be written as

{∥ f ∥}_{0 (j)} = \sqrt{{∥ f ∥}_{00 (j)}^{2} + {∥ f ∥}_{01 (j)}^{2}}

, where

{∥ f ∥}_{00 (j)} = \sqrt{{〈 f, f 〉}_{00 (j)}}

and

{∥ f ∥}_{01 (j)} = \sqrt{{〈 f, f 〉}_{01 (j)}}

denote the norms of

H_{00 (j)}

and

H_{01 (j)}

, respectively.

The RK of

H_{(j)}

will be denoted by

R_{(j)} (x, z)

for any

x, z \in X_{(j)}

, and note that the RK is an element of the j-th predictor’s RKHS, i.e.,

R_{(j)} \in H_{(j)}

for any

x, z \in X_{(j)}

. By definition, the RK is the representer of the evaluation functional in

H_{(j)}

, which implies that the RK satisfies

f (x) = 〈 R_{(j)} (x, z), f (z) 〉

for any

f \in H_{(j)}

and any

x, z \in X_{(j)}

. Note that

R_{(j)} (x, z) = R_{0 (j)} (x, z) + R_{1 (j)} (x, z)

, where where

R_{0 (j)} \in H_{0 (j)}

is the null space RK and

R_{1 (j)} \in H_{1 (j)}

is the contrast space RK. Furthermore, when

H_{0 (j)} = H_{00 (j)} \oplus H_{01 (j)}

, the null space RK can be decomposed such as

R_{0 (j)} (x, z) = R_{00 (j)} (x, z) + R_{01 (j)} (x, z)

, where

R_{00 (j)} \in H_{00 (j)}

and

R_{01 (j)} \in H_{01 (j)}

denotes the RKs of

H_{00 (j)}

and

H_{01 (j)}

, respectively. Note that

R_{00 (j)} (x, z) = β_{0 (j)}

for all

x, z \in X_{(j)}

, where

β_{0 (j)} \in R

is some constant, given that

H_{00 (j)}

is assumed to be a constant (intercept) subspace for all p predictors.

3.2. Tensor Product Function Spaces

Consider the construction of a tensor product function space

H

that is formed by combining the marginal spaces

{H_{(1)}, \dots, H_{(p)}}

. The largest space that could be constructed includes all possible main and interaction effects, such as

\begin{matrix} H & = H_{(1)} \otimes \dots \otimes H_{(p)} \\ = H_{{0}} \oplus H_{{1}} \oplus \dots \oplus H_{{p}} \end{matrix}

(10)

where

H_{{0}} = {f : f (X) \propto 1 \forall X \in X}

is the tensor product constant (intercept) space, and each

H_{{j}}

consists of

(\binom{p}{j})

orthogonal subspaces that capture different main and/or interaction effects of the predictors. For example,

H_{{1}} = \oplus_{j = 1}^{p} H_{(j)}

consists of p main effect subspaces,

H_{{2}} = \oplus_{k = 2}^{p} \oplus_{j = 1}^{k - 1} H_{(j)} \otimes H_{(k)}

consists of

(\binom{p}{2}) = \frac{p (p - 1)}{2}

two-way interaction effect subspaces, etc. Note that different (more parsimonious) statistical models can be formed by excluding subspaces from the tensor product RKHS defined in Equation (10). For example, the tensor product space corresponding to the additive model has the form

H = H_{{0}} \oplus H_{{1}}

. For the model that includes all main effects and two-way interactions, the tensor product RKHS has the form

H = H_{{0}} \oplus H_{{1}} \oplus H_{{2}}

.

Let

X = {(X_{1}, \dots, X_{p})}^{⊤} \in X

and

Z = {(Z_{1}, \dots, Z_{p})}^{⊤} \in X

denote two arbitrary predictor vectors. To evaluate functions in

H

, the tensor product RK can be defined as

\begin{matrix} R (X, Z) & = \prod_{j = 1}^{p} R_{(j)} (X_{j}, Z_{j}) \\ = R_{{0}} (X, Z) + R_{{1}} (X, Z) + \dots + R_{{p}} (X, Z) \end{matrix}

(11)

where

R_{{0}} (X, Z) = 1

is the constant (intercept) term, and each

R_{{j}}

consists of a summation of

(\binom{p}{j})

RKs from orthogonal subspaces that capture different main and/or interaction effects of the predictors. For example,

R_{{1}} = \sum_{j = 1}^{p} R_{(j)}

consists of p main effect RKs,

R_{{2}} = \sum_{k = 2}^{p} \sum_{j = 1}^{k - 1} R_{(j)} R_{(k)}

consists of

(\binom{p}{2}) = \frac{p (p - 1)}{2}

two-way interaction effect RKs, etc. When different (more parsimonious) models are formed by excluding subspaces of the tensor product RKHS, the corresponding components of the tensor product RK are also excluded. For example, the tensor product RK corresponding to the additive model has the form

R = R_{{0}} + R_{{1}}

, and the tensor product RK for the model that includes all main effects and two-way interactions has the form

R = R_{{0}} + R_{{1}} + R_{{2}}

.

The inner product of the tensor product RKHS

H

can be written as

〈 f, g 〉 = {〈 f, g 〉}_{{0}} + {〈 f, g 〉}_{{1}} + \dots + {〈 f, g 〉}_{{p}}

(12)

where

{〈 f, g 〉}_{{0}}

is the inner product of

H_{{0}}

, and

{〈 f, g 〉}_{{j}}

consists of a summation of

(\binom{p}{j})

inner products corresponding to orthogonal subspaces that capture different main and/or interaction effects of the predictors. For example,

{〈 f, g 〉}_{{1}}

consists of the summation of p main effect inner products, and

{〈 f, g 〉}_{{2}}

consists of the summation of

\frac{p (p - 1)}{2}

two-way interaction effect inner products. The specifics of each subspace’s inner product will depend on the type of spline used for each predictor. This is because each subspace’s inner product (and, consequently, penalty) aggregates information across the penalized components after “averaging out” information from unpenalized components (see [5] (pp. 40–48)).

3.3. Representation and Computation

Given an assumed model form, the tensor product RKHS can be written as

H = H_{0}^{★} \oplus H_{1} \oplus \dots \oplus H_{K}

(13)

where

H_{0}^{★} = {f : P (f) = 0, f \in H}

is the tensor product null space with

P (\cdot)

denoting the tensor product penalty (later defined), and

H_{k}

is the k-th orthogonal subspace of the tensor product contrast space

H_{1}^{★} = H_{1} \oplus \dots \oplus H_{K}

. Note that

H_{k}

corresponds to the different main and/or interaction effect subspaces that are included in the assumed model form. The corresponding inner product and RK can be written as

\begin{matrix} 〈 f, g 〉 & = {〈 f, g 〉}_{0}^{★} + \sum_{k = 1}^{K} θ_{k}^{- 1} {〈 f, g 〉}_{k} \\ R (X, Z) & = R_{0}^{★} (X, Z) + \sum_{k = 1}^{K} θ_{k} R_{k} (X, Z) \end{matrix}

(14)

where

{〈 f, g 〉}_{0}^{★}

and

R_{0}^{★} (X, Z)

denote the inner product and RK of

H_{0}^{★}

(the tensor product null space),

{〈 f, g 〉}_{k}

and

R_{k} (X, Z)

denote the inner product and RK of

H_{k}

for

k = 1, \dots, K

, and the

θ_{k} > 0

are additional non-negative tuning parameters that control the influence of each subspace’s contribution. Note that including the

θ_{k}

parameters is essential given that the different subspaces do not necessarily have comparable metrics.

Suppose that the tensor product penalty

P (f)

is defined to be the squared norm of the function’s projection into the (tensor product) contrast space, i.e.,

P (f) = {〈 f, f 〉}_{1}^{★} = \sum_{k = 1}^{K} θ_{k}^{- 1} {∥ f ∥}_{k}^{2}

(15)

where

{∥ f ∥}_{k} = \sqrt{{〈 f, f 〉}_{k}}

is the norm for

H_{k}

(the k-th orthogonal subspace of

H_{1}^{★}

). Using this definition of the penalty, the function minimizing the penalized least squares functional in Equation (2) can be written according to the representer theorem in Equation (5). In this case, the set of functions

{N_{0}, N_{1}, \dots, N_{m - 1}}

forms a basis for the tensor product null space

H_{0}^{★}

, and the RK of the contrast space is defined as

R_{1}^{★} = \sum_{k = 1}^{K} θ_{k} R_{k}

. Using this optimal representation, the penalty can be written according to Equation (6) with the penalty matrix defined as

Q = \sum_{k = 1}^{K} θ_{k} Q_{k}

where

Q_{k} = [R_{k} (x_{i}, x_{i^{'}})]

evaluates the k-th subspace’s RK function at all combinations of training data points.

For scalable computation as n becomes large, the approximate representer theorem in Equation (7) can be applied using the knots

{x_{ℓ}^{*}}_{ℓ = 1}^{r}

, where

x_{ℓ}^{*} = {(x_{ℓ 1}^{*}, \dots, x_{ℓ p}^{*})}^{⊤} \in X

for all

ℓ = 1, \dots, r

. Using the approximately optimal representation from Equation (7), the penalized least squares problem can be written according to Equation (8), and the optimal coefficients can be written according to Equation (9). In the tensor product case, the optimal coefficients should really be subscripted with

λ = (λ, θ_{1}, \dots, θ_{K})

, given that these estimates depend on the overall tuning parameter

λ

, as well as the K tuning (hyper)parameters for each of the contrast subspaces. Note that the penalty only depends on

(λ_{1}, \dots, λ_{K})

where

λ_{k} = λ / θ_{k}

for

k = 1, \dots, K

. However, it is often helpful (for tuning purposes) to separate the overall tuning parameter

λ

from the tuning parameters that control the individual effect functions, i.e., the

θ_{k}

tuning parameters.

4. Refined Tensor Product Smoothing

4.1. Smoothing Spline Like Estimators

Consider a single predictor (i.e.,

p = 1

) that satisfies

x \in X

, and let

H

denote a RKHS of functions on

X

. Consider a decomposition of the function space such as

H = H_{00} \oplus H_{11}

, where

H_{11} = H_{01} \oplus H_{1}

is a space of non-constant functions that either sum to zero (for categorical x) or integrate to zero (for continuous x) across the domain

X

. The inner product of

H

can be written as

〈 f, g 〉 = {〈 f, g 〉}_{00} + {〈 f, g 〉}_{11}

for any

f, g \in H

, where

{〈 f, g 〉}_{11} = {〈 f, g 〉}_{01} + {〈 f, g 〉}_{1}

is the inner product of

H_{11}

. The corresponding RK can be written as

R (x, z) = R_{00} (x, z) + R_{11} (x, z)

for any

x, z \in X

, where

R_{11} = R_{01} + R_{1}

is the RK for

H_{11}

. Given a sample of n observations

{(x_{i}, y_{i})}_{i = 1}^{n}

, consider finding the

f \in H

that satisfies

{\hat{f}}_{λ} = \underset{f \in H}{arg min} \{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - f (x_{i}))}^{2} + λ {∥ f ∥}_{11}^{2}\}

(16)

where

{∥ f ∥}_{11}^{2} = {〈 f, f 〉}_{11}

is the squared norm of the projection of f into

H_{11}

. The

{\hat{f}}_{λ}

defined in Equation (16) is a smoothing spline if

H_{0} = H_{00}

, which will be the case for nominal, ordinal, and linear smoothing splines. However, for cubic (and higher-order) smoothing splines, the

H_{01}

subspace consists of non-constant lower-order polynomial terms, which are unpenalized. Note that the

{\hat{f}}_{λ}

in Equation (16) penalizes all non-constant terms, so it will not be equivalent to a cubic smoothing spline—even when

H = {f : \int | f^{2} (x) |^{2} d x < \infty, \forall x \in X}

is the same RKHS used for cubic smoothing spline estimation.

Theorem 1

(Representer Theorem). The

f \in H

that minimizes Equation (16) has the form

f_{λ} (x) = β + \sum_{i = 1}^{n} α_{i} R_{11} (x, x_{i})

where

β \in R

is an intercept parameter and

α = {(α_{1}, \dots, α_{n})}^{⊤} \in R^{n}

is a vector of coefficients that combine the reproducing kernel function evaluations.

Proof.

The theorem is simply a version of the representer theorem from Equation (5) where the null space has dimension one. □

Corollary 1

(Low-Rank Approximation). The function

f \in H

that minimizes Equation (16) can be well-approximated via

f_{λ} (x) \approx β + \sum_{ℓ = 1}^{r} α_{ℓ}^{*} R_{11} (x, x_{ℓ}^{*})

where

{x_{ℓ}^{*}}_{ℓ = 1}^{r}

are the selected knots with

r ≍ O (n^{2 / (4 δ + 1)})

for some

δ \in [1, 2]

.

Proof.

The corollary is simply a version of the approximate representer theorem from Equation (7) where the null space has dimension one. □

These results imply that the penalized least squares functional from Equation (16) can be rewritten as the penalized least squares problem in Equation (8) where (i) the null space only contains the intercept column, i.e.,

N_{i} = 1

and

β = β

, and (ii) the contrast space RK

R_{1}

is replaced by

R_{11}

in the function and penalty representation, i.e.,

R_{i} = {(R_{11} (x, x_{1}^{*}), \dots, R_{11} (x, x_{r}^{*}))}^{⊤}

and

Q = [R_{11} (x_{ℓ}^{*}, x_{ℓ^{'}}^{*})]

. Using these modifications the optimal coefficients can be written according to Equation (9).

4.2. Tensor Product Formulation

Now, consider the model in Equation (1) with

p \geq 2

predictors. Given an assumed model form, the tensor product RKHS

H

can be written according to the tensor sum decomposition in Equation (13) with

H_{0}^{★} = {f : f (X) \propto 1 \forall X \in X}

denoting the constant (intercept) subspace. Similarly, the inner product and RK of

H

can be written according to Equation (14), and the tensor product penalty can be written according to Equation (15). Unlike the previous tensor product treatment, this tensor product formulation assumes that

H_{0}^{★}

contains only the constant (intercept) subspace, which implies that the

H_{k}

subspaces contain all non-constant functions of the predictors. Furthermore, this implies that the proposed formulation of the tensor product penalty in Equation (15) penalizes all non-constant functions of the predictors. Note that if all p predictors have a null space dimension of one, i.e., if

H_{0 (j)} = H_{00 (j)}

for all

j = 1, \dots, p

, then the proposed formulation will be equivalent to the classic formulation. However, if

H_{01 (j)}

exists for any predictor, then the proposed formulation will differ from the classic formulation because the functions in

H_{01 (j)}

will be penalized using the proposed formulation.

Given a sample of n observations

{(x_{i}, y_{i})}_{i = 1}^{n}

with

x_{i} = (x_{i 1}, \dots, x_{i p}) \in X

and

y_{i} \in R

, consider the problem of finding the function

f \in H

that satisfies

{\hat{f}}_{λ} = \underset{f \in H}{arg min} \{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - f (x_{i}))}^{2} + λ \sum_{k = 1}^{K} ω_{k} {∥ f ∥}_{k}^{2}\}

(17)

where the

ω_{k} \geq 0

are additional tuning parameters (penalty weights) that control the influence of each component function’s penalty contribution.

Theorem 2

(Tensor Product Representer Theorem). The minimizer of Equation (17) has the form

f_{λ} = \sum_{k = 0}^{K} f_{k λ}

, where

f_{0 λ} \in R

is an intercept, and

f_{k λ} \in H_{k}

is the k-th effect function for

k = 1, \dots, K

. The optimal effect functions can be expressed as

f_{k λ} (x) = \sum_{i = 1}^{n} α_{i k} R_{k} (x, x_{i})

for all

x \in X

, where the coefficient vector

α_{k} = {(α_{1 k}, \dots, α_{n k})}^{⊤} \in R^{n}

depends on the chosen hyperparameters (i.e., λ and

ω_{k}

) for

k = 1, \dots, K

.

Proof.

The result in Theorem 2 can be considered a generalization of the typical result used in tensor product smoothing spline estimators (see [3,5]). More specifically, the SSANOVA approach assumes that the function can be represented according to the form in Theorem 2 with the coefficients defined as

α_{i k} = α_{i} θ_{k}

, where the vector

α = {(α_{1}, \dots, α_{n})}^{⊤}

is common to all K terms. □

Compared to the tensor product representation used in the SSANOVA modeling approach, the proposed approach combines the marginal RK information in a more flexible manner, such as

\begin{matrix} SSANOVA : & f_{λ} (x) = f_{0 λ} + \sum_{i = 1}^{n} \sum_{k = 1}^{K} α_{i} θ_{k} R_{k} (x, x_{i}) \\ Proposed : & f_{λ} (x) = f_{0 λ} + \sum_{i = 1}^{n} \sum_{k = 1}^{K} α_{i k} R_{k} (x, x_{i}) \end{matrix}

Clearly, the two representations are equivalent when

α_{i k} = α_{i} θ_{k}

for all

i \in {1, \dots, n}

and all

k \in {1, \dots, K}

. However, such a constraint is not necessary in practice. At first glance, it may appear that the proposed approach has made the estimation problem more challenging, given that the number of parameters has increased from

n + K

to

n K

. However, for estimation and inference purposes, it is beneficial to allow each term to have unique coefficients, given that this makes is possible to treat the tuning parameters as variance components in a linear mixed effects modeling framework [2,26,27].

4.3. Scalable Computation

The tensor product representer theorem in Theorem 2 is computationally costly for large n and/or K, given that it requires estimation of

n K

coefficients. For more practical computation, it is possible to apply knot-based approximations in a tensor product function space, as described in the following corollary.

Corollary 2

(Tensor Product Low-Rank Approximation). The minimizer of Equation (17) has the form

f_{λ} = \sum_{k = 0}^{K} f_{k λ}

, and the effect functions can be approximated via

f_{k λ} (x) \approx \sum_{ℓ = 1}^{r_{k}} α_{ℓ k} R_{k} (x, x_{ℓ k}^{*})

where

{x_{ℓ k}^{*}}_{ℓ = 1}^{r_{k}}

are the selected knots for the k-th effect with

x_{ℓ k}^{*} \in X_{k} \subset X \forall ℓ, k

.

The proposed representation also allows for more flexible knot placement within each of the K subspaces of the tensor product contrast space. In particular, each of the K contrast subspaces is permitted to have a different number of knots

r_{k}

using this formulation. Furthermore, note that

x_{ℓ k}^{*}

only needs to contain knot values for the predictors that are included in the k-th effect, e.g.,

x_{ℓ k}^{*}

is a scalar for main effects, a vector of length two for two-way interactions, etc. For main effects, it is typical to place the knots at the (univariate) data quantiles for each predictor. For two-way interactions, many different knot placement strategies are possible, e.g., fixed grid, random sample, bivariate quantiles, strategic placement, etc. In this paper, I only consider multivariate knot placements that involve taking combinations of univariate knots (as in [20]), but my ideas are easily applicable to other knot placement schemes.

Theorem 3

(Tensor Product Penalties). Suppose that

K \geq p

and

H_{k}

captures the k-th predictor’s main effect for

k = 1, \dots, p

. Given any

x = (x_{1}, \dots, x_{p}) \in X

, the k-th basis vector is defined as

R_{k} = (R_{11 (k)} (x_{k}, x_{1 k}^{*}), \dots, R_{11 (k)} {(x_{k}, x_{r_{k} k}^{*})}^{⊤}

and the k-th penalty matrix is

Q_{k} = [R_{11 (k)} (x_{ℓ k}^{*}, x_{ℓ^{'} k}^{*})]

, where

R_{11 (k)} = R_{01 (k)} + R_{1 (k)}

is the non-constant portion of each predictor’s marginal RK function for

k = 1, \dots, p

. Now, suppose that

H_{k}

(for some

k > p

) captures the interaction effect between

X_{a}

and

X_{b}

for some

a, b \in {1, \dots, p}

. If the basis vector is defined as

R_{k} = R_{a} \tilde{\otimes} R_{b}

, where

\tilde{\otimes}

denotes the Kronecker product, then the penalty matrix has the form

Q_{k} = Q_{a} \tilde{\otimes} Q_{b}

. Now, suppose that

H_{k}

(for some

k > p

) captures the three-way interaction between

(X_{a}, X_{b}, X_{c})

for some

a, b, c \in {1, \dots, p}

. If the basis vector is defined as

R_{k} = R_{a} \tilde{\otimes} R_{b} \tilde{\otimes} R_{c}

, then the penalty matrix has the form

Q_{k} = Q_{a} \tilde{\otimes} Q_{b} \tilde{\otimes} Q_{c}

. Basis vectors and penalty matrices for higher-order interactions can be efficiently constructed in a similar fashion.

Proof.

To prove the theorem, it suffices to prove the result for two-way interactions, given that three-way (and higher-order) interactions can be built by recursively applying the results from the two-way interaction scenario. Specifically, it suffices to show that

Q_{k} = Q_{a} \tilde{\otimes} Q_{b}

is the penalty matrix corresponding to

R_{k} = R_{a} \tilde{\otimes} R_{b}

. First note that the vector

R_{k} = R_{a} \tilde{\otimes} R_{b} = {(R_{1 k}, \dots, R_{r_{k} k})}^{⊤}

has length

r_{k} = r_{a} r_{b}

for any

a, b \in {1, \dots, p}

. The ℓ-th entry

R_{ℓ k}

can be written in terms of the corresponding entries of

R_{a}

and

R_{b}

, such as

R_{ℓ k} = R_{k} (x, x_{ℓ k}^{*}) = R_{11 (a)} (x_{a}, x_{u a}^{*}) R_{11 (b)} (x_{b}, x_{v b}^{*})

(18)

where

x = (x_{a}, x_{b})

is the bivariate vector at which the RK is evaluated, and

x_{ℓ k}^{*} = (x_{u a}^{*}, x_{v b}^{*})

is the bivariate knot. Note that

ℓ = v + r_{b} (u - 1)

indexes the tensor product vector

R_{k}

, and

u \in {1, \dots, r_{a}}

and

v \in {1, \dots, r_{b}}

index the marginal

R_{a}

and

R_{b}

vectors. Letting

α_{k} = {(α_{1 k}, \dots, α_{r_{k} k})}^{⊤} \in R^{r_{k}}

denote an arbitrary coefficient vector, the penalty for the k-th term has the form

\begin{matrix} {∥ f ∥}_{k}^{2} & = \sum_{ℓ = 1}^{r_{k}} \sum_{ℓ^{'} = 1}^{r_{k}} α_{ℓ k} α_{ℓ^{'} k} {〈R_{ℓ k}, R_{ℓ^{'} k}〉}_{k} \\ = \sum_{ℓ = 1}^{r_{k}} \sum_{ℓ^{'} = 1}^{r_{k}} α_{ℓ k} α_{ℓ^{'} k} R_{k} (x_{ℓ k}^{*}, x_{ℓ^{'} k}^{*}) \\ = α_{k}^{⊤} (Q_{a} \tilde{\otimes} Q_{b}) α_{k} \end{matrix}

(19)

where the first line is due to the bilinearity of the inner product, the second line is due to the reproducing property of the RK function, and the third line is a straightforward (algebraic) simplification of the second line. □

5. Tensor Product Spectral Smoothing

5.1. Spectral Representater Theorem

For a more convenient representation of univariate smoothing spline (like) estimators, I introduce the spectral version of the representer theorem from Theorem 1, which will be particularly useful for tensor product function building.

Theorem 4

(Spectral Representer Theorem). Let

R = {(R_{11} (x, x_{1}), \dots, R_{11} (x, x_{n}))}^{⊤}

denote the vector of RK evaluations at the training data for an arbitrary

x \in X

, and let

Q = [R_{11} (x_{i}, x_{i^{'}})]

denote the corresponding penalty matrix. Consider an eigen-decomposition of

Q

of the form

Q = V D^{2} V^{⊤}

, where

V = (v_{1}, \dots, v_{n})

is the matrix of eigenvectors, and

D^{2} = diag (d_{1}^{2}, \dots, d_{n}^{2})

is the diagonal matrix of eigenvalues (

d_{i} > 0

is the i-th singular value). The function

f \in H

that minimizes Equation (16) can be written as

f_{λ} (x) = β + \sum_{i = 1}^{n} γ_{i} S_{i} (x)

where

γ = {(γ_{1}, \dots, γ_{n})}^{⊤} \in R^{n}

is a vector of coefficients, and

S_{i} (x) = d_{i}^{- 1} v_{i}^{⊤} R

. The spectral basis functions satisfy

{〈 S_{i}, S_{i^{'}} 〉}_{11} = δ_{i i^{'}}

, where

δ_{i i^{'}}

is Kronecker’s delta, which implies that

∥ S^{⊤} {γ ∥}_{11}^{2} = \sum_{i = 1}^{n} γ_{i}^{2}

for any

γ \in R^{n}

, where

S = {(S_{1} (x), \dots, S_{n} (x))}^{⊤}

.

Proof.

To prove the first part of the theorem, we need to prove that

R^{⊤} α = S^{⊤} γ

, where

S = {(S_{1} (x), \dots, S_{n} (x))}^{⊤}

. To establish the connection between the classic and spectral representations, first note that we can write the transformed (spectral) basis as

S = D^{- 1} V^{⊤} R

, and the corresponding transformed coefficients as

γ = D V^{⊤} α

. This implies that

S^{⊤} γ = R^{⊤} V D^{- 1} D V^{⊤} α = R^{⊤} α

given that

V D^{- 1} D V^{⊤} = I_{n}

, which completes the proof of the first part of the theorem. The prove the second part of the theorem, note that

{〈 S_{i}, S_{i^{'}} 〉}_{1} = d_{i}^{- 1} d_{i^{'}}^{- 1} v_{i}^{⊤} V D^{2} V^{⊤} v_{i^{'}} = δ_{i i^{'}}

, which is a consequence of the fact that

{〈 R, R 〉}_{11} = Q

due to the reproducing property, and the fact that

v_{i}^{⊤} v_{i^{'}} = δ_{i i^{'}}

due to the orthonormality of the eigenvectors. □

Note that Theorem 4 reveals that modified representation in Theorem 1 can be equivalently expressed in terms of the empirical eigen-decomposition of the penalty matrix, which we refer to as the spectral representation of the smoothing spline. Furthermore, note that the theorem reveals that the spectral basis functions

{S_{1}, S_{2}, \dots, S_{n}}

serve as empirical eigenfunctions for

H

, in the sense that these functions are a sample dependent basis that is orthonormal with respect to the contrast space inner-product. These eigenfunctions have the typical sign-changing behavior that is characteristic of spectral representations, such that

S_{i + 1}

has more sign changes than

S_{i}

for

i = 1, \dots, n

, see Figure 1. Note that the (scaled) ordinal and linear smoothing spline spectra are nearly identical to one another, which is not surprising given the asymptotic equivalence of these kernel functions [28]. Furthermore, note that the (scaled) cubic and quintic smoothing spline spectra are rather similar in appearance, especially for the first four empirical eigenfunctions.

5.2. Tensor Product Formulation

For a more convenient representation of tensor product smoothing spline (like) estimators, I introduce the spectral version of the representer theorem from Theorem 2, which will be particularly useful for tensor product function building.

Theorem 5

(Spectral Tensor Product Representer Theorem). The minimizer of Equation (17) has the form

f_{λ} = \sum_{k = 0}^{K} f_{k λ}

, where

f_{0 λ} \in R

is an intercept, and

f_{k λ} \in H_{k}

is the k-th effect function for

k = 1, \dots, K

. The optimal effect functions can be expressed as

f_{k λ} (x) = \sum_{i = 1}^{n} γ_{i k} S_{i k} (x)

for all

x \in X

, where

γ_{k} = {(γ_{1 k}, \dots, γ_{n k})}^{⊤} \in R^{n}

is the coefficient vector and

{S_{i k}}_{i = 1}^{n}

are the spectral basis functions for

k = 1, \dots, K

. The spectral basis functions can be defined to satisfy

{〈 S_{i k}, S_{i^{'} k} 〉}_{k} = δ_{i i^{'}}

, where

δ_{i i^{'}}

is Kronecker’s delta, which implies that

∥ S_{k} γ_{k} ∥_{k}^{2} = \sum_{i = 1}^{n} γ_{i k}^{2}

for any

γ_{k} \in R^{n}

, where

S_{k} = {(S_{1 k} (x), \dots, S_{n k} (x))}^{⊤}

.

Proof.

The result in Theorem 5 is essentially a combination of the results in Theorem 2 and Theorem 4. To prove the result, let

R_{k} = {(R_{k} (x, x_{1}), \dots, R_{k} (x, x_{n}))}^{⊤}

denote the vector of RK evaluations at the training data for an arbitrary

x \in X

, and let

Q_{k} = [R_{k} (x_{i}, x_{i^{'}})]

denote the corresponding penalty matrix. Furthermore, let

Q_{k} = V_{k} D_{k}^{2} V_{k}^{⊤}

denote the eigen-decomposition of the penalty matrix, where

V_{k} = (v_{1 k}, \dots, v_{n k})

is the matrix of eigenvectors, and

D_{k}^{2} = diag (d_{1 k}^{2}, \dots, d_{n k}^{2})

is the diagonal matrix of eigenvalues (

d_{i k} > 0

is the i-th singular value). Then the spectral basis functions can be defined as

S_{i k} = d_{i k}^{- 1} v_{i k}^{⊤} R_{k}

, which ensures that

∥ S_{k} γ_{k} ∥_{k}^{2} = \sum_{i = 1}^{n} γ_{i k}^{2}

for any

γ_{k} \in R^{n}

. □

Using the spectral tensor products, multiple and generalized nonparametric regression models can be easily fit using standard mixed effects modeling software, such as lme4 [31]. See Figure 2 for a visualization of the spectral tensor product basis functions.

5.3. Scalable Computation

For large n, the spectral basis functions defined in Theorem 5 are not computationally feasible, given that computing the eigen-decomposition of the penalty requires

O (n^{3})

flops. For more scalable computation, I present a spectral version of Corollary 2.

Corollary 3

(Spectral Tensor Product Low-Rank Approximation). The minimizer of Equation (17) has the form

f_{λ} = \sum_{k = 0}^{K} f_{k λ}

, and the effect functions can be approximated via

f_{k λ} (x) \approx \sum_{ℓ = 1}^{r_{k}} γ_{ℓ k} S_{ℓ k} (x)

where

S_{k} = {(S_{1 k} (x), \dots, S_{r_{k} k} (x))}^{⊤}

is the vector of spectral basis functions corresponding to

{x_{ℓ k}^{*}}_{ℓ = 1}^{r_{k}}

, which are the selected knots for the k-th effect with

x_{ℓ k}^{*} \in X_{k} \subset X \forall ℓ, k

.

Using the low-rank approximation, the penalty matrix

Q_{k} = [R_{k} (x_{ℓ k}^{*}, x_{ℓ^{'} k}^{*})]

evaluates the RK function at all combinations of the selected knots

{x_{ℓ k}^{*}}_{ℓ = 1}^{r_{k}}

. Note that the eigen-decomposition of

Q_{k}

only requires

O (n r_{k}^{2})

flops, which is a substantial improvement if

r_{k} ≪ n

. For the main effects, the spectral basis functions can be defined as

S_{ℓ k} = d_{ℓ k}^{- 1} v_{ℓ k}^{⊤} R_{k}

, where

Q_{k} = V_{k} D_{k}^{2} V_{k}^{⊤}

is the eigen-decomposition of the penalty matrix with

(d_{ℓ k}^{2}, v_{ℓ k})

denoting the ℓ-th eigenvalue/vector pair. As will be demonstrated in the subsequent theorem, spectral basis functions for interaction effects can be defined in a more efficient fashion via the computational tools from Theorem 3.

Theorem 6

(Spectral Tensor Product Penalties). Suppose that

K \geq p

and

H_{k}

captures the k-th predictor’s main effect for

k = 1, \dots, p

. Given any

x = (x_{1}, \dots, x_{p}) \in X

, the k-th basis vector is defined as

R_{k} = {(R_{11 (k)} (x_{k}, x_{1 k}^{*}), \dots, R_{11 (k)} (x_{k}, x_{r_{k} k}^{*}))}^{⊤}

and the k-th penalty matrix is

Q_{k} = [R_{11 (k)} (x_{ℓ k}^{*}, x_{ℓ^{'} k}^{*})]

, where

R_{11 (k)} = R_{01 (k)} + R_{1 (k)}

is the non-constant portion of each predictor’s marginal RK function for

k = 1, \dots, p

. Then the k-th spectral basis vector is defined as

S_{k} = D_{k}^{- 1} V_{k}^{⊤} R_{k}

, and the corresponding penalty matrix is the identity matrix. Now, suppose that

H_{k}

(for some

k > p

) captures the interaction effect between

X_{a}

and

X_{b}

for some

a, b \in {1, \dots, p}

. If the basis vector is defined as

S_{k} = S_{a} \tilde{\otimes} S_{b}

, where

\tilde{\otimes}

denotes the Kronecker product, then the penalty matrix is the identity matrix. Now, suppose that

H_{k}

(for some

k > p

) captures the three-way interaction between

(X_{a}, X_{b}, X_{c})

for some

a, b, c \in {1, \dots, p}

. If the basis vector is defined as

S_{k} = S_{a} \tilde{\otimes} S_{b} \tilde{\otimes} S_{c}

, then the penalty matrix is the identity matrix. Basis vectors for higher-order interactions can be efficiently constructed in a similar fashion.

Proof.

To prove the theorem, it suffices to prove the result for two-way interactions, given that three-way (and higher-order) interactions can be built by recursively applying the results from the two-way interaction scenario. Specifically, it suffices to show that the penalty matrix corresponding to

S_{k} = S_{a} \tilde{\otimes} S_{b}

is the identity matrix. Letting

α_{k} \in R^{r_{k}}

and

γ_{k} \in R^{r_{k}}

denote arbitrary coefficient vectors, the representation for the k-th term is

f_{k} (x) = R_{k}^{⊤} α_{k} = S_{k}^{⊤} γ_{k}

(20)

where the reparameterized basis and coefficient vector can be written as

\begin{matrix} S_{k} & = ((D_{a}^{- 1} V_{a}^{⊤}) \tilde{\otimes} (D_{b}^{- 1} V_{b}^{⊤})) R_{k} \\ γ_{k} & = ((D_{a} V_{a}^{⊤}) \tilde{\otimes} (D_{b} V_{b}^{⊤})) α_{k} \end{matrix}

(21)

Now, note that the squared Euclidean norm of the reparameterized coefficients is

\begin{matrix} \sum_{ℓ = 1}^{r_{k}} γ_{ℓ k}^{2} & = α_{k}^{⊤} {((D_{a} V_{a}^{⊤}) \tilde{\otimes} (D_{b} V_{b}^{⊤}))}^{⊤} ((D_{a} V_{a}^{⊤}) \tilde{\otimes} (D_{b} V_{b}^{⊤})) α_{k} \\ = α_{k}^{⊤} ((V_{a} D_{a}) \tilde{\otimes} (V_{b} D_{b})) ((D_{a} V_{a}^{⊤}) \tilde{\otimes} (D_{b} V_{b}^{⊤})) α_{k} \\ = α_{k}^{⊤} ((V_{a} D_{a} D_{a} V_{a}^{⊤}) \tilde{\otimes} (V_{b} D_{b} D_{b} V_{b}^{⊤})) α_{k} \\ = α_{k}^{⊤} (Q_{a} \tilde{\otimes} Q_{b}) α_{k} \end{matrix}

(22)

where the first line plugs in the definition of the squared Euclidean norm, the second line uses the fact that

{(A \tilde{\otimes} B)}^{⊤} = (A^{⊤} \tilde{\otimes} B^{⊤})

, the third line uses the fact that

(A \tilde{\otimes} B) (C \tilde{\otimes} D) = (A C) \tilde{\otimes} (B D)

, and the fourth line plugs in the definition of the penalty matrices. □

6. Simulated Example

To demonstrate the potential of the proposed approach, I designed a simple simulation study to compare the performance of the proposed tensor product smoothing approach with the approach of Wood et al. [20], which is implemented in the popular gamm4 package [32] in R [29]. The gamm4 package [32] uses the mgcv package [33] to build the smooth basis matrices, and then uses the lme4 package [31] to tune the smoothing parameters (which are treated as variance parameters). For a fair comparison, I have implemented the proposed tensor product spectral smoothing (TPSS) approach using the lme4 package to tune the smoothing parameters, which I refer to as tpss4. This ensures that any difference in the results is due to the employed (reparameterized) basis functions instead of due to differences in the tuning procedure.

Given

p = 2

predictors with

X = [0, 1] \times [0, 1]

, the true mean function is defined as

f (x_{1}, x_{2}) = f_{1} (x_{1}) + f_{2} (x_{2}) + f_{12} (x_{1}, x_{2})

where

f_{1} (x_{1}) = 4 cos (2 π [x_{1} - π])

is the main effect of the first predictor, and

f_{2} (x_{2}) = 120 {(x_{2} - 0.6)}^{5}

is the main effect of the second predictor. The interaction effect is defined as

f_{12} (x_{1}, x_{2}) = 0

for the additive function, and

f_{12} (x_{1}, x_{2}) = 4 sin (π [x_{1} - x_{2}])

for the interaction function. Note that this interaction function has been used in previous simulation work that explored tensor product smoothers (see [9,34]). Two different sample sizes were considered

n \in {1000, 2000}

. For each sample size and data-generating mean function, n observations were (independently) randomly sampled from

X

, and the response was defined as

y_{i} = f (x_{i 1}, x_{i 2}) + ϵ_{i}

, where

ϵ_{i}

follows a standard normal distribution.

For both the gamm4 package and the proposed tpss4 implementation, (i) I fit the model using

r_{k} \in {5, 6, \dots, 10}

marginal knots for each predictor, and (ii) I used restricted maximum likelihood (REML) to tune the smoothing parameters. For the gamm4 package, the tensor product smooth was formed using the t2() function, which allows for main and interaction effects of the predictors. For the tpss4 method, the implementation in the smooth2d() function (see Supplementary Materials) allows for both main and interaction effects. Thus, for both methods, the fit model is misspecified for additive models and correctly specified for interaction models.

I compared the quality of the solutions using the root mean squared error (RMSE)

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(f (x_{i 1}, x_{i 2}) - {\hat{f}}_{λ} (x_{i 1}, x_{i 2}))}^{2}}

and the mean absolute error (MAE)

MAE = \frac{1}{n} \sum_{i = 1}^{n} |f (x_{i 1}, x_{i 2}) - {\hat{f}}_{λ} (x_{i 1}, x_{i 2})|

where

f (x_{i 1}, x_{i 2})

is the data-generating mean function and

{\hat{f}}_{λ}

is the estimated function. The data generation and analysis procedure was repeated 100 times for each sample size.

Box plots of the RMSE and MAE for each method under each combination of

n \in {1000, 2000}

and

r_{k} \in {5, 6, \dots, 10}

are displayed in Figure 3 and Figure 4. As expected, both the RMSE and MAE decrease as the number of knots increases for both methods. For each

r_{k}

, the proposed tpss4 method tends to result in smaller RMSE and MAE values compared to the gamm4 implementation. For the (misspecified) additive function, the benefit of the proposed approach is noteworthy and persists across all

r_{k}

. For the interaction model, the benefit of the proposed tpss4 approach is particularly noticeable for small

r_{k}

, but is still existent for larger numbers of knots.

The runtime for each method is displayed in Figure 5. The proposed tpss4 method produces runtimes that are slightly larger than the gamm4 method in most situations. Despite using the same number of marginal knots for each predictor, the gamm4 approach uses an approximation that estimates slightly fewer coefficients, which is likely causing the timing differences. However, it is possible that these timing differences could be due to running compiled code (in gamm4) versus uncompiled code (in tpss4). Regardless of the source of the differences, the timing differences are rather small and disappear as

r_{k}

increases, which reveals the practicality of the proposed approach.

7. Real Data Example

To demonstrate the proposed approach using real data, I make use of the Bike Sharing Dataset [35] from the UCI Machine Learning Repository [36]. This dataset contains the number (count) of bikes rented from the Capital Bike Share system in Washington DC. The rental counts are recorded by the hour from the years 2011 and 2012, which produced a dataset with n = 17,379 observations. In addition to the counts, the dataset contains various situational factors that might affect the number of rented bikes. In this example, I will focus on modeling the number of bike rentals as a function of the hour of the day (which takes values 0, 1, …, 23) and the month of the year (which takes values 1, 2, …, 12).

The proposed approach was used to fit a tensor product spectral smoother (TPSS) to the data using 12 knots for the hour variable and 6 knots for the month variable. The counts were modeled on the log10 scale, and then transformed back to the original (data) scale for visualization purposes. As in the simulation study, the smoothing parameters were tuned using the REML method in the lme4 package. Figure 6 displays the average number of bike rentals by hour and month, as well as the TPSS model predictions. As is evident from the figure, the TPSS solution closely resembles the average data; however, the model predictions are substantially smoother, which improves the interpretation.

Looking at the bike rental patterns by hour of the day, it is evident that there are two surges in the number of rentals: (1) during the morning rush hour (∼8:00–9:00) and (2) during the evening rush hour (∼17:00–18:00). The results also reveal another (smaller) surge that occurs during the lunch hour (∼12:00–13:00). Interestingly, the predictions in Figure 6 reveal that the bike rental surge during the morning rush hour is more localized in time (lasting about one hour), whereas the evening surge is more temporally diffuse (lasting 2–3 h). The bike rentals tend to peak during the afternoon rush hour, and are at their lowest expected value during the evening hours (∼23:00–06:00).

The month effect is less pronounced than the hour effect, but it still produces some interpretable insights. In particular, we see that there are fewer people using the bikes during the winter months (Dec, Jan, Feb), which is expected. The drops in the number of rentals during the winter are particularly noticeable during the lunch surge, which suggests that fewer people use the bikes to compute for lunch during the winter. The peak in the rentals occurs during the summer months (Jun, Jul, Aug). Combining the hour and month information suggests that the evening rush hour during the summer months is when the Capital Bike Share system sees it greatest surge in demand.

8. Discussion

This paper proposes efficient and flexible approaches for fitting tensor product smoothing spline-like models. The refined smoothing spline approach developed in Section 4 offers an alternative approach for tensor product smoothing splines that penalizes all non-constant effects of the predictors. In particular, Theorem 1 proposes a representer theorem for univariate smoothing spline-like estimators that penalizes all non-constant functions, Theorem 2 provides a tensor product extension of the proposed estimator, and Theorem 3 develops efficient computational tools for forming tensor product penalties. Furthermore, the spectral tensor product approach developed in Section 5 makes it possible to use exact (instead of approximate) tensor product penalties, which can be easily implemented in any standard mixed effects modeling software. In particular, Theorem 4 presents a spectral representer theorem for univariate smoothing, Theorem 5 provides a tensor product extension of the spectral representation, and Theorem 6 develops efficient computational tools for forming tensor product penalties.

The principal results in this paper reveal that if basis functions are formed by taking Kronecker products of spectral spline representations, then the resulting (exact) penalty matrix is the identity matrix. This implies that it is no longer necessary to choose between approximate penalties or costly parameterizations. Note that the results in this paper provide some theoretical support for the tensor product approach of Wood et al. [20], which uses a similar approach with different basis functions. The simulation results support the theoretical results given that the proposed approach (which uses the exact penalty) outperforms the approach of Wood et al. [20] in gamm4 [32]. As a result, I expect that the proposed approach will be quite useful for fitting (generalized) nonparametric models using modern mixed effects and penalized regression modeling softwares such as lme4 or grpnet. Furthermore, I expect that the proposed approach will be useful for conducting inference with tensor product smoothing splines, e.g., using nonparametric permutation tests [37] or standard hypothesis tests for variance components [38].

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/stats7010003/s1.

Name	Type	Description
smooth2d	R function (.R)	Function for 2-dimensional smoothing
tpss_ex	R script (.R)	Script for the bike sharing analyses and Figure 6
tpss_figs	R script (.R)	Script for reproducing Figure 1 and Figure 2
tpss_sim	R script (.R)	Script for the simulation study and Figure 3, Figure 4 and Figure 5

Funding

This research was funded by National Institutes of Health (NIH) grants R01EY030890, U01DA046413, and R01MH115046.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

R code to reproduce the results is included as Supplementary Materials. The analyzed data are open source and publicly available.

Conflicts of Interest

The author declares no conflict of interest.

References

Helwig, N.E. Multiple and Generalized Nonparametric Regression. In SAGE Research Methods Foundations; Atkinson, P., Delamont, S., Cernat, A., Sakshaug, J.W., Williams, R.A., Eds.; SAGE Publications Ltd.: London, UK, 2020. [Google Scholar] [CrossRef]
Berry, L.N.; Helwig, N.E. Cross-validation, information theory, or maximum likelihood? A comparison of tuning methods for penalized splines. Stats 2021, 4, 701–724. [Google Scholar] [CrossRef]
Wahba, G. Spline Models for Observational Data; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 1990. [Google Scholar]
de Boor, C. A Practical Guide to Splines; revised ed.; Springer: New York, NY, USA, 2001. [Google Scholar]
Gu, C. Smoothing Spline ANOVA Models, 2nd ed.; Springer: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
Wang, Y. Smoothing Splines: Methods and Applications; CRC Press: Boca Raton, FL, USA, 2011. [Google Scholar]
Wood, S.N. Generalized Additive Models: An Introduction with R, 2nd ed.; Chapman & Hall: Boca Raton, FL, USA, 2017. [Google Scholar]
Hastie, T.; Tibshirani, R. Generalized Additive Models; Chapman and Hall/CRC: New York, NY, USA, 1990. [Google Scholar]
Helwig, N.E.; Ma, P. Fast and stable multiple smoothing parameter selection in smoothing spline analysis of variance models with large samples. J. Comput. Graph. Stat. 2015, 24, 715–732. [Google Scholar] [CrossRef]
Helwig, N.E.; Gao, Y.; Wang, S.; Ma, P. Analyzing spatiotemporal trends in social media data via smoothing spline analysis of variance. Spat. Stat. 2015, 14, 491–504. [Google Scholar] [CrossRef]
Helwig, N.E.; Shorter, K.A.; Hsiao-Wecksler, E.T.; Ma, P. Smoothing spline analysis of variance models: A new tool for the analysis of cyclic biomechaniacal data. J. Biomech. 2016, 49, 3216–3222. [Google Scholar] [CrossRef] [PubMed]
Helwig, N.E.; Ruprecht, M.R. Age, gender, and self-esteem: A sociocultural look through a nonparametric lens. Arch. Sci. Psychol. 2017, 5, 19–31. [Google Scholar] [CrossRef]
Helwig, N.E.; Sohre, N.E.; Ruprecht, M.R.; Guy, S.J.; Lyford-Pike, S. Dynamic properties of successful smiles. PLoS ONE 2017, 12, e0179708. [Google Scholar] [CrossRef]
Helwig, N.E.; Snodgress, M.A. Exploring individual and group differences in latent brain networks using cross-validated simultaneous component analysis. NeuroImage 2019, 201, 116019. [Google Scholar] [CrossRef] [PubMed]
Hammell, A.E.; Helwig, N.E.; Kaczkurkin, A.N.; Sponheim, S.R.; Lissek, S. The temporal course of over-generalized conditioned threat expectancies in posttraumatic stress disorder. Behav. Res. Ther. 2020, 124, 103513. [Google Scholar] [CrossRef]
Almquist, Z.W.; Helwig, N.E.; You, Y. Connecting Continuum of Care point-in-time homeless counts to United States Census areal units. Math. Popul. Stud. 2020, 27, 46–58. [Google Scholar] [CrossRef]
Helwig, N.E. Efficient estimation of variance components in nonparametric mixed-effects models with large samples. Stat. Comput. 2016, 26, 1319–1336. [Google Scholar] [CrossRef]
Helwig, N.E.; Ma, P. Smoothing spline ANOVA for super-large samples: Scalable computation via rounding parameters. Stat. Its Interface 2016, 9, 433–444. [Google Scholar] [CrossRef]
Demmler, A.; Reinsch, C. Oscillation matrices with spline smoothing. Numer. Math. 1975, 24, 375–382. [Google Scholar] [CrossRef]
Wood, S.N.; Scheipl, F.; Faraway, J.J. Straightforward intermediate rank tensor product smoothing in mixed models. Stat. Comput. 2013, 23, 341–360. [Google Scholar] [CrossRef]
Kimeldorf, G.; Wahba, G. Some results on Tchebycheffian spline functions. J. Math. Anal. Appl. 1971, 33, 82–95. [Google Scholar] [CrossRef]
Gu, C.; Kim, Y.J. Penalized likelihood regression: General formulation and efficient approximation. Can. J. Stat. 2002, 30, 619–628. [Google Scholar] [CrossRef]
Kim, Y.J.; Gu, C. Smoothing spline Gaussian regression: More scalable computation via efficient approximation. J. R. Stat. Soc. Ser. B 2004, 66, 337–356. [Google Scholar] [CrossRef]
Moore, E.H. On the reciprocal of the general algebraic matrix. Bull. Am. Math. Soc. 1920, 26, 394–395. [Google Scholar] [CrossRef]
Penrose, R. A generalized inverse for matrices. Math. Proc. Camb. Philos. Soc. 1955, 51, 406–413. [Google Scholar] [CrossRef]
Wang, Y. Mixed effects smoothing spline analysis of variance. J. R. Stat. Soc. Ser. B 1998, 60, 159–174. [Google Scholar] [CrossRef]
Wang, Y. Smoothing spline models with correlated random errors. J. Am. Stat. Assoc. 1998, 93, 341–348. [Google Scholar] [CrossRef]
Helwig, N.E. Regression with ordered predictors via ordinal smoothing splines. Front. Appl. Math. Stat. 2017, 3, 15. [Google Scholar] [CrossRef]
R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2023; R version 4.3.1. [Google Scholar]
Helwig, N.E. grpnet: Group Elastic Net Regularized GLM, R package version 0.2; Comprehensive R Archive Network: Vienna, Austria, 2023. [Google Scholar]
Bates, D.; Mächler, M.; Bolker, B.M.; Walker, S.C. Fitting Linear Mixed-Effects Models Using lme4. J. Stat. Softw. 2015, 67, 1–48. [Google Scholar] [CrossRef]
Wood, S.; Scheipl, F. gamm4: Generalized Additive Mixed Models Using ‘mgcv’ and ‘lme4’, R package version 0.2-6; Comprehensive R Archive Network: Vienna, Austria, 2020. [Google Scholar]
Wood, S.N. mgcv: Mixed GAM Computation Vehicle with GCV/AIC/REML Smoothness Estimation and GAMMs by REML/PQL, R package version 1.9-1; Comprehensive R Archive Network: Vienna, Austria, 2023. [Google Scholar]
Helwig, N.E. Spectrally sparse nonparametric regression via elastic net regularized smoothers. J. Comput. Graph. Stat. 2021, 30, 182–191. [Google Scholar] [CrossRef]
Fanaee-T, H.; Gama, J. Event labeling combining ensemble detectors and background knowledge. Prog. Artif. Intell. 2013, 2, 1–15. [Google Scholar] [CrossRef]
Kelly, M.; Longjohn, R.; Nottingham, K. The University of California Irvine (UCI) Machine Learning Repository. Available online: https://archive.ics.uci.edu/ (accessed on 26 December 2023).
Helwig, N.E. Robust Permutation Tests for Penalized Splines. Stats 2022, 5, 916–933. [Google Scholar] [CrossRef]
Kuznetsova, A.; Brockhoff, P.B.; Christensen, R.H.B. lmerTest Package: Tests in Linear Mixed Effects Models. J. Stat. Softw. 2017, 82, 1–26. [Google Scholar] [CrossRef]

Figure 1. Spectral basis functions for different types of reproducing kernel functions using five equidistant knots. Basis functions were evaluated at

x_{i} = \frac{i - 1}{100}

for

i = 1, \dots, 101

and were scaled for visualization purposes. Produced by R [29] using the rk() function in the grpnet package [30].

Figure 1. Spectral basis functions for different types of reproducing kernel functions using five equidistant knots. Basis functions were evaluated at

x_{i} = \frac{i - 1}{100}

for

i = 1, \dots, 101

and were scaled for visualization purposes. Produced by R [29] using the rk() function in the grpnet package [30].

Figure 2. Spectral tensor product basis functions formed from

p = 2

cubic smoothing spline marginals with

r_{k} = 5

equidistant knots for each predictor. From left to right, the basis functions become less smooth with respect to

X_{2}

. From top to bottom, the basis functions become less smooth with respect to

X_{1}

. Produced by R [29] using the rk() function in the grpnet package [30].

Figure 2. Spectral tensor product basis functions formed from

p = 2

cubic smoothing spline marginals with

r_{k} = 5

equidistant knots for each predictor. From left to right, the basis functions become less smooth with respect to

X_{2}

. From top to bottom, the basis functions become less smooth with respect to

X_{1}

. Produced by R [29] using the rk() function in the grpnet package [30].

Figure 3. Box plots of the root mean squared error (RMSE) of the function estimate for each method. Rows show results for the additive function (top) and interaction function (bottom). Columns show the results as a function of the number of knots for

n = 1000

(left) and

n = 2000