Robust Permutation Tests for Penalized Splines

Helwig, Nathaniel E.

doi:10.3390/stats5030053

Open AccessFeature PaperArticle

Robust Permutation Tests for Penalized Splines

by

Nathaniel E. Helwig

^1,2

¹

Department of Psychology, University of Minnesota, Minneapolis, MN 55455, USA

²

School of Statistics, University of Minnesota, Minneapolis, MN 55455, USA

Stats 2022, 5(3), 916-933; https://doi.org/10.3390/stats5030053

Submission received: 18 August 2022 / Revised: 9 September 2022 / Accepted: 11 September 2022 / Published: 16 September 2022

Download

Browse Figures

Versions Notes

Abstract

:

Penalized splines are frequently used in applied research for understanding functional relationships between variables. In most applications, statistical inference for penalized splines is conducted using the random effects or Bayesian interpretation of a smoothing spline. These interpretations can be used to assess the uncertainty of the fitted values and the estimated component functions. However, statistical tests about the nature of the function are more difficult, because such tests often involve testing a null hypothesis that a variance component is equal to zero. Furthermore, valid statistical inference using the random effects or Bayesian interpretation depends on the validity of the utilized parametric assumptions. To overcome these limitations, I propose a flexible and robust permutation testing framework for inference with penalized splines. The proposed approach can be used to test omnibus hypotheses about functional relationships, as well as more flexible hypotheses about conditional relationships. I establish the conditions under which the methods will produce exact results, as well as the asymptotic behavior of the various permutation tests. Additionally, I present extensive simulation results to demonstrate the robustness and superiority of the proposed approach compared to commonly used methods.

Keywords:

generalized ridge regression; nonparametric methods; penalized least squares; randomization tests; smoothing and nonparametric regression

Graphical Abstract

1. Introduction

1.1. Penalized Spline Prevalence

Penalized splines are used within modern multiple and generalized nonparametric regression frameworks [1,2], such as the smoothing spline analysis of variance models [3,4,5] and generalized additive models [6,7,8], to discover unknown functional relationships between a response variable Y and a collection of predictors

X_{1}, \dots, X_{d}

. Penalized splines and their variants have been applied to understand functional relationships in data from a variety of different disciplines. For example, recent applications include the use of penalized splines to model spatiotemporal patterns in US homelessness [9], biomechanical analysis of locomotion data [10,11], fear learning curves in veterans with PTSD [12], functional properties of successful smiles [13], self-esteem trajectories across the lifespan [14], and spatiotemporal trends in social media [15].

In addition to being a commonly used tool in applied research, penalized splines are also frequently the focus of theoretical and computational statistics research. For example, there has been recent interest in fitting penalized spline models with ordinal predictors and responses [16,17], mixed-effects models with penalized spline components [18,19,20], efficient approximation for large samples [21,22], and efficient algorithms for fitting penalized spline models to big data [23,24]. Furthermore, given the frequent usage of penalized splines for the analysis of real world data, there has been a growing interest in the robustness of penalized spline tuning and inference methods under model misspecification [25]. Recently, there has also been the development of alternative spline penalization approaches for nonparametric function estimation from noisy data [26].

1.2. Penalized Spline Definition

Consider a multiple nonparametric regression model [1,2] of the form

Y_{i} = η (X_{i}) + ϵ_{i}

(1)

where

Y_{i}

is the i-th realization of the response variable

Y \in R

,

X_{i} = {(X_{i 1}, \dots, X_{i d})}^{⊤}

is the i-th realization of the predictor variable

X \in X = X_{1} \times \dots \times X_{d}

, and

ϵ_{i}

is the i-th realization of the error term

ϵ \in R

. The error terms are assumed to satisfy

E (ϵ_{i}) = 0

and

E (ϵ_{i}^{2}) = σ_{i}^{2}

with

σ_{i}^{2} < \infty

denoting the i-th observation’s error variance. To estimate

η

, it is typical to minimize the penalized least squares (PLS) functional

\frac{1}{n} \sum_{i = 1}^{n} {(Y_{i} - η (X_{i}))}^{2} + λ J (η)

where

J (\cdot)

is the penalty functional, and

λ \geq 0

is the smoothing parameter, which controls the balance between fitting and smoothing.

Given

λ

, the Kimeldorf-Wahba representer theorem [27] reveals that the minimizer of the PLS functional has the form

η_{λ} (X) = a + \sum_{v = 1}^{m - 1} b_{v} N_{v} (X) + \sum_{i = 1}^{r} c_{i} R (X, X_{i}^{*})

where

{N_{v}}_{v = 0}^{m - 1}

are known functions that span the null space

H_{0} = {η : J (η) = 0}

with

N_{0} (X) = 1

, the symmetric and bivariate function

R (\cdot, \cdot)

is the known reproducing kernel of the contrast space

H_{1} = {η : J (η) < \infty}

, the collection of predictor scores

{X_{i}^{*}}_{i = 1}^{r}

are the selected spline knots,

a \in R

is the unknown intercept, and

b = {(b_{1}, \dots, b_{m - 1})}^{⊤}

and

c = {(c_{1}, \dots, c_{r})}^{⊤}

are the unknown basis function coefficient vectors. The representer theorem uses all design points as knots (i.e.,

r = n

and

X_{i}^{*} = X_{i})

, but reasonable approximations can be obtained using

r < n

knots [21,22,28].

1.3. Penalized Spline Estimation

Using the representer theorem, the nonparametric regression model can be written as

Y_{i} = a + N_{i}^{⊤} b + R_{i}^{⊤} c + ϵ_{i}

where

N_{i}^{⊤} = (N_{1} (X_{i}), \dots, N_{m - 1} (X_{i}))

is the null space basis function vector for the i-th observation, and

R_{i}^{⊤} = (R (X_{i}, X_{1}^{*}), \dots, R (X_{i}, X_{r}^{*}))

is the reproducing kernel function evaluated at

X_{i}

and the selected knots. Let

Y_{i}^{c} = Y_{i} - \bar{Y}

denote the centered response and let

N_{i}^{c} = N_{i} - \bar{N}

and

R_{i}^{c} = R_{i} - \bar{R}

denote the centered basis functions, where

\bar{Y} = \frac{1}{n} \sum_{i = 1}^{n} Y_{i}

,

\bar{N} = \frac{1}{n} \sum_{i = 1}^{n} N_{i}

, and

\bar{R} = \frac{1}{n} \sum_{i = 1}^{n} R_{i}

. Then, the PLS functional can be rewritten as

\frac{1}{n} {∥ Y_{c} - N_{c} b - R_{c} c ∥}^{2} + λ c^{⊤} Q c

where

Y_{c} = {(Y_{1}^{c}, \dots, Y_{n}^{c})}^{⊤}

,

N_{c} = {[N_{1}^{c}, \dots, N_{n}^{c}]}^{⊤}

and

R_{c} = {[R_{1}^{c}, \dots, R_{n}^{c}]}^{⊤}

are the mean centered basis function matrices, and

Q = [R (X_{i}^{*}, X_{j}^{*})]

is the penalty matrix.

Let

d = {(b_{1}, \dots, b_{m - 1}, c_{1}, \dots, c_{r})}^{⊤}

denote the combined basis function coefficient vector, and let

K_{c} = (N_{c}, R_{c})

denote the combined (centered) basis function matrix. Given the smoothing parameter

λ

, it is well known that the basis function coefficients that minimize the PLS functional can be written as

{\hat{d}}_{λ} = {(K_{c}^{⊤} K_{c} + n λ Q^{*})}^{†} K_{c}^{⊤} Y_{c}

where

Q^{*} = bdiag (0_{m - 1}, Q)

is the block diagonal penalty matrix, and

{(\cdot)}^{†}

denotes the Moore-Penrose pseudo-inverse [29,30]. The least squares estimate of the intercept term can then be written as

\hat{a} = \bar{Y} - \bar{K}^{⊤} {\hat{d}}_{λ}

, where

\bar{K}^{⊤} = (\bar{N}^{⊤}, \bar{R}^{⊤})

. The fitted values have the form

{\hat{η}}_{λ} = \hat{a} + {\hat{η}}_{0 λ} + {\hat{η}}_{1 λ}

, where

{\hat{η}}_{0 λ} = N_{c} {\hat{b}}_{λ}

is the (non-constant) null space contribution, and

{\hat{η}}_{1 λ} = R_{c} {\hat{c}}_{λ}

is the contrast space contribution with

{\hat{d}}_{λ}^{⊤} = ({\hat{b}}_{λ}^{⊤}, {\hat{c}}_{λ}^{⊤})

.

1.4. Bayesian Interpretation

Given the estimated function

{\hat{η}}_{λ}

, the statistical inference about the unknown function

η

is often conducted using the Bayesian interpretation of a smoothing spline [31,32]. This approach assumes that

η = η_{0} + η_{1}

, where the null space function

η_{0}

has a vague prior, and the contrast space function

η_{1}

has a Gaussian process prior with a mean zero and covariance matrix proportional to

Q^{†}

. Given these prior assumptions, it can be demonstrated that the posterior distribution of

η

given

Y

is multivariate normal with mean vector

{\hat{η}}_{λ}

and covariance matrix

{(K^{⊤} K + n λ Q^{*})}^{†}

, where

K = (1_{n}, N, R)

. When the smoothing parameter

λ

is chosen via the generalized cross-validation (GCV) criterion [33], confidence intervals formed using the Bayesian covariance matrix tend to have an “across-the-function” coverage property [32]. See [25] for an investigation of the Bayesian CI coverage properties.

The Bayesian confidence intervals can be used to assess the uncertainty of the fitted values and the component functions, i.e., the main and interaction effects of the d predictors [34,35]. However, testing hypotheses about the nature of the function

η

is a more difficult problem. Various different approaches have been proposed for testing hypotheses about penalized smoothers [36,37,38,39,40,41,42,43]. However, these methods are primarily designed for testing specific hypotheses (e.g.,

H_{0} : η \in H_{0}

) under the assumption of homoscedastic Gaussian errors. The exception is Wood’s (2013a) approach, which is designed to be more general, but the validity of Wood’s approach depends on having a correctly specified parametric (exponential family) distribution. As a result, the generalizability of the existing inference methods is limited, and the validity of these methods is suspect when the parametric assumptions are questionable.

1.5. Proposed Approach

When working with real data, it can be difficult to determine whether or not the parametric assumptions that are required for valid hypothesis testing are reasonably met. Furthermore, if the error terms are non-Gaussian and/or heteroscedastic, the proposed hypothesis tests may produce substantially misleading inferential results. To overcome this important practical issue, I propose a flexible permutation testing framework for robust inference in nonparametric regression models. The proposed approach extends recent advances in robust permutation tests for linear models [44] to generalized ridge regression (GRR) and penalized smoothing problems. As I demonstrate in the following sections, the proposed framework can be used for overall (omnibus) tests about functional relationships, as well as more specific (conditional) tests.

The remainder of this paper is organized as follows: Section 2 develops the foundations and theory for omnibus permutation tests using GRR estimators, Section 3 extends the permutation testing framework to conditional tests of effects, Section 4 presents extensive simulation results to validate the theoretical results derived in the previous sections, and Section 5 discusses how the proposed framework can be flexibly adapted for testing a variety hypotheses about semi- and non-parametric regression models.

2. Omnibus Regression Tests

2.1. Model and Estimation

Given an independent sample of n observations, consider the linear regression model

Y_{i} = α + X_{i}^{⊤} β + ϵ_{i}

(2)

for

i \in {1, \dots, n}

, where

Y_{i}

is the i-th realization of the response variable

Y \in R

,

X_{i}^{⊤} = (X_{i 1}, \dots, X_{i p})

is the i-th realization of the predictor vector

X^{⊤} = (X_{1}, \dots, X_{p}) \in R^{p}

, and

ϵ_{i}

is the i-th realization of the error term

ϵ \in R

. The error terms are assumed to satisfy

E (ϵ_{i}) = 0

and

E (ϵ_{i}^{2}) = σ_{i}^{2}

with

σ_{i}^{2} < \infty

denoting the i-th observation’s (finite) error variance. Without a loss of generality, we can assume that the response and predictors are centered, which implies that the intercept

α

can be dropped from the model for estimation purposes. Let

Y_{i}^{c} = Y_{i} - \bar{Y}

and

X_{i}^{c} = X_{i} - \bar{X}

denote the mean centered response and predictor vector for the i-th observation, where

\bar{Y} = \frac{1}{n} \sum_{i = 1}^{n} Y_{i}

and

\bar{X} = \frac{1}{n} \sum_{i = 1}^{n} X_{i}

.

To estimate the coefficients in

β

, consider minimizing the GRR loss function [45]

\frac{1}{n} {(Y_{c} - X_{c} β)}^{⊤} (Y_{c} - X_{c} β) + β^{⊤} Δ β

(3)

where

Y_{c} = {(Y_{1}^{c}, \dots, Y_{n}^{c})}^{⊤}

is the centered response vector,

X_{c} = {[X_{1}^{c}, \dots, X_{n}^{c}]}^{⊤}

is the centered design matrix, and

Δ

is a

p \times p

symmetric and positive semi-definite penalty matrix. The coefficients that minimize the GRR problem in Equation (3) have the form

{\hat{β}}_{Δ} = {(\frac{1}{n} X_{c}^{⊤} X_{c} + Δ)}^{- 1} (\frac{1}{n} X_{c}^{⊤} Y_{c})

(4)

which is subscripted to emphasize that the estimated coefficients depend on the penalty matrix

Δ

. Given the estimated slope vector

{\hat{β}}_{Δ}

, the least squares estimate of the intercept has the form

\hat{α} = \bar{Y} - {\bar{X}}^{⊤} {\hat{β}}_{Δ}

.

2.2. Asymptotic Distributions

Consider the linear regression model from Equation (2) with the assumptions:

A1.: $Y_{i} = α + X_{i}^{⊤} β + ϵ_{i}$ with $E (ϵ_{i}) = 0$ and $E (ϵ_{i}^{2}) = σ_{i}^{2} < \infty$ for $i = 1, \dots, n$
A2.: $(Y_{i}, X_{i})$ are iid from a distribution satisfying $E (ϵ_{i} X_{i}) = 0$
A3.: $Σ_{X} = E ((X_{i} - μ_{X}) {(X_{i} - μ_{X})}^{⊤})$ and $Ω_{X} = E (ϵ_{i}^{2} (X_{i} - μ_{X}) {(X_{i} - μ_{X})}^{⊤})$ are nonsingular, where $μ_{X} = E (X_{i})$ , and $\frac{1}{n} X_{c}^{⊤} X_{c} + Δ$ is almost surely invertible. In the fixed predictors case, the mean vector is $μ_{X} = \frac{1}{n} \sum_{i = 1}^{n} X_{i}$ and the the covariance matrix terms are defined as $Σ_{X} = \frac{1}{n} X_{c}^{⊤} X_{c}$ and $Ω_{X} = \frac{1}{n} X_{c}^{⊤} Ψ X_{c}$ , where $Ψ = diag (σ_{1}^{2}, \dots, σ_{n}^{2})$ .

Given assumptions A1–A3, the GRR estimator provides an estimate of

β_{Δ} = Σ_{X Δ}^{- 1} σ_{X Y}

, where

Σ_{X Δ} = Σ_{X} + Δ

and

σ_{X Y} = E ((X_{i} - μ_{X}) (Y_{i} - μ_{Y}))

is the covariance between

X_{i}

and

Y_{i}

. Note that

β_{Δ} = Σ_{X Δ}^{- 1} Σ_{X} β

, where

β = Σ_{X}^{- 1} σ_{X Y}

is estimated by the ordinary least squares (OLS) estimator

\hat{β} = \hat{Σ}_{X}^{- 1} {\hat{σ}}_{X Y}

with

{\hat{Σ}}_{X} = \frac{1}{n} X_{c}^{⊤} X_{c}

and

{\hat{σ}}_{X Y} = \frac{1}{n} X_{c}^{⊤} Y_{c}

. This implies that the GRR estimator can be written as

{\hat{β}}_{Δ} = \hat{Σ}_{X Δ}^{- 1} {\hat{Σ}}_{X} \hat{β}

where

\hat{Σ}_{X Δ} = {\hat{Σ}}_{X} + Δ

.

Lemma 1.

Given assumptions A1–A3, the GRR estimator

{\hat{β}}_{Δ}

from Equation (4) is asymptotically normal with mean vector

β_{Δ}

and covariance matrix

\frac{1}{n} Σ_{X Δ}^{- 1} Ω_{X} Σ_{X Δ}^{- 1}

, i.e.,

\sqrt{n} ({\hat{β}}_{Δ} - β_{Δ}) \overset{d}{\to} N (0, Σ_{X Δ}^{- 1} Ω_{X} Σ_{X Δ}^{- 1})

as

n \to \infty

, where the notation

\overset{d}{\to}

denotes convergence in distribution.

Lemma 1 can be proved using results of White [46], who demonstrated that the OLS estimator

\hat{β}

is asymptotically normal with a mean vector

β

and covariance matrix

\frac{1}{n} Σ_{X}^{- 1} Ω_{X} Σ_{X}^{- 1}

under assumptions A1–A3. Noting that

{\hat{β}}_{Δ} = \hat{Σ}_{X Δ}^{- 1} {\hat{Σ}}_{X} \hat{β}

and

β_{Δ} = Σ_{X Δ}^{- 1} Σ_{X} β

completes the proof of Lemma 1, given that

{\hat{Σ}}_{X}

is a consistent estimator of

Σ_{X}

. See Appendix A.1 for details.

Lemma 2.

Consider the linear model in Equation (2) with the assumptions

β \sim N (0_{p}, \frac{σ^{2}}{n} Δ^{- 1})

and

ϵ_{i} \overset{iid}{\sim} N (0, σ^{2})

, and suppose that β and

ϵ_{i}

are independent of one another. Under these assumptions, the posterior distribution of β given

Y

is multivariate normal with mean vector

{\hat{β}}_{Δ}

and covariance matrix

\frac{σ^{2}}{n} \hat{Σ}_{X Δ}^{- 1}

. The asymptotic mean vector is

β_{Δ}

and the asymptotic covariance matrix is

\frac{σ^{2}}{n} Σ_{X Δ}^{- 1}

.

Lemma 2 can be proved by using the results of Henderson [47,48], who derived the covariance matrix of the best linear unbiased estimator (BLUE) and the best linear unbiased predictor (BLUP) in linear mixed models (e.g., see [49]). The asymptotic mean vector and covariance matrix result from the facts that

{\hat{Σ}}_{X}

and

{\hat{σ}}_{X Y}

are consistent estimators of

Σ_{X}

and

σ_{X Y}

, respectively. See Appendix A.2 for details.

2.3. Test Statistics

Consider the linear model in Equation (2), and suppose that we want to test the null hypothesis

H_{0} : β = 0_{p}

versus the alternative hypothesis

H_{1} : β \neq 0_{p}

, where the notation

0_{p}

denotes a

p \times 1

vector of zeros. If

β \sim N (0_{p}, \frac{σ^{2}}{n} Δ^{- 1})

and

ϵ_{i} \overset{iid}{\sim} N (0, σ^{2})

are independent of one another, one may consider using the F statistic

F = \frac{n}{p {\hat{σ}}^{2}} {\hat{β}}_{Δ}^{⊤} \hat{Σ}_{X Δ} {\hat{β}}_{Δ}

(5)

where

{\hat{σ}}^{2} = \frac{1}{n - p - 1} \sum_{i = 1}^{n} {\hat{ϵ}}_{i}^{2}

is the estimated error variance, and

{\hat{ϵ}}_{i} = Y_{i} - {\hat{Y}}_{i}

is the residual with

{\hat{Y}}_{i} = \hat{α} + X_{i}^{⊤} {\hat{β}}_{Δ}

denoting the fitted value for the i-th observation. Note that if

H_{0}

is true and

β

is fixed, then the F statistic would approach an F distribution with degrees of freedom parameters p and

n - p - 1

as

Δ \to 0

. However, under the assumptions of Lemma 2, the F statistic in Equation (5) will not follow an

F_{p, n - p - 1}

distribution under

H_{0}

, and it may produce asymptotically invalid results when used in a permutation test.

The assumption

ϵ_{i} \overset{iid}{\sim} N (0, σ^{2})

may be questionable in many real data situations, where the error terms may be non-Gaussian and/or heteroscedastic. If the error terms are heteroscedastic, i.e., if

E (ϵ_{i}^{2}) = σ_{i}^{2}

, then the F statistic may not produce valid results even in a permutation test (see [44,50,51]). Note that even if

ϵ_{i} \overset{iid}{\sim} (0, σ^{2})

for some non-Gaussian distribution, the F statistic may produce asymptotically invalid results when used in a permutation test. Instead, consider the Wald test statistic

W = n {\hat{β}}_{Δ}^{⊤} {[\hat{Σ}_{X Δ}^{- 1} {\hat{Ω}}_{X} \hat{Σ}_{X Δ}^{- 1}]}^{- 1} {\hat{β}}_{Δ}

(6)

where

{\hat{Ω}}_{X} = \frac{1}{n} X_{c}^{⊤} D_{Y}^{2} X_{c}

with

D_{Y} = diag (Y_{1}^{c}, \dots, Y_{n}^{c})

. Under assumptions A1–A3, the W statistic asymptotically follows a

χ_{p}^{2}

distribution when

H_{0}

is true, which is a result of Lemma 1 and the consistency of the estimators

{\hat{Σ}}_{X}

and

{\hat{Ω}}_{X}

under

H_{0}

.

2.4. Permutation Inference

Let the vector

π = (π_{1}, \dots, π_{n})

denote some permutation of the integers

{1, \dots, n}

, and define

Y_{π} = (Y_{π_{1}}^{c}, \dots, Y_{π_{n}}^{c})

to be the permuted (and centered) response vector using the permutation vector

π

. Furthermore, let

F (Y_{π}, X)

and

W (Y_{π}, X)

denote the test statistics from Equations (5) and (6) calculated using the permuted response vector

Y_{π}

. When

X_{i}

is independent of

ϵ_{i}

, the permutation test conducted using

W (Y_{π}, X)

will be exact, given that

Ω_{X} = E (σ_{i}^{2}) Σ_{X}

when

X_{i}

and

ϵ_{i}

are independent. However, the permutation test using

F (Y_{π}, X)

is not guaranteed to be exact or asymptotically valid, given that the asymptotic sampling distribution of

F (Y, X)

may not be the same as the permutation distribution (see [44]). When there is dependence between

X_{i}

and

ϵ_{i}

, the permutation test conducted using

F (Y_{π}, X)

will be inexact and asymptotically invalid, whereas the permutation test conducted using

W (Y_{π}, X)

will be inexact and asymptotically valid. Define the additional assumption A4:

E (Y_{i}^{4}) < \infty

and

E (X_{i j}^{4}) < \infty

for all j.

Theorem 1.

Consider the linear model with assumptions A1–A4. When

β = 0_{p}

, the permutation distribution of

W (Y_{π}, X)

converges to a

χ_{p}^{2}

distribution as

n \to \infty

.

Theorem 1 can be proved by combining the results in Lemma 1 with the results in Theorem 3.1 of DiCiccio and Romano [44], who derived the asymptotic nature of the permutation distribution

W (Y_{π}, X)

for the OLS estimator

\hat{β}

. Note that Theorem 1 reveals that a permutation test using

W (Y_{π}, X)

will produce asymptotically level

α

rejection rates when the null hypothesis

H_{0} : β = 0_{p}

is true.

Let the vector

ψ = (ψ_{1}, \dots, ψ_{n})

denote a resigning vector where

ψ_{i} \in {- 1, 1} \forall i

, and define

Y_{ψ} = (ψ_{1} Y_{1}^{c}, \dots, ψ_{n} Y_{n}^{c})

to be the resigned (and centered) response vector using the resigning vector

ψ

. Furthermore, let

F (Y_{ψ}, X)

and

W (Y_{ψ}, X)

denote the test statistics from Equations (5) and (6) calculated using the resigned response vector

Y_{ψ}

. When

X_{i}

is independent of

ϵ_{i}

and the errors are symmetric, the permutation test conducted using

W (Y_{ψ}, X)

will be exact, but the permutation test using

F (Y_{ψ}, X)

may be invalid. When

X_{i} ⊥ ⊥ ϵ_{i}

, the permutation test conducted using

F (Y_{ψ}, X)

will be inexact and asymptotically invalid, whereas the permutation test conducted using

W (Y_{ψ}, X)

will be inexact and asymptotically valid, regardless of whether or not errors are symmetric.

Theorem 2.

Consider the linear model with assumptions A1–A4. When

β = 0_{p}

, the permutation distribution of

W (Y_{ψ}, X)

converges to a

χ_{p}^{2}

distribution as

n \to \infty

.

Theorem 2 can be proved by combining the results in Lemma 1 with the results in Theorem 3.2 of DiCiccio and Romano [44], who derived the asymptotic nature of the permutation distribution

W (Y_{ψ}, X)

for the OLS estimator

\hat{β}

. Note that Theorem 2 reveals that a permutation test using

W (Y_{ψ}, X)

will produce asymptotically level

α

rejection rates when the null hypothesis

H_{0} : β = 0_{p}

is true.

Corollary 1.

Define

Y_{π ψ} = (ψ_{1} Y_{π_{1}}^{c}, \dots, ψ_{n} Y_{π_{n}}^{c})

to be the permuted and resigned (centered) response vector using the permutation vector π and resigning vector ψ. Consider the linear model with assumptions A1–A4. When

β = 0_{p}

, the permutation distribution of

W (Y_{π ψ}, X)

converges to a

χ_{p}^{2}

distribution as

n \to \infty

, where

W (Y_{π ψ}, X)

is the test statistic in Equation (6), calculated using the permuted and resigned response vector

Y_{π ψ}

.

Corollary 1 follows directly from the results of Theorems 1 and 2.

3. Conditional Regression Tests

3.1. Model and Estimation

Given an independent sample of n observations, consider the linear regression model

Y_{i} = α + X_{i}^{⊤} β + Z_{i}^{⊤} γ + ϵ_{i}

(7)

for

i \in {1, \dots, n}

, where

Y_{i}

is the i-th realization of the response variable

Y \in R

,

X_{i}^{⊤} = (X_{i 1}, \dots, X_{i p})

and

Z_{i}^{⊤} = (Z_{i 1}, \dots, Z_{i q})

are the i-th realizations of the predictor vectors

X^{⊤} = (X_{1}, \dots, X_{p}) \in R^{p}

and

Z^{⊤} = (Z_{1}, \dots, Z_{q}) \in R^{q}

, and

ϵ_{i}

is the i-th realization of the error term

ϵ \in R

. The predictor variables are assumed to be partitioned into two sets, such that

X

contains the variables of interest for inference purposes, and

Z

contains the covariates that will be conditioned on. Let

M_{i}^{⊤} = (X_{i}^{⊤}, Z_{i}^{⊤})

denote the combined predictor vector, and let

θ = {(β_{1}, \dots, β_{p}, γ_{1}, \dots, γ_{q})}^{⊤}

denote the combined coefficient vector. Furthermore, let

M_{i}^{c} = M_{i} - \bar{M}

denote the mean centered (combined) predictor vector, where

\bar{M} = \frac{1}{n} \sum_{i = 1}^{n} M_{i}

is the sample average of the combined predictor vector.

To estimate the coefficients in

θ

, consider minimizing the GRR loss function

\frac{1}{n} {(Y_{c} - M_{c} θ)}^{⊤} (Y_{c} - M_{c} θ) + θ^{⊤} Δ θ

(8)

where

M_{c} = {[M_{1}^{c}, \dots, M_{n}^{c}]}^{⊤}

is the centered design matrix, and

Δ

is an

r \times r

symmetric and positive semi-definite penalty matrix (where

r = p + q

is the total number of slope coefficients). The coefficients that minimize Equation (8) can be written as

{\hat{θ}}_{Δ} = {(\frac{1}{n} M_{c}^{⊤} M_{c} + Δ)}^{- 1} (\frac{1}{n} M_{c}^{⊤} Y_{c})

(9)

which is subscripted to emphasize that the estimated coefficients depend on the penalty matrix

Δ

. Given the estimated slope vector

{\hat{θ}}_{Δ}

, the least squares estimate of the intercept has the form

\hat{α} = \bar{Y} - {\bar{M}}^{⊤} {\hat{θ}}_{Δ}

.

3.2. Asymptotic Distributions

Consider the linear regression model from Equation (7) with the assumptions:

B1.: $Y_{i} = α + M_{i}^{⊤} θ + ϵ_{i}$ with $E (ϵ_{i}) = 0$ and $E (ϵ_{i}^{2}) = σ_{i}^{2} < \infty$ for $i = 1, \dots, n$
B2.: $(Y_{i}, M_{i})$ are iid from a distribution satisfying $E (ϵ_{i} | M_{i}) = 0$
B3.: $Σ_{M} = E ((M_{i} - μ_{M}) {(M_{i} - μ_{M})}^{⊤})$ and $Ω_{M} = E (ϵ_{i}^{2} (M_{i} - μ_{M}) {(M_{i} - μ_{M})}^{⊤})$ are nonsingular, where $μ_{M} = E (M_{i})$ , and $\frac{1}{n} M_{c}^{⊤} M_{c} + Δ$ is almost surely invertible. In the fixed predictors case, the mean vector is $μ_{M} = \frac{1}{n} \sum_{i = 1}^{n} M_{i}$ and the covariance matrix terms are defined as $Σ_{M} = \frac{1}{n} M_{c}^{⊤} M_{c}$ and $Ω_{M} = \frac{1}{n} M_{c}^{⊤} Ψ M_{c}$ , where $Ψ = diag (σ_{1}^{2}, \dots, σ_{n}^{2})$ .

Given assumptions B1–B3, the GRR estimator provides an estimate of

θ_{Δ} = Σ_{M Δ}^{- 1} σ_{M Y}

, where

Σ_{M Δ} = Σ_{M} + Δ

and

σ_{M Y} = E ((M_{i} - μ_{M}) (Y_{i} - μ_{Y}))

is the covariance between

M_{i}

and

Y_{i}

. Note that

θ_{Δ} = Σ_{M Δ}^{- 1} Σ_{M} θ

, where

θ = Σ_{M}^{- 1} σ_{M Y}

is estimated by the OLS estimator

\hat{θ} = \hat{Σ}_{M}^{- 1} {\hat{σ}}_{M Y}

with

{\hat{Σ}}_{M} = \frac{1}{n} M_{c}^{⊤} M_{c}

and

{\hat{σ}}_{M Y} = \frac{1}{n} M_{c}^{⊤} Y_{c}

. This implies that the GRR estimator can be written as

{\hat{θ}}_{Δ} = \hat{Σ}_{M Δ}^{- 1} {\hat{Σ}}_{M} \hat{θ}

where

\hat{Σ}_{M Δ} = {\hat{Σ}}_{M} + Δ

.

Lemma 3.

Given assumptions B1–B3, the GRR estimator

{\hat{θ}}_{Δ}

from Equation (9) is asymptotically normal with mean vector

θ_{Δ}

and covariance matrix

\frac{1}{n} Σ_{M Δ}^{- 1} Ω_{M} Σ_{M Δ}^{- 1}

, i.e.,

\sqrt{n} ({\hat{θ}}_{Δ} - θ_{Δ}) \overset{d}{\to} N (0, Σ_{M Δ}^{- 1} Ω_{M} Σ_{M Δ}^{- 1})

as

n \to \infty

, where the notation

\overset{d}{\to}

denotes convergence in distribution.

Lemma 4.

Consider the linear model in Equation (7) with the assumptions

θ \sim N (0_{r}, \frac{σ^{2}}{n} Δ^{- 1})

and

ϵ_{i} \overset{iid}{\sim} N (0, σ^{2})

, and suppose that θ and

ϵ_{i}

are independent of one another. Under these assumptions, the posterior distribution of θ given

Y

is multivariate normal with mean vector

{\hat{θ}}_{Δ}

and covariance matrix

\frac{σ^{2}}{n} \hat{Σ}_{M Δ}^{- 1}

. The asymptotic mean vector is

θ_{Δ}

and the asymptotic covariance matrix is

\frac{σ^{2}}{n} Σ_{M Δ}^{- 1}

.

Note that Lemmas 3 and 4 can be proved using analogues of the results that were used to prove Lemmas 1 and 2. Specifically, for Lemma 3, we can use a direct analogue of the proof for Lemma 1 with

θ_{Δ}

replacing

β_{Δ}

and the matrices

Ω_{M}

and

Σ_{M Δ}

replacing the matrices

Ω_{X}

and

Σ_{X Δ}

. For Lemma 4, we can use a direct analogue of the proof for Lemma 2 with

θ_{Δ}

replacing

β_{Δ}

and

Σ_{M Δ}

replacing

Σ_{X Δ}

. For both cases, we need to replace the unpenalized coefficient vector

β

with the unpenalized coefficient vector

θ

.

3.3. Test Statistics

Consider the linear model in Equation (7), and suppose that we want to test the null hypothesis

H_{0} : β = 0_{p}

versus the alternative hypothesis

H_{1} : β \neq 0_{p}

. This is the same null hypothesis that was considered in the previous section, but now the nuisance effects

Z_{i}^{⊤} γ

are included in the model (i.e., conditioned on) while testing the significance of the

β

vector. Assuming that

θ \sim N (0_{r}, \frac{σ^{2}}{n} Δ^{- 1})

and

ϵ_{i} \overset{iid}{\sim} N (0, σ^{2})

are independent of one another, we could use the F test statistic

F = \frac{n}{p} {\hat{β}}_{Δ}^{⊤} {({\hat{σ}}^{2} S \hat{Σ}_{M Δ}^{- 1} S^{⊤})}^{- 1} {\hat{β}}_{Δ}

(10)

where

S = [I_{p}, 0_{p \times q}]

is a

p \times r

selection matrix such that

S θ_{Δ} = β_{Δ}

(note that

I_{p}

is the

p \times p

identity matrix and

0_{p \times q}

is a

p \times q

matrix of zeros),

{\hat{σ}}^{2} = \frac{1}{n - r - 1} \sum_{i = 1}^{n} {\hat{ϵ}}_{i}^{2}

is the estimated error variance, and

{\hat{ϵ}}_{i} = Y_{i} - {\hat{Y}}_{i}

is the residual with

{\hat{Y}}_{i} = \hat{α} + M_{i}^{⊤} {\hat{θ}}_{Δ}

denoting the fitted value for the i-th observation.

When

H_{0}

is true and the assumptions in Lemma 4 are met, the F statistic approaches an F distribution with degrees of freedom parameters p and

n - r - 1

as

Δ \to 0

. For non-zero penalties, the F statistic in Equation (10) will not follow an

F_{p, n - r - 1}

distribution, and may produce asymptotically invalid results when used in a permutation test, especially when the error terms are heteroscedastic (see [44,50,51]). In such cases, the Wald test statistic should be preferred

W = n {\hat{β}}_{Δ}^{⊤} {[S \hat{Σ}_{M Δ}^{- 1} {\hat{Ω}}_{M} \hat{Σ}_{M Δ}^{- 1} S^{⊤}]}^{- 1} {\hat{β}}_{Δ}

(11)

where

{\hat{Ω}}_{M} = \frac{1}{n} M_{c}^{⊤} D_{\hat{ϵ}}^{2} M_{c}

with

D_{\hat{ϵ}} = diag ({\hat{ϵ}}_{1}, \dots, {\hat{ϵ}}_{n})

. Under assumptions B1–B3, the W statistic asymptotically follows a

χ_{p}^{2}

distribution when

H_{0}

is true, which is a result of Lemma 3 (and the consistency of the estimators

{\hat{Σ}}_{M}

and

{\hat{Ω}}_{M}

).

3.4. Permutation Inference

Table 1 depicts eight different permutation methods that have been proposed for testing the significance of regression coefficients in the presence of nuisance parameters. The eight methods can be split into three different groups: (i) methods that permute the rows of

X

[52,53,54], (ii) methods that permute

Y

with

Z

included in the model [55,56,57], and (iii) methods that permute

Y

after partialling out

Z

[58,59,60]. All of these methods were originally proposed for use with the OLS estimator

\hat{θ}

and the F test statistic. Recent works have incorporated the use of the robust W test statistic with these permutation methods [44,50,51]. However, these authors only considered a theoretical analysis of the DS and FL permutation methods, and no previous works seem to have studied these methods using the GRR estimators from Equations (4) and (9).

To understand the motivation of the various permutation methods in Table 1, assume that the penalty matrix is a block diagonal such as

Δ = bdiag (Δ_{X}, Δ_{Z})

, where

Δ_{X}

and

Δ_{Z}

denote the penalty matrices for

β

and

γ

, respectively. Using the well-known form for the inverse of a block matrix [61,62,63,64], the coefficient estimates from Equation (9) have the form

\begin{matrix} {\hat{β}}_{Δ} & = {(X_{c}^{⊤} R_{Z} X_{c} + n Δ_{X})}^{- 1} X_{c}^{⊤} R_{Z} Y_{c} \\ {\hat{γ}}_{Δ} & = {(Z_{c}^{⊤} Z_{c} + n Δ_{Z})}^{- 1} Z_{c}^{⊤} (Y_{c} - X_{c} {\hat{β}}_{Δ}) \end{matrix}

where

R_{Z} = I_{n} - Z_{c} {(Z_{c}^{⊤} Z_{c} + n Δ_{Z})}^{- 1} Z_{c}^{⊤}

is the residual forming matrix for the model that only includes the nuisance effects in the model, i.e.,

Y = α + Z^{⊤} γ + ϵ

. This implies that the (centered) fitted values can be written as

{\hat{Y}}_{c} = X_{c} {\hat{β}}_{Δ} + Z_{c} {\hat{γ}}_{Δ} = R_{Z} X_{c} {\hat{β}}_{Δ} + H_{Z} Y_{c}

, where

H_{Z} = Z_{c} {(Z_{c}^{⊤} Z_{c} + n Δ_{Z})}^{- 1} Z_{c}^{⊤}

is the hat matrix for the linear model that only includes the nuisance effects. Thus, when

Δ_{Z} = 0_{q \times q}

, all of the permutation methods except SW will produce the same observed F statistic (when the permutation matrix is

P = I_{n}

).

Consider the additional assumption that the response and predictor variables have finite fourth moments, i.e., B4:

E (Y_{i}^{4}) < \infty

,

E (X_{i j}^{4}) < \infty \forall j

, and

E (Z_{i k}^{4}) < \infty \forall k

. Assume B1–B4 and that the null hypothesis

H_{0} : β = 0_{p}

is true. Using the W test statistic from Equation (11), the following can be said about the finite sample and asymptotic properties of the various permutation methods in Table 1: the DS method is exact when

X ⊥ ⊥ (Y, Z)

and asymptotically valid otherwise; the MA method is exact when

Y ⊥ ⊥ (X, Z)

and asymptotically valid otherwise; the SW method is inexact and asymptotically valid only when

E ((X - μ_{X}) {(Z - μ_{Z})}^{⊤}) = 0_{p \times q}

; the other five methods (OS, FL, TB, KC, HJ) are inexact and asymptotically valid.

The asymptotic behaviors of the DS and FL methods were proved by DiCiccio and Romano [44]. The asymptotic validity of the OS method can be proved using a similar result as used for the DS method, given that

\frac{1}{n} X_{c}^{⊤} R_{Z} X_{c}

is a consistent estimator of

Σ_{X} - Σ_{X Z} Σ_{Z Δ}^{- 1} Σ_{Z X}

and

\frac{1}{n} X_{c}^{⊤} R_{Z} Y_{c}

is a consistent estimator of

σ_{X Y} - Σ_{X Z} Σ_{Z Δ}^{- 1} σ_{Z Y}

. The asymptotic validity of the MA method can also be proved using a similar result as used for the DS method. The asymptotic validity of the TB method can be proved using a similar result as used for the FL method, given that

\frac{1}{n} M_{c}^{⊤} R_{M} M_{c}

is a consistent estimator of

Σ_{M} - Σ_{M} Σ_{M Δ}^{- 1} Σ_{M}

and

\frac{1}{n} M_{c}^{⊤} R_{M} Y_{c}

is a consistent estimator of

σ_{M Y} - Σ_{M} Σ_{M Δ}^{- 1} σ_{M Y}

. Finally, note that the KC and HJ methods are asymptotically equivalent, and the SW method is asymptotically equivalent to the KC and HJ methods when

X

and

Z

are uncorrelated. The asymptotic validity of the KC and HJ methods follow from the results in Theorem 1, given that these methods permute the response after partialling out the nuisance effects. It is important to note that if

X

and

Z

are correlated, the SW method will produce asymptotically invalid results because

Z

is partialled out of

Y

but not

X

.

4. Simulation Studies

4.1. Simulation A

The first simulation study was designed to explore the validity of the claims made about the omnibus permutation tests discussed in Section 2. Simulation A was a fully crossed design that manipulated three design factors: (i) the data generating distribution (three levels: left skew normal, standard normal, right skew normal), (ii) the error standard deviation (three levels: constant, increasing, and parabolic), and (iii) the sample size (five levels:

n \in {10, 25, 50, 100, 200}

), see Figure 1. For each of the 45 combinations of simulation design parameters (3

F_{ϵ}

× 3

σ

× 5 n), I generated 10,000 independent copies of the data from the model in Equation (1) with

X_{i} = (i - 1) / (n - 1)

and

η (X_{i}) = 0

(so that

Y_{i} = ϵ_{i}

). For each generated sample of data, the ss() function in the npreg R package [65] was used to fit a cubic smoothing spline with

r = 5

knots, which were placed at the quantiles of the predictor scores. The generalized cross-validation (GCV) criterion [33] was used to select the smoothing parameter

λ

.

Ten different inference methods (six permutation tests and four parametric tests) were used to test the null hypothesis of no functional relationship. The six permutation tests were formed using the two test statistics (F and W) combined with the three permutation methods discussed in Section 2: permute Y, sign-flip Y, and both permuting and sign flipping. The permutation tests were implemented using the np.reg.test() function in the nptest R package [66] using the default number of resamples (the default uses

R = min (R_{0}, 9999)

resamples, where

R_{0}

is the number of elements of the exact null distribution:

R_{0} = n!

for the permute method,

R_{0} = 2^{n}

for the sign-flip method, and

R_{0} = n! 2^{n}

for the combined method). The first two parametric tests were formed by comparing the F statistic from Equation (5) to an

F_{p, n - p - 1}

distribution, and the W statistic from Equation (6) to a

χ_{p}^{2}

distribution. The other two parametric tests were the F tests that are implemented by the mgcv R package [67] and the npreg R package [65]. Note that these F tests compare the F statistic from Equation (5) to an

F_{ν, n - ν - 1}

distribution, where

ν

is an estimate of the effective degrees of freedom. The npreg package defines

ν

as the trace of the smoothing matrix, whereas the mgcv package uses a more complex estimate of

ν

(see [42]).

Figure 2 displays the type I error rate for each inference method in each combination of Simulation A conditions. All of the inference methods perform similarly across the three data generating distributions, so the following discussions of the results apply to each of the three data generating distributions. When using the W test statistic, the three permutation methods produced accurate type I error rates for all combinations of data generating conditions (see red square, blue circle, and green triangle), and the parametric

χ_{p}^{2}

approximation produced asymptotically accurate results (see purple plus sign). When using the F test statistic, the permutation methods produced inflated type I error rates (see orange x, yellow diamond, and brown upside-down triangle), and the parametric

F_{p, n - p - 1}

approximation produced inconsistent results across the different error standard deviation conditions (see pink square with x). The parametric

F_{ν, n - ν - 1}

tests implemented by the mgcv and npreg packages also produce inconsistent results across the different error standard deviation conditions, such that the type I error rate is inflated in the increasing and parabolic conditions (see gray asterisk and black diamond with plus sign).

4.2. Simulation B

The second simulation study was designed to explore the validity of the claims made about the conditional permutation tests discussed in Section 3. Simulation B was a fully crossed design that manipulated two design factors: (i) the error standard deviation (three levels: constant, increasing, and parabolic), and (ii) the sample size (five levels:

n \in {10, 25, 50, 100, 200}

). Given that the results in Simulation A did not noticeably differ across the three data generating distributions, errors were generated from a normal distribution throughout Simulation B. For each of the 15 combinations of data generating parameters (3

σ

× 5 n), I generated 10,000 independent copies of the data from the model in Equation (1) with

X_{i} = (i - 1) / (n - 1)

and

η (X_{i}) = X_{i}

(unlike Simulation A, the data generating mean function now includes a linear effect). As in the previous simulation, the ss() function in the npreg R package [65] was used to fit a cubic smoothing spline with

r = 5

knots, and the GCV criterion was used to select the smoothing parameter.

The null hypothesis of a linear relationship was tested using a total of 19 inference methods: 16 permutation tests and 3 parametric tests. The 16 permutations tests were formed using the two test statistics (F and W) combined with the eight permutation methods in Table 1. As in Simulation A, the permutation tests were implemented using the np.reg.test() function in the nptest R package [66] using the default number of resamples (i.e.,

R = 9999

) to form the permutation distribution. The first two parametric tests were formed by comparing the F statistic from Equation (10) to an

F_{p, n - r - 1}

distribution, and the W statistic from Equation (11) to a

χ_{p}^{2}

distribution. The other parametric test is the F test that is implemented by the npreg package, which compares the F statistic from Equation (10) to an

F_{ν - m, n - ν - 1}

distribution. Note that the “cardinal” spline parameterization used in the mgcv R package does not separate the linear and non-linear portions of the function; thus, a test of linearity is not possible using this package.

Figure 3 displays the type I error rate for each inference method in each combination of Simulation B conditions. When the errors have constant variance, using the F statistic produces (i) inflated type I error rates using all permutation methods, (ii) deflated type I error rates using the parametric test, and (iii) asymptotically accurate error rates using the npreg test. In contrast, when using the W test statistic, all of the inference methods except SW produce asymptotically accurate error rates when the errors have constant variance. As expected, the F statistic produces asymptotically invalid results when the errors have non-constant variance, with the performance being the worst in the parabolic condition. In contrast, the W statistic produced asymptotically valid type I error rates when the errors have non-constant variance (using all methods except SW).

5. Discussion

5.1. Summary of Findings

Penalized splines are frequently used in applied research for understanding functional relationships between variables. Although there has been a considerable body of work on the estimation and computation of penalized splines, there have been relatively few papers on statistical inference about the nature of the functional relation. In most applications, the statistical inference for penalized splines is conducted using the random effects or Bayesian interpretation of a smoothing spline. These inferential frameworks rely on parametric assumptions about the unknown function and error terms, i.e., that

η

is a Gaussian process and

ϵ_{i}

are iid Gaussian variables. Even when these parametric assumptions are met, valid statistical inference can be challenging due to the need for a reasonable estimate of the degrees of freedom of the penalized spline estimate. Furthermore, when these parametric assumptions are incorrect (e.g., due to heteroscedastic and/or non-Gaussian errors), the standard inferential tools can produce substantially misleading results.

In this paper, I developed a flexible and robust permutation testing framework for inference using penalized splines. Unlike the existing methods for statistical inference with penalized splines, the proposed methods can provide asymptotically valid inferential results under a variety of data generating situations. Furthermore, unlike a majority of the existing methods for hypothesis testing with penalized splines, the proposed approach can be flexibly adapted to test a variety of standard and non-standard hypotheses about the nature of the functional relationships between variables. The omnibus test in Section 2 is frequently of interest in practical applications, but has been ignored by most inferential tests for penalized splines (e.g., see [42]). The conditional test in Section 3 can be used to test the classic hypothesis

η \in H_{0}

(which has been considered in many papers), as well as the more specialized hypotheses about the main and/or interaction effects of predictors.

The simulation results in Section 4 clearly demonstrate the benefits of the proposed approach over standard methods used for inference with penalized splines. In particular, the simulation results demonstrate that classic (parametric) methods that rely on an F test statistic can produce substantially inflated type I error rates when the error terms are heteroscedastic. Furthermore, the simulation results demonstrate that the F test statistic can even produce inaccurate results when used in a permutation test. In contrast, the permutation tests using the robust W test statistic can produce exact results for the omnibus tests and asymptotically valid results for the conditional tests—even when the errors depend on the predictor(s). Moreover, the

χ_{p}^{2}

approximation using the robust W statistic produced asymptotically valid results that were reasonably close to the nominal

α = 0.05

rate with only

n = 100

(for the omnibus test) or

n = 200

(for the conditional test).

5.2. Future Directions

Although the simulation studies only explored the omnibus and conditional tests with a single predictor, the theoretical results in Section 2 and Section 3 can be readily applied to penalized spline models with multiple predictors. With

d > 1

predictors, the number of possible hypotheses that could be tested increases, given that it is possible to test omnibus or conditional tests about any of the main or interaction effects. The simulation results for a single predictor revealed that the permutation methods of Kennedy and Cade [59] and Huh and Jhun [60] performed the best for conditional tests with a single predictor, so I hypothesize that these procedures will be ideal for tests of main and/or interaction effects with

d > 1

predictors. However, future theoretical and Monte Carlo simulation work is needed to determine which permutation strategies should be preferred for conditional tests of main and/or interaction effects of multiple predictors.

The theoretical results in Section 2 and Section 3 were developed for the classic formulation of penalized splines, which solves a GRR problem to obtain the function estimate. Note that the penalized spline GRR problem is a special case of the elastic net penalty [68] that is used in the recently proposed kernel eigenvector smoothing and selection operator (kesso) regression method [26]. Although the theorems derived in Section 2 and Section 3 assume a ridge penalty, it is straightforward to develop extensions of these formulas for elastic net penalties used in the kesso regression. Specifically, the vector version of the coefficients, which is given after Equation (8) in [26], can be used to express the kesso coefficients as a modification of the least squares coefficients. However, future Monte Carlo research is needed to explore the accuracy of the various permutation strategies when using elastic net penalties to fit penalized spline regression models.

Finally, it is worth noting that this paper only considered the GCV tuning method, which is one of several smoothing parameter selection methods available in the ss() function. Recent work has demonstrated that the different tuning methods tend to produce similar results across a wide variety of data generating conditions—including non-normal and/or heteroscedastic errors (see [25]). However, it was noted that the maximum likelihood based tuning method tended to perform (i) worse than the cross-validation based methods when the sample size was small and (ii) better than the cross-validation based methods when the errors were correlated. Given these past findings, I would not expect the simulation results in this paper to significantly change if a different tuning criterion were used instead of the GCV. However, future research should explore the performance of the proposed permutation testing approach using different combinations of tuning methods and permutation strategies to analyze data from a wide variety of data-generating conditions.

Funding

This research was funded by the following National Institutes of Health (NIH) grants: R01EY030890, R01MH115046, U01DA046413, and R43AG074740.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The supporting information contains the following files:

simA R script for reproducing simulation A (.R file)

simB R script for reproducing simulation B (.R file)

sstest R function for omnibus and conditional smoothing spline permutation tests (.R file)

Conflicts of Interest

The author declares no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

BLUE	Best Linear Unbiased Estimator
BLUP	Best Linear Unbiased Predictor
GCV	Generalized Cross-Validation
GRR	Generalized Ridge Regression
OLS	Ordinary Least Squares
PLS	Penalized Least Squares

Appendix A. Proofs

Appendix A.1. Proof of Lemma 1

To prove Lemma 1, one needs to demonstrate that under assumptions A1-A3, the OLS estimate

\hat{β} = \hat{Σ}_{X}^{- 1} {\hat{σ}}_{X Y}

is asymptotically normal with mean vector

β = Σ_{X}^{- 1} σ_{X Y}

and covariance matrix

\frac{1}{n} Σ_{X}^{- 1} Ω_{X} Σ_{X}^{- 1}

. In other words, one needs to demonstrate that

\sqrt{n} (\hat{β} - β) \overset{d}{\to} N (0, Σ_{X}^{- 1} Ω_{X} Σ_{X}^{- 1})

where

\cdot \overset{d}{\to} \cdot

denotes that the random vector on the lefthand side converges in distribution to the probability distribution specified on the righthand side. This result, which is originally due to White [46], can be derived using basic rules of expectation and covariance operators in combination with a multivariate central limit theorem.

First, under assumptions A1-A3, note that the expectation of

\hat{β}

is

\begin{matrix} E (\hat{β}) & = E (\hat{Σ}_{X}^{- 1} {\hat{σ}}_{X Y}) \\ = E ({(X_{c}^{⊤} X_{c})}^{- 1} X_{c}^{⊤} (α 1_{n} + X β + ϵ)) \\ = E ({(X_{c}^{⊤} X_{c})}^{- 1} X_{c}^{⊤} X_{c} β + {(X_{c}^{⊤} X_{c})}^{- 1} X_{c}^{⊤} ϵ) \\ = β \end{matrix}

where the second line is due to the fact that

X_{c}^{⊤} Y_{c} = X_{c}^{⊤} Y

, the third line is due to the facts that

X_{c}^{⊤} 1_{n} = 0_{p}

and

X_{c}^{⊤} X = X_{c}^{⊤} X_{c}

by definition, and the fourth line is due to the fact that

E (X_{c}^{⊤} ϵ) = 0_{p}

by assumption A2.

Second, under assumptions A1–A3, note that the covariance matrix of

\hat{β}

satisfies

\begin{matrix} Cov (\hat{β}) & = Cov (β + {(X_{c}^{⊤} X_{c})}^{- 1} X_{c}^{⊤} ϵ) \\ = Cov ({(X_{c}^{⊤} X_{c})}^{- 1} X_{c}^{⊤} ϵ) \\ = E ({(X_{c}^{⊤} X_{c})}^{- 1} X_{c}^{⊤} ϵ ϵ^{⊤} X_{c} {(X_{c}^{⊤} X_{c})}^{- 1}) \\ ≍ \frac{1}{n} Σ_{X}^{- 1} Ω_{X} Σ_{X}^{- 1} \end{matrix}

where the first line uses the previous result from the expectation derivation, the second line uses the fact that

β

is a constant vector, the third line uses the fact that

E ({(X_{c}^{⊤} X_{c})}^{- 1} X_{c}^{⊤} ϵ) = 0_{p}

by assumption A2, and the fourth line uses the definitions of

Σ_{X}

and

Ω_{X}

given in assumption A3. Note that the notation ≍ should be read as ‘is asymptotically equal to’.

Thus, under assumptions A1-A3, the OLS coefficient estimates have mean vector

E (\hat{β}) = β

and asymptotic covariance matrix

Cov (\hat{β}) ≍ \frac{1}{n} Σ_{X}^{- 1} Ω_{X} Σ_{X}^{- 1}

. The asymptotic multivariate normality of

\hat{β}

results from applying the multivariate central limit theorem using the consistent estimators

{\hat{Σ}}_{X} = \frac{1}{n} X_{c}^{⊤} X_{c}

and

{\hat{Ω}}_{X} = \frac{1}{n} X_{c}^{⊤} diag (\hat{ϵ}) X_{c}

in place of the unknown asymptotic covariance matrix components

Σ_{X}

and

Ω_{X}

.

Finally, note that the asymptotic expectation of

{\hat{β}}_{Δ}

has the form

\begin{matrix} E ({\hat{β}}_{Δ}) & ≍ Σ_{X Δ}^{- 1} Σ_{X} E (\hat{β}) \\ = Σ_{X Δ}^{- 1} Σ_{X} β \\ = β_{Δ} \end{matrix}

and the asymptotic covariance of

{\hat{β}}_{Δ}

has the form

\begin{matrix} Cov ({\hat{β}}_{Δ}) & ≍ Σ_{X Δ}^{- 1} Σ_{X} Cov (\hat{β}) Σ_{X} Σ_{X Δ}^{- 1} \\ = \frac{1}{n} Σ_{X Δ}^{- 1} Ω_{X} Σ_{X Δ}^{- 1} \end{matrix}

given that

{\hat{β}}_{Δ} = \hat{Σ}_{X Δ}^{- 1} {\hat{Σ}}_{X} \hat{β}

, which completes the proof.

Appendix A.2. Proof of Lemma 2

To prove Lemma 2, ones needs to demonstrate that under assumptions A1–A3 and the prior distribution assumptions, the posterior distribution of the coefficients is a multivariate normal with mean vector

{\hat{β}}_{Δ} = {\hat{Σ}}_{X Δ}^{- 1} {\hat{σ}}_{X Y}

and covariance matrix

\frac{σ^{2}}{n} \hat{Σ}_{X Δ}^{- 1}

. In other words, one needs to demonstrate that

\sqrt{n} ({\hat{β}}_{Δ} - β) \sim N (0, σ^{2} \hat{Σ}_{X Δ}^{- 1})

where the sign of the term on the lefthand side has been flipped to emphasize the similarity in form to the result in Appendix A.1. This result, which was first given by Henderson [47,48], can be proven using a classic result of multivariate normal theory (e.g., see [69]).

Consider a partitioned vector

Z

that has a multivariate normal distribution, i.e.,

Z = [\begin{matrix} X \\ Y \end{matrix}] \sim N ([\begin{matrix} μ_{X} \\ μ_{Y} \end{matrix}], [\begin{matrix} Σ_{X X} & Σ_{X Y} \\ Σ_{Y X} & Σ_{Y Y} \end{matrix}]) .

The posterior distribution of

X

given

Y = y

is multivariate normal, i.e.,

(X | Y = y) \sim N (μ_{X | Y}, Σ_{X | Y})

, where the posterior mean vector and covariance matrix have the form

\begin{matrix} μ_{X | Y} & = μ_{X} + Σ_{X Y} Σ_{Y Y}^{- 1} (y - μ_{Y}) \\ Σ_{X | Y} & = Σ_{X X} - Σ_{X Y} Σ_{Y Y}^{- 1} Σ_{Y X} \end{matrix}

In this case,

Z^{⊤} = (β^{⊤}, Y^{⊤})

where

Y = α 1_{n} + X_{c} β + ϵ

is the response vector. Note that the prior mean vectors are

μ_{β} = 0_{p}

and

μ_{Y} = α 1_{n}

, and the prior covariance matrices are

Σ_{β β} = \frac{σ^{2}}{n} Δ^{- 1}

and

Σ_{Y Y} = σ^{2} (\frac{1}{n} X_{c} Δ^{- 1} X_{c}^{⊤} + I_{n})

, and the covariance between

β

and

Y

is

Σ_{β Y} = \frac{σ^{2}}{n} Δ^{- 1} X_{c}^{⊤}

. Thus, the posterior distribution of

β

given

Y

has mean vector

\begin{matrix} μ_{β | Y} & = μ_{β} + Σ_{β Y} Σ_{Y Y}^{- 1} (y - μ_{Y}) \\ = \frac{1}{n} Δ^{- 1} X_{c}^{⊤} {(\frac{1}{n} X_{c} Δ^{- 1} X_{c}^{⊤} + I_{n})}^{- 1} Y_{c} \\ = \frac{1}{n} Δ^{- 1} X_{c}^{⊤} (I_{n} - X_{c} {(X_{c}^{⊤} X_{c} + n Δ)}^{- 1} X_{c}^{⊤}) Y_{c} \\ = {(X_{c}^{⊤} X_{c} + n Δ)}^{- 1} X_{c}^{⊤} Y_{c} \end{matrix}

where the second line plugs in the definitions of the parameters, the third line uses Equation (17) of Henderson and Searle [70] to more conveniently write the matrix inverse, and the last line results from straightforward algebraic simplification of the third line. Similarly, the posterior covariance matrix of

β

given

Y

has the form

\begin{matrix} Σ_{β | Y} & = Σ_{β β} - Σ_{β Y} Σ_{Y Y}^{- 1} Σ_{Y β} \\ = \frac{σ^{2}}{n} (Δ^{- 1} - \frac{1}{n} Δ^{- 1} X_{c}^{⊤} {(\frac{1}{n} X_{c} Δ^{- 1} X_{c}^{⊤} + I_{n})}^{- 1} X_{c} Δ^{- 1}) \\ = \frac{σ^{2}}{n} (Δ^{- 1} - \frac{1}{n} Δ^{- 1} X_{c}^{⊤} (I_{n} - X_{c} {(X_{c}^{⊤} X_{c} + n Δ)}^{- 1} X_{c}^{⊤}) X_{c} Δ^{- 1}) \\ = σ^{2} {(X_{c}^{⊤} X_{c} + n Δ)}^{- 1} \end{matrix}

where the second line plugs in the definitions of the parameters, the third line plugs in the convenient representation of

Σ_{Y Y}^{- 1}

from Equation (17) of Henderson and Searle [70], and the final line results from the straightforward algebraic manipulation of the third line. Noting that

μ_{β | Y} = {\hat{β}}_{Δ}

and

Σ_{β | Y} = \frac{σ^{2}}{n} {\hat{Σ}}_{X Δ}^{- 1}

completes the proof.

References

Fox, J. Quantitative Applications in the Social Sciences: Multiple and Generalized Nonparametric Regression; SAGE Publications, Inc.: Thousand Oaks, CA, USA, 2000. [Google Scholar] [CrossRef]
Helwig, N.E. Multiple and Generalized Nonparametric Regression. In SAGE Research Methods Foundations; Atkinson, P., Delamont, S., Cernat, A., Sakshaug, J.W., Williams, R.A., Eds.; SAGE Publications, Inc.: London, England, 2020. [Google Scholar] [CrossRef]
Wahba, G. Spline Models for Observational Data; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 1990. [Google Scholar]
Wang, Y. Smoothing Splines: Methods and Applications; CRC Press: Boca Raton, FL, USA, 2011. [Google Scholar]
Gu, C. Smoothing Spline ANOVA Models, 2nd ed.; Springer: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R. Generalized Additive Models; Chapman and Hall/CRC: New York, NY, USA, 1990. [Google Scholar]
Ruppert, D.; Wand, M.P.; Carroll, R.J. Semiparametric Regression; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Wood, S.N. Generalized Additive Models: An Introduction with R, 2nd ed.; Chapman & Hall: Boca Raton, FL, USA, 2017. [Google Scholar]
Almquist, Z.W.; Helwig, N.E.; You, Y. Connecting Continuum of Care point-in-time homeless counts to United States Census areal units. Math. Popul. Stud. 2020, 27, 46–58. [Google Scholar] [CrossRef]
Kage, C.C.; Helwig, N.E.; Ellingson, A.M. Normative cervical spine kinematics of a circumduction task. J. Electromyogr. Kinesiol. 2021, 61, 102591. [Google Scholar] [CrossRef]
Helwig, N.E.; Shorter, K.A.; Hsiao-Wecksler, E.T.; Ma, P. Smoothing spline analysis of variance models: A new tool for the analysis of cyclic biomechaniacal data. J. Biomech. 2016, 49, 3216–3222. [Google Scholar] [CrossRef] [PubMed]
Hammell, A.E.; Helwig, N.E.; Kaczkurkin, A.N.; Sponheim, S.R.; Lissek, S. The temporal course of over-generalized conditioned threat expectancies in posttraumatic stress disorder. Behav. Res. Ther. 2020, 124, 103513. [Google Scholar] [CrossRef] [PubMed]
Helwig, N.E.; Sohre, N.E.; Ruprecht, M.R.; Guy, S.J.; Lyford-Pike, S. Dynamic properties of successful smiles. PLoS ONE 2017, 12, e0179708. [Google Scholar] [CrossRef]
Helwig, N.E.; Ruprecht, M.R. Age, gender, and self-esteem: A sociocultural look through a nonparametric lens. Arch. Sci. Psychol. 2017, 5, 19–31. [Google Scholar] [CrossRef]
Helwig, N.E.; Gao, Y.; Wang, S.; Ma, P. Analyzing spatiotemporal trends in social media data via smoothing spline analysis of variance. Spat. Stat. 2015, 14, 491–504. [Google Scholar] [CrossRef]
Helwig, N.E. Regression with ordered predictors via ordinal smoothing splines. Front. Appl. Math. Stat. 2017, 3, 1–13. [Google Scholar] [CrossRef]
Gu, C. Nonparametric regression with ordinal responses. Stat 2021, 10, e365. [Google Scholar] [CrossRef]
Gu, C.; Ma, P. Optimal smoothing in nonparametric mixed-effect models. Ann. Stat. 2005, 33, 1357–1379. [Google Scholar] [CrossRef]
Gu, C.; Ma, P. Generalized Nonparametric Mixed-Effect Models: Computation and Smoothing Parameter Selection. J. Comput. Graph. Stat. 2005, 14, 485–504. [Google Scholar] [CrossRef]
Helwig, N.E. Efficient estimation of variance components in nonparametric mixed-effects models with large samples. Stat. Comput. 2016, 26, 1319–1336. [Google Scholar] [CrossRef]
Kim, Y.J.; Gu, C. Smoothing spline Gaussian regression: More scalable computation via efficient approximation. J. R. Stat. Soc. Ser. B 2004, 66, 337–356. [Google Scholar] [CrossRef]
Gu, C.; Kim, Y.J. Penalized likelihood regression: General formulation and efficient approximation. Can. J. Stat. 2002, 30, 619–628. [Google Scholar] [CrossRef]
Helwig, N.E.; Ma, P. Fast and stable multiple smoothing parameter selection in smoothing spline analysis of variance models with large samples. J. Comput. Graph. Stat. 2015, 24, 715–732. [Google Scholar] [CrossRef]
Helwig, N.E.; Ma, P. Smoothing spline ANOVA for super-large samples: Scalable computation via rounding parameters. Stat. Interface 2016, 9, 433–444. [Google Scholar] [CrossRef]
Berry, L.N.; Helwig, N.E. Cross-validation, information theory, or maximum likelihood? A comparison of tuning methods for penalized splines. Stats 2021, 4, 701–724. [Google Scholar] [CrossRef]
Helwig, N.E. Spectrally sparse nonparametric regression via elastic net regularized smoothers. J. Comput. Graph. Stat. 2021, 30, 182–191. [Google Scholar] [CrossRef]
Kimeldorf, G.; Wahba, G. Some results on Tchebycheffian spline functions. J. Math. Anal. Appl. 1971, 33, 82–95. [Google Scholar] [CrossRef]
Ma, P.; Huang, J.; Zhang, N. Efficient computation of smoothing splines via adaptive basis sampling. Biometrika 2015, 102, 631–645. [Google Scholar] [CrossRef]
Moore, E.H. On the reciprocal of the general algebraic matrix. Bull. Am. Math. Soc. 1920, 26, 394–395. [Google Scholar] [CrossRef]
Penrose, R. A generalized inverse for matrices. Math. Proc. Camb. Philos. Soc. 1955, 51, 406–413. [Google Scholar] [CrossRef]
Wahba, G. Bayesian “confidence intervals” for the cross-validated smoothing spline. J. R. Stat. Soc. Ser. B 1983, 45, 133–150. [Google Scholar] [CrossRef]
Nychka, D. Bayesian confidence intervals for smoothing splines. J. Am. Stat. Assoc. 1988, 83, 1134–1143. [Google Scholar] [CrossRef]
Craven, P.; Wahba, G. Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation. Numer. Math. 1979, 31, 377–403. [Google Scholar] [CrossRef]
Gu, C.; Wahba, G. Smoothing spline ANOVA with component-wise Bayesian “confidence intervals”. J. Comput. Graph. Stat. 1993, 2, 97–117. [Google Scholar]
Marra, G.; Wood, S.N. Coverage properties of confidence intervals for generalized additive model components. Scand. J. Stat. 2012, 39, 53–74. [Google Scholar] [CrossRef]
Cox, D.; Koh, E.; Wahba, G.; Yandell, B.S. Testing the (Parametric) Null Model Hypothesis in (Semiparametric) Partial and Generalized Spline Models. Ann. Stat. 1988, 16, 113–119. [Google Scholar] [CrossRef]
Zhang, D.; Lin, X. Hypothesis testing in semiparametric additive mixed models. Biostatistics 2003, 4, 57–74. [Google Scholar] [CrossRef]
Liu, A.; Wang, Y. Hypothesis testing in smoothing spline models. J. Stat. Comput. Simul. 2004, 74, 581–597. [Google Scholar] [CrossRef]
Crainiceanu, C.; Ruppert, D.; Claeskens, G.; Wand, M.P. Exact likelihood ratio tests for penalised splines. Biometrika 2005, 92, 91–103. [Google Scholar] [CrossRef]
Scheipl, F.; Greven, S.; Küchenhoff, H. Size and power of tests for a zero random effect variance or polynomial regression in additive and linear mixed models. Comput. Stat. Data Anal. 2008, 52, 3283–3299. [Google Scholar] [CrossRef]
Nummi, T.; Pan, J.; Siren, T.; Liu, K. Testing for Cubic Smoothing Splines under Dependent Data. Biometrics 2011, 67, 871–875. [Google Scholar] [CrossRef] [PubMed]
Wood, S.N. On p-values for smooth components of an extended generalized additive model. Biometrika 2013, 100, 221–228. [Google Scholar] [CrossRef]
Wood, S.N. A simple test for random effects in regression models. Biometrika 2013, 100, 1005–1010. [Google Scholar] [CrossRef]
DiCiccio, C.J.; Romano, J.P. Robust Permutation Tests For Correlation And Regression Coefficients. J. Am. Stat. Assoc. 2017, 112, 1211–1220. [Google Scholar] [CrossRef]
Hoerl, A.; Kennard, R. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 1970, 12, 55–67. [Google Scholar] [CrossRef]
White, H. A Heteroscedasticity-Consistent Covariance Matrix and a Direct Test for Heteroscedasticity. Econometrica 1980, 48, 817–838. [Google Scholar] [CrossRef]
Henderson, C.R. Estimation of genetic parameters (abstract). Ann. Math. Stat. 1950, 21, 309–310. [Google Scholar]
Henderson, C.R. Best Linear Unbiased Estimation and Prediction under a Selection Model. Biometrics 1975, 31, 423–447. [Google Scholar] [CrossRef]
Robinson, G.K. That BLUP is a Good Thing: The Estimation of Random Effects. Stat. Sci. 1991, 6, 15–32. [Google Scholar] [CrossRef]
Helwig, N.E. Robust nonparametric tests of general linear model coefficients: A comparison of permutation methods and test statistics. NeuroImage 2019, 201, 116030. [Google Scholar] [CrossRef] [PubMed]
Helwig, N.E. Statistical nonparametric mapping: Multivariate permutation tests for location, correlation, and regression problems in neuroimaging. WIREs Comput. Stat. 2019, 2, e1457. [Google Scholar] [CrossRef]
Draper, N.R.; Stoneman, D.M. Testing for the Inclusion of Variables in Linear Regression by a Randomisation Technique. Technometrics 1966, 8, 695–699. [Google Scholar] [CrossRef]
O’Gorman, T.W. The Performance of Randomization Tests that Use Permutations of Independent Variables. Commun. Stat. Simul. Comput. 2005, 34, 895–908. [Google Scholar] [CrossRef]
Nichols, T.E.; Ridgway, G.R.; Webster, M.G.; Smith, S.M. GLM permutation: Nonparametric inference for arbitrary general linear models. NeuroImage 2008, 41, S72. [Google Scholar]
Manly, B. Randomization and regression methods for testing for associations with geographical, environmental and biological distances between populations. Res. Popul. Ecol. 1986, 28, 201–218. [Google Scholar] [CrossRef]
Freedman, D.; Lane, D. A Nonstochastic Interpretation of Reported Significance Levels. J. Bus. Econ. Stat. 1983, 1, 292–298. [Google Scholar] [CrossRef]
ter Braak, C.J.F. Permutation Versus Bootstrap Significance Tests in Multiple Regression and ANOVA. In Bootstrapping and Related Techniques. Lecture Notes in Economics and Mathematical Systems; Jöckel, K.H., Rothe, G., Sendler, W., Eds.; Springer: Berlin/Heidelberg, Germany, 1992; Volume 376, pp. 79–86. [Google Scholar]
Still, A.W.; White, A.P. The approximate randomization test as an alternative to the F test in analysis of variance. Br. J. Math. Stat. Psychol. 1981, 34, 243–252. [Google Scholar] [CrossRef]
Kennedy, P.E.; Cade, B.S. Randomization tests for multiple regression. Commun. Stat. Simul. Comput. 1996, 25, 923–936. [Google Scholar] [CrossRef]
Huh, M.H.; Jhun, M. Random Permutation Testing in Multiple Linear Regression. Commun. Stat. Theory Methods 2001, 30, 2023–2032. [Google Scholar] [CrossRef]
Schur, J. Über Potenzreihen, die im Innern des Einheitskreises beschränkt sind. J. FüR Die Reine Und Angew. Math. 1917, 1917, 205–232. [Google Scholar] [CrossRef]
Hotelling, H. Further Points on Matrix Calculation and Simultaneous Equations. Ann. Math. Stat. 1943, 14, 440–441. [Google Scholar] [CrossRef]
Hotelling, H. Some New Methods in Matrix Calculation. Ann. Math. Stat. 1943, 14, 1–34. [Google Scholar] [CrossRef]
Duncan, W.J. Some devices for the solution of large sets of simultaneous linear equations (with an appendix on the reciprocation of partitioned matrices). Lond. Edinb. Dublin Philos. Mag. J. Sci. Seventh Ser. 1944, 35, 660–670. [Google Scholar] [CrossRef]
Helwig, N.E. npreg: Nonparametric Regression via Smoothing Splines; R Package Version 1.0-9; R Foundation for Statistical Computing: Vienna, Austria, 2022; Available online: https://cran.r-project.org/package=npreg.
Helwig, N.E. nptest: Nonparametric Tests; R Package Version 1.0-3; R Foundation for Statistical Computing: Vienna, Austria, 2021; Available online: https://cran.r-project.org/package=nptest.
Wood, S.N. mgcv: Mixed GAM Computation Vehicle with GCV/AIC/REML smoothness estimation and GAMMs by REML/PQL; R Package Version 1.8-40; R Foundation for Statistical Computing: Vienna, Austria, 2022; Available online: https://cran.r-project.org/package=mgcv.
Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B 2005, 67, 301–320. [Google Scholar] [CrossRef]
Kalpić, D.; Hlupić, N. Multivariate Normal Distributions. In International Encyclopedia of Statistical Science; Lovric, M., Ed.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 907–910. [Google Scholar] [CrossRef]
Henderson, H.V.; Searle, S.R. On deriving the inverse of a sum of matrices. SIAM Rev. 1981, 23, 53–60. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Simulation A Design. The top row shows the three data generating distributions, which each have mean

μ = 0

and variance

σ^{2} = 1

. The bottom row shows the three error standard deviations: CON is constant error standard deviation with

σ (x) = 1

, INC is increasing the error standard deviation with

σ (x) = 1 / 2 + x

, and PAR is a parabolic error standard deviation with

σ (x) = 1 / 2 + 4 {(x - 1 / 2)}^{2}

. All nine combinations (= 3

f (ϵ)

distributions × 3

σ (x)

patterns) of these data generating factors were explored in the simulation.

Figure 1. Simulation A Design. The top row shows the three data generating distributions, which each have mean

μ = 0

and variance

σ^{2} = 1

. The bottom row shows the three error standard deviations: CON is constant error standard deviation with

σ (x) = 1

, INC is increasing the error standard deviation with

σ (x) = 1 / 2 + x

, and PAR is a parabolic error standard deviation with

σ (x) = 1 / 2 + 4 {(x - 1 / 2)}^{2}

. All nine combinations (= 3

f (ϵ)

distributions × 3

σ (x)

patterns) of these data generating factors were explored in the simulation.

Figure 2. Simulation A Results. Within each subplot, the Type I error rate is plotted for each inference method at each sample size. The rows denote the three different data generating distributions, and the columns denote the three different error standard deviation functions. The nominal

α = 0.05

rate is denoted with a dotted line.

Figure 2. Simulation A Results. Within each subplot, the Type I error rate is plotted for each inference method at each sample size. The rows denote the three different data generating distributions, and the columns denote the three different error standard deviation functions. The nominal

α = 0.05

rate is denoted with a dotted line.

Figure 3. Simulation B results. Within each subplot, the Type I error rate is plotted for each inference method at each sample size. The rows denote the two different test statistics (W and F), and the columns denote the three different error standard deviation functions. The nominal

α = 0.05

rate is denoted with a dotted line.

Figure 3. Simulation B results. Within each subplot, the Type I error rate is plotted for each inference method at each sample size. The rows denote the two different test statistics (W and F), and the columns denote the three different error standard deviation functions. The nominal

α = 0.05

rate is denoted with a dotted line.

Table 1. Permutation methods for testing

H_{0} : β = 0_{p}

in the presence of the nuisance parameters

γ

. The intercept

α

is excluded from each model for notational simplicity.

Table 1. Permutation methods for testing

H_{0} : β = 0_{p}

in the presence of the nuisance parameters

γ

. The intercept

α

is excluded from each model for notational simplicity.

Code	Method	Permutation Method
DS	Draper-Stoneman (1966)	Y	$= P X β + Z γ + ϵ$
OS	O’Gorman-Smith (2005/8)	Y	$= P R_{Z} X β + Z γ + ϵ$
MA	Manly (1986)	$P Y$	$= X β + Z γ + ϵ$
FL	Freedman-Lane (1983)	$(H_{Z} + P R_{Z}) Y$	$= X β + Z γ + ϵ$
TB	ter Braak (1992)	$(H_{M} + P R_{M}) Y$	$= X β + Z γ + ϵ$
SW	Still-White (1981)	$P R_{Z} Y$	$= X β + ϵ$
KC	Kennedy-Cade (1996)	$P R_{Z} Y$	$= R_{Z} X β + ϵ$
HJ	Huh-Jhun (2001)	$P Q^{'} R_{Z} Y$	$= Q^{'} R_{Z} X β + ϵ$

Notes. P is a permutation matrix.

R_{Z} = I - H_{Z}

where

H_{Z} = Z {(Z^{⊤} Z + n Δ_{Z})}^{- 1} Z^{⊤}

is the hat matrix with Z in the model.

R_{M} = I - H_{M}

where

H_{M} = M {(M^{⊤} M + n Δ)}^{- 1} M^{⊤}

is the hat matrix with

M = (X, Z)

in the model. For the Huh-Jhun method,

R_{Z} = Q Q^{⊤}

where the columns of Q are mutually orthogonal.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Helwig, N.E. Robust Permutation Tests for Penalized Splines. Stats 2022, 5, 916-933. https://doi.org/10.3390/stats5030053

AMA Style

Helwig NE. Robust Permutation Tests for Penalized Splines. Stats. 2022; 5(3):916-933. https://doi.org/10.3390/stats5030053

Chicago/Turabian Style

Helwig, Nathaniel E. 2022. "Robust Permutation Tests for Penalized Splines" Stats 5, no. 3: 916-933. https://doi.org/10.3390/stats5030053

Article Menu

Robust Permutation Tests for Penalized Splines

Abstract

1. Introduction

1.1. Penalized Spline Prevalence

1.2. Penalized Spline Definition

1.3. Penalized Spline Estimation

1.4. Bayesian Interpretation

1.5. Proposed Approach

2. Omnibus Regression Tests

2.1. Model and Estimation

2.2. Asymptotic Distributions

2.3. Test Statistics

2.4. Permutation Inference

3. Conditional Regression Tests

3.1. Model and Estimation

3.2. Asymptotic Distributions

3.3. Test Statistics

3.4. Permutation Inference

4. Simulation Studies

4.1. Simulation A

4.2. Simulation B

5. Discussion

5.1. Summary of Findings

5.2. Future Directions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Proofs

Appendix A.1. Proof of Lemma 1

Appendix A.2. Proof of Lemma 2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI