Fast Algorithm for Impact Point Selection in Semiparametric Functional Models

Novo, Silvia; Aneiros, Germán; Vieu, Philippe

doi:10.3390/proceedings2019021014

Open AccessProceeding Paper

Fast Algorithm for Impact Point Selection in Semiparametric Functional Models^†

by

Silvia Novo

^1,*

,

Germán Aneiros

¹

and

Philippe Vieu

²

¹

MODES Research Group, CITIC, Universidade da Coruña, 15071 A Coruña, Spain

²

Institut de Mathématiques, Université Paul Sabatier, 31062 Toulouse, France

^*

Author to whom correspondence should be addressed.

^†

Presented at the 2nd XoveTIC Conference, A Coruña, Spain, 5–6 September 2019.

Proceedings 2019, 21(1), 14; https://doi.org/10.3390/proceedings2019021014

Published: 31 July 2019

(This article belongs to the Proceedings of The 2nd XoveTIC Conference (XoveTIC 2019))

Download Versions Notes

Abstract

:

A new sparse semiparametric functional model is proposed, which tries to incorporate the influence of two functional variables in a scalar response in a quite simple and interpretable way. One of the functional variables is included trough a single-index structure and the other one linearly, but trough the high-dimensional vector of its discretized observations. For this model, a new algorithm for impact point selection in the linear part and for the model estimation is proposed. This procedure is based on the functional origin of the linear covariates. Some asymptotic results will ensure the good performance of the method. The computational efficiency of the algorithm, without loss of predictive power, will be showed trough a simulation study and a real data application, by comparing its results with those obtained trough the standard PLS method.

Keywords:

functional data analysis; multi-functional covariates; dimension reduction; variable selection; functional single-index model; semiparametric model

1. Introduction

In the BIG data era, it is more and more frequent having observations of variables measured in a continuous support (data are curves, images). This informative richness provided by the functional variables makes very usual found them in regression problems. In many situations, we have a scalar variable of interest and we want to know which points of a functional variable are the most influential (points of impact) on this scalar variable (see [1]). The problem is that the functional variables usually are observed in many points and standard variable selection methods in the multidimensional context can provide inadequate results. On the one hand, these procedures are affected by the dependence between observations, which in this case is directly derived from its functional origin. On the other hand, the great quantity of observations makes difficult obtaining results in reasonable amount of time.

In this work, we are going to focus on a regression model with scalar response which incorporates the influence of two functional variables: one of them is included trough a single-index type structure (see for details [2,3]) and the other one, linearly, but trough a high-dimensional vector formed by its discretized observations (see [1,4] for details and motivation of this structure). In this way we obtain a very flexible model, which combines interpretable estimations with dimension reduction. For this model, the so-called Multi-functional Partial Linear Single-Index Model (MFPLSIM), we work in the framework where we have a very big number of linear covariates but only a few of them have a real influence in the response (sparse context). Accordingly, we are going to develop an efficient algorithm for impact point selection in the linear part and for the estimation of the model (the Fast Algorithm for Sparse Semiparametric Multifunctional Regression- FASSMR), which takes advantage of the functional origin of these scalar variables included in the linear part. The good practical behaviour of the proposed methodology will be showed trough a simulation study and a real data application. In both cases, we will show its computational efficiency, without loss of predictive power, by comparing its results with the standard PLS procedure. Furthermore, some asymptotic results will support theoretically the FASSMR.

2. The Model

The MFPLSIM is defined by the relationship

Y = \sum_{j = 1}^{p_{n}} β_{0 j} ζ (t_{j}) + m (〈θ_{0}, X〉) + ε,

(1)

where Y is a real random response,

X

denote a random curve defined on some Hilbert space

H

with inner product

〈\cdot, \cdot〉

and

ζ

denote another random curve defined on some interval

[c, d]

. The curve

ζ

is observed in the points

c \leq t_{1} < \dots < t_{p_{n}} \leq d

and denote by

ζ (t_{j})

,

j = 1, \dots, p_{n}

, its discretized observations;

{(β_{01}, \dots, β_{0 p_{n}})}^{⊤}

is a vector of unknown coefficients, m is an unknown link function and

θ_{0}

denotes an unknown curve in

H

. Finally,

ε

is the random error, which verifies

E (ε | ζ (t_{1}), \dots, ζ (t_{p_{n}}), X) = 0 .

In model (1), we assume that only a few points of the curve

ζ

have an effect on the response Y. Then, we denote

S_{n} = {j = 1, \dots, p_{n}, such that β_{0 j} \neq 0}

, and it is verified that

♯ S_{n} = s_{n} = o (p_{n})

.

3. The FASSMR

Our procedure is based on the fact that the variables

ζ (t_{j})

,

j = 1, \dots, p_{n}

, come from the discretization of the functional variable

ζ

. Then, when

t_{j}

is close from

t_{k}

, the two corresponding variables

ζ (t_{j})

and

ζ (t_{k})

roughly contain the same information on the response. As consequence, some variables can be discarded before applying the variable selection procedure.

For presenting the FASSMR, let us assume that we have a statistical sample of size n,

{(ζ_{i}, X_{i}, Y_{i}), i = 1, \dots, n}

i.i.d. as

(ζ, X, Y)

. We will consider, without lost of generality, that

p_{n}

can be factorized in the following way:

p_{n} = q_{n} w_{n}

with

q_{n}

and

w_{n}

integers. The previous considerations allow us present the following set of variables

R_{n}^{1} = {ζ (t_{k}^{1}) = ζ (t_{[(2 k - 1) q_{n} / 2]}), k = 1, \dots, w_{n}},

where

[z]

denotes the smallest integer not less than

z \in R

. Note that the correlation between consecutive variables inside of

R_{n}^{1}

is much less important than in the whole set of

p_{n}

initial linear covariates. As consequence, the variable selection procedure will be carried out in variables belonging to

R_{n}^{1}

. In other words, we will considerer the following model with only

w_{n}

linear covariates

Y_{i} = \sum_{k = 1}^{w_{n}} β_{0 k}^{1} ζ_{i} (t_{k}^{1}) + m^{1} (〈θ_{0}^{1}, X_{i}〉) + ε_{i}^{1} .

(2)

Then, variable selection task can be developed following the standard procedure described in [5] and detailed in [6], which is based on transforming the model (2) into a linear one and applying the PLS procedure. We denote by

({\hat{β}}_{0}^{1}, {\hat{θ}}_{0}^{1})

, the estimation of the parameters of model (2) where

{\hat{β}}_{0}^{1} = {({\hat{β}}_{01}^{1}, \dots, {\hat{β}}_{0 w_{n}}^{1})}^{⊤}

. Then,

ζ (t_{k}^{1})

is selected in

R_{n}^{1}

if and only if

{\hat{β}}_{0 k}^{1} \neq 0

.

Considering the whole set of initial of

p_{n}

linear covariates, that is, returning to model (1), a variable

ζ (t_{j}) \in {ζ (t_{1}), \dots, ζ (t_{p_{n}})}

is selected if and only if it belongs to

R_{n}^{1}

and its estimated coefficient, which can be denoted by

{\hat{β}}_{0 k_{j}}^{1}

, is non null. Then,

{\hat{S}}_{n} = {j = 1, \dots, p_{n}, such that

ζ (t_{j}) = ζ (t_{k_{j}}^{1}) \in R_{n}^{1}

and {\hat{β}}_{0 k_{j}}^{1} \neq 0}

and

{\hat{β}}_{0 j} = {\hat{β}}_{0 k_{j}}^{1} if j \in {\hat{S}}_{n}

and

{\hat{β}}_{0 j} = 0

otherwise. Finally,

{\hat{θ}}_{0} = {\hat{θ}}_{0}^{1}

and an estimator of the function

m_{θ_{0}} (\cdot) \equiv m (〈θ_{0}, χ〉)

, denoted by

{\hat{m}}_{{\hat{θ}}_{0}} (χ)

, can be obtained by smoothing the residuals from the parametric fit (see Appendix A).

4. Theory, Simulation and Real Data Application Conclusions

The good behaviour of the proposed algorithm will be ensured theoretically. Furthermore, from the simulation study it can be seen that the FASSMR allows us to obtain the variable selection and estimation of model (1) in a reasonable amount of time, even for very big values of

p_{n}

. As will be derived from the simulation study, the developed algorithm clearly overpasses standard PLS procedure in terms of computational time without loss in prediction power. A real data application will also illustrate the flexibility and applicability of model (1) together with the FASSMR estimation.

Funding

The authors acknowledge partial support by MINECO grants MTM2014-52876-R and MTM2017-82724-R (EU ERDF support included). Additionally, financial support from the Xunta de Galicia (Centro Singular de Investigación de Galicia accreditation ED431G/01 2016-2019 and Grupos de Referencia Competitiva ED431C2016-015) and the European Union (European Regional Development Fund—ERDF), is gratefully acknowledged. The first author also thanks the financial support from the Xunta de Galicia and the European Union (European Social Fund—ESF), the reference of which is ED481A-2018/191.

Conflicts of Interest

The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

FASSMR	Fast Algorithm for Sparse Semiparametric Multi-functional Regression
i.i.d.	Independent and identically distributed
MFPLSIM	Multi-functional Partial Linear Single-Index Model
PLS	Penalized Least Squares

Appendix A

Denoting by

{\hat{β}}_{0}

the vector of estimated parameters,

{\hat{m}}_{{\hat{θ}}_{0}} (χ) \equiv \hat{m} (〈{\hat{θ}}_{0}, χ〉) = \frac{\sum_{i = 1}^{n} (Y_{i} - ζ_{i}^{⊤} {\hat{β}}_{0}) K (d_{{\hat{θ}}_{0}} (χ, X_{i}) / h)}{\sum_{i = 1}^{n} K (d_{{\hat{θ}}_{0}} (χ, X_{i}) / h)},

where we have denoted

ζ_{i} = {(ζ_{i} (t_{1}), \dots, ζ_{i} (t_{p_{n}}))}^{⊤}

,

h > 0

is a bandwidth, K is a kernel and, for any

θ \in H

,

d_{θ} (\cdot, \cdot)

is the semimetric defined as

d_{θ} (χ, χ^{'}) = |〈θ, χ - χ^{'}〉|

for each

χ, χ^{'} \in H

.

References

Aneiros, G.; Vieu, P. Variable selection in infinite-dimensional problems. Stat. Probab. Lett. 2014, 9, 12–20. [Google Scholar] [CrossRef]
Ait-Saïdi, A.; Ferraty, F.; Kassa, R.; Vieu, P. Cross-Validated Estimations in the Single-Functional Index Model. Statistics 2008, 42, 475–494. [Google Scholar] [CrossRef]
Novo, S.; Aneiros, G.; Vieu, P. Automatic and location-adaptive estimation in functional single-index regression. J. Nonparametric Stat. 2019, 31, 364–392. [Google Scholar] [CrossRef]
Aneiros, G.; Vieu, P. Partial linear modelling with multi-functional covariates. Comput. Stat. 2015, 30, 647–671. [Google Scholar] [CrossRef]
Novo, S.; Aneiros, G.; Vieu, P. Sparse Semi-Functional Partial Linear Single-Index Regression. Proceedings 2018, 2, 1190. [Google Scholar] [CrossRef]
Novo, S.; Aneiros, G.; Vieu, P. Sparse semiparametric regression when predictors are mixture of functional and high-dimensional variables. preprint.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Novo, S.; Aneiros, G.; Vieu, P. Fast Algorithm for Impact Point Selection in Semiparametric Functional Models. Proceedings 2019, 21, 14. https://doi.org/10.3390/proceedings2019021014

AMA Style

Novo S, Aneiros G, Vieu P. Fast Algorithm for Impact Point Selection in Semiparametric Functional Models. Proceedings. 2019; 21(1):14. https://doi.org/10.3390/proceedings2019021014

Chicago/Turabian Style

Novo, Silvia, Germán Aneiros, and Philippe Vieu. 2019. "Fast Algorithm for Impact Point Selection in Semiparametric Functional Models" Proceedings 21, no. 1: 14. https://doi.org/10.3390/proceedings2019021014

Article Menu

Fast Algorithm for Impact Point Selection in Semiparametric Functional Models^†

Abstract

1. Introduction

2. The Model

3. The FASSMR

4. Theory, Simulation and Real Data Application Conclusions

Funding

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Fast Algorithm for Impact Point Selection in Semiparametric Functional Models †

Abstract

1. Introduction

2. The Model

3. The FASSMR

4. Theory, Simulation and Real Data Application Conclusions

Funding

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Fast Algorithm for Impact Point Selection in Semiparametric Functional Models^†