**Big Data Analytics and Information Science for Business and Biomedical Applications II**

Editors

**S. Ejaz Ahmed Farouk Nathoo**

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade • Manchester • Tokyo • Cluj • Tianjin

*Editors* S. Ejaz Ahmed Brock University Canada

Farouk Nathoo University of Victoria Canada

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Entropy* (ISSN 1099-4300) (available at: https://www.mdpi.com/journal/entropy/special issues/Big Data Biomedical II).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-0365-5549-2 (Hbk) ISBN 978-3-0365-5550-8 (PDF)**

© 2022 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

### **Contents**


### **About the Editors**

### **S. Ejaz Ahmed**

S. Ejaz Ahmed is a Professor of Statistics and Dean of the Faculty of Math and Science at Brock University, Canada. Previously, he was Professor and Head of the Mathematics and Statistics Department at the University of Windsor, Canada, and University of Regina, Canada, as well as Assistant Professor at the University of the Western Ontario, Canada. He holds adjunct professorship positions at many Canadian and International universities. He has supervised more than 20 Ph.D. students and organized several international workshops and conferences around the globe. He is a Fellow of the American Statistical Association and has held the prestigious ASEAN Chair Professorship position. His areas of expertise include big data analysis, statistical learning, and shrinkage estimation strategy. He has authored several books and edited and co-edited several volumes and special issues of scientific journals. He has been a Technometrics Review Editor for the past ten years. Further, he is editor and associate editor of many statistical journals. Overall, he has published more than 200 articles in scientific journals and reviewed more than 100 books. Having been among the Board of Directors of the Statistical Society of Canada, he was also Chairman of its Education Committee. Moreover, he was Vice President of Communications for The International Society for Business and Industrial Statistics (ISBIS) as well as a member of the "Discovery Grants Evaluation Group" and the "Grant Selection Committee" of the Natural Sciences and Engineering Research Council of Canada.

### **Farouk Nathoo**

Farouk Nathoo received his B.Sc. in Mathematics and Statistics (combined honours) from the University of British Columbia in 1998, his M.Math. from the University of Waterloo in 2000 and his Ph.D. in statistics at Simon Fraser University in 2006. He joined the Department of Mathematics and Statistics at the University of Victoria in 2006, and became a Full Professor in 2021. He currently holds the Tier 2 Canada Research Chair in Biostatistics for Spatial and High-Dimensional Data. His research interests focus on Bayesian Methods, High-Dimensional Data, Statistical Computation, Neuroimaging Statistics, and Machine Learning. He is President-Elect, Business and Industrial Statistics Section of the Statistical Society of Canada (2022–2023), and a Member of the Board of Directors of the Canadian Statistical Sciences Institute, (2018–2024).

### **Preface to "Big Data Analytics and Information Science for Business and Biomedical Applications II"**

This book is the second volume of *Big Data Analytics and Information Science for Business and Biomedical Applications*. As with the first volume, it provides a venue for the presentation of cutting-edge research and discussion of powerful statistical methods developed for the analysis of Big Data in these areas. This second volume comprises nine papers showcasing both theoretical and applied developments.

In the first article, Shahhosseini and Miranda present a review article discussing techniques for the estimation of functional brain connectivity with an emphasis on functional magnetic resonance imaging (fMRI) data. In the second article, Chakraborty and Shojaie discuss the problem of learning the structure of directed acyclic graphs within the setting of non-Gaussian data. They develop nonparametric methods and associated algorithms to learning causal structure at high dimensions. In the third article, Fan and Bu discuss the diagnosis of lung disease from X-ray imaging using deep neural networks and transfer learning to incorporate existing pretrained networks to handle small sample sizes. The fourth article considers the examination of associations between longitudinal gestational weight gain and infant birth weight. Pietrosanu et al. develop a Bayesian joint modelling approach where parameters representing trajectories in a longitudinal model for gestational weight gain are incorporated as predictors of infant birthweight. Naiman and Song consider high-frequency data collected by mobile devices and develop semiparametric kernel machine regression with variable selection for functional predictors. In the sixth article, Liu et al. present a study examining the differences in social support communication among people with different types of cancers in online health communities using a network analysis. In the seventh article, Soderb ¨ ack et al. develop ¨ an improved estimation strategy for financial quantities by accounting for the high resolution and heteroscedastic nature of intraday data from liquid financial markets. In the eight article, Opoku et al. consider the estimation of fixed effects in the high-dimensional linear mixed model in settings where there is some prior information in the form of linear restrictions on the parameters. Shrinkage estimators are developed based on a full ridge regression estimator as a base model. The final contribution focuses on denoising image sequences such as those that arise from satellite imaging or fMRI. Yi and Qiu develop an edge-preserving image denoising procedure based on a jump-preserving local smoothing procedure. The procedure incorporates tunning parameters representing the bandwidths chosen to account for spatio-temporal correlation.

We hope that this second volume will continue to generate new ideas and research focussed on the many modern problems involving big data and high-dimensional inference.

> **S. Ejaz Ahmed and Farouk Nathoo** *Editors*

### *Review* **Functional Connectivity Methods and Their Applications in fMRI Data**

**Yasaman Shahhosseini and Michelle F. Miranda \***

Department of Mathematics and Statistics, University of Victoria, Victoria, BC V8W 2Y2, Canada; yshahhosseini@uvic.ca

**\*** Correspondence: michellemiranda@uvic.ca

**Abstract:** The availability of powerful non-invasive neuroimaging techniques has given rise to various studies that aim to map the human brain. These studies focus on not only finding brain activation signatures but also on understanding the overall organization of functional communication in the brain network. Based on the principle that distinct brain regions are functionally connected and continuously share information with each other, various approaches to finding these functional networks have been proposed in the literature. In this paper, we present an overview of the most common methods to estimate and characterize functional connectivity in fMRI data. We illustrate these methodologies with resting-state functional MRI data from the Human Connectome Project, providing details of their implementation and insights on the interpretations of the results. We aim to guide researchers that are new to the field of neuroimaging by providing the necessary tools to estimate and characterize brain circuitry.

**Keywords:** fMRI; functional connectivity; brain network; Human Connectome Project; statistics

### **1. Introduction**

Functional magnetic resonance imaging (fMRI) techniques have emerged as a powerful tool for the characterization of human brain connectivity and its relationship to health, behavior, and lifestyle [1]. The fMRI measurements comprise of an indirect and noninvasive measurement of brain activity based on the blood oxygen level dependent (BOLD) contrast [2]. Compared to alternative brain imaging modalities such as positron emission tomography (PET) and eletroencephalography (EEG), fMRIs are non-invasive and have a high spatial resolution, which makes them a popular choice in large brain imaging studies. An example of such studies is the Human Connectome Project that aims at understanding the underlying function of the brain by describing the patterns of connectivity in the healthy adult human brain [3].

There are mainly two goals in such studies: first, to identify location signatures in the brain that respond to external stimuli, and second, to identify brain space–time association patterns that emerge when the brain is either at rest or performing a task. These association patterns are measures of co-activation in functionally connected time series of anatomically different brain regions, known as functional connectivity [4,5]. There is evidence that individual differences in these connectivity patterns are responsible for important differences in cognitive and behavioral functions. Therefore, understanding these patterns can play an important role in predicting the early onset of neurodegenerative diseases and in monitoring disease care and treatment [6,7].

Functional MRI data is often high-dimensional and consists of images of 3D brain volumes collected over a period of time. In a typical study, the number of voxels *Nv* is in the hundred of thousands, and the number of time points *T* is in the hundreds. Therefore, estimating the *Nv* × *Nv* correlation matrix of voxelwise spatial connectivities is challenging and requires a few strategies and assumptions. A simple technique is to first pre-specify regions called *seeds* and then compute the cross-correlation of seeds and the functional time

**Citation:** Shahhosseini, Y.; Miranda, M.F. Functional Connectivity Methods and Their Applications in fMRI Data. *Entropy* **2022**, *24*, 390. https:// doi.org/10.3390/e24030390

Academic Editors: S. Ejaz Ahmed and Farouk Nathoo

Received: 15 January 2022 Accepted: 8 March 2022 Published: 11 March 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

series of every other voxel in the brain. This *seed-based approach* became popular due to its straightforward calculation and interpretation. Seeds can vary in size and be as small as a single voxel. If the seed is a region, it is customary to average the time courses of the region and use that as the reference time course to be correlated with all the other voxels. In order to improve scalability, it is also common to first parcellate the brain into small regions and use the average time series of these regions to estimate the networks. The seed-based method can be a helpful resource when comparing patterns of neuropathologies and the normal brain. For example, ref. [8] uses this method to show that connectivity between the hippocampus and other brain regions change with the early signs of Alzheimer's disease when compared to control subjects. Despite the popularity of these approaches, there are various criticisms to the method. First, by focusing on pre-determined seeds, potential patterns that emerge in different brain regions are ignored [9]. Second, the method neglects the variability across voxels by averaging the time series in the ROIs. Third, the approach computes correlations between pairs of nodes and ignores the potential influence of other nodes in the entire network.

In contrast to region pre-specification, dimension reduction approaches characterize the spatial and/or temporal connectivity patterns by representing the data through a small number of latent components [10]. Principal component analysis (PCA) and independent component analysis (ICA) are the two most common of these methods. Both methods project the high-dimensional imaging data into a low-dimensional subspace. In PCA, this projection consists of orthogonal components that maximize the variance of the data projected into the low-dimensional subspace [11]. In ICA, the projection consists of components that are as spatially independent as possible [12]. Each of these components is then assembled into brain maps with the value in each voxel representing the relative amount of that particular voxel, which is modulated by the activation of that component. Compared to the seed-based approach, both PCA and ICA have the advantage of providing automated components with no need for the pre-specification of a seed region, i.e., these methods are data-driven. The authors in reference [13] used ICA to decompose brain networks into spatial sub-networks with similar functions in both the resting state and task fMRI data.

Other methods combine the brain parcellation strategy used in seed-based methods with dimension reduction approaches to characterize brain circuitry. Reference [10] uses an anatomical atlas to pre-determine clusters (ROIs) and then extract features from each cluster via principal components. Multiple extracted components were then used to estimate connectivity between these ROIs using the RV coefficient, a measure that summarizes the correlation among sets of features.

In addition to the methods utilized to estimate connectivity, it is common to characterize the functional networks by using the tools of *graph theory*. In a graph, brain networks are treated as a collection of nodes connected by edges. Commonly, the edges are defined by an estimated connectivity. Following the specification of the nodes, a binary matrix is obtained by thresholding the connectivity matrix. The binary graph is then used to compute various graph parameters that describe the nature of the brain network. These parameters express key characteristics of the network and usually include quantities that help determine if the graph nodes are connected in a random or small-world order. Random networks have a more globally connected pattern while small-world networks show a high level of local ordering [14]. Statistical network models take these graph measures as inputs for the prediction of global networks that characterize multiple individuals.

The goal of this paper is to provide an overview of the most commonly used methods to estimate and characterize functional connectivity in resting-state fMRI data. We illustrate these methods by analyzing data from a single-subject in the Human Connectome Project. Although we do not attempt to offer an exhaustive presentation of the rapidly evolving methods in the field, we expect that the information provided here will guide researchers that are new to the field of neuroimaging in exploring these data.

The remainder of the paper is organized as follows: In Section 2, we describe the different methods of estimating functional connectivity, focusing on data from a single subject. In Section 3, we estimate functional connectivity for a single-subject resting-state data from the Human Connectome Project, using the methods described in Section 2. In Section 4, we present a few multiple-subject estimation methods. In Section 5, we describe statistical network models. Finally, in Section 6, we present some final remarks on the topic.

### **2. Methods for Functional Connectivity**

In this section, we review the different methods to estimate functional connectivity for single-subject data. We focus on data for a single subject and discuss group connectivity in Section 4. For all calculations, let the matrix *<sup>Y</sup>* be a matrix of size *<sup>T</sup>* <sup>×</sup> *Nv*, consisting of *Nv* time courses representing the BOLD signal at each voxel *v* = 1, ... , *Nv* [2] for a single subject. Here, for simplicity, we centralized the matrix *Y* by subtracting each voxel data (column) by the average of its corresponding time course. The goal of a connectivity-based analysis is to describe how various brain regions interact, either when the brain is resting or performing a task [15].

### *2.1. Seed-Based Analysis*

It is computationally expensive to compute all pairwise correlations among a large number of voxels as it would require *N*<sup>2</sup> *<sup>v</sup>* operations. Seed-based analysis (SBA) relies on estimating pairwise correlations between regions of interest (ROIs) or between a seed region and all the other voxels across the brain.

To estimate correlations between ROIs, a common approach is to divide the brain volume according to anatomical templates, usually called *brain atlases* [16]. There are several human brain atlases available, including the popular *Automated Anatomical Labelling (AAL)*, *Tailarach Atlas*, and the *MN1-152 atlas* [16,17]. To estimate correlations between a seed region and the other voxels, a seed is usually selected either by expert opinion or by choosing the voxel that shows greater activation during the fMRI experiment as the seed. The latter is more common during experiments involving tasks. After selection, the connectivity is estimated through a measure that quantifies correlation. Various measures were proposed in the literature. We provide more details about these measures in the Appendix A. We can summarize the seed-based approach the following way.


Alternatively, after dividing the brain into various ROIs using an atlas, we can summarize the time series of that region, either by averaging across voxels or by calculating the first principal component [15]. Next, we use those summary time series to be correlated between all regions. We illustrate both options in Section 3.

### *2.2. Decomposition Methods*

Although seed-based methods have a straightforward interpretation, they are biased to the seed selection technique [18] and, therefore, should be used with caution. Principal component analysis and independent component analysis aim at solving the issue by providing a data-driven approach to functional connectivity. These decomposition methods play many roles in functional neuroimaging applications. They are used in the pre-processing steps to remove data artifacts and to reduce data dimensionality, and they will likely appear in at least one step of estimating functional connectivity in various populations. In this section, we will focus on their role as a method to describe functional connectivity in single-subject fMRI data, while in Section 4, we explore their contribution in multi-subject analysis.

As an alternative to seed-based analysis, the goal of the decomposition methods is to represent the voxel domain as a smaller subset of spatial components. Each spatial component has a separate time course and represents simultaneous changes in the fMRI signals of many voxels [12]. In this section, we assume that for each column of *Y* the average was subtracted from the data.

### 2.2.1. Principal Component Analysis (PCA)

PCA is a common method to reduce data dimensionality while minimizing the loss of data information and maximizing data variability [11]. The principal components are obtained either by the eigendecomposition of the sample covariance matrix *YTY* or by finding the eigenvectors of the data matrix *Y* using the theory of singular value decomposition (SVD). The rank of the data matrix is *r* = *min*{*T*, *Nv*} (usually *T* < *Nv* and *r* = *T*) and therefore we can find *r* principal components through the decomposition

$$\mathbf{Y} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{W}^T = \sum\_{k=1}^r \sigma\_k \boldsymbol{\mu}\_k \mathbf{w}\_k^T,\tag{1}$$

where the *<sup>T</sup>* <sup>×</sup> *<sup>r</sup>* matrix *<sup>U</sup>* contains an orthonormal left singular vector *uk* ∈ *T*, the *<sup>r</sup>* <sup>×</sup> *Nv* matrix *<sup>W</sup>* contains orthonormal right singular vectors *wk* ∈ *Nv* , and the *<sup>r</sup>* <sup>×</sup> *<sup>r</sup>* diagonal matrix **Σ** contains the ordered singular values [11,15,19]. Notice that the eigendecomposition of *<sup>Y</sup>T<sup>Y</sup>* is defined as *<sup>W</sup>T***Σ**2*W*. The orthonormal rows of the *<sup>r</sup>* <sup>×</sup> *Nv* matrix *<sup>W</sup>* are referred to as eigenimages and can be assembled into brain maps, each representing the relative amount from a given voxel that is modulated by the activation of that component.

A different approach is to project the original fMRI data into the space spanned by the first *p* principal components, where the choice of *p* is based on the amount of data variability explained by the component. The projected data matrix, *YW*, consists of the time series of regions in this new subspace. The authors in reference [20] used this idea to reduce the dimensionality of the fMRI data in certain ROIs and then applied a Granger causality analysis on the block time series of two brain regions to infer directional connections. Although it is possible to compute correlations using the time series of these projected data points, the results have no clear interpretation since each of these spatial regions in the new subspace represent a linear combination of different voxels in the original data space.

### 2.2.2. Independent Component Analysis (ICA)

ICA aims at representing the brain data using a latent representation of independent factors. Differently from PCA, the goal is to decompose *Y* as a product of a mixing matrix and a combination of spatially independent components (ICs).

$$\mathcal{Y} = \mathcal{M}\mathcal{C} + E = \sum\_{k=1}^{K} m\_k \mathbf{c}\_k + E,\tag{2}$$

where *<sup>M</sup>* is the *<sup>T</sup>* <sup>×</sup> *<sup>K</sup>* mixing matrix with columns *<sup>m</sup>k*, and the *<sup>K</sup>* <sup>×</sup> *Nv* matrix *<sup>C</sup>* is the matrix of independent components with rows *ck*, where each *ck* contains brain networks corresponding to component *k* for a total of *K* independent components. These components represent the networks of various functions, such as motor, vision, auditory, etc. The elements of the matrix *E* are independent Gaussian noises.

It is assumed that the component maps, *ck*, *k* = 1, ... , *K* represent possible overlapping and statistically dependent signals, but the individual component map distributions are independent, i.e., if *P*(*ck*) represents the probability distribution of the voxels values in the *k*th component map, we have

$$P(\mathbf{c}\_1, \mathbf{c}\_2, \dots, \mathbf{c}\_K) = \prod\_{k=1}^K P(\mathbf{c}\_k). \tag{3}$$

Each independent component *c<sup>k</sup>* is a vector of size *Nv* and can be assembled into brain maps. As in PCA, these maps represent the relative amount of a given voxel that is modulated by the activation of that component.

### *2.3. Computational Aspects*

In imaging applications, estimating the principal components requires the decomposition of the *Nv* <sup>×</sup> *Nv* matrix *<sup>Y</sup>TY*, which is usually unfeasible. Many algorithms were proposed in the literature to efficiently estimate the components in such high-dimensional settings. Ref. [21] develops the sparse PCA (SPCA), which is based on a regression optimization problem using a lasso penalty, [22] a multilevel functional principal component for high-dimensional settings, and [23] estimate a sparse set of principal components through an iterative thresholding algorithm. Some of these toolboxes are easy to access and available for downloading at the authors' website.

Similarly, estimating the independent components is not straightforward, and ranking the components is challenging because the ICs are usually not orthogonal, and the sum of the variances explained by each component will not sum to the variance of the original data. One of the first algorithms was the Infomax, which aims at maximizing the joint entropy of suitably transformed component maps [12,24]. Recently, more modern algorithms focus on extracting a sparse set of features from data matrices containing a very large number of features. Examples are the ICA with a reconstruction cost (RICA) proposed by [25], which is available as a Matlab toolbox.

### *2.4. A Hybrid Method*

A different approach to estimate functional connectivity is given by reference [10]. The authors propose a multi-scale model based on networks at multiple topological scales, from voxel level to regions consisting of clusters of voxels, and larger networks consisting of collections of those regions. In practice, these collections of voxels are pre-specified and usually taken as anatomical ROIs. These anatomical ROIs can be then combined to form clusters of ROIs. Their approach consists of a dimension reduction step through to a factor model within each ROI. Let *<sup>r</sup>* represent the *<sup>r</sup>*-th ROI and *<sup>Y</sup><sup>r</sup>* be a *<sup>T</sup>* <sup>×</sup> *pr* data matrix consisting of the time series of voxels belonging to the *r*-th ROI (containing a total of *pr* voxels, where ∑*<sup>R</sup> <sup>r</sup>*=<sup>1</sup> *pr* = *Nv* and *R* is the total number of ROIs). Then, we write

$$\mathcal{Y}\_r(t) = \mathcal{Q}\_r f\_r(t) + E\_r(t),\tag{4}$$

where *<sup>Y</sup>r*(*t*) is a column vector of size *pr*, *fr*(*t*) is a *mr* <sup>×</sup> 1 vector of latent common factors with a number of factors *mr pr*, *<sup>Q</sup><sup>r</sup>* is a *pr* <sup>×</sup> *mr* factor-loading matrix that defines the dependence between the *pr* voxels through the mixing of *fr*, and *Er*(*t*) = [*er*1(*t*), ... ,*erpr* (*t*)] is a *pr* <sup>×</sup> 1 vector of white noise with *<sup>E</sup>*(*Er*(*t*)) = 0 and **<sup>Σ</sup>***Er*,*Er* <sup>=</sup> *Cov*(*Er*(*t*)) = diag(*σ*<sup>2</sup> *er*1 ,..., *σ*<sup>2</sup> *erpr* ).

These factor models are then concatenated to define

$$\mathcal{Y}(t) = \mathcal{Q}f(t) + \mathcal{E}(t),\tag{5}$$

where *Y*(*t*) is a column vector of size ∑*<sup>R</sup> <sup>r</sup>*=<sup>1</sup> *pr* <sup>=</sup> *Nv*, *<sup>Q</sup>* <sup>=</sup> diag(*Q*1, ... , *<sup>Q</sup>R*) is a <sup>∑</sup>*<sup>R</sup> <sup>r</sup>*=<sup>1</sup> *pr* × ∑*<sup>R</sup> <sup>r</sup>*=<sup>1</sup> *mr* block-diagonal mixing matrix and *<sup>f</sup>*(*t*)=[ *fr*(*t*), ... , *fR*(*t*)] is a <sup>∑</sup>*<sup>R</sup> <sup>r</sup>*=<sup>1</sup> *mr* × 1 vector of aggregated latent factors.

Network covariance matrices in these different topological scales are estimated using the low-rank matrix in the following way. Let Σ*YrYr* be the covariance matrix within ROI *r*. Model (4) implies the following decomposition

$$
\Sigma\_{\mathbf{Y}\_r\mathbf{Y}\_r} = \mathbf{Q}\_r \Sigma\_{f\_rf\_r} \mathbf{Q}\_r^\prime + \Sigma\_{\mathbf{E}\_r\mathbf{E}\_r} \cdot \tag{6}
$$

Similarly, from Model (5) we have

$$
\Sigma\_{\mathbf{Y}\mathbf{Y}} = \mathcal{Q}\Sigma\_{f\uparrow}\mathcal{Q}^{\prime} + \Sigma\_{\mathbf{EE}^{\prime}}.\tag{7}
$$

The low-dimensional factor covariance matrix Σ*f f* is a block matrix used to estimate the lag-zero dependency between ROIs as follows.

$$
\Sigma\_{ff} = \begin{pmatrix}
\Sigma\_{f\_1f\_1} & & & \Sigma\_{f\_1f\_R} \\
& \ddots & & \\
& \Sigma\_{f\_Rf\_1} & & \Sigma\_{f\_Rf\_R}
\end{pmatrix},
$$

The diagonal blocks Σ*fr fr*,*r* = 1, ... , *R* are diagonal covariance matrices, capturing the total variance of factors within each ROI. The off-diagonal blocks Σ*fk fj* , *j* = *j* are cross-covariance matrices between factors and summarize the dependence between clusters j and k.

The authors summarize the dependence between ROIs using the RV coefficient, a multivariate generalization of the squared correlation coefficient. The RV coefficient between factors in clusters j and k is defined by

$$RV = \frac{\text{tr}(\mathbf{C}\_{f\_k f\_{j'}} \mathbf{C}\_{f\_j f\_k})}{\sqrt{\text{tr}(\mathbf{C}\_{f\_k f\_{j'}} \mathbf{C}\_{f\_j f\_{j}}) \text{tr}(\mathbf{C}\_{f\_k f\_{k'}} \mathbf{C}\_{f\_k f\_k})}},\tag{8}$$

where *<sup>C</sup>fj fk* = (Σ*fj fj* )<sup>−</sup> <sup>1</sup> <sup>2</sup> <sup>Σ</sup>*fj fk* (Σ*fk fk* )<sup>−</sup> <sup>1</sup> 2 .

In practice, the authors apply this model to estimate resting-state networks. They estimate the factors *fr* and matrices *<sup>Q</sup><sup>r</sup>* using PCA and pre-specify the ROIs based on an anatomical atlas. The authors in reference [26] use this approach to estimate background functional connectivity between ROIs using data from the Working Memory Task in the Human Connectome Project.

### *2.5. Brain Networks*

It is common to represent the brain using tools from *graph theory*. In this framework, we can think of functional connectivity as a network represented by a graph, where the spatial units are the nodes and the connection between them are the edges. Networks are treated as a collection of nodes (vertices) connected by links (edges). The graph (network) is represented as the pair *G* = (*V*, *E*), where *V* and *E* are the sets of vertices and edges, respectively. In addition, graphs may be weighted and, in such cases, will be denoted by the triple *G* = (*V*, *E*, *W*), with *W*(*E*) indicating the weight for each edge.

The first decision to make is the selection of the nodes of the network. Similar to the seed-based connectivity, these nodes are defined by either voxels or the ROI parcellations given by anatomical atlases. Following the specification of the nodes, their edges (links) must be determined. These edges quantify the strength of association between these different nodes, i.e., they are the functional connectivity. The same measures discussed previously for functional connectivity and described in Appendix A are used to quantify the strength of the edges.

Most of the standard tools of graph theory have been developed for binary networks, where each edge is either present or not. A binary matrix, usually called an *adjacency matrix*, is obtained by thresholding the connectivity matrix. Although it is convenient to threshold the weighted graphs to apply the standard graph theoretical machinery, information about the original signal is discarded in the process. Moreover, in most situations, the choice of a threshold is not unique, and such a decision may be difficult to justify. One strategy is the use of a mass-univariate approach, in which a statistical test is performed for every possible edge in the network and then corrected for multiple comparisons using standard techniques, such as the Bonferroni correction or the false discovery rate (FDR) [27,28].

After the network is estimated, some descriptive measures are calculated as means to describe the topological graph properties. In brain networks, the popular metrics are the characteristic path length, the clustering coefficient, and the degree distribution. For a list of the comprehensive topological measures used in neuroimaging, see reference [29].


Many other network measures are greatly influenced by basic network characteristics, such as the number of nodes and links and the degree of distribution presented in this section.

### **3. Real Data Example**

We analyzed the resting-state data from the Human Connectome Project (HCP). We chose to work with the data that had been previously denoised using the FIX pipeline [30]. This pipeline uses a gentle high-pass temporal filter, performs motion regression (i.e., the regression of 24 movement parameters: six rigid-body motion parameters, their backward temporal derivatives, and squares of those 12 time series), and applies a regression based on ICA to remove the variance in noise components that was orthogonal to signal components [31]. For the single-subject analysis, we considered the volumes collected from the right–left phase of the example, Subject 100307. Volumes of fMRI were obtained every 720 ms. Each volume consisted of images of size 91 × 109 × 91 for a total of 1200 time frames.

### *3.1. Single-Subject Examples*

### 3.1.1. ROI-Based Connectivity

We partitioned the brain into ROIs using the AAL Atlas version that was registered into the MNI152 space. We considered a total of 166 ROIs and estimated the connectivity using the following methods:


The results for the estimated connectivity values are shown in Figure 1. Inspecting Figure 1 reveals that cross-correlation measures in panels (a) and (c) capture larger correlations than their corresponding partial cross-correlation measures (panels (b) and (d)). The RV coefficient from the method described in (e) seems to be able to capture a large number of small correlations among ROIs. Before drawing any conclusions from the figure, we should first test whether these values are significant. For the first four matrices, the test is done by first transforming these values to z-scores and then thresholding them to identify important correlations. For the RV coefficient in panel (e), significance is based on the asymptotic distribution of the coefficient as detailed in reference [10].

Next, we used these connectivity matrices to obtain a binary graph with the edges determined based on the *p*-values obtained from the z-scores of the correlation coefficient, as described in Appendix A, Equation (A2). The *p*-values were thresholded based on the Bonferroni correction and a significance of 5%. For the RV coefficient in panel (e), we use the asymptotic distribution of the coefficients to convert the values to z-scores and thresholded based on the Bonferroni correction to find the quantile of the standard normal distribution with a significance of 5%. Considering this criteria, we compute the adjacency matrices shown in Figure 2.

**Figure 1.** Estimated connectivity for the ROIs based on the AAL parcellation. Panel (**a**) depicts the cross-correlation for the average time series of the ROIs, panel (**b**) depicts the partial cross correlation for the average time series of the ROIs, panel (**c**) depicts the cross correlation for the time series of the ROI data projected with the first PC, panel (**d**) depicts the partial cross correlation for the time series of the ROI data projected with the first PC, and panel (**e**) represents the RV coefficient with each ROI retaining the principal components that explain 20% of its variability.

Inspecting Figure 2 reveals the presence of a large number of edges for both (a) and (c) graphs. This indicates a high level of interaction between the different anatomical regions. This high-interaction level was not found in graphs (b) and (d). In panel (e), we observe a moderate level of interaction with a few ROIs connecting with many others, while some regions are quiet during the resting-state experiment.

**Figure 2.** Binary Graphs obtained from the thresholded connectivities matrices of Figure 1. For all panels, the white color indicates an edge between the ROIs. Panel (**a**) is the graph obtained by thresholding the cross correlation of the average time series of the ROIs, panel (**b**) depicts the graph from the thresholded partial cross correlation for the average time series of the ROIs, panel (**c**) depicts the graph obtained by thresholding the cross correlation for the time series of the ROI data projected with the first PC, panel (**d**) depicts the graph obtained by thresholding the partial cross correlation for the time series of the ROI data projected with the first PC, and panel (**e**) represents the graph obtained by thresholding the RV coefficient.

### 3.1.2. Network Summary Measures

We used the binary graphs obtained above to estimate a few summary measures, using graph theory as described in Section 2.5. The formulas used in each calculation are detailed in Appendix B. Table 1 shows the results. *CPL* is the characteristic path length excluding all infinity paths from the network, *DG* is the average degree of the network, where the degree indicates the number of links in each node, *CC* is the average clustering coefficient of the network, and *Inf* is the number of infinity paths in the network. The quantities in Table 1 reflect what we observe in Figure 2. The degree indicates the number of connections between regions. As noticed before, the graphs in panels (a) and (c) indicate a high degree, with many interactions between ROIs. The characteristic path length (CPL) of the RV coefficient indicates that on average the network has a short path length, with a value that is comparable to the networks in panels (a) and (c) of Figure 2. This indicates that despite few regions being connected, the ones that are connected are near each other.



### 3.1.3. Volume-Based Connectivity

**Seed-Based Analysis.** For seed-based analysis, we chose the left pars opercularis (left interior frontal gyrus) as the seed [32]. We take the average time series for this region and compute the cross correlation with the remaining voxels. We perform a Bonferroni correction considering *α* = 0.05 to threshold the correlation values. Figure 3 shows the resulting brain map. We display clusters bigger than 125 as significant voxels, and their mask is overlaid on a template brain consisting of the average time points of the example subject data used here.

**Figure 3.** Seed- based connectivity of the left pars opercularis. Figure shows sagittal slices with voxels that have a significant connection with the seed ROI depicted in red.

**Decomposition Methods.** We first obtain the principal components of the data matrix *Y*. It is important to notice that a large number of principal components is needed to represent data variability and that traditional principal components have the issues discussed in Section 2.3. For this particular data, 150 components are needed to represent 20% of the data variability and 463 are needed to represent 50%. We illustrate the first five components scaled by their eigenvalues (i.e., the loadings) in Figure 4.

Next, to estimate the independent components, we use the probabilistic independent components analysis proposed in reference [33] and implemented in the MELODIC (multivariate exploratory linear optimized decomposition into independent components) function in FSL. Figure 5 depicts the results.

For illustration purposes, we present the components without thresholding their values. It is more common to use the individual components' maps as inputs in a multisubject approach and then perform thresholding in the group components to identify a group network. We comment more on the topic in the next section.

**Figure 4.** Sagittal view of the ordered principal components' maps from first (**top**) to fifth (**bottom**).

**Figure 5.** Sagittal view of the independent components' maps ordered based on increasing amounts of uniquely explained variance from first (**top**) to fifth (**bottom**).

### **4. Multiple-Subject Functional Connectivity**

When modeling functional MRI, an important goal is to identify the functional connectivity structure in multi-subject data by leveraging a shared structure across subjects. Multi-subject functional connectivity models can range from constrained tensor decomposition models, e.g., PARAFAC, to more flexible approaches where subject-specific connectivity matrices or PCA and ICA models are estimated first, and their concatenated results are used as inputs on a group-based estimation. The optimal model will depend on which level of flexibility best captures the functional connectivity features within the group [34].

In multi-subject ICA models, a simple procedure is to estimate the single-subject connectivity matrix using pre-specified ROIs, as in the seed-based approach described in Section 2, and then aggregate those results into a single matrix, subsequently further decomposing this matrix using principal components. The principal components can then be mapped to estimate a group-based functional connectivity. Ref. [35] used this idea to estimate a dynamical group-based resting-state connectivity of minimally disabled relapsing–remitting patients.

A multi-stage approach is implemented in reference [36] to compare functional connectivity between subjects at a high familial risk for Alzheimer's disease that are clinically asymptomatic versus matched controls. The method follows four steps, including subject-specific SVD, a population-level decomposition of aggregated subject-specific eigenvectors, a projection of the subject-level data onto the population eigenvectors to obtain subject-specific loadings, and the use of the subject-specific loadings in a functional logistic regression model.

A group of methods propose a *group ICA* approach, where fMRI data is either temporally concatenated across subjects or taken as a multi-dimensional array. The FMRIB Software Library (FSL), a software library containing image analysis and statistical tools for various imaging data, provides group ICA and tensorial ICA in its MELODIC function.This section will focus on these two approaches.

*4.1. Group ICA*

Ref. [37] proposed for the first time an approach to perform ICA on fMRI data from a group of subjects. Suppose we observe fMRI data from *n* subjects. Let *Y<sup>i</sup>* be a matrix of size *T* × *Nv* consisting of *Nv* time courses representing the BOLD signal at each voxel *v* = 1, ... , *Nv* for subject *i* = 1, ... , *n*. Their model involves a multi-stage approach as follows.

1. Subject-level data reduction. In this step, reduction is applied in the temporal domain. For each subject *i* = 1, . . . .*n*, the reduced data is given by

$$\mathbf{X}\_i = F\_i^{-1} \mathbf{Y}\_{i\nu}$$

where *F*−<sup>1</sup> *<sup>i</sup>* is a *<sup>L</sup>* <sup>×</sup> *<sup>T</sup>* reducing matrix and *<sup>X</sup><sup>i</sup>* is a *<sup>L</sup>* <sup>×</sup> *Nv* matrix representing the reduced data. In practice, *F*−<sup>1</sup> is obtained by PCA decomposition;


$$\mathbf{X} = \mathbf{M}\mathbf{C}\_{\prime}$$

where *<sup>M</sup>* is a *<sup>K</sup>* <sup>×</sup> *<sup>K</sup>*-mixing matrix and *<sup>C</sup>* is a *<sup>K</sup>* <sup>×</sup> *Nv* component map matrix. The resulting group ICA components can be thresholded by first converting them into Z-scores.

Individual maps can be obtained by partitioning *GM* (where *G* = (*G*−1)*T*) by subject and going back along the previous steps as follows.

$$\mathbf{G}X = \mathbf{G}\mathbf{M}\mathbf{C} = \begin{bmatrix} \ \mathbf{F}\_1^{-1}\mathbf{Y}\_1 \\ \cdot \\ \cdot \\ \ \mathbf{F}\_N^{-1}\mathbf{Y}\_N \end{bmatrix}.$$

Based on these steps, the matrix *GMC* is a matrix of size *NL* <sup>×</sup> *Nv* of individual maps and can be partitioned such that *GiMiC<sup>i</sup>* = *F*−<sup>1</sup> *<sup>i</sup> <sup>Y</sup>i*, and *<sup>C</sup><sup>i</sup>* contains the individual maps.

### *4.2. Tensorial ICA*

The tensor ICA is based on tensor decomposition, which obtains a low-rank representation of a multi-dimensional array. PARAFAC is a common decomposition method [38]. Let *<sup>X</sup>* <sup>∈</sup> <sup>R</sup>*T*×*Nv*×*<sup>N</sup>* be an array with fMRI data and dimension times, voxels, and subjects, respectively. The three-dimensional array *X* can be decomposed as a sum of *R* outer products in the following way

$$\mathbf{X} = \sum\_{r=1}^{R} \mathbf{a}\_r \circ \mathbf{b}\_r \circ \mathbf{c}\_{r,r}$$

where *<sup>a</sup><sup>r</sup>* <sup>∈</sup> <sup>R</sup>*T*, *<sup>B</sup><sup>r</sup>* <sup>∈</sup> <sup>R</sup>*NV* , and *<sup>c</sup><sup>r</sup>* <sup>∈</sup> <sup>R</sup>*N*. This decomposition implies that each element in the array *X* can be written as

$$\{\mathfrak{x}\_{ijk}\} = \sum\_{r=1}^{R} a\_{ir} b\_{jr} c\_{kr}.$$

The vectors in the decomposition can be represented in matrices, e.g., *A* = [*a*1*a*<sup>2</sup> ... *aR*], and likewise to obtain matrices *B* and *C*. The three-dimensional array can be unfolded into matrices in a process called matricization. The unfolding can happen in any of the three dimensions. On the second dimension, *<sup>X</sup>*(2) <sup>∈</sup> <sup>R</sup>*Nv*×*NT* is the mode-two matricization of *<sup>X</sup>*. Similarly, we can use the unfolding to generate mode-two and mode-three matrices. For

details on the PARAFAC decomposition and matricization, please refer to reference [38]. Using these definitions, we can write:

$$\mathbf{X}\_{(2)} = \mathcal{B}(\mathbf{C} \odot \mathbf{A})^T,$$

where denotes the Katri–Rao product. In reference [39], the authors propose an ICA decomposition of the form

$$\mathbf{X}^\* = (\mathbf{C} \odot \mathbf{A})\mathbf{B}^T + \mathbf{E}\_{\prime}$$

where *X*<sup>∗</sup> = *X<sup>T</sup>* (2) and the mixing matrix *<sup>M</sup>* = (*<sup>C</sup> <sup>A</sup>*) and component matrix *<sup>B</sup><sup>T</sup>* are estimated as in reference [33].

### **5. Statistical Network Models**

In this section, we follow the notation in reference [40] to describe statistical network models with the purpose of characterizing brain circuitry. In these models, individual functional connectivity is estimated first, using the techniques described in Section 2. After individual estimation, the effects of multiple variables of interest and topological network features are taken into account on the overall network structure.

Let (Y*i*, X*i*) represent the network and covariates for subject *i*, respectively. The probability density function of the network given the covariates is denoted by *<sup>P</sup>*(Y*i*|X*i*, *<sup>θ</sup>i*), where *θ<sup>i</sup>* describes the relationship of Y*<sup>i</sup>* and X*i*. These covariates can be node-specific covariates, such as brain location and also functions of the network Y*i*, such as the path length or other metrics described in Section 2.5. Popular ways of modeling the density function include exponential random graph models (ERGMs) and mixed models [40].

In ERGMs, we consider binary graphs and the models are fitted for each subject individually as follows. Let Y be a network consisting of *R* × *R* nodes. Then, Y*ijk* = 1 if a link exists between nodes *j* and *k*, and Y*ijk* = 0 otherwise. The probability mass function has the form of a regular exponential family:

$$P(\mathcal{Y}\_i = y\_i | \mathcal{X}\_i) = \kappa(\boldsymbol{\theta})^{-1} \exp\left\{ \boldsymbol{\theta}^T \boldsymbol{g} (\mathcal{Y}\_{i\prime} \mathcal{X}\_i) \right\}.$$

The estimation of the parameters *θ* is done by MCMC MLE. In reference [41], they identify the most important explanatory metrics *g*(*yi*) for each subject's network. Next, the authors create a group-based summary measure of the fitted parameter values *θ* for all subjects. They use these group-based explanatory metrics and parameters to fit a group-based representative network via ERGMs.

One limitation of the current estimation methods for ERGMs is scalability. The major issue is not the number of ROIs per se but the edge structure of the network, which can cause convergence problems. Moreover, most models were developed for binary graphs and are not well-suited for link-level examination [40].

As an alternative to ERGMs, mixed models allow for both link-level examination and multiple-subject comparisons. The framework defines a two-part mixed effect that models both the probability of a connection being present or absent and the strength of a connection if it exists. Let Y*<sup>i</sup>* be a representation of the functional connectivity strength given by one of the correlation measures listed in Appendix A, and let R*ijk* be an indicator of whether a connection between *j* and *k* is present. Then the conditional probabilities are

$$P(\mathcal{R}\_{ijk} = r\_{ijk} | \mathcal{J}\_r; b\_{ri}) = \begin{cases} 1 - p\_{ijk}(\mathcal{J}\_r; b\_{ri}), & \text{if } r\_{ijk} = 0 \\ p\_{ijk}(\mathcal{J}\_r; b\_{ri}), & \text{if } r\_{ijk} = 1, \end{cases}$$

where *β<sup>r</sup>* are the vector of fixed effects that relate the covariates X*ijk* for each participant and pair of nodes, and *bri* are random effects representing subject-specific and node-specific parameters.

Let *Zijk* be the design matrix associated with the random effects *bri*; the models are divided into two parts. The first part of the model uses a logit link function to relate the probability of connection between nodes *j* and *k* to the covariates as follows.

$$\text{logit}(p\_{ijk}) = \mathcal{X}'\_{ijk}\mathcal{b}\_r + Z'\_{ijk}\mathbf{b}\_{ri}\text{-}\mathbf{k}$$

The second part models the strength of the connection between nodes *j* and *k* given that the connection is present, by linearly linking the Fisher's Z-transform of the correlation coefficient between nodes *i* and *j* and the covariates. Let *Sijk* = Y*ijk*|*Rijk* = 1, then

$$\text{Fisher's Z-transform}(S\_{ijk}) = \mathcal{X}\_{ijk}^{\prime}\mathcal{b}\_{\text{s}} + Z\_{ijk}^{\prime}\mathcal{b}\_{\text{si}} + \mathfrak{e}\_{ijk\prime}$$

where *β<sup>r</sup>* is a vector of population parameters that related the strength of connection to the same set of covariates <sup>X</sup>*ijk* for each participant and pair of nodes, *<sup>b</sup>si* is a vector of subject and node-specific parameters that capture how this relationship varies about the population average *βs*, and *eijk* is the random noise for subject *i* and nodes *j* and *k*. Details of the two-parts modeling approach is presented in reference [42].

One issue that arises from these models is that thresholding choices based on the connectivity weights will impact the network topology, even if multiple comparisons are taken into account. The authors in reference [40] argue that persistent homology provides a multi-scale hierarchical framework that addresses the threshold issue. The method is a technique of computational topology that provides a coherent mathematical framework for comparing networks. Instead of looking at the networks at a fixed threshold, persistent homology records the changes in topological network features over multiple resolutions and scales. By doing so, it reveals the features that are robust to noise, i.e., the most 'persistent' topological features.

### **6. Summary**

In this paper, we have reviewed the most common methods to estimate functional connectivity in fMRI data. For single-subject data, estimation can be done by directly quantifying correlations across regions of interest and/or seed regions, or by finding a set of latent components that represent simultaneous activity, and while interpretation is straightforward for the former approach, it is not as clear for the later. In the example provided, the number of component maps needed to represent the data variability is very high and, therefore, the investigation of only a few components might not reflect the whole picture of the brain network.

The results obtained in Section 2 indicate that even if the regions are defined in an equivalent way, different estimation procedures of connectivity will lead to different interpretations of the networks. Therefore, it is of great importance to be aware of the limitations of each approach, especially when interpreting results from individual datum.

Despite the challenges with the single-subject analysis, a consistent procedure, applied to various subjects, might translate into a successful representation of multiple-subject networks. This is specially true if the method does not require a multi-stage approach and performs, instead, a joint estimation as in the tensorial ICA framework. Other emerging multi-subject network methods, such as persistent homology, are a promising way to estimate brain circuitry, especially if scalability can be achieved.

**Funding:** This research was funded by Natural Sciences and Engineering Research Council grant number RGPIN-2020-06941.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **Appendix A. Methods to Quantify Correlation**

**Cross Correlation**. Cross correlation measures the (lagged) temporal dependencies between two signals, and it was first proposed by reference [43] as an effective way to describe functional connectivity. Suppose we want to calculate the correlation between the BOLD time series for a given voxel *v*, i.e., *yv*(*t*), *t* = 1, ... , *T* and a reference time series *rv*(*t*), *t* = 1, ... , *T* for *v* = *v* . Let *<sup>μ</sup><sup>y</sup>* and *<sup>μ</sup><sup>r</sup>* be the average value of the vectors *<sup>y</sup><sup>v</sup>* and *<sup>r</sup>v* , respectively. Then, the cross correlation between the vectors *<sup>y</sup><sup>v</sup>* and *<sup>r</sup>v* is defined as

$$\mathcal{LC}\_{y,r} = \frac{\sum\_{t=1}^{T} (\mathcal{y}\_v(t) - \mu\_y)(r\_{v'}(t) - \mu\_r)}{\sqrt{\sum\_{t=1}^{T} (\mathcal{y}\_v(t) - \mu\_y)^2} \sqrt{\sum\_{t=1}^{T} (r\_{v'}(t) - \mu\_r)^2}} \tag{A1}$$

The reference vector can be a pre-selected voxel, the seed, or it can be an average of time series in a certain region. For cross correlation between ROIs, both *y*(*t*), *t* = 1, ... , *T* and *r*(*t*), *t* = 1, ... , *T* can be the average time series in the pre-determined regions *y* and *r*, respectively.

It is common to transform the correlation coefficient obtained in (A1) using a Fisher's Z-transformation for each correlation coefficient as follows

$$\text{iz-score} = \frac{\ln(1 + \alpha c\_{y,r}) - \ln(1 - \alpha c\_{y,r})}{2}. \tag{A2}$$

These coefficients are approximately normally distributed, and cutoff values are obtained from the standard normal distribution.

**Partial Cross Correlation.** Cross correlation quantifies only the marginal linear dependence between two signals and does not consider the effect of a third signal [15,44]. To remove the linear influence of a third signal *k*(*t*) we define the partial correlation as follows.

$$PCC\_{y,r|k} = \frac{cc\_{y,r} - cc\_{y,k}cc\_{r,k}}{\sqrt{1 - cc\_{y,k}^2}\sqrt{1 - cc\_{r,k}^2}} \text{.} \tag{A.3}$$

Partial cross correlation is a valuable metric for estimating brain networks because it can estimate the direct relationship between two signals [15].

The calculation of cross correlation and partial cross-correlation measures assumes the signals to be stationary. When this assumption is not satisfied, detrended cross correlation and detrended partial cross correlation should be used instead [45].

**Time-varying connectivity.** It is possible to obtain a dynamical functional connectivity to understand its pattern over time. Both static measures mentioned in this section have a natural time-varying analogue in conjunction with a sliding window [15].

The concept of the sliding window is simple. Starting from the first time point, a window (a fixed number of time points) is selected, and all data points within the window are used to estimate the FC. This window is then shifted a certain number of time points, and the FC is estimated on the new set of data points. The process is repeated until the end of the time course. The series of estimated FC over time is the time-varying FC.

### **Appendix B. Calculation of Network Measures**

For completeness, we present the mathematical definitions of the network measures presented in Section 2.5. For a complete list of network measures, please refer to reference [29].

We use the graph notation as defined in Section 2.5. Let *n* be the number of nodes in the network and *N* be the set of all nodes. Let *l* be the number of links in the network and *L* be the set of all links. Then, (*i*, *j*) is a link between nodes *i* and *j* and *aij* = 1 when there is a link (*i*, *j*). We define *l* = ∑*i*,*<sup>j</sup> aij* (counting each indirect link twice).

**Degree of a node.** The degree of a node *i* is the sum of all the links connected to the node and is defined as

$$k\_i = \sum\_{j \in N} a\_{ij}.$$

**Shortest path length.** The shortest path length measures the shortest distance between nodes *i* and *j* and is defined as:

$$d\_{\vec{i}\vec{j}} = \sum\_{a\_{\text{inv}} \in \mathcal{G} \leftrightarrow \vec{j}} a\_{\text{inv}} \cdot$$

where *gi*↔*<sup>j</sup>* is the shortest distance between *<sup>i</sup>* and *<sup>j</sup>*. For all disconnected pairs (*i*, *<sup>j</sup>*), *dij* = ∞.

**Characteristic path length.** Let *Li* be the average distance between node *i* and all other nodes. The characteristics path length is defined as

$$L = \frac{1}{n} \sum\_{i \in \mathcal{N}} L\_i = \frac{1}{n} \sum\_{i \in \mathcal{N}} \frac{\sum\_{j \in \mathcal{N}, j \neq i} d\_{ij}}{n - 1}.$$

**Number of triangles.** The number of triangles of a node *i* is defined as

$$t\_i = \frac{1}{2} \sum\_{j\lambda \in \mathcal{N}} a\_{ij} a\_{ih} a\_{jh} \dots$$

**Clustering coefficient.** The clustering coefficient of the network is defined as

$$\mathcal{C} = \frac{1}{n} \sum\_{i \in \mathcal{N}} \mathcal{C}\_i = \frac{1}{n} \sum\_{i \in \mathcal{N}} \frac{2t\_i}{k\_i(k\_i - 1)}.$$

### **References**


### *Article* **Nonparametric Causal Structure Learning in High Dimensions**

**Shubhadeep Chakraborty and Ali Shojaie \***

Department of Biostatistics, University of Washington, Seattle, WA 98195, USA; deep20@uw.edu **\*** Correspondence: ashojaie@uw.edu

**Abstract:** The PC and FCI algorithms are popular constraint-based methods for learning the structure of directed acyclic graphs (DAGs) in the absence and presence of latent and selection variables, respectively. These algorithms (and their order-independent variants, PC-stable and FCI-stable) have been shown to be consistent for learning sparse high-dimensional DAGs based on partial correlations. However, inferring conditional independences from partial correlations is valid if the data are jointly Gaussian or generated from a linear structural equation model—an assumption that may be violated in many applications. To broaden the scope of high-dimensional causal structure learning, we propose nonparametric variants of the PC-stable and FCI-stable algorithms that employ the conditional distance covariance (CdCov) to test for conditional independence relationships. As the key theoretical contribution, we prove that the high-dimensional consistency of the PC-stable and FCIstable algorithms carry over to general distributions over DAGs when we implement CdCov-based nonparametric tests for conditional independence. Numerical studies demonstrate that our proposed algorithms perform nearly as good as the PC-stable and FCI-stable for Gaussian distributions, and offer advantages in non-Gaussian graphical models.

**Keywords:** causal structure learning; consistency; FCI algorithm; high dimensionality; nonparametric testing; PC algorithm

### **1. Introduction**

Directed acyclic graphs (DAGs) are commonly used to represent causal relationships among random variables [1–3]. The PC algorithm [3] is the most popular constraintbased method for learning DAGs from observational data under the assumption of causal sufficiency, i.e., when there are no unmeasured common causes and no selection variables. It first estimates the skeleton of a DAG by recursively performing a sequence of conditional independence tests, and then uses the information from the conditional independence relations to partially orient the edges, resulting in a completed partially directed acyclic graph (CPDAG). In Section 2, we provide a review of these and other notions commonly used in the graphical modeling literature that are relevant to our work. In addition, we refer to estimating the CPDAG as structure learning of the underlying DAG throughout the rest of the paper.

Observational studies often involve latent and selection variables, which complicate the causal structure learning problem. Ignoring such unmeasured variables can make the causal inference based on the PC algorithm erroneous; see, e.g., Section 1.2 in [4] for some illustrations. The Fast Causal Inference (FCI) algorithm and its variants [3–6] utilize similar strategies as the PC algorithm to learn the DAG structure in the presence of latent and selection variables.

Both PC and FCI algorithms adopt a hierarchical search strategy—they recursively perform conditional independence tests given subsets of increasingly larger cardinalities in some appropriate search pool. The PC algorithm is usually order-dependent, in the sense that its output depends on the order in which pairs of adjacent vertices and subsets of their adjacency sets are considered. The FCI algorithm suffers from a similar limitation. To overcome this limitation, Ref. [7] proposed two variants of the PC and FCI algorithms, namely the PC-stable and FCI-stable algorithms that resolve the order dependence at different stages of the algorithms.

**Citation:** Chakraborty, S.; Shojaie, A. Nonparametric Causal Structure Learning in High Dimensions. *Entropy* **2022**, *24*, 351. https:// doi.org/10.3390/e24030351

Academic Editors: S. Ejaz Ahmed and Farouk Nathoo

Received: 20 January 2022 Accepted: 25 February 2022 Published: 28 February 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

In general, testing for conditional independence is a problem of central importance in the causal structure learning. The literature on the PC and FCI algorithms predominantly uses partial correlations to infer conditional independence relations. It is well-known that the characterization of conditional independence by partial correlations, or, in other words, equivalence between conditional independence and zero partial correlations only holds for multivariate normal random variables. Therefore, the high-dimensional consistency results for the PC and FCI algorithms [4,8] are limited to Gaussian graphical models, where the nodes correspond to random variables with a joint Gaussian distribution. Although the Gaussian graphical model is the standard parametric model for continuous data, it may not hold in many real data applications. Although this limitation can be somewhat relaxed by considering linear structural equation models (SEMs) with general noise distributions [9], linear SEMs and joint Gaussianity are essentially equivalent [10]. Moreover, neither approach is appropriate when the observations are categorical, discrete, or are supported on a subset of the real line. In Section 4.3, for example, we present a real application where all the observed variables are categorical, and therefore far from being Gaussian. As an improvement, ref. [11] used rank-based partial correlations to test for conditional independence relations, showing that the high-dimensional consistency of the PC algorithm holds for a broader class of Gaussian copula models. Some nonparametric versions of the PC algorithm have been also proposed in the literature via kernel-based tests for conditional independence [12,13]; however, they lack theoretical justifications of the correctness of the algorithms, and are not studied in high dimensions.

This work aims to broaden the applicability of the PC-stable and FCI-stable algorithms to general distributions by employing a nonparametric test for conditional independence relationships. To this end, we utilize recent developments on dependence metrics that quantify nonlinear and non-monotone dependence between multivariate random variables. More specifically, our work builds on the idea of distance covariance (dCov) proposed by [14] and its extension to conditional distance covariance (CdCov) by [15] as a nonparametric measure of nonlinear and non-monotone conditional independence between two random vectors of arbitrary dimensions given a third. Utilizing this flexibility, we use the conditional distance covariance (CdCov) to test for conditional independence relationships in the sample versions of the PC-stable and FCI-stable algorithms. The resulting algorithms—which, for distinction, are termed *nonPC* and *nonFCI*—facilitate causal structure learning from general distributions over DAGs and are shown to be consistent in sparse high-dimensional settings. We establish the consistency of the proposed algorithms using some moment and tail conditions on the variables, without requiring strict distributional assumptions. To our knowledge, the proposed generalizations of PC/PC-stable or the FCI/FCI-stable algorithms provide the first general nonparametric framework for causal structure learning with theoretical guarantees in high dimensions.

The rest of the paper is organized as follows: In Section 2, we review the relevant background, including preliminaries on graphical modeling (Section 2.1), an outline of the PC-stable and FCI-stable algorithms (Section 2.2) and a brief overview of dCov and CdCov (Section 2.3). The nonparametric version of the PC-stable algorithm is presented in Section 3.1. As a key contribution of the paper, we establish that the algorithm consistently estimates the skeleton and the equivalence class of the underlying sparse high-dimensional DAG in a general nonparametric framework. We then present the nonparametric version of the FCI-stable algorithm in Section 3.2 and establish its consistency in sparse highdimensional settings. As the FCI involves the adjacency search of the PC algorithm, any improvement on the PC/PC-stable directly carries over to the FCI/FCI-stable as well. In Section 4, we compare the performances of our algorithms with the PC-stable and FCIstable using both simulated datasets (involving both Gaussian and non-Gaussian examples), as well as a real dataset. These numerical studies clearly demonstrate that nonPC and nonFCI algorithms are comparable with PC-stable and FCI-stable for Gaussian data and offer improvements for non-Gaussian data.

### **2. Background**

### *2.1. Preliminaries on Graphical Modeling*

We start with introducing some necessary terminologies and background information. Our notations and terminologies follow standard conventions in graphical modeling (see, e.g., [3]). A graph G = (*V*, *E*) consists of a vertex set *V* = {1, ... , *p*} and an edge set *E* ⊆ *V* × *V*. In a graphical model, the vertices or nodes are associated with random variables *Xa* for 1 ≤ *a* ≤ *p*. Throughout, we index the nodes by the corresponding random variables. We also allow the edge set *E* of the graph G to contain (a subset of) the following six types of edges: → (*directed*), ↔ (*bidirected*), − (*undirected*), ◦−◦ (*nondirected*), ◦− (*partially undirected*) and ◦→ (*partially directed*). The endpoints of an edge are called marks, which can be tails, arrowheads or circles. A "◦" at the end of an edge indicates it is not known whether an arrowhead should occur at that place. We use the symbol '-' to denote an arbitrary edge mark; for example, the symbol -→ represents an edge of the type →, ↔ or ◦→ in the graph. A *mixed graph* is a graph containing directed, bidirected and undirected edges. A graph containing only directed edges (→) is called a *directed graph*, one containing only undirected edges (−) is called an *undirected graph*, and one containing directed and undirected edges is called a *partially directed graph*.

The *adjacency set* of a vertex *Xa* in the graph G = (*V*, *E*), denoted adj(G, *Xa*), is the set of all vertices in *V* that are adjacent to *Xa*, or, in other words, are connected to *Xa* by an edge. The *degree* of a vertex *Xa*, |adj(G, *Xa*)|, is defined as the number of vertices adjacent to it. A graph is *complete* if all pairs of vertices in the graph are adjacent. A vertex *Xb* ∈ adj(G, *Xa*) is called a *parent* of *Xa* if *Xb* → *Xa*, a *child* of *Xa* if *Xa* → *Xb* and a *neighbor* of *Xa* if *Xa* − *Xb*. The *skeleton* of the graph G is the undirected graph obtained by replacing all the edges of G by undirected edges, in other words, ignoring all the edge orientations. Three vertices *Xa*, *Xb*, *Xc* are called an *unshielded triple* if *Xa* and *Xb* are adjacent, *Xb* and *Xc* are adjacent, but *Xa* and *Xc* are not adjacent. A *path* is a sequence of distinct adjacent vertices. A node *Xa* is an *ancestor* of its *descendent Xb*, if G contains a directed path *Xa* → ··· → *Xb*. A non-endpoint vertex *Xa* on a path is called a collider on the path if both the edges preceding and succeeding it have an arrowhead at *Xa*, or, in other words, the path contains -→ *Xa* ←-. An unshielded triple *Xa*, *Xb*, *Xc* is called a *v-structure* if *Xb* is a collider on the path *Xa*, *Xb*, *Xc* .

A *cycle* occurs in a graph when there is a path from *Xa* to *Xb*, and *Xa* and *Xb* are adjacent. A directed path from *Xa* to *Xb* forms a *directed cycle* together with the edge *Xb* → *Xa*, and it forms an *almost directed cycle* together with the edge *Xb* ↔ *Xa*. Three vertices that form a cycle are called a *triangle*. A *directed acyclic graph* (DAG) is a directed graph that does not contain any cycle. A DAG entails conditional independence relationships via a graphical criterion called *d-separation* (Section 1.2.3 in [16]). Two vertices *Xa* and *Xb* that are not adjacent in a DAG G are d-separated in G by a subset *XS* ⊆ *V*\{*Xa*, *Xb*}. A probability distribution *<sup>P</sup>* on <sup>R</sup>*<sup>p</sup>* is said to be *faithful* with respect to the DAG <sup>G</sup> if the conditional independence relationships in *P* can be inferred from G using d-separation and vice versa; in other words, *Xa* ⊥⊥ *Xb*|*XS* if and only if *Xa* and *Xb* are d-separated in G by *XS*.

A graph that is both (partially) directed and acyclic is called a *partially directed acyclic graph (PDAG)*. DAGs that encode the same set of conditional independence relations form a Markov equivalence class [17]. Two DAGs belong to the same Markov equivalence class if and only if they have the same skeleton and the same v-structures. A Markov equivalence class of DAGs can be uniquely represented by a *completed partially directed acyclic graph (CPDAG)*, which is a PDAG that satisfies the following: (i) *Xa* → *Xb* in the CPDAG if *Xa* → *Xb* in every DAG in the Markov equivalence class, and (ii) *Xa* − *Xb* in the CPDAG if the Markov equivalence class contains a DAG in which *Xa* → *Xb* as well as a DAG in which *Xa* ← *Xb*.

### *2.2. The PC-Stable and FCI-Stable Algorithms*

In this section, we provide an outline of the PC/PC-stable and FCI/FCI-stable algorithms. Estimation of the CPDAG by the PC algorithm involves two steps: (1) estimation of the skeleton and separating sets (also called the adjacency search step); and (2) partial orientation of edges; see Algorithms 1 and 2 in [8] for details.

Intuitively, the PC algorithm works as follows. In the first step (the adjacency search step), the algorithm starts with a complete undirected graph Then, for conditioning sets of increasing cardinality, *k* = 0, 1, ..., the algorithm removed an edge *Xa* − *Xb* if *Xa* and *Xb* are conditionally independent given a subset *S* of size *k* chosen among the current neighbors of nodes *a* and *b*. This process continues up to the order *q* − 1, where *q* is the maximum degree of the underlying DAG. By searching over the neighboring nodes, the algorithm is adaptive and can efficiently infer sparse high-dimensional DAGs, where the sparsity is characterized by the maximum node degree, *q*.

In the presence of latent and selection variables, one needs a generalization of an DAG, called a *maximal ancestral graph* (MAG). A mixed graph is called an *ancestral graph* if it contains no directed or almost directed cycles and no subgraph of the type *Xa* − *Xb* ← - *Xc*. DAGs form a subset of ancestral graphs. A MAG is an ancestral graph in which every missing edge corresponds to a conditional independence relationship via the mseparation criterion [18], a generalization of the notion of d-separation. Multiple MAGs may represent the same set of conditional independence relations. Such MAGs form a Markov equivalence class which can be represented by a *partial ancestral graph* (PAG) [19]; see [18] for additional details.

Under the faithfulness assumption, the Markov equivalence class of a DAG with latent and selection variables can be learned using the FCI algorithm (e.g., Algorithm 3.1 in [4]), which is a modification of the PC algorithm. The FCI algorithm first employs the adjacency search of the PC algorithm, and then performs additional conditional independence queries because of the presence of latent variables followed by partial orientation of the edges, resulting in an estimated PAG. The FCI algorithm adopts the same hierarchical search strategy as the PC algorithm: It starts with a complete undirected graph and recursively removes edges via conditional independence queries given subsets of increasingly larger cardinalities in some appropriate search pool.

The PC algorithm is usually order-dependent, in the sense that its output depends on the order in which pairs of adjacent vertices and subsets of their adjacency sets are considered. The FCI algorithm suffers from a similar limitation, as it shares the adjacency search step of the PC algorithm as its first step. To overcome this limitation, ref. [7] proposed variants of the PC and FCI algorithms, namely the PC-stable and FCI-stable algorithms that resolve the order dependence at different stages of the algorithms. The basic difference between the PC algorithm and the PC-stable algorithm is that, in the adjacency search step, the latter computes and stores the adjacency sets of all the variables after each new cardinality, *k* = 0, 1, ..., of the conditioning sets. These stored adjacency sets are then used to search for conditioning sets of this given size *k*. As a consequence, the removal of an edge no longer affects which conditional independence relations need to be checked for other pairs of variables at this given size of the conditioning sets.

We would refer the reader to Appendix A, where we provide in full detail the pseudocodes of the *oracle* versions of the PC-stable and FCI-stable algorithms. In the *oracle* versions of the algorithms, it is assumed that perfect knowledge is available about all the necessary conditional independence relations. As such, conditional independence relations are not estimated from data. Of course, this perfect knowledge is not available in practice. *Sample* versions of the PC-stable and FCI-stable algorithms can be obtained by replacing the conditional independence queries by a suitable test for conditional independence at some pre-specified level. For example, if the variables are jointly Gaussian, one can test for zero partial correlations (see, e.g., [8]). The next subsection is devoted to discussions on nonparametric tests for independence and conditional independence.

### *2.3. Distance Covariance and Conditional Distance Covariance*

We start by describing the notation used throughout the paper. We denote by ·*<sup>p</sup>* the Euclidean norm of <sup>R</sup>*<sup>p</sup>* and use · when the dimension is clear from the context. We use *<sup>X</sup>* ⊥⊥ *<sup>Y</sup>* to denote the independence of *<sup>X</sup>* and *<sup>Y</sup>* and use <sup>E</sup>*<sup>U</sup>* to denote expectation with respect to the probability distribution of the random variable *U*. For any set *S*, we denote its cardinality by |*S*|.

We use the usual asymptotic notation, '*O*' and '*o*', as well as their probabilistic counterparts, *Op* and *op*, which denote stochastic boundedness and convergence in probability, respectively. For two sequences of real numbers {*an*}<sup>∞</sup> *<sup>n</sup>*=<sup>1</sup> and {*bn*}<sup>∞</sup> *<sup>n</sup>*=1, *an bn* if and only if *an*/*bn* = *O*(1) and *bn*/*an* = *O*(1) as *n* → ∞. We use the symbol "*a b*" to indicate that *<sup>a</sup>* <sup>≤</sup> *C b* for some constant *<sup>C</sup>* <sup>&</sup>gt; 0. For a matrix *<sup>A</sup>* = (*akl*)*<sup>n</sup> <sup>k</sup>*,*l*=<sup>1</sup> <sup>∈</sup> <sup>R</sup>*n*×*n*, we denote its determinant by <sup>|</sup>*A*<sup>|</sup> and define its <sup>U</sup>-centered version *<sup>A</sup>*˜ = (*a*˜*kl*)*<sup>n</sup> <sup>k</sup>*,*l*=<sup>1</sup> as

$$\overline{a}\_{kl} = \begin{cases} a\_{kl} - \frac{1}{n-2} \sum\_{j=1}^{n} a\_{kj} - \frac{1}{n-2} \sum\_{i=1}^{n} a\_{il} + \frac{1}{(n-1)(n-2)} \sum\_{i,j=1}^{n} a\_{ij}, & k \neq l, \\ 0, & k = l, \end{cases} \tag{1}$$

for *k*, *l* = 1, ... , *n*. We denote the indicator function of any set *A* by **1**(*A*). Finally, we denote the integer part of *<sup>a</sup>* <sup>∈</sup> <sup>R</sup> by *a*.

Ref. [14], in their seminal paper, introduced the notion of distance covariance (dCov, henceforth) to quantify nonlinear and non-monotone dependence between two random vectors of arbitrary dimensions. Consider two random vectors *<sup>X</sup>* <sup>∈</sup> <sup>R</sup>*<sup>p</sup>* and *<sup>Y</sup>* <sup>∈</sup> <sup>R</sup>*<sup>q</sup>* with <sup>E</sup>*X<sup>p</sup>* <sup>&</sup>lt; <sup>∞</sup> and <sup>E</sup>*Y<sup>q</sup>* <sup>&</sup>lt; <sup>∞</sup>. The distance covariance between *<sup>X</sup>* and *<sup>Y</sup>* is defined as the positive square root of

$$\mathrm{dCov}^2(X,Y) = \frac{1}{\mathcal{c}\_p \mathcal{c}\_q} \int\_{\mathbb{R}^{p+q}} \frac{|f\_{X,Y}(t,s) - f\_X(t)f\_Y(s)|^2}{||t||\_p^{1+p}||s||\_q^{1+q}} dt ds$$

where *fX*, *fY* and *fX*,*<sup>Y</sup>* are the individual and joint characteristic functions of *X* and *Y*, respectively, and *cp* <sup>=</sup> *<sup>π</sup>*(1+*p*)/2/ <sup>Γ</sup>((<sup>1</sup> <sup>+</sup> *<sup>p</sup>*)/2) is a constant with <sup>Γ</sup>(·) being the complete gamma function.

The key feature of dCov is that it completely characterizes the independence between two random vectors, or in other words dCov(*X*,*Y*) = 0 if and only if *X* ⊥⊥ *Y*. According to Remark 3 in [14], dCov can be equivalently expressed as

$$\begin{split} \mathsf{Adv}^{2}(\mathcal{X},\mathcal{Y}) &= \mathbb{E}\left\|X - X'\right\|\_{p} \left\|Y - Y'\right\|\_{q} + \mathbb{E}\left\|X - X'\right\|\_{p} \mathbb{E}\left\|Y - Y'\right\|\_{q} \\ &- 2\operatorname{\mathbb{E}}\left\|X - X'\right\|\_{p} \left\|Y - Y''\right\|\_{q} .\end{split}$$

This alternate expression comes handy in constructing V or U-statistic type estimators for the quantity. For an observed random sample (*Xi*,*Yi*)*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> from the joint distribution of *X* and *Y*, define the distance matrices *d<sup>X</sup>* = *dX ij n <sup>i</sup>*,*j*=<sup>1</sup> and *<sup>d</sup><sup>Y</sup>* <sup>=</sup> *dY ijn <sup>i</sup>*,*j*=<sup>1</sup> <sup>∈</sup> <sup>R</sup>*n*×*n*, where *dX ij* :<sup>=</sup> *Xi* <sup>−</sup> *Xj<sup>p</sup>* and *<sup>d</sup><sup>Y</sup> ij* := *Yi* −*Yjq*. Following the U-centering idea in [20], an unbiased U-statistic type estimator of dCov2(*X*,*Y*) can be expressed as

$$\text{Adv}^2\_n(X, \mathcal{Y}) \; := \; (\vec{d}^X \cdot \vec{d}^Y) \; := \; \frac{1}{n(n-3)} \sum\_{i \neq j} \vec{d}^X\_{ij} \vec{d}^Y\_{ij} \; : \tag{2}$$

where ˜*d <sup>X</sup>* = ( ˜*d <sup>X</sup> ij* )*<sup>n</sup> <sup>i</sup>*,*j*=<sup>1</sup> and ˜*<sup>d</sup> <sup>Y</sup>* = ( ˜*<sup>d</sup> <sup>Y</sup> ij* )*<sup>n</sup> <sup>i</sup>*,*j*=<sup>1</sup> are the <sup>U</sup>-centered versions of the matrices *<sup>d</sup> <sup>X</sup>* and *d <sup>Y</sup>*, respectively, as defined in (1).

Ref. [15] generalized the notion of dCov and introduced the conditional distance covariance (CdCov, henceforth) as a measure of conditional dependence between two random vectors of arbitrary dimensions given a third. CdCov essentially replaces the characteristic functions used in the definition of dCov by conditional characteristic functions. Consider a third random vector *<sup>Z</sup>* <sup>∈</sup> <sup>R</sup>*<sup>r</sup>* with <sup>E</sup>(*X<sup>p</sup>* <sup>+</sup> *Y<sup>q</sup>* <sup>|</sup> *<sup>Z</sup>*) <sup>&</sup>lt; <sup>∞</sup>. Denote by *fX*,*Y*|*<sup>Z</sup>* the conditional joint characteristic function of *X* and *Y* given *Z*, and by *fX*|*<sup>Z</sup>* and *fY*|*<sup>Z</sup>* the conditional marginal characteristic functions of *X* and *Y* given *Z*, respectively. Then, CdCov between *X* and *Y* given *Z* is defined as the positive square root of

$$\text{CdCov}^2(X, Y | Z) = \frac{1}{\mathcal{C}\_{p}\mathcal{C}\_{q}} \int\_{\mathbb{R}^{p+q}} \frac{|f\_{X, Y | Z}(t, s) - f\_{X | Z}(t) f\_{Y | Z}(s)|^2}{||t||\_{p}^{1+p} ||s||\_{q}^{1+q}} dt ds.$$

The key feature of CdCov is that CdCov (*X*,*Y*|*Z*) = 0 almost surely if and only if *X* ⊥⊥ *Y*|*Z*, which is quite straightforward to see from the definition.

Similar to dCov, an equivalent alternative expression can be established for CdCov that avoids complicated integrations involving conditional characteristic functions. Let {*Wi* = (*Xi*,*Yi*, *Zi*)}*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> be an i.i.d. sample from the joint distribution of *W* := (*X*,*Y*, *Z*). Define *dijkl* := *dX ij* + *<sup>d</sup><sup>X</sup> kl* <sup>−</sup> *<sup>d</sup><sup>X</sup> ik* <sup>−</sup> *<sup>d</sup><sup>X</sup> jl d<sup>Y</sup> ij* + *<sup>d</sup><sup>Y</sup> kl* <sup>−</sup> *<sup>d</sup><sup>Y</sup> ik* <sup>−</sup> *<sup>d</sup><sup>Y</sup> jl* , which is not symmetric with respect to {*i*, *j*, *k*, *l*}, and therefore necessitates defining the following symmetric form: *dS ijkl* := *dijkl* + *dijlk* + *dilkj*. Lemma 1 in [15] establishes an equivalent representation of CdCov2(*X*,*Y*|*<sup>Z</sup>* <sup>=</sup> *<sup>z</sup>*) as

$$\text{CdCov}^2(X, Y | Z=z) \;= \frac{1}{12} \mathbb{E} \left[ d\_{1234}^{\mathbb{S}} \, | \, Z\_1 = z, Z\_2 = z, Z\_3 = z, Z\_4 = z \right]. \tag{3}$$

**Remark 1.** *In a recent work, [21] explore the connection between conditional independence measures induced by distances on a metric space and reproducing kernels associated with a reproducing kernel Hilbert space (RKHS). They generalize CdCov to arbitrary metric spaces of negative type termed generalized CdCov (gCdCov)—and develop a kernel-based measure of conditional independence, namely the Hilbert–Schmidt conditional independence criterion (HSCIC). Theorem 1 in their paper establishes an equivalence between gCdCov and HSCIC, or, in other words, between distance and kernel-based measures of conditional independence.*

For *<sup>w</sup>* <sup>∈</sup> <sup>R</sup>*<sup>r</sup>* , let *KH*(*w*) := |*H*| <sup>−</sup><sup>1</sup> *K*(*H*−1*w*) be a kernel function, where *H* is the diagonal matrix diag(*h*, ... , *h*) determined by a bandwidth parameter *h*. *KH* is typically considered to be the Gaussian kernel *KH*(*w*)=(2*π*)<sup>−</sup> *<sup>r</sup>* <sup>2</sup> |*H*| <sup>−</sup><sup>1</sup> exp − 1 <sup>2</sup>*wTH*−2*<sup>w</sup>* , where *<sup>w</sup>* <sup>∈</sup> <sup>R</sup>*<sup>r</sup>* .

Let *Kiu* := *KH*(*Zi* − *Zu*) = |*H*| <sup>−</sup><sup>1</sup> *<sup>K</sup>*(*H*−1(*Zi* <sup>−</sup> *Zu*)) and *Ki*(*Z*) :<sup>=</sup> *KH*(*<sup>Z</sup>* <sup>−</sup> *Zi*) for 1 ≤ *i*, *u* ≤ *n*. Then, by virtue of the equivalent representation of CdCov in (3), a V-statistic type estimator of CdCov2(*X*,*Y*|*Z*) can be constructed as

$$\text{CdCov}\_{\text{n}}^{2}(\text{X}, \text{Y}|\text{Z}) \ := \sum\_{i,j,k,l} \frac{\text{K}\_{i}(\text{Z}) \, \text{K}\_{j}(\text{Z}) \, \text{K}\_{k}(\text{Z}) \, \text{K}\_{l}(\text{Z})}{12 \left(\sum\_{i=1}^{n} \text{K}\_{i}(\text{Z})\right)^{4}} \, d\_{ijkl}^{\text{S}} \,. \tag{4}$$

Under certain regularity conditions, Theorem 4 in [15] shows that, conditioned on *Z*, CdCov2 *<sup>n</sup>*(*X*,*Y*|*Z*) *<sup>P</sup>*−→ CdCov2(*X*,*Y*|*Z*) as *<sup>n</sup>* <sup>→</sup> <sup>∞</sup>.

### **3. Methodology and Theory**

*3.1. The Nonparametric PC Algorithm in High Dimensions*

To obtain a measure of conditional independence between *X* and *Y* given *Z* that is free of *Z*, we define

$$\rho\_0^\*\left(X,\mathcal{Y}|Z\right) \coloneqq \mathbb{E}\left[\mathbb{C}\text{d}\mathbb{C}\text{ov}\_n^2(X,\mathcal{Y}|Z)\right].\tag{5}$$

Clearly, *ρ*∗ <sup>0</sup> (*X*,*Y*|*Z*) = 0 if and only if *X* ⊥⊥ *Y* | *Z*. Consider a plug-in estimate of *ρ*∗ <sup>0</sup> (*X*,*Y*|*Z*) as

$$\begin{aligned} \hat{\rho}\,^\*(X,Y|Z) &:= \frac{1}{n} \sum\_{u=1}^n \text{CdCov}\_n^2(X,Y|Z\_{\text{ul}}) \ &= \frac{1}{n} \sum\_{u=1}^n \Delta\_{i,j,k,l;\mu} \\ \text{where} \qquad \Delta\_{i,j,k,l;\mu} &:= \sum\_{i,j,k,l} \frac{K\_{\text{iul}} K\_{j\text{u}} K\_{\text{kul}} K\_{\text{l}\text{u}}}{12 \left(\sum\_{i=1}^n K\_{i\text{u}}\right)^4} d\_{ijkl}^S. \end{aligned} \tag{6}$$

We reject *<sup>H</sup>*<sup>0</sup> : *<sup>X</sup>* ⊥⊥ *<sup>Y</sup>*|*<sup>Z</sup>* vs *HA* : *<sup>X</sup>* ⊥⊥ *<sup>Y</sup>*|*<sup>Z</sup>* at level *<sup>α</sup>* <sup>∈</sup> (0, 1) if *<sup>ρ</sup>* <sup>∗</sup>(*X*,*Y*|*Z*) <sup>&</sup>gt; *ξα*, for a suitably chosen threshold *ξα*. In Appendix A, we present a local bootstrap procedure for choosing *ξα* in practice, which is also used in our numerical studies. Henceforth, we will often denote *ρ*∗ <sup>0</sup> (*X*,*Y*|*Z*) and *<sup>ρ</sup>* <sup>∗</sup>(*X*,*Y*|*Z*) simply by *<sup>ρ</sup>*<sup>∗</sup> <sup>0</sup> and *<sup>ρ</sup>* <sup>∗</sup> respectively for notational simplicity, whenever there is no confusion.

In view of the complete characterization of conditional independence by *ρ*∗ <sup>0</sup>, we propose testing for conditional independence relations nonparametrically in the sample version of the PC-stable algorithm based on *ρ*∗ <sup>0</sup>, rather than partial correlations. We coin the resulting algorithm the 'nonPC' algorithm, to emphasize that it is a nonparametric generalization of parametric PC-stable algorithms.

The *oracle version* of the first step of nonPC, or the skeleton estimation step, is exactly the same as that of the PC-stable algorithm (Algorithm A1 in Appendix A). The second step, which extends the skeleton estimated in the first step to a CPDAG (Algorithm A2 in Appendix A), is comprised of some purely deterministic rules for edge orientations, and is exactly the same for both the nonPC and PC-stable as well. The only difference lies in the implementation of the tests for conditional independence relationships in the *sample versions* of the first step. Specifically, we replace all the conditional independence queries in the first step by tests based on *ρ*∗ <sup>0</sup> (*X*,*Y*|*Z*). At some pre-specified significance level *<sup>α</sup>*, we infer that *Xa* ⊥⊥ *Xb* <sup>|</sup> *XS* when *<sup>ρ</sup>* <sup>∗</sup>(*Xa*, *Xb*|*XS*) <sup>≤</sup> *<sup>ξ</sup>n*,*α*, where *<sup>a</sup>*, *<sup>b</sup>* <sup>∈</sup> *<sup>V</sup>* and *<sup>S</sup>* <sup>⊆</sup> *<sup>V</sup>*, <sup>|</sup>*S*<sup>|</sup> <sup>=</sup> *<sup>φ</sup>*. When <sup>|</sup>*S*<sup>|</sup> <sup>=</sup> *<sup>φ</sup>*, *<sup>ρ</sup>* <sup>∗</sup>(*Xa*, *Xb*|*XS*) = dCov2 *<sup>n</sup>*(*Xa*, *Xb*) and *ρ*<sup>∗</sup> <sup>0</sup> (*X*,*Y*|*Z*) = dCov2(*X*,*Y*). The critical value *ξn*,*<sup>α</sup>* in this case is obtained by a bootstrap procedure (see, e.g., Section 4 in [22] with *d* = 2).

Given that the equivalence between conditional independence and zero partial correlations only holds for multivariate normal random variables, our generalization broadens the scope of applicability of causal structure learning by the PC/PC-stable algorithm to general distributions over DAGs. This nonparametric approach is thus a natural extension of Gaussian and Gaussian copula models. It enables capturing nonlinear and non-monotone conditional dependence relationships among the variables, which partial correlations fail to detect.

Next, we establish theoretical guarantees on the correctness of the nonPC algorithm in learning the true underlying causal structure in sparse high-dimensional settings. Our consistency results only require mild moment and tail conditions on the set of variables, without making any strict distributional assumptions. Denote by *mp* the maximum cardinality of the conditioning sets considered in the adjacency search step of the PC-stable algorithm. Clearly, *mp* ≤ *q*, where *q* := max1≤*a*≤*<sup>p</sup>* |adj(G, *a*)| is the maximum degree of the DAG G. For a fixed pair of nodes *a*, *b* ∈ *V*, the conditioning sets considered in the adjacency search step are elements of *J mp <sup>a</sup>*,*<sup>b</sup>* := {*S* ⊆ *V*\{*a*, *b*} : |*S*| ≤ *mp*}.

We first establish a concentration inequality that gives the rate at which the absolute difference of *ρ*∗ <sup>0</sup> (*Xa*, *Xb*|*XS*) and its plug-in estimate *<sup>ρ</sup>* <sup>∗</sup>(*Xa*, *Xb*|*XS*) decays to zero, for any fixed pair of nodes *a* and *b* ∈ *V* and a fixed conditioning set *S*. Towards that, we impose the following regularity conditions.

(A1) There exists *s*<sup>0</sup> > 0 such that, for 0 ≤ *s* < *s*0, sup *p* max 1≤*a*≤*p* E exp(*sX*<sup>2</sup> *<sup>a</sup>* ) < ∞.

(A2) The kernel function *K*(·) is non-negative and uniformly bounded over its support.

Condition (A1) imposes a sub-exponential tail bound on the squares of the random variables. This is a quite commonly used condition, for example, in the high-dimensional feature screening literature (see, for example, [23]). Condition (A2) is a mild condition on the kernel function *K*(·) that is guaranteed by many commonly used kernels, including the Gaussian kernel. Under conditions (A1) and (A2), the next result shows that the plug-in estimate *<sup>ρ</sup>* <sup>∗</sup>(*Xa*, *Xb*|*XS*) converges in probability to its population counterpart *ρ*∗ <sup>0</sup> (*Xa*, *Xb*|*XS*) exponentially fast.

**Theorem 1.** *Under conditions (A1) and (A2), for any*  > 0*, there exist positive constants A, B and γ* ∈ (0, 1/4) *such that*

$$\mathbb{P}\left(|\widehat{\rho}^\*(X\_{\theta}, X\_{\theta}|X\_S) - \rho\_0^\*\left(X\_{\theta}, X\_{\theta}|X\_S\right)| > \varepsilon\right) \\
\leq O\left(2\exp\left(-A n^{1-2\gamma}\epsilon^2\right) + n^4 \exp\left(-B n^{\gamma}\right)\right).$$

The proof of Theorem 1 is long and somewhat technical; it is thus relegated to Appendix B. Theorem 1 serves as the main building block towards establishing the consistency of the nonPC algorithm in sparse high-dimensional settings.

For notational convenience, henceforth, we denote *ρ*∗ <sup>0</sup> (*Xa*, *Xb*|*XS*) and *<sup>ρ</sup>* <sup>∗</sup>(*Xa*, *Xb*|*XS*) by *ρ*∗ 0 ; *a b*|*<sup>S</sup>* and *<sup>ρ</sup>* <sup>∗</sup> *ab*|*S*, respectively. In Theorem <sup>2</sup> below, we establish a uniform bound for the errors in inferring conditional independence relationships using the *ρ*∗ <sup>0</sup>-based test in the skeleton estimation step of the sample version of the nonPC algorithm.

**Theorem 2.** *Under conditions (A1) and (A2), for any*  > 0*, there exist positive constants A, B and γ* ∈ (0, 1/4) *such that*

$$\begin{split} \sup\_{\begin{subarray}{c} a,b \in \mathcal{I} \\ S \in \mathcal{I} \end{subarray}} \mathbb{P} \left( |\hat{\rho}^{\*}\_{ab|S} - \rho^{\*}\_{0:ab|S}| > \epsilon \right) &\leq \mathbb{P} \left( \sup\_{a,b \in \mathcal{V} \\ S \in \bigcup\_{t=0}^{\mathcal{U}-1}} |\hat{\rho}^{\*}\_{ab|S} - \rho^{\*}\_{0:ab|S}| > \epsilon \right) \\ &\leq O \left( p^{m\_{\mathcal{V}} + 2} \left[ 2 \exp \left( -A \, n^{1 - 2\gamma} \epsilon^{2} \right) + n^{4} \exp \left( -B \, n^{\gamma} \right) \right] \right). \end{split} \tag{7}$$

Next, we turn to proving the consistency of the nonPC algorithm in the high-dimensional setting where the dimension *p* can be much larger than the sample size *n*, but the DAG is considered to be sparse. We impose the following regularity conditions, which are similar to the assumptions imposed in Section 3.1 of [8] in order to prove the consistency of the PC algorithm for Gaussian graphical models. We let the number of variables *p* grow with the sample size *n* and consider *p* = *pn*, and also the DAG G = G*<sup>n</sup>* := (*Vn*, *En*) and the distribution *P* = *Pn*.


*Xa* and *Xb* are d-separated by *XS* ⇐⇒ *Xa* ⊥⊥ *Xb* | *XS* ⇐⇒ *ρ*<sup>∗</sup> 0 ; *a b*|*<sup>S</sup>* <sup>=</sup> 0 .

Moreover, *ρ*∗ 0 ; *a b*|*<sup>S</sup>* values are uniformly bounded both from above and below. Formally,

$$\begin{array}{rcl} \mathsf{C}\_{\min} & := & \inf\_{\begin{subarray}{c} a,b \in V\_{n} \\ S \in \int\_{a,b}^{p\_{m}} \end{subarray}} \rho\_{0}^{\*}, \mathsf{ab}|\_{S} \geq \lambda\_{\min} \ \lambda\_{\min}^{-1} = O(n^{v}) \\ & \qquad \qquad \rho\_{0;ab|S}^{\*} \neq 0 \\ \text{and} & \qquad \mathsf{C}\_{\max} := & \sup\_{\begin{subarray}{c} a,b \in V\_{n} \\ S \in \int\_{a,b}^{mp\_{n}} \end{subarray}} \rho\_{0;ab|S}^{\*} \leq \ \lambda\_{\max} \end{array}$$

where *λmax* is a positive constant and 0 < *v* < 1/4.

Condition (A3) allows the dimension to grow at any arbitrary polynomial rate of the sample size. Condition (A4) is a sparsity assumption on the underlying true DAG, allowing the maximum degree of the DAG to also grow, but at a slower rate than *n*. Since *mp* ≤ *qn*, we also have *mp* = *O*(*n*1−*b*). Finally, Condition (A5) is the strong faithfulness assumption (Definition 1.3 in [24]) on *Pn* and is similar to condition (A4) in [8]. This essentially requires *ρ*∗ 0 ; *ab*|*<sup>S</sup>* to be bounded away from zero when the vertices *Xa* and *Xb* are not d-separated by *XS*. It is worth noting that the faithfulness assumption alone is not enough to prove the consistency of the PC/PC-stable/nonPC algorithms in high-dimensional settings, and the more stringent strong faithfulness condition is required.

**Remark 2.** *For notational convenience, treat Xa*, *Xb and XS as X, Y and Z, respectively, for any a*, *b* ∈ *Vn and S* ∈ *J mpn <sup>a</sup>*,*<sup>b</sup> . From Equation (3), we have*

$$\text{CdCov}^2(\text{X}, \text{Y}|\text{Z}) \;= \frac{1}{12} \mathbb{E} \left[ \text{d}\_{1234}^{\text{S}} \, | \, \text{Z}\_1 = \text{Z}, \dots, \text{Z}\_4 = \text{Z} \right],$$

*which implies*

$$\rho\_0^\* = \mathbb{E}\left[\text{CdCov}^2(\mathbb{X}, \mathbb{Y}|\mathbb{Z})\right] \\ = \frac{1}{12} \mathbb{E}\left[\text{d}\_{1234}^\mathbb{S}\right] \\ = \frac{1}{12} \mathbb{E}\left[\text{d}\_{1234} + \text{d}\_{1243} + \text{d}\_{1432}\right].$$

.

*Condition (A1) implies* sup *p* max 1≤*a*≤*p* E *X*<sup>2</sup> *<sup>a</sup>* < ∞*. With this and the definition of dijkl in Section 2.3, it*

*follows from some simple algebra and the Cauchy–Schwarz inequality that ρ*∗ <sup>0</sup> < ∞*. This provides a justification for the second part of Assumption (A5) that* sup *a*,*b*∈*Vn S*∈*J mpn a*,*b ρ*∗ 0 ; *ab*|*<sup>S</sup>* <sup>≤</sup> *<sup>λ</sup>max for some positive*

*constant λmax.*

The next theorem establishes that the nonPC algorithm consistently estimates the skeleton of a sparse high-dimensional DAG, thereby providing the necessary theoretical guarantees to our proposed methodology. It is worth noting that, in the sample version of the PC-stable and hence the nonPC algorithm, all the inference is done during the skeleton estimation step. The second step that involves appropriately orienting the edges of the estimated skeleton is purely deterministic (see Sections 4.2 and 4.3 in [7]). Therefore, to prove the consistency of the nonPC algorithm in estimating the equivalence class of the underlying true DAG, it is enough to prove the consistency of the estimated skeleton. We include the detailed proof of Theorem 3 in Appendix B.

**Theorem 3.** *Assume that Conditions (A1)–(A5) hold. Let* G*skel*,*<sup>n</sup> be the true skeleton of the graph* <sup>G</sup>*n, and* <sup>G</sup><sup>ˆ</sup> *skel*,*<sup>n</sup> be the skeleton estimated by the nonPC algorithm. Then, as n* → ∞*,* P Gˆ *skel*,*<sup>n</sup>* = G*skel*,*<sup>n</sup>* → 1*.*

**Remark 3.** *In the proof of Theorem 3, we consider the threshold ξα to be of constant order. However, the proof continues to work as long as ξα is of the same order as Cmin as n* → ∞*.*

### *3.2. The Nonparametric FCI Algorithm in High Dimensions*

The FCI is a modification of the PC algorithm that accounts for latent and selection variables. Thus, generalizations of the PC algorithm naturally extend to the FCI as well. Similar to nonPC, we propose testing for conditional independence relations nonparametrically in the *sample version* of the FCI-stable algorithm (Algorithm A3 in Appendix A) based on *ρ*∗ <sup>0</sup>, instead of partial correlations. We coin the resulting algorithm the 'nonFCI' algorithm, to emphasize that it is a generalization of parametric FCI-stable algorithms. Again, the *oracle version* of the nonFCI is exactly the same as that of the FCI-stable algorithm. The difference is in the implementation of the tests for conditional independence relationships in their *sample versions*. This broadens the scope of the FCI algorithm in causal structural learning for observational data in the presence of latent and selection variables when Gaussianity is not a viable assumption. More specifically, it enables capturing nonlinear and non-monotone conditional dependence relationships among the variables that partial correlations would fail to detect.

Equipped with the theoretical guarantees we established for the nonPC in Section 3.1, we establish below in Theorem 4 the consistency of the nonFCI algorithm for general distributions in sparse high-dimensional settings. Let H = (*V*, *E*) be a DAG with the vertex set partitioned as *V* = *VX* ∪ *VL* ∪ *VT*, where *VX* indexes the set of *p* observed variables, *VL* denotes the set of latent variables and *VT* stands for the set of selection variables. Let M be the unique MAG over *VX*. We let *p* grow with *n* and consider *p* = *pn*, H = H*<sup>n</sup>* and *Q* = *Qn*, where *Q* is the distribution of (*U*1, ... , *Up*) := (*X*<sup>1</sup> | *VT*, ... , *Xp* | *VT*). We provide below the definition of possible-D-SEP sets (Definition 3.3 in [4]).

**Definition 1.** *Let* C *be a graph with any of the following edge types :* ◦−◦*,* ◦→ *and* ↔*. A possible-D-SEP* (*Xa*, *Xb*) *in* C*, denoted* pds(C, *Xa*, *Xb*)*, is defined as follows: Xc* ∈ pds(C, *Xa*, *Xb*) *if and only if there is a path π between Xa and Xc in* C *such that, for every subpath Xe*, *Xf* , *Xg of π, Xf is a collider on the subpath in* C *or Xe*, *Xf* , *Xg is a triangle in* C*.*

To prove the consistency of the nonFCI algorithm in sparse high-dimensional settings, we impose the following regularity conditions, which are similar to the assumptions imposed in Section 4 in [4].


inf |*ρ*∗ <sup>0</sup> (*Ui*, *Uj*|*US*)| : *ρ*<sup>∗</sup> <sup>0</sup> (*Ui*, *Uj*|*US*) = 0 ≥ *λ min* (*λ min*)−<sup>1</sup> = *<sup>O</sup>*(*nv*) and sup |*ρ*<sup>∗</sup> <sup>0</sup> (*Ui*, *Uj*|*US*)| ≤ *λ max*

where *λ max* is a positive constant and 0 < *v* < 1/4.

**Theorem 4.** *Suppose conditions (A1)–(A3) and (C3)–(C5) hold. Denote by* C*<sup>n</sup> and* C<sup>∗</sup> *<sup>n</sup> the true underlying FCI-PAG and the output of the nonFCI algorithm, respectively. Then, as n* → ∞*,* P C∗ *<sup>n</sup>* = C*<sup>n</sup>* → 1*.*

### **4. Numerical Studies**

### *4.1. Performance of the NonPC Algorithm*

In this subsection, we compare the performances of the nonPC and the PC-stable algorithms in finding the skeleton and the CPDAG for various simulated datasets. We simulate random DAGs in the following examples and sample from probability distributions faithful to them.

**Example 1** (Linear SEM)**.** *We first fix a sparsity parameter s* ∈ (0, 1) *and enumerate the vertices as V* = {1, ... , *p*}*. We then construct a p* × *p adjacency matrix* Λ *as follows. First, initialize* Λ *as a zero matrix. Next, fill every entry in the lower triangle (below the diagonal) of* Λ *by independent realizations of Bernoulli random variables with success probability s. Finally, replace each nonzero entry in* Λ *by independent realizations of a Uniform*(0.1, 1) *random variable.*

In this scheme, each node has the same expected degree <sup>E</sup>(*m*)=(*<sup>p</sup>* <sup>−</sup> <sup>1</sup>)*s*, where *m* is the degree of a node and follows a Binomial(*p* − 1,*s*) distribution. Using the adjacency matrix Λ, the data are then generated from the following linear structural equation model (SEM) :

$$\mathbf{X} = \Lambda \mathbf{X} + \epsilon$$

where  = (1, ... ,  *<sup>p</sup>*) and 1, ... ,  *<sup>p</sup>* are jointly independent. To obtain samples {*X<sup>k</sup>* <sup>1</sup>, ... , *<sup>X</sup><sup>k</sup> p*}*n k*=1 on {*X*1, ... , *Xp*}, we first sample { *<sup>k</sup>* <sup>1</sup>, ... ,  *<sup>k</sup> p*}*n <sup>k</sup>*=<sup>1</sup> from the three following data-generating schemes. For 1 ≤ *k* ≤ *n* and 1 ≤ *i* ≤ *p*,


**Example 2** (Nonlinear SEM)**.** *In this example, we first generate a p* × *p adjacency matrix* Λ *in the similar way as in Example 1 and then generate the data from the following nonlinear SEM (similar to [10]) : Xi* = <sup>∑</sup>*<sup>j</sup>* : <sup>Λ</sup>*ij*=<sup>0</sup> *fij*(*Xj*) +  *<sup>i</sup> with <sup>i</sup> <sup>i</sup>*.*i*.*d*. <sup>∼</sup> *<sup>N</sup>*(0, 1)*, where* <sup>1</sup> <sup>≤</sup> *<sup>j</sup>* <sup>&</sup>lt; *<sup>i</sup>* <sup>≤</sup> *p. If the functions fij's are chosen to be nonlinear, then the data will typically not correspond to a well-known multivariate distribution. We consider fij*(*xj*) = *bij*1*xj* + *bij*2*x*<sup>2</sup> *<sup>j</sup> , where bij*<sup>1</sup> *and bij*<sup>1</sup> *are independently sampled from N*(0, 1) *and N*(0, 0.5) *distributions, respectively.*

With the exception of Example 1.1, the above examples are all non-Gaussian graphical models. We would thus expect the nonPC to perform better than the PC-stable in learning the unknown causal structure in these examples. For each of the four data generating methods considered above, we compare the Structural Hamming Distance (SHD) [25] between the estimated and the true skeletons of the underlying DAGs using the nonPC and PC-stable algorithms. The SHD between two undirected graphs is the number of edge additions or deletions necessary to make the two graphs match. Therefore, larger SHD values between the estimated and the true skeleton correspond to worse estimates.

We consider 199 bootstrap replicates for the CdCov-based conditional independence tests in the implementation of our nonPC algorithm and the significance level *α* = 0.05. Table 1 presents the average SHD for the different data generating schemes over 20 simulation runs, for different choices of *n*, *p* and E(*m*).

**Table 1.** Comparison of the average structural Hamming distances (SHD) of nonPC and PC-stable algorithms across simulation studies.


The results in Table 1 demonstrate that the nonPC performs nearly as good as the PC-stable for the Gaussian data example, in terms of the average SHD. However, for each of the non-Gaussian data examples, the nonPC performs better than the PC-stable in estimating the true skeleton of the underlying DAGs. The improvement in SHD becomes more substantial as the dimension grows. The superior performance of the nonPC over PC-stable for the non-Gaussian graphical models is expected, as the characterization of conditional independence by partial correlations is only valid under the assumption of joint Gaussianity.

### *4.2. Performance of the NonFCI Algorithm*

In this subsection, we compare the performances of the nonFCI and the FCI-stable algorithms over various simulated datasets. We first generate random DAGs as in Examples 1 and 2. To assess the impact of latent variables, we randomly define half of the variables with no parents and at least one child as latent. We do not consider selection variables. We run both the nonFCI and the FCI-stable algorithms on the above data examples with *n* = 200, *p* = {10, 20, 30, 100, 200} and *α* = 0.01, using 199 bootstrap replicates for the CdCov-based conditional independence tests. We consider 20 simulation runs for each of the data generating models. Table 2 reports the average SHD between the estimated and true PAG skeleton by the nonFCI and FCI-stable algorithms.


**Table 2.** Comparison of the average structural Hamming distances (SHD) of nonFCI and FCI-stable algorithms across simulation studies.

The results in Table 2 demonstrate that, in both the Gaussian and non-Gaussian examples, the nonFCI algorithm outperforms the FCI-stable in estimating the true PAG skeleton.

### *4.3. Real Data Example*

A major difficulty in assessing whether nonPC and nonFCI provide more reasonable estimates compared to the parametric versions of the algorithms in high-dimensional real data settings is that the true causal graph is not known in most of the cases. In absence of the truth, we may only be able to draw some conclusions about sensible causal mechanisms by examining known or logical relationships among pairs of variables. However, this becomes increasingly difficult for larger networks, where even visualization becomes challenging. This is why we first choose a relatively smaller dataset in Section 4.3.1, where we can draw upon background knowledge to glean insight into potential causal mechanisms in a setting where the data are clearly non-Gaussian. This example highlights the main focus of the paper that, with non-Gaussian data (categorical, as in this example), nonPC is expected to perform better than the PC-stable in learning the true causal structure of the underlying DAG. In Section 4.3.2, we consider a larger example and examine the performance of PC-stable and nonPC in learning the DAG from both seemingly Gaussian data as well as a categorized version of the same data. This example clearly illustrates the potential limitations of PC-stable: in contrast to nonPC, the output of PC-stable can be strikingly different when applied to a categorized version of the original data.

### 4.3.1. Montana Poll Dataset

To demonstrate the flexibility of our proposed framework, we first apply the nonPC algorithm to the Montana Economic Outlook Poll dataset. The poll was conducted in May 1992 where a random sample of 209 Montana residents were asked whether their personal financial status was worse, the same or better than a year ago, and whether they thought the state economic outlook was better than the year before. Accompanying demographic information on the respondents' age, income, political orientation, and area of residence in the state were also recorded. We obtained the dataset from the Data and Story Library (DASL), available at https://math.tntech.edu/e-stat/DASL/page4.html (accessed on 25 March 2021). The study is comprised of the following seven categorical variables: AGE = 1 for under 35, 2 for 35–54, 3 for 55 and over; SEX = 0 for male, 1 for female; INC = yearly income: 1 for under \$20 K, 2 for \$20–35 K, 3 for over \$35 K; POL = 1 for Democrat, 2 for Independent, 3 for Republican; AREA = 1 for Western, 2 for Northeastern, 3 for Southeastern Montana; FIN (=Financial status): 1 for worse, 2 for same, 3 for better than a year ago; and STAT (=State economic outlook): 1 for better, 0 for not better than a year ago.

After removing the cases with missing values, we are left with *n* = 163 samples. Since all the variables are categorical, the Gaussianity assumption is outrightly violated. Thus, we would expect the nonPC to perform better than the PC-stable in learning the true causal structure among the variables in this case. Figure 1 below presents the CPDAGs estimated by the nonPC and PC-stable algorithms at a significance level *α* = 0.1. We consider 199 bootstrap replicates for the CdCov-based conditional independence tests in the implementation of the nonPC algorithm.

It is quite intuitive that age and sex are likely to affect the income; one's financial status and the area of residence might also influence their political inclination; and improvements or downturns in the state economic outlook might impact an individual's financial status. The CPDAG estimated by the nonPC algorithm in Figure 1a affirms such common-sense understanding of these causal influences. However, in the CPDAG estimated by the PCstable in Figure 1b, the edge between age and income is missing. In addition, the directed edges POL → AREA and POL → FIN seem to make little sense in this case. se o

**Figure 1.** CPDAGs estimated by the nonPC and PC-stable algorithms for the Montana poll dataset.

### 4.3.2. Protein Expression Data

We next consider a protein expression dataset of 410 patients with breast cancer from The Cancer Genome Atlas (TCGA). The dataset consists of *p* = 118 genes, and we randomly select a subset of *n* = 100 patients with PR-negative status. Since the true causal structure of the genes in the cancer cells may be different than that of normal cells [26], we apply both the nonPC and PC-stable algorithms to learn the causal structure. To put the performances of the nonPC and PC-stable under scrutiny as the data depart farther away from Gaussianity, we categorize the protein expression data for each of the *p* genes, denoted by {*X<sup>k</sup> a*}*n <sup>k</sup>*=1, 1 ≤ *a* ≤ *p*, as follows. We compute the three quartiles *Q*1 ; *<sup>a</sup>*, *Q*2 ; *<sup>a</sup>* and *Q*3 ; *<sup>a</sup>* of the protein expression values for every 1 ≤ *a* ≤ *p*. Consequently, we obtain categorized protein expressions {*X<sup>k</sup> <sup>C</sup>* ; *<sup>a</sup>*}*<sup>n</sup> <sup>k</sup>*=<sup>1</sup> for 1 ≤ *a* ≤ *p*, where

$$X\_{\mathbb{C},a}^{k} := \begin{cases} 0 & \text{if } X\_{a}^{k} \le Q\_{1,a} \\ 1 & \text{if } Q\_{1,a} < X\_{a}^{k} \le Q\_{2,a} \\ 2 & \text{if } Q\_{2,a} < X\_{a}^{k} \le Q\_{3,a} \\ 3 & \text{if } X\_{a}^{k} > Q\_{3,a} \text{ .} \end{cases}$$

We apply the nonPC and PC-stable algorithms to both the original and the categorized protein expression data at a significance level *α* = 0.01. We consider 199 bootstrap replicates for the CdCov-based conditional independence tests in the implementation of the nonPC algorithm. Table 3 below shows the SHD between the skeletons estimated from the original and the categorized data by the nonPC and PC-stable algorithms. It can be seen that the SHD between the skeletons estimated from the original and categorized data by the PCstable algorithm is much larger than that for nonPC. This example highlights the potential limitation of parametric implementations of the PC algorithm: when the data deviate farther away from Gaussianity (in this case, being categorical), the estimates produced by the PC-stable may deviate considerably more from the estimates from the original data. In contrast, the nonparametric test in nonPC delivers more stable estimates regardless of the data distribution.

**Table 3.** Comparison of the SHD between the skeletons estimated from the original and the categorized protein expression data by the nonPC and PC-stable algorithms.


### **5. Discussion**

We proposed nonparametric variants of the widely popular PC-stable and FCI-stable algorithms, which employ conditional distance covariance (CdCov) to test for conditional independence relationships in their sample versions. Our proposed algorithms broaden the applicability of the PC/PC-stable and FCI/FCI-stable algorithms to general distributions over DAGs, and enable taking into account nonlinear and non-monotone conditional dependence among the random variables, which partial correlations fail to capture. We show that the high-dimensional consistency of the PC-stable and FCI-stable algorithms carry over to more general distributions over DAGs when we implement CdCov-based nonparametric tests for conditional independence. These results are obtained without imposing any strict distributional assumptions and only require moment and tail conditions on the variables.

There are several intriguing potential directions for future research. First, it is generally difficult to select the tuning parameter (i.e., the significance threshold for the CdCov test) in causal structure learning. One possible strategy is to use ideas based on *stability selection* [27,28]. By assessing the stability of the estimated graphs in multiple subsamples, this strategy allows us to choose the tuning parameter in order to control the false positive error. However, the repeated subsampling increases the computational burden. Second, the computational and sample complexities of the PC and FCI algorithms (and hence those of the nonPC and nonFCI) scale with the maximum degree of the DAG, which is assumed to be small relative to the sample size. However, in many applications, one encounters sparse graphs containing a small number of highly connected 'hub' nodes. In such cases, ref. [29] proposed a low-complexity variant of the PC algorithm, namely the *reduced PC* (rPC) algorithm that exploits the local separation property of large random networks [30]. The rPC is shown to consistently estimate the skeleton of a high-dimensional DAG by conditioning only on sets of small cardinality. More recently, ref. [31] have generalized this approach to account for unobserved confounders. In this light, it would be intriguing to develop computationally faster variants of the nonPC and nonFCI in the future by exploiting the idea of local separation.

**Author Contributions:** Conceptualization, S.C. and A.S.; methodology, S.C. and A.S.; formal analysis, S.C.; investigation, S.C.; writing–original draft preparation, S.C.; writing–review and editing, S.C. and A.S.; supervision, A.S.; funding acquisition, A.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** The authors gratefully acknowledge the funding from grants R01GM114029 and R01GM133848 from the US National Institutes of Health and grant DMS-1915855 from the US National Science Foundation.

**Data Availability Statement:** The Montana Poll dataset has been accessed from the Data and Story Library (DASL) at https://math.tntech.edu/e-stat/DASL/page4.html (accessed on 25 March 2021).

**Conflicts of Interest:** The authors declare no conflict of interest.

### **Appendix A. Preliminaries and Background**

For the sake of completeness, we illustrate in this section the pseudocodes of the oracle versions of the PC-stable and FCI-stable algorithms. We also outline a local bootstrap procedure that can be used to approximate the threshold *ξα* mentioned in Section 3.1 and is used throughout the numerical studies in the paper.

Algorithm A1 presents the pseudocode of the oracle version of Step 1 of the PC-stable algorithm (Algorithm 4.1 of [7]), which estimates the skeleton of the underlying DAG. Algorithm A2 presents the pseudocode of Step 2 of the PC-stable algorithm (Algorithm 2 of [8]) that extends the skeleton estimated in Step 1 to the CPDAG. Algorithm A3 presents the pseudocode of the FCI-stable algorithm (Section 4.4 in [7]). It implements Algorithm A4 to obtain an initial skeleton of the underlying PAG, Algorithm A5 to orient the v-structures, and finally Algorithm A6 to obtain the final skeleton that the FCI-stable returns.

To approximate the threshold *ξα* to test for *H*<sup>0</sup> : *X* ⊥⊥ *Y*|*Z* vs. *HA* : *X* ⊥⊥ *Y*|*Z* at level *α* ∈ (0, 1) (see Section 3.1), we consider the following local bootstrap procedure in the light of Section 4.3 in [15]. Given the i.i.d. sample {*Wi* = (*Xi*,*Yi*, *Zi*)}*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> from the joint distribution of *<sup>W</sup>* = (*X*,*Y*, *<sup>Z</sup>*), draw a local bootstrap sample {*W*† *<sup>i</sup>* = (*X*† *<sup>i</sup>* ,*Yi*, *Zi*)}*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> and compute the bootstrap statistic. The detailed steps are as follows :

### **Algorithm A1** Step 1 of the PC-stable algorithm (oracle version).

**Require** : Conditional independence information among all variables in *V*, and an ordering order(*V*) on the variables. Form the complete undirected graph C on the vertex set *V*. Let *l* = −1; **repeat** *l* = *l* + 1; **for** all vertices *Xa* in C **do** let *u*(*Xa*) = *adj*(C, *Xa*) **end for repeat** Select a (new) ordered pair of vertices (*Xa*, *Xb*) that are adjacent in C such that |*u*(*Xa*) \ {*Xb*}| ≥ *l*, using order (*V*); **repeat** Choose a (new) set *S* ⊆ *u*(*Xa*) \ {*Xb*} with |*S*| = *l*, using order(*V*); **if** *Xa* ⊥⊥ *Xb* | *S* **then** Delete the edge *Xa* − *Xb* from C; Let *sepset*(*Xa*, *Xb*) = *sepset*(*Xb*, *Xa*) = *S*; **end if until** *Xa* and *Xb* are no longer adjacent in C or all *S* ⊆ *u*(*Xa*) \ {*Xb*} with |*S*| = *l* have been considered **until** all ordered pairs of adjacent vertices (*Xa*, *Xb*) in C with |*u*(*Xa*) \ {*Xb*}| ≥ *l* have been considered **until** all pairs of adjacent vertices (*Xa*, *Xb*) in C satisfy |*u*(*Xa*) \ {*Xb*}| ≤ *l* **Output** : The estimated skeleton C, separation sets *sepset*.

**Algorithm A2** Step 2 of the PC-stable algorithm.

**Require** : Skeleton C, separation sets *sepset*. **for** all all pair of nonadjacent vertices *Xa*, *Xc* with common neighbor *Xb* in C **do if** *Xb* ∈/ *sepset*(*Xa*, *Xc*) **then** Replace *Xa* − *Xb* − *Xc* in C by *Xa* → *Xb* ← *Xc*; **end if end for** In the resulting PDAG, try to orient as many undirected edges as possible by repeated applications of the following rules : **(R1)** Orient *Xb* − *Xc* into *Xb* → *Xc* whenever there is an arrow *Xa* → *Xb* such that *Xa*

and *Xc* are nonadjacent (otherwise, a new v-structure is created). **(R2)** Orient *Xa* − *Xc* into *Xa* → *Xc* whenever there is a chain *Xa* → *Xb* → *Xc* (otherwise, a directed cycle is created).

**(R3)** Orient *Xa* − *Xc* into *Xa* → *Xc* whenever there are two chains *Xa* − *Xb* → *Xc* and *Xa* − *Xd* → *Xc* such that *Xb* and *Xd* are nonadjacent (otherwise, a new v-structure or a directed cycle is created).

**Algorithm A3** The FCI-stable algorithm (oracle version).

**Require** : Conditional independence information among all variables in *VX* given *VT*. Use Algorithm A4 to find an initial skeleton (C), separation sets (sepset) and unshielded triple list (M);

Use Algorithm A5 to orient v-structures (update C);

Use Algorithm A6 to find the final skeleton (update C and sepset);

Use Algorithm A5 to orient v-structures (update C);

Use rules (R1)-(R10) of [6] to orient as many edge marks as possible (update C); **Output** : C, sepset.

**Algorithm A4** Obtaining an initial skeleton in the FCI-stable algorithm (Algorithm 4.1 in the supplement of [4]).

**Require** : Conditional independence information among all variables in *VX* given *VT*, and an ordering order(*VX*) on the variables.

```
Form the complete undirected graph C on the vertex set VX with edges ◦−◦.
Let l = −1;
repeat
   l = l + 1;
   for all vertices Xa in C do
       let u(Xa) = adj(C, Xa)
   end for
   repeat
       Select a (new) ordered pair of vertices (Xa, Xb) that are adjacent in C such that
       |u(Xa) \ {Xb}| ≥ l, using order (VX);
       repeat
          Choose a (new) set Y ⊆ u(Xa) \ {Xb} with |Y| = l, using order(VX);
          if Xa ⊥⊥ Xb | Y ∪ VT then
              Delete the edge Xa ◦−◦ Xb from C;
              Let sepset(Xa, Xb) = sepset(Xb, Xa) = Y;
          end if
       until Xa and Xb are no longer adjacent in C or all Y ⊆ u(Xa) \ {Xb} with |Y| = l
have
      been considered
   until all ordered pairs of adjacent vertices (Xa, Xb) in C with |u(Xa) \ {Xb}| ≥ l have
been
   considered
until all pairs of adjacent vertices (Xa, Xb) in C satisfy |u(Xa) \ {Xb}| ≤ l
Form a list M of all unshielded triples Xc · Xd
                                                (i.e., the middle vertex is left unspecified)
in C with c < d.
Output : C, sepset, M.
```
**Algorithm A5** Orienting v-structures in the FCI-stable algorithm (Algorithm 4.2 in the supplement of [4]).

**Require** : Initial skeleton (C), separation sets (sepset) and unshielded triple list (M). **for** all elements *Xa*, *Xb*, *Xc* of M **do if** *Xb* ∈/ sepset(*Xa*, *Xc*) **then** Orient *Xa* -−◦ *Xb* ◦−- *Xc* as *Xa*-→ *Xb* ←-*Xc* **end if end for Output** : C, sepset.

### **Algorithm A6** Obtaining the final skeleton in the FCI-stable algorithm (Algorithm 4.3 in the supplement of [4]).

**Require**: Partially oriented graph (C) and separation sets (sepset). **for** all vertices *Xa* in C **do** let *v*(*Xa*) = pds(C, *Xa*, ·); **for** all vertices *Xb* ∈ *adj*(C, *Xa*) **do** Let *l* = −1; **repeat** *l* = *l* + 1; **repeat** Choose a (new) set *Y* ⊆ *v*(*Xa*) \ {*Xb*} with |*Y*| = *l*; **if** *Xa* ⊥⊥ *Xb* | *Y* ∪ *VT* **then** Delete the edge *Xa* -−- *Xb* from C; Let sepset(*Xa*, *Xb*) = sepset(*Xb*, *Xa*) = *Y*; **end if until** *Xa* and *Xb* are no longer adjacent in C or all *Y* ⊆ *v*(*Xa*) \ {*Xb*} with |*Y*| = *l* have been considered **until** *Xa* and *Xb* are no longer adjacent in C or |*v*(*Xa*) \ {*Xb*}| < *l* **end for end for** Reorient all edges in C as ◦−◦. Form a list M of all unshielded triples *Xc* · *Xd* in C with *c* < *d*. **Output** : C, sepset, M.

A. For *i* = 1, . . . , *n*, draw *X*† *<sup>i</sup>* from

$$
\widehat{F}\_{X|Z=Z\_i} = \frac{\sum\_{j=1}^n K\_{ij} \mathbf{1}(-\infty, X\_j](x)}{\sum\_{j=1}^n K\_{ij}} \, .
$$

Compute *<sup>ρ</sup>* <sup>∗</sup>† based on the local bootstrap sample {*W*† *<sup>i</sup>* = (*X*† *<sup>i</sup>* ,*Yi*, *Zi*)}*<sup>n</sup> <sup>i</sup>*=1.

B. Repeat Step A *<sup>B</sup>* times to obtain {*ρ* <sup>∗</sup>† *<sup>b</sup>* }*<sup>B</sup> <sup>b</sup>*=1. Obtain *ξ* <sup>∗</sup> *<sup>n</sup>*,*<sup>α</sup>* as the 100(<sup>1</sup> <sup>−</sup> *<sup>α</sup>*)*th* percentile of {*nhr*/2 *<sup>ρ</sup>* <sup>∗</sup>† *<sup>b</sup>* }*<sup>B</sup> <sup>b</sup>*=1. Then, <sup>1</sup> *nhr*/2 *ξ* <sup>∗</sup> *<sup>n</sup>*,*<sup>α</sup>* can be considered as an approximation for *ξα*.

### **Appendix B. Proofs of the Theoretical Results**

In this section, we provide detailed technical proofs of the theoretical results presented in the paper. We first state a concentration inequality in Lemma A1. The result in Lemma A1 is not new and can be seen as a corollary of Theorem A in Section 5.6.1 of [32]; however, it is a key technical ingredient in the proof of Theorem 1, which is the main theoretical innovation of our paper. For completeness, we include a short proof for Lemma A1.

**Lemma A1.** *Consider a U-statistic Un* = *U*(*X*1, ... , *Xn*) = ( *<sup>n</sup> m*) <sup>−</sup><sup>1</sup> ∑*i*1<···<*im h*(*Xi*<sup>1</sup> , ... , *Xim* ) *with a symmetric kernel <sup>h</sup> such that* <sup>E</sup> *Un* <sup>=</sup> <sup>E</sup> *<sup>h</sup>*(*X*1, ... , *Xm*) = *<sup>θ</sup>. Further suppose* <sup>|</sup>*h*(*X*1, ... , *Xm*)| ≤ *M for some M* > 0*. Then, for any*  > 0*, we have*

$$\mathbb{P}(|\mathcal{U}\_n - \theta| > \epsilon) \le 2 \exp\left(-\frac{\epsilon^2 k}{2M^2}\right)$$

*where k* :<sup>=</sup> *<sup>n</sup> <sup>m</sup> .*

### **Proof of Lemma A1.** Define

$$\mathcal{W}(X\_1, \dots, X\_{\hbar}) := \frac{1}{k} \left[ h(X\_1, \dots, X\_{\hbar}) + h(X\_{m+1}, \dots, X\_{2m}) + \dots + h(X\_{\hbar m - m + 1}, \dots, X\_{\hbar m}) \right].$$

Then, following Section 5.1.6 in [32], we can write

$$\mathcal{U}\_n = \frac{1}{n!} \sum\_{\pi} \mathcal{W}(X\_{i\_1}, \dots, X\_{i\_n}) \tag{A1}$$

where <sup>∑</sup>*<sup>π</sup>* denotes summation over all *n*! permutations (*i*1, ... , *in*) of (1, 2, ... , *n*). Thus, *Un* can be expressed as an average of *n*! terms, each of which is an average of *k* i.i.d. random variables. Using Markov's inequality, convexity of the exponential function and Jensen's inequality, we have, for any *t* > 0,

$$\begin{split} \mathbb{P}(\mathcal{U}\_{\hbar} - \theta > \epsilon) &= \mathbb{P}\Big(\exp\big(t\big(\mathcal{U}\_{\hbar} - \theta\big)\big) > \exp(t\epsilon)\big) \\ &\leq \exp(-t\epsilon)\exp(-t\theta)\mathbb{E}\left[\exp\big(t\big(\mathcal{U}\_{\hbar}\big)\big) \\ &= \exp(-t\epsilon)\exp(-t\theta)\mathbb{E}\left[\exp\big(t\bigfrac{1}{n!}\sum\_{p}W(\mathcal{X}\_{i\_{1}},\ldots,\mathcal{X}\_{i\_{n}})\right)\right] \\ &\leq \exp(-t\epsilon)\exp(-t\theta)\frac{1}{n!}\sum\_{\pi}\mathbb{E}\left[\exp\big(t\,W(\mathcal{X}\_{i\_{1}},\ldots,\mathcal{X}\_{i\_{n}})\big)\right] \\ &= \exp(-t\epsilon)\exp(-t\theta)\left[\mathbb{E}\left(\exp\big(\frac{t}{k}h\big)\right)\right]^{k} \\ &= \exp(-t\epsilon)\operatorname{\mathbb{E}}^{k}\left[\exp\big(\frac{t}{k}(h-\theta)\big)\right] \end{split} \tag{A2}$$

where, for notational simplicity, we use *h* to denote *h*(*X*1, ... , *Xm*). Using Hoeffding's Lemma, we have from (A2)

$$\mathbb{P}(L\_n - \theta > \epsilon) \le \exp\left(-t\epsilon + k\frac{1}{8}\frac{t^2}{k^2}(2M)^2\right) \\ = \exp\left(-t\epsilon + \frac{t^2M^2}{2k}\right).$$

Symmetrically, we obtain

$$\mathbb{P}(|\mathcal{U}\_{\hbar} - \theta| > \epsilon) \le 2 \exp\left(-t\epsilon + \frac{t^2 M^2}{2k}\right). \tag{A3}$$

The right-hand side of (A3) is minimized at *t* =  *k*/*M*2. Therefore, choosing *t* =  *k*/*M*2, we obtain

$$\mathbb{P}(|\mathcal{U}\_{\hbar} - \theta| > \epsilon) \le 2 \exp\left(-\frac{\epsilon^2 k}{2M^2}\right).$$

**Proof of Theorem 1.** When |*S*| = 0, it can be shown in similar lines of Theorem 1 in Li et al. (2012) [33] that, for any  > 0, there exist positive constants *A*, *B* and *γ* ∈ (0, 1/4) such that

$$\mathbb{P}(|\hat{\rho}^\*(X\_{\mathfrak{a}\prime}X\_{\mathfrak{b}}|X\_{\mathcal{S}}) - \rho\_0^\*(X\_{\mathfrak{a}\prime}X\_{\mathfrak{b}}|X\_{\mathcal{S}})| > \varepsilon) \\ \leq O\left(2\exp\left(-An^{1-2\gamma}\epsilon^2\right) + n\exp\left(-Bn^{\gamma}\right)\right).$$

Now, consider the case 0 < |*S*| ≤ *mp*.

For notational convenience, we treat *Xa*, *Xb* and *XS* as *X*, *Y* and *Z*, respectively.

Denote *<sup>δ</sup><sup>Z</sup>* :<sup>=</sup> CdCov2(*X*,*Y*|*Z*). Then, *<sup>ρ</sup>*<sup>∗</sup> <sup>0</sup> <sup>=</sup> <sup>E</sup>[*δZ*]. Recall that

$$\begin{aligned} \hat{\rho}\,^\*(X,Y|Z) &:= \frac{1}{n} \sum\_{u=1}^n \text{CdCov}\_n^2(X,Y|Z\_u) \quad \coloneqq \frac{1}{n} \sum\_{u=1}^n \Delta\_{i,j,k,l;u} \\ \text{where} \qquad \Delta\_{i,j,k,l;u} &:= \sum\_{i,j,k,l} \frac{K\_{iu}K\_{ju}K\_{ku}K\_{lu}}{12\left(\sum\_{i=1}^n K\_{iu}\right)^4} d\_{ijkl}^S. \end{aligned} \tag{A4}$$

From (A4), we have

$$\begin{aligned} &\mathbb{E}\left[\text{CdCov}\_{n}^{2}(X,Y|Z\_{u})|Z\right] \\ &=\frac{1}{12}\mathbb{E}\left[d\_{1234}^{\mathbb{S}}\,|\,\mathcal{Z}\_{1}=\mathcal{Z}\_{\text{u}},\ldots,\mathcal{Z}\_{\text{d}}=\mathcal{Z}\_{\text{u}}\right]\sum\_{i,j,k,l}K\_{\text{i}u}K\_{ju}K\_{ku}K\_{lu}\,/\left(\sum\_{i=1}^{n}K\_{iu}\right)^{4} \\ &=\frac{1}{12}\mathbb{E}\left[d\_{1234}^{\mathbb{S}}\,|\,\mathcal{Z}\_{1}=\mathcal{Z}\_{\text{u}},\ldots,\mathcal{Z}\_{\text{d}}=\mathcal{Z}\_{u}\right]\ =\delta\mathcal{Z}\_{u} \end{aligned} \tag{A5}$$

where the last equality follows from Lemma 1 in [15]. Together, (A4) and (A5) imply <sup>E</sup> [ *<sup>ρ</sup>* <sup>∗</sup>] = *<sup>ρ</sup>*<sup>∗</sup> 0.

Now, consider the truncation

$$\begin{split} \rho\_0^\* &= \rho\_{01}^\* + \rho\_{02}^\* \\ &:= \mathbb{E}\left[\frac{1}{12} d\_{i,j,k,l}^S \mathbf{1}\left(\left|\frac{1}{12} d\_{i,j,k,l}^S \right| \le M\right)\right] + \mathbb{E}\left[\frac{1}{12} d\_{i,j,k,l}^S \mathbf{1}\left(\left|\frac{1}{12} d\_{i,j,k,l}^S\right| > M\right)\right] \end{split} \tag{A6}$$

where *M* > 0 will be specified later. Then, using triangle inequality,

$$\begin{split} \mathbb{P}(|\hat{\rho}^{\*} - \rho\_{0}^{\*}| > \epsilon) &= \mathbb{P}\left( \left| \frac{1}{n} \sum\_{u=1}^{n} \left( \sum\_{i,j,k,l} \Delta\_{i,j,k,l;u} - \rho\_{0}^{\*} \right) \right| > \epsilon \right) \\ &\leq \mathbb{P}\left( \left| \frac{1}{n} \sum\_{u=1}^{n} \left( \sum\_{i,j,k,l} \Delta\_{i,j,k,l;u} \mathbf{1} \left( \left| \frac{1}{12} d\_{i,j,k,l}^{\operatorname{\boldsymbol{\beta}}} \right| \leq M \right) - \rho\_{01}^{\*} \right) \right| > \epsilon/2 \right) \\ &+ \mathbb{P}\left( \left| \frac{1}{n} \sum\_{u=1}^{n} \left( \sum\_{i,j,k,l} \Delta\_{i,j,k,l;u} \mathbf{1} \left( \left| \frac{1}{12} d\_{i,j,k,l}^{\operatorname{\boldsymbol{\beta}}} \right| > M \right) - \rho\_{02}^{\*} \right) \right| > \epsilon/2 \right) \\ &=: \mathbb{I} + \Pi. \end{split} \tag{A7}$$

Clearly, from (A4), we have |Δ*i*,*j*,*k*,*l*;*u*| ≤ *M* when ( ( ( 1 <sup>12</sup> *<sup>d</sup><sup>S</sup> i*,*j*,*k*,*l* ( ( ( <sup>≤</sup> *<sup>M</sup>*. With this observation, we have

$$\mathcal{I} \le 2 \exp\left(-\frac{n\epsilon^2}{8\ M^2}\right) \tag{A8}$$

which follows from Lemma A1 by setting *<sup>m</sup>* <sup>=</sup> 1, *<sup>k</sup>* <sup>=</sup> *n* and  <sup>=</sup> /2. Choosing *<sup>M</sup>* <sup>=</sup> *c n<sup>γ</sup>* for *γ* ∈ (0, 1/4) and some positive constant *c*, it follows from (A8) that

$$\mathcal{I} \le 2 \exp\left(-\mathcal{C}\_1 n^{1-2\gamma} \epsilon^2\right) \tag{A9}$$

for some *C*<sup>1</sup> > 0.

Now, to find a suitable upper bound for II, note that a simple application of triangle inequality yields

$$\begin{split} \frac{\varepsilon}{2} &< \left| \frac{1}{n} \sum\_{u=1}^{n} \sum\_{i,j,k,l} \Delta\_{i,j,k,l;u} \mathbf{1} \left( \left| \frac{1}{12} d\_{i,j,k,l}^{\rm S} \right| > M \right) - \rho\_{02}^{\*} \right| \\ &\leq \left| \frac{1}{n} \sum\_{u=1}^{n} \sum\_{i,j,k,l} \Delta\_{i,j,k,l;u} \mathbf{1} \left( \left| \frac{1}{12} d\_{i,j,k,l}^{\rm S} \right| > M \right) \right| + \left| \rho\_{02}^{\*} \right|. \end{split} \tag{A10}$$

For the choice of *M* = *c nγ*, we have

$$\rho\_{02}^{\*} = \mathbb{E}\left[\frac{1}{12} d\_{i,j,k,l}^{S} \mathbf{1}\left(\left|\frac{1}{12} d\_{i,j,k,l}^{S}\right| > M\right)\right] \\ < \frac{\epsilon}{4} \tag{A11}$$

for sufficiently large *n* (see, for example, Exercise 6 in Chapter 5, [34]). Combining (A10) and (A11), we obtain

$$\begin{split} & \left\{ \left| \frac{1}{n} \sum\_{u=1}^{n} \sum\_{i,j,k,l} \Delta\_{i,j,k,l;u} \mathbf{1} \left( \left| \frac{1}{12} \, d\_{i,j,k,l}^{S} \right| > M \right) - \rho\_{02}^{\*} \right| > \epsilon/2 \right\} \\ & \subseteq \left\{ \left| \frac{1}{n} \sum\_{u=1}^{n} \sum\_{i,j,k,l} \Delta\_{i,j,k,l;u} \mathbf{1} \left( \left| \frac{1}{12} \, d\_{i,j,k,l}^{S} \right| > M \right) \right| > \epsilon/4 \right\} \\ & \subseteq \left\{ \left[ \left| \frac{1}{12} \, d\_{i,j,k,l}^{S} \right| > M \right] \text{ for some } 1 \le i, j, k, l \le n \right\}, \end{split}$$

which implies

$$\begin{split} &\mathbb{P}\left(\left|\frac{1}{n}\sum\_{u=1}^{n}\sum\_{i,j,k,l}\Delta\_{i,j,k,l}\mathbf{1}\left(\left|\frac{1}{12}d\_{i,j,k,l}^{S}\right|>M\right)-\rho\_{02}^{+}\right|>\epsilon/2\right) \\ &\leq \mathbb{P}\left(\left|\frac{1}{n}\sum\_{u=1}^{n}\sum\_{i,j,k,l}\Delta\_{i,j,k,l}\mathbf{1}\left(\left|\frac{1}{12}d\_{i,j,k,l}^{S}\right|>M\right)\right|>\epsilon/4\right) \\ &\leq n^{4}\mathbb{P}\left(\left|\frac{1}{12}d\_{i,j,k,l}^{S}\right|>M\right). \end{split} \tag{A12}$$

This is because, if ( ( ( 1 <sup>12</sup> *<sup>d</sup><sup>S</sup> i*,*j*,*k*,*l* ( ( ( <sup>≤</sup> *<sup>M</sup>* for all 1 <sup>≤</sup> *<sup>i</sup>*, *<sup>j</sup>*, *<sup>k</sup>*, *<sup>l</sup>* <sup>≤</sup> *<sup>n</sup>*, then

$$m^{-1} \sum\_{u=1}^{n} \sum\_{i,j,k,l} \Delta\_{i,j,k,l;u} \mathbf{1} \left( \left| \frac{1}{12} \, d\_{i,j,k,l}^{S} \right| > M \right) = 0.1$$

Under Condition (A1), Lemma 2 in the supplementary materials of [35] proves that there exists *s* > 0 for which E exp *s* ( ( *dS* 1234 ( ( is finite. Using Markov's inequality, we have

$$\begin{split} \mathbb{P}\left( \left| \frac{1}{12} \, d\_{i,j,k,l}^{\mathbb{S}} \right| > M \right) &\leq \mathbb{P}\left( \exp\left( s \, \left| \frac{1}{12} \, d\_{i,j,k,l}^{\mathbb{S}} \right| \right) > \exp(sM) \right) \\ &\leq \exp(-sM) \mathbb{E}\left[ \exp\left( s \, \left| \frac{1}{12} \, d\_{i,j,k,l}^{\mathbb{S}} \right| \right) \right] \\ &\leq C\_2 \exp(-sM) \leq C\_2 \exp(-s\_1 n^{\gamma}) \end{split} \tag{A13}$$

for some positive constants *C*<sup>2</sup> and *s*1, where the last line uses the fact that *M* = *c nγ*. Combining (A12) and (A13), we have

$$
\Pi \le C\_2 n^4 \exp(-s\_1 n^\gamma) \,. \tag{A14}
$$

Finally, combining (A7), (A9) and (A14), we obtain

$$\mathbb{P}(|\hat{\rho}^{\,\*} - \rho\_0^{\,\*}| > \epsilon/2) \le 2 \exp\left(-\mathbb{C}\_1 n^{1-2\gamma} \epsilon^2\right) + \mathbb{C}\_2 n^4 \exp(-s\_1 n^{\gamma})$$

for some positive constants *γ*, *C*1, *C*<sup>2</sup> and *s*1. This completes the proof of the theorem. 

**Proof of Theorem 2.** The first inequality in Theorem 2 simply follows by observing the fact that, for any generic random sequence {*Xn*}<sup>∞</sup> *<sup>n</sup>*=<sup>1</sup> and any  > 0,

$$P(|X\_n| > \epsilon) \le P(\sup\_n |X\_n| > \epsilon)$$

for all *n* ≥ 1, which, in turn, implies

$$\sup\_{n} P(|X\_n| > \epsilon) \le P(\sup\_{n} |X\_n| > \epsilon).$$

The second inequality follows from union bound and Theorem 1.

**Proof of Theorem 3.** Denote by *Eab*|*<sup>S</sup>* the event that "an error occurs while testing for *Xa* ⊥⊥ *Xb* | *XS*" for *a*, *b* ∈ *V* and *S* ∈ *J mpn <sup>a</sup>*,*<sup>b</sup>* . Then,

$$\mathbb{P}(\text{an error occurs in the nonPC algorithm}) \le \mathbb{P}\left(\bigcup\_{\substack{a,b \in V\\S \in \int\_{a\mathbb{R}^n}^{b\mathbb{R}\_p}}} E\_{ab|S}\right) \lesssim \int\_{\mathbb{R}} p\_n^{m\_{\mathbb{R}^n} + 2} \mathbb{P}(E\_{ab|S}) \tag{A15}$$

which is essentially due to the union bound. Now, we can write *Eab*|*<sup>S</sup>* <sup>=</sup> *<sup>E</sup>*<sup>I</sup> *ab*|*<sup>S</sup>* <sup>∪</sup> *<sup>E</sup>*II *ab*|*S*, where

$$\begin{array}{ccccc} \text{(Type I error)} & E\_{ab|S}^{\text{I}} & : |\rho^{\*}\_{ab|S}| > \xi\_{a} & \text{when } \rho^{\*}\_{0:ab|S} = 0\\ \text{and} & \text{(Type II error)} & E\_{ab|S}^{\text{II}} & : |\rho^{\*}\_{ab|S}| \le \xi\_{a} & \text{when } \rho^{\*}\_{0:ab|S} > 0 \ . \end{array}$$

Then, by using triangle inequality,

$$\begin{split} \mathbb{P}(E\_{ab|S}^{1}) &= \mathbb{P}(|\boldsymbol{\beta}^{\*}\_{ab|S}| > \boldsymbol{\xi}\_{a}) = \mathbb{P}(|\boldsymbol{\beta}^{\*}\_{ab|S} - \boldsymbol{\rho}^{\*}\_{0:ab|S} + \boldsymbol{\rho}^{\*}\_{0:ab|S}| > \boldsymbol{\xi}\_{a}) \\ &\leq \mathbb{P}(|\boldsymbol{\beta}^{\*}\_{ab|S} - \boldsymbol{\rho}^{\*}\_{0:ab|S}| > \boldsymbol{\xi}\_{a} - \mathbb{C}\_{\max}) \\ &\leq 2\exp\left(-A\,n^{1-2\gamma}(\mathbb{f}\_{a}^{\*} - \mathbb{C}\_{\max})^{2}\right) + n^{4}\exp\left(-Bn^{\gamma}\right) \end{split} \tag{A16}$$

for positive constants *A*, *B* and *γ* ∈ (0, 1/4), where the last inequality follows from Theorem 2. Similarly, using the definition of *Cmin* and the identity |*a*|−|*b*|≤|*a* − *b*| for *<sup>a</sup>*, *<sup>b</sup>* <sup>∈</sup> <sup>R</sup>, we have

$$\begin{split} \mathbb{P}(\underline{\boldsymbol{E}}\_{ab|S}^{\mathrm{II}}) &= \mathbb{P}(|\boldsymbol{\beta}\_{ab|S}^{\*}| \leq \underline{\boldsymbol{\varsigma}}\_{a}) = \mathbb{P}(-|\boldsymbol{\beta}\_{ab|S}^{\*}| \geq -\underline{\boldsymbol{\varsigma}}\_{a}) \\ &= \mathbb{P}(|\boldsymbol{\rho}\_{0:ab|S}^{\*}| - |\boldsymbol{\beta}\_{ab|S}^{\*}| \geq |\boldsymbol{\rho}\_{0:ab|S}^{\*}| - \underline{\boldsymbol{\varsigma}}\_{a}) \\ &\leq \mathbb{P}(|\boldsymbol{\rho}\_{0:ab|S}^{\*} - \underline{\boldsymbol{\rho}}\_{ab|S}^{\*}| \geq \mathbb{C}\_{\mathrm{min}} - \underline{\boldsymbol{\varsigma}}\_{a}) \\ &\lesssim 2\exp\left(-A\,n^{1-2\gamma}(\underline{\boldsymbol{\varsigma}}\_{a} - \underline{\boldsymbol{\varsigma}}\_{\mathrm{min}})^{2}\right) + n^{4}\exp\left(-Bn^{\gamma}\right). \end{split} \tag{A17}$$

Again, the last inequality follows from Theorem 2. Combining Equations (A15)–(A17), we have

P ( an error occurs in the nonPC algorithm )

$$\begin{aligned} \mathcal{I} &= \mathcal{O}\left(p\_n^{m\_{\mathcal{V}}+2} \Big[ 2 \exp\left(-\operatorname{A}n^{1-2\gamma}(\xi\_a - \mathsf{C}\_{\max})^2\right) + 2 \exp\left(-\operatorname{A}n^{1-2\gamma}(\xi\_a - \mathsf{C}\_{\min})^2\right) \right] \\ &+ n^4 \exp\left(-\operatorname{B}n^{\gamma}\right) \Big] \\ &= o(1) \end{aligned}$$

where the last step follows from the fact that *γ* ∈ (0, 1/4) and Assumption (A5). This implies that, as *n* → ∞,

> P *G*ˆ skel,*<sup>n</sup>* = *G*skel,*<sup>n</sup>* <sup>=</sup> <sup>1</sup> <sup>−</sup> <sup>P</sup> ( an error occurs in the nonPC algorithm ) → 1 .

**Proof of Theorem 4.** The proof follows similar lines of the proof of Theorem 4.2 in [4], replacing Lemma 1.4 in their supplement by Theorem 2 in our paper. 

### **References**


### *Article* **Transfer-Learning-Based Approach for the Diagnosis of Lung Diseases from Chest X-ray Images**

**Rong Fan <sup>1</sup> and Shengrong Bu 2,\***


**Abstract:** Using chest X-ray images is one of the least expensive and easiest ways to diagnose patients who suffer from lung diseases such as pneumonia and bronchitis. Inspired by existing work, a deep learning model is proposed to classify chest X-ray images into 14 lung-related pathological conditions. However, small datasets are not sufficient to train the deep learning model. Two methods were used to tackle this: (1) transfer learning based on two pretrained neural networks, DenseNet and ResNet, was employed; (2) data were preprocessed, including checking data leakage, handling class imbalance, and performing data augmentation, before feeding the neural network. The proposed model was evaluated according to the classification accuracy and receiver operating characteristic (ROC) curves, as well as visualized by class activation maps. DenseNet121 and ResNet50 were used in the simulations, and the results showed that the model trained by DenseNet121 had better accuracy than that trained by ResNet50.

**Keywords:** transfer learning; deep learning; pretrained neural networks; chest X-ray images; lung diseases

### **1. Introduction**

Many people suffer from lung diseases such as pneumonia and emphysema every year. Chest X-ray images are one of the most widely used and low-cost diagnose tools for lung diseases [1]. However, since there might be more than one pathology to be detected from chest X-rays for a disease [2], diagnosing by doctors could be challenging sometimes. Computer-aided diagnosis for various diseases has been researched to improve the efficiency and accuracy of the diagnosis [3]. Various deep learning methods [4] for medical image classification have the potential of predicting and diagnosing diseases even more accurately than the average radiologist [5].

Since the global corona virus pandemic, researchers have developed methods to analyze radiographic chest images more efficiently to make the diagnosis of COVID-19 easier. Heidari et al. developed a novel deep learning model to detect non-pneumonia, non-COVID-19-infected pneumonia and COVID-19-infected pneumonia [6]. In [7], the authors presented a deep learning approach to realize the diagnosis of pulmonary hypertension by analyzing chest radiographs and compared the performance of ResNet50, Xception, and Inception V3. Yu et al. built a multi-task deep learning network consisting of an extraction architecture and three different routes for various functions by using chest X-rays from peripherally inserted central catheters [8]. Jaiswal et al. realized the localization and identification of pneumonia in chest X-ray images using a deep learning model derived from mask-RCNN [9]. In [5], a modified AlexNet with many handcrafted features was proposed to detect whether the chest X-ray images were in the normal or in the pneumonia class.

However, the medical image dataset could be too small to be used to train a neural network since the images have to be labeled by professionals. Transfer learning originated from terms such as knowledge transfer or inductive transfer in 1995 [10], and later, in 2005, it was defined as the technique of applying knowledge and skills learned in previous tasks

**Citation:** Fan, R.; Bu, S. Transfer-Learning-Based Approach for the Diagnosis of Lung Diseases from Chest X-ray Images. *Entropy* **2022**, *24*, 313. https://doi.org/10.3390/e24030313

Academic Editors: S. Ejaz Ahmed and Farouk Nathoo

Received: 12 January 2022 Accepted: 15 February 2022 Published: 22 February 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

to novel tasks [11]. Since then, many studies have employed transfer learning on small medical datasets and trained neural networks to realize image recognition and classification. Minaee et al. applied transfer learning to process chest X-ray images for the detection of COVID-19, and DenseNet121, ResNet18, ResNet50, and SqueezeNet were utilized as the pre-trained networks [12]. In [13], the advantages and challenges of deep transfer learning were studied. Ravishankar et al. realized ultrasound kidney images' detection using transfer learning [14]. A deep convolutional neural network (DCNN) was proposed to study the advantages of transfer learning in medicine [15]. Subspace-based techniques, such as in [16], can be used together with transfer learning to increase the accuracy when the dataset is small.

Class imbalance is a common challenging related to medical image diagnosis [17], since the amount of positive data and negative ones in each class might not be equivalent. In this kind of application, the rare or minor occurrences are much more important than the majority classes [18]. As a result, the contributions of the loss for these two kinds of data are not the same, and the small data size of some class will affect the overall training performance. Various methods could be used to handle imbalanced datasets, including setting appropriate class weights for the model and random under-sampling and over-sampling.

In this paper, a transfer learning method is proposed to classify 14 lung-related pathologies using frontal-view chest X-ray images. The contributions of this paper are as follows:


The structure of this paper is as follows. The methods and principles with respect to transfer learning, data augmentation, evaluation, and visualization are presented in Section 2. Section 3 then presents the experimental process and results. Finally, the conclusion of this paper is drawn in Section 4.

### **2. Proposed Transfer Learning Method**

In our work, transfer learning was used for the chest X-ray image classification task. Transfer learning is an effective method in the image processing domain that can take advantage of well-developed models to solve new tasks [19]. There are two main ways to utilize pretrained networks in transfer learning: First, a pretrained model can be used as the feature extractor for the new dataset. Once the features are extracted, added layers such as a linear classifier can be trained for the new task. Second, the whole or some part of the pretrained network will be fine-tuned for the new classification task. Thus, the weights of the pretrained model are considered as the initial values and will be updated during the training process. In our work, the first method was used since the dataset was small and the computing power was limited. Two networks, i.e., DenseNet121 and ResNet50, were used as the base models for transfer learning. In the following, the principle of transfer learning, the framework of the networks, and the measures for the evaluation are discussed.

### *2.1. Transfer Learning with a Data Augmentation Approach*

Two pretrained networks were employed as the training models in this project. The first one is called ResNet50, which won the first prize in the 2015 ImageNet competition. This model uses a shortcut connection, which is the basis of a residual network, and the connection ensures that the feature of one preceding layer is the input of the later layers, skipping some of the layers. Therefore, any layer in this framework has information from the preceding layers. The design overcomes the problem of learning rate reduction and invariant classification accuracy as a result of a deeper network. The second one is DenseNet121, which was the winner of the 2017 ImageNet competition and has been widely

applied in deep learning. DenseNet consists of DenseBlock layers, each of which receives additional inputs from all preceding layers and transition layers. Additional inputs from all preceding layers together with the feature maps of the current layer are all passed on to other subsequent layers, and thus, the shortcuts of all the former layers and the latter layer are built densely. For comparison, the traditional CNN with *l* layers has *l* connections between adjacent layers, whereas DenseNet has *l*(*l* + 1)/2 layers in total because of its shortcut feature [6]. Thus, the learned features could be reused and the network has less channels as a result of the collective knowledge feature of each layer. Besides, this also leads to better performance under the conditions of fewer parameters and little computing cost. It also has some other advantages such as vanishing gradient problem mitigation and parameter reduction. In contrast, since ResNet only has shortcuts between the former layer and the latter layer, and DenseNet has demonstrated better performance. Due to the aforementioned reasons, DenseNet is much deeper than ResNet and has more than 100 layers, and the training process could be more effective and the accuracy improved.

One basic problem of deep learning is the opposition of optimization and generalization [20]. Optimization is the learning process that adjusts the model to obtain the best performance, while generalization is the performance of the model on the testing of new data. The goal of learning is to realize a satisfactory generation, but this cannot be controlled, so the models are always adjusted based on the training data. When the training process begins, the generalization can become worse after a number of iterations, which means the model is overfitting, and this is a common problem in training neural networks. Among various methods used to prevent the neural networks from overfitting, data augmentation is the most effective one and is widely used in computer vision, especially when the dataset is small. In *Keras*, data augmentation can be realized by using the *ImageDataGenerator* class and transforming the image parameters randomly. Some commonly adjusted parameters include the following: *rotation*\_*range* is the rotation range of the image; *width*\_*shi f t* and *height*\_*shi f t* are the range of shifting in the horizontal and vertical direction, respectively; *horizontal*\_ *flip* is the flip ratio; *sheer*\_*range* is the random sheer angle of the image.

### *2.2. Evaluation Methods*

The performance of the network needs to be evaluated after testing. Accuracy and receiver operating characteristic (ROC) curves with the AUCROC were used as the metrics for the evaluation. Accuracy shows the general performance of all testing images, and the ROC curves with the AUCROC indicate the classification performance for each label.

The classification task in our project was a multi-task classification because one image might correspond to more than one pathological condition. Therefore, the *Accuracy* can be calculated as follows, since there are 14 pathological conditions:

$$Accuracy = \frac{summf \, trly \, predicted \, labels}{14 \ast (\text{\textquotedblleft of testing\textquotedblright} images)} \tag{1}$$

The accurately predicted labels for all *testing images* were considered together instead of calculating the accuracy of each image and then averaging them. The *sum of the truly predicted labels* was calculated by first finding the number of truly predicted images for each label and adding them together.

An ROC curve is a classification evaluation tool in deep learning. In real-world applications, some datasets have the problem of class imbalance. For example, a common case is that the number of negative images is larger than that of the positive images for medical datasets. A stable evaluation curve could be achieved by using the ROC curve. To summarize, the ROC curve has the following features: First, the curve can be used to check the impact of a specific threshold value on the generalization ability of a classifier. Second, the ROC can help determine the best threshold value, since the closer it is to the upper-left corner, the better the classifier is. Third, the ROC is a good tool to compare the performance of many different classifiers for each class intuitively.

In the figure of a typical ROC curve, the horizontal coordinate, i.e., false positive rate (*FPR*), and the vertical coordinate, i.e., true positive rate (*TPR*), are defined as follows:

$$TPR = \frac{TP}{P} = \frac{TP}{TP + FN}.\tag{2}$$

$$FPR = \frac{FP}{N} = \frac{FP}{TN + FP'} \tag{3}$$

where *P* is the number of real positive samples and *N* is the number of real negative samples. *TP* means true positive, which is the positive samples that are predicted positively by the model. *FP* mans false positive, which is the negative samples that are predicted positively by the model; *FN* means false negative, which is the positive samples that are predicted negatively by the model; *TN* means true negative, which is the negative samples that are predicted negatively by the model. For a specified classifier, a pair of TPR and FPR points can be obtained according to the testing performance. As a result, this classifier can be mapped into a point on the ROC plain. The area under ROC curve (AUCROC) is used to quantify the classification ability, and a larger AUCROC indicates better classification performance.

There are three methods to calculate the AUC manually, the namely trapezoidal rule, the Mann–Whitney statistics [21], and the parameter rule. The first method uses the vertical line of each point on the x-axis and calculates the sum of small trapezoidal areas. The second method is proper for medical images, because it calculates the value of the possibility that positive samples are larger than the negative samples. The third method uses the mean and variance value when the samples obey a Gaussian distribution. In our work, these two functions *roc*\_*auc*\_*score*, *roc*\_*curve* can be used by directly importing them from the *sklearn.metrics* library. After the AUC value is calculated, the performance of the classifier can be analyzed: (1) If *AUC* = 1, the classifier is perfect. (2) If 0.5 < *AUC* < 1, the performance is better than guessing randomly. If a proper threshold value is set, the classifier can predict most of the cases correctly. (3) If *AUC* = 0.5, the process of prediction is the same as a random guess, and there is no prediction value. (4) If *AUC* < 0.5, it is worse than guessing. However, if predicting inversely, it is similar to the second case.

### *2.3. Visualization Using Class Activation Maps*

Visualization of neural networks increases the interpretability of the networks in the field of computer vision. The complexity of medical images always makes the visualization harder. In our work, the class activation map (CAM) was used for visualization. The basic principle of the CAM is that it will produce a heat map of the input images, indicating the degree of similarity between the real class and the predicted class. Specifically, the technique used in this work was gradient-weighted class activation mapping (Grad-CAM) [22]. This method generates a localization map with the significant parts of the image highlighted by extracting the gradient of the classification target and letting the gradient flow into the last layer.

A convolutional neural network normally consists of a feature extractor, which is used to extract useful features, and a classifier, which classifies according to the extracted features. There are two kinds of classification models. One is feature extraction with flatten and softmax layers: A flatten layer is used to transform the three-dimensional images into one-dimensional vectors. A dense layer will then be added, and finally, there is a softmax function as the activation function for the output. The other is feature extraction with global average pooling (GAP) and softmax, where a global average pooling layer is used to substitute the flatten layer: this has the advantages of reducing the number of parameters, making the training process easy and preventing from overfitting. Based on the classification model, the CAM is generated.

For a traditional CNN model that has a flatten layer, if the last layer of the CNN has *n* feature maps, which means there are *n* weights for a neuron in the classifier layer and each neuron relates to a class, then the class activation map [22] for class *c* can be calculated as follows:

$$L\_{CAM}^c = \sum\_{i=1}^n w\_i^c A^i \, , \tag{4}$$

where the weights for the *i*th neuron are: *w<sup>i</sup>* <sup>1</sup>, *<sup>w</sup><sup>i</sup>* <sup>2</sup>, *<sup>w</sup><sup>i</sup>* <sup>3</sup>, ··· , *<sup>w</sup><sup>i</sup> <sup>n</sup>*, and *A<sup>i</sup>* indicates the feature maps in the last layer. If a GAP is used to substitute the flatten layer, the classification score of class *c* [22] can be calculated as follows:

$$S\_c = \sum\_{i=1}^{n} w\_i^c G A P(A^i) = \frac{1}{Z} \sum\_{i=1}^{n} \sum\_{k=1}^{c1} \sum\_{j=1}^{c2} A\_{kj}^i w\_{i\prime}^c \tag{5}$$

where *w<sup>c</sup> <sup>i</sup>* is the weight for the GAP and the size of a feature map is *Z* = *c*1 ∗ *c*2. The value of *Sc* is determined by the pixel value *A<sup>i</sup> kj* and weights *<sup>w</sup><sup>c</sup> <sup>i</sup>* . If the multiplication of the pixel value and weights is larger than 1, the sample will be classified into this current class *c*, and the model considers the original image as related to this class. This equation helps decide which part of the original image corresponds to a specific pixel.

CAMs are a very powerful tool for the visualization of the neural network's decisionmaking process. However, they have certain limitations: (1) We can apply CAMs only if the CNN contains a GAP layer; (2) heat maps can be generated only for the last convolutional layer. To address these issues, gradient-weighted class activation mapping (Grad-CAM) is proposed. The class activation mapping for class *c* [22] can be generated by:

$$L\_{\text{Grad}-\complement\text{AM}}^{\mathcal{L}} = \frac{1}{Z} \sum\_{i=1}^{n} \sum\_{k=1}^{\mathcal{L}} \sum\_{j=1}^{\mathcal{L}} \frac{\partial \mathcal{S}\_{\mathcal{L}}}{\partial A\_{kj}^{i}} A^{i}. \tag{6}$$

Grad-CAM is the generalization of the CAM, and the gradient operator indicates the backpropagation. Grad-CAM was employed in our work due to its advantages. The code implementation included the following steps: (1) The output of the batch normalization (BN) layer [23] and the output of the whole network were extracted. (2) Backpropagation was computed from the output of the whole network to the output of the BN layer by using function *gradients* in *TensorFlow* to calculate the gradient automatically. (3) We used the gradients as the weights and multiplied them with the output of the BN layer. (4) Function *resize* in the *OpenCV* library was used to compound the feature maps to visualize.

### **3. Simulation Results**

Our simulation process can be divided into three parts: (1) The raw data need to be preprocessed, including checking the data leakage, handling the class imbalance, performing the data augmentation, and generating new images. (2) The training process was conducted. (3) The testing and evaluation results showed the generalization ability of the model. Simulations were conducted on a GPU-equipped computer, using *TensorFlow* and *Keras*.

### *3.1. Data Preprocessing*

The data used in our work were frontal-view chest X-ray images from patients. The whole dataset was obtained from https://nihcc.app.box.com/v/ChestXray-NIHCC (accessed on 10 Febuary 2021). Each image in the dataset includes 14 labels for 14 pathological conditions, such as consolidation, effusion, edema, atelectasis and so on. For each label, 1 means positive and 0 means negative. After classification, the pathological conditions can be utilized by physicians to detect eight different diseases. The original datasets were divided into three groups for training, validation, and testing, respectively.

Data leakage is a common problem for processing medical images, because one patient may have multiple images. Data leakage will lead to the overfitting problem, since it is difficult for the model to learn from similar features and to predict other new features. To ensure that there is no data leakage between any two datasets, the datasets should not contain the images from the same patient. The identification of unique patents of each set was collected by using the *set* function in Python, and then, the *intersection* function was used to check whether the two datasets contain information from the same patient.

Neural networks can only process the data in the format of float tensor. Therefore, formatting is important, since the original dataset contains images in PNG files. In *Keras*, there is a class named *ImageDataGenerator*, which can be used to finish the following tasks in sequence: read image files; encode the PNG files into RGB pixels; transform these pixels into a float tensor; scale the pixels in the range of [0,1]. Then, three generators are defined to load the images into the network. Several parameters can be set to proper values in *ImageDataGenerator*:


The data augmentation module was added to the generator, which means that the data were already augmented before feeding them into the neural network. In order to compare the image before and after data augmentation, the first image of the dataset is shown in Figure 1 by using the *plt.imshow* function. As shown in Figure 2, the image was shifted and zoomed after augmentation.

**Figure 1.** A chest X-ray image.

**Figure 2.** A chest X-ray image with data augmentation.

Class imbalance was handled by calculating the weight loss as the loss function. Specifically, for each label, the loss was weighted by the frequency of positive data (*wp*) and that of negative data (*wn*) as shown below:

$$L(X, y) = \begin{cases} w\_p \ast (-\log(Y = 1|X)), & \text{if } y = 1, \\ w\_\mathbb{H} \ast (-\log(Y = 1|X)), & \text{if } y = 0, \end{cases} \tag{7}$$

where *Y* stands for predication and *X* means input labels.

### *3.2. Training*

The pretrained network was used as the base model. A global pooling layer was added using function *GlobalAveragePooling2D*, and a fully connected layer was placed as the output layer by employing the Dense function with *Softmax* activation. In our work, the aim was to realize the classification of 14 pathological conditions, which is a multi-task classification problem. In this scenario, the effective activation was *Softmax*. The final output of the model is called the prediction, which is a 14-length vector with each element indicating the probability of a certain pathological condition. In order to compile the whole model, function *compile* was used, and several related parameters were set. For example, compiling the model required the type of loss function and the optimizer. The weighted loss was considered as the loss function, since the class imbalance problem was handled by the weighted loss. *Adam* was used as the optimizer since it has better performance than the traditional optimizers, such as the *Momentum* and *RMSprop* optimizers. Since "accuracy" was used as the metric, the accuracy of each training step and each validation step was displayed while running the code.

After all the preparations were completed, the network was trained by using training labels and images. The goal of the training was optimization, which means the model itself builds the connection of the output and output and learns the features. By using the *fit*\_*generator* function, the model first fits the data to realize training and then performs the validation. Some parameters are important for the training and/or validation process:


When each epoch of training was finished, the weights of the current trained network were saved in a weight file, by calling the *model.save* function. The later training was based on the formerly saved weights.

The next step was to plot the loss curve for training and validation, which is useful for observing network convergence and the overfitting problem. Function *Matplotlib* in *Keras* was used for plotting. After all training and validation epochs, the loss for each epoch can be retrieved by calling the *history* function.

The training loss and validation loss of DenseNet121 without DA, DenseNet121 with DA, and ResNet50 with DA as the base model are shown in Figure 3. The results without DA and with DA were firstly compared. Figure 3a,b shows that both of the losses with or without DA for the training decreased from one to nearly zero with the increase of the epoch, while those for validation increased from one to almost five, which means that the model was overfit. Ideally, training loss and validation loss should have the same trends, if the model is well fit. The figures also demonstrate that the model with DA had better performance than that without DA. The figures show that the model with DA learned the model slower than that without DA, since more images needed to be fed into the the network after data augmentation. The curves for DenseNet121 without DA fluctuated more than those with DA. The loss curves by using ResNet50 as the base model with DA are also presented. Compared with DenseNet121, ResNet50 took more time to train because the training loss converged at around the 70th epoch.

**Figure 3.** Loss curves for xx with/without DA. (**a**) DenseNet121 without DA. (**b**) DenseNet121 with DA. (**c**) ResNet50 with DA.

The training accuracy of using these three models is shown in Table 1. DenseNet121 with DA had the highest training accuracy, followed by DenseNet121 without DA and then ResNet50. The reason was that the dataset became larger and more diversified after DA, and thus, the network was trained to be optimal. ResNet50 had the lowest training accuracy, since there were fewer shortcut connections inside of the base model, and consequently, the learning ability was poorer.



### *3.3. Testing and Evaluation*

All the testing images were fed into the model, and the prediction results could be obtained. To test the network, function *predict\_generator* was used as the major function. The output of this function was a list, which included the probability of classification for each label. When this probability was larger than the threshold value of 0.5, the program considered the prediction as correct. After comparing the prediction results with the real label of each image, the generalization ability of the model could be known with the selfdefined function to calculate the testing accuracy. The classification accuracy for testing the datasets using DenseNet121 without DA, DenseNet121 with DA, and ResNet50 with DA is shown in Table 2. This table shows that DenseNet121 had better performance than ResNet50, and DA was beneficial for improving the classification accuracy.


**Table 2.** Testing accuracy for different networks.

In order to evaluate the model, the receiver operating characteristic (ROC) curves were generated, and the area under the curve (AUC) was calculated. *Keras* has a library, *sklearn*, which can conduct some advanced computations in machine learning and computer vision. For the evaluation, functions *roc*\_*auc*\_*score* and *roc*\_*curve* were imported from the library to calculate the AUCROC and to derive the ROC curve. Figure 4 illustrates the ROC curves and the AUCROC values of DenseNet121 without DA for the 14 pathological conditions. The horizontal axis indicates the false positive rate, while the vertical axis indicates the true positive rate. The AUCROC score for each class is listed at the lower-right corner of this figure, e.g., for cardiomegaly, the AUCROC was 0.51, which means that the area under curve for the label was 0.51. The figure shows that the ROC curves for several pathologies lie below the straight line that passes through points (0,0) and (1,1). For these pathologies, the classifier worked even worse than random guessing. The AUCROC values of five pathologies, i.e., emphysema, infiltration, pneumothorax, pleural thickening, and pneumonia, were all less than 0.5, which means that the classifier could not diagnose most of the images in these classes correctly. Therefore, this figure indicates that the classification ability of DenseNet121 without DA was relatively poor.

**Figure 4.** The ROC and AUCROC for DenseNet121 without DA.

Figure 5 illustrates the ROC curves and the AUCROC values of DenseNet121 with DA for the 14 pathological conditions. This figure shows that most of these ROC curves are located above the dotted line that passes through points (0,0) and (1,1), and all of the AUCROC values are larger than 0.5. The reason was that the images were preprocessed with DA, which led to a better-trained network. For fibrosis, the ROC curve lies significantly higher than the other curves and is mostly close to the upper-left corner, and its AUCROC was the largest with a value of 0.775, which means that its classifier had the best performance among all 14 classifiers. For nodule and infiltration, their AUCROC values were just slightly larger than 0.5, which means that these classifiers could help predict these pathological conditions, but the performance was relatively poor.

**Figure 5.** The ROC and AUCROC for DenseNet121 with DA.

Figure 6 illustrates the ROC curves and the AUCROC values of ResNet50 with DA for the 14 pathological conditions. Compared to Figure 5, more ROC curves using ResNet50 with DA lie below the straight dotted line that passes through points (0,0) and (1,1) than those using DenseNet121 with DA. The largest AUCROC value was for fibrosis, with the value of 0.68, which was smaller than that of using DenseNet121 with DA. The AUCROC values of three classes, i.e., emphysema, pneumothorax, and pneumonia, were smaller than 0.5, which means that these classifiers could not help predict these pathological conditions.

The comparison of the ROC curves and AUCROC values for different networks demonstrated that the classifiers trained by DenseNet121 had better performance than those trained by ResNet50. The results also indicated that DA improved the classification capability for all of the classes. Most of the ROC curves lie above the straight dotted line that passes through points (0,0) and (1,1), but they are not close to the upper-left corner enough, because the dataset used for testing was relatively small.

**Figure 6.** The ROC and AUCROC for ResNet50 with DA.

### *3.4. Visualization*

The visual explanation of classification decision-making was produced by using Grad-CAM techniques. The heat maps of using DenseNet121 as the base model are shown in Figures 7 and 8. These chest X-rays were randomly selected from the datasets, and only the four most probable diagnosis heat maps are shown in the figure. The probability of diagnosing a certain pathological condition is demonstrated in each of the subfigures. For example, in Figure 7, the original chest X-ray image is shown in the first subfigure. The second and third subfigures indicate that it is impossible for the image to be classified as cardiomegaly or hernia. The fourth and fifth subfigures mean that the image has a probability of 0.763 and 0.593 to be diagnosed as nodule and edema, respectively. Figure 8 shows that the original image has the possibility of being diagnosed into four pathological conditions, and the most probable one is nodule with a probability of 0.822.

**Figure 7.** Visualization of the diagnosis heat maps of one image example by the use of Grad-CAM.

**Figure 8.** Visualization of the diagnosis heat maps of the second example by the use of Grad-CAM.

### **4. Conclusions**

A deep learning approach was proposed to use transfer learning and pretrained networks to recognize and classify chest X-ray images into 14 pathological conditions, and therefore help with diagnosing diseases related to these pathological conditions. The performance of the two adopted pretrained networks DenseNet121 and ResNet50 was compared, and DA was also used to further improve the performance. Evaluation metrics, such as the accuracy, ROC curves, and AUCROC curves were utilized. The simulation results showed that the network using DenseNet121 as the base model with DA had a better generalization ability on the testing datasets. In the future, multiple transfer learning methods could be used together with ensemble classifiers to further improve the performance of the proposed work. The potential use of the other datasets, such as PadChest, ChexPpert, and MIMIC-CXR, will be explored in our future work.

**Author Contributions:** Formal analysis, R.F.; Funding acquisition, S.B.; Investigation, R.F.; Methodology, R.F.; Software, R.F.; Supervision, S.B.; Validation, R.F.; Writing—original draft, R.F.; Writing review & editing, S.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by start-up funds provided by Brock University.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data analyzed in this study are openly available at https://nihcc. app.box.com/v/ChestXray-NIHCC accessed on 10 Febuary 2021.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**


### *Article* **Associations between Longitudinal Gestational Weight Gain and Scalar Infant Birth Weight: A Bayesian Joint Modeling Approach**

**Matthew Pietrosanu 1, Linglong Kong 1, Yan Yuan 2, Rhonda C. Bell 3, Nicole Letourneau <sup>4</sup> and Bei Jiang 1,\***


**Abstract:** Despite the importance of maternal gestational weight gain, it is not yet conclusively understood how weight gain during different stages of pregnancy influences health outcomes for either mother or child. We partially attribute this to differences in and the validity of statistical methods for the analysis of longitudinal and scalar outcome data. In this paper, we propose a Bayesian joint regression model that estimates and uses trajectory parameters as predictors of a scalar response. Our model remedies notable issues with traditional linear regression approaches found in the clinical literature. In particular, our methodology accommodates nonprospective designs by correcting for bias in self-reported prestudy measures; truly accommodates sparse longitudinal observations and short-term variation without data aggregation or precomputation; and is more robust to the choice of model changepoints. We demonstrate these advantages through a real-world application to the Alberta Pregnancy Outcomes and Nutrition (APrON) dataset and a comparison to a linear regression approach from the clinical literature. Our methods extend naturally to other maternal and infant outcomes as well as to areas of research that employ similarly structured data.

**Keywords:** Bayesian modeling; functional regression; gestational weight; infant birth weight; joint modeling; longitudinal data; maternal weight gain

### **1. Introduction**

Maternal weight gain supports fetal growth and holds important health implications for both mother and child during and after pregnancy [1–3]. Insufficient weight gain is associated with preterm birth and low infant birth weight, while excessive weight gain is linked to postpartum weight retention, gestational diabetes, hypertension, infant macrosomia, and other complications [3–5]. A growing amount of clinical literature further implicates maternal gestational weight gain outside of recommendations in adverse, longterm health outcomes for the child, including a heightened future risk of cardiovascular disease [6,7].

It is not yet conclusively understood how weight gain in different stages of pregnancy affects health outcomes for either mother or child. This is despite previous findings that gestational weight trajectories are similar across human populations with varying genetic, cultural, and lifestyle traits [8]. As an example central to this article, previous studies present conflicting conclusions on the effect of first- and second-trimester weight gain on infant birth weight [8–12]. We attribute this in part to differences in and the validity of the statistical methods currently used to jointly analyze scalar outcomes and longitudinal data. Thus, developments in methodology for analyzing how patterns in longitudinal data

**Citation:** Pietrosanu, M.; Kong, L.; Yuan, Y.; Bell, R.C.; Letourneau, N.; Jiang, B. Associations between Longitudinal Gestational Weight Gain and Scalar Infant Birth Weight: A Bayesian Joint Modeling Approach. *Entropy* **2022**, *24*, 232. https:// doi.org/10.3390/e24020232

Academic Editors: S. Ejaz Ahmed and Farouk Nathoo

Received: 16 December 2021 Accepted: 29 January 2022 Published: 2 February 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

(e.g., gestational weight gain) influence scalar outcomes (e.g., infant birth weight) are both statistically and clinically relevant.

Retnakaran et al. [13] investigates the relationship between infant birth weight and gestational weight gain in different periods of pregnancy using traditional linear regression. The work's models include, as predictors, demographic covariates together with pregravid weight and interval-specific average weight gain. The authors opt for clinical data in order to avoid bias in self-reported pregravid measurements that they claim is prevalent in other studies [5,12]. The resulting preconception study design presents a few practical problems: this design is more difficult to implement, limits the use of secondary data, and can introduce other sampling biases and restrict model generalizability (e.g., through the exclusion of unplanned pregnancies). Despite the supposed benefit of bias reduction, the work's average weight gain measurements are precomputed (as differences in average weight between gestational intervals) and may be highly variable due to clinical measurement error and the small number of observations in each gestational interval. As Richardson notes, ignoring this measurement error can lead to unreliable effect estimates and misleading conclusions [14]. This linear regression approach furthermore does not account for gestational age at each weight measurement and, through its initial precomputing stage, reduces the amount of data used to fit the model. The consequent coarsening of information may contribute to unreliable effect estimates and conclusions.

To address these issues, we turn to other approaches for modeling longitudinal data. Joint models that simultaneously consider longitudinal responses and scalar health outcomes are well established in the statistical literature [15–22]. These models were originally motivated by HIV/AIDS and cancer research to predict patient outcomes using a timedependent covariate trajectory. Relevant methodology has since evolved to incorporate techniques from functional data analysis, semiparametric inference, robust estimation, and Bayesian methods [23].

In this paper, we consider a joint model for infant birth weight and gestational weight gain trajectories that also incorporates clinical covariates. Our approach efficiently uses information from estimated mean weight trajectories—including estimated pregravid weight, interval-specific rates of weight gain, and individual residual variance—to predict infant birth weight. As a result, our model can correct for bias in self-reported weight measurements (when combined with clinical observations) and permits nonprospective study designs with unbalanced longitudinal observations.

We employ the Bayesian joint modeling approach of Jiang et al. [23]. Our model uses parameter estimates that describe individual gestational weight trajectories to model the association between infant birth weight and gestational weight gain. We model the mean [24,25] and measurement error [26,27] of these trajectories using a robust, semiparametric mixed effects model and a Bayesian linear spline approach [23].

Our joint model remedies the issues noted above for linear regression [13]. First, by using estimated mean trajectory parameters as predictors of infant birth weight, our approach obtains more-efficient estimates of the time-dependent effects of gestational weight gain. More generally, our joint modeling method, implemented in a Bayesian framework, borrows information from all observations and patients in a one-stage procedure. On the other hand, the predictors in the traditional linear model, such as interval-specific weight gain, are precomputed in an initial step independently for each patient using only a small proportion of the available data at a time. Second, our approach truly accommodates longitudinal data by explicitly accounting for gestational age at each weight measurement when estimating weight gain trajectories. Third, unlike other studies that treat within-patient residual variance as a nuisance parameter, our method models measurement error variance and uses it as a random effect to predict infant birth weight.

Our approach to mean trajectory modeling mitigates bias in self-reported prestudy measurements and accounts for variability inherent in observed data. These are notable advantages over traditional methods such as the linear regression approach above, where the amalgamation of data from different sources can negatively impact an analysis. Another advantage of the proposed model is its potential to be used for prediction and intervention: our model can be applied to predict infant birth weight well before term and can thus be conveniently deployed in clinical settings. More generally, while infant birth weight is the primary focus of the present paper, our approach and discussions apply to other maternal and infant outcomes and to other areas of research that employ similarly structured data.

In Section 2, we introduce the pregnancy outcomes dataset used in this article and the proposed model. This section also presents our chosen prior distributions and computational methods. We present estimates for the effect of time-specific maternal weight gain on infant birth weight obtained under the proposed model in Section 3, and compare these estimates to those obtained using the linear regression approach described above [13]. In Section 4, we discuss our results and provide some concluding remarks on the general significance of our approach and future directions.

### **2. Materials and Methods**

*2.1. Data*

Throughout this paper, we use data from the 2009–2012 Alberta Pregnancy Outcomes and Nutrition (APrON) study [28]. The 2189 women in the APrON study, all of whom were at least 16 years of age and at most 27 weeks into gestation, are part of a longitudinal cohort [28,29]. As part of the APrON study, maternal weight and gestational age were measured at each trimester following registration. Participants recruited before 13 weeks gestation have measurements corresponding to all three trimesters, while those recruited between 14 and 27 weeks gestation have measurements only for the second and third trimesters. Pregravid weight, along with other demographic characteristics, were selfreported by each participant upon recruitment. Gestational age at delivery was assessed postpartum. In addition to the APrON data, clinical weight measurements were collected from all participants at regularly scheduled prenatal visits. The number of weight measurements for each participant varies due to missing appointments or data. The longitudinal weight data in this study may be considered sparse and has been previously examined in the functional data analysis literature [30].

We only include participants with a live, singleton birth in the following analyses. We exclude individuals without a reported pregravid weight; those with less than three weight measurements during pregnancy; and those with missing gestational age at delivery, infant birth weight, marital status, education level, income level, ethnic origin, parity, or age. We do not consider any postpartum weight measurements in our analyses.

The final analytic sample consists of *n* = 1340 participants with *N* = 15,183 weight observations. Demographic characteristics for this sample, stratified by infant birth weight class, are summarized in Table 1. We use <2.5 kg, ≥2.5 kg and <4 kg, and ≥4 kg as criteria defining low, normal, and high infant birth weight classes [31]. Clinical weight measurements (i.e., not including self-reported pregravid measurements) were taken at gestational ages ranging from 4.4 to 41.7 weeks, with a median of 30.3 weeks. Participants have a median of 12 recorded weight measurements each.


**Table 1.** Summary of demographic covariates for the analytic sample in the APrON dataset. For categorical variables, counts and relative percentages are reported. A \* indicates the chosen reference category. For continuous variables, means (and standard deviations, in parentheses) are reported.

### *2.2. Joint Model*

We now present our joint model for infant birth weight and longitudinal gestational weight gain. As a main feature, the model estimates the former using parameter estimates from patient-specific maternal weight trajectories:

$$\begin{aligned} \chi\_i \mid \mathbf{b}\_i &= (1, \mathbf{z}\_i^\top, \mathbf{b}\_i^\top, \ln \sigma\_i^2) \boldsymbol{\theta} + \varepsilon\_i \\ \varepsilon\_i &\overset{\text{i.i.d.}}{\sim} N(0, \sigma^2) \end{aligned}$$

for *i* = 1, ... , *n*, where *Yi* denotes an observed infant birth weight; *zi* an observed demographic covariate vector; *bi* a vector of random weight trajectory parameters; and *σ*<sup>2</sup> *<sup>i</sup>* the trajectory's residual variance for the *i*th patient. The vector *θ* contains the corresponding fixed and random effects.

Individual longitudinal weight trajectories influence *Yi* through the random trajectory parameters *bi* in the longitudinal submodel

$$\begin{aligned} X\_{i\bar{j}} &= f(t\_{i\bar{j}\bar{\tau}}\mathbf{b}\_{\bar{i}}) + \varepsilon\_{i\bar{j}}\\ \varepsilon\_{i\bar{j}} &\overset{\text{i.i.d.}}{\sim} N(0, \sigma\_{i}^{2})\\ \mathbf{b}\_{i} &\overset{\text{i.i.d.}}{\sim} N(\mathbf{b}, \Sigma) \end{aligned}$$

for *j* = 1, ... , *ni*, where *Xij* is the observed weight of the *i*th patient at gestational age *tij* and *ni* is the total number of longitudinal observations for the *i*th patient. We consider a piecewise linear weight trajectory (as a function of gestational age *t* ≥ 0) [32]

$$f(t; \mathbf{b} = (b\_0, b\_1, \dots, b\_K)^\top) = b\_0 + \sum\_{k=1}^K b\_k (t - t\_k^\*)\_{+\prime}$$

where *<sup>x</sup>*<sup>+</sup> <sup>=</sup> max{0, *<sup>x</sup>*} for *<sup>x</sup>* <sup>∈</sup> <sup>R</sup> and (*<sup>t</sup>* ∗ <sup>1</sup> = 0, ... .*t* ∗ *<sup>K</sup>*, *t* ∗ *<sup>K</sup>*+<sup>1</sup> = ∞) is a fixed, increasing sequence of changepoint locations. Consequently, *b*<sup>0</sup> is the mean pregravid weight and ∑*k*<sup>0</sup> *<sup>k</sup>*=<sup>1</sup> *bk* is the mean rate of weight gain in the gestational age interval [*t* ∗ *k*0 , *t* ∗ *<sup>k</sup>*0+1), for *<sup>k</sup>*<sup>0</sup> = 1, ... , *K*. Following common trimester boundaries [13], we take *K* = 8 with *t* ∗ <sup>2</sup> = 13, *t* ∗ <sup>3</sup> = 18, *t* ∗ <sup>4</sup> = 23, *t* ∗ <sup>5</sup> = 27, *t* ∗ <sup>6</sup> = 32, *t* ∗ <sup>7</sup> = 37, and *t* ∗ <sup>8</sup> = 45.

Under the proposed model, *β* describes an average, "prototype" trajectory, while the random *bi*s describe patient-specific trajectories and deviations from *β*. Our longitudinal model accounts for short-term variation and measurement error in patient trajectories by using ln *σ*<sup>2</sup> *<sup>i</sup>* as a predictor of *Yi*.

### *2.3. Bayesian Framework and Model Estimation*

We take a Bayesian approach to parameter estimation in the proposed model.

In the longitudinal submodel, we model random trajectory parameters as *bi* i.i.d. ∼ *N*(*β*, Σ) under the diffuse prior *β* ∼ *N*(0, 10*I*). Additional tests, not presented here, indicate no need to consider a Gaussian mixture [23] in the distribution of the *bi*s for our APrON dataset. To avoid issues with unbounded likelihood [33] when using an unstructured random effect covariance matrix Σ, we implement the empirical Bayes Wishart prior [34]

$$\Sigma \sim \mathcal{W}\left(m = 2 + \frac{K+1}{2}, \ \Lambda = \sum\_{i=1}^{n} \widehat{\text{Cov}}(\hat{\mathfrak{b}}\_{i}^{(\text{OLS})})^{-1}\right).$$

,

where Cov -(<sup>ˆ</sup> *b* (OLS) *<sup>i</sup>* ) is an estimate of the covariance matrix of the ordinary least squares (OLS) estimator of *bi*. For the *σ*<sup>2</sup> *<sup>i</sup>* s, the trajectory residual variances, we assume a lognormal prior ln *σ*<sup>2</sup> *i* i.i.d. <sup>∼</sup> *<sup>N</sup>*(*μ*, *<sup>τ</sup>*2) under the diffuse hyperpriors *<sup>μ</sup>* <sup>∼</sup> *<sup>N</sup>*(0, 103) and *<sup>τ</sup>*<sup>2</sup> <sup>∼</sup> Inv-Gamma(10−4, 10−4). For the scalar response *Yi*, we take *<sup>θ</sup>* <sup>∼</sup> *<sup>N</sup>*(0, 10*I*) and *<sup>σ</sup>*<sup>2</sup> <sup>∼</sup> Inv-Gamma(10−4, 10−4).

For notational simplicity, let *<sup>ϕ</sup>* <sup>=</sup> {*θ*, *<sup>σ</sup>*2, *<sup>β</sup>*, <sup>Σ</sup>, *<sup>μ</sup>*, *<sup>τ</sup>*2} be the collection of model parameters. We assume that all elements of *ϕ* have independent prior distributions and denote the joint prior of *ϕ* by *π*. Define *η<sup>μ</sup> <sup>i</sup>* = (1, *z <sup>i</sup>* , *b <sup>i</sup>* , ln *<sup>σ</sup>*<sup>2</sup> *<sup>i</sup>* )*θ* as the linear predictor corresponding to *Yi*.

The full likelihood of *ϕ* for our model is

$$\begin{split} L(\mathfrak{g}) &= \pi(\mathfrak{g}) \prod\_{i=1}^{n} \left[ |\Sigma|^{-0.5} \exp \left\{ -0.5 (\mathfrak{b}\_{i} - \mathfrak{f})^{\top} \Sigma^{-1} (\mathfrak{b}\_{i} - \mathfrak{f}) \right\} \right. \\ &\times \prod\_{j=1}^{\frac{n\_{i}}{2}} \left[ \sigma\_{i}^{-1} \exp \left\{ -0.5 \sigma\_{i}^{-2} (x\_{ij} - f(t\_{ij}; \mathfrak{b}\_{i}))^{2} \right\} \right] \\ &\times \tau^{-1} \exp \left\{ -0.5 \tau^{-2} (\ln \sigma\_{i}^{2} - \mu)^{2} \right\} \\ &\times \sigma^{-1} \exp \left\{ -0.5 \sigma^{-2} (y\_{i} - \eta\_{i}^{\mu})^{2} \right\} . \end{split}$$

We implement a Gibbs sampler to perform posterior draws. For analytic derivations of the posterior distributions, see Jiang et al. [23]. As the full conditional posterior of *σ*<sup>2</sup> *<sup>i</sup>* has no closed form, we obtain draws using the inverse cumulative distribution function method. In our Markov Chain Monte Carlo (MCMC) procedure, we run a chain of 150,000 iterations and use the first 50,000 iterations as a burn-in period; however, in this particular application, we observe that the model converges very quickly and that even 10,000 total iterations

are sufficient. To reduce autocorrelation in subsequent draws, we thin posterior draws by saving only every 10th. We implement our model in C++ using the Scythe open-source statistical library [35] and R [36].

We consider two models, each accounting for a different set of demographic covariates. The first model (JM1) includes education level, income level, ethnic origin, parity, age at pregnancy, and gestational age at delivery. The second model (JM2) includes only demographic variables whose 95% credible interval in JM1 do not contain zero.

### *2.4. Comparison to Linear Regression*

We compare our proposed method against the previously noted traditional linear regression (LR) approach. We focus specifically on differences in the effects of maternal weight gain rate in different gestational age periods on infant birth weight. To make this comparison easier, we use the rate of weight gain in each gestational period (rather than period-specific absolute weight gain) as a predictor of infant birth weight *Yi*.

We use the same gestational age intervals in both models: [0, 13), [13, 18), [18, 23), [23, 27), [27, 32), [32, 37), and [32, 45). To compute the average rate of weight gain ˜ *bk* in the *k*th interval, we first calculate the averages, *μ<sup>k</sup>* and *μk*−1, of weight measurements taken in the *k*th and (*k* − 1)th intervals, respectively. We then calculate the rate of weight gain as ˜ *bk* = (*μ<sup>k</sup>* − *<sup>μ</sup>k*−1)/(*mk* − *mk*−1), where *mk* is the midpoint of the *<sup>k</sup>*th gestational age interval. For the sake of notation, we let *k* = 0 refer to pregravid measurements (i.e., at week zero).

As noted previously, our joint model addresses numerous shortcomings of the LR approach. First, the LR model does not fully take into account the timing of individual maternal weight measurements, while our JM approach estimates patient-specific weight trajectories as functions of time. Second, LR model estimates are subject to short-term measurement error and variability: this is because only a small number of measurements contribute to pregravid weight and the estimated rates of weight gain. Our hierarchical Bayesian framework borrows information from all observations to estimate these quantities via patient-specific trajectory parameters. As another feature that may be clinically relevant in some applications, our model also estimates and uses short-term variability in maternal weight as another predictor.

We similarly consider two linear regression models in the following analyses. The first (LR1) uses estimated rates of weight gain (i.e., the ˜ *bk*s), average pregravid weight ˜ *b*<sup>0</sup> = *μ*0, and the same demographic variables as JM1. Similar to JM2, the second model (LR2) includes only the demographic covariates whose 95% confidence intervals in LR1 do not contain zero.

### **3. Results and Discussion**

Table 2 presents parameter estimates for all four of the models described in the previous section. Model convergence for the joint models were assessed visually and numerically using five parallel chains. Trace plots for each of the coefficients in Table 2 suggest adequate convergence and mixing. Numerically, Rubin–Gelman statistics [37] for these coefficients range from 1.005 to 1.027 and also imply model convergence.

We observe major differences in the estimated effects of weight gain between the LR and JM approaches. Both LR models find rate of weight gain to be a useful predictor of infant birth weight only after 18 weeks gestation. On the other hand, the JM models find this to be true throughout gestation, including before 18 weeks.

**Table 2.** Parameter estimates obtained using the LR and the proposed JM models, with 95% confidence and credible intervals, respectively. For JM model interpretability, we present estimates for ∑*<sup>k</sup> <sup>j</sup>*=<sup>1</sup> *bj* (rather than for just *bk*), which can be interpreted as the effect of weight gain rate in the *k*th gestational interval. Boldface indicates an estimate whose corresponding credible (or confidence) interval does not contain zero.


Further, we note a difference in the direction of the estimated effect of weight gain during weeks 32–37 between the JM and LR models. Our JM approach estimates this effect to be positive, while the LM model estimates a negative effect. Given the positive estimates for other gestational intervals and the positive estimate originally reported in Retnakaran et al. [13], we suspect that the LR model is inaccurate here. As discussed previously, this could be attributed to the loss of time information or the precomputation of average weight gain measurements. These results illustrate how the LR approach might not yield reliable conclusions, even with relatively large datasets. Towards the end of this section, we also discuss the sensitivity of the LR approach to the choice of gestational intervals.

Other differences in the effect of rate of weight gain are less drastic but important nonetheless. In general, effect estimates in the LR models (relative to those in the JM models) are shrunk towards zero. We attribute this shrinkage to attenuation bias in the LR models due to self-reporting bias (in pregravid measurements) and the LR models' inability to account for short-term variation in the weight trajectories. As discussed previously, this can be due to the small number of observations used to compute each patient's pregravid weight (˜ *b*0) and interval-specific rates of weight gain (the ˜ *bk*s).

Figure 1 illustrates the importance of accounting for deviation in patient-level trajectories (described by the *bi*s) from the prototype trajectory (described by *β*) in our JM approach. While an overall trend in individual fitted trajectories is apparent, we see significant amounts of variation in gestational weight gain trajectories between patients. Figure 2 illustrates our proposed model's ability to accommodate individual longitudinal trajectories even in the presence of between-patient variability.

In a separate analysis not shown in Table 2, we consider a different set of gestational intervals (i.e., the sequence of *t* ∗ *<sup>k</sup>* s): [0, 15), [15, 20), [20, 25), [25, 30), [30, 35), and [35, 45), this time chosen out of convenience. The JM models yield similar conclusions with these different intervals while the LR models find weight gain during only 20–30 weeks gestation to be associated with infant birth weight. This demonstrates that the LR model is not robust with respect to the precomputation of interval-specific weight gain measurements and, as above, calls into question the validity of this approach.

**Figure 1.** Posterior mean estimates from the proposed JM1 model for the mean weight gain trajectory *β* (solid blue) and twenty randomly selected individual trajectories *bi* (solid grey), both as functions of gestational age (GA). The light blue and grey regions describe 95% credible bands for *β* and *bi*, respectively. Dotted grey lines indicate model changepoints (i.e., at GA = *t* ∗ *k* ).

**Examples of estimated trajectories**

**Figure 2.** Eight randomly selected estimates of individual trajectories *bi* from the JM1 model as functions of gestational age (GA) (solid grey) and corresponding observed weights *Xij*. Observed weights from the eight patients are denoted by 1, 2, ... , 8. Light grey regions denote 95% credible bands for *Xij* (each for a fixed *i*). Dotted grey lines indicate model changepoints (i.e., at GA = *t* ∗ *k* ).

### **4. Conclusions**

In this paper, we provided a hierarchical Bayesian model for the joint analysis of scalar and longitudinal data based on Jiang et al. [23]. Our work was motivated by a question in maternal health research on the relationship between (scalar) infant birth weight and (longitudinal) gestational weight gain during different periods of pregnancy. We contrasted our joint modeling approach with one using traditional linear regression that has appeared in the clinical literature [13] and is reminiscent of analyses commonly seen in applied research.

This comparative LR approach was originally proposed for a preconception cohort study to eliminate self-reporting bias in pregravid measurements [13]. However, in addition to the design's inconvenience, this approach does not fully account for gestational age or clinical measurement error and uses only a small number of observations to pre-estimate (i.e., in an initial stage separate from model estimation) weight gain in each gestational period. This results in high-variance model estimates that are not robust to the choice of gestational intervals. In contrast, through a one-stage, hierarchical Bayesian framework, our JM approach accounts for gestational age and short-term variability in longitudinal measurements, and borrows information from all observations to reduce bias and obtain more-reliable estimates.

The benefits of our model over the LR approach are apparent in our real-world study using the APrON pregnancy outcomes dataset. Beyond the LR model's questionable negative estimated association between infant birth weight and maternal weight gain for 32–37 weeks gestation, we observed relative shrinkage in LR effect estimates towards zero. This illustrates the unreliability of the LR methodology and the impact of attenuation bias on effect estimates. On the other hand, our JM approach produced estimates that were reasonable and stable, even when considering different gestational periods.

We have demonstrated the usefulness of our joint modeling approach in settings with continuous scalar and longitudinal responses. Our approach extends naturally to other submodels and data types such as ordinal health outcomes (e.g., through an appropriate (cumulative) probit or logit link function at the response level of the model) [23]. While our focus in this paper was on comparing the JM and LR approaches, the proposed model can be further optimized for predictive purposes. Our developments hold immediate implications for clinical interventions, such as the early identification of pregnant women at risk of birth complications (e.g., extreme infant birth weight or other outcomes, whether scalar or ordinal) using self-reported prepregnancy data or sparse clinical observations.

**Author Contributions:** Conceptualization, B.J. and Y.Y.; methodology, B.J. and M.P.; software, B.J. and M.P.; validation, M.P.; formal analysis, M.P.; investigation, M.P.; resources, B.J., R.C.B. and N.L.; data curation, Y.Y, R.C.B. and N.L.; writing—original draft preparation, M.P.; writing—review and editing, M.P., B.J., Y.Y., R.C.B. and N.L.; visualization, M.P.; supervision, B.J. and L.K.; project administration, B.J., L.K., R.C.B. and N.L.; funding acquisition, B.J., R.C.B. and N.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** B.J.'s research is supported by a Discovery Grant from the Natural Sciences and Engineering Research Council of Canada. L.K.'s research is supported by a Discovery Grant from the Natural Sciences and Engineering Research Council of Canada and a Canada Research Chair in Statistical Learning. M.P.'s graduate studies are supported by a Canadian Graduate Scholarship (Master's and Doctoral) from the Natural Sciences and Engineering Research Council of Canada. Y.Y.'s research is supported by a Discovery Grant from the Natural Sciences and Engineering Research Council of Canada (RGPIN-2019-04-862) and the Women and Children's Health Research Institute. The APrON cohort was established by an interdisciplinary team grant from Alberta Innovates Health Solutions (formerly the Alberta Heritage Foundation for Medical Research). Additional funding from the Alberta Children's Hospital Foundation assisted with the collection and analysis of data presented in this manuscript.

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Human Ethics Review Board (Biomedical Panel) of the University of Alberta (Pro00002954; 4 March 2009) and the Research Ethics Board of the University of Calgary (14-702; 2008, renewed 4 November 2021).

**Informed Consent Statement:** Informed consent was obtained from all participants involved in the study.

**Data Availability Statement:** Data are available from the Secondary Analyses to Generate Evidence (SAGE) databases held within the Policy Wise for Children and Families (nongovernmental) organization in Alberta, Canada: https://policywise.com/, accessed on 16 December 2021. Data are available subject to appropriate review and approvals. Access to Alberta Pregnancy Outcomes and Nutrition (APrON) data is administered by SAGE: requests can be made to data@policywise.com.

**Acknowledgments:** The authors are grateful to the families who took part in the APrON study and the APrON team (http://APrONstudy.ca, accessed on 16 December 2021), investigators, research assistants, graduate and undergraduate students, volunteers, clerical staff, and managers.

**Conflicts of Interest:** The authors declare no conflict of interest. The APrON cohort was established by an interdisciplinary team grant from Alberta Innovates Health Solutions (formerly the Alberta Heritage Foundation for Medical Research). Additional funding from the Alberta Children's Hospital Foundation assisted with the collection and analysis of data presented in this manuscript.

### **References**


### *Article* **Multivariate Functional Kernel Machine Regression and Sparse Functional Feature Selection**

**Joseph Naiman and Peter Xuekun Song \***

Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA; jnaiman@umich.edu

**\*** Correspondence: pxsong@umich.edu

**Abstract:** Motivated by mobile devices that record data at a high frequency, we propose a new methodological framework for analyzing a semi-parametric regression model that allow us to study a nonlinear relationship between a scalar response and multiple functional predictors in the presence of scalar covariates. Utilizing functional principal component analysis (FPCA) and the least-squares kernel machine method (LSKM), we are able to substantially extend the framework of semi-parametric regression models of scalar responses on scalar predictors by allowing multiple functional predictors to enter the nonlinear model. Regularization is established for feature selection in the setting of reproducing kernel Hilbert spaces. Our method performs simultaneously model fitting and variable selection on functional features. For the implementation, we propose an effective algorithm to solve related optimization problems in that iterations take place between both linear mixed-effects models and a variable selection method (e.g., sparse group lasso). We show algorithmic convergence results and theoretical guarantees for the proposed methodology. We illustrate its performance through simulation experiments and an analysis of accelerometer data.

**Keywords:** functional principal component analysis; functional predictor; linear mixed-effects model; mobile device; sparse group regularization; wearable device data

### **1. Introduction**

Data captured by mobile devices have lately received much attention in the data science community. Such data are typically recorded at a high frequency, giving rise to an ample volume of information at a very fine scale, and thus present many methodological challenges in statistical modeling and data analyses. In this paper, we plan to utilize the strength of the classical kernel machine method that enjoys fast computing speed via the linear mixed-effects model to deal with such high-frequency data using a functional data analysis approach. The motivation for our proposed framework come from data collected from a tri-axis accelerometer. Accelerometers, worn on the hip or wrist as a way of monitoring physical activity, are becoming more and more common [1–4]. There are several different accelerometers available such as ActiGraph GT3X+ (ActiGraph, Pensacola, FL, USA) and Actical (Phillips Respironics, Bend, OR). Raw accelerometer data are often collected in high-resolution signals with a sampling frequency ranging from 30–100 Hz. The commercial software on these devices provides activity counts (ACs) [2,4], which are calculated from the raw accelerometer data using proprietary algorithms. As an example from our motivating dataset, Figure 1 displays a three-dimensional time series of ACs per minute, each on one axis, from one subject wearing the GT3X+ over a period of 7 days (d).

Oftentimes, different types of summaries of the tri-axis ACs are suggested in the literature as opposed to the utility of all three raw functionals [5–8]. These summary-databased approaches may be regarded as a quick and dirty dimension reduction strategy that comes up with summarized data with computationally manageable volumes, which would be then analyzed by existing methods and software. One concern with the use of summarized data would be the loss of potential fine features that can only be captured

**Citation:** Naiman, J.; Song, P.X. Multivariate Functional Kernel Machine Regression and Sparse Functional Feature Selection. *Entropy* **2022**, *24*, 203. https://doi.org/ 10.3390/e24020203

Academic Editors: S. Ejaz Ahmed and Farouk Nathoo

Received: 4 January 2022 Accepted: 26 January 2022 Published: 28 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

in data of high resolution. Recently, some researchers have attempted to use the entire functional AC curve through functional data analysis techniques [6,9,10]. Further details on current methods being used to retrieve and interpret accelerometer data can be found in [11]. Our contribution in this paper pertains to a new framework in that tri-axis accelerometer data are used as three-dimensional correlated functional predictors in an association analysis with a potential health outcome such as the Body Mass Index (BMI). The relationship between physical activities and childhood obesity has long been a central interest of public health sciences, and our new scalar-on-functional regression model can provide some new insights into this important scientific problem.

**Figure 1.** Activity counts over 7 d from a tri-axis (*X*-, *Y*- and *Z*-axis) accelerometer of a subject.

We begin with a brief review of existing functional data models, the least-squares kernel machine model, and different variable selection techniques, which prelude the framework for this paper.

### *1.1. Functional Regression*

There has been much attention in recent years given to functional data analysis (FDA) where either covariates, or response, or both are functional as opposed to scalar in nature [12–17]. In this paper, we focused on the methodology that allows us to relate multiple functional covariates to a scalar outcome in a nonlinear way in the presence of other scalar covariates. To proceed, let us introduce some notation. Let *<sup>L</sup>*2(<sup>T</sup> ) be the class of square-integrable functions on a compact set T . This is a separable Hilbert space with inner product < *f* , *g* >:= 0 <sup>T</sup> *f g* for *<sup>f</sup>* , *<sup>g</sup>* <sup>∈</sup> *<sup>L</sup>*2(<sup>T</sup> ). Consider a probability space (Ω, <sup>F</sup>, *<sup>P</sup>*), where *<sup>Z</sup>* denotes a functional random variable that maps into *<sup>L</sup>*2(<sup>T</sup> ), namely *<sup>Z</sup>* : <sup>Ω</sup> <sup>→</sup> *<sup>L</sup>*2(<sup>T</sup> ). Define *<sup>L</sup>*2(Ω) :<sup>=</sup> {*<sup>Z</sup>* : ( 0 Ω*Z*<sup>2</sup> *dP*) 1 <sup>2</sup> < ∞}, where *P* is a certain probability measure, *Z*<sup>2</sup> <sup>=</sup> <sup>&</sup>lt; *<sup>Z</sup>* , *<sup>Z</sup>* <sup>&</sup>gt;, and assume *<sup>Z</sup>* <sup>∈</sup> *<sup>L</sup>*2(Ω) in the rest of this paper. For convenience, we also assume that *Z* is mean centered, namely *E*(*Z*) = 0.

The class of functional linear models (FLM) (e.g., [13–15]) is proposed to relate a functional covariate *Z* with a mean-centered scalar outcome *y*, which is also known as scalar-on-functional regression: *y* = < *b*, *Z* > + , where the error term  is a mean zero random variable uncorrelated with *Z*. An optimal solution of the unknown functional parameter *<sup>b</sup>* <sup>∈</sup> *<sup>L</sup>*2(<sup>T</sup> ) is typically obtained by minimizing the mean-squared error: inf*b*∈*L*2(<sup>T</sup> ) *<sup>E</sup>*(*y*<sup>−</sup> <sup>&</sup>lt; *<sup>b</sup>*, *<sup>Z</sup>* <sup>&</sup>gt;)2. Moreover, the mean model for the mean-centered scalar *<sup>y</sup>* takes the form *<sup>E</sup>*(*y*|*Z*) = 0 <sup>T</sup> *<sup>Z</sup>*(*t*)*b*(*t*)*dt*.

As suggested in the literature, we may obtain an optimal estimator of *b* by expanding functional predictor *Z* under certain basis functions. In this paper, we focus on the utility of functional principal component analysis (FPCA) to perform the decomposition of the functional *Z*. By the Karhunen–Loève expansion (e.g., [18–20]), we may write *Z*(*t*) = ∑<sup>∞</sup> *k*=1 <sup>√</sup>*ςkξkφk*(*t*), where *<sup>ς</sup><sup>k</sup>* <sup>&</sup>gt; 0 are the eigenvalues, and the loadings are given by *ξ<sup>k</sup>* := <sup>√</sup><sup>1</sup> *<sup>ς</sup><sup>k</sup>* <sup>&</sup>lt; *<sup>Z</sup>*, *<sup>φ</sup><sup>k</sup>* <sup>&</sup>gt;. These coefficients satisfy (i) mean zero, *<sup>E</sup>*(*ξk*) = 0; (ii) variance one, *E*(*ξ*<sup>2</sup> *<sup>k</sup>* ) = 1; (iii) uncorrelated, *E*(*ξkξj*) = 0 for *k* = *j*. Then, the mean model may be rewritten as follows,

$$E(y|Z) = \sum\_{k=1}^{\infty} \beta\_k \xi\_{k\prime}^{x} \tag{1}$$

where coefficients *β<sup>k</sup>* =< *b*, <sup>√</sup>*ςkφ<sup>k</sup>* <sup>&</sup>gt;, *<sup>k</sup>* <sup>=</sup> 1, ···, which are unknown due to the unknown *<sup>b</sup>*. Equation (1) presents a linear projection of scalar outcome *y* on the space spanned by the standardized principal components (PCs) *ξk*'s of functional predictor *Z*. On these lines of research, Müller and Yao (2008) proposed a class of functional additive models (FAMs) that extends Equation (1) by allowing a nonparametric form of the projection:

$$E(y|Z) = \sum\_{k=1}^{\infty} f\_k(\xi\_k),\tag{2}$$

where *fk* is a fully unspecified nonlinear smooth function to be estimated. It is obvious that Müller and Yao's extension given in (2) takes an additive model on individual coefficient (or feature) components *ξk*'s. Regularization is often needed for both (1) and (2) in order to deal with these infinite-dimensional unknowns. One of the challenges concerning regularization for (2) lies in the technical treatment in the functional space. Müller and Yao (2008) [21] proposed truncation (or a hard threshold) of the eigenspace to retain only the leading components that explain the majority of the total variation in *Z*. Zhu, Yao, and Zhang (2014) [15] proposed another regularization for the functions *fk* using the powerful COSSO method [22]. One advantage for this kind of regularization method is that sums of higher-order functional principal components are allowed to be potentially included in the fit model, if they make stronger contributions to the functional relationship than the leading functional principal components. This regularization method [15] begins with an additive model *<sup>E</sup>*(*y*|*Z*) = <sup>∑</sup>*<sup>s</sup> <sup>k</sup>*=<sup>1</sup> *fk*(*ξk*), where *s* represents some initial degrees of truncation to specify the total number of additive components to be considered. Then, COSSO helps simultaneously regularize and select important functional components among the *s* functions *fk*. Although the above discussion is based on a single functional predictor *Z* in mind, it is appealing to extend such a framework with multiple functional predictors for a broad range of problems.

When multiple functional predictors, *say Z*1, ... , *Zp*, are considered, it is not clear if the above additive model specification remains suitable to handle the complexity, especially a non-additive relationship (e.g., interactions) may be of interest to understand the association between a scalar outcome and multiple functional predictors. In effect, from both the perspectives of theoretical advances and application needs, relaxing the additive relationship is an important task in functional data analysis. Alternatively, there are some methods (e.g., [16,17]) in the literature that do not use the strategy of decomposing *Z* into its functional components. In this paper, we adopt the framework of kernel machine regression models to extend the methodologies with non-additive relationships between multiple functional predictors and the scalar outcome.

### *1.2. Least-Squares Kernel Machine*

Liu, Lin, and Ghosh (2007) [23] proposed a semi-parametric regression model *yi* = **x** *<sup>i</sup> β* + *h*(**z***i*) +  *<sup>i</sup>* for subject *i* = 1, ... , *n*, where they used the least-squares kernel machine (LSKM) to analyze multidimensional genetic pathways denoted by a vector **z***i*. The key feature of this model is the nonlinear relationship between the outcome *yi* and a vector of gene expressions **z***i*, which is characterized by a nonparametric smooth function *h*. Under the theory of smoothing splines, function *h* is assumed to lie in a reproducing kernel Hilbert space (RKHS), HK, generated by a positive-definite kernel function K(·, ·). For the ease of exposition, we suppress the bandwidth for the kernel K in the following discussion. Then, both parameter *β* and function *h* are estimated by maximizing the scaled penalized likelihood function:

$$J(h, \mathfrak{F}) = -\frac{1}{2} \sum\_{i=1}^{n} \{y\_i - \mathbf{x}\_i^\top \mathfrak{F} - h(\mathbf{z}\_i)\}^2 - \frac{1}{2} \lambda\_1 \|h\|\_{\mathcal{H}\_{\mathcal{K}'}}^2 \tag{3}$$

where *<sup>λ</sup>*<sup>1</sup> <sup>&</sup>gt; 0 is the tuning parameter and ·HK is the norm of the RKHS. For a function *<sup>h</sup>* <sup>∈</sup> *<sup>L</sup>*2(HK), we have *<sup>h</sup>*(·) = <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *<sup>α</sup>i*K(·, **<sup>z</sup>***i*). Then, *h*<sup>2</sup> HK <sup>=</sup> *<sup>α</sup>***K***α*, where **<sup>K</sup>** is an *<sup>n</sup>* <sup>×</sup> *<sup>n</sup>* matrix whose (*i*, *j*) entry is K(**z***i*, **z***j*) and *α* = (*α*1,..., *αn*).

It is known in the literature (e.g., [23,24]) that maximizing *J*(*h*, *β*) in (3) turns out to be equivalent to solving the normal equations from the following linear mixed-effects model (LMM): **Y** = **X***β* + **h** + , where **h** is an *n* × 1 vector of random effects with distribution *<sup>N</sup>*(**0**, *<sup>τ</sup>***K**) and an *<sup>n</sup>*-dimensional vector error term <sup>∼</sup> *<sup>N</sup>*(**0**, *<sup>σ</sup>*2**I**), with *<sup>τ</sup>* <sup>=</sup> *<sup>λ</sup>*−<sup>1</sup> <sup>1</sup> *<sup>σ</sup>*<sup>2</sup> > 0. One remarkable advantage of solving (3) through the existing numerical procedure of the LMM is most advocated in the literature [25], where we can determine the smoothing parameter *λ*<sup>1</sup> as part of the estimation of the variance components of the LMM. Therefore, instead of using cross-validation or other information-based tuning methods on *λ*1, we can solve simultaneously for all the model parameters in (3), as shown in [23]. Utilizing this numerical strength of the kernel machine regression model, we propose a semi-parametric regression model by incorporating functional principal components of functional predictors (i.e., the **z***i*) to evaluate a nonlinear relationship of a scalar outcome with multiple functional covariates in a non-additive way. Assuming that function *h* belongs to an RKHS, we can use existing software packages for solving LMMs to obtain estimates of all model parameters and the smoothing parameter.

### *1.3. Feature Selection*

To deal with high-dimensional functional principal components from functional covariates, we invoked the sparse regularization approach in the kernel machine regression model. Note that for both mean models (1) and (2), one needs to truncate the series from the Karhunen–Loève expansion. Regularization helps reduce from an infinite number of terms to a sum of finite terms. To introduce some notations, here we present a brief review on the group lasso (GL) [26], sparse group lasso (SGL) [27], and non-negative garrote [28]. See also the series of work originated by COSSO [22]. Yuan and Lin (2007) [26] proposed the group lasso, which solves the convex optimization problem: min*β*∈R*<sup>p</sup>* 1 1 1 **<sup>Y</sup>** <sup>−</sup> <sup>∑</sup>*<sup>L</sup>* =<sup>1</sup> **<sup>X</sup>***β* 1 1 1 2 2 + *λ* ∑*<sup>L</sup>* =1 1 1 1*β* 1 1 1 2 , where *L* is the total number of groups of covariates and **X** refers to a subset of covariates associated with group . Friedman, Hastie, and Tibshirani [27] extended the group lasso to allow within-group sparsity, namely SGL, given as min*β*∈*R<sup>p</sup>* 1 1 1 **<sup>Y</sup>** <sup>−</sup> <sup>∑</sup>*<sup>L</sup>* =<sup>1</sup> **<sup>X</sup>***β* 1 1 1 2 2 + *λ*(1 − *δ*) ∑*<sup>L</sup>* =1 1 1 1*β* 1 1 1 <sup>2</sup> <sup>+</sup> *λδβ*1, where *<sup>δ</sup>* <sup>∈</sup> [0, 1]. The additional 1-norm penalty term on *<sup>β</sup>* encourages individual sparsity, while the first penalty targets sparsity at the group level. It is easy to see that group lasso is a special case of the SGL when *δ* = 0.

The non-negative garrote proposed by Breiman (1995) [28] is another useful means of variable selection. It invokes a scaled version of least-squares estimation given by: arg min**<sup>d</sup>** <sup>1</sup> 2 1 <sup>1</sup>**<sup>Y</sup>** <sup>−</sup> **Xd**˜ <sup>1</sup> 12 <sup>2</sup> <sup>+</sup> *<sup>λ</sup>* <sup>∑</sup>*<sup>p</sup> <sup>j</sup>*=<sup>1</sup> *dj*, subject to *dj* <sup>≥</sup> 0, *<sup>j</sup>* <sup>=</sup> 1, ... , *<sup>p</sup>*. Here, **<sup>X</sup>**˜ = (**x**˜1, ... , **<sup>x</sup>**˜ *<sup>p</sup>*) is an *<sup>n</sup>* <sup>×</sup> *<sup>p</sup>* matrix with columns **<sup>x</sup>**˜*<sup>j</sup>* <sup>=</sup> **<sup>x</sup>***jβ*ˆ*OLS <sup>j</sup>* , with *<sup>β</sup>*ˆ*OLS <sup>j</sup>* being the least-squares estimates from arg min*<sup>β</sup>* <sup>1</sup> <sup>2</sup> **<sup>Y</sup>** <sup>−</sup> **<sup>X</sup>***β*<sup>2</sup> <sup>2</sup> with no constraints. Obviously, estimate <sup>ˆ</sup>*dj* <sup>=</sup> 0 implies that covariate *xj* would be excluded from the fit model. Breiman's formulation that turns a variable selection problem into a parameter estimation problem will be applied for the development of feature selection on functional principal components in this paper.

This paper is organized as follows. Section 2 introduces our proposed high-dimensional kernel machine regression. Section 3 outlines a simple step-by-step algorithm that is used to implement the sparse estimation method. Section 4 concerns asymptotic properties for our proposed sparse kernel machine regression. Section 5 provides simulation results

to examine the performance of our method, with comparisons with existing methods. Section 6 illustrates the proposed method by an association analysis of the relationship between the BMI and functional accelerometer data. Section 7 includes our conclusions. The Appendix A contains some key technical details, including the proofs of the theoretical results, while Appendix B presents a discussion on the model identifiability issue.

### **2. Model and Estimation**

Consider a regression analysis of a scalar outcome *y* on *p* functional covariates, *Z*, = 1, ... , *p*. Let **z** *<sup>i</sup>* = (*ξ* <sup>1</sup>, ... , *<sup>ξ</sup> s* ) *<sup>i</sup>* be the *s*-element vector of functional principal component (FPC) features from the *i th* observation of the th functional covariate *Z*, and let**z***<sup>i</sup>* = [(**z**<sup>1</sup> *<sup>i</sup>* ), ... ,(**z** *p <sup>i</sup>* )] be the grand vector of all FPC features from all *p* functional covariates for subject *i*, *i* = 1, ... , *n*. Clearly, the set of FPC features from each functional covariate forms a group, and in total, there are *p* groups with *s* = ∑*<sup>p</sup>* =<sup>1</sup> *s* many FPC features and**z***<sup>i</sup>* ∈ R*<sup>s</sup>* . The high dimensionality of FPC features presents the key methodological challenge in the analysis. We consider the following functional kernel machine regression (FKMR) model:

$$\mathbf{x}\_{i} = \mathbf{x}\_{i}^{\top} \boldsymbol{\mathcal{B}} + h(\vec{\mathbf{z}}\_{i}) + \boldsymbol{\varepsilon}\_{i}, \ i = 1, \cdots, n,\tag{4}$$

where *<sup>β</sup>* ∈ R*<sup>q</sup>* is a set of parameters for the effects of *<sup>q</sup>* scalar covariates **<sup>x</sup>** = (*x*1, ... , *xq*), *h* ∈ HK is an *s*-variate smooth nonparametric function with HK being the functional space generated by a *Mercer kernel* K and error terms  *<sup>i</sup> iid* <sup>∼</sup> *<sup>N</sup>*(0, *<sup>σ</sup>*2). The FKMR model (4) allows for not only nonlinear, but also non-additive relationships with multiple functional covariates *Z* via their FPC features, = 1, ... , *p*, and a scalar outcome, *y*. The statistical task is to estimate and select important functional covariates that are related to the outcome of interest through regularizing the FPC features within each functional covariate. To proceed, following Beiman's [28] non-negative garrote method, we here introduce a new *<sup>s</sup>*-dimensional scaling vector *<sup>γ</sup>* ∈ R*<sup>s</sup>* , *γ* = (*γ*1, ... , *γs*<sup>1</sup> , ... , *γs*), by which we can set *<sup>γ</sup>* ◦**z***<sup>i</sup>* = (*γ*1*ξ*<sup>1</sup> <sup>1</sup>, ... , *<sup>γ</sup>s*<sup>1</sup> *<sup>ξ</sup>*<sup>1</sup> *s*1 , ... , *<sup>γ</sup>s<sup>ξ</sup> <sup>p</sup> sp* ) *<sup>i</sup>* a new vector of weighted FPC features by *γ* via the Hadamard product (i.e., elementwise product). Note that *γ* is grouped and denoted by *γ* = ((*γ*1), ... ,(*γp*)) where *γ* is an *s*-element vector of FPC features **z** of the *th* functional covariate *Z*. When the element, say *γj*, is equal to zero, the corresponding FPC feature *ξ<sup>j</sup>* will not be selected in the set of important FPCs, and moreover, functional covariate *Z* is excluded from the FKMR model when the entire vector (*γ*) = 0.

We estimate the unknowns in the FKMR model (4), as well as the scaling parameters *γ* by minimizing the penalized objective function *J*1(*h*, *β*, *γ*), whose expression is given on the right-hand side of the following Equation (5):

$$\min\_{h,\mathfrak{g},\gamma\_{1}} I\_{1}(h,\mathfrak{g},\gamma) = \min\_{h,\mathfrak{g},\gamma} \frac{1}{2n} \sum\_{i=1}^{n} \left\{ y\_{i} - \mathbf{x}\_{i}^{\top} \boldsymbol{\mathcal{B}} - h(\gamma \circ \mathbf{z}\_{i}) \right\}^{2} + \frac{1}{2} \lambda\_{1} \left\| h \right\|\_{\mathcal{H}\_{\mathbf{K}}}^{2} + \lambda\_{2} \rho(\gamma;\delta), \tag{5}$$

where *λ*<sup>1</sup> > 0 and *λ*<sup>2</sup> > 0 are two tuning parameters, and penalty *ρ*(*γ*; *δ*) may be specified according to a certain regularization method. For the case of sparse group lasso (SGL), we take *<sup>p</sup>*(*γ*; *<sup>δ</sup>*)=(<sup>1</sup> <sup>−</sup> *<sup>δ</sup>*) <sup>∑</sup>*<sup>p</sup>* =1 1 1 1*γ* 1 1 1 2 + *δγ*1, *δ* ∈ [0, 1]. Typically, *δ* is predetermined and set to 0.95 or 0.05 depending on the trade-off between group and within-group sparsity, while the factor (1 − *δ*) controls the relative group sparsity to individual sparsity of each functional predictor *Z*. Meanwhile, a large tuning parameter for *λ*<sup>2</sup> would remove a certain group of FPC features from the FKMR model when all elements in the vector *γ* are zero. Given *h* ∈ HK, an equivalent optimization to the above (5) can be formulated as follows:

$$\begin{split} \min\_{\mathbf{a}, \boldsymbol{\mathcal{B}}, \gamma} l\_2(\mathbf{a}, \boldsymbol{\mathcal{B}}, \gamma) = \min\_{\mathbf{a}, \boldsymbol{\mathcal{B}}, \gamma} \frac{1}{2n} \sum\_{i=1}^{n} \left\{ y\_i - \mathbf{x}\_i^\top \boldsymbol{\mathcal{B}} - \sum\_{k=1}^{n} a\_k \mathcal{K}(\gamma \circ \vec{\mathbf{z}}\_i, \gamma \circ \vec{\mathbf{z}}\_k) \right\}^2 \\ &+ \frac{1}{2} \lambda\_1 \mathbf{a}^\top \mathbf{K}(\gamma \circ \mathsf{Z}) \mathbf{a} + \lambda\_2 \rho(\gamma \circ \delta), \end{split} \tag{6}$$

where **K**(*γ*; *Z*) is an *n* × *n* matrix whose (*i*, *k*)th element is [**K**(*γ*;**Z**)]*ik* = K(*γ* ◦**z***i*, *γ* ◦**z***k*). Lemma 1 below establishes the equivalency of optimization solutions between (5) and (6), which is crucial in our estimation procedure.

**Lemma 1.** *A solution (*ˆ *h, β*ˆ*, γ*ˆ*) is a minimizer of* (5) *if and only if (α*ˆ *, β*ˆ*, γ*ˆ*) is a minimizer of* (6)*, where* ˆ *<sup>h</sup>*(*γ***<sup>ˆ</sup>** ◦**z**) = <sup>∑</sup>*<sup>n</sup> <sup>k</sup>*=<sup>1</sup> *α*ˆ *<sup>k</sup>*K(*γ***ˆ** ◦**z**, *γ***ˆ** ◦**z***k*)*.*

The proof of Lemma 1 is given in Appendix A.1.

**Theorem 1** (Existence of optimizers)**.** *If the kernel* K(·, *γ* ◦**z**) *is continuous with respect to <sup>γ</sup>* ∈ R*<sup>s</sup> , then there exists a global minimizer (*ˆ *h, β*ˆ*, γ*ˆ*) for the optimization problem* (5)*.*

The proof of Theorem 1 is given in Appendix A.3. Note that there may exist multiple optimal minimizers for (5); Theorem 1 ensures only the existence of optimal solutions, but provides no guarantees for uniqueness due to the fact that (5) or (6) is a nonlinear and non-convex optimization problem. It is worth noting that in both (5) and (6), we set the bandwidth for the kernel at a fixed value due to the identifiability issue with respect to the scaling parameters *γ*. Refer to Appendix B for more detailed discussions on the issue of parameter identifiability.

### **3. Implementation and Algorithm**

We propose an iterative algorithm to implement our proposed estimation procedure in which we require the differentiability of the kernel with respect to the scaling factor *γ* and some additional assumptions presented below in order to ensure algorithmic convergence. One part of the algorithm solving (5) is carried out under fixed *γ*, where the resulting minimization problem reduces to the equivalent maximization problem in the least-squares kernel machine (3) with the FPC features,**z***i*, being replaced by *γ* ◦**z***i*. As pointed out in Section 1.2, the step of numerical calculation can be easily executed in the same fashion as the solution from the linear mixed model, including the REML estimation of the smoothing parameter *λ*1. The other part of the algorithm is performed under fixed *α*, *β* and *λ*1, where we solve the nonlinear and non-convex optimization problem to update estimates of *γ*. Lemma 2 below helps us solve for the scaling parameter *γ*.

**Lemma 2.** *For fixed (α, β, λ*1*), minimizing* (6) *over γ is equivalent to minimizing over γ the following objective function:*

$$\frac{1}{2n} \left\| \mathbf{F}(\gamma) - \mathbf{\tilde{Y}} \right\|\_{2}^{2} + \lambda\_{2} \rho(\gamma; \delta), \text{ for } \lambda\_{2} > 0,\tag{7}$$

*where* **<sup>F</sup>**(*γ*) = **<sup>K</sup>**(*γ*; *<sup>Z</sup>*)*<sup>α</sup> and* **<sup>Y</sup>**˜ <sup>=</sup> **<sup>Y</sup>** <sup>−</sup> **<sup>X</sup>***<sup>β</sup>* <sup>−</sup> *<sup>n</sup>* <sup>2</sup> *λ*1*α.*

The proof of Lemma 2 is given in Appendix A.2. Linearizing the function **F**(*γ*) in (7) leads to an equivalent form:

$$\min\_{\gamma} \frac{1}{2n} \left\| \hat{\mathbf{Y}} - \sum\_{\ell=1}^{p} \nabla\_{\gamma} \mathbf{F}^{(\ell)}(\tilde{\gamma}) \boldsymbol{\gamma}^{\ell} \right\|\_{2}^{2} + \lambda\_{2} \rho(\gamma; \delta), \tag{8}$$

where **Y**˜ = **<sup>Y</sup>** <sup>−</sup> **<sup>X</sup>***<sup>β</sup>* <sup>−</sup> *<sup>n</sup>* <sup>2</sup> *λ*1*α* − **F**(*γ***˜**) + ∇*γ***F**(*γ***˜**)*γ***˜**, with ∇*γ***F**(*γ***˜**) being the gradient of the function **<sup>F</sup>** with respect to *<sup>γ</sup>* evaluated at *<sup>γ</sup>***˜** for some *<sup>γ</sup>***˜**, and <sup>∇</sup>*γ***F**()(*γ***˜**) being the columns of <sup>∇</sup>*γ***F**(*γ***˜**) associated with the th group of *<sup>γ</sup>*. This is precisely the form of the standard sparse group regularization problem: min*β*∈R*<sup>p</sup>* <sup>1</sup> 2*n* 1 1 1 **<sup>Y</sup>** <sup>−</sup> <sup>∑</sup>*<sup>p</sup>* =<sup>1</sup> **<sup>X</sup>***β* 1 1 1 2 2 + *λ*2*ρ*(*γ*; *δ*). This implies that (8) presents a standard sparse group regularization problem with a specific choice of penalty function *ρ*(*γ*; *δ*).

The convergence of the above iterative search algorithm for updating *γ***˜** for fixed (*α*, *β*, *λ*1) can be justified by the proximal Gauss–Newton method [29]. Readers are referred to [30] for details on the proximal Gauss–Newton method. One of the key assumptions of the proximal Gauss–Newton method is the existence of a local minimizer. This condition is satisfied in the above (8). This is because according to Theorem 1, there exists a global minimizer.

Algorithm 1 summarizes these iterative steps, which is showed to satisfy a descent property: *<sup>J</sup>*2(*α*(*r*+1), *<sup>β</sup>*(*r*+1), *<sup>γ</sup>*(*r*+1)) <sup>≤</sup> *<sup>J</sup>*2(*α*(*r*), *<sup>β</sup>*(*r*), *<sup>γ</sup>*(*r*)) under the convergence of the proximal Gauss–Newton algorithm for Step 2.2.

### **Algorithm 1** An iterative algorithm for optimization in FKMR.


To speed up Algorithm 1, we propose the following operational schemes that avoid setting up the pairs of (*λ*1,*λ*2) and performing Step 3.1. Here are a few remarks on the two algorithms. (i) Algorithm 2 depends on good starting values in order to enjoy a fast search. (ii) The main difference between Algorithms 1 and 2 is that *λ*<sup>2</sup> is fixed in Algorithm 1, while it is changing in Algorithm 2. Some similar algorithms with changing tuning parameters have been proposed in the literature, such as the single index model [31]. (iii) There is no guarantee that both algorithms converge to a global minimizer, and the proximal Gauss– Newton method used in the implementation can only find stationary points. Numerical solvers for the optimization problem in (5) or in (6) indeed remain an open problem in the field of nonlinear and nonconvex optimization.

**Algorithm 2** A fast operational scheme of Algorithm 1.


### **4. Theoretical Guarantees**

Our theoretical analysis focuses on the finite-sample *L*<sup>2</sup> error bounds for the estimators (ˆ *h*, *γ*ˆ) obtained by (5) or (6). Consequently, we are able to establish the estimation consistency. For simplicity, we set *β* = **0** and consider a general setting of random vectors **z**1, ... , **z***<sup>n</sup>* so that the FPC features**z**1, ... ,**z***<sup>n</sup>* correspond to a special case. Along similar lines as those of [15,32], the estimation consistency is proven in the case of the SGL penalty function. We define a map <sup>Γ</sup> with an *<sup>s</sup>*-element vector *<sup>γ</sup>* ∈ R*<sup>s</sup>* , which gives rise to a collection of all scaling map functions: <sup>A</sup> <sup>=</sup> {<sup>Γ</sup> : <sup>R</sup>*<sup>s</sup>* → R*<sup>s</sup>* <sup>|</sup> <sup>Γ</sup>(**z**) = *<sup>γ</sup>* ◦ **<sup>z</sup>**, **<sup>z</sup>** ∈ R*<sup>s</sup>* and *<sup>γ</sup>* ∈ R*s*}. Since <sup>Γ</sup> is a linear

(and bounded) operator, A is a real vector space where (*c*1Γ<sup>1</sup> + *c*2Γ2)(**z**) = *c*1Γ1(**z**) + *c*2Γ2(**z**) with any *c*1, *c*<sup>2</sup> ∈ R and Γ1, Γ<sup>2</sup> ∈ A. To perform a group regularization estimation, we define an SGL penalty by a norm on A for a fixed *δ* ∈ [0, 1] as follows:

$$\|\|\Gamma\|\|\_{SGL} = \delta \sum\_{\ell=1}^{p} \left\| \gamma^{\ell} \right\|\_{2} + (1 - \delta) \|\|\gamma\|\|\_{1}. \tag{9}$$

Consequently, the SGL regularization estimation requires the following constrained optimization:

$$\min\_{\Gamma \in \mathcal{A}, \, h \in \mathcal{H} \& \mathcal{E}} J\_3(\Gamma, h) = \min\_{\Gamma \in \mathcal{A}, \, h \in \mathcal{H} \& \mathcal{E}} ||\mathbf{Y} - h \circ \Gamma||\_n^2 + \lambda\_1 ||h||\_{H\_K}^2 + \lambda\_2 ||\Gamma||\_{SGL} \tag{10}$$

where **<sup>Y</sup>** <sup>−</sup> *<sup>h</sup>* ◦ <sup>Γ</sup><sup>2</sup> *<sup>n</sup>* = <sup>1</sup> *<sup>n</sup>* <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=1{*yi* <sup>−</sup> (*<sup>h</sup>* ◦ <sup>Γ</sup>)(**z***i*)}<sup>2</sup> . Lemma 3 below provides the essential finite-sample inequalities that lead to the estimation consistency.

**Lemma 3** (Basic inequality)**.** *Let* ˆ *<sup>h</sup>* ◦ <sup>Γ</sup><sup>ˆ</sup> *be the minimizer of* (10)*. Let <sup>h</sup>*<sup>0</sup> ◦ <sup>Γ</sup><sup>0</sup> *be the true function. Then, we have:*

$$J\_3(\Gamma, \hat{h}) \le 2(\mathfrak{e}, \hat{h} \circ \hat{\Gamma} - h\_0 \circ \Gamma\_0)\_n + \lambda\_1 \| |h\_0| \|\_{\mathcal{H}\_{\hat{K}}}^2 + \lambda\_2 \| \Gamma\_0 \|\_{SGL} \tag{11}$$

*.*

$$where \ 2(\mathfrak{e}, \hat{h} \circ \hat{\Gamma} - h\_0 \circ \Gamma\_0)\_{\mathfrak{n}} = \frac{2}{n} \sum\_{i=1}^{n} \mathfrak{e}\_i \left\{ (\hat{h} \circ \hat{\Gamma})(\mathbf{z}\_i) - (h\_0 \circ \Gamma\_0)(\mathbf{z}\_i) \right\}.$$

We need the following notation before presenting our theoretical guarantees. Let N (*δ*, *M*, *Pn*) denote the minimal *δ* covering number of the function set M under the empirical metric *Pn* based on the random vectors **z**1, ··· , **z***n*. Let *N* = N (*δ*, *M*, *Pn*) be a shorthand notation. This means that there exist functions *m*1, ··· , *mN* (not necessarily in the set M) such that for every function *m* ∈ M, there exists a *j* ∈ {1, ··· , *N*} such that 1 <sup>1</sup>*<sup>m</sup>* − *mj* 1 1 *Pn* <sup>≤</sup> *<sup>δ</sup>*, with <sup>1</sup> <sup>1</sup>*<sup>m</sup>* − *mj* 1 1 *Pn* :<sup>=</sup> 1 *<sup>n</sup>* <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=1{*m*(**z***i*) − *mj*(**z***i*)}2. Define the *δ*-entropy of M for the empirical metric, *Pn*, as *H*(*δ*,M, *Pn*) := *log*(N (*δ*,M, *Pn*)). Consider a functional space of the form:

$$\mathcal{B} = \left\{ b := b(h, \Gamma) = \frac{h \circ \Gamma - h\_0 \circ \Gamma\_0}{\left\| h \right\|\_{\mathcal{H}\mathcal{L}}^2 + \left\| h\_0 \right\|\_{\mathcal{H}\mathcal{L}}^2 + \left\| \Gamma \right\|\_{SGL}^2 + \left\| \Gamma\_0 \right\|\_{SGL}^2} \left| h \in \mathcal{H}\_{\mathcal{K}}, \Gamma \in \mathcal{A} \right\} \right\}.$$

We postulate the following assumptions.

**Assumption 1.** *The error term* = (1, ... ,  *<sup>n</sup>*) *is uniformly sub-Gaussian; that is, for constants C*<sup>1</sup> *and C*2*,*

$$\max\_{n\geq 1} \max\_{i=1,\cdots,n} C\_1^2 \left[ E\left\{ \exp\left(\frac{\epsilon\_i^2}{C\_1^2}\right) \right\} - 1 \right] \leq C\_2.$$

*Clearly, the moment condition is bounded below from zero.*

**Assumption 2.** Γ0<sup>2</sup> *SGL* <sup>+</sup> *h*0<sup>2</sup> HK <sup>&</sup>gt; <sup>0</sup>*, and the entropy of space* <sup>B</sup> *with respect to the empirical metric Pn is bounded as follows:*

$$H(\delta, \mathcal{B}, P\_n) \le C\_3 \delta^{-2\psi} \rho$$

*where C*<sup>3</sup> *is some constant and ψ* ∈ (0, 1)*.*

**Assumption 3.** sup*b*∈B*bPn* ≤ *<sup>C</sup>*<sup>4</sup> *for some constant C*4*.*

**Theorem 2.** *(Consistency) Under Assumptions 1-3 above, if tuning parameters λ*<sup>1</sup> *and λ*<sup>2</sup> *satisfy*

$$\lambda\_2^{-1} = n^{\frac{1}{1+\psi}} \left( \|h\_0\|\_{\mathcal{H}\_\mathbb{K}}^2 + \|\Gamma\_0\|\_{SGL} \right)^{\frac{1-\psi}{1+\psi}}, \text{ and } \lambda\_1 = O\_\mathbb{P}(1)\lambda\_{2\prime}$$

*then we have*

$$\left\| \left\| \hat{h} \circ \hat{\Gamma} - h\_0 \circ \Gamma\_0 \right\| \right\|\_{\mathfrak{u}} = O\_{\mathcal{P}}(n^{-\frac{1}{2+2\mathfrak{p}}}) \left( \left\| h \right\|\_{\mathcal{H}\_{\mathbb{K}}}^2 + \left\| \Gamma \right\|\_{SGL} \right)^{\frac{\mathfrak{p}}{1+\mathfrak{p}}}, \text{ and} \tag{12}$$

$$\left\|\hbar\right\|\_{\mathcal{H}\_{\mathcal{K}}}^2 + \left\|\Gamma\right\|\_{SGL} = O\_p(1) \left( \left\|h\_0\right\|\_{\mathcal{H}\_{\mathcal{K}}}^2 + \left\|\Gamma\_0\right\|\_{SGL} \right). \tag{13}$$

Theorem 2 implies estimation consistency under the right rates for the two tuning parameters *λ*<sup>1</sup> and *λ*2. Due to the potential identifiability issues explained in detail in Appendix B, although the estimator (ˆ *h*, Γˆ) may not be unique, the sum of ˆ *h* and Γˆ is not too far away from the sum of the true *h*<sup>0</sup> and Γ0.

**Corollary 1.** *If the RKHS,* HK*, contains differentiable functions* <sup>∇</sup>*h*(**z**) *whose norm* ∇*h*(**z**)HK *is uniformly bounded for all functions <sup>h</sup>* ∈ HK *and* **<sup>z</sup>** <sup>∈</sup> *<sup>R</sup><sup>s</sup> , then Assumption 2 holds when Theorem 2 is replaced by H*(*δ*, HK, *Pn*) <sup>≤</sup> *<sup>C</sup>*1*δ*−2*ψ*, *for all <sup>δ</sup>* <sup>≥</sup> <sup>0</sup>*.*

The proofs of Theorem 2 and Corollary 1 are given in Appendices A.4 and A.5, respectively. Often, when we are only interested in a subset of functions in the RKHS (e.g., functions with norm less than one), we can substitute the full space HK in Corollary 1 with the subspace of interest. Refer to [15] or [32], where both considered an RKHS (i.e., Sobolev space) with functions of norm less than or equal to one.

### **5. Simulation Experiments**

We performed extensive simulation to investigate the performance of our proposed procedure, including the performance of SGL variable selection and its overall accuracy. Due to the limitations of space, we include results from two simulation experiments in this section, and more results may be found in the first author's Ph.D. dissertation [30].

### *5.1. Setup*

In the evaluation of the performance accuracy, following [15], we used both quasi-*R*<sup>2</sup> and adjusted quasi-*R*<sup>2</sup> defined as follows:

$$R\_Q^2 := 1 - \frac{\sum\_{i=1}^n (y\_i - \hat{y}\_i)^2}{\sum\_{i=1}^n (y\_i - \bar{y}\_i)^2}, \text{ and } R\_{AQ}^2 := 1 - \left(1 - R\_Q^2\right) \left(\frac{n-1}{n - (k+1)}\right).$$

The latter is known to be appealing for the comparison of the estimation sparsity. There is another performance metric of interest in addition to model accuracy. Performance in variable selection is summarized in terms of the stability measured by sensitivity and specificity for both functional and variable selections under these simulation experiments. Our algorithm uses existing R packages, including emmreml, kspm, and oem.

Specifically, we designed the following two simulation settings.

Scenario 1: A single functional predictor with sparsity in the FPC features. Scenario 2: Multiple functional predictors with sparsity in the functional predictors and with sparsity in the FPC features of important functional predictors.

Each of these two scenarios would be handled using certain suitable penalty functions to address the designed sparsity; for example, in Scenario 2 we used a two-level variable selection penalty (e.g., SGL) to deal with two types of sparsity in the true model. In all analyses, we used the Gaussian kernel <sup>K</sup>(*u*, *<sup>v</sup>*) = exp(<sup>−</sup> <sup>1</sup> *<sup>p</sup> <sup>u</sup>* <sup>−</sup> *<sup>v</sup>*<sup>2</sup> ) in our estimation, where *p* was set as the number of features, which is equivalent to dividing the *γ* vector by √*p*. This scaling parameter may be either estimated or set to the number of features to overcome the identifiability issue according to [33], where theoretical justification was given for the use of the number of features for the bandwidth parameter in the case of the Gaussian kernel.

According to [23], due to the difficulty of the graphical display for the estimated *s*-dimensional function *h*(·) of **z**, we summarized the goodness-of-fit by regressing the true *h* on the estimated ˆ *h*, with both being evaluated at the design points. From this concordance regression analysis, we may measure the goodness-of-fit on ˆ *h* through the average intercepts, slopes, and *R*-squared (also known as the coefficient of determination) obtained over the number of replications. Clearly, a high-quality fit is reflected by (i) the intercept being close to zero, (ii) the slope being close to one, and (iii) the *R*-squared being close to one. Moreover, we graphically display the estimated function ˆ *h* by setting all variables equal to 0.5 except the one of interest over a grid of 100 equally spaced points on the interval [0, 1]. Such visualization of the functional estimation at each margin further facilitates the evaluation of the proposed algorithm in addition to the results obtained from the concordance regression analyses.

In all scenarios, we generated 1000 IID functional paths, of which 750 paths were assigned to the training set and 250 paths were assigned to the test set for an external performance evaluation. It is the test set that we used to display the performance accuracy. We used a one-dimensional covariate *xi* to show the flexibility of our model in a semiparametric setting, with independent copies of *xi* ∼ *N*(0, 1). We chose the true coefficients in the kernel machine model similar to those given in [23].

### *5.2. Simulation in Scenario 1*

In this simple scenario with a single functional predictor, we simulated data from a model with sparsity in its FPC features. To do so, we generated a single functional predictor based on the first 15 eigenbasis of the Fourier basis functions over the interval [0, 1]: *Z*(*t*) = ∑<sup>15</sup> *j*=1 √*ςjξjφj*(*t*). That is, a functional predictor was created as a linear combination of the 15 basis functions, where *φj*(·) is the *j th* Fourier basis function, *ς<sup>j</sup>* is the *j*th eigenvalue of *Z*, and *ξ<sup>j</sup>* is the *j*th FPC feature that is simulated from a normal distribution detailed as follows.

There were 100 sampled points that were first equally spaced in the interval [0, 1] and then varied with certain small deviations drawn from *<sup>ν</sup>* <sup>∼</sup> *<sup>N</sup>*(0, 0.001). Set *<sup>ς</sup><sup>j</sup>* <sup>=</sup> <sup>45</sup> <sup>×</sup> 0.64*<sup>j</sup>* and *ξ<sup>j</sup>* ∼ *N*(0, 1) independently over *j* = 1, ... , 15. As was done in [17], instead of directly using *ξj*, we used *ζ<sup>j</sup>* = Φ(*ξj*), where Φ is the CDF of the standard normal. This resulted in **z** = (*ζ*1, ... , *ζ*15). We chose the second, *ζ*2, and ninth, *ζ*9, features as important features in the following true nonlinear non-additive model:

$$y\_i = 2x\_i + 20\cos(2\pi\mathcal{Z}\_{i2}) - 10\sin(2\pi\mathcal{Z}\_{i9}) + \mathcal{J}\_{i2}\mathcal{Z}\_{i9} + \varepsilon\_{i4}$$

with  *<sup>i</sup> iid* <sup>∼</sup> *<sup>N</sup>*(0, 1). FPCA was performed by the R package PACE [34], producing the estimated FPC scores, ˆ *ξj*, as well as the estimated eigenvalues, *ς*ˆ*j*, which in turn enabled us to compute ˆ *ζj*, *j* = 1, . . . , 15.

We applied both LASSO and MCP penalty functions in our implementation, termed as *FKMRLasso* and *FKMRMCP*, respectively. We compared the results of our method with the standard linear approach with both LASSO and MCP under the assumption of linear functional relationships, as well as the COSSO method for functional additive regression [15] using the R package COSSO [15,34]. Since the COSSO package is built for nonparametric regression (and not partial linear models), we adopted the backfitting strategy and regressed the residuals with our estimated effect of *xi* removed.

In addition, we compared our method with an oracle FKMR estimator, called *FKMRoracle*, that assumed the full knowledge of the true *ζ<sup>j</sup>* containing two true nonzero signals, *ζ*<sup>2</sup> and *ζ*9. We also considered two oracle versions of our proposed algorithm, *FKMRoracle Lasso* and *FKMRoracle MCP* , both of which used the knowledge of true *ζ<sup>j</sup>* in order to evaluate the performance of the FPCA procedure. This evaluation is important as our proposed procedure can be in principle used in simpler cases that do not involve functional covariates. Note that once we used FPCA to obtain ˆ *ζ<sup>j</sup>* features, our algorithm essentially works in a standard regression setting with the sparsity of covariates. Thus, our proposed procedure

can be in principle used in simpler cases with scalar covariates. In Scenario 1, due to the highly nonlinear relationships between the FPC features and the outcome, as expected, the naive linear model performed poorly in terms of both model selection and model consistency. The detailed simulation results for Scenario 1 can be found in the first author's Ph.D. dissertation [30]. In brief, our proposed method worked well in all aspects. In this setting, COSSO also worked well in terms of model fit, but it tended to select noisy features more frequently than our proposed method, leading to more false positives.

### *5.3. Simulation in Scenario 2*

Now, we generated four functional predictors of the form: *Z*(*t*) = ∑<sup>9</sup> *j*=1 *ς j ξ j φ <sup>j</sup>*(*t*), = 1, ... , 4, where *φ <sup>j</sup>* , *<sup>ς</sup> j* , and *ξ <sup>j</sup>* were set in the same way as those given in Scenario 1. It follows that **z** = (*ζ*<sup>1</sup> <sup>1</sup>, ... , *<sup>ζ</sup>*<sup>1</sup> <sup>9</sup>, ... , *<sup>ζ</sup>*<sup>4</sup> <sup>1</sup>, ... , *<sup>ζ</sup>*<sup>4</sup> <sup>9</sup>), where *<sup>ζ</sup> <sup>j</sup>* is the *j*th Φ-transformed feature for the th functional covariate. Sparsity was specified as follows: the first and second functional covariates, *Z*<sup>1</sup> and *Z*2, were chosen as important signals in which these transformed FPC features, {*ζ*<sup>1</sup> <sup>1</sup>, *<sup>ζ</sup>*<sup>1</sup> <sup>3</sup>, *<sup>ζ</sup>*<sup>1</sup> <sup>4</sup>, *<sup>ζ</sup>*<sup>2</sup> <sup>2</sup>, *<sup>ζ</sup>*<sup>2</sup> <sup>7</sup>}, are five important features (three features from the *Z*<sup>1</sup> and two features from *Z*2) that are related to the outcome:

$$\begin{split} y\_i &= 2\mathbf{x}\_i + \boldsymbol{\zeta}\_{i1}^1 + \boldsymbol{\zeta}\_{i3}^1 + \boldsymbol{\zeta}\_{i4}^1 + \boldsymbol{\zeta}\_{i2}^2 + \boldsymbol{\zeta}\_{i7}^2 + 10\cos(2\pi\boldsymbol{\zeta}\_{i1}^1) - 10\left(\boldsymbol{\zeta}\_{i2}^2\right)^2 + 10\left(\boldsymbol{\zeta}\_{i7}^2\right)^2 - 10\left(\boldsymbol{\zeta}\_{i3}^1\right)^2 \\ &+ 10\exp(-\boldsymbol{\zeta}\_{i3}^1)\boldsymbol{\zeta}\_{i4}^1 - 8\sin(2\pi\boldsymbol{\zeta}\_{i7}^2)\cos(2\pi\boldsymbol{\zeta}\_{i3}^1) + 20\boldsymbol{\zeta}\_{i1}^1\boldsymbol{\zeta}\_{i7}^2 + \boldsymbol{\varepsilon}\_{i\prime} \ \boldsymbol{i} = 1, \ldots, n, \end{split}$$

where  *<sup>i</sup> iid* ∼ *N*(0, 1). This model specifies both group sparsity (two of the four functional predictors) and within-group sparsity (three of the nine FPC features in *Z*<sup>1</sup> and two of the nine FPC features in *Z*2). In addition, we specified non-additive relationships in the true model across multiple functional covariates.

We fit the data using the proposed methods, including *FKMRoracle GMCP*, *FKMRLasso*, *FKMRGLasso*, *FKMRSGL*, *FKMRMCP*, and *FKMRGMCP*, and the results based on 100 replicates are summarized in Table 1. For comparison, we also fit the simulated data by existing methods, including the linear model (denoted by LM + penalty), COSSO functional additive regression, and the oracle method using the knowledge of true important features in the analysis, as done in the above simulation of Scenario 1. From Table 1 regarding the goodness-of-fit, we see that all of our FKMR estimators outperformed the standard linear estimators in terms of *R*<sup>2</sup> *AQ* among all of our penalty functions, and they outperformed COSSO for penalties that accounted for group sparsity. In the concordance regression analysis, we see that all intercepts were close to zero, all slopes close to one, and all *R*<sup>2</sup> close to one, indicating a high goodness-of-fit for functional estimation. COSSO tended to perform on par for penalties that did not account for group sparsity (LASSO and MCP). It is evident that using a group sparsity penalty function (SGL, GLasso, and GMCP) clearly outperformed the methods that did not regularize the grouping of covariates (Lasso and MCP). In addition, our FKMR estimators (except *FKMRLasso*) performed as well as the oracle estimator *FKMRoracle GMCP* both in terms of *<sup>R</sup>*<sup>2</sup> *AQ* and in terms of our estimate of functional *h*. The results also indicated that there were little differences between using a concave (MCP or GMCP) penalty function or using a convex (GLasso or SGL) penalty function.

As regards the group sparsity, Table 2 indicates that the all methods had a high sensitivity of detecting functional signals, while the proposed FKMR methods had better specificity than both sparse linear models and COSSO. Concerning the within-group sparsity, it is interesting to note that a bigger difference was seen in terms of what type of penalty function was being used in feature selection. As shown in Tables 3 and 4, using a general penalty (e.g., Lasso and MCP) that does not take the grouping structure into account tended to under-select important features within a group. COSSO tended to perform well within group sparsity. Moreover, Figure 2 shows that the FKMR method estimated the five signal functions (*Z*<sup>1</sup> and *Z*2) well.


**Table 1.** Goodness-of-fit and the concordance regression for Scenario 2.

**Table 2.** Sensitivity and specificity of functional selection for Scenario 2.


**Table 3.** FPC feature selection for signal functional *Z*<sup>1</sup> in Scenario 2.



**Table 4.** FPC feature selection for signal functional *Z*<sup>2</sup> in Scenario 2.

**Figure 2.** Five marginal estimates of important feature functions with 95% shaded confidence bands evaluated at 100 grid points while holding all other components equal to 0.5 in Scenario 2.

### **6. Data Example**

To show the usefulness of our proposed methodology, we analyzed data of 550 children recruited by the ELEMENTS study [35], who had consent to wear an actigraph (ActiGraph GT3X+; ActiGraph LLC. Pensacola, FL, USA). This wearable was to be placed on their non-dominant wrist for five to seven days with no interruption. The actigraph measured tri-axis accelerometer data sampled at 30 Hz, which captured three different directions of a person's movement. The BMI was the outcome of interest as it is biomarker of obesity. Sex and age were confounding factors used in the analysis. Due to some missing data, our analysis only included children who wore the device properly for 85% or more over the study period, which resulted in 395 participants, consisting of 189 males and 206 females. Other studies such as [36] have excluded days of accelerometer data with more than five percent missing. The mean ± SD BMI of the study cohort was 21.5 ± 4.1. The mean age of the study participants was 14.3 ± 2.1 y. A more detailed description of the dataset used for this paper can be found in [37]. Our primary interest was to see if the BMI is associated with physical activity in the presence of other covariates, specifically sex and age. We

preprocessed the activity counts over the 7 d of wear by taking the median in the 1 min epoch over the entire 7 d of wear. For example, since all the participants started wearing the device at 3 p.m., the first data point for each individual was a median of 7 ACs (each for one day) for the 1 min epoch of 3:00–3:01 p.m. This procedure that takes the medians across the minutes from different days has been considered in other applications such as [36]. See Figure 3 as an example of the resulting time series of medians derived from the AC data displayed in Figure 1.

**Figure 3.** The 24 h minute-by-minute medians of 7 d ACs for one subject.

We applied the following five models, labeled as M0–M4 for convenience, to analyze the data with the 24 h median ACs as functional predictors. Let *ξ<sup>k</sup> ij* be the *i*th person's *k*th FPC score for functional predictor *j*.


The BMI and age were mean centered and scaled to be a standard deviation of one, so *β*<sup>0</sup> was absent in the models. Here are some key findings from the data analyses. First, in terms of the goodness-of-fit, Table 5 suggests that M3, i.e., our proposed model FKMR with the SGL penalty, gave the best performance, where the adjusted *R*<sup>2</sup> of M3 was nearly twice as big as all the other four models. Second, it is interesting to note that both the COSSO and the *FKMRSGL* did not select the FPC scores associated with the Z-axis. Third, as shown in Table 6, all of the FPC components chosen by COSSO were also chosen by the *FKMRSGL*. It is worth noting that the linear model together with the SGL penalty selected the highest number of FPC components, yet performed the worst in terms of the model fit.


**Table 5.** Goodness-of-fit for the five models used in the data analysis.

**Table 6.** Axis-specific FPC feature selection.


### **7. Conclusions**

In this paper, we proposed a method to model the nonlinear relationship between multiple functional predictors and a scalar outcome in the presence of other scalar confounders. We used the FPCA to decompose the functional predictors for feature extraction and used the LSKM framework to model the functional relationship between the outcome and principal components. We developed a simultaneous procedure to select important functional predictors and important features within selected functionals. We proposed a computationally efficient algorithm to implement our regularization method, which was easily programmed in R with the utility of multiple existing R packages. It should be noted that although we focused on functional regression in this paper, the method proposed can be applied to non-functional predictors. In effect, by using functional principal components, we essentially bypassed the infinite-dimensional problem and worked effectively in a non-functional framework with the FPC features. Through simulation and using data from the ELEMENT dataset, we demonstrated how the FKMR estimator outperformed existing methods in terms of both variable selection and model fit. It should be noted that the existing COSSO method did perform well in terms of variable selection, as shown in Section 5.

A technical issue pertains to identifiability limitations with regard to the bandwidth parameter and to the RKHS estimator. To overcome this, we suggested fixing the bandwidth parameter; see the detailed discussion in Section 3. We established key theoretical guarantees for our proposed estimator. In the case where there are multiple proposed estimators (and thus the identifiability issues arise), the established theoretical properties in Section 4 apply to any of those estimators.

Variable section on functional predictors presents many technical challenges, and there are many methodological problems that remain unsolved. This paper demonstrated a possible framework to regularize estimation with a bi-level sparsity of functional group sparsity and within-group sparsity. In the LSKM paper [23], it was briefly mentioned that if the relationship between the scalar outcome and *p* genetic pathways is additive, we can tweak the model as *yi* = *x <sup>i</sup> <sup>β</sup>* <sup>+</sup> *<sup>h</sup>*1(*z*<sup>1</sup> *<sup>i</sup>* ) + ··· <sup>+</sup> *hp*(*z<sup>p</sup> <sup>i</sup>* ) +  *<sup>i</sup>* where each *hj* belongs to its own RKHS. It is easy to extend our method and algorithms to handle this case. For future research, an extension on longitudinal outcomes may be considered via a mixed-effects model *yij* = *x <sup>i</sup> <sup>β</sup>* <sup>+</sup> *<sup>h</sup>*(*zij*) + *<sup>u</sup> ij vi* <sup>+</sup>  *ij* where *<sup>u</sup> ij vi* are the random effects. Other useful extensions to the proposed paradigm would be on the lines of generalized linear models and Cox regression models.

**Author Contributions:** Conceptualization, P.X.S. and J.N.; Formal analysis, J.N.; Methodology, J.N. and P.X.S.; Supervision, P.X.S.; Writing—original draft, J.N.; Writing—review & editing, P.X.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported in part by NSF DMS#2113564.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The used data of physical activity counts, BMI and demographic variables (sex and age) are available upon request through a formal data request procedure outlined by the ELEMENT Cohort Study. Contact the corresponding author of this paper for the detail.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **Appendix A. Technical Assumptions and Proofs**

*Appendix A.1. Proof of Lemma 1*

It suffices to show that for any *<sup>J</sup>*1(*h*, *<sup>β</sup>*, *<sup>γ</sup>*) in (5) we can always find *<sup>α</sup>* ∈ R*<sup>n</sup>* such that *J*1(˜ *h* = ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *<sup>α</sup>i*K(·, *<sup>γ</sup>* ◦**z***i*), *<sup>γ</sup>*, *<sup>β</sup>*) <sup>≤</sup> *<sup>J</sup>*1(*h*, *<sup>β</sup>*, *<sup>γ</sup>*) where ˜ *h* is the projection of *h* onto the linearly spanned space given by *span*{K(·, *γ* ◦**z***i*), ··· , K(·, *γ* ◦**z***n*)}. For any *h* we can write *h* = *h*<sup>⊥</sup> + ˜ *h* where *h*<sup>⊥</sup> ∈ *span*{K(·, *γ* ◦**z**1), ···, K(·, *γ* ◦**z***n*)}⊥. Since H*<sup>k</sup>* is a reproducing kernel Hilbert space we can rewrite (5) as follows:

$$f\_1(h, \gamma, \boldsymbol{\mathfrak{f}}) = \frac{1}{2n} \sum\_{i=1}^n \{y\_i - \mathbf{x}\_i^\top \boldsymbol{\mathfrak{f}} - < h, \boldsymbol{\mathcal{K}}(\cdot, \gamma \circ \vec{\mathbf{z}}\_i) > \}^2 + \frac{1}{2} \lambda\_1 \|h\|\_{\mathcal{H}\_k}^2 + \lambda\_2 \rho(\gamma; \boldsymbol{\delta}).$$

Since < *h*⊥, K(·, *γ* ◦**z***i*) >= 0 for every *i*, we obtain

$$\begin{split} f\_{1}(h,\boldsymbol{\gamma},\boldsymbol{\theta}) &= \frac{1}{2n} \sum\_{i=1}^{n} \left\{ y\_{i} - \mathbf{x}\_{i}^{\top} \boldsymbol{\mathcal{B}} - \sum\_{k=1}^{n} a\_{k} \mathcal{K} (\boldsymbol{\gamma} \diamond \vec{\mathbf{z}}\_{i}, \boldsymbol{\gamma} \diamond \vec{\mathbf{z}}\_{k}) \right\}^{2} + \frac{1}{2} \lambda\_{1} \left\| \boldsymbol{h}^{\perp} + \boldsymbol{\tilde{h}} \right\|\_{\mathcal{H}\_{k}}^{2} + \lambda\_{2} \rho(\boldsymbol{\gamma};\boldsymbol{\delta}) \\ &\geq \frac{1}{2n} \sum\_{i=1}^{n} \left\{ y\_{i} - \mathbf{x}\_{i}^{\top} \boldsymbol{\mathcal{B}} - \sum\_{k=1}^{n} a\_{k} \mathcal{K} (\boldsymbol{\gamma} \diamond \vec{\mathbf{z}}\_{i}, \boldsymbol{\gamma} \diamond \vec{\mathbf{z}}\_{k}) \right\}^{2} + \frac{1}{2} \lambda\_{1} \left\| \boldsymbol{h} \right\|\_{\mathcal{H}\_{k}}^{2} + \lambda\_{2} \rho(\boldsymbol{\gamma};\boldsymbol{\delta}) \\ &= f\_{1}(\overline{h}, \boldsymbol{\gamma}, \boldsymbol{\mu}, \boldsymbol{\theta}). \end{split}$$

*Appendix A.2. Proof of Lemma 2*

The equivalence of forms become clear once we rewrite (6) in the matrix notation. Equation (6) can be written as follows:

$$\min\_{\mathbf{a},\mathbf{g},\mathbf{f},\gamma} l\_2(\mathbf{a},\mathbf{f},\gamma) = \min\_{\mathbf{a},\mathbf{f},\gamma} \frac{1}{2\pi} \left\| \mathbf{Y} - \mathbf{X}\boldsymbol{\beta} - \mathbf{K}(\gamma; \mathbf{Z})\mathbf{a} \right\|\_2^2 + \frac{1}{2} \lambda\_1 \mathbf{a}^\top \mathbf{K}(\gamma; \mathbf{Z})\mathbf{a} + \lambda\_2 \rho(\gamma; \boldsymbol{\delta}). \tag{A1}$$

For fixed *α* , *β* and *λ*1, minimizing the function in (A1) with respect to *γ* is equivalent to

$$\min\_{\gamma} \left\{ \frac{1}{2n} \left\| \left( \mathbf{Y} - \mathbf{X}\boldsymbol{\beta} - \frac{n}{2} \boldsymbol{\lambda}\_1 \mathbf{a} \right) - \mathbf{K}(\gamma; Z)\mathbf{a} \right\|\_2^2 + \lambda\_2 \rho(\gamma; \boldsymbol{\delta}) \right\}. \tag{A2}$$

*Appendix A.3. Proof of Theorem 1*

With loss of the generality we use the penalty function for sparse group lasso but this proof can easily be modified for other penalty functions. Also, we fix *λ*<sup>1</sup> = *λ*<sup>2</sup> = *δ* = 1, and consider *β* ∈ R as well as set the design matrix **X** (or vector in this case) scaled to have norm 1. The case of *<sup>β</sup>* ∈ R*<sup>q</sup>* will follow along similar lines of arguments. Let *<sup>γ</sup>* <sup>∈</sup> *<sup>D</sup>*<sup>3</sup> with *<sup>D</sup>*<sup>3</sup> <sup>=</sup> {*<sup>γ</sup>* : *γ*<sup>1</sup> <sup>≤</sup> <sup>1</sup> <sup>2</sup>*<sup>n</sup>* **Y**<sup>2</sup> <sup>2</sup>}. Define *f*(*γ*) = **K**(*γ*; *Z*) = *ηmax*(**K**(*γ*; *Z*)) ≥ 0, where *ηmax*(**K**(*γ*; *Z*)) denotes the largest eigenvalue of **K**(*γ*; *Z*) with the operator norm (the norm of **<sup>K</sup>**(*γ*; *<sup>Z</sup>*)) defined in its usual way **K**(*γ*; *<sup>Z</sup>*) <sup>=</sup> *sup*{**K**(*γ*; *<sup>Z</sup>*)**x**<sup>2</sup> <sup>2</sup> : **x**<sup>2</sup> <sup>2</sup> = 1}. Since *D*<sup>3</sup>

is compact and **K**(*γ*; *Z*) is continuous with respect to *γ* it achieves its maximum over *D*3. Thus, we define *η*- <sup>=</sup> *supγ*∈*D*<sup>3</sup> *<sup>f</sup>*(*γ*) <sup>≥</sup> 0. Define *<sup>D</sup>*<sup>2</sup> <sup>=</sup> {*<sup>β</sup>* :<sup>|</sup> *<sup>β</sup>* |≤ (<sup>1</sup> <sup>+</sup> *<sup>η</sup>*-)**Y**2}, where the upper bound is denoted by *b*- = (1 + *η*-)**Y**<sup>2</sup> ≥ 0. Moreover, define *D*<sup>1</sup> = {*α* : *α*<sup>2</sup> ≤ <sup>√</sup>*n*(**Y**<sup>2</sup> <sup>+</sup> *<sup>b</sup>*-)}.

Since *D*1, *D*<sup>2</sup> and *D*<sup>3</sup> are compact there exists a (*α*-, *β*-, *γ*-) such that *J*2(*α*-, *β*-, *γ*-) ≤ *<sup>J</sup>*2(*α*, *<sup>β</sup>*, *<sup>γ</sup>*) for all (*α*, *<sup>β</sup>*, *<sup>γ</sup>*) <sup>∈</sup> *<sup>D</sup>*<sup>1</sup> <sup>×</sup> *<sup>D</sup>*<sup>2</sup> <sup>×</sup> *<sup>D</sup>*3. Note that *<sup>J</sup>*2(**0**, 0, **<sup>0</sup>**) = <sup>1</sup> <sup>2</sup>*<sup>n</sup>* **Y**<sup>2</sup> <sup>2</sup> and (**0**, 0, **0**) ∈ *<sup>D</sup>*<sup>1</sup> <sup>×</sup> *<sup>D</sup>*<sup>2</sup> <sup>×</sup> *<sup>D</sup>*3. We claim that (*α*-, *β*-, *γ*-) is a global minimizer, which is proved below by contradiction.

Suppose that there exists (*α*˜ , *<sup>β</sup>*˜, *<sup>γ</sup>*˜) <sup>∈</sup>/ *<sup>D</sup>*<sup>1</sup> <sup>×</sup> *<sup>D</sup>*<sup>2</sup> <sup>×</sup> *<sup>D</sup>*<sup>3</sup> where *<sup>J</sup>*2(*α*˜ , *<sup>β</sup>*˜, *<sup>γ</sup>*˜) <sup>&</sup>lt; *<sup>J</sup>*2(*α*-, *β*-, *γ*-). We must have that *<sup>γ</sup>*˜ <sup>∈</sup> *<sup>D</sup>*3; if not, we have *<sup>J</sup>*2(*α*˜ , *<sup>β</sup>*˜, *<sup>γ</sup>*˜) <sup>≥</sup> *γ*˜<sup>1</sup> <sup>≥</sup> *<sup>J</sup>*2(**0**, 0, **<sup>0</sup>**) <sup>≥</sup> *<sup>J</sup>*2(*α*-, *β*-, *γ*-). Let *q*1, ···, *qn* be the orthonormal vectors of **K**(*γ*˜; *Z*) with its associated eigenvalues *η*<sup>1</sup> ≥···≥ *η<sup>n</sup>* ≥ 0. We can write out *α*˜ ,**X**,**Y** in terms of these basis functions where *α*˜ = ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> < *<sup>α</sup>*˜ , *qi* > *qi*, **<sup>Y</sup>** = <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> < **<sup>Y</sup>**, *qi* > *qi* and **<sup>X</sup>** = <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> < **<sup>X</sup>**, *qi* > *qi*. Let *<sup>C</sup>α*˜ *<sup>i</sup>* =< *α*˜ , *qi* >, *C***<sup>Y</sup>** *<sup>i</sup>* =< **<sup>Y</sup>**, *qi* > and *<sup>C</sup>***<sup>X</sup>** *<sup>i</sup>* =< **X**, *qi* >. It follows that

$$J\_2(\mathfrak{A}, \mathfrak{F}, \tilde{\gamma}) \ge \frac{1}{2n} \left\| \sum\_{i=1}^n \mathcal{C}\_i^{\mathbf{Y}} q\_i - \sum\_{i=1}^n \mathcal{C}\_i^{\mathbf{X}} \tilde{\beta} q\_i - \sum\_{i=1}^n \mathcal{C}\_i^{\mathbf{k}} \eta\_i q\_i \right\|\_2^2 + \frac{1}{2} \sum\_{i=1}^n (\mathcal{C}\_i^{\mathbf{k}})^2 \eta\_i \nu$$

which is equal to <sup>1</sup> <sup>2</sup>*<sup>n</sup>* <sup>∑</sup>*<sup>n</sup> i*=1(*C***<sup>Y</sup>** *<sup>i</sup>* <sup>−</sup> *<sup>C</sup>***<sup>X</sup>** *<sup>i</sup> <sup>β</sup>*˜ <sup>−</sup> *<sup>C</sup>α*˜ *<sup>i</sup> <sup>η</sup>i*)<sup>2</sup> <sup>+</sup> <sup>1</sup> <sup>2</sup> <sup>∑</sup>*<sup>n</sup> i*=1(*Cα*˜ *<sup>i</sup>* )2*ηi*. We can minimize the above objective function with respect to *Cα*˜ *<sup>i</sup>* and *<sup>β</sup>*˜. First, note that for any *<sup>η</sup><sup>i</sup>* <sup>=</sup> 0 we can let *<sup>C</sup>α*˜ *<sup>i</sup>* = 0 as it will not affect the expression above. It is sufficient to consider *η<sup>i</sup>* > 0. Taking the first derivative and setting it equal to zero, we obtain the score equations the minimizer must satisfy, for our minimum *β*˜ and *Cα*˜ *i*

$$\beta = \sum\_{i=1}^{n} \mathbb{C}\_{i}^{\mathbf{X}} (\mathbb{C}\_{i}^{\mathbf{Y}} - \mathbb{C}\_{i}^{\mathbf{h}} \eta\_{i}) \tag{A3}$$

$$\mathbf{C}\_{i}^{\mathbf{k}} = \frac{1}{n + \eta\_{i}} (\mathbf{C}\_{i}^{\mathbf{Y}} - \mathbf{C}\_{i}^{\mathbf{X}} \boldsymbol{\beta}). \tag{A4}$$

In the above derivation we used the fact that 1 <sup>=</sup> **X**<sup>2</sup> <sup>2</sup> = <sup>∑</sup>*<sup>n</sup> i*=1(*C***<sup>X</sup>** *<sup>i</sup>* )2. Plugging (A4) into (A3), we obtain

$$\beta = \frac{\sum\_{i=1}^{n} \mathbb{C}\_{i}^{\chi} \mathbb{C}\_{i}^{\chi} (1 - \frac{\eta\_{i}}{n + \eta\_{i}})}{1 - \sum\_{i=1}^{n} (\mathbb{C}\_{i}^{\chi})^{2} \frac{\eta\_{i}}{n + \eta\_{i}}}.\tag{A5}$$

It follows that

$$\beta \le \frac{\sum\_{i=1}^{\eta} |\mathsf{C}\_i^{\mathsf{X}} \mathsf{C}\_i^{\mathsf{Y}}|}{1 - \sum\_{i=1}^{\eta} (\mathsf{C}\_i^{\mathsf{X}})^2 \frac{\eta^\*}{n + \eta^\*}} \le \frac{||\mathsf{X}||\_2 ||\mathsf{Y}||\_2}{||\mathsf{X}||\_2^2 (1 - \frac{\eta^\*}{n + \eta^\*})} \le \frac{||\mathsf{Y}||\_2}{(1 - \frac{\eta^\*}{1 + \eta^\*})} = b^\star.$$

Thus, the *β* that minimizes *J*<sup>2</sup> for a given *γ* ∈ *D*<sup>3</sup> is in *D*2. Also, (A4) implies that <sup>|</sup> *<sup>C</sup>α*˜ *<sup>i</sup>* |≤ (**Y**<sup>2</sup> + **X**2*β*2); consequently, the optimal *α* for the given *γ*˜ ∈ *D*<sup>3</sup> and *β* ∈ *D*<sup>2</sup> that minimizes *<sup>J</sup>*<sup>2</sup> satisfies *α*<sup>2</sup> <sup>≤</sup> <sup>√</sup>*n*(**Y**<sup>2</sup> <sup>+</sup> *<sup>b</sup>*-). As a result, *α* ∈ *D*2. This suggests that for any (*α*˜ , *<sup>β</sup>*˜, *<sup>γ</sup>*˜) <sup>∈</sup>/ *<sup>D</sup>*<sup>1</sup> <sup>×</sup> *<sup>D</sup>*<sup>2</sup> <sup>×</sup> *<sup>D</sup>*<sup>3</sup> we can find an (*α*, *<sup>β</sup>*, *<sup>γ</sup>*) <sup>∈</sup> *<sup>D</sup>*<sup>1</sup> <sup>×</sup> *<sup>D</sup>*<sup>2</sup> <sup>×</sup> *<sup>D</sup>*<sup>3</sup> such that *<sup>J</sup>*2(*α*˜ , *<sup>β</sup>*˜, ˜*γ*) <sup>≥</sup> *<sup>J</sup>*2(*α*, *<sup>β</sup>*, *<sup>γ</sup>*).

### *Appendix A.4. Proof of Theorem 2*

By Lemma 8.4 on page 129 in [32], Assumptions 1, 2, and 3 imply:

$$P\left(\sup\_{b\in\mathcal{B}}\frac{\frac{1}{\sqrt{n}}|\sum\_{i=1}^{n}\varepsilon\_{i}b(\mathbf{z}\_{i})|}{||b||\_{P\_{n}}^{1-\psi}}\geq T\right)\leq c\exp\left(-\frac{T^{2}}{c^{2}}\right),\ T\geq c\tag{A6}$$

where the constant *c* is dependent on *C*1, *C*2, *C*3, *C*4, and *ψ*. It follows that

$$\sup\_{\mathfrak{b}\in\mathcal{B}} \frac{\frac{1}{\sqrt{n}} |\sum\_{i=1}^{n} \mathfrak{c}\_{i} b(\mathfrak{z}\_{i})|}{||b||\_{P\_{\mathfrak{u}}}^{1-\mathfrak{y}}} = O\_{\mathbb{P}}(1). \tag{A7}$$

Therefore, for any *h* ∈ HK and a scaling map function Γ ∈ A, we obtain

$$\frac{\sqrt{n}(\boldsymbol{\varepsilon},\boldsymbol{h}\odot\boldsymbol{\Gamma}-\boldsymbol{h}\_{0}\circ\boldsymbol{\Gamma}\_{0})\_{\boldsymbol{n}}\left(\left\lVert\boldsymbol{h}\right\rVert\_{\mathcal{H}\_{\boldsymbol{K}}}^{2}+\left\lVert\boldsymbol{h}\_{0}\right\rVert\_{\mathcal{H}\_{\boldsymbol{K}}}^{2}+\left\lVert\boldsymbol{\Gamma}\right\rVert\_{SGL}^{2}+\left\lVert\boldsymbol{\Gamma}\_{0}\right\rVert\_{SGL}^{2}\right)^{-\psi}}{\left\lVert\boldsymbol{h}\circ\boldsymbol{\Gamma}-\boldsymbol{h}\_{0}\circ\boldsymbol{\Gamma}\_{0}\right\rVert\_{P\_{\boldsymbol{n}}}^{1-\psi}}=O\_{\mathcal{P}}(1).\tag{A8}$$

For our estimators, ˆ *h* and Γˆ, it is easy to see that

$$(\epsilon, \hat{h} \circ \Gamma - h\_0 \circ \Gamma\_0)\_n = $$

$$O\_p(n^{-\frac{1}{2}}) \left\| \hat{h} \circ \Gamma - h\_0 \circ \Gamma\_0 \right\|\_n^{1-\Psi} \left( \left\| \hat{h} \right\|\_{\mathcal{H}\_\mathbb{C}}^2 + \left\| h\_0 \right\|\_{\mathcal{H}\_\mathbb{C}}^2 + \left\| \Gamma \right\|\_{SGL}^2 + \left\| \Gamma\_0 \right\|\_{SGL}^2 \right)^{\Psi}. \tag{A9}$$

From (A9), we obtain the following inequality:

$$\begin{split} & \left\| \left\| \hat{h} \circ \hat{\Gamma} - h\_{0} \circ \Gamma\_{0} \right\| \right\|\_{n}^{2} + \lambda\_{1} \left\| \hat{h} \right\|\_{\mathcal{H}\_{\mathcal{K}}}^{2} + \lambda\_{2} \left\| \hat{\Gamma} \right\|\_{SGL}^{2} \leq \\ & \left\| O\_{P}(n^{-\frac{1}{2}}) \right\| \left\| \hat{h} \circ \hat{\Gamma} - h\_{0} \circ \Gamma\_{0} \right\| \right\|\_{n}^{1-\Psi} \left( \left\| \hat{h} \right\|\_{\mathcal{H}\_{\mathcal{K}}}^{2} + \left\| h\_{0} \right\|\_{\mathcal{H}\_{\mathcal{K}}}^{2} + \left\| \hat{\Gamma} \right\|\_{SGL}^{2} + \left\| \Gamma\_{0} \right\|\_{SGL}^{2} \right)^{\Psi} \\ & + \lambda\_{1} \left\| h\_{0} \right\|\_{\mathcal{H}\_{\mathcal{K}}}^{2} + \lambda\_{2} \left\| \Gamma\_{0} \right\|\_{SGL}^{2} . \end{split} \tag{A10}$$

We require *λ*<sup>1</sup> = *Op*(1)*λ*2, namely *λ*<sup>2</sup> and *λ*<sup>1</sup> go to zero at the same rate. We will show at the end of the proof what happens if they are not of the same order. Therefore, without loss of generality, we set *λ*<sup>1</sup> = *λ*2, denoted by *λ*. In what follows, we divide (A10) into two cases.

Case 1: Suppose that

$$\begin{split} \left\| O\_p(n^{-\frac{1}{2}}) \right\| \left\| \hat{h} \circ \Upsilon - h\_0 \circ \Gamma\_0 \right\|\_{n}^{1-\Psi} \left( \left\| \hat{h} \right\|\_{\mathcal{H}\_{\mathbb{K}}}^2 + \left\| h\_0 \right\|\_{\mathcal{H}\_{\mathbb{K}}}^2 + \left\| \|\uparrow\|\_{\mathcal{S}GL}^2 + \left\| \Gamma\_0 \right\|\_{\mathcal{S}GL}^2 \right)^{\Psi} \right. \\ \left. \begin{aligned} & \ge \lambda \left( \left\| h\_0 \right\|\_{\mathcal{H}\_{\mathbb{K}}}^2 + \left\| \Gamma\_0 \right\|\_{\mathcal{S}GL}^2 \right) . \end{aligned} \right) \end{split}$$

In this case, we have

$$\begin{split} & \left\| \left\| \hat{h} \circ \hat{\Gamma} - h\_{0} \circ \Gamma\_{0} \right\| \right\|\_{n}^{2} + \lambda \left( \left\| \hat{h} \right\| \right\|\_{\mathcal{H}\_{\mathcal{K}}}^{2} + \left\| \hat{\Gamma} \right\|\_{\mathcal{S}\mathcal{G}\mathcal{L}}^{2} \right) \leq \\ & O\_{p}(n^{-\frac{1}{2}}) \left\| \hat{h} \circ \Gamma - h\_{0} \circ \Gamma\_{0} \right\| \right\|\_{n}^{1-\Psi} \left( \left\| \hat{h} \right\|\_{\mathcal{H}\_{\mathcal{K}}}^{2} + \left\| h\_{0} \right\|\_{\mathcal{H}\_{\mathcal{K}}}^{2} + \left\| \hat{\Gamma} \right\|\_{\mathcal{S}\mathcal{G}\mathcal{L}}^{2} + \left\| \Gamma\_{0} \right\|\_{\mathcal{S}\mathcal{G}\mathcal{L}}^{2} \right)^{\Psi} . \end{split} \tag{A11}$$

Above (A11) is further discussed separately in two sub-cases. Case 1a: If *h*0<sup>2</sup> 1 1 2 + 1 1

$$\begin{split} \left\| \begin{aligned} \hat{\boldsymbol{\mu}}\_{\mathcal{H}}^{2} + \left\| \Gamma\_{0} \right\|\_{\mathcal{S}GL}^{2} &\leq \left\| \hat{h} \right\|\_{\mathcal{H}\_{\mathcal{K}}}^{2} + \left\| \hat{\Gamma} \right\|\_{\mathcal{S}GL}^{2} \text{ then we have} \\ &\left\| \hat{h} \circ \Gamma - h\_{0} \circ \Gamma\_{0} \right\|\_{\boldsymbol{n}}^{2} + \lambda \left( \left\| \hat{h} \right\|\_{\mathcal{H}\_{\mathcal{K}}}^{2} + \left\| \hat{\Gamma} \right\|\_{\mathcal{S}GL}^{2} \right) \leq \\ &O\_{p}(n^{-\frac{1}{2}}) \left\| \hat{h} \circ \hat{\Gamma} - h\_{0} \circ \Gamma\_{0} \right\|\_{\boldsymbol{n}}^{1-\Psi} \left( \left\| \hat{h} \right\|\_{\mathcal{H}\_{\mathcal{K}}}^{2} + \left\| \hat{\Gamma} \right\|\_{\mathcal{S}GL}^{2} \right)^{\Psi} . \end{aligned} \tag{A12}$$

Therefore,

$$\mathbb{P}\left(\left\|\hat{h}\right\|\_{H\_K}^2 + \left\|\hat{\Gamma}\right\|\_{SGL}^2\right)^{\Psi} \le O\_p(n^{-\frac{\Psi}{2(1-\Psi)}}) \left\|\hat{h} \circ \hat{\Gamma} - h\_0 \circ \Gamma\_0\right\|\_n^{\Psi} \lambda^{-\frac{\Psi}{1-\Psi}}.\tag{A13}$$

It follows that

$$\begin{aligned} \left\| \hat{h} \circ \hat{\Gamma} - h\_0 \circ \Gamma\_0 \right\|\_{\mathfrak{n}} &= O\_p(n^{-\frac{1}{2(1-\mathfrak{q})}}) O\_p(\lambda^{-\frac{\mathfrak{q}}{1-\mathfrak{q}}}) ,\\ \left\| \hat{h} \right\|\_{H\_K}^2 + \left\| \hat{\Gamma} \right\|\_{SGL}^2 &= O\_p(n^{-\frac{1}{1-\mathfrak{q}}}) O\_p(\lambda^{-\frac{1+\mathfrak{q}}{1-\mathfrak{q}}}) .\end{aligned} \tag{A14}$$
  $\text{Case 1b: If } \left\| h\_0 \right\|\_{\mathcal{H}\_K}^2 + \left\| \Gamma\_0 \right\|\_{SGL}^2 \ge \left\| \hat{h} \right\|\_{H\_K}^2 + \left\| \Gamma \right\|\_{SGL}^2 \text{ then:} $ 

$$\text{2.1b: If } \left\|\boldsymbol{h}\_{0}\right\|\_{\mathcal{H}\_{\mathcal{K}}}^{2} + \left\|\Gamma\_{0}\right\|\_{SGL}^{2} \geq \left\|\boldsymbol{h}\right\|\_{\boldsymbol{H}\_{\mathcal{K}}}^{2} + \left\|\Gamma\right\|\_{SGL}^{2} \text{ then:}$$

$$\left\|\boldsymbol{h}\right\|\_{\mathcal{H}\_{\mathcal{K}}}^{2} + \left\|\hat{\Gamma}\right\|\_{SGL}^{2} = O\_{p}(\left\|\boldsymbol{h}\_{0}\right\|\_{\mathcal{H}\_{\mathcal{K}}}^{2} + \left\|\Gamma\_{0}\right\|\_{SGL}^{2})O\_{p}(1).$$

Therefore,

$$\left\| \left| \hat{h} \circ \Gamma - h\_0 \circ \Gamma\_0 \right| \right\|\_{\mathfrak{u}} = O\_p(n^{-\frac{1}{2(1+\mathfrak{q})}}) \left( \left\| h\_0 \right\|\_{\mathcal{H}\_{\mathbb{K}}}^2 + \left\| \left| \Gamma \right|\_{\mathrm{SGL}}^2 \right\| \right)^{\frac{\mathfrak{q}}{1+\mathfrak{q}}}.$$

Consequently, we obtain

$$\begin{split} \left\| \hat{h} \circ \Gamma - h\_0 \circ \Gamma\_0 \right\|\_{\mathfrak{u}} &= O\_p(n^{-\frac{1}{2(1-\mathfrak{q})}}) O\_p(\lambda^{-\frac{\mathfrak{q}}{1-\mathfrak{q}}}), \\ \left\| \hat{h} \right\|\_{\mathcal{H}\_{\mathbb{K}}}^2 + \left\| \hat{\Gamma} \right\|\_{\mathrm{SGL}}^2 &= O\_p(n^{-\frac{1}{1-\mathfrak{q}}}) O\_p(\lambda^{-\frac{1+\mathfrak{q}}{1-\mathfrak{q}}}). \end{split} \tag{A15}$$

Both terms in (A15) are the same rates as those in (A14). Case 2: Suppose that

$$\begin{split} \left\| O\_p(\boldsymbol{\mu}^{-\frac{1}{2}}) \right\| \left\| \hat{h} \circ \boldsymbol{\Gamma} - h\_0 \circ \boldsymbol{\Gamma}\_0 \right\|\_{\boldsymbol{\mathcal{H}}}^{1-\Psi} \left( \left\| \hat{h} \right\|\_{\boldsymbol{\mathcal{H}}\_{\mathcal{K}}}^2 + \left\| h\_0 \right\|\_{\mathcal{H}\_{\mathcal{K}}}^2 + \left\| \boldsymbol{\Gamma} \right\|\_{SGL}^2 + \left\| \boldsymbol{\Gamma}\_0 \right\|\_{SGL}^2 \right)^{\Psi} \\ &\leq \lambda \left( \left\| h\_0 \right\|\_{\mathcal{H}\_{\mathcal{K}}}^2 + \left\| \boldsymbol{\Gamma}\_0 \right\|\_{SGL}^2 \right). \end{split}$$

Then, we have

$$\lambda \left\| \hat{h} \circ \Gamma - h\_0 \circ \Gamma\_0 \right\|\_{\mathfrak{u}}^2 + \lambda \left( \left\| \hat{h} \right\|\right\|\_{\mathcal{H}\_{\mathbb{C}}}^2 + \left\| \|\Gamma\|\right\|\_{SGL}^2 \right) \le 2\lambda \left( \left\| h\_0 \right\|\_{\mathcal{H}\_{\mathbb{C}}}^2 + \left\| \Gamma\_0 \right\|\_{SGL}^2 \right).$$

This implies that

$$\begin{aligned} \left\| \hat{h} \circ \hat{\Gamma} - h\_0 \circ \Gamma\_0 \right\|\_{\mathfrak{n}} &= O\_p(\lambda^{\frac{1}{2}}) \left( \left\| h\_0 \right\|\_{\mathcal{H}\_{\mathcal{K}}}^2 + \left\| \Gamma\_0 \right\|\_{SGL}^2 \right)^{\frac{1}{2}}, \\ \left\| h \right\|\_{\mathcal{H}\_{\mathcal{K}}}^2 + \left\| \Gamma \right\|\_{SGL}^2 &= O\_p(1) \left( \left\| h\_0 \right\|\_{\mathcal{H}\_{\mathcal{K}}}^2 + \left\| \Gamma\_0 \right\|\_{SGL}^2 \right). \end{aligned} \tag{A16}$$

In order to make (A14) and (A16) have the same rates we first equate the two term *Op*(*λ*<sup>1</sup> 2 ) *h*<sup>2</sup> HK <sup>+</sup> Γ<sup>2</sup> *SGL*<sup>1</sup> <sup>2</sup> and *Op*(*n* <sup>−</sup> <sup>1</sup> <sup>2</sup>(1−*ψ*) )*Op*(*λ*<sup>−</sup> *<sup>ψ</sup>* <sup>1</sup>−*<sup>ψ</sup>* ), and then solve for a common *λ*. The solution is given as follows:

$$\lambda^{-1} = n^{\frac{1}{1+\Psi}} \left( ||h||\_{\mathcal{H}\_{\bar{\mathbb{K}}}}^2 + ||\Gamma||\_{SGL}^2 \right)^{\frac{1-\Psi}{1+\Psi}}.$$

Under this *λ* value we obtain that (A14)–(A16) as of the form:

$$\left\|\hbar\odot\Gamma-h\_{0}\circ\Gamma\_{0}\right\|\_{\mathfrak{u}}=O\_{p}(n^{-\frac{1}{2(1+\mathfrak{q})}})\left(\left\|h\_{0}\right\|\_{\mathcal{H}\mathbb{C}}^{2}+\left\|\Gamma\_{0}\right\|\_{SGL}^{2}\right)^{\frac{\mathfrak{g}}{1+\mathfrak{q}}},\tag{A17}$$

$$\left\|\hat{h}\right\|\_{\mathcal{H}\_{\mathcal{K}}}^2 + \left\|\Gamma\right\|\_{SGL}^2 = O\_p(1) \left(\left\|h\_0\right\|\_{\mathcal{H}\_{\mathcal{K}}}^2 + \left\|\Gamma\_0\right\|\_{SGL}^2\right). \tag{A18}$$

This completes the proof of Theorem 2.

Now we discuss the situation where the tuning parameters *λ*<sup>1</sup> and *λ*<sup>2</sup> are not of the same order. As seen blow, the selection consistency may not be guaranteed. Take Case 2 as an example. Suppose that

$$\begin{split} \left\| O\_{P} \left( n^{-\frac{1}{2}} \right) \right\| \left\| \hat{h} \circ \hat{\Gamma} - h\_{0} \circ \Gamma\_{0} \right\|\_{\mathfrak{u}}^{1-\Psi} \left( \left\| \hat{h} \right\|\_{\mathfrak{H}\_{\mathbb{K}}}^{2} + \left\| h\_{0} \right\|\_{\mathfrak{H}\_{\mathbb{K}}}^{2} + \left\| \hat{\Gamma} \right\|\_{SGL}^{2} + \left\| \Gamma\_{0} \right\|\_{SGL}^{2} \right)^{\Psi} \\ & \leq \lambda\_{1} \left\| h\_{0} \right\|\_{\mathfrak{H}\_{\mathbb{K}}}^{2} + \lambda\_{2} \left\| \Gamma\_{0} \right\|\_{SGL}^{2} . \end{split}$$

Let us consider two cases.

Case 2a: If *<sup>λ</sup>*1*h*0<sup>2</sup> HK <sup>≤</sup> *<sup>λ</sup>*2Γ0<sup>2</sup> *SGL*, following the same arguments above, we have

$$\begin{aligned} \left\| \hat{h} \circ \hat{\Gamma} - h\_0 \circ \Gamma\_0 \right\|\_{\mathfrak{n}} &= O\_p(\lambda\_2^{\frac{1}{2}}) \|\Gamma\_0\|\_{SGL} \, \vert \, \_{SGL} \\ \left\| \hat{h} \right\|\_{\mathcal{H}\_{\mathbb{K}}}^2 &= O\_p(\frac{\lambda\_2}{\lambda\_1}) \|\Gamma\_0\|\_{SGL}^2 \\ \left\| \hat{\Gamma} \right\|\_{SGL}^2 &= O\_p(1) \|\Gamma\_0\|\_{SGL}^2 \end{aligned} \tag{A19}$$

Case 2b: If *<sup>λ</sup>*1*h*0<sup>2</sup> HK <sup>≥</sup> *<sup>λ</sup>*2Γ0<sup>2</sup> *SGL*, then following the same logic as before:

$$\left\|\hat{h}\circ\hat{\mathbf{1}}-h\_{0}\circ\Gamma\_{0}\right\|\_{\mathfrak{n}}=O\_{p}(\lambda\_{1}^{\frac{1}{2}})\left\|h\_{0}\right\|\_{\mathcal{H}\_{\mathcal{K}}}),$$

$$\left\|\hat{\mathbf{1}}\right\|\_{\mathcal{S}\mathcal{G}\mathcal{L}}^{2}=O\_{p}(\frac{\lambda\_{1}}{\lambda\_{2}})\left\|h\_{0}\right\|\_{\mathcal{H}\_{\mathcal{K}}\prime}^{2}\tag{A20}$$

$$\left\|\hat{h}\right\|\_{\mathcal{H}\_{\mathcal{K}}}^{2}=O\_{p}(1)\left\|h\_{0}\right\|\_{\mathcal{H}\_{\mathcal{K}}}^{2}.$$

Both terms involve *Op*( *<sup>λ</sup>*<sup>1</sup> *λ*2 ) and *Op*( *<sup>λ</sup>*<sup>2</sup> *λ*1 ), indicating that these two tuning parameters *λ*<sup>1</sup> and *λ*<sup>2</sup> should go to zero at the same rates. Moreover, we can think of our estimator ˆ *<sup>h</sup>* ◦ <sup>Γ</sup><sup>ˆ</sup> as one operational object. See Appendix <sup>B</sup> for more details on this, which can further explain the need of one rate for the two penalties.

### *Appendix A.5. Proof of Corollary 1*

For convenience, we present the following lemma proved by [32] (on page 20).

**Lemma A1.** *(Geer's Lemma) A <sup>d</sup> dimensional ball of radius R, Bd*(*R*)*, in* <sup>R</sup>*<sup>d</sup> with Euclidean metric can be covered by* ( <sup>4</sup>*R*+*<sup>δ</sup> <sup>δ</sup>* )*<sup>d</sup> balls of radius <sup>δ</sup>.*

We have shown in the proof of Theorem 1 that the optimal *γ* vector is restricted to be within a ball of a radius that depends on the norm of **Y**. For the sake of simplicity let us confine our *<sup>γ</sup>* to be within a norm ball of radius 1, *<sup>γ</sup>* ∈ G <sup>=</sup> {*<sup>γ</sup>* : *γ*<sup>2</sup> <sup>2</sup> ≤ 1}. We then confine our set which we called A to be restricted to those *γ*, that is A = {Γ : <sup>Γ</sup>(**z**) = *<sup>γ</sup>* ◦ **<sup>z</sup>**, *<sup>γ</sup>* ∈ G}. Since our *<sup>γ</sup>* <sup>∈</sup> *<sup>R</sup><sup>s</sup>* , we can use above Lemma A1 and cover our set A with *N*<sup>1</sup> = 4+*δ δ s* number of functions in the following sense. The ball of radius 1 in *<sup>R</sup><sup>s</sup>* can be covered (using the Euclidean metric) by {*γ*1, ··· *<sup>γ</sup>N*<sup>1</sup> }. Since there is a one to one relationship between the functions Γ and *γ*, take the set {Γ1, ... , Γ*N*<sup>1</sup> } and define the metric between some <sup>Γ</sup>*<sup>j</sup>* and <sup>Γ</sup>*<sup>k</sup>* in the set A as *<sup>d</sup>*(Γ*j*, <sup>Γ</sup>*k*) = <sup>1</sup> <sup>1</sup>*γ<sup>j</sup>* − *<sup>γ</sup><sup>k</sup>* 1 1 2 . Then, the set of functions {Γ1, ... , <sup>Γ</sup>*N*<sup>1</sup> } is a *<sup>δ</sup>*-covering for <sup>A</sup> under this metric with entropy *s log*( <sup>4</sup>+*<sup>δ</sup> <sup>δ</sup>* ). For each <sup>Γ</sup>*<sup>j</sup>* we have an induced RKHS, HK◦<sup>Γ</sup>*<sup>j</sup>* = {*<sup>h</sup>* ◦ <sup>Γ</sup>*<sup>j</sup>* : *<sup>h</sup>* ∈ HK} with entropy no larger than that of HK, which according to the assumption, has entropy <sup>≤</sup> *<sup>A</sup>δ*−2*<sup>ψ</sup>* for some *<sup>ψ</sup>* ∈ (0, 1) and *<sup>A</sup>* ∈ R. Therefore, the covering number *<sup>N</sup>*<sup>2</sup> = *<sup>N</sup>*(*δ*, HK◦<sup>Γ</sup>*<sup>j</sup>* , *Pn*) ≤ exp{*Aδ*−2*ψ*}. This implies that for every <sup>Γ</sup>*<sup>j</sup>* there exists a set {*hj*<sup>1</sup> ◦ <sup>Γ</sup>*j*, ··· , *hjN*<sup>2</sup> ◦ <sup>Γ</sup>*j*} such that for every *h* ◦ Γ*<sup>j</sup>* ∈ HK◦<sup>Γ</sup>*<sup>j</sup>* there exists an integer *i* ∈ {1, ... , *N*2} we have 1 <sup>1</sup>*<sup>h</sup>* ◦ <sup>Γ</sup>*<sup>j</sup>* <sup>−</sup> *hji* ◦ <sup>Γ</sup>*<sup>j</sup>* 1 1 *Pn* ≤ *<sup>δ</sup>*. Set B is essentially the union of the different Hilbert spaces

of the form HK◦Γ. Under the setup, a natural estimate of the *delta*-covering number of this set would be approximately of size *N*<sup>1</sup> × *N*<sup>2</sup> where functions take the form of {*h*<sup>11</sup> ◦ <sup>Γ</sup>1, ··· , *<sup>h</sup>*1*N*<sup>2</sup> ◦ <sup>Γ</sup>1, ··· , *hN*<sup>11</sup> ◦ <sup>Γ</sup>*N*<sup>1</sup> , ··· , *hN*<sup>1</sup>*N*<sup>2</sup> ◦ <sup>Γ</sup>*N*<sup>1</sup> }. In addition, we add *<sup>N</sup>*<sup>2</sup> functions from the set {*h*<sup>1</sup> ◦ Γ0, ··· , *hN*<sup>2</sup> ◦ Γ0} where Γ<sup>0</sup> is the true Γ<sup>0</sup> (or one of the true Γ0). Since HK◦<sup>Γ</sup>*<sup>j</sup>* is a Hilbert space for every *<sup>j</sup>*, if *<sup>h</sup>* ◦ <sup>Γ</sup>*<sup>j</sup>* ∈ HK◦<sup>Γ</sup>*<sup>j</sup>* so is *<sup>h</sup>*◦Γ*<sup>j</sup> h*<sup>2</sup> HK <sup>+</sup>*h*0<sup>2</sup> HK <sup>+</sup>Γ*j* 2 *SGL*+Γ0<sup>2</sup> *SGL* . We can simply ignore the denominator and substitute *<sup>h</sup>*◦Γ*<sup>j</sup> h*<sup>2</sup> HK <sup>+</sup>*h*0<sup>2</sup> HK <sup>+</sup>Γ*j* 2 *SGL*+Γ0<sup>2</sup> *SGL* with ˜ *<sup>h</sup>* ◦ <sup>Γ</sup>*<sup>j</sup>* <sup>∈</sup> *HK*◦Γ*<sup>j</sup>* where ˜ *h* = *<sup>h</sup> h*<sup>2</sup> HK <sup>+</sup>*h*0<sup>2</sup> HK <sup>+</sup>Γ*j* 2 *SGL*+Γ0<sup>2</sup> *SGL* . We now prove Corollary 1.

**Proof.** Set *M* = sup*<sup>h</sup>* < ∇*h*(**z**), ∇*h*(**z**) > where the inner product is the standard Euclidean inner product. This is for a fixed **z**, or under the assumption that the gradient is uniformly

bounded, we can take the sup*h*∈HK,**z**∈*R<sup>s</sup>* <sup>&</sup>lt; <sup>∇</sup>*h*(**z**), <sup>∇</sup>*h*(**z**) <sup>&</sup>gt;. Let *<sup>N</sup>*<sup>1</sup> <sup>=</sup> 4+ # *δ* 3*M* 1 2 \$ # *δ* 3*M* 1 2 \$ *s* which

is the number of balls needed to provide a *δ* <sup>3</sup>*<sup>M</sup>* <sup>1</sup> 2 ! covering for a norm 1 ball in <sup>R</sup>*<sup>s</sup>* . Let *<sup>N</sup>*<sup>2</sup> <sup>=</sup> exp*A*( *<sup>δ</sup>* <sup>3</sup> )−2*<sup>ψ</sup>* which is the covering number needed to provide a *<sup>δ</sup>* <sup>3</sup> cover of our space HK. Let:

$$\begin{aligned} \tilde{h} \circ \Gamma - \tilde{h}\_0 \circ \Gamma\_0 &= \\ \frac{\hat{h} \circ \hat{\Gamma}}{\left\| \hat{h} \right\|\_{\mathcal{H}\_{\mathcal{K}}}^2 + \left\| h\_0 \right\|\_{\mathcal{H}\_{\mathcal{K}}}^2 + \left\| \Gamma \right\|\_{SGL}^2 + \left\| \Gamma\_0 \right\|\_{SGL}^2} - \frac{h\_0 \circ \Gamma\_0}{\left\| \hat{h} \right\|\_{\mathcal{H}\_{\mathcal{K}}}^2 + \left\| h\_0 \right\|\_{\mathcal{H}\_{\mathcal{K}}}^2 + \left\| \Gamma \right\|\_{SGL}^2 + \left\| \Gamma\_0 \right\|\_{SGL}^2} \end{aligned}$$

be an arbitrary function in the set B. There exists a Γ*<sup>j</sup>* where *j* ∈ {1, ... , *N*1} such that *<sup>d</sup>*(Γ*j*, <sup>Γ</sup>ˆ) <sup>≤</sup> *<sup>δ</sup>* 3 max *i*=1,··· ,*n* **z***i*<sup>2</sup> <sup>√</sup>*<sup>M</sup>* , and there exists an *<sup>i</sup>* where *<sup>i</sup>* ∈ {1, ... , *<sup>N</sup>*2} such that 1 1 1˜ˆ *h* ◦ Γ*<sup>j</sup>* − *hji* ◦ Γ*<sup>j</sup>* 1 1 1 *Pn* <sup>≤</sup> *<sup>δ</sup>* 3 .

Similarly, there exists a *<sup>t</sup>* ∈ {1, ... , *<sup>N</sup>*2} such that <sup>1</sup> 1˜ *h*<sup>0</sup> ◦ Γ<sup>0</sup> − *ht* ◦ Γ<sup>0</sup> 1 1 *Pn* <sup>≤</sup> *<sup>δ</sup>* <sup>3</sup> . We construct our approximating function of ˜ˆ *<sup>h</sup>* ◦ <sup>Γ</sup><sup>ˆ</sup> <sup>−</sup> ˜ *h*<sup>0</sup> ◦ Γ<sup>0</sup> as *hji* ◦ Γ*<sup>j</sup>* − *ht* ◦ Γ0. We now show that this function is within *<sup>δ</sup>* of our arbitrary function ˜ˆ *<sup>h</sup>* ◦ <sup>Γ</sup><sup>ˆ</sup> <sup>−</sup> ˜ *h*<sup>0</sup> ◦ Γ0. Applying the mean value theorem for multivariate functions, ˜ˆ *<sup>h</sup>* ◦ <sup>Γ</sup>ˆ(**z**) = ˜ˆ *<sup>h</sup>* ◦ <sup>Γ</sup>*j*(**z**) + <sup>∇</sup>˜ˆ *<sup>h</sup>*(*C*(**z**))( <sup>ˆ</sup> <sup>Γ</sup>(**z**) <sup>−</sup> <sup>Γ</sup>*j*(**z**)), we have:

$$\begin{aligned} & \left\| \left( \tilde{h} \circ \hat{\Gamma} - \tilde{h}\_{0} \circ \Gamma\_{0} \right) - \left( h\_{\hat{\mu}} \circ \Gamma\_{\hat{\jmath}} - h\_{t} \circ \Gamma\_{0} \right) \right\|\_{P\_{\text{n}}} \\ & \qquad \le \left\| \left| \tilde{h} \circ \hat{\Gamma} - h\_{\hat{\mu}\_{i}} \circ \Gamma\_{\hat{\jmath}} \right| \right\|\_{P\_{\text{n}}} + \left\| \left| \tilde{h}\_{0} \circ \Gamma\_{0} - h\_{t} \circ \Gamma\_{0} \right| \right\|\_{P\_{\text{n}}} \\ & \qquad \le \left\| \left| \tilde{h} \circ \hat{\Gamma} - h\_{\hat{\mu}\_{i}} \circ \Gamma\_{\hat{\jmath}} \right| \right\|\_{P\_{\text{n}}} + \frac{\delta}{3} \\ & \qquad = \left\| \left| \tilde{h} \circ \Gamma\_{\hat{\jmath}} - h\_{\hat{\mu}\_{i}} \circ \Gamma\_{\hat{\jmath}} + \nabla \tilde{h}(\mathcal{C}(\cdot)) (\hat{\Gamma} - \Gamma\_{\hat{\jmath}}) \right| \right\|\_{P\_{\text{n}}} + \frac{\delta}{3} \end{aligned}$$

where vector **<sup>z</sup>** ∈ R*<sup>s</sup>* lies in the segment from *<sup>γ</sup><sup>j</sup>* ◦ **<sup>z</sup>** and *<sup>γ</sup>*<sup>ˆ</sup> ◦ **<sup>z</sup>**, and *<sup>C</sup>*(·) is an unknown function that maps from <sup>R</sup>*<sup>s</sup>* into <sup>R</sup>*<sup>s</sup>* that allows for the formula to hold. Continuing our chain of inequalities, we obtain:

$$\begin{split} \left\| \frac{\tilde{h}}{\tilde{h}} \circ \Gamma\_{j} - h\_{\tilde{f}\_{i}} \circ \Gamma\_{j} + \nabla \tilde{h}(\mathbf{C}(\cdot))(\boldsymbol{\Gamma} - \Gamma\_{j}) \right\|\_{P\_{\mathbf{z}}} + \frac{\delta}{3} \le \frac{\delta}{3} \\ \left\| \nabla \tilde{h}(\mathbf{C}(\cdot))(\boldsymbol{\Gamma} - \Gamma\_{j}) \right\|\_{P\_{\mathbf{z}}} + \frac{\delta}{3} + \frac{\delta}{3} = \\ \sqrt{\frac{1}{n} \sum\_{i=1}^{n} \left( \nabla \tilde{h}(\mathbf{C}(\mathbf{z}\_{i}))(\boldsymbol{\Gamma}(\mathbf{z}\_{i}) - \boldsymbol{\Gamma}\_{j}(\mathbf{z}\_{i})) \right)^{2} + \frac{\delta}{3} + \frac{\delta}{3}} \le \\ \sqrt{\frac{1}{n} \sum\_{i=1}^{n} M \left\| \hat{\gamma} \circ \mathbf{z}\_{i} - \gamma\_{\hat{f}} \circ \mathbf{z}\_{i} \right\|\_{2}^{2} + \frac{\delta}{3} + \frac{\delta}{3}} \le \\ \sqrt{\frac{M \left( \frac{\delta}{3 \max\_{i=1,\dots,n} \|\mathbf{z}\_{i}\|\_{2} \sqrt{M} \right)}{3 \max\_{i=1,\dots,n} \|\mathbf{z}\_{i}\|\_{2} \sqrt{M}}} \left( \max\_{i=1,\dots,n} \|\mathbf{z}\_{i}\|\_{2}^{2} + \frac{\delta}{3} + \frac{\delta}{3} = \right) \\ \frac{\delta}{3} + \frac{\delta}{3} + \frac{\delta}{3} = \delta. \end{split}$$

Therefore, to provide a *δ* cover we need *N*<sup>1</sup> × *N*<sup>2</sup> + *N*<sup>2</sup> number of functions or:

$$\exp\{\left(A(\frac{\delta}{3})^{-2\psi}\right)\}\left(\frac{4+\left(\frac{\delta}{3M^{\frac{1}{2}}}\right)}{\left(\frac{\delta}{3M^{\frac{1}{2}}}\right)}\right)^{s} + \exp\left\{\left(A\left(\frac{\delta}{3}\right)^{-2\psi}\right)\right\} = 0$$

$$\exp\{\bar{A}\delta^{-2\psi}\}\left(\frac{\mathbb{C}+\delta}{\delta}\right)^{s} + \exp\{\bar{A}\delta^{-2\psi}\}\_{s}$$

where *A*˜ = *<sup>A</sup>* <sup>3</sup>−2*<sup>ψ</sup>* and *<sup>C</sup>* <sup>=</sup> <sup>12</sup>*M*<sup>1</sup> <sup>2</sup> . Taking the log we see the entropy is <sup>≤</sup> *<sup>A</sup>*˜*δ*−2*<sup>ψ</sup>* <sup>+</sup> log ( *<sup>C</sup>*+*<sup>δ</sup> <sup>δ</sup>* )*<sup>s</sup>* + <sup>1</sup> which is of the same order as <sup>≤</sup> *<sup>A</sup>*˜*δ*−2*<sup>ψ</sup>* (the *log* term is dominated by the first term). Therefore a sufficient (but not necessary) condition for our set B to have the same entropy as that of the original RKHS HK is for the sup*<sup>h</sup>* < ∇*h*(**z**), ∇*h*(**z**) > to be bounded. Having bounded derivatives is reasonable for any RKHS since every RKHS satisfies the Lipschitz condition of the form:

$$|h(X) - h(Y)| = | < h, \mathcal{K}\_X > - < h, \mathcal{K}\_Y > | \le \\\\ \|h\|\_{\mathcal{H}\_{\mathcal{K}}} < \mathcal{K}\_X, \mathcal{K}\_Y > \frac{1}{2} = \|h\|\_{\mathcal{H}\_{\mathcal{K}}} d(X, Y),$$

where the distance metric in <sup>R</sup>*<sup>s</sup>* is defined as *<sup>d</sup>*(*X*,*Y*)<sup>2</sup> <sup>=</sup> <sup>K</sup>(*X*, *<sup>X</sup>*) <sup>−</sup> <sup>2</sup>K(*X*,*Y*) + <sup>K</sup>(*Y*,*Y*). If we restrict our functions in the RKHS of norm ≤ *C* for some constant *C* then we have a universal Lipschitz constant *C* to ensure bounded derivatives.

### **Appendix B. Discussion about the FKMR Estimator**

We introduce *γ* as a way of performing variable selection on our vector of FPC features. We want to illustrate this technical trick with some concrete examples and discuss identifiability issues with the resulting estimator. There are two ways of looking at the estimation of the unknown functions *h*<sup>0</sup> and Γ0. The first way is to view our feature vector, **z**, as being related to the dependent variable *y* through the composite function *h* ◦ Γ, as explained in Section 4. The second and equivalent way is to view our features as unknown. The true features take the form of *γ* ◦ **z**, where in this case the ◦ denotes the Hadamard product. We are given **z** and need to estimate the "true" features *γ* ◦ **z**. In addition, we need to estimate the relationship between *γ* ◦ **z** and *y*, which is done through the function *h* ∈ HK.

The first way is to estimate the function *h*<sup>0</sup> ◦ Γ0. The function belongs to the RKHS HK◦Γ. We essentially consider many different function spaces to construct our estimator. The intersection between the function spaces is not necessarily empty, implying that our estimator may not be unique. We proceed this discussion more formally. Let <sup>K</sup> : <sup>R</sup>*<sup>s</sup>* × R*<sup>s</sup>* → R be a positive definite function. Let <sup>Γ</sup> : <sup>R</sup>*<sup>s</sup>* → R*<sup>s</sup>* . We define K ◦ <sup>Γ</sup> : <sup>R</sup>*<sup>s</sup>* × R*<sup>s</sup>* → R as the function given by K ◦ Γ(**s**,**t**) = K(Γ(**s**), Γ(**t**)). This new function, K ◦ Γ is positive definite. There is a relationship between the original RKHS, HK and the new RKHS, HK◦Γ. This results in HK◦<sup>Γ</sup> = {*<sup>h</sup>* ◦ <sup>Γ</sup> : *<sup>h</sup>* ∈ HK}. For any vector *<sup>u</sup>* ∈ *HK*◦Γ, we have that *u*HK◦<sup>Γ</sup> <sup>=</sup> *inf* {*h*HK : *<sup>u</sup>* <sup>=</sup> *<sup>h</sup>* ◦ <sup>Γ</sup>}. In general, HK◦<sup>Γ</sup> ⊂ HK. In (5), we take the norm with respect to the original space HK. Our iterative procedure essentially presents the second way in which the true features are unknown, whereas our theoretical arguments are justified through the first way. Given the knowledge of the features (which translates to fixing a *<sup>γ</sup>*), we are confined to just one RKHS, HK. Take the linear kernel, K(**x**1, **<sup>x</sup>**2) = **<sup>x</sup>** <sup>1</sup> **x**<sup>2</sup> as an example. Suppose the truth is that *y* is related to a one-dimensional feature **z**<sup>0</sup> through the following formulation: *<sup>y</sup>* = *<sup>h</sup>*0(*z*0) + *<sup>ε</sup>* where *<sup>h</sup>*<sup>0</sup> ∈ HK<sup>1</sup> , where K<sup>1</sup> is the kernel that maps from R×R → R. Therefore, if we knew the feature *z*1, we would proceed to optimize (6) using the standard LSKM. However, when each *y* is associated with a twodimensional vector **z** = (*z*1, *z*2), where *z*<sup>2</sup> is a "noisy" feature and unrelated to *y*. Suppose that *a priori* we do not know this information. Typically we use a model *y* = *h*(*z*1, *z*2) + *ε* where *<sup>h</sup>* ∈ HK, where <sup>K</sup> is the kernel that maps from <sup>R</sup><sup>2</sup> × R<sup>2</sup> → R. In this case, we introduce our *γ* vector (*γ*1, *γ*2) and formulate *y* = *h*(*γ*1*z*1, *γ*2*z*2) + . All functions, *h* in the space HK, are of the form *<sup>h</sup>*(**z**) = **<sup>x</sup><sup>z</sup>** for some two-dimensional vector **<sup>x</sup>** = (*x*1, *<sup>x</sup>*2). There is a one-to-one relationship between *h* and **x**. The true function, *h*0, has an associated real number *<sup>c</sup>* where *<sup>h</sup>*1(*z*1) = *cz*1. We can recover *<sup>h</sup>*<sup>1</sup> ∈ HK<sup>1</sup> from our estimation of *<sup>h</sup>* and *<sup>γ</sup>* if we set *γ* = (1, 0) and **x** = (*c*, -) , where "-" is any real number. Equivalently, we can recover *h*<sup>1</sup> under *γ* = (1, 1) where **x** = (*c*, 0). There are many functions that may recover the original function in the RKHS corresponding to the linear space kernel. Formulating our problem in the first way, through function composition, we can estimate Γ<sup>0</sup> with the *γ* being (1, 0) or (1, 1).

We can now see that in the intersection between HK◦<sup>Γ</sup><sup>1</sup> and HK◦<sup>Γ</sup><sup>2</sup> , where Γ<sup>1</sup> has associated *γ***<sup>1</sup>** = (1, 0) and Γ<sup>2</sup> has associated *γ***<sup>2</sup>** = (1, 1), lies our estimate of *h*1. In truth, for the linear space RKHS, there is no need to apply our method since *h*<sup>0</sup> ∈ HK<sup>1</sup> can be estimated directly from the larger space HK where we set *<sup>h</sup>*(**z**) = **<sup>x</sup><sup>z</sup>** where **<sup>x</sup>** = (*c*, 0). We can never hope to have variable selection consistency nor can we hope to have identifiability of our estimator for these types of spaces. However, from a goodness-of-fit standpoint, we are able to do just as good a job with many types of function compositions. Our hope is that we can glean some variable selection by penalizing the *γ* vector with the *ρ*(*γ*; *δ*) term which, going back to the above scenario, should give preference to *γ* = (1, 0) over *γ* = (1, 1). For the RKHS associated with the Gaussian Kernel, the "larger dimensional space", a Gaussian Kernel mapping from higher dimensions, does not necessarily contain the functions from a "lower dimensional space", a Gaussian Kernel mapping from lower dimensions. However through the introduction of the *γ* transformation of the features, we can recover the equivalent functions of the "lower dimensional space".

### **References**


### *Article* **Comparative Analysis of Social Support in Online Health Communities Using a Word Co-Occurrence Network Analysis Approach**

**Mengque Liu 1, Xia Zou 1, Jiyin Chen <sup>1</sup> and Shuangge Ma 2,\***


**Abstract:** Online health communities (OHCs) have become a major source of social support for people with health problems. Members of OHCs interact online with others facing similar health problems and receive multiple types of social support, including but not limited to informational support, emotional support, and companionship. The aim of this study is to examine the differences in social support communication among people with different types of cancers. A novel approach is developed to better understand the types of social support embedded in OHC posts. Our approach, based on the word co-occurrence network analysis, preserves the semantic structures of the texts. Information extraction from the semantic structures is supported by the interplay of quantitative and qualitative analyses of the network structures. Our analysis shows that significant differences in social support exist across cancer types, and evidence for the differences across diseases in terms of communication preferences and language use is also identified. Overall, this study can establish a new venue for extracting and analyzing information, so as to inform social support for clinical care.

**Keywords:** online health community; social support; network analysis; cancer

### **1. Introduction**

A cancer diagnosis and treatment can cause significant changes to a person's path in life and affect his/her daily activities, work, relationships, and family roles. Cancer patients (and their surrounding members) often suffer from a high level of psychological stress, which can lead to anxiety and depression. They strongly demand social support, which is broadly defined as resources or aids that are exchanged by members within a specific community. Extensive research [1–3] has reported social support as a complex construction with direct and buffering effects on a person's well-being and psychological adjustment to cancer. For example, studies have suggested the association between social support and cancer progression [4]. In addition, insufficient social support can lead to poor health behaviors, which may result in an increased vulnerability toward cancer and its associated mortality [5]. It has also been identified as a consistent indicator for survival.

According to the Health Information National Trends Survey, the proportion of cancer survivors reporting internet use has increased over time, from 49.5% in 2003 to 76.9% in 2017 [6]. Consistent with that, social support is also increasingly exchanged via computermediated communication, which has been referred to as computer-mediated social support. It can be developed among strangers whose only connection is their common affliction or concern about a source of personal discomfort. The anonymous nature of online communities also allows patients to exchange personal concerns and advice without the fear of being judged or recognized [7]. We refer to published studies for more discussions on the advantages of computer-mediated social support [8–10]. Online health communities

**Citation:** Liu, M.; Zou, X.; Chen, J.; Ma, S. Comparative Analysis of Social Support in Online Health Communities Using a Word Co-Occurrence Network Analysis Approach. *Entropy* **2022**, *24*, 174. https://doi.org/10.3390/e24020174

Academic Editors: S. Ejaz Ahmed and Farouk Nathoo

Received: 21 December 2021 Accepted: 22 January 2022 Published: 25 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

(OHCs) are online social networks with a focus on health. OHCs can be categorized as either general-purpose communities or those dedicated to a specific health issue. Many OHCs have their own websites, while others are built on existing social networking services, such as Facebook. Compared to traditional health-related websites that only allow users to retrieve information, OHCs can increase members' ability to interact with peers facing similar health problems and, as a result, better meet their immediate needs for social support. People show emotional support for others in OHCs by offering encouragement, reassurance, compassion, etc. OHCs are helpful in empowering patients through personal participation and providing access to information as well as emotional support.

Understanding how members of these online groups interact with each other and make use of online support resources is of critical interest. A handful of content analyses have been conducted, examining the nature of support messages communicated in OHCs [11]. In several studies that analyzed a variety of cancer support groups, information support was found to be the predominant type of support exchanged [12,13]. Some other studies reported that emotional support was the most frequent type of support message [14,15]. Questions, though, about when and why social support messages in computer-mediated contexts vary systematically remain largely unanswered [16]. Blank et al. [17] and Seale et al. [18] revealed significant gender differences. There is also evidence that the support needs of those who were diagnosed, and their families, vary by disease [12,19,20]. It is noted that these studies are mostly limited to breast cancer and prostate cancer, which are mostly gender-specific. Our literature review suggests that, in general, differences across diseases have not been sufficiently examined—something that is critical for understanding patients' needs related to information, emotional support, and relationship-building in OHCs. Only by understanding patients' more specific perceptions and needs can we further optimize the designs and services of OHCs, especially for cancer survivors, who have complex support needs and require different levels of care [21].

Our objective is to provide a detailed and inductively generated account of cancer-type differences in a large number of postings in online cancer support forums. To this end, a novel approach is applied to better understand the types of social support embedded in OHC posts. Different from some previous studies that relied on a commensurate coding scheme with all posts coded [22], which is not feasible with a large amount of data, our approach, based on a word co-occurrence network analysis technique, can provide a macroscopic field-wide view to extract information from big data, making it possible to process a massive amount of online community data. Some other studies adopted quantitative analysis approaches. For example, Seale et al. [18] conducted a comparative keyword analysis to facilitate an interpretive and qualitative examination focused on the meanings of word clusters associated with keywords. There are limitations, however, such as a lack of relevance of word clusters and an inaccurate expression of text themes. Wang et al. [23] used machine learning techniques to reveal the types of social support embedded in each post of an OHC. Wu et al. [24] proposed a social support classification method, using an LDA (linear discriminant analysis) to extract topic features from data. A significant limitation of this analysis is that a certain amount of human annotation is needed, which can be time-consuming and subjective. In addition, an unbalanced data distribution can affect the accuracy of prediction and performance. In this study, the adopted analysis approach can advance from the aforementioned and other studies and directly overcome their limitations. Text data are organized and analyzed with a network perspective, which is system-oriented. Our analysis can identify patterns and relationships among all the words in a system. It can capture properties of individual words and provide insight on how individual words are tied to a larger web (collection of interconnections).

Overall, this study fits well in the scope of information theory-based research. Specifically, it extracts information by conducting complex text mining, and generates knowledge on a complex system by conducting an advanced network analysis, which can more effectively describe variables by taking a system perspective and modeling interconnections. Although the analytic methods adopted in this article have roots in the existing literature, their "combination" and application to a new domain and new biomedical questions are novel. The most essential merit of this study may come from its data analysis findings, which can reveal the social support needed for multiple deadly cancers and the significant differences across cancer types: this has been suggested in the literature but not well quantified to date. The findings can be valuable for stakeholders at multiple levels including healthcare providers, patients, family members, and others. This study can also serve as a prototype for future social support analyses using state-of-the-art network and information analysis techniques, and noting that the existing social support analysis has mostly been based on less advanced methods.

### **2. Materials and Methods**

### *2.1. Data Source*

Patientslikeme.com (PLM) is the world's largest personalized health network, with a growing community of more than 830,000 users. It was designed to facilitate informationsharing between users within disease-specific communities, with the goal of improving the well-being of all users through knowledge derived from shared, real-world experiences and outcomes. In addition to general social networking service (SNS) tools such as user profiles, comments, and private messages, each community has disease-specific tools that allow patients to track and share relevant information such as symptoms, treatments, and medical data. These features have enabled PLM to play a leading role in empowering patients and facilitating social support exchanges and communication online. We note that PLM is not specific to cancer. However, it may still be one of the best resources for studying cancer social support. Beyond the aforementioned advantages, it also has a close working relationship with various healthcare providers. For example, two-thirds of its users felt that their healthcare providers approved/supported using PLM, and about one-third had printed out their patient profiles for use during healthcare visits [25].

PLM has a representative cancer community of more than 50,000 people with over 50 types of cancers, and it is focused on providing customized, disease-specific services that are closely related to our research goal. Extensive research into patient perspectives has been based on this information source. For example, there have been several evaluations of patient perspectives on diseases as well as patient-reported clinical and treatment experience studies of social support groups [26,27]. Other OHCs, such as Breastcancer.org [28], Google Groups [19], and WebMD [29], have also been utilized as data resources in related research.

A web crawler was designed and used to collect data from the PLM online cancer forums, which were launched in 2011. The original dataset consists of all the public posts and user profile information from February 2011 to September 2020. There are 12,150 posts that were contributed by 1358 users who were cancer patients or family members. All posts were in English. The cancer patients were then filtered (according to tags and conditions), leading to 6262 posts. Most of the posts (87.85%) are related to eight cancers. Our exploration shows that the dominating majority of patients had a single type of cancer, which matches clinical practice. Additional details are presented in Figure 1. Our study is centered around these eight specific cancers.

**Figure 1.** Percentages of posts for the eight types of cancer.

### *2.2. Method and Procedures*

The key steps include the construction of the word co-occurrence network, module detection, social support examination, and interpretation. They are discussed in detail in the following subsections.

### **Step1: Word Co-Occurrence Network Construction**

The posts are split into sentences. For pre-processing, we first conduct tokenization. Stop words that are not informative are removed. Punctuation marks are excluded. Multi-word tokenization is also conducted to expand a raw token into multiple syntactic words. A word co-occurrence network is created with unigram tokens and concatenated multi-word units.

A word co-occurrence network can be expressed as *G* = (*V*, *E*), where *V* is a set of nodes (where each node represents a word) and *E* is a set of edges. Edge *eij* ∈ *E* connects nodes *i* and *j* if those two words co-occur within at least one sentence. The number of edges is denoted as *m* = |*E*|, and *n* = |*V*| denotes the number of nodes. The degree of a node i is the number of edges connected to that node, that is, *ki* = | { *j* ∈ *V* |{*i*, *j*} ∈ *E*}|. The weight *wij* of edge *eij* is defined as the count of joint word occurrence, describing the co-occurrence relationship between the corresponding words in one sentence. The network is undirected by construction. Figure 2 shows a representative word co-occurrence network plotted using the software *Gephi* and containing information on the words and semantic structures. Some important statistical parameters that characterize a network are examined. First, the average shortest-path length (*ASPL*) is the average value of the shortest-path length between any two nodes in the network, which is calculated as:

$$ASPL = \frac{2\sum\_{i>j} d\_{ij}}{n(n-1)}.$$

where *dij* is the shortest-path length between nodes *i* and *j*. Second, the clustering coefficient of the network *CC* is the average of the clustering coefficients of all the nodes in the network defined as:

$$\text{CC} = \frac{1}{n} \sum\_{i} \frac{m\_i}{k\_i (k\_i - 1) / 2} \gamma$$

where *ki* is the degree of node *i*, and *mi* is the number of edges among the *ki* neighbor nodes. For example, for an Erdös–Renyi random network, its average shortest-path length is *ASPLr* ≈ ln(*n*)/(ln(2*m*) − ln(*n*)), and its clustering coefficient is *CCr* ≈ 2*m*/*n*(*n* − 1). A network is said to be a small-world network if *ASPL* ≈ *ASPLr* and *CC* ≈ *CCr* [30]. Third, degree distribution *p*(*k*) is defined as the probability that a randomly chosen node has exactly degree *k*. For example, if *p*(*k*) satisfies the power-law degree distribution, that is, *p*(*k*) ∝ *k*−*γ*, where *γ* is a positive constant, then the network is said to be scale-free [31].

**Figure 2.** A sample word co-occurrence network for randomly selected posts.

The study of co-occurrence can allow researchers to quantitatively describe the semantic structures of posts. However, significant challenges appear immediately. The word co-occurrence network of posts is usually very hard to visualize, and it is impossible to directly extract meaningful information. As such, there is a strong need to simplify the network, which can reduce complexity, improve visualization, and serve other purposes. One approach is to construct subgraphs, in which most of the useful information contained in the initial graph can be preserved. Here, we achieve this goal via network modules.

### **Step2: Module Detection**

A module is defined as a set of densely connected nodes that are sparsely connected to the other modules in the network. The Louvain algorithm [32], which is based on the optimization of the quality function known as modularity over all possible divisions of a network, is adopted in this analysis and realized using the *Gephi* software. More specifically, this algorithm identifies modules by minimizing:

$$Q(c) = \frac{1}{2M} \sum\_{i} \sum\_{j} \left[ w\_{ij} - \lambda \frac{\ell\_i \ell\_j}{2M} \right] \delta\_{ij}(c)\_{\prime \prime}$$

where *c* is a partition of nodes, *wij* is the edge weight between nodes *i* and *j*, *λ* is a tuning parameter,

$$M = \frac{1}{2} \sum\_{i} \sum\_{j} w\_{ijr}$$

$$\ell\_i = \sum\_{j} w\_{ijr}$$

and

$$\delta\_{i\bar{j}}(c) = \begin{cases} 1 & \text{if } c(i) = c(j) \\ 0 & \text{otherwise} \end{cases}.$$

Here *c*(*i*) denotes the module to which node *i* belongs in the partition *c*.

The algorithm can unfold a complete hierarchical modular structure for the network, thereby giving access to different resolutions of module detection. In *Gephi*, the resolution parameter, which describes how much between-group edges impact the modularity score, determines the granularity level at which modules are detected [33], with a low-resolution value resulting in more modules. It has been suggested that this algorithm outperforms all other module detection methods in computation time. Moreover, highly satisfied module detection has been observed in practice. For our analysis, module detection of the word co-occurrence network can reduce the size of data, and the analysis of co-occurrences in an individual module can allow researchers to keep track of the semantic structures, which are useful in understanding social support.

### **Step3: Social Support Quantification and Interpretation**

The analysis of word co-occurrences involves clustering words together without breaking their semantic links. In this step, we examine social support by analyzing the semantic structures of the identified modules. As a representative example, Figure 3 presents a module in the word co-occurrence network for ovarian cancer. The words grouped in one module are likely to describe tightly connected topics. For example, most of the words in Figure 3 are related to treatments and medical terminologies. As such, this module can be considered as describing informational support.

• The Taxonomy of Social Support.

Several taxonomies have been developed for the categories of support messages (see for example, [34,35]). Literature on social support suggests that OHCs mainly offer three types of social support: informational support, emotional support, and companionship [11,36]. Informational support is the transmission of facts, suggestions, and/or guidance to community users. Example topics include medication side effects, ways to deal with a symptom, experience with a physician, and medical insurance problems. Emotional support is the expression of understanding, encouragement, empathy, affection, affirmation, caring, and concern. Such support can help reduce stress and anxiety. Companionship consists of chatting, humor, teasing, and discussions of daily life that are not necessarily related to health problems. Examples include diet plans, birthday wishes, holiday plans, and online scrabble games. Companionship helps expand or reinforce a group member's connections.

**Figure 3.** A sample module from the word co-occurrence network for ovarian cancer.

Through the quantitative analysis of semantic structures, the prevalence of specific types of support messages can be revealed. To do this, the first step is to calculate the proportion of edges in each module, which is defined as:

$$P\_{\mathbb{C}\_k} = \frac{\sum\_{i \in \mathbb{C}\_k} \{ j \in \mathbb{C}\_k | \{ i, j \} \in E \}}{\sum\_{k=1}^K \sum\_{i \in \mathbb{C}\_k} \{ j \in \mathbb{C}\_k | \{ i, j \} \in E \}}, \ k = 1, \dots, K\_k$$

where *K* is the number of modules, *Ck* represents module *k*, ∑ *i* ∈*Ck* {*j* ∈ *Ck*|{*i*, *j*} ∈ *E*} denotes the sum of edges between nodes in *Ck*. Then, we can compute the proportion of each social support category by summing up the proportions from the individual modules. Exploring communication preferences and language use can also be achieved by taking a closer look at the semantic structures.

### **3. Results**

We apply the analysis approach described above to the data on individual cancers. Pancreatic cancer is highlighted as a representative example.

### *3.1. Word Co-Occurrence Network*

Sentences drawn from the posts were tokened prior to the co-occurrence search, resulting in a list of unique co-occurrence pairs. The word co-occurrence network was then constructed for each cancer. Summary information on the word co-occurrence networks is provided in Table 1. Based on this, an overview of the co-occurrence networks can be provided.


**Table 1.** Summary of the word co-occurrence networks.

Compared to a same-scale random network, all the networks have similar average shortest-path lengths and higher clustering coefficients. For example, the average shortestpath length of the pancreatic cancer network is 3.595 (in comparison, an Erdös–Renyi random network has a value of 2.258), and the average clustering coefficient is 0.861 (in comparison, an Erdös–Renyi random network has a value of 0.013). This suggests the presence of the small-world phenomenon in the networks.

In the analysis of degree distribution, it is found that all networks exhibit power-law degree distributions, with the power-law exponent *γ* ranging between 2.4 and 4.8. Table 1 shows that *γ* of the ovarian cancer network is the largest, and that of the lung cancer network is the smallest. The scale-free characteristics suggest that the connectivity values of a small number of nodes are quite large (with a large number of connections), rendering them leading roles in the networks. On the other hand, most other nodes have limited connections.

### *3.2. Module Detection*

Take pancreatic cancer as an example. When we visualize its network (Figure 4), words in different modules are represented with different colors. Under the default resolution value of 1.0, there are 72 modules, and the modularity is 0.769. Modules with fewer than five words are removed to improve presentation, leading to 25 modules. Among the remaining modules, the average clustering coefficient is 0.890, suggesting a significant clustering effect. The silhouette for each module is also calculated. The mean silhouette value is 0.649. The silhouette values of the five largest modules are shown in Table 2, which suggest a satisfactory partitioning of the network. The same analysis is also conducted on the other cancers, and the summary of the module detection results is presented in Table 3.

**Figure 4.** Word co-occurrence network for pancreatic cancer. Different modules are represented using different colors.


**Table 2.** Information on the five largest modules for pancreatic cancer.


**Table 3.** Summary of module detection.

### *3.3. Social Support Quantification and Interpretation*

Summary information for the five largest modules for pancreatic cancer is shown in Table 2. It is observed that the themes of modules 1–4 are mainly concentrated around cancer information, that is, information social support. The keywords of module 5 are mostly associated with the feelings of patients, corresponding to emotional social support. With a similar analysis of the other modules, the proportion of edges in each module is calculated, and the proportions of different social support types after aggregation are obtained. Results are shown in Table 4.

**Table 4.** Proportions of different social support categories.


### 3.3.1. Differences across Diseases in Types of Social Support

Table 4 shows the proportion of each social support category for each cancer type. Overall, information support (mean 47.14%) and companionship (mean 28.26%) are exchanged most frequently. Sharing is caring, and most posts talk about medical treatments and daily life. The Chi-squared analysis confirms that the overall distribution of social support categories is significantly different across cancer types (*p* < 0.001). Specifically, lung cancer, colon cancer, and pancreatic cancer have the highest percentages (above 50%) of information support. Ovarian and breast cancers have the lowest percentages of information support. Breast cancer has the highest percentage of emotional support (40.45%), followed by prostate cancer (36.73%), ovarian cancer (36.43%), and skin cancer (24.19%). Skin cancer has the highest percentage of companionship (33.79%), while breast cancer (18.87%) and prostate cancer (22.12%) have the lowest.

### 3.3.2. Differences across Diseases in Communication Preference and Language Use

There is evidence of differences in language use and communication preference across diseases. Four cancers (breast, ovarian, prostate, and skin) have pronounced communication preference and language use patterns. Figure 5 shows the representative network modules, revealing the emotional support of these four cancers. It is observed that breast and ovarian cancer patients mainly talked about their pains and feelings, and their language style was sentimental. In comparison, prostate cancer patients talked more

about their thoughts and beliefs, and their language style was calmer and more rational. Figure 6 shows the companionship traits of the four cancers. Skin and breast cancer patients mainly talked about their daily lives, ovarian cancer patients talked more about their family members, and prostate cancer patients talked more broadly. Differences in language use and communication preference mainly exist in the categories of emotional support and companionship. Overall, these findings can reveal several key differences in the use of OHCs across cancer types.

**Figure 5.** Emotional social support revealed by network modules: (**a**) breast cancer; (**b**) prostate cancer; (**c**) ovarian cancer; (**d**) skin cancer.

**Figure 6.** Companionship revealed by network modules: (**a**) breast cancer; (**b**) prostate cancer; (**c**) ovarian cancer; (**d**) skin cancer.

### **4. Discussion**

Our findings are mostly consistent with published research. For example, information support has been identified as the most common type of social support, and published literature has suggested that messages of emotional well-being and medical-related comments are most common on breast cancer sites [17,19,37]. Meanwhile, our research has also added to the existing knowledge of the significant differences between social support categories across cancer types. For example, lung cancer, colon cancer, and pancreatic cancer survivors have been found to mainly utilize OHCs for information-gathering. Notably, prostate cancer survivors also used OHCs as a source of emotional support. Breast, ovarian, prostate, and skin cancer survivors appeared to be in most need of emotional social support. This is likely because people with these cancers had to bear more mental pressure and had a higher risk of also experiencing depression after a new cancer diagnosis [38]. For skin cancer, the high percentage of companionship indicates that the survivors had many daily struggles that led them to seek out support.

Besides adding to existing knowledge by complementing and extending previous research into computer-mediated social support communicated by cancer patients, our analysis has also demonstrated the need for greater recognition of the differences between people with different types of cancer. This knowledge can assist in the design of OHCs. The work can also be a resource for guiding cancer survivors and their families to OHCs that tend to focus more on their specific types of cancer and issues. Similarly, clinicians need to be more aware of the different needs of patients and their families and be able to direct them to online resources that are the most likely to be supportive. In this line, recent studies have shown that the internet has changed the patterns of doctor–patient communication. Social support in OHCs has sometimes played an ambiguous role, making patients behave in a strategic, uncooperative way toward physicians [39,40]. Patient care services have been recommended to enhance the patient–physician relationship. More studies on patients' specific support needs and patient–physician cooperation are needed.

The adopted analysis method can also be used, along with or in replacement of machine learning techniques, in the identification of user roles in OHCs. Further studies on user roles (for example, the differences between lurkers and posters, their specific behaviors, and impact) are also warranted.

### *Limitations*

This study inevitably has limitations. Although PLM is representative and its data has also been examined in other published studies, it is a single OHC and may have a problem of biasedness; although, this has not been observed in existing studies. We have extracted all cancer forum data from PLM. Still, the amount of data for some cancers is limited. This may be true for pancreatic, ovarian, and renal cell cancers. Another data limitation is the possible lack of reliability. Medical information researchers have found that social media sites are identified by limited information [41]. Online users may also be vulnerable to both hidden and overt conflicts of interest, and so they may be incapable of interpreting [42]. In this dataset, there is a lack of information on the duration of diagnosis. As such, we are not able to conduct, for example, a longitudinal analysis to examine temporal trends. Another missed opportunity is that, with a small number of patients with multiple types of cancers, we are not able to provide insights into poly chronic conditions.

There may also be methodological limitations. For example, there is an emphasis on a module-based analysis over individual-message based, which may lead to certain challenges in result interpretation. We have studied the most essential network properties, and it may be of interest to explore more subtle network information.

### **5. Conclusions**

This study has made both domain-specific and methodological contributions to the investigation of OHC use among cancer survivors. There is evidence, some of which confirms and some of which adds to the existing literature, about the significant differences across diseases in terms of social support needs. Specifically, lung cancer, colon cancer, and pancreatic cancer survivors mainly utilized OHCs to meet information support needs. Healthcare providers and physicians are recommended to provide guidance to patients and families on how to gather information and verify its authenticity. Breast, ovarian, prostate, and skin cancer survivors were found to be the most in need of emotional support. For them, targeted patient care can be advice and help to build healthy relationships in a community. Moreover, there is evidence for differences across diseases in language use and communication preference when exchanging social support. For example, skin and breast cancer patients mainly talked about their daily lives, ovarian cancer patients talked more about their family members, and prostate cancer patients talked more about their thoughts and beliefs. Getting familiar with patients' communication preferences can be valuable for establishing the patient–provider bond. With collaboration, liking, and trust, patients are more likely to adhere to treatment especially for long-term medical issues. This work has

also introduced a novel method for social support quantification and interpretation, which has multiple advantages over the analyses applied in previous studies.

**Author Contributions:** Conceptualization, S.M. and M.L.; methodology, S.M. and M.L.; software, M.L.; investigation, X.Z. and J.C.; data curation, X.Z. and J.C.; writing—original draft preparation, M.L.; writing—review and editing, S.M. and M.L.; visualization, M.L.; supervision, S.M.; project administration, S.M.; funding acquisition, M.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was partly supported by the China Postdoctoral Science Foundation, grant number 2019M663764, the big data visualization technology open sharing platform of Science and Technology Department of Shaanxi Province, grant number 2020PT-029 and National Institutes of Health R03 CA241699.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The analyzed data are in the public domain and accessible to all researchers. However, we do not have the authority to re-distribute data.

**Acknowledgments:** We thank the editors and reviewers for their kind consideration and careful review.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**


### *Article* **Improved Dividend Estimation from Intraday Quotes**

**Pontus Söderbäck 1, Jörgen Blomvall <sup>1</sup> and Martin Singull 2,\***


**Abstract:** Liquid financial markets, such as the options market of the S&P 500 index, create vast amounts of data every day, i.e., so-called intraday data. However, this highly granular data is often reduced to single-time when used to estimate financial quantities. This under-utilization of the data may reduce the quality of the estimates. In this paper, we study the impacts on estimation quality when using intraday data to estimate dividends. The methodology is based on earlier linear regression (ordinary least squares) estimates, which have been adapted to intraday data. Further, the method is also generalized in two aspects. First, the dividends are expressed as present values of future dividends rather than dividend yields. Second, to account for heteroscedasticity, the estimation methodology was formulated as a weighted least squares, where the weights are determined from the market data. This method is compared with a traditional method on out-of-sample S&P 500 European options market data. The results show that estimations based on intraday data have, with statistical significance, a higher quality than the corresponding single-times estimates. Additionally, the two generalizations of the methodology are shown to improve the estimation quality further.

**Keywords:** big data adaptation; dividend estimation; options markets; weighted least squares

### **1. Introduction**

This paper presents a method for extracting dividend information from the equity derivatives market using exchange-traded European-typed call and put options. The central methodology in this paper is an extension of the work of Desmettre et al. [1], that is, to formulate a linear regression with a well-known put–call parity. Moreover, we present a novel option position (the sloped asset position), from which it is possible to compute a dividend estimate without specifying an interest rate. Furthermore, throughout the paper, the primary application in mind for the estimates is derivative pricing. This application framing may prima facie seem like an unnecessary limitation, but we argue that the estimates have often-overlooked inherent assumptions that should be aligned with the application. The derivative pricing application follows naturally, whereas other applications require non-trivial adjustments.

One research question is the connection between an asset and its dividends. One of the earliest examples is the asset valuing method: discounted cash flow. The principle idea of that method is that there is a relationship between the price of an asset and its future dividend payments. A related research question is to understand the effect on asset prices of dividend payments. The price of a dividend-paying asset in a frictionless market, ceteris paribus, would drop when a dividend is paid, and the size of the drop would be the size of the dividend payment, see, e.g., Campbell and Beranek [2] and Miller and Modigliani [3]. However, this theory is not supported in empirical studies, and the price generally drops less than the size of the dividend. Campbell and Beranek [2] attribute the differences to tax effects. This idea was elaborated into a formula by Elton and Gruber [4], where the differences between dividend and capital gain taxes were key. Other explanations have been presented, such as transaction costs and behavioral effects. The former was studied by,

**Citation:** Söderbäck, P.; Blomvall, J.; Singull, M. Improved Dividend Estimation from Intraday Quotes. *Entropy* **2022**, *24*, 95. https:// doi.org/10.3390/e24010095

Academic Editors: S. Ejaz Ahmed and Farouk Nathoo

Received: 6 December 2021 Accepted: 29 December 2021 Published: 7 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

e.g., Kalay [5] and Boyd and Jagannathan [6], and the latter by Hartzmark and Solomon [7]. Practical imperfections such as a time difference between the ex-dividend date and the payment date can also explain this idea, as was claimed by Wilmott [8]. This paper neither elaborates upon the discounted cash flows method nor provides explanations of imperfect drops in asset prices. Still, these effects must be considered when estimating dividends and evaluating these estimates. We present the implications of the imperfections for estimate interpretation and how to evaluate estimates accordingly.

Dividends are also central in the derivative pricing literature. Dividends have recently started to be seen as an independent asset class according to Filipovi´c and Willems [9], who also provide an overview of this market. This asset class has some interesting properties, but it is not used in this paper. We elaborate on this decision in Section 2.1. Instead, we follow the traditional focus, which has been on modeling the effect on asset price of dividend payments. One of the first to incorporate dividends in derivative pricing was Merton [10], who modeled dividends as continuous adjustments. Another approach is to have discrete adjustments. Discrete adjustments can be applied either as an adjustment of the spot price or an adjustment of the price as time evolves, where the former is sometimes known as an escrowed model (for an overview of this model, see Haug et al. [11], Frishling [12], and Vellekoop and Nieuwenhuis [13]). The use of discrete adjustments is limited since these models have drawbacks. The former contains the possibility of arbitrage opportunities and logical flaws, while the main problem with the latter is its complexity, which often leads to costly methods, see a more elaborated discussion in Haug et al. [11] and Vellekoop and Nieuwenhuis [13]. These problems can be avoided by following Merton [10] and modeling the dividend as a constant continuous yield for each period of maturity, even though that is a poor representation of reality. For example, that method has been applied to models based on stochastic differential equations, such as Carr and Madan [14], Duffie et al. [15], and Carr et al. [16]; implied volatility models such as Gatheral and Jacquier [17]; and local volatility models such as Derman and Kani [18], Derman and Kani [19], and Geng et al. [20].

Regardless of the method, a critical concept is making pricing consistent, which is a critique against the escrowed approach. Even so, estimating dividends, either as a yield or as a present value of future dividends, from market data is not well-studied in the literature. The estimation method that we propose does naturally handle consistent pricing. Furthermore, we study the difference between estimating yield and present value and find that the latter is preferable, regardless of the choice of pricing model. We explain this performance difference in the inherent connection between the dividend yield and the price of the underlying asset.

Although there has been little effort to estimate the dividends for the derivative pricing perspective, it has received more attention in other fields. For example, dividends have long been of interest in studies, such as Fama and French [21], on how dividend yields predict stock returns. Fama and French [21] were not the first to take an intererest in this topic; for an overview of preceding work see their paper and for succeeding papers see Golez [22]. The aims of these papers are of limited relevance in this current study, and relevance is how dividends are estimated. Earlier papers used historical (realized) dividends, but Golez [22] claims that using those could decrease predictability and argued further that inferring dividend yields from the derivatives market is beneficial. Bilson et al. [23] complemented the work of Golez [22] by introducing a novel approach to dividend growth rates implied by market data. Important to note is that Fama and French [21], Golez [22], and Bilson et al. [23] had other aims than to develop a dividend estimation methodology.

Our focus, i.e., on estimation methodology, is not common, but another similar exception is the linear regression (ordinary least squares) methodology presented by Desmettre et al. [1]. In this paper, we generalize the work of Desmettre et al. [1]. The work of Desmettre et al. [1] and our paper can be seen as a parallel to recent work in interest rate estimation by Azzone and Baviera [24] and Blomvall et al. [25]. The estimation methodologies are similar for interest rates and dividends, but the latter contains additional

nuances that must be considered. Papers that have estimated dividend quantities from data, such as Golez [22], Bilson et al. [23], and Desmettre et al. [1], have all based their estimates on data from a single time. We expand the data —in the same way as Blomvall et al. [25] does —to use intraday data and find that it provides more stable estimates, i.e., less sensitive to market noise. Moreover, intraday data introduces a coupling to market dynamics that must be considered via a slight reformulation of the regression developed by Desmettre et al. [1]. Additionally, we also present a generalization in the form of weighted least squares formulations.

The estimation methodology is one part of this paper. Furthermore, Desmettre et al. [1] argue that their method and results are limited to markets that meet specific conditions, e.g., the French and German equity markets. This paper presents another interpretation of the quantities, enabling us to evaluate the dividend estimates for more markets, e.g., the US S&P 500 equity market. However, our method is not applicable when used along with equity shares since our methodology relies on relationships between European-typed options. One key of the result evaluations is that the sloped asset position is introduced, acting as an independent method. This position is analogous to the box position used in interest-rate estimation.

The remaining section of this paper is arranged as follows. First, we start with the modeling of dividends, where different estimation methods are also presented. We continue by discussing our data set: the raw data used in the studies and the processing that we performed on the data. In the subsequent section, we present our evaluation methodology, numerical results, and related discussions. Finally, the paper ends with a conclusion and summary of the results found in the paper.

### **2. Dividend Modeling and Estimation**

A dividend payment is a way to distribute value from companies to their shareholders. The basic dynamic that we utilize in our methodology is that the asset price drops when the asset pays a dividend. To obtain a forward-looking estimate, we use the derivatives market. To schematically exemplify this, assume that we have a—highly theoretical—situation with two European-styled call options with identical contract specifications, i.e., identical time to maturity, strike price, and underlying asset, but one option has an underlying asset that pays a dividend while the underlying asset of the other option does not. The option with the dividend-paying asset has a lower price than the other since its payoff at expiry is smaller. In this highly theoretical—but unrealistic—setting, we could infer the dividend effect from the price difference between the options. It is possible to achieve a similar inference in a realistic setting by utilizing the derivatives market.

The idea is simple, but the interpretation of the estimated quantity—even in the idealistic setting of the above example—is rather complex. First, in the example above, the difference between the two option prices is not the dividend, since the option owner is not entitled to the dividend in either case. The difference is instead dependent on how the asset price reacts to dividend payments. This insight is—according to us—not sufficiently pronounced in the dividend estimation literature. Nevertheless, it is of significant importance when interpreting estimates.

The difference between drops and dividends has been empirically studied for shares, and different explanations have been proposed. This research question has not been closed, but when we look into options with an index as the underlying, we argue that other additional effects may also be present. The most apparent difference between a single share and an index is that the latter is neither a traded asset nor pays dividends. The value of an index, i.e., the quoted index value, is not a traded price but a computation from the index constituents' prices according to an index methodology. From this computation, it follows that the index quote should, ideally, experience a drop that reflects the constituent's equity price drop and weight. There might also be details in the index methodology that further complicate the situation. For example, the S&P 500 index quote is not adjusted for standard cash dividends but extra cash dividends. Hence, in theory, the type of dividend

payment is reflected differently in the index quotes. These complications make the estimate interpretation more complex than for a single share.

To summarize, the crucial insight is that the effect seen in the market is not only due to the dividend, but is a result of a mixture of the dividend and its imperfections. Despite the importance of this insight, it has received little attention. Desmettre et al. [1] present a related argumentation, but they limit the imperfections to the tax situation. We instead argue that the estimates should not perfectly reflect the realized dividends but rather a latent quantity. This claim is similar to the claim of Desmettre et al. [1], but the difference is that we do not see tax as the sole imperfection. Nevertheless, throughout this paper, we do not explicitly clarify this point repeatedly and refer to the quantity simply as a dividend to increase readability.

The initial theoretical situation—with two different behaviors, i.e., prices, of the same underlying asset—is impossible to replicate in reality. The key to making the idea usable in reality is creating derivative positions related to the underlying asset. In the following two sections, we first discuss data and different relationships that can be used, and then we discuss how to infer estimates from the relationships.

### *2.1. Market Dividend Relations*

The aim of this paper is to create derivative relationships, or positions, of—exchangetraded contracts—from which it is possible to infer the dividend. The relationships that we use should fulfill two properties. First, the data quality of the position should be high, and, second, the position should not require complex modeling of, e.g., the underlying asset, but rather suffice with few assumptions. The former is not a strict definition, but we regard quality as a synonym to liquidity in this paper, i.e., high liquidity is high quality. The reason that we want high-quality data is to ensure that the data quality does not limit the estimates. The limitation of the second property comes from an interpretation of the estimates. The drawback of complex modeling is that the dividend is strongly coupled to the specific model. Estimating the dividend from such a model requires a calibration of the other model parameters. In essence, this coupling makes the dividend an additional model parameter in the calibration process, and, hence, the dividend is affected by the other parameters. This may be a valid method for the calibration of the model, but the dividend is not transferable to other models or applications. Thus, the market contracts we considered in this study were limited by the two properties: high-quality data and non-complex modeling.

The traditional market, when inferring dividend,s has been the equity derivatives market, such as equity futures or equity options. An alternative that may seem attractive is the dividend derivatives market, because of its close connection to dividends, and also because it was used to infer dividend information by van Binsbergen et al. [26]. The market is interesting, but we see three drawbacks of using this market for dividend estimation. First and most important, the underlying of these derivatives is a dividend point index, which is computed from realized dividends. Therefore, the inherent information in the dividend derivatives is linked to realized dividends rather than the effect dividends have on the equity index. This discrepancy makes the dividend derivatives market ill-suited for our estimation since we want to estimate the effect of the asset rather than the dividend. Second, van Binsbergen et al. [26] introduced a model that makes the corresponding estimates less tractable and violates our second property. Additionally, the method used has been questioned by Tunaru [27], who argues that van Binsbergen et al. [26] fail to recognize that dividend derivatives are part of an incomplete market and, thus, that results pbtained using them are invalid. Third, the asset class is not well developed in most markets, and its liquidity is low.

To avoid both illogical approaches and poor liquidity, we use the equity market. In theory, a wide range of derivatives could be used, from plain vanilla to exotic contracts. However, we exclude contracts in the latter category since they require pricing models or have low illiquidity. To conclude, in this study, we considered futures contracts and plain

vanilla call and put options to infer dividend information without introducing models and using liquid market data.

The literature for estimating dividends from market data has had two prevailing contract types: futures contracts and plain vanilla European options. The relationships that typically relate to these contracts are the future-basis and the put–call parity. Variations of these positions have been presented, but the common denominator is that they can be constructed almost exclusively and uniquely with exchange-traded contracts. The only component of the relationships that is not directly market observable is the spot interest rates, which match the periods of maturity of the contracts. These unobservable interest rates must be computed from market data. This computation and the corresponding contracts are undesirable because of their increased complexity and reduced tractability.

### 2.1.1. General Notation

This paper works with two dividend formulations: a yield formulation and a present value formulation. We let *δ*(*t*; *T*) denote the dividend yield estimated at time t for the period [*t*, *T*] and *D*(*t*; *T*1, *T*2) denote the estimated present value at time *t* of dividends paid in the period [*T*1, *T*2]. To simplify the notation—when the start of the period coincides with the time of estimation—we also introduce *D*(*t*; *T*) ≡ *D*(*t*; *t*, *T*). Another key component in the dividend estimation is the continuously compounded interest rate. We use the formulation used by Blomvall et al. [25] and decompose the interest rate into two terms, one risk-less interest and an additional spread. We denote the risk-less interest rate and the spread at time *t* for the period [*t*, *T*] as *ro*(*t*; *T*) and *s*(*t*; *T*), respectively.

This paper considers options with S&P 500 as their underlying, where the standard S&P 500-option contract is of the European-type. The choice of European-typed options was not made inadvertently. Our method does not hold, in general, for American-typed options. European call and put options are used, but they are always considered in a pair as synthetic forward positions. A synthetic forward position is created from a call–put option-pair, i.e., two options with the same underlying, the same time to maturity, and the same strike price. A long (short) synthetic forward position is equivalent to a long (short) call option position and a short (long) put option position. The name synthetic forward position stems from the payoff, which is similar to a standard forward contract, i.e., linear in the price of the underlying. The payoffs are similar, but there are differences between a standard forward contract and the synthetic forward position. The former is unique for each time of maturity, and, upon entering, the two parties agree on a forward price that marks the contract to the market, i.e., no money is transferred upon entering. The synthetic forward position, on the contrary, is not unique for each time of maturity, and it is possible to specify the strike prices. Thus a money transfer can be necessary to mark the contract to the market.

The quote of the S&P 500 index is computed and presented as a unique value, but the market prices for tradable financial assets are only precise down to a bid–ask spread. Despite this market feature, we formulated all the relationships with a unique price in the remainder of this section. Details are discussed in Section 3, but the unique prices used were mid-prices, i.e., the arithmetic means of the bid and ask prices.

### 2.1.2. Future-Basis

The future-basis is the relationship between a future and spot price for a futures contract. This position is, in essence, used by Andersen and Brotherton-Ratcliffe [28], and it is also described in various textbooks and practitioner-geared literature, such as Wilmott [8] (p. 1040). Let *S*(*t*) denote the spot price at time *t*, and *F*(*t*; *T*) the future price at time *t* with a time of maturity *T*, then the future-basis can be written as

$$F(t;T) = S(t)\mathbf{e}^{\left(r\_o(t;T) + s(t;T) - \delta(t;T)\right)(T-t)}\tag{1}$$

and

$$F(t;T) = S(t)e^{(r\_o(t;T) + s(t;T))(T-t)} - D(t;T),\tag{2}$$

where the former holds for the dividend yield and the latter for a present value of dividends. These relationships can be rewritten as dividend estimates:

$$\hat{\delta}(t;T) = \frac{1}{T-t} \ln\left[\frac{F(t;T)}{S(t)}\right] - (r\_o(t;T) + s(t;T))\tag{3}$$

and

$$\mathbf{D}(t;T) = \mathbf{S}(t)\mathbf{e}^{(r(t;T)+s(t;T))(T-t)} - F(t;T). \tag{4}$$

One clear advantage of basing dividend estimates on the future-basis is that the estimates are uniquely specified. On the other hand, we see three drawbacks to basing the estimation of the future market. First, the interest rate must be determined, and potential misspecification affects the dividend estimate. Second, the liquidity of the futures contract is only high for short times to maturity, and, hence, estimations corresponding to longer times to maturity are challenging. Third, the uniqueness of the estimate comes with a drawback. To rely on a single contract for an estimate makes it fragile to noise in the futures price. The second and third drawbacks can be resolved using the options market, e.g., via the put–call parity. Moreover, all three drawbacks can be removed entirely with the sloped asset position, but at the cost of the non-unique estimates. It is also possible to mitigate the first and third problem using a suitable estimation method, which is discussed in Section 2.2.1.

### 2.1.3. Put–Call Parity

The put–call parity is a relationship that relates the price of a European call option, the price of a European put option, and the price of their underlying asset. The put–call parity does not hold for American call and put options since American-typed options can be exercised early, i.e., prior to maturity. This optionality provides the American-typed options a premium that violates the parity. However, Kragt [29] presents a methodology to estimate these premiums simultaneously with the dividend component, which is outside the scope of this paper. To base dividend estimates on the put–call parity is not a novelty. Additional examples are van Binsbergen et al. [30], Hull [31], and Desmettre et al. [1], where the second formulates the parity with a dividend yield and the other two with the present value of the dividends. We let *c*(*t*; *K*, *T*) and *p*(*t*; *K*, *T*) denote the European call and put option prices, respectively, at time *t* of options, with strike price *K*, and time of maturity *T*. The put–call parity formulated with a yield and a present value can be written as:

$$\mathcal{L}\left(t; \mathbf{K}, T\right) - p\left(t; \mathbf{K}, T\right) = \mathcal{S}(t)\mathbf{e}^{-\delta\left(t; T\right)\left(T - t\right)} - \mathbf{K}\mathbf{e}^{-\left(r\_{\delta}\left(t; T\right) + s\left(t; T\right)\right)\left(T - t\right)},\tag{5}$$

and

$$c(t; \mathbf{K}, T) - p(t; \mathbf{K}, T) = S(t) - D(t; T) - \mathbf{K} \mathbf{e}^{-(r\_o(t; T) + s(t; T))(T - t)},\tag{6}$$

respectively. The left-hand sides of the two relationships can be identified as synthetic forward positions, which we denote as *f*(*t*; *K*, *T*) ≡ *c*(*t*; *K*, *T*) − *p*(*t*; *K*, *T*). From the parities and fixed *t*, *T*, and *K*, it is possible to find direct formulas of the dividend estimates:

$$\begin{split} \hat{\delta}(t;T) &= -\frac{1}{T-t} \ln \left[ \frac{c(t;K,T) - p(t;K,T) + \text{Ke}^{-(r\_o(t;T) + s(t;T))(T-t)}}{S(t)} \right] \\ &= -\frac{1}{T-t} \ln \left[ \frac{f(t;K,T) + \text{Ke}^{-(r\_o(t;T) + s(t;T))(T-t)}}{S(t)} \right], \end{split} \tag{7}$$

and

$$\dot{D}(t;T) = S(t) - Ke^{-(r\_\theta(t;T) + s(t;T))(T-t)} - c(t;K,T) + p(t;K,T) \tag{9}$$

$$=S(t) - Ke^{-(r\_o(t;T) + s(t;T))(T-t)} - f(t;K,T),\tag{10}$$

for the yield and present value, respectively. All estimates for a given time of maturity should, in theory, be the same irrespective of the strike prices. This unity is not true in practice, and the estimates differ for different strike prices. These multiple estimates mitigate the fragility of a single contract but at the cost of non-uniqueness. If a single-valued estimate is necessary, we require an aggregation method. Additionally, the options market is more liquid than the futures market for most times of maturity. The exception is short times to maturity, where the futures market is more liquid than the options market. The third drawback of the future-basis (the need for an interest rate) is also present for the put–call parity. One option to remove the need is to utilize a new option position—the sloped asset position.

### 2.1.4. Sloped Asset Position

Ronn and Ronn [32] presented an option position, the box-position, from which a market-implied interest rate could be estimated without specifying a dividend. The position has been used in the literature, e.g., van Binsbergen et al. [33] and Blomvall et al. [25]. The box-position is constructed by combining two put–call parities or the equivalent of two synthetic forward positions. We build upon the same logic but choose the number of synthetic forward contracts differently. Let *<sup>K</sup>*<sup>1</sup> <sup>∈</sup> <sup>R</sup><sup>+</sup> and *<sup>K</sup>*<sup>2</sup> <sup>∈</sup> <sup>R</sup>+, and let the new position consists of one long position in a synthetic forward with the strike price *K*<sup>1</sup> and *K*1/*K*<sup>2</sup> short synthetic forward positions with the strike price *K*2. We refer to this position as the sloped asset position, where the name stems from the payoff of the position. From (5) and (6), we can write two relationships (see Appendix A for details):

$$\frac{f(t;K\_1,T)K\_2 - f(t;K\_2,T)K\_1}{K\_2 - K\_1} = S\_t \mathbf{e}^{-\delta(T-t)}\tag{11}$$

and

$$\frac{f(t;K\_1,T)K\_2 - f(t;K\_2,T)K\_1}{K\_2 - K\_1} = S\_t - D(t;T),\tag{12}$$

respectively. We note that the left-hand sides are the same but that the right-hand sides differ, and we introduce the concept of an adjusted spot price, to simplify the notation, thus:

$$S^\*(t; \mathcal{K}\_1, \mathcal{K}\_2, T) := \frac{f(t; \mathcal{K}\_1, T)\mathcal{K}\_2 - f(t; \mathcal{K}\_2, T)\mathcal{K}\_1}{\mathcal{K}\_2 - \mathcal{K}\_1}. \tag{13}$$

It is possible to reformulate (11) and (12) with the adjusted spot price into:

$$\hat{\delta}(t;T) = -\frac{1}{T-t} \ln \left[ \frac{S^\*(t; \mathcal{K}\_1, \mathcal{K}\_2, T)}{S(t)} \right],\tag{14}$$

and

$$\mathbf{D}(t;T) = \mathbf{S}(t) - \mathbf{S}^\*(t; \mathbf{K}\_1, \mathbf{K}\_2, T), \tag{15}$$

respectively. The advantage of the position is twofold. First, it is less exposed against noise since it—similar to the put–call parity—is not based on a single data point. Second, contrary to the put–call parity and the future-basis, it does not need an interest rate specification. The reduced noise exposure comes with two drawbacks since it is possible to construct many positions. First, similar to the put–call parity, the estimates must be aggregated if a single-value is wanted. Second, the method is unfeasible for some data sets that the

other relationships could manage. For example, with a data set consisting of *n* option pairs for a given time of maturity (i.e., *n* synthetic forward contracts), it is possible to construct *n*(*n* − 1)/2 different sloped asset positions and thus equally as many estimates. This quadratic relationship makes the position computationally unfeasible for data sizes that are feasible for the future-basis and the put–call parity. A solution to this infeasibility problem is to limit the data set, but we have chosen not to limit it, because it is difficult to make such a limitation generally and systematically.

### *2.2. Estimation Methods*

The three relationships: the future-basis, the put–call parity, and the sloped asset position could all be used to estimate a dividend quantity, either a yield or a present value, for specific times, *t*, and times of maturity, *T*. The estimation method aims to produce a single estimate for each date, but we have multiple times for every date. Furthermore, the future-basis implies a unique estimate for each time of maturity, while multiple estimates can be inferred from the other two relations. For practical applications, multi-valued estimates do not suffice, and a necessary element in the estimation method is aggregation.

A straightforward approach to produce a single estimate is to limit the data. In doing this, the aim of the method is met, but the drawback is that the technique probably introduces additional noise in the estimates, which comes from the fact that the chosen data points can imply biased estimates. An alternative could be to select data points such that the noise is reduced. The disadvantage of such an alternative is twofold. First, it is challenging to design a method that makes this selection possible. Second, it is a strong assumption that a few data points are representative of the whole market. Therefore, we can adjust the estimates instead of adjusting the (input) data. A technique that would consider all available data points to aggregate the estimates, could, e.g., be a mean or a median computation. The drawbacks of this approach are that it requires that the interest rate is specified exogenously, and that the weights given to specific estimates are arbitrary. For example, in the case of the median, all of the weight is put on a single estimate. To mitigate these drawbacks, we followed the method used by Desmettre et al. [1] and formulated the put–call parity as a linear regression model. Similar formulations have also been used by van Binsbergen et al. [33], Azzone and Baviera [24], and Blomvall et al. [25] for interest rate estimation methodologies.

The regression used by Desmettre et al. [1] is the foundation of our work, but we present three expansions. First, instead of limiting the data used to data from a single time (single-time data), we use data from a whole day (intraday data). Second, we formulate two regressions with different modeling of the dividend: one where the dividend is formulated as a yield and one where it is formulated as a present value. Third, we generalize the regression from an ordinary to a weighted least squares model. It would also be possible to formulate a regression from the future-basis, since we use intraday data. We elaborate slightly in the next section, but we do not see it as an appropriate approach, primarily because of the drawbacks presented in Section 2.1.2.

### 2.2.1. Linear Regression

We formulated one linear regression model for each time of maturity and each put–call parity formulation, (5) and (6). The first regression model was formulated with a dividend yield, and the second used a present-value formulation. In contrast to Desmettre et al. [1], we used intraday data rather than data from a single time. Further, Desmettre et al. [1] correctly point out that by estimating the dividend with regression, the interest rate is estimated simultaneously, eliminating the need for a separate interest rate estimate. Therefore, it may seem strange to reintroduce the estimation need by formulating the interest rate as a sum of an interest rate and an interest spread, but the reintroduction is necessary due to the fact that we use intraday data. The interest rate for a single time is constant, but it is not, in general, constant across a whole day. Consequently, to formulate the regression with a fixed interest will inevitably involve an approximation. To make a more realistic and

suitable formulation, we model the spread as a constant and keep intraday dynamics for the total interest rate. The rationale is the same as that used by Blomvall et al. [25], i.e., that the spread is more stable intraday than the risk-less component.

In the formulation, we let *N<sup>d</sup>* denote the number of days we estimated the dividends and let *<sup>d</sup>* denote the day *<sup>d</sup>* ∈ {1, ... , *<sup>N</sup>d*}. Moreover, we let *<sup>κ</sup>* denote a pair of one (intraday) time, *t*, and one strike price, *K*, *κ* = (*t*, *K*). For a day *d* and a time of maturity *T*, we collected pairs in a set H *<sup>d</sup>*,*<sup>T</sup>* and enumerated the pairs as 1, ... , *Nd*,*T*, where *Nd*,*<sup>T</sup>* = ( ( ( H *<sup>d</sup>*,*<sup>T</sup>* ( ( (. (The operator, |·|, denotes the cardinality of the set.) (The order is unimportant, and the pair with index *i* is thus *κ<sup>i</sup>* = (*ti*, *Ki*)). We also introduced *τd*,*T*, which is the time to maturity computed at the beginning of the day. To compute the time from the beginning of the day, we followed the interest rate market convention. The put–call parity, Equation (5), can then be written as

$$f\_i^T = S(t\_i) \mathbf{e}^{-\delta^{d,T} \mathbf{r}^{d,T}} - K\_i \mathbf{e}^{-r\_\sigma^T(t\_i) \mathbf{r}^{d,T}} \mathbf{e}^{-s^{d,T} \mathbf{r}^{d,T}}, \quad \forall i = 1, \ldots, N^{d,T}, \tag{16}$$

where *f <sup>T</sup> <sup>i</sup>* <sup>≡</sup> *<sup>f</sup>*(*ti*; *Ki*, *<sup>T</sup>*). We introduced *<sup>X</sup>d*,*<sup>T</sup>* 1,*<sup>i</sup>* :<sup>=</sup> <sup>−</sup>*Ki*e−*r<sup>T</sup> <sup>o</sup>* (*ti*)*τd*,*<sup>T</sup>* and *X*2,*<sup>i</sup>* := *S*(*ti*) to simplify the notation. (Note that *X*2,*<sup>i</sup>* neither depends on the day nor the time of maturity.) We wanted to estimate e−*δd*,*Tτd*,*<sup>T</sup>* and e−*sd*,*Tτd*,*<sup>T</sup>* , and denoted the corresponding regression coefficients as *γd*,*<sup>T</sup>* <sup>1</sup> and *<sup>γ</sup>d*,*<sup>T</sup>* <sup>2</sup> . Thus, it is possible to write the linear regression as

$$f\_i^T = X\_{1,i}^{d,T} \gamma\_1 + X\_{2,i} \gamma\_2^{d,T}, \quad \forall i = 1, \dots, N^{d,T}. \tag{17}$$

It is possible to write a similar regression, based on (6) by introducing *h<sup>T</sup> <sup>i</sup>* := *<sup>f</sup> <sup>T</sup> <sup>i</sup>* − *S*(*ti*) and a regression constant, *γ*0,

$$h\_i^T = \gamma\_0 + X\_{i,1}^{d,T} \gamma\_{1\prime} \quad \forall i = 1, \ldots, N^{d,T}. \tag{18}$$

To summarize, we can write the financial quantity estimates from the regression estimates as

$$
\hat{D}^{d,T} = -\gamma\_{0\prime} \tag{19}
$$

$$\hat{\boldsymbol{\delta}}^{d,T} = -\frac{1}{\mathbf{T}^{d,T}} \ln \left[ \hat{\gamma}\_2^{d,T} \right],\tag{20}$$

$$\hat{s}^{d,T} = -\frac{1}{\tau^{d,T}} \ln \left[ \hat{\gamma}\_1^{d,T} \right],\tag{21}$$

where *D*ˆ *<sup>d</sup>*,*<sup>T</sup>* is the estimate of the present value for the time of maturity *T*. The interest rate spread estimate, *s*ˆ *<sup>d</sup>*,*T*, can be estimated from the regression models (17) and (18), but the estimates are not generally equal, with the exception of the single-time data.

It is possible to see the differences between using intraday data and single-time data in the regressions. The interest rate, *r<sup>T</sup> <sup>o</sup>* (*ti*), is fixed when single-time data is used. This fixed interest rate makes the decomposed interest rate form redundant since the sum of *ro*(*ti*)*<sup>T</sup>* + *s* is a constant. Further, the spot price, *S*(*ti*), is also fixed, making it possible to convert the dividend yield estimate to a present value dividend estimate, and vice versa, without loss or distortion of the estimates. This perfect conversion makes the two different dividend formulations redundant. These redundancies are not present when intraday data is used, since neither *ro*(*ti*) nor *S*(*ti*) is constant, making the decomposed interest rate necessary and the dual regression formulations interesting.

Finally, from (17) and (18), it is easy to see that the future-basis regressions, based on (1) and (2), would follow. The formulation is made possible by the utilization of intraday data rather than a single-time data. Despite the analogue to the put–call parity, the regression has one shortcoming compared to its put–call parity counterpart. The regression coefficient for the dividend yield-formulated regression is the sum of the dividend yield and spread, e(*s*(*t*;*T*)−*δ*(*t*;*T*))(*T*−*t*). Hence, the future-basis could only be used for present value estimates. This shortcoming and the previously mentioned drawbacks are why we do not consider this regression in this paper.

### 2.2.2. Linear Regression—Weighted Least Squares

In an ordinary least squares formulation all of the data are considered equally important. This implicit assumption is likely to be incorrect since the quality of data points is likely different. The ordinary least squares formulation does not adjust for this difference in data quality and thus has a drawback. One approach to counteract this behavior is to value some data points more and some less. To formulate this mathematically rigorous method, we followed the idea in Blomvall et al. [25] and use weighted least squares. Considering the models (17) and (18), we can formulate the weighted least squares

$$\min\_{\gamma\_{\mathcal{I}}=(\gamma\_1,\gamma\_2)} \sum\_{i=1}^{N^{d,T}} w\_i^{d,T} \left( f\_i - X\_{1,i}^{d,T} \gamma\_1 - X\_{2,i}^{d,T} \gamma\_2 \right)^2,\tag{22}$$

and

$$\min\_{\gamma=(\gamma\_0,\gamma\_1)} \sum\_{i=1}^{N^{d,T}} w\_i^{d,T} \left( h\_i^{d,T} - \gamma\_0 - X\_{i,1}^{d,T} \gamma\_1 \right)^2,\tag{23}$$

where *wd*,*<sup>T</sup>* <sup>1</sup> , ... , *<sup>w</sup>d*,*<sup>T</sup> <sup>N</sup>d*,*<sup>T</sup>* are non-negative weights. Note that if each weight is chosen as a positive constant, i.e., 0 < *w* = *wd*,*<sup>T</sup>* <sup>1</sup> <sup>=</sup> ... <sup>=</sup> *<sup>w</sup>d*,*<sup>T</sup> <sup>N</sup>d*,*<sup>T</sup>* , we obtain the ordinary least squares estimator, only if *w* = 1, the same sum of square errors, is the same. The crux with these formulations is to determine weights. The key idea of the weights is to choose them such that the resulting estimator has good properties. An essential property for the ordinary least squares estimator is that if the residuals are independent and homoscedastic (same finite variance), the estimator is the BLUE (best linear unbiased estimator). The residuals from the regressions (17) and (18) likely do not fulfill the homoscedasticity, and an ordinary least squares estimator is not the BLUE. One reason is that the liquidity of the data varies between strike prices, where illiquidity typically leads to higher variance.

The heteroscedasticity can be counteracted, and it is possible to achieve the BLUE with a specific weighting scheme. According to Aitken [34] (the result can also be found in textbooks such as Zwanzig and Liero [35]), if the weights are chosen to be inversely proportional to the variances, the estimator receives the BLUE property. In addition to the statistical properties, Blomvall et al. [25] pointed out that weights chosen inversely to the residuals also have an economic rationale. The residuals can be interpreted as a measure of the repricing capabilities of the linear models, where smaller residuals indicate accurate repricing. Nevertheless, the appealing theoretical property has a practical drawback since the variances are unknown and need to be estimated. Estimating the variance for residuals is non-trivial, since we only have a single residual if we fix the time, strike price, and time of maturity, i.e., we do not have repeated estimates of a quantity. To mitigate this problem, we make the same assumption as Blomvall et al. [25] that the variance is constant intraday, i.e., for a fixed strike price and time of maturity. Hence, the variances can be estimated from different intraday times. The weights are computed with the same four-step processes used by Blomvall et al. [25].

First, an ordinary least squares estimate is computed, and the (raw) residuals are determined, which we, for each strike price and time of maturity, denote as *ei*, ∀*i* = 1, ... , *Nd*,*T*. Second, the residuals, *ei*, are grouped into (index) groups according to their strike prices, <sup>R</sup>*d*,*<sup>T</sup> <sup>K</sup>* <sup>=</sup> {*i*|*<sup>i</sup>* ∈ {1, ... , *<sup>N</sup>d*,*T*} and *<sup>K</sup>d*,*<sup>T</sup> <sup>i</sup>* = *K*}. Third, a variance is estimated for each group, where:

$$\mu\_{\boldsymbol{K}}^{d,T} = \frac{1}{\left| \mathcal{R}\_{\boldsymbol{K}}^{d,T} \right|} \sum\_{\boldsymbol{i} \in \mathcal{R}\_{\boldsymbol{K}}^{d,T}} e\_{\boldsymbol{i}\prime} \tag{24}$$

$$\nu\_K^{d,T} = \frac{1}{\left| \mathcal{R}\_K^{d,T} \right| - 1} \sum\_{i \in \mathcal{R}\_K^{d,T}} \left( c\_i - \mu\_K^{d,T} \right)^2,\tag{25}$$

denote the estimated mean and variance, respectively, for the group associated with the date, *d*, the time of maturity, *T*, and the strike price, *K*. Finally, the weights in (22) and (23) can be determined from the auxiliary weights *w*˜ *<sup>d</sup>*,*<sup>T</sup> <sup>K</sup>* <sup>=</sup> 1/*νd*,*<sup>T</sup> <sup>K</sup>* , as

$$w\_i^{d,T} = \overline{w}\_{k\_i^{d,T}}^{d,T} \quad \forall i = 1, \dots, N^{d,T}. \tag{26}$$

### **3. Data**

The data set used in this paper is the same data set used by Blomvall et al. [25]. All the data have been collected from the data provider Thomson Reuters Refinitiv Eikon, and the data set consists of three types of intraday data. First, quotes of the S&P 500-index. Second, bid and ask quotes of European call and put options with the S&P 500-index as their underlying. Third, payer and receiver quotes of fix rates of USD denoted by overnight index swaps contracts with the federal funds rate as the reference rate.

The tick data is collected for all dates in the period from 1 March 2020 to 31 January 2021 between 9 a.m.–4 p.m. The European options are all the available monthly options for the given dates, i.e., all options expiring on the third Friday of each month. The USD overnight index swaps fix rates have a maturity between 1 and 10 years. The data set consists of 6 million S&P 500 index quotes, 110 million bid and ask quotes of the USD overnight index swaps, and 54 billion option prices.

Although granular, the data set must be processed to be useful in the paper. The collected data has four inherent problems. First, we collected tick data, but it is difficult to use because of its irregularities. The data is transformed to a more usable form where the level of granularity is preserved. Second, the quotes of the fix rates of overnight index swaps are not directly usable since (17) and (18) require continuous spot rates, and thus a transformation is needed. Third, in Section 2, all regressions were formulated with a unique price, but in the data set, the prices are only precise down to a bid–ask spread. Earlier, we mentioned that the price used is the mid-price, and below, we discuss this issue. Fourth, we discuss how to identify and remove unrealistic data points.

### *3.1. Synthetic Forward and Sloped Asset Positions*

In Section 2, we used synthetic forward and sloped asset positions with unique prices. Neither of these positions is traded in the market, rather only the options are. Hence, neither synthetic forward nor sloped asset positions have quoted bid or ask prices. To circumnavigate the missing prices of these positions, we first computed their bid and ask prices. The mid-prices were then computed from these bid and ask prices. The bid and ask prices were created by artificially replicating the market prices of entering such positions.

Let *c*, *p*, and *f* , respectively, denote the price of a call option, put option, and synthetic forward; and let *a* and *b* denote the ask and bid price, respectively. The payments of the bid and ask positions can be summarized as *fa* = *ca* − *pb* and *fb* = *cb* − *pa*. We compute the mid-price of the synthetic forward as

$$f\_m = \frac{f\_a + f\_b}{2} = \frac{(c\_a - p\_b) + (c\_b - p\_a)}{2} = c\_m - p\_m. \tag{27}$$

It is thus necessary to have four prices—bid and ask prices for both the call and put options—to compute the mid-price of the synthetic forward position. Hence, if one or more quotes are missing, the mid-quote is not computable and thus not used in the regression.

We likewise compute mid-prices of the sloped asset position by first computing the bid and ask prices; the argument is analogous to the synthetic forward position. We compute the price of entering the position in two directions, and the mid-price is the average. Let *φi*,*<sup>j</sup>* denote the sloped asset position, which is going long in a synthetic forward with strike price *Ki*, and short in *Ki*/*Kj* synthetic forwards with strike price *Kj*. The cost of entering such a position is the ask price, *f <sup>a</sup> <sup>i</sup>* reduced by the bid price, *<sup>f</sup> <sup>b</sup> <sup>j</sup>* for each of the *Ki*/*Kj* contracts. The ask price of this contract can thus be written as *φ<sup>a</sup> <sup>i</sup>*,*<sup>j</sup>* = *<sup>f</sup> <sup>a</sup> <sup>i</sup>* <sup>−</sup> *Ki*/*Kj <sup>f</sup> <sup>b</sup> <sup>j</sup>* . Alternatively, we receive *f <sup>b</sup> <sup>i</sup>* and must pay *<sup>f</sup> <sup>a</sup> <sup>j</sup>* for each of the *Ki*/*Kj* contracts, and the bid price of the slope position is thus calculated *φ<sup>b</sup> <sup>i</sup>*,*<sup>j</sup>* = *<sup>f</sup> <sup>b</sup> <sup>i</sup>* <sup>−</sup> *Ki*/*Kj <sup>f</sup> <sup>a</sup> <sup>j</sup>* . The mid-price of the slope position is

$$
\Phi\_{i,j}^m = \frac{\Phi\_{i,j}^a + \Phi\_{i,j}^b}{2} = \frac{\left(f\_i^a - \mathcal{K}\_i / \mathcal{K}\_j f\_j^b\right) + \left(f\_i^b - \mathcal{K}\_i / \mathcal{K}\_j f\_j^a\right)}{2} \tag{28}
$$

$$\hat{f} = \frac{\left(f\_i^a + f\_i^b\right) - \mathcal{K}\_i / \mathcal{K}\_{\hat{\jmath}}\left(f\_{\hat{\jmath}}^b + f\_{\hat{\jmath}}^a\right)}{2} = f\_i^{\mathfrak{m}} - \mathcal{K}\_i / \mathcal{K}\_{\hat{\jmath}} f\_{\hat{\jmath}}^{\mathfrak{m}}.\tag{29}$$

We see that the data of synthetic forward contracts is sufficient to express both the put–call parity and the sloped asset position.

### *3.2. Transformation and Cleaning of Option Tick Data*

Our data management has two aims. First, to make the data appropriate for the (estimation) method, and second, to clean the data from the artifacts. While the former is a necessity, the second can lead to the validity of the method being questioned; hence, the data cleaning is moderate. In this paper, we address two features of this data set: a sudden and temporary downward spike in bid quotes and a lack of bid quotes for out-of-the-money options. We classify the former as a data artifact that normal market dynamics cannot explain, while the second has a natural explanation.

The first problem (downward spikes) is illustrated in Figure 1. We deem the drops of approximately \$1200 to be non-realistic and deem further that those spikes have been created in the data collection. We cannot explain why only the bid quotes are affected by this effect. The problem of the downward spikes is easily solved by removing them from the data set. The crux is to determine which of the bid quotes are artifacts and which that are not. In Figure 1, the artifact is evident, but there could be other cases where the spikes are not as obvious. A rough description is that the drops are more pronounced for (deep) in-the-money options than out-of-the-money options, since the former options naturally have higher prices. However, the silver lining is that the effect of the data is less severe, since out-of-the-money options have a lower price; thus, the drops cannot be as big. Therefore, we limited the data cleaning to in-the-money options, since they are more affected and easier to find than out-of-the-money options. The data cleaning procedure that we used was to discard all call (put) options with strike prices greater (less) than the spot price as well as all quotes smaller than \$1.

The second problem (missing bid quotes) is not as trivial or obvious as the spikes. Many (deep) out-of-the-money options in the data set lack bid quotes but have corresponding ask quotes. We attribute this data property to the tick size of the market, i.e., the minimum amount that a quote can be changed. This amount may be greater than the fair price of some options, and any (positive) bid quote would thus be overpriced. If the only possible price is an overprice, the only sensible action is not to quote. The ask prices do not suffer from the same dynamics, since it is natural to ask for a higher price than a fair price. The drawback of the estimation method with missing bid quotes is that it decreases the data set significantly. As noted above, a single synthetic forward price mid-price requires

both a bid and ask quote for one call and one put option. To reduce the data waste, we recreated the bid quotes.

**Figure 1.** The two panels illustrate the bid quotes of a call option on the 9 March 2020. The strike price of the option is \$1500 (in-the-money), and its expiration is the 20 March 2020. The two panels illustrate the same data, but the lower panel focuses on smaller values and thus has a smaller y-axis than the upper panel.

The strict natural lower limit of plain vanilla option quotes is zero. A price of zero means that someone, essentially, gives away a contract for free with only positive (including zero) payoffs, which is an arbitrage opportunity and, thus, a non-realistic scenario. However, the bid and ask prices are not used individually, but rather only in pairs, to compute mid-prices. Therefore, we argue that when the fair option price is within one tick from zero, it is a valid approximation to set the bid price to zero. A zero bid price is too low, the subsequent mid-price is too low, and a bias is introduced in the mid-prices. In order not to introduce biases, we only replaced missing bid quotes for some call and put options, which are essentially options with small prices. Let *S* denote the intraday median spot price, *K* the strike price, and let *ad*,*<sup>T</sup>* := *aσ* √ *τd*,*<sup>T</sup>* where *a* > 0 and *σ* > 0. We replaced missing bid quotes for call and put options if *<sup>S</sup>*(<sup>1</sup> <sup>+</sup> *<sup>a</sup>d*,*T*) <sup>&</sup>lt; *<sup>K</sup>* and *<sup>S</sup>*(<sup>1</sup> <sup>−</sup> *<sup>a</sup>d*,*T*) <sup>&</sup>gt; *<sup>K</sup>*, respectively. (The economic interpretation is that bid quotes are only replaced for options with a strike price at least *a* standard deviations, *σ*, from the current spot price of the underlying asset, i.e., deep out-of-money options.)

Data cleaning is the first step in the data transformation process, where the second is to process the data into a better-suited format. The collected option data is tick data of values and timestamps, which has a precision of one second. One alternative would be to transform the tick data set into a set where the data points are spaced with a fixed time unit, e.g., a second. In essence, the idea is to use the most recent tick quoted in the market for every time unit, using the most recent tick in the case of tick data to second data. The assumption is that, as long as new ticks have not reached the market, i.e., no new information has reached the market, the old ticks are still valid. There are two practical benefits of such an approach. First, it is easy to work with such data. Second, the data utilization is high. Furthermore, if no new information has reached the market, that implies that the market's dividend and interest-rate beliefs are unchanged. (The converse is not true, changed prices are not synonyms for changes in the markets in terms of the dividend or interest rates beliefs, but it can signify a myriad of factors).

The drawback of such an approach is the risk of amplifying noise. All market information carries some noise, and repeating individual data points would assign higher confidence or weight to arbitrary points and consequently amplify noise in these points. In order not to indirectly assign higher weights to certain points and instead to keep the data utilization high, we are only interested in times where at least one quote of at least one option has changed from the previous time. In this method, the prices of the options should correspond to the index quote for the same times. The transformation that we propose is a two-step process. First, the tick data set is transformed into a set with a specific frequency, i.e., the time between data points, e.g., 1 second, which is the frequency used in this paper. Second, this data set is transformed into the final data set, where only the data points that have changed are kept. Small schematic examples of mock tick data, fixed time unit data, and the final data are presented in Tables 1–3, respectively. Note, before the first tick of the day, the quote is written as not available (N/A). The value from a tick prevails until a new tick comes or the day ends (4 pm). The transformation is performed for all options and all fixed rates. The second transformation is from the one-second data to a data set that only contains seconds that coincide with ticks. Table 3 presents a continuation of the example in Table 2. Note that this transformation is not a reversal of the first transformation. The first transformation was made for individual options' bid and ask quotes, and the second considers all the options' bid and ask quotes (for a given day and time of maturity) simultaneously.

**Table 1.** The two panels schematically exemplify mock tick market data of two assets. The marker N.U. indicates Not Updated.


**Table 2.** The two panels schematically exemplify one-second data that have been derived from Table 1. The left and right panels are derived from the left and right panels in Table 1, respectively. The bold quotes indicate that those quotes were ticks and not repeats of an earlier tick. Bold times indicate that, at that time, at least one of the quotes (bid and ask) was a tick.



**Table 3.** This table schematically exemplifies one-second data, which combines the two panels in Table 2. The bold numbers indicate that those numbers were ticks in the tick data. (Note that every row has at least one bolded number) The difference between the panels in Table 2 and this table is that times that lack a tick have been removed from this table.

### *3.3. Overnight Index Swap Implied Spot-Rates*

The regression formulations (17) and (18) require continuous spot interest rates. In this paper, we follow the arguments in Blomvall et al. [25] and base these rates on OIS contracts. A specific interest rate is not critical since we estimate a spread over this rate, and most rates are stable intraday. From that point of view, we could have used interest rates from a data provider, such as Thomson Reuters Eikon Refinitiv.

However, the interest rate data must match the frequency of the option data, and thus we must compute them. We use the technique proposed by Blomvall [36], which produces a complete forward-term structure of daily forward interest rates. In this paper, only the spot rates that correspond to the options times of maturity are of interest, and these rates are computed from the forward rates.

### **4. Results and Discussion**

This results and discussion section consists of four parts. First, we present the characteristics of the estimates in plots, which are the foundation of the next part in the section. In the second part, in-sample results are presented, that is, results where the data set has not been divided into training and test sets. The in-sample results answer some questions, but the validity of the results can be partly questioned since the results could be the effect of over-fitting. The third part presents the methodology for performing out-of-sample testing, i.e., the data partitioning and evaluation methods. The results include both some basic statistics and a statistical Diebold–Mariano test. Throughout the section, we discuss and highlight results when presented, but one question spans multiple parts—the difference in estimating yield and present value, and hence, it is discussed in the fourth, and final, part of this section.

In addition to the question of the difference between yield and present value, two additional questions are discussed in this section. First, various regressions for dividend estimation have been presented, which can be grouped according to two properties: the weighting scheme and the type of data. The regressions have been formulated generally to handle intraday data, which is similar to the approach used by Blomvall et al. [25]. Contrariwise, in Desmettre et al. [1] and other methodologically similar approaches for interest rate estimation, single-time data is used, see Blomvall et al. [25] for an overview of the latter. From these regressions we make two comparisons. First, we compare the single-time and the intraday dataset. Second, we also study the differences between the weighted least squares and the ordinary least squares models.

### *4.1. Characteristics of Estimates*

The characterization of the estimates is divided into two parts. First, we have three illustrations of the estimates, both for intraday data and single-time data. The data set is not partitioned in this section, but rather all the data for each time has been used. Second, we start by presenting some surf plots in Figure 2. The surf plots provide an overview, but

it is difficult to see any small differences. The line plots in Figures 3 and 4 complement the surf plots.

**Figure 2.** The figure shows the surface plots of the ordinary least squares dividend estimates, which can be grouped by two properties. First, the two upper and two lower panels are computed with intraday and single-time data, respectively. Second, the left and right panels are computed as dividend yields and the present value of dividends, respectively. The z-axes of all four panels indicate the estimated values. The x- and y-axes corresponds to the date and time to maturity (measured in days), respectively, to which the estimates correspond.

The overall illustration in the surf plots and the line plots is that the present value estimates have a downward sloping trend as time progresses. These trends, for the longer maturities, are supported by the mean values in Table 4, which indicates that the mean daily changes are negative. These slopes are expected since the present value—for a specific time of maturity—naturally decreases when ex-dividend dates are passed. On the other hand, the yield estimates do not form a slope. Instead, the main effect is that they converge for longer times of maturity. This convergence can be interpreted as expected dividends, in dollars, being stable over the years.

**Table 4.** This table shows statistics for the daily differences in market-matched present value dividend estimates for four series of estimates with a constant time of maturity. The numbers (382, 473, 655, 1019) in the first column—TTM—are the times to maturity on the 2 March 2020 (i.e., the first date in the data set.) for the series.


We can observe that both the yield and present value approaches are stable as time evolves, but the estimates vary substantially for different maturities. We can also see an additional effect: estimates drift off shortly before the expiration date. The problem effect is easily observable for the yields in Figures 2 and 4. The effect is also observable for the present value data, but the scale of the plot masks the effect. In most cases, the drift is positive, but we can observe some negative estimates. A negative yield or a negative present value can be interpreted as a cash flow that lifts (negative drop) the price. It is an improbable market dynamic, and, since we experience these negative estimates adjacent to other spurious estimates, we argue that these are not to be taken at face value. Instead, the estimates in these regions should rather be seen as indications of artifacts of the estimation method. A similar effect was reported by Blomvall et al. [25] for interest rate spread estimates, and we follow their argument and explain this effect with low option data quality. Finally, we can see that these spurious values are more pronounced for the single-time data than for the intraday data. We only consider option pairs with a time to maturity exceeding five days to reduce the impact on the results of these spurious values.

**Figure 3.** The figure consists of two panels, where the upper and lower panels illustrate the present value dividend estimates that are market-matched and stripped, respectively. The plots illustrate a series of dividend estimates with fixed times of maturity, where the x-axis is the date. The series in each panel consists of either estimates determined by intraday data or by single-time data recorded at 3 p.m. The legend of the upper panels indicates the type of data, and the number is the number of days to maturity at the first date. The legend of the lower panel contain a period, which is the number of days to maturity for the two contracts that have created the stripped dividend.

The above observation illustrates potential data problems, and provides some insights into how the market reacted during the period. The core idea of the paper is not to understand and study the market dynamics. Nevertheless, the illustrations indicate shifts in the market, which are too significant to leave without comment. The comments are not detailed but instead focus on the holistic picture. In Figure 4, we can see that around April 2020, the estimates behave differently than for the other period, which is a period when the global pandemic started to affect the markets. It is possible to see that the estimates of present value squeezed together, i.e., the difference in estimates of longer and shorter times of maturity reduced. The yield estimates were also affected, but rather with in the

opposite direction. The difference between the long and short times of maturity increased. We can also observe that the S&P 500 index quote also experienced a downturn. The effects on the estimates can prima facie seem contrary, but both behaviors have the same underlying reason. During this period, many companies cut, either partly or entirely, their future dividends but kept dividends that were closer in time (e.g., announced dividends), and the market anticipated further cuts for future dividends. For the present value dividend estimates, the effect was direct. Present value dividend estimates corresponding to longer maturities were reduced more than the corresponding short times to maturity. This phenomenon is natural, since both the realized and anticipated dividend cuts were more pronounced for longer times to maturity. A similar effect would have been seen in the yields if the S&P 500 quote had been constant, but the downturn of the S&P 500 offset the effect of lower yields, especially for short times of maturity, and resulted in higher yield estimates for shorter times to maturity. We can see that the estimates have captured these market dynamics and could potentially be a good measure of how the market predicted large dividend cuts.

**Figure 4.** This figure consists of two panels which share the same labels. The upper panel illustrates dividend yield estimates, and the lower panel illustrates present value dividend estimates. The data illustrated in these two panels are part of the data in Figure 2. However, in these two panels, we plot dividend estimates that have a constant time of maturity (one of the dimensions of the surf plots has been removed). The times of maturity that are illustrated are those that were present in the market as of 2 March 2020 (i.e., the first date in our data set). The legends show the time to maturity (measured in days), corresponding to the times of maturity at the first date.

The interconnection between the yield and the quote of the underlying asset is interesting in two regards. First, it gives rise to counter-intuitive behavior. Second, and more important, from the view of assumptions, if a yield is constant intraday, this would imply highly fluctuating present value dividend estimates during the day. Such behavior seems unlikely from market participants. This view is complemented by Vellekoop and Nieuwenhuis [13], who claim that market makers prefer to specify fixed cash amounts rather than yields. We take this as an indication that the constant dividend yield formulation has inherent problems. In the upcoming sections, we present results that support the fact that yield estimates perform worse than their present value counterparts, and that these differences in performance can be related to the variability of the underlying asset.

### *4.2. In-Sample*

We have two types of in-sample results. First, we elaborate on whether the yield or the present value should be used. (This question is also discussed in the next section, where an out-of-sample analysis is performed). Second, we examine the difference between the intraday and single-time data.

An interpretation of the linear models is that it is a pricing method for synthetic forward positions, and an obvious performance measure between the models is to compare the residuals. The residuals are informative but difficult to compare. Therefore, we compare the mean squared errors rather than the residuals themselves in the analyses.

### 4.2.1. Yield and Present Value Comparison

We consider two different regression models, (17) and (18), where the former is formulated with a dividend yield and the latter with the present values of dividends. The results of the regressions are shown in Table 5, and we can observe that the present value dividend formulation (18) outperforms the yield formulation (17), i.e., the former has a lower mean squared error than the latter.

**Table 5.** This table consists of mean squared errors (MSE) for in-sample dividend ordinary least squares (OLS) estimates based on intraday and single-time (recorded at 3 p.m.) data. The table shows the mean squared errors for the yield and present value dividend formulation. The MSE for the single-time data is—by construction—equal for both dividend formulations; hence, these are written on the same row.


In Table 5, only the ordinary least squares results are presented, not the corresponding weighted least squares results. The reason is that the ordinary least squares estimator by construction produces a lower mean squared error than the weighted least squares estimator, cf. (22) and (23). Therefore, to have a meaningful comparison, we will compare the ordinary and weighted least squares estimators out-of-sample in the next section.

It is possible to see a difference in predictability between the yield and present value formulation, but it is not easy to relate the two quantities and determine the magnitude of the difference. The yield is transformed into a present value to enable a comparison between the estimates. We represent the yield implied present value with *Dy*. The key in the transformation is that both estimates can be interpreted as a spot price adjustments. The yield and present value adjusted spot prices can be written as *S*(*t*)e−*δ*(*t*;*T*)(*T*−*t*) and *S*(*t*) − *D*(*t*; *T*), respectively. By equating these two adjusted spot prices, we write the yield-implied present value as

$$D\_y(t;T) = S(t)\Big(1 - \mathbf{e}^{-\delta(t;T)(T-t)}\Big),\tag{30}$$

which can be rewritten into conversions between yield and present value. We can see the results of this conversion in Table 6. The differences between the yield implied and the estimated present value are not big, but there are differences. The statistics in the table do not present any clear differences between the two estimates. The summarized picture shows that the differences are close and symmetric around zero, since the mean values are close to zero with a low standard deviation, while the skewness and kurtosis indicate that there are extreme points. The skewness shows that the implied present value dividend is higher than the present value dividend in eight of eleven ranges and in total. Furthermore, the high kurtosis shows differences notably more extreme than a couple of standard deviations from the mean. The only visible trends in the data are that the standard deviations and the absolute differences seem to increase with longer times to maturity. These results are thus inconclusive as to whether there is a difference or if the estimations only are noisier for longer times to maturity. We continue this discussion around the out-of-sample tests.

**Table 6.** This table presents the statistics of the differences between the present value estimates and the implied present value quantity, *Dy*. The second column—Abs. Mean—shows the values of the mean of the absolute value of the differences. The last row—All TTM—shows the statistics for all differences. The other rows show groups of differences corresponding to the times to maturity (days) in the range.


### 4.2.2. Intraday and Single-Time Data

Blomvall et al. [25] concluded that intraday data produce more stable and higher quality estimates than data recorded from a single time. Therefore, we undertook a similar analysis and performed linear regressions where only data points recorded at 3 p.m. were used. First, Table 5 shows mean squared errors for both the intraday and the single-time data set, but the mean squared errors are not directly comparable since the errors are computed from different data sets. Consequently, we do not make any such comparisons in-sample but postpone them to the out-of-sample analysis.

It is possible to consider the surf plots in Figure 2 again. The differences between the intraday and single-time data seem small, but it is possible to observe a wave for both estimate types, which indicates that some estimates differ from adjacent estimates. These estimate differences are more visible in the upper panel of Figure 3 than in Figure 2. We can see that around the period of March–May of 2020, the single-time estimates seems to be more volatile than the intraday estimates. Further, later in the studied period, there are occasional single estimates that are considerably different from their adjacent estimates.

We can study the estimates that correspond to the times of maturity of the options market, which we refer to as market-matched dividend estimates. We want two properties when computing the statistics of the estimates: to estimate the same quantity every day and to have large sample sizes, i.e., long times series of estimates. The latter property is achieved by limiting the data set, such that only the times of maturity that are present in the market for the whole period of study are included. The first property is impossible to achieve completely since the market changes with time. In a period, [*t*, *T*], the value of the dividends changes because the ex-dividend dates are passed as *t* evolves. Additionally, the estimated quantity may also change since the beliefs of future dividends change. By studying the present values of the dividends estimates between two maturities that have not passed in the period, the impact of passed ex-dividends dates is removed. We refer to these differences as stripped dividend estimates. We use the notation introduced in Section 2.1.1, where *D*(*t*; *τ*, *T*) is the present value of dividends within the ex-dividend date in the period [*τ*, *T*]. We can then measure some statistics of these stripped dividends, an analysis that is similar to the analysis conducted by Desmettre et al. [1].

The market-matched and stripped dividend estimates are similar, but they have some differences in their interpretations. Statistics of the market-matched estimates can be seen in Table 4, and statistics for the stripped dividends estimates are presented in Table 7. Further, the market-matched and stripped dividends are presented in the lower panel of Figure 3. The stripped dividends estimates do not have the downward slopes that the market-matched dividend estimates have. The line plots of Figure 3 are flat, and the means of Table 7 are approximately zero. The reason for the slope is that the ex-dividend dates are passed for the market-matched dividend estimates, but since the stripped dividends are further in the future, no ex-dividend dates have passed.

**Table 7.** This table shows the statistics for the daily differences of the stripped present value dividend estimates for four series of estimates with a constant time of maturity. The intervals (e.g., 473–382) in the first column—Tenors (TTM)—indicate time to maturity intervals on 2 March 2020 (the first date in the data set) that the stripped dividend estimates correspond to.


The means are similar for both data sets, but the single-time data estimates have higher volatility values than the intraday data estimates. Furthermore, the standard deviation (volatility) values are similar between the market-matched and stripped dividends, which is surprising. The stripped dividends are estimates of fewer dividends than the marketmatched dividends, and additionally, those dividends are shared with the market-matched estimates. Therefore, a natural assumption is that the dispersion of the former would be smaller. A possible explanation is that the future dividends are uncertain. Another explanation is that there is noise in the estimates, which may be because options with longer times to maturity are less liquid than options with shorter times to maturity.

Furthermore, the auto-correlation of the daily differences holds interesting information. We can see in Figure 3 that there are some upward spikes for individual days, i.e., it goes up one day and then comes back to a similar level the following day. This pattern is a clear sign of the noise in the estimates. We can contrast the single-time data plots with the plots of the intraday data, which lack clear spikes. We measured the auto-correlation to see how much the estimates were affected and presented the results in Tables 4 and 7. We noted that the market-matched dividend estimates had a lower auto-correlation since these estimates had a downward trend. This downward trend reduced the information in the auto-correlation, and, thus, the auto-correlation values of the stripped dividends are better indicators of the noise for each method. We can see in Table 7 that the auto-correlation is negative for both the intraday and the single-time data, but the auto-correlations are smaller (more negative) for the latter. The negative sign indicates that both types of estimates are affected by noise, and further, the differences between the auto-correlations indicate that intraday estimates contain less noise than the single-time estimates. Further, it is impossible to make statements concerning the noise level in the market match contra the stripped dividend estimates since the market-matched dividends have a natural downward slope, which thus increases the auto-correlation of the daily differences.

### *4.3. Out-of-Sample*

The in-sample results indicate that the present value of the dividends performs better than the yield estimates. However, these results can be questioned, since the performance may be a result of over-fitting. In this section, we perform an out-of-sample analysis. The analysis is a two-step approach. First, we discuss how to partition the data into two sets: the training and test sets. The former was used for estimating, while the latter was used for evaluating the estimates. Second, we present the evaluation method.

### 4.3.1. Partitioning the Data Set

In order to make an out-of-sample analysis, the data set needed be divided into two parts. The data consisted of all (business) dates from 1 March 2020 to 1 February 2021. Each date had some times of maturity, and linear regressions were performed for each time of maturity. The partitions into in- and out-of-sample sets were performed on each such unit, since there was neither data sharing between the dates nor the times of maturity. The data set used for estimation consisted of three data types: the spot price of the underlying (i.e., quotes of the S&P 500 index), spot interest rates, and synthetic forward mid-prices. The regressions use both different times and different strike prices. We argued in Section 3 that information reaches the market over time and that the times are important. For each (intraday) time, a single and unique S&P 500 quote and a single unique spot rate exist. This uniqueness creates the need for these points to be used both in- and out-of-sample. On the other hand, the synthetic forward prices can be partitioned into two sets.

The partition is performed with two principles. First, we want a wide range of strike prices in-sample since they are important for making good estimates. Second, we want to have a greater portion in-sample than out-of-sample. The set of synthetic forward positions is divided into an in- and out-of-sample set according to two criteria. The first criterion is that a synthetic forward is included if its strike price is below a lower limit, <sup>∈</sup> <sup>R</sup>+, or above an upper limit, *<sup>u</sup>* <sup>∈</sup> <sup>R</sup>+. The second criterion is that of the synthetic forwards not included in-sample by the first criterion, every *k*:th is placed in the out-of-sample set, while the remaining are placed in-sample, where *<sup>k</sup>* <sup>∈</sup> <sup>N</sup>+, i.e., a strictly positive integer. It would be improper to make the first inclusion criterion static, since the index value changes during the studied period, and thus the limits of in- and out-of-money change. Therefore, rather than assigning static values to and *u*, we assign values relative to the index value for each day. The index value was not constant intraday, and we computed the index daily reference value as the median of all intraday index quotes and denoted it with *S*ˆ, and we defined = *S*ˆ and *u* = *u <sup>S</sup>*ˆ, where <sup>∈</sup> <sup>R</sup><sup>+</sup> and *<sup>u</sup>* <sup>∈</sup> <sup>R</sup>+.

The in-sample and out-of-sample data are from the same data set, but their roles are not equal. The in-sample should, in essence, be the data used for estimation. The out-ofsample, on the other hand, was used as a reference, and we could have been more selective when forming this set, and, e.g., used additional filters. One rough measure of the quality of prices is the size of the bid–ask spread, where a wide spread indicates a less reliable price and a narrow spread a more reliable price. The idea is to remove options with too wide spreads, an idea which was used by Blomvall et al. [25] and Azzone and Baviera [24]. The crux is to characterize a typical and reasonable spread. One natural dynamic to keep in mind is that options with higher prices have wider spreads than options with lower prices, if the spreads are measured in an absolute dollar amount. This relationship means that in-sample options have wider spreads than out-of-sample options, and options with longer maturity times have larger spreads than options close to expiry. However, this dynamic is not a big problem in practice. The latter is not a problem, since each time of maturity is managed independently. The former is slightly more challenging, but since the out-of-sample is a subset in which deep in- and out-of-sample options have been excluded, the potential impact is limited. Further, the spreads can also vary between days, and thus they are not suitable to use as a fixed cutoff value. Instead, a reference is computed for each date to account for this variability.

The additional filter handles call and put options separately and are applied for each date and time to maturity. Let, *Nd*,*<sup>T</sup>* denote the number of option pairs for date *d*, with time of maturity *T*, and let Δ*ci* = *c<sup>i</sup> <sup>a</sup>* <sup>−</sup> *<sup>c</sup><sup>i</sup> <sup>b</sup>* and <sup>Δ</sup>*pi* = *<sup>p</sup><sup>i</sup> <sup>a</sup>* <sup>−</sup> *<sup>p</sup><sup>i</sup> <sup>b</sup>* denote the spread of the *i*th call option and put option, respectively. The scaled median of the spreads is computed as *mc* = (1 + *bc*) median *<sup>i</sup>*∈{1,...,*Nd*,*T*} Δ*ci* and *mp* = (1 + *bp*) median *<sup>i</sup>*∈{1,...,*Nd*,*T*} <sup>Δ</sup>*pi*, where *bc* <sup>∈</sup> <sup>R</sup><sup>+</sup> and *bp* <sup>∈</sup> <sup>R</sup>+. We kept an option if its spread was below the scaled median. Note that a complete option pair was required to compute the synthetic forward price, and, hence, if the one option in the pair was removed, the other one became useless. The parameters used to generate all out-of-sample results are presented in Table 8.

**Table 8.** This table shows the parameters used to partition the data set into in- and out-of-sample data sets.


### 4.3.2. Evaluation Method

It is critical to choose how to evaluate an estimate. One approach would be to follow the path used by Desmettre et al. [1]. They estimated the dividends for individual shares and compared the results of their estimates with the realized dividends, but we argue that this approach has some intrinsic drawbacks. First, Desmettre et al. [1] discuss a difference in their estimates of the market consensus of the present value dividends and the actual dividends. They used market data for specific markets with a tax setting that they argued was suitable. This favorable tax setting is not present in the US market, and, in Section 2.1, we argue that we do not measure the dividends but rather how the index is affected by them. Second, there is also a practical problem with index data. The index does not pay dividends but rather its constituents, which results in considerably more dividend payments, and the payments must be scaled with the weight of its constituents. All these technical details make the method error-prone and thus not suitable for use. To summarize, even in idealistic conditions, it is not generally valid to compare dividend estimates with their realized counterparts.

Another natural approach would be to use the linear models and the predicted errors, which, in essence, is how well the linear models reprice the out-of-sample options. The advantage of the prediction errors is that they are easily computable and allow an easy model comparison. The primary disadvantage is that the linear regressions of the put– call parity also include estimations of the interest rate spread. Hence, prediction errors are affected by both the quality of the dividends and the interest rate spreads estimates. The results are, thus, in a strict sense, a measure of linear model performance, but not necessarily of the dividend. Consequently, we base our estimate on another approach: utilizing the sloped asset position. The limitation of using this position as an estimator is the vast amount of combinations. A potential solution to this limitation is to limit the data, but the drawback is figuring out how to make such a limitation systematically. However, in the out-of-sample testing, the data set was, by construction, small enough to use sloped asset positions. The sloped asset position makes it possible to test the estimates isolated from the potential effects of the interest rate. Furthermore, we use the adjusted share price formulation, *S*∗, to compare yield and present value since the two types are not directly comparable. The regressions were run in-sample, and they were then compared with the help of the out-of-sample data.

The mean squared errors of the residuals is one method of measuring and comparing the different methods. It was difficult to argue if the difference between methods was big or small. Therefore, we complemented the mean squared error with a statistical test on the out-of-sample data. We use the version of the Diebold–Mariano test that was used by Blomvall et al. [25]. This test is a version of the original test presented by Diebold and Mariano [37]. The test consists of four steps. First, we partitioned the data into inand out-of-sample sets. Second, the regression was performed (in-sample). Third, the linear models were evaluated on the out-of-sample data to measure the errors. Fourth, we performed the Diebold–Mariano test from the errors. We denoted the errors for the two

regressions, which we compared using *si*,1 and *si*,2, respectively, where *i* = 1, ... , *n*. Let *di* = *si*,1*i* <sup>2</sup> <sup>−</sup> *<sup>s</sup>*<sup>2</sup> *<sup>i</sup>*,2, ∀*i* = 1, . . . , *n* denote the loss differentials, and let

$$\bar{d} = \frac{1}{n} \sum\_{i=1}^{n} d\_{i\prime}$$

denote the mean of the loss differentials, and the autocovariance with lag *k* be

$$\gamma\_k = \frac{1}{n} \sum\_{i=k+1}^n \left( d\_i - \bar{d} \right) \left( d\_{i-k} - \bar{d} \right). \tag{31}$$

The Diebold–Mariano statistic was formulated as

$$DM = \frac{\bar{d}}{\sqrt{\frac{1}{n} \left(\gamma\_0 + 2\sum\_{k=1}^{h-1} \gamma\_k\right)}} \text{ } \tag{32}$$

where *<sup>h</sup>* <sup>∈</sup> <sup>N</sup>+, i.e., a strictly positive integer, and we chose *<sup>h</sup>* <sup>=</sup> *<sup>n</sup>*1/3 <sup>+</sup> 1. The Diebold– Mariano test statistic follows a standard normal, *N*(0, 1), given the null hypothesis *H*<sup>0</sup> : E[*di*] = *μ* = 0. We computed the errors, *si*,1 and *si*,2, in two ways. First, we used the prediction errors of the linear models of the out-of-sample data. Second, we also used the adjusted spot price that was implied by the sloped asset position.

Further, the Diebold–Mariano test only determines if there is a (significant) difference between methods, but the test does not quantify this difference. However, Blomvall et al. [25] present one measure, <sup>√</sup>2/*<sup>π</sup>* <sup>√</sup> ¯*d*, that can be interpreted as the average improvement between estimates. The measure has the same unit as the errors, *si*,1 and *si*,2.

### 4.3.3. Results

The results are divided between the mean squared errors presented in Table 9 and the Diebold–Mariano tests presented in Table 10. The Diebold–Mariano tests cover the comparisons between yield and present value, ordinary and weighted least squares formulations, and single-time and intraday data. Table 9 shows that the out-of-sample results are consistent with the in-sample results, since the present value outperformed the yield formulation. Moreover, the weighted least squares formulation performed better than the ordinary least squares formulation. The mean squared errors indicate the performance of the different models, but they do not quantify the significance or even if the difference is significant.

**Table 9.** This table shows the out-of-sample mean squared errors (MSE) of the ordinary least squares (OLS) the weighted least squares (WLS) formulations and the differences between these two regressions. This table presents the results of two different measures: dividend yield and present value dividend; and two evaluation methods: regression residuals and difference to the sloped asset position.


**Table 10.** This table presents the Diebold–Mariano test statistics for the comparison between the different estimation methods. The first three columns show information about the method; the type of dividend formulation: yield or present value (PV); the regression form: ordinary (OLS) or weighted least squares (WLS); the type of data used: either single-time (Single) or intraday (I-day) data; and the errors that can be based on prediction or sloped asset positions. The Diebold–Mariano test compares a pair of methods, and each row in the table is one such comparison, and the compared methods are indicated with "reference method" vs. "alternative method". For example, in the first row, the yield and present value formulation are compared. The fifth and sixth rows contain the mean of the differential and the Diebold–Mariano test statistics, where a positive or negative sign indicates that the alternative method is better or worse, respectively, than the reference method.


To see the statistical significance between the models, we discuss the Diebold–Mariano results in this section. The Diebold–Mariano test results are presented in Table 10. That table presents the test statistic, and all the comparisons show significant differences. Further, the fifth column, <sup>√</sup>2/*<sup>π</sup>* <sup>√</sup> ¯*d*, is a measure of the differences between the methods. The statistical test and the values of the measures yield the same results, which can be summarized in three points. First, the present value dividend formulation is significantly better than the yield formulation, and the improvements are between 13.91 to 17.61 cents. Second, the weighted least squares formulation is significantly better than the ordinary least squares formulation, and the improvements are between 3.95 to 19.37 cents. Third, basing dividend estimates on intraday data is significantly better than single time data, and the improvement is 54.57 cents. These quantitative results align with the earlier qualitative results.

### *4.4. Performance Difference between Yield and Present Value*

We have seen that the present value formulation has a superior performance to the yield formulation both in-sample and out-of-sample for intraday data. If single-time data is used, there is no difference between the two formulations. The methods are similar in assumptions but with a crucial difference. The dividend quantity is assumed constant in both regressions, but a constant dividend yield implies different adjustments to the spot price, which is incompatible with the market participants' perception.

We tested if this variability in the adjustment can explain the inferior performance. We performed a regression that related the difference between the methods and the intraday variability of the spot price to each other. It is possible to create many variability measures, but there are two features that we would like the measure to have. First, the absolute quote changes are less interesting than the relative changes, i.e., the changes should be related to spot price. Second, we want the regression to be easily computable and tractable.

It is also possible to create several measures of dividend differences. We chose to measure the intraday variability of the spot price as the intraday range of the spot price divided by the median spot price. First, we introduced times *t j* , *j* = 1, ... , *Md*, which were the times when the spot price of the index was recorded, and *M<sup>d</sup>* was the number of such times for day *d*. The variability of the spot price for a day *d* was then written as

$$\Delta S\_d = \frac{\max\_{j \in \{1, \dots, M^d\}} S(t\_j) - \min\_{j \in \{1, \dots, M^d\}} S(t\_j)}{\operatorname\*{median}\_{j \in \{1, \dots, M^d\}} S(t\_j)},\tag{33}$$

and the difference between the dividends of the two as

$$\mathcal{Y}\_d(T) = \frac{1}{T} \left| \hat{D}^{d,T} - D\_y^{d,T} \right|. \tag{34}$$

where

$$\hat{D}\_y^{d,T} = \underset{j \in \{1, \dots, M^d\}}{\text{median}} \ S(t\_j) \left(1 - \hat{\gamma}\_2^{d,T}\right). \tag{35}$$

The values of *D*ˆ *<sup>d</sup>*,*<sup>T</sup> <sup>y</sup>* were aggregated into a single value, *Yd* for each date as the mean of *Yd*(*T*). The regression can then be formulated as

$$Y\_d = \beta \Delta S\_{d'} \tag{36}$$

and the results are presented in Table 11 and Figure 5. We can see that the t-statistic indicates that the coefficient is significantly different from zero, indicating that the spot price variability partly explains the difference. Furthermore, we can see from Figure 5 that the spot price variability is probably not the sole explanation, but it is possible to conclude that increased variability increases the difference between the two dividend formulations.

**Table 11.** This plot is the result of regressing and understanding the problem with estimating the dividend yield.

**Figure 5.** This figure illustrates the regression results between the variability of the underlying and the difference between estimating a dividend yield and a present value of a dividend. The slope coefficient is 4.0059, which means that the greater variability predicts a bigger difference between the yield and the present value estimate.

### *4.5. Conclusions*

This paper has made both practical and theoretical contributions to the literature in this area. The practical contribution is that we have expanded and generalized the regression method presented by Desmettre et al. [1] in two regards. First, we have generalized the regression from an ordinary least squares formulation to a weighted least squares formulation. Second, the regression has been reformulated to utilize intraday data rather than being limited to data recorded at a single time. We have proven that both of these changes improve the quality of the dividend estimates with statistical significance. The latter improved the estimation more than the former. Additionally, one key component of this analysis is the new European option position (the sloped asset position) that we have introduced. This position makes it possible to evaluate dividend estimates independent of interest rate estimates.

The main theoretical contribution is that we have proven that the present value dividend formulation performs significantly better than the yield formulation. We have also proposed an explanation for this phenomenon. We propose that worse performance is caused by the inherent connection between the yield and the spot price. We have also contributed theoretically with the clarification of the interpretation of the dividend. These realizations could affect, e.g., the dividend adjustments in derivative pricing.

**Author Contributions:** Conceptualization: P.S., J.B. and M.S.; methodology: P.S., J.B. and M.S.; software: P.S.; validation: P.S.; formal analysis: P.S., J.B. and M.S.; investigation: P.S.; resources: P.S.; data curation: P.S. and J.B.; writing–original draft preparation: P.S.; writing–review and editing: P.S., J.B. and M.S.; visualization: P.S.; supervision: J.B. and M.S.; project administration: P.S.; and funding acquisition, N/A. All authors have read and agreed to the published version of the manuscript, please see the following link: CRediT taxonomy (accessed on 5 December 2021) for explanation of terms. Authorship has been limited to those who have contributed substantially to the work reported.

**Funding:** This research received no external funding.

**Data Availability Statement:** All the data have been collected from the data provider Thomson Reuters Refinitiv Eikon.

**Acknowledgments:** The authors would like to thank Jonas Ekblom, Johan Hagenbjörk and the anonymous reviewers for several valuable and helpful suggestions and comments to improve the presentation of the paper.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **Appendix A**

The derivation of the *sloped asset* position can be made either with a dividend expressed as a yield or as a present value of future dividends. Here, we derive both versions. Let, *fi* be a synthetic forward with strike price *Ki*, i.e., the options in the option pair used to construct the synthetic forward have the strike price *Ki*. Furthermore, let *δ* and *D* denote the dividend yield and present value of the dividends, respectively. Furthermore, let *r* denote the continuous interest rate. However, the rate is not necessary to prove the relation. Finally, let *T* denote the time to maturity when entering the contracts, and the time the contract is entered into is *t*.

The sloped asset position consists of a long position in *f*<sup>1</sup> and *<sup>K</sup>*<sup>1</sup> *<sup>K</sup>*<sup>2</sup> short positions in *f*2. The former has a payoff that can be written as *g*1(*s*) = *s* − *K*1, while the short positions provide a payoff of *<sup>g</sup>*2(*s*) = <sup>−</sup>*K*<sup>1</sup> *K*2 (*<sup>s</sup>* <sup>−</sup> *<sup>K</sup>*2) = *<sup>K</sup>*<sup>1</sup> *K*2 (*K*<sup>2</sup> − *s*). The total payoff at expiration for the complete position is:

$$\mathbf{g}(\mathbf{s}) = \mathbf{g}\_1(\mathbf{s}) + \mathbf{g}\_2(\mathbf{s}) = (\mathbf{s} - K\_1) + \left(\frac{K\_1}{K\_2}(K\_2 - \mathbf{s})\right) = \mathbf{s}\left(1 - \frac{K\_1}{K\_2}\right).$$

The payoff can be interpreted as a fractional position, either long or short, in the shares. The price of this contract upon entering is the share price adjusted for dividends (scaled with the factor), i.e., the following:

$$f\_1 - \frac{K\_1}{K\_2} f\_2 = S^\* \left( 1 - \frac{K\_1}{K\_2} \right) = S e^{-\delta \left( T - t \right)} \left( 1 - \frac{K\_1}{K\_2} \right) \tag{A1}$$

$$f\_1 - \frac{K\_1}{K\_2} f\_2 = S^\* \left( 1 - \frac{K\_1}{K\_2} \right) = (S - D) \left( 1 - \frac{K\_1}{K\_2} \right). \tag{A2}$$

It is possible to find the dividend yields and the present values of the dividends directly from the expressions by rearranging them thus:

$$\delta = -\frac{1}{T - t} \ln[S^\*/S] = -\frac{1}{T - t} \ln\left[\frac{1}{S} \frac{f\_1 K\_2 - f\_2 K\_1}{K\_2 - K\_1}\right],\tag{A.3}$$

$$D = S - S^\* = S - \frac{f\_1 K\_2 - f\_2 K\_1}{K\_2 - K\_1}.\tag{A4}$$

### **References**


### *Article* **Sparse Estimation Strategies in Linear Mixed Effect Models for High-Dimensional Data Application**

**Eugene A. Opoku 1,\*, Syed Ejaz Ahmed <sup>2</sup> and Farouk S. Nathoo <sup>1</sup>**


**Abstract:** In a host of business applications, biomedical and epidemiological studies, the problem of multicollinearity among predictor variables is a frequent issue in longitudinal data analysis for linear mixed models (LMM). We consider an efficient estimation strategy for high-dimensional data application, where the dimensions of the parameters are larger than the number of observations. In this paper, we are interested in estimating the fixed effects parameters of the LMM when it is assumed that some prior information is available in the form of linear restrictions on the parameters. We propose the pretest and shrinkage estimation strategies using the ridge full model as the base estimator. We establish the asymptotic distributional bias and risks of the suggested estimators and investigate their relative performance with respect to the ridge full model estimator. Furthermore, we compare the numerical performance of the LASSO-type estimators with the pretest and shrinkage ridge estimators. The methodology is investigated using simulation studies and then demonstrated on an application exploring how effective brain connectivity in the default mode network (DMN) may be related to genetics within the context of Alzheimer's disease.

**Citation:** Opoku, E.A.; Ahmed, S.E.; Nathoo, F.S. Sparse Estimation Strategies in Linear Mixed Effect Models for High-Dimensional Data Application. *Entropy* **2021**, *23*, 1348. https://doi.org/10.3390/e23101348

Academic Editor: Matteo Convertino

Received: 9 September 2021 Accepted: 12 October 2021 Published: 15 October 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**Keywords:** linear mixed model; ridge estimation; pretest and shrinkage estimation; multicollinearity; asymptotic bias and risk; LASSO estimation; high-dimensional data

### **1. Introduction**

In many fields such as bio-informatics, physical biology, and epidemiology, the response of interest is represented by repeated measures of some variables of interest that are collected over a specified time period for different independent subjects or individuals. These types of data are commonly encountered in medical research where the responses are subject to various time-dependent and time-constant effects such as pre- and post-treatment types, gender effect, and baseline measures, among others. A widely-used statistical tool in the analysis and modeling of longitudinal and repeated measures data is the linear mixed effects model (LMM) [1,2]. This model provides an effective and flexible way to describe the means and the covariance structures of a response variable after accounting for within subject correlation.

The rapid growth in the size and scope of longitudinal data has created a need for innovative statistical strategies in longitudinal data analysis. Classical methods are based on the assumption that the number of predictors is less than the number of observations. However, there is an increasing demand for efficient prediction strategies for analysis of high-dimensional data, where the number of observed data elements (sample size) are smaller than the number of predictors in a linear model context. Existing techniques that deal with high-dimensional data mostly rely on various penalized estimators. Due to the trade-off between model complexity and model prediction, the statistical inference of model selection becomes an extremely important and challenging problem in high-dimensional data analysis.

Over the years, many penalized regularization approaches have been developed to do variable selection and estimation simultaneously. Among them, the least absolute shrinkage and selection operator (LASSO) is commonly used [3]. It is a useful estimation technique in part due to its convexity and computational efficiency. The LASSO approach is based on an <sup>1</sup> penalty for regularization of regression parameters. Ref. [4] provides a comprehensive summary of the consistency properties of the LASSO approach. Related penalized likelihood methods have been extensively studied in the literature, see for example [5–10]. The penalized likelihood methods have a close connection to Bayesian procedures. Thus, the LASSO estimate corresponds to a Bayes method that puts a Laplacian (double-exponential) prior on the regression coefficients [11,12].

In this paper, our interest lies in estimating the fixed effect parameters of the LMM using a ridge estimation technique when it is assumed that some prior information is available in the form of potential linear restrictions on the parameters. One possible source of prior information is using a Bayesian approach. An alternative source of prior information may be obtained from previous studies or expert knowledge that search for or assume sparsity patterns.

We consider the problem of fixed effect parameter estimation for LMMs when there exist many predictors relative to the sample size. These predictors may be classified into two groups: sparse and non-sparse. Thus, there are two choices to be considered: a full model with all predictors, and a sub-model that contains only non-sparse predictors. When the sub-model based on available subspace information is true (i.e., the assumed restriction holds), it then provides more efficient statistical inferences than those based on a full model. In contrast, if the sub-model is not true, the estimates could become biased and inefficient. The consequences of incorporating subspace information therefore depend on the quality or reliability of the information being incorporated into the estimation procedure. One way to deal with uncertain subspace information is to use a pretest estimation strategy. The validity of the information is tested before incorporation into a final estimator. Another approach is shrinkage estimation, which shrinks the full model estimator to the sub-model estimator by utilizing subspace information. Besides these estimation strategies, there is a growing literature on simultaneous model selection and estimation. These approaches are known as penalty strategies. By shrinking some regression coefficients toward zero, the penalty methods simultaneously select a sub-model and estimate its regression parameters. Several authors have investigated the pretest, shrinkage, and penalty estimation strategies in partial linear model, Poisson regression model, and Weibull censored regression model [13–15].

To formulate the problem, we suppose that the vector of the fixed effects parameter *β* in the LMM can be partitioned into two sub-vectors *β* = (*β* <sup>1</sup>, *β* 2) , where *β*<sup>1</sup> is the coefficient vector of non-sparse predictors and *β*<sup>2</sup> is the coefficient vector of sparse predictors. Our interest lies in the estimation of *β*<sup>1</sup> when *β*<sup>2</sup> is close to zero. To deal with this problem in the context of low dimensional data, ref. [16] propose an improved estimation strategy using sub-model selection and post-estimation for the LMM. Within this framework, linear shrinkage and shrinkage pretest estimation strategies are developed, which combine full model and sub-model estimators in an effective way as a trade-off between bias and variance. Ref. [17] extend this study by using a likelihood ratio test to develop James–Stein shrinkage and pretest estimation methods based on LMM for longitudinal data. In addition, the non-penalty estimators are compared with several penalty estimators (LASSO, adaptive LASSO and Elastic Net) for best performance.

In most real data situations, there is also the problem of multicollinearity among predictor variables for high-dimensional data. Various biased estimation techniques such as shrinkage estimation, partial least squares estimation [18] and Liu estimators [19] have been implemented to deal with this problem, but the widely used technique is ridge estimation [20]. The ridge estimator overcomes the weakness of the least squares estimator with a smaller mean squared error. To overcome and combat multicollinearity, ref. [21] propose pretest and Stein-type ridge regression estimators for linear and partially linear models. Furthermore, ref. [22] also develop shrinkage estimation based on Liu regression to overcome multicollinearity in linear models.

Our primary focus is on the estimation and prediction problem for linear mixed effect models when there are many potential predictors that have a weak or no influence on the response of interest. This method simultaneously controls overfitting using general least square estimation with a roughness penalty. We propose pretest and shrinkage estimation strategies using the ridge estimation technique as a base estimator and numerically compare their performance with the LASSO and adaptive LASSO estimators. Our proposed estimation strategy is applied to both high-dimensional and low-dimensional data.

The rest of this article is organized as follows. In Section 2, we present the linear mixed effect model and the proposed estimation techniques. We introduce the full and sub-model estimators based on ridge estimation. Thereafter, we construct the pretest and shrinkage ridge estimators. Section 3 provides the asymptotic bias and risk of these estimators. A Monte Carlo simulation is used to evaluate the performance of the estimators including a comparison with the lasso-type estimators, and the results are reported in Section 4. Section 5 presents a demonstration of the proposed methodology on a high-dimensional resting-state effective brain connectivity and genetic data. We also illustrate the proposed estimation methods in an application to a low-dimensional Amsterdam growth and health study. Section 6 presents a discussion with recommendations.

### **2. Model and Estimation Strategies**

In this section, we present the linear mixed effect model and the proposed estimation strategies.

### *2.1. Linear Mixed Model*

Suppose that we have a sample of *N* subjects. For the *i th* subject, we collect the response variable *yij* for the jth time, where *i* = 1 ... , *n*; *j* = 1 ... , *ni* and *N* = ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *ni*. Let *Yi* = (*yi*1, ... *yini* ) denotes the *ni* × 1 vector of responses from the ith subject. Let **X***<sup>i</sup>* = (**x***i*1, ... , **x***ini* ) and **Z***<sup>i</sup>* = (**z***i*1, ... , **z***ini* ) be *ni*× p and *ni*×q known fixed-effects and random-effect design matrix for the ith subject of full rank *p* and *q*, respectively. The linear mixed effect model [1] for a vector of repeated responses *Yi* on the ith subject is assumed to have the form

$$\mathbf{Y}\_{i} = \mathbf{X}\_{i}\boldsymbol{\mathfrak{B}} + \mathbf{Z}\_{i}\mathbf{a}\_{i} + \mathbf{e}\_{i},\tag{1}$$

where *β* = (*β*1, ... , *βp*) is the p × 1 vector of unknown fixed-effect parameters or regression coefficients, *ai* is the q <sup>×</sup> 1 vector of unobservable random effects for the ith subject, assumed to come from a multivariate normal distribution with zero mean and a covariance matrix **G**, where **G** is an unknown *q* × *q* covariance matrix and *<sup>i</sup>* denotes *ni*×1 vector of error terms assumed to be normally distributed with zero mean, covariance matrix *σ*2I*ni* . Further, *<sup>i</sup>* are assumed to be independent of the random effects *ai*.

The marginal distribution for the response *yi* is normal with mean **X***iβ* and covariance matrix *Cov*(*Yi*) = *Ziσ*<sup>2</sup> *<sup>i</sup> <sup>Z</sup><sup>T</sup> <sup>i</sup>* <sup>+</sup> *<sup>σ</sup>*<sup>2</sup> *In*. By stacking the vectors, the mixed model can be can be expressed as **Y** = **X***β* + **Za** + . From the Equation (1), the distribution of the model follows *<sup>Y</sup>* ∼ N*n*(*Xβ*, *<sup>V</sup>*), where *<sup>E</sup>*(*Y*) = *<sup>X</sup><sup>β</sup>* with covariance, *<sup>V</sup>* <sup>=</sup> *<sup>n</sup>* ∑ *i*=1 *Ziσ*<sup>2</sup> *<sup>i</sup> <sup>Z</sup><sup>T</sup> <sup>i</sup>* + *<sup>σ</sup>*2I*n*.

### *2.2. Ridge Full Model and Sub-Model Estimator*

The generalized least square estimator (GLS) is defined as *β***ˆ** GLS = (**X***T***V**−1**X**)−1**X***T***V**−1**<sup>Y</sup>** and the ridge full model estimator can be obtained by introducing a penalized regression so that *β*ˆ = arg min*<sup>β</sup>* \* (**<sup>Y</sup>** <sup>−</sup> **<sup>X</sup>***β*)*T***V**−1(**<sup>Y</sup>** <sup>−</sup> **<sup>X</sup>***β*) + *<sup>k</sup>β<sup>T</sup> <sup>β</sup>* + and

*β***ˆ** Ridge = (**X***TV*−1**<sup>X</sup>** + *<sup>k</sup>***I**)−1**X***TV*−1**Y**, where *<sup>β</sup>***<sup>ˆ</sup>** Ridge is the ridge full model estimator and *<sup>k</sup>* <sup>∈</sup> [0, <sup>∞</sup>) is the tuning parameter. If k = 0, *<sup>β</sup>***<sup>ˆ</sup>** Ridge is the GLS estimator and *<sup>β</sup>***<sup>ˆ</sup>** Ridge = <sup>0</sup> for k is sufficiently large. We select the value of k using cross validation.

We let **X** = (**X**1,**X**2), where **X**<sup>1</sup> is an *n* × *p*<sup>1</sup> sub-matrix containing the non-sparse predictors and **X**<sup>2</sup> is an *n* × *p*<sup>2</sup> sub-matrix that contains the sparse predictors. Accordingly, *β* = (*β*1, *β*2) where *β*<sup>1</sup> and *β*<sup>2</sup> have dimensions *p*<sup>1</sup> and *p*2, respectively, with *p*<sup>1</sup> + *p*<sup>2</sup> = *p*, *pi* ≥ 0 for *i* = 1, 2.

A sub-model is defined as *<sup>Y</sup>* <sup>=</sup> *<sup>X</sup><sup>β</sup>* <sup>+</sup> *Za* <sup>+</sup> subject to *<sup>β</sup>T<sup>β</sup>* <sup>≤</sup> *<sup>φ</sup>* and *<sup>β</sup>*<sup>2</sup> <sup>=</sup> **<sup>0</sup>** which corresponds to *Y* = *X***1***β***<sup>1</sup>** + *Za* + subject to *β***<sup>1</sup>** *<sup>T</sup>β***<sup>1</sup>** <sup>≤</sup> *<sup>φ</sup>*. The sub-model estimator *<sup>β</sup>***<sup>ˆ</sup>** RSM 1 of *<sup>β</sup>*<sup>1</sup> has the form *<sup>β</sup>***<sup>ˆ</sup>** RSM <sup>1</sup> = (**X***<sup>T</sup>* <sup>1</sup> **<sup>V</sup>**−1**X**<sup>1</sup> + *<sup>k</sup>***I**)−1**X***<sup>T</sup>* <sup>1</sup> **<sup>V</sup>**−1**Y**. We denote *<sup>β</sup>***<sup>ˆ</sup>** RFM <sup>1</sup> as the full model ridge estimator of *β*<sup>1</sup> and given as

*β***ˆ** RFM <sup>1</sup> = (**X***<sup>T</sup>* <sup>1</sup> **<sup>V</sup>**<sup>−</sup>1/2M*<sup>X</sup>*2**V**−1/2**X**<sup>1</sup> + *<sup>k</sup>*I)−1**X***<sup>T</sup>* <sup>1</sup> **<sup>V</sup>**<sup>−</sup>1/2M*<sup>X</sup>*2**V**−1/2**Y**, where **<sup>M</sup>***X*<sup>2</sup> <sup>=</sup> <sup>I</sup> <sup>−</sup> **<sup>P</sup>** <sup>=</sup> **<sup>I</sup>** <sup>−</sup> **<sup>V</sup>**−1/2**X**2(**X**2**V**−1**X**2)−1**X***<sup>T</sup>* <sup>2</sup> **V**<sup>−</sup>1/2.

### *2.3. Pretest Ridge Estimation Strategy*

Generally, the sub-model estimator will be more efficient than the full model estimator if the information embodied in the imposed linear restrictions is valid, thus *β*<sup>2</sup> is close to zero. However, if the information is not valid the sub-model estimator is likely to be more biased and may have a higher risk than the full model estimator. There is, therefore, some doubt as to whether or not to impose the restrictions on the model's parameter. It is in response to this uncertainty that a statistical test may be used to determine the validity of the proposed restrictions. Accordingly, the procedure to follow in practice is pretest the validity of the restrictions and if the outcome of the pretest suggests that they are correct then the model parameters are estimated incorporating the restrictions. If the pretest rejects the restrictions then the parameters are estimated from the sample information alone. This motivates the consideration of the pretest estimation strategy for the LMM.

The pretest estimator is a combination of the full model estimator *β***ˆ** RFM <sup>1</sup> , and sub-model estimator *β***ˆ** RSM <sup>1</sup> , through an indicator function I(L*<sup>n</sup>* ≤ *dn*,*α*), where L*<sup>n</sup>* is an appropriate test statistic to test *H*<sup>0</sup> : *β*<sup>2</sup> = **0** versus *HA* : *β*<sup>2</sup> = **0**. Moreover, *dn*,*<sup>α</sup>* is an *α* level critical value based on distribution of L*n* under *Ho*. We define test statistics based on the log-likelihood ratio test as L*<sup>n</sup>* = 2 ∗(*β***ˆ** RFM <sup>|</sup> **<sup>Y</sup>**) <sup>−</sup> ∗(*β***<sup>ˆ</sup>** RSM <sup>|</sup> **<sup>Y</sup>**) .

Under *H*0, the test statistic L*<sup>n</sup>* follows asymptotic chi-square distribution with *p*<sup>2</sup> degrees of freedom. The pretest test ridge estimator *β***ˆ** RPT <sup>1</sup> of *β*<sup>1</sup> is then defined by

$$
\boldsymbol{\hat{\beta}}\_{1}^{\text{RPT}} = \boldsymbol{\hat{\beta}}\_{1}^{\text{RFM}} - (\boldsymbol{\hat{\beta}}\_{1}^{\text{RFM}} - \boldsymbol{\hat{\beta}}\_{1}^{\text{RSM}})\mathbf{I}(\mathbf{L}\_{n} \le \boldsymbol{d}\_{n,n}), \quad p\_{2} \ge 1.
$$

### *2.4. Shrinkage Ridge Estimation Strategy*

The pre-test estimator is a discontinuous function of the sub-model *β***ˆ** RSM <sup>1</sup> and full model *β***ˆ** RFM <sup>1</sup> , which depends on the hard threshold (*dn*,*<sup>α</sup>* = *<sup>χ</sup>*<sup>2</sup> *p*2,*α*). We address this limitation by defining the shrinkage ridge estimator based on soft thresholding. The shrinkage ridge estimator (RSE) of *<sup>β</sup>*1, denoted as *<sup>β</sup>***<sup>ˆ</sup>** RSE <sup>1</sup> , is defined as

$$
\hat{\mathcal{B}}\_1^{\text{RSE}} = \hat{\mathcal{B}}\_1^{\text{RSM}} + (\hat{\mathcal{B}}\_1^{\text{RFM}} - \hat{\mathcal{B}}\_1^{\text{RSM}}) (1 - (p\_2 - 2)\mathcal{L}\_n^{-1}), \quad p\_2 \ge 3.1
$$

Here, *β***ˆ** RSE <sup>1</sup> is the linear combination of the full model *<sup>β</sup>***<sup>ˆ</sup>** RFM <sup>1</sup> and sub-model *<sup>β</sup>***<sup>ˆ</sup>** RSM <sup>1</sup> estimates. If <sup>L</sup>*<sup>n</sup>* <sup>≤</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>), then a relatively large weight is placed on *<sup>β</sup>***<sup>ˆ</sup>** RSM <sup>1</sup> otherwise, more weight is on *β***ˆ** RFM <sup>1</sup> . A setback with *<sup>β</sup>***<sup>ˆ</sup>** RSE <sup>1</sup> is that it is not a convex combination of *<sup>β</sup>***<sup>ˆ</sup>** RFM <sup>1</sup> and *<sup>β</sup>***<sup>ˆ</sup>** RSM <sup>1</sup> . This can cause over-shrinkage, which gives the estimator opposite sign of *β***ˆ** RFM <sup>1</sup> . This could happen if (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)L−<sup>1</sup> *<sup>n</sup>* is larger than one. To counter this, we use the positive-part shrinkage ridge estimator (RPS) defined as

$$
\boldsymbol{\hat{\beta}}\_{1}^{\text{RPS}} = \boldsymbol{\hat{\beta}}\_{1}^{\text{RSM}} + (\boldsymbol{\hat{\beta}}\_{1}^{\text{RFM}} - \boldsymbol{\hat{\beta}}\_{1}^{\text{RSM}}) (1 - (p\_2 - 2)L\_n^{-1})^+, p\_2 \ge 3
$$

where (<sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)L−<sup>1</sup> *<sup>n</sup>* )<sup>+</sup> <sup>=</sup> max(0, 1 <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)L−<sup>1</sup> *<sup>n</sup>* ). The RPS estimator will control possible over-shrinking in the RSE estimator.

### **3. Asymptotic Results**

In this section, we derive the asymptotic distributional bias and risk of the estimators considered in Section 2. We examine the properties of the estimators for increasing *n* and as *β*<sup>2</sup> approaches the null vector under the sequence of local alternatives defined as

$$K\_n: \mathcal{f}\_2 = \mathcal{f}\_{2(n)} = \frac{\kappa}{\sqrt{n}'} \tag{2}$$

where *<sup>κ</sup>* = (*κ*1, *<sup>κ</sup>*<sup>2</sup> ... , *<sup>κ</sup>p*<sup>2</sup> ) <sup>∈</sup> <sup>R</sup>*p*<sup>2</sup> is a fixed vector. The vector <sup>√</sup>*<sup>κ</sup> <sup>n</sup>* is a measure of how far local alternatives *Kn* differ from the subspace information *β*<sup>2</sup> = **0**. In order to evaluate the performance of the estimators, we define the asymptotic distributional bias of the estimator *β***ˆ** ∗ <sup>1</sup> as

$$\text{ADB}(\hat{\boldsymbol{\beta}}\_1^\*) = \lim\_{n \to \infty} E\{\sqrt{n}(\hat{\boldsymbol{\beta}}\_1^\* - \boldsymbol{\beta}\_1)\},$$

In order to compute the risk functions, we first compute the asymptotic covariance of the estimators. The asymptotic covariance of an estimator *β***ˆ** ∗ <sup>1</sup> is expressed as

$$\operatorname{Cov}(\mathcal{J}\_1^\*) = \lim\_{n \to \infty} E\left\{ n (\mathcal{J}\_1^\* - \mathcal{J}\_1)(\mathcal{J}\_1^\* - \mathcal{J}\_1)^\top \right\}.$$

Following the asymptotic covariance matrix, we define the asymptotic risk of an estimator *β***ˆ** ∗ <sup>1</sup> as <sup>R</sup>(*β***ˆ***<sup>∗</sup>* **<sup>1</sup>** ) = tr **Q**Cov(*β***ˆ** ∗ 1 ) . **Q** is a positive definite matrix of weights with dimensions of *p* × *p*. We set **Q** = **I** in this study.

**Assumption 1.** *We make the following two regularity conditions to establish the asymptotic properties of the estimators.*

$$\begin{aligned} \text{1. } &\frac{1}{n} \max\_{1 \le i \le n} \mathbf{x}\_i^T \left[\mathbf{X}^T \mathbf{V}^{-1} \mathbf{X}\right]^{-1} \mathbf{x}\_i \to \mathbf{0} \text{ as } n \to \infty \text{, where } \mathbf{x}\_i^T \text{ is the } i \text{th row of } \mathbf{X}.\\ \text{2. } &\mathbf{B}\_{\text{il}} = n^{-1} \left[\mathbf{X}^T \mathbf{V}^{-1} \mathbf{X}\right]^{-1} \to \mathbf{B}\_{\text{i}} \text{ for some finite } \mathbf{B} = \begin{pmatrix} \mathbf{B}\_{11} & \mathbf{B}\_{12} \\ \mathbf{B}\_{21} & \mathbf{B}\_{22} \end{pmatrix}. \end{aligned}$$

**Theorem 1.** *For k* < ∞*, If k*/ <sup>√</sup>*<sup>n</sup>* <sup>→</sup> *<sup>λ</sup><sup>o</sup> and <sup>B</sup> is non-singular, the distribution of the full model ridge estimator, β***ˆ** *RFM <sup>n</sup> is* <sup>√</sup>*n*(*β***<sup>ˆ</sup>** *RFM <sup>n</sup>* <sup>−</sup> *<sup>β</sup>*) *<sup>D</sup>* → N (−*λoB*−1*β*,*B*−1),

*where <sup>D</sup>* → *denotes convergence in distribution.*

**Proof.** See Theorem 2 in [23].

**Proposition 1.** *Assuming the above assumption 1 together with Theorem 1 hold, under the local alternatives Kn, we have*

$$
\begin{split}
\begin{pmatrix}
\mathfrak{p}\_{1} \\
\mathfrak{p}\_{3}
\end{pmatrix} & \stackrel{D}{\rightarrow} \mathcal{N} \left[ \begin{pmatrix}
\delta
\end{pmatrix}, \begin{pmatrix}
\mathcal{B}\_{11,2}^{-1} & \Phi \\
\Phi & \Phi
\end{pmatrix} \right], \\
\begin{pmatrix}
\mathfrak{p}\_{3} \\
\mathfrak{p}\_{2}
\end{pmatrix} & \stackrel{D}{\rightarrow} \mathcal{N} \left[ \begin{pmatrix}
\mathcal{S} \\
\end{pmatrix}, \begin{pmatrix}
\Phi & \mathbf{0} \\
\mathbf{0} & \mathcal{B}\_{11}^{-1}
\end{pmatrix} \right],
\end{split}
$$

where *<sup>ϕ</sup>*<sup>1</sup> <sup>=</sup> <sup>√</sup>*n*(*β***<sup>ˆ</sup>** RFM <sup>1</sup> <sup>−</sup> *<sup>β</sup>*1), *<sup>ϕ</sup>*<sup>2</sup> <sup>=</sup> <sup>√</sup>*n*(*β***<sup>ˆ</sup>** RSM <sup>1</sup> <sup>−</sup> *<sup>β</sup>*1), *<sup>ϕ</sup>*<sup>3</sup> <sup>=</sup> <sup>√</sup>*n*(*β***<sup>ˆ</sup>** RFM <sup>1</sup> <sup>−</sup> *<sup>β</sup>***<sup>ˆ</sup>** RSM <sup>1</sup> ), *γ* = *μ*11.2 + *δ*, *δ* = **B**−<sup>1</sup> <sup>11</sup> **<sup>B</sup>**12*<sup>κ</sup>* , **<sup>Φ</sup>** <sup>=</sup> **<sup>B</sup>**−<sup>1</sup> <sup>11</sup> **<sup>B</sup>**12**B**−<sup>1</sup> 22.1B21**B**−<sup>1</sup> <sup>11</sup> , **<sup>B</sup>**22.1 <sup>=</sup> **<sup>B</sup>**<sup>22</sup> <sup>−</sup> **<sup>B</sup>**21**B**−<sup>1</sup> <sup>11</sup> **<sup>B</sup>**12, *<sup>μ</sup>* <sup>=</sup> <sup>−</sup>*λo***B**−1*<sup>β</sup>* <sup>=</sup> *μ*1 *μ*2 ! and *<sup>μ</sup>*11.2 <sup>=</sup> *<sup>μ</sup>*<sup>1</sup> <sup>−</sup> **<sup>B</sup>**12**B**−<sup>1</sup> <sup>22</sup> ((*β*<sup>2</sup> − *κ*) − *μ*2).

### **Proof.** See Appendix A

**Theorem 2.** *Under the condition of Theorem 1 and the local alternatives Kn, the ADBs of the proposed estimators are*

$$\begin{split} & AD(\boldsymbol{\mathfrak{g}}\_{1}^{\textrm{REM}}) = -\boldsymbol{\mu}\_{11.2}, \\ & AD(\boldsymbol{\mathfrak{g}}\_{1}^{\textrm{REM}}) = -\boldsymbol{\mu}\_{11.2} - \boldsymbol{\mathcal{B}}\_{11}^{-1} \boldsymbol{\mathcal{B}}\_{12} \boldsymbol{\mathcal{S}} = -\boldsymbol{\gamma}, \\ & AD(\boldsymbol{\mathfrak{g}}\_{1}^{\textrm{RET}}) = -\boldsymbol{\mu}\_{11.2} - \boldsymbol{\mathcal{S}} \boldsymbol{\mathcal{H}}\_{p\_{2}+2}(\boldsymbol{\chi}\_{p\_{2},\boldsymbol{\alpha}}^{2}; \boldsymbol{\Delta}), \\ & AD(\boldsymbol{\mathfrak{g}}\_{1}^{\textrm{RSE}}) = -\boldsymbol{\mu}\_{11.2} - (p\_{2}-2)\boldsymbol{\mathcal{S}} \boldsymbol{\mathcal{E}}(\boldsymbol{\chi}\_{p\_{2}+2}^{-2}(\boldsymbol{\Delta})), \\ & AD(\boldsymbol{\mathfrak{g}}\_{1}^{\textrm{RST}}) = -\boldsymbol{\mu}\_{11.2} - \boldsymbol{\mathcal{S}} \boldsymbol{\mathcal{H}}\_{p\_{2}+2}(\boldsymbol{\chi}\_{p\_{2}-2}^{2}; \boldsymbol{\Delta}) \left\{ -(p\_{2}-2)\boldsymbol{\mathcal{S}} \boldsymbol{\mathcal{E}}\{\boldsymbol{\chi}\_{p\_{2}+2}^{-2}(\boldsymbol{\Delta})\boldsymbol{I}(\boldsymbol{\chi}\_{p\_{2}+2}^{-2} > p\_{2}-2)\right\}, \end{split}$$

*where* Δ = *κTB*−<sup>1</sup> 22.1*κ, <sup>B</sup>*22.1 <sup>=</sup> *<sup>B</sup>*<sup>22</sup> <sup>−</sup> *<sup>B</sup>*21*B*−<sup>1</sup> <sup>11</sup> *<sup>B</sup>*12*, and Hv*(*x*; <sup>Δ</sup>) *is the cumulative distribution function of the non-central chi-squared distribution with non-centrality parameter* Δ *and v degrees of freedom, and <sup>E</sup>*(*χ*−2*<sup>j</sup> <sup>v</sup>* (Δ)) *is the expected value of the inverse of a non-central <sup>χ</sup>*<sup>2</sup> *distribution with v degrees of freedom and non-centrality parameter* Δ*,*

$$E(\chi\_v^{-2j}(\Delta)) = \int\_0^\infty \mathbf{x}^{-2j} dH\_v(\mathbf{x}, \Delta).$$

**Proof.** See Appendix B.1

Since the ADBs of the estimators are in non-scalar form, we define the following asymptotic quadratic bias (AQDB) of *β***ˆ** ∗ <sup>1</sup> by

$$\text{AQDB}(\hat{\boldsymbol{\beta}}\_1^\*) = \left(\text{ADB}(\hat{\boldsymbol{\beta}}\_1^\*)\right)' \mathbf{B}\_{11.2} \left(\text{ADB}(\hat{\boldsymbol{\beta}}\_1^\*)\right).$$

where **<sup>B</sup>**11.2 <sup>=</sup> **<sup>B</sup>**<sup>11</sup> <sup>−</sup> **<sup>B</sup>**12**B**−<sup>1</sup> <sup>22</sup> **B**21.

**Corollary 1.** *Suppose Theorem 2 holds. Then, under* {*Kn*}*, the AQDBs of the estimators are*

$$\begin{split} \LambdaQD(\boldsymbol{\theta}\_{1}^{\operatorname{REM}}) &= \boldsymbol{\mu}\_{11,2}^{\operatorname{T}}\mathbf{B}\_{11,2}\boldsymbol{\mu}\_{11,2} \\ \LambdaQD(\boldsymbol{\theta}\_{1}^{\operatorname{REM}}) &= \boldsymbol{\eta}^{\top}\mathbf{B}\_{11,2}\boldsymbol{\mu}, \\ \LambdaQD(\boldsymbol{\theta}\_{1}^{\operatorname{REM}}) &= \boldsymbol{\mu}\_{11,2}^{\operatorname{T}}\mathbf{B}\_{11,2}\boldsymbol{\mu}\_{11,2} + \boldsymbol{\mu}\_{11,2}^{\operatorname{T}}\mathbf{B}\_{11,2}\boldsymbol{\delta}\mathbf{B}\_{p\_{2}+2}(\boldsymbol{\chi}\_{p\_{2}}^{2};\boldsymbol{\Delta}) \\ &+ \boldsymbol{\delta}^{\top}\mathbf{B}\_{11,2}\boldsymbol{\mu}\_{11,2}\boldsymbol{\mu}\_{p\_{2}+2}(\boldsymbol{\chi}\_{p\_{2}}^{2};\boldsymbol{\Delta}) + \boldsymbol{\delta}^{\top}\mathbf{B}\_{11,2}\boldsymbol{\delta}\mathbf{B}\_{p\_{2}+2}^{\operatorname{T}}(\boldsymbol{\chi}\_{p\_{2}}^{2};\boldsymbol{\Delta}), \\ \LambdaQD(\boldsymbol{\theta}\_{1}^{\operatorname{RES}}) &= \boldsymbol{\mu}\_{11,2}^{\operatorname{T}}\mathbf{B}\_{11,2}\boldsymbol{\mu}\_{11,2} + (p\_{2}-2)\boldsymbol{\mu}\_{11,2}^{\operatorname{T}}\mathbf{B}\_{11,2}\boldsymbol{\delta}\mathbf{E}(\boldsymbol{\chi}\_{p\_{2}+2}^{-2}(\boldsymbol{\Delta})) \\ &+ (p\_{2}-2)\boldsymbol{\delta}^{\top}\mathbf{B}\_{11,2}\boldsymbol{\mu}\_{11,2}\boldsymbol{\delta}(\boldsymbol{\chi}\_{p\_{2}+2}^{-2}(\boldsymbol{\Delta})) + (p\_{2}-2)^{2}\boldsymbol{\delta}^{\top}\mathbf{B}\_{11,2}\boldsymbol{\$$

When **B**11.2 = **0**, the AQDB of all estimators are equivalent, and the estimators are therefore asymptotically unbiased. If we assume that **B**11.2 = 0, the results for the bias of the estimators can be summarized as follows:

1. The AQDB of *β***ˆ** RSM <sup>1</sup> is an unbounded function of *<sup>γ</sup>T***B**11.2*γ*.


**Theorem 3.** *Suppose Theorem 1 holds and under the local alternatives Kn, the covariance matrices of the estimators are*

*Cov*(*β***ˆ***RFM* **<sup>1</sup>** ) = *<sup>B</sup>*−<sup>1</sup> 11.2 <sup>+</sup> *<sup>μ</sup>*11.2*μ<sup>T</sup>* 11.2, *Cov*(*β***ˆ***RSM* **<sup>1</sup>** ) = *<sup>B</sup>*−<sup>1</sup> <sup>11</sup> <sup>+</sup> *γγT*, *Cov*(*β***ˆ***RPT* **<sup>1</sup>** ) = *<sup>B</sup>*−<sup>1</sup> 11.2 <sup>+</sup> *<sup>μ</sup>*11.2*μ<sup>T</sup>* 11.2 + <sup>2</sup>*μ<sup>T</sup>* 11.2*δHp*2+2(*χ*<sup>2</sup> *<sup>p</sup>*<sup>2</sup> ; <sup>Δ</sup>) <sup>−</sup> **<sup>Φ</sup>***Hp*2+2(*χ*<sup>2</sup> *<sup>p</sup>*<sup>2</sup> ; Δ) + *δδ<sup>T</sup>* <sup>2</sup>*Hp*2+2(*χ*<sup>2</sup> *<sup>p</sup>*<sup>2</sup> ; <sup>Δ</sup>) <sup>−</sup> *<sup>H</sup>p*2+4(*χ*<sup>2</sup> *<sup>p</sup>*<sup>2</sup> ; Δ) , *Cov*(*β***ˆ***RSE* **<sup>1</sup>** ) = *<sup>B</sup>*−<sup>1</sup> 11.2 <sup>+</sup> *<sup>μ</sup>*11.2*μ<sup>T</sup>* 11.2 <sup>+</sup> <sup>2</sup>(*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)*μ<sup>T</sup>* 11.2*δE χ*−<sup>2</sup> *<sup>p</sup>*2+2(Δ) − (*p*<sup>2</sup> − 2)**Φ** 2*E χ*−<sup>2</sup> *<sup>p</sup>*2+2(Δ) − (*p*<sup>2</sup> − 2)*E χ*−<sup>4</sup> *<sup>p</sup>*2+2(Δ) + (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)*δδ<sup>T</sup>* − 2*E χ*−<sup>2</sup> *<sup>p</sup>*2+4(Δ) + 2*E*(*χ*−<sup>2</sup> *<sup>p</sup>*2+2(Δ)) + (*p*<sup>2</sup> − <sup>2</sup>)*<sup>E</sup> χ*−<sup>4</sup> *<sup>p</sup>*2+4(Δ) , *Cov*(*β***ˆ** *RPS* <sup>1</sup> ) = *Cov*(*β***<sup>ˆ</sup>** *RSE* <sup>1</sup> ) + <sup>2</sup>*δμ<sup>T</sup>* 11.2*E* <sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)*χ*−<sup>2</sup> *<sup>p</sup>*2+2(Δ) *I χ*2 *<sup>p</sup>*2+2(Δ) ≤ *p*<sup>2</sup> − 2 ! − 2**Φ***E* <sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)*χ*−<sup>2</sup> *<sup>p</sup>*2+2(Δ) *I χ*2 *<sup>p</sup>*2+2(Δ) ≤ *p*<sup>2</sup> − 2 ! <sup>−</sup> <sup>2</sup>*δδTE* {<sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)*χ*−<sup>2</sup> *<sup>p</sup>*2+4(Δ)}*I*(*χ*<sup>2</sup> *<sup>p</sup>*2+4(Δ) ≤ *p*<sup>2</sup> − 2) + 2*δδTE* <sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)*χ*−<sup>2</sup> *<sup>p</sup>*2+2(Δ) *I χ*2 *<sup>p</sup>*2+2(Δ) ≤ *p*<sup>2</sup> − 2 ! <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)2**Φ***<sup>E</sup> χ*−<sup>4</sup> *<sup>p</sup>*2+2(Δ)*<sup>I</sup> χ*2 *<sup>p</sup>*2+2,*α*(Δ) ≤ *p*<sup>2</sup> − 2 ! <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)2*δδTE χ*−<sup>4</sup> *<sup>p</sup>*2+2,*α*(Δ)*<sup>I</sup> χ*2 *<sup>p</sup>*2+2,*α*(Δ) ≤ *p*<sup>2</sup> − 2 ! <sup>+</sup> **<sup>Φ</sup>***Hp*2+<sup>2</sup> *p*<sup>2</sup> − 2; Δ <sup>+</sup> *δδTHp*2+<sup>4</sup> *p*<sup>2</sup> − 2; Δ .

**Proof.** See Appendix B.2.

**Corollary 2.** *Under the local alternatives (Kn) and from Theorem 3, the risk of the estimators are obtained as*

*R β***ˆ** *RFM* 1 = *tr QB*−1 11.2 + *μ<sup>T</sup>* 11.2*Qμ*11.2, *R β***ˆ** *RSM* <sup>1</sup> ] = *tr QB*−1 11 + *γTQγ*, *R β***ˆ** *RPT* 1 = *tr QB*−1 11.2 + *μ<sup>T</sup>* 11.2*Qμ*11.2 <sup>+</sup> <sup>2</sup>*μ<sup>T</sup>* 11.2*QδHp*2+<sup>2</sup> *χ*2 *<sup>p</sup>*<sup>2</sup> ; Δ − *tr Q***Φ** *<sup>H</sup>p*2+<sup>2</sup> *χ*2 *<sup>p</sup>*<sup>2</sup> ; Δ + *δQδ<sup>T</sup>* . <sup>2</sup>*Hp*2+<sup>2</sup> *χ*2 *<sup>p</sup>*<sup>2</sup> ; Δ <sup>−</sup> *<sup>H</sup>p*2+<sup>4</sup> *χ*2 *<sup>p</sup>*<sup>2</sup> ; Δ / , *R β***ˆ** *RSE* 1 = *tr QB*−1 11.2 + *μ<sup>T</sup>* 11.2*Qμ*11.2 <sup>+</sup> <sup>2</sup>(*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)*μ<sup>T</sup>* 11.2*Qδ<sup>E</sup> χ*−<sup>2</sup> *<sup>p</sup>*2+2(Δ) <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)*tr*(*Q***Φ**) . *E χ*−<sup>2</sup> *<sup>p</sup>*2+2(Δ) − (*p*<sup>2</sup> − 2)*E χ*−<sup>4</sup> *<sup>p</sup>*2+2(Δ) / + (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)*δTQ<sup>δ</sup>* . 2*E χ*−<sup>2</sup> *<sup>p</sup>*2+2(Δ) − 2*E χ*−<sup>2</sup> *<sup>p</sup>*2+4(Δ) − (*p*<sup>2</sup> − 2)*E χ*−<sup>4</sup> *<sup>p</sup>*2+4(Δ) /, *R β***ˆ** *RPS* 1 = *R β***ˆ** *RSE* 1 + 2*δQμ<sup>T</sup>* 11.2*E* <sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)*χ*−<sup>2</sup> *<sup>p</sup>*2+2(Δ) *I χ*2 *<sup>p</sup>*2+2(Δ) ≤ *p*<sup>2</sup> − 2 ! <sup>−</sup> <sup>2</sup>*tr*(*Q***Φ**)*<sup>E</sup>* <sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)*χ*−<sup>2</sup> *<sup>p</sup>*2+2(Δ) *I χ*2 *<sup>p</sup>*2+2(Δ) ≤ *p*<sup>2</sup> − 2 ! <sup>−</sup> <sup>2</sup>*δTQδ<sup>E</sup>* {<sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)*χ*−<sup>2</sup> *<sup>p</sup>*2+4(Δ)}*I*(*χ*<sup>2</sup> *<sup>p</sup>*2+4(Δ) ≤ *p*<sup>2</sup> − 2) + 2*δTQδE* <sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)*χ*−<sup>2</sup> *<sup>p</sup>*2+2(Δ) *I χ*2 *<sup>p</sup>*2+2(Δ) ≤ *p*<sup>2</sup> − 2 ! <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)2*tr*(*Q***Φ**)*<sup>E</sup> χ*−<sup>4</sup> *<sup>p</sup>*2+2(Δ)*<sup>I</sup> χ*2 *<sup>p</sup>*2+2(Δ) ≤ *p*<sup>2</sup> − 2 ! <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)2*δTQδ<sup>E</sup> χ*−<sup>4</sup> *<sup>p</sup>*2+2(Δ)*<sup>I</sup> χ*2 *<sup>p</sup>*2+2(Δ) ≤ *p*<sup>2</sup> − 2 ! <sup>+</sup> *tr*(*Q***Φ**)*Hp*2+<sup>2</sup> *p*<sup>2</sup> − 2; Δ <sup>+</sup> *<sup>δ</sup>TQδHp*2+<sup>4</sup> *p*<sup>2</sup> − 2; Δ .

From Theorem 2, when **B**<sup>12</sup> = **0**, the risks of estimators *β***ˆ** RSM <sup>1</sup> , *<sup>β</sup>***<sup>ˆ</sup>** RPT <sup>1</sup> , *<sup>β</sup>***<sup>ˆ</sup>** RSE <sup>1</sup> , and *<sup>β</sup>***<sup>ˆ</sup>** RPS <sup>1</sup> are reduced to common value tr(**QB**−<sup>1</sup> 11.2) + *<sup>μ</sup><sup>T</sup>* 11.2**Q***μ*11.2, the risk of *<sup>β</sup>***<sup>ˆ</sup>** RFM <sup>1</sup> . If **B**<sup>12</sup> = **0**, the results can be summarized as follows:


### **4. Simulation Studies**

In this section, we conduct a simulation study to assess the performance of the suggested estimators for finite samples. The criterion for comparing the performance of any estimator in our study is the mean square error. We simulate the response from the following LMM model

$$\mathbf{Y}\_{i} = \mathbf{X}\_{i}\boldsymbol{\mathfrak{B}} + \mathbf{Z}\_{i}\mathbf{a}\_{i} + \boldsymbol{\mathfrak{e}}\_{i} \tag{3}$$

where *<sup>i</sup>* ∼ N (0, *<sup>σ</sup>*2I*ni* ) with *σ*<sup>2</sup> = 1. We generate random effect covariate **a***<sup>i</sup>* from a multivariate normal distribution with zero mean and covariance matrix **G** = 0.5I2×2, where

I2×<sup>2</sup> is 2 × 2 identity matrix. The design matrix **X***<sup>i</sup>* = (**x***i*1, ... , **x***ini* ) is generated from a *ni*multivariate normal distribution with mean vector and covariance matrix Σ*x*. Furthermore, we assume that the off-diagonal elements of the covariance matrix Σ*<sup>x</sup>* are equal to *ρ*, which is the coefficient of correlation between any two predictors, with *ρ* = 0.3, 0.7, 0.9. The ratio of the largest eigenvalue to the smallest eigen-value of matrix **X***T***V**−1**X** is calculated as a condition number index (CNI) [24], which assesses the existence of multicollinearity in the design matrix. If the CNI is larger than 30, then the model has significant multicollinearity. Our simulations are based on the linear mixed effects model in Equation (3) with *n* = 60 and 100 subjects.

We consider a situation when the model is assumed to be sparse. In this study, our interest lies in testing the hypothesis *Ho* : *β*<sup>2</sup> = **0**, and our goal is to estimate the fixed effect coefficient *β*1. We partition the fixed effects coefficients as *β* = (*β* <sup>1</sup>, *β* 2) = (*β* <sup>1</sup>, **0***p*<sup>2</sup> ) . The coefficients *β*<sup>1</sup> and *β*<sup>2</sup> are *p*<sup>1</sup> and *p*<sup>2</sup> dimensional vectors, respectively, with *p* = *p*<sup>1</sup> + *p*2.

In order to investigate the behavior of the estimators, we define Δ<sup>∗</sup> = ||*β* − *βo*||, where *<sup>β</sup><sup>o</sup>* = (*β<sup>T</sup>* <sup>1</sup> , **<sup>0</sup>***p*<sup>2</sup> )<sup>T</sup> and ||.|| is the euclidean norm. We considered <sup>Δ</sup><sup>∗</sup> values between 0 and 4. If Δ∗ = 0, then we will have *β* = (1, 1, 1, 1, 0, 0, . . . , 0 7 89 : *<sup>p</sup>*<sup>2</sup> )<sup>T</sup> to generate the response

under null hypothesis. On the other hand, when Δ<sup>∗</sup> ≥ 0, say Δ<sup>∗</sup> = 4, we will have *β* = (1, 1, 1, 1, 4, 0, 0, . . . , 0 7 89 : *<sup>p</sup>*2−<sup>1</sup> )<sup>T</sup> to generate the response under the local alternative hypothesis.

In our simulation study, we consider the number of fixed effect or predictor variables as (*p*1, *p*2) ∈ {(5, 40),(5, 500),(5, 1000)}. Each realization is repeated 5000 times to obtain consistent results and compute the MSE of suggested estimators with *α* = 0.05.

Based on the simulated data, we calculate the mean square error (MSE) of all the estimators as MSE(*β*ˆ) = <sup>1</sup> <sup>5000</sup> <sup>∑</sup><sup>5000</sup> *<sup>j</sup>*=<sup>1</sup> (*β*<sup>ˆ</sup> <sup>−</sup> *<sup>β</sup>*)T(*β*<sup>ˆ</sup> <sup>−</sup> *<sup>β</sup>*), where *<sup>β</sup>*<sup>ˆ</sup> denotes any one of *<sup>β</sup>*<sup>ˆ</sup> RSM, *<sup>β</sup>*<sup>ˆ</sup> RPT, *<sup>β</sup>*<sup>ˆ</sup> RSE and *<sup>β</sup>*<sup>ˆ</sup> RPS, in the jth repetition. We use the relative mean squared efficiency (RMSE), or the ratio of MSE for risk performance comparison. The RMSE of an estimator *β***ˆ** <sup>∗</sup> with respect to the baseline full model ridge estimator *β***ˆ** RFM <sup>1</sup> is defined as RMSE(*β***<sup>ˆ</sup>** RFM <sup>1</sup> : *<sup>β</sup>***<sup>ˆ</sup>** ∗ <sup>1</sup> ) = MSE(*β***<sup>ˆ</sup>** RFM <sup>1</sup> ) MSE(*β***ˆ** ∗ 1 ) ,

where *β*∗ <sup>1</sup> is one of the suggested estimators under consideration.

### *4.1. Simulation Results*

In this subsection, we present the results from our simulation study. We report the results for *n* = 60, 100 and *p*<sup>1</sup> = 5 with different values of correlation coefficient *ρ* are shown in Table 1. Furthermore, we plot the RMSEs against Δ∗ in Figures 1 and 2. The findings can be summarized as follows:


other hand, if the sub-model is misspecified, the gain slowly diminishes. However, in terms of risk, the shrinkage estimators are at least as good as the full ridge model estimator. Therefore, the use of shrinkage estimators makes sense in application when a sub-model cannot be correctly specified.

5. The RMSE of the ridge-type estimators are an increasing function of the amount of multicollinearity. This indicates that the ridge-type estimators perform better than the classical estimator in the presence of multicollinearity among predictor variables. p y gp

**Figure 1.** RMSE of estimators as a function of the non-centrality parameter Δ when *n* = 60, and *p*<sup>1</sup> = 5.

**Figure 2.** RMSE of estimators as a function of the non-centrality parameter Δ when *n* = 100, and *p*<sup>1</sup> = 5.


**Table 1.** RMSEs of RSM, RPT, RSE, and RPS estimators with respect to *β***ˆ** RFM <sup>1</sup> when Δ ≥ 0 for *p*<sup>1</sup> = 5 and *n* = 60.


**Table 2.** RMSEs of RSM, RPT, RSE, and RPS estimators with respect to *β***ˆ** RFM <sup>1</sup> when Δ ≥ 0 for *p*<sup>1</sup> = 5, and *n* = 100.

### *4.2. Comparison with LASSO-Type Estimators*

We compare our listed estimators with the LASSO and adaptive LASSO estimators. A 10-fold cross-validation is used for selecting the optimal value of the penalty parameters that minimizes the mean square errors for the LASSO-type estimators. The results for *ρ* = 0.3, 0.7, 0.9, *n* = 60, 100, *p*<sup>1</sup> = 10 and *p*<sup>2</sup> = 50, 500, 1000, 2000 are presented in Table 3. We observe the following from Table 3.



**Table 3.** RMSEs of estimators with respect to *β***ˆ** RFM <sup>1</sup> when Δ = 0 for *p*<sup>1</sup> = 10.

### **5. Real Data Application**

We consider two real data analyses using Amsterdam Growth and Health Data and a genetic and brain network connectivity edge weight data to illustrate the performance of the proposed estimators.

### *5.1. Amsterdam Growth and Health Data (AGHD)*

The AGHD data is obtained from the Amsterdam Growth and Health Study [25]. The goal of this study is to investigate the relationship between lifestyle and health in adolescence into young adulthood. The response variable *Y* is the total serum cholesterol measured over six time points. There are five covariates: *X*<sup>1</sup> is the baseline fitness level measured as the maximum oxygen uptake on a treadmill, *X*<sup>2</sup> is the amount of body fat estimated by the sum of the thickness of four skinfolds, *X*<sup>3</sup> is a smoking indicator (0 = no, 1 = yes), *X*<sup>4</sup> is the gender (1 = female, 2 = male), and time measurement as *X*<sup>5</sup> and subject specific random effects.

A total of 147 subjects participated in the study where all variables were measured at *ni* = 6 time occasions. In order to apply the proposed methods, firstly, we apply a variable selection based on AIC procedure to select the sub-model. For the AGHD data, we fit a linear mixed model with all the five covariates for both fixed and subject specific random effects by two stage selection procedure for the purpose of choosing both the random and fixed effects. The analysis found *X*<sup>2</sup> and *X*<sup>5</sup> to be significant covariates for prediction of the response variable serum cholestrol and the other variables are ignored since they are not significantly important. Based on this information, a sub-model is chosen to be *X*<sup>2</sup> and *X*<sup>5</sup> and the full model includes all the covariates. We construct the shrinkage estimators from the full-model and sub-model. In terms of null hypothesis, the restriction can be written as *β*<sup>2</sup> = (*β*1, *β*3, *β*4)=(0, 0, 0) with *p* = 5, *p*<sup>1</sup> = 2 and *p*<sup>2</sup> = 3.

To evaluate the performance of the estimators, we obtain the mean square prediction error (MSPE) using bootstrap samples. We draw 1000 bootstrap samples of the 147 subjects from the data matrix {(*Yij*,**X***ij*), *i* = 1, 2, ... , 147; *j* = 1, 2, ... , 6}. We then calculate the relative prediction error (RPE) of *β*∗ <sup>1</sup> with respect to *<sup>β</sup>*RFM <sup>1</sup> , the full model estimator. The RPE is defined as

$$\text{RPE}(\boldsymbol{\hat{\beta}}\_{1}^{\text{RFM}} : \boldsymbol{\hat{\beta}}\_{1}^{\*}) = \frac{\text{MSPE}(\boldsymbol{\hat{\beta}}\_{1}^{\*})}{\text{MSPE}(\boldsymbol{\hat{\beta}}\_{1}^{\text{RFM}})} = \frac{(\mathbf{Y} - \mathbf{X}\_{1}\boldsymbol{\hat{\beta}}\_{1}^{\*})'(\mathbf{Y} - \mathbf{X}\_{1}\boldsymbol{\hat{\beta}}\_{1}^{\*})}{(\mathbf{Y} - \mathbf{X}\_{1}\boldsymbol{\hat{\beta}}\_{1}^{\text{RFM}})'(\mathbf{Y} - \mathbf{X}\_{1}\boldsymbol{\hat{\beta}}\_{1}^{\text{RFM}})},$$

where *β*∗ <sup>1</sup> is one of the listed estimators. If RPE < 1, then *<sup>β</sup>***<sup>ˆ</sup>** ∗ <sup>1</sup> outperforms *<sup>β</sup>***<sup>ˆ</sup>** RFM <sup>1</sup> .

Table 4 reports the estimates, standard error of the non-sparse predictors and RPEs of the estimators with respect to the full model. As expected, the sub-model ridge estimator *β***ˆ** RSM <sup>1</sup> has the minimum RPE because it is computed when the sub-model is correct, that is, Δ∗ = 0. It is evident by the RPE values in Table 4 that the shrinkage estimators are superior to the LASSO-type estimators. Furthermore, the positive shrinkage is more efficient than the shrinkage ridge estimator.

**Table 4.** Estimate, standard error for the active predictors and RPEs of estimators with respect to full-model estimator for the Amsterdam Growth and Health Study data.


### *5.2. Resting-State Effective Brain Connectivity and Genetic Data*

This data comprises longitudinal resting-state functional magnetic resonance imaging (rs-fMRI) effective brain connectivity network and genetic study [26] data obtained from a sample of 111 subjects with a total of 319 rs-fMRI scans from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. The 111 subjects comprise 36 cognitively normal (CN), 63 mild cognitive impairment (MCI) and 12 Alzheimer's Disease (AD) subjects. The response is a network connection between regions of interest estimated from an rsfMRI scan within the Default Mode Network (DMN), and we observe a longitudinal sequence of such connections for each subject with the number of repeated measurements. The DMN consists of a set of brain regions that tend to be active in resting-state, when a subject is mind wandering with no intended task. For this data analysis, we consider the network edge weight from the left intraparietal cortex to posterior cingulate cortex (LIPC → PCC) as our response. The genetic data are single nucleotide polymorphism (SNPs) from non-sex chromosomes, i.e., chromosome 1 to chromosome 22. SNPs with minor allele frequency less than 5% are removed as are SNPs with a Hardy–Weinberg equilibrium p-value lower than 10−<sup>6</sup> or a missing rate greater than 5%. After preprocessing we are left with 1,220,955 SNPs and the longitudinal rs-fMRI effective connectivity network using the 111 subjects with rs-fMRI data. The response is network edge weight. There are SNPs which are the fixed effects and subject specific random effects.

In order to apply the proposed methods, we use a genome- wide association study (GWAS) for screening the genetic data to 100 SNPs. We implement a second screening by applying multinomial logistic regression to identify a smaller subset of the 100 SNPs that are potentially associated with disease (CN/MCI/AD). This yields a subset of top 10 SNPs. This showed the top 10 SNPs are the most important predictors and the other 90 SNPs are ignored as not significant. We now have two models, which are the full model with all 100 SNPs and sub-model with 10 SNPs selected. Finally, we construct the pretest and shrinkage estimators from the full-model and sub-model.

We draw 1000 bootstrap samples with replacements from the corresponding data matrix {(*Yij*,**X***ij*), *i* = 1, ... , 111; *j* = 1, ... , *ni*}. We report the RPE of the estimators based on the bootstrap simulation with respect to the full model ridge estimator in Table 5. We observe that the RPE of the sub-model, pretest, shrinkage and positive shrinkage ridge estimators outperforms the full model estimator. Clearly, the sub-model ridge estimator has the smallest RPE since it's computed when the candidate sub-model is correct, i.e., Δ = 0. Both shrinkage ridge estimators outperform the pretest ridge estimator. Particularly, the positive shrinkage performed better than the shrinkage estimator. The performance of both shrinkage and pretest ridge estimators are better than the LASSO-type estimators. Thus, the data analysis is in line with our simulation and theoretical findings.

**Table 5.** RPEs of estimators.


### **6. Conclusions**

In this paper, we present efficient estimation strategies for the linear mixed effect model when there exists multicollinearity among predictor variables for high-dimensional data application. We considered the estimation of fixed effects parameters in the linear mixed model when some of the predictors may have a very weak influence on the response of interest. We introduced pretest and shrinkage estimation in our model using the ridge estimation as the reference estimator. In addition, we established the asymptotic properties of the pretest and shrinkage ridge estimators. Our theoretical findings demonstrate that the shrinkage ridge estimators outperform the full model ridge estimator and perform relatively better than the sub-model estimator in a wide range of the parameter space.

Additionally, a Monte Carlo simulation was conducted to investigate and assess the finite sample behavior of proposed estimators when the model is sparse (restrictions on parameters hold). As expected, the sub-model ridge estimator outshines all other estimators when the restrictions hold. However, when this assumption is violated, the shrinkage and pretest ridge estimators outperform the sub-model estimator. Furthermore, when the number of sparse predictors are extremely large relative to the sample size, the shrinkage estimators outperform the pretest ridge estimator. These numerical results are consistent with our asymptotic result. We also assess the relative performance of the LASSO-type estimators with our ridge-type estimators. We observe that the performance of pretest and shrinkage ridge estimators are superior to the LASSO-type estimators when predictors are highly correlated. For our real data application, the shrinkage ridge estimators are superior with the smallest relative prediction error compared to the LASSO-type estimators.

In summary, the results of the data analyses strongly confirm the findings of the simulation study and suggest the use of the shrinkage ridge estimation strategy when no prior information about the parameter subspace is available. The results of our simulation study and real data application are consistent with available results in [27–29].

In our future work, we will focus on other penalty estimators like the Elastic-Net, the minimax concave penalty (MCP), and the smoothly clipped absolute deviation method (SCAD) as estimation strategy in LMM for high-dimensional data. These estimators will be assessed and compared with the proposed ridge-type estimators. Another interesting extension will be integrating two sub-models by incorporating ridge-type estimation strategies in the linear mixed effect models. The goal is to improve the estimation accuracy of the non-sparse set of the fixed effects parameters by combining an over-fitted model estimator with an under-fitted one [27,29]. This approach will include combining two sub-models produced by two different variable selection techniques from the LMM [28].

**Author Contributions:** Conceptualization, E.A.O. and S.E.A.; methodology, E.A.O. and F.S.N.; formal analysis, E.A.O.; writing—original draft preparation, E.A.O.; writing—review and editing, E.A.O., S.E.A. and F.S.N.; supervision, F.S.N. and S.E.A.; funding acquisition, F.S.N. and S.E.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by Natural Sciences and Engineering Research Council of Canada (NSERC).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Publicly available datasets were analyzed in this study. This data can be found here https://pubmed.ncbi.nlm.nih.gov/22434862/ (accessed on 20 April 2021).

**Acknowledgments:** Research is supported by the Visual and Automated Disease Analytics (VADA) graduate training program.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **Appendix A**

**Proof of Proposition 1.** The asymptotic relationship between the sub-model and full model estimators of *<sup>β</sup>***1**, we use the argument and equation: **<sup>Y</sup>**<sup>ˆ</sup> <sup>=</sup> **<sup>Y</sup>** <sup>−</sup> **<sup>X</sup>**2*β***<sup>ˆ</sup>** RFM <sup>2</sup> , where

$$\begin{split} \boldsymbol{\hat{\beta}}\_{1}^{\text{RFM}} &= \arg\min\_{\boldsymbol{\beta}\_{1}} \left\{ (\hat{\mathbf{Y}} - \mathbf{X}\_{1}\boldsymbol{\beta}\_{1})^{\text{T}}\mathbf{V}^{-1}(\hat{\mathbf{Y}} - \mathbf{X}\_{1}\boldsymbol{\beta}\_{1}) + \lambda||\boldsymbol{\beta}\_{1}||^{2} \right\} \\ &= \left[\mathbf{X}\_{1}^{\text{T}}\mathbf{V}^{-1}\mathbf{X}\_{1} + \lambda\mathbf{I}\_{p\_{1}}\right]^{-1}\mathbf{X}\_{1}^{\text{T}}\mathbf{V}^{-1}\mathbf{\hat{Y}} \\ &= \left[\mathbf{X}\_{1}^{\text{T}}\mathbf{V}^{-1}\mathbf{X}\_{1} + \lambda\mathbf{I}\_{p\_{1}}\right]^{-1}\mathbf{X}\_{1}^{\text{T}}\mathbf{V}^{-1}\mathbf{Y} - \left[\mathbf{X}\_{1}^{\text{T}}\mathbf{V}^{-1}\mathbf{X}\_{1} + \lambda\mathbf{I}\_{p\_{1}}\right]^{-1}\mathbf{X}\_{1}^{\text{T}}\mathbf{V}^{-1}\mathbf{X}\_{2}\boldsymbol{\beta}\_{2}^{\text{RFM}} \\ &= \boldsymbol{\hat{\beta}}\_{1}^{\text{RCM}} - \left[\mathbf{X}\_{1}\mathbf{V}^{-1}\mathbf{X}\_{1} + \lambda\mathbf{I}\_{p\_{1}}\right]^{-1}\mathbf{X}\_{1}^{\text{T}}\mathbf{V}^{-1}\mathbf{X}\_{2}\boldsymbol{\beta}\_{2}^{\text{RFM}} \\ &= \boldsymbol{\hat{\beta}}\_{1}^{\text{RCM}} - \mathbf{B}\_{11}^{-1}\mathbf{B}\_{12}\boldsymbol{\hat{\beta}}\_{2}^{\text{RFM}} \end{split}$$

From Theorem 1, we partition <sup>√</sup>*n*(*β***<sup>ˆ</sup>** RFM <sup>−</sup> *<sup>β</sup>*) as <sup>√</sup>*n*(*β***<sup>ˆ</sup>** RFM <sup>−</sup> *<sup>β</sup>*) = √*n*(*β***<sup>ˆ</sup>** RFM <sup>1</sup> <sup>−</sup> *<sup>β</sup>*1), <sup>√</sup>*n*(*β***<sup>ˆ</sup>** RFM <sup>2</sup> − *β*2) . We obtain <sup>√</sup>*n*(*β***<sup>ˆ</sup>** RFM <sup>1</sup> <sup>−</sup> *<sup>β</sup>*1) *<sup>D</sup>* → N*p*<sup>1</sup> (−*μ*11.2,**B**−<sup>1</sup> 11.2), where **<sup>B</sup>**−<sup>1</sup> 11.2 = **B**<sup>11</sup> − **B**12**B**−<sup>1</sup> <sup>22</sup> **<sup>B</sup>**21. We have shown that *<sup>β</sup>***<sup>ˆ</sup>** RSM <sup>1</sup> <sup>=</sup> *<sup>β</sup>*<sup>ˆ</sup> RFM <sup>1</sup> + **<sup>B</sup>**−<sup>1</sup> <sup>11</sup> **<sup>B</sup>**12*β*<sup>ˆ</sup> RFM <sup>2</sup> . Using this expression and under the local alternative {*Kn*}, we obtain the following expressions

$$\begin{split} \boldsymbol{\varrho}\_{2} &= \sqrt{n} \left( \boldsymbol{\hat{\beta}}\_{1}^{\text{RSM}} - \boldsymbol{\beta}\_{1} \right) \\ &= \sqrt{n} \left( \boldsymbol{\hat{\beta}}\_{1}^{\text{RFM}} + \mathbf{B}\_{11}^{-1} \mathbf{B}\_{12} \boldsymbol{\hat{\beta}}\_{2}^{\text{RFM}} - \boldsymbol{\beta}\_{1} \right) \\ &= \boldsymbol{\varrho}\_{1} + \mathbf{B}\_{11}^{-1} \mathbf{B}\_{12} \sqrt{n} \boldsymbol{\hat{\beta}}\_{2}^{\text{RFM}} \\ \boldsymbol{\varrho}\_{3} &= \sqrt{n} (\boldsymbol{\hat{\beta}}\_{1}^{\text{RFM}} - \boldsymbol{\hat{\beta}}\_{1}^{\text{RSM}}) \\ &= \sqrt{n} (\boldsymbol{\hat{\beta}}\_{1}^{\text{RFM}} - \boldsymbol{\beta}\_{1}) - \sqrt{n} (\boldsymbol{\hat{\beta}}\_{1}^{\text{RCM}} - \boldsymbol{\beta}\_{1}) \\ &= \boldsymbol{\varrho}\_{1} - \boldsymbol{\varrho}\_{2}. \end{split}$$

Since *ϕ*<sup>2</sup> and *ϕ*<sup>3</sup> are linear functions of *ϕ*1, as *n* → ∞, they are also asymptotically normally distributed. Their mean vectors and covariance matrices are as follows:

*E*(*ϕ*1) = *E* <sup>√</sup>*<sup>n</sup> β***ˆ** RFM <sup>1</sup> − *β*<sup>1</sup> ! = −*μ*11.2 *E*(*ϕ*2) = *E ϕ*<sup>1</sup> + **B**−<sup>1</sup> <sup>11</sup> **<sup>B</sup>**12√*nβ*<sup>ˆ</sup> RFM 2 ! = *E*(*ϕ*1) + **B**−<sup>1</sup> <sup>11</sup> **<sup>B</sup>**12√*nE*(*β*<sup>ˆ</sup> RFM <sup>2</sup> ) <sup>=</sup> <sup>−</sup>*μ*11.2 <sup>+</sup> **<sup>B</sup>**−<sup>1</sup> <sup>11</sup> **B**12*κ* = −(*μ*11.2 − *δ*) = −*γ E*(*ϕ*3) = *E*(*ϕ*<sup>1</sup> − *ϕ*2) = −*μ*11.2 − (−(*μ*11.2 − *δ*)) = *δ Var*(*ϕ*1) = **B**−<sup>1</sup> 22.1 *Var*(*ϕ*2) = *Var ϕ*<sup>1</sup> + **B**−<sup>1</sup> <sup>11</sup> **<sup>B</sup>**12√*nβ*<sup>ˆ</sup> RFM 2 ! = *Var*(*ϕ*1) + **B**−<sup>1</sup> <sup>11</sup> **<sup>B</sup>**12**B**−<sup>1</sup> 22.1B21**B**−<sup>1</sup> 11 <sup>+</sup> <sup>2</sup>*Cov*& <sup>√</sup>*n*(*β***<sup>ˆ</sup>** RFM <sup>1</sup> − *β*1), <sup>√</sup>*n*(*β***<sup>ˆ</sup>** RFM <sup>2</sup> − *β*2) ' (**B**−<sup>1</sup> <sup>11</sup> **<sup>B</sup>**12)<sup>T</sup> = **B**−<sup>1</sup> 22.1 <sup>−</sup> **<sup>B</sup>**−<sup>1</sup> <sup>11</sup> **<sup>B</sup>**12**B**−<sup>1</sup> 22.1B21**B**−<sup>1</sup> <sup>11</sup> <sup>=</sup> **<sup>B</sup>**−<sup>1</sup> 11 *Var*(*ϕ*3) = *Var* <sup>√</sup>*<sup>n</sup> β***ˆ** RFM <sup>1</sup> <sup>−</sup> *<sup>β</sup>***<sup>ˆ</sup>** RSM 1 ! <sup>=</sup> *Var* <sup>√</sup>*<sup>n</sup> β***ˆ** RFM <sup>1</sup> <sup>−</sup> *<sup>β</sup>***<sup>ˆ</sup>** RFM <sup>1</sup> <sup>−</sup> **<sup>B</sup>**−<sup>1</sup> <sup>11</sup> **<sup>B</sup>**12*β*<sup>ˆ</sup> RFM 2 ! = **B**−<sup>1</sup> <sup>11</sup> **<sup>B</sup>**12*Var*& <sup>√</sup>*nβ*<sup>ˆ</sup> RFM 2 ' (**B**−<sup>1</sup> <sup>11</sup> **<sup>B</sup>**12)<sup>T</sup> = **B**−<sup>1</sup> <sup>11</sup> **<sup>B</sup>**12**B**−<sup>1</sup> 22.1B21**B**−<sup>1</sup> <sup>11</sup> = **Φ** *Cov*(*ϕ*1, *<sup>ϕ</sup>*3) = *Cov*& √*n β***ˆ** RFM <sup>1</sup> − *β*<sup>1</sup> , √*n β***ˆ** RFM <sup>1</sup> <sup>−</sup> *<sup>β</sup>***<sup>ˆ</sup>** RSM 1 ' <sup>=</sup> *Var* <sup>√</sup>*<sup>n</sup> β***ˆ** RFM <sup>1</sup> − *β*<sup>1</sup> ! <sup>−</sup> *Cov*& √*n β***ˆ** RFM <sup>1</sup> − *β*<sup>1</sup> , √*n β***ˆ** RSM <sup>1</sup> − *β*<sup>1</sup> ' <sup>=</sup> *Var*(*ϕ*1) <sup>−</sup> *Cov*& √*n β***ˆ** RFM <sup>1</sup> − *β*<sup>1</sup> , √*n β***ˆ** RFM <sup>1</sup> − *β*<sup>1</sup> <sup>+</sup> <sup>√</sup>*n***B**−<sup>1</sup> <sup>11</sup> **<sup>B</sup>**12*β*<sup>ˆ</sup> RFM 2 ' = **B**−<sup>1</sup> <sup>11</sup> **<sup>B</sup>**12**B**−<sup>1</sup> 22.1B21**B**−<sup>1</sup> <sup>11</sup> = **Φ**

$$\begin{split} Cov(\boldsymbol{\varrho}\_{2}, \boldsymbol{\varrho}\_{3}) &= Cov\left[\sqrt{n}\left(\boldsymbol{\mathfrak{f}}\_{1}^{\mathrm{RSM}} - \boldsymbol{\mathfrak{f}}\_{1}\right), \sqrt{n}\left(\boldsymbol{\mathfrak{f}}\_{1}^{\mathrm{RFM}} - \boldsymbol{\mathfrak{f}}\_{1}^{\mathrm{RSM}}\right)\right] \\ &= Cov\left[\sqrt{n}\left(\boldsymbol{\mathfrak{f}}\_{1}^{\mathrm{RSM}} - \boldsymbol{\mathfrak{f}}\_{1}\right), \sqrt{n}\left(\boldsymbol{\mathfrak{f}}\_{1}^{\mathrm{RFM}} - \boldsymbol{\mathfrak{f}}\_{1}\right)\right] - Var\left(\sqrt{n}\left(\boldsymbol{\mathfrak{f}}\_{1}^{\mathrm{RSM}} - \boldsymbol{\mathfrak{f}}\_{1}\right)\right) \\ &= \mathbf{B}\_{112}^{-1} - \mathbf{B}\_{11}^{-1}\mathbf{B}\_{12}\mathbf{B}\_{221}^{-1}\mathbf{B}\_{21}\mathbf{B}\_{11}^{-1} - \mathbf{B}\_{11}^{-1} \\ &= \mathbf{B}\_{112}^{-1} - \left(\mathbf{B}\_{112}^{-1} - \mathbf{B}\_{11}^{-1}\right) - \mathbf{B}\_{11}^{-1} = \mathbf{0} \end{split}$$

Therefore, the asymptotic distributions of the vectors *ϕ*<sup>2</sup> and *ϕ*<sup>3</sup> are obtained as follows:

$$\begin{aligned} \boldsymbol{\varrho}\_{2} &= \sqrt{n}(\boldsymbol{\hat{\beta}}\_{1}^{\text{RSM}} - \boldsymbol{\beta}\_{1}) \stackrel{D}{\to} \mathcal{N}\_{p\_{1}}(-\gamma\_{\prime} \, \mathbf{B}\_{11}^{-1}),\\ \boldsymbol{\varrho}\_{3} &= \sqrt{n}(\boldsymbol{\hat{\beta}}\_{1}^{\text{RFM}} - \boldsymbol{\beta}\_{1}^{\text{RSM}}) \stackrel{D}{\to} \mathcal{N}\_{p\_{1}}(\boldsymbol{\delta}, \boldsymbol{\Phi}) \end{aligned}$$

### **Appendix B**

We next introduce the lemmas given in [30] to aid with the proof of the bias and covariance of the estimators.

**Lemma A1.** *Let <sup>V</sup>* = (*V*1, *<sup>V</sup>*2, ... *Vp*)*<sup>T</sup> be a p-dimensional normal vector distributed as* <sup>N</sup>*p*(*μv***,Σ***p*), *then for a measurable function* Ψ, *we have*

$$E\left[\mathbf{V}\Psi(\mathbf{V}^{T}\mathbf{V})\right] = \mu\_{\upsilon}E\left[\mathbf{V}\chi\_{p+2}^{2}(\Delta)\right]$$

$$E\left[\mathbf{V}\mathbf{V}^{T}\Psi(\mathbf{V}^{T}\mathbf{V})\right] = \Sigma\_{p}E\left[\mathbf{V}\chi\_{p+2}^{2}(\Delta)\right] + \mu\_{\upsilon}\mu\_{\upsilon}^{T}E\left[\mathbf{V}\chi\_{p+4}^{2}(\Delta)\right]$$

*where χ*<sup>2</sup> *<sup>k</sup>* (Δ) *is a non-central chi-square distribution with k degrees of freedom and non-centrality parameter* Δ*.*

*Appendix B.1* **Proof of Theorem 2.**

$$\begin{split} \text{ADB}(\mathcal{J}\_1^{\text{RFM}}) &= E\{ \lim\_{n \to \infty} \sqrt{n} (\mathcal{J}\_1^{\text{RFM}} - \mathcal{J}\_1) \} \\ &= -\mu\_{11.2} . \end{split}$$

$$\begin{split} \text{ADD}(\hat{\boldsymbol{\theta}}\_{1}^{\text{RSM}}) &= E\{ \lim\_{n \to \infty} \sqrt{n}(\hat{\boldsymbol{\theta}}\_{1}^{\text{RSM}} - \boldsymbol{\theta}\_{1}) \} \\ &= E\{ \lim\_{n \to \infty} \sqrt{n}(\boldsymbol{\theta}\_{1}^{\text{RPM}} - \mathbf{B}\_{1}^{-1} \mathbf{B}\_{12} \boldsymbol{\theta}\_{2}^{\text{RPM}} - \boldsymbol{\theta}\_{1}) \} \\ &= E\{ \lim\_{n \to \infty} \sqrt{n}(\boldsymbol{\theta}\_{1}^{\text{RPM}} - \boldsymbol{\theta}\_{1}) \} - E\{ \lim\_{n \to \infty} \sqrt{n}(\mathbf{B}\_{11}^{-1} \mathbf{B}\_{12} \boldsymbol{\theta}\_{2}^{\text{RPM}}) \} \\ &= -\mu\_{112} - E\{ \lim\_{n \to \infty} \sqrt{n}(\mathbf{B}\_{11}^{-1} \mathbf{B}\_{12} \boldsymbol{\theta}\_{2}^{\text{RPM}}) \} \\ &= -\mu\_{112} - \mathbf{B}\_{11}^{-1} \mathbf{B}\_{12} \boldsymbol{\kappa} = -(\mu\_{112} + \delta) = -\gamma. \end{split}$$

Using Lemma 1,

$$\begin{split} \text{ADD}(\hat{\boldsymbol{\beta}}\_{1}^{\text{RFT}}) &= E\left\{ \lim\_{n \to \infty} \sqrt{n} (\hat{\boldsymbol{\beta}}\_{1}^{\text{RFT}} - \boldsymbol{\beta}\_{1}) \right\} \\ &= E\left\{ \lim\_{n \to \infty} \sqrt{n} (\hat{\boldsymbol{\beta}}\_{1}^{\text{RFT}} - (\hat{\boldsymbol{\beta}}\_{1}^{\text{REM}} - \hat{\boldsymbol{\beta}}\_{1}^{\text{RSM}}) \mathbf{I} (\mathcal{L}\_{n} \le d\_{n,a}) - \boldsymbol{\beta}\_{1} \right\} \\ &= E\left\{ \lim\_{n \to \infty} \sqrt{n} (\hat{\boldsymbol{\beta}}\_{1}^{\text{REM}} - \boldsymbol{\beta}\_{1}) \right\} - E\left\{ \lim\_{n \to \infty} \sqrt{n} (\hat{\boldsymbol{\beta}}\_{1}^{\text{REM}} - \hat{\boldsymbol{\beta}}\_{1}^{\text{RSM}}) \mathbf{I} (\mathcal{L}\_{n} \le d\_{n,a}) \right\} \\ &= -\mu\_{11,2} - E\left\{ \lim\_{n \to \infty} \sqrt{n} (\hat{\boldsymbol{\beta}}\_{1}^{\text{REM}} - \hat{\boldsymbol{\beta}}\_{1}^{\text{RSM}}) \mathbf{I} (\mathcal{L}\_{n} \le d\_{n,a}) \right\} \\ &= -\mu\_{11,2} - \delta \mathbf{H}\_{p\_{1} + 2}(\chi\_{p\_{2}}^{2}; \Delta) . \end{split}$$

$$\begin{split} \text{ADD}(\hat{\boldsymbol{\theta}}\_{1}^{\text{RSE}}) &= \operatorname\*{E}\left\{ \lim\_{n \to \infty} \sqrt{n} (\hat{\boldsymbol{\theta}}\_{1}^{\text{RSE}} - \boldsymbol{\beta}\_{1}) \right\} \\ &= \operatorname\*{E}\left\{ \lim\_{n \to \infty} \sqrt{n} (\hat{\boldsymbol{\theta}}\_{1}^{\text{REM}} - (\hat{\boldsymbol{\theta}}\_{1}^{\text{REM}} - \hat{\boldsymbol{\theta}}\_{1}^{\text{RSM}}) (p\_{2} - 2) \mathbf{L}\_{n}^{-1} - \boldsymbol{\beta}\_{1} \right\} \\ &= \operatorname\*{E}\left\{ \lim\_{n \to \infty} \sqrt{n} (\hat{\boldsymbol{\theta}}\_{1}^{\text{REM}} - \boldsymbol{\beta}\_{1}) \right\} - \operatorname\*{E}\left\{ \lim\_{n \to \infty} \sqrt{n} (\hat{\boldsymbol{\theta}}\_{1}^{\text{REM}} - \hat{\boldsymbol{\theta}}\_{1}^{\text{RSM}}) (p\_{2} - 2) \mathbf{L}\_{n}^{-1} \right\} \\ &= -\boldsymbol{\mu}\_{112} - \operatorname\*{E}\left\{ \lim\_{n \to \infty} \sqrt{n} (\hat{\boldsymbol{\theta}}\_{1}^{\text{REM}} - \hat{\boldsymbol{\theta}}\_{1}^{\text{RSM}}) (p\_{2} - 2) \mathbf{L}\_{n}^{-1} \right\} \\ &= -\boldsymbol{\mu}\_{112} - (p\_{2} - 2) \delta \mathrm{E}(\boldsymbol{\chi}\_{p\_{2}+2}^{-2}(\Delta)) . \end{split}$$

ADB(*β***ˆ**RPS **<sup>1</sup>** ) = *E* lim*n*→<sup>∞</sup> <sup>√</sup>*n*(*β***ˆ**RPS **<sup>1</sup>** *− β***1**) = *E* lim*n*→<sup>∞</sup> <sup>√</sup>*n*(*β***ˆ**RSM **<sup>1</sup> + (***β***ˆ**RFM **<sup>1</sup>** *<sup>−</sup> <sup>β</sup>***ˆ**RSM **<sup>1</sup> )**(<sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)L−<sup>1</sup> *<sup>n</sup>* )I(L*<sup>n</sup>* > *p*<sup>2</sup> − 2) − *β***1**) = *E* <sup>√</sup>*<sup>n</sup> β***ˆ** RSM <sup>1</sup> + (*β*ˆRFM <sup>1</sup> <sup>−</sup> *<sup>β</sup>*ˆRSM <sup>1</sup> )(1 − I(L*<sup>n</sup>* ≤ *p*<sup>2</sup> − 2)) <sup>−</sup> (*β*ˆRFM <sup>1</sup> <sup>−</sup> *<sup>β</sup>*ˆRSM <sup>1</sup> )(*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)L−<sup>1</sup> *<sup>n</sup>* I(L*<sup>n</sup>* > *p*<sup>2</sup> − 2) − *β*<sup>1</sup> = *E* lim*n*→<sup>∞</sup> <sup>√</sup>*n*(*β***ˆ**RFM **<sup>1</sup>** *− β***1**) − *E* lim*n*→<sup>∞</sup> <sup>√</sup>*n*(*β***ˆ**RFM **<sup>1</sup>** *<sup>−</sup> <sup>β</sup>***ˆ**RSM **<sup>1</sup>** )(*p*<sup>2</sup> − 2)I(L*<sup>n</sup>* ≤ *p*<sup>2</sup> − 2) − *E* lim*n*→<sup>∞</sup> <sup>√</sup>*n*(*β***ˆ**RFM **<sup>1</sup>** *<sup>−</sup> <sup>β</sup>***ˆ**RSM **<sup>1</sup>** )(*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)L−<sup>1</sup> *<sup>n</sup>* I(L*<sup>n</sup>* > *p*<sup>2</sup> − 2) <sup>=</sup> <sup>−</sup>*μ*11.2 <sup>−</sup> *<sup>δ</sup>***H***p*2+2(*χ*<sup>2</sup> *<sup>p</sup>*2<sup>−</sup>2; <sup>Δ</sup>) − (*p*<sup>2</sup> − 2)*δE χ*−<sup>2</sup> *<sup>p</sup>*2+2(Δ)I(*χ*−<sup>2</sup> *<sup>p</sup>*2+<sup>2</sup> > *<sup>p</sup>*<sup>2</sup> − <sup>2</sup>) .

*Appendix B.2*

In order to compute the risk functions, we first compute the asymptotic covariance of the estimators. The asymptotic covariance of an estimator *β***ˆ** ∗ <sup>1</sup> is expressed as

$$\operatorname{Cov}(\hat{\mathcal{B}}\_1^\*) = \lim\_{n \to \infty} E\left\{ n(\hat{\mathcal{B}}\_1^\* - \mathcal{B}\_1)(\hat{\mathcal{B}}\_1^\* - \mathcal{B}\_1)^\top \right\}.$$

**Proof of Theorem 3.** We first start by computing the asymptotic covariance of the estimator *β***ˆ**RFM **<sup>1</sup>** as:

$$\begin{split} \text{Cov}(\hat{\boldsymbol{\beta}}\_{\mathbf{1}}^{\text{RFM}}) &= E\{\lim\_{\boldsymbol{\eta}\to\infty} \sqrt{n}(\hat{\boldsymbol{\beta}}\_{\mathbf{1}}^{\text{RFM}} - \boldsymbol{\beta}\_{\mathbf{1}}) \sqrt{n}(\hat{\boldsymbol{\beta}}\_{\mathbf{1}}^{\text{RFM}} - \boldsymbol{\beta}\_{\mathbf{1}})^{\top}\} \\ &= E(\boldsymbol{\varrho}\_{1}\boldsymbol{\varrho}\_{1}^{\text{T}}) = \text{Cov}(\boldsymbol{\varrho}\_{1}\boldsymbol{\varrho}\_{1}^{\text{T}}) + E(\boldsymbol{\varrho}\_{1})E(\boldsymbol{\varrho}\_{1}^{\text{T}}) \\ &= \mathbf{B}\_{11.2}^{-1} + \boldsymbol{\mu}\_{11.2}\boldsymbol{\mu}\_{11.2}^{\text{T}} . \end{split}$$

Furthermore, similarly, the asymptotic covariance of the estimator *β***ˆ**RSM **<sup>1</sup>** is obtained as:

$$\begin{split} \text{Cov}(\boldsymbol{\mathfrak{f}}\_{\mathbf{1}}^{\text{RSM}}) &= E\{\lim\_{\mathbf{n}\to\infty} \sqrt{n}(\boldsymbol{\mathfrak{f}}\_{\mathbf{1}}^{\text{RSM}} - \boldsymbol{\mathfrak{f}}\_{\mathbf{1}}) \sqrt{n}(\boldsymbol{\mathfrak{f}}\_{\mathbf{1}}^{\text{RSM}} - \boldsymbol{\mathfrak{f}}\_{\mathbf{1}})^{\text{T}}\} \\ &= E(\boldsymbol{\varrho}\_{2}\boldsymbol{\mathfrak{g}}\_{2}^{\text{T}}) = \text{Cov}(\boldsymbol{\varrho}\_{2}\boldsymbol{\mathfrak{g}}\_{2}^{\text{T}}) + E(\boldsymbol{\varrho}\_{2})E(\boldsymbol{\mathfrak{g}}\_{2}^{\text{T}}) \\ &= \mathbf{B}\_{11}^{-1} + \boldsymbol{\gamma}\boldsymbol{\mathfrak{\chi}}^{\text{T}}. \end{split}$$

The asymptotic covariance of the estimator *β***ˆ**RPT **<sup>1</sup>** is obtained as:

$$\begin{split} \text{Cov}(\hat{\boldsymbol{\beta}}\_{\mathbf{1}}^{\text{RFT}}) &= E\{\lim\_{n\to\infty} \sqrt{n}(\hat{\boldsymbol{\beta}}\_{\mathbf{1}}^{\text{RFT}} - \boldsymbol{\beta}\_{\mathbf{1}}) \sqrt{n}(\hat{\boldsymbol{\beta}}\_{\mathbf{1}}^{\text{RFT}} - \boldsymbol{\beta}\_{\mathbf{1}})^{\top}\} \\ &= E\{\lim\_{n\to\infty} n \left[ (\hat{\boldsymbol{\beta}}\_{\mathbf{1}}^{\text{REM}} - \boldsymbol{\beta}\_{\mathbf{1}}) - (\hat{\boldsymbol{\beta}}\_{\mathbf{1}}^{\text{REM}} - \hat{\boldsymbol{\beta}}\_{\mathbf{1}}^{\text{RSM}})I(L\_{n} \le d\_{n,\mathcal{a}}) \right] \\ &\quad \left[ (\hat{\boldsymbol{\beta}}\_{\mathbf{1}}^{\text{REM}} - \boldsymbol{\beta}\_{\mathbf{1}}) - (\hat{\boldsymbol{\beta}}\_{\mathbf{1}}^{\text{REM}} - \hat{\boldsymbol{\beta}}\_{\mathbf{1}}^{\text{RSM}})I(L\_{n} \le d\_{n,\mathcal{a}}) \right]^{\top} \} \\ &= E\left\{ [\boldsymbol{\varphi}\_{1} - \boldsymbol{\varphi}\_{3}I(L\_{n} \le d\_{n,\mathcal{a}})] \left[ \boldsymbol{\varphi}\_{1} - \boldsymbol{\varphi}\_{3}I(L\_{n} \le d\_{n,\mathcal{a}}) \right]^{\top} \right\} \\ &= E\left\{ \boldsymbol{\varphi}\_{1} \boldsymbol{\varphi}\_{1}^{\text{T}} - 2\boldsymbol{\varphi}\_{3} \boldsymbol{\varphi}\_{1}^{\text{T}}I(L\_{n} \le d\_{n,\mathcal{a}}) + \boldsymbol{\varphi}\_{3} \boldsymbol{\varphi}\_{3}^{\text{T}}I(L\_{n} \le d\_{n,\mathcal{a}}) \right\} \end{split}$$

Thus, we need to find *E ϕ*1*ϕ*<sup>T</sup> 1 , *E ϕ*3*ϕ*<sup>T</sup> <sup>1</sup> *I*(*Ln* ≤ *dn*,*α*) and *E ϕ*3*ϕ*<sup>T</sup> <sup>3</sup> *I*(*Ln* ≤ *dn*,*α*) . The first term is *E ϕ*1*ϕ*<sup>T</sup> 1 = **B**−<sup>1</sup> 11.2 + *<sup>μ</sup>*11.2*μ*<sup>T</sup> 11.2. Using Lemma 1, the third term is computed as:

$$E\left\{\boldsymbol{\varrho}\_{3}\boldsymbol{\varrho}\_{3}^{\mathrm{T}}I(L\_{n}\leq d\_{\mathbb{H}\mathcal{A}})\right\}=\boldsymbol{\Phi}\mathbf{H}\_{p\_{2}+2}(\boldsymbol{\chi}\_{p\_{2}}^{2};\boldsymbol{\Delta})+\boldsymbol{\delta}\boldsymbol{\delta}^{\mathrm{T}}\mathbf{H}\_{p\_{2}+4}(\boldsymbol{\chi}\_{p\_{2}}^{2};\boldsymbol{\Delta}).$$

The second term *E ϕ*3*ϕ*<sup>T</sup> <sup>1</sup> *I*(*Ln* ≤ *dn*,*α*) can be computed from normal theory as

*E* \* *ϕ*3*ϕ*<sup>T</sup> <sup>1</sup> *I*(*Ln* ≤ *dn*,*α*) + = *E* \* *E ϕ*3*ϕ*<sup>T</sup> <sup>1</sup> *I*(*Ln* ≤ *dn*,*α*)|*ϕ*<sup>3</sup> + = *E* \* *ϕ*3*E ϕ*T <sup>1</sup> *I*(*Ln* ≤ *dn*,*α*)|*ϕ*<sup>3</sup> + = *E <sup>ϕ</sup>*3[−*μ*11.2 + (*ϕ*<sup>3</sup> <sup>−</sup> *<sup>δ</sup>*)]T*I*(*Ln* <sup>≤</sup> *dn*,*α*) = −*E ϕ*3*μ*11.2 *I*(*Ln* ≤ *dn*,*α*) + *E <sup>ϕ</sup>*3(*ϕ*<sup>3</sup> <sup>−</sup> *<sup>δ</sup>*)T*I*(*Ln* <sup>≤</sup> *dn*,*α*) <sup>=</sup> <sup>−</sup>*μ*<sup>T</sup> 11.2*E*{*ϕ*<sup>3</sup> *<sup>I</sup>*(*Ln* <sup>≤</sup> *dn*,*α*)} <sup>+</sup> *<sup>E</sup>*{*ϕ*3*ϕ*<sup>T</sup> <sup>3</sup> *I*(*Ln* ≤ *dn*,*α*)} − *E <sup>ϕ</sup>*3*δ*T*I*(*Ln* <sup>≤</sup> *dn*,*α*) <sup>=</sup> <sup>−</sup>*μ*<sup>T</sup> 11.2*δ***H***p*2+2(*χ*<sup>2</sup> *<sup>p</sup>*<sup>2</sup> ; <sup>Δ</sup>) + \* *Cov*(*ϕ*3*ϕ*<sup>T</sup> <sup>3</sup> )**H***p*2+2(*χ*<sup>2</sup> *<sup>p</sup>*<sup>2</sup> ; Δ) + *E*(*ϕ*3)*E*(*ϕ*<sup>T</sup> <sup>3</sup> )**H***p*2+4(*χ*<sup>2</sup> *<sup>p</sup>*<sup>2</sup> ; <sup>Δ</sup>) <sup>−</sup> *δδ*T**H***p*2+2(*χ*<sup>2</sup> *<sup>p</sup>*<sup>2</sup> ; Δ) + <sup>=</sup> <sup>−</sup>*μ*<sup>T</sup> 11.2*δ***H***p*2+2(*χ*<sup>2</sup> *<sup>p</sup>*<sup>2</sup> ; <sup>Δ</sup>) + **<sup>Φ</sup>H***p*2+2(*χ*<sup>2</sup> *<sup>p</sup>*<sup>2</sup> ; <sup>Δ</sup>) + *δδ*T**H***p*2+4(*χ*<sup>2</sup> *<sup>p</sup>*<sup>2</sup> ; Δ) <sup>−</sup> *δδ*T**H***p*2+2(*χ*<sup>2</sup> *<sup>p</sup>*<sup>2</sup> ; Δ)

Putting all the terms together and simplifying, we obtain

$$\begin{split} & \text{Cov}(\hat{\mathcal{P}}\_{1}^{\text{RPT}}) \\ &= \mu\_{11,2}\mu\_{11,2}^{\text{T}} + 2\mu\_{11,2}^{\text{T}}\mathcal{S}\mathbf{H}\_{p\_{2}+2}(\chi\_{p\_{2}}^{2};\Delta) + \mathbf{B}\_{11,2}^{-1} - \boldsymbol{\Theta}\mathbf{H}\_{p\_{2}+2}(\chi\_{p\_{2}}^{2};\Delta) - \mathcal{S}\boldsymbol{\delta}^{\text{T}}\mathbf{H}\_{p\_{2}+4}(\chi\_{p\_{2}}^{2};\Delta) \\ &+ 2\mathcal{S}\boldsymbol{\delta}^{\text{T}}\mathbf{H}\_{p\_{2}+2}(\chi\_{p\_{2}}^{2};\Delta) \\ &= \mathbf{B}\_{11,2}^{-1} + \mu\_{11,2}\mu\_{11,2}^{\text{T}} + 2\mu\_{11,2}^{\text{T}}\mathcal{S}\mathbf{H}\_{p\_{2}+2}(\chi\_{p\_{2}}^{2};\Delta) - \boldsymbol{\Theta}\mathbf{H}\_{p\_{2}+2}(\chi\_{p\_{2}}^{2};\Delta) \\ &+ \mathcal{S}\boldsymbol{\delta}^{\text{T}}\left[2\mathbf{H}\_{p\_{2}+2}(\chi\_{p\_{2}}^{2};\Delta) - \mathbf{H}\_{p\_{2}+4}(\chi\_{p\_{2}}^{2};\Delta)\right]. \end{split}$$

The asymptotic covariance of the estimator *β***ˆ**RSE **<sup>1</sup>** can be obtained as

$$\begin{split} \text{Cov}(\hat{\beta}\_{1}^{\text{RSE}}) &= E\{\lim\_{n\to\infty} \sqrt{n}(\hat{\beta}\_{1}^{\text{RSE}} - \beta\_{1})\sqrt{n}(\hat{\beta}\_{1}^{\text{RSE}} - \beta\_{1})^{\text{T}}\} \\ &= E\{\lim\_{n\to\infty} n \left[ (\hat{\beta}\_{1}^{\text{RFM}} - \beta\_{1}) - (\hat{\beta}\_{1}^{\text{RFM}} - \hat{\beta}\_{1}^{\text{RSM}})(p\_{2} - 2)L\_{n}^{-1} \right] \\ &\left[ (\hat{\beta}\_{1}^{\text{RFM}} - \beta\_{1}) - (\hat{\beta}\_{1}^{\text{RFM}} - \hat{\beta}\_{1}^{\text{RCM}})(p\_{2} - 2)L\_{n}^{-1} \right]^{\text{T}} \} \\ &= E\left\{ [\mathfrak{p}\_{1} - \mathfrak{p}\_{3}(p\_{2} - 2)L\_{n}^{-1}][\mathfrak{p}\_{1} - \mathfrak{p}\_{3}(p\_{2} - 2)L\_{n}^{-1}]^{\text{T}} \right\} \\ &= E\left\{ \mathfrak{p}\_{1}\mathfrak{p}\_{1}^{\text{T}} - 2(p\_{2} - 2)\mathfrak{p}\_{3}\mathfrak{p}\_{1}^{\text{T}}L\_{n}^{-1} + (p\_{2} - 2)^{2}\mathfrak{p}\_{3}\mathfrak{p}\_{3}^{\text{T}}L\_{n}^{-2} \right\} \end{split}$$

We need to compute *E ϕ*3*ϕ*<sup>T</sup> <sup>3</sup> *<sup>L</sup>*−<sup>2</sup> *<sup>n</sup>* and *E ϕ*3*ϕ*<sup>T</sup> <sup>1</sup> *<sup>L</sup>*−<sup>1</sup> *<sup>n</sup>* . By using Lemma 1, the first term is obtained as follows:

$$E\left\{\boldsymbol{\varrho}\_{3}\boldsymbol{\varrho}\_{3}^{\mathrm{T}}\boldsymbol{L}\_{n}^{-2}\right\}=\boldsymbol{\Phi}E\left(\boldsymbol{\chi}\_{p\_{2}+2}^{-4}(\boldsymbol{\Delta})\right)+\boldsymbol{\mathcal{S}}\boldsymbol{\delta}^{\mathrm{T}}E\left(\boldsymbol{\chi}\_{p\_{2}+4}^{-4}(\boldsymbol{\Delta})\right).$$

The second term is computed from normal theory

$$\begin{split} E\left\{\boldsymbol{\varrho}\_{3}\boldsymbol{\varrho}\_{1}^{\rm T}\boldsymbol{L}\_{n}^{-1}\right\} &= E\left\{E\left(\boldsymbol{\varrho}\_{3}\boldsymbol{\varrho}\_{1}^{\rm T}\boldsymbol{L}\_{n}^{-1}|\boldsymbol{\varrho}\_{3}\right)\right\} = E\left\{\boldsymbol{\varrho}\_{3}E\left(\boldsymbol{\varrho}\_{1}^{\rm T}\boldsymbol{L}\_{n}^{-1}|\boldsymbol{\varrho}\_{3}\right)\right\} \\ &= E\left\{\boldsymbol{\varrho}\_{3}\left[-\boldsymbol{\mu}\_{11,2}+\left(\boldsymbol{\varrho}\_{3}-\boldsymbol{\mathcal{S}}\right)\right]^{\rm T}\boldsymbol{L}\_{n}^{-1}\right\} \\ &= -E\left\{\boldsymbol{\varrho}\_{3}\boldsymbol{\mu}\_{11,2}\boldsymbol{L}\_{n}^{-1}\right\} + E\left\{\boldsymbol{\varrho}\_{3}\left(\boldsymbol{\varrho}\_{3}-\boldsymbol{\mathcal{S}}\right)^{\rm T}\boldsymbol{L}\_{n}^{-1}\right\} \\ &= -\boldsymbol{\mu}\_{11,2}^{\rm T}E\left\{\boldsymbol{\varrho}\_{3}\boldsymbol{L}\_{n}^{-1}\right\} + E\left\{\boldsymbol{\varrho}\_{3}\boldsymbol{\varrho}\_{3}^{\rm T}\boldsymbol{L}\_{n}^{-1}\right\} - E\left\{\boldsymbol{\varrho}\_{3}\boldsymbol{\mathcal{S}}^{\rm T}\boldsymbol{L}\_{n}^{-1}\right\}. \end{split}$$

From above, we can find *E ϕ*3*δ*T*L*−<sup>1</sup> *n* = *δδ*T*E χ*−<sup>2</sup> *<sup>p</sup>*2+2(Δ) and *E ϕ*3*L*−<sup>1</sup> *<sup>n</sup>* = *δE χ*−<sup>2</sup> *<sup>p</sup>*2+2(Δ) . Putting these terms together and simplifying, we obtain

$$\begin{split} \text{Cov}(\boldsymbol{\hat{\mathcal{J}}}\_{1}^{\text{RSE}}) &= \mathbf{B}\_{11.2}^{-1} + \mu\_{11.2}\boldsymbol{\mu}\_{1.2}^{\text{T}} + 2(p\_{2} - 2)\boldsymbol{\mu}\_{11.2}^{\text{T}} \boldsymbol{\delta E} \left(\boldsymbol{\chi}\_{p\_{2} + 2}^{-2}(\boldsymbol{\Delta})\right) \\ &- (p\_{2} - 2)\boldsymbol{\Phi} \left\{ 2\boldsymbol{E} \left(\boldsymbol{\chi}\_{p\_{2} + 2}^{-2}(\boldsymbol{\Delta})\right) - (p\_{2} - 2)\boldsymbol{E} \left(\boldsymbol{\chi}\_{p\_{2} + 2}^{-4}(\boldsymbol{\Delta})\right) \right\} \\ &+ (p\_{2} - 2)\boldsymbol{\delta S}^{\text{T}} \left\{ -2\boldsymbol{E} \left(\boldsymbol{\chi}\_{p\_{2} + 4}^{-2}(\boldsymbol{\Delta})\right) + 2\boldsymbol{E} (\boldsymbol{\chi}\_{p\_{2} + 2}^{-2}(\boldsymbol{\Delta})) + (p\_{2} - 2)\boldsymbol{E} \left(\boldsymbol{\chi}\_{p\_{2} + 4}^{-4}(\boldsymbol{\Delta})\right) \right\}. \end{split}$$

$$\text{Since } \boldsymbol{\mathfrak{f}}\_{1}^{\text{RFS}} = \boldsymbol{\mathfrak{f}}\_{1}^{\text{RSE}} - (\boldsymbol{\mathfrak{f}}\_{1}^{\text{REM}} - \boldsymbol{\mathfrak{f}}\_{1}^{\text{RSM}}) \left\{ 1 - (p\_{2} - 2)\boldsymbol{\mathsf{L}}\_{n}^{-1} \right\} \text{I} (\boldsymbol{\mathcal{L}}\_{n} \leq p\_{2} - 2).$$

We derive the covariance of the estimator *β***ˆ** RPS <sup>1</sup> as follows.

Cov(*β***ˆ** RPS <sup>1</sup> ) = *E* lim*n*→<sup>∞</sup> <sup>√</sup>*n*(*β***<sup>ˆ</sup>** RPS <sup>1</sup> − *β*1) <sup>√</sup>*n*(*β***<sup>ˆ</sup>** RPS <sup>1</sup> <sup>−</sup> *<sup>β</sup>*1)<sup>T</sup> = *E* lim*n*→<sup>∞</sup> <sup>√</sup>*n*(*β***<sup>ˆ</sup>** RSE <sup>1</sup> <sup>−</sup> *<sup>β</sup>*1) <sup>−</sup> <sup>√</sup>*n*(*β***<sup>ˆ</sup>** RFM <sup>1</sup> <sup>−</sup> *<sup>β</sup>***<sup>ˆ</sup>** RSM <sup>1</sup> ) <sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)L−<sup>1</sup> *n* I(L*<sup>n</sup>* ≤ *p*<sup>2</sup> − 2) × .√*n*(*β***<sup>ˆ</sup>** RSE <sup>1</sup> <sup>−</sup> *<sup>β</sup>*1) <sup>−</sup> <sup>√</sup>*n*(*β***<sup>ˆ</sup>** RFM <sup>1</sup> <sup>−</sup> *<sup>β</sup>***<sup>ˆ</sup>** RSM <sup>1</sup> ) <sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)L−<sup>1</sup> *n* I(L*<sup>n</sup>* ≤ *p*<sup>2</sup> − 2) /*T* ) = *E* lim*n*→<sup>∞</sup> <sup>√</sup>*n*(*β***<sup>ˆ</sup>** RSE <sup>1</sup> − *β*1) <sup>√</sup>*n*(*β***<sup>ˆ</sup>** RSE <sup>1</sup> <sup>−</sup> *<sup>β</sup>*1)<sup>T</sup> <sup>−</sup> <sup>2</sup>*ϕ*<sup>3</sup> <sup>√</sup>*n*(*β***<sup>ˆ</sup>** RSE <sup>1</sup> <sup>−</sup> *<sup>β</sup>*1)<sup>T</sup> <sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)L−<sup>1</sup> *n* I(L*<sup>n</sup>* ≤ *p*<sup>2</sup> − 2) +*ϕ*3*ϕ*<sup>T</sup> 3 <sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)L−<sup>1</sup> *n* 2 I(L*<sup>n</sup>* ≤ *p*<sup>2</sup> − 2) = Cov(*β***ˆ** RSE <sup>1</sup> ) − 2*E* lim*n*→<sup>∞</sup> *<sup>ϕ</sup>*<sup>3</sup> <sup>√</sup>*n*(*β***<sup>ˆ</sup>** RSE <sup>1</sup> <sup>−</sup> *<sup>β</sup>*1)<sup>T</sup> <sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)L−<sup>1</sup> *n* 2 I(L*<sup>n</sup>* ≤ *p*<sup>2</sup> − 2) + *E* lim*n*→<sup>∞</sup> *<sup>ϕ</sup>*3*ϕ*<sup>T</sup> 3 <sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)L−<sup>1</sup> *n* 2 I(L*<sup>n</sup>* ≤ *p*<sup>2</sup> − 2) = Cov(*β***ˆ** RSE <sup>1</sup> ) − 2*E* lim*n*→<sup>∞</sup> *<sup>ϕ</sup>*3*ϕ*<sup>T</sup> 1 <sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)L−<sup>1</sup> *n* I(L*<sup>n</sup>* ≤ *p*<sup>2</sup> − 2) + 2*E* lim*n*→<sup>∞</sup> *<sup>ϕ</sup>*3*ϕ*<sup>T</sup> <sup>3</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)L−<sup>1</sup> *n* <sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)L−<sup>1</sup> *n* I(L*<sup>n</sup>* ≤ *p*<sup>2</sup> − 2) + *E* lim*n*→<sup>∞</sup> *<sup>ϕ</sup>*3*ϕ*<sup>T</sup> 3 <sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)L−<sup>1</sup> *n* 2 I(L*<sup>n</sup>* ≤ *p*<sup>2</sup> − 2) = Cov(*β***ˆ** RSE <sup>1</sup> ) − 2*E* lim*n*→<sup>∞</sup> *<sup>ϕ</sup>*3*ϕ*<sup>T</sup> 1 <sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)L−<sup>1</sup> *n* I(L*<sup>n</sup>* ≤ *p*<sup>2</sup> − 2) − *E* lim*n*→<sup>∞</sup> *<sup>ϕ</sup>*3*ϕ*<sup>T</sup> <sup>3</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)2L−<sup>2</sup> *<sup>n</sup>* I(L*<sup>n</sup>* ≤ *p*<sup>2</sup> − 2) + *E* lim*n*→<sup>∞</sup> *<sup>ϕ</sup>*3*ϕ*<sup>T</sup> <sup>3</sup> I(L*<sup>n</sup>* ≤ *p*<sup>2</sup> − 2) 

We first compute the last term in the equation above *E <sup>ϕ</sup>*3*ϕ*<sup>T</sup> <sup>3</sup> I(L*<sup>n</sup>* ≤ *p*<sup>2</sup> −2) as *E <sup>ϕ</sup>*3*ϕ*<sup>T</sup> <sup>3</sup> I(L*<sup>n</sup>* ≤ *p*<sup>2</sup> − 2) <sup>=</sup> **<sup>Φ</sup>H***p*2+2(*p*<sup>2</sup> <sup>−</sup> 2; <sup>Δ</sup>) + *δδ*T**H***p*2+4(*p*<sup>2</sup> <sup>−</sup> 2; <sup>Δ</sup>). Using Lemma 1 and from the normal theory, we find,

*E <sup>ϕ</sup>*3*ϕ*<sup>T</sup> <sup>1</sup> {<sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)L−<sup>1</sup> *<sup>n</sup>* }I(L*<sup>n</sup>* ≤ *p*<sup>2</sup> − 2) = *E E <sup>ϕ</sup>*3*ϕ*<sup>T</sup> <sup>1</sup> {<sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)L−<sup>1</sup> *<sup>n</sup>* }I(L*<sup>n</sup>* ≤ *p*<sup>2</sup> − 2)|*ϕ*<sup>3</sup> = *E ϕ*3*E ϕ*T <sup>1</sup> {<sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)L−<sup>1</sup> *<sup>n</sup>* }I(L*<sup>n</sup>* ≤ *p*<sup>2</sup> − 2)|*ϕ*<sup>3</sup> = *E <sup>ϕ</sup>*3[*μ*11.2 + (*ϕ*<sup>3</sup> <sup>−</sup> *<sup>δ</sup>*)]T{<sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)L−<sup>1</sup> *<sup>n</sup>* }I(*Ln* ≤ *p*<sup>2</sup> − 2) = −*μ*11.2*E ϕ*3 <sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)*L*−<sup>1</sup> *n* I L*<sup>n</sup>* ≤ *p*<sup>2</sup> − 2 + *E <sup>ϕ</sup>*3*ϕ*<sup>T</sup> 3 <sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)L−<sup>1</sup> *n* I L*<sup>n</sup>* ≤ *p*<sup>2</sup> − 2 − *E <sup>ϕ</sup>*3*δ*<sup>T</sup> <sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)L−<sup>1</sup> *n* I L*<sup>n</sup>* ≤ *p*<sup>2</sup> − 2 <sup>=</sup> <sup>−</sup>*δμ*<sup>T</sup> 11.2*E* <sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)*χ*−<sup>2</sup> *<sup>p</sup>*2+2(Δ) I *χ*−<sup>2</sup> *<sup>p</sup>*2+2(Δ) ≤ *<sup>p</sup>*<sup>2</sup> − <sup>2</sup> ! + **Φ***E* <sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)*χ*−<sup>2</sup> *<sup>p</sup>*2+2(Δ) I *χ*−<sup>2</sup> *<sup>p</sup>*2+2(Δ) ≤ *<sup>p</sup>*<sup>2</sup> − <sup>2</sup> ! + *δδ*T*E* <sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)*χ*−<sup>2</sup> *<sup>p</sup>*2+4(Δ) I *χ*−<sup>2</sup> *<sup>p</sup>*2+4(Δ) ≤ *<sup>p</sup>*<sup>2</sup> − <sup>2</sup> ! <sup>−</sup> *δδ*T*<sup>E</sup>* <sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)*χ*−<sup>2</sup> *<sup>p</sup>*2+4(Δ) I *χ*−<sup>2</sup> *<sup>p</sup>*2+4(Δ) ≤ *<sup>p</sup>*<sup>2</sup> − <sup>2</sup> ! .

$$\begin{split} E\left\{\mathfrak{g}\_{3}\mathfrak{g}\_{3}^{\mathrm{T}}(p\_{2}-2)^{2}\mathrm{L}\_{n}^{-2}\mathrm{I}(\mathrm{L}\_{n}\leq p\_{2}-2)\right\} &= (p\_{2}-2)^{2}\Phi E\left(\chi\_{p\_{2}+2}^{-4}(\Delta)\mathrm{I}\left(\chi\_{p\_{2}+2}^{2}(\Delta)\leq p\_{2}-2\right)\right) \\ &+ (p\_{2}-2)^{2}\delta\delta^{\mathrm{T}}E\left(\chi\_{p\_{2}+2}^{-4}(\Delta)\mathrm{I}\left(\chi\_{p\_{2}+2}^{2}(\Delta)\leq p\_{2}-2\right)\right). \end{split}$$

Putting all the terms together, we obtain

Cov(*β***ˆ** RPS <sup>1</sup> ) = Cov(*β***<sup>ˆ</sup>** RSE <sup>1</sup> ) + <sup>2</sup>*δμ*<sup>T</sup> 11.2*E* <sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)*χ*−<sup>2</sup> *<sup>p</sup>*2+2(Δ) I *χ*2 *<sup>p</sup>*2+2(Δ) ≤ *p*<sup>2</sup> − 2 ! − 2**Φ***E* <sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)*χ*−<sup>2</sup> *<sup>p</sup>*2+2(Δ) I *χ*2 *<sup>p</sup>*2+2(Δ) ≤ *p*<sup>2</sup> − 2 ! <sup>−</sup> <sup>2</sup>*δδ*T*<sup>E</sup>* {<sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)*χ*−<sup>2</sup> *<sup>p</sup>*2+4(Δ)}I(*χ*<sup>2</sup> *<sup>p</sup>*2+4(Δ) ≤ *p*<sup>2</sup> − 2) + 2*δδ*T*E* <sup>1</sup> <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)*χ*−<sup>2</sup> *<sup>p</sup>*2+2(Δ) I *χ*2 *<sup>p</sup>*2+2(Δ) ≤ *p*<sup>2</sup> − 2 ! <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)2**Φ***<sup>E</sup> χ*−<sup>4</sup> *<sup>p</sup>*2+2(Δ)<sup>I</sup> *χ*2 *<sup>p</sup>*2+2,*α*(Δ) ≤ *p*<sup>2</sup> − 2 ! <sup>−</sup> (*p*<sup>2</sup> <sup>−</sup> <sup>2</sup>)2*δδ*T*<sup>E</sup> χ*−<sup>4</sup> *<sup>p</sup>*2+2(Δ)<sup>I</sup> *χ*2 *<sup>p</sup>*2+2(Δ) ≤ *p*<sup>2</sup> − 2 ! + **ΦH***p*2+<sup>2</sup> *p*<sup>2</sup> − 2; Δ + *δδ*T**H***p*2+<sup>4</sup> *p*<sup>2</sup> − 2; Δ .

### **References**


### *Article* **Edge-Preserving Denoising of Image Sequences**

**Fan Yi \* and Peihua Qiu**

Department of Biostatistics, University of Florida, Gainesville, FL 32603, USA; pqiu@ufl.edu **\*** Correspondence: yifan@ufl.edu; Tel.: +1-352-745-4977

**Abstract:** To monitor the Earth's surface, the satellite of the NASA Landsat program provides us image sequences of any region on the Earth constantly over time. These image sequences give us a unique resource to study the Earth's surface, changes of the Earth resource over time, and their implications in agriculture, geology, forestry, and more. Besides natural sciences, image sequences are also commonly used in functional magnetic resonance imaging (fMRI) of medical studies for understanding the functioning of brains and other organs. In practice, observed images almost always contain noise and other contaminations. For a reliable subsequent image analysis, it is important to remove such contaminations in advance. This paper focuses on image sequence denoising, which has not been well-discussed in the literature yet. To this end, an edge-preserving image denoising procedure is suggested. The suggested method is based on a jump-preserving local smoothing procedure, in which the bandwidths are chosen such that the possible spatio-temporal correlations in the observed image intensities are accommodated properly. Both theoretical arguments and numerical studies show that this method works well in the various cases considered.

**Keywords:** bandwidth selection; correlation; edge-preserving image denoising; image sequence; jump regression analysis; local smoothing; nonparametric regression; spatio-temporal data

Edge-Preserving Denoising of Image Sequences. *Entropy* **2021**, *23*, 1332. https://doi.org/10.3390/e23101332

**Citation:** Yi, F.; Qiu, P.

Academic Editors: Amelia Carolina Sparavigna, Farouk Nathoo and S. Ejaz Ahmed

Received: 2 September 2021 Accepted: 7 October 2021 Published: 12 October 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

### **1. Introduction**

The Landsat project, led by the US Geological Survey (USGS) and NASA, has launched eight satellites since 1972 to continuously provide scientifically valuable images of the Earth's surface. These images can be freely accessed by researchers around the world (cf., Zanter [1]). This rich archive of Landsat images has become a major resource for scientific research about the Earth's surface and its resources in different scientific disciplines, including forest science, climate science, agriculture, ecology, fire science, and many more. As an example, Figure 1 shows two images of the Las Vegas area in Nevada taken in 1984 and 2007, respectively. These two images clearly show the increasing urban sprawl of Las Vegas during the 23-year period, and consequently, the environment in that region has changed dramatically. The current satellite (i.e., the Landsat 8) can deliver an image of a given region roughly every 16 days. So, we have a sequence of images of that region collected sequentially over time, stored in the Landsat database, which is increasing all the time. Image sequences are commonly used in many other applications, including functional magnetic resonance imaging (fMRI) in neuroscience and quality control in manufacturing industries (Qiu [2]). In practice, observed images usually contain noise and other contaminations (Gonzalez and Woods [3]). For reliable subsequent image analyses, such contaminations should be removed in advance. In the image processing literature, the removal of noise from an observed image is referred to as image denoising. This paper focuses on image denoising for analyzing observed image sequences.

In the literature, there has been extensive discussion on image denoising (Qiu [4]). Many early methods in the computer science literature are based on the Markov random field (MRF) framework, in which observed image intensities of an image are assumed to have the Markov property that the observed intensity at a given pixel depends only on the observed intensities in a neighborhood of the given pixel (Geman and Geman [5]). Then, if the true image is assumed to have a prior distribution which is also an MRF, its posterior distribution would be an MRF too, and consequently, the true image can be estimated by the maximum a posteriori (MAP) estimator (e.g., Geman and Geman [5], Besag [6], Fessler et al. [7]). Other popular image denoising methods include those based on diffusion equations (e.g., Perona and Malik [8], Weickert [9]), total variation (Beck and Teboulle [10], Rudin et al. [11], Yuan et al. [12]), wavelet transformations (e.g., Chang et al. [13], Mrázek [14]), jump regression analysis (e.g., Gijbels et al. [15], Qiu [16], Qiu [17], Qiu and Mukherjee [18]), adaptive weights smoothing (e.g., Polzehl and Spokoiny [19]), spatial adaption (e.g., Kervrann and Boulanger [20]) and more. Besides noise removal, edge-preserving is important for image denoising because edges are important structures of the images. Some of the methods mentioned above can preserve edges well, such as the ones based on jump regression analysis, total variation, and wavelet transformations. Thorough surveys of popular edge-preserving image denoising methods can be found in Jain and Tyagi [21] and Qiu [4].

**Figure 1.** Two Landsat images of the Las Vegas area taken in 1984 (**left panel**) and 2007 (**right panel**).

Although there are already some existing methods for edge-preserveing image denoising, almost all of them handle observed images taken at a single time point. So far, we have not found much discussion about denoising image sequences, which is the focus of the current paper. A given image sequence often describes a gradual change in appearance over time, subject to the underlying process. For instance, the sequence of images of the Las Vegas area acquired by the Landsat satellite (cf., Figure 1) describes the gradual change of the Earth's surface in that area over time. As mentioned above, two consecutive images in the sequence acquired by the current Landsat satellite are only about 16 days apart. So, their difference should be very small. However, the images could be substantially different after a long period of time, as shown in Figure 1. In such applications, it should be reasonable to assume that edge locations in different images either do not change or change gradually over time. To handle such image sequences, the neighboring images should be useful when denoising the image at a given time point, or information in neighboring images should be shared during image denoising. By noticing such features of image sequences, we propose an edge-preserving image denoising procedure for analyzing image sequences in this paper. Our proposed method is based on the jump regression analysis (JRA) used for regression modeling when the underlying regression function has jumps or other singularities (Qiu [22]). It is a local smoothing procedure, and the possible spatiotemporal correlation in the observed image data has been accommodated properly in its construction. Both theoretical arguments and numerical studies show that this method works well in various different cases.

The remaining parts of the article are organized as follows. The proposed method is described in detail in Section 2. Its statistical properties and the numerical studies about its performance in different finite-sample cases are presented in Section 3. Several concluding remarks are provided in Section 4. Some technical details are given in Appendix A.

### **2. Materials and Methods**

This section describes our proposed method in two parts. A JRA model for describing an image sequence and the model estimation are discussed in Section 2.1. Selection of several parameters used in model estimation is discussed in Section 2.2.

### *2.1. JRA Model and Its Estimation*

To describe an image sequence, let us consider the following JRA model:

$$Z\_{\rm ijk} = f(\mathbf{x}\_i, \mathbf{y}\_j; \mathbf{t}\_k) + \varepsilon\_{\rm ijk}, \quad i = 1, 2, \dots, n\_{\rm x}, j = 1, 2, \dots, n\_{\rm y}, k = 1, 2, \dots, n\_{\rm t} \tag{1}$$

where *Zijk* is the observed image intensity level at the (*i*, *j*)-th pixel (*xi*, *yj*) and at the *k*-th time point *tk*, *f*(*xi*, *yj*; *tk*) is the true image intensity level, and *εijk* is the pointwise random noise with mean 0 and variance *σ*2. In model (1), spatio-temporal data correlation is allowed, namely, {*εijk*} could be correlated over *i*, *j* and *k*. For image data, the pixel locations are usually regularly spaced. Without loss of generality, it is assumed that they are equally spaced in the design space Ω = [0, 1] × [0, 1], namely, (*xi*, *yj*)=(*i*/*nx*, *j*/*ny*), for all *i* and *j*, where *nx* and *ny* are the numbers of rows and columns, respectively. The observation times {*tk*, *k* = 1, 2, ... , *nt*} are also assumed to be equally spaced in the time interval [0, 1]. The true image intensity function *f*(*x*, *y*; *t*), for (*x*, *y*) ∈ Ω, is continuous in the design space Ω at each *t* ∈ [0, 1], except on the edges where it has jumps.

To estimate the unknown image intensity function *f*(*x*, *y*; *t*) in model (1), we consider using a local smoothing method, instead of a global smoothing method (e.g., smoothing spline method), because of a large amount of data involved in the current problem. Likewise, it has been well-discussed in the JRA literature that conventional smoothing methods (e.g., conventional local kernel smoothing methods) would not work well for estimating models like (1) where the true image intensity function *f*(*x*, *y*; *t*) has jumps at the edges, because the jumps would be blurred by such conventional methods (cf., Qiu [22]). In this paper, we suggest a jump-preserving local smoothing method for estimating (1), described in detail below. For a given point (*x*, *y*; *t*) ∈ Ω × [0, 1], define a local neighborhood

$$\begin{aligned} O(x, y; t) &= \{ \left( x', y'; t' \right) : \left( x', y'; t' \right) \in \Omega \times [0, 1], \\ &\sqrt{\frac{(x' - x)^2}{h\_x^2} + \frac{(y' - y)^2}{h\_y^2}} \le 1, |t' - t| / h\_t \le 1 \} \end{aligned}$$

where *hx*, *hy* and *ht* are the bandwidths in the *x*−, *y*−, and *t*−axis, respectively. In *O*(*x*, *y*; *t*), we first consider the following local linear kernel (LLK) smoothing procedure (Fan and Gijbels [23]):

min *a*,*b*,*c*,*d nx* ∑ *i*=1 *ny* ∑ *j*=1 *nt* ∑ *k*=1 *Zijk* − *a* + *b*(*xi* − *x*) + *c*(*yj* − *y*) + *d*(*tk* − *t*) 2 *K xi* − *x hx* , *yj* − *y hy* ! *K tk* − *t ht* ! , (2)

where *K*(*v*) is a density kernel function with the support {*v* : |*v*| ≤ 1}. The solutions to (*a*, *<sup>b</sup>*, *<sup>c</sup>*, *<sup>d</sup>*) of the minimization problem (2) are denoted as *a*(*x*, *<sup>y</sup>*; *<sup>t</sup>*), *b*(*x*, *<sup>y</sup>*; *<sup>t</sup>*), *<sup>c</sup>*(*x*, *<sup>y</sup>*; *<sup>t</sup>*), and *d* (*x*, *<sup>y</sup>*; *<sup>t</sup>*), respectively. It can be checked that they have the following expressions:

$$
\begin{bmatrix}
\widehat{a}(x,y;t) \\
\widehat{b}(x,y;t) \\
\widehat{c}(x,y;t) \\
\widehat{d}(x,y;t)
\end{bmatrix} = \begin{bmatrix}
m\_{000} & m\_{100} & m\_{010} & m\_{001} \\
m\_{100} & m\_{200} & m\_{110} & m\_{101} \\
m\_{010} & m\_{110} & m\_{020} & m\_{011} \\
m\_{001} & m\_{101} & m\_{011} & m\_{002}
\end{bmatrix}^{-1} \begin{bmatrix}
\sum\_{ijk} Z\_{ijk} K\_{ijk} \\
\sum\_{ijk} (x\_i - x) Z\_{ijk} K\_{ijk} \\
\sum\_{ijk} (y\_j - y) Z\_{ijk} K\_{ijk} \\
\sum\_{ijk} (t\_k - t) Z\_{ijk} K\_{ijk}
\end{bmatrix} \tag{3}
$$

where <sup>∑</sup>*ijk* denotes <sup>∑</sup>*nx <sup>i</sup>*=<sup>1</sup> <sup>∑</sup>*ny <sup>j</sup>*=<sup>1</sup> <sup>∑</sup>*nt <sup>k</sup>*=1, *Kijk* denotes *<sup>K</sup>*( *xi*−*<sup>x</sup> hx* , *yj*−*y hy* )*K*(*tk*−*<sup>t</sup> ht* ), and *mrsl* = <sup>∑</sup>*ijk*(*xi* − *<sup>x</sup>*)*r*(*yj* <sup>−</sup> *<sup>y</sup>*)*s*(*tk* <sup>−</sup> *<sup>t</sup>*)*<sup>l</sup> Kijk*, for *r*,*s*, *l* = 0, 1, 2. The LLK estimator of *f*(*x*, *y*; *t*) is defined to be *a*(*x*, *<sup>y</sup>*; *<sup>t</sup>*). The estimated gradient direction of *<sup>f</sup>*(*x*, *<sup>y</sup>*; *<sup>t</sup>*) at (*x*, *<sup>y</sup>*; *<sup>t</sup>*) is *<sup>G</sup>*(*x*, *<sup>y</sup>*; *<sup>t</sup>*) = (*b*(*x*, *<sup>y</sup>*; *<sup>t</sup>*), *<sup>c</sup>*(*x*, *<sup>y</sup>*; *<sup>t</sup>*), *<sup>d</sup>* (*x*, *<sup>y</sup>*; *<sup>t</sup>*)) which indicates the direction in which the estimated plane in *O*(*x*, *y*; *t*) by the LLK procedure (2) increases the fastest. If there is an edge surface in *<sup>O</sup>*(*x*, *<sup>y</sup>*; *<sup>t</sup>*), then *<sup>G</sup>*(*x*, *<sup>y</sup>*; *<sup>t</sup>*) would be (approximately) orthogonal to that surface.

In cases when there are no edges in the neighborhood *<sup>O</sup>*(*x*, *<sup>y</sup>*; *<sup>t</sup>*), *a*(*x*, *<sup>y</sup>*; *<sup>t</sup>*) would be a good estimate of *<sup>f</sup>*(*x*, *<sup>y</sup>*; *<sup>t</sup>*). Otherwise, it cannot be a good estimate because *a*(*x*, *<sup>y</sup>*; *<sup>t</sup>*) is a weighted average of all observed image intensities in *O*(*x*, *y*; *t*), the jumps in the image intensity surface would be smoothed out in the weighted average, and the estimate *a*(*x*, *<sup>y</sup>*; *<sup>t</sup>*) would be biased for estimating *f*(*x*, *y*; *t*). To overcome that limitation, we consider the following one-sided smoothing idea. Let *O*(*x*, *y*; *t*) be divided into two parts *O*(1)(*x*, *y*; *t*) and *<sup>O</sup>*(2)(*x*, *<sup>y</sup>*; *<sup>t</sup>*) by a plane that passes (*x*, *<sup>y</sup>*; *<sup>t</sup>*) and is perpendicular to *<sup>G</sup>*(*x*, *<sup>y</sup>*; *<sup>t</sup>*). See Figure 2 for an example.

**Figure 2.** The neighborhood *O*(*x*, *y*; *t*) is divided into two parts by a plane that passes (*x*, *y*; *t*) and is perpendicular to the estimated gradient direction *<sup>G</sup>*(*x*, *<sup>y</sup>*; *<sup>t</sup>*).

Then, in cases when there is an edge surface in *O*(*x*, *y*; *t*), that plane would be (approximately) parallel to the edge surface. Consequently, at least one of *O*(1)(*x*, *y*; *t*) and *O*(2)(*x*, *y*; *t*) would be (mostly) located on a single side of the edge surface in such cases. Now, let us consider the following one-sided LLK smoothing procedure: for *l* = 1, 2,

min *<sup>a</sup>*,*b*,*c*,*<sup>d</sup>* ∑ (*xi*,*yj*;*tk* )∈*O*(*l*)(*x*,*y*;*t*) *Zijk* − *a* + *b*(*xi* − *x*) + *c*(*yj* − *y*) + *d*(*tk* − *t*) 2 *K xi* − *x hx* , *yj* − *y hy* ! *K tk* − *t ht* ! . (4)

The solutions of(4)to (*a*, *<sup>b</sup>*, *<sup>c</sup>*, *<sup>d</sup>*) are denoted as (*a*(*l*)(*x*, *<sup>y</sup>*; *<sup>t</sup>*), *b*(*l*)(*x*, *<sup>y</sup>*; *<sup>t</sup>*), *<sup>c</sup>*(*l*)(*x*, *<sup>y</sup>*; *<sup>t</sup>*), *<sup>d</sup>* (*l*)(*x*, *<sup>y</sup>*; *<sup>t</sup>*)), for *<sup>l</sup>* <sup>=</sup> 1, 2. Intuitively, when there are no edges in *<sup>O</sup>*(*x*, *<sup>y</sup>*; *<sup>t</sup>*), *a*(*x*, *<sup>y</sup>*; *<sup>t</sup>*), *a*(1)(*x*, *<sup>y</sup>*; *<sup>t</sup>*) and *a*(2)(*x*, *<sup>y</sup>*; *<sup>t</sup>*) are all consistent estimates of *<sup>f</sup>*(*x*, *<sup>y</sup>*; *<sup>t</sup>*) under some regular conditions. In such cases, *a*(*x*, *<sup>y</sup>*; *<sup>t</sup>*) would be preferred since it averages more observations and consequently it would have a smaller variance. When there are edges in *<sup>O</sup>*(*x*, *<sup>y</sup>*; *<sup>t</sup>*), *a*(*x*, *<sup>y</sup>*; *<sup>t</sup>*) would not be a good estimate of *<sup>f</sup>*(*x*, *<sup>y</sup>*; *<sup>t</sup>*) as explained above, but one of *a*(1)(*x*, *<sup>y</sup>*; *<sup>t</sup>*) and *a*(2)(*x*, *<sup>y</sup>*; *<sup>t</sup>*) should estimate *<sup>f</sup>*(*x*, *<sup>y</sup>*; *<sup>t</sup>*) well. Therefore, in all cases, at least one of the three estimators *a*(*x*, *<sup>y</sup>*; *<sup>t</sup>*), *a*(1)(*x*, *<sup>y</sup>*; *<sup>t</sup>*) and *a*(2)(*x*, *<sup>y</sup>*; *<sup>t</sup>*) should estimate *<sup>f</sup>*(*x*, *<sup>y</sup>*; *<sup>t</sup>*) well.

Next, we need to choose a good estimator from *a*(*x*, *<sup>y</sup>*; *<sup>t</sup>*), *a*(1)(*x*, *<sup>y</sup>*; *<sup>t</sup>*) and *a*(2)(*x*, *<sup>y</sup>*; *<sup>t</sup>*) based on the observed data, which is not straightforward, partly because we do not know in advance whether there are edges in the neighborhood *O*(*x*, *y*; *t*) and whether the edges are mostly contained in *O*(1)(*x*, *y*; *t*) or *O*(2)(*x*, *y*; *t*) if the answer to the first question is

positive. To overcome this difficulty, let us consider the following weighted residual mean squares (WRMS) of the fitted local plane by the LLK procedure (2):

$$\begin{split} \text{Re}(\mathbf{x}, \mathbf{y}; t) &= \left\{ \sum\_{ijk} [\mathbf{Z}\_{ijk} - \hat{\mathbf{a}}(\mathbf{x}, \mathbf{y}; t) - \hat{\mathbf{b}}(\mathbf{x}, \mathbf{y}; t)(\mathbf{x}\_i - \mathbf{x}) - \hat{\mathbf{c}}(\mathbf{x}, \mathbf{y}; t)(\mathbf{y}\_j - \mathbf{y}) - \hat{\mathbf{d}}(\mathbf{x}, \mathbf{y}; t) \right\} \\ &\qquad \hat{d}(\mathbf{x}, \mathbf{y}; t)(t\_k - t) \big|^2 K\_{ijk} \Bigg\} / \sum\_{ijk} K\_{ijk} . \end{split} \tag{5}$$

The above WRMS measures how well the fitted local plane describes the observed data in *O*(*x*, *y*; *t*). If there are edges in *O*(*x*, *y*; *t*), this quantity would be relatively large, due mainly to the jumps in the image intensity surface. Otherwise, it would be relatively small. So, the quantity *e*(*x*, *y*; *t*) contains useful information about the existence of edges in *O*(*x*, *y*; *t*). Similarly, we can define WRMS values for the two one-sided local planes fitted in *O*(1)(*x*, *y*; *t*) and *O*(2)(*x*, *y*; *t*). They are denoted as *e*(1)(*x*, *y*; *t*) and *e*(2)(*x*, *y*; *t*). Based on these WRMS values, we define our edge-preserving estimator of *f*(*x*, *y*; *t*) to be

$$\begin{split} \hat{f}(\mathbf{x}, y; t) &= \hat{a}(\mathbf{x}, y; t) I(D(\mathbf{x}, y; t) \le u) \\ &+ \hat{a}^{(1)}(\mathbf{x}, y; t) I(D(\mathbf{x}, y; t) > u) I(e^{(1)}(\mathbf{x}, y; t) < e^{(2)}(\mathbf{x}, y; t)) \\ &+ \hat{a}^{(2)}(\mathbf{x}, y; t) I(D(\mathbf{x}, y; t) > u) I(e^{(1)}(\mathbf{x}, y; t) > e^{(2)}(\mathbf{x}, y; t)) \\ &+ \frac{\hat{a}^{(1)}(\mathbf{x}, y; t) + \hat{a}^{(2)}(\mathbf{x}, y; t)}{2} I(D(\mathbf{x}, y; t) > u) I(e^{(1)}(\mathbf{x}, y; t) = e^{(2)}(\mathbf{x}, y; t)), \end{split} \tag{6}$$

where *<sup>D</sup>*(*x*, *<sup>y</sup>*; *<sup>t</sup>*) = max(*e*(*x*, *<sup>y</sup>*; *<sup>t</sup>*) <sup>−</sup> *<sup>e</sup>*(1)(*x*, *<sup>y</sup>*; *<sup>t</sup>*),*e*(*x*, *<sup>y</sup>*; *<sup>t</sup>*) <sup>−</sup> *<sup>e</sup>*(2)(*x*, *<sup>y</sup>*; *<sup>t</sup>*)), *<sup>I</sup>*(·) is the indicator function, and *u* > 0 is a threshold parameter. By (6), it is obvious that *f* (*x*, *<sup>y</sup>*; *<sup>t</sup>*) is defined to be one of *a*(*x*, *<sup>y</sup>*; *<sup>t</sup>*), *a*(1)(*x*, *<sup>y</sup>*; *<sup>t</sup>*) and *a*(2)(*x*, *<sup>y</sup>*; *<sup>t</sup>*). The quantity *a*(*x*, *<sup>y</sup>*; *<sup>t</sup>*), which is obtained from the entire neighborhood *O*(*x*, *y*; *t*), is chosen if the observed data indicate no edges in *O*(*x*, *y*; *t*), supported by the event *D*(*x*, *y*; *t*) ≤ *u*. Otherwise, one of the two one-sided quantities, *a*(1)(*x*, *<sup>y</sup>*; *<sup>t</sup>*) and *a*(2)(*x*, *<sup>y</sup>*; *<sup>t</sup>*), with a smaller WRMS value is chosen. Although, theoretically, the event (*e*(1)(*x*, *y*; *t*) = *e*(2)(*x*, *y*; *t*)) would have probability zero of happening, the last term on the right-hand-side of (6) is still included for completeness of the definition of *f* (*x*, *<sup>y</sup>*; *<sup>t</sup>*) and for the consideration that *<sup>e</sup>*(1)(*x*, *<sup>y</sup>*; *<sup>t</sup>*) and *<sup>e</sup>*(2)(*x*, *<sup>y</sup>*; *<sup>t</sup>*) could be considered the same in certain algorithms when their values are close.

### *2.2. Parameter Selection*

In our proposed method described in Section 2.1, there are four parameters; *hx*, *hy*, *ht* and *u*, that need to be chosen properly in advance. For that purpose, it is natural to consider the cross validation (CV) procedure, especially in the current research problem where the observed data are quite large in size. However, it has been well-demonstrated in the literature that the conventional CV procedure would not work well in cases when the observed data are autocorrelated, because it cannot effectively distinguish the data correlation structure from the mean structure (cf., Altman [24], Opsomer et al. [25]). In the current problem, spatio-temperal data correlation is possible in almost all applications. Thus, the conventional CV procedure is not feasible in such cases. In the univariate regression setup, Brabanter et al. [26] suggested a modified CV procedure for choosing smoothing parameters in cases with correlated data. This procedure is generalized here for choosing the parameters *hx*, *hy*, *ht* and *u* used in the proposed method, which is described below. Let the modified CV score for choosing *hx*, *hy*, *ht* and *u* be defined as

$$\mathbb{C}V(\mathbf{h}\_{\mathbf{x}},\mathbf{h}\_{\mathbf{y}},\mathbf{h}\_{\mathbf{t}},\mathbf{u}) = \frac{1}{n\_{\mathbf{x}}n\_{\mathbf{y}}\eta\_{\mathbf{t}}}\sum\_{ijk} \left[\hat{f}\_{-(ijk)}(\mathbf{x}\_{i\tau}y\_{j};\mathbf{t}\_{k}) - Z(\mathbf{x}\_{i\tau}y\_{j};\mathbf{t}\_{k})\right]^{2},\tag{7}$$

where *f* <sup>−</sup>(*ijk*)(*xi*, *yj*; *tk*) is the leave-one-out estimate of *<sup>f</sup>*(*xi*, *yj*; *tk*) by (2)–(6) after the observation *Zijk* is removed from the estimation process and after the kernel function is replaced by the so-called -optimal bimodal kernel function *K* (*v*) defined to be

$$K\_{\varepsilon}(v) = \frac{4}{4 - 3\varepsilon - \varepsilon^{3}} \times \begin{cases} \frac{3}{4} (1 - v^{2}) I(|v| \le 1), & \text{if } |v| \ge \varepsilon, \\\frac{3(1 - \varepsilon^{2})}{4\varepsilon} |v|, & \text{if } |v| < \varepsilon, \end{cases} \tag{8}$$

where 0 <  < 1 is a parameter. Based on a large simulation study, Brabanter et al. [26] suggested choosing  to be 0.1, which is adopted in this paper. Then, by the above modified CV procedure, (7) and (8), the parameters *hx*, *hy*, *ht* and *u* can be chosen by minimizing the modified CV score *CV*(*hx*, *hy*, *ht*, *u*).

### **3. Results**

### *3.1. Statistical Properties*

In this part, we discuss some statistical properties of the proposed edge-preserving image sequence denoising method (2)–(6). First, we have the following proposition.

**Proposition 1.** *Assume that i) the kernel function K*(*v*) *used in* (2) *is a Lipschitz-1 continuous density function, and ii) the noise terms* {*εijk*, *i* = 1, 2, ... , *nx*, *j* = 1, 2, ... , *ny*, *k* = 1, 2, ... , *nt*} *in model* (1) *form a strong mixing stochastic process with the following strong mixing coefficients:*

$$\begin{split} \alpha(d) &= \sup\_{(ijk), (i'j'k')} \sup\_{A,B} \left\{ |P(A \cap B) - P(A)P(B)|\_\prime A \in \sigma(\varepsilon\_{ijk}), B \in \sigma(\varepsilon\_{i'j'k'}), \\ &\quad \max\{ |i - i'|\_\prime |j - j'|\_\prime |k - k'| \} > d \right\}, \end{split}$$

*which have the property that <sup>α</sup>*(*d*) <sup>≤</sup> *<sup>c</sup>*1*σ*2*ρc*2*d, where <sup>c</sup>*1, *<sup>c</sup>*<sup>2</sup> <sup>&</sup>gt; <sup>0</sup> *and* <sup>0</sup> <sup>&</sup>lt; *<sup>ρ</sup>* <sup>&</sup>lt; <sup>1</sup> *are constants, and iii) E*(*ε*<sup>6</sup> <sup>111</sup>) < ∞*. Let N* = *nxnynt, H* = *hxhyht, nmin* = min(*nx*, *ny*, *nt*)*, and hmin* = min(*hx*, *hy*, *ht*)*. Then, for any* (*x*, *y*; *t*) ∈ Ω*<sup>h</sup>* = [*hx*, 1 − *hx*] × [*hy*, 1 − *hy*] × [*ht*, 1 − *ht*]*, we have*

$$\left| \frac{1}{NH} \sum\_{ijk} K\left(\frac{\mathbf{x}\_i - \mathbf{x}}{h\_{\mathbf{x}}}, \frac{\mathbf{y}\_i - \mathbf{y}}{h\_{\mathbf{y}}}\right) K\left(\frac{t\_i - t}{h\_{\mathbf{t}}}\right) - 1 \right| = O\left(\frac{1}{n\_{\min} h\_{\min}}\right),$$

$$E\left[ \left| \frac{1}{NH} \sum\_{ijk} \varepsilon\_{ijk} K\left(\frac{\mathbf{x}\_i - \mathbf{x}}{h\_{\mathbf{x}}}, \frac{\mathbf{y}\_i - \mathbf{y}}{h\_{\mathbf{y}}}\right) K\left(\frac{t\_i - t}{h\_{\mathbf{t}}}\right) \right|^2\right] = O\left(\frac{1}{NH}\right),$$

$$E\left[ \left| \frac{1}{NH} \sum\_{ijk} (\varepsilon\_{ijk}^2 - \sigma^2) K\left(\frac{\mathbf{x}\_i - \mathbf{x}}{h\_{\mathbf{x}}}, \frac{\mathbf{y}\_i - \mathbf{y}}{h\_{\mathbf{y}}}\right) K\left(\frac{t\_i - t}{h\_{\mathbf{t}}}\right) \right|^2\right] = O\left(\frac{1}{NH}\right).$$

Based on the results in Proposition 1, we can derive the following properties of the LLK estimates defined in (3).

**Theorem 1.** *Besides the conditions in Proposition 1, we further assume that the true image intensity function f*(*x*, *y*; *t*) *has continuous first-order partial derivatives with respect to x, y and t in the design space* Ω *except at the edge curves. Then, for any* (*x*, *y*; *t*) ∈ Ω*<sup>h</sup>* \ *Jh, we have*

$$
\begin{bmatrix}
\widehat{a}(\mathbf{x},\mathbf{y};t) \\
\widehat{b}(\mathbf{x},\mathbf{y};t) \\
\widehat{c}(\mathbf{x},\mathbf{y};t) \\
\widehat{d}(\mathbf{x},\mathbf{y};t)
\end{bmatrix} = \begin{bmatrix}
f(\mathbf{x},\mathbf{y};t) \\
f\_{\mathbf{x}}'(\mathbf{x},\mathbf{y};t) \\
f\_{\mathbf{y}}'(\mathbf{x},\mathbf{y};t) \\
f\_{\mathbf{t}}'(\mathbf{x},\mathbf{y};t)
\end{bmatrix} + \begin{bmatrix}
O(\frac{h\_{\mathbf{x}}^{2}+h\_{\mathbf{y}}^{2}+h\_{t}^{2}}{h\_{\mathbf{x}}}) \\
O(\frac{h\_{\mathbf{x}}^{2}+h\_{\mathbf{y}}^{2}+h\_{t}^{2}}{h\_{\mathbf{y}}}) \\
O(\frac{h\_{\mathbf{x}}^{2}+h\_{\mathbf{y}}^{2}+h\_{t}^{2}}{h\_{\mathbf{y}}}) \\
O(\frac{h\_{\mathbf{x}}^{2}+h\_{\mathbf{y}}^{2}+h\_{t}^{2}}{h\_{t}})
\end{bmatrix} + \begin{bmatrix}
O\_{p}(\frac{1}{\sqrt{NH}}) \\
O\_{p}(\frac{1}{\sqrt{NH}}) \\
O\_{p}(\frac{1}{\sqrt{NH}}) \\
O\_{p}(\frac{1}{\sqrt{NH}})
\end{bmatrix}.
$$

*for any* (*x*, *y*, *t*) ∈ *Jh* \ *Sh, we have*

$$
\begin{bmatrix}
\hat{a}(x,y;t) \\
\hat{b}(x,y;t) \\
\hat{c}(x,y;t) \\
\hat{d}(x,y;t)
\end{bmatrix} = \begin{bmatrix}
f\_{-}(x\_{\tau},y\_{\tau};t\_{\tau}) + d\_{\tau}\xi\_{000}^{(2)} \\
\frac{d\_{\tau}}{\xi\_{200}\hbar\_{\tau}}\xi\_{100}^{(2)} \\
\frac{d\_{\tau}}{\xi\_{200}\hbar\_{\tau}}\xi\_{010}^{(2)} \\
\frac{d\_{\tau}}{\xi\_{002}\hbar\_{\tau}}\xi\_{001}^{(2)}
\end{bmatrix} + \begin{bmatrix}
O(\sqrt{h\_{x}^{2}+h\_{y}^{2}+h\_{t}^{2}}) \\
O(\frac{\sqrt{h\_{x}^{2}+h\_{y}^{2}+h\_{t}^{2}}}{h\_{x}}) \\
O(\frac{\sqrt{h\_{x}^{2}+h\_{y}^{2}+h\_{t}^{2}}}{h\_{y}}) \\
O(\frac{\sqrt{h\_{x}^{2}+h\_{y}^{2}+h\_{t}^{2}}}{h\_{t}})
\end{bmatrix} + \begin{bmatrix}
O\_{p}(\frac{1}{\sqrt{NH}}) \\
O\_{p}(\frac{1}{\frac{1}{h\_{x}\sqrt{NH}}}) \\
O\_{p}(\frac{1}{\frac{1}{h\_{y}\sqrt{NH}}}) \\
O\_{p}(\frac{1}{\frac{1}{h\_{t}\sqrt{NH}}})
\end{bmatrix},\tag{9}
$$

*where ξrsl* = 0 <sup>Ω</sup>×[0,1] *<sup>u</sup>rvswl K*(*u*, *v*)*K*(*w*) *dudvdw, ξ* (2) *rsl* = 0 *<sup>Q</sup>*(2) *<sup>u</sup>rvswl K*(*u*, *v*)*K*(*w*) *dudvdw, for r*,*s*, *l* = 0, 1, 2*, J is the closure of the set of all jump points of f*(*x*, *y*; *t*)*, Jh* = {(*x*, *y*; *t*) : (*x*, *y*; *t*) ∈ Ω*h*, (*x* − *x*∗)2/*h*<sup>2</sup> *<sup>x</sup>* + (*y* − *y*∗)2/*h*<sup>2</sup> *<sup>y</sup>* ≤ 1, |*t* − *t* ∗|/*ht* ≤ 1, *for any* (*x*∗, *y*∗, *t* <sup>∗</sup>) ∈ *J*}*, S is the set of singular points in J, including the crossing points of two or more edges, points on an edge surface at which the edge surface does not have a unique tangent surface, and points in J at which the jump sizes in f*(*x*, *y*; *t*) *are zero, Sh* = {(*x*, *y*; *t*) : (*x*, *y*; *t*) ∈ Ω*<sup>h</sup> ,* (*x* − *x*∗)2/*h*<sup>2</sup> *<sup>x</sup>* + (*y* − *y*∗)2/*h*<sup>2</sup> *<sup>y</sup>* ≤ 1, |*t* − *t* ∗|/*ht* ≤ 1, *f or any* (*x*∗, *y*∗, *t* <sup>∗</sup>) ∈ *S*}*,* (*xτ*, *yτ*; *tτ*) ∈ *J* \ *S is the projection of* (*x*, *y*; *t*) *to J with the Euclidean distance between the two points being c h*2 *<sup>x</sup>* + *h*<sup>2</sup> *<sup>y</sup>* + *h*<sup>2</sup> *<sup>t</sup> , for a constant* 0 < *c* < 1*, and f*−(*xτ*, *yτ*; *tτ*) *is the smaller one of the two one-sided limits of f*(*x*, *y*; *t*) *at* (*xτ*, *yτ*; *tτ*)*. In cases when O*(*x*, *y*; *t*) *contains jumps, without loss of generality, it is assumed that O*(*x*, *y*; *t*) *is divided by the edge surface into two parts I*<sup>1</sup> *and I*<sup>2</sup> *with a positive jump size d<sup>τ</sup> from I*<sup>1</sup> *to I*<sup>2</sup> *at* (*xτ*, *yτ*; *tτ*)*, and Q*(1) *and Q*(2) *are the two corresponding parts in the support of K*(*u*, *v*)*K*(*w*)*.*

The next two theorems establish the consistency of the proposed edge-preserving image denoising procedure (2)–(6). First, we have the following theorem about the WRMS values defined in (5).

**Theorem 2.** *Assume that the conditions in Theorem 1 are satisfied, h*<sup>2</sup> *<sup>x</sup>* + *h*<sup>2</sup> *<sup>y</sup>* + *h*<sup>2</sup> *<sup>t</sup>* = *o*(1)*,* (*h*<sup>2</sup> *<sup>x</sup>* + *h*2 *<sup>y</sup>* + *h*<sup>2</sup> *<sup>t</sup>*)/*hmin* = *o*(1)*,* 1/(*NH*) = *o*(1) *and* 1/(*NHh*<sup>2</sup> *min*) = *o*(1)*. Then, we have the following results: for any* (*x*, *y*; *t*) ∈ Ω*h*\*Jh,*

$$\begin{aligned} \varepsilon(\mathbf{x}, y; t) &= \sigma^2 + o\_p(1), \\ \varepsilon^{(l)}(\mathbf{x}, y; t) &= \sigma^2 + o\_p(1), \quad \text{for } l = 1, 2; \end{aligned} \tag{10}$$

*for any* (*x*, *y*; *t*) ∈ *Jh*\*Sh,*

$$\begin{aligned} e(x, y; t) &= \sigma^2 + d\_\tau \mathbb{C}\_\tau^2 + o\_p(1), \\ e^{(l)}(x, y; t) &= \sigma^2 + d\_\tau \left[\mathbb{C}\_\tau^{(l)}\right]^2 + o\_p(1), \quad \text{for } l = 1, 2, \end{aligned} \tag{11}$$

*where*

$$\begin{array}{rcl} \mathbb{C}\_{\tau} &=& \left( \int \int \int\_{Q^{(1)}} \left[ \frac{\mathfrak{z}^{(2)}}{\mathfrak{z}\_{000}} + \frac{\mathfrak{z}^{(2)}\_{100}}{\mathfrak{z}\_{200}} \mathfrak{u} + \frac{\mathfrak{z}^{(2)}\_{010}}{\mathfrak{z}\_{020}} \mathfrak{v} + \frac{\mathfrak{z}^{(2)}\_{001}}{\mathfrak{z}\_{002}} w \right]^{2} K(\mathfrak{u}, \upsilon) K(w) dudvdw + \\ & \int \int \int\_{Q^{(2)}} \left[ 1 - \mathfrak{z}^{(2)}\_{000} - \frac{\mathfrak{z}^{(2)}\_{100}}{\mathfrak{z}\_{200}} \mathfrak{u} - \frac{\mathfrak{z}^{(2)}\_{010}}{\mathfrak{z}\_{020}} \upsilon - \frac{\mathfrak{z}^{(2)}\_{001}}{\mathfrak{z}\_{002}} w \right]^{2} K(\mathfrak{u}, \upsilon) K(w) dudvdw \right)^{1/2} .\end{array}$$

*and*

$$\begin{array}{rcl} \mathcal{L}\_{\tau}^{(l)} &=& \left(2 \int \int \int\_{Q^{(1l)}} \left[B\_{0l} + \frac{B\_{1l}}{\frac{\mathcal{Z}}{\mathcal{Z}} 200}u + \frac{B\_{2l}}{\frac{\mathcal{Z}}{\mathcal{Z}} 020}v + \frac{B\_{3l}}{\frac{\mathcal{Z}}{\mathcal{Z}} 002}w \right]^2 K(u,v)K(w)dudvdw + \\\\ &2 \int \int \int\_{Q^{(2l)}} \left[1 - B\_{0l} - \frac{B\_{1l}}{\frac{\mathcal{Z}}{\mathcal{Z}} 200}u - \frac{B\_{2l}}{\frac{\mathcal{Z}}{\mathcal{Z}} 020}v - \frac{B\_{3l}}{\frac{\mathcal{Z}}{\mathcal{Z}} 002}w \right]^2 K(u,v)K(w)dudvdw \right)^{1/2} .\end{array}$$

*with the quantities Q*(1*l*)*, Q*(2*l*)*, B*0*l, B*1*l, B*2*<sup>l</sup> and B*3*<sup>l</sup> defined as follows. Let* −→*g* = ( *<sup>d</sup><sup>τ</sup> <sup>ξ</sup>*200*hx ξ* (2) <sup>100</sup>*, dτ <sup>ξ</sup>*020*hy ξ* (2) <sup>010</sup>*, <sup>d</sup><sup>τ</sup> ξ*002*ht ξ* (2) <sup>001</sup>)*. Then, from (9),* −→*<sup>g</sup> is actually the asymptotic direction of the gradient vector <sup>G</sup>*(*x*, *<sup>y</sup>*; *<sup>t</sup>*)*. Let <sup>O</sup>*;(*l*)(*x*, *<sup>y</sup>*; *<sup>t</sup>*)*, for <sup>l</sup>* <sup>=</sup> 1, 2*, be two halves of the neighborhood <sup>O</sup>*(*x*, *<sup>y</sup>*; *<sup>t</sup>*) *separated by a plane passing the point* (*x*, *<sup>y</sup>*; *<sup>t</sup>*) *in the direction perpendicular to* −→*<sup>g</sup> and <sup>Q</sup>*;(*l*) *be the two corresponding parts in the support of <sup>K</sup>*(*u*, *<sup>v</sup>*)*K*(*w*)*. Then, <sup>Q</sup>*(1*l*) <sup>=</sup> *<sup>Q</sup>*(1) <sup>∩</sup> *<sup>Q</sup>*;(*l*)*, <sup>Q</sup>*(2*l*) <sup>=</sup> *<sup>Q</sup>*(2) <sup>∩</sup> *<sup>Q</sup>*;(*l*)*, <sup>B</sup>*0*<sup>l</sup>* <sup>=</sup> <sup>000</sup> *<sup>Q</sup>*(2*l*) *<sup>K</sup>*(*u*, *<sup>v</sup>*)*K*(*w*)*dudvdw, <sup>B</sup>*1*<sup>l</sup>* = 000 *<sup>Q</sup>*(2*l*) *uK*(*u*, *v*)*K*(*w*)*dudvdw, B*2*<sup>l</sup>* = 000 *<sup>Q</sup>*(2*l*) *vK*(*u*, *<sup>v</sup>*)*K*(*w*)*dudvdw, and B*3*<sup>l</sup>* = 000 *Q*(2*l*) *wK*(*u*, *v*)*K*(*w*)*dudvdw, for l* = 1, 2*.*

**Theorem 3.** *Under the conditions in Theorem 2 and the extra assumption that threshold parameter u* = *uN* → 0 *as N* → ∞*, we have, for any* (*x*, *y*; *t*) ∈ Ω*h,*

$$f(x, y; t) = f(x, y; t) + o\_p(1).$$

The proofs of these theoretical results are given in Appendix A.

### *3.2. Numerical Studies*

In this part, we study the numerical performance of our proposed method for denoising an image sequence. First, we consider a simulation example in which the true image intensity function in model (1) has the following expression:

$$f(\mathbf{x}, y; t) = \begin{cases} -2(\mathbf{x} - 0.5)^2 - 2(y - 0.5)^2 - 0.1 \sin(2\pi t) + 1, & \text{if } r(\mathbf{x}, y; t) \le 0.25^2, \\\ -2(\mathbf{x} - 0.5)^2 - 2(y - 0.5)^2 - 0.1 \sin(2\pi t), & \text{otherwise,} \end{cases}$$

where *<sup>r</sup>*(*x*, *<sup>y</sup>*; *<sup>t</sup>*)=(*<sup>x</sup>* <sup>−</sup> 0.5)<sup>2</sup> + (*<sup>y</sup>* <sup>−</sup> 0.5)<sup>2</sup> <sup>+</sup> 0.01 sin(2*πt*), (*x*, *<sup>y</sup>*) <sup>∈</sup> <sup>Ω</sup> = [0, 1] <sup>×</sup> [0, 1], and *<sup>t</sup>* <sup>∈</sup> [0, 1]. At a given value of *<sup>t</sup>*, *<sup>f</sup>*(*x*, *<sup>y</sup>*; *<sup>t</sup>*) has a circular edge curve *<sup>r</sup>*(*x*, *<sup>y</sup>*; *<sup>t</sup>*) = 0.252 with a constant jump size 1 in < *f*(*x*, *y*; *t*) at the edges. The radius of the circular edge curve, 0.25<sup>2</sup> − 0.01 sin(2*πt*), changes periodically over *t* ∈ [0, 1]. The image intensity function *f*(*x*, *y*; *t*) at *t* = 0.01 and 0.25 and its temporal profile *f*(0.25, 0.25; *t*) are shown in Figure 3. It can be seen that both the image intensity level at a given pixel and the edge curve change gradually when *t* changes in [0, 1].

**Figure 3.** (**a**) The true image intensity function *f*(*x*, *y*; *t*) at *t* = 0.01 (left) and *t* = 0.25 (right). (**b**) The temporal profile *f*(0.25, 0.25; *t*) when *t* changes in [0, 1].

In model (1), the random errors {*εijk*, *i* = 1, 2, ... , *nx*, *j* = 1, 2, ... , *ny*, *k* = 1, 2, ... , *nt*} are generated by the function spatialnoise() in the R-package neuRosim (cf., Welvaert et al. [27]). In that R function, there are two parameters *ρ* and *σ* to specify in advance, where *ρ* controls the data autocorrelation in all three dimensions and *σ* is the common standard deviation of the random errors. In all our examples, *σ* is fixed at 0.1, 0.2 or 0.3, and *ρ* is fixed at 0.1, 0.3 or 0.5, to study the possible impact of data noise level and data correlation on the performance of the proposed method. Without loss of generality, we set *nx* = *ny* in all examples. In the model estimation procedure (2)–(6), we set *hx* = *hy*, and the kernel function *K*(*v*) is chosen to be the following truncated Gaussian density function:

$$K(v) = \begin{cases} \frac{\exp(-v^2/2) - \exp(-0.5)}{2\pi - 3\pi \exp(-0.5)}, & \text{if } |v| \le 1, \\\ 0, & \text{otherwise.} \end{cases}$$

In cases when *σ* = 0.1, 0.2 or 0.3, *nx* = 64 or 128, *nt* = 50 or 100, *ρ* = 0.1, 0.3 or 0.5, the MSE values of the estimator *f* (*x*, *<sup>y</sup>*; *<sup>t</sup>*) defined in (6) are presented in Table 1, along with the corresponding parameters *hx*, *ht* and *u* selected by the modified CV procedure (7) and (8). In each case considered, the MSE value is computed based on 10 replicated simulations. For comparison purposes, the optimal MSE value of the estimator *f* (*x*, *<sup>y</sup>*; *<sup>t</sup>*), when its parameters (*hx*, *ht* and *u*) are chosen such that the MSE value reaches the minimum in each case considered, is also presented in the table, along with the corresponding parameter values. From the table, we can draw the following conclusions. (i) The MSE values are smaller when either *nx* or *nt* is larger, which confirms the consistency results discussed in Section 3.1. (ii) When *ρ* is larger (i.e., the spatio-temporal data correlation is stronger), the MSE values are larger. So, data correlation does have an impact on the performance of the proposed method, which is intuitively reasonable. (iii) By comparing the MSE and the optimal MSE values, we can see that the MSE values are usually larger than their optimal values, but their differences are not that big in almost all cases considered. This conclusion indicates that the modified CV procedure (7) and (8) for determining the values of the parameters (*hx*, *ht*, *u*) is quite effective. (iv) The parameter values chosen by the modified CV procedure (7) and (8) are quite close to the optimal parameter values in most cases considered.

**Table 1.** In each entry, MSE of *f* (*x*, *<sup>y</sup>*; *<sup>t</sup>*) in (6) is presented in the first line with its standard error (in parenthesis); the corresponding values of (*hx*, *ht*, *u*) chosen by the modified CV procedure (7) and (8) is presented in the second line; the optimal MSE is presented in the third line with its standard error (in parenthesis); the optimal values of (*hxy*, *ht*, *u*) are presented in the fourth line. MSE in the table has been multiplied by 103 and standard error has been multiplied by 105.


**Table 1.** *Cont.*


Next, we compare our proposed method, denoted as NEW, with some alternative methods described below. The first alternative method is the conventional LLK procedure (2), by which *<sup>f</sup>*(*x*, *<sup>y</sup>*; *<sup>t</sup>*) is estimated by *a*(*x*, *<sup>y</sup>*; *<sup>t</sup>*) defined in (3). Its bandwidths are chosen by the conventional CV procedure, without considering any possible spatiotemporal data correlation. As explained in Section 2.1, this estimator would blur edges while removing noise. The second alternative method is to use *a*(*x*, *<sup>y</sup>*; *<sup>t</sup>*) for estimating *f*(*x*, *y*; *t*), but its bandwidths are chosen by the modified CV procedure (7) and (8). The above two alternative methods are denoted as LLK-C and LLK, respectively, where LLK-C denotes the first conventional LLK procedure that does not accommodate data correlation. The third alternative method is the one by Gijbels et al. [15] which is used for edgepreserving image denoising of a single image. To apply this method to the current problem, individual images collected at different time points can be denoised by it separately. This method assumes that the observed image intensities at different pixels are independent of each other, and thus their bandwidths can be chosen by the conventional CV procedure. This method is denoted as GLQ. The fourth alternative method is to use *f* (*x*, *<sup>y</sup>*; *<sup>t</sup>*) in (6) to estimate *f*(*x*, *y*; *t*), but the parameters (*hx*, *ht*, *u*) are chosen by the conventional CV procedure. This method is denoted as NEW-C. By considering all these four alternative methods (i.e., LLK-C, LLK, GLQ and NEW-C), we can check whether the current problem to denoise an image sequence can be handled properly by the conventional LLK procedure with or without using the modified CV procedure, by an existing edge-preserving image denoising method designed for denoising a single image, or by the proposed method without considering the possible spatio-temporal data correlation. To evaluate their performance, in addition to the regular MSE criterion, we also consider the following edge-preservation (EP) criterion originally discussed in Hall and Qiu [28]:

$$EP(\hat{f}) = |JS(\hat{f}) - JS(f)| / |JS(f) .|$$

where

$$JS(f) = \frac{1}{(n\_x - 2)(n\_y - 2)(n\_t - 2)} \sum\_{i=2}^{n\_x - 1} \sum\_{j=2}^{n\_y - 1} \sum\_{k=2}^{n\_t - 1} \left( [f(\mathbf{x}\_{i+1}, y\_j; t\_k) - f(\mathbf{x}\_{i-1}, y\_j; t\_k)]^2 + \cdots \right)$$

$$[f(\mathbf{x}\_{i\prime} y\_{j+1\prime}; t\_k) - f(\mathbf{x}\_i, y\_{j-1\prime}; t\_k)]^2 + [f(\mathbf{x}\_i, y\_j; t\_{k+1}) - f(\mathbf{x}\_i, y\_j; t\_{k-1})]^2 \Big)^{1/2}$$

and JS(*f* ) is defined similarly. According to Hall and Qiu [28], JS(f) is a reasonable measure of the cumulative jump magnitude of *f* at the edge locations. So, *EP*(*f* ) provides a measure of the percentage of the cumulative jump magnitude of *f* that has been lost during data smoothing by using the estimator *f* . By this explanation, the smaller its value, the better. In cases when *σ* = 0.1, 0.2 or 0.3, *nx* = 128, *nt* = 100, and *ρ* = 0.1, 0.3 or 0.5, the MSE and EP values of the related methods are presented in Table 2. From the table, it can be seen that the proposed method NEW has the smallest MSE values with quite large margins among all five methods in all cases considered, except the case when *σ* = 0.1 and *ρ* = 0.1 where NEW-C has a lightly smaller MSE value than that of NEW due to the weak data correlation in that case. Likewise, NEW has much smaller EP values in all cases considered, compared to the four competing methods. This example confirms that it is necessary to consider edge-preserving procedures when denoising image sequences and the possible spatio-temporal data correlation should be taken into account during the denoising process. It also confirms the benefit to share useful information among neighboring images when denoising an image sequence.


**Table 2.** In each entry, the first line is the MSE value with its standard error (in parenthesis), and the second line is the EP value. MSE values in the table are in the unit of 103 and the standard error values are in the unit of 105.

In the cases when *σ* = 0.2 and *ρ* = 0.1, 0.3 or 0.5, Figure 4 shows the observed images at *t* = 0.5 in the first column, and the denoised images by the methods LLK-C, LLK, GLQ, NEW-C and NEW in columns 2–6. From the figure, it can be seen that the denoised images by NEW are the best in removing noise and preserving edges. As a comparison, the denoised images by LLK-C, and NEW-C are quite noisy because their selected bandwidths by the conventional CV procedure are relatively small due to the fact the conventional CV

procedure cannot distinguish the data correlation from the mean structure, as discussed in Section 2.2. The denoised images by LLK are quite blurry because the method does not take the edges into account when denoising the images. The denoised images by GLQ are quite blurry as well since GLQ denoises individual images at different time points separately and the serial data correlation is ignored in this method.

**Figure 4.** The first column shows the observed images at *t* = 0.5 when *σ* = 0.2 and *ρ* = 0.1 (1st row), 0.3 (2nd row), and 0.5 (3rd row). Second to sixth columns show the denoised images by LLK-C, LLK, GLQ, NEW-C and NEW, respectively.

Next, we apply the proposed method NEW and the four alternative methods LLK-C, LLK, GLQ and NEW-C to a sequence of cell images that records the vasculogenesis process. The sequence has 100 images, and each image has 128 × 128 pixels. A detailed description of the data can be found in Svoboda et al. [29]. The 1st, 50th and 100th images of the sequence are shown in Figure 5.

**Figure 5.** The 1st, 50th and 100th cell images of the image sequence for describing a vasculogenesis process.

In the image denoising literature, to test the noise removal ability of a image denoising method, it is a common practice to add random noise at a certain level to the test images and then apply the image denoising method to the noisy test images (cf., Gijbels et al. [15]). To follow this convention, spatio-temporally correlated noise is first generated using the R-package neuRosim and then added to the sequence of 100 cell images described above. When generating the noise, *σ* is chosen to be 0.1, 0.2 or 0.3 and *ρ* is chosen to be 0.1, 0.3 or 0.5, as in the simulation examples presented above. The MSE and EP values of the five image denoising methods based on 10 replicated simulations are presented in Table 3. From the table, it can be seen that NEW still has smaller MSE and EP values in this example, compared to the four competing methods, except in a small number of cases when *σ* and *ρ* are relatively small.


**Table 3.** Results for denoising a sequence of 100 cell images. In each entry, the first line is the MSE value and its standard error (in parenthesis), and the second line is the EP value. MSE values in the table are in the unit of 103 and the standard errors are in the unit of 105.

The 50th observed test image after the spatio-temporally correlated noise with *ρ* = 0.1, 0.3 or 0.5 being added is shown in the first column of Figure 6. The denoised images by the five methods LLK-C, LLK, GLQ, NEW-C and NEW are shown in columns 2–6 of the figure. It can be seen that similar conclusions to those from Figure 4 can be made here, and the denoised images by NEW look reasonably well, as the algorithm work well in removing noise and preserving edges.

**Figure 6.** First column shows the 50th observed cell image after the spatio-temporally correlated noise with *ρ* = 0.1 (1st row), 0.3 (2nd row) or 0.5 (3rd row) being added. The second to sixth columns show the denoised images by LLK-C, LLK, GLQ, NEW-C and NEW, respectively.

Finally, we apply the five methods considered in the above examples to a sequence of Landsat images of the Salton Sea region. The Salton Sea is the largest inland lake located at the southern border of California, US, and has a great impact on the local ecosystem (Shuford et al. [30]). The Landsat images used here were taken during the time period of 27 May 2000 and 24 December 2001. There are a total of 20 images collected at roughly equally-spaced time points, and each image has 100 × 100 pixels. In this example, we consider the case when *σ* = 0.3 and *ρ* = 0.3. The MSE values of the five methods LLK-C, LLK, GLQ, NEW-C, and NEW calculated in the same way as before are 9.70, 4.78, 12.03, 9.77, and 4.82, respectively. Their EP values are respectively 85.54%, 20.18%, 109.91%, 86.15%, and 19.14%. So, we can see that NEW method has the best edge-preserving performance among the five methods in this example, and NEW and LLK have the best overall noise removal performance. The 10th noisy observed test image taken on 28 April 2001 and its denoised versions by the five methods are shown in Figure 7. It can be seen from the figure that the denoised images by the methods LLK-C, GLQ, and NEW-C are still quite noisy, and the noise in the images generated by NEW and LLK is mostly removed while the edges are preserved reasonably well.

**Figure 7.** The first image is the observed landsat image of the Salton Sea region taken on 28 April 2001 after the spatio-temporally correlated noise with *σ* = 0.3 and *ρ* = 0.3 being added. Second to sixth images are its denoised versions by LLK-C, LLK, GLQ, NEW-C, and NEW, respectively.

### **4. Conclusions**

In this paper, we have described our proposed edge-preserving image denoising method for handling image sequences. Some major features of the proposed method include (i) helpful information in neighboring images is shared during image denoising, (ii) edge structures in the observed images can be preserved when removing noise, and (iii) possible sptio-temporal data correlation can be accommodated in the related local smoothing procedure. Theoretical arguments given in Section 3.1 and numerical studies presented in Section 3.2 show that the proposed method works well in various cases considered. There are still some issues about the proposed method for future research. For instance, in the proposed local smoothing procedure (2)–(6), each of the bandwidths (*hx*, *hy*, *ht*) is chosen by the modified CV procedure (7) and (8) to be the same in the entire design space Ω × [0, 1]. Intuitively, relatively small bandwidths are preferred at places where the image intensity surface *f*(*x*, *y*; *t*) has large curvature and relatively large bandwidths are preferred at places where the curvature of *f*(*x*, *y*; *t*) is small. Thus, in some applications where the curvature of *f*(*x*, *y*; *t*) could change quite dramatically in the design space, variable bandwidths might be helpful. Such issues will be studied carefully in our future research.

**Author Contributions:** Methodology, P.Q.; Formal analysis, F.Y.; Writing—original draft preparation, F.Y.; Writing—review and editing, P.Q.; Funding acquisition, P.Q.; Supervision, P.Q. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Science Foundation grant DMS-1914639.

**Data Availability Statement:** Publicly available datasets were analyzed in this study. They can be found from the links: https://cbia.fi.muni.cz/datasets/ and https://earthexplorer.usgs.gov.

**Acknowledgments:** We thank the four referees for many constructive comments and suggestions about the paper which greatly improved its quality. This research is supported in part by the National Science Foundation grant DMS-1914639.

**Conflicts of Interest:** The authors declare no conflicts of interest.

### **Appendix A**

*Appendix A.1. Proof of Proposition 1*

Define *Bh*(*x*, *y*, *t*) = {(*x* , *y* ; *t* ) : (|*x* − *x*|/*hx*)<sup>2</sup> + (|*y* − *y*|/*hy*)<sup>2</sup> ≤ 1, |*t* − *t* | ≤ *ht*,(*x* , *y* ; *t* ) ∈ [0, 1] × [0, 1] × [0, 1]}, <sup>Δ</sup>*ijk* = [*xi*−1, *xi*] × [*yj*−1, *yj*] × [*tk*−1, *tk*], *<sup>x</sup>*<sup>0</sup> = *<sup>y</sup>*<sup>0</sup> = *t*<sup>0</sup> = 0. Then it can be seen that

( ( ( ( 1 *NH* ∑ *ijk K xi* − *x hx* , *yi* − *y hy* ! *K tk* − *t ht* ! − 1 ( ( ( ( = ( ( ( ( 1 *<sup>H</sup>* ∑ *ijk* Δ*ijk K xi* − *x hx* , *yi* − *y hy* ! *K tk* − *t ht* ! *dudvdw* − 1 ( ( ( ( = ( ( ( ( 1 *<sup>H</sup>* ∑ *ijk* Δ*ijk K xi* − *x hx* , *yi* − *y hy* ! *K tk* − *t ht* ! *dudvdw* − 1 *H Bh*(*x*,*y*,*t*) *K u* − *x hx* , *v* − *y hy* ! *K w* − *t ht* ! *dudvdw* ( ( ( ( = ( ( ( ( 1 *<sup>H</sup>* ∑ *ijk* Δ*ijk K xi* − *x hx* , *yi* − *y hy* ! *K tk* − *t ht* ! *dudvdw* − 1 *<sup>H</sup>* ∑ *ijk Bh*(*x*,*y*,*t*)∩Δ*ijk K u* − *x hx* , *v* − *y hy* ! *K w* − *t ht* ! *dudvdw* ( ( ( ( = ( ( ( ( 1 *<sup>H</sup>* ∑ *ijk Bh*(*x*,*y*,*t*)*c*∩Δ*ijk K xi* − *x hx* , *yi* − *y hy* ! *K tk* − *t ht* ! *dudvdw* + 1 *<sup>H</sup>* ∑ *ijk Bh*(*x*,*y*,*t*)∩Δ*ijk K xi* − *x hx* , *yi* − *y hy* ! *K tk* − *t ht* ! *dudvdw* − 1 *<sup>H</sup>* ∑ *ijk Bh*(*x*,*y*,*t*)∩Δ*ijk K u* − *x hx* , *v* − *y hy* ! *K w* − *t ht* ! *dudvdw* ( ( ( ( <sup>≤</sup> *<sup>O</sup>*( <sup>1</sup> *nminhmin* ) + <sup>1</sup> *<sup>H</sup>* ∑ *ijk Bh*(*x*,*y*,*t*)∩Δ*ijk* ( ( ( ( *K xi* − *x hx* , *yj* − *y hy* ! *K tk* − *t ht* ! − *K u* − *x hx* , *v* − *y hy* ! *K w* − *t ht* !( ( ( ( *dudvdw* <sup>≤</sup> *<sup>O</sup>*( <sup>1</sup> *nminhmin* ) + <sup>1</sup> *<sup>H</sup>* ∑ *ijk Bh*(*x*,*y*,*t*)∩Δ*ijk* (<sup>1</sup> <sup>+</sup> <sup>√</sup>2)*<sup>C</sup> nminhmin dudvdw* <sup>=</sup> *<sup>O</sup>*( <sup>1</sup> *nminhmin* ) + <sup>1</sup> *H* (<sup>1</sup> <sup>+</sup> <sup>√</sup>2)*<sup>C</sup> nminhmin Bh*(*x*,*y*,*t*) 1*dudvdw* <sup>=</sup> *<sup>O</sup>*( <sup>1</sup> *nminhmin* ),

where *C* ≥ 0 is the Lipschitz constant that satisfies the condition |*K*(*u*) − *K*(*u* )| ≤ *C*|*u* − *u* |. So, the first result in Proposition 1 is valid.

To prove the second result, it can be checked that

*E* ( ( ( ( 1 *NH* ∑ *ijk εijkK xi* − *x hx* , *yi* − *y hy* ! *K ti* − *x ht* !( ( ( ( 2 <sup>=</sup> *Var*( <sup>1</sup> *NH* ∑ *ijk εijkK xi* − *x hx* , *yi* − *y hy* ! *K tk* − *x ht* ! ) <sup>=</sup> <sup>1</sup> *<sup>N</sup>*2*H*<sup>2</sup> ∑ *ijk* ∑ *i j k K xi* − *x hx* , *yi* − *y hy* ! *K tk* − *x ht* ! *K xi* − *x hx* , *yi* − *y hy* ! *K t k* − *x ht* ! *Cov*(*εijk*,*ε<sup>i</sup> j k* ) ≤ 1 *<sup>N</sup>*2*H*<sup>2</sup> ∑ *ijk* ∑ *i j k K xi* − *x hx* , *yi* − *y hy* ! *K tk* − *x ht* ! *K xi* − *x hx* , *yi* − *y hy* ! *K t k* − *x ht* ! *c*1*σ*2*ρc*<sup>2</sup> max{|*i*−*<sup>i</sup>* |,|*j*−*j* |,|*k*−*k* |} ≤ 1 *<sup>N</sup>*2*H*<sup>2</sup> ∑ *ijk K xi* − *x hx* , *yi* − *y hy* ! *K tk* − *x ht* ! *<sup>c</sup>*1*σ*<sup>224</sup> <sup>∞</sup> 0 *τ*2*ρτdτ* <sup>=</sup> *<sup>O</sup>*( <sup>1</sup> *NH* ).

Similarly, it can be checked that

*E* ( ( ( ( 1 *NH* ∑ *ijk* (*ε* 2 *ijk* <sup>−</sup> *<sup>σ</sup>*2)*<sup>K</sup> xi* − *x hx* , *yi* − *y hy* ! *K ti* − *x ht* !( ( ( ( 2 <sup>=</sup> *Var*( <sup>1</sup> *NH* ∑ *ijk ε* 2 *ijkK xi* − *x hx* , *yi* − *y hy* ! *K tk* − *x ht* ! ) <sup>=</sup> <sup>1</sup> *<sup>N</sup>*2*H*<sup>2</sup> ∑ *ijk* ∑ *i j k K xi* − *x hx* , *yi* − *y hy* ! *K tk* − *x ht* ! *K xi* − *x hx* , *yi* − *y hy* ! *K t k* − *x ht* ! *Cov*(*ε* 2 *ijk*,*ε* 2 *i j k* ) ≤ 1 *<sup>N</sup>*2*H*<sup>2</sup> ∑ *ijk* ∑ *i j k K xi* − *x hx* , *yi* − *y hy* ! *K tk* − *x ht* ! *K xi* − *x hx* , *yi* − *y hy* ! *K t k* − *x ht* ! 12(*c*1*σ*2*ρc*<sup>2</sup> max{|*i*−*<sup>i</sup>* |,|*j*−*j* |,|*k*−*k* |})1/4*E*(*ε* 4 <sup>111</sup>) ≤ 1 *<sup>N</sup>*2*H*<sup>2</sup> ∑ *ijk K xi* − *x hx* , *yi* − *y hy* ! *K tk* − *x ht* ! <sup>12</sup>(*c*1*σ*<sup>224</sup> <sup>∞</sup> 0 *τ*2*ρτdτ*)1/3(*E*(*ε* 6 111))2/3 <sup>=</sup> *<sup>O</sup>*( <sup>1</sup> *NH* ).

The first inequality in the above expression is based on the result in Davydov [31]. So, the third result is valid.

### *Appendix A.2. Proof of Theorem 1*

We first consider the case when (*x*, *y*; *t*) ∈ Ω*<sup>h</sup>* \ *Jh*. By Taylor expansion, we have

$$\begin{aligned} Z\_{ijk} &= f(\mathbf{x}\_i, \mathbf{y}\_j; t\_k) + \epsilon\_{ijk} \\ &= f(\mathbf{x}, \mathbf{y}; t) + (\mathbf{x}\_i - \mathbf{x}) f\_x'(\mathbf{x}, \mathbf{y}; t) + (\mathbf{y}\_j - \mathbf{y}) f\_y'(\mathbf{x}, \mathbf{y}; t) + (t\_k - t) f\_t'(\mathbf{x}, \mathbf{y}; t) + \epsilon\_{ijk} \\ &\quad O(h\_x^2 + h\_y^2 + h\_t^2) + \epsilon\_{ijk} .\end{aligned}$$

So, it can be checked that

$$
\begin{bmatrix}
\sum\_{ijk} Z\_{ijk} K\_{ijk} \\
\sum\_{ijk} (\mathbf{x}\_i - \mathbf{x}) Z\_{ijk} K\_{ijk} \\
\sum\_{ijk} (y\_j - y) Z\_{ijk} K\_{ijk} \\
\sum\_{ijk} (t\_k - t) Z\_{ijk} K\_{ijk}
\end{bmatrix} = M \begin{bmatrix}
f(\mathbf{x}, y; t) \\
f\_x(\mathbf{x}, y; t) \\
f\_y(\mathbf{x}, y; t) \\
f\_t(\mathbf{x}, y; t)
\end{bmatrix} + \begin{bmatrix}
\sum\_{ijk} \mathcal{O}(h\_x^2 + h\_y^2 + h\_t^2) K\_{ijk} \\
\sum\_{ijk} (\mathbf{x}\_i - \mathbf{x}) \mathcal{O}(h\_x^2 + h\_y^2 + h\_t^2) K\_{ijk} \\
\sum\_{ijk} (y\_j - y) \mathcal{O}(h\_x^2 + h\_y^2 + h\_t^2) K\_{ijk} \\
\sum\_{ijk} (t\_k - t) \mathcal{O}(h\_x^2 + h\_y^2 + h\_t^2) K\_{ijk}
\end{bmatrix} + \begin{bmatrix}
\sum\_{ijk} \mathcal{O}(h\_x^2 + h\_y^2 + h\_t^2) K\_{ijk} \\
\sum\_{ijk} (t\_k - t) \mathcal{O}(h\_x^2 + h\_y^2 + h\_t^2) K\_{ijk} \\
\sum\_{ijk} (t\_k - t) \mathcal{O}(h\_{ik}^2 + h\_t^2 + h\_t^2) K\_{ijk} \\
\sum\_{ijk} (y\_j - y) \varepsilon\_{ijk} K\_{ijk}
\end{bmatrix} + \begin{bmatrix}
\sum\_{ijk} \mathcal{O}(h\_x^2 + h\_y^2 + h\_t^2) K\_{ijk} \\
\sum\_{ijk} (t\_k - t) \mathcal{O}(h\_{ik}^2 + h\_t^2 + h\_t^2) K\_{ijk} \\
\sum\_{ijk} (t\_k - t) \mathcal{O}\_{ijk} K\_{ijk}
\end{bmatrix} = 0
$$

where

$$M = \begin{bmatrix} m\_{000} & m\_{100} & m\_{010} & m\_{001} \\ m\_{100} & m\_{200} & m\_{110} & m\_{101} \\ m\_{010} & m\_{110} & m\_{020} & m\_{011} \\ m\_{001} & m\_{101} & m\_{011} & m\_{002} \end{bmatrix}.$$

From Expression (3), we have

$$
\begin{split}
\begin{bmatrix}
\hat{a}(\mathbf{x},\mathbf{y};t) \\
\hat{b}(\mathbf{x},\mathbf{y};t) \\
\hat{c}(\mathbf{x},\mathbf{y};t) \\
\hat{d}(\mathbf{x},\mathbf{y};t)
\end{bmatrix} &= \begin{bmatrix}
f(\mathbf{x},\mathbf{y};t) \\
f\_{\mathbf{x}}(\mathbf{x},\mathbf{y};t) \\
f\_{\mathbf{y}}(\mathbf{x},\mathbf{y};t) \\
f\_{\mathbf{z}}(\mathbf{x},\mathbf{y};t)
\end{bmatrix} + M^{-1} \begin{bmatrix}
\sum\_{ijk} \mathcal{O}(h\_{\mathbf{x}}^{2}+h\_{\mathbf{y}}^{2}+h\_{\mathbf{z}}^{2})K\_{ijk} \\
\sum\_{ijk} (\mathbf{x}\_{i}-\mathbf{x})\mathcal{O}(h\_{\mathbf{x}}^{2}+h\_{\mathbf{y}}^{2}+h\_{\mathbf{z}}^{2})K\_{ijk} \\
\sum\_{ijk} (\mathbf{y}\_{j}-\mathbf{y})\mathcal{O}(h\_{\mathbf{x}}^{2}+h\_{\mathbf{y}}^{2}+h\_{\mathbf{z}}^{2})K\_{ijk} \\
\sum\_{ijk} (\mathbf{k}\_{k}-\mathbf{t})\mathcal{O}(h\_{\mathbf{x}}^{2}+h\_{\mathbf{y}}^{2}+h\_{\mathbf{z}}^{2})K\_{ijk}
\end{bmatrix} + \\
\mathcal{M}^{-1} \begin{bmatrix}
\sum\_{ijk} \mathcal{E}\_{ijk}\mathcal{K}\_{ijk} \\
\sum\_{ijk} (\mathbf{x}\_{i}-\mathbf{x})\epsilon\_{ijk}K\_{ijk} \\
\sum\_{ijk} (\mathbf{y}\_{j}-\mathbf{y})\epsilon\_{ijk}K\_{ijk} \\
\sum\_{ijk} (\mathbf{k}\_{k}-\mathbf{t})\epsilon\_{ijk}K\_{ijk}
\end{bmatrix}.
\end{split}
$$

By some simple algebraic manipulations, we have

$$M^{-1} = \begin{bmatrix} O(\frac{1}{NH}) & O(\frac{1}{NH \cdot h\_{x}}) & O(\frac{1}{NH \cdot h\_{l}}) & O(\frac{1}{NH \cdot h\_{l}})\\ O(\frac{1}{NH \cdot h\_{x}}) & O(\frac{1}{NH \cdot h\_{x}^{2}}) & O(\frac{1}{NH \cdot h\_{x}h\_{y}}) & O(\frac{1}{NH \cdot h\_{x}h\_{l}})\\ O(\frac{1}{NH \cdot h\_{y}}) & O(\frac{1}{NH \cdot h\_{x}h\_{y}}) & O(\frac{1}{NH \cdot h\_{y}^{2}}) & O(\frac{1}{NH \cdot h\_{y}h\_{l}})\\ O(\frac{1}{NH \cdot h\_{l}}) & O(\frac{1}{NH \cdot h\_{x}h\_{l}}) & O(\frac{1}{NH \cdot h\_{y}h\_{l}}) & O(\frac{1}{NH \cdot h\_{l}^{2}}) \end{bmatrix}.$$

Then,

$$
\begin{bmatrix}
\widehat{a}(\mathbf{x},\mathbf{y};t) \\
\widehat{b}(\mathbf{x},\mathbf{y};t) \\
\widehat{c}(\mathbf{x},\mathbf{y};t) \\
\widehat{d}(\mathbf{x},\mathbf{y};t)
\end{bmatrix} = \begin{bmatrix}
f(\mathbf{x},\mathbf{y};t) \\
f\_x'(\mathbf{x},\mathbf{y};t) \\
f\_y'(\mathbf{x},\mathbf{y};t) \\
f\_z'(\mathbf{x},\mathbf{y};t)
\end{bmatrix} + \begin{bmatrix}
O(h\_x^2 + h\_y^2 + h\_t^2) \\
O(\frac{h\_x^2 + h\_y^2 + h\_t^2}{h\_x}) \\
O(\frac{h\_x^2 + h\_y^2 + h\_t^2}{h\_y}) \\
O(\frac{h\_x^2 + h\_y^2 + h\_t^2}{h\_t})
\end{bmatrix} + \begin{bmatrix}
O\_p(\frac{1}{\sqrt{NH}}) \\
O\_p(\frac{1}{h\_x\sqrt{NH}}) \\
O\_p(\frac{1}{h\_y\sqrt{NH}}) \\
O\_p(\frac{1}{h\_t\sqrt{NH}})
\end{bmatrix}
$$

.

Now, we consider the case when (*x*, *y*; *t*) ∈ *Jh* \ *Sh*. If (*xi*, *yj*; *tk*) ∈ *I*1, then we have

$$\begin{aligned} Z\_{ijk} &= \; f(\mathbf{x}\_{i\prime} \mathbf{y}\_{j\prime} \mathbf{t}\_k) + \varepsilon\_{ijk} \\ &= \; f\_{-}(\mathbf{x}\_{\tau\prime} \mathbf{y}\_{\tau} \mathbf{t}\_{\tau}) + \mathcal{O}(\sqrt{h\_{\mathbf{x}}^2 + h\_{\mathbf{y}}^2 + h\_{\mathbf{t}}^2}) + \varepsilon\_{ijk\prime} \end{aligned}$$

and if (*xi*, *yj*; *tk*) ∈ *I*2, we have

$$\begin{aligned} Z\_{ijk} &=& f(\mathbf{x}\_{i\prime} y\_j; t\_k) + \varepsilon\_{ijk} \\ &=& f\_{-}(\mathbf{x}\_{\tau\prime} y\_{\tau\prime} t\_{\tau}) + d\_{\tau} + O(\sqrt{h\_{\mathbf{x}}^2 + h\_{\mathbf{y}}^2 + h\_t^2}) + \varepsilon\_{ijk}. \end{aligned}$$

By some similar arguments to those in the case considered above, we have

⎡ ⎢ ⎢ ⎣ *a*(*x*, *<sup>y</sup>*; *<sup>t</sup>*) *b*(*x*, *<sup>y</sup>*; *<sup>t</sup>*) *<sup>c</sup>*(*x*, *<sup>y</sup>*; *<sup>t</sup>*) *d* (*x*, *<sup>y</sup>*; *<sup>t</sup>*) ⎤ ⎥ ⎥ <sup>⎦</sup> <sup>=</sup> ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ *f*−(*xτ*, *yτ*; *tτ*) + *d<sup>τ</sup>* <sup>∑</sup>(*xi*,*yj*;*<sup>t</sup> <sup>k</sup>* )∈*<sup>I</sup>* <sup>2</sup> *Kijk* ∑*ijk Kijk dτ hx* <sup>∑</sup>(*xi*,*yj*;*<sup>t</sup> <sup>k</sup>* )∈*<sup>I</sup>* 2 [(*xi*−*x*)/*hx*]*Kijk* <sup>∑</sup>*ijk* [(*xi*−*x*)/*hx*]<sup>2</sup>*Kijk dτ hy* <sup>∑</sup>(*xi*,*yj*;*<sup>t</sup> <sup>k</sup>* )∈*<sup>I</sup>* 2 [(*yj*−*y*)/*hy*]*Kijk* <sup>∑</sup>*ijk* [(*yj*−*y*)/*hy*]<sup>2</sup>*Kijk dτ ht* <sup>∑</sup>(*xi*,*yj*;*<sup>t</sup> <sup>k</sup>* )∈*<sup>I</sup>* 2 [(*tk*−*t*)/*ht*]*Kijk* <sup>∑</sup>*ijk* [(*tk*−*t*)/*ht*]<sup>2</sup>*Kijk* ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ + ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ *O*( *h*2 *<sup>x</sup>* + *h*<sup>2</sup> *<sup>y</sup>* + *h*<sup>2</sup> *t*) *O*( *h*2 *<sup>x</sup>*+*h*<sup>2</sup> *<sup>y</sup>*+*h*<sup>2</sup> *t hx* ) *O*( *h*2 *<sup>x</sup>*+*h*<sup>2</sup> *<sup>y</sup>*+*h*<sup>2</sup> *t hy* ) *O*( *h*2 *<sup>x</sup>*+*h*<sup>2</sup> *<sup>y</sup>*+*h*<sup>2</sup> *t ht* ) ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ + ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ *Op*( <sup>√</sup> <sup>1</sup> *NH* ) *Op*( <sup>1</sup> *hx* <sup>√</sup>*NH* ) *Op*( <sup>1</sup> *hy* <sup>√</sup>*NH* ) *Op*( <sup>1</sup> *ht* <sup>√</sup>*NH* ) ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ *f*−(*xτ*, *yτ*; *tτ*) + *dτξ* (2) <sup>000</sup> *<sup>d</sup><sup>τ</sup> <sup>ξ</sup>*200*hx ξ* (2) 100 *dτ <sup>ξ</sup>*020*hy ξ* (2) 010 *dτ ξ*002*ht ξ* (2) 001 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ + ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ *O*( *h*2 *<sup>x</sup>* + *h*<sup>2</sup> *<sup>y</sup>* + *h*<sup>2</sup> *t*) *O*( *h*2 *<sup>x</sup>*+*h*<sup>2</sup> *<sup>y</sup>*+*h*<sup>2</sup> *t hx* ) *O*( *h*2 *<sup>x</sup>*+*h*<sup>2</sup> *<sup>y</sup>*+*h*<sup>2</sup> *t hy* ) *O*( *h*2 *<sup>x</sup>*+*h*<sup>2</sup> *<sup>y</sup>*+*h*<sup>2</sup> *t ht* ) ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ + ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ *Op*( <sup>√</sup> <sup>1</sup> *NH* ) *Op*( <sup>1</sup> *hx* <sup>√</sup>*NH* ) *Op*( <sup>1</sup> *hy* <sup>√</sup>*NH* ) *Op*( <sup>1</sup> *ht* <sup>√</sup>*NH* ) ⎤ ⎥ ⎥ ⎥ ⎥ ⎦

*Appendix A.3. Proof of Theorem 2*

We prove the second equations in (10) and (11) here. The first equations can be proved similarly. For simplicity, we write *a*(*l*)(*x*, *<sup>y</sup>*; *<sup>t</sup>*), *b*(*l*)(*x*, *<sup>y</sup>*; *<sup>t</sup>*), *<sup>c</sup>*(*l*)(*x*, *<sup>y</sup>*; *<sup>t</sup>*), *<sup>d</sup>* (*l*)(*x*, *<sup>y</sup>*; *<sup>t</sup>*), *<sup>O</sup>*(*l*)(*x*, *<sup>y</sup>*; *<sup>t</sup>*) and *<sup>O</sup>*;(*l*)(*x*, *<sup>y</sup>*; *<sup>t</sup>*) as *a*(*l*), *b*(*l*), *<sup>c</sup>*(*l*), *<sup>d</sup>* (*l*), *<sup>O</sup>*(*l*) and *<sup>O</sup>*;(*l*), respectively from now on. First, by Proposition 1, it is easy to show that

$$\frac{\sum\_{ijk} \varepsilon\_{ijk} K\left(\frac{x\_i - x}{h\_x}, \frac{y\_i - y}{h\_y}\right) K\left(\frac{t\_i - x}{h\_l}\right)}{\sum\_{ijk} K\left(\frac{x\_i - x}{h\_x}, \frac{y\_i - y}{h\_y}\right) K\left(\frac{t\_i - x}{h\_l}\right)} = O\_p(\frac{1}{\sqrt{NH}}),\tag{A1}$$

$$\frac{\sum\_{ijk}(\varepsilon\_{ijk}^2 - \sigma^2)K\left(\frac{x\_i - x}{h\_x}, \frac{y\_i - y}{h\_y}\right)K\left(\frac{t\_i - x}{h\_l}\right)}{\sum\_{ijk}K\left(\frac{x\_i - x}{h\_x}, \frac{y\_i - y}{h\_y}\right)K\left(\frac{t\_i - x}{h\_l}\right)} = o\_p(1). \tag{A2}$$

Let us first consider the case when (*x*, *y*; *t*) ∈ Ω*<sup>h</sup>* \ *Jh*. In such a case, it can be checked that

*e* (*l*) (*x*, *<sup>y</sup>*; *<sup>t</sup>*) = \* ∑ (*xi*,*yj*;*tk* )∈*O*(*l*) [*εijk* <sup>+</sup> *<sup>f</sup>*(*xi*, *yj*; *tk*) <sup>−</sup> *a*(*l*) <sup>−</sup> *b*(*l*) (*xi* − *x*) − *c*(*l*) (*yj* − *y*) − *d* (*l*) (*tk* <sup>−</sup> *<sup>t</sup>*)]<sup>2</sup>*Kijk*<sup>+</sup> / ∑ (*xi*,*yj*;*tk* )∈*O*(*l*) *Kijk* = \* ∑ (*xi*,*yj*;*tk* )∈*O*(*l*) *ε* 2 *ijkKijk*<sup>+</sup> / ∑ (*xi*,*yj*;*tk* )∈*O*(*l*) *Kijk* + \* 2 ∑ (*xi*,*yj*;*tk* )∈*O*(*l*) *<sup>ε</sup>ijk*[ *<sup>f</sup>*(*xi*, *yj*; *tk*) <sup>−</sup> *a*(*l*) <sup>−</sup> *b*(*l*) (*xi* − *x*) − *c*(*l*) (*yj* − *y*) − *d* (*l*) (*tk* <sup>−</sup> *<sup>t</sup>*)]*Kijk*<sup>+</sup> / ∑ (*xi*,*yj*;*tk* )∈*O*(*l*) *Kijk* + \* ∑ (*xi*,*yj*;*tk* )∈*O*(*l*) [ *<sup>f</sup>*(*xi*, *yj*; *tk*) <sup>−</sup> *a*(*l*) <sup>−</sup> *b*(*l*) (*xi* − *x*) − *c*(*l*) (*yj* − *y*) − *d* (*l*) (*tk* <sup>−</sup> *<sup>t</sup>*)]<sup>2</sup>*Kijk*<sup>+</sup> / ∑ (*xi*,*yj*;*tk* )∈*O*(*l*) *Kijk* =: *A*(*l*) <sup>1</sup> (*x*, *<sup>y</sup>*; *<sup>t</sup>*) + *<sup>A</sup>*(*l*) <sup>2</sup> (*x*, *<sup>y</sup>*; *<sup>t</sup>*) + *<sup>A</sup>*(*l*) <sup>3</sup> (*x*, *y*; *t*).

Similar to (A2), we have

$$A\_1^{(l)}(\mathbf{x}, y; t) = \sigma^2 + o\_{\mathcal{P}}(1). \tag{A3}$$

Taylor expansion of *f*(*xi*, *yj*; *tk*) at point (*x*, *y*; *t*), results in Theorem 1, and by similar arguments for (A1), we have

*A*(*l*) <sup>2</sup> (*x*, *<sup>y</sup>*; *<sup>t</sup>*) <sup>≤</sup> <sup>2</sup><sup>|</sup> *<sup>f</sup>*(*x*, *<sup>y</sup>*; *<sup>t</sup>*) <sup>−</sup> *a*(*l*) | ( ( ( ( <sup>∑</sup>(*xi*,*yj*;*tk* )∈*O*(*l*) *<sup>ε</sup>ijkKijk* <sup>∑</sup>(*xi*,*yj*;*tk* )∈*O*(*l*) *Kijk* ( ( ( ( + (A4) 2*hx*| *f <sup>x</sup>*(*x*, *<sup>y</sup>*; *<sup>t</sup>*) <sup>−</sup> *b*(*l*) | ( ( ( ( <sup>∑</sup>(*xi*,*yj*;*tk* )∈*O*(*l*) *<sup>ε</sup>ijk xi*−*<sup>x</sup> hx Kijk* <sup>∑</sup>(*xi*,*yj*;*tk* )∈*O*(*l*) *Kijk* ( ( ( ( + 2*hy*| *f <sup>y</sup>*(*x*, *<sup>y</sup>*; *<sup>t</sup>*) <sup>−</sup> *<sup>c</sup>*(*l*) | ( ( ( ( <sup>∑</sup>(*xi*,*yj*;*tk* )∈*O*(*l*)(*x*,*y*;*t*) *<sup>ε</sup>ijk yj*−*y hy Kijk* <sup>∑</sup>(*xi*,*yj*;*tk* )∈*O*(*l*) *Kijk* ( ( ( ( + 2*ht*| *f <sup>t</sup>*(*x*, *y*; *t*) − *d* (*l*) | ( ( ( ( <sup>∑</sup>(*xi*,*yj*;*tk* )∈*O*(*l*) *<sup>ε</sup>ijk tk*−*<sup>t</sup> ht Kijk* <sup>∑</sup>(*xi*,*yj*;*tk* )∈*O*(*l*) *Kijk* ( ( ( ( = *op*(1).

Similarly, we have

$$A\_3^{(l)}(\mathfrak{x}, \mathfrak{y}; t) = o\_p(1). \tag{A5}$$

By combining (A3)–(A5), we have

$$
\sigma^{(I)}(\mathfrak{x}, \mathfrak{y}; t) = \sigma^2 + o\_p(1).
$$

Now, let us consider the case when (*x*, *y*; *t*) ∈ *Jh* \ *Sh*. Similar to the above case, let us write

$$
\varepsilon^{(l)}(\mathbf{x}, \mathbf{y}; t) = A\_1^{(l)}(\mathbf{x}, \mathbf{y}; t) + A\_2^{(l)}(\mathbf{x}, \mathbf{y}; t) + A\_3^{(l)}(\mathbf{x}, \mathbf{y}; t).
$$

Here, we still have

$$A\_1^{(l)}(\mathbf{x}, y; t) = \sigma^2 + o\_p(1). \tag{A6}$$

For *A*(*l*) <sup>2</sup> (*x*, *y*; *t*), we have

$$A\_{2}^{(l)}(x,y;t) \quad = \left\{ 2\sum\_{(\mathbf{x}\_{i},\mathbf{y}\_{j};\mathbf{k}\_{k})\in I^{l}\cap O^{(l)}}\varepsilon\_{ijk}[f(\mathbf{x}\_{i},\mathbf{y}\_{j};t\_{k})-\widehat{a}^{(l)}-\widehat{b}^{(l)}](\mathbf{x}\_{i}-\mathbf{x})-\widehat{b}^{(l)}] \right\}$$

$$\widehat{\varepsilon}^{(l)}(y\_{j}-y)-\widehat{d}^{(l)}(t\_{k}-t)|K\_{ijk}\rangle \Big/ \sum\_{(\mathbf{x},\mathbf{y};\mathbf{t}\_{k})\in O^{(l)}}K\_{ijk}+$$

$$\left\{2\sum\_{(\mathbf{x}\_{i},\mathbf{y}\_{j};\mathbf{t}\_{k})\in I^{2}\cap O^{(l)}}\varepsilon\_{ijk}[f(\mathbf{x}\_{i},\mathbf{y}\_{j};t\_{k})-\widehat{a}^{(l)}-\widehat{b}^{(l)}(\mathbf{x}\_{i}-\mathbf{x})-\widehat{b}^{(l)}]\right\}$$

$$\widehat{\varepsilon}^{(l)}(y\_{j}-y)-\widehat{d}^{(l)}(t\_{k}-t)|K\_{ijk}\rangle \Big/ \sum\_{(\mathbf{x}\_{i},\mathbf{y}\_{j};t\_{k})\in O^{(l)}}K\_{ijk}$$

$$=:\quad A\_{21}^{(l)}(\mathbf{x},\mathbf{y};t)+A\_{22}^{(l)}(\mathbf{x},\mathbf{y};t).$$

By the results in Theorem 1, we have

*A*(*l*) <sup>21</sup> (*x*, *y*; *t*) = <sup>2</sup> <sup>∑</sup>(*xi*,*yj*;*tk* )∈*I*1∩*O*(*l*) *<sup>ε</sup>ijk f*(*xi*, *yj*; *tk*) − *f*−(*xτ*, *yτ*; *tτ*) *Kijk* <sup>∑</sup>(*xi*,*yj*;*tk* )∈*O*(*l*) *Kijk* − (*D*<sup>1</sup> <sup>+</sup> *op*(1)) <sup>∑</sup>(*xi*,*yj*;*tk* )∈*I*1∩*O*(*l*) *<sup>ε</sup>ijkKijk* <sup>∑</sup>(*xi*,*yj*;*tk* )∈*O*(*l*) *Kijk* − (*D*<sup>2</sup> <sup>+</sup> *op*(1)) <sup>∑</sup>(*xi*,*yj*;*tk* )∈*I*1∩*O*(*l*) *<sup>ε</sup>ijk xi*−*<sup>x</sup> hx Kijk* <sup>∑</sup>(*xi*,*yj*;*tk* )∈*O*(*l*) *Kijk* − (*D*<sup>3</sup> <sup>+</sup> *op*(1)) <sup>∑</sup>(*xi*,*yj*;*tk* )∈*I*1∩*O*(*l*) *<sup>ε</sup>ijk yj*−*y hy Kijk* <sup>∑</sup>(*xi*,*yj*;*tk* )∈*O*(*l*) *Kijk* − (*D*<sup>4</sup> <sup>+</sup> *op*(1)) <sup>∑</sup>(*xi*,*yj*;*tk* )∈*I*1∩*O*(*l*) *<sup>ε</sup>ijk tk*−*<sup>t</sup> ht Kijk* <sup>∑</sup>(*xi*,*yj*;*tk* )∈*O*(*l*) *Kijk* ,

where *D*1, *D*2, *D*<sup>3</sup> and *D*<sup>4</sup> are constants. By similar arguments for (A1), we can conclude that

$$A\_{21}^{(l)} = o\_p(1).$$

Similarly, we have

$$A\_{22}^{(l)} = o\_p(1).$$

<sup>2</sup> = *op*(1). (A7)

So,

*A*(*l*) By similar arguments to those about Proposition 1, we have

$$\left| \frac{1}{NH} \sum\_{(\mathbf{x}\_i, \mathbf{y}\_j; \mathbf{t}\_k) \in O^{(l)}} \mathcal{K}\_{ijk} - \frac{1}{2} \right| = o(1).$$

For a function *<sup>φ</sup>*(*x*, *<sup>y</sup>*; *<sup>t</sup>*) satisfying the condition that sup*x*2+*y*2+*t*2≤<sup>1</sup> <sup>|</sup>*φ*(*x*, *<sup>y</sup>*; *<sup>t</sup>*)| ≤ *<sup>b</sup><sup>φ</sup>* <sup>&</sup>lt; ∞, we can have

$$\begin{split} & \left| \frac{1}{NH} \sum\_{(\boldsymbol{x}\_{i}, \boldsymbol{y}\_{j}; \boldsymbol{t}\_{k}) \in I^{1} \cap O^{(l)}} \phi(\frac{\boldsymbol{x}\_{i} - \boldsymbol{x}}{h\_{\boldsymbol{x}}}, \frac{\boldsymbol{y}\_{j} - \boldsymbol{y}}{h\_{\boldsymbol{y}}}; \frac{\boldsymbol{t}\_{k} - \boldsymbol{t}}{h\_{\boldsymbol{t}}}) K\_{ijk} - 1 \right| \\ & \left| \frac{1}{NH} \sum\_{(\boldsymbol{x}\_{i}, \boldsymbol{y}\_{j}; \boldsymbol{t}\_{k}) \in I^{1} \cap \bar{O}^{(l)}} \phi(\frac{\boldsymbol{x}\_{i} - \boldsymbol{x}}{h\_{\boldsymbol{x}}}, \frac{\boldsymbol{y}\_{j} - \boldsymbol{y}}{h\_{\boldsymbol{y}}}; \frac{\boldsymbol{t}\_{k} - \boldsymbol{t}}{h\_{\boldsymbol{t}}}) K\_{ijk} \right| \\ & \leq \left| \quad b\_{\boldsymbol{\Phi}} || \boldsymbol{K} || \frac{1}{NH} \sum\_{(\boldsymbol{x}\_{i}, \boldsymbol{y}\_{j}; \boldsymbol{t}\_{k}) \in O^{(l)} \cap \bar{O}^{(l)}} 1 \right| \\ & = \left| o(1) \right| . \end{split}$$

where *<sup>O</sup>*(*l*)Δ*O*;(*l*) = (*O*(*l*) <sup>&</sup>gt; *<sup>O</sup>*;(*l*)) \ (*O*(*l*) <sup>=</sup> *<sup>O</sup>*;(*l*)). The last equation above is a direct conclusion of (9). By the above results, we have

$$
\begin{split}
\hat{A}^{(y)}\_{\beta}(x,y;t)&=\quad&\frac{2}{2M}\sum\_{(x,y)\in\mathcal{N}(\beta)}\left[f(x,y;t\_{y};t\_{x})-\hat{a}^{(1)}\_{0}-\hat{b}^{(0)}\_{0}(x)-x-\text{(A)}\_{0}\right] \\
&\quad -\hat{c}^{(1)}\_{\beta}(y,-y)-\hat{d}^{(1)}\_{0}(t\_{x}-t)^{2}\big]K\_{\beta\beta} \\
&=\frac{2}{2M}\sum\_{(x,y)\in\mathcal{N}(\beta)}\left[f(x,y;t\_{y};t\_{x})-f\_{\left(x,y;t\_{x};t\_{x}\right)}-d\_{x}\mathcal{B}\_{00}-\frac{d\_{x}\mathcal{B}\_{01}}{\xi\_{200}}\frac{x-x}{h\_{x}}-\frac{\beta}{2}h\_{x}\right] \\
&\quad -\frac{d\_{x}\mathcal{B}\_{01}}{\xi\_{200}}\frac{y\_{y}-y}{h\_{y}}-\frac{d\_{x}\mathcal{B}\_{02}}{\xi\_{200}}\frac{y\_{x}-t}{h\_{x}}\Big{)}K\_{\beta\beta}+o\_{p}(1) \\
&=\frac{2}{2M}\left(\sum\_{(x,y)\in\mathcal{N}(\beta)}\sum\_{(x,y)\in\mathcal{N}(\beta)}\left(-d\_{x}\mathcal{B}\_{02}\frac{d\_{x}y\_{x}-x}{h\_{x}}\right)\right. \\
&\left[\frac{d\_{x}\mathcal{B}\_{01}}{\xi\_{200}}\frac{y\_{y}-y}{h\_{y}}-\frac{d\_{x}\mathcal{B}\_{02}}{\xi\_{200}}\frac{y\_{x}-t}{h\_{x}}\right)^{2}K\_{\beta\beta}+o\_{p}(1) \\
&=\frac{2}{2M}\left(\frac{\mathcal{N}(\beta)}{(\xi\_{2$$

$$\begin{split} \mathcal{J} &= -2d\_{\tau}^{2} \int \int \int\_{Q(\mathcal{U})} \left[ B\_{0\mathcal{U}} + \frac{B\_{1\mathcal{I}}}{\tilde{\xi}\_{200}} u + \frac{B\_{2\mathcal{I}}}{\tilde{\xi}\_{020}} v + \frac{B\_{3\mathcal{I}}}{\tilde{\xi}\_{002}} w \right]^{2} K(u,v)K(w) dudvdw + \\ & \qquad 2d\_{\tau}^{2} \int \int \int\_{Q^{(2\mathcal{I})}} \left[ 1 - B\_{0\mathcal{I}} - \frac{B\_{1\mathcal{I}}}{\tilde{\xi}\_{200}} u - \frac{B\_{2\mathcal{I}}}{\tilde{\xi}\_{020}} v - \frac{B\_{3\mathcal{I}}}{\tilde{\xi}\_{002}} w \right]^{2} K(u,v)K(w) dudvdw \\ & \qquad + o\_{p}(1) \\ &= -d\_{\tau}^{2} (\mathcal{C}\_{\tau}^{(l)})^{2} + o\_{p}(1), \end{split}$$

where

$$\begin{array}{rcl} \mathcal{L}\_{\tau}^{(I)} &=& \left(2 \int \int \int\_{Q^{(I)}} \left[B\_{0l} + \frac{B\_{1l}}{\overline{\xi}\_{200}}u + \frac{B\_{2l}}{\overline{\xi}\_{020}}v + \frac{B\_{3l}}{\overline{\xi}\_{002}}w\right]^2 K(u,v)K(w)dudvdw + \\\\ &2 \int \int \int\_{Q^{(2l)}} \left[1 - B\_{0l} - \frac{B\_{1l}}{\overline{\xi}\_{200}}u - \frac{B\_{2l}}{\overline{\xi}\_{020}}v - \frac{B\_{3l}}{\overline{\xi}\_{002}}w\right]^2 K(u,v)K(w)dudvdw \right)^{1/2} .\end{array}$$

Then by equation (A6)–(A8), we have

$$
\sigma^{(l)}(\mathbf{x}, \mathbf{y}; t) = \sigma^2 + d\_\pi^2 (\mathbf{C}\_\pi^{(l)})^2 + o\_p(1).
$$

Similarly, we can prove that

$$e(\mathbf{x}, \mathbf{y}; t) = \sigma^2 + d\_\tau^2 (\mathbf{C}\_\tau)^2 + o\_p(1),$$

where

$$\begin{array}{rcl} \mathbb{C}\_{\mathsf{T}} &=& \left( \int \int \int\_{Q^{(1)}} \left[ \frac{\mathfrak{z}^{(2)}}{\mathfrak{z}\_{000}} + \frac{\mathfrak{z}^{(2)}\_{100}}{\mathfrak{z}\_{\cdot 200}} u + \frac{\mathfrak{z}^{(2)}\_{010}}{\mathfrak{z}\_{\cdot 020}} v + \frac{\mathfrak{z}^{(2)}\_{001}}{\mathfrak{z}\_{\cdot 002}} w \right]^{2} K(u,v)K(w) dudvdw + \\ & & \int \int \int\_{Q^{(2)}} \left[ 1 - \mathfrak{z}^{(2)}\_{000} - \frac{\mathfrak{z}^{(2)}\_{100}}{\mathfrak{z}\_{\cdot 200}} u - \frac{\mathfrak{z}^{(2)}\_{010}}{\mathfrak{z}\_{\cdot 020}} v - \frac{\mathfrak{z}^{(2)}\_{001}}{\mathfrak{z}\_{\cdot 002}} w \right]^{2} K(u,v)K(w) dudvdw \end{array}$$

The main difference between this case and the previous case in the proof is in the derivation of the result of (A8). For *e*(*x*, *y*; *t*), the corresponding result is

$$\begin{array}{rcl} A\_3(\mathbf{x}, y; t) &=& \frac{1}{NH} \sum\_{(\mathbf{x}\_i \mathbf{y}\_j; t\_k)} \left[ f(\mathbf{x}\_i, \mathbf{y}\_j; t\_k) - \hat{a}(\mathbf{x}, y; t) - \hat{b}(\mathbf{x}, y; t)(\mathbf{x}\_i - \mathbf{x}) - \hat{b}(\mathbf{x}, \mathbf{y}; t) \right] \\\\ \hat{c}(\mathbf{x}, y; t)(y\_j - y) &- \hat{d}(\mathbf{x}, y; t)(t\_k - t) \Bigg]^2 K\_{ijk} \\\\ &=& \frac{1}{NH} \sum\_{(\mathbf{x}\_i \mathbf{y}\_j; t\_k)} \left[ f(\mathbf{x}\_i, y\_j; t\_k) - f\_-(\mathbf{x}\_\tau, y\_\tau; t\_\tau) - d\_\tau \xi\_{000}^{(2)} - \frac{d\_\tau \xi\_{100}^{(2)}}{\xi\_{200}} \frac{\mathbf{x}\_i - \mathbf{x}}{h\_x} - \hat{d}(\mathbf{x}, \mathbf{y}; t) \right] \\\\ &\quad \frac{d\_\tau \xi\_{010}^{(2)}}{\xi\_{020}} \frac{y\_j - y}{h\_y} - \frac{d\_\tau \xi\_{001}^{(2)}}{\xi\_{002}} \frac{h\_k - t}{h\_l} \Bigg] K\_{ijk} + o\_p(1) \end{array}$$

<sup>=</sup> <sup>1</sup> *NH* ∑ (*xi*,*yj*;*tk* )∈*I*<sup>1</sup> + ∑ (*xi*,*yj*;*tk* )∈*I*<sup>2</sup> ! & *f*(*xi*, *yj*; *tk*) − *f*−(*xτ*, *yτ*; *tτ*) − *dτξ* (2) <sup>000</sup> <sup>−</sup> *<sup>d</sup>τξ* (2) 100 *ξ*<sup>200</sup> *xi* − *x hx* − *dτξ* (2) 010 *ξ*<sup>020</sup> *yj* − *y hy* <sup>−</sup> *<sup>d</sup>τξ* (2) 001 *ξ*<sup>002</sup> *tk* − *t ht* '2 *Kijk* + *op*(1) <sup>=</sup> <sup>1</sup> *NH* ∑ (*xi*,*yj*;*tk* )∈*I*<sup>1</sup> & − *dτξ* (2) <sup>000</sup> <sup>−</sup> *<sup>d</sup>τξ* (2) 100 *ξ*<sup>200</sup> *xi* − *x hx* − *dτξ* (2) 010 *ξ*<sup>020</sup> *yj* − *y hy* <sup>−</sup> *<sup>d</sup>τξ* (2) 001 *ξ*<sup>002</sup> *tk* − *t ht* '2 *Kijk* + 1 *NH* ∑ (*xi*,*yj*;*tk* )∈*I*<sup>2</sup> & *d<sup>τ</sup>* − *dτξ* (2) <sup>000</sup> <sup>−</sup> *<sup>d</sup>τξ* (2) 100 *ξ*<sup>200</sup> *xi* − *x hx* − *dτξ* (2) 010 *ξ*<sup>020</sup> *yj* − *y hy* <sup>−</sup> *<sup>d</sup>τξ* (2) 001 *ξ*<sup>002</sup> *tk* − *t ht* '2 *Kijk* + *op*(1) = *d*<sup>2</sup> *τ Q*(1) & *ξ* (2) <sup>000</sup> <sup>+</sup> *<sup>ξ</sup>* (2) 100 *ξ*<sup>200</sup> *<sup>u</sup>* <sup>+</sup> *<sup>ξ</sup>* (2) 010 *ξ*<sup>020</sup> *<sup>v</sup>* <sup>+</sup> *<sup>ξ</sup>* (2) 001 *ξ*<sup>002</sup> *w* '2 *K*(*u*, *v*)*K*(*w*)*dudvdw* + *d*2 *τ Q*(2) & 1 − *ξ* (2) <sup>000</sup> <sup>−</sup> *<sup>ξ</sup>* (2) 100 *ξ*<sup>200</sup> *<sup>u</sup>* <sup>−</sup> *<sup>ξ</sup>* (2) 010 *ξ*<sup>020</sup> *<sup>v</sup>* <sup>−</sup> *<sup>ξ</sup>* (2) 001 *ξ*<sup>002</sup> *w* '2 *K*(*u*, *v*)*K*(*w*)*dudvdw* +*op*(1) = *d*<sup>2</sup> *<sup>τ</sup>*(*Cτ*)<sup>2</sup> + *op*(1).

*Appendix A.4. Proof of Theorem 3*

For the case when (*x*, *y*; *t*) ∈ Ω*<sup>h</sup>* \ *Jh*, the estimator *f* (*x*, *<sup>y</sup>*; *<sup>t</sup>*) is one of *a*(*x*, *<sup>y</sup>*; *<sup>t</sup>*), *a*(1)(*x*, *<sup>y</sup>*; *<sup>t</sup>*), *a*(2)(*x*, *<sup>y</sup>*; *<sup>t</sup>*) and (*a*(1)(*x*, *<sup>y</sup>*; *<sup>t</sup>*) + *a*(2)(*x*, *<sup>y</sup>*; *<sup>t</sup>*))/2, all of which are consistent estimators of *f*(*x*, *y*; *t*). So, we have the result in the theorem.

For the case when (*x*, *y*; *t*) ∈ *Jh* \ *Sh*, it is easy to see that we have either i) *e*(*x*, *y*; *t*) = *σ*<sup>2</sup> + *d*<sup>2</sup> *<sup>τ</sup>*(*Cτ*)<sup>2</sup> + *op*(1), *e*(1)(*x*, *y*; *t*) = *σ*<sup>2</sup> + *op*(1), and *e*(2)(*x*, *y*; *t*) = *σ*<sup>2</sup> + *d*<sup>2</sup> *<sup>τ</sup>*(*C*(2) *<sup>τ</sup>* )<sup>2</sup> <sup>+</sup> *op*(1), or ii) *e*(*x*, *y*; *t*) = *σ*<sup>2</sup> + *d*<sup>2</sup> *<sup>τ</sup>*(*Cτ*)<sup>2</sup> + *op*(1), *e*(1)(*x*, *y*; *t*) = *σ*<sup>2</sup> + *d*<sup>2</sup> *<sup>τ</sup>*(*C*(1) *<sup>τ</sup>* )<sup>2</sup> <sup>+</sup> *op*(1), and *<sup>e</sup>*(2)(*x*, *<sup>y</sup>*; *<sup>t</sup>*) = *σ*<sup>2</sup> + *op*(1). In both cases, we have *D*(*x*, *y*; *t*) = *d*<sup>2</sup> *<sup>τ</sup>*(*Cτ*)<sup>2</sup> + *op*(1). Therefore, asymptotically *D*(*x*, *y*; *t*) > *u*. Since *e*(1)(*x*, *y*; *t*) < *e*(2)(*x*, *y*; *t*) in i), the estimator *f* (*x*, *<sup>y</sup>*; *<sup>t</sup>*) is *a*(1)(*x*, *<sup>y</sup>*; *<sup>t</sup>*) in this case, which is a consistent estimator of *f*(*x*, *y*; *t*). A similar result follows in the case ii).

### **References**


MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Entropy* Editorial Office E-mail: entropy@mdpi.com www.mdpi.com/journal/entropy

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34

www.mdpi.com

ISBN 978-3-0365-5550-8