Next Article in Journal
The Existence Problems of Solutions for a Class of Differential Variational–Hemivariational Inequality Problems
Next Article in Special Issue
Sharper Concentration Inequalities for Median-of-Mean Processes
Previous Article in Journal
Event-Triggered Extended Dissipativity Fuzzy Filter Design for Nonlinear Markovian Switching Systems against Deception Attacks
Previous Article in Special Issue
Non-Asymptotic Bounds of AIPW Estimators for Means with Missingness at Random
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

On a Low-Rank Matrix Single-Index Model

Department of Mathematical Sciences, Norwegian University of Science and Technology, 7034 Trondheim, Norway
Mathematics 2023, 11(9), 2065; https://doi.org/10.3390/math11092065
Submission received: 17 March 2023 / Revised: 25 April 2023 / Accepted: 25 April 2023 / Published: 26 April 2023
(This article belongs to the Special Issue New Advances in High-Dimensional and Non-asymptotic Statistics)

Abstract

:
In this paper, we conduct a theoretical examination of a low-rank matrix single-index model. This model has recently been introduced in the field of biostatistics, but its theoretical properties for jointly estimating the link function and the coefficient matrix have not yet been fully explored. In this paper, we make use of the PAC-Bayesian bounds technique to provide a thorough theoretical understanding of the joint estimation of the link function and the coefficient matrix. This allows us to gain a deeper insight into the properties of this model and its potential applications in different fields.

1. Introduction

In this study, we investigate a particular type of single-index model, where the response variable, denoted by Y, is a real number and the covariate matrix, represented by X, is a matrix of real numbers with dimensions of d × d . The model is defined in Equation (1) as
Y = f ( X , B ) + ϵ
In this equation, X , B = trace ( X B ) represents the inner product between matrices X and B , where B is an unknown coefficient matrix with dimensions of d × d . The link function f is an unknown univariate measurable function. The noise term, represented by ϵ , is assumed to have a mean of 0 and is independent of the covariate X.
In line with the recent research presented in [1,2], we make the assumption that the coefficient matrix B is a symmetric, low-rank matrix with rank ( B ) < d . Additionally, in order to ensure the uniqueness of the model, we impose the condition that the Frobenius norm of B is equal to 1, i.e., B F = 1 .
Previous studies have been conducted on a similar model to the one presented in this paper, where the unknown coefficient matrix B is assumed to have sparse elements. In particular, the work of [1] in the field of biostatistics has been used to examine the correlation between a response variable and the functional connectivity associated with a certain brain region. Additionally, recent research by [2] has focused on developing methods for estimating the unknown low-rank matrix B by using implicit regularization techniques.
The model discussed in this paper can be thought of as a nonparametric version of the trace regression model that has been previously proposed in the literature, specifically in the works in [3,4,5]. This trace regression model utilizes the identity function as the link function, and encompasses a diverse array of statistical models, including but not limited to reduced rank regression, matrix completion, and linear regression.
The single-index model is a versatile extension of the linear model, which offers a natural interpretation. This model only changes in the direction of the parameter (vector/matrix), and the nature of this change is depicted by the link function f . This has been the subject of extensive research in the literature, with various studies exploring its applications and extensions in various fields. Examples of such works include [1,6,7,8,9,10,11,12,13,14]. These studies have demonstrated the versatility and utility of the single-index model in a wide range of contexts, making it a valuable tool for researchers in various fields.
Definition 1.
Let S 1 d denote the set of all symmetric matrix B R d × d such that B F = 1 .
Given the covariates { X i } i = 1 n , the response variables { Y i } i = 1 n are i.i.d. generated from model (1). We define the expected risk for any measurable f : R R and B S 1 d as
R ( B , f ) = E ( Y f ( X , B ) 2 )
and denote the empirical counterpart of R ( f , B ) by
r n ( B , f ) = 1 n i = 1 n ( Y i f ( X i , B ) 2 ) .
In this research, we examine the forecasting abilities of the model. More specifically, we consider a pair ( f , B ) to have comparable predictive performance to ( f , B ) if the difference between R ( B , f ) and R ( B , f ) is minimal.
Our approach in this work is built on the PAC-Bayesian bound technique, which is a powerful tool for obtaining oracle inequalities bounds [15]. Similar to Bayesian analysis, one important aspect of a PAC-Bayesian bound is specifying a prior distribution over the parameter space. In our approach, we adopt the prior distribution for the link function from the reference [11], while the prior distribution for the matrix parameter B is inspired by the eigen decomposition of the matrix. The specifics of our approach and the details of the prior distributions we chose are discussed in the next section. The use of the PAC-Bayesian bound technique in combination with carefully chosen prior distributions allows us to obtain reliable and accurate estimates of the unknown parameters in our model.

2. Main Result

2.1. Method

We make an additional assumption in our model (1) that E [ ϵ | X ] = 0 , and the following conditional moment assumptions on the noise ϵ are assumed.
Assumption 1.
We assume that there exist two constants σ > 0 and L > 0 , such that for all integers s 2 ,
E [ | ϵ | s | X ] s ! 2 σ 2 L s 2 .
Remark 1.
The assumption stated above implies that the noise term in our model follows a subexponential distribution. This class of distributions includes, for example, Gaussian noise or bounded noise, as discussed in [16]. In simpler terms, this means that the noise term in our model is characterized by a rate of decay that is slower than that of an exponential distribution. This assumption is critical for the application of our approach, as it allows us to obtain accurate and reliable estimates of the unknown parameters under a wide range of noise conditions. This is an important consideration, as the presence of noise can have a significant impact on the accuracy of the estimates obtained from our model. By assuming that the noise follows a subexponential distribution, we can be confident that our estimates are robust to the presence of noise.
In addition to the assumptions stated previously, it is also necessary to assume that the covariate matrix X is almost surely bounded by a constant. Additionally, the unknown link function f is also assumed to be bounded by some known positive constant. To make this more precise, we use the notation X to represent its supremum norm and f to denote its functional supremum norm over the interval [ 1 , 1 ] . Based on these definitions, we make the following assumption:
Assumption 2.
We assume that X 1 a.s. and C 1 , such that f C .
In order to present the technical proofs in the clearest and simplest manner, we did not attempt to find the best constant used in the proofs. Specifically, the condition that C 1 is just convenient for the proofs in nature, and it could be eliminated by using max [ C , 1 ] in the proofs.
The link function f is approximately estimated through a given specific countable set of measurable functions (dictionary) { φ k } k = 1 . For this purpose, the set of finite linear combinations of functions from the dictionary is utilized, and we denote this vector space by F . We assume that each element φ k in the dictionary is defined on the interval [ 1 , 1 ] and takes values within the range [ 1 , 1 ] .
Assumption 3.
For the sake of simplicity, we assume that the basic functions are differentiable and there exists some constant C ϕ > 0 , such that
φ k k C ϕ .
An example of such a collection of functions is the system of non-normalized trigonometric functions, where
φ 1 ( t ) = 1 , φ 2 k ( t ) = cos ( π k t ) , φ 2 k + 1 ( t ) = sin ( π k t ) , k = 1 , 2 ,
satisfy this assumption. This assumption on the dictionary functions enables us to approximate the unknown link function f with a finite linear combination of these functions.
Our approach is inspired by the work of [11], where the authors explored the PAC-Bayesian approach in [15] for a sparse-vector single-index model. The method needs to first specify a distribution π on S 1 d × F , similar to the prior distribution in Bayesian analysis. This prior distribution in our framework should enforce the characteristics of the underlying link function and the parameter matrix. In this work, we consider the following prior distribution:
d π ( B , f ) = d μ ( B ) d ν ( f ) ,
in other words, it means that the prior distribution of the index matrix and the prior distribution over the link functions are assumed to be independent.
In this study, the matrix B is treated as a symmetric matrix and can be expressed in its eigen-decomposition form B = U Λ U . The matrix U is an orthogonal matrix with U U = U U 1 = I d (identity matrix of dimension d × d ), and the diagonal matrix Λ holds the corresponding eigenvalues λ 1 , , λ d . To enforce that B F = 1 , the sum of the squares of the eigenvalues λ i must equal 1, as B F = trace ( B 2 ) and trace ( B 2 ) = j = 1 d λ i 2 . Additionally, the requirement of low-rankness on B means that most of the eigenvalues λ 1 , , λ d are close to zero, with only a few being significantly larger.
With the goal of obtaining an appropriate low-rank-promoting prior for B, we propose the following approach. We simulate an orthogonal matrix V and simulate ( γ 1 , , γ d ) from a Dirichlet distribution D i r ( α 1 , , α d ) . Put
B = V diag ( γ 1 1 / 2 , , γ d 1 / 2 ) V .
To obtain an approximate low-rank matrix, we take all parameters of the Dirichlet distribution to be very close to 0, for example, by setting α 1 = = α d = 1 / d . It is worth noting that a typical drawing of the Dirichlet distribution leads to one of the γ i s being close to 1 and the others being close to 0. For more detailed discussions on how to choose the parameters for the Dirichlet distribution, one can refer to [17].
Now, we present a prior distribution on F . We opted to use the prior introduced in [11]. With any integer M that 0 < M n , let us put
B M ( c Λ ) = ( β 1 , , β M ) R M : s = 1 M s | β s | c Λ and β M 0 , c Λ > 0 .
Now, we define F M ( c Λ ) F the image of B M ( c Λ ) by the function
G M : R M F ( β 1 , , β M ) j = 1 M β j φ j .
Remark 2.
Corollary 1 (below) provides a discussion regarding the approximation of Sobolev spaces (see [18] by the set F M ( c Λ ) ), which become more accurate as M increases.
Now, a prior distribution ν M ( d f ) is defined on the set F M ( C + 1 ) . This is performed by considering the image of the uniform measure on B M ( C + 1 ) obtained through the function G M . We consider the following choice for the prior distribution ν on F
d ν ( f ) = M = 1 n 10 M ν M ( d f ) 1 ( 1 10 ) n .
The reason for choosing C + 1 rather than C in the above definition of the prior distribution support is essentially for technical proof. This is to ensure that as soon as the underlying link function f belongs to F n ( C ) , there then exists a small ball around it that is contained in F n ( C + 1 ) . One could safely replace it by C + a n , where { a n } n = 1 is any positive sequence vanishing sufficiently slowly as n .
Remark 3.
The integer M can be viewed as a measure of the “dimension” of the function f—the larger the M, the more complex the function—and the prior ν adapts again to the sparsity idea by penalizing large-dimensional functions f. The coefficient 10 M , which appears in (2), shows that more complex models have a geometrically decreasing influence. Inspired from the practical results in [11], the value 10 is a random choice. This choice could be in general changed by another positive constant, but it requires more technical attention.

2.2. The Proposed Estimator

Definition 2.
The Gibbs posterior distribution over S 1 d × F n ( C + 1 ) is defined as
ρ ^ λ ( B , f ) = exp λ r n ( B , f ) d π ( B , f ) exp λ r n ( B , f ) d π ( B , f ) .
Now, we define an estimator as follows. Let λ > 0 be a tuning parameter, or sometime called the inverse temperature parameter. Let ( B ^ λ , f ^ λ ) be an estimator of ( B , f ) . It is simply achieved by a random draw from ρ ^ λ , the Gibbs posterior distribution above.

2.3. Theoretical Results

As E [ Y | X ] = f ( X , B ) almost surely, it is noted that for all ( B , f ) S 1 d × F n ( C + 1 ) ,
R ( B , f ) R ( B , f ) = E Y f ( X , B ) 2 E Y f ( X , B ) 2 = E f ( X , B ) f ( X , B ) 2
(Pythagoras theorem).
Definition 3.
For any positive integer M n , we set
B M , f M arg min ( B , f ) S 1 d × F M ( C ) R ( B , f ) .
Remark 4.
It is noted here that the infimum f M is defined on F M ( C ) for each value of M. However, the prior distribution is defined on a slightly larger set, that is, F M ( C + 1 ) .
Let us define
w : = 64 ( C + 1 ) max [ L , C + 1 ] , C 1 : = 8 [ ( C + 1 ) 2 + σ 2 ] .
The theoretical results in this work mainly come from the following theorem, the proof of which is provided in Section 3. It should be noted that throughout the paper, the phrase “with probability 1 δ ” refers to the probability calculated with respect to both the distribution P n of the data and the conditional Gibbs distribution ρ ^ λ .
Theorem 1.
Assume that Assumptions 1 and 2 hold, with
λ = n w + 2 C 1 .
We have that, for all δ ( 0 , 1 ) , with a probability of at least 1 δ ,
R ( B ^ λ , f ^ λ ) R ( B , f ) C inf 1 M n { R ( B M , f M ) R ( B , f ) + log ( n ) ( M + d rank ( B ) + d log ( d ) ) + log 2 δ n } ,
where C > 0 is a constant depending only on L , σ , C , C ϕ .
Remark 5.
As in practice, the value of w and C 1 are not known, and the theoretical value of λ cannot be used. However, it provides a good order to tune this parameter, for example, using cross-validation.
Remark 6.
Theorem 1 can be interpreted in a straightforward manner. Essentially, it states that if there exists a “small” M and rank ( B ) is small, such that the difference between R ( B M , f M ) and R ( B , f ) is minimal, then the difference between R ( B ^ λ , f ^ λ ) and R ( B , f ) will also be small in the order of log ( n ) / n . On the other hand, if neither of these conditions are met, then the rate M log ( n ) / n or rank ( B ) d log ( n ) / n (or either) will start to dominate, thus resulting in a decrease in the general quality of the convergence rate.
We can obtain a good convergence rate as soon as a low-rank assumption is considered. This is typically achievable when B is already low-rank or can be well approximated by a low-rank matrix. In the case that f is sufficiently regular, we can obtain a good approximation with a “small” M.
As shown in [11], when f belongs to a Sobolev space, we can derive a more specific nonparametric rate for the above theorem. For example, assume that { φ k } k = 1 is the system of trigonometric functions and in addition that the link function f is in the following Sobolev ellipsoid space [18],
W k , 6 C 2 π 2 = f L 2 ( [ 1 , 1 ] ) : f = j = 1 β j φ j   and   j = 1 j 2 k β j 2 6 C 2 π 2
where k 2 is an unknown regularity parameter. In this context, the approximation set F M ( C + 1 ) is in the following form:
F M ( C + 1 ) = f L 2 ( [ 1 , 1 ] ) : f = s = 1 M β s φ s , s = 1 M s | β s | C + 1   and   β M 0 .
It should be noted that the results presented in this paper are in the so-called adaptive setting, where the regularity parameter k is not assumed to be known. However, in order to obtain these results, it is necessary to make an additional assumption.
Assumption 4.
We assume that the probability density of the random variable X , B is defined on [ 1 , 1 ] , and it is upper-bounded by a constant A > 0 .
Corollary 1.
Assume that Theorem 1 and additional Assumption 4 hold. Moreover, assume that f is in the Sobolev ellipsoid space W ( k , 6 C 2 / π 2 ) , where the regularity parameter k 2 is unknown. The tuning parameter λ is as in (3). We have that for all δ ( 0 , 1 ) with a probability of at least 1 δ ,
R ( B ^ λ , f ^ λ ) R ( B , f ) C log ( n ) n 2 k 2 k + 1 + log ( n ) ( d rank ( B ) + d log d ) + log 2 δ n ,
where C > 0 is a constant depending only on L, C, σ, C ϕ , A .
The proof for Corollary 1 follows a similar approach to that of Corollary 4 in [11], and thus, it is not included in this paper.
Remark 7.
From an asymptotic point of view, that d is fixed and n , the leading rate on the right-hand side in the above Corollary is ( log ( n ) / n ) 2 k 2 k + 1 . This is known to be the minimax rate of convergence up to a log ( n ) factor over a Sobolev class; see [18]. On the other hand, in a nonasymptotic setting where n is “small”, we obtain the estimation rate rank ( B ) d log ( n ) / n , which was also obtained by [2], and it is minimax optimal up to a logarithmic term, as in [3].
From Theorem 1, it is actually possible to derive that the Gibbs posterior ρ ^ λ contracts around ( B , f ) at the optimal rate.
Theorem 2.
Under the same assumptions for Theorem 1 and the same definition for λ, let ε n be any sequence in ( 0 , 1 ) , such that ε n 0 when n . Define
E n = { ( B , f ) S 1 d × F n ( C + 1 ) : R ( B , f ) R ( B , f ) C inf 1 M n { R ( B M , f M ) R ( B , f ) + log ( n ) ( M + rank ( B ) d + d log d ) + log 2 ε n n } .
Then,
E P ( B , f ) ρ ^ λ ( ( B , f ) E n ) 1 ε n n 1 .

3. Proofs

For the sake of simplicity in the proofs, we put
R : = R ( B , f ) , r n : = r n ( B , f ) .
We have that for each f = j = 1 M β j φ j F M ( C + 1 ) , f j = 1 M | β j | C + 1 .
The following lemma, Lemma 1, is a Bernstein-type inequality [16] that is useful for our proofs. We denote by ( Z ) + the positive part of a random variable Z.
Lemma 1.
Let Z 1 , , Z n be independent real-valued random variables. It is assumed that there exist two constants v > 0 , w > 0 that for all integers r 2 , s = 1 n E ( Z s ) + r r ! 2 v w r 2 . We have that with ζ ( 0 , 1 / w ) ,
E e ζ s = 1 n ( Z s E Z s ) e v ζ 2 2 ( 1 w ζ ) .
Let ( A , A ) be a measurable space and γ 1 and γ 2 be two probability measures on ( A , A ) . Denote by K ( γ 1 , γ 2 ) the Kullback–Leibler divergence of γ 1 with respect to γ 2 . Lemma 2 is a classical result, and its proof can be found, for example, in [15], (page 4).
Lemma 2.
Let ( A , A ) be a measurable space. For any probability measure ν on ( A , A ) and any measurable function g : A R , such that ( exp g ) d ν < , we have
log ( exp g ) d ν = sup κ g d κ K ( κ , ν ) ,
where κ is a probability measure on ( A , A ) and = . In addition, when g is upper-bounded on the support of ν, the supremum in (5) is obtained by the Gibbs distribution g, given by
d ρ d ν ( a ) = exp ( g ( a ) ) ( exp g ) d ν , a A .
Lemma 3.
We assume that Assumption 1 is satisfied. Put w = 16 ( C + 1 ) max [ L , 2 ( C + 1 ) ] , C 1 : = 8 [ ( C + 1 ) 2 + σ 2 ] and take λ 0 , n w + C 1 and put
α = λ λ 2 C 1 2 n ( 1 C 2 λ n ) and β = λ + λ 2 C 1 2 n ( 1 C 2 λ n ) .
With δ ( 0 , 1 ) and any distribution ρ ^ λ π , we have that
E exp [ α R ( B , f ) R + λ r n ( B , f ) + r n log d ρ ^ λ d π ( B , f ) log 2 δ ] d ρ ^ λ ( B , f ) δ / 2 ,
E sup ρ exp [ β R ( B , f ) d ρ R + λ r n ( B , f ) d ρ r n K ( ρ , π ) log 2 δ ] δ / 2 ,
Proof. 
Fix B S 1 d and f F n ( C + 1 ) . We start by using Lemma 1 with the following random variables:
T i = Y i f ( X i , B ) 2 + Y i f ( X i , B ) 2 , i = 1 , , n .
Note that T i , i = 1 , , n are independent, and we have that
i = 1 n E T i 2 = i = 1 n E 2 Y i f ( X i , B ) f ( X i , B ) 2 f ( X i , B ) f ( X i , B ) 2 = i = 1 n E 2 ϵ i + f ( X i , B ) f ( X i , B ) 2 f ( X i , B ) f ( X i , B ) 2 i = 1 n E 8 ϵ i 2 + 8 ( C + 1 ) 2 f ( X i , B ) f ( X i , B ) 2 . 8 ( C + 1 ) 2 + σ 2 i = 1 n E f ( X i , B ) f ( X i , B ) 2 : = v ,
where we set C 1 : = 8 [ ( C + 1 ) 2 + σ 2 ] ; and v = n C 1 R ( B , f ) R .
Now, for all integers k greater than 3, we have that
i = 1 n E ( T i ) + k i = 1 n E | 2 Y i f ( X i , B ) f ( X i , B ) | k | f ( X i , B ) f ( X i , B ) | k = i = 1 n E | 2 ϵ i + f ( X i , B ) f ( X i , B ) | k | f ( X i , B ) f ( X i , B ) | k 2 k 1 i = 1 n E 2 k | ϵ i | k + 2 k ( C + 1 ) k 2 k 2 ( C + 1 ) k 2 | f ( X i , B ) f ( X i , B ) | 2 .
In the last inequality, we used the fact that | q + w | k 2 k 1 ( | q | k + | w | k ) . We obtain that
i = 1 n E ( T i ) + k i = 1 n 2 2 k 2 k ! σ 2 L k 2 + 2 2 k 1 ( C + 1 ) k 2 k 2 ( C + 1 ) k 2 R ( B , f ) R = v × 2 2 k 2 k ! σ 2 L k 2 + 2 2 k 1 ( C + 1 ) k 2 k 2 ( C + 1 ) k 2 [ 2 ( C + 1 ) 2 + 4 σ 2 ] v × k ! 8 k 2 max L k 2 , 2 k 2 ( C + 1 ) k 2 2 k 2 ( C + 1 ) k 2 2 : = k ! 2 v w k 2 ,
with w = 64 ( C + 1 ) max [ L , C + 1 ] .
Thus, for any λ ( 0 , n / w ) , taking ζ = λ / n , we apply Lemma 1 to obtain
E exp λ R ( B , f ) R r n ( B , f ) + r n exp v λ 2 2 n 2 ( 1 w λ n ) = exp C 1 R ( B , f ) R λ 2 2 n ( 1 w λ n ) .
Therefore, we obtain, with the α given in (6),
E e α R ( B , f ) R + λ r n ( B , f ) + r n log 2 δ δ / 2 .
Next, integrating with respect to π and consequently using Fubini’s theorem, we obtain
E exp [ α R ( B , f ) R + λ r n ( B , f ) + r n log 2 / δ ] d π ( B , f ) δ / 2 .
To obtain (7), it is noted that for any measurable function h,
exp [ h ( B , f ) ] d π = exp h ( B , f ) log d ρ ^ λ d π ( B , f ) d ρ ^ λ .
The proof for (8) is similar. More precisely, we apply Lemma 1 with T i = ( Y i f ( X , B ) ) 2 ( Y i f ( X , B ) ) 2 . We obtain, for any λ ( 0 , n / w ) ,
E exp λ r n ( B , f ) + r n R ( B , f ) + R exp v λ 2 2 n 2 ( 1 w λ n ) .
By rearranging terms, using definition of β in (6), and multiplying both sides by δ / 2 , we obtain
E exp β R ( B , f ) + R + λ r n ( B , f ) r n log 2 δ δ / 2 .
Integrating with respect to π and using Fubini’s theorem, we obtain
E exp β R ( B , f ) + R + λ r n ( B , f ) r n log 2 δ d π δ / 2 .
Now, Lemma 2 is applied to the integral, and this directly yields (8). □
Proof of Theorem 1.
Recall that P n stands for the distribution of the sample D n ; the Equation (7) can be written conveniently as
E D n P n E ( B ^ , f ^ ) ρ ^ λ exp [ α R ( B ^ , f ^ ) R + λ r n ( B ^ , f ^ ) + r n log d ρ ^ λ d π ( B ^ , f ^ ) log 2 δ ] δ / 2 ,
Now, we use the standard Chernoff trick to transform an exponential moment inequality into a deviation inequality, i.e., using exp ( λ x ) 1 R + ( x ) . We obtain, with a probability of at least 1 δ / 2 for any δ ( 0 , 1 ) and any distribution ρ ^ λ ,
R ( B ^ , f ^ ) R λ α ( r n ( B ^ , f ^ ) r n + log d ρ ^ λ d π ( B ^ , f ^ ) + log 2 δ λ ) .
It is noted that we have
log d ρ ^ λ d π ( B ^ , f ^ ) = log exp ( λ r n ( B ^ , f ^ ) ) exp ( λ r n ( B , f ) ) d π = λ r n ( B ^ , f ^ ) log e λ r n ( B , f ) d π ;
thus, we obtain, with a probability larger than 1 δ / 2 ,
R ( B ^ , f ^ ) R 1 α ( log exp ( λ r n ( B , f ) ) d π λ r n + log 2 δ ) .
Now, using Lemma 2, it yields that with a probability larger than 1 δ / 2 ,
R ( B ^ , f ^ ) R λ α ( r n ( B , f ) d ρ ^ λ r n + K ( ρ ^ λ , π ) + log 2 δ λ ) .
Now, from (8) with an application of the standard Chernoff trick, we obtain, with a probability larger than 1 δ / 2 for any δ ( 0 , 1 ) and any distribution ρ ^ λ π ,
r n ( B , f ) d ρ ^ λ r n β λ ( R ( B , f ) d ρ ^ λ R ) + K ( ρ ^ λ , π ) + log 2 δ λ .
Combining (9) and (10) with a union bound argument gives the bound, with a probability larger than 1 δ ,
R ( B ^ , f ^ ) R inf ρ β α R ( B , f ) d ρ R + 2 K ( ρ , π ) + log 2 δ α .
The final steps of the proof involve making the right-hand side of the inequality more explicit. To achieve this, we limit the infimum bound to a specific distribution. This allows us to have a more concrete understanding of the result and to explicitly obtain the error rate.
Put B = U Λ U and let r = # { i : Λ i > ε } , with small ε ( 0 , 1 ) . Take
d ρ η 1 1 ( i : | v i Λ i | ε ; i = 1 , , r : u i U i F η ) π ( d u , d v )
For any positive integer M n and any η , γ ( 0 , 1 / n ) , let the probability measure ρ M , η , γ be defined by
d ρ M , η , γ ( B , f ) = d ρ η 1 ( B ) d ρ M , γ 2 ( f ) ,
with
ρ M , γ 2 ( f ) 1 [ f f M M γ ] ν M ( f ) .
We denote for f = s = 1 M β s φ s F M ( C + 1 ) , f M = j = 1 M j | β j | .
Inequality (11) leads to
R ( B ^ , f ^ ) R inf 1 M n inf η , γ > 0 { β α R ( B , f ) d ρ M , η , γ ( B , f ) R + 2 K ( ρ M , η , γ , π ) + log 2 δ α } .
To finish the proof, we have to control the different terms in (12). Note first that
K ( ρ M , η , γ , π ) = K ( ρ η 1 ρ M , γ 2 , μ ν M ) = K ( ρ η 1 , μ ) + K ( ρ M , γ 2 , ν M ) + log 1 ( 1 / 10 ) n 10 M .
By technical Lemma 4, we know that
K ( ρ η 1 , μ ) r d log ( 16 / η ) + C D 1 d log d ( 1 + log ( 2 / ε ) ) .
Additionally, by technical Lemma 10 in [11], we have that
K ( ρ M , γ 2 , ν M ) = M log C + 1 γ .
Bringing together all the parts, it arrives at
K ( ρ M , η , γ , π ) r d log ( 1 / c ) + C D 1 d log d ( 1 + log ( 2 / δ ) ) + M log C + 1 γ + log 1 10 M .
Finally, it remains to control the term R ( B , f ) d ρ M , η , γ ( B , f ) . To this aim, we write
R ( B , f ) d ρ M , η , γ ( B , f ) = E Y f ( X , B ) 2 d ρ M , η , γ ( B , f ) = E [ ( Y f M ( X , B M ) + f M ( X , B M ) f ( X , B M ) + f ( X , B M ) f ( X , B ) ) 2 ] d ρ M , η , γ ( B , f ) = R ( B M , f M ) + E [ f M ( X , B M ) f ( X , B M ) 2 + f ( X , B M ) f ( X , B ) 2 + 2 Y f M ( X , B M ) f M ( X , B M ) f ( X , B M ) + 2 Y f M ( X , B M ) f ( X , B M ) f ( X , B ) + 2 f M ( X , B M ) f ( X , B M ) f ( X , B M ) f ( X , B ) ] d ρ M , η , γ ( B , f ) : = R ( B M , f M ) + A + B + C + D + E .
Computation of C by Fubini’s theorem:
C = E 2 Y f M ( X , B M ) f M ( X , B M ) f ( X , B M ) d ρ M , η , γ ( B , f ) = E 2 Y f M ( X , B M ) f M ( X , B M ) f ( X , B M ) d ρ M , γ 2 ( f ) d ρ η 1 ( B ) .
Using the triangle inequality, we obtain that for f = s = 1 M β s φ s and f M = s = 1 M ( β M ) s φ s ,
j = 1 M j | β j | j = 1 M j | β j ( β M ) j | + j = 1 M j | ( β M ) j | .
Since f M F M ( C ) , and thus s = 1 M s | ( β M ) s | C , as a consequence, s = 1 M s | β s | C + 1 as soon as f f M M 1 . This shows that the set
f = j = 1 M β j φ j : f f M M γ
is contained in the support of ν M . In particular, this implies that ρ M , γ 2 is centered at f M and, consequently,
f M ( X , B M ) f ( X , B M ) d ρ M , γ 2 ( f ) = 0 .
This proves that C = 0 .
Control of A: Clearly,
A sup y R ( f M ( y ) f ( y ) 2 d ρ M , γ 2 ( f ) γ 2 .
Control of B: We have
B = E f ( X , B M ) f ( X , B ) 2 d ρ M , η , γ ( B , f ) E C ϕ ( C + 1 ) ( B M B ) X 2 d ρ η 1 ( B ) ( using the mean value theorem ) C ϕ 2 ( C + 1 ) 2 E X 2 B M B F 2 d ρ η 1 ( B ) ( by Assumption 4 ) .
Using Lemma 6 from [19], we have that
B M B F 2 d ρ η 1 ( B ) ( 3 d c + 2 r η ) 2 .
Thus,
B C ϕ 2 ( C + 1 ) 2 ( 3 d c + 2 r η ) 2 .
Control of E: We have that
| E | 2 E | f M ( X , B M ) f ( X , B M ) | | f ( X , B M ) f ( X , B ) | d ρ M , η , γ ( B , f ) 2 E | f M ( X , B M ) f ( X , B M ) | C ϕ ( C + 1 ) | ( B M B ) X | d ρ M , η , γ ( B , f ) 2 E f M ( X , B M ) f ( X , B M ) 2 d ρ M , η , γ ( B , f ) 1 2 E C ϕ ( C + 1 ) ( B M B ) X 2 d ρ M , η , γ ( B , f ) 1 2 2 γ 2 1 2 C ϕ 2 ( C + 1 ) 2 ( 3 d c + 2 r η ) 2 1 2 = 2 C ϕ ( C + 1 ) γ ( 3 d ε + 2 r η ) .
Control of D: Finally,
D = 2 E Y f M ( X , B M ) f ( X , B M ) f ( X , B ) d ρ M , η , γ ( B , f ) = 2 E Y f M ( X , B M ) f M ( X , B M ) f M ( X , B ) d ρ η 1 ( B ) ( since f d ρ M , γ 2 ( f ) = f M ) = 2 E Y f M ( X , B M ) f M ( X , B M ) f M ( X , B ) d ρ η 1 ( B ) 2 E Y f M ( X , B M ) 2 E f M ( X , B M ) f M ( X , B ) d ρ η 1 ( B ) 2 = 2 R ( B M , f M ) E f M ( X , B M ) f M ( X , B ) d ρ η 1 ( B ) 2 .
As we have that
| f M ( X , B M ) f M ( X , B ) | C ϕ ( C + 1 ) | ( B M B ) X | C ϕ ( C + 1 ) B M B F ,
it leads to
f M ( X , B M ) f M ( X , B ) d ρ η 1 ( B ) 2 C ϕ 2 ( C + 1 ) 2 B M B F d ρ η 1 ( B ) 2 C ϕ 2 ( C + 1 ) 2 ( 3 d c + 2 r η ) 2 ,
and therefore,
D 2 C ϕ ( C + 1 ) ( 3 d c + 2 r η ) R ( 0 , 0 ) / 2 2 C ϕ ( C + 1 ) ( 3 d ε + 2 r η ) C 2 + σ 2 .
Thus, taking η = γ = ε = 1 / n and assembling all the components, we obtain that
A + B + C + D + E C 1 n ,
where C 1 is a positive constant function of C, σ , and C ϕ . Combining this inequality with (12) and (13) yields, with a probability larger than 1 δ ,
R ( B ^ λ , f ^ λ ) R inf 1 M n { β α R ( B M , f M ) R + C 1 n + 2 M log ( ( C + 1 ) 10 n ) + r d log ( 16 n ) + C D 1 d log d log ( 2 n e ) + log 2 δ λ } .
Finally, choosing λ = n w + 2 C 1 , it yields that there exists a constant C 2 > 0 depending only on L , σ , C , C ϕ with a probability of at least 1 δ , such that
R ( B ^ λ , f ^ λ ) R C 2 inf 1 M n { R ( B M , f M ) R + M log ( 10 C n ) + r d log ( 16 n ) + C 3 d log d log ( 2 n e ) + log 2 δ n } .
This concludes the proof of Theorem 1. □
Lemma 4.
Let r = # { i : Λ i > ε } with small ε [ 0 , 1 ) . Take
d ρ η 1 1 ( i : | v i Λ i | ε ; i = 1 , , r : u i U i F η ) μ ( d u , d v )
Then,
K ( ρ η 1 , μ ) r d log ( 16 / η ) + C 3 d log d log ( 2 e / ε )
where C 3 is a universal constant.
Proof. 
We have that
K ( ρ η 1 , μ ) = log 1 μ ( { u , v : i : | v i Λ i | ε ; i = 1 , r : u i U i F η } ) = log 1 μ i = 1 , r : u i . U i . F η + log 1 μ ( { i : | v i Λ i | ε } ) .
The first log term
π i = 1 , r : u i . U i . F η i = 1 r π ( d 1 ) / 2 ( η / 2 ) d 1 Γ ( d 1 2 + 1 ) / 2 π ( d + 1 ) / 2 Γ ( d + 1 2 ) η d 1 2 d π r η r ( d 1 ) 2 4 r d .
Note the following for the above calculation: firstly, the distribution of the orthogonal vector is approximated by the uniform distribution on the sphere [20], and secondly, the probability is greater or equal to the volume of the (d-1)-“circle” with radius c / 2 over the surface area of the d-“unit sphere”.
It is noted that if γ B e t a ( a , b ) (beta distribution), then γ 1 / 2 has the pdf as f ( γ ) = 2 γ 2 a 1 ( 1 γ 2 ) b 1 B e ( a , b ) , 0 < γ < 1 where B e ( a , b ) is the beta function. The second log term in the Kullback–Leibler term with a = α i , b = i = 1 d α i α i , α i = 1 / d is
π ( { i : | v i Λ i | ε } ) = i = 1 d max ( Λ i ε , 0 ) min ( Λ i + ε , 1 ) v i 2 a 1 ( 1 v i 2 ) b 1 2 B e ( a , b ) d v i i = 1 d 0 ε v i 2 a 1 ( 1 v i 2 ) b 1 2 B e ( a , b ) d v i C 3 ( ε / 2 d ) d e d   log   d .
The interval of integration contains at least an interval of length ε . Thus, we obtain
K ( ρ η 1 , μ ) log 2 4 r d η r ( d 1 ) + log ( 2 d ) d e d   log   d C 3 ε d r d log ( 16 η ) + C 3 d log d log ( e 2 ε )
for some absolute numerical constant C 3 that does not depend on r , n or d. □
Proof of Theorem 2.
We also apply Lemma 3, and focus on (7), applied to δ : = ε n , that is
E exp [ α R ( B , f ) R + λ r n ( B , f ) + r n log d ρ ^ λ d π ( B , f ) log 2 ε n ] d ρ ^ λ ( B , f ) ε n / 2
Using Chernoff’s inequality, this leads to
E P ( B , f ) ρ ^ λ ( ( B , f ) A n ) 1 ε n 2
where
A n = ( B , f ) : α R ( B , f ) R + λ r n ( B , f ) + r n log d ρ ^ λ d π ( B , f ) + log 2 ε n .
From the definition of ρ ^ λ , for ( B , f ) A n , we obtain that
α R ( B , f ) R λ r n ( B , f ) r n + log d ρ ^ λ d π ( B , f ) + log 2 ε n log exp λ r n ( B , f ) π ( d ( B , f ) ) λ r n + log 2 ε n = λ r n ( B , f ) ρ ^ λ ( d ( B , f ) ) r n + K ( ρ ^ λ , π ) + log 2 ε n = inf ρ λ r n ( B , f ) ρ ( d ( B , f ) ) r n + K ( ρ , π ) + log 2 ε n .
Now, put
B n : = ρ , β R ( B , f ) d ρ + R + λ r n d ρ r n K ( ρ , π ) + log 2 ε n .
Using (8), we have that
E 1 B n 1 ε n 2 .
We now prove that if λ is such that α > 0 ,
E P ( B , f ) ρ ^ λ ( ( B , f ) E n ) E P ( B , f ) ρ ^ λ ( ( B , f ) A n ) 1 B n
and, together with,
E P ( B , f ) ρ ^ λ ( ( B , f ) A n ) 1 B n = E ( 1 P ( B , f ) ρ ^ λ ( ( B , f ) A n ) ) ( 1 1 B n c ) E 1 P ( B , f ) ρ ^ λ ( ( B , f ) A n ) 1 B n c 1 ε n
leads to
E P ( B , f ) ρ ^ λ ( ( B , f ) E n ) 1 ε n .
To obtain that, assume that we are on the set B n , and let ( B , f ) A n . Then,
α R ( B , f ) R inf ρ λ r n ( B , f ) ρ ( d ( B , f ) ) r n + K ( ρ , π ) + log 2 ε n inf ρ β R ( B , f ) ρ ( d ( B , f ) ) R + 2 K ( ρ , π ) + 2 log 2 ε n
that is,
R ( B , f ) R inf ρ β R d ρ R + 2 K ( ρ , π ) + log 2 ε α
We upper-bound the right-hand side similarly as in the proof of Theorem 1, which leads to ( B , f ) E n . □

4. Conclusions

In this paper, we conduct a theoretical study of a low-rank matrix single-index model. The model is used to estimate the link function and the coefficient matrix jointly. We leverage the PAC-Bayesian bounds technique to gain a deeper insight into the properties of this model and its potential applications. The study extends previous work in the field by considering a low-rank matrix, rather than a sparse vector, as the coefficient matrix. We also provide a detailed explanation of the choice of prior distributions for the link function and the coefficient matrix, which allows to obtain accurate and reliable estimates of the unknown parameters. Overall, this study provides a thorough theoretical understanding of the low-rank matrix single-index model.
The focus of future research would center on executing the proposed approach. There are various possible avenues to explore. One of the promising approaches is to use the reversible jump Markov chain Monte Carlo method, which was successfully applied in the past to address the sparse vector single-index model, as documented in [11].

Funding

This research was funded by Norwegian Research Council grant number 309960 through the Centre for Geophysical Forecasting at NTNU.

Data Availability Statement

No new data were created or analyzed in this study.

Acknowledgments

The author is grateful to two anonymous reviewers for their expert analysis and helpful suggestions.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Weaver, C.; Xiao, L.; Lindquist, M.A. Single-index models with functional connectivity network predictors. Biostatistics 2021, 24, 52–67. [Google Scholar] [CrossRef] [PubMed]
  2. Fan, J.; Yang, Z.; Yu, M. Understanding Implicit Regularization in Over-Parameterized Single Index Model. J. Am. Stat. Assoc. 2022, 1–14. [Google Scholar] [CrossRef]
  3. Rohde, A.; Tsybakov, A.B. Estimation of high-dimensional low-rank matrices. Ann. Stat. 2011, 39, 887–930. [Google Scholar] [CrossRef]
  4. Koltchinskii, V.; Lounici, K.; Tsybakov, A.B. Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Stat. 2011, 39, 2302–2329. [Google Scholar] [CrossRef]
  5. Zhao, J.; Niu, L.; Zhan, S. Trace regression model with simultaneously low rank and row (column) sparse parameter. Comput. Stat. Data Anal. 2017, 116, 1–18. [Google Scholar] [CrossRef]
  6. Nelder, J.A.; Wedderburn, R.W. Generalized linear models. J. R. Stat. Soc. Ser. A Gen. 1972, 135, 370–384. [Google Scholar] [CrossRef]
  7. Hardle, W.; Hall, P.; Ichimura, H. Optimal smoothing in single-index models. Ann. Stat. 1993, 21, 157–178. [Google Scholar] [CrossRef]
  8. Ichimura, H. Semiparametric least squares (SLS) and weighted SLS estimation of single-index models. J. Econom. 1993, 58, 71–120. [Google Scholar] [CrossRef]
  9. Jiang, B.; Liu, J.S. Variable selection for general index models via sliced inverse regression. Ann. Stat. 2014, 42, 1751–1786. [Google Scholar] [CrossRef]
  10. Kong, E.; Xia, Y. Variable selection for the single-index model. Biometrika 2007, 94, 217–229. [Google Scholar] [CrossRef]
  11. Alquier, P.; Biau, G. Sparse Single-Index Model. JMLR 2013, 14, 243–280. [Google Scholar]
  12. Putra, I.; Dana, I.M. Study of Optimal Portfolio Performance Comparison: Single Index Model and Markowitz Model on LQ45 Stocks in Indonesia Stock Exchange. Am. J. Humanit. Soc. Sci. Res. 2020, 3, 237–244. [Google Scholar]
  13. Pananjady, A.; Foster, D.P. Single-index models in the high signal regime. IEEE Trans. Inf. Theory 2021, 67, 4092–4124. [Google Scholar] [CrossRef]
  14. Ganti, R.S.; Balzano, L.; Willett, R. Matrix completion under monotonic single index models. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
  15. Catoni, O. Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning; Institute of Mathematical Statistics Lecture Notes—Monograph Series 56; Institute of Mathematical Statistics: Beachwood, OH, USA, 2007; Volume 5544465. [Google Scholar]
  16. Boucheron, S.; Lugosi, G.; Massart, P. Concentration Inequalities: A Nonasymptotic Theory of Independence; Oxford University Press: Oxford, UK, 2013. [Google Scholar]
  17. Wallach, H.; Mimno, D.; McCallum, A. Rethinking LDA: Why priors matter. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 7–10 December 2009; Volume 22. [Google Scholar]
  18. Tsybakov, A.B. Introduction to Nonparametric Estimation; Springer Series in Statistics; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar] [CrossRef]
  19. Mai, T.T.; Alquier, P. Pseudo-Bayesian quantum tomography with rank-adaptation. J. Stat. Plan. Inference 2017, 184, 62–76. [Google Scholar] [CrossRef]
  20. Goldstein, S.; Lebowitz, J.L.; Tumulka, R.; Zanghî, N. Any orthonormal basis in high dimension is uniformly distributed over the sphere. Ann. L’Institut Henri Poincaré Probab. Stat. 2017, 53, 701–717. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mai, T.T. On a Low-Rank Matrix Single-Index Model. Mathematics 2023, 11, 2065. https://doi.org/10.3390/math11092065

AMA Style

Mai TT. On a Low-Rank Matrix Single-Index Model. Mathematics. 2023; 11(9):2065. https://doi.org/10.3390/math11092065

Chicago/Turabian Style

Mai, The Tien. 2023. "On a Low-Rank Matrix Single-Index Model" Mathematics 11, no. 9: 2065. https://doi.org/10.3390/math11092065

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop