**1. Introduction**

Many fields of science, and especially health studies, require the solution of problems in which the number of characteristics is larger than the sample size. These problems are referred to as high-dimensional problems. In the present paper, we focus on solving high-dimensional problems in statistical modelling.

We consider the linear regression model

$$y = X\mathcal{B} + \mathfrak{e}\_{\prime} \tag{1}$$

where *X* = **1** *x*1 ··· *xd* is the design matrix of order *n* × (*d* + <sup>1</sup>), which is supposed to be high-dimensional, i.e., *n* < *d*. The columns *xi* ∼ *<sup>N</sup>*(**<sup>0</sup>***<sup>n</sup>*, *σ*2*i In*), *i* = 1, 2, ... , *d*, are the correlated covariates of the model and all the elements of the first column of the design matrix are equal to 1 in correspondence with the mean effect. The response vector *y* has length *n*, = (1, 2, ... , *n*)*<sup>T</sup>* is the *n*-vector of independent and identically distributed (i.i.d.) random errors, where *i* ∼ *N*(0, *σ*<sup>2</sup>) for all *i* = 1, 2, . . . , *n*.

In the present study we focus on the following two points.

1. Estimation of the regression parameter *β* ∈ R*d*+1.

From numerical linear algebra point of view, the statistical model (1) can be considered as an underdetermined system. This kind of system has infinitely many solutions. The first way to determine the desired vector *β* is to keep the solution with the minimum norm. This solution is referred to as minimum norm solution (MNS), [1] (p. 264). Another way of solving these problems is based on regularisation techniques. Specifically, these methods allow us to solve a different problem which

**Citation:** Choudalakis, S.; Mitrouli, M.; Polychronou, A.; Roupa, P. Solving High-Dimensional Problems in Statistical Modelling: A Comparative Study. *Mathematics* **2021**, *9*, 1806. https://doi.org/10.3390/ math9151806

Academic Editor: Nickolai Kosmatov

Received: 3 June 2021 Accepted: 26 July 2021 Published: 30 July 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

2.

has a unique solution and thus to estimate the desired vector *β*. One of the most popular regularisation methods is Tikhonov regularization, [2]. Another regularization technique which is used is the *p*-*q* regularization, [3,4].

It is of major importance to decide whether problem (1) can be solved directly in the least squares sense or regularisation is required. Therefore, we describe a way of choosing the appropriate method for solving (1) for design matrices with correlated covariates. For these matrices we study extensively their properties. We prove that as the correlation of the covariates increases, the generalised condition number of the design matrix increases as well and thus the design matrix becomes ill-conditioned. Toascertainthemostfactorsofthestatisticalmodel.

 important Variable selection is a major issue in solving high-dimensional problems. By means of variable selection we refer to the specification of the important variables (active factors) in the linear regression model, i.e., the variables which play a crucial role in the model. The rest of the variables (inactive factors) can be omitted.

We deal with the variable selection in supersaturated designs (SSDs) which are fractional factorial designs in which the run size is less than the number of all the main effects. In this class of designs, the columns of *X*, except the first column, have elements ±1. The symbols 1 and −1 are usually utilised to denote the high and low level of each factor, respectively. The correlation of SSDs is usually small, i.e., *r* ≤ 0.5. The analysis of SSDs is a main issue in Statistics. Many methods for analysing these designs have been proposed. In [5], a Dantzig selector was introduced. Recently, a sure independence screening method has been applied in a model selection method in SSDs [6], and a support vector machine recursive feature elimination method for feature selection [7]. In our study, as we want to retain sparsity in variable selection, we adopt the *p*-*q* regularisation and the SVD principal regression method, [8], in order to determine the most important factors of the statistical model.

In the regression model (1), there is no error setting in the design matrix *X* which defines the model. It is always considered an unperturbed matrix *X* with covariates from normal distribution with well determined rank. However, we assume i.i.d. random error = (1, 2, ... , *n*)*<sup>T</sup>*, *i* ∼ *N*(0, *σ*<sup>2</sup>) for all *i* = 1, 2, ... , *n*, incorporated in the model as given from relation (1). Thus, we are having well-posed problems on the set of the data according to the work in [9].

The paper is organised as follows. In Section 2, we briefly present some methods for solving high-dimensional problems. We initially display the MNS and in the sequel we present two regularisation methods. Specifically, Tikhonov regularisation and a general regularisation technique, *p*-*q* regularisation method, are discussed. The described methods are used in estimating the regression parameter *β* of (1) for design matrices with correlated covariates and the results are given in Section 3. These methods can be applied to ill-posed problems as well. Variable selection for SSDs can be found in Section 4. We end up this work with several concluding remarks in Section 5.

#### **2. Methods Overview**

In this section, we present some methods for solving high-dimensional problems.

#### *2.1. Minimum Norm Solution*

The system (1), which is an underdetermined system, does not have a unique solution. In fact, this underdetermined system has infinitely many solutions, and we are seeking a solution such that its norm is minimised, i.e., the minimum norm solution (MNS) argmin *β*∈R*d*+<sup>1</sup> *y* − *X β*<sup>2</sup> 2, [1] (p. 264). A necessary and sufficient condition for the existence of

MNS is given in the following theorem. **Theorem 1.** *Let X* ∈ R*n*×(*d*+<sup>1</sup>) *be a high-dimensional matrix, i.e., n* < *d, with rank*(*X*) = *n, and β*∗ *be a solution of the underdetermined system Xβ* = *y. Then, β*∗ *is a MNS if and only if β*∗ ∈ *Range*(*X<sup>T</sup>*)*.*

**Proof.** As *β*∗ is a solution of the underdetermined system *Xβ* = *y*, we have

$$X\mathcal{J}^\* = \mathcal{y} \iff (\mathcal{J}^\*)^T X^T = \mathcal{y}^T. \tag{2}$$

Let us consider the *QR* factorisation of *<sup>X</sup>T*, i.e.,

$$X^T = QR = Q \begin{bmatrix} R\_1 \\ 0\_{d+1-n,n} \end{bmatrix}'$$

where *Q* ∈ R(*d*+<sup>1</sup>)×(*d*+<sup>1</sup>) is orthogonal and *R*1 ∈ R*n*×*n* is upper triangular. Therefore, (2) can be rewritten as

$$(\mathcal{J}^\*)^T \mathcal{Q} \mathcal{R} = \mathcal{y}^T \iff (\mathcal{Q}^T \mathcal{J}^\*)^T \mathcal{R} = \mathcal{y}^T. \tag{3}$$

If we set

$$=\mathbb{Q}^{T}\mathbb{P}^{\*},\tag{4}$$

then

> (3) ⇔ *zTR* = *yT* ⇔ *RTz* = *y*.

*z*

Moreover, we have

$$(4)\iff \mathbb{Q}^{-1}\mathfrak{P}^\* = \mathfrak{z} \iff \mathfrak{P}^\* = \mathbb{Q}\mathfrak{z} \iff \mathfrak{P}^\* \in \operatorname{Range}\mathfrak{e}(\mathbb{Q}) \iff \mathfrak{P}^\* \in \operatorname{Range}\mathfrak{e}(\boldsymbol{X}^{\overline{\boldsymbol{T}}}).$$

Taking into account the result of the above theorem, we obtain the formula for the MNS *β*<sup>∗</sup>, which is given by

$$\mathcal{S}^\* = X^T (XX^T)^{-1} y. \tag{5}$$

Formula (5) cannot be used directly for calculating the vector *β*, as it is not a stable computation. Therefore, we state the Algorithm 1 for a stable way of calculating the MNS through the singular value decomposition (SVD) of the design matrix *X*, [1] (p. 265). The operation count for this algorithm is dominated by the computation of the SVD, which requires a cost of O(*nd*<sup>2</sup>) flops.


#### *2.2. The Discrete Picard Condition*

It is crucial to identify when problem (1) can be directly solved with a satisfactory MNS solution or different ways of handling the solution must be employed. In [10], a criterion for deciding whether a least squares problem can have a satisfactory direct solution or not is proposed. This criterion employs the SVD of the design matrix *X* and the discrete Picard condition as defined in [11,12]. Let *X* = *USV<sup>T</sup>* = *n* ∑ *i*=1 *siui<sup>v</sup>Ti* be the SVD of *X*, where *si* are the singular values of *X* with corresponding left singular vectors *ui* and right singular vectors *vi*, *i* = 1, 2, ... , *n*. The discrete Picard condition ensures that the solution can be approximated by a regularised solution [13].

**Definition 1** (The discrete Picard condition)**.** *The discrete Picard condition (DPC) requires that the ratio* |*ci*| *si decreases to zero as i* → *n, i.e.,*

$$\frac{|c\_i|}{s\_i} \to 0, \quad \text{as} \ i \to n\_\star$$

*where ci* = *uTi y. The DPC implies that the constants* |*ci*| *tend to zero faster than the singular values tend to zero.*

**Example 1.** *Let us now consider design matrices of order* 50 × 101*, their columns have same variance σ*<sup>2</sup> *and same correlation structure r. In particular, we test two design matrices X with* (*r*, *σ*<sup>2</sup>)=(0.9, 0.25) *and* (*r*, *σ*<sup>2</sup>)=(0.999, 1)*. In Figure 1, we display the ratios* |*ci*| *si and* |*c*<sup>ˆ</sup>*i*| *si , which correspond to the noise-free and the noisy problem, ci* = *uTi y, c*ˆ*i* = *uTi y***ˆ***, i* = 1, 2, ... , *n, y* **ˆ** = *y* + *. If the graphs are close enough the MNS is satisfactory; otherwise, regularisation techniques are necessary for deriving a good approximation of the desired vector β. As we can see in Figure 1, the values of the depicted ratios are very close in the design matrix with r* = 0.9 *case whereas in the highly correlated matrix with r* = 0.999 *case the ratios differ. This implies that a regularisation method is necessary for the second case.*

**Figure 1.** The ratios |*ci*|/*si* and |*c*<sup>ˆ</sup>*i*|/*si* for the design matrices of order 50 × 101 for (*r*, *σ*<sup>2</sup>)=(0.9, 0.25) (**left**) and (*r*, *σ*<sup>2</sup>)=(0.999, 1) (**right**).

#### *2.3. Regularisation Techniques*

There are cases where the MNS *β*∗ cannot achieve a good approximation of the desired unknown solution *β*. As in the linear regression model as described in (1) the design matrix *X* is always unperturbed, and thus its rank can be a priori known, we can adopt regularisation techniques. In the present section, we present two regularisation methods. In particular, we present the popular Tikhonov regularisation [2] and the *p*-*q* regularisation which has recently received considerable attention [3,4]. Both of these techniques replace the initial problem with another one which is close to the original.

#### 2.3.1. Tikhonov Regularisation

A regularisation method that is widely used is Tikhonov regularisation. The standard form of Tikhonov regularization, which corresponds in linear regression model (1), is given by

$$\min\_{\mathcal{J}\in\mathbb{R}^{d+1}}\{||\!|\!|\!|\!|\!|\!| + \lambda^2\!\|\!|\!|\!|\!|^2\},\tag{6}$$

where *λ* is the regularisation parameter. The solution of the penalised least-squares problem (6) is given by the formula

$$\mathcal{B}\_{\lambda} = (X^T X + \lambda^2 I\_{d+1})^{-1} X^T y = X^T (X X^T + \lambda^2 I\_n)^{-1} y\_{\lambda}$$

as it holds the identity (*XTX* + *λ*<sup>2</sup> *Id*+<sup>1</sup>)−1*X<sup>T</sup>* = *X<sup>T</sup>*(*XX<sup>T</sup>* + *λ*<sup>2</sup> *In*)−1. Indeed, we have *X<sup>T</sup>*(*XX<sup>T</sup>* + *λ*<sup>2</sup> *In*)=(*XTX* + *λ*<sup>2</sup> *Id*+<sup>1</sup>)*X<sup>T</sup>* ⇒ (*XTX* + *λ*<sup>2</sup> *Id*+<sup>1</sup>)−1*X<sup>T</sup>* = *X<sup>T</sup>*(*XX<sup>T</sup>* + *λ*<sup>2</sup> *In*)−1.

As we can see, Tikhonov regularisation depends on the regularization parameter *λ*. An appropriate method for selecting *λ* leads to the derivation of a satisfactory approximation *βλ* of the desired regression parameter *β*. The error in the input data for the statistical model that we study follows the standard norm distribution, i.e., ∼ *<sup>N</sup>*(**<sup>0</sup>***<sup>n</sup>*, *σ*<sup>2</sup> *In*). Therefore, the norm of the error is known, and it is given by 2 = √*n* − 1*σ*. In the case of known error norm, the appropriate method for the selection of the regularisation parameter is the discrepancy principle, which is reported in Algorithm 2 [14] (p. 283). Following also the analysis presented in [9], and due to the uniqueness of *λ* for most reasonable values of (see, for example, in [15]), we adopt this method for our study.


#### 2.3.2. *p*-*q* Regularisation

A more general regularisation technique is the so-called *p*-*q* regularisation [3]. The main idea of this approach is based on the replacement of the minimisation problem *y* − *<sup>X</sup>β*2 by an *p*-*q* minimisation problem of the form

$$\min\_{\mathcal{J}\in\mathbb{R}^{d+1}}\{\frac{1}{p}||y-X\mathcal{J}||\_p^p+\mu\frac{1}{q}||\mathcal{J}||\_q^q\},\tag{7}$$

where *μ* > 0 is the regularisation parameter and 0 < *p*, *q* ≤ 2. The solution of the minimisation problem (7) is given by

$$\hat{\mathcal{J}}\_{\mu} = \underset{\mathcal{B} \in \mathbb{R}^{d+1}}{\text{argmin}} \{ \frac{1}{p} ||y - X\mathcal{B}||\_p^p + \mu \frac{1}{q} ||\mathcal{B}||\_q^q \}. \tag{8}$$

**Remark 1.** *In case of p* = *q* = 2*, the regularised minimisation problem* (7) *reduces to Tikhonov regularisation.*

Concerning the selection of the regularisation parameter, we choose the optimal value of *μ*, i.e., the value that minimises the error norm *β***<sup>ˆ</sup>***μ* − *β*2 over a given grid of values for *μ*. Concerning the computational cost, the implementation of the *p*-*q* regularisation requires O(*nd*) flops.

#### **3. Design Matrix with Correlated Covariates**

In high-dimensional applications, the design matrix *X* = **1** *x*1 ··· *xd* has correlated covariates *xi* ∼ *<sup>N</sup>*(**<sup>0</sup>***<sup>n</sup>*, *σ*2*i In*), *i* = 1, ... , *d*, where *σ*2*i* is the variance of *xi* and the correlation structure is given from the relation

$$r\_{ij} = \operatorname{cor}(\mathbf{x}\_{i\prime}\mathbf{x}\_{j}) = \frac{\mathbf{x}\_{i}^{T}\mathbf{x}\_{j}}{||\mathbf{x}\_{i}|| ||\mathbf{x}\_{j}||}, \ i, j = 1, \dots, d, \ i \neq j.$$

with −1 ≤ *rij* ≤ 1.

> Next, we present a thorough investigation of the properties that characterize these matrices.

#### *3.1. Correlated Covariates with Same Variance and Correlation*

We initially consider design matrices with correlated covariates which have same variance *σ*<sup>2</sup> and same correlation *r*. In the following theorem, we formulate and prove in detail the types for the singular values of the design matrix *X*. In [16], this case of design matrix is considered and there exists a brief description of the eigenvalues of the matrix *XTX*.

**Theorem 2.** *Let X* = **1** *x*1 ··· *xd* ∈ R*n*×(*d*+<sup>1</sup>) *be a high-dimensional design matrix of full rank whose columns xi* ∼ *<sup>N</sup>*(**<sup>0</sup>***<sup>n</sup>*, *σ*<sup>2</sup> *In*)*, i* = 1, 2, ... , *d, with correlation structure r. The singular values of the matrix X are*

$$s\_1 = \sqrt{n}, \ s\_2 = \dots = s\_{n-1} = \sigma \sqrt{(n-1)(1-r)}, \ s\_n = \sigma \sqrt{(n-1)[(d-1)r+1]}.$$

**Proof.** The *n* singular values of *X* are the square roots of the *n* non-zero eigenvalues of *XTX*. Therefore, we compute the matrix *<sup>X</sup>TX*, i.e.,

$$\begin{aligned} \mathbf{X}^T \mathbf{X}^- &= \begin{bmatrix} 1 & \dots & 1 \\ \mathbf{x}\_{11} & \dots & \mathbf{x}\_{n1} \\ \vdots & \ddots & \vdots \\ \mathbf{x}\_{1d} & \dots & \mathbf{x}\_{nd} \end{bmatrix} \begin{bmatrix} 1 & \mathbf{x}\_{11} & \dots & \mathbf{x}\_{1d} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & \mathbf{x}\_{n1} & \dots & \mathbf{x}\_{nd} \end{bmatrix} \\\\ &= \begin{bmatrix} \sum\_{j=1}^n \mathbf{1} & \sum\_{j=1}^n \mathbf{x}\_{j1} & \dots & \sum\_{j=1}^n \mathbf{x}\_{jd} \\ \sum\_{j=1}^n \mathbf{x}\_{j1} & & \\ \vdots & & \mathbf{X}^T \mathbf{X} \\ \vdots & & \vdots \\ \mathbf{x}\_{jd} & & \end{bmatrix} = \begin{bmatrix} n & 0 & \dots & 0 \\ 0 & & \\ \vdots & & \mathbf{X}^T \mathbf{X} \end{bmatrix}, \end{aligned}$$

where *X* ˆ = *x*1 ··· *xd* and *n* ∑ *j*=1 *xji* = 0, ∀ *i* = 1, ... , *d*, due to the construction of the

design matrix *X* according to the normal distribution. Therefore, the matrix *XTX* has one eigenvalue equal to *n*.

Moreover, we can express the variance *σ*<sup>2</sup> of each covariate *xi* = *x*1*i x*2*i* ··· *xni T* in terms of vector norms as follows:

$$
\sigma^2 = \frac{1}{n-1} \sum\_{j=1}^n (\mathbf{x}\_{ji} - \mathbf{x}\_i)^2 = \frac{1}{n-1} ||\mathbf{x}\_i - \mathbf{x}\_i||^2,
$$

where *x*¯*i* denotes the mean value of each *xi*. As the mean value of each *xi* is zero, we have

$$
\sigma^2 = \frac{1}{n-1} \|\mathbf{x}\_l\|^2 \Rightarrow \|\mathbf{x}\_l\|^2 = (n-1)\sigma^2,\ \forall\ i = 1,\dots,d.\tag{9}
$$

The submatrix *X* ˆ *TX*ˆ of *XTX* can be written as

$$
\hat{X}^T \hat{X} = \begin{bmatrix}
\|\mathbf{x}\_1\|^2 & r\|\mathbf{x}\_1\|\|\|\mathbf{x}\_2\|\| & \dots & r\|\mathbf{x}\_1\|\|\|\mathbf{x}\_d\|\| \\
r\|\|\mathbf{x}\_1\|\|\|\mathbf{x}\_2\|\| & \|\mathbf{x}\_2\|^2 & \dots & r\|\mathbf{x}\_2\|\|\|\mathbf{x}\_d\|\| \\
\vdots & \vdots & \ddots & \vdots \\
r\|\mathbf{x}\_d\|\|\|\mathbf{x}\_1\|\| & r\|\mathbf{x}\_d\|\|\|\mathbf{x}\_2\|\|\dots & \|\mathbf{x}\_d\|^2
\end{bmatrix}
$$

$$
\stackrel{(9)}{=} \begin{bmatrix}
(n-1)\sigma^2 & r(n-1)\sigma^2 & \dots & r(n-1)\sigma^2 \\
r(n-1)\sigma^2 & (n-1)\sigma^2 & \dots & r(n-1)\sigma^2 \\
\vdots & \vdots & \ddots & \vdots \\
r(n-1)\sigma^2 & r(n-1)\sigma^2 & \dots & (n-1)\sigma^2
\end{bmatrix}
$$

$$
= (n-1)\sigma^2 \begin{bmatrix}
1 & r & \dots & r \\
r & 1 & \dots & r \\
\vdots & \vdots & \ddots & \vdots \\
r & r & \dots & 1
\end{bmatrix} = (n-1)\sigma^2 [(1-r)I + rI]\_{\prime\prime}
$$

where *J* is the *d* × *d* matrix with all elements equal to 1. The non-zero eigenvalues of *X*ˆ *TX*ˆ are *λ*1 = (*n* − <sup>1</sup>)*σ*<sup>2</sup>(<sup>1</sup> − *r*) with algebraic multiplicity *n* − 2 and *λ*2 = (*n* − <sup>1</sup>)*σ*<sup>2</sup>[(*<sup>d</sup>* − <sup>1</sup>)*r* + 1] with algebraic multiplicity 1. Therefore, the singular values of *X* are *s*1 = √*<sup>n</sup>*, *s*2 = ··· = *sn*−1 = *σ*(*n* − 1)(1 − *r*), *sn* = *σ*(*n* − 1)[(*d* − <sup>1</sup>)*r* + 1].

Let us denote by *κ*(*X*) the generalised condition number of *X*, i.e., *κ*(*X*) = *X*2 · *X*†2, where *X*† = *<sup>X</sup><sup>T</sup>*(*XX<sup>T</sup>*)−<sup>1</sup> is the pseudoinverse of *X*, [1] (p. 246). It is known that the generalised condition number can be expressed in terms of the maximum *smax* and the minimum *smin* singular value of *X* as *κ*(*X*) = *smax smin*, [1] (p. 216).

 In Theorem 3, we express the generalised condition number of *X* in terms of the correlation structure *r*.

**Theorem 3.** *Let X* = **1** *x*1 ··· *xd* ∈ R*n*×(*d*+<sup>1</sup>) *be a high-dimensional design matrix of full rank whose columns xi* ∼ *<sup>N</sup>*(**<sup>0</sup>***<sup>n</sup>*, *σ*<sup>2</sup> *In*)*, i* = 1, 2, ... , *d, with correlation structure r. The generalised condition number of X is given by*

$$\begin{aligned} \text{1.} \qquad \kappa(X) &= \sqrt{\frac{n}{(n-1)\sigma^2(1-r)}}, \quad \text{if} \quad r \le \frac{1}{d-1} \left(\frac{n}{(n-1)\sigma^2} - 1\right), \\\ \text{2.} \qquad \kappa(X) &= \sqrt{\frac{(d-1)r+1}{1-r}}, \quad \text{if} \quad \left(r > \frac{1}{d-1} \left(\frac{n}{(n-1)\sigma^2} - 1\right) \text{ and } \sigma^2 < \frac{n}{n-1}\right) \text{or} \\\ &\quad \left(r > 1 - \frac{n}{(n-1)\sigma^2} \text{ and } \sigma^2 > \frac{n}{n-1}\right), \\\ \text{3.} \qquad \kappa(X) &= \sqrt{\frac{(n-1)\sigma^2((d-1)r+1)}{n}}, \quad \text{if} \quad r < 1 - \frac{n}{(n-1)\sigma^2}. \end{aligned}$$

**Proof.** It is obvious that *sn* = *σ*(*n* − 1)[(*d* − <sup>1</sup>)*r* + 1] > *si*, *i* = 2, ... , *n* − 1 holds. Therefore, we have to distinguish three cases. The first case is *s*1 ≥ *sn*, the second case is *si* < *s*1 < *sn* and the last one is *s*1 < *si*.

First case: If *s*1 ≥ *sn*, then *κ*(*X*) = *s*1*si* = *n* (*n* − <sup>1</sup>)*σ*<sup>2</sup>(<sup>1</sup> − *r*). The restriction *s*1 ≥ *sn* can be rewritten as follows:

$$\begin{aligned} n &\geq (n-1)\sigma^2((d-1)r+1) \\ \Leftrightarrow & \frac{n}{(n-1)\sigma^2} \geq (d-1)r+1 \\ \Leftrightarrow & \frac{n}{(n-1)\sigma^2} - 1 \geq (d-1)r \\ \Leftrightarrow & r \leq \frac{1}{d-1} \left(\frac{n}{(n-1)\sigma^2} - 1\right). \end{aligned}$$

Second case: If *si* < *s*1 < *sn*, then *κ*(*X*) = *snsi* = (*d* − <sup>1</sup>)*r* + 1 1 − *r* . The restriction *si* < *s*1 < *sn* can be reformulated as follows:

.

.

$$\begin{aligned} &(n-1)\sigma^2(1-r) < n < (n-1)\sigma^2(dr+1-r) \\ \Leftrightarrow & \begin{cases} 1-r < \frac{n}{(n-1)\sigma^2} \\ (d-1)r > \frac{n}{(n-1)\sigma^2} - 1 \end{cases} \Leftrightarrow \begin{cases} r > 1 - \frac{n}{(n-1)\sigma^2} \\ r > \frac{1}{d-1} \left(\frac{n}{(n-1)\sigma^2} - 1\right) \end{cases} \end{aligned}$$

Moreover, we make the check

$$\begin{aligned} \frac{1}{d-1} \left( \frac{n}{(n-1)\sigma^2} - 1 \right) &< 1 - \frac{n}{(n-1)\sigma^2} \\ \Leftrightarrow \quad n - (n-1)\sigma^2 &< (d-1)(n-1)\sigma^2 - n(d-1) \\ \Leftrightarrow \quad (n-1)\sigma^2(1+d-1) &> n+nd-n \\ \Leftrightarrow \quad \sigma^2 &> \frac{n}{n-1} .\end{aligned}$$

Therefore, we conclude that the generalised condition number *κ*(*X*)is equal to (*d* − <sup>1</sup>)*r* + 1 1 − *r* if the following relation holds.

$$r > \frac{1}{d-1} \left( \frac{n}{(n-1)\sigma^2} - 1 \right) \quad \text{and} \quad \sigma^2 < \frac{n}{n-1} \quad \text{or} \quad \frac{n}{n}$$

$$r > 1 - \frac{n}{(n-1)\sigma^2} \quad \text{and} \quad \sigma^2 > \frac{n}{n-1}$$

Third case: If *s*1 ≤ *si*, then *κ*(*X*) = *sns*1 = (*n* − <sup>1</sup>)*σ*<sup>2</sup>((*<sup>d</sup>* − <sup>1</sup>)*r* + 1) *n* . This restriction is equivalently written as

$$m < (n-1)\sigma^2(1-r) \Leftrightarrow \frac{n}{(n-1)\sigma^2} < 1 - r \Leftrightarrow r < 1 - \frac{n}{(n-1)\sigma^2}.$$

Taking into consideration the derived formulae for the generalised condition number of the design matrix *X*, we see that if *r* ≈ 1 the generalised condition number *κ*(*X*) becomes large. A detailed example is presented next.

**Example 2.** *In this example, we plot the generalised condition number of X as a function of the correlation r. We consider n* = 50*, d* = 100 *and σ*<sup>2</sup> = 2*. In Figure 2, we display κ*(*X*) *for correlation r* −→ 1*. As we see in Figure 2, as correlation r tends to* 1*, the generalised condition number κ*(*X*) *increases rapidly.*

**Figure 2.** The generalised condition number of *X* as a function of the correlation *r*.

*3.2. Highly Correlated Covariates with Different Variance and Correlation*

Next, we consider a general and more usual case in which the covariates *xi* ∼ *<sup>N</sup>*(**<sup>0</sup>***<sup>n</sup>*, *σ*2*i In*) of the design matrix *X* have different variance *σ*2*i* and correlation *rij*, *i*, *j* = 1, ... , *d*. Based on the results presented in [17] for the eigenvalues of the matrix *<sup>X</sup>TX*, we record analytic formulae for the singular values of *X* in the following theorem.

**Theorem 4.** *Let X* = **1** *x*1 ··· *xd* ∈ R*n*×(*d*+<sup>1</sup>) *be a high-dimensional design matrix of full rank whose columns xi* ∼ *<sup>N</sup>*(**<sup>0</sup>***<sup>n</sup>*, *σ*2*i In*)*, i* = 1, 2, ... , *d, with highly correlation structure rij. The singular values of the matrix X are*

$$s\_1 = \sqrt{n}, \ s\_2 = \sqrt{(n-1)\sum\_{j=1}^d \sigma\_i^2 + \mathcal{O}(\delta)}, \ s\_3 = \dots = s\_\mathbb{n} = \sqrt{\mathcal{O}(\delta)},$$

*assuming that* 1 − *rij* = O(*δ*) *as δ* → 0*.*

As we record in Section 3.1, the generalised condition number is equal to the ratio *smax smin* and in the present case *smin* = O(*δ*) considering that 1 − *rij* = O(*δ*), i.e., highly correlated covariates. Therefore, the value of *κ*(*X*) is large and this affects the solution of the corresponding problem.

**Remark 2.** *As the correlation r increases the generalised condition number κ*(*X*) *increases as well. From Theorems 2 and 4 we deduce that the case of highly correlated covariates leads to possible instability and thus regularisation is recommended. This result is confirmed from Table 1 which is presented in Section 3.3.*

#### *3.3. Numerical Implementation*

The implementation of the simulation study presented in this section and in Section 4 has been done by using the Julia Programming Language.

Given the high-dimensional design matrix *X* of order *n* × (*d* + <sup>1</sup>), the response vector *y* of order *n* and the *n*-vector = (1, 2, ... , *n*)*<sup>T</sup>* of i.i.d. random errors, *j* ∼ *N*(0, <sup>1</sup>), *j* = 1, 2, ... , *n*, we estimate the vector *β* by using the methods which are described in Section 2. We consider design matrices *X* with correlated covariates and we distinguish the two aforementioned cases. The results for the first case, i.e., the covariates of the design matrices having same correlation *r* and same variance *σ*2, are recorded in Tables 1 and 2. The results for the second case are displayed in Table 3.

The implemented simulation scheme is the following. For each design matrix *X*, a random vector *β* is generated and *y* = *Xβ* denotes the noise free response vector. Then, 100 iterations are performed, in each one the response vector is perturbed by noise *i* resulting in a noisy response vector *y***ˆ** = *y* + *i*, *i* = 1, 2, ... , 100. Eventually, the regression parameter *β* **ˆ** is computed by using both the MNS given by Algorithm 1 and the regularisation techniques. The quality of the generated approximation solution *β***ˆ** is assessed by the mean square error (MSE) between *β* and *β* **ˆ** which is given by the formula

$$MSE(\hat{\mathcal{B}}) = E[||\hat{\mathcal{B}} - \mathcal{B}||\_2^2].$$

In Algorithm 3, we summarise the simulation scheme.


In Tables 1–3, we present the results of estimating the regression parameter *β* for different orders of the design matrices *X*. In the two first columns of the tables, the correlation *r* and the variance *σ*<sup>2</sup> of the covariates are recorded, respectively. In Table 3, we record the interval in which lies the correlation and the variance. In the third column, the adopted methods are written. Specifically, we record MNS, Tikhonov regularisation and *p*-*q* regularisation technique for different pairs of (*p*, *q*). The fourth column contains the used grid of values for the regularisation parameter *λ* or *μ* for Tikhonov or *p*-*q* regularisation, respectively. In the last column the *MSE*(*β***<sup>ˆ</sup>**) of the derived approximation solutions *β* **ˆ** are recorded.




**Table 2.** Results for *X*50×101.

**Table 3.** Results for *X*25×51.


As we can see in these tables, in the case of highly correlated design matrices, the regularisation is necessary for deriving a good approximation of the desired vector *β*. On the other hand, if the correlation of the design matrix is not high, MNS can achieve a fair estimation and a regularisation method does not improve the results, as it is verified by the *MSE*(*β***<sup>ˆ</sup>**). Therefore, according to the presented results, for matrices with moderate correlated covariates, regularisation is redundant, as MNS yields adequate results. However, as the correlation between the covariates rises, the regularisation is essential.

Note that in case of design matrices with same variance and correlation *r* = 0.999 (Tables 1 and 2) the regularisation techniques, Tikhonov and *p*-*q*, can achieve comparable results. The choice of the pair of parameters (*p*, *q*) and the values of the required regularisation parameter play an important role for the efficient implementation of both methods.

#### **4. Variable Selection in SSDs**

In this section, we are interested in selecting the active factors of SSDs by using the methods which are described in Section 2. In our comparison, we also include SVD principal regression method which is used in SSDs, and it was proposed in [8]. We briefly refer to this method as SVD regression. The main computational cost of this approach is the evaluation of the SVD.

We measure the effectiveness of these methods through the Type I and Type II error rates. In particular, Type I error measures the cost of declaring an inactive factor to be active and Type II measures the cost of declaring an active effect to be inactive. In our numerical experiments, we consider 500 different realisations of the error and in the presented tables we record the mean value of Type I, II error rates.

It is worth mentioning that both the MNS and Tikhonov regularisation give that all the factors are active, i.e., Type I = 1, Type II = 0, for all the tested SSDs. Therefore, these methods are not suitable for variable selection and we do not include them in the following presented tables.

**Example 3** (An illustrative example)**.** *In this example, we shall exhibit in detail the performance of each method for a particular problem. For this purpose, we adopt the illustrative example presented in [8], with design matrix*

*X* = ⎡⎢⎢⎢⎢⎢⎢⎣ + −−−−−−−−− +++++++ −−− ++++ −−− +++ − + − − + + − + + − − − + − + − + + − + −−− + − + + − + + ⎤⎥⎥⎥⎥⎥⎥⎦ = *x*1 *x*2 ... *x*10 .

*Then a first column x*0 *with all entries equal to* 1 *is added to the matrix, which corresponds to the average mean. The simulated data are generated by the model*

$$y = 5\mathbf{x}\_0 + 4\mathbf{x}\_2 + 3\mathbf{x}\_5 + \mathbf{e}\_{\prime\prime}$$

*where* ∼ *<sup>N</sup>*(**<sup>0</sup>**6, *<sup>I</sup>*6)*. A response vector y obtained by using this model is*

$$y = \begin{bmatrix} -1.54 & 12.02 & 6.82 & 12.44 & 4.62 & -1.21 \end{bmatrix}^T$$

*The exact regression parameter β and the predicted coefficients by each method are demonstrated below.*

*β* = 50400300000 *T*, *β* **ˆ** *MNS* = 5.525 0.1208 2.4508 1.1475 0.1758 2.0842 1.1125 −0.1908 1.2175 0.2458 −1.0575 *T*, *β* **ˆ** *Tik* = 4.7237 0.1114 2.2592 1.0578 0.1621 1.9212 1.0255 −0.1759 1.1223 0.2266 −0.9748 *T*, *β* **ˆ** -2-0.1 = 5.5214 0.0 3.9482 0.0 0.0 2.8458 0.0 0.0 0.0 0.0 0.0 *T*, *β* **ˆ** *SVD* = 5.5245 0.0 3.901 0.0 0.0 2.801 0.0 0.0 0.0 0.0 0.0 *T*.

> As we can see from the generated approximation solutions *β* **ˆ** , the MNS and Tikhonov regularised solution cannot specify the active factors of the model and completely spoil the sparsity. On the other hand, the *p*-*q* regularisation method and the SVD regression can determine appropriately the active factors of the model.

> **Example 4** (Williams' data)**.** *We consider the well-known Williams' dataset (rubber age data) which is reported in Table 4. It is a classical dataset of SSDs and it is tested in several works, such as in [8]. As it is written in [8], as the columns* 13 *and* 16 *in the original design matrix are identical, the column* 13 *is removed for executing our numerical experiments. For this dataset we consider two cases, the real case and* 3 *synthetic cases.*

> We initially deal with the real case where the design matrix *X* and the response vector *y* are given, without the initial knowledge of the desired vector *β*. In literature, it is reported that the active factor is *x*15. In this case, according to our numerical experiments, the SVD regression and the *p*-*q* regularisation method for *p* = 0.8, *q* = 0.1 indicate that the factor *x*15 is important. In particular, the proposed models, i.e., the coefficients *βi* are given in Table 5.

> The second case corresponds to 3 synthetic cases, see in [8] and references therein, which are given below. For these simulated cases, we record the results in Table 6. In particular, we compute Type I and II error rates for the described methods. We apply the *p*-*q* regularisation method for *μ* = 5 and the SVD regression for the significance level *a* = 0.05. As we notice in this table, both the *p*-*q* regularisation and the SVD regression can select sufficiently the important factors, as we see from the corresponding Type I, II error rates. The first model has the particularity that it includes the interaction of the factors

*x*5, *x*9 which does not usually appear in SSDs analysis. The first model is a challenging case for all the methods.

Model 1: *y* ∼ *<sup>N</sup>*(<sup>15</sup>*<sup>x</sup>*1 + 8*<sup>x</sup>*5 − 6*<sup>x</sup>*9 + 3*<sup>x</sup>*5*x*9, *<sup>I</sup>*14) Model 2: *y* ∼ *<sup>N</sup>*(<sup>8</sup>*<sup>x</sup>*1 + 5*<sup>x</sup>*12, *<sup>I</sup>*14) Model 3: *y* ∼ *<sup>N</sup>*(<sup>10</sup>*<sup>x</sup>*1+ 9*<sup>x</sup>*2+ 2*x*3, *<sup>I</sup>*14)

 **Table 4.** The Williams' data—rubber age data.


**Table 5.** The selected model for William's data (real case).


**Table 6.** Results for William's Data (synthetic cases).


**Example 5** (A 3-circulant SSD)**.** *In this example, we consider one more SSD, which is also used in [18], and it is recorded in Table 7. We test the behaviour of the methods for variable selection by considering three models which can be found in [19] and are given below.*

*Model 1: y* ∼ *<sup>N</sup>*(<sup>10</sup>*<sup>x</sup>*1, *<sup>I</sup>*8)

*Model 2: y* ∼ *<sup>N</sup>*(−15*<sup>x</sup>*<sup>1</sup> + 8*<sup>x</sup>*5 − 2*x*9,

*Model 3: y* ∼ *<sup>N</sup>*(−15*<sup>x</sup>*<sup>1</sup> + 12*<sup>x</sup>*5 − 8*<sup>x</sup>*9 + 6*<sup>x</sup>*13 − 2*x*17, *<sup>I</sup>*8)

 *<sup>I</sup>*8)

*The results are presented in Table 8. For the three used models, we apply the p-q regularisation method for μ* = 5, 5.5, 0.5 *respectively and the SVD regression for a* = 0.25*. According to the presented numerical results, we see that both the p-q regularisation and the SVD regression can achieve satisfactory Type I and II error rates for the Model 1. On the other hand, for the Model 2 the SVD regression fails to specify the active factors whereas the p-q regularisation method achieves better Type II error. However, neither of the methods produce fair results for the Model 3. The coefficients of this model are not sufficiently close and this fact affects the behaviour of the methods.*

**Table 7.** A 3-circulant SSD.


**Table 8.** Results for 3-circulant SSD.


**Example 6.** *In this example we consider a real data set presented in [20] that deals with moss bags of Rhynchostegium riparioides which were exposed to different water concentrations of* 11 *trace elements under laboratory conditions. The design matrix X can be found in Table 1 in [20]. We consider the main effects, the second- and third-order interactions of influent factors. Therefore, we have a* 67 × 232 *SSD and we can select the important factors applying the p-q regularisation for μ* = 0.75 *and the SVD regression for significance level a* = 0.05*.*

From Table 9, we see that both -2--0.1 and SVD regression methods identify the main effect Zn as active factor. The second order interactions Cd/Mn, As/Pb and Mn/Ni are also identified as active. These results are in agreemen<sup>t</sup> with [20].



## **5. Conclusions**

In the present work, we analysed the properties of design matrices with correlated covariates. Specifically, we derived and proved formulae for the singular values of these matrices and we studied the connection of the generalised condition number with the correlation structure. Moreover, we described some available methods for solving highdimensional problems. We checked the behaviour of the MNS and the necessity of applying regularisation techniques in estimating the regression parameter *β* in the linear regression model. We concluded that in solving high-dimensional statistical problems the following remarks must be taken into consideration.


In conclusion, the proposed scheme for the selection of the appropriate method for the solution of high-dimensional statistical problems is summarised in the following logical diagram, see Figure 3.

**Figure 3.** Logical diagram for choosing the appropriate method for the solution of high-dimensional statistical problems.

> **Author Contributions:** Conceptualization, M.M.; methodology, M.M. and P.R.; software, S.C., A.P., P.R.; validation, M.M. and P.R.; formal analysis, S.C., A.P., P.R.; investigation, M.M.; data curation, S.C., A.P., P.R.; writing—original draft preparation, S.C., M.M., A.P., P.R.; writing—review and editing, S.C., M.M., A.P., P.R.; supervision, M.M.; project administration, M.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable. **Acknowledgments:** The authors are grateful to the reviewers of this paper whose valuable remarks improved this work.

**Conflicts of Interest:** The authors declare no conflict of interest.
