Large Sample Behavior of the Least Trimmed Squares Estimator

Zuo, Yijun

doi:10.3390/math12223586

Open AccessFeature PaperEditor’s ChoiceArticle

Large Sample Behavior of the Least Trimmed Squares Estimator

by

Yijun Zuo

Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA

Mathematics 2024, 12(22), 3586; https://doi.org/10.3390/math12223586

Submission received: 30 September 2024 / Revised: 8 November 2024 / Accepted: 11 November 2024 / Published: 15 November 2024

(This article belongs to the Special Issue Advances in High-Dimensional Data Analysis)

Download

Browse Figures

Versions Notes

Abstract

:

The least trimmed squares (LTS) estimator is popular in location, regression, machine learning, and AI literature. Despite the empirical version of least trimmed squares (LTS) being repeatedly studied in the literature, the population version of the LTS has never been introduced and studied. The lack of the population version hinders the study of the large sample properties of the LTS utilizing the empirical process theory. Novel properties of the objective function in both empirical and population settings of the LTS and other properties, are established for the first time in this article. The primary properties of the objective function facilitate the establishment of other original results, including the influence function and Fisher consistency. The strong consistency is established with the help of a generalized Glivenko–Cantelli Theorem over a class of functions for the first time. Differentiability and stochastic equicontinuity promote the establishment of asymptotic normality with a concise and novel approach.

Keywords:

trimmed squares of residuals; continuity and differentiability of objective function; influence function; Fisher consistency; asymptotics

MSC:

62J05; 62G36; 62J99; 62G99

1. Introduction

In classical multiple linear regression analysis, it is assumed that there is a relationship for a given data set

{{(x_{i}^{⊤}, y_{i})}^{⊤}, i \in {1, \dots, n}}

:

y_{i} = (1, x_{i}^{⊤}) β_{0} + e_{i}, i \in {1, \dots, n},

(1)

where

y_{i}

and

e_{i}

(an error term, a random variable, and is assumed to have a zero mean and unknown variance

σ^{2}

in the classic regression theory) are in

R^{1}

, ⊤ stands for the transpose,

β_{0} = {(β_{01}, \dots, β_{0 p})}^{⊤}

, the true unknown parameter, and

x_{i} = {(x_{i 1}, \dots, x_{i (p - 1)})}^{⊤}

is in

R^{p - 1}

(

p \geq 2

) and could be random. It is seen that

β_{01}

is the intercept term. Writing

w_{i} = {(1, x_{i}^{⊤})}^{⊤}

, one has

y_{i} = w_{i}^{⊤} β_{0} + e_{i}

. The classic assumptions such as linearity and homoscedasticity are implicitly assumed here. Others will be considered later when they are needed.

The goal is to estimate the

β_{0}

based on the given sample

z^{(n)} : = {{(x_{i}^{⊤}, y_{i})}^{⊤}, i \in {1, \dots, n}}

(hereafter, it is implicitly assumed that they are i.i.d. from parent

(x, y)

). For a candidate coefficient vector

β

, call the difference between

y_{i}

(observed) and

w_{i}^{⊤} β

(predicted), the ith residual,

r_{i} (β)

, (

β

is often suppressed). That is,

r_{i} : = r_{i} (β) = y_{i} - w_{i}^{⊤} β .

(2)

To estimate

β_{0}

, the classic least squares (LS) minimizes the sum of squares of residuals,

{\hat{β}}_{l s} : = arg min_{β \in R^{p}} \sum_{i = 1}^{n} r_{i}^{2} .

Alternatively, one can replace the square above with the absolute value to obtain the least absolute deviations estimator (aka,

L_{1}

estimator, in contrast to the

L_{2}

(LS) estimator).

Due to its great computability and optimal properties when the error

e_{i}

follows a Gaussian distribution, the LS estimator is popular in practice across multiple disciplines. It, however, can misbehave when the error distribution slightly departs from the Gaussian assumption, particularly when the errors are heavy-tailed or contain outliers. Both

L_{1}

and

L_{2}

estimators have the worst

0 %

asymptotic breakdown point, in sharp contrast to the

50 %

of the least trimmed squares estimator [1]. The latter is one of the most robust alternatives to the LS estimator. Robust alternatives to the LS estimator are abundant in the literature. The most popular are M-estimators [2]), least median squares (LMS) and least trimmed squares (LTS) estimators [3]), S-estimators [4]), MM-estimators [5]),

τ

-estimators [6]) and maximum depth estimators [7,8,9]), among others.

Although the M-estimator is the first robust alternative to the LS estimator, it has a poor breakdown point,

1 / n

, just as the LS whereas the MM-estimator could have a higher breakdown point but it depends on the initial estimator, which must have a high breakdown robustness such as the LTS. So the MM-estimator can have a better efficiency than the LTS but not robustness.

Due to the cube-root consistency of LMS in R84 and its other drawbacks, LTS is preferred over LMS (see [10]). LTS is popular in the literature in view of its fast computability and high robustness and often serves as the initial estimator for many high breakdown iterative procedures (e.g., S- and MM- estimators). The LTS is defined as the minimizer of the sum of h trimmed squares of residuals. Namely,

{\hat{β}}_{l t s} : = arg min_{β \in R^{p}} \sum_{i = 1}^{h} r_{i : n}^{2},

(3)

where

r_{1 : n}^{2} \leq r_{2 : n}^{2} \leq \dots \leq r_{n : n}^{2}

are the ordered squared residuals and

⌈ n / 2 ⌉ \leq h < n

and

⌈ x ⌉

is the ceiling function.

There are copious studies on the LTS in the literature. Most are focused on its computation, e.g., [1,10,11,12,13,14,15,16,17,18];

The LTS has been extended to penalized regression setting with a sparse model where dimension p (in thousands) is much larger than sample size n (in tens, or hundreds), see, e.g., [19,20]. The resulting estimator performs outstanding, especially in terms of robustness.

Other studies on LTS sporadically addressed the asymptotics, e.g., Refs. [1,21] addressed the asymptotic normality of the LTS, but limited to the location case, that is, when

p = 1

. Refs. [22,23,24] also addressed the asymptotics of the LTS without employing advanced technical tools in a series (three) of lengthy articles for consistency, root-n consistency, and asymptotic normality, respectively. The analysis is technically demanding and with difficult-to-verify assumptions

A, B, C

. Furthermore, the analysis is limited to the non-random vector

x_{i}

s case. In this article, without those assumptions and limitations, those results are established concisely with the help of advanced empirical process theory.

Replacing

(1, x_{i}^{⊤}) β_{0}

by a unspecified nonlinear function

h (x_{i}, β_{0})

, Refs. [25,26,27] discussed the asymptotics of the LTS in a nonlinear regression setting. Now that the more general non-linear case has been addressed, one might wonder if there are any merits to discussing the special linear case in this article.

There are at least three merits: (i) the nonlinear function

h (x_{i}, β_{0})

cannot always cover the linear case of

(1, x_{i}^{⊤}) β_{0}

for the usual LTS (e.g., in the exponential and power regression cases); (ii) many assumptions for the nonlinear case (see A1, A2, A3, A4 in [25]; H1, H2, H3, H4, H5, H6; D1, D2; I1, I2 in [26,27]) (which are usually difficult to verify) can be dropped for the linear case as demonstrated in this article. (iii) A key assumption that

{h (x, β), β \in Θ}

form a VC class of functions over a compact parameter space

Θ

(see [25,26,27]) can be verified directly in this article.

To avoid all the drawbacks and limitations discussed above and take advantage of the standard results of the empirical process theory, this article defines the population version of the LTS (Section 2.1), introduces the novel partition of the parameter space (Section 2.2), and investigates the primary properties of the objective function for the LTS both in the empirical and population settings (Section 2) for the first time. The obtained novel results facilitate the versification of some fundamental assumptions conveniently made previously in the literature. The major contributions of this article thus include the following

(a): Introducing a novel partition of the parameter space and defining an original population version of the LTS for the first time;
(b): Investigating primary properties of the sample and population versions of the objective function for the LTS, obtaining original results;
(c): For the first time, obtaining the influence function ( $p \geq 2$ ) and Fisher consistency for the LTS;
(d): For the first time, establishing the strong consistency of the sample LTS via a generalized Glivenko-Cantelli Theorem without artificial assumptions; and
(e): For the first time, employing a novel and concise approach based on the empirical process theory to establish the asymptotic normality of the sample LTS.

The rest of the article is organized as follows. Section 2 introduces for the first time, the population version of LTS and addresses the properties of the LTS estimator in both empirical and population settings, including the global continuity and local differentiability and convexity of its objective function; its influence function (in

p > 2

cases) and Fisher consistency are established for the first time. Section 3 establishes the strong consistency via a generalized Glivenko–Cantelli Theorem and the asymptotic normality of the estimator is re-established in a very different and concise approach (via stochastic equicontinuity) rather than the previous approaches in the literature. Section 4 addresses the asymptotic inference procedures based on the asymptotic normality and bootstrapping. Concluding remarks in Section 5 end the article. Major proofs are deferred to Appendix A.

2. Definition and Properties of the LTS

2.1. Definition

Denote by

F_{(x^{⊤}, y)}

the joint distribution of

x^{⊤}

and y in model (1). Throughout,

F_{Z}

stands for the distribution function of the random vector

Z^{⊤}

. For a given

β \in R^{p}

and an

α \in [1 / 2, c]

,

1 / 2 < c < 1

, let

q (β, α) = F_{W}^{- 1} (α)

be the

α

th quantile of

F_{W}

with

W : = W (β) = {(y - w^{⊤} β)}^{2}

, where

w^{⊤} = (1, x^{⊤})

. The constant

c = 1

case is excluded to avoid unbounded

q (β, α)

and the LS cases. Define an objective function

O (F_{(x^{⊤}, y)}, β, α) = \int {(y - (1, x^{⊤}) β)}^{2} 1 ({(y - (1, x^{⊤}) β)}^{2} \leq q (β, α)) d F_{(x^{⊤}, y)} (x, y),

(4)

and a regression functional

β_{l t s} (F_{(x^{⊤}, y)}, α) = arg min_{β \in R^{p}} O (F_{(x^{⊤}, y)}, β, α),

(5)

where

1 (A)

is the indicator of A (i.e., it is one if A holds and zero otherwise). Let

F_{(x^{⊤}, y)}^{n}

be the sample version of the

F_{(x^{⊤}, y)}

based on a sample

z^{(n)} : = {{(x_{i}^{⊤}, y_{i})}^{⊤}, i \in {1, 2, \dots, n}}

.

F_{(x^{⊤}, y)}^{n}

and

z^{(n)}

will be used interchangeably. Using

F_{(x^{⊤}, y)}^{n}

, one obtains the sample versions

O (F_{(x^{⊤}, y)}^{n}, β, α) = \frac{1}{n} \sum_{i = 1}^{⌊ α n ⌋ + 1} r_{i : n}^{2},

(6)

where

⌊ x ⌋

is the floor function. Further

{\hat{β}}_{l t s}^{n} : = β_{l t s} (F_{(x^{⊤}, y)}^{n}, α) = arg min_{β \in R^{p}} O (F_{(x^{⊤}, y)}^{n}, β, α) .

(7)

It is readily seen that the

{\hat{β}}_{l t s}^{n}

above is identical to the

{\hat{β}}_{l t s}

in (3) with

h = ⌊ α n ⌋ + 1

. Henceforth, we prefer to treat the

{\hat{β}}_{l t s}^{n}

rather than the

{\hat{β}}_{l t s}

in (3).

The first natural question is the existence of the minimizer in the right-hand side (RHS) of (7), or the existence of the

{\hat{β}}_{l t s}^{n}

. Does it always exist? If it exists, will it be unique? Unique existence is a key precondition for the study of asymptotics of an estimator.

One might take the existence for granted since the objective is non-negative and has a finite infimum that can be approximated by objective values of a sequence of

β

s. There is a sub-sequence of

β

s with its objective values converging to the infimum that is a minimum due to the continuity of the objective function. The sub-sequence of

β

s converges to a point

β^{*}

, which is the minimizer of the RHS. However, there are multiple issues with the arguments above. The existence and the convergence of the

β

sub-sequence (to a minimum) and continuity of objective function need to be proved. In the sequel, we take a different approach.

2.2. Properties in the Empirical Case

Write

O^{n} (β)

and

1_{i}

for the

O (F_{(x^{'}, y)}^{n}, β, α)

and the

1 (r_{i}^{2} \leq r_{h : n}^{2})

, respectively. It is seen that

O^{n} (β) = \frac{1}{n} \sum_{i = 1}^{n} r_{i}^{2} 1 (r_{i}^{2} \leq r_{h : n}^{2}) = \frac{1}{n} \sum_{i = 1}^{n} r_{i}^{2} 1_{i},

(8)

where

h = ⌊ α n ⌋ + 1

. The fraction

1 / n

will often be ignored in the following discussion.

Existence and uniqueness
Partition parameter space

For a given sample

z^{(n)}

, an

α

(or h), and any

β^{1} \in R^{p}

, let

r_{i : n}^{2} (β^{1}) = r_{k_{i}}^{2} (β^{1})

for an integer

k_{i}

. Note that

r_{j}

and

k_{i}

depend on

β^{1}

, i.e.,

r_{j} : = r_{j} (β^{1})

,

k_{i} : = k_{i} (β^{1})

. Obviously

r_{k_{1}}^{2} \leq r_{k_{2}}^{2} \leq \dots \leq r_{k_{n}}^{2}

. Call

{k_{i}, 1 \leq i \leq h}

β^{1}

-h-integer set. If

r_{i}^{2} \neq r_{j}^{2}

for any distinct i and j, then the h-integer set is unique. Hereafter, we assume (A0):

W

has a density for any given

β

. Then, almost surely (a.s.), the h-integer set is unique.

Consider the unique cases. There are other

β

s in

R^{p}

that share the same h-integer set as that of the

β^{1}

. Denote the set of such points that have the same h integers as

β^{1}

by

S_{β^{1}} : = \{β \in R^{p} : k_{i} (β) = k_{i} (β^{1}) = k_{i}, i \in {1, 2, \dots, h} {k_{i}, 1 \leq i \leq h} is unique .\}

(9)

If (A0) holds, then

S_{β^{1}} \neq \emptyset

(a.s.). If it is

R^{p}

, then we have a trivial case (see Remark 1 below). Otherwise, there are only finitely many such sets (for a fixed n) that partition

R^{p}

. Let

\cup_{l = 1}^{L} {\bar{S}}_{β^{l}} = R^{p}

, where

S_{β^{l}}

s are defined similarly to (9) and are disjoint for different l,

1 \leq l \leq L : = (\binom{n}{h})

, and

\bar{A}

is the closure of the set A. Write

X_{n} = {(w_{1}, \dots, w_{n})}^{'}

, an

n \times p

matrix. Assume (A1): $X_{n}$ and any its h rows have a full rank $p$ . As in the R function ltsReg in the R package robustbase version 99-4-1, hereafter, we assume that

p < n / 2

.

Lemma 1.

Assume that (A0) and (A1) hold, then

(i): (a) For any l ( $1 \leq l \leq L$ ), $r_{k_{1} (β^{l})}^{2} < r_{k_{2} (β^{l})}^{2} < \dots < r_{k_{h} (β^{l})}^{2}$ over $S_{β^{l}}$ .
(b) For any $η \in S_{β^{l}}$ , there exists an open ball $B (η, δ)$ centered at η with a radius $δ > 0$ such that for any $β \in B (η, δ)$

$O^{n} (β) = \frac{1}{n} \sum_{i = 1}^{h} r_{k_{i} (β)}^{2} (β) = \frac{1}{n} \sum_{i = 1}^{h} r_{k_{i} (η)}^{2} (β) = \frac{1}{n} \sum_{i = 1}^{h} r_{k_{i} (β^{l})}^{2} (β), (a . s .)$

(10)
(ii): The graph of $O^{n} (β)$ over $β \in R^{p}$ is composed of the L closures of graphs of the quadratic function of β: $\frac{1}{n} \sum_{i = 1}^{h} r_{k_{i} (β^{l})}^{2} (β)$ for $O^{n} (β^{l})$ and any l ( $1 \leq l \leq L$ ), joined together.
(iii): $O^{n} (β)$ is continuous in $β \in R^{p}$ .
(iv): $O^{n} (β)$ is differentiable and strictly convex over each $S_{β^{l}}$ for any $1 \leq l \leq L$ .

Proof.

See the Appendix A. □

Remark 1.

(a) If

S_{β^{0}} = R^{p}

, then

O^{n} (β)

is a twice differentiable and strictly convex quadratic function of β, the existence and the uniqueness of

{\hat{β}}_{l t s}^{n}

are trivial as long as

X_{n}

has a full rank.

(b) Replacing

{(1, x_{i})}^{'} β

by a nonlinear

h (x_{i}, β)

, conveniently assuming that (i)

F_{W}

is twice differentiable around the points corresponding to the square roots of the α-quantiles of W, (ii)

h (x_{i}, β)

is continuous over parameter space B, (iii)

h (x_{i}, β)

is twice differentiable in β for

β \in B (β_{0}, δ)

a.s., and (iv)

\partial h (x_{i}, β) / \partial β

is continuous in β, Refs. [26,27] also addressed the continuity and differentiability of the objective function of the LTS. All assumptions were never verified in [26,27], though; however, they are proved (or not required) in Lemma 1.

(c) Continuity and differentiability inferred just based on

O^{n} (β)

being the sum of h continuous and differentiable functions (squares of residual) without (i) or (10) might not be flawless. In general,

O^{n} (β)

is not differentiable nor convex in β globally.

Let

y_{n} : = {(y_{1}, \dots, y_{n})}^{⊤}

,

M_{n} : = M (y_{n}, X_{n}, β, α) = \sum_{i = 1}^{n} w_{i} w_{i}^{⊤} 1_{i} = \sum_{i = 1}^{h} w_{k_{i} (β)} w_{k_{i} (β)}^{⊤}

. Note that

1_{i}

depends on

β

.

Theorem 1.

Assume that (A0) and (A1) hold. Then,

(i): ${\hat{β}}_{l t s}^{n}$ exists and is the local minimum of $O^{n} (β)$ over $S_{β^{l_{0}}}$ for some $l_{0}$ ( $1 \leq l_{0} \leq L$ ).
(ii): Over $S_{β^{l_{0}}}$ , ${\hat{β}}_{l t s}^{n}$ is the solution of the system of equations

$\sum_{i = 1}^{n} (y_{i} - w_{i}^{⊤} β) w_{i} 1_{i} = 0,$

(11)
(iii): Over $S_{β^{l_{0}}}$ , the unique solution is

${\hat{β}}_{l t s}^{n} = M_{n} {(y_{n}, X_{n}, {\hat{β}}_{l t s}^{n}, α)}^{- 1} \sum_{i = 1}^{h} y_{k_{i} (β^{l_{0}})} w_{k_{i} (β^{l_{0}})} .$

(12)

Proof.

The given conditions and Lemma 1 allow one to focus on a piece

S_{β^{l}}

,

1 \leq l \leq L

, all results follow in a straightforward fashion. For more details, see Appendix A. □

Remark 2.

(a) Unique existence, which is often implicitly assumed or ignored in the literature, is central for the discussion of asymptotics of

{\hat{β}}_{l t s}^{n}

. Existence of

{\hat{β}}_{l t s}^{n}

could also be established under the assumption that there are no

⌊ (n + 1) / 2 ⌋

sample points of

z^{(n)}

contained in any

(p - 1)

-dimensional hyperplane, similarly to that of Theorem 2.2 for LST in [28]. It is established here without such an assumption nevertheless.

(b) A sufficient condition for the invertibility of

M_{n}

is that any h rows of

X_{n}

form a full rank sub-matrix. The latter is true if (A1) holds.

(c) Ref. [22] also addressed the existence of

{\hat{β}}_{l t s}^{n}

(Assertion 1) for non-random covariates (carriers) satisfying many demanding assumptions (

A, B

) that are never verified. The uniqueness was left unaddressed, though.

2.3. Properties in the Population Case

The best breakdown point of the LTS (see p. 132 of [1]) reflects its global robustness. We now examine its local robustness via the influence function to depict its complete robust picture.

2.3.1. Definition of Influence Function

For a distribution F on

R^{p}

and an

ε \in (0, 1 / 2)

, the version of F contaminated by an

ε

amount of an arbitrary distribution G on

R^{p}

is denoted by

F (ε, G) = (1 - ε) F + ε G

(an

ε

amount deviation from the assumed F).

F (ε, G)

is actually a convex contamination of F. There are other types of contamination such as contamination by total variation or Hellinger distances. We cite the definition given in [29].

Definition 1

([29]). The influence function (IF) of a functional

t

at a given point

x \in R^{p}

for a given F is defined as

IF (x; t, F) = lim_{ε \to 0^{+}} \frac{t (F (ε, δ_{x})) - t (F)}{ε},

(13)

where

δ_{x}

is the point-mass probability measure at

x \in R^{p}

.

The function

IF (x; t, F)

describes the relative effect (influence) on

t

of an infinitesimal point-mass contamination at

x

and measures the local robustness of

t

.

To establish the IF for the functional

β_{l t s} (F_{(x^{⊤}, y)}, α)

, we need to first show its existence and uniqueness with or without point-mass contamination. To that end, write

F_{ε} (z) : = F (ε, δ_{z}) = (1 - ε) F_{(x^{⊤}, y)} + ε δ_{z},

with

u = {(s^{⊤}, t)}^{⊤} \in R^{p}

,

s \in R^{p - 1}

,

t \in R^{1}

as the corresponding random vector (i.e.,

F_{u} = F_{ε} (z) = F (ε, δ_{z})

). The versions of (4) and (5) at the contaminated

F (ε, δ_{z})

are, respectively,

O (F_{ε} (z), β, α) = \int {(t - (1, s^{⊤}) β)}^{2} 1 ({(t - (1, s^{⊤}) β)}^{2} \leq q_{ε} (z, β, α)) d F_{u} (s, t),

(14)

with

q_{ε} (z, β, α)

being the

α

th quantile of the distribution function of

{(t - v^{⊤} β)}^{2}

,

v = {(1, s^{⊤})}^{⊤}

, and

β_{l t s} (F_{ε} (z), α) = arg min_{β \in R^{p}} O (F_{ε} (z), β, α) .

(15)

For

β_{l t s} (F_{(x^{⊤}, y)}, α)

defined in (5) and

β_{l t s} (F_{ε} (z), α)

above, we have an analogous result to Theorem 1. (Assume that the counterpart of model (1) is

y = (1, x^{⊤}) β_{0} + e = w^{⊤} β_{0} + e

). Before we derive the influence function, we need to establish existence and uniqueness.

2.3.2. Existence and Uniqueness

Write

O (β)

for

O (F_{(x^{⊤}, y)}, β, α)

in (4). To have a counterpart of Lemma 1, we need (A2): $W$ has a positive density around a small neighborhood of $q (β, α)$ for the given $α$ , $β$ .

Lemma 2.

Assume (A2) holds and

E (w w^{⊤})

exists. Then,

(i)

O (β)

is continuous in

β \in R^{p}

;

(ii)

O (β)

is twice differentiable in

β \in R^{p}

,

\partial^{2} O (β) / \partial β^{2} = 2 E (w w^{⊤} 1 ({(y - w^{⊤} β)}^{2} \leq q (β, α)));

(iii)

O (β)

is strictly convex in

β \in R^{p}

.

Proof.

The boundedness of the integrand in (4), given conditions, and the Lebesgue-dominated convergence theorem leads to the desired results. For the details, see Appendix A. □

Note that (ii) and (iii) above are global in

β

, stronger than the empirical counterparts, and all are attributed to the boundary of

S_{β^{l}}

issue. We now treat the existence and uniqueness of

β_{l t s}

, which is central to the study of the asymptotics.

Theorem 2.

Assume (A2) holds and

E (w w^{⊤})

exists,

q_{ε} (z, β, α)

is continuous in β, and

P ({(t - v^{⊤} β)}^{2} = q_{ε} (z, β, α)) = 0

for any

β \in R^{p}

and the given α. Then,

(i)

β_{l t s} (F_{(x^{⊤}, y)}, α)

and

β_{l t s} (F_{ε} (z), α)

exist.

(ii) Furthermore, they are the solution of a system of equations, respectively,

\begin{matrix} \int (y - (1, x^{⊤}) β) {(1, x^{⊤})}^{⊤} 1 ({(y - (1, x^{⊤}) β)}^{2} \leq q (β, α)) d F_{(x^{⊤}, y)} (x, y) & = 0, \end{matrix}

(16)

\begin{matrix} \int (t - (1, s^{⊤}) β) {(1, s^{⊤})}^{⊤} 1 ({(t - (1, s^{⊤}) β)}^{2} \leq q_{ε} (z, β, α)) d F_{u} (s, t) & = 0 . \end{matrix}

(17)

(iii)

β_{l t s} (F_{(x^{⊤}, y)}, α)

and

β_{l t s} (F_{ε} (z), α)

are unique provided that

\begin{matrix} \int {(1, x^{⊤})}^{⊤} (1, x^{⊤}) 1 ({(y - (1, x^{⊤}) β)}^{2} \leq q (β, α)) d F_{(x^{⊤}, y)} (x, y) & , \end{matrix}

(18)

\begin{matrix} \int {(1, s^{⊤})}^{⊤} (1, s^{⊤}) 1 ({(t - (1, s^{⊤}) β)}^{2} \leq q_{ε} (z, β, α)) d F_{u} (s, t) \end{matrix}

(19)

are invertible for β in a small neighborhood of

β_{l t s} (F_{(x^{⊤}, y)}, α)

and

β_{l t s} (F_{ε} (z), α)

, respectively.

Proof.

In light of Lemma 2, the proof is straightforward, see Appendix A. □

The continuity of

q_{ε} (z, β, α)

in

β

is necessary for the differentiability of

O (F_{ε} (z), β, α)

. In the non-contaminated case, the continuity of

q (β, α)

is guaranteed by (A2).

Does the population version of the LTS,

β_{l t s}

, defined in (5), have something to do with

β_{0}

? It turns out that under some conditions, they are identical, which is called Fisher consistency.

2.3.3. Fisher Consistency

Theorem 3.

Assume (A2) holds and

E (w w^{⊤})

exists, then,

β_{l t s} (F_{(x^{⊤}, y)}, α) = β_{0}

provided that

(i): $E_{(x^{⊤}, y)} (w w^{⊤} 1 (r {(β)}^{2} \leq F_{r {(β)}^{2}}^{- 1} (α)))$ is invertible, and
(ii): $E_{(x^{⊤}, y)} (e w 1 (e^{2} \leq F_{e^{2}}^{- 1} (α))) = 0$ , where $r (β) = y - w^{⊤} β$ .

Proof.

Theorem 2 leads directly to the desired result, see the Appendix A. □

2.3.4. Influence Function

Theorem 4.

Assume that the assumptions in Theorem 2 hold. Set

β_{l t s} : = β_{l t s} (F_{(x^{⊤}, y)}, α)

. Then, for any

z_{0} : = {(s_{0}^{⊤}, t_{0})}^{⊤} \in R^{p}

, we have that

I F (z_{0}; β_{l t s}, F_{(x^{⊤}, y)}) = \{\begin{matrix} 0, & i f {(t_{0} - v_{0}^{⊤} β_{l t s})}^{2} > q (β_{l t s}, α), \\ M^{- 1} (t_{0} - v_{0}^{⊤} β_{l t s}) v_{0}, & o t h e r w i s e, \end{matrix}

provided that

M = E_{(x^{⊤}, y)} (w w^{⊤} 1 (r {(β_{l t s})}^{2} \leq q (β_{l t s}, α)))

is invertible, where

v_{0} = {(1, s_{0}^{⊤})}^{⊤}

.

Proof.

The connection to the derivative of a functional is the key, see Appendix A. □

Remark 3.

(a) When

p = 1

, the problem in our model (1) becomes a location problem (see p. 158 of [1]) and the IF of the LTS estimation functional is given on p. 191 of [1]. In the location setting, Ref. [30] also studied the IF of the LTS. When

p = 2

, namely in the simple regression case, Ref. [31] studied IF of the sparse-LTS functional under the assumption that

x

and e are independent and normally distributed. Under stringent assumptions on the error terms

e_{i}

and on

x

, Ref. [21] also addressed the IF of LTS for any p, but the point mass at

(x^{⊤}, z)

with z being the error term, an unusual contaminating point. The above result is much more general and valid for any

p \geq 1

,

x^{'}

, and e.

(b) The influence function of

β_{l t s}

remains bounded if the contaminating point

(s_{0}^{'}, t_{0})

does not follow the model (i.e., its residual is extremely large), in particular for bad leverage points and vertical outliers. This shows the good robust properties of the LTS.

(c) The influence function of

β_{l t s}

, unfortunately, might be unbounded (in

p > 1

case), sharing the drawback of the sparse-LTS (in the p = 2 case). The latter was shown in [31]. Trimming based on the residuals (or squared residuals) will have this type of drawback since the term

w^{⊤} β

can be bounded, but

∥ x ∥

might not.

3. Asymptotic Properties

Refs. [22,23,24] rigorously addressed the consistency, root-n consistency, and normality of the LTS under a restrictive setting (

x_{i}

s are non-random covariates) plus many assumptions on

x_{i}

s and on the distribution of

e_{i}

in a series of three lengthy papers.

Refs. [26,27] also addressed asymptotic properties of an extended LTS under

β

-mixing conditions for

x_{i}

with nonlinear regression function

h (x_{i}, β)

, a seemingly more general setting, but with numerous artificial assumptions (H1, H2, H3, H4, H5, H6; D1, D2; I1, I2) that are never verified in any concrete example and not even for the linear LTS case. That is, Refs. [26,27] do not cover the LTS in (3).

Here, we address asymptotic properties of

{\hat{β}}_{l t s}^{n}

without the artificial assumptions that were made in the literature for LTS with a nonlinear regression function. Strong consistency has been addressed in [25] in a nonlinear setting without verifying any of their conveniently assumed key assumptions for the linear LTS. We now rigorously establish strong consistency.

3.1. Strong Consistency

Following the notations of [32], write

\begin{matrix} O (β, P) : & = O (F_{(x^{⊤}, y)}, β, α) = P [{(y - w^{⊤} β)}^{2} 1 (r {(β)}^{2} \leq F_{r {(β)}^{2}}^{- 1} (α))] = P f, \\ O (β, P_{n}) : & = O (F_{(x^{⊤}, y)}^{n}, β, α) = \frac{1}{n} \sum_{i = 1}^{n} r_{i}^{2} 1 (r_{i}^{2} \leq {(r)}_{h : n}^{2}) = P_{n} f, \end{matrix}

where

f : = f (x, y, β, α) = {(y - w^{⊤} β)}^{2} 1 (r {(β)}^{2} \leq F_{r {(β)}^{2}}^{- 1} (α))

,

α

and

h = ⌊ α n ⌋ + 1

are fixed.

Under corresponding assumptions in Theorems 1 and 2,

{\hat{β}}_{l t s}^{n}

and

β_{l t s}

are unique minimizers of

O (β, P_{n})

and

O (β, P)

, respectively.

To show that

{\hat{β}}_{l t s}^{n}

converges to

β_{l t s}

a.s., one can take the approach given in Section 4.2 of [28]. However, here, we take a different and more direct approach.

To show that

{\hat{β}}_{l t s}^{n}

converges to

β_{l t s}

a.s., it will suffice to prove that

O ({\hat{β}}_{l t s}^{n}, P) \to O (β_{l t s}, P)

a.s., because

O (β, P)

is bounded away from

O (β_{l t s}, P)

outside each neighborhood of

β_{l t s}

in light of continuity and compactness. Let

Θ

be a closed ball centered at

β_{l t s}

with a radius

r > 0

. Define a class of functions

F (β, α) = \{f (x, y, β, α) = {(y - w^{⊤} β)}^{2} 1 (r {(β)}^{2} \leq F_{r {(β)}^{2}}^{- 1} (α)) : β \in Θ, α \in [1 / 2, c]\}

If we prove uniform almost sure convergence of

P_{n}

to P over

F

(see Lemma 3 below), then we can deduce that

O ({\hat{β}}_{l t s}^{n}, P) \to O (β_{l t s}, P)

a.s. from

\begin{matrix} O ({\hat{β}}_{l t s}^{n}, P_{n}) - O ({\hat{β}}_{l t s}^{n}, P) & \to 0 (in light of Lemma 3), and \\ O ({\hat{β}}_{l t s}^{n}, P_{n}) \leq O (β_{l t s}, P_{n}) & \to O (β_{l t s}, P) \leq O ({\hat{β}}_{l t s}^{n}, P) . \end{matrix}

The above discussions and arguments have led to the following

Theorem 5.

Under corresponding assumptions in Theorems 1 and 2 for the uniqueness of

{\hat{β}}_{l t s}^{n}

and

β_{l t s}

, respectively, we have

{\hat{β}}_{l t s}^{n}

converges a.s. to

β_{l t s}

(i.e.,

{\hat{β}}_{l t s}^{n} - β_{l t s} = o (1)

, a.s.).

The above is based on the following generalized Glivenko–Cantelli Theorem.

Lemma 3.

{sup}_{f \in F} | P_{n} f - P f | \to 0

a.s., provided that (A2) holds.

Proof.

Verifying two requirements in Theorem 24 in II.5 of [32] leads to the result. Showing the covering number for functions in

F

is bounded is challenging. Essentially, one needs to show that the graphs of functions in

F

form a VC class of sets (this was avoided in the literature, e.g., in [22,23,24,25,26,27]). For details, see Appendix A. □

3.2. Root-n Consistency and Asymptotic Normality

Instead of treating the root-n consistency separately as in [22,23,24], we will establish the asymptotic normality of

{\hat{β}}_{l t s}^{n}

directly via stochastic equicontinuity (see p. 139 of [32]).

Stochastic equicontinuity refers to a sequence of stochastic processes

{Z_{n} (t) : t \in T}

whose shared index set T comes equipped with a semi-metric

d (\cdot, \cdot)

.

Definition 2

(IIV. 1, Def. 2 of [32]). Call

Z_{n}

stochastically equicontinuous at

t_{0}

if for each

η > 0

and

ϵ > 0

there exists a neighborhood U of

t_{0}

for which

lim sup P (sup_{U} | Z_{n} (t) - Z_{n} (t_{0}) | > η) < ϵ .

(20)

It is readily seen (see [32]) that if

τ_{n}

is a sequence of random elements of T that converges in probability to

t_{0}

, then,

Z_{n} (τ_{n}) - Z_{n} (t_{0}) \to 0 in probability,

(21)

because, with probability tending to one,

τ_{n}

will belong to each U. The form above will be easier to apply, especially when the behavior of a particular

τ_{n}

sequence is under investigation.

Suppose

F = {f (\cdot, t) : t \in T}

, with T a subset of

R^{k}

, is a collection of real, P-integrable functions on the set S where P (probability measure) lives. Denote by

P_{n}

the empirical measure formed from n independent observations on P, and define the empirical process

E_{n}

as the signed measure

n^{1 / 2} (P_{n} - P)

. Define

\begin{matrix} F (t) & = P f (\cdot, t), \\ F_{n} (t) & = P_{n} f (\cdot, t) . \end{matrix}

Suppose

f (\cdot, t)

has a linear approximation near the

t_{0}

at which

F (\cdot)

takes on its minimum value:

f (\cdot, t) = f (\cdot, t_{0}) + {(t - t_{0})}^{⊤} \nabla f (\cdot, t) + | t - t_{0} | r (\cdot, t) .

(22)

For completeness, set

r (\cdot, t_{0}) = 0

, where ∇ (differential operator) is a vector of k real functions on S. We cite theorem 5 in VII.1 of [32] (p. 141) for the asymptotic normality of

τ_{n}

.

Lemma 4

([32]). Suppose

{τ_{n}}

is a sequence of random vectors converging in probability to the value

t_{0}

at which

F (\cdot)

has its minimum. Define

r (\cdot, t)

and the vector of functions

\nabla (\cdot, t)

by (22). If

(i): $t_{0}$ is an interior point of the parameter set T;
(ii): $F (\cdot)$ has a non-singular second derivative matrix V at $t_{0}$ ;
(iii): $F_{n} (τ_{n}) = o_{p} (n^{- 1}) + {inf}_{t} F_{n} (t)$ ;
(iv): The components of $\nabla f (\cdot, t)$ all belong to $L^{2} (P)$ ;
(v): The sequence ${E_{n} r (\cdot, t)}$ is stochastically equicontinuous at $t_{0}$ ;

then,

n^{1 / 2} (τ_{n} - t_{0}) \overset{d}{⟶} N (0, V^{- 1} [P (\nabla \nabla^{⊤}) - (P \nabla) {(P \nabla)}^{⊤}] V^{- 1}) .

By Theorems 1 and 5, assume, without loss of generality (w.l.o.g.), that

{\hat{β}}_{l t s}^{n}

and

β_{l t s}

belong to a ball centered at

β_{l t s}

with a large enough radius

r_{0}

,

B (β_{l t s}, r_{0})

, and that

Θ = B (β_{l t s}, r_{0})

is our parameter space of

β

, hereafter. In order to apply the Lemma, we first realize that in our case,

{\hat{β}}_{l t s}^{n}

and

β_{l t s}

correspond to

τ_{n}

and

t_{0}

(assume, w.l.o.g. that

β_{l t s} = 0

in light of regression equivariance, see Section 4);

β

and

Θ

correspond to t and T;

f (\cdot, t) : = f (x^{⊤}, y, β, α) = {(y - w^{⊤} β)}^{2} 1 (r {(β)}^{2} \leq F_{r {(β)}^{2}}^{- 1} (α))

. In our case,

\nabla f (\cdot, t) : = \nabla f (x, y, β, α) = \frac{\partial}{\partial β} f (x, y, β, α) = - 2 (y - w^{⊤} β) w 1 (r {(β)}^{2} \leq F_{r {(β)}^{2}}^{- 1} (α)) .

We will have to assume that

P (\nabla_{i}^{2}) = P (4 {(y - w^{⊤} β)}^{2} w_{i}^{2} 1 (r {(β)}^{2} \leq F_{r {(β)}^{2}}^{- 1} (α)))

exists to meet (iv) of the lemma, where

i \in {1, \dots, p}

and

w^{⊤} = (1, x^{⊤}) = (1, x_{1}, \dots, x_{p - 1})

. It is readily seen that a sufficient condition for this assumption to hold is the existence of

P (x_{i}^{2})

. In our case,

V = 2 P (w w^{⊤} 1 (r {(β)}^{2} \leq F_{r {(β)}^{2}}^{- 1} (α)))

, and we will have to assume that it is invertible when

β

is replaced by

β_{l t s}

(this is covered by (18)) to meet (ii) of the lemma. In our case,

r (\cdot, t) = (\frac{β^{⊤}}{∥ β ∥} V / 2 \frac{β}{∥ β ∥}) ∥ β ∥ .

We will assume that

λ_{m i n}

and

λ_{m a x}

are the minimum and maximum eigenvalues of positive semidefinite matrix V over all

β \in Θ

and

α \in [1 / 2, c]

.

Theorem 6.

Assume that

(i): The uniqueness assumptions for ${\hat{β}}_{l t s}^{n}$ and $β_{l t s}$ in Theorems 1 and 2 hold respectively;
(ii): $P (x_{i}^{2})$ exists with $x = (x_{1}, \dots, x_{p - 1})$ ;

then

n^{1 / 2} ({\hat{β}}_{l t s}^{n} - β_{l t s}) \overset{d}{⟶} N (0, V^{- 1} [P (\nabla \nabla^{⊤}) - (P \nabla) {(P \nabla)}^{⊤}] V^{- 1}),

where β in V and ∇ is replaced by

β_{l t s}

(which could be assumed to be zero).

Proof.

The key to applying this Lemma is to verify (v). For details, see Appendix A. □

Remark 4.

(a) In the case of

p = 1

, that is in the location case, the asymptotic normality of the LTS has been studied in [21,33,34].

(b) Ref. [35], under the rank-based optimization framework and stringent assumptions on error term

e_{i}

(even density that is strictly decreasing for positive value, bounded absolute first moment) and on

x_{i}

(bounded fourth moment), covers the asymptotic normality of the LTS. Ref. [24] also treated the general case

p \geq 1

and obtained the asymptotic normality of the LTS under many stringent conditions on the non-random covariates

x_{i}

s and the distributions of

e_{i}

in a 27-page article. The assumption

C

is quite artificial and never verified there. Conversely, Refs. [26,27] also addressed the asymptotic normality of LTS in the nonlinear regression under a dependence setting. For these extensions, many artificial assumptions (D1, D2, H1, H2, H3, H4, H5, H6, I1, I2) are imposed, but they are never verified even for the linear LTS case. So those results do not cover the LTS in (3).

(c) Furthermore, since there was no population version like (4) and (5) before, empirical process theory could not be employed to verify the VC class of functions in [22,23,24,26,27]. Our approach here is quite different from former classical analyses and much more neat and concise (employing the standard empirical process theory that was asserted to be not applicable in [26,27]).

4. Inference Procedures

In order to utilize the asymptotic normality result in Theorem 6, we need to figure out the asymptotic covariance. For simplicity, assume that

z = {(x^{⊤}, y)}^{⊤}

follows elliptical distributions

E (g; μ, Σ)

with density

f_{z} (x^{⊤}, y) = \frac{g ({({(x^{⊤}, y)}^{⊤} - μ)}^{⊤} Σ^{- 1} ({(x^{⊤}, y)}^{⊤} - μ))}{\sqrt{det (Σ)}},

where

μ \in R^{p}

and

Σ

is a positive definite matrix of size p which is proportional to the covariance matrix if the latter exists. We assume

f_{z}

is unimodal.

4.1. Equivariance

A regression estimation functional

t (\cdot)

is said to be regression, scale, and affine equivariant (see [8]) if, respectively,

\begin{matrix} t (F_{(w, y + w^{⊤} b)}) & = & t (F_{(w, y)}) + b, \forall b \in R^{p}; \\ t (F_{(w, s y)}) & = & s t (F_{(w, y)}), \forall s \in R; \\ t (F_{(A^{⊤} w, y)}) & = & A^{- 1} t (F_{(w, y)}), \forall n o n s i n g u l a r A \in R^{p \times p}; \end{matrix}

Theorem 7.

β_{l t s} (F_{(x^{'}, y)}, α)

is regression, scale, and affine equivariant.

Proof.

See the empirical version treatment given in [1] (p. 132). □

4.2. Transformation

Assume the Cholesky decomposition of

Σ

yields a non-singular lower triangular matrix

L

of the form

(\begin{matrix} A & 0 \\ v^{⊤} & c \end{matrix})

with

Σ = L L^{⊤}

. Hence,

det (A) \neq 0 \neq c

. Now, transfer

(x^{⊤}, y)

to

(s^{⊤}, t)

with

{(s^{⊤}, t)}^{⊤} = L^{- 1} ({(x^{⊤}, y)}^{⊤} - μ)

. It is readily seen that the distribution of

{(s^{⊤}, t)}^{⊤}

follows

E (g; 0, I_{p \times p})

.

Note that

{(x^{⊤}, y)}^{⊤} = L {(s^{⊤}, t)}^{⊤} + {(μ_{1}^{⊤}, μ_{2})}^{⊤}

with

μ = {(μ_{1}^{⊤}, μ_{2})}^{⊤}

. That is,

\begin{matrix} x & = A s + μ_{1}, \end{matrix}

(23)

\begin{matrix} y & = v^{⊤} s + c t + μ_{2} . \end{matrix}

(24)

Equivalently,

\begin{matrix} {(1, s^{⊤})}^{⊤} & = B^{- 1} {(1, x^{⊤})}^{⊤}, \end{matrix}

(25)

\begin{matrix} t & = \frac{y - (1, s^{⊤}) {(μ_{2}, v^{⊤})}^{⊤}}{c}, \end{matrix}

(26)

where

B = (\begin{matrix} 1 & 0^{⊤} \\ μ_{1} & A \end{matrix}), B^{- 1} = (\begin{matrix} 1 & 0^{⊤} \\ - A^{- 1} μ_{1} & A^{- 1} \end{matrix}),

It is readily seen that (25) is an affine transformation on

w

and (26) is first an affine transformation on

w

then a regression transformation on y followed by a scale transformation on y. In light of Theorem 7, we can assume, hereafter, w.l.o.g. that

(x^{⊤}, y)

follows an

E (g; 0, I_{p \times p})

(spherical) distribution and

I_{p \times p}

is the covariance matrix of

(x^{⊤}, y)

.

Theorem 8.

Assume that

e \sim N (0, σ^{2})

, e, and

x

are independent. Then,

(1): $P \nabla = 0$ and $P (\nabla \nabla^{'}) = 8 σ^{2} C I_{p \times p}$ ,
with $C = Φ (c) - 1 / 2 - c e^{- c^{2} / 2} / \sqrt{2 π}$ , where $c = \sqrt{F_{χ^{2} (1)}^{- 1} (α)}$ and $Φ (x)$ is the CDF of $N (0, 1)$ , $χ^{2} (1)$ is a chi-square random variable with one degree of freedom.
(2): $V = 2 C_{1} I_{p \times p}$ with $C_{1} = 2 Φ (c) - 1$ .
(3): $n^{1 / 2} ({\hat{β}}_{l t s}^{n} - β_{l t s}) \overset{d}{⟶} N (0, \frac{2 C σ^{2}}{C_{1}^{2}} I_{p \times p})$ , where C and $C 1$ are defined in (1) and (2) above.

Proof.

See Appendix A. □

4.3. Approximate $100 (1 - γ) %$ Confidence Region

(i) Based on the asymptotic normality. Under the setting of Theorem 8, an approximate

100 (1 - γ) %

confidence region for the unknown regression parameter

β_{0}

is:

\{β \in R^{p} : ∥ β - {\hat{β}}_{l t s}^{n} ∥ \leq \sqrt{\frac{2 C σ^{2}}{C_{1}^{2} n}} F_{χ^{2} (p)}^{- 1} (γ)\},

where

∥ \cdot ∥

stands for the Euclidean distance. Without asymptotic normality, one can appeal to the next procedure.

(ii) Based on bootstrapping scheme and depth-median and depth-quantile. Here, no assumptions on the underlying distribution are needed. This approximate procedure first re-samples n points with replacement from the given original sample points and calculates an

{\hat{β}}_{l t s}^{n}

. Repeat this m (a large number, say

10^{4}

) times and obtain m such

{\hat{β}}_{l t s}^{n}

s.

The next step is to calculate the depth, concerning a location depth function (e.g., halfspace depth [36] or projection depth [37,38]), of these m points in the parameter space of

β

. Trimming

⌊ γ m ⌋

of the least deep points among the m points, the left points form a convex hull, that is an approximate

100 (1 - γ) %

confidence region for the unknown regression parameter

β_{0}

in the location case and low dimensions.

Example 1.

To illustrate the normality of

{\hat{β}}_{l t s}^{n}

, here, we carry out a small-scale simulation. We will generate

N = 500, 1000, 10000

{\hat{β}}_{l t s}^{n}

s, each obtained based on a bivariate standard normal sample

{(x_{i}, y_{i})}

of size

n = 50

. For each N, we provide scatter plots of N

{\hat{β}}_{l t s}^{n}

s and marginal histgrams. Inspecting Figure 1, Figure 2 and Figure 3 reveals that the plots of

{\hat{β}}_{l t s}^{n}

s more and more resemble a bivariate normal pattern when the number of

{\hat{β}}_{l t s}^{n}

increases. The marginal histograms confirm the normality.

5. Concluding Remarks

Without the population version of the LTS (see (5)), it will be difficult to apply the empirical process theory to study the asymptotics of the LTS, e.g, to verify the key result, the VC-class property of regression function class (indexed by

β

) will be challenging. To avoid this challenge, some authors addressed the asymptotics of the nonlinear LTS, where without an explicit regression function (unlike the linear case), they can conveniently assume this VC-class property. Refs. [26,27] even believed that the standard empirical process theory does not apply to the asymptotics of LTS, while Refs. [22,23,24], addressed the asymptotics without any advanced tools, employing elementary tools with numerous artificial, difficult-to-verify assumptions and lengthy articles.

By partitioning the parameter space and introducing the population version for the LTS, this article establishes some fundamental and primary properties for the objective function of the LTS in both empirical and population settings. These newly obtained original results verify some key facts that were conveniently assumed (but never verified) in nonlinear regression literature for the LTS and facilitate the application of standard empirical process theory to the establishment of asymptotic normality for the sample LTS concisely and neatly. Some of the newly obtained results, such as Fisher and strong consistency, and influence function, are original and obtained as by-products.

The asymptotic normality is applied in Theorem 8 for the practical inference procedure of confidence regions of the regression parameter

β_{0}

. There are open problems left here; one is the estimation of the variance of e, which is now unrealistically assumed to be known, and the other is the testing of the hypothesis on

β_{0}

.

Funding

This author declares that there was no funding received for this study.

Data Availability Statement

The data will be made available by authors on request.

Acknowledgments

Insightful comments and useful suggestions from Wei Shao and Derek Young have significantly improved the manuscript and are highly appreciated. Special thanks go to Derek Young for making the technical report of Chen, Stromberg, and Zhou available.

Conflicts of Interest

This author declares that there is no conflict of interests/Competing interests.

Abbreviations

The following abbreviations are used in this manuscript:

MDPI	Multidisciplinary Digital Publishing Institute
DOAJ	Directory of open-access journals
TLA	Three-letter acronym
LD	Linear dichroism

Appendix A. Proofs

Proof of Lemma 1.

(i) Based on the definition in (9), over

S_{β^{l}}

, there is no tie among the smallest h squared residuals. The assertion (a) follows straightforwardly.

The first and the last equality in (b) is trivial, it suffices to focus on the middle one. Let

k_{i} : = k_{i} (η)

(

= k_{i} (β^{l})

in light of (9)). By (6), we have that

O^{n} (η) = \frac{1}{n} \sum_{i = 1}^{h} r_{k_{i}}^{2} (η) .

(A1)

Let

r_{i} : = r_{i} (η) = y_{i} - w_{i}^{⊤} η

and

γ = min {{min}_{1 \leq i \neq j \leq n} {| r_{i}^{2} - r_{j}^{2} |}, 1}

. Then,

1 \geq γ > 0

(a.s.).

Based on the continuity of

r_{k_{i}}^{2} (β)

in

β

, for any

1 \leq i \leq h

and for any given

ε \in (0, 1)

, we can fix a small

δ > 0

so that

| r_{k_{i}}^{2} (β) - r_{k_{i}}^{2} (η) | < γ ε / 4 h

for any

β \in B (η, δ)

. Now, we have for any

β \in B (η, δ)

, (assume below

2 \leq i \leq h

)

\begin{matrix} r_{k_{i}}^{2} (β) - r_{k_{(i - 1)}}^{2} (β) & > r_{k_{i}}^{2} (η) - \frac{γ ε}{4 h} - [r_{k_{(i - 1)}}^{2} (η) + \frac{γ ε}{4 h}] \\ = r_{k_{i}}^{2} (η) - r_{k_{(i - 1)}}^{2} (η) - \frac{γ ε}{2 h} \\ \geq γ - \frac{γ ε}{2 h} > 0 (a . s .), \end{matrix}

Thus,

k_{i} = k_{i} (η)

,

1 \leq i \leq h

forms the h-integer set for any

β \in B (η, δ)

. Part (b) follows.

(ii) The domain of function

O^{n} (β)

is the union of the pieces of

{\bar{S}}_{β^{l}}

and the function of

O^{n} (β)

over

S_{β^{l}}

is a quadratic function of

β

:

O^{n} (β) = \sum_{j = 1}^{h} r_{k_{j} (β^{l})}^{2} (β)

, the statement follows.

(iii) By (ii), it is clear that

O^{n} (β)

is continuous in

β

over each piece of

S_{β^{l}}

. We only need to show that this holds for any

β

that is on the boundary of

S_{β^{l}}

.

Let

η

lie on the common boundary of

S_{β^{s}}

and

S_{β^{t}}

, then,

O^{n} (β) = \frac{1}{n} \sum_{i = 1}^{h} r_{k_{i} (β^{s})}^{2}

for any

β \in {\bar{S}}_{β^{s}}

[this is obviously true if

β \in S_{β^{s}}

, it is also true if

β

on the boundary of

S_{β^{s}}

since in this case the

β

-h-integer set is not unique, there are at least two; one of them is

k_{1} (β^{s}), \dots, k_{h} (β^{s})

] and

O^{n} (β) = \frac{1}{n} \sum_{i = 1}^{h} r_{k_{i} (β^{t})}^{2}

for any

β \in {\bar{S}}_{β^{t}}

. Let

{β_{j}}

be a sequence approaching

η

, where

β_{j}

could be on

{\bar{S}}_{β^{s}}

or on

{\bar{S}}_{β^{t}}

. We show that

O^{n} (β_{j})

approaches to

O^{n} (η)

. Note that

O^{n} (η) = \frac{1}{n} \sum_{i = 1}^{h} r_{k_{i} (β^{s})}^{2} (η) = \frac{1}{n} \sum_{i = 1}^{h} r_{k_{i} (β^{t})}^{2} (η)

. Partition

{β_{j}}

into

{β_{j_{s}}}

and

{β_{j_{t}}}

so that all members of the former belong to

{\bar{S}}_{β^{s}}

where the latter are all within

{\bar{S}}_{β^{t}}

. By continuity of the sum of h squared residuals in

β

, both

O^{n} (β_{j_{s}}))

and

O^{n} (β_{j_{t}}))

approach to

O^{n} (η)

since both

{β_{j_{s}}}

and

{β_{j_{t}}}

approach

η

as

min {s, t} \to \infty

.

(iv) Note that for any l,

1 \leq l \leq L

, over

S_{β^{l}}

, one has a least squares problem with n reduced to h,

O^{n} (β)

is a quadratic function and hence, is twice differentiable and strictly convex in light of the following

\begin{matrix} n \frac{\partial}{\partial β} O^{n} (β) & = - 2 \sum_{i = 1}^{n} r_{i} 1_{i} w_{i} = - 2 X_{n}^{⊤} D R, \\ n \frac{\partial^{2}}{\partial β^{2}} O^{n} (β) & = 2 X_{n}^{⊤} D X_{n} = 2 {X^{*}}_{n}^{⊤} {X^{*}}_{n} = 2 \sum_{i = 1}^{h} w_{k_{i} (β^{l})} w_{k_{i} (β^{l})}^{⊤}, \end{matrix}

where

R = {(r_{1}, r_{2}, \dots, r_{n})}^{⊤}

,

D = d i a g (1_{i})

, and

{X^{*}}_{n} = D X_{n}

. Strict convexity follows from the positive definite of the Hessian matrix:

\frac{2}{n} {X^{*}}_{n}^{⊤} {X^{*}}_{n}

(an invertible matrix due to (A1), see (iii) in the proof of Theorem 1). □

Proof of Theorem 1.

(i) Over each

S_{β^{l}}

, an open set,

O^{n} (β)

is twice differentiable and strictly convex in light of given condition, hence, it has a unique minimizer (otherwise, one can show that by openness and strictly convexity there is a third point in

S_{β^{l}}

that attains a strictly smaller objective value than the two minimizers). Since there are only finitely many

S_{β^{l}}

, the assertion follows if we can prove that the minimum does not reach a boundary point of some

S_{β^{l}}

.

Assume it is otherwise. That is,

O^{n} (β)

reaches its global minimum at point

β_{1}

which is a boundary point of

S_{β^{l}}

for some l. Assume that over

S_{β^{l}}

,

O^{n} (β)

attains its local minimum value at the unique point

β_{2}

. Then,

O^{n} (β_{1}) \leq O^{n} (β_{2})

. If equality holds, then, we have the desired result (since there are points besides

β_{2}

in

S_{β^{l}}

which also attain the minimum value as

β_{2}

, a contradiction). Otherwise, there is a point

β_{3}

in the small neighborhood of

β_{1}

so that

O^{n} (β_{3}) \leq O^{n} (β_{1}) + (O^{n} (β_{2}) - O^{n} (β_{1})) / 2 < O^{n} (β_{2})

. A contradiction appears.

(ii) It is seen from (i) that

O^{n} (β)

is twice continuously differentiable, hence, its first derivative evaluated at the global minimum must be zero. By (i), we have the Equation (11).

(iii) This part directly follows from (ii) and the invertibility of

M_{n}

. The latter follows from (A1), which implies that the p columns of matrix

X_{n}

are linearly independent and also implies that any h sub-rows of

X_{n}

has a full rank. □

Proof of Lemma 2.

Label

G (β) f o r

{(y - w^{⊤} β)}^{2} 1 ({(y - w^{⊤} β)}^{2} \leq q (β, α))

, the integrand in (4), for a point

(x^{⊤}, y) \in R^{p}

. Write

G (β) : = {(y - w^{⊤} β)}^{2} (1 - 1 ({(y - w^{⊤} β)}^{2} > q (β, α)))

.

(i) By the strictly monotonicity of

F_{W}

around

q (β, α)

, we have the continuity of the

q (β, α)

. Consequently,

G (β)

is obvious continuous and so is

O (β)

in

β \in R^{p}

.

(ii) For arbitrary points

(x^{⊤}, y)

and

β

in

R^{p}

, there are three cases for the relationship between the squared residual and its quantile: (a)

{(y - w^{⊤} β)}^{2} > q (β, α)

(b)

{(y - w^{⊤} β)}^{2} < q (β, α)

, and (c)

{(y - w^{⊤} β)}^{2} = q (β, α)

. Case (c) happens with probability zero, we thus skip this case and treat (a) and (b) only. By the continuity in

β

, there is a small neighborhood of

β

:

B (β, δ)

, centered at

β

with radius

δ

such that (a) (or (b)) holds for all

β \in B (β, δ)

. This implies that

\frac{\partial}{\partial β} 1 ({(y - w^{⊤} β)}^{2} > q (β, α)) = 0, (a . s .)

and

\frac{\partial}{\partial β} G (β) = - 2 (y - w^{⊤} β) w 1 ({(y - w^{⊤} β)}^{2} \leq q (β, α)), (a . s) .

Hence, we have that

\frac{\partial^{2}}{\partial β^{2}} G (β) = 2 w w^{⊤} 1 ({(y - w^{⊤} β)}^{2} \leq q (β, α)), (a . s) .

Note that

E (w w^{⊤})

exists. Then, by the Lebesgue-dominated convergence theorem, the desired result follows.

(iii) The strict convexity follows from the twice differentiability and the positive definite of the second order derivative of

O (β)

. □

Proof of Theorem 2.

We will treat

β_{l t s} (F_{(x^{⊤}, y)}, α)

, the counterpart for

β_{l t s} (F_{ε} (z), α)

can be treated analogously.

(i) Existence follows from the positive semi-definiteness of the Hessian matrix (see proof of (ii) of Lemma 2) and the convexity of

O (β)

.

(ii) The equation follows from the differentiability and the first order derivative of

O (β)

given in the proof (ii) of Lemma 2.

(iii) The uniqueness follows from the positive definite of the Hessian matrix based on the given condition (invertibility). □

Proof of Theorem 3.

By Theorem 2, (i) and given conditions guarantee the existence and the uniqueness of

β_{l t s} (F_{(x^{⊤}, y)}, α)

, which is the unique solution of the system of the equations

\int (y - w^{⊤} β) w 1 ({(y - w^{⊤} β)}^{2} \leq q (β, α)) d F_{(x^{⊤}, y)} (x, y) = 0 .

Notice that

y - w^{⊤} β = - w^{⊤} (β - β_{0}) + e

. Inserting this into the above equation, we have

\int (- w^{⊤} (β - β_{0}) + e) w 1 ({(- w^{⊤} (β - β_{0}) + e)}^{2} \leq F_{{(- w^{⊤} (β - β_{0}) + e)}^{2}}^{- 1} (α)) d F_{(x^{⊤}, y)} (x, y) = 0 .

By (ii), it is readily seen that

β = β_{0}

is a solution of the above system of equations. Uniqueness leads to the desired result. □

Proof of Theorem 4.

Write

β_{l t s}^{ε} (z_{0})

for

β_{l t s} (F_{ε} (z_{0}), α)

and insert it for

β

into (17) and take derivative with respect to

ε

in both sides of (17) and let

ε \to 0

, we obtain (in light of dominated theorem)

\begin{matrix} (\int \frac{\partial}{\partial β_{l t s}^{ε} (z_{0})} r (β_{l t s}^{ε} (z_{0})) v 1 (r {(β_{l t s}^{ε} (z_{0}))}^{2} \leq q_{ε} (z, β_{l t s}^{ε} (z_{0}), α)) |_{ε \to 0} d F_{(x^{⊤} y)}) IF (z_{0}, β_{l t s}, F_{(x^{⊤}, y)}) \\ + \int r (β_{l t s} (F_{(x^{⊤}, y)}, α)) w 1 (r {(β_{l t s} (F_{(x^{⊤}, y)}, α))}^{2} \leq q (β_{l t s} (F_{(x^{⊤}, y)}, α), α)) d (δ_{z_{0}} - F_{(x^{⊤}, y)}) \\ = 0, \end{matrix}

(A2)

where

r (β) = t - v^{⊤} β

in the first term on the LHS and

r (β) = y - w^{⊤} β

in the second term on the LHS. Call the two terms on the LHS as

T_{1}

and

T_{2}

, respectively, and call the integrand in

T_{1}

as

T_{0}

, then, it is seen that (see the proof (i) of Theorem 1)

\begin{matrix} T_{0} & = & \frac{\partial}{\partial β_{l t s}^{ε} (z_{0})} (t - v^{⊤} β_{l t s}^{ε} (z_{0})) v 1 ({(t - v^{⊤} β_{l t s}^{ε} (z_{0}))}^{2} \leq q_{ε} (z, β_{l t s}^{ε} (z_{0}), α)) |_{ε \to 0} \\ = & - w w^{⊤} 1 ({(y - w^{⊤} β_{l t s})}^{2} \leq q (β_{l t s}, α)) . \end{matrix}

Focus on the

T_{2}

, it is readily seen that

\begin{matrix} T_{2} & = \int (y - w^{⊤} β_{l t s} (F_{(x^{⊤}, y)}, α)) w 1 ({(y - w^{⊤} β_{l t s} (F_{(x^{⊤}, y)}, α))}^{2} \leq q (β_{l t s} (F_{(x^{⊤}, y)}, α), α)) d δ_{z_{0}} \\ - \int (y - w^{⊤} β_{l t s} (F_{(x^{⊤}, y)}, α)) w 1 ({(y - w^{⊤} β_{l t s} (F_{(x^{⊤}, y)}, α))}^{2} \leq q (β_{l t s} (F_{(x^{⊤}, y)}, α), α)) d F_{(x^{⊤}, y)}, \end{matrix}

In light of (16), we have

\begin{matrix} T_{2} & = \int (y - w^{⊤} β_{l t s} (F_{(x^{⊤}, y)}, α)) w 1 ({(y - w^{⊤} β_{l t s} (F_{(x^{⊤}, y)}, α))}^{2} \leq q (β_{l t s} (F_{(x^{⊤}, y)}, α), α)) d δ_{z_{0}} \\ = \{\begin{matrix} 0, & if {(t_{0} - v_{0}^{⊤} β_{l t s})}^{2} > q (β_{l t s}, α), \\ (t_{0} - v_{0}^{⊤} β_{l t s}) v_{0}, & otherwise . \end{matrix} \end{matrix}

This,

T_{0}

, and display (A2) lead to the desired result. □

Proof of Lemma 3.

We invoke Theorem 24 in II.5 of [32]. The first requirement of the theorem is the existence of an envelope of

F

. The latter is

{sup}_{β \in Θ} F_{r {(β)}^{2}}^{- 1} (c)

, which is bounded since

Θ

is compact and

F_{W}^{- 1}

is continuous in

β

, and

F_{W}^{- 1} (α)

is non-decreasing in

α \in [1 / 2, c]

. To complete the proof, we only need to verify the second requirement of the theorem.

For the second requirement, that is, to bound the covering numbers, it suffices to show that the graphs of functions in

F (β, α)

have only polynomial discrimination (see Theorem 25 and Example 26 in II.5 of [32]).

The graph of a real-valued function f on a set S is defined as the subset (see p. 27 of [32])

G_{f} = {(s, t) : 0 \leq t \leq f (s) or f (s) \leq t \leq 0, s \in S} .

The graph of a function in

F (β, α)

contains a point

(x^{⊤} (!), y (ω), t)

if and only if

0 \leq t \leq f (x, y, β, α)

or

f (x, y, β, α) \leq t \leq 0

. The latter case could be excluded since the function is always nonnegative (and equals 0 case covered by the former case). The former case happens if and only if

0 \leq \sqrt{t} \leq y - w^{⊤} β

or

0 \leq \sqrt{t} \leq - y + w^{⊤} β

.

Given a collection of n points

(x_{i}^{⊤}, y_{i}, t_{i})

(

t_{i} \geq 0

), the graph of a function in

F (β, α)

picks out only points that belong to

{\sqrt{t_{i}} \geq 0} \cap {y_{i} - β^{⊤} w_{i} - \sqrt{t_{i}} \geq 0} \cup {\sqrt{t_{i}} \geq 0} \cap {- y_{i} + β^{⊤} w_{i} - \sqrt{t_{i}} \geq 0}

. Introduce n new points

(w_{i}^{⊤}, y_{i}, z_{i}) : = ((1, x_{i}^{⊤}), y_{i}, \sqrt{t_{i}})

in

R^{p + 2}

. On

R^{p + 2}

define a vector space

G

of functions

g_{a, b, c} (w, y, z) = a^{⊤} w + b y + c z,

where

a \in R^{p}

,

b \in R^{1}

, and

c \in R^{1}

and

G : = {g_{a, b, c} (w, y, z) = a^{⊤} w + b y + c z, a \in R^{p}, b \in R^{1}, and c \in R^{1}}

which is a

R^{p + 2}

-dimensional vector space.

It is clear now that the graph of a function in

F (β, α)

picks out only points that belong to the sets of

{g \geq 0}

for

g \in G

(ignoring the union and intersection operations at this moment). By Lemma 18 in II.4 of [32] (p. 20), the graphs of functions in

F (β, α)

pick only polynomial numbers of subsets of

{p_{i} : = (w_{i}^{⊤}, y_{i}, z_{i}), i \in {1, \dots, n}}

; those sets corresponding to

g \in G

with

a \in {0, - β, β}

,

b \in {0, 1, - 1}

, and

c \in {1, - 1}

pick up even few subsets from

{p_{i}, i \in {1, \dots, n}}

. This in conjunction with Lemma 15 in II.4 of [32] (p. 18), yields that the graphs of functions in

F (β, α)

have only polynomial discrimination. By Theorem 24 in II.5 of [32], we have completed the proof. □

Proof of Theorem 6.

To apply Lemma 4, we need to verify the five conditions; among them, only (iii) and (v) need to be addressed, and all others are satisfied trivially. For (iii), it holds automatically since our

τ_{n} = {\hat{β}}_{l t s}^{n}

is defined to be the minimizer of

F_{n} (t)

over

t \in T (= Θ)

.

So, the only condition that needs to be verified is (v), the stochastic equicontinuity of

{E_{n} r (\cdot, t)}

at

t_{0}

. For that, we will appeal to the Equicontinuity Lemma (VII.4 of [32], p. 150). To apply the Lemma, we will verify the condition for the random covering numbers satisfy the uniformity condition. To that end, we look at the class of functions

R (β, α) = \{r (\cdot, \cdot, α, β) = (\frac{β^{⊤}}{∥ β ∥} V / 2 \frac{β}{∥ β ∥}) ∥ β ∥ : β \in Θ, α \in [1 / 2, c]\} .

Obviously,

λ_{m a x} r_{0} / 2

is an envelope for the class

R

in

L^{2} (P)

, where

r_{0}

is the radius of the ball

Θ = B (β_{l t s}, r_{0})

. We now show that the covering numbers of

R

is uniformly bounded, which amply suffices for the Equicontinuity Lemma. For this, we will invoke Lemmas II.25 and II.36 of [32]. To apply Lemma II.25, we need to show that the graphs of functions in

R

have only polynomial discrimination. The graph of

r (x, y, α, β)

contains a point

(x^{⊤}, y, t)

,

t \geq 0

if and only if

(\frac{β^{⊤}}{∥ β ∥} V / 2 \frac{β}{∥ β ∥}) ∥ β ∥ \geq t

for all

β \in Θ

and

α \in [1 / 2, c]

.

Equivalently, the graph of

r (x, y, α, β)

contains a point

(x^{⊤}, y, t)

,

t \geq 0

if and only if

λ_{m i n} / 2 ∥ β ∥ \geq t

. For a collection of n points

(x_{i}^{⊤}, y_{i}, t_{i})

with

t_{i} \geq 0

, the graph picks out those points satisfying

λ_{m i n} / 2 ∥ β ∥ - t_{i} \geq 0

. Construct from

(x_{i}^{⊤}, y_{i}, t_{i})

, a point

z_{i} = t_{i}

in

R

. On

R

, define a vector space

G

of functions

g_{a, b} (x) = a x + b, a, b \in R .

By Lemma 18 of [32], the sets

{g \geq 0}

, for

g \in G

, pick out only a polynomial number of subsets from

{z_{i}}

; those sets corresponding to functions in

G

with

a = - 1

and

b = λ_{m i n} / 2 ∥ β ∥

pick out even fewer subsets from

{z_{i}}

. Thus, the graphs of functions in

R

have only polynomial discrimination. □

Proof of Theorem 8.

In order to invoke Theorem 6, we only need to check the uniqueness of

{\hat{β}}_{l t s}^{n}

and

β_{l t s}

. The former is guaranteed by the (iii) of Theorem 1 since (A1) holds true a.s. This is because any p columns of the

X_{n}

or any of its h rows could be regarded as a sample from an continuous random vector with dimension n or h. The probability that these p points lie in a

(p - 1)

dimensional non-degenerated hyperplane (the normal vector is non-zero) is zero.

The latter is guaranteed by (iii) of Theorem 2 since

W = {(y - w^{⊤} β)}^{2}

is the square of a normal distribution with mean

- β_{1}

and hence, has a positive density and (18) becomes

2 (Φ (c / σ) - 1 / 2) I_{p \times p}

, hence, is invertible, where c is defined in Theorem 8. By Theorems 3 and 7, we can assume, w.l.o.g., that

β_{l t s} = β_{0} = 0

. Utilizing the independence between e and

x

and Theorem 6, a straightforward calculation leads to the results. □

References

Rousseeuw, P.J.; Leroy, A. Robust Regression and Outlier Detection; Wiley: New York, NY, USA, 1987. [Google Scholar]
Huber, P.J. Robust estimation of a location parameter. Ann. Math. Stat. 1964, 35, 73–101. [Google Scholar] [CrossRef]
Rousseeuw, P.J. Least median of squares regression. J. Am. Stat. Assoc. 1984, 79, 871–880. [Google Scholar] [CrossRef]
Rousseeuw, P.J.; Yohai, V.J. Robust regression by means of S-estimators. In Robust and Nonlinear Time Series Analysis; Lecture Notes in Statistics; Springer: New York, NY, USA, 1984; Volume 26, pp. 256–272. [Google Scholar]
Yohai, V.J. High breakdown-point and high efficiency estimates for regression. Ann. Stat. 1987, 15, 642–656. [Google Scholar] [CrossRef]
Yohai, V.J.; Zamar, R.H. High breakdown estimates of regression by means of the minimization of an efficient scale. J. Am. Stat. Assoc. 1988, 83, 406–413. [Google Scholar] [CrossRef]
Rousseeuw, P.J.; Hubert, M. Regression depth (with discussion). J. Am. Stat. Assoc. 1999, 94, 388–433. [Google Scholar] [CrossRef]
Zuo, Y. On general notions of depth for regression. Stat. Sci. 2021, 36, 142–157. [Google Scholar] [CrossRef]
Zuo, Y. Robustness of the deepest projection regression depth functional. Stat. Pap. 2021, 62, 1167–1193. [Google Scholar] [CrossRef]
Rousseeuw, P.J.; Van Driessen, K. Computing LTS Regression for Large Data Sets. Data Min. Knowl. Discov. 2006, 12, 29–45. [Google Scholar] [CrossRef]
Stromberg, A.J. Computation of High Breakdown Nonlinear Regression Parameters. J. Am. Stat. Assoc. 1993, 88, 237–244. [Google Scholar] [CrossRef]
Hawkins, D.M. The feasible solution algorithm for least trimmed squares regression. Comput. Stat. Data Anal. 1994, 17, 185–196. [Google Scholar] [CrossRef]
Hössjer, O. Exact computation of the least trimmed squares estimate in simple linear regression. Comput. Stat. Data Anal. 1995, 19, 265–282. [Google Scholar] [CrossRef]
Rousseeuw, P.J.; Van Driessen, K. A fast algorithm for the minimum covariance determinant estimator. Technometrics 1999, 41, 212–223. [Google Scholar] [CrossRef]
Hawkins, D.M.; Olive, D.J. Improved feasible solution algorithms for high breakdown estimation. Comput. Stat. Data Anal. 1999, 30, 1–11. [Google Scholar] [CrossRef]
Agullö, J. New algorithms for computing the least trimmed squares regression estimator. Comput. Stat. Data Anal. 2001, 36, 425–439. [Google Scholar] [CrossRef]
Hofmann, M.; Gatu, C.; Kontoghiorghes, E.J. An Exact Least Trimmed Squares Algorithm for a Range of Coverage Values. J. Comput. Andgr. Stat. 2010, 19, 191–204. [Google Scholar] [CrossRef]
Klouda, K. An Exact Polynomial Time Algorithm for Computing the Least Trimmed Squares Estimate. Comput. Stat. Data Anal. 2015, 84, 27–40. [Google Scholar] [CrossRef]
Alfons, A.; Croux, C.; Gelper, S. Sparse least trimmed squares regression for analyzing high-dimensional large data sets. Ann. Appl. Stat. 2013, 7, 226–248. [Google Scholar] [CrossRef]
Kurnaz, F.S.; Hoffmann, I.; Filzmoser, P. Robust and sparse estimation methods for high-dimensional linear and logistic regression. Chemom. Intell. Lab. Syst. 2018, 172, 211–222. [Google Scholar] [CrossRef]
Mašíček, L. Optimality of the Least Weighted Squares Estimator. Kybernetika 2004, 40, 715–734. [Google Scholar]
Všek, J.Á. The least trimmed squares. Part I: Consistency. Kybernetika 2006, 42, 1–36. [Google Scholar]
Všek, J.Á. The least trimmed squares. Part II: $\sqrt{n}$ -consistency. Kybernetika 2006, 42, 181–202. [Google Scholar]
Všek, J.Á. The least trimmed squares. Part III: Asymptotic normality. Kybernetika 2006, 42, 203–224. [Google Scholar]
Chen, Y.; Stromberg, A.; Zhou, M. The Least Trimmed Squares Estimate in Nonlinear Regression; Technical Report, 1997/365; Department of Statistics, University of Kentucky: Lexington, KY, USA, 1997. [Google Scholar]
Čížek, P. Asymptotics of Least Trimmed Squares Regression; CentER Discussion Paper 2004-72; Tilburg University: Tilburg, The Netherlands, 2004. [Google Scholar]
Čížek, P. Least Trimmed Squares in nonlinear regression under dependence. J. Stat. Plan. Inference 2005, 136, 3967–3988. [Google Scholar] [CrossRef]
Zuo, Y.; Zuo, H. Least sum of squares of trimmed residuals regression. Electron. J. Stat. 2023, 17, 2447–2484. [Google Scholar] [CrossRef]
Hampel, F.R.; Ronchetti, E.M.; Rousseeuw, P.J.; Stahel, W.A. Robust Statistics: The Approach Based on Influence Functions; John Wiley & Sons: New York, NY, USA, 1986. [Google Scholar]
Tableman, M. The influence functions for the least trimmed squares and the least trimmed absolute deviations estimators. Stat. Probab. Lett. 1994, 19, 329–337. [Google Scholar] [CrossRef]
Öllerer, V.; Croux, C.; Alfons, A. The influence function of penalized regression estimators. Statistics 2015, 49, 741–765. [Google Scholar] [CrossRef]
Pollard, D. Convergence of Stochastic Processes; Springer: Berlin, Germany, 1984. [Google Scholar]
Bednarski, T.; Clarke, B.R. Trimmed likelihood estimation of location and scale of the normal distribution. Aust. J. Stat. 1993, 35, 141–153. [Google Scholar] [CrossRef]
Butler, R.W. Nonparametric interval point prediction using data trimmed by a Grubbs type outlier rule. Ann. Stat. 1982, 10, 197–204. [Google Scholar] [CrossRef]
Hössjer, O. Rank-Based Estimates in the Linear Model with High Breakdown Point. J. Am. Stat. Assoc. 1994, 89, 149–158. [Google Scholar] [CrossRef]
Zuo, Y. A new approach for the computation of halfspace depth in high dimensions. Commun. Stat. Simul. Comput. 2018, 48, 900–921. [Google Scholar] [CrossRef]
Zuo, Y. Projection-based depth functions and associated medians. Ann. Stat. 2003, 31, 1460–1490. [Google Scholar] [CrossRef]
Shao, W.; Zuo, Y.; Luo, J. Employing the MCMC Technique to Compute the Projection Depth in High Dimensions. J. Comput. Appl. Math. 2022, 411, 114278. [Google Scholar] [CrossRef]

Figure 1. Marginal histograms and scatter plot of 500

{\hat{β}}_{l t s}^{n}

with sample size

n = 50

.

Figure 1. Marginal histograms and scatter plot of 500

{\hat{β}}_{l t s}^{n}

with sample size

n = 50

.

Figure 2. Marginal histograms and scatter plot of 1000

{\hat{β}}_{l t s}^{n}

with sample size

n = 50

.

Figure 2. Marginal histograms and scatter plot of 1000

{\hat{β}}_{l t s}^{n}

with sample size

n = 50

.

Figure 3. Marginal histograms and scatter plot of 10000

{\hat{β}}_{l t s}^{n}

with sample size

n = 50

.

Figure 3. Marginal histograms and scatter plot of 10000

{\hat{β}}_{l t s}^{n}

with sample size

n = 50

.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zuo, Y. Large Sample Behavior of the Least Trimmed Squares Estimator. Mathematics 2024, 12, 3586. https://doi.org/10.3390/math12223586

AMA Style

Zuo Y. Large Sample Behavior of the Least Trimmed Squares Estimator. Mathematics. 2024; 12(22):3586. https://doi.org/10.3390/math12223586

Chicago/Turabian Style

Zuo, Yijun. 2024. "Large Sample Behavior of the Least Trimmed Squares Estimator" Mathematics 12, no. 22: 3586. https://doi.org/10.3390/math12223586

APA Style

Zuo, Y. (2024). Large Sample Behavior of the Least Trimmed Squares Estimator. Mathematics, 12(22), 3586. https://doi.org/10.3390/math12223586

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Large Sample Behavior of the Least Trimmed Squares Estimator

Abstract

1. Introduction

2. Definition and Properties of the LTS

2.1. Definition

2.2. Properties in the Empirical Case

2.3. Properties in the Population Case

2.3.1. Definition of Influence Function

2.3.2. Existence and Uniqueness

2.3.3. Fisher Consistency

2.3.4. Influence Function

3. Asymptotic Properties

3.1. Strong Consistency

3.2. Root-n Consistency and Asymptotic Normality

4. Inference Procedures

4.1. Equivariance

4.2. Transformation

4.3. Approximate $100 (1 - γ) %$ Confidence Region

5. Concluding Remarks

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Proofs

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Large Sample Behavior of the Least Trimmed Squares Estimator

Abstract

1. Introduction

2. Definition and Properties of the LTS

2.1. Definition

2.2. Properties in the Empirical Case

2.3. Properties in the Population Case

2.3.1. Definition of Influence Function

2.3.2. Existence and Uniqueness

2.3.3. Fisher Consistency

2.3.4. Influence Function

3. Asymptotic Properties

3.1. Strong Consistency

3.2. Root-n Consistency and Asymptotic Normality

4. Inference Procedures

4.1. Equivariance

4.2. Transformation

4.3. Approximate 100 ( 1 − γ ) % Confidence Region

5. Concluding Remarks

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Proofs

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.3. Approximate $100 (1 - γ) %$ Confidence Region