Robust Algorithms for Change-Point Regressions Using the t-Distribution

Lu, Kang-Ping; Chang, Shao-Tung

doi:10.3390/math9192394

Open AccessArticle

Robust Algorithms for Change-Point Regressions Using the t-Distribution

by

Kang-Ping Lu

¹ and

Shao-Tung Chang

^2,*

¹

Department of Applied Statistics, National Taichung University of Science and Technology, Taichung 404336, Taiwan

²

Department of Mathematics, National Taiwan Normal University, Taipei 116059, Taiwan

^*

Author to whom correspondence should be addressed.

Mathematics 2021, 9(19), 2394; https://doi.org/10.3390/math9192394

Submission received: 26 August 2021 / Revised: 20 September 2021 / Accepted: 22 September 2021 / Published: 26 September 2021

Download

Browse Figures

Versions Notes

Abstract

:

Regression models with change-points have been widely applied in various fields. Most methodologies for change-point regressions assume Gaussian errors. For many real data having longer-than-normal tails or atypical observations, the use of normal errors may unduly affect the fit of change-point regression models. This paper proposes two robust algorithms called EMT and FCT for change-point regressions by incorporating the t-distribution with the expectation and maximization algorithm and the fuzzy classification procedure, respectively. For better resistance to high leverage outliers, we introduce a modified version of the proposed method, which fits the t change-point regression model to the data after moderately pruning high leverage points. The selection of the degrees of freedom is discussed. The robustness properties of the proposed methods are also analyzed and validated. Simulation studies show the effectiveness and resistance of the proposed methods against outliers and heavy-tailed distributions. Extensive experiments demonstrate the preference of the t-based approach over normal-based methods for better robustness and computational efficiency. EMT and FCT generally work well, and FCT always performs better for less biased estimates, especially in cases of data contamination. Real examples show the need and the practicability of the proposed method.

Keywords:

change-point regression; robust; t-distribution; outlier; EMT algorithm; FCT algorithm

1. Introduction

In regression analyses, the relationship between the response and explanatory variables often changes abruptly at some positions, termed change-points (CPs) or thresholds. Change-point (CP) locations signify the place of changes in the data structure, which is usually pivotal for decision making or for realizing some scientific issues. Hence, CP regression models have been widely used for uncovering the meaning of CPs and regression coefficients in many fields, such as engineering, economics, medicine, climatology, genomics and linguistics. There are some motivating examples of CP detection in reality. In medicine, CPs often denote threshold values that are altered by the effect of some risk factor on the response. For example, the risk of a baby having Down Syndrome (DS) at birth is low and unrelated to the mother’s age until an age threshold, but then afterward, the DS risk grows steeply with the mother’s age [1]. Another use of CP models in medical studies is for monitoring patients’ health, such as detecting heart rate trend [2] or inspecting the activity of epileptics [3]. Besides, in econometrics, CPs often signal the time of market structure changes as a result of government interference or policy changes [4]. More applications include discovering the time of changes in the temperature trend for studying the global warming effect in climatology [5] and finding changes in the DNA copy number in genomics [6].

Outliers frequently occur in regression data due to measurement errors or improperly sampling a fraction of data from a different population. Outliers can lead to distortions in the model fitting. Outliers even have greater impacts on CP regressions since they may not only distort the estimations of model parameters but also twist the segmented structure of data. Fearnhead and Rigaill [6] provided a real example to illustrate the CP detection errors caused by outliers. They remarked that many CP methods are sensitive to outliers because of the modeling assumption of Gaussian noises, with much work on CP regressions being carried out for Gaussian models, such as in [7,8,9,10,11]. Despite that the distortion very likely resulted from outliers and the considerable concern for robust procedures in the literature of CP detection, there are few studies on robust estimations for CP regressions.

In regression analysis, the two common methods for solving outlier problems are outlier diagnostic and robust procedures. The robust approach is more recommended [12,13] since the diagnostic approach has issues such as it may fail due to the masking effect, the probability distribution of the resulting statistics is unknown and the variations of estimators may be underestimated. Hence, in this study, we aim at finding a new robust procedure for CP regressions. Here, we regard an estimation method to be robust in the sense of insensitivity to outliers and heavy-tailed distributions.

In this paper, we propose a robust CP regression model using t-distributions. The longer tail of a t-distribution makes it a more robust approach to CP regression data fitting by giving less weight to observations that are atypical in the computation of CPs and regression parameters. With the t CP regression approach, the normal distribution in each regime is embedded in a broader class of symmetric distributions, with an additional parameter called the degrees of freedom, r. As commented by Lange et al. [14] and Peel and McLachlan [15], the parameter r is helpful for tuning the robustness of the t-based method so that it is crucial for achieving resistant estimations. The proposed method treats CPs as latent class variables representing data points lying in respective regimes so that the new method is free of the difficulty of non-differentiability of the log-likelihood function with respect to CPs regarded as parameters. The computations of t log-likelihood are simplified by writing the t-distribution as a scale mixture of normal distributions. Through two latent variables introduced into the observed data, the expectation and maximization (EM) algorithm is employed to estimate the CP model.

Considering the indefiniteness of the demarcation between two adjacent pieces, we extend the crisp CP model to a fuzzy class model. We create a fuzzy classification t (FCT) method which is more robust than the t-based EM-type algorithm. To make the t CP model more tolerant of high leverage outliers, we further propose a modified version to first moderately prune high leverage points and then employ the new method to the pruned dataset. The rules for deciding the proportion of pruning and the degrees of freedom of t-distributions are also discussed. Through numerous simulations and real applications, we demonstrate the effectiveness and superiority of the proposed methods by comparing them with existing methods.

The rest of this paper is organized as follows. In Section 2, we present some related work in the literature. In Section 3, we develop the new methods and illustrate the determination of the degrees of freedom of the t-distribution, the resistance to high leverage outliers and the robust properties of the proposed methods. In Section 4, experiments with numeric and real examples are provided to show the effectiveness and superiority of the proposed methods. Finally, the discussion and conclusions are stated in Section 5.

2. Related Work

Since change-point (CP) problems are often encountered in practice, numerous methodological approaches have been implemented for estimating CP models, such as likelihood-based estimation, Bayesian analyses, grid-searching approaches and clustering methods. A main difficulty with the likelihood estimations for CP regressions is the non-differentiability of the likelihood function with respect to the CPs viewed as parameters. Researchers have tried to overcome this problem through various approximations or smooth transitions between two adjoining regimes. For example, Muggeo [16] created a segmented (SEG) regression method by means of a simple linearization technique. This method allows simultaneous inference on CPs and regression coefficients, but it is not resistant to outliers and only workable for continuous models. Recently, Lu and Chang [10] conquered the non-differentiable problem by transforming the CP regression model into a mixture regression model and introduced an EM-BP algorithm to derive estimates. There have been more recent methods for solving CP problems. Chakar et al. [17] consider the infeasibility of the dynamic programming algorithm for dependent series. They proposed a robust estimator of the autocorrelation parameter to de-correlate the autoregressive time series. Then, the classical inference for independent series can be utilized to estimate the CPs in the autoregressive series.

There are some recent Bayesian CP analyses. For instance, Ko et al. [18] proposed the Dirichlet process hidden Markov model (DPHMM) to detect CPs without requiring the specification of the number of CPs a priori. Bardwell and Fearnhead [19] presented a Bayesian approach to detect abnormal regions in multiple time series, allowing the presence of changes in a small subset of the time series.

Nonparametric CP methods were also suggested for the data, which cannot be well-approximated by any parametric distribution, such as in the works of Zou et al. [20] and Haynes et al. [21]. Moreover, Frick et al. [7] considered the question of search accuracy in CP analysis. They introduced a simultaneous multiscale CP estimator (SMUCE) to provide the estimates of an unknown step function with the probability of over- and underestimating the number of CPs being controlled. Pein et al. [8] further proposed a heterogeneous simultaneous multiscale change-point estimator (H-SMUCE) for a heterogeneous piecewise constant model. More recently, various researchers have focused on developing computationally efficient methods [22,23]. For more reviews on CP work, we refer to articles [24,25] and references therein.

In spite of recent rapid developments on various aspects of CP detection, there is less work on the robustness to the presence of outliers [6,24]. To alleviate the disturbance of outliers, data may be pre-processed to remove outliers. After editing suspected outliers, the subsequent analysis is still based on the normality assumption. A serious problem with this approach is that the resulted influences may fail to reflect uncertainty in the “cleaned” data; particularly, standard errors may be too small.

The least absolute deviations (LAD) estimation has been used to estimate CP regression models robustly by a few researchers [26,27]. A robust approach connected with the LAD procedure is assuming that regression errors have a Laplace distribution. Yang [28] and Jafari et al. [29] employ Laplace distributions to estimate CP regression models robustly. The restriction of a single CP detection limits their applicability because multiple CPs often exist in reality. Although the LAD method is more robust than the least squares (LS) estimation, it is computationally intensive to optimize the objective function [25].

M-estimation is another popular robust approach in linear regressions, which replaces the least squares criterion with a resistant criterion. Considering the robustness of M-estimation, Fearnhead and Rigaill [6] proposed to associate the penalized cost approaches with a less sensitive loss function for estimating CPs robustly. The authors claimed that a bounded loss, such as the bi-weight loss, is favorable for robust estimations. However, their method relies on a constant variance assumption which is unsuitable in many applications [8]. Recently, a robust estimator based on the Wilcoxon statistic was proposed for estimating the CP of the mean [30], but this approach is restricted to a single CP detection.

The t-distribution has also been employed for robust fitting of a statistical model to the data structure. For example, Lange et al. [14] proposed to use t-distributions for the robust fit of a traditional linear regression model to data. Peel and McLachlan [15] proposed a mixture model of t-distributions for robust clustering. Yao et al. [31] presented a t mixture regression model for estimating model parameters robustly. Lin et al. [32,33] presented the tests on heteroscedasticity in Student t and skew-t-normal regression models, respectively. Although t-distributions have been used in mixture regressions for making resistant estimations, they were barely considered for CP regression problems. Though locating CPs in a piecewise regression model is similar to classifying data into classes of similar objects, CP regression problems are different from mixture regressions in some aspects. For example, data points in CP regressions are ordered according to the order of some partitioning variable, and hence there are restrictions on the memberships of data points belonging to each individual segment. Subsequently, the robust mixture regressions of t-distributions cannot be applied to CP regressions directly. The rare implementation of t-distributions in CP regressions may be because the t likelihood function is non-differentiable with respect to CPs and there is no explicit form for the maximizer of the t-log-likelihood.

A few studies have utilized t-distributions to solve CP regression problems. Osorio and Galea [34] located a single CP for t regression models based on Schwarz Information Criterion (SIC). Their method is restricted to a single CP detection, and the degrees of freedom of t-distributions are not well-estimated. Lin et al. [35] adopted a Bayesian approach to identify the position of changes in variance for a t regression model with the regression coefficients assumed to be constant. The assumption of constant regression coefficients is restrictive and unrealistic. Furthermore, the burden of their computations is heavy because six steps are required in one pass of the Gibbs sampler for locating a single CP only, not mentioned for detecting multiple CPs. That is why they suggested detecting multiple CPs one by one in a hierarchical binary way instead of simultaneous detection. However, the hierarchical binary method for locating multiple CPs may not result in a correct optimum [11]. Moreover, these researchers did not illustrate the resistance of their method against high leverage outliers. It is known that t-distributions are sensitive to high leverage outliers, and these kinds of outliers usually have severe effects on the fit of a regression model to the data.

3. Robust Change-Point Regression Using t-Distributions

3.1. An EMT Algorithm for Change-Point Regression Models

Assume a dataset,

\{(x_{1}^{'}, y_{1}), \dots, (x_{n}^{'}, y_{n})\}

, is randomly drawn from a population of CP regressions, with

x_{j} = (x_{j 1,} \dots, x_{j p})^{'} \in ℜ^{p}

(

(\cdot)^{'}

represents the transpose of a vector or a matrix) and

y_{j} \in ℜ

,

j = 1, \dots, n

expressing the jth independent observation and response, respectively. Suppose that

x_{d}^{*} = (x_{1 d}, x_{2 d}, \dots, x_{n d})^{'}

, for some

d \in \{1, \dots, p\}

, is an independent variable that splits the domain of independent variables into sub-domains, such that regression models vary from sub-domain to sub-domain. Thus, we may assume that the dataset

\{(x_{1}^{'}, y_{1}), \dots, (x_{n}^{'}, y_{n})\}

has been positioned in the increasing order of

x_{d}^{*}

. Accordingly, CPs can be stated by the rank of the ordered data. Suppose a CP regression model contains c − 1 CPs at

τ_{1}, τ_{2}, \dots, τ_{c - 1}

, with

1 \leq τ_{1} < τ_{2} < \dots < τ_{c - 1} \leq n - 1

(

τ_{c - 1} < n

since the last observation cannot be a CP). Set

τ_{0} = 0

and

τ_{c} = n

. Then, the CP regression model can be represented as follows. Given

(x_{j}^{'}, y_{j})

lying in the ith segment, i.e.,

τ_{i - 1} < j \leq τ_{i}

,

y_{j | i} = β_{i 0} + β_{i 1} x_{j 1} + \dots + β_{i p} x_{j p} + e_{j | i}, i = 1, \dots, c,

(1)

where

e_{j | i}

are independent with mean 0 and variance

σ_{i}^{2}

,

i = 1, \dots, c

. Since CPs can appear randomly among

\{1, \dots, n - 1\}

, the c − 1 CPs can be viewed as any possible combination of c – 1 numbers drawn from

\{1, \dots, n - 1\}

. Let

τ = {(τ_{1}, τ_{2}, \dots, τ_{c - 1})}^{'}

with

1 \leq τ_{1} < τ_{2} < \dots < τ_{c - 1} \leq n - 1

be a collection of c − 1 CPs. Define a random variable

w_{τ}

as:

w_{τ} = \{\begin{matrix} 1, & if 1 \leq τ_{1} < τ_{2} < \dots < τ_{c - 1} \leq n - 1 are CPs, \\ 0, & otherwise, \end{matrix}

(2)

with:

P (w_{τ} = 1) = ε_{τ}

and

\underset{1 \leq τ_{1} < τ_{2} < \dots < τ_{c - 1} \leq n - 1}{\sum \dots \sum} ε_{τ} = 1 .

Then, define

z_{i j}

as:

z_{i j} = \{\begin{matrix} \underset{\begin{matrix} 1 \leq τ_{1} < τ_{2} < \dots < τ_{c - 1} \leq n - 1 \\ τ_{i - 1} < j \leq τ_{i} \end{matrix}}{\sum \dots \sum} w_{τ}, & 1 \leq i \leq c, 1 \leq j \leq n, \\ 0, & otherwise . \end{matrix}

(3)

Note that

z_{i j}

,

i = 1, \dots, c

,

j = 1, \dots, n

are actually the class variables, where

z_{i j} = \{\begin{matrix} 1, & if (x_{j}, y_{j}) is from segment i, \\ 0, & otherwise . \end{matrix}

Thus, the sample

(x, y) = {((x_{1}^{'}, y_{1}), \dots, (x_{n}^{'}, y_{n}))}^{'}

is incomplete with missing data

z = {(z_{11}, z_{12}, \dots, z_{c n})}^{'} .

The EM algorithm is used for estimations. The complete likelihood function is:

L^{c} (θ | (x, y), z) = \prod_{i = 1}^{c} \prod_{j = 1}^{n} {[π_{i} f_{i} (y_{j} | x_{j})]}^{z_{i j}},

where

π_{i}

are mixing proportions and

θ

is the vector of parameters included in regressions, and

f_{i} (y_{j} | x_{j})

is the probability density function of response

y_{j}

given by regressors

x_{j}

in the ith segment. The complete log-likelihood function is given by:

l^{c} (θ | (x, y), z) = \sum_{i = 1}^{c} \sum_{j = 1}^{n} z_{i j} \log (π_{i} \ln f_{i} (y_{j} | x_{j}))

(4)

For the E step of EM, we need the conditional expectation of

z_{i j}

given by

(x, y)

under

θ^{0}

.

E (z_{i j} | (x, y), θ^{0}) = \underset{\begin{matrix} 1 \leq τ_{1} < τ_{2} < \dots < τ_{c - 1} \leq n - 1 \\ τ_{i - 1} < j \leq τ_{i} \end{matrix}}{\sum \dots \sum} E (w_{τ} | (x, y), θ^{0})

= \underset{\begin{matrix} 1 \leq τ_{1} < τ_{2} < \dots < τ_{c - 1} \leq n - 1 \\ τ_{i - 1} < j \leq τ_{i} \end{matrix}}{\sum \dots \sum} P (w_{τ} = 1 | (x, y), θ^{0}) .

Based on Bayes’ theorem,

P (w_{τ} = 1 | (x, y), θ^{0})

is estimated by:

ε_{τ}^{*} = \frac{\prod_{j = 1}^{τ_{1}} f_{1}^{0} (y_{j} | x_{j}) \prod_{j = τ_{1} + 1}^{τ_{2}} f_{2}^{0} (y_{j} | x_{j}) \dots \prod_{j = τ_{c - 1} + 1}^{n} f_{c}^{0} (y_{j} | x_{j}) ε_{τ}^{0}}{\underset{1 \leq k_{1} < k_{2} < \dots < k_{c - 1} \leq n - 1}{\sum \dots \sum} \prod_{j = 1}^{k_{1}} f_{1}^{0} (y_{j} | x_{j}) \dots \prod_{j = k_{c - 1} + 1}^{n} f_{c}^{0} (y_{j} | x_{j}) ε_{k}^{0}},

(5)

where the superscript 0 indicates that parameters are given as

θ^{0}

. Then,

E (z_{i j} | (x, y), θ^{0})

is estimated by:

p_{i j}^{*} = \{\begin{matrix} \underset{\begin{matrix} 1 \leq τ_{1} < τ_{2} < \dots < τ_{c - 1} \leq n - 1 \\ τ_{i - 1} < j \leq τ_{i} \end{matrix}}{\sum \dots \sum} ε_{τ}^{*}, & i = 1, \dots, c, j = 1, \dots, n, \\ 0, & otherwise . \end{matrix}

(6)

Replace

z_{i j}

in (4) with

p_{i j}^{*}

in Equation (6) and denote the resulted expression as

Q (θ | (x, y), θ^{0})

. The M step is to maximize

Q (θ | (x, y), θ^{0}) = \sum_{i = 1}^{c} \sum_{j = 1}^{n} p_{i j}^{*} \ln (π_{i} f_{i} (y_{j} | x_{j})) .

(7)

Repeating the steps E and M until the stop criterion is met, we then obtain the estimator of CPs as:

\hat{τ} = \underset{1 \leq τ_{1} < τ_{2} < \dots < τ_{c - 1} \leq n - 1}{\arg \max} (ε_{τ}), τ = {(τ_{1}, τ_{2}, \dots, τ_{c - 1})}^{'} .

To robustly estimate the CP regression model given by Equation (1), assume that the regression errors

e_{j | i}

,

j = 1, \dots, n

, follow a t-distribution, with

r_{i}

degrees of freedom (DF), scale parameter

σ_{i}

and the density:

f (e | σ, r) = \frac{1}{σ} h (\frac{e}{σ} | r), h (t | r) = \frac{Γ (\frac{r + 1}{2})}{{(π r)}^{1 / 2} Γ (\frac{r}{2}) {\{1 + \frac{t^{2}}{r}\}}^{\frac{1}{2} (r + 1)}},

(8)

where

h (t | r)

is the density of a standard t-distribution with r DF, t(r). We first suppose

r_{i}

s are known, and we discuss the decision of

r_{i}

s later. The complete log-likelihood is:

l^{c} (θ | (x, y), z) = \sum_{i = 1}^{c} \sum_{j = 1}^{n} z_{i j} \ln (π_{i} f_{i} (y_{j} - {X^{'}}_{j} β_{i} | x_{j}, r_{i}, σ_{i})),

where

f_{i} (\cdot)

is the density of Equation (8),

X_{j}^{'} = (1, x_{j}^{'}) = (1, x_{j 1}, \dots, x_{j p})

,

β_{i} = (β_{i 0}, β_{i 1}, \dots, β_{i p})^{'}

,

θ = {(β_{1}^{'}, \dots, β_{c}^{'}, σ_{1}, \dots, σ_{c})}^{'}

. To simplify the computations, we express the t-distribution as a scale mixture of normal distributions (c.f. [12,28]). We introduce into the data vector additional missing data

v = (v_{1}, \dots, v_{n})^{'}

, satisfying that given

z_{i j} = 1

,

y_{j} - X_{j}^{'} β_{i} | x_{j}, v_{j}, z_{i j} = 1 ~ N (0, σ_{i}^{2} / v_{j}), v_{j} | z_{i j} = 1 ~ gamma (\frac{r_{i}}{2}, \frac{r_{i}}{2}),

(9)

where

gamma (α, λ)

has density

g (v; α, λ) = \frac{1}{Γ (α)} λ^{α} v^{α - 1} e^{- λ v}

, g > 0. Then,

y_{j} - X_{j}^{'} β_{i} | x_{j}, z_{i j} = 1

has a

t (r_{i})

distribution with a scale parameter

σ_{i}

. The complete log likelihood function becomes:

\begin{array}{l} l^{c} (θ | (x, y), z, v) \\ = \sum_{j = 1}^{n} \sum_{i = 1}^{c} z_{i j} \ln (π_{i} \frac{\sqrt{v_{j}}}{σ_{i}} ϕ (\frac{\sqrt{v_{j}} (y_{j} - X_{j}^{'} β_{i})}{σ_{i}}) g (v_{j} | r_{i})) \\ = \sum_{j = 1}^{n} \sum_{i = 1}^{c} z_{i j} \ln (π_{i} \frac{\sqrt{v_{j}}}{\sqrt{2 π} σ_{i}} \exp (- \frac{v_{j} {(y_{j} - X_{j}^{'} β_{i})}^{2}}{2 σ_{i}^{2}}) \frac{{(\frac{r_{i}}{2})}^{r_{i} / 2}}{Γ (\frac{r_{i}}{2})} v_{j}^{r_{i} / 2 - 1} \exp (- v_{j} r_{i} / 2)) \\ = \sum_{j = 1}^{n} \sum_{i = 1}^{c} z_{i j} \ln π_{i} - \frac{1}{2} \sum_{j = 1}^{n} \sum_{i = 1}^{c} z_{i j} \ln σ_{i}^{2} - \frac{1}{2} \sum_{j = 1}^{n} \sum_{i = 1}^{c} z_{i j} \frac{v_{j} {(y_{j} - X_{j}^{'} β_{i})}^{2}}{σ_{i}^{2}} \\ + \sum_{j = 1}^{n} \sum_{i = 1}^{c} z_{i j} \ln (\frac{\sqrt{v_{j}}}{\sqrt{2 π}} \frac{{(\frac{r_{i}}{2})}^{r_{i} / 2}}{Γ (\frac{r_{i}}{2})} v_{j}^{r_{i} / 2 - 1}) - \frac{1}{2} \sum_{j = 1}^{n} \sum_{i = 1}^{c} z_{i j} v_{j} r_{i}, \end{array}

where

ϕ (\cdot)

is the density of

N (0, 1)

. To use EM, given the current estimate

θ^{(k)}

, since the last two terms do not contain unknown parameters

θ

, the computation of

E (l^{c} (θ | (x, y), z, v) | (x, y), θ^{(k)})

in step E simplifies to computing

E (z_{i j} | (x, y), θ^{(k)})

and

E (v_{j} | (x, y), θ^{(k)}, z_{i j} = 1)

. Based on Equation (6):

E (z_{i j} | (x, y), θ^{(k)}) = p_{i j}^{(k + 1)} = \{\begin{matrix} \underset{\begin{matrix} 1 \leq τ_{1} < τ_{2} < \dots < τ_{c - 1} \leq n - 1 \\ τ_{i - 1} < j \leq τ_{i} \end{matrix}}{\sum \dots \sum} ε_{τ}^{(k)}, & i = 1, \dots, c, j = 1, \dots, n, \\ 0, & otherwise, \end{matrix},

(10)

where

ε_{τ}^{(k)}

is computed by Equation (5) with the density

f_{i} (y_{j} - X_{j}^{'} β_{i}^{(k)} | σ^{(k)}, r_{i})

given by Equation (8).

Moreover, the gamma distribution is the conjugate prior distribution for

v_{j}

. It is easy to show that:

v_{j} | (x, y), z_{i j} = 1, θ^{(k)} ~ gamma (\frac{r_{i} + 1}{2}, \frac{1}{2} [r_{i} + {(\frac{y_{j} - X_{j}^{'} β_{i}^{(k)}}{σ_{i}^{(k)}})}^{2}]),

(11)

E (v_{j} | (x, y), θ^{(k)}, z_{i j} = 1) = q_{i j}^{(k + 1)} = \frac{r_{i} + 1}{r_{i} + {[(y_{j} - X_{j}^{'} β_{i}^{(k)}) / σ_{i}^{(k)}]}^{2}} .

(12)

Then, step M is to maximize:

E (l^{c} (θ | (x, y), z, v) | (x, y), θ^{(k)})

\propto Q (θ | (x, y), θ^{(k)}) = \sum_{j = 1}^{n} \sum_{i = 1}^{c} p_{i j}^{(k + 1)} \ln π_{i} - \frac{1}{2} \sum_{j = 1}^{n} \sum_{i = 1}^{c} p_{i j}^{(k + 1)} \ln {(σ_{i}^{(k)})}^{2} - \frac{1}{2} \sum_{j = 1}^{n} \sum_{i = 1}^{c} p_{i j}^{(k + 1)} q_{i j}^{(k + 1)} {(\frac{y_{j} - X_{j}^{'} β_{i}^{(k)}}{σ_{i}^{(k)}})}^{2}

Taking partial derivatives of

Q (θ | (x, y), θ^{(k)})

with respect to all parameters with the Lagrange multiplier under constraint

\sum_{i = 1}^{c} π_{i} = 1

, we obtain the updated estimates:

β_{i}^{(k + 1)} = {(X^{'} U_{i}^{(k + 1)} X)}^{- 1} X^{'} U_{i}^{(k + 1)} Y,

(13)

{(σ_{i}^{2})}^{(k + 1)} = \frac{\sum_{j = 1}^{n} p_{i j}^{(k + 1)} q_{i j}^{(k + 1)} {(y_{j} - X_{j}^{'} β_{i}^{(k + 1)})}^{2}}{\sum_{j = 1}^{n} p_{i j}^{(k + 1)}},

(14)

where

X \in ℜ^{n \times (p + 1)}

is a matrix with the jth row

X_{j}^{'} = (1, x_{j}^{'}) = (1, x_{j 1}, \dots, x_{j p})

,

Y \in ℜ^{n}

is a vector with the jth component

y_{j}

and

U_{i}^{(k + 1)}

is a diagonal matrix with the jth diagonal element equal to

p_{i j}^{(k + 1)} q_{i j}^{(k + 1)}

. Based on the above, we propose an EMT Algorithm 1 for estimating CP regression models, as follows.

Algorithm 1. EMT

Set s = C_{c - 1}^{n - 1}

, ε^{(k)} = {[ε_{τ}^{(k)} : τ = {(τ_{1}, τ_{2}, \dots, τ_{c - 1})}^{'}, 1 \leq τ_{1} < τ_{2} < \dots < τ_{c - 1} \leq n - 1]}_{s \times 1}

,

p^{(k + 1)} = {[p_{i j}^{(k + 1)} : 1 \leq 1 \leq c, 1 \leq j \leq n]}_{c n \times 1} \in ℜ^{c n}

,

q^{(k + 1)} = {[q_{i j}^{(k + 1)} : 1 \leq i \leq c, 1 \leq j \leq n]}_{c n \times 1} \in ℜ^{c n}

.

Let η > 0

, k = 0, ε_{τ}^{(0)} = 1 / C_{c - 1}^{n - 1}

,

for each possible τ

, and q_{_{i j}}^{(0)} = 1

, i = 1, \dots, c

, j = 1, \dots, n

.

Step 1. Compute p^{(0)}

with Equation (6) using ε^{(0)}

. Then, compute β_{i}^{(0)}

, {(σ_{i}^{2})}^{(0)}

,

i = 1, \dots, c

, with Equations (13) and (14), respectively.

Step 2. Compute p^{(k + 1)}

and q^{(k + 1)}

with Equations (10) and (12), respectively.

Step 3. Compute β^{(k + 1)} \equiv {(β_{1}^{(k + 1)}, \dots, β_{c}^{(k + 1)})}^{'} \in ℜ^{c (p + 1)}

with p^{(k + 1)}

and q^{(k + 1)}

by Equation (13).

Step 4. Compute σ^{(k + 1)} \equiv {(σ_{1}^{(k + 1)}, \dots, σ_{c}^{(k + 1)})}^{'}

with β^{(k + 1)}

, p^{(k + 1)}

and q^{(k + 1)}

by Equation (14).

Step 5. Update ε^{(k + 1)}

with β^{(k + 1)}

, σ^{(k + 1)}

and ε^{(k)}

by Equation (5) with the density given by Equation (8).

Step 6. If ‖ε^{(k + 1)} - ε^{(k)}‖ < η

, then stop; else, k = k + 1

and go to Step 2.
After the EMT algorithm converges, the estimates of c − 1 CPs are given by

\hat{τ} = \underset{1 \leq τ_{1} < τ_{2} < \dots < τ_{c - 1} \leq n - 1}{\arg \max} (ε_{τ})

, τ = (τ_{1}, τ_{2}, \dots, τ_{c - 1})^{'}

.

3.2. A FCT Algorithm for t CP Regression Models

Since the EM algorithm converges slowly when data involve noises (c.f. [36]), we extend the crisp model to a fuzzy class model for improving the convergence rate and the precision of estimations. Similar to Lu and Chang [10], using the fuzzy partitioning concept, we regard each probable CP combination as a partition of data with a fuzzy membership and then transform them into the pseudo memberships of data belonging to each individual segment. As in previous descriptions, consider a regression model with c − 1 CPs, as given in Equation (1). Suppose

τ = {(τ_{1}, τ_{2}, \dots, τ_{c - 1})}^{'}

,

1 \leq τ_{1} < τ_{2} < \dots < τ_{c - 1} \leq n - 1

, is a collection of CPs with a fuzzy membership

α_{τ}

under the conditions

0 \leq α_{τ} \leq 1

and

\underset{1 \leq τ_{1} < τ_{2} < \dots < τ_{c - 1} \leq n - 1}{\sum \dots \sum} α_{τ} = 1

. Now, let

z_{i j}

be the fuzzy membership of observation

(x_{j}, y_{j})

belonging to the ith piece. Then,

z_{i j}

can be expressed by

α_{τ}

as follows (c.f. [10]):

z_{i j} = \{\begin{matrix} \underset{\begin{matrix} 1 \leq τ_{1} < τ_{2} < \dots < τ_{c - 1} \leq T - 1 \\ τ_{i - 1} < j \leq τ_{i} \end{matrix}}{\sum \dots \sum} α_{τ}, & i = 1, \dots c - 1, i \leq j \leq n - c + i, \\ 0, & i = 1, \dots c - 1, 1 \leq j < i or n - c + i < j \leq n, \\ 1 - \sum_{i = 1}^{c - 1} μ_{i j}, & i = c, 1 \leq j \leq n . \end{matrix}

(15)

Now, the CP model is a fuzzy class model with a fuzzy c-partition,

\{z_{1}, \dots, z_{c}\}

,

z_{i} = {(z_{i 1}, \dots, z_{i n})}^{'}

,

i = 1, \dots, c,

and

\sum_{i = 1}^{c} z_{i j} = 1

,

j = 1, \dots, n

. The regression parameters can be estimated by maximizing the fuzzy classification maximum likelihood (FCML) function (c.f. [10,37]):

J (θ) = \sum_{i = 1}^{c} \sum_{j = 1}^{n} z_{i j}^{m} \ln f_{i} ((x_{j}, y_{j}) | θ) + w \sum_{i = 1}^{c} \sum_{j = 1}^{n} z_{i j}^{m} \ln π_{i}

(16)

subject to

\sum_{i = 1}^{c} π_{i} = 1

,

π_{i} > 0

, where m > 1,

r \geq 0

,

z_{i j} \in [0, 1]

,

\sum_{i = 1}^{c} z_{i j} = 1

,

i = 1, \dots, c

,

j = 1, \dots, n

and

θ = {(β_{1}^{'}, \dots, β_{c}^{'}, σ_{1}, \dots, σ_{c}, π_{1}, \dots, π_{c})}^{'}

. As stated, assume that the CP regression errors,

e_{j | i} = y_{j} - X_{j}^{'} β_{i}

,

j = 1, \dots, n

, have a t(r_i)-distribution. With the scale mixture expression of t(r) in Equation (9), the FCML objective function in Equation (16) is:

\begin{array}{l} T_{m, w} (z, π, β, σ) \\ = \sum_{i = 1}^{c} \sum_{j = 1}^{n} z_{i j}^{m} \ln f_{i} (y_{j} | β_{i}, σ_{i}) + w \sum_{i = 1}^{c} \sum_{j = 1}^{n} z_{i j}^{m} \ln π_{i} \\ = \sum_{i = 1}^{c} \sum_{j = 1}^{n} z_{i j}^{m} \ln (\frac{\sqrt{v_{j}}}{\sqrt{2 π} σ_{i}} \exp (- \frac{v_{i} {(y_{j} - X_{j}^{'} β_{i})}^{2}}{2 σ_{i}^{2}}) \frac{{(\frac{r_{i}}{2})}^{r_{i} / 2}}{Γ (\frac{r_{i}}{2})} v_{j}^{r_{i} / 2 - 1} \exp (- v_{j} r_{i} / 2)) + w \sum_{i = 1}^{c} \sum_{j = 1}^{n} z_{i j}^{m} \ln π_{i} \\ = - \frac{1}{2} \sum_{i = 1}^{c} \sum_{j = 1}^{n} z_{i j}^{m} \ln σ_{i}^{2} - \frac{1}{2} \sum_{i = 1}^{c} \sum_{j = 1}^{n} z_{i j}^{m} \frac{v_{j} {(y_{j} - X_{j}^{'} β_{i})}^{2}}{σ_{i}^{2}} + \sum_{i = 1}^{c} \sum_{j = 1}^{n} z_{i j}^{m} \ln (\frac{\sqrt{v_{j}}}{\sqrt{2 π_{i}}} \frac{{(\frac{r_{i}}{2})}^{r_{i} / 2}}{Γ (\frac{r_{i}}{2})} v_{j}^{r_{i} / 2 - 1}) \\ - \frac{1}{2} \sum_{i = 1}^{c} \sum_{j = 1}^{n} z_{i j}^{m} v_{j} r_{i} + w \sum_{i = 1}^{c} \sum_{j = 1}^{n} z_{i j}^{m} \ln π_{i} \end{array}

Based on Equation (12), replace

v_{j}

with

q_{i j}

in

T_{m, w} (z, π, β, σ)

. Take the partial derivatives of the Lagrange expression with

T_{m, w} (z, π, β, σ)

under constraint

\sum_{i = 1}^{c} π_{i} = 1

, then, we obtain the updated estimates:

β_{i}^{(k + 1)} = {(X U_{i}^{(k)} X^{'})}^{- 1} X U_{i}^{(k)} Y,

(17)

σ_{i}^{2^{(k + 1)}} = \frac{\sum_{j = 1}^{n} z_{i j}^{m^{(k)}} q_{i j}^{(k)} {(y_{j} - X_{j}^{'} β_{i}^{(k + 1)})}^{2}}{\sum_{j = 1}^{n} z_{i j}^{m^{(k)}}},

(18)

π_{i}^{(k + 1)} = \frac{\sum_{j = 1}^{n} z_{i j}^{m}^{^{(k)}}}{n},

(19)

where

U_{i}^{(k)}

is a diagonal matrix with the jth diagonal element equal to

z_{i j}^{m}^{^{(k)}} q_{i j}^{(k)}

. Having the regression estimates

\hat{θ}

, we update the CP memberships

α_{τ} s

as follows. Given a CP collection

τ

, define:

d_{τ}^{2} = - \sum_{i = 1}^{c} \sum_{j = τ_{i - 1} + 1}^{τ_{i}} \ln f_{i} (y_{j} | x_{j}, \hat{θ})

We then optimize the objective function:

J (α) = \underset{1 \leq τ_{1} < τ_{2} < \dots < τ_{c - 1} \leq n - 1}{\sum \dots \sum} α_{τ}^{m} \exp (d_{τ}^{2})

subject to

\underset{1 \leq τ_{1} < τ_{2} < \dots < τ_{c - 1} \leq n - 1}{\sum \dots \sum} α_{τ} = 1

.

0 \leq α_{τ} \leq 1

. The updated

α_{τ}

is:

α_{τ} = \frac{\exp {(- d_{τ}^{2})}^{- 1 / (m - 1)}}{\underset{1 \leq t_{1} \leq t_{2} \leq \dots \leq t_{c - 1} \leq n - 1}{\sum \dots \sum} \exp (- d_{t}^{2} / (m - 1))},

(20)

We propose an FCT Algorithm 2 for robustly fitting CP regression models, as follows.

Algorithm 2. FCT

Set s = C_{c - 1}^{n - 1}

, α^{(k)} = {[α_{τ}^{(k)} : τ = {(τ_{1}, τ_{2}, \dots, τ_{c - 1})}^{'}, 1 \leq τ_{1} < τ_{2} < \dots < τ_{c - 1} \leq n - 1]}_{s \times 1}

,

Let η > 0

, k = 0, α_{τ}^{(0)} = 1 / C_{c - 1}^{n - 1}

, for each possible τ

, and q_{_{i j}}^{(0)} = 1

, i = 1, \dots, c

, j = 1, \dots, n

.

Step 1. Compute z_{i j}^{(0)}

1 \leq i \leq c

, 1 \leq j \leq n

by Equation (15).

Step 2. Compute β_{i}^{(k + 1)}

, σ_{i}^{2^{(k + 1)}}

, π_{i}^{(k + 1)}

, i = 1, \dots, c

, by Equations (17)–(19), respectively.

Step 3. Update α^{(k + 1)}

by (20) using β_{i}^{(k + 1)}

, σ_{i}^{2^{(k + 1)}}

, i = 1, \dots, c

, and Equation (8).

Step 4. Compute z_{i j}^{(k + 1)}

and q_{i j}^{(k + 1)}

by Equations (12) and (15), respectively.

Step 5. If ‖α^{(k + 1)} - α^{(k)}‖ < η

, then stop; else, k = k + 1

and go to step 2.
After FCT converges, the estimates of c − 1 CPs are given by

\hat{τ} = \underset{1 \leq τ_{1} < τ_{2} < \dots < τ_{c - 1} \leq n - 1}{\arg \max} (ε_{τ})

, τ = (τ_{1}, τ_{2}, \dots, τ_{c - 1})^{'}

.

3.3. Decide the Degrees of Freedom of t-Distributions

In this section, we discuss the decision of the degrees of freedom (DF),

r = {(r_{1}, \dots, r_{c})}^{'}

. Typically, one can derive the maximum likelihood estimator (MLE) of r by maximizing the log-likelihood over

r_{i}

. Another solution is to compute the likelihood over a grid set,

G = {1, 2, \dots, r_{m a x}},

and choose DF from G which has the largest likelihood (c.f. [28]). As is known, t(r) approaches normal distributions for a large enough r. In practice, r is not necessary to be very large for achieving normality. Similar to [31], we set r_max = 15. After performing lots of simulations, we found that choosing

r

based on a grid set often fails to provide CP estimates that are precise enough. However, numerous simulation results show that both EMT and FCT with

r = 1

(EMT1 and FCT1) always produce rather precise CP estimates, no matter whether data are contaminated or not. Accurate CP estimates are critical to solving CP regression problems. To demonstrate that EMT1 and FCT1 are generally suitable to all kinds of data, we employ EMT and FCT with r = 1, 5, 10 and 15 to diversified datasets randomly generated from the models of 1 and 2 CPs, with data including 0, 5, 10 and 15 noises, respectively. The models considered for generating data are shown in Figure 1 with regression errors following a normal distribution of N(0, 0.5²) (except for the cases in Figure 1, more situations and different kinds of models were also examined, but results are similar, and hence, we only report four cases here). We evaluate the performance based on R replications in each individual circumstance with R equal to 1000 and 200 for models of 1 and 2 CPs, respectively. The evaluation criteria includes (1) the mean of CP estimates (

\bar{CP}

) along with the standard errors (SECP), and (2) the sum of the mean squared errors (MSES) of model coefficient estimates computed as follows:

MSES \equiv \sum_{i, j} MSE (β_{i j}), MSE (β_{i j}) = \frac{1}{R} \sum_{r = 1}^{R} {({\hat{β}}_{i j (r)} - β_{i j})}^{2},

(21)

where

{\hat{β}}_{i j (r)}

is the estimate of

β_{i j}

in the rth replication. The normal-based EM-BP algorithm for CP regressions presented in [10] is also employed and renamed EMN for comparisons. For illustrative purposes, the true regression coefficients and one of the simulated series are shown in Figure 1. Table 1 summarizes the simulation results.

Table 1 demonstrates that EMT1 and FCT1 generally perform better than those with different degrees of freedom (DF) by virtue of more precise CP estimates and less MSE in the estimations of regression coefficients in all cases, especially the cases of data with contaminations. For clean data such as in Figure 1a, all EMTs and FCTs with different DF provide sufficient and comparable estimates of CPs and regression coefficients. However, for data containing noisy points, such as the cases of Figure 1b,d, only EMT1 and FCT1 work well in all, but the other EMTs and FCTs mostly fail to provide sufficient CP estimates. As illustrated by Peel and McLachlan [15], the use of t components provides less extreme estimates of the posterior probabilities of segment membership of the CP model. The longer tail of t-distributions makes the fitted CP models more robust by giving reduced weight to atypical observations in the calculation of CP estimates. Furthermore, the t model-based approach embeds the normal distribution for each segment into a wider class of elliptically symmetric distributions, with an additional parameter called DF which is key for tuning robustness. This illustrates that the long tail of t(1)-distributions makes EMT1 and FCT1 more robust than those with higher DF. Even with normal data involving no contaminations, EMT1 and FCT1 still perform as well as EMN because of the similarity of symmetry between normal and t-distributions. Moreover, the best robustness of EMT1 and FCT1 among all EMTs and FCTs can be seen from the robustness properties of the proposed methods illustrated in Section 3.5. Thus, one may first fit data using EMN, EMT and FCT with several different DF, such as r = 1, 5, 10 and 15, respectively. If the estimations of EMN, EMTs and FCTs are close, one may use EMN, considering that the data are clean and normally distributed. Otherwise, EMT1 and FCT1 should be adopted.

3.4. Resistance against High Leverage Outliers

Similar to the M-estimates for linear regressions, the proposed EMT and FCT are still susceptive to high leverage outliers. Here, high leverage outliers refer to outliers in the space of independent variables, since points far away from the main body of points in the X space have high leverage. Modifications for solving this problem are needed. In regression analyses, the leverage score of an observation

(x_{j}, y_{j})

is often defined as:

h_{j j} = n^{- 1} + {(n - 1)}^{- 1} {MD}_{j},

where

{MD}_{j} = (x_{j} - \bar{x})^{'} S^{- 1} (x_{j} - \bar{x}), x_{j} = (x_{j 1}, \dots, x_{j p})^{'},

\bar{x}

and

S

are the sample mean and the sample covariance of

x_{j}

s, respectively. However,

\bar{x}

and

S

are nonresistant to outliers, and leverage points may not be recognized due to the effect of other high leverage points [34]. To lessen the mask effect, we consider replacing

\bar{x}

and

S

with the minimum variance determinant (MCD) estimators [38] among all the subsets of

{x_{1}, x_{2}, \dots, x_{n}}

having h elements,

1 \leq h \leq n

. In choosing h, we adopt the suggestion proposed by Rousseeuw and Leroy [39] and Lopuhaa and Rousseeuw [40], that h = (n + p + 1)/2 should be utilized to have the largest breakdown value for the MCD estimators. Through the fast MCD (FastMCD) algorithm of Rousseeuw and Van Driessen [41] with the small sample corrections proposed by Pison [42], the dataset is trimmed by removing the top

α * 100 %

of the data sorted in the descending order of the leverages of data. Then, EMT/FCT is applied to the trimmed data. Since the proposed EMT/FCT is only sensitive to outliers with rather high leverages, a small value of

α

such as 0.05 is usually enough in practice. To advance the ability of the proposed algorithms in withstanding outliers, for data involving doubtful high leverage outliers, we suggest employing EMT/FCT after trimming those observations with a modified MD ranking in the top

α * 100 %

of MDs arranged in declining order using a small

α

.

3.5. The Robust Properties of EMT and FCT

In this section, we employ the influence function and gross error sensitivity to illustrate the ability of the proposed algorithms in withstanding outliers. These two criteria have been frequently used to measure robustness in statistics [13]. Suppose

y_{1}, \dots, y_{n}

are observed data for estimating the parameter

θ

. An M-estimator is created by minimizing the form:

\sum_{j = 1}^{n} ρ (y_{j}; θ),

(22)

where

ρ

is a function that measures the difference between the assumed distribution and the true one. An M-estimator is derived by solving the equation:

\sum_{j = 1}^{n} φ (y_{j}; θ) = 0,

(23)

where

φ (y_{j}; θ) = (\partial / \partial θ) ρ (y_{j}; θ)

. For example, the sample mean and median are the M-estimators for the loss function

ρ (y_{j}; θ) = {(y_{j} - θ)}^{2}

and

ρ (y_{j}; θ) = |y_{j} - θ|

, respectively. Consider a random sample,

Y_{1}, \dots, Y_{n}

, from a population of density function

f_{y} (y; θ)

and let

ρ (y_{j}; θ) = - \log f_{y} (y_{j}; θ)

. Then, the MLE is also an M-estimator.

The influence function (IF) is useful for evaluating the relative influence of individual observations toward the value of an estimate. The IF of an M-estimator is proportional to its

φ

function with the form:

IF (y; θ) = - \frac{φ (y; θ)}{\int [\frac{\partial φ (y; θ)}{\partial θ}] d F_{y} (y)},

(24)

where

F_{y} (y)

denotes the distribution function of Y. An outlier might bring about issues when the influence function of an estimator is unbounded. Many useful robustness measures can be investigated through IF. For example, the gross error sensitivity (g*) denoted by

g^{*} = \sup_{y} |IC (y; F, θ)|

(25)

is one important robustness measure.

g^{*}

can be thought of as the approximate magnitude of the worst influence that the addition of an infinitesimal point mass may have on the value of the associated estimator.

For the proposed EMT, the considered estimator is the MLE associated with the t density having r degrees of freedom:

f_{y} (y; θ) = \frac{Γ (\frac{r + 1}{2})}{{(π r)}^{1 / 2} Γ (\frac{r}{2}) {\{1 + \frac{{(y - θ)}^{2}}{r}\}}^{\frac{1}{2} (r + 1)}} .

Maximizing

\log f_{y} (y; θ)

is equivalent to minimize

\ln \{1 + \frac{{(y - θ)}^{2}}{r}\} .

The loss function associated with the MLE obtained by EMTr (r denotes DF) is equivalent to:

ρ (y; θ) = \ln \{1 + \frac{{(y - θ)}^{2}}{r}\}

(26)

and hence the

φ (y; θ)

is:

φ (y; θ) = \frac{- 2 (y - θ)}{r + {(y - θ)}^{2}} .

(27)

We only need to analyze

φ (y; θ)

because IF is proportional to

φ (y; θ)

according to Equation (24). Note that:

\lim_{y \to \infty} φ (y; θ) = \lim_{y \to - \infty} φ (y; θ) = 0 .

(28)

Equation (28) indicates that IF approaches to 0 as y approaches to positive or negative infinity. The maximum and minimum values of

φ (y; θ)

can also be derived by solving

(\partial / \partial y) φ (y; θ) = 0 .

(29)

Therefore, the

φ (y; θ)

in Equation (27) is bounded and continuous, as exhibited in Figure 2a. Furthermore, Figure 2a shows that for all y values, the

|φ (y, θ)|

of EMT1 is closest to 0 among all EMTr, which means that EMT1 is the most robust among all EMTr. Besides, Figure 2a demonstrates that for any given observation y, the

|φ (y, θ)|

of EMTr increases with r, and

|φ (y, θ)|

tends to be unbounded as

r \to \infty

and

|y| \to \infty

. This investigation reflects that the IF associated with EMN is unbounded because t(r) approaches to a normal distribution, as

r \to \infty

. In fact, the loss function corresponding to normal regressions is equivalent to

ρ (y; θ) = {(y - θ)}^{2}

with

φ (y; θ) = - 2 (y - θ)

. Thus, the IF with EMN is unbounded and the gross error sensitivity

g^{*} = \infty

. This is the reason why the normal-based MLE is not robust to outliers.

The plot of IF for the median is also displayed in Figure 2b. The influence of an excessively large or small y on the median is constant. However, the influence of an extremely large or small y on the estimator derived by EMT1 is tiny according to Equation (28) and Figure 2a. In fact, Equation (28) implies that an extremely large or small y can be viewed as an observation which may come from a different but unknown distribution and have no influence (i.e.,

IC (y; F, θ) = 0

) on our estimator. The maximum influence of observations, y, on our estimators can be derived by solving Equation (29). Hence, the proposed estimator has a bounded and continuous IF, and also a finite gross error sensitivity. Thus, the t CP model-based approach is robust from the point of view of robust statistics. EMT1 and FCT1 are more robust than those with DF greater than 1.

4. Experiments and Examples

4.1. Simulation Studies

We used numerical examples to demonstrate the effectiveness and the superiority of the proposed method. In all simulations, the stop criterion and initial values are set as

η = 0.000005

and

ε_{τ}^{(0)} = 1 / C_{c - 1}^{n - 1}

, for each possible combination of c − 1 CPs,

τ = (τ_{1}, \dots, τ_{c - 1})^{'}

,

1 \leq τ_{1} < τ_{2} < \dots < τ_{c - 1} \leq n - 1

. A sample of size n is randomly generated by selecting n values of the partitioning variable uniformly distributed over the interval [0, 10]. Then, divide the interval [0, 10] evenly into c nonoverlapping subintervals,

(d_{i - 1}, d_{i}]

,

i = 1, \dots, c

, with

d_{0} = 0

and

d_{c} = 10

. The values of the response variable y are computed according to respective regression settings. Seven methods used for comparisons include:

Using EM with errors having normal distributions [10] (EMN)
Rank-based segmented regression method [43] (RSE)
EMro-t in [44], re-denoted as EM-Tk
The proposed EMT with DF equal to 1 (EMT1)
The trimmed version of EMT1 with trimmed proportion, $α = 5 %$ (EMT1-tm)
The proposed FCT with DF equal to 1 (FCT1)
The trimmed version of FCT1 with trimmed proportion, $α = 5 %$ (FCT1-tm)

The comparisons are based on three models under four respective scenarios using sample sizes of 50 and 75 with 1000 and 200 replicates for 1 and 3 CPs models in each simulation run, respectively. For illustrative purposes, Figure 3 displays the true coefficients of the models used, and one of the simulated series including generated data, fitted lines and noises. Four respective scenarios considered for three models are similar and include the cases of data having (1) 0 atypical observations, (2) several background noises, (3) high leverage outliers and (4) heavy-tailed distributions. The evaluation measures for comparisons are (1) the MSE of CP estimates (MSE-CP) described as

MSE (\hat{τ}) = \frac{1}{R} \sum_{k = 1}^{R} {({\hat{τ}}_{k} - τ)}^{2}

(30)

where

{\hat{τ}}_{k}

is the CP estimate at kth replication, and (2) the sum of MSE of the regression coefficient estimates (MSES-reg) as given by Equation (21). Table 2 reports the results.

Table 2 shows that EMT1 and FCT1 perform better or at least comparable to the other methods in all cases of all three models, except the case of data contaminated by leverage outliers. However, for the cases with high leverage outliers, the trimmed versions, EMT1-tm and FCT1-tm, work well. Particularly, FCT1 usually performs better than EMT1 because the influence of outliers can be greatly reduced by moderately increasing the power of the fuzzy index m used in FCT1. More findings are stated as follows:

For datasets including no atypical observations, all EM-type and FCT methods (including EMN, EM-Tk, EM-T, EMT1, EMT-tm, FCT1 and FCT1-tm) work well for all three models, with FCT1 performing better than the others for less biased CP estimates and less MSES-reg with regression coefficient estimates.
RSE works well for the 1 CP continuous model (model b). However, RSE is not workable in cases of models with discontinuity (model a) or multiple CPs (model c). However, these situations frequently occur in practice.
For data with background noises such as cases a.2, b.2 and c.2, only the proposed t-based approach works generally well. Particularly, FCT1 provides more precise estimates.
For datasets including extremely high leverage outliers such as cases a.3, b.3 and c.3, all methods cannot work well, except EMT1-tm and FCT1-tm. In such situations, both trimmed versions provide effective estimates of CPs and regression coefficients for each model, and FCT1-tm performs better for producing less biased estimates on the whole.
From extensive simulations, the iteration time respectively taken by EMT1 and FCT1 in all cases is always less than other EM-based methods. This evidence indicates that the proposed method is most computationally economic. Especially, it is even more timesaving than the normality-based EMN.

Next, we show that the proposed algorithms are more tolerant of extreme outliers in comparison with the existing methods based on model b in Figure 3. Four cases of extreme contaminations occurring at different locations are considered, with a simulated sample shown in Figure 4.

Cases 1 and 4: 5 (9.1%) identical outliers occur at (0, 7) and (10, 12), respectively. The leverage of outliers in these two cases is high.

Cases 2 and 3: 8 (13.8%) identical outliers occur at (3, 0) and (8, 4), respectively. The leverage of outliers in these two cases is moderate.

Note that in all four cases, the distance of the observed y-value from the predicted y- value is larger than 6 deviations, and the percentage of extreme outliers is more than 9.1% in each case. The estimation results are shown in Table 3.

Similar to previous comparisons, Table 3 shows that FCT1 performs best in all four cases of extreme contaminations. Particularly, FCT1 and FCT1-tm are much more resistant against radical outliers than other methods for providing estimates as precise as those obtained from clean data. Specifically, in cases 1 and 4, where the drastic outliers (0, 7) and (10, 12) have rather high leverages and their responses are distant from the predicted values for 12 standard deviations, FCT1 methods are still hardly affected by outliers and produce adequate estimates even if extreme outliers account for almost 10% of the whole dataset. Moreover, as explained earlier, FCT1 is more robust than EMT1 in that the fuzzy partitioning and the fuzziness index, m, are helpful for reducing the influence of outliers by giving tiny weights with a large m to the exceptional outliers. Table 3 demonstrates again that the normal-based method, EMN, cannot work well in the presence of extreme outliers. Additionally, while RSE is quite robust to outliers with moderate leverages, it cannot provide satisfied regression estimates for data involving drastic outliers with fairly high leverages.

In summary, the proposed FCT1 method is very robust to outliers, even though the dataset includes extreme outliers with the observed responses far away from the predicted values, by more than 6 standard deviations. Furthermore, FCT1 can still produce sufficient estimates for datasets in the presence of radical outliers with quite high leverages. More importantly, FCT1 is tolerant of the extreme outliers, which account for about 10% of the total data in all four cases of the experiment. In practice, outliers far away from the predicted regression line, by more than 6 standard deviations, are unusual and very likely less than 10% of the whole dataset. Thus, the proposed FCT1 is attainable in real applications, generally.

4.2. Real Data Applications

Example 1. We applied the proposed methods to a dataset involving 107 terrestrial animals gathered by Garland [45] for analyzing the relationship between the maximal running speed (MARS) with the body size. It is understandable that animals of different sizes can reach different maximal running speeds (MARS). McMahon [46] commented that animals are built for elastic similarity and the MARS of animals is proportional to the fourth root of body mass, (mass)^1/4, among elastically similar animals. The plot in Figure 5 shows that a slope change occurs in the relationship between (MARS)^1/4 and (mass)^1/4. Additionally, several exceptionally low points appear near the place where the regression slope converts. It is appreciable that these suspicious points very likely pull down the regression line and hence have a substantial influence on the regression slope.

We respectively apply EMN, EMT and FCT to the MARS data by setting DF = 1. The fitting results in Figure 5 show that EMT produces a much shorter CP at 25 and the line on the right hand side becomes flatter due to the down pull exerted by those extremely low points (see Figure 5, the blue fitted line by EMT and

{\hat{β}}_{21}^{EMT} = - 0.0456

, flatter than those by EMN and FCT1,

{\hat{β}}_{21}^{EMN} = - 0.0892

and

{\hat{β}}_{21}^{FCT} = - 0.0820

). The CPs produced by EMR (

{\hat{τ}}^{EMN} = 46

) and FCT (

{\hat{τ}}^{FCT} = 41

) are very close, but the slope of the left line (

{\hat{β}}_{11}^{EMN} = 0.4973

) obtained by EMN is much lower than that by FCT1 (

{\hat{β}}_{11}^{FCT} = 0.9118

). This evidence indicates that FCT1 is much more resistant against those extraordinarily low points than EMN. Based on previous simulation results showing that FCT1 usually provides less biased estimates than other models, we believe that the estimated model provided by FCT1 is reasonable and best suited to MARS data. The MARS data can be obtained from [45].

Example 2. We analyzed a bedload transport dataset (downloaded from https://doi.org/10.2737/RDS-2007-0004 (accessed on 26 May 2017)) to study the relationship between the discharge rate (m³/s) and the rate of bedload transport (kg/s). Bedload transport measures the transportation of particles in flowing fluids along a bed. The bedload transportation is often described as occurring in two phases, starting from a relatively slow stage (Phase I) to a faster stage (Phase II). The transition point is the main interest of researchers in studying bedload transportation. Ryan and Porth [47] had fitted the dataset with a piecewise regression model, and they found it difficult because few data were collected at higher flows. The scatter plots of data in Figure 6 show that the transportation rate tends to increase with the water discharge faster after the discharge rate goes beyond 1.5 m³/s. Furthermore, the two points on the upper right having extraordinarily high transportation rates are leverage points.

We implemented EMN, EMT1, EMT1-tm, FCT1 and FCT1-tm with DF = 1 and trimming percentage

α = 0.05

to the data. Figure 6 shows that EMN and EMT1 are heavily affected by two high leverage points and both give an identical CP at 74 (discharge rate, DR = 1.81354). This is irrational for only two points included in Phase II. By contrast, the other three methods are much more robust, resulting in the same CP at 62 (DR = 1.5786) and much lower slope estimates for the second line, especially FCT1 and FCT1-tm. Particularly, the slope of the second line,

{\hat{β}}_{21}^{FCT 1 - tm} = 0.0282

, obtained by FCT1-tm is much smaller than those obtained by other methods. These facts illustrate that FCT1 and FCT1-tm are more resistant against unusual observations; especially, FCT1-tm is capable of withstanding high leverage outliers. Our resulting CP estimate, 62, looks reasonable and it is close to the finding of Zhang and Li [48]. As for deriving more accurate estimates for Phase II, just as suggested by Ryan and Porth [47], more data at faster flows are required.

Example 3. We applied the proposed algorithms to the Dow Jones Index closing price ranged from 22 October 2008 to 22 October 2009 (downloaded from https://finance.yahoo.com/quote/%5EDJI/history?ltr=1 (accessed on 20 July 2021)). Due to the financial crisis of 2008 with a bankruptcy of Lehman Brothers, U.S. stock market price series suffered a huge fluctuation from late 2008 to 2009. Thus, the U.S. stock market price series during this period is worthy of investigations to illustrate variance changes in real situations. Adopting an approach common in finance, we consider the stock returns,

R_{t} = (P_{t + 1} - P_{t}) / P_{t}

, where

P_{t}

is the closing price on day t. Since the values of

\{R_{t}\}

are too small, we applied our methods to the series

\{R_{t}\}

,

R_{t}^{e} = 100 \exp (R_{t})

. The descriptive statistics of

\{R_{t}^{e}\}

in Table 4 and the box plots of the whole data and the 4 sub-segments derived by EMN in Figure 7 reveal that the series

\{R_{t}^{e}\}

follows a fat-tailed distribution and includes several extreme outliers. Therefore, t-regression models are better-suited to the series

\{R_{t}^{e}\}

.

Lin et al. [35] studied the same data and concluded three CPs at 35, 133 and 182. We set c = 4 and applied EMN, EMT1 and FCT1 to

\{R_{t}^{e}\}

, separately. EMN produces 3 CPs at 32, 122 and 182, which are close to those of Lin et al. [35]. However, the box plots in Figure 7b show that the last three segments obtained by EMN contain extreme outliers. Hence, the CPs estimated by EMN are suspicious due to its sensitivity to outliers and heavy-tails. Moreover, the estimated CPs of Lin et al. are obtained by the binary hierarchic segmentation, which does not generally provide the true optimum (cf. [11,49]). On the other hand, EMT1 and FCT1 produced 3 CPs at 21, 32 and 108 and 21, 32 and 151, in which the second segment contains only 11 observations and the difference of variations between the first two segmentations is not statistically significant. Furthermore, the scree test for choosing the number of CPs (see the discussion in Section 5) suggests that the optimal number of segments is 3. The box plots in Figure 8 show that several outliers appear in the last two segments. According to previous experiments, FCT1 is powerful in resisting atypical observations and it is more robust than EMT1. Thus, the CP estimates, 33 and 151, provided by FCT1 should be the optimal times of variance changes in the return series.

5. Conclusions and Discussion

5.1. Discussion

Computation cost is usually a concern in CP detection methods. The capacity of a robust method in resisting outliers for different sizes of datasets is important. For most statistical methods, it is harder to achieve robustness with a small dataset because estimations are easily affected by anomalous observations for small amounts of data. In order to show the high robustness of the proposed method even for small samples, we used datasets consisting of tens or hundreds of data points in the experiments. When fitting a regression model with c − 1 CPs to a dataset of size n by EMT or FCT, we consider

C_{c - 1}^{n - 1}

potential collections of CPs, simultaneously. Therefore, the computational efficiency can be impeded as the size of the dataset or the number of CPs increases. In fact, except for the sample size and the number of CPs, other factors may influence the computational cost, such as the model variations, the model structures, such as continuity or discontinuity at CPs, and the discrepancies of models between two contiguous segments. According to our simulations executed on a computer of an Intel Core i7-7700HQ CPU with 2.80 GHz and 8 Gb of Ram, the proposed methods are suitable for datasets of sizes no more than 2000, 1000 and 500 for locating one, two and three CPs, respectively. For large datasets, a possible approach is to cut the sample size to the favorable size and then derive preliminary estimates by using the reduced data. One possible way for reducing the data size is to remove the data of even rank in the ordered dataset arranged in the increasing order of the partitioning variable,

x_{d}^{*}

, then repeatedly reduce the data until the favorable size is reached. After deriving the rudimentary estimates, we can further estimate CPs and regression parameters by employing EMT to the nearby data of the preceding CPs again.

The determination of the number of CPs is important in CP detection. However, it is difficult when lacking likelihood-based tests for the number of CPs because the regularity conditions fail and there is little knowledge about the asymptotic distribution of the test statistics even for Gaussian cases. Most CP studies usually adopted some information criteria or informal tests for deciding the number of segments. Here, we employed the idea of Hawkins’s scree test [49] to investigate the progressive changes in the residual sum of squared errors (

R S S (c - 1) - R S S (c)

, c is the number of segments fitted to the data), with the changes in the MSE described as:

MSE (c) = \frac{RSS (c)}{[n - c (p + 1)]} = \frac{\sum {(y_{i} - {\hat{y}}_{i})}^{2}}{[n - c (p + 1)]},

and the changes in the maximized log-likelihood multiplied by −2 (−2lnL) (i.e.,

F (c - 1) - F (c)

,

F (c)

is −2 multiplying the optimized log-likelihood obtained from a model of c pieces fitted to the data). Additionally, Ciuperca’s [50] generalized Bayesian Information Criterion (BIC) was also utilized.

The suggested indexes were employed to the four models, shown in Figure 9, with the data contaminated by background noises, which includs outliers with a quite high leverage. For data contaminated by high leverage outliers, the trimmed version with

α = 0.05

was used and the results were similar to those of cases including no outliers. Hence, we do not present the results of cases with high leverage outliers here. We report the results of both EMT1 and FCT in Table 5. For models including zero or one single CP, all four suggested indexes (RSS, MSE, −2lnL and BIC) provide a correct c for each respective case. For the models of multiple CPs, the BIC and −2lnL index fails to obtain the correct c. The RSS and MSE indexes work well in all circumstances. It can be easily seen from Table 5 that both the indexes RSS and MSE decreased progressively for c less than the true value, and then the decrease becomes less again once c goes beyond the true value. These two indexes provide a clear indication of the optimal c. Therefore, the two suggested indexes, RSS and MSE, are useful for determining the number of segments. However, in some applications, the cut-point may not be so clearly indicated by these indices. In such cirsumstances, one may fit the data with several possible c’s based on RSS and MSE and then choose the best fitted model on the basis of some criteria in regression analysis. Although there is no solid inferential base for the suggested method, it may suffice as a rough and practical tool.

On the other hand, the penalized likelihood approach has been utilized in recent studies, such as [7,8], to infer both the number of CPs and their locations. For large datasets, one may consider the penalized approaches with a penalty added into the objective likelihood to estimate the number and the locations of CPs simultaneously. However, the selection of a proper penalty is not easy because the penalty can have a substantial impact on the estimations for the number of CPs [28]. One may refer to Haynes et al. [51] for information on workable penalties.

In using the trimmed EMT, a small value of

α

was selected to remove leverage points where the leverage scores rank in the top

α \times 100 %

. Apparently, the choice of

α

is connected with the results of EMT1-tm and FCT1-tm. A formal statistical test is required for objectively judging whether a datum is a high leverage point based on a cut-point, which is the critical point resulted from a statistical test under a given significance level. Accordingly, high leverage points can be deleted adaptively but not fixed at a subjective value. Moreover, high leverage points may present small residuals and hence provide important information for model fitting. Therefore, more research on incorporating the information of high leverage points with small residuals is also needed to judge high leverage points better. One possible method is to borrow the idea of monitoring introduced by Cerioli et al. [52]. First, the trimmed version with a small

α

is employed to derive robust estimates of CPs and regression parameters. Then, the technique of monitoring in [52] may be considered for judging the trimmed points to make the most of the dataset. The test statistic for outliers in CP regressions would be better based on the difference in the estimates of CPs and model parameters between different samples. Further research is needed, and it will be covered in our future work.

5.2. Conclusions

This paper proposed a new robust method for CP regressions using t-distributions. The t-distribution provides a longer tailed substitute to the normal distribution and has an additional parameter called DF, which is critical for tuning robustness. These favorable properties make the t CP regression model more tolerant of atypical observations. We considered CPs as random variables and transfered them into latent class variables so that we could implement the EM algorithm to estimate the CPs and regression parameters simultaneously. Moreover, we rewrote the t-distribution as a scale mixture of normal distributions to simplify the computation of t-log-likelihood. We further extended the latent class model to a fuzzy class model. Incorporating fuzzy clustering methods, we created the FCT algorithm to make more robust estimations for CP regressions by FCT than by EMT. The proposed EMT and FCT are computationally easy and efficient compared to the other existing EM-type methods. The proposed method has no problem with the non-differentiability and requires no restrictions on the models, such as the continuity between two adjoining segments needed, only a single CP allowed, or an unchanged regression variance demanded. Some of these restraints are often required in existing methods. To advance the ability of the proposed methods in resisting high leverage outliers, we proposed the modified versions, called EMT-tm and FCT-tm, by first trimming high leverage points moderately and then employing the proposed method to the reduced data. As for the decision of the degrees of freedom, r, the EMT1 and FCT1 (DF r = 1) worked well in all experiments, no matter whether the data were contaminated or not. The analysis of robust properties with the proposed t-based methods also showed that EMT1 and FCT1 were most tolerant of atypical observations. In managing CP regression problems, we suggest respectively fitting data with EMN, EMTs and FCTs using values such as r = 1, 5, 10 and 15. If the estimations of the three methods are close, one may consider that the data are normally distributed, having no abnormal observations, so that EMN can be used for obtaining less varied estimators. Otherwise, EMT1 or FCT1 should be empolyed. In practice, FCT1 can be given priority for its better ability of withstanding atypical observations. For the concern about the trimmed proportions of data,

α

, normally, outliers with extremely high leverage are seldom more than 5%, thus

α

can be set as 0.05 at the start and increased in case a higher value is needed (cf. [28,47]). Extensive experiments with numerical and real examples showed the effectiveness and the practicability of EMT1 and FCT1. In particular, only the trimmed versions were workable for data containing extremely high leverage outliers, such as the real data of bedload transportations. This evidence shows the effectiveness and the need of the trimmed versions for real applications. Overall, the proposed EMT and FCT approaches are robust to atypical observations and heavy-tailed distributions. Especially, FCT1 generally works well and performs better than EMT1 for providing less biased estimates of CPs and model parameters.

Author Contributions

Conceptualization and methodology, K.-P.L. and S.-T.C.; software, S.-T.C.; validation, K.-P.L.; formal analysis, S.-T.C.; investigation, K.-P.L.; resources, K.-P.L.; data curation, K.-P.L.; writing—original draft preparation, K.-P.L.; writing—review and editing, K.-P.L. and S.-T.C.; visualization, K.-P.L.; supervision, K.-P.L.; project administration, K.-P.L.; funding acquisition, K.-P.L. and S.-T.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Ministry of Science and Technology, Taiwan (Grant numbers MOST 108-2118-M-025-001 and MOST 108-2118-M-003-003).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The sources of datasets are respectively stated in each example they are considered (pages 21–23).

Conflicts of Interest

The authors declare no conflict of interest.

References

Muggeo, V.M.R. Segmented: An R package to fit regression models with broken-line relationships. News R Proj. 2008, 8, 20–25. [Google Scholar]
Yang, P.; Dumont, G.; Ansermino, J.M. Adaptive change detection in heart rate trend monitoring in anesthetized children. IEEE Trans. Biomed. Eng. 2006, 53, 2211–2219. [Google Scholar] [CrossRef]
Schröder, A.L.; Ombao, H. FreSpeD: Frequency-specific change-point detection in Epileptic seizure multi-channel EEG data. J. Am. Stat. Assoc. 2019, 114, 115–128. [Google Scholar] [CrossRef]
Loschi, R.H.; Pontel, J.G.; Cruz, F.R.B. Multiple change-point analysis for linear regression models. Chil. J. Stat. 2010, 1, 93–112. [Google Scholar]
Werner, R.; Valev, D.; Danov, D.; Guineva, V. Study of structural break points in global and hemispheric temperature series by piecewise regression. Adv. Space Res. 2015, 56, 2323–2334. [Google Scholar] [CrossRef]
Fearnhead, P.; Rigaill, G. Changepoint Detection in the Presence of Outliers. J. Am. Stat. Assoc. 2019, 114, 169–183. [Google Scholar] [CrossRef] [Green Version]
Frick, K.; Munk, A.; Sieling, H. Multiscale change point inference. J. R. Stat. Soc. Ser. B Stat. Methodol. 2014, 76, 495–580. [Google Scholar] [CrossRef]
Pein, F.; Sieling, H.; Munk, A. Heterogeneuous change point inference. J. R. Stat. Soc. Ser. B Stat. Methodol. 2017, 79, 1207–1227. [Google Scholar] [CrossRef] [Green Version]
Zarandi, M.H.F.; Alaeddini, A. A general fuzzy-statistical clustering approach for estimating the time of change in variable sampling control charts. Inf. Sci. 2010, 180, 3033–3044. [Google Scholar] [CrossRef]
Lu, K.P.; Chang, S.T. A fuzzy classification approach to piecewise regression models. Appl. Soft Comput. 2018, 69, 671–688. [Google Scholar] [CrossRef]
Fryzlewicz, P. Wild binary segmentation for multiple change-point detection. Ann. Stat. 2014, 42, 2243–2281. [Google Scholar] [CrossRef]
Maronna, R.A.; Martin, R.D.; Yohai, V.J. Robust Statistics: Theory and Methods; Wiley: New York, NY, USA, 2006. [Google Scholar]
Huber, P.J. Robust Statistics; Wiley: New York, NY, USA, 1981. [Google Scholar]
Lange, K.L.; Little, R.J.A.; Taylor, J.M.G. Robust statistical modelling using the t distribution. J. Am. Stat. Assoc. 1989, 84, 881–889. [Google Scholar]
Peel, D.; McLachlan, G.J. Robust mixture modelling using the t distribution. Stat. Comput. 2000, 10, 339–348. [Google Scholar] [CrossRef]
Muggeo, V.M.R. Estimating regression models with unknown breakpoints. Stat. Med. 2003, 22, 3055–3071. [Google Scholar] [CrossRef]
Chakar, S.; Lebarbier, E.; Lévy-Leduc, C.; Robin, S. A robust approach for estimating change-points in the mean of an AR(1) process. Bernoulli 2017, 23, 1408–1447. [Google Scholar] [CrossRef]
Ko, S.I.M.; Chong, T.T.L.; Ghosh, P. Dirichlet process hidden Markov multiple change-point model. Bayesian Anal. 2015, 10, 275–296. [Google Scholar] [CrossRef]
Bardwell, L.; Fearnhead, P. Bayesian detection of abnormal segments in multiple time series. Bayesian Anal. 2015, 12, 193–218. [Google Scholar] [CrossRef] [Green Version]
Zou, C.; Yin, G.; Feng, L.; Wang, Z. Nonparametric maximum likelihood approach to multiple change-point problems. Ann. Stat. 2014, 42, 970–1002. [Google Scholar] [CrossRef] [Green Version]
Haynes, K.; Fearnhead, P.; Eckley, I.A. A computationally efficient nonparametric approach for changepoint detection. Stat. Comput. 2017, 27, 1293–1305. [Google Scholar] [CrossRef] [Green Version]
Rigaill, G. A pruned dynamic programming algorithm to recover the best segmentations with 1 to K_max change-points. J. Soc. Fr. Stat. 2015, 156, 180–205. [Google Scholar]
Maidstone, R.; Hocking, T.; Rigaill, G.; Fearnhead, P. On optimal multiple changepoint algorithms for large data. Stat. Comput. 2017, 27, 519–533. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Aminikhanghahi, S.; Cook, D.J. A survey of methods for time series change point detection. Knowl. Inf. Syst. 2017, 51, 339–367. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Truong, C.; Oudre, L.; Vayatis, N. A review of change point detection methods. arXiv 2018, arXiv:1801.00718v1. [Google Scholar]
Ciuperca, G. Estimating nonlinear regression with and without change-points by the LAD method. Ann. Inst. Stat. Math. 2011, 63, 717–743. [Google Scholar] [CrossRef]
Ciuperca, G. Penalized least absolute deviations estimation for nonlinear model with change-points. Stat. Pap. 2011, 52, 371–390. [Google Scholar] [CrossRef]
Yang, F. Robust Mean Change-Point Detecting through Laplace Linear Regression Using EM Algorithm. J. Appl. Math. 2014, 2014, 856350. [Google Scholar] [CrossRef] [Green Version]
Jafari1, A.; Yarmohammadil, M.; Rasekhi, A. A Bayesian analysis to detect change-point in two-phase Laplace model. Sci. Res. Essays 2016, 11, 187–193. [Google Scholar]
Gerstenberger, C. Robust Wilcoxon-type estimation of change-point location under short range dependence. J. Time Ser. Anal. 2018, 39, 90–104. [Google Scholar] [CrossRef]
Yao, W.; Wei, Y.; Yu, C. Robust mixture regression using the t-distribution. Comput. Stat. Data Anal. 2014, 71, 116–127. [Google Scholar] [CrossRef]
Lin, J.G.; Zhu, L.X.; Xie, F.C. Heteroscedasticity diagnostics for t linear regression models. Metrika 2009, 70, 59–77. [Google Scholar] [CrossRef]
Lin, J.G.; Xie, F.C.; Wei, B.C. Statistical Diagnostics for Skew-t-normal Nonlinear Models. Commun. Stat. Simul. Comput. 2009, 38, 2096–2110. [Google Scholar] [CrossRef]
Osorio, F.; Galea, M. Detection of a change-point in student-t linear regression models. Stat. Pap. 2005, 45, 31–48. [Google Scholar] [CrossRef]
Lin, J.G.; Chen, J.; Li, Y. Bayesian Analysis of Student t Linear Regression with Unknown Change-Point and Application to Stock Data Analysis. Comput. Econ. 2012, 40, 203–217. [Google Scholar] [CrossRef]
Petersen, K.B.; Winther, O.; Hansen, L.K. On the slow Convergence of EM and VBEM in low-noise linear Models. Neural Comput. 2005, 17, 1921–1926. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Yang, M.S. A survey of fuzzy clustering. Math. Comput. Model. 1993, 18, 1–16. [Google Scholar] [CrossRef]
Rousseeuw, P.J. Least median of squares regression. J. Am. Stat. Assoc. 1984, 79, 871–880. [Google Scholar] [CrossRef]
Rousseeuw, P.J.; Leroy, A.M. Robust Regression and Outlier Detection; Wiley-Interscience: New York, NY, USA, 1987. [Google Scholar]
Lopuhaa, H.P.; Rousseeuw, P.J. Breakdown Points of Affine Equivariant Estimators of Multivariate Location and Covariance Matrices. Ann. Stat. 1991, 19, 229–248. [Google Scholar] [CrossRef]
Rousseeuw, P.J.; Van Driessen, K. A fast algorithm for the minimum covariance determinant estimator. Technometrics 1999, 41, 212–223. [Google Scholar] [CrossRef]
Pison, G.; Van Aelst, S.; Willems, G. Small sample corrections for LTS and MCD. Metrika 2002, 55, 111–123. [Google Scholar] [CrossRef] [Green Version]
Shi, S.; Li, Y.; Wan, C. Robust continuous piecewise linear regression model with multiple change points. J. Supercomput. 2020, 76, 3623–3645. [Google Scholar] [CrossRef]
Lu, K.P.; Chang, S.T. Robust algorithms for multiphase regression models. Appl. Math. Model. 2020, 77, 1643–1661. [Google Scholar] [CrossRef]
Garland, T. The relation between maximal running speed and body mass in terrestrial mammals. J. Zool. 1983, 199, 157–170. [Google Scholar] [CrossRef]
McMahon, T.A. Using body size to understand the structural design of animals: Quadrupedal locomotion. J. Appl. Physiol. 1975, 39, 619–627. [Google Scholar] [CrossRef] [PubMed]
Ryan, S.; Porth, L. A Tutorial on the Piecewise Regression Approach Applied to Bedload Transport Data; General Technic Report RMRS-GTR-189; U.S. Department of Agriculture, Forest Service, Rocky Mountain Research Station: Fort Collins, CO, USA, 2007; p. 41.
Zhang, F.; Li, Q. Robust bent line regression. J. Stat. Plan. Inference 2017, 185, 41–55. [Google Scholar] [CrossRef]
Hawkins, D.M. Fitting multiple change-point models to data. Comput. Stat. Data Anal. 2001, 37, 323–341. [Google Scholar] [CrossRef]
Ciuperca, G. A general criterion to determine the number of change-points. Stat. Probab. Lett. 2011, 81, 1267–1275. [Google Scholar] [CrossRef] [Green Version]
Haynes, K.; Eckley, I.A.; Fearnhead, P. Computationally Efficient Changepoint Detection for a Range of Penalties. J. Comput. Graph. Stat. 2017, 26, 134–143. [Google Scholar] [CrossRef]
Cerioli, A.; Riani, M.; Atkinson, A.C.; Corbellini, A. The power of monitoring: How to make the most of a contaminated multivariate sample. Stat. Methods Appl. 2018, 27, 641–649. [Google Scholar] [CrossRef]

Figure 1. Models used to show EMT1 performs better,

β

: *: (1, 0.5), +: (6, −0.5) show the coefficients of the model from which data were generated; here,

β

: *: (1, 0.5), +: (6, −0.5) indicate that the data marked * and + in the figure were generated from models

E (y) = 1 + 0.5 x

and

E (y) = 6 - 0.5 x

, respectively. These notations are also used for the figures below.

Figure 1. Models used to show EMT1 performs better,

β

: *: (1, 0.5), +: (6, −0.5) show the coefficients of the model from which data were generated; here,

β

: *: (1, 0.5), +: (6, −0.5) indicate that the data marked * and + in the figure were generated from models

E (y) = 1 + 0.5 x

and

E (y) = 6 - 0.5 x

, respectively. These notations are also used for the figures below.

Figure 2. (a)

φ

function of EMTr and (b)

φ

function of the median.

Figure 2. (a)

φ

function of EMTr and (b)

φ

function of the median.

Figure 3. Plots of models used for comparing the proposed approaches with existing methods.

Figure 4. Four cases of extreme outliers considered to show the robustness of the proposed methods.

Figure 5. Fitting results of EMN, EMT and FCT to MARS data.

Figure 6. Fitting results of EMN, EMT1, EMT1-tm, FCT1 and FCT1-tm to bedload transport data.

Figure 7. (a) Box plots of

R_{t}^{e}

with the whole data, and (b) the four sub-segments derived by EMN.

Figure 7. (a) Box plots of

R_{t}^{e}

with the whole data, and (b) the four sub-segments derived by EMN.

Figure 8. Fitting results of EMN, EMT1 and FCT1 to the 2008 DJIA return series.

Figure 9. Models used for deciding the number of segments based on the suggested indices.

Table 1. The performance of EMT and FCT with different degrees of freedom.

(I) 1 CP Cases
(a) 0 outliers	CP = 25
Method	EMN	EMT1		EMT5	EMT10	EMT15	FCT1	FCT5	FCT10	FCT15
CPhat	25.18	25.15		25.26	25.20	25.19	25.21	25.10	25.11	25.13
SECP	0.168	0.186		0.169	0.166	0.166	0.120	0.109	0.109	0.109
MSES	0.76	1.21		0.80	0.78	0.78	0.98	0.58	0.56	0.55
(b) 15 noises	CP = 33
Method	EMN	EMT1		EMT5	EMT10	EMT15	FCT1	FCT5	FCT10	FCT15
CPhat	18.80	33.74		21.75	24.41	21.75	33.51	31.86	29.65	28.70
SECP	0.137	0.154		0.237	0.272	0.237	0.111	0.130	0.123	0.109
MSES	14.17	0.98		10.68	8.24	10.68	0.77	0.82	1.65	2.38
(c) 1 extreme outlier (2, 8)	CP = 26
Method	EMN	EMT1		EMT5	EMT10	EMT15	FCT1	FCT5	FCT10	FCT15
CPhat	18.55	25.79		25.56	24.70	23.34	26.06	26.06	25.94	25.85
SECP	0.112	0.167		0.147	0.149	0.152	0.064	0.063	0.063	0.063
MSES	1.73	1.02		0.70	0.69	0.82	0.75	0.49	0.47	0.47
(II) 2 CPs Case, CP1 = 28, CP2 = 56
Method	EMN	EMT1	EMT5	EMT10	EMT15	FCT1	FCT5	FCT10	FCT15	Method
CP1	15.79	27.79	21.43	17.64	16.45	28.39	26.14	21.73	19.27	CP1
SECP1	0.252	0.249	0.350	0.324	0.306	0.137	0.216	0.297	0.314	SECP1
CP2	68.39	57.86	62.07	64.85	65.22	56.46	54.42	57.44	60.43	CP2
SECP2	0.064	0.279	0.274	0.352	0.412	0.187	0.167	0.255	0.314	SECP2
MSES	7.97	7.81	7.09	7.48	7.79	3.33	2.62	6.15	10.31	MSES

Table 2. Comparisons of 7 methods with models a–c under 4 different situations, respectively.

Model a: 1 CP Discontinuous Model
(a.1) 0 outliers	EMN	RSE	EM-Tk	EMT1	EMT1-tm	FCT1	FCT1-tm
MSE-CP	0.91	37.75	0.50	2.84	2.68	0.43	0.40
MSES-reg	0.57	x¹	0.52	0.67	0.71	0.76	0.79
(a.2) 8 (14%) noises	EMN	RSE	EM-Tk	EMT1	EMT1-tm	FCT1	FCT1-tm
MSE-CP	66.98	111.16	135.25	0.87	1.19	0.98	0.92
MSES-reg	2.26	x	0.82	0.65	0.82	0.74	0.98
(a.3) leverage outliers (10, 10)	EMN	RSE	EM-Tk	EMT1	EMT1-tm	FCT1	FCT1-tm
MSE-CP	569.16	x	54.68	573.70	2.80	574.47	0.39
MSES-reg	11,556.91	x	x	10,900.66	0.71	x	0.75
(a.4) errors~t (2)	EMN	RSE	EM-Tk	EMT1	EMT1-tm	FCT1	FCT1-tm
MSE-CP	6.79	44.96	1.17	2.69	2.95	0.41	0.41
MSES-reg	12,603.61	x	2.98	5066.66	3395.77	1.54	0.75
Model b: 1 CP Continuous Model
(b.1) 0 outliers	EMN	RSE	EM-Tk	EMT1	EMT1-tm	FCT1	FCT1-tm
MSE-CP	0.024	0.004	0.014	0.009	0.022	0.004	0.008
MSES-reg	0.600	0.620	0.574	0.831	1.020	0.784	0.875
(b.2)10 (17%) noises	EMN	RSE	EM-Tk	EMT1	EMT1-tm	FCT1	FCT1-tm
MSE-CP	201.574	0.535	208.182	0.573	1.245	0.277	0.143
MSES-reg	14.174	2.939	9.714	0.978	1.745	0.771	1.215
(b.3) leverage outliers (10, 15)	EMN	RSE	EM-Tk	EMT1	EMT1-tm	FCT1	FCT1-tm
MSE-CP	335.479	x	2.428	477.948	0.024	347.620	0.004
MSES-reg	7451.722	x	1.518	9694.875	0.986	7884.283	0.853
(b.4) errors~t (2)	EMN	RSE	EM-Tk	EMT1	EMT1-tm	FCT1	FCT1-tm
MSE-CP	0.038	0.002	0.041	0.017	0.019	0.013	0.011
MSES-reg	1346.945	1.742	2.834	1.640	1.825	1.072	1.218
Model c: 2 CPs Model
(c.1) 0 outlierss	EMN	RSE	EM-Tk	EMT1	EMT1-tm	FCT1	FCT1-tm
MSE-CP1	0.223	0.170	0.490	0.490	0.228	0.271	0.457
MSE-CP2	5.745	0.041	6.142	6.142	5.361	3.218	5.194
MSES-reg	4.798	x	6.019	6.019	8.255	7.580	8.587
(c.2) 10 (12%) noises	EMN	RSE	EM-Tk	EMT1	EMT1-tm	FCT1	FCT1-tm
MSE-CP1	149.402	210.211	208.989	0.355	0.289	0.347	0.464
MSE-CP2	153.532	871.555	69.751	3.830	5.782	6.503	5.972
MSES-reg	7.969	x	7.640	7.809	8.444	5.359	5.901
(c.3) leverage outliers (10, 10)	EMN	RSE	EM-Tk	EMT1	EMT1-tm	FCT1	FCT1-tm
MSE-CP1	1.594	x	268.806	11.273	3.858	3.978	4.012
MSE-CP2	306.429	x	768.323	318.802	13.971	318.088	13.400
MSES-reg	7902.387	x	x	8939.276	9.410	8412.589	8.816
(c.4) errors~t (2)	EMN	RSE	EM-Tk	EMT1	EMT1-tm	FCT1	FCT1-tm
MSE-CP1	3.655	x	2.269	2.969	2.194	0.302	0.472
MSE-CP2	1.137	x	1.037	0.242	0.890	1.603	1.033
MSES-reg	67.314	x	x	4.727	4.914	3.704	4.486

x denotes that MSE is extremely large.

Table 3. Comparing the robustness of 7 methods to extreme outliers using model b under four situations.

Case	5 (9.1%) Outliers, CP = 30		6 (10.7%) Outliers, CP = 30		8 (13.8%) Outliers, CP = 33		8 (13.8%) Outliers, CP = 25		5 (9.1%) Outliers, CP = 25		6 (10.7%) Outliers, CP = 25
	MSECP ¹	MSEreg ¹	MSE-CP	MSEreg	MSECP	MSEreg	MSE-CP	MSEreg	MSECP	MSEreg	MSECP	MSEreg
EMN	25.19	13.15	32.16	16.40	36.84	5.74	14.47	8.42	25.9	85.9	32.0	113
RSE	1.35	x	2.30	x	1.42	0.46	0.07	0.41	1.27	x	2.19	x
EM-Tk	33.44	13.53	46.31	18.82	44.85	7.34	27.11	22.77	34.6	96.0	46.1	146
EMT1	0.01	1.02	3.32	4.40	4.14	1.74	4.10	34.40	0.06	0.96	3.78	30.1
FCT1	0.04	0.78	1.93	8.64	0.03	1.07	0.00	0.59	0.02	0.72	2.40	96.5
EMT1-tm	0.11	1.59	16.38	13.88	1.79	1.78	1.83	22.05	0.16	2.37	17.3	140
FCT1-tm	0.04	1.02	5.22	13.32	0.05	1.43	0.01	3.25	0.01	2.02	5.59	143

¹: MSECP and MSESreg are the MSE of the estimates of CP and regression parameters given by Equations (21) and (30), respectively.

Table 4. Descriptive statistics of

\{R_{t}^{e}\}

.

Table 4. Descriptive statistics of

\{R_{t}^{e}\}

.

Sample Size	Mean	SD	Minmum	Median	Maximum	Skewness	Kurtosis
252	101.11	2.1451	92.59	100.06	111.49	0.573	7.074

Table 5. Deciding the number of segments for the models in Figure 9.

(i) Using EMT1
try	0 CP c = 1, 15 noises				1 CP c = 2, 15 noises				2 CPs c = 3, 10 noises				3 CPs c = 4, 10 noises
c	RSS ¹	–2lnL ²	MSE ³	BIC	RSS	–2lnL	MSE	BIC	RSS	–2lnL	MSE	BIC	RSS	–2lnL	MSE	BIC
1	x ⁴	x	x	115	x	x	x	92	x	x	x	−5	x	x	x	90
2	−9	−59	−1	133	98	997	2	79	18	37	0.13	3	190	430	3.88	33
3	* ⁵	*	−891	467	11	−172	0	108	8	−220	0.37	11	35	−50	0.72	30
4	0	1	−43	517	−3	−217	0	159	3	17	−0.02	62	4	−39	0.04	74
(ii) Using FCMT1
try	0 CP c = 1, 15 noises				1 CP c = 2, 15 noises				2 CPs c = 3, 10 noises				3 CPs c = 4, 10 noises
c	RSS ¹	–2lnL ²	MSE ³	BIC	RSS	–2lnL	MSE	BIC	RSS	–2lnL	MSE	BIC	RSS	–2lnL	MSE	BIC
1	x	x	x	115	x	x	x	88	x	x	x	−10	x	x	x	90
2	−8	−59	−0.52	133	88	311	1.69	77	7	290	0.08	0	185	1885	3.78	36
3	−7	*	−0.54	167	0	−263	−0.17	111	14	39	0.26	15	36	−45	0.73	36
4	8	810	−0.24	216	8	−630	0.01	158	1	51	−0.03	67	7	29	0.12	75

¹. RSS: the change of RSS, i.e., RSS(c − 1) − RSS(c). ². –2lnL: the change of −2 multiplying maximized log-likelihood,

F (c - 1) - F (c)

. ³. MSE: the change of MSE, i.e., MSE(c − 1) − MSE(c). ⁴. Optimal c denoted by fluorescent yellow color. ⁵. Extremely small number (unreasonable results due to using a wrong c).

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, K.-P.; Chang, S.-T. Robust Algorithms for Change-Point Regressions Using the t-Distribution. Mathematics 2021, 9, 2394. https://doi.org/10.3390/math9192394

AMA Style

Lu K-P, Chang S-T. Robust Algorithms for Change-Point Regressions Using the t-Distribution. Mathematics. 2021; 9(19):2394. https://doi.org/10.3390/math9192394

Chicago/Turabian Style

Lu, Kang-Ping, and Shao-Tung Chang. 2021. "Robust Algorithms for Change-Point Regressions Using the t-Distribution" Mathematics 9, no. 19: 2394. https://doi.org/10.3390/math9192394

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Robust Algorithms for Change-Point Regressions Using the t-Distribution

Abstract

1. Introduction

2. Related Work

3. Robust Change-Point Regression Using t-Distributions

3.1. An EMT Algorithm for Change-Point Regression Models

3.2. A FCT Algorithm for t CP Regression Models

3.3. Decide the Degrees of Freedom of t-Distributions

3.4. Resistance against High Leverage Outliers

3.5. The Robust Properties of EMT and FCT

4. Experiments and Examples

4.1. Simulation Studies

4.2. Real Data Applications

5. Conclusions and Discussion

5.1. Discussion

5.2. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI