Grouped Change-Points Detection and Estimation in Panel Data

Lu, Haoran; Wang, Dianpeng

doi:10.3390/math12050750

Open AccessArticle

Grouped Change-Points Detection and Estimation in Panel Data

by

Haoran Lu

and

Dianpeng Wang

^*

The School of Mathematics and Statistics, Beijing Institute of Technology, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(5), 750; https://doi.org/10.3390/math12050750

Submission received: 12 December 2023 / Revised: 18 February 2024 / Accepted: 20 February 2024 / Published: 1 March 2024

(This article belongs to the Section Probability and Statistics)

Download

Browse Figures

Versions Notes

Abstract

:

The change-points in panel data can be obstacles for fitting models; thus, detecting change-points accurately before modeling is crucial. Extant methods often either assume that all panels share the common change-points or that grouped panels have the same unknown parameters. However, the problem of different change-points and model parameters between panels has not been solved. To deal with this problem, a novel approach is proposed here to simultaneously detect and estimate the grouped change-points precisely by employing an iterative algorithm and the penalty cost function. Some numerical experiments and case studies are utilized to demonstrate the superior performance of the proposed method in grouping the panels, and estimating the number and positions of change-points.

Keywords:

grouped change-point; integer programming; penalty cost function; panel data

MSC:

62F10; 62H12; 62M10

1. Introduction

Due to the influence of some factors, the structure (mean, covariance, etc.) of data may change at some times or places. Thus, testing the structural stability first before modeling is crucial. Several studies on detecting the change in data structure and estimating the position of the change-point can be traced back to 1954. Ref. [1] first considered the problem of detecting the change-point in sample parameter value. Further, the correlated problem has been studied extensively by scholars since the 1990s. Ref. [2] studied the problem about the mean shift in a linear process and used the least-squares method to estimate the change-point. Ref. [3] proposed a method for detecting and estimating multiple change-points, and proved the corresponding theoretical properties. Another area attracting scholarly attention is the problem about the change-point in time-series models. Ref. [4] applied the minimum description length (MDL) criterion to estimate change-points in the piecewise auto-regressive (AR) processes. Ref. [5] constructed a cumulative sum (CUSUM) statistic to identify structural changes in multivariate time-series models. Ref. [6] considered the problem of structural change in an autoregressive model and proposed a method by using group lasso to estimate the change-point. Ref. [7] proposed a new test using the eigensystem to ascertain change-points. Recent works have discussed the high-dimensional case (dimension larger than the sample size). Ref. [8] proposed the sparsified binary segmentation (SBS) algorithm for high-dimensional time series change-point detection problem. Ref. [9] considered high-dimensional data and used a two-stage approach to estimate the change-point; these stages were dimensionality reduction and the construction of a CUSUM statistic. All these studies are offline methods, which analyze change-points based on historical dataset. However, in some fields, the online approaches for monitoring the changes in the system are significantly important since the observations are obtained sequentially and timely decisions are needed. Ref. [10] proposed a method for sequentially detecting change-points using the likelihood ratio test.

Despite the good performance of the aforementioned methods, their limitation is that they consider a single sequence of data. Meanwhile, panel data, which are a two-dimensional collection of time-series and cross-sectional data, are common in practical applications, such as in economics and finance. Some changes may happen in the panel data which require detection and estimation before modeling due to the effects of some factors. The aforementioned methods could be utilized to detect or estimate the change-points in a single sequence. However, it will benefit the detection and estimation of change-points by integrating the information of multiple sequences, if the change-points between these sequences are correlated. Some works do consider the estimation of change-points in panel data. Refs. [11,12] first studied the problem of change-points in N sequences and proposed an estimator by using the maximum likelihood method. Ref. [13] proposed a method to estimate change-points by using the least-squares and quasi-maximum likelihood methods, but it is only for the case where there are common change-points between panels. Ref. [14] used common correlated effects estimators for heterogeneous panels change-points estimation. Ref. [15] compared ordinary least squares and first difference, and found that the first difference estimator is robust to the case of stationary or nonstationary regressors and error term. Ref. [16] proposed a new CUSUM estimator for common mean change-points in panel data, which performed better than the least-squares estimator proposed by Ref. [13]. Ref. [17] considered dependent and nonstationary panels and developed a novel estimator. However, these methods assume that there is a common change-point between panels. It is a very strong assumption and some evidence shows that this assumption does not hold in many cases. Ref. [18] allowed different change-points between panels and proposed a grouping method. However, it can only estimate the most recent change-point. Considering linear panel data models and allowing the group structure changes, Ref. [19] proposed a least-squares method and an iterative estimation approach to estimate change-points, group membership, and coefficients simultaneously. In addition, some scholars have studied the problem of change-point detection in panel data. Ref. [20] developed a fluctuation test and Wald statistics for detecting change-points in panel data. Based on the CUSUM method, Ref. [21] proposed a new statistic and established the corresponding asymptotic distribution to detect common change-points. Ref. [22] proposed a ratio type test statistic for change-point detection, which was for fixed and relatively small panel size. Ref. [23] considered smooth structural changes and developed two consistent tests. An asymptotic method and two new bootstrap tests were proposed by Ref. [24] for a sequential change-point of panel data. Ref. [25] proposed a general approach for testing change-points with large number of panels. Based on the cumulative sum of ordinary least-squares residuals, Ref. [26] proposed a new method for testing whether there are common change-points in heterogeneous panel data. All of the above methods regard detection and estimation as two problems and study them separately. Recently, some authors have studied new estimators to achieve simultaneous detection and estimation in a single step. Most were lasso-type methods and Ref. [27] reviewed these. Assuming each panel is a linear model and has the same coefficient, Ref. [28] proposed adaptive group fused lasso (AGFL) to detect and estimate common change-points. Then, relaxing the assumption of common change-points and allowing differences in the number and location of change-points between groups, Ref. [29] developed grouped AGFL (GAGFL) for heterogeneous structural changes in panel data. However, the lasso-type methods took a long time to solve because the objective function has an absolute value function and the parameters need to be tuned. All of them required that the model parameters in the group must be the same, but many situations could not be satisfied in practice.

In this paper, we study the mean change-point problem of panel data. We further relax the model assumptions to allow for different change-points and model parameters between panels. A new statistic and an iterative algorithm are proposed to simultaneously detect and estimate change-points in panel data. Although it is equivalent to the G-median problem and is NP-hard, we use an open-source solver, which has a low computational cost and works well. A lot of numerical experiments and practical applications demonstrate the good performance of the new method.

The remainder of this article is organized as follows. In Section 2, the problem of grouped change-points in panel data is presented, and a new method for detecting and estimating the change-points is proposed. Some numerical experiments are performed to demonstrate the performance of the new method in Section 3. In Section 4, we apply the new method to the stock and breast cancer datasets. The conclusions and remarks are discussed in Section 5.

2. Methodology

We first consider the simple situation with one change-point in each sequence of panel data but the locations of the change-point in each sequence could be different. Let

{x_{i, t}}_{1 \leq t \leq T}

be the ith sequence of panel data for

1 \leq i \leq N

and define

y_{i, t}

as

\begin{matrix} y_{i, t} = \{\begin{matrix} x_{i, t}, & 0 < t \leq t_{i} \\ x_{i, t} + u_{i}, & t_{i} < t \leq T \end{matrix} . \end{matrix}

where

t_{i}

is the true location of the change-point,

u_{i}

is a nonzero constant, and for each i,

{x_{i, t}}_{1 \leq t \leq T}

is a time series. Define

x_{i, t} = \sum_{j = 0}^{\infty} f_{i, j} ϵ_{t - j},

where

{ϵ_{t}}

is a sequence of independent variables with zero-mean and finite variance, such that

\sum_{j = 0}^{\infty} j | f_{i, j} | < \infty

. Assume the panels are independent of each other, and the N panels can be divided into G groups, each of which has the same change-point. Define the groups as

I_{1 : G} = {I_{1}, \dots, I_{G}}

,

I_{g} \subset {1, \dots, N}

, and each group does not intersect and merges into full set

{1, \dots, N}

. The number of elements in group i is

N_{i}

and

\sum_{i = 1}^{G} N_{i} = N

. Denote the true change-points as

t_{1 : G} = (t_{1}, \dots, t_{G})

; that is, for all series

i \in I_{g}

, the change-point is located at

t_{g}

. For

s \leq t

, the set of observations for panel i from time s to time t is denoted as

y_{i, s : t} = (y_{i, s}, . . ., y_{i, t})

. We are interested in detecting the change-point for the panel data, and we use a minimum penalized cost approach to solve this.

First, we describe the method for univariate time series. For panel i, the penalty cost function is defined as

Q_{i} (τ) = \{\begin{matrix} C (y_{i, 1 : τ}) + C (y_{i, τ + 1 : T}) + β, & τ = 1, \dots, T - 1 \\ C (y_{i, 1 : T}), & τ = 0 \end{matrix},

where

β > 0

is a parameter. Here, we take

β = O_{p} (log T)

and

C (y_{i, s : t}) = min_{θ} \sum_{j = s}^{t} γ (y_{i, j}; θ),

where

γ (y_{i, j}; θ) = {(y_{i, j} - θ)}^{2}

is the square loss function and

θ

is a segment-specific location parameter. The estimator is

\hat{τ} = \arg min_{τ} Q_{i} (τ) .

(1)

\hat{τ} = 0

means there is no change-point; thus, the method can simultaneously detect and estimate change-point.

Then, we extend this method to panel data. If G is known, the statistic of panel data is defined as

Q = min_{I_{1 : G}, t_{1 : G}} \sum_{g = 1}^{G} \sum_{i \in I_{g}} Q_{i} (t_{g}),

(2)

{\hat{I}}_{1 : G}, {\hat{t}}_{1 : G} = \arg min_{I_{1 : G}, t_{1 : G}} \sum_{g = 1}^{G} \sum_{i \in I_{g}} Q_{i} (t_{g})

(3)

To solve, consider exchanging Q as

Q = min_{S} \sum_{i = 1}^{N} min_{t \in S} Q_{i} (t),

(4)

where

S \subset {0, 1, 2, \dots, T - 1}

and

| S | = G

. According to [30], the problem is equivalent to the G-median problem and can be solved as the integer programming problem. Let

ξ_{i, t} = \{\begin{matrix} 1, & if series i has a change-point at time t, \\ 0, & otherwise \end{matrix} .

ν_{t} = \{\begin{matrix} 1, & it there is a change-point in any series at time t, \\ 0, & otherwise \end{matrix} .

Thus,

min \sum_{i = 1}^{N} \sum_{t = 0}^{T - 1} Q_{i} (t) ξ_{i, t}

s . t . \sum_{t = 0}^{T - 1} ξ_{i, t} = 1, \forall i

ξ_{i, t} \leq ν_{t}, \forall i, t

\sum_{t = 0}^{T - 1} ν_{t} = G .

Many methods are available for solving integer programming problems, such as branch and bound, and cutting-plane algorithms. SCIP is an open-source solver and is used here for a fast solution. Although it is not guaranteed to find the global optimal solution, we find it can empirically lead to good estimates of the change-point in Section 3.

In practice, G is unknown. We focus on using the MDL criterion to determine G (Ref. [29] used Bayesian information criterion to determine G, but it needed to estimate a parameter in advance). Using the MDL, the number of choices of the G change-points is approximately

T^{G}

and each N time series can choose which G change-points to have, which gives

G^{N}

possible choices [18]. Thus, define

Q_{G}^{'} = Q + N {log}_{2} G + G {log}_{2} T,

(5)

and

\hat{G} = min_{G \in {1, \dots, N}} Q_{G}^{'} .

This can be solved by a traversal algorithm. Given

G = 1, \dots, N

, Q is calculated, and then

Q_{G}^{'}

is calculated. Choose G that minimizes

Q_{G}^{'}

as our estimate.

Of course, the new method can be extended to multiple change-points problem. If some panels have multiple change-points, following [31], the penalty cost is defined as

Q (y_{1 : T}; τ_{1 : k}) = \sum_{i = 0}^{k} C (y_{τ_{i} + 1 : τ_{i + 1}}) + β k,

where

τ_{0} = 0

and

τ_{k + 1} = T

. If

N > 1

, define the minimum cost for segmenting series i as

Q_{i} = min_{τ_{i}, m_{i}} Q (y_{i, (1 : T)}; τ_{i}) = min_{τ_{i}, m_{i}} {\sum_{j = 0}^{m_{i}} C (y_{i, (τ_{i, j} + 1 : τ_{i, j + 1})}) + β m_{i}},

and the estimator is

{\hat{τ}}_{i}, \hat{m_{i}} = \arg min_{τ_{i}, m_{i}} {\sum_{j = 0}^{m_{i}} C (y_{i, (τ_{i, j} + 1 : τ_{i, j + 1})}) + β m_{i}} .

(6)

Using the binary segmentation method [32] and the ruptures package in Python, we can simultaneously obtain the number and locations of change-points. For panel data, define the groups as

I_{1 : G} = {I_{1}, \dots, I_{G}}

; the number of change-points for each group is

m_{g}

and set of change-points for each group is

t_{g}

, where

1 \leq g \leq G

and

t_{g} = {t_{g, 1}, \dots, t_{g, m_{g}}}

. Thus, if G is known, it is natural to define

Q = min_{I_{1 : G}, t_{1}, \dots, t_{G}} \sum_{g = 1}^{G} \sum_{i \in I_{g}} Q (y_{i, (1 : T)}; t_{g}) .

(7)

To solve this model, an iterative algorithm (Algorithm 1) is proposed. In Section 3.2, we show the convergence rate of the algorithm. For group

{\hat{I}}_{g}

, the estimate of the change-points within the group is defined as

{\hat{t}}_{g}, {\hat{m}}_{g} = \arg min_{t_{g}, m_{g}} \sum_{i \in {\hat{I}}_{g}} Q (y_{i, (1 : T)}; t_{g}) .

(8)

Algorithm 1: Iterative Algorithm with G is known.

Input:

Panel data ${y_{i, t}}_{1 \leq t \leq T}$ for $i = 1, \dots, N$ ;
The measure of fit $γ (y_{i, j}; θ)$ depends on the data;
Number of iterations p;
Number of group G;

Output:

$t_{g}^{(s)}$ for $g = 1, \dots, G$ ;
$I_{1 : G}^{(s + 1)}$ ;

₁: initialize $s = 0$ ;
₂: calculate the initial grouping result $I_{1 : G}^{(0)}$ , assuming each time-series data has only one change-point;
₃: repeat
₄: According to the grouping, all the change-points in each group are calculated to obtain the set of change-points $t_{1}^{(0)}, \dots, t_{G}^{(0)}$ ,

$t_{g}^{(s)} = \arg min_{t_{g}} \sum_{i \in I_{g_{i}^{(s)}}} Q (y_{i, (1 : T)}; t_{g});$
₅: The grouping is redetermined according to the set of change-points,

$g_{i}^{(s + 1)} = \arg min_{g \in {1, \dots, G}} Q (y_{i, (1 : T)}; t_{g}^{(s)}),$

then can obtain $I_{1 : G}^{(s + 1)}$ ;
₆: Set s = s + 1;
⁷: until s > p or $t_{g}^{(s)} = t_{g}^{(s - 1)}$ for $g = 1, \dots, G$ and $g_{i}^{(s + 1)} = g_{i}^{(s)}$ for $i = 1, \dots, N$ ;

If G is unknown, add a penalty to G:

Q_{G}^{'} = Q + N {log}_{2} G + \sum_{g = 1}^{G} ({log}_{2} T + m_{g} {log}_{2} T) .

This uses the MDL criterion: for each group, the number of change-points

m_{g}

has T possible choices and the locations of change-points have approximately

T^{m_{g}}

possible choices. In addition, each N time series can choose which G change-points to have, resulting in

G^{N}

possible choices. The solution procedure is shown in Algorithm 2.

Algorithm 2: Iterative Algorithm with G is unknown.

Input:

Panel data ${y_{i, t}}_{1 \leq t \leq T}$ for $i = 1, \dots, N$ ;
The measure of fit $γ (y_{i, j}; θ)$ depends on the data;
Number of iterations p;

Output:

$\hat{G}$ ;
$t_{g}^{(s)}$ for $g = 1, \dots, \hat{G}$ ;
$I_{1 : \hat{G}}^{(s + 1)}$ ;

₁: for $G = 1, 2, 3, 4, 5$ do
₂: calculate $t_{g}^{(s)}$ for $g = 1, \dots, G$ and $I_{1 : G}^{(s + 1)}$ via Algorithm 1;
₃: calculate $Q_{G}^{'} = Q + N {log}_{2} G + \sum_{g = 1}^{G} ({log}_{2} T + {\hat{m}}_{g} {log}_{2} T)$ .
₄: end
₅: $\hat{G} = \arg {min}_{G} Q_{G}^{'}$ .

Finally, we give the consistency of the number and location of the change-points in each group under the condition of correct grouping. Using the binary segmentation method, each estimated change-point is defined as

\hat{τ}

, and the true change-point is

τ_{0}

. We shall prove the following theorem:

Theorem 1.

Assuming the change in the mean is bounded, for large T and fixed N, we have

| \hat{r} - r_{0} | = O_{p} (T^{- 1 / 2}),

where

r_{0} = τ_{0} / T

.

Proof.

The binary segmentation method searches the change-point that lowers the sum of costs. Define

S_{i} (τ) = \sum_{i = 1}^{τ} {(y_{i, j} - {\bar{y}}_{1})}^{2} + \sum_{i = τ + 1}^{T} {(y_{i, j} - {\bar{y}}_{2})}^{2},

then for group g,

\hat{τ} = arg min_{τ} \sum_{i = 1}^{N_{g}} S_{i} (τ) .

Let

U (τ) = \frac{1}{N_{g} T} \sum_{i = 1}^{N_{g}} S_{i} (τ) .

According to [13] (Lemmas A.1 and A.2), for fixed

N_{g}

, we have

sup_{1 \leq τ \leq T} | U (τ) - E U (τ) | = O_{p} (T^{- 1 / 2}),

and

E U (τ) - E U (τ_{0}) \geq C | τ - τ_{0} | / T,

where

C > 0

. Then, we have

\begin{matrix} U (τ) - U (τ_{0}) & = U (τ) - E U (τ) - [U (τ_{0}) - E U (τ_{0})] + E U (τ) - E U (τ_{0}) \\ \geq - 2 sup_{1 \leq j \leq T} | U (j) - E U (j) | + E U (τ) - E U (τ_{0}) \\ \geq - 2 sup_{1 \leq j \leq T} | U (j) - E U (j) | + C | τ - τ_{0} | / T \end{matrix}

The above inequality holds for each

τ \in [1, T]

. Of course, it holds for

\hat{τ}

. From

U (\hat{τ}) - U (τ_{0}) \leq 0

, we can obtain

| τ - τ_{0} | / T \leq C^{- 1} 2 sup_{1 \leq j \leq T} | U (j) - E U (j) |,

so

| \hat{r} - r_{0} | = O_{p} (T^{- 1 / 2})

□

Following Ref. [33], take

β = 4 C (ϵ) log T

, where

C (ϵ) < \infty

; we have the following theorem:

Theorem 2.

For large T and fixed N, we have

P ({\hat{m}}_{g} = m_{g}^{0}) ⟶ 1, g = 1, 2, \dots, G,

where

m_{g}^{0}

is the true number of change-point for group g.

Proof.

Following Ref. [33], for each panel, we can estimate the number and position of change-points by minimizing the penalized cost function and

P ({\hat{m}}_{i} = m_{i}^{0}) ⟶ 1, i = 1, \dots, N,

where

m_{i}^{0}

is the true number of change-point for panel i.

Specifically, for panel i, define

Q_{i} (τ) = \sum_{k = 1}^{m_{i} + 1} \sum_{τ = τ_{k - 1} + 1}^{τ_{k}} {(y_{i, τ} - θ_{i, k})}^{2},

where

θ_{i, k} = {\bar{y}}_{i} (τ_{k - 1}, τ_{k}) = \frac{1}{τ_{k} - τ_{k - 1}} \sum_{τ = τ_{k - 1} + 1}^{τ_{k}} y_{i, τ}

. Then

(\hat{τ}, {\hat{m}}_{i}) = arg min_{m_{i}} arg min_{τ} \frac{1}{T} {Q_{i} (τ) + β m_{i}} .

Following Ref. [33], define

J_{i} (τ) = \frac{1}{T} (Q_{i} (τ) - Q_{i} (τ_{0})),

K_{i} (τ) = \frac{1}{T} \sum_{k = 1}^{m_{i} + 1} \sum_{τ = τ_{k - 1} + 1}^{τ_{k}} {(E y_{i, τ} - E θ_{i, k})}^{2},

V_{i} (τ) = \frac{1}{T} \sum_{k = 1}^{m_{i} + 1} \{\frac{{(\sum_{τ = τ_{k - 1}^{0} + 1}^{τ_{k}^{0}} x_{i, τ})}^{2}}{τ_{k}^{0} - τ_{k - 1}^{0}} - \frac{{(\sum_{τ = τ_{k - 1} + 1}^{τ_{k}} x_{i, τ})}^{2}}{τ_{k} - τ_{k - 1}}\},

W_{i} (τ) = \frac{1}{2 T} \sum_{k = 1}^{m_{i} + 1} \{(\sum_{τ = τ_{k - 1}^{0} + 1}^{τ_{k}^{0}} x_{i, τ}) μ_{k}^{0} - (\sum_{τ = τ_{k - 1} + 1}^{τ_{k}} x_{i, τ}) E θ_{i, k}\},

where

τ_{0}

is the true change-points and

μ_{k}^{0}

is the true mean of segment k. Then

J_{i} (τ) = K_{i} (τ) + V_{i} (τ) + W_{i} (τ)

. According to [33] (Theorem 9 and its proof), we have

\frac{1}{T} {Q_{i} (\hat{τ}) + β {\hat{m}}_{i}} \leq \frac{1}{T} {Q_{i} (τ_{0}) + β m_{i}^{0}},

K_{i} (\hat{τ}) + V_{i} (\hat{τ}) + W_{i} (\hat{τ}) + \frac{β}{T} ({\hat{m}}_{i} - m_{i}^{0}) \leq 0,

and for any

0 \leq m \leq T

and

m \neq m_{i}^{0}

,

\begin{matrix} P (\hat{m_{i}} = m) & \leq P (K_{i} (\hat{τ}) + V_{i} (\hat{τ}) + W_{i} (\hat{τ}) + \frac{β}{T} (m - m_{i}^{0}) \leq 0) \\ \leq P (min_{τ} {K_{i} (τ) + V_{i} (τ) + W_{i} (τ) + \frac{β}{T} (m - m_{i}^{0}) \leq 0) \\ ⟶ 0, T ⟶ \infty . \end{matrix}

For group g, define

(\hat{τ}, {\hat{m}}_{g}) = arg min_{m_{g}} arg min_{τ} \frac{1}{T} \sum_{i \in I_{g}} {Q_{i} (τ) + β m_{g}} .

Then we have

\frac{1}{T} \sum_{i \in I_{g}} {Q_{i} (\hat{τ}) + β {\hat{m}}_{g}} \leq \frac{1}{T} \sum_{i \in I_{g}} {Q_{i} (τ_{0}) + β m_{g}^{0}},

\sum_{i \in I_{g}} {K_{i} (\hat{τ}) + V_{i} (\hat{τ}) + W_{i} (\hat{τ}) + \frac{β}{T} ({\hat{m}}_{g} - m_{g}^{0})} \leq 0,

and for any

0 \leq m \leq T

and

m \neq m_{g}^{0}

,

\begin{matrix} P (\hat{m_{g}} = m) & \leq P (\sum_{i \in I_{g}} {K_{i} (\hat{τ}) + V_{i} (\hat{τ}) + W_{i} (\hat{τ}) + \frac{β}{T} (m - m_{g}^{0})} \leq 0) \\ \leq \sum_{i \in I_{g}} P (K_{i} (\hat{τ}) + V_{i} (\hat{τ}) + W_{i} (\hat{τ}) + \frac{β}{T} (m - m_{g}^{0}) \leq 0) \\ \leq \sum_{i \in I_{g}} P (min_{τ} {K_{i} (τ) + V_{i} (τ) + W_{i} (τ) + \frac{β}{T} (m - m_{g}^{0}) \leq 0) . \end{matrix}

P (\hat{m_{g}} = m) ⟶ 0

as

T ⟶ \infty

because

N_{g}

is fixed. □

3. Numerical Experiments

3.1. Evaluation Criteria

To evaluate the estimation effect of the new method, three types of evaluation indicators are used. First, to determine the group of G, we perform 1000 replications, and the empirical probability is defined as

P (G = i) = c_{i} / 1000,

where

c_{i}

is the number of replications in which the statistic minimizes at

G = i

.

Then, to obtain the accuracy of the grouping, we use the set coverage (D), which is defined as

D_{j} = 1 - \frac{| I_{j} \cap {\hat{I}}_{j} |}{\sqrt{| I_{j} | | {\hat{I}}_{j} |}},

where

j = 1, 2, \dots, G

and

D = (D_{1} + . . . + D_{G}) / G

.

Finally, for the accuracy of the location, we use the root mean square error (RMSE) for one change-point and define it as

R M S E = \sqrt{\frac{1}{1000} \sum_{l = 1}^{1000} \sum_{g = 1}^{G} {(t_{g} - {\hat{t}}_{l, g})}^{2} / G},

where

{\hat{t}}_{l, g}

is the estimation of the change-point position of group g obtained by the lth replication.

For multiple change-points, we use the Hausdorff distance (HD) and frequency of correct estimation of the number of change-points (F). We define

H D ({\hat{t}}_{g}, t_{g}) = max {D ({\hat{t}}_{g}, t_{g}), D (t_{g}, {\hat{t}}_{g})},

where

D (A, B) = {sup}_{b \in B} {inf}_{a \in A} | a - b |

for any set A and B.

3.2. Detection and Estimation

To illustrate the superiority of the new method, we consider three time-series models for simulation and compare the new method with the least-squares estimator (LSE) with the sample size T of 80, 100, and 120, and the panel number N of 100 and 120. The number of each group

N_{1} : N_{2} : N_{3} = 4 : 3 : 3

. Here, we take

β = log T

. In each case, 1000 replications are carried out to calculate the mean value of evaluation indexes, and the final simulation results are obtained.

Following [13,34], if there are s common change-points, the statistic can be defined as

{\hat{t}}_{1 : s} = \arg min_{t_{1 : s}} S S R (t_{1 : s}),

S S R (t_{1 : s}) = \sum_{i = 1}^{N} S_{i T} (t_{1 : s}),

S_{i T} (t_{1 : s}) = \sum_{t = 1}^{t_{1}} {(y_{i, t} - {\bar{y}}_{i, 1})}^{2} + \sum_{t = t_{1} + 1}^{t_{2}} {(y_{i, t} - {\bar{y}}_{i, 2})}^{2} + . . . + \sum_{t = t_{s} + 1}^{T} {(y_{i, t} - {\bar{y}}_{i, s + 1})}^{2},

where

{\bar{y}}_{i, j} = \frac{1}{t_{j} - t_{j - 1}} \sum_{t = t_{j - 1} + 1}^{t_{j}} y_{i, t}

for

j \in {1, 2, . . ., s + 1}

.

First, we consider the AR(1) model, and define

x_{i, t} = \{\begin{matrix} 0.1 x_{i, t - 1} + ε_{i, t}, & i \in I_{1} \\ 0.2 x_{i, t - 1} + ε_{i, t}, & i \in I_{2} \\ 0.15 x_{i, t - 1} + ε_{i, t}, & i \in I_{3} \end{matrix},

and

y_{i, t} = \{\begin{matrix} x_{i, t}, & i \in I_{1}, 0 < t \leq t_{1} \\ x_{i, t} + u_{i}, & i \in I_{1}, t_{1} < t \leq T \\ x_{i, t}, & i \in I_{2}, 0 < t \leq t_{2} \\ x_{i, t} + u_{i}, & i \in I_{2}, t_{2} < t \leq T \\ x_{i, t}, & i \in I_{3}, 0 < t \leq t_{3} \\ x_{i, t} + u_{i}, & i \in I_{3}, t_{3} < t \leq T \end{matrix},

where

ε_{i, t} \sim N (0, 1)

,

u_{i} \sim U (0.5, 1)

for

i = 1, 2, \dots, n

, and the true change-point

t_{1} = 0.5 T

,

t_{2} = 0.65 T

, and

t_{3} = 0.35 T

.

Table 1 presents the empirical probability of the number of groups G. The true number of groups can be correctly estimated using MDL. As N increases, the empirical probability of correct judgment increases. To better illustrate this rule, we illustrate the empirical probability of correct judgment in Figure 1 by fixing T.

For a given G, Table 2 shows the estimation results of the new method and LSE. The new method always performs better than LSE. Figure 2 presents the curve of D and RMSE with panel number N and time-series length T. In Figure 2, at fixed N, D decreases as T increases. This indicates that the effect of grouping is better. Then, when we fix T, the RMSE becomes smaller with N increases, which indicates that the estimate becomes more accurate.

However, Table 1 shows that there is a small probability that

\hat{G}

is greater than G. When this happens, say

\hat{G} = 4

, the three change-points can still be accurately estimated, and the fourth group

{\hat{I}}_{4}

consists of individual elements from the

I_{1}, I_{2}

, and

I_{3}

. Table 3 shows the RMSE of the estimate in this case, where we only consider the RMSE of three change-points. The estimation of the change-point can still achieve good results when the group number G is misestimated. However, it is worse than the effect when the number of group G is estimated correctly.

Then, we consider the MA(2) model

x_{i, t} = \{\begin{matrix} ε_{i, t} + 0.3 ε_{i, t - 1} - 0.1 ε_{i, t - 2}, & i \in I_{1} \\ ε_{i, t} + 0.4 ε_{i, t - 1} - 0.3 ε_{i, t - 2}, & i \in I_{2} \\ ε_{i, t} + 0.3 ε_{i, t - 1} - 0.2 ε_{i, t - 2}, & i \in I_{3} \end{matrix} .

and

y_{i, t} = \{\begin{matrix} x_{i, t}, & i \in I_{1}, 0 < t \leq t_{1} \\ x_{i, t} + u_{i}, & i \in I_{1}, t_{1} < t \leq T \\ x_{i, t}, & i \in I_{2}, 0 < t \leq t_{2} \\ x_{i, t} + u_{i}, & i \in I_{2}, t_{2} < t \leq T \\ x_{i, t}, & i \in I_{3}, 0 < t \leq T \end{matrix} .

where

ε_{i, t} \sim N (0, 1)

,

u_{i} \sim U (0.5, 1)

for

i = 1, 2, \dots, n

, and the true change-point

t_{1} = 0.5 T

and

t_{2} = 0.65 T

.

Table 4 shows the empirical probability of G taking

1, 2, 3, 4,

and 5 in 1000 replications. It demonstrates that G has more than 90 percent probability of being estimated correctly. Figure 3 shows that the empirical probability of correct group selection increases as N increases.

From Table 4, G is chosen as 3. Given

\hat{G} = 3

, we use the SCIP Solver to estimate the change-points, as shown in Table 5. The RMSE of the new method is smaller than LSE, which means that the new method performs better. Furthermore, the D of the new method is small, which means that the grouping is accurate.

In Figure 4, we show the change in D and RMSE with T and N. D decreases as T increases and RMSE decreases as N increases.

Finally, we consider a time-series model with a trend term, and define

y_{i, t} = \{\begin{matrix} u_{0, i} + u_{1, i} t + 0.3 ε_{i, t - 1} + ε_{i, t}, & i \in I_{1}, 0 < t \leq t_{1} \\ (u_{0, i} - 1) + u_{1, i} t + 0.3 ε_{i, t - 1} + ε_{i, t}, & i \in I_{1}, t_{1} < t \leq T \\ u_{0, i} + u_{1, i} t + 0.3 ε_{i, t - 1} + ε_{i, t}, & i \in I_{2}, 0 < t \leq t_{2} \\ u_{0, i} + (u_{1, i} + 0.2) t + 0.3 ε_{i, t - 1} + ε_{i, t}, & i \in I_{2}, t_{2} < t \leq T \\ u_{0, i} + u_{1, i} t + 0.3 ε_{i, t - 1} + ε_{i, t}, & i \in I_{3}, 0 < t \leq t_{3} \\ (u_{0, i} + 0.2) + (u_{1, i} - 0.2) t + 0.3 ε_{i, t - 1} + ε_{i, t}, & i \in I_{3}, t_{3} < t \leq T \end{matrix} .

where

ε_{i, t} \sim N (0, 1)

,

u_{0, i} \sim U (0, 1)

,

u_{1, i} \sim U (0, 0.5)

, and the true change-point

t_{1} = 0.5 T

,

t_{2} = 0.65 T

, and

t_{3} = 0.35 T

. We define the square loss function as

γ (y_{i, j}; θ) = {(y_{i, j} - θ_{0} - i θ_{1})}^{2} .

Table 6 shows the empirical probability of the estimated number of groups. The results indicate that using MDL can estimate the number of groups with a high empirical probability. Figure 5 shows the change in empirical probability. Notably, the empirical probability approaches 1 as N increases.

In Table 7, we display the D and RMSE of the new estimator. The new method performs well on the time-series model with a trend. Figure 6 shows that D decreases with an increase in T; this means that the grouping is more and more accurate. Further, RMSE decreases with an increase in N, which means that the estimates are improving.

In the case of one change-point, the results can be summarized as follows: the new method performs better than LSE. With fixed T, as N increases, the empirical probability of choosing the right number of groups approaches 1 and the RMSE becomes smaller. With fixed N, the set coverage becomes smaller as T increases.

For multiple change-points, consider the AR(1) model, and define

x_{i, t} = \{\begin{matrix} 0.1 x_{i, t - 1} + ε_{i, t}, & i \in I_{1} \\ 0.2 x_{i, t - 1} + ε_{i, t}, & i \in I_{2} \\ 0.15 x_{i, t - 1} + ε_{i, t}, & i \in I_{3} \end{matrix},

and

y_{i, t} = \{\begin{matrix} x_{i, t}, & i \in I_{1}, 0 < t \leq t_{1}, t_{2} < t \leq T \\ x_{i, t} + u_{i}^{'}, & i \in I_{1}, t_{1} < t \leq t_{2} \\ x_{i, t} + u_{i}^{'}, & i \in I_{2}, 0 < t \leq t_{3} \\ x_{i, t}, & i \in I_{2}, t_{3} < t \leq t_{2} \\ x_{i, t} - u_{i}, & i \in I_{2}, t_{2} < t \leq T \\ x_{i, t}, & i \in I_{3}, 0 < t \leq t_{4} \\ x_{i, t} + u_{i}, & i \in I_{3}, t_{4} < t \leq T \end{matrix},

where

u_{i} \sim U (0.5, 1)

,

u_{i}^{'} \sim U (1, 1.5)

and

t_{1}, t_{2}, t_{3}, t_{4}

changes with T, when

T = 80

,

t_{1} = 30, t_{2} = 60, t_{3} = 20, t_{4} = 40

; when

T = 100

,

t_{1} = 40, t_{2} = 70, t_{3} = 30, t_{4} = 50

; and when

T = 120

,

t_{1} = 50, t_{2} = 90, t_{3} = 40, t_{4} = 60

.

Table 8 shows the empirical probability of the estimated number of groups. An accurate group number estimation can be obtained by using the MDL criterion. To illustrate the iterative convergence rate of Algorithm 1, we show the curve of coverage (D) versus s in Figure 7. The algorithm reaches convergence after five iterations. According to Ref. [13], when the number of change-points is unknown, we use LSE combined with AIC or BIC penalty to detect. The statistic is defined as

S S R (t_{1 : s}) = \sum_{i = 1}^{N} (S_{i T} (t_{1 : s}) + s β),

where the number of change-points s is unknown;

β = 2

for AIC penalty and

β = log T

for BIC penalty. Table 9 presents the D, F, and

H D

of the new method and LSE. Clearly, the new method divides the groups accurately, and accurately obtains the number and position of change-points in each group. Using AIC penalty, the number of change-points can be obtained accurately, while the BIC penalty is less than the real number of change-points.

Although Ref. [29] can not be applied to the above model, we can set the mean within the group to be the same and define the following model:

y_{i, t} = \{\begin{matrix} x_{i, t}, & i \in I_{1}, 0 < t \leq t_{1}, t_{2} < t \leq T \\ x_{i, t} + 1, & i \in I_{1}, t_{1} < t \leq t_{2} \\ x_{i, t} + 1, & i \in I_{2}, 0 < t \leq t_{3} \\ x_{i, t}, & i \in I_{2}, t_{3} < t \leq t_{2} \\ x_{i, t} - 1, & i \in I_{2}, t_{2} < t \leq T \\ x_{i, t}, & i \in I_{3}, 0 < t \leq t_{4} \\ x_{i, t} + 1, & i \in I_{3}, t_{4} < t \leq T \end{matrix},

This model is equivalent to taking all the regression variables in Ref. [29] as 1, and

β_{1, t} = \{\begin{matrix} 0, & 0 < t \leq t_{1}, t_{2} < t \leq T \\ 1, & t_{1} < t \leq t_{2} \end{matrix}, β_{2, t} = \{\begin{matrix} 1, & 0 < t \leq t_{3} \\ 0, & t_{3} < t \leq t_{2} \\ - 1, & t_{2} < t \leq T \end{matrix}, β_{3, t} = \{\begin{matrix} 0, & 0 < t \leq t_{4} \\ 1, & t_{4} < t \leq T \end{matrix} .

Then, we compare the new method with Ref. [29] under this model. The tuning parameter

λ

in Ref. [29] is selected by searching on the interval [1, 10,000] with 100 evenly-distributed logarithmic grids. We present the results of the new method and Ref. [29] in Table 10 (for this case, given

G = 3

, the new method has a probability of less than 1% to split the panel into two groups. The results presented here do not include this). The grouping of [29] is much better than that of the new method. This may be because Ref. [29] required the same model parameters within the group and utilized this information. For the estimation of the number and position of change-points, the performance of the two methods is similar. However, when the mean within the group is different, the new method can be applied to solve, and Ref. [29] cannot.

Last, we implement the method of Ref. [29] in Python and give the computation time in Table 11 (here is the average time of 100 replications; the CPU is an 11th Gen Intel Core i5-1135G7). It can be observed that the new method is faster than Ref. [29]. This may be because the objective function of Ref. [29] is complex and the parameters need to be tuned.

4. Applications

4.1. Stock Dataset

We first apply our approach to the stock dataset, where the model parameters (mean) are different for different stocks. We choose the closing price of FTSE, FCHI, GDXAI, MIB, AEX, GEM, Shanghai, Shenzhen, CSI 300, and CHINA SME 100 ETFs for our analysis. It is panel data with

N = 10

. The data come from the choice financial terminal, where we take a sample of the weekly closing prices of 10 stocks from June 2019 to May 2023. Our method is applied assuming that the mean of the data within each segment is a constant, and using the square loss function. Since the number of groups is unknown, we use Algorithm 2 to solve the problem, and Table 12 shows the values of the statistics (5) for the different G.

According to the results, the 10 stocks are divided into two groups. The first group contains five European stocks (FTSE, FCHI, GDXAI, MIB, and AEX) and the second group contains five Chinese stocks (Shanghai, Shenzhen, CSI 300, GEM, and CHINA SME 100 ETF). This may be due to the differences in economic systems and market structures between China and Europe. In each group, we can obtain the number and position of change-points by minimizing the penalty cost function (8). Table 13 presents the positions of change-points and the penalty costs obtained using binary segmentation method in order, where

Q (0)

represents the penalty cost with no change-point,

Q (\hat{τ})

represents the penalty cost with a change-point at

\hat{τ}

, and

\hat{τ}

is the change-point when

Q (0) > Q (\hat{τ})

.

From Table 13, the first group has six change-points and the second group has three change-points. Figure 8 displays the change-point positions of two groups. It can be observed that there is a jump in the mean before and after the estimated change-point in Figure 8. Some of the detected change-points can be readily associated with historical events. For Europe, the first change-point occurred in the last week of February 2020 due to the outbreak of COVID-19. The second change-point occurred at the end of May 2020, the third in November 2020, the fourth in March 2021, the fifth at the end of February 2022 (perhaps because of the Russia–Ukraine war), and the last in early January 2023. For China, the first change-point came in early July 2020, which may be related to the control of COVID-19, continued recovery of the domestic economy, and implementation of relevant government policies. The second change-point occurred at the end of December 2020, and the last in March 2022, which may be due to the Russia–Ukraine war and COVID-19.

4.2. Breast Cancer Dataset

Then, the new method is applied to a multidimensional dataset. The dataset is a built-in one for sklearn.datasets containing data on the malignant/benign (1/0) category of breast cancer for 569 patients recorded in Wisconsin; the dataset also includes data on the physiological indicators corresponding to 30 dimensions. The mean values of some indicators are significantly different between benign and malignant, while some indicators are the same. Thus, the new method can be applied to find which physiological indicators can distinguish between benign and malignant. First, we rearrange the data, placing 212 benign data in the front and 357 malignant data in the back so that the true change-point is 212. We treat each physiological indicator as a sequence and convert multidimensional data into panel data for analysis. Since each indicator has one change-point and

G = 2

, we can use equation (2) to obtain the grouping and change-point position. The results are shown in Figure 9. The mean symmetry, radius error, area error, and concave points error do not change in 212, while the remaining indicators, such as mean radius and mean texture, have changed (

Q (0) = 14,794.00, Q (212) = 10,106.22

). We can see that there is a jump in the mean before and after the estimated change-point. This indicates that the indicators of mean symmetry, radius error, area error, and concave points error can not be used to distinguish between benign and malignant, but other indicators can.

5. Conclusions

Many panel datasets have emerged in finance and economics in recent years. Panel data are a two-dimensional collection of time-series and cross-sectional data, which can provide more information. However, due to many factors, the structure of panel data may change, such as the impact of the financial crisis on stock prices. Research on the change in panel data structure is of great significance for reasonably and appropriately grasping economic and financial phenomena and preventing risks. Although each panel can be regarded as single time-series data for detection, this approach is not as accurate as multi-panel change-point detection. Here, a new statistic and an iterative algorithm are proposed to simultaneously detect and estimate the change-points in panel data.

The main contributions of this paper are as follows: A new statistic is constructed and an iterative algorithm is proposed to simultaneously detect and estimate change-points. The new method solves the problem that there are different change-points and model parameters between panels. In addition, the new algorithm takes less time to solve. Through a large number of simulation experiments, we find the following: (1) The new method can accurately obtain the number of groups. For fixed T, when N increases, the empirical probability of choosing the right group increases; (2) Given the number of groups, the new method can accurately group panels. For fixed N, when T increases, the grouping effect improves; (3) Given the number of groups, the new method is better than LSE in estimating the position of change-points. For fixed T, with the increase of N, the root mean square error becomes smaller and smaller; (4) The new method can be applied to multiple change-points to obtain the grouping, the number, and the position of change-points in each group accurately. In the applications, applying the new method to the Stock dataset, we can group multiple stocks and obtain each group of change-points; applying the new method to the multidimensional Breast Cancer dataset, we can group dimensions and obtain the dimension with change-points to distinguish between benign and malignant.

Finally, we construct the estimator statistics by assuming that the panels are independent of each other. However, there may be correlations between panels in practice. For example, some stocks or neighboring regions will affect each other. The next challenge for us is to develop better estimators by using the correlations between panels.

Author Contributions

Methodology, H.L. and D.W.; Writing—original draft, H.L.; Writing—review & editing, D.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China ( Grant no. NSFC 12171033).

Data Availability Statement

Data are public. We give the data source in the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Page, E.S. Continuous Inspection Schemes. Biometrika 1954, 41, 100–115. [Google Scholar] [CrossRef]
Bai, J. Least squares estimation of a shift in linear processes. J. Time Ser. Anal. 1994, 15, 453–472. [Google Scholar] [CrossRef]
Bai, J. Estimating multiple breaks one at a time. Econom. Theory 1997, 13, 315–352. [Google Scholar] [CrossRef]
Davis, R.A.; Lee, T.C.M.; Rodriguez-Yam, G.A. Structural break estimation for nonstationary time series models. J. Am. Stat. Assoc. 2006, 101, 223–239. [Google Scholar] [CrossRef]
Aue, A.; Hörmann, S.; Horváth, L.; Reimherr, M. Break detection in the covariance structure of multivariate time series models. Ann. Stat. 2009, 37, 4046–4087. [Google Scholar] [CrossRef]
Chan, N.H.; Yau, C.Y.; Zhang, R.M. Group LASSO for structural break time series. J. Am. Stat. Assoc. 2014, 109, 590–599. [Google Scholar] [CrossRef]
Kao, C.; Trapani, L.; Urga, G. Testing for instability in covariance structures. Bernoulli 2018, 24, 740–771. [Google Scholar] [CrossRef]
Cho, H.; Fryzlewicz, P. Multiple-change-point detection for high dimensional time series via sparsified binary segmentation. J. R. Stat. Soc. Ser. Stat. Methodol. 2015, 77, 475–507. [Google Scholar] [CrossRef]
Dette, H.; Pan, G.; Yang, Q. Estimating a Change Point in a Sequence of Very High-Dimensional Covariance Matrices. J. Am. Stat. Assoc. 2022, 117, 444–454. [Google Scholar] [CrossRef]
Dette, H.; Gösmann, J. A likelihood ratio approach to sequential change point detection for a general class of parameters. J. Am. Stat. Assoc. 2020, 115, 1361–1377. [Google Scholar] [CrossRef]
Joseph, L.; Wolfson, D.B. Estimation in multi-path change-point problems. Commun. Stat. Theory Methods 1992, 21, 897–913. [Google Scholar] [CrossRef]
Joseph, L.; Wolfson, D.B. Maximum likelihood estimation in the multi-path change-point problem. Ann. Inst. Stat. Math. 1993, 45, 511–530. [Google Scholar] [CrossRef]
Bai, J. Common breaks in means and variances for panel data. J. Econom. 2010, 157, 78–92. [Google Scholar] [CrossRef]
Baltagi, B.H.; Feng, Q.; Kao, C. Estimation of heterogeneous panels with structural breaks. J. Econom. 2016, 191, 176–195. [Google Scholar] [CrossRef]
Baltagi, B.H.; Kao, C.; Liu, L. Estimation and identification of change points in panel models with nonstationary or stationary regressors and error term. Econom. Rev. 2017, 36, 85–102. [Google Scholar] [CrossRef]
Chen, Z.; Hu, Y. Cumulative sum estimator for change-point in panel data. Stat. Pap. 2017, 58, 707–728. [Google Scholar] [CrossRef]
Pešta, M.; Peštová, B.; Maciak, M. Changepoint estimation for dependent and non-stationary panels. Appl. Math. 2020, 65, 299–310. [Google Scholar] [CrossRef]
Bardwell, L.; Fearnhead, P.; Eckley, I.A.; Smith, S.; Spott, M. Most recent changepoint detection in panel data. Technometrics 2019, 61, 88–98. [Google Scholar] [CrossRef]
Lumsdaine, R.L.; Okui, R.; Wang, W. Estimation of panel group structure models with structural breaks in group memberships and coefficients. J. Econom. 2023, 233, 45–65. [Google Scholar] [CrossRef]
Emerson, J.; Kao, C. Testing for Structural Change of a Time Trend Regression in Panel Data; Center for Policy Research Working Papers 15; Center for Policy Research, Maxwell School, Syracuse University: Syracuse, NY, USA, 2000. [Google Scholar]
Horváth, L.; Hušková, M. Change-point detection in panel data. J. Time Ser. Anal. 2012, 33, 631–648. [Google Scholar] [CrossRef]
Peštová, B.; Pešta, M. Testing structural changes in panel data with small fixed panel size and bootstrap. Metrika 2015, 78, 665–689. [Google Scholar] [CrossRef]
Chen, B.; Huang, L. Nonparametric testing for smooth structural changes in panel data models. J. Econom. 2018, 202, 245–267. [Google Scholar] [CrossRef]
Chen, Z.; Hu, Y. Asymptotic and Bootstrap Tests for a Sequential Change-Point of Panel. Wuhan Univ. J. Nat. Sci. 2019, 24, 329–340. [Google Scholar] [CrossRef]
Antoch, J.; Jan Hanousek, L.H.M.H.; Wang, S. Structural breaks in panel data: Large number of panels and short length time series. Econom. Rev. 2019, 38, 828–855. [Google Scholar] [CrossRef]
Jiang, P.; Kurozumi, E. A new test for common breaks in heterogeneous panel data models. Econom. Stat. 2023. [Google Scholar] [CrossRef]
Feng, Q.; Kao, C. Large-Dimensional Panel Data Econometrics: Testing, Estimation and Structural Changes; World Scientific: Singapore, 2021. [Google Scholar]
Qian, J.; Su, L. Shrinkage estimation of common breaks in panel data models via adaptive group fused lasso. J. Econom. 2016, 191, 86–109. [Google Scholar] [CrossRef]
Okui, R.; Wang, W. Heterogeneous structural breaks in panel data models. J. Econom. 2021, 220, 447–473. [Google Scholar] [CrossRef]
Reese, J. Solution methods for the p-median problem: An annotated bibliography. Networks 2006, 48, 125–142. [Google Scholar] [CrossRef]
Fearnhead, P.; Rigaill, G. Changepoint detection in the presence of outliers. J. Am. Stat. Assoc. 2019, 114, 169–183. [Google Scholar] [CrossRef]
Truong, C.; Oudre, L.; Vayatis, N. Selective review of offline change point detection methods. Signal Process. 2020, 167, 107299. [Google Scholar] [CrossRef]
Lavielle, M.; Moulines, E. Least-squares estimation of an unknown number of shifts in a time series. J. Time Ser. Anal. 2000, 21, 33–59. [Google Scholar] [CrossRef]
Ditzen, J.; Karavias, Y.; Westerlund, J. Testing and estimating structural breaks in time series and panel data in Stata. arXiv 2021, arXiv:2110.14550. [Google Scholar]

Figure 1. AR(1) model: Change of empirical probability with

T = 100

.

Figure 1. AR(1) model: Change of empirical probability with

T = 100

.

Figure 2. AR(1) model: Change of coverage and root mean square error with

N = 100

(upper) and

T = 100

(lower).

Figure 2. AR(1) model: Change of coverage and root mean square error with

N = 100

(upper) and

T = 100

(lower).

Figure 3. MA(2) model: Change of empirical probability with

T = 100

.

Figure 3. MA(2) model: Change of empirical probability with

T = 100

.

Figure 4. MA(2) model: Change of coverage and root mean square error with

N = 100

(upper) and

T = 100

(lower).

Figure 4. MA(2) model: Change of coverage and root mean square error with

N = 100

(upper) and

T = 100

(lower).

Figure 5. Trend model: Change of empirical probability with

T = 100

.

Figure 5. Trend model: Change of empirical probability with

T = 100

.

Figure 6. Trend model: Change of coverage and root mean square error with

N = 100

(upper) and

T = 100

(lower).

Figure 6. Trend model: Change of coverage and root mean square error with

N = 100

(upper) and

T = 100

(lower).

Figure 7. The convergence rate of the algorithm: the curve of D vs. s.

Figure 8. Change-point detection positions for European (left) and Chinese (right) stocks.

Figure 9. No change in physiological indicators (left) and some change in physiological indicators (right).

Table 1. AR(1) model: Empirical probability of group number selection using MDL when

G = 3

.

Table 1. AR(1) model: Empirical probability of group number selection using MDL when

G = 3

.

N	T	3	4	5
100	80	0.883	0.080	0.037
100	100	0.917	0.060	0.023
100	120	0.929	0.050	0.021
120	80	0.924	0.063	0.013
120	100	0.933	0.054	0.013
120	120	0.959	0.036	0.005

Table 2. AR(1) model: Coverage and root mean square error of the new method and LSE.

	N		100			120
	$T$	80	100	120	80	100	120
new	D	0.2363	0.1977	0.1642	0.2363	0.1951	0.1652
new	RMSE	0.5132	0.5128	0.4235	0.4203	0.3493	0.2658
LSE	RMSE	0.8595	0.6356	0.5857	0.5882	0.4761	0.4207

Table 3. AR(1) model: Root mean square error of the new method with

G = 4

.

Table 3. AR(1) model: Root mean square error of the new method with

G = 4

.

N		100			120
$T$	80	100	120	80	100	120
new	0.6039	0.5961	0.5707	0.4997	0.4736	0.4041

Table 4. MA(2) model: Empirical probability of group number selection using MDL when

G = 3

.

Table 4. MA(2) model: Empirical probability of group number selection using MDL when

G = 3

.

N	T	3	4	5
100	80	0.907	0.083	0.010
100	100	0.911	0.077	0.012
100	120	0.916	0.075	0.009
120	80	0.954	0.036	0.010
120	100	0.944	0.049	0.007
120	120	0.937	0.059	0.004

Table 5. MA(2) model: Coverage and root mean square error of the new method and LSE.

	N		100			120
	$T$	80	100	120	80	100	120
new	D	0.2280	0.1815	0.1469	0.2282	0.1810	0.1478
new	RMSE	0.4658	0.3941	0.3568	0.3430	0.2881	0.2702
LSE	RMSE	1.6675	1.2163	0.9985	0.9455	0.9287	0.8738

Table 6. Trend model: Empirical probability of group number selection using MDL when

G = 3

.

Table 6. Trend model: Empirical probability of group number selection using MDL when

G = 3

.

N	T	2	3	4	5
100	80	0.002	0.961	0.036	0.001
100	100	0	0.954	0.040	0.006
100	120	0	0.963	0.036	0.001
120	80	0	0.980	0.019	0.001
120	100	0	0.982	0.016	0.002
120	120	0	0.981	0.016	0.003

Table 7. Trend model: Coverage and root mean square error of the new method.

	N		100			120
	$T$	80	100	120	80	100	120
new	D	0.1266	0.1089	0.0911	0.1261	0.1085	0.0918
new	RMSE	0.1169	0.1017	0.0837	0.0894	0.0775	0.0707

Table 8. AR(1) model: Empirical probability of group number selection using MDL when

G = 3

.

Table 8. AR(1) model: Empirical probability of group number selection using MDL when

G = 3

.

N	T	3	4	5
100	80	0.879	0.117	0.004
100	100	0.880	0.118	0.002
100	120	0.943	0.056	0.001

Table 9. Coverage, frequency, and Hausdorff distance of the new method and LSE.

		T	80	100	120
		D	0.1106	0.1010	0.0796
New	$I_{1}$	F	1	1	1
	$I_{1}$	HD/T	0.0003	0.0005	0.0003
	$I_{2}$	F	0.976	0.981	0.999
	$I_{2}$	HD/T	0.0095	0.0037	0.0008
	$I_{3}$	F	0.922	0.927	0.962
	$I_{3}$	HD/T	0.0316	0.0235	0.0113
LSE	AIC	F	0.898	0.879	0.866
	AIC	HD/T	0.0145	0.0146	0.0116
	BIC	F	0	0	0
	BIC	HD/T	0.1280	0.1018	0.0857

Table 10. Coverage, frequency, and Hausdorff distance of the new method and GAGFL [29] with multiple change-points.

	Method		The New Method			GAGFLC [29]
	$T$	80	100	120	80	100	120
	D	0.1375	0.1212	0.1094	0.0054	0.0022	0.0019
$I_{1}$	F	0.971	0.975	0.982	0.950	0.963	0.938
$I_{1}$	HD/T	0.0196	0.0119	0.0062	0.0054	0.0047	0.0068
$I_{2}$	F	0.953	0.952	0.954	0.889	0.896	0.901
$I_{2}$	HD/T	0.0164	0.0101	0.0059	0.0141	0.0114	0.0110
$I_{3}$	F	0.999	0.999	0.985	0.903	0.896	0.893
$I_{3}$	HD/T	0.0005	0.0004	0.0040	0.0167	0.0204	0.0186

Table 11. The computation time (s) of the new method and GAGFL.

T	80	100	120
New	2.40	3.34	5.02
GAGFL	151.99	226.90	347.87

Table 12. The values of the statistics for the different G.

G	1	2	3	4	5
$Q_{G}^{'}$	595.47	576.15	607.78	632.26	665.41

Table 13. The penalty cost of the estimated change-points.

	The First Group							The Second Group
Order	1	2	3	4	5	6	7	1	2	3	4
$\hat{τ}$	91	37	74	140	181	50	112	55	141	81	30
$Q (0)$	1000.00	273.22	125.50	177.26	105.05	51.00	30.79	1000.00	419.82	91.99	44.08
$Q (\hat{τ})$	476.97	173.99	81.54	162.34	63.17	50.36	39.35	490.39	151.69	75.48	44.31

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, H.; Wang, D. Grouped Change-Points Detection and Estimation in Panel Data. Mathematics 2024, 12, 750. https://doi.org/10.3390/math12050750

AMA Style

Lu H, Wang D. Grouped Change-Points Detection and Estimation in Panel Data. Mathematics. 2024; 12(5):750. https://doi.org/10.3390/math12050750

Chicago/Turabian Style

Lu, Haoran, and Dianpeng Wang. 2024. "Grouped Change-Points Detection and Estimation in Panel Data" Mathematics 12, no. 5: 750. https://doi.org/10.3390/math12050750

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Grouped Change-Points Detection and Estimation in Panel Data

Abstract

1. Introduction

2. Methodology

3. Numerical Experiments

3.1. Evaluation Criteria

3.2. Detection and Estimation

4. Applications

4.1. Stock Dataset

4.2. Breast Cancer Dataset

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI