Model Selection for Exponential Power Mixture Regression Models

Jiang, Yunlu; Liu, Jiangchuan; Zou, Hang; Huang, Xiaowen

doi:10.3390/e26050422

Open AccessArticle

Model Selection for Exponential Power Mixture Regression Models

Department of Statistics and Data Science, College of Economics, Jinan University, Guangzhou 510632, China

^*

Author to whom correspondence should be addressed.

Entropy 2024, 26(5), 422; https://doi.org/10.3390/e26050422

Submission received: 27 February 2024 / Revised: 24 April 2024 / Accepted: 14 May 2024 / Published: 15 May 2024

(This article belongs to the Section Information Theory, Probability and Statistics)

Download Versions Notes

Abstract

:

Finite mixture of linear regression (FMLR) models are among the most exemplary statistical tools to deal with various heterogeneous data. In this paper, we introduce a new procedure to simultaneously determine the number of components and perform variable selection for the different regressions for FMLR models via an exponential power error distribution, which includes normal distributions and Laplace distributions as special cases. Under some regularity conditions, the consistency of order selection and the consistency of variable selection are established, and the asymptotic normality for the estimators of non-zero parameters is investigated. In addition, an efficient modified expectation-maximization (EM) algorithm and a majorization-maximization (MM) algorithm are proposed to implement the proposed optimization problem. Furthermore, we use the numerical simulations to demonstrate the finite sample performance of the proposed methodology. Finally, we apply the proposed approach to analyze a baseball salary data set. Results indicate that our proposed method obtains a smaller BIC value than the existing method.

Keywords:

finite mixture of linear regression models; variable selection; exponential power distribution; modified EM algorithm

1. Introduction

FMLR models are among the most exemplary statistical tools to deal with various heterogeneous data. Since FMLR models were first introduced by [1,2], they are widely applied in many research fields, e.g., machine learning [3], social sciences [4], and business [5]. For more references to FMLR models, see [6,7,8].

There are two important statistical problems in FMLR models: order selection and variable selection for the different regressions. However, order selection should be the first discussed issue in FMLR models. There exists a lot of literature to deal with this problem. For example, Ref. [9] introduced a penalized likelihood method for mixtures of univariate location distributions. Ref. [10] proposed a penalized likelihood method to select the number of mixing components for the finite multivariate Gaussian mixture models. For variable selection problems for each regression component, Ref. [11] applied subsect selection, REDapproaches such as Akaike information criterion (AIC) and Bayesian information criterion (BIC) to perform a variable selection for each component in a finite mixture of Poisson regression models. To avoid the drawbacks of subsect selection, Ref. [12] introduced a penalized likelihood method for variable selection in FMLR models. Ref. [13] proposed a robust variable selection procedure to estimate and select relevant covariates for FMLR models.

The above-proposed methods do not jointly select the order selection and significant variables in FMLR models. In fact, it is a challenging issue, although some literature exists to solve this problem. Ref. [14] introduced MR-Lasso for FMLR models to simultaneously identify the order selection and significant variables. However, they do not study the large sample properties of the proposed method. Ref. [15] proposed a robust mixture regression estimator via an asymmetric exponential power distribution and [16] studied component selection for exponential power mixture models, while they did not consider the variable selection procedure. Ref. [17] applied the penalized method on the number of components and regression coefficients to conduct model selection for FMLR models, but the error followed a normal distribution. Therefore, the proposed method is very sensitive to the heavy-tailed distribution.

In this paper, motivated by [10,18], we propose a new model selection procedure for the FMLR models via an exponential power distribution, which includes normal distributions and Laplace distributions as special cases. Under some regularity conditions, we investigate the asymptotic properties of the proposed method. In addition, we introduce an expectation-maximization (EM) algorithm [19] and a majorization-maximization (MM) algorithm [20] to solve the proposed optimization problem. The finite sample performance of the proposed method is illustrated via some numerical simulations. Results indicate that the proposed method is more robust to the heavy-tailed distributions than the existing method.

The rest of this paper is organized as follows. In Section 2, we present the finite mixture of regression models with an exponential power distribution and a penalized likelihood-based model selection approach. The asymptotic properties of the resulting estimates are investigated. In Section 3, a modified EM algorithm and an MM algorithm are developed to maximize the penalized likelihood. In Section 4, we propose a data-driven procedure to select the tuning parameters. In Section 5, simulation studies are conducted to evaluate the finite sample performance of the proposed method. In Section 6, a real data set is analyzed to compare the proposed test with some existing methods. We conclude with some remarks in Section 7. Technical conditions and proofs are given in the Appendix A.

2. Methodology

The density function of an exponential power (EP) distribution is defined as follows:

\begin{matrix} f_{p} (x; 0, σ) = \frac{p}{Γ (\frac{1}{p}) 2^{1 + \frac{1}{p}} σ} exp (- \frac{1}{2} {| \frac{x}{σ} |}^{p}), \end{matrix}

where

p > 0

,

σ > 0

is the scale parameter, and

Γ (\cdot)

is the Gamma function. When

0 < p < 2

, the EP distribution is heavy-tailed, which indicates that it can provide protection against outliers. The EP density function is a flexible and general density function class, and includes some important statistical density functions as its special cases, e.g., Gaussian density function (

p = 2

), and Laplace density function (

p = 1

). Meanwhile, the EP distribution has a wide range of applications, particularly in the area of business applications [21].

Based on the EP density function, we study the FMLR models. Let Z be a latent class variable with

P (Z = j | x) = π_{j}

for

j = 1, 2, \dots, m

, where

X

is a p-dimensional vector. Given

Z = j

, suppose that the response Y depends on

X

in a linear way

Y = X^{T} β_{j} + ϵ_{j},

where

β_{j}

is a p-dimensional vector, and

ϵ_{j}

is a random error with an EP density function

f_{p_{j}} (x; 0, σ_{j})

. Then the conditional density of Y given

X

can be written as

f (y | x) = \sum_{j = 1}^{m} π_{j} f_{p_{j}} (Y - X^{T} β_{j}; 0, σ_{j}) .

(1)

Let

{(X_{1}, Y_{1}), \dots, (X_{n}, Y_{n})}

be a random sample from (1). Then, the log-likelihood function for observations

{(X_{1}, Y_{1}), \dots, (X_{n}, Y_{n})}

is given by

Q_{n} (θ) = \sum_{i = 1}^{n} log [\sum_{j = 1}^{m} π_{j} f_{p_{j}} (Y_{i} - X_{i}^{T} β_{j}; 0, σ_{j})],

where

θ = (β_{11}, \dots, β_{1 p}, \dots, β_{m 1}, \dots, β_{m p}, σ_{1}, \dots, σ_{m}, p_{1}, \dots, p_{m}, π_{1}, \dots, π_{m - 1})

.

To deal with the model selection problem, according to [10], we consider the following objective function,

{\tilde{Q}}_{n} (θ) = Q_{n} (θ) - P_{n}^{1} (θ) - P_{n}^{2} (θ)

(2)

with the penalty function

P_{n}^{1} (θ) = n \sum_{j = 1}^{m} \{\sum_{t = 1}^{p} p_{λ_{1}} (| β_{j t} |)\},

P_{n}^{2} (θ) = n λ_{2} \sum_{j = 1}^{m} [log (ϵ + p_{λ_{2}} (π_{j})) - log (ϵ)],

where

p_{λ} (\cdot)

is a non-negative and non-decreasing function, and

λ_{1} > 0

and

λ_{2} > 0

are two penalized parameters. Thus, we can obtain the estimators

{\hat{θ}}_{n}

of

θ

as follows

{\hat{θ}}_{n} = arg max_{θ} {\tilde{Q}}_{n} (θ) .

(3)

To derive some theoretical properties of the estimators

{\hat{θ}}_{n}

, we first define

a_{n} = max_{j, t} {p_{λ_{1}}^{'} (β_{j t}^{0}) / \sqrt{n}, p_{λ_{2}}^{'} (π_{j}^{0}) / \sqrt{n} : β_{j t}^{0} \neq 0, π_{j}^{0} \neq 0},

b_{n} = max_{j, t} {p_{λ_{1}}^{''} (β_{j t}^{0}) / n, p_{λ_{2}}^{''} (π_{j}^{0}) / n : β_{j t}^{0} \neq 0, π_{j}^{0} \neq 0},

where

p_{λ}^{'} (h)

and

p_{λ}^{''} (h)

are the first and second derivatives of the function

p_{λ} (h)

with respect to h. To establish the asymptotic properties of the proposed estimators, we assume the following regularity conditions:

(C1): For any $λ$ , $p_{λ} (0) = 0$ , and $p_{λ} (\cdot)$ is non-negative and symmetric. Furthermore, it is non-decreasing and twice differentiable in $(0, \infty)$ with at most a few exceptions.
(C2): As $n \to \infty$ , $b_{n} = o (1)$ .
(C3): $\begin{matrix} lim_{n \to \infty} inf_{0 < h \leq n^{- 1 / 2} log n} \sqrt{n} p_{λ}^{'} (h) = \infty . \end{matrix}$
(C4): The joint density $f (z, θ)$ of $Z = (X, Y)$ have the third partial derivatives with respect to $θ$ for almost all $z$ .
(C5): For each $θ_{0}$ , there exists $R_{1} (z)$ and $R_{2} (z)$ such that for $θ$ in a neighborhood $N (θ_{0})$ of $θ_{0}$ ,

$\begin{matrix} |\frac{\partial f (z; θ)}{\partial θ_{i}}| \leq R_{1} (z), |\frac{\partial^{2} f (z; θ)}{\partial θ_{i} \partial θ_{j}}| \leq R_{1} (z), |\frac{\partial^{3} f (z; θ)}{\partial θ_{i} \partial θ_{j} \partial θ_{k}}| \leq R_{2} (z), \end{matrix}$

where $θ_{0}$ is the true parameter, $R_{1} (z)$ and $R_{2} (z)$ satisfy $\int R_{1} (z) d z < \infty$ , and $\int R_{2} (z) f (z; θ) d z < \infty$ .
(C6): The Fisher information matrix $Q (θ)$ is finite and positive definite at $θ = θ_{0}$ , where $Q (θ)$ is defined as follows,

$\begin{matrix} Q (θ) = E \{[\frac{\partial}{\partial θ} log (f (Z; θ))] {[\frac{\partial}{\partial θ} log (f (Z; θ))]}^{T}\} . \end{matrix}$
(C7): $p_{j} > 1, j = 1, \dots, m$ .
(C8): $c_{1} \leq σ_{j}^{2} \leq c_{2}, ∥β_{j}∥ \leq c_{3}, j = 1, \dots, m$ , where $c_{1}$ is some positive constant, $c_{2}$ and $c_{3}$ are some large constants.

Remark 1.

Conditions C1–C3 are the assumption about the penalty function, and assure that the variable selection of the proposed estimators is consistent. The similar conditions are also used in [22]. Condition (C5) ensures that the main term dominates the remainder in the Taylor expansion. Conditions (C4)–(C6) are used in [17]. Condition (C7) ensures the concavity of the likelihood function since the log likelihood function of random sample from EP distribution is concave if

p > 1

. Condition (C8) ensures the compactness of parameter space. Conditions (C7) and (C8) are similarly applied in Wang and Feng [16].

In the following, we have two theorems with proofs given in the Appendix A.

Theorem 1.

Under the conditions (C1), (C2), (C4)–(C8), and if

\sqrt{n} min {λ_{1}, λ_{2}} \to \infty

, and

min {λ_{1}, λ_{2}} \to 0

, then there exists a local maximizer

{\hat{θ}}_{n}

of the penalized log-likelihood function (2) such that

\begin{matrix} ||{\hat{θ}}_{n} - θ_{0}|| = O_{p} (n^{- 1 / 2}) . \end{matrix}

Theorem 2.

Under the conditions (C1)–(C8), and if

\sqrt{n} min {λ_{1}, λ_{2}} \to \infty

, and

min {λ_{1}, λ_{2}} \to 0

. Then, for any

\sqrt{n}

-consistent estimator

{\hat{θ}}_{n}

of

θ

, we have

(a): Sparsity: $P {{\hat{π}}_{k} = 0} \to 1$ as $n \to \infty$ , where $k = m_{0} + 1, \dots, m$ .
(b): Sparsity: $P {{\hat{β}}_{k j} = 0} \to 1$ as $n \to \infty$ , where $k = 1, \dots, m_{0}$ and $j = 1, \dots, t_{k}$ .
(c): Asymptotic normality:

$\begin{matrix} \sqrt{n} \{[Q_{1} (θ_{01}) - \frac{P_{n}^{1^{''}} (θ_{01})}{n} - \frac{P_{n}^{2^{''}} (θ_{01})}{n}] ({\hat{θ}}_{n 1} - θ_{01}) + \frac{P_{n}^{1^{'}} (θ_{01})}{n} + \frac{P_{n}^{2^{'}} (θ_{01})}{n}\} \overset{D}{\to} N (0, Q_{1} (θ_{01})), \end{matrix}$

where $m_{0}$ is the number of true non-zero mixing weights, $θ_{01}$ and $Q_{1} (θ_{01})$ are the true parameter and the corresponding Fisher information when all zero effects are removed, respectively.

3. Algorithm

In this section, we apply a modified EM algorithm and an MM algorithm to solve the proposed optimization problem (3). Let

z_{i j}

be the indicator variables that show if the i-th observation arises from the j-th component as missing data, and

p_{i j}

is the posterior probability that the i-th observation belongs to the j-th component. Therefore, the expected complete-data log-likelihood function is given as follows:

\sum_{i = 1}^{n} \sum_{j = 1}^{m} z_{i j} log [π_{j} f_{p_{j}} (Y_{i} - X_{i}^{T} β_{j}; 0, σ_{j})] .

Then, the objective function (2) is rewritten as

\sum_{i = 1}^{n} \sum_{j = 1}^{m} p_{i j} log [π_{j} f_{p_{j}} (Y_{i} - X_{i}^{T} β_{j}; 0, σ_{j})] - P_{n}^{1} (θ) - P_{n}^{2} (θ) .

(4)

Next, we apply a modified EM algorithm to maximize the objective function (4). The detailed procedure is given as follows:

Step 1 Given the l-th approximation

{\hat{θ}}^{(l)} = ({\hat{β}}_{11}^{(l)}, \dots, {\hat{β}}_{1 p}^{(l)}, \dots, {\hat{β}}_{m 1}^{(l)}, \dots, {\hat{β}}_{m p}^{(l)}, {\hat{σ}}_{1}^{(l)}, \dots, {\hat{σ}}_{m}^{(l)}, {\hat{p}}_{1}^{(l)}, \dots, {\hat{p}}_{m}^{(l)}, {\hat{π}}_{1}^{(l)}, \dots, {\hat{π}}_{m - 1}^{(l)}),

we can calculate the classification probabilities:

{\hat{p}}_{i j}^{(l + 1)} = \frac{{\hat{π}}_{j}^{(l)} f_{{\hat{p}}_{j}^{(l)}} (Y_{i} - X_{i}^{T} {\hat{β}}_{j}^{(l)}; 0, {\hat{σ}}_{j}^{(l)})}{\sum_{j = 1}^{m} {\hat{π}}_{j}^{(l)} f_{{\hat{p}}_{j}^{(l)}} (Y_{i} - X_{i}^{T} {\hat{β}}_{j}^{(l)}; 0, {\hat{σ}}_{j}^{(l)})} .

Step 2 We first update

{π_{1}, \dots, π_{m}}

. We use a Lagrange multiplier

δ

to take into account for the constraint

\sum_{j = 1}^{m} π_{j} = 1

, then we have

\frac{\partial}{\partial π_{j}} \{\sum_{i = 1}^{n} \sum_{j = 1}^{m} {\hat{p}}_{i j}^{(l + 1)} log (π_{j}) - n λ_{2} \sum_{j = 1}^{m} [log (ϵ + p_{λ_{2}} (π_{j}))] - δ (\sum_{j = 1}^{m} π_{j} - 1)\} = 0 .

(5)

In (5), we apply the local linear approximation [23] to

log (ϵ + p_{λ_{2}} (π_{j}))

,

log (ϵ + p_{λ_{2}} (π_{j})) \approx log (ϵ + p_{λ_{2}} ({\hat{π}}_{j}^{(l)})) + \frac{p_{λ_{2}}^{'} ({\hat{π}}_{j}^{(l)})}{ϵ + p_{λ_{2}} ({\hat{π}}_{j}^{(l)})} (π_{j} - {\hat{π}}_{j}^{(l)}) .

Then,

π_{j}

can be updated by straightforward calculations,

{\hat{π}}_{j}^{(l + 1)} = \frac{1}{D_{j}} \sum_{i = 1}^{n} {\hat{p}}_{i j}^{(l + 1)},

where

D_{j} = n [1 - λ_{2} \sum_{j = 1}^{m} \frac{{\hat{π}}_{j}^{(l)} p_{λ_{2}}^{'} ({\hat{π}}_{j}^{(l)})}{ϵ + p_{λ_{2}} ({\hat{π}}_{j}^{(l)})} + λ_{2} \frac{p_{λ_{2}}^{'} ({\hat{π}}_{j}^{(l)})}{ϵ + p_{λ_{2}} ({\hat{π}}_{j}^{(l)})}] .

Next, we update

{β_{11}, \dots, β_{1 p}, \dots, β_{m 1}, \dots, β_{m p}, σ_{1}, \dots, σ_{m}, p_{1}, \dots, p_{m}}

by maximizing the following objective function,

\begin{matrix} \sum_{i = 1}^{n} \sum_{j = 1}^{m} {\hat{p}}_{i j}^{(l + 1)} log [{\hat{π}}_{j}^{(l + 1)} f_{p_{j}} (Y_{i} - X_{i}^{T} β_{j}; 0, σ_{j})] - n \sum_{j = 1}^{m} \{\sum_{t = 1}^{p} p_{λ_{1}} (| β_{j t} |)\} . \end{matrix}

We first update

\{σ_{1}, \dots, σ_{m}\}

. For each

σ_{j}, j = 1, 2, \dots, m

, we only need to maximize

\sum_{i = 1}^{n} {\hat{p}}_{i j}^{(l + 1)} (- log (σ_{j}) - \frac{1}{2} {| \frac{Y_{i} - X_{i}^{T} {\hat{β}}_{j}^{(l)}}{σ_{j}} |}^{{\hat{p}}_{j}^{(l)}}) .

Then, the resulting estimator is given as follows:

{\hat{σ}}_{j}^{(l + 1)} = \frac{\sum_{i = 1}^{n} {\hat{p}}_{i j}^{(l + 1)} {| Y_{i} - X_{i} {\hat{β}}_{j}^{(l)} |}^{{\hat{p}}_{j}^{(l)}}}{2 \sum_{i = 1}^{n} {\hat{p}}_{i j}^{(l + 1)}} .

Next, we update

\{p_{1}, \dots, p_{m}\}

. For each

p_{j}, j = 1, 2 \dots, m

, according to the condition (C7), we have

{\hat{p}}_{j}^{(l + 1)} = \underset{p_{j} > 1}{arg max} \sum_{i = 1}^{n} {\hat{p}}_{i j}^{(l + 1)} \{log (p_{j}) - l o g (Γ (\frac{1}{p_{j}}) - (1 + \frac{1}{p_{j}}) log (2) - \frac{1}{2} {|\frac{Y_{i} - X_{i}^{T} β_{j}^{(l)}}{{\hat{σ}}_{j}^{(l + 1)}}|}^{p_{j}}\} .

Finally, we update

\{β_{1}, \dots, β_{m}\}

. By ignoring some terms which do not involve in

β_{j}

, we have

L (β_{j}) = - \sum_{i = 1}^{n} {\hat{p}}_{i j}^{(l + 1)} \frac{1}{2 {\hat{σ}}_{j}^{(l + 1)}} | Y_{i} - X_{i}^{T} β_{j} |^{{\hat{p}}_{j}^{(l + 1)}} - n \sum_{t = 1}^{p} p_{λ_{1}} (| β_{j t} |) .

By using a MM algorithm for

L (β_{j})

’s the first term, we have

\begin{matrix} {\{{(Y_{i} - X_{i}^{T} β_{j})}^{T} (Y_{i} - X_{i}^{T} β_{j})\}}^{\frac{{\hat{p}}_{j}^{(l + 1)}}{2}} \leq {\{{(Y_{i} - X_{i}^{T} {\hat{β}}_{j}^{(l)})}^{T} (Y_{i} - X_{i}^{T} {\hat{β}}_{j}^{(l)})\}}^{\frac{{\hat{p}}_{j}^{(l + 1)}}{2}} \\ + \frac{{\hat{p}}_{j}^{(l + 1)}}{2} {\{{(Y_{i} - X_{i}^{T} {\hat{β}}_{j}^{(l)})}^{T} (Y_{i} - X_{i}^{T} {\hat{β}}_{j}^{(l)})\}}^{\frac{{\hat{p}}_{j}^{(l + 1)}}{2} - 1} \{{(Y_{i} - X_{i}^{T} β_{j})}^{T} (Y_{i} - X_{i}^{T} β_{j}) - {(Y_{i} - X_{i}^{T} {\hat{β}}_{j}^{(l)})}^{T} (Y_{i} - X_{i}^{T} {\hat{β}}_{j}^{(l)})\} . \end{matrix}

For

p_{λ_{1}} (| β_{j t} |)

, we apply a local quadratic approximation [22], then we have

p_{λ_{1}} (| β_{j t} |) \approx p_{λ_{1}} (| {\hat{β}}_{j t}^{(l)} |) + \frac{p_{λ_{1}}^{'} (| {\hat{β}}_{j t}^{(l)} |)}{2 | {\hat{β}}_{j t}^{(l)} |} (β_{j t}^{2} - {\hat{β}}_{j t}^{(l) 2}) .

Thus, for each

β_{j}, j = 1, 2, \dots, m

, we only need to solve the following minimization problem

{\hat{β}}_{j}^{(l + 1)} = \underset{β_{j}}{arg min} [\sum_{i = 1}^{n} {\hat{p}}_{i j}^{(l + 1)} \frac{1}{2 {\hat{σ}}_{j}^{(l + 1)}} {\hat{w}}_{i j}^{(l)} {(Y_{i} - X_{i}^{T} β_{j})}^{T} (Y_{i} - X_{i}^{T} β_{j}) + n \sum_{t = 1}^{p} β_{j t}^{2} \frac{p_{λ_{1}}^{'} (| {\hat{β}}_{j t}^{(l)} |)}{2 | {\hat{β}}_{j t}^{(l)} |}],

where

{\hat{w}}_{i j}^{(l)} = \frac{{\hat{p}}_{j}^{(l + 1)}}{2} {\{{(Y_{i} - X_{i}^{T} {\hat{β}}_{j}^{(l)})}^{T} (Y_{i} - X_{i}^{T} {\hat{β}}_{j}^{(l)})\}}^{\frac{{\hat{p}}_{j}^{(l + 1)}}{2} - 1}

.

Thus, we can update

β_{j}

as follows

{\hat{β}}_{j}^{(l + 1)} = {(X B X^{T} + A)}^{- 1} X B Y,

where

\begin{matrix} A & = & n * d i a g \{\frac{p_{λ_{1}}^{'} (| {\hat{β}}_{j 1}^{(l)} |)}{2 | {\hat{β}}_{j 1}^{(l)} |}, \dots, \frac{p_{λ_{1}}^{'} (| {\hat{β}}_{j p}^{(l)} |)}{2 | {\hat{β}}_{j p}^{(l)} |}\}, \\ B & = & d i a g \{{\hat{p}}_{1 j}^{(l + 1)} \frac{1}{2 {\hat{σ}}_{j}^{(l + 1)}} {\hat{w}}_{1 j}^{(l)}, {\hat{p}}_{2 j}^{(l + 1)} \frac{1}{2 {\hat{σ}}_{j}^{(l + 1)}} {\hat{w}}_{2 j}^{(l)}, \dots, {\hat{p}}_{n j}^{(l + 1)} \frac{1}{2 {\hat{σ}}_{j}^{(l + 1)}} {\hat{w}}_{n j}^{(l)}\} . \end{matrix}

Step 3 Repeat Step 1, and Step 2 until convergence.

4. Choice of the Tuning Parameters

The selection of tuning parameters is a vital part in the order selection and variable selection procedure. In order to guarantee that a true model can be chosen correctly, we should select the proper tuning parameters

λ_{1}

and

λ_{2}

in the process of practice. There are lots of methods to select

λ_{1}

and

λ_{2}

, such as cross-validation (CV), generalized cross-validation (GCV), AIC, and BIC.

As suggested in [24], we introduce a data-driven procedure to choose the tuning parameters

λ_{1}

and

λ_{2}

by minimizing the following modified Bayesian information criterion,

M B I C (λ_{1}, λ_{2}) = - 2 \sum_{i = 1}^{n} log \{\sum_{j = 1}^{\hat{m}} {\hat{π}}_{j} f_{{\hat{p}}_{j}} (Y_{i} - X_{i}^{T} {\hat{β}}_{j}; 0, {\hat{σ}}_{j})\} + log n * d f,

(6)

where

\hat{m}

denotes the estimate of the number of components,

d f = 3 \hat{m} - 1 + {\hat{M}}_{β}

, and

{\hat{M}}_{β} = # \{| {\hat{β}}_{j t} | > 10^{- 3}, j = 1, \dots, \hat{m}, t = 1, \dots, p\} .

5. Simulation

In this section, we use some numerical simulations to illustrate the finite sample performance of the proposed method. For the penalty function, we use the SCAD penalty [22], which is given as follows:

p_{λ} (t; a) = \{\begin{matrix} λ | t |, & i f | t | \leq λ, \\ - (t^{2} - 2 a λ | t | + λ^{2}) / [2 (a - 1)], & i f λ < | t | \leq a λ, \\ (a + 1) λ^{2} / 2, & o t h e r w i s e, \end{matrix}

where

λ

is a tuning parameter and

a > 2

. According to the suggestion in Fan and Li [22], a is equal to 3.7 by minimizing the Bayes risks. The datasets are generated via a three-component FMLR model

f (y | x) = \sum_{j = 1}^{3} π_{j} f_{p_{j}} (y - x^{T} β_{j}; 0, σ_{j}),

(7)

where the components of

x

are generated independently from the 7-dimensional standard normal distribution. In detail, we generate random samples of each component from the following linear model

Y = X^{T} β + ϵ .

We simulate 100 datasets from the FMLR model (7) with sample size of n=200, 600, 800, 1000. The datasets are generated by the following four scenarios:

Scenario 1.

β_{1} = {(1, 1, 1, 1, 0, 0, 0)}^{T}, β_{2} = {(1, 2, 3, 4, 0, 0, 0)}^{T}, β_{3} = {(5, 6, 7, 8, 0, 0, 0)}^{T}

and

π_{1} = 0.4, π_{2} = 0.3, π_{3} = 0.3

, and the random error

ϵ \sim N (0, 1)

;

Scenario 2. We use the same setting as in Scenario 1, except that the error term follows a t-distribution with freedom degree 2;

Scenario 3. We use the same setting as in Scenario 1, except that the error term follows a mixture t distribution:

ϵ \sim 0.5 t (1) + 0.5 t (3)

;

Scenario 4. We use the same setting as in Scenario 1, except that the error term follows a mixture normal distribution:

ϵ \sim 0.95 N (0, 1) + 0.05 N (0 . 5^{2})

.

We compare our proposed method with the method proposed by [17]. To assess the finite-sample performance, we consider four different measures:

(1): $R M S E_{π_{j}}$ : the root mean square error of $\hat{π_{j}}$ when the order is corrected estimated, which is defined by

$R M S E_{π_{j}} = \sqrt{\frac{1}{M^{*}} \sum_{m = 1}^{M^{*}} {({\hat{π}}_{j}^{m} - π_{j})}^{T} ({\hat{π}}_{j}^{m} - π_{j})}$

where $M^{*}$ is the number of simulations with correct estimation of the order.
(2): $R M S E_{β_{c}}$ : the root mean square error of ${\hat{β}}_{j}$ , which can be similarly calculated as $R M S E_{π_{j}}$ .
(3): NCZ (the number of correct zeros): It denotes that the number of the true value of the parameter is zero and is correctly estimated as zero. NCZ can be calculated by

$N C Z = # \{t : β_{t} = 0 \land {\hat{β}}_{t} = 0\},$

where $# {A}$ denotes the number of elements within A.
(4): NIZ (the number of incorrect zeros): It indicates that the number of the true value of the parameter is non-zero and is incorrectly estimated as zero. NIZ is given as follows:

$N I Z = # \{t : β_{t} \neq 0 \land {\hat{β}}_{t} = 0\} .$

In simulation studies, suppose we know that the data come from a mixture regression model with at most five components, but the true number of components should be estimated. For each scenario, the simulation is repeated 100 times. The corresponding results are shown in Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8. In these Tables,

M 1

and

M 2

denote the results by [17] and our proposed method, respectively.

Table 1 shows the simulation results of order selection. Columns labeled “Underfitted” are the proportion of the fitted model with less than three components in 100 simulations. Meanwhile, “Correctly fitted” and “Overfitted” can be similarly interpreted. From Table 1, we can find that the effects of the two models are very similar, and the accuracy rate of order selection can reach more than

98 %

for

M 1

and

M 2

when n is larger than or equal to 600. Table 2 presents the results of variable selection and parameter estimation for each component. From Table 2, we observe that the finite sample performances of the two models are very similar for

n \geq 600

. Therefore, when the error term follows a normal distribution, the two models have similar performance when the sample size is sufficiently large.

Table 3 and Table 4 present the results of Scenario 2, which is a heavy-tailed scenario. We can observe from Table 3 that

M 1

can only estimate about

20 %

underfitted or overfitted model, while our method keeps robustness and continues to maintain

98 %

accuracy when

n \geq 600

. In Table 4,

M 1

has a poor performance in variable selection.

M 1

has many non-zero NIZ, while our method’s NIZ is all zero for

n \geq 600

. Meanwhile, the NCZ of our proposed method increases as n increases. In addition, our proposed method has a smaller RMSE than M1.

For Table 5, the performance of order selection for

M 1

is worse than that for

M 2

. The ratio of the correctly fitted model remains above

98 %

with our method for

n \geq 600

, while

M 1

is easy to overfit the model’s components. In Table 6, it can be seen that the NCZ value of

M 1

is a little better than that of

M 2

. Compared with

R M S E_{β_{c}}

, we can find that our method is better than

M 1

consistently.

Table 7 and Table 8 present the results of Scenario 4.

M 1

absolutely stays away from the right number of components. On the contrary, our method can select the correct number of components with

98 %

accuracy for

n \geq 600

. In Table 8,

M 1

is better than

M 2

in NCZ, but

M 1

is unstable in NIZ. Comparing

R M S E_{β_{c}}

, we can find that

M 1

is larger than

M 2

. In general, our model is better than

M 1

in both order selection and variable selection and parameter estimation.

6. Real Data Analysis

In this section, we apply the proposed methodology to analyze baseball salary data, which consists of information about major league baseball players. The response variable is their 1992 salaries (measured in thousands of dollars). In addition, there are 16 performance measures for 337 MLB players who participated in at least one game in both the 1991 and 1992 seasons. This data set has been analyzed by others, such as [12,17]. We want to study how the performance measures affect salaries using our method.

The performance measures are batting average

(x_{1})

, on-base percentage

(x_{2})

, runs

(x_{3})

, hits

(x_{4})

, doubles

(x_{5})

, triples

(x_{6})

, home runs

(x_{7})

, runs batted in

(x_{8})

, walks

(x_{9})

, strikeouts

(x_{10})

, stolen bases

(x_{11})

, and errors

(x_{12})

; and indicators of free agency eligibility

(x_{13})

, free agent in 1991/2

(x_{14})

, arbitration eligibility

(x_{15})

, and arbitration in 1991/2

(x_{16})

. The four (dummy) variables

x_{13} - x_{16}

indicate how free each player was to move to another team. As suggested in [25], the interaction effects between (dummy) variables

x_{13} - x_{16}

and the quantitative variables

x_{1}, x_{3}, x_{7}

, and

x_{8}

should be added to the consideration. Therefore, we obtain a set of 32 potential covariates affecting each player’s salary. Ref. [12] fitted a mixture of linear regression models with two or three components to depict the overlaid shape of the histogram of log(salary), and concluded that a two-component mixture regression model labeled MIXSCAD fitted the data well. [17] uses an FMLR model based on normal distribution, and the number of components is two.

As advocated by [12], we use log(salary) as the response variable. We first fit a linear model via stepwise regression, the results are shown in Table 9, denoted as

{\hat{β}}_{o l s}

. Based on [17], we consider the following four-component mixture model,

Y | X \sim \sum_{j = 1}^{4} π_{j} f_{p_{j}} (Y - X^{T} β_{j}; 0, σ_{j}),

where Y = log(salary) and X is a

33 \times 1

vector containing 32 covariates plus an intercept term. In order to implement the proposed modified EM algorithm, we set the initial values as follows

π^{0} = {(0.4, 0.2, 0.2, 0.2)}^{T}, σ^{0} = {(10, 10, 10, 10)}^{T}, p^{0} = {(1, 1, 1, 1)}^{T}, β_{j}^{0} = {\hat{β}}_{o l s} + ϵ_{j}

where

ϵ_{j} \sim N (0, I), j = 1, 2, 3, 4

. The results are reported in Table 9 and Table 10. From Table 9, we find that both

M 1

and

M 2

choose two components. Furthermore, we can observe from Table 10 that M2 has a smaller BIC value than M1, which indicates that our proposed method can better fit this dataset than M1.

Of interest is to explain how the performance measures affect salaries by interpreting the outcome of the fit, although it can be a source of controversy. Do not think about it; there should be many positive correlations between a baseball player and his salary.

M 1

and

M 2

have the same sign and approximate coefficients in

x_{0}, x_{13}, x_{15}

, and interactions of

x_{8}

and

x_{14}

. Recall that

x_{1}

and

x_{7}

are individual performances, while

x_{13}

,

x_{15}

, and

x_{16}

are three dummy variables indicating how freely players change teams. For example, the effect of

x_{1} * x_{16}

implies that for most players having arbitration eligibility in 1991/2 enhances the individual ability

(x_{1})

toward a lower salary, but not the value of their team contribution

(x_{8})

.

The main differences between the two models are interaction effects

x_{1} * x_{14}

and

x_{1} * x_{15}

. This implies that

M 1

disregards

x_{1} * x_{14}

’s effect, but

M 2

indicates that it is in two directions. And

M 2

attaches great importance to the interaction effect of

x_{1} * x_{14}

.

7. Discussion

In this paper, we introduced the FMLR models via an exponential power distribution. Under some conditions, the asymptotic properties of the proposed estimators were established. Meanwhile, a modified EM algorithm and an MM algorithm were applied to solve the proposed optimization problem. Furthermore, the merits of our proposed methodology were illustrated through some numerical simulations and real data analysis. Simulation studies showed that the proposed method had better performance than the existing methods under difference errors. By analyzing a baseball salary dataset, our proposed method had a smaller BIC value than the method proposed [17].

Author Contributions

Methodology, Y.J. and J.L.; Formal analysis, H.Z. and X.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research is partially supported by NSFC (12171203), the Fundamental Research Funds for the Central Universities (23JNQMX21) and the Natural Science Foundation of Guangdong (2022A1515010045).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data used in this study is publicly available. Code is available on request from the second authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Proof of Theorem 1.

For any given

ϵ > 0

, let

∥ u ∥ = M_{ϵ}

. Denote

Γ_{n} (u) = {\tilde{Q}}_{n} (θ_{0} + u / \sqrt{n}) - {\tilde{Q}}_{n} (θ_{0}) .

According to (2), we have

Γ_{n} (u) = [Q_{n} (θ_{0} + u / \sqrt{n}) - Q_{n} (θ_{0})] - [P_{n}^{1} (θ_{0} + u / \sqrt{n}) - P_{n}^{1} (θ_{0})] - [P_{n}^{2} (θ_{0} + u / \sqrt{n}) - P_{n}^{2} (θ_{0})] .

Under condition (C1), we have

p_{λ} (0) = 0

for any

λ

. Therefore,

P_{n}^{1} (θ_{0}) = P_{n}^{1} (θ_{01})

and

P_{n}^{2} (θ_{0}) = P_{n}^{2} (θ_{01})

. Since

P_{n}^{1} (θ_{0} + u / \sqrt{n})

and

P_{n}^{2} (θ_{0} + u / \sqrt{n})

are a sum of positive terms, we then have

Γ_{n} (u) \leq [Q_{n} (θ_{0} + u / \sqrt{n}) - Q_{n} (θ_{0})] - [P_{n}^{1} (θ_{01} + u_{1} / \sqrt{n}) - P_{n}^{1} (θ_{01})] - [P_{n}^{2} (θ_{01} + u_{1} / \sqrt{n}) - P_{n}^{2} (θ_{01})],

where

u_{1}

is a subvector of

u

with the corresponding non-zero coefficients.

By conditions (C4), (C5), (C7) and (C8), and the Taylor’s expansion, we have

\begin{matrix} Q_{n} (θ_{0} + u / \sqrt{n}) - Q_{n} (θ_{0}) = n^{- 1 / 2} Q_{n}^{'} {(θ_{0})}^{T} u - \frac{1}{2} (u^{T} Q (θ_{0}) u) (1 + o_{p} (1)) . \end{matrix}

By condition (C1), the Taylor’s expansion, triangular inequality, and Cauchy–Schwarz inequality, we have

\begin{matrix} P_{n}^{1} (θ_{01} + u_{1} / \sqrt{n}) - P_{n}^{1} (θ_{01}) \\ = n \sum_{k = 1}^{m_{0}} \{\sum_{j = 1}^{t_{k}} [p_{λ_{1}} (| β_{k j} + u_{k j} / \sqrt{n} |) - p_{λ_{1}} (| u_{k j} / \sqrt{n} |)]\} \\ = \sqrt{m_{0} t} a_{n} ∥ u ∥ + \frac{b_{n}}{2} {∥ u ∥}^{2} (1 + o (1)), \end{matrix}

where

m_{0}

is the number of true non-zero mixing weights, and

t = {max}_{k} \sqrt{t_{k}}

, and

t_{k}

is the number of true non-zero regression coefficients in the k-th component.

Since

\sqrt{n} λ_{2} \to \infty

, and

λ_{2} \to 0

, we have

\begin{matrix} | P_{n}^{2} (θ_{01} + u_{1} / \sqrt{n}) - P_{n}^{2} (θ_{01}) | = 0 . \end{matrix}

Regularity condition (C6) implies that

n^{- 1 / 2} Q_{n}^{'} (θ_{0}) = O_{p} (1)

. Since

\sqrt{n} min {λ_{1}, λ_{2}} \to \infty

, and

min {λ_{1}, λ_{2}} \to 0

, we have

a_{n} = 0

. By conditions (C2) and (C6), for any given

ϵ > 0

, there exists a sufficiently large

M_{ϵ}

such that

\begin{matrix} lim_{n \to \infty} P \{sup_{∥ u ∥ = M_{ϵ}} Γ_{n} (u) < 0\} \geq 1 - ϵ . \end{matrix}

Therefore, with large probability, there is a local maximum in

{θ + u / \sqrt{n} : ∥ u ∥ \leq M_{ϵ}}

. That is to say, this local maximizer

{\hat{θ}}_{n}

satisfies

∥ {\hat{θ}}_{n} - θ_{0} ∥ = O_{p} (1 / \sqrt{n})

. This completes the proof of Theorem 1. □

Proof of Theorem 2.

We first show that

{\hat{π}}_{k} = 0

for

k = m_{0} + 1, \dots, m

. Since

||{\hat{θ}}_{n} - θ|| = O_{p} (n^{- 1 / 2})

, we have

{\hat{π}}_{k} = O_{p} (1 / \sqrt{n})

for

k = m_{0} + 1, \dots, m

. To prove (a), it is sufficient to show with probability tending to 1 as

n \to \infty

for any

π_{k}

satisfying

{\hat{π}}_{k} - π_{k} = O_{p} (1 / \sqrt{n})

and

k = m_{0} + 1, \dots, m

\begin{matrix} \frac{\partial Q^{*} (θ)}{\partial {\hat{π}}_{k}} < 0 f o r {\hat{π}}_{k} < C / \sqrt{n}, \end{matrix}

(A1)

where C is a positive constant number,

\begin{matrix} Q^{*} (θ) = {\tilde{Q}}_{n} (θ) - δ (\sum_{k = 1}^{m} π_{k} - 1), \end{matrix}

and

δ

is a Lagrange multiplier. Therefore,

{\hat{π}}_{k}, k = 1, \dots, m

should satisfy

\begin{matrix} \frac{\partial Q^{*} (θ)}{\partial {\hat{π}}_{k}} = \sum_{i = 1}^{n} \frac{f_{p_{j}} (Y_{i} - X_{i}^{T} β_{j}; 0, σ_{j})}{\sum_{j = 1}^{m} {\hat{π}}_{j} f_{p_{j}} (Y_{i} - X_{i}^{T} β_{j}; 0, σ_{j})} - n λ_{2} \frac{p_{λ_{2}}^{'} ({\hat{π}}_{k})}{C_{0} + p_{λ_{2}} ({\hat{π}}_{k})} - δ = 0 . \end{matrix}

(A2)

We first consider

k \leq m_{0}

. By the law of large numbers, we have

\begin{matrix} \sum_{i = 1}^{n} \frac{f_{p_{j}} (Y_{i} - X_{i}^{T} β_{j}; 0, σ_{j})}{\sum_{j = 1}^{m} {\hat{π}}_{j} f_{p_{j}} (Y_{i} - X_{i}^{T} β_{j}; 0, σ_{j})} = O_{p} (n) . \end{matrix}

(A3)

For

k \leq m_{0}

, we have

{\hat{π}}_{k} = π_{k}^{0} + O_{p} (1 / \sqrt{n}) > \frac{1}{2} min {π_{1}^{0}, \dots, π_{m_{0}}^{0}}

. Since

n λ_{2} = o_{p} (n)

,

p_{λ_{2}}^{'} ({\hat{π}}_{k}) = o_{p} (1)

and

p_{λ_{2}} ({\hat{π}}_{k}) = o_{p} (n)

, then we have

\begin{matrix} n λ_{2} \frac{p_{λ_{2}}^{'} ({\hat{π}}_{k})}{C_{0} + p_{λ_{2}} ({\hat{π}}_{k})} = o_{p} (1) . \end{matrix}

(A4)

By (A2)–(A4), we have

δ = O_{p} (n)

. For

k \geq m_{0} + 1

and

{\hat{π}}_{k} < C / \sqrt{n}

, we have

{\hat{π}}_{k} = O_{p} (1 / \sqrt{n})

. By

\sqrt{n} λ_{2} \to \infty

,

C_{0}

is sufficient small and

p_{λ} (\cdot)

is the SCAD penalty, we have

\begin{matrix} \{n λ_{2} \frac{p_{λ_{2}}^{'} ({\hat{π}}_{k})}{C_{0} + p_{λ_{2}} ({\hat{π}}_{k})}\} / n = \frac{λ_{2}^{2}}{C_{0} + λ_{2} {\hat{π}}_{k}} = O_{p} (\sqrt{n} λ_{2}) \to \infty . \end{matrix}

Therefore, the first term and the third term in the Equation (A2) are dominated by the second term. Thus, we prove the Equation (A1). This completes the proof of (a).

To prove (b), for any

θ

with

m_{0}

components, we split

θ_{m_{0}} = (θ_{m_{0}}^{1}, θ_{m_{0}}^{2})

for any

θ_{m_{0}}

in the neighborhood

| | θ_{m_{0}} - θ_{m_{0}}^{0} | | = O_{p} (1 / \sqrt{n})

such that

θ_{m_{0}}^{2}

contains all zero effects, e.g.,

β_{k j} = 0, k = 1, \dots, m_{0}

and

j = 1, \dots, t_{k}

. By (2), we have

\begin{matrix} {\tilde{Q}}_{n} {(θ_{m_{0}}^{1}, θ_{m_{0}}^{2})} - {\tilde{Q}}_{n} {(θ_{m_{0}}^{1}, 0)} \\ = [Q_{n} {(θ_{m_{0}}^{1}, θ_{m_{0}}^{2})} - Q_{n} {(θ_{m_{0}}^{1}, 0)}] - [P_{n}^{1} {(θ_{m_{0}}^{1}, θ_{m_{0}}^{2})} - P_{n}^{1} {(θ_{m_{0}}^{1}, 0)}] \\ = [Q_{n} {(θ_{m_{0}}^{1}, θ_{m_{0}}^{2})} - Q_{n} {(θ_{m_{0}}^{1}, 0)}] - n \sum_{k = 1}^{m_{0}} \sum_{j = t_{k} + 1}^{p} p_{λ_{1}} (| β_{k j} |) . \end{matrix}

According to the mean value theorem, we have

\begin{matrix} Q_{n} {(θ_{m_{0}}^{1}, θ_{m_{0}}^{2})} - Q_{n} {(θ_{m_{0}}^{1}, 0)} = {[\frac{\partial Q_{n} {(θ_{m_{0}}^{1}, γ)}}{\partial θ_{m_{0}}^{2}}]}^{T} θ_{m_{0}}^{2}, \end{matrix}

(A5)

where

| | γ | | \leq | | θ_{m_{0}}^{2} | | = O (n^{- 1 / 2})

. Since

\begin{matrix} ||\frac{\partial Q_{n} {(θ_{m_{0}}^{1}, γ)}}{\partial θ_{m_{0}}^{2}} - \frac{\partial Q_{n} {(θ_{m_{0}}^{01}, 0)}}{\partial θ_{m_{0}}^{2}}|| \\ \leq ||\frac{\partial Q_{n} {(θ_{m_{0}}^{1}, γ)}}{\partial θ_{m_{0}}^{2}} - \frac{\partial Q_{n} {(θ_{m_{0}}^{1}, 0)}}{\partial θ_{m_{0}}^{2}}|| + ||\frac{\partial Q_{n} {(θ_{m_{0}}^{1}, 0)}}{\partial θ_{m_{0}}^{2}} - \frac{\partial Q_{n} {(θ_{m_{0}}^{01}, 0)}}{\partial θ_{m_{0}}^{2}}|| \\ \leq [\sum_{i = 1}^{n} R_{1} (z_{i})] | | γ | | + [\sum_{i = 1}^{n} R_{1} (z_{i})] | | θ_{m_{0}}^{1} - θ_{m_{0}}^{01} | | \\ = (| | γ | | + | | θ_{m_{0}}^{1} - θ_{m_{0}}^{01} | |) O_{p} (n) = O_{p} (n^{1 / 2}), \end{matrix}

and

\begin{matrix} \frac{\partial Q_{n} {(θ_{m_{0}}^{01}, 0)}}{\partial θ_{m_{0}}^{2}} = O_{p} (n^{1 / 2}), \end{matrix}

we have

\begin{matrix} \frac{\partial Q_{n} {(θ_{m_{0}}^{1}, γ)}}{\partial θ_{m_{0}}^{2}} = O_{p} (n^{1 / 2}) . \end{matrix}

(A6)

By (A5) and (A6), we have

\begin{matrix} Q_{n} {(θ_{m_{0}}^{1}, θ_{m_{0}}^{2})} - Q_{n} {(θ_{m_{0}}^{1}, 0)} = O_{p} (n^{1 / 2}) [\sum_{k = 1}^{m_{0}} \sum_{j = t_{k} + 1}^{p} | β_{k j} |] . \end{matrix}

Thus, we have

\begin{matrix} {\tilde{Q}}_{n} {(θ_{m_{0}}^{1}, θ_{m_{0}}^{2})} - {\tilde{Q}}_{n} {(θ_{m_{0}}^{1}, 0)} = \sum_{k = 1}^{m_{0}} \sum_{j = t_{k} + 1}^{p} [O_{p} (n^{1 / 2}) | β_{k j} | - n p_{λ_{1}} (| β_{k j} |)] . \end{matrix}

By condition (C3), for

| t | \leq n^{- 1 / 2} log n

, we have

O_{p} (n^{1 / 2}) | t | < n p_{λ_{1}} (| t |)

. Therefore, we can obtain

\begin{matrix} {\tilde{Q}}_{n} {(θ_{m_{0}}^{1}, θ_{m_{0}}^{2})} - {\tilde{Q}}_{n} {(θ_{m_{0}}^{1}, 0)} < 0 . \end{matrix}

(A7)

By (A7), with probability tending to 1 as

n \to \infty

, we have

\begin{matrix} {\tilde{Q}}_{n} {(θ_{m_{0}}^{1}, θ_{m_{0}}^{2})} - {\tilde{Q}}_{n} {({\hat{θ}}_{m_{0}}^{1}, 0)} \\ = [{\tilde{Q}}_{n} {(θ_{m_{0}}^{1}, θ_{m_{0}}^{2})} - {\tilde{Q}}_{n} {(θ_{m_{0}}^{1}, 0)}] + [{\tilde{Q}}_{n} {(θ_{m_{0}}^{1}, 0)} - {\tilde{Q}}_{n} {({\hat{θ}}_{m_{0}}^{1}, 0)}] < 0 . \end{matrix}

Thus, this completes the proof of part (b).

Using the result of Theorem 1, there exists a

\sqrt{n}

-consistent local maximizer

{\hat{θ}}_{n 1}

of

{\tilde{Q}}_{n} {(θ_{1}, 0)}

such that

{\hat{θ}}_{n} = ({\hat{θ}}_{n 1}, 0)

satisfies

\frac{\partial {\tilde{Q}}_{n} ({\hat{θ}}_{n})}{\partial θ_{1}} = {\{\frac{Q_{n} (θ)}{\partial θ_{1}} - \frac{\partial P_{n}^{1} (θ)}{\partial θ_{1}} - \frac{\partial P_{n}^{2} (θ)}{\partial θ_{1}}\}}_{θ = {\hat{θ}}_{n}} = 0 .

(A8)

By the Taylor’s expansion, we have

{\{\frac{\partial Q_{n} (θ)}{\partial θ_{1}}\}}_{θ = {\hat{θ}}_{n}} = \frac{\partial Q_{n} (θ_{01})}{\partial θ_{1}} + \{\frac{\partial^{2} Q_{n} (θ_{01})}{\partial θ_{1} \partial θ_{1}^{T}} + o_{p} (n)\} ({\hat{θ}}_{n 1} - θ_{01}),

(A9)

{\{\frac{\partial P_{n}^{1} (θ)}{\partial θ_{1}}\}}_{θ = {\hat{θ}}_{n}} = P_{n}^{1^{'}} (θ_{01}) + \{P_{n}^{1^{''}} (θ_{01}) + o_{p} (n)\} ({\hat{θ}}_{n 1} - θ_{01}),

(A10)

{\{\frac{\partial P_{n}^{2} (θ)}{\partial θ_{1}}\}}_{θ = {\hat{θ}}_{n}} = P_{n}^{2^{'}} (θ_{01}) + \{P_{n}^{2^{''}} (θ_{01}) + o_{p} (n)\} ({\hat{θ}}_{n 1} - θ_{01}) .

(A11)

By substituting Equations (A9)–(A11) into (A8), we have

\begin{matrix} \{\frac{\partial^{2} Q_{n} (θ_{01})}{\partial θ_{1} \partial θ_{1}^{T}} - P_{n}^{1^{''}} (θ_{01}) - P_{n}^{2^{''}} (θ_{01}) + o_{p} (n)\} ({\hat{θ}}_{n 1} - θ_{01}) \\ = \frac{\partial Q_{n} (θ_{01})}{\partial θ_{1}} - P_{n}^{1^{'}} (θ_{01}) - P_{n}^{2^{'}} (θ_{01}) . \end{matrix}

By the conditions (C4), (C5), and (C6), we have

\begin{matrix} \frac{1}{n} \frac{\partial^{2} Q_{n} (θ_{01})}{\partial θ_{1} \partial θ_{1}^{T}} = Q_{1} (θ_{01}) + o_{p} (1), \end{matrix}

\begin{matrix} \frac{1}{\sqrt{n}} \frac{\partial Q_{n} (θ_{01})}{\partial θ_{1}} \overset{D}{\to} N (0, Q_{1} (θ_{01})) . \end{matrix}

By Slutsky’s theorem, we have

\begin{matrix} \sqrt{n} \{[Q_{1} (θ_{01}) - \frac{P_{n}^{1^{''}} (θ_{01})}{n} - \frac{P_{n}^{2^{''}} (θ_{01})}{n}] ({\hat{θ}}_{n 1} - θ_{01}) + \frac{P_{n}^{1^{'}} (θ_{01})}{n} + \frac{P_{n}^{2^{'}} (θ_{01})}{n}\} \overset{D}{\to} N (0, Q_{1} (θ_{01})) . \end{matrix}

This completes the proof of part (c). □

References

Quandt, R.E. A new approach to estimating switching regressions. J. Am. Stat. Assoc. 1972, 67, 306–310. [Google Scholar] [CrossRef]
Goldfeld, S.M.; Quandt, R.E. A Markov model for switching regressions. J. Econom. 1973, 1, 3–15. [Google Scholar] [CrossRef]
Jacobs, R.A.; Jordan, M.I.; Nowlan, S.J.; Hinton, G.E. Adaptive mixtures of local experts. Neural Comput. 1991, 3, 79–87. [Google Scholar] [CrossRef] [PubMed]
Wedel, M.; Kamakura, W.A. Market Segmentation: Conceptual and Methodological Foundations; Springer Science & Business Media: Berlin, Germany, 2000. [Google Scholar]
Skrondal, A.; Rabe-Hesketh, S. Generalized Latent Variable Modeling: Multilevel, Longitudinal, and Structural Equation Models; Chapman and Hall/CRC: Boca Raton, FL, USA, 2004. [Google Scholar]
Peel, D.; MacLahlan, G. Finite Mixture Models; John & Sons: Toronto, ON, Canada, 2000. [Google Scholar]
McLachlan, G.J.; Lee, S.X.; Rathnayake, S.I. Finite mixture models. Annu. Rev. Stat. Appl. 2019, 6, 355–378. [Google Scholar] [CrossRef]
Yu, C.; Yao, W.; Yang, G. A selective overview and comparison of robust mixture regression estimators. Int. Stat. Rev. 2020, 88, 176–202. [Google Scholar] [CrossRef]
Chen, J.; Khalili, A. Order selection in finite mixture models with a nonsmooth penalty. J. Am. Stat. Assoc. 2009, 104, 187–196. [Google Scholar] [CrossRef]
Peng, H.; Huang, T.; Zhang, K. Model Selection for Gaussian Mixture Models. Stat. Sin. 2017, 27, 147–169. [Google Scholar]
Wang, P.; Puterman, M.L.; Cockburn, I.; Le, N. Mixed Poisson regression models with covariate dependent rates. Biometrics 1996, 52, 381–400. [Google Scholar] [CrossRef]
Khalili, A.; Chen, J. Variable selection in finite mixture of regression models. J. Am. Stat. Assoc. 2007, 102, 1025–1038. [Google Scholar] [CrossRef]
Jiang, Y. Robust variable selection for mixture linear regression models. Hacet. J. Math. Stat. 2016, 45, 549–559. [Google Scholar] [CrossRef]
Luo, R.; Wang, H.; Tsai, C.L. On mixture regression shrinkage and selection via the MR-Lasso. Int. J. Pure Appl. Math. 2008, 46, 403–414. [Google Scholar]
Jiang, Y.; Huang, M.; Wei, X.; Tonghua, H.; Hang, Z. Robust mixture regression via an asymmetric exponential power distribution. Commun. Stat.-Simul. Comput. 2022, 1–12. [Google Scholar] [CrossRef]
Wang, X.; Feng, Z. Component selection for exponential power mixture models. J. Appl. Stat. 2023, 50, 291–314. [Google Scholar] [CrossRef] [PubMed]
Yu, C.; Wang, X. A new model selection procedure for finite mixture regression models. Commun. Stat.-Theory Methods 2020, 49, 4347–4366. [Google Scholar] [CrossRef]
Chen, X. Robust mixture regression with Exponential Power distribution. arXiv 2020, arXiv:2012.10637. [Google Scholar]
Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 1977, 39, 1–22. [Google Scholar] [CrossRef]
Hunter, D.R.; Lange, K. A tutorial on MM algorithms. Am. Stat. 2004, 58, 30–37. [Google Scholar] [CrossRef]
Kobayashi, G. Skew exponential power stochastic volatility model for analysis of skewness, non-normal tails, quantiles and expectiles. Comput. Stat. 2016, 31, 49–88. [Google Scholar] [CrossRef]
Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Zou, H.; Li, R. One-step sparse estimates in nonconcave penalized likelihood models. Ann. Stat. 2008, 36, 1509–1533. [Google Scholar]
Wang, H.; Li, R.; Tsai, C.L. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika 2007, 94, 553–568. [Google Scholar] [CrossRef]
Watnik, M.R. Pay for play: Are baseball salaries based on performance? J. Stat. Educ. 1998, 6, 1–5. [Google Scholar] [CrossRef]

Table 1. Order selection results in Scenario 1.

n	$M 1$			$M 2$
n	Underfitted	Correctly Fitted	Overfitted	Underfitted	Correctly Fitted	Overfitted
200	0.00	0.99	0.01	0.40	0.60	0.00
600	0.00	0.99	0.01	0.00	0.99	0.01
800	0.00	0.99	0.01	0.00	0.99	0.01
1000	0.00	1.00	0.00	0.00	0.99	0.01

Table 2. Variable selection and parameter estimation results in Scenario 1.

n	$M 1$				$M 2$
n	${RMSE}_{π_{c}}$	${RMSE}_{β_{c}}$	$NCZ$	$NIZ$	${RMSE}_{π_{c}}$	${RMSE}_{β_{c}}$	$NCZ$	$NIZ$
200	0.092	0.535	2.900	0.200	0.143	0.352	2.667	0.000
	0.048	0.596	2.700	0.000	0.141	0.207	2.333	0.000
	0.073	0.703	2.670	0.000	0.048	0.427	2.833	0.000
600	0.024	0.154	2.990	0.000	0.023	0.156	2.990	0.000
	0.025	0.153	2.980	0.000	0.025	0.151	2.990	0.000
	0.021	0.154	2.990	0.000	0.022	0.156	2.980	0.000
800	0.022	0.142	2.990	0.000	0.020	0.145	3.000	0.000
	0.020	0.138	2.980	0.000	0.021	0.153	3.000	0.000
	0.019	0.141	2.990	0.000	0.020	0.138	3.000	0.000
1000	0.014	0.123	2.990	0.000	0.015	0.130	3.000	0.000
	0.016	0.122	3.000	0.000	0.014	0.121	3.000	0.000
	0.014	0.121	3.000	0.000	0.014	0.122	3.000	0.000

Table 3. Order selection results in Scenario 2.

n	$M 1$			$M 2$
n	Underfitted	Correctly Fitted	Overfitted	Underfitted	Correctly Fitted	Overfitted
200	0.50	0.20	0.30	0.16	0.64	0.20
600	0.07	0.81	0.12	0.00	0.99	0.01
800	0.03	0.75	0.22	0.01	0.98	0.01
1000	0.11	0.84	0.05	0.00	0.99	0.01

Table 4. Variable selection and parameter estimation results in Scenario 2.

n	$M 1$				$M 2$
n	${RMSE}_{π_{c}}$	${RMSE}_{β_{c}}$	$NCZ$	$NIZ$	${RMSE}_{π_{c}}$	${RMSE}_{β_{c}}$	$NCZ$	$NIZ$
200	0.285	8.381	2.500	0.000	0.088	0.964	2.722	0.056
	0.126	0.457	1.500	0.000	0.058	1.957	2.778	0.076
	0.223	2.876	2.000	2.000	0.090	0.852	2.772	0.000
600	0.057	0.671	2.893	0.000	0.048	0.261	2.963	0.000
	0.062	1.119	2.844	0.011	0.040	0.240	2.876	0.000
	0.055	1.264	2.872	0.034	0.044	0.264	2.896	0.000
800	0.065	0.715	2.897	0.013	0.034	0.223	2.845	0.000
	0.053	0.874	2.892	0.012	0.032	0.228	2.957	0.000
	0.047	1.241	2.887	0.000	0.033	0.193	2.929	0.000
1000	0.063	0.905	2.912	0.012	0.029	0.198	2.906	0.000
	0.056	0.926	2.923	0.011	0.031	0.191	2.946	0.000
	0.047	0.837	2.921	0.012	0.033	0.188	2.979	0.000

Table 5. Order selection results in Scenario 3.

n	$M 1$			$M 2$
n	Underfitted	Correctly Fitted	Overfitted	Underfitted	Correctly Fitted	Overfitted
200	0.60	0.25	0.15	0.32	0.66	0.02
600	0.00	0.74	0.26	0.00	0.98	0.02
800	0.03	0.73	0.24	0.00	0.99	0.01
1000	0.05	0.79	0.16	0.00	0.99	0.01

Table 6. Variable selection and parameter estimation results in Scenario 3.

n	$M 1$				$M 2$
n	${RMSE}_{π_{c}}$	${RMSE}_{β_{c}}$	$NCZ$	$NIZ$	${RMSE}_{π_{c}}$	${RMSE}_{β_{c}}$	$NCZ$	$NIZ$
200	0.236	2.312	2.000	0.000	0.039	0.766	3.000	0.500
	0.165	3.903	2.400	0.200	0.054	0.969	3.000	0.167
	0.060	1.732	2.600	1.400	0.049	0.929	3.000	0.667
600	0.025	0.164	2.887	0.000	0.023	0.122	2.874	0.000
	0.025	0.156	2.889	0.000	0.024	0.127	2.869	0.000
	0.027	0.162	2.896	0.000	0.025	0.124	2.877	0.000
800	0.024	0.154	2.893	0.000	0.018	0.114	2.878	0.000
	0.023	0.137	2.886	0.000	0.017	0.123	2.931	0.000
	0.019	0.134	2.897	0.000	0.017	0.123	2.931	0.000
1000	0.021	0.132	2.894	0.000	0.017	0.097	2.924	0.000
	0.020	0.138	2.924	0.000	0.017	0.097	2.924	0.000
	0.019	0.122	2.891	0.000	0.016	0.114	2.971	0.000

Table 7. Order selection results in Scenario 4.

n	$M 1$			$M 2$
n	Underfitted	Correctly Fitted	Overfitted	Underfitted	Correctly Fitted	Overfitted
200	0.49	0.41	0.10	0.07	0.72	0.21
600	0.13	0.28	0.59	0.02	0.98	0.00
800	0.19	0.31	0.50	0.00	1.00	0.00
1000	0.10	0.39	0.51	0.01	0.99	0.00

Table 8. Variable selection and parameter estimation results in Scenario 4.

n	$M 1$				$M 2$
n	${RMSE}_{π_{c}}$	${RMSE}_{β_{c}}$	$NCZ$	$NIZ$	${RMSE}_{π_{c}}$	${RMSE}_{β_{c}}$	$NCZ$	$NIZ$
200	0.014	5.455	2.889	0.444	0.218	0.971	2.500	0.000
	0.180	1.337	2.889	0.222	0.228	1.669	2.000	0.000
	0.079	2.956	2.889	0.000	0.320	2.284	2.250	0.000
600	0.050	2.672	2.832	0.000	0.054	0.372	2.776	0.000
	0.053	1.256	2.717	0.000	0.055	0.273	2.724	0.000
	0.054	2.136	2.846	0.038	0.061	0.271	2.878	0.000
800	0.051	1.134	2.811	0.000	0.035	0.398	2.600	0.000
	0.045	1.535	2.623	0.000	0.039	0.163	2.600	0.000
	0.047	2.724	2.747	0.000	0.022	0.183	3.000	0.000
1000	0.074	1.217	2.942	0.000	0.035	0.347	2.973	0.000
	0.073	1.736	2.974	0.103	0.031	0.220	2.697	0.000
	0.047	3.734	2.772	0.000	0.025	0.129	2.949	0.000

Table 9. Parameter estimates for baseball salary data.

Covariates	Linear Model	M1		M2
Covariates	Linear Model	Comp1	Comp2	Comp1	Comp2
$x_{0}$	5.48	4.81	5.66	4.70	4.67
$x_{1}$	-	-	-	-	-
$x_{2}$	−1.54	-	-	-	-
$x_{3}$	-	-	-	-	-
$x_{4}$	-	-	0.01	0.03	0.02
$x_{5}$	-	-	-	-	0.01
$x_{6}$	-	-	-	-	-
$x_{7}$	-	-	-	-	-
$x_{8}$	0.01	0.01	0.02	0.01	-
$x_{9}$	0.01	-	-	0.03	0.01
$x_{10}$	−0.01	-	-	-	-
$x_{11}$	-	0.03	-	-	0.01
$x_{12}$	-	-	-	-	-
$x_{13}$	1.52	2.04	-	3.13	2.16
$x_{14}$	-0.48	-	-	-	-
$x_{15}$	1.35	1.60	-	2.73	1.28
$x_{16}$	-	-	-	0.01	1.40
$x_{1} * x_{13}$	-	-	-	-	-
$x_{1} * x_{14}$	-	-	-	-	10.05
$x_{1} * x_{15}$	-	-	-	0.01	-
$x_{1} * x_{16}$	−4.38	-	-	-	-
$x_{3} * x_{13}$	-	-	-	-	-
$x_{3} * x_{14}$	-	-	-	−0.01	−0.02
$x_{3} * x_{15}$	-	-	-	-	-
$x_{3} * x_{16}$	-	-	-	0.01	-
$x_{7} * x_{13}$	0.01	-	-	0.03	-
$x_{7} * x_{14}$	0.03	-	-	-	0.02
$x_{7} * x_{15}$	-	-	-	-	-
$x_{7} * x_{16}$	-	-	-	-	-
$x_{8} * x_{13}$	-	-	0.01	-	0.01
$x_{8} * x_{14}$	-	0.01	-	0.01	-
$x_{8} * x_{15}$	-	-	0.02	-	-
$x_{8} * x_{16}$	0.02	-	-	-	0.02

Table 10. Model parameter estimates for baseball salary data.

Parameter	$M 1$		$M 2$
$\hat{π}$	0.69	0.31	0.84	0.16
$\hat{p}$	2	2	1.05	1.49
$\hat{σ}$	-	-	2.27	7.13
$λ_{1}$	0.300		0.220
$λ_{2}$	0.040		0.016
$M B I C$	569.64		547.25

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiang, Y.; Liu, J.; Zou, H.; Huang, X. Model Selection for Exponential Power Mixture Regression Models. Entropy 2024, 26, 422. https://doi.org/10.3390/e26050422

AMA Style

Jiang Y, Liu J, Zou H, Huang X. Model Selection for Exponential Power Mixture Regression Models. Entropy. 2024; 26(5):422. https://doi.org/10.3390/e26050422

Chicago/Turabian Style

Jiang, Yunlu, Jiangchuan Liu, Hang Zou, and Xiaowen Huang. 2024. "Model Selection for Exponential Power Mixture Regression Models" Entropy 26, no. 5: 422. https://doi.org/10.3390/e26050422

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Model Selection for Exponential Power Mixture Regression Models

Abstract

1. Introduction

2. Methodology

3. Algorithm

4. Choice of the Tuning Parameters

5. Simulation

6. Real Data Analysis

7. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI