1. Introduction
Finite mixture models are highly demanded in machine-learning analysis, due to their properties, computational tractability, and for being a good approximation for continuous densities [
1]. They are also an important statistical tool for many applications in clustering, discriminant analysis, image processing and satellite imaging [
2]. Beyond the already known results provided for the finite mixture of normal distributions (FM-NOR) model in the literature [
1], recent developments cover symmetric/asymmetric and light/heavy tailed distributions. One of these is the novel class of finite mixture of multivariate skew-normal mixture (FM-SN) models [
3,
4], which provides some advantages over the normal mixtures: the normal components allow an arbitrarily close modeling of any distribution by increasing the number of components, and, in the context of supervised learning, groups of observations represented by asymmetrically distributed data can lead to the wrong classification. The components of skew-normal mixture models, however, capture skewness due to their flexibility [
1]. In addition, a robust extension of the FM-SN model to robust finite mixture of skew-t (FM-ST) has been done in the influential works of [
3,
5,
6,
7]. The FM-ST components, too, capture both skewness and extreme observations due to their flexibility [
8].
The SMSN family is a rich and very strong flexible class of distributions which covers the light/heavy-tailed distributions; e.g., skew-normal (SN), skew-t (ST), skew-slash (SSL) and skew contaminated-normal (SCN) distributions, and has been widely considered in many statistical models, especially FM models (see e.g., [
5,
9,
10,
11,
12,
13,
14,
15]). The SMSN family is an extension of the skewed version of the well-known symmetric scale mixtures of the normal (SMN) family which contains the light/heavy-tailed members: the normal (N), t (T), slash (SL) and contaminated-normal (CN) distributions [
16]. Lange et al. [
17], Lange and Sinsheimer [
18], and Maleki and Nematollahi [
19] used the SMN family in an application of robust statistical modeling. A two-piece distribution based on the symmetrical distributions with various scales is an alternative approach to model atypical data (see e.g., [
10,
20,
21,
22,
23,
24]). In our approach, we have used the two-piece distributions based on the SMN family. This family, called the two-piece distributions based on the scale mixtures of normal (TP-SMN), and analogy of the SMSN family, contains the light/heavy-tailed members: the two-piece normal (TP-N), two-piece t (TP-T), two-piece slash (TP-SL) and two-piece contaminated-normal (TP-CN) distributions as its members.
In this paper, we consider the TP-SMN family of distributions as a two-component mixture of truncated SMN distributions on a special two partition of the real domain (
), and then propose the finite mixture of this family, called FM-TP-SMN models. It represents an alternative family to the well-known scale mixtures of skew normal (SMSN) family studied by [
25]. We have also used a hierarchical representation of the FM-TP-SMN and implemented an expectation-maximization (EM)-type algorithm for finding the maximum likelihood (ML) estimates of the proposed model. Studies by [
21,
23], show that by truncating the distribution in two partitions, makes it possible to obtain a better fit of empirical distribution because, the subjacent process of the complete likelihood is modeled. This way, the “two-piece” modeling is a direct competitor against the FM-SMSN family of distributions [
21].
The rest of this paper is organized as follows. In
Section 2, we review some main properties of the TP-SMN family and represent this family as a two-component mixture of the truncated SMN distributions. In
Section 3, the FM-TP-SMN model is introduced and the ML estimates of the proposed model parameters via an EM-type algorithm are provided. In
Section 4, numerical studies with an application of the proposed models and estimates are considered. Some conclusions and ideas for future research are offered in
Section 5.
2. The Two-Piece Scale Mixtures of Normal Distributions
In this section, we analyze some necessary properties of the TP-SMN family of distributions for our proposed FM model.
The well-known SMN family introduced by [
16] (the basis of the robust asymmetric TP-SMN family), has the following probability density function (PDF) and stochastic representation. Let
, then its PDF is
and its stochastic representation is
where
represents the density of
distribution,
is the cumulative distribution function (CDF) of the scale mixing random variable
U, which can be indexed by a scalar or vector of parameters
, and
W is a standard normal random variable that is independent of
U.
The TP-SMN is a rich family of distributions that covers the asymmetric light-tailed TP-N (also called the epsilon-skew-normal; [
26]), the asymmetric heavy-tailed TP-T, TP-SL and TP-CN distributions, and their corresponding symmetric members. Note that symmetric members of the TP-SMN and SMSN classes are the SMN family. In terms of density, for
this family can be represented as
where
is the slant parameter,
is given by (
1) and is denoted by
with
and
, for which
,
,
, and
U is the scale mixing random variable in (
2).
Different TP-SMN member distributions in (
3) are obtained by several distributions for scale mixing random variable
U in (
2), as follows:
Two-piece normal (TP-N): with probability one,
Two-piece t (TP-T): , i.e., ,
Two-piece slash (TP-SL): , i.e., ,
Two-piece contaminated normal (TP-CN): , i.e., .
For more details and statistical properties of the TP-SMN family, see [
20,
23].
Further, the two-piece distributions can be represented as the two-component mixture with separated supports, i.e., left and right half basic distributions [
20] (Equation (
4)), especially when
with PDF given in (
3), two-component mixture left and right half SMN distributions with special component probabilities as follows:
Note in Equation (
4) that, the scale parameter
and slant parameter
in Equation (
3) are recovered in the form of
and
.
By using auxiliary (latent) variables
,
; in terms of the components of the mixture in Equation (
4), the TP-SMN random variable can have the following stochastic representation
where
and
denotes the truncated SMN distribution on the interval
A, and
has a multinomial distribution with following probability mass function (PMF):
and is denoted by
. Note that each component-label is a Bernoulli random variable
;
, such that
.
3. Finite Mixtures TP-SMN
In this section, we introduce the finite mixture of TP-SMN (FM-TP-SMN) model and obtain the ML estimates of this model’s parameters.
3.1. FM-TP-SMN Model
Here, we consider a distribution represented as a g-component mixture of TP-SMN distributions. In terms of density, this mixture distribution is characterized by the following density:
where
, for which
ß with
,
,
, and, for
,
is an
-component density as defined in (
1). Also, we write
to say that a random variable
Y has an FM-TP-SMN distribution as defined by (
7).
Concerning the parameter
of the mixing distribution
, for
, it is worth noting that it can be a vector of parameters, e.g. the contaminated normal distribution. Thus, for computational convenience we assume that
(see also [
5]).
In terms of the components of the mixtures, Equation (
7) can be equivalently obtained by
where
is a multinomial (component-label) vector with probability mass function
,
;
,
.
Since only one component of
can be equal to one (remaining ones are zero), events
and
are equivalent, indicating thus that the distribution of
Y corresponds to the
i-th component of the mixture; for further details, see e.g., [
1].
Remark 1. Let , then the mean and variance of Y are, respectively, given byandwhere , , and (see e.g., [2]). The FM-TP-ESN densities in (
7) are an extremely flexible class which includes the finite mixtures of SMN densities as special case, when
,
.
For each i.i.d. sample in the form of
, by considering the PDF (
7), the log-likelihood function is
3.2. ML Estimates of Model Parameters
We can utilize a (latent) indicator (allocation) variables
,
, to assign observations belonging to different components of the mixture (
), so in terms of
, we can conclude that
and so using Equations (
2) and (
5) with
,
, we have that
for
;
;
,
and
denotes the truncated normal distribution on the interval
A.
The above hierarchical representation of the FM-TP-SMN model will be used to obtain the ML estimates via an ECME-algorithm. This algorithm is a generalization of the ECM-algorithm introduced by [
27], which is an extension of the EM-algorithm [
28]. It can be obtained by replacing some CM-steps, which maximize the constrained expected complete-data log-likelihood function, with steps that maximize the corresponding constrained actual likelihood function. As [
27,
29] indicated, the joint ML estimates obtained by ECME-algorithms are much more efficient than other EM-type algorithms.
Let
denotes the complete data, where
is the observed sample
and
are the latent or unobserved variables from the FM-TP-SMN model with vector of parameters
. Considering the hierarchical representation (
10), the completed (augmented) likelihood function is given by
where
. After ignoring constants and using auxiliary (latent) variables the completed log-likelihood function is in the form:
Quantities
,
and
, must be defined, and using known properties of conditional expectation and PDF in (
4), we obtain
, where
,
,
, and
where
is the TP-SMN PDF defined in Equation (
4), and
, and the conditional expectation
for the TP-SMN distribution members are given by:
Two-piece normal (TP-N): ,
Two-piece t (TP-T): ,
Two-piece slash (TP-SL): ,
Two-piece contaminated Normal (TP-CN): ,
where and denotes the distribution function of the distribution evaluated at x.
Now, the expectation step (E-step) at the th iteration of the ECME-algorithm requires the calculation of . So,
For the conditionally maximizing steps (CM-steps) at the -th iteration of the ECME-algorithm we have:
CM-steps.
Update
,
, as:
Update
,
, as:
where
.
Update
;
;
, by solving the following stressed cubic equations
where
, for which
. Note that
, so this cubic equation has unique just root in the
interval.
CML-step of the ECME-algorithm.
where
is the log-likelihood function given in (
9) and
denotes the
-th update of
except
.
The ECME-algorithm iterates until a sufficient convergence rule is satisfied, e.g., if , under the determined tolerance .
4. Numerical Studies
In this section, we assess the performance of the proposed FM model using simulated and real datasets. The implementations of the algorithms were based on the R software [
30] version 3.5.1 with a core i7 760 processor 2.8 GHz, and a relative tolerance of
was used for convergence of the ECME-algorithms. A sample copy of the R code is available up on request from the authors and will be available in an R package specialize to this proposed model.
4.1. Simulations
In this section, we have three simulations. In the first, we showed the robustness of the FM-TP-SMN models to classify heterogeneous data; in the second, we showed the misspecification of the proposed FM-TP-SMN models; and in the third simulation we considered suitability of the asymptotic properties for proposed model estimates.
4.1.1. Clustering
The FM models are useful for clustering the observations by allocating them into groups of observations that are similar in some sense. In fact, by considering the estimated (posterior) probabilities, we can assign such observation points to given groups. However, some atypical data have an undesirable effect to suitable clustering (see e.g., [
1,
2,
8]). In our models, we consider the skewness and use the clustering as a base on them to show the robustness on the clustering of atypical data in components. We generated 1000 samples from the FM-TP-SMN with two components and for each sample, and considered the k-means clustering while we have ignored the true classification on these classifications.
We simulated 1000 samples with sample sizes
, from the FM-TP-SMN models with parameters
,
,
,
,
,
,
and
(FM-Normal model), for which
for TP-T and TP-SL, and
for TP-CN. According to the FM-TP-SMN estimated (posterior) probabilities given in (
12) and the threshold value 0.5, we allocated the observations to some specific component. For each sample
, the mean value rate of the correct allocations are given in
Table 1, which shows that clustering based on the FM-TP-T, FM-TP-SL and FM-TP-CN are more reasonability than the ordinary FM-Normal model clustering, in the presence of atypical data. Also note that in the case of the true model (FM-TP-T), the FM-TP-CN also outperforms the other models.
4.1.2. Misspecification
For this section, we simulated 2000 samples with lengths
from FM-SN (asymmetric and light tailed components) and FM-ST (asymmetric and heavy tailed components) separately, with parameters of the previous simulation structure and with
. Then, we fitted various proposed FM-TP-SMN models to these data. In
Table 2, various FM-TP-SMN models were first compared with the ordinary FM-NOR model (symmetric and light tailed components) and then various competitors within the FM-TP-SMN models (asymmetric components). The results in the first of four rows of
Table 2 demonstrate that the number of preferred models belongs to the class of FM-TP-SMN models against the FM-NOR model. Also, the number of preferred models to fit the FM-SN is FM-TP-N, and in this case other preferred models except the FM-TP-N model are models which have similarities with it (for example FM-TP-T with large values of degree of freedom
), i.e., preferred fitted models to the FM-SN with asymmetric and light tailed components are the FM-TP-SMN models with light tailed components. In the cases of FM-ST with asymmetric but heavy tailed components, also the FM-TP-SMN models with heavy tailed components were preferred. In this and the real application parts, the model selection criteria to choose the best model are: logarithm of the maximized likelihood function (log-like) which is
, Akaike information criteria (AIC); [
31], Bayesian information criteria (BIC); [
32], in the form of
respectively, where
k is the number of the model parameters.
4.1.3. Asymptotical results
For this section, we simulated 400 samples each one with sample sizes
, 600, 1000, 2000, 4000, from some FM-TP-T models with two components which are weak separated (WS), medium separated (MS) and strong separated (SS) of components, i.e., little, medium and large overlap of components respectively (see
Figure 1), for which
,
,
,
,
,
,
,
and
for WS data;
for MS data; and
for SS data.
Using the proposed ECME algorithm to find the ML estimates we focus on the evaluation of Monte-Carlo average of biasness (MC-bias) and mean squared error (MSE) defined as of the ML estimates in each
j-th sample,
, respectively given in
Table 3,
Table 4 and
Table 5 by
where
is the ML estimate of the parameter
in the
i-th sample.
These results in
Table 3,
Table 4 and
Table 5 are obtained from the different fitted FM-TP-SMN models and show the performance of the proposed models as well as their parameters estimates. As the sample size increased we naturally observed that the Monte-Carlo average bias of ML estimates and MSE were tending toward zero.
4.2. Applications
In this section, we apply the FM-TP-SMN models on some various real data sets to show the performance of the proposed models and estimates in applications.
BMI Data
We considered the body mass index (BMI) data set collected for men aged between 18 and 80 years. The BMI data set was gathered with the National Health and Nutrition Examination Survey in the US National Center for Health Statistics (NCHS) of the Center for Disease Control (CDC). A strong relationship between the obesity problem and many chronic diseases has attracted attention in recent years, that is, most people with an obesity problem will have chronic diseases. The ratio of body weight in kilograms and height in squared meters (BMI) is a measure to determine the rate of relationship between overweight and obesity. In this way, a person with BMI > 25 is considered overweighed, while BMI > 30 is considered obese.
This dataset had 4579 participants with BMI records, but for modeling with finite mixture models, participants with weights within 39.50–70.00 kg and 95.01–196.80 kg with 1069 and 1054 participants were considered in the first and second subgroups respectively. Lin et al. [
7] were first analyzed this dataset by considering the reports in 1999–2000 and 2001–2002, and were fitted the FM-normal, FM-T, FM-SN and FM-ST, always with two components, and then [
5,
13] fitted the FM-SMSN models to this dataset. The results, obtained by [
13], were general and involved the results by [
5,
7]. So we fitted the proposed FM-TP-SMN models to this dataset and compared obtained results in the [
13].
Table 6 contains the ML estimates of the FM-TP-SMN models with two components, and the Log-likelihood, AIC and BIC criterions of the proposed FM-TP-SMN models and FM-SMSN taken from
Table 1 due to [
13] appear in
Table 7.
As noted by Lin et al. [
4] and Prates et al. [
13], the criteria values in
Table 7 indicate that the heavy tailed FM-SMSN models (FM-ST, FM-SCN and FM-SSL) had a better fit than the ordinary FM-NOR and FM-SN models, and also the FM-SSL and FM-ST were the best fitted models. Such results are for the FM-TP-SMN (with corresponding FM-SMSN counterparts) models, while the FM-TP-SMN models were more reasonable than FM-TP-SMN models. However, the FM-TP-SL and FM-TP-T were the best models. In
Figure 2, we plot the fitted FM-TP-T and FM-ST densities curved on the histogram of BMI data.
4.3. UScrime Data
As a further application of the FM-TP-SMN models and proposed methodology, we consider the effect of punishment regimes on crime rates [
33,
34], which is of high interest to criminologists. This has been studied using aggregate data of 47 US states for 1960 given in this data frame, and we consider the 13th column of this data frame which is due to income inequality. The data are available under the UScrime function in the MASS R package.
Table 8 contains the ML estimates of the FM-TP-SMN models with two components, and the Log-likelihood, AIC and BIC criterions of the proposed FM-TP-SMN and FM- SMSN models.
The log-likelihood values in
Table 8 indicate that the FM-TP-N and FM-TP-T are the best models within the FM-TP-SMN models, while the FM-SCN is the best model in the class of the FM-SMSN models. The AIC and BIC criteria chose the FM-TP-N model (asymmetric components) within the FM-TP-SMN models, while they chose the FM-NOR model (symmetric components) in the class of the FM-SMSN models, which has symmetrical components. Among all competitors, the criteria chose the FM-TP-N model belonging to the proposed FM-TP-SMN models, which is more reasonable. In
Figure 3 is plotted the histogram of US crime data with the curves of FM-TP-SMN and FM-SMSN models. These graphical visualizations show the suitability of the asymmetrical components and proposed FM-TP-SMN models.