1. Introduction
In many practical problems, the real-data distribution tends to be skew with unimodal and asymmetrical characteristics such as dental plaque index data [
1], freeway speed data [
2] and polarizer manufacturing process data [
3]. For this reason, Azzalini [
4,
5] proposed the concept of the skew-normal distribution originally and gave its density function expression to characterize it. The random variable
X follows a skew-normal distribution with location parameter
, scale parameter
and skewness parameter
, denoted by
, if its density function is:
where
is the normal probability density function with mean
and variance
, and
is the standard normal cumulative distribution function. When
, Equation (
1) degenerates into the normal distribution with mean
and variance
.
In view of the wide applications of the skew-normal distribution, many scholars further explored its statistical properties. Some recent studies include: characterizations of distribution [
6,
7], characteristic functions [
8], sampling distributions [
9], distribution of quadratic forms [
10,
11,
12], measures of skewness and divergence [
13,
14], asymptotic expansions for moments of the extremes [
15], rates of convergence of the extremes [
16], exact density of the sum of independent random variables [
17], identifiability of finite mixtures of the skew-normal distributions [
18], etc. On this basis, we can use the skew-normal distribution as the fitted distribution of real data and establish a statistical model to solve the practical problem. Some recent applications include: modelling of air pollution data [
19], modelling of psychiatric measures [
20], modelling of bounded health scores [
21], modelling of insurance claims [
22], asset pricing [
23], individual loss reserving [
24], robust portfolio estimation [
25], growth estimates of cardinalfish [
26], age-specific fertility rates [
27], reliability studies [
28], statistical process control [
29], analysis of student satisfaction towards university courses [
30], detecting differential expression to microRNA data [
31], etc.
Due to the complex structure of the skew-normal distribution, the traditional parameter estimation method is difficult to be applied directly. To this end, Pewsey [
32] studied the weaknesses of the direct parameterization in parameter estimation and proposed the centered parameterization method. Pewsey [
33,
34] applied this method to the wrapped skew-normal population and gave the methods of moment estimation and maximum likelihood (ML) estimation. Arellano-Valle and Azzalini [
35] extended the centered parameterization method to the multivariate skew-normal distribution and studied its information matrix. Further, due to the wide application of location parameter in econometrics, medicinal chemistry and life testing, the research on the location parameter of skew-normal distribution has attracted many scholars’ attention. For example, Wang et al. [
36] discussed the interval estimation of location parameter when the coefficient of variation and skewness parameter are known. Thiuthad and Pal [
37] considered the hypothesis testing problem of location parameter and constructed three testing statistics when the scale parameter and skewness parameter are known. Ma et al. [
38] studied the interval estimation and hypothesis testing problems of location parameter with known scale parameter and skewness parameter. Based on the approximate likelihood equations, Gui and Guo [
39] derived the explicit estimators of scale parameter and location parameter. But in practical applications, inferences on the location parameter with unknown scale parameter and skewness parameter is by no means an exception but a fact of life. For this, the statistical inference problems of location parameter for single and two skew-normal populations are researched when the scale parameter and skewness parameter are unknown.
This paper is organized as follows. In
Section 2, for single skew-normal population, the centered parameterization and Bootstrap approaches are used for the hypothesis testing and interval estimation problems of location parameter with unknown scale parameter and skewness parameter. In
Section 3, for two skew-normal populations, the Behrens-Fisher type and interval estimation problems of location parameters are discussed when the scale parameters and skewness parameters are unknown. In
Section 4, the Monte Carlo simulation results of the above approaches are presented, which are compared from analytical perspective at the same time. In
Section 5, the proposed approaches are applied to the real data examples of leaf area index (LAI), carbon fibers’ strength and red blood cell (RBC) count in athletes. In
Section 6, the summary of this paper is given.
2. Inference on the Location Parameter of Single Skew-Normal Population
In this section, the estimation problem of unknown parameters for single skew-normal population is considered firstly. Suppose that
are random samples from the skew-normal distribution
and all the samples are mutually independent. Let
denote the sample mean, the second and third central moments of the sample, respectively. Namely:
Theorem 1. Let . If , then the moment estimators of are:where . Proof. Let
be the observed values of
.
are the standardized samples where
from
,
. Note that:
The moment generating function ( MGF ) of
Y is
By Equation (
5), we have:
According to Equations (
4) and (
6), the moment estimates of
can be expressed as:
where
. By Equation (
3), the moment estimators of
are given, then the proof of Theorem 1 is completed. □
Further, the ML estimators of
are considered. Pewsey [
32] proved that the results of using numerical techniques to maximize the log-likelihood for direct parameters
, may be highly misleading as no unique solution exists in this case. For this, we derive the ML estimators of the unknown parameters based on the method of centered parametrization by References [
4,
32,
34,
35,
40]. Firstly, we give the following definition.
Definition 1. Suppose . Let , then:where denotes the skew-normal distribution with mean , variance and skewness coefficient γ. The centered parameterization removes the singularity of the expected Fisher information matrix at
. Furthermore, the components of centered parameters are less correlated than those of direct parameters. By Definition 1, the relationship between the direct parameters
and centered ones
is as follows (see [
34]).
are assumed to be random samples from the skew-normal distribution
. The sample mean, the second and third central moments of the sample can be written respectively as:
Theorem 2. Suppose . Let and , then .
Proof. The first three derivatives of MGF
of
X can be obtained as:
By the above three equations, the skewness coefficient
γ of
X has the form of:
From Equations (
7) and (
8), we have:
Because
, we obtain that
and
. Then,
Hence, the proof of Theorem 2 is completed. □
Remark 1. By Theorem 1, if , then . Furthermore, we have by (8). More details see Pewsey [32]. Besides, we consider the ML estimators of the centered parameters . The observed values of are denoted by . Similarly, let
,
, where
are the standardized samples from
with
and
. So the density function of
is:
By Equation (
10), the logarithmic likelihood function (without constant terms) of
can be represented as:
In addition, the ML estimators of
and
satisfy the constraint (see [
4]):
By Theorem 2,
in Equation (
12) can also be expressed as:
From Equations (
7) and (
13), the ML estimators
of
have the following relationship:
Substituting Equation (
14) into Equation (
11), we have:
Therefore, we define
as the ML estimates of
with default starting values given by the moment estimates of
from Equation (
15). Namely:
Further, the ML estimates of
and
are obtained as follows:
By Equation (
7), the ML estimates of the direct parameters
are:
Then the ML estimate of is . Hence, we have the following result.
Theorem 3. Suppose that are the ML estimators corresponding to in Equation (16), then the ML estimators of direct parameters are:where and are the ML estimators corresponding to and in Equation (17), respectively. Furthermore, the ML estimator of δ is . It is well-known that when
, the location parameter of skew-normal population is a generalization of the mean of normal population. Therefore, it is especially important to study the statistical inference of location parameter of single skew-normal distribution. Then, the Bootstrap approach for the hypothesis testing problem of location parameter in the single skew-normal population
is proposed. Specifically, the hypothesis of interest is:
where
is a specified value. Based on the central limit theorem, under
in (
19) we have:
If
and
are known,
T can be the test statistic for hypothesis testing problem (
19). Since
and
are often unknown, the test statistics might be developed by replacing
and
with their moment and ML estimators in Equation (
20), respectively. Therefore, the test statistics have the form of:
As the exact distributions of
and
are often unknown, it is impossible to establish their exact test approaches, so we can construct an approximate test approach based on the central limit theorem. However, the Monte Carlo simulation results indicate that the Type I error probabilities of approximate approach exceed the nominal significance level in most cases. Namely, the above approach is liberal. This result may be attributed to its approximate distribution characteristic. In view of this, we propose the Bootstrap test statistic for hypothesis testing problem (
19) in this paper.
Under
in (
19), we define
as the Bootstrap samples from
, where
denote the sample mean, the second and third center moments of the sample and
are their observed values. By Theorem 1, the moment estimators of
have the form of:
Let
be the moment estimates corresponding to
. Let
be the Bootstrap samples from
with the sample mean
. Then the ML estimators
can be obtained by Theorem 3. Similar to
and
, the Bootstrap test statistics can be expressed as:
Then the Bootstrap
p-values for hypothesis testing problem (
19) are defined as:
where
and
are the observed values of
and
, respectively. The null hypothesis
in (
19) is rejected whenever the above
p-values are less than the nominal significance level of
, which means that the difference between
and
is significant.
Remark 2. According to [41], similar to and , the Bootstrap pivot quantities of ξ can be constructed as and based on the moment estimator and ML estimator, respectively. Suppose that is the empirical percentile of . Then the Bootstrap confidence interval for ξ is given by: Similarly, a Bootstrap confidence interval for ξ based on is also obtained.
3. Inference on the Location Parameters of Two Skew-Normal Populations
Let
be random samples from
and all of them are mutually independent,
. The sample mean, the second and third central moments of the sample can be expressed respectively as:
Firstly, we consider the estimation problems of
and
of two skew-normal populations in this section. By Theorem 1, the moment estimators of
and
can be given by:
where
. By Theorem 3, the ML estimators of
can be written as
,
. Thus, let
be the moment estimates corresponding to
and
be the ML estimates corresponding to
,
.
Next, the problem of interest here is to test:
where
is a specified value. By the central limit theorem, under
in (
27) we have:
If
and
are known,
, then
is a natural statistic for hypothesis testing problem (
27). For
, since
and
are often unknown in practical, the test statistics might be obtained by replacing
and
by their moment estimators and ML estimators, respectively. They are given by:
The exact distributions of
and
are also unknown like
and
. For this, the Bootstrap approach will be used to construct test statistics for hypothesis testing problem (
27).
Under
in (
27), let
denote the Bootstrap samples from
with the sample mean
,
. By Theorem 1, the moment estimators of
are
,
. Likewise, let
denote the Bootstrap samples from
with the sample mean
and the ML estimators of
be
by Theorem 3,
. Based on
and
, the Bootstrap test statistics are defined as:
Then the Bootstrap
p-values for hypothesis testing problem (
27) are:
where
and
are the observed values of
and
, respectively. The null hypothesis
in (
27) is rejected whenever the above
p-values are less than the nominal significance level of
, which means that the difference between
and
is significant.
Remark 3. Similar to Remark 2, the Bootstrap pivotal quantities of are constructed as and based on the moment estimators and ML estimators, respectively. Let be the empirical percentile of . The Bootstrap confidence interval for is defined as: Similarly, a Bootstrap confidence interval for based on is also obtained.
4. Simulation Results and Discussion
In this section, the Monte Carlo simulation is used to numerically investigate properties of the above hypothesis testing approaches from the aspects of the Type I error rates and powers. Type I error refers to the error of rejecting the actually established and correct hypothesis, which can measure whether the testing approach is liberal or conservative. For convenience, we only provide the steps of the Bootstrap approach based on the moment estimators for hypothesis testing problem (
19).
Step 1: For a given
, generate a group of random samples
from skew-normal distribution. And
are computed by Equation (
2).
Step 2: By Theorem 1, the moment estimates of
are computed and denoted by
. Then the observed value
of
is obtained by Equation (
21).
Step 3: Under
in (
19), generate the Bootstrap samples
),
. And
are computed.
Step 4: By Theorem 1, the moment estimates of
from Bootstrap samples are computed and denoted by
. Then
is obtained by Equation (
23).
Step 5: Repeat Steps 3–4
times and compute
by Equation (
25). If
, then
; otherwise,
.
Step 6: Repeat Steps 1–5 times and we get . Then the Type I error probability is .
Based on the above steps, the power of hypothesis testing problem (
19) can be obtained similarly.
In this simulation, the parameters and sample sizes are set as follows. Firstly, let the nominal significance level be 5%, and the number of inner loops
and number of outer loops
both be 2500. Secondly, for hypothesis testing problem (
19), we set
,
,
(−8, −7.5, −7, −6.5, −6, −5.5, −5, −4.5, −4, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8), and
(40, 45, 50, 55, 60, 70, 100, 150, 200). Finally, for hypothesis testing problem (
27), we suppose
,
, and
.
For hypothesis testing problem (
19),
Table A1,
Table A2 and
Table A3 in
Appendix A present the simulated Type I error probabilities and powers of the proposed approaches. Since the simulated results are similar in the case of positive and negative skewness, only the positive situation is analyzed below. From
Table A1 in
Appendix A, the Type I error probabilities based on
are close to those based on
in most parameter settings. Specifically, regarding the small sample size and skewness parameter, these two approaches are slightly liberal; regarding the large sample size, both approaches control the Type I error probabilities well. Furthermore, with the increase of sample size, the actual levels of the above two approaches are close to the nominal significance level of 5%. From
Table A2 and
Table A3 in
Appendix A, it is clear that the powers of these two approaches based on
and
both increase with larger sample size, but the former approach always performs better than the latter.
For hypothesis testing problem (
27),
Table A4,
Table A5 and
Table A6 in
Appendix A give the simulated Type I error probabilities and powers of the proposed approaches. From
Table A4 in
Appendix A, the approach based on
is slightly liberal when the sample size and skewness parameter are small, but it can effectively control the Type I error probabilities under other parameter settings. As the sample size increases, the actual level of this approach is close to the nominal significance level of 5%. The approach based on
is conservative under most parameter settings. From
Table A5 and
Table A6 in
Appendix A, the powers of the approach based on
are obviously better than those based on
in most cases.
In a word, for hypothesis testing problems (
19) and (
27), the proposed Bootstrap approaches provide the satisfactory performances under the senses of Type I error probability and power in most cases regardless of the moment estimator or ML estimator. It is well-known that the ML estimator depends on the choice of initial value, which may influence its estimation accuracy. Hence, the Bootstrap test based on the moment estimator is better than that based on the ML estimator in most situations, which can provide a useful approach for the inference on location parameter in the real data examples.
Remark 4. For hypothesis testing problem (27), we only provide the simulation results in the case of positive skewness. When the skewness parameter is negative, the results are similar to those of positive skewness parameter, so we omit them. 5. Illustrative Examples
In order to verify the rationality and validity of the proposed approaches, we apply them into the examples of LAI, carbon fibers’ strength and RBC count in athletes in this section.
Example 1. The above approaches are applied to the data of LAI of Robinia pseudoacacia Plantation in Huaiping Forest Farm, Yongshou County, Shanxi Province (see Ye et al. [42]). From Figure 1 and Figure 2, the distribution of LAI does not follow the normal distribution but shows asymmetric right-biased distribution characteristics. To confirm the conclusion, we first test the normality of this data. It turns out that the p-values from R output of Shapiro–Wilk test, Anderson–Darling test and Lilliefor test are 0.0007, 0.0014 and 0.0458, respectively. Hence, the LAI is not normally distributed at the nominal significance level of 5%. Further, we should prove whether the distribution of LAI is skew-normal by the Chi-square goodness-of-fit test. By calculation, we have . Therefore, the LAI follows the skew-normal distribution at the nominal significance level of 5%. Based on the method of moment estimation, the LAI is approximately distributed as , and its density curve is given in Figure 2. To illustrate the proposed approach for hypothesis testing problem (
19), we suppose
as the nearby value of moment estimate of
. Namely, consider the hypothesis testing problem:
Based on the moment and ML estimators, the p-values of Bootstrap test are 0.02584 and 0.00097, respectively. Hence, the null hypothesis is rejected at the nominal significance level of 5%, that is, the location parameter of LAI is not equal to 2 significantly.
Example 2. Kundu and Gupta [43] presented a data set of the strength measured in GPA for single carbon fibers. The Shapiro-Wilk test, Anderson-Darling test and Lilliefor test are used to test the normality of the data. It turns out that the p-values of the data are 0.0108, 0.0109 and 0.0254 respectively. The P-P plot and the histogram of the data are given in Figure 3 and Figure 4. Furthermore, the chi-square goodness-of-fit test is used to see whether the distribution of data is skew-normal, namely we set : the carbon fibers’ strength data is skew-normally distributed. We obtain that , then the null hypothesis is rejected at the nominal significance level of 5%. Similar to Example 1, the carbon fibers’ strength data is considered to follow skew normal distribution SN (2.0917, 0.9230, 2.9668). Consider the hypothesis testing problem:
By the moment and ML estimators, the p-values of Bootstrap test are 0.9433 and 0.0814, respectively. Therefore, the null hypothesis is not rejected at the nominal significance level of 5%.
Example 3. The data collected by the Australian Institute of Physical Education of RBC count in 102 male and 100 female athletes are analyzed in this example (see Cook and Weisberg [44]). Similar to Example 2, the Shapiro-Wilk test, Anderson-Darling test and Lilliefor test are used to test the normality of RBC count in male and female athletes. It shows that the p-values of RBC count in male athletes are 0.0000, 0.0019 and 0.0025 respectively, while those in female athletes are 0.0065, 0.0131 and 0.0181 respectively. The corresponding images are shown in Figure 5 and Figure 6. Therefore, at the nominal significance level of 5%, the above tests all reject the null hypothesis that RBC counts in male and female athletes follow normal distributions. Furthermore, to verify the skew-normality of RBC count, we test the null hypothesis : the RBC count is skew-normally distributed. It can be obtained by calculation that and , which means the RBC count in male and female athletes follow the skew-normal distributions and respectively at the nominal significance level of 5%. Consider the hypothesis testing problem:
The p-values of Bootstrap test statistics based on the moment estimators and ML estimators are 0.00071 and 0.00153, respectively. Therefore, the null hypothesis is rejected at the nominal significance level of 5%, that is, the location parameters of RBC counts in male and female athletes have significant differences.