Next Article in Journal
Modelling Asymmetric Unemployment Dynamics: The Logarithmic-Harmonic Potential Approach
Next Article in Special Issue
Distance in Information and Statistical Physics III
Previous Article in Journal
Sentiment Classification Method Based on Blending of Emoticons and Short Texts
Previous Article in Special Issue
Generalizations of Talagrand Inequality for Sinkhorn Distance Using Entropy Power Inequality
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Information–Theoretic Aspects of Location Parameter Estimation under Skew–Normal Settings

by
Javier E. Contreras-Reyes
Instituto de Estadística, Facultad de Ciencias, Universidad de Valparaíso, Valparaíso 2360102, Chile
Entropy 2022, 24(3), 399; https://doi.org/10.3390/e24030399
Submission received: 14 February 2022 / Revised: 9 March 2022 / Accepted: 11 March 2022 / Published: 13 March 2022
(This article belongs to the Special Issue Distance in Information and Statistical Physics III)

Abstract

:
In several applications, the assumption of normality is often violated in data with some level of skewness, so skewness affects the mean’s estimation. The class of skew–normal distributions is considered, given their flexibility for modeling data with asymmetry parameter. In this paper, we considered two location parameter ( μ ) estimation methods in the skew–normal setting, where the coefficient of variation and the skewness parameter are known. Specifically, the least square estimator (LSE) and the best unbiased estimator (BUE) for μ are considered. The properties for BUE (which dominates LSE) using classic theorems of information theory are explored, which provides a way to measure the uncertainty of location parameter estimations. Specifically, inequalities based on convexity property enable obtaining lower and upper bounds for differential entropy and Fisher information. Some simulations illustrate the behavior of differential entropy and Fisher information bounds.

1. Introduction

A typical problem in statistical inference is estimating the parameters from a data sample [1], especially if the data have some level of skewness. Therefore, the estimation of these parameters is affected by asymmetry. Recent research addressed data asymmetry with the class of skew–normal distributions, given their flexibility for modeling data with the skewness (asymmetry/symmetry) parameter [2]. In particular, Ref. [3] focused on estimating location parameter ( μ ), assuming that the coefficient of variation and skewness parameter are known. Specifically, they presented the least square estimator (LSE) and the best unbiased estimator (BUE) for μ . The precision of the location parameter estimation is directly influenced by skewness [4] and, hence, affects the confidence intervals and sample size [5,6].
Given that complex parametric distributions with several parameters are often considered [2], the information measures (entropies and/or divergences) play an important role in quantifying uncertainty provided by a random process about itself, and it is sufficient to study the reproduction of a marginal process through a noiseless system. One main application is related to the selection of models and detection of the number of clusters [7], or the interpretation of physical phenomena [8,9]. However, the use of entropies and/or divergences is widely considered to compare estimations [1]. For example, Ref. [10] considered the Kullback–Leibler (KL) divergence as a method to compare sample correlation matrices to an application in financial markets, assuming two multivariate normal densities. Using the estimated parameters based on maximum likelihood estimation, Ref. [11,12,13] considered the KL divergence for an asymptotic test to evaluate the data skewness and/or bimodality.
Given that precision was evaluated with confidence intervals in [5], the quantification of uncertainty for location parameter estimation under skew–normal settings motivated this study. The properties for MSE and (emphasizing) BUE, using classic theorems and properties of information theory are explored, which enable measuring the uncertainty of location parameter estimations based on differential entropy and Fisher information [1]. The Cramér–Rao inequality [14] linked Fisher information with the variance of an unbiased estimator, which is considered to find a lower bound for Fisher information. In addition, considering a stochastic representation [15] of a skew–normal random variable, the convexity property of Fisher information is also used to find an upper bound for Fisher information.
This paper is organized as follows: some properties and inferential aspects based on information theory are presented in Section 2. In Section 3, the computation and description of information–theoretic theorems related to location parameter estimation of skew–normal distribution are presented. In Section 4, some simulations illustrate the usefulness of the results. Final remarks conclude the paper in Section 5.

2. Information-Theoretic Aspects

In this section, some main theorems and properties of information theory are described. Specifically, these properties are based on differential entropy and Fisher information.
Definition 1.
Let X be a random variable with support in R and continuous probability density function (pdf), f ( x ; θ ) , which depends on parameter θ. The differential entropy of X [1] is defined by
H ( X ) = E log f ( X ; θ ) = R f ( x ; θ ) log f ( x ; θ ) d x ,
where notation E [ g ( X ) ] = R g ( x ) f ( x ; θ ) d x was used.
Differential entropy depends only on the pdf of the random variable. In the following theorem, the scaling property of differential entropy is presented.
Theorem 1.
For any real constant a, the  differential entropy of a X Theorem 8.6.4 of [1] is given by
H ( a X ) = H ( X ) + log | a | .
In particular, for two random variables, the following differential entropy bounds hold.
Theorem 2.
Let X and Y be two independent random variables. Suppose Z = D X + Y , where “ = D ” denotes equality in distribution, then
(i) 
H ( X ) + H ( Y ) + log 2 2 H ( Z ) H ( X ) + H ( Y ) .
(ii) 
For any constant ρ such that 0 ρ 1 ,
ρ H ( X ) + ( 1 ρ ) H ( Y ) H ( ρ X + 1 ρ Y ) .
Proof. 
For part (i), consider first the general case for X 1 , X 2 , , X n independent and identically distributed (i.i.d.) random variables see Equations (5) and (6) of [16], then
H X 1 H X 1 + X 2 2 H 1 n 1 i = 1 n 1 X i H 1 n i = 1 n X i .
Considering the latter inequality for two variables, X and Y, and the scaling property of Theorem 1, we obtain 2 H ( X + Y ) log 2 H ( X ) + H ( Y ) , yielding the left side of the inequality. For the right side, see [17]. The inequality of part (ii) is proved in Theorem 7 of [14].    □
The inequality of Theorem 2 (ii) is based on the convexity property, and allows obtaining a lower bound for differential entropy Theorem 8.6.5 of [1].
Theorem 3.
Let X be a random variable with zero mean and finite σ 2 , then
H ( X ) 1 2 log ( 2 π e σ 2 ) ,
and equality is achieved if, and only if, X N ( 0 , σ 2 ) .
Theorem 3, also known as the maximum entropy principle, implies that Gaussian distribution maximizes the differential entropy over all distributions with the same variance. This theorem has several implications for information theory, mainly when the differential entropy of an unknown distribution is hard to obtain. Thus, this upper bound is a good alternative. Another consequence is the relationship between estimation error and differential entropy, which includes the Cramér–Rao bound as described next. First, the Fisher information for continuous densities needs to be defined as follows.
Definition 2.
Let X be a random variable with support in R and continuous density function f ( x ; θ ) , which depends on parameter θ, so R f ( x ; θ ) d x = 1 . The Fisher information of X [1] is defined by
J ( X ) = E x log f ( x ; θ ) 2 = R x f ( x ; θ ) 2 1 f ( x ; θ ) d x .
The Fisher information is a measure of the minimum error in estimating a parameter θ of a distribution. Classical definitions of Fisher information considered differentiation with respect to θ to define J ( θ ) ; however, by considering a parametric form as f ( x θ ; θ ) , differentiation with respect to x is equivalent to differentiation with respect to θ as in Equation (1) [1]. The following inequality links Fisher information and variance.
Theorem 4.
Let X = ( X 1 , X 2 , , X n ) be a sample of n random variables drawn i.i.d. f ( x ; θ ) , the mean-squared error of an unbiased estimator T ( X ) of parameter θ is lower bounded by the reciprocal of the Fisher information Theorem 11.10.1 of [1]:
V a r [ T ( X ) ] 1 J ( X ) ,
where J ( X ) is defined in Equation (1) and, if the inequality is achieved, T ( X ) is efficient.
Theorem 4, also known as the Cramér–Rao inequality, allows determining the best estimator of θ to obtain a lower bound for Fisher information. The Cramér–Rao inequality was first planned for any estimator T ( X ) (not necessarily unbiased) of θ in terms of mean-squared error, in this case
E [ { T ( X ) θ } 2 ] 1 + θ B i a s ( θ ) 2 J ( X ) + B i a s ( θ ) 2 ,
where B i a s ( θ ) = E [ T ( X ) θ ] ; see Equation (11).290 of [1]. Clearly, if  T ( X ) is an unbiased estimator of θ , Theorem 4 is a particular case of the latter inequality. Inequality (2) was obtained through the Cauchy–Schwarz inequality on the variance of all unbiased estimators. The following inequality, also known as the Fisher information inequality, is based on the convexity property and is useful to obtain an upper bound for Fisher information.
Theorem 5.
For any two independent random variables X and Y, and any constant ρ such that 0 ρ 1 , then
J ( ρ X + 1 ρ Y ) ρ J ( X ) + ( 1 ρ ) J ( Y ) .
Proof. 
See proof of Theorem 13 in [14].    □

3. Location Parameter Estimation

The skew–normal distribution is an extension of the normal one, allowing for the presence of skewness.
Definition 3.
X is called a skew–normal random variable [15] and denoted as X S N 1 ( μ , σ 2 , λ ) if it has pdf
f ( x ; θ ) = 2 σ ϕ x μ σ Φ λ x μ σ , x R , θ = ( μ , σ 2 , λ ) ;
with location μ R , scale σ 2 R , and shape λ R parameters. In addition, ϕ ( x ) is the pdf of the standardized normal distribution with 0 mean and variance 1, denoted as N ( 0 , 1 ) , and  Φ ( x ) is the corresponding cumulative distribution function (cdf) of the standardized normal distribution.
Random variable X is represented by the following stochastic representation:
X = d μ + σ ( δ | U 0 | + 1 δ 2 U ) ,
where δ = λ 1 + λ 2 , and  U 0 and U N ( 0 , 1 ) are independently distributed; see Equation (2.14) of [15].
Additionally, X has a representation based on a link between differential entropy and Fisher information, due to de Bruijn’s identity. By matching the stochastic representation (3) with Equation (20) of [16], it is possible to assign Y = μ + σ δ | U 0 | with fixed δ . Then,
H ( Y ) = 1 2 log ( 2 π e ) + λ ( 1 + λ 2 ) J ( X ) 1 d λ ,
where an approximation for Fisher information J ( X ) appears in the proof of Proposition 5 below (with n = 1 observation).
Definition 4.
X is called a multivariate skew–normal random vector [18] and denoted as X S N n ( μ , Σ , λ ) if it has pdf
f n ( x ; θ ) = 2 ϕ n ( x ; μ , Σ ) Φ [ λ Σ 1 / 2 ( x μ ) ] , x R n , θ = ( μ , Σ , λ ) ,
with location vector μ R n , scale matrix Σ R n × n , and skewness vector λ R n parameters. In addition, ϕ n ( x ; μ , Σ ) is the n-dimensional normal pdf with location parameter μ and scale parameter Σ.
Let X = ( X 1 , , X n ) S N n ( μ , Σ , λ ) , with μ = 1 n μ , Σ = σ 2 I n , and λ = 1 n λ , where 1 n = ( 1 , , 1 ) R n and I n denotes the n × n -identity matrix. Following [3] and Corollary 2.2 of [5], the following properties hold.
Property 1.
X i S N 1 ( μ , σ 2 , λ * ) , i = 1 , , n , with  λ * = λ 1 + ( n 1 ) λ 2 .
Property 1 indicates that X 1 , , X n is a random sample with identically distributed but random variables not independent from a univariate skew–normal population with location μ , scale σ 2 , and shape λ * parameters.
Property 2.
E [ X i ] = μ + σ b δ * and V a r [ X i ] = σ 2 1 n b 2 δ * 2 , with  b = 2 π and δ * = λ 1 + n λ 2 .
Property 3.
X ¯ = 1 n i = 1 n X i S N 1 ( μ , σ 2 n , n λ ) .
Property 4.
( n 1 σ 2 ) S 2 χ n 1 2 with S 2 = 1 n 1 i = 1 n ( X i X ¯ ) 2 , where χ n 1 2 denotes the chi-square distribution with n 1 degrees of freedom, and sample mean X ¯ and sample variance S 2 are independent.

3.1. Least Square Estimator

Assuming that the coefficient of variation τ = | σ / μ | and shape parameter λ are known, Theorem 4.1 of [3] provides the least square estimator for μ and its variance, given by
T L S E ( X ) = ω X ¯ ,
V a r [ T L S E ( X ) ] = ω 2 ( 1 n δ n 2 ) σ 2 n , ω = n n + τ τ + n δ n 1 + δ n τ ,
where δ n = b δ * , and  δ * is defined in Property 2. The least square estimator for μ was obtained by minimizing the MSE of n c X ¯ with respect to a constant c. The MSE of T L S E ( X )  is
M S E [ T L S E ( X ) ] = σ 2 n + μ 2 + 2 μ δ n σ ω 2 2 μ ω ( μ + σ δ n ) + μ 2 .
Proposition 1.
Let X = ( X 1 , , X n ) S N n ( μ 1 n , σ 2 I n , 1 n λ ) , with known τ and λ. Thus,
(i) 
H ( T L S E ( X ) ) = 1 2 log 2 π e σ 2 n ω 2 H N ( η ) , H N ( η ) = E [ log { 2 Φ ( η W ) } ] , W S N 1 ( 0 , 1 , η ) , η = σ λ .
(ii) 
H ( T L S E ( X ) ) 1 2 log 2 π e σ 2 n ω 2 ( 1 n δ n 2 ) .
Proof. 
Part (i) follows straightforwardly from Theorem 1, Property 3, (4) and Proposition 2.1 of [19] (for the univariate case). Part (ii) is straightforward from Theorem 3 and (5).    □
Differential entropy of T L S E ( X ) corresponds to the difference of the normal differential entropy and a term called negentropy, H N ( η ) , that depends on σ and λ parameters. Additionally, note that part (ii) yields the upper bound for H N ( η ) of part (i), H N ( η ) 1 2 log ( 1 n δ n 2 ) .
As a particular case of Proposition 1, it is possible to obtain the differential entropy of sample mean X ¯ by choosing ω = 1 :
H ( X ¯ ) = 1 2 log 2 π e σ 2 n E [ log { 2 Φ ( η W ) } ] ;
its respective upper bound
H ( X ¯ ) 1 2 log 2 π e σ 2 n ( 1 n δ n 2 ) ;
and, from Equation (6), its respective MSE
M S E [ X ¯ ] = σ 2 n .

3.2. Best Unbiased Estimator

Assuming that the coefficient of variation τ = | σ / μ | and shape parameter λ are known, Theorem 5.1 of [3] provides the best unbiased estimator (BUE) for μ , given by
T B U E ( X ) = ( 1 α ) d 1 ( X ) + α d 2 ( X ) , d 1 ( X ) = X ¯ 1 + δ n τ , d 2 ( X ) = c n n 1 S ,
c n = 1 2 τ 2 Γ n 1 2 Γ n 2 ,
α = 1 ( 1 + δ n τ ) [ ( n 1 ) c n ] 2 ,
where Γ ( x ) denotes the usual gamma function and S is defined in Property 4.
Remark 1.
Equation (10) can be approximated using an asymptotic expression for the gamma function given by Γ ( x + a ) 2 π x x + a 1 / 2 e x , a < , as | x | [19]. Then,
c n 1 n τ 2 ,
as n . Since the exact form (10) can be undefined for large samples ( n > 200 ), approximation (1) is very useful for these cases. Note that from (11) and (1), δ n , α 0 as n , which implies that estimator T B U E ( X ) is only influenced by d 1 ( X ) for large samples.
From Properties 3 and 4, Ref. [3] also proved that
d 1 ( X ) S N 1 μ 1 + δ n τ , σ 2 n ( 1 + δ n τ ) 2 , n λ ,
V a r [ d 1 ( X ) ] = ( μ τ ) 2 ( 1 n δ n 2 ) n ( 1 + δ n τ ) 2 ,
V a r [ d 2 ( X ) ] = 2 ( μ τ ) 2 1 1 2 ( n 1 ) ( τ c n ) 2 .
Given that C o v ( d 1 , d 2 ) = 0 , from Equations (9), (14) and (15), we obtain
V a r [ T B U E ( X ) ] = ( 1 α ) 2 V a r [ d 1 ( X ) ] + α 2 V a r [ d 2 ( X ) ] .
The following proposition provides two upper bounds of differential entropy for T B U E ( X ) based on Theorem 3.
Proposition 2.
Let X = ( X 1 , , X n ) S N n ( μ 1 n , σ 2 I n , 1 n λ ) , with known τ and λ. Thus,
(i) 
H ( T B U E ( X ) ) 1 2 log 4 π e n σ 2 α ( 1 α ) 1 + δ n τ 2 1 1 2 ( n 1 ) ( τ c n ) 2 ,
(ii) 
H ( T B U E ( X ) ) 1 2 log 2 π e V a r [ T B U E ( X ) ] } .
Proof. 
From Theorem 3 and Equations (14) and (15), the differential entropies of d 1 ( X ) and d 2 ( X ) are, respectively, upper bounded by
H ( d 1 ( X ) ) 1 2 log 2 π e σ 2 n ( 1 + δ n τ ) 2 ,
H ( d 2 ( X ) ) 1 2 log 2 π e 2 σ 2 1 1 2 ( n 1 ) ( τ c n ) 2 .
Considering the right side on the inequality of Theorem 2(i), with  Z = X + Y , X = ( 1 α ) d 1 ( X ) and Y = α d 2 ( X ) (thus C o v ( X , Y ) = 0 ), we obtain
H ( T B U E ( X ) ) H ( ( 1 α ) d 1 ( X ) ) + H ( α d 2 ( X ) ) = H ( d 1 ( X ) ) + H ( d 2 ( X ) ) + log | ( 1 α ) α | ,
where Theorem 1 is applied later. Then, Equations (17) and (18) yield part (i). On the other hand, by considering directly Theorem 3 on T B U E ( X ) , Equation (16) implies part (ii).    □
The following proposition provides two lower bounds of differential entropy for T B U E ( X ) .
Proposition 3.
Let X = ( X 1 , , X n ) S N n ( μ 1 n , σ 2 I n , 1 n λ ) , with known τ ( 0 < τ < 1 ) and λ. Thus
(i) 
H ( d 1 ( X ) ) + H ( d 2 ( X ) ) + log ( 2 α 2 ( 1 α ) 2 ) 2 H ( T B U E ( X ) ) ,
(ii) 
( 1 α 2 ) H ( d 1 ( X ) ) + α 2 H ( d 2 ( X ) ) H ( T B U E ( X ) ) ;
where
H ( d 1 ( X ) ) = 1 2 log 2 π e σ 2 n ( 1 + δ n τ ) 2 H N ( η 1 ) , H N ( η 1 ) = E [ log { 2 Φ ( η 1 W 1 ) } ] , W 1 S N 1 ( 0 , 1 , η 1 ) , η 1 = | λ σ 1 + δ n τ | ;
and
H ( d 2 ( X ) ) = log | α | Γ n 1 2 n 2 τ Γ n 2 n 1 ( n 2 ) Γ n 1 2 4 ( 2 c n 2 ) n 1 2 ψ n 1 2 + log ( 2 c n 2 ) + Γ ( n + 1 ) c n 2 Γ n 1 2 n + 2 τ Γ n 2 n + 3 .
Proof. 
Differential entropy of d 1 ( X ) is straightforward from evaluating (13) on Proposition 2.1 of [19] (for the univariate case). Given that distribution of d 2 ( X ) is unknown, Ref. [3] provided its pdf
f d 2 ( x ; μ , σ ) = 2 τ Γ n 2 n 1 Γ n 1 2 n x n 2 e x 2 2 c n 2 .
Through Equations (19) and (3.381.4) of [20], the moments of d 2 are given by
E d 2 [ X m ] = 2 Γ ( m + n 1 ) Γ n 1 2 2 m + n 2 τ Γ n 2 2 m + n 1 , m = 0 , 1 , ;
and, using Equation (4.352.1) of [20], the moment of log x is
E d 2 [ log X ] = 0 f d 2 ( x ; μ , σ ) log x d x , = 2 τ Γ n 2 n 1 Γ n 1 2 n 0 x n 2 e x 2 2 c n 2 log x d x , = ( 2 c n 2 ) n 1 2 Γ n 1 2 4 ψ n 1 2 + log ( 2 c n 2 ) ,
where ψ ( x ) = d d x log { Γ ( x ) } is the digamma function. Therefore, by definition (1), the differential entropy of d 2 ( X ) is computed as
H ( d 2 ( X ) ) = 0 f d 2 ( x ; μ , σ ) log f d 2 ( x ; μ , σ ) d x , = log 2 τ Γ n 2 n 1 | α | Γ n 1 2 n ( n 2 ) 0 f d 2 ( x ; μ , σ ) log x d x E d 2 [ log X ] + 1 2 c n 2 0 f d 2 ( x ; μ , σ ) x 2 d x E d 2 [ X 2 ] .
Thus, Equations (20) and (21) are evaluated in the latter expression to obtain H ( d 2 ( X ) ) . By assuming Z = X + Y in Theorem 2(i) ( C o v ( X , Y ) = 0 ), with  X = ( 1 α ) d 1 ( X ) and Y = α d 2 ( X ) , the inequality of part (i) is obtained.
By assuming Z = X + Y in Theorem 2(ii), with  X = d 2 ( X ) , Y = d 1 ( X ) (thus C o v ( X , Y ) = 0 ), and  ρ = α 2 , and since d 1 ( X ) and d 2 ( X ) are two unbiased estimators of μ [3], the inequality of part (ii) is obtained.    □
The following proposition provides a lower bound for Fisher information of parameter μ based on T B U E ( X ) .
Proposition 4.
Let X = ( X 1 , , X n ) S N n ( μ 1 n , σ 2 I n , 1 n λ ) , with known τ and λ. Thus,
J ( μ ) ( 1 α ) 2 σ 2 ( 1 n δ n 2 ) n ( 1 + δ n τ ) 2 + α 2 2 σ 2 1 1 2 ( n 1 ) ( τ c n ) 2 1 .
Proof. 
Considering that T B U E ( X ) is an unbiased estimator of μ , from the Crámer–Rao inequality of Theorem 4 and Equations (14)–(16), we obtain
J ( μ ) 1 ( 1 α ) 2 V a r [ d 1 ( X ) ] + α 2 V a r [ d 2 ( X ) ] ,
yielding the result.    □
The following Proposition provides an upper bound of Fisher information for parameter μ based on T B U E ( X ) and the convexity property.
Proposition 5.
Let X = ( X 1 , , X n ) S N n ( μ 1 n , σ 2 I n , 1 n λ ) , with known τ ( 0 < τ < 1 ) and λ. Thus
J ( μ ) ( 1 α 2 ) J ( d 1 ( X ) ) + α 2 J ( d 2 ( X ) ) ,
where
J ( d 1 ( X ) ) 1 + n ( b λ ) 2 1 + 2 n b 4 λ 2 , J ( d 2 ( X ) ) = 2 ( n 2 ) ( n 2 2 n 2 ) Γ ( n 2 ) 1 2 c n 2 n 3 τ Γ n 2 n 1 Γ n 1 2 n + 1 c n 4 ,
Proof. 
By assuming Z = X + Y in Theorem 5, with  X = d 2 ( X ) , Y = d 1 ( X ) (thus C o v ( X , Y ) = 0 ), and ρ = α 2 , and since d 1 ( X ) and d 2 ( X ) are two unbiased estimators of μ [3], we obtain J ( μ ) α 2 J ( d 2 ( X ) ) + ( 1 α 2 ) J ( d 1 ( X ) ) . Note that condition 0 < τ < 1 ensures that 0 α 2 1 .
For J ( d 1 ( X ) ) , the steps of Section 3.2 of [9] were considered. By  Equations (1) and (13), and the change of variable z = ( x μ * ) / σ * , with  μ * = μ / ( 1 + δ n τ ) , σ * = σ / n ( 1 + δ n τ ) 2 and λ * = n λ , J ( d 1 ( X ) ) can be computed as
J ( d 1 ( X ) ) = x f ( x ; θ ) 2 1 f ( x ; θ ) d x = f ( z ; λ * ) [ z + λ * ζ ( λ * z ) ] 2 d z = z 2 f ( z ; λ * ) d z 2 λ * z ζ ( λ * z ) f ( z ; λ * ) d z + [ λ * ] 2 ζ ( λ * z ) 2 f ( z ; λ * ) d z = z 2 f ( z ; λ * ) d z 4 λ * z ϕ ( λ * z ) ϕ ( z ) d z + 2 [ λ * ] 2 ϕ ( z ) ϕ ( λ * z ) 2 Φ ( λ * z ) d z ,
where ζ ( x ) = ϕ ( x ) / Φ ( x ) is the zeta function. From Equation (22), the first and second terms are the second moment of a standardized skew–normal random variable ( E [ Z 2 ] = 1 ) and the first moment of a standardized normal random variable ( E [ R ] = 0 , R N ( 0 , 1 ) ), respectively. The third term is
ϕ ( z ) ϕ ( λ * z ) 2 Φ ( λ * z ) d z = 0 ϕ ( z ) ϕ ( λ * z ) 2 Φ ( λ * z ) d z + 0 ϕ ( z ) ϕ ( λ * z ) 2 1 Φ ( λ * z ) d z = 0 ϕ ( z ) ϕ ( λ * z ) 2 Φ ( λ * z ) [ 1 Φ ( λ * z ) ] d z
The following approximation of normal densities (see p. 83 of [15]),
ϕ ( y ) Φ ( y ) [ 1 Φ ( y ) ] π b 2 ϕ ( b 2 y ) , y R ,
and some basic algebraic operations of normal densities are useful to approximate the third term of Equation (22) as
0 ϕ ( z ) ϕ ( λ * z ) 2 Φ ( λ * z ) [ 1 Φ ( λ * z ) ] d z π 2 b 4 0 ϕ ( z ) ϕ ( b 2 λ * z ) 2 d z = π 2 b 4 2 π 1 + 2 b 4 [ λ * ] 2 0 ϕ ( z ; 0 , { 1 + 2 b 4 [ λ * ] 2 } 1 ) d z = π b 4 4 1 + 2 b 4 [ λ * ] 2 .
Given that π b 4 2 = b 2 and λ * = n λ , we obtain
J ( d 1 ( X ) ) 1 + n ( b λ ) 2 1 + 2 n b 4 λ 2 .
Using Equation (19), J ( d 2 ( X ) ) is computed as
J ( d 2 ( X ) ) = 0 x f d 2 ( x ; μ , σ ) 2 1 f d 2 ( x ; μ , σ ) d x = 0 f d 2 ( x ; μ , σ ) [ ( n 2 ) x 1 1 c n 2 ] 2 d x = ( n 2 ) 2 0 x 2 f d 2 ( x ; μ , σ ) d x 2 ( n 2 ) c n 2 0 x 1 f d 2 ( x ; μ , σ ) d x + 1 c n 4 = 2 ( n 2 ) 2 τ Γ n 2 n 1 Γ n 1 2 n 0 x n 4 e 1 2 c n 2 x 2 d x M 1 4 ( n 2 ) c n 2 τ Γ n 2 n 1 Γ n 1 2 n 0 x n 3 e 1 2 c n 2 x 2 d x M 2 + 1 c n 4 = 2 ( n 2 ) ( n 2 2 n 2 ) Γ ( n 2 ) 1 2 c n 2 n 3 τ Γ n 2 n 1 Γ n 1 2 n + 1 c n 4 ,
where Equation (3.381.4) of [20] is applied to solve integrals M 1 and M 2 .    □
Remark 2.
Considering the same argument as in Remark (1), it can be noted that inequalities of Propositions 4 and 5 are only affected by J ( d 1 ( X ) ) , i.e., for large samples, we obtain
1 V a r [ d 1 ( X ) ] J ( μ ) J ( d 1 ( X ) ) .

4. Simulations

All location parameter estimators, variances, Fisher information and differential entropies were calculated with R software [21]. Samples based on skew–normal random variables were drawn based on stochastic representation (3) and with the rsn function of sn package. All R codes used in this paper are available upon request from the corresponding author.
In general, τ takes a value between 0 and 1. If τ is close to 0, the sample has low variability, and if it is close to 1, the sample has high variability and mean loss reliability. For example, if τ > 0.3 , the mean is less representative of the sample. Sometimes, if μ is close to zero, τ takes high values (high variability) and could exceed unity. Therefore, for illustrative purposes, in all simulations, a coefficient of variation set of τ = 0.1 , , 1 is considered. Additionally considered are positive asymmetry parameters λ = 0.1 , , 5 , sample sizes n = 10 and 250, and theoretical location parameters μ = 0.1 , 0.5 and 1. For the computation of information measures, σ is replaced by τ | μ | and location parameters μ are evaluated by their respective estimators.
The MSE of T L S E ( X ) is given in Equation (6), and MSE of T B U E ( X ) is the variance of the (unbiased) estimator (see Equation (16)). Without loss of generality, τ = 1 is considered in Figure 1 because the same pattern is repeated for values of τ between 0 and 1. Comparing the MSE of both estimators, Figure 1 shows for all cases that differences between MSEs tend to increase for large values of μ , and MSEs turn around a specific value when the sample size increases. Moreover, MSEs of the unbiased estimator are less than those obtained by LSE, i.e., BUE dominates LSE Equation (11.263) of [1]. Therefore, the analysis focuses on BUE in the next section.
Behavior of differential entropy bounds given in Propositions 2 and 3 is illustrated in Figure 2 as 3D plots. Without loss of generality, τ ( 0 , 1 ] is considered in Figure 2 because the same pattern is repeated for values of τ > 1 , i.e., entropies keep increasing. The upper bound corresponds to the minimum value between bounds given in Proposition 2(i) and (ii), which is the one given in (ii). Thus, the upper bound of H ( T B U E ( X ) ) is determined by the variance (or MSE) of the estimator. In contrast, the lower bounds correspond to the maximum value between bounds given in Proposition 3 (i) and (ii) [17].
Sample sizes ( n = 250 ) imply that α 0 and lower bounds only depend on H ( d 1 ( X ) ) . For small sample sizes ( n = 10 ), α could be an intermediate value of the interval ( 0 , 1 ) , thus, lower bounds depend on H ( d 1 ( X ) ) and H ( d 2 ( X ) ) . For n = 10 , the surfaces are rough, given the randomness of bounds produced by the small sample. When λ 0 (symmetry condition), the bounds decay to negative values. This is analogous to considering the skew–normal density as a non-stationary process [15], when λ is near zero, so the Hurst exponent decreases abruptly [8]. On the other hand, for n = 250 , surfaces are soft and bounds increase slightly for large λ . For all cases, information increases when τ tends to 1 because it produces more variability in samples.
For practical purposes, the average between bounds can be considered to provide an approximation of differential entropy [7] in similar form to average lengths of the confidence interval [3]. Given that all lower bounds of differential entropy depend on the entropy of d 1 ( X ) , which depends on variance and sample size, they could take negative values and tend to zero when τ tends to 1. Therefore, the difference between lower and upper bounds could increase and turn out an inadequate approximation if the lower bound is negative. For the latter reason, the Fisher information considers only positive values, as studied next.
The Fisher information bounds given in Propositions 4 and 5 are illustrated in Figure 3 as 3D and 2D plots, respectively. As in the differential entropy case, and without loss of generality, τ ( 0 , 1 ] is considered in Figure 3 because the same pattern is repeated for values of τ > 1 , i.e., entropies keep decreasing. Following the Cramér–Rao theorem, the variance of BUE corresponds to the reciprocal of the Fisher information. In contrast, the lower bound corresponds to a combination of the Fisher information of d 1 ( X ) and d 2 ( X ) .
As in the differential entropy case, large sample sizes ( n = 250 ) imply that α 0 and lower bounds only depend on J ( d 1 ( X ) ) , as mentioned in Remark 2. For small sample sizes ( n = 10 ), α could be an intermediate value of the interval ( 0 , 1 ) , thus lower bounds depend on J ( d 1 ( X ) ) and J ( d 2 ( X ) ) . When τ 0 (low variability condition), the lower bounds take the highest values. This reciprocal relationship is determined by the Cramér–Rao theorem: more variability, less Fisher information. In addition, the 2D plot shows that the smallest upper bounds of J ( T B U E ( X ) ) are produced when λ 0 [9]. Given that upper bounds do not depend on τ and μ because the skew–normal densities are standardized, these measures are illustrated with respect to n and λ . In addition, when λ and n increase simultaneously, the upper bounds of Fisher information take the largest values.

5. Concluding Remarks

In this paper, some properties of the best unbiased estimator proposed by [3] were presented, using classic theorems of information theory, which provide a way to measure the uncertainty of location parameter estimations. Given that BUE dominates LSE, this paper focused on this estimator. Inequalities based on differential entropy and Fisher information allowed obtaining lower and upper bounds for these measures. Some simulations illustrated the behavior of differential entropy and Fisher information bounds.
Classical theorems of information theory considered the obtained additional properties of unbiased location parameter estimators. However, these theorems could be applied to other estimators, such as Bayesian [22] (as long as the prior pdf density is known), shrinkage [23], or bootstrap-based [24] ones. The assumption of the sample that came from a multivariate skew–normal distribution is too strong and not always applicable in the real world, so the properties revised here could be extended to more complex densities, for example, those that assess bimodality and heavy tails in data [7,11,13,19]. On the other hand, and given that Fisher information bounds under skew–normal settings were considered in this study, further work could focus on developing time-dependent Fisher information for skew–normal density [25], which could be applied to real data in survival analysis.

Funding

This research was funded by FONDECYT (Chile) grant number 11190116.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The author thanks the editor and three anonymous referees for their helpful comments and suggestions.

Conflicts of Interest

The author declares that there is no conflict of interest in the publication of this paper.

References

  1. Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley & Son, Inc.: New York, NY, USA, 2006. [Google Scholar]
  2. Adcock, C.; Azzalini, A. A selective overview of skew-elliptical and related distributions and of their applications. Symmetry 2020, 12, 118. [Google Scholar] [CrossRef] [Green Version]
  3. Wang, Z.Y.; Wang, C.; Wang, T.H. Estimation of location parameter in the skew normal setting with known coefficient of variation and skewness. Int. J. Intel. Technol. Appl. Stat. 2016, 9, 191–208. [Google Scholar]
  4. Trafimow, D.; Wang, T.; Wang, C. From a sampling precision perspective, skewness is a friend and not an enemy! Educ. Psychol. Meas. 2019, 79, 129–150. [Google Scholar] [CrossRef] [PubMed]
  5. Wang, C.; Wang, T.; Trafimow, D.; Myüz, H.A. Necessary sample sizes for specified closeness and confidence of matched data under the skew normal setting. Comm. Stat. Simul. Comput. 2019, in press. [Google Scholar] [CrossRef]
  6. Wang, C.; Wang, T.; Trafimow, D.; Talordphop, K. Estimating the location parameter under skew normal settings: Is violating the independence assumption good or bad? Soft Comput. 2021, 25, 7795–7802. [Google Scholar] [CrossRef]
  7. Abid, S.H.; Quaez, U.J.; Contreras-Reyes, J.E. An information-theoretic approach for multivariate skew-t distributions and applications. Mathematics 2021, 9, 146. [Google Scholar] [CrossRef]
  8. Contreras-Reyes, J.E. Analyzing fish condition factor index through skew-gaussian information theory quantifiers. Fluct. Noise Lett. 2016, 15, 1650013. [Google Scholar] [CrossRef]
  9. Contreras-Reyes, J.E. Fisher information and uncertainty principle for skew-gaussian random variables. Fluct. Noise Lett. 2021, 20, 21500395. [Google Scholar] [CrossRef]
  10. Tumminello, M.; Lillo, F.; Mantegna, R.N. Correlation, hierarchies, and networks in financial markets. J. Econ. Behav. Organ. 2010, 75, 40–58. [Google Scholar] [CrossRef] [Green Version]
  11. Contreras-Reyes, J.E. An asymptotic test for bimodality using the Kullback–Leibler divergence. Symmetry 2020, 12, 1013. [Google Scholar] [CrossRef]
  12. Contreras-Reyes, J.E.; Kahrari, F.; Cortés, D.D. On the modified skew-normal-Cauchy distribution: Properties, inference and applications. Comm. Stat. Theor. Meth. 2021, 50, 3615–3631. [Google Scholar] [CrossRef]
  13. Contreras-Reyes, J.E.; Maleki, M.; Cortés, D.D. Skew-Reflected-Gompertz information quantifiers with application to sea surface temperature records. Mathematics 2019, 7, 403. [Google Scholar] [CrossRef] [Green Version]
  14. Dembo, A.; Cover, T.M.; Thomas, J.A. Information theoretic inequalities. IEEE Trans. Infor. Theor. 1991, 37, 1501–1518. [Google Scholar] [CrossRef] [Green Version]
  15. Azzalini, A. The Skew-Normal and Related Families; Cambridge University Press: Cambridge, UK, 2013; Volume 3. [Google Scholar]
  16. Madiman, M.; Barron, A. Generalized entropy power inequalities and monotonicity properties of information. IEEE Trans. Infor. Theor. 2007, 53, 2317–2329. [Google Scholar] [CrossRef] [Green Version]
  17. Xie, Y. Sum of Two Independent Random Variables. ECE587, Information Theory; 2012; Available online: https://www2.isye.gatech.edu/~yxie77/ece587/SumRV.pdf (accessed on 15 January 2022).
  18. Azzalini, A.; Dalla-Valle, A. The multivariate skew-normal distribution. Biometrika 1996, 83, 715–726. [Google Scholar] [CrossRef]
  19. Contreras-Reyes, J.E. Asymptotic form of the Kullback–Leibler divergence for multivariate asymmetric heavy-tailed distributions. Phys. A Stat. Mech. Its Appl. 2014, 395, 200–208. [Google Scholar] [CrossRef]
  20. Gradshteyn, I.S.; Ryzhik, I.M. Table of Integrals, Series, and Products, 7th ed.; Academic Press: Cambridge, MA, USA; Elsevier: London, UK, 2007. [Google Scholar]
  21. R Core Team. A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2021; Available online: http://www.R-project.org (accessed on 15 January 2022).
  22. Bayes, C.L.; Branco, M.D. Bayesian inference for the skewness parameter of the scalar skew-normal distribution. Braz. J. Prob. Stat. 2007, 21, 141–163. [Google Scholar]
  23. Kubokawa, T.; Strawderman, W.E.; Yuasa, R. Shrinkage estimation of location parameters in a multivariate skew-normal distribution. Comm. Stat. Theor. Meth. 2020, 49, 2008–2024. [Google Scholar] [CrossRef]
  24. Ye, R.; Fang, B.; Wang, Z.; Luo, K.; Ge, W. Bootstrap inference on the Behrens–Fisher-type problem for the skew-normal population under dependent samples. Comm. Stat. Theor. Meth. 2021, in press. [Google Scholar] [CrossRef]
  25. Kharazmi, O.; Asadi, M. On the time-dependent Fisher information of a density function. Braz. J. Prob. Stat. 2018, 32, 795–814. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Mean square errors (MSE) for T L S E ( X ) [blue dots] and T B U E ( X ) [red dots] considering τ = 1 and several skewness λ and location μ parameters in the simulations.
Figure 1. Mean square errors (MSE) for T L S E ( X ) [blue dots] and T B U E ( X ) [red dots] considering τ = 1 and several skewness λ and location μ parameters in the simulations.
Entropy 24 00399 g001
Figure 2. Differential entropy bounds for T B U E ( X ) considering n = 10 and 250, μ = 0.1 , 0.5 and 1; and several skewness λ and coefficient of variation τ parameters in the simulations.
Figure 2. Differential entropy bounds for T B U E ( X ) considering n = 10 and 250, μ = 0.1 , 0.5 and 1; and several skewness λ and coefficient of variation τ parameters in the simulations.
Entropy 24 00399 g002
Figure 3. Fisher information lower bounds for T B U E ( X ) considering n = 250 , μ = 1 , 2.5 and 5; and several skewness λ and coefficient of variation τ parameters in the simulations. The fourth panel shows the upper bounds for T B U E ( X ) considering n = 100 , , 1000 and several skewness parameters λ .
Figure 3. Fisher information lower bounds for T B U E ( X ) considering n = 250 , μ = 1 , 2.5 and 5; and several skewness λ and coefficient of variation τ parameters in the simulations. The fourth panel shows the upper bounds for T B U E ( X ) considering n = 100 , , 1000 and several skewness parameters λ .
Entropy 24 00399 g003
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Contreras-Reyes, J.E. Information–Theoretic Aspects of Location Parameter Estimation under Skew–Normal Settings. Entropy 2022, 24, 399. https://doi.org/10.3390/e24030399

AMA Style

Contreras-Reyes JE. Information–Theoretic Aspects of Location Parameter Estimation under Skew–Normal Settings. Entropy. 2022; 24(3):399. https://doi.org/10.3390/e24030399

Chicago/Turabian Style

Contreras-Reyes, Javier E. 2022. "Information–Theoretic Aspects of Location Parameter Estimation under Skew–Normal Settings" Entropy 24, no. 3: 399. https://doi.org/10.3390/e24030399

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop