A Concentrated, Nonlinear Information-Theoretic Estimator for the Sample Selection Model

Golan, Amos; Gzyl, Henryk

doi:10.3390/e12061569

Open AccessArticle

A Concentrated, Nonlinear Information-Theoretic Estimator for the Sample Selection Model

by

Amos Golan

^1,* and

Henryk Gzyl

²

¹

Department of Economics and the Info-Metrics Institute, American University, Kreeger Hall 104, 4400 Massachusetts Ave., NW, Washington, DC 20016-8029, USA

²

Centro de Finanzas, IESA, Caracas, Venezuela

^*

Author to whom correspondence should be addressed.

Entropy 2010, 12(6), 1569-1580; https://doi.org/10.3390/e12061569

Submission received: 16 April 2010 / Revised: 3 June 2001 / Accepted: 11 June 2010 / Published: 14 June 2010

Download Versions Notes

Abstract

:

This paper develops a semi-parametric, Information-Theoretic method for estimating parameters for nonlinear data generated under a sample selection process. Considering the sample selection as a set of inequalities makes this model inherently nonlinear. This estimator (i) allows for a whole class of different priors, and (ii) is constructed as an unconstrained, concentrated model. This estimator is easy to apply and works well with small or complex data. We provide a number of explicit analytical examples for different priors’ structures and an empirical example.

Keywords:

concentrated model; inequalities; information; maximum entropy; priors; sample selection

1. Introduction and Basic Model

The sample selection problem appears often in empirical studies of labor supply, individuals' wages and other topics. For small sample sizes the existing parametric [1] and semi-parametric estimators [2,3,4,5] have difficulties. Recently, [6], henceforth GMP, developed a semi-parametric, Information-Theoretic (IT) estimator for the sample-selection problem that performs well when the sample is small. This estimator is based on the IT generalized maximum entropy (GME) approach of [7] and [8]. GMP used a large number of sampling experiments to investigate and compare the small-sample behavior of their estimator relative to other estimators. GMP concluded that their IT estimator is the most stable estimator while the likelihood estimators predicted better within the sample for large enough samples. Their IT estimator outperformed the AP estimator in most cases and in all small samples. Another set of experiments within the nonlinear framework appear in [9].

GMP specified their IT-GME model with bounds on the parameters and with finite and discrete support. Though, their IT-GME estimator performs relatively well, it still has some of the basic shortcomings of that estimator. It has finite and bounded supports for both signal and noise, it is not flexible enough to incorporate infinitely large bounds and continuous support spaces, and it is constructed as a constrained optimization estimator. The objective here is to extend the estimator discussed in GMP in three directions. First, we allow unbounded support spaces for all parameters. Second, we accommodate for a whole class of (discrete and continuous) priors. Third, we construct our estimator as an unconstrained concentrated model.

1.1. The Basic Sample Selection Model

For simplicity, we follow a common labor model discussed in [10]. Suppose individual h (h=1,…, N) values staying (working) at home at wage

y_{1 h}^{*}

and can earn

y_{2 h}^{*}

in the marketplace. If

y_{2 h}^{*} > y_{1 h}^{*}

, the individual works in the marketplace, y_1h = 1, and we observe the market value,

y_{2 h} = y_{2 h}^{*}

. Otherwise, y_1h = 0 and y_2h = 0.

The individual's value at home or in the marketplace depends (linearly) on demographic characteristics (x):

y_{1 h}^{*} = x_{1 h}^{t} β_{1} + ε_{1 h}

(1)

y_{2 h}^{*} = x_{2 h}^{t} β_{2} + ε_{2 h}

(2)

where x_1h and x_2h are K₁ and K₂-dimensional vectors, β₁ and β₂ are K₁ and K₂-dimensional vectors of unknowns and “t” stands for “transpose”. This model can be expressed as

y_{1 h} = {\begin{matrix} 1 & if y_{2 h}^{*} > y_{1 h}^{*} \\ 0 & if y_{2 h}^{*} \leq y_{1 h}^{*} \end{matrix}

(3)

y_{2 h} = {\begin{matrix} x_{2 h}^{t} β_{2} + ε_{2 h} & if y_{2 h}^{*} > y_{1 h}^{*} \\ 0 & if y_{2 h}^{*} \leq y_{1 h}^{*} \end{matrix}

(4)

Our objective is to estimate β₁ and β₂. Typically the researcher is interested primarily in β₂.

Unlike the more traditional models, GMP constructed their model as a solution to a constrained optimization problem such that the information represented by the set of censored equations (3)-(4) enters the estimation as inequalities:

x_{2 h}^{t} β_{2} + ε_{2 h} = y_{2 h}, if y_{1 h} = 1

(5)

x_{2 h}^{t} β_{2} + ε_{2 h} > x_{1 h}^{t} β_{1} + ε_{1h}, if y_{1 h} = 1

(6)

x_{2 h}^{t} β_{2} + ε_{2 h} \leq x_{1 h}^{t} β_{1} + ε_{1 h}, if y_{1 h} = 0

(7)

In our formulation, we use inequalities as well to represent all available information in the set of censored equations.

2. The Information-Theoretic Estimator

Rewrite equations (1)-(2) as finding γ₁ and γ₂ in

y_{1}^{*} = A_{1} γ_{1}

and

y_{2}^{*} = A_{2} γ_{2}

, where the dependent variable is censored and where

γ_{1} = (\begin{matrix} β_{1} \\ ε_{1} \end{matrix}), A_{1} = [X_{1} I]

,

γ_{2} = (\begin{matrix} β_{2} \\ ε_{2} \end{matrix}) and A_{2} = [X_{2} I]

. We formulate the censored model (5)-(7) in the following way.

Let the constraint sets

C_{i} = C_{i, s} \times C_{i, n} (i = 1, 2) .

For each i, C_i,s is an auxiliary closed, convex set used to model the a-priori constraints on the β’s. Similarly, the closed convex set C_i,n is part of the specification of the “physical” nature of the noise and contains all possible realizations of ε . We view the coordinates

ς_{i} in C_{i, s}

and

ν_{i} in C_{i, n}

as values of random variables distributed according to some probability measure

d P_{i} (ξ_{i}) \equiv d P_{i} (ς_{i}, ν_{i})

such that their expectations (E) are

β_{i} = E_{P_{i}} [ς_{i}] and ε_{i} = E_{P_{i}} [ν_{i}]

(8)

We note that the qualifier “prior” assigned to the Q probability measures is not the traditional Bayesian view. Rather, the Q_s is just a mathematical construct to transform the estimation problem into a variational problem. The Q_n, however, could be viewed as the probability measure describing the statistical nature of the noise. The process of estimation of the noise involves a tilting of the prior measure.

Given some (any) prior measures

d Q_{i} (ξ_{i}) \equiv d Q_{i} (ς_{i}, ν_{i}) = d Q_{i, s} (ς_{i}) d Q_{i, n} (ν_{i})

we search for densities

ρ_{i} (ξ_{i})

such that

d P_{i} = ρ_{i} (ξ_{i}) d Q_{i} (ξ_{i})

satisfies the system (3)-(4). This yields the parameter estimator

β^{*}

and the estimated residuals

ε^{*}

.

Next, let

S (P_{i}, Q_{i})

denotes the differential entropy divergence measure between the priors, Q_i, and the post-data (posteriors) P_i. This is just the continuous version of the Kullback-Liebler information divergence measure, also known as relative entropy (see [11,12,13]). Since the data are naturally divided into observed and unobserved parts, we divide the data into two subsets: J and J^c of {1,2,…,N}. Next, rewrite the data (3)-(4)

y_{1}^{*} = A_{1} γ_{1}

and

y_{2}^{*} = A_{2} γ_{2}

as

y_{1}^{*} = (\begin{matrix} y_{1} \\ {\bar{y}}_{1} \end{matrix}) = (\begin{matrix} B_{1} \\ {\bar{B}}_{1} \end{matrix}) E_{P_{1}} [ξ_{1}]; and y_{2}^{*} = (\begin{matrix} y_{2} \\ {\bar{y}}_{2} \end{matrix}) = (\begin{matrix} B_{2} \\ {\bar{B}}_{2} \end{matrix}) E_{P_{2}} [ξ_{2}]

(9)

where the matrices B₁ and B₂ correspond to the rows of the matrices A_i (i=1, 2) labeled by the indices for which observations are available. For the indices in J the values y₂ are observed and

y_{2 h}^{*} > y_{1 h}^{*}

, whereas for the values in J^c all we know is that

{\bar{y}}_{1 h}^{*} > {\bar{y}}_{2 h}^{*}

.

Our “Basic (Primal) Problem” is the solution to

\underset{(P_{1}, P_{2})}{S u p} {S (P_{1}, Q_{1}) + S (P_{2}, Q_{2}) | y_{2} = B_{2} E_{P_{2}} [ξ_{2}], B_{2} E_{P_{2}} [ξ_{2}] > B_{1} E_{P_{1}} [ξ_{1}]; {\bar{B}}_{2} E_{P_{2}} [ξ_{2}] \leq {\bar{B}}_{1} E_{P_{1}} [ξ_{1}]}

(10)

where the inequalities between vectors are taken to be component wise.

Next, we formulate the problem as a concentrated (unconstrained) entropy problem. To do so, we view the basic primal problem as a two stage problem, call it an “equivalent primal problem.” In the equivalent model the first stage consists of the standard Generalized Entropy problem (the equality portion of the model) for which a dual can be easily formulated.

The Equivalent primal problem is a solution to the two stage optimization problem:

\begin{array}{c} S u p {S u p {S (P_{1}, Q_{1}) + S (P_{2}, Q_{2}) | y_{2} = B_{2} E_{P_{2}} [ξ_{2}], η_{1} = B_{1} E_{P_{1}} [ξ_{1}]; \\ {\bar{η}}_{2} = {\bar{B}}_{2} E_{P_{2}} [ξ_{2}]; {\bar{η}}_{1} = {\bar{B}}_{1} E_{P_{1}} [ξ_{1}]} | y_{2} > η_{1}, {\bar{η}}_{2} \leq {\bar{η}}_{1}} \end{array}

(11)

Theorem 2.1.

The equivalent primal problem (11) is equivalent to the following (dual) problem

I n f {l n Z (λ_{1}, - \bar{λ}, λ_{2}, \bar{λ}) + 〈 λ_{1} + λ_{2}, y_{2} 〉 | λ_{1} \in ℝ_{+}^{| J |}, λ_{2} \in ℝ^{| J |} and \bar{λ} \in ℝ_{+}^{N - | J |}}

(12)

where

〈 a, b 〉

denotes the Euclidean scalar (inner) product of the vectors a and b,

\begin{array}{l} Z (λ_{1}, {\bar{λ}}_{1}, λ_{2}, {\bar{λ}}_{2}) = \iint_{C_{1} \times C_{2}} e^{- 〈 λ_{1}, B_{1} ξ_{1} 〉} e^{- 〈 {\bar{λ}}_{1}, {\bar{B}}_{1} ξ_{1} 〉} e^{- 〈 λ_{2}, B_{2} ξ_{2} 〉} e^{- 〈 {\bar{λ}}_{2}, {\bar{B}}_{2} ξ_{2} 〉} d Q_{1} (ξ_{1}) d Q_{2} (ξ_{2}) \\ = \int_{C_{1}} e^{- 〈 λ_{1}, B_{1} ξ_{1} 〉} e^{- 〈 {\bar{λ}}_{1}, {\bar{B}}_{1} ξ_{1} 〉} d Q_{1} (ξ_{1}) \int_{C_{2}} e^{- 〈 λ_{2}, B_{2} ξ_{2} 〉} e^{- 〈 {\bar{λ}}_{2}, {\bar{B}}_{2} ξ_{2} 〉} d Q_{2} (ξ_{2}) = Z_{1} (λ_{1}, {\bar{λ}}_{1}) Z_{2} (λ_{2}, {\bar{λ}}_{2}) \end{array}

and

(λ_{1}, {\bar{λ}}_{1}, λ_{2}, {\bar{λ}}_{2})

are the four sets of Lagrange multipliers associated with (11). To carry out the procedure specified in (12) first set

{\bar{λ}}_{1} = - {\bar{λ}}_{2} = - \bar{λ}

, and then carry out the minimization.

Proof:

See Appendix.

To confirm the uniqueness of the solution to problem (12), observe that the function

ℓ (λ_{1}, {\bar{λ}}_{1}, λ_{2}, {\bar{λ}}_{2}) = \ln Z (λ_{1}, {\bar{λ}}_{1}, λ_{2}, {\bar{λ}}_{2}) + 〈 λ_{1}, η_{1} 〉 + 〈 {\bar{λ}}_{1}, {\bar{η}}_{1} 〉 + 〈 λ_{2}, y_{2} 〉 + 〈 {\bar{λ}}_{2}, {\bar{η}}_{2} 〉

is strictly convex on its domain

Ψ (Q) = {λ = (λ_{1}, {\bar{λ}}_{1}, λ_{2}, {\bar{λ}}_{2}) | Z (λ_{1}, {\bar{λ}}_{1}, λ_{2}, {\bar{λ}}_{2}) < \infty}

, and if

ℓ (λ_{1}, {\bar{λ}}_{1}, λ_{2}, {\bar{λ}}_{2}) \to \infty

as

λ = (λ_{1}, {\bar{λ}}_{1}, λ_{2}, {\bar{λ}}_{2}) \to \partial Ψ

, then problem (12) has a unique solution, where “

\partial

” is the boundary of the set Ψ. This is always true in the cases we consider here. A simple example in which it does not hold is

l (λ) = e^{λ} + λ y

with

ℝ

as domain. This has no minimum for a positive y.

Solving (12) yields

λ_{1}^{*}, {\bar{λ}}^{*}, λ_{2}^{*}

, which in turn yields the optimal maximum entropy (posterior) density.

ρ * (ξ_{1}, ξ_{2}) = \frac{e^{- 〈 λ_{1}^{*}, B_{1} ξ_{1} 〉 - 〈 \bar{λ}, {\bar{B}}_{1} ξ_{1} 〉} e^{- 〈 λ_{2}^{*}, B_{2} ξ_{2} 〉 + 〈 \bar{λ}, {\bar{B}}_{2} ξ_{2} 〉}}{Z (λ_{1}^{*}, \bar{λ}) Z (λ_{2}^{*}, - \bar{λ})} = ρ_{1}^{*} (ξ_{1}) ρ_{2}^{*} (ξ_{2})

This density is naturally factored into a product of the maximum entropy densities of the two sets of equations. Therefore,

ξ_{1} and ξ_{2}

are independent with respect to the reconstructed density

d P * (ξ_{1}, ξ_{2}) = ρ_{1}^{*} (ξ_{1}) ρ_{2}^{*} (ξ_{2}) d Q_{1} (ξ_{1}) d Q_{2} (ξ_{2})

, and with respect to the original priors. Once

P *

is solved, we follow (8), or (9), to get

(\begin{matrix} β_{i}^{*} \\ ε_{i}^{*} \end{matrix}) = E_{P_{i}^{*}} [ξ_{i}^{*}] = \int_{C_{i}} ξ_{i} ρ_{i}^{*} (ξ_{i}) d Q_{i} (ξ_{i}); i = 1, 2

(13)

With that generic formulation, we show below three analytic examples that cover a wide range of possible priors and support spaces for β and ε.

3. Large Sample Properties

Denote by

β_{i N}^{*}

the estimator of the true

β_{i}

when the sample size is N. Throughout this section we add a subscript N to all quantities introduced in Section 2 to remind us that the size of the data set is N. We show that

β_{i N}^{*} \to β_{i}

and

\sqrt{N} (β_{i N}^{*} - β_{i}) \to N (0, V_{i})

as

N \to \infty

in some appropriate way. The proof is similar in logic to Proposition 3.2 in [14]. We assume:

Assumption 3.1.

For every sample size N, the minimizers of (12) are all in the interior of their domains:

λ_{1}^{*} \in i n t (ℝ_{+}^{| J |}) and {\bar{λ}}_{1}^{*} \in i n t (ℝ_{+}^{N - | J |})

where “int” stands for interior.

Assumption 3.2.

Let

\frac{1}{N} X_{i}^{t} X_{i} = \frac{1}{N} (B_{i}^{t} B_{i} + {\bar{B}}_{i}^{t} {\bar{B}}_{i}) = \frac{J}{N} (\frac{1}{J} B_{i}^{t} B_{i}) + \frac{N - J}{N} (\frac{1}{N - J} {\bar{B}}_{i}^{t} {\bar{B}}_{i})

. Then, assume there exists

α \in (0, 1)

such that (i)

N \to \infty

and

J \to \infty

such that

(\frac{J}{N}) \to α

and (ii) assume that there exists two matrices

W_{i}^{o}

and

W_{i}^{u}

such that

\frac{1}{J} B_{i}^{t} B_{i} \to W_{i}^{o}

and

\frac{1}{N - J} {\bar{B}}_{i}^{t} {\bar{B}}_{i} \to W_{i}^{u}

.

Note that

\frac{1}{N} X_{i}^{t} X_{i} \to W_{i} = α W_{i}^{o} + (1 - α) W_{i}^{u}

.

Proposition 3.1.

(Convergence in distribution.) Under Assumptions 3.1 and 3.2

a): $β_{i N}^{*} \overset{D}{\to} β_{i}$ as $N \to \infty$ , for i=1, 2.
b): $\sqrt{Ν} (β_{i N}^{*} - β_{i}) \overset{D}{\to} N (0, V_{i})$ as $N \to \infty$

where

\overset{D}{\to}

stands for convergence in distribution and

V_{i} = Σ_{i} W_{i}^{- 1} Σ_{i}

, where

Σ_{i}

is the covariance matrix of

ς_{i}

with respect to

d Q_{i} (ς_{i}, ν_{i})

.

The approximate finite sample variance is

σ_{i}^{*^{2}} = \frac{1}{N - K_{i}} ε_{i}^{* t} ε_{i}^{*}

for i = 1, 2 and

ε_{i}^{*} = E_{P_{i}^{*}} [ν_{i}]

as is shown in (8) or similarly in (12).

4. Analytic Examples

We discuss three examples, corresponding to assuming that the β’s are either unbounded (Normal), bounded below (Gamma) and bounded below and above (Bernoulli). Under the normal priors, the minimum described in (12) can be explicitly computed. In the other cases, a numerical computation is necessary.

4.1. Normal Priors

Let the constraint space be

C = C_{s} \times C_{n} = ℝ^{K} \times ℝ^{N}

. Using the traditional view and centering the support spaces at zero, the prior—a product of two normal distributions—is

d Q (ξ) = \frac{\exp (- 〈 ξ, Σ^{- 2} ξ 〉 / 2)}{{(2 π)}^{(N + K) / 2} {(\det Σ^{2})}^{1 / 2}} d ξ

. The covariance

Σ

has two diagonal blocks:

K \times K

and

N \times N

. Without loss of generality, we assume that these two matrices are

σ_{s}^{2} I_{K} and σ_{n}^{2} I_{N}

respectively. Our basic model holds for the general covariance structure

Σ = [\begin{array}{l} Σ_{1} \\ Σ_{2} \end{array}]

.

Formulating these priors within our model yields

\ln Z_{1} (λ_{1}, {\bar{λ}}_{1}) = \frac{1}{2} {〈 λ_{1}, B_{1} Σ_{1}^{2} B_{1}^{t} λ_{1} 〉 - 2 〈 \bar{λ}, {\bar{B}}_{1} Σ_{1}^{2} B_{1}^{t} λ_{1} 〉 + 〈 \bar{λ}, {\bar{B}}_{1} Σ_{1}^{2} {\bar{B}}_{1}^{t} \bar{λ} 〉}

\ln Z_{2} (λ_{2}, - {\bar{λ}}_{1}) = \frac{1}{2} {〈 λ_{2}, B_{2} Σ_{2}^{2} B_{2}^{t} λ_{2} 〉 + 2 〈 \bar{λ}, {\bar{B}}_{2} Σ_{2}^{2} B_{2}^{t} λ_{2} 〉 + 〈 \bar{λ}, {\bar{B}}_{2} Σ_{2}^{2} {\bar{B}}_{2}^{t} \bar{λ} 〉}

where

Σ_{1}^{2}

is a diagonal (K + N) × (K + N) matrix, the first block being a K × K matrix with entries equal to

σ_{1, s}^{2}

and the second block is a N × N matrix with entries equal to

σ_{1, n}^{2}

. That is, the priors on signal and noise spaces are iid normal random variables. Thus, problem (12) consists of finding the minimum of

\begin{array}{l} l n Z (λ_{1}, {\bar{λ}}_{1}, λ_{2}, - {\bar{λ}}_{1}) + 〈 λ_{1} + λ_{2}, y_{2} 〉 = 〈 λ_{1} + λ_{2}, y_{2} 〉 \\ + \frac{1}{2} {〈 λ_{1}, B_{1} Σ_{1}^{2} B_{1}^{t} λ_{1} 〉 - 2 〈 \bar{λ}, {\bar{B}}_{1} Σ_{1}^{2} B_{1}^{t} λ_{1} 〉 + 〈 \bar{λ}, {\bar{B}}_{1} Σ_{1}^{2} {\bar{B}}_{1}^{t} \bar{λ} 〉} \\ + \frac{1}{2} {〈 λ_{2}, B_{2} Σ_{2}^{2} B_{2}^{t} λ_{2} 〉 + 2 〈 \bar{λ}, {\bar{B}}_{2} Σ_{2}^{2} B_{2}^{t} λ_{2} 〉 + 〈 \bar{λ}, {\bar{B}}_{2} Σ_{2}^{2} {\bar{B}}_{2}^{t} \bar{λ} 〉} \end{array}

over the set described in (12).

To verify that the minimizer of (12) occurs in the interior of the constraint set, we look at the first order conditions

B_{1} Σ_{1}^{2} B_{1}^{t} λ_{1} - B_{1} Σ_{1}^{2} {\bar{B}}_{1}^{t} \bar{λ} + y_{2} = 0

B_{2} Σ_{2}^{2} B_{2}^{t} λ_{2} + B_{2} Σ_{2}^{2} {\bar{B}}_{2}^{t} \bar{λ} + y_{2} = 0

(14)

- {\bar{B}}_{1} Σ_{1}^{2} {\bar{B}}_{1}^{t} \bar{λ} + {\bar{B}}_{1} Σ_{1}^{2} B_{1}^{t} λ_{1} - {\bar{B}}_{2} Σ_{2}^{2} {\bar{B}}_{2}^{t} \bar{λ} - {\bar{B}}_{2} Σ_{2}^{2} B_{2}^{t} λ_{2} = 0

A feasible solution to (12) may lie inside the domain of the constraints and provides a solution.

Once the system is solved for

λ_{1}^{*}, {\bar{λ}}_{1}^{*}, λ_{2}^{*}

, the estimated densities are

d P_{i}^{*} (ξ_{i}) = \frac{e^{- | | Σ_{i}^{- 1} ξ_{i} + Σ_{i} h_{i}^{*} | |^{2} / 2}}{{(2 π)}^{(N + K_{i}) / 2} {(\det Σ_{i}^{2})}^{1 / 2}} d ξ_{i}

which, as expected, are normally distributed. Defining

h_{i} = B_{i}^{t} λ_{i} + {\bar{B}}_{i}^{t} {\bar{λ}}_{i}^{t} = A_{i}^{t} μ_{i} \equiv (\begin{matrix} X_{i}^{t} \\ I \end{matrix}) (\begin{matrix} λ_{i}^{*} \\ {\bar{λ}}_{i}^{*} \end{matrix})

(recall that

{\bar{λ}}_{2}^{*} = - {\bar{λ}}_{1}^{*} = - \bar{λ}

due to the constraints) we use (13) to get

(\begin{matrix} β_{i}^{*} \\ ε_{i}^{*} \end{matrix}) = E_{P_{i}^{*}} [ξ_{i}^{*}] = - Σ_{i}^{2} A_{i}^{t} μ_{i} = - (\begin{matrix} σ_{i, s}^{2} & 0 \\ 0 & σ_{i, n}^{2} \end{matrix}) A_{i}^{t} μ_{i} = - (\begin{matrix} σ_{i . s}^{2} X_{i}^{t} μ_{i} \\ σ_{i, n}^{2} μ_{i} \end{matrix})

for i = 1, 2 and where

σ_{i . s}^{2}

and

σ_{i, n}^{2}

are matrices.

4.2. Gamma Priors

Let β’s be bounded below by 0. This can be easily generalized by an appropriate shifting of the support of the distributions. To show the generality of our model, we let the prior on the noise be normal thereby showing that one can use different priors for the signal and the noise.

The signal and noise constraint spaces respectively are

C_{s} = ℝ_{+}^{K} = {[0, \infty)}^{K}

and

C_{n} = ℝ^{N}

. The prior is

d Q (ξ) = (\prod_{j = 1}^{K} \frac{a^{b} ς_{j}^{b - 1} e^{- a ς_{j}}}{Γ (b)}) d ς_{j} \frac{e^{- 〈 v, Σ_{n}^{- 2} v 〉 / 2}}{{(2 π)}^{N / 2} (\det Σ_{n}^{2})} d v

Before specifying the concentrated entropy function, we study the matrix A₁ defined as

A_{1} = (X_{1} I) = (\begin{matrix} B_{1} \\ {\bar{B}}_{1} \end{matrix}) = (\begin{matrix} D_{1} \\ {\bar{D}}_{1} \end{matrix} \begin{matrix} I_{1} \\ {\bar{I}}_{1} \end{matrix})

. Note that

(\begin{matrix} D_{1} \\ {\bar{D}}_{1} \end{matrix})

splits X₁ and

(\begin{matrix} I_{1} \\ {\bar{I}}_{1} \end{matrix})

splits the N × N identity matrix to match the splitting of X. The concentrated entropy function is

\ln Z_{1} (λ_{1}, - \bar{λ}) = \sum_{j = 1}^{K} b \ln (\frac{a}{a + {(D_{1}^{t} λ_{1} + {\bar{D}}_{1}^{t} \bar{λ})}_{j}}) + σ_{1, n}^{2} (λ_{1}^{2} + {\bar{λ}}^{2})

(Note that

D_{1}^{t} λ_{1} + {\bar{D}}_{1}^{t} \bar{λ} = X_{1}^{t} μ_{1}

and

D_{2}^{t} λ_{2} - {\bar{D}}_{2}^{t} \bar{λ} = X_{2}^{t} μ_{2}

.) A similar expression exists for

\ln Z_{2} (λ_{2}, - \bar{λ})

.

The problem (12) consists of minimizing

\begin{array}{l} l n Z (λ_{1}, - \bar{λ}, λ_{2}, \bar{λ}) + 〈 λ_{1} + λ_{2}, y_{2} 〉 \\ = \sum_{j = 1}^{K_{1}} b \ln (\frac{a}{a + {(D_{1}^{t} λ_{1} + {\bar{D}}_{1}^{t} \bar{λ})}_{j}}) + \sum_{j = 1}^{K_{2}} b \ln (\frac{a}{a + {(D_{2}^{t} λ_{2} - {\bar{D}}_{2}^{t} \bar{λ})}_{j}}) \\ + σ_{1, n}^{2} (λ_{1}^{2} + {\bar{λ}}^{2}) + σ_{2, n}^{2} (λ_{2}^{2} + {\bar{λ}}^{2}) + 〈 λ_{1} + λ_{2}, y_{2} 〉 \end{array}

Once

λ_{1}^{*}, {\bar{λ}}^{*}, λ_{2}^{*}

are found, the optimal density is

d P_{i}^{*} (ξ_{i}) = (\prod_{j = 1}^{K_{i}} \frac{{(a + {(X_{i}^{t} μ_{i})}_{j})}^{b} ς_{j}^{b - 1} e^{- ({(X_{i}^{t} μ_{i})}_{j} + a) ς_{j}}}{Γ (b)} d ς_{j}) \frac{e^{- {‖ v + σ_{i, n}^{2} μ_{i}^{*} ‖}^{2} / 2 σ_{i, n}^{2}}}{{(2 π σ_{i, n}^{2})}^{N_{i} / 2}} d ν

The estimated parameters are

{(β_{i}^{*})}_{j} = E_{P_{i}^{*}} [{(ς_{i})}_{j}] = \frac{b}{(a + {(X_{i}^{t} μ_{i})}_{j})}; for j = 1, ..., K_{i} and i = 1, 2

The realized residuals are

{(ε_{i}^{*})}_{l} = E_{P_{i}^{*}} [{(v_{i})}_{l}] = - σ_{i, n}^{2} {(μ_{i}^{*})}_{l}; for l = 1, ..., N_{i} and i = 1, 2

4.3. Bernoulli Priors

This example represents another extreme case where it is assumed that the β’s are bounded. For simplicity, assume that we know that all β’s lie in the interval [a, b], which makes

C_{s} = {[a, b]}^{K_{i}}

the choice for all of the constraints on the signal space . For the noise component, we follow the previous formulation of normal priors.

With this background, the prior measure used is

d Q (ξ) = (\prod_{j = 1}^{K} \frac{1}{2} (δ_{a} (d ς_{j}) + δ_{b} (d ς_{j}))) \frac{e^{- 〈 v, Σ_{n}^{- 2} v 〉 / 2}}{{(2 π)}^{N / 2} (\det Σ_{n}^{2})} d v

The concentrated entropies are

\ln Z_{i} {(λ_{i}, \bar{λ})}_{i} = \sum_{j = 1}^{K_{i}} \ln \frac{1}{2} (e^{- g_{i, j} a} + e^{- g_{i, j} b}) + σ_{n}^{2} {‖ μ_{i} ‖}^{2}, i = 1, 2

where

g_{i} = D_{i}^{t} λ_{i} + {\bar{D}}_{i}^{t} {\bar{λ}}_{i}

and

μ_{i} = {(λ_{i} {\bar{λ}}_{i})}^{t}

. Recall that

{\bar{λ}}_{2} = - {\bar{λ}}_{1}

. In this case, the function to be minimized is

\begin{array}{l} l n Z (λ_{1}, \bar{λ}, λ_{2}, - \bar{λ}) + 〈 λ_{1} + λ_{2}, y_{2} 〉 \\ = \sum_{j = 1}^{K_{1}} \ln \frac{1}{2} (e^{- g_{1, j} a} + e^{- g_{1, j} b}) + \sum_{j = 1}^{K_{2}} \ln \frac{1}{2} (e^{- g_{2, j} a} + e^{- g_{2, j} b}) \\ + σ_{1, n}^{2} (λ_{1}^{2} + {\bar{λ}}^{2}) + σ_{2, n}^{2} (λ_{2}^{2} + {\bar{λ}}^{2}) + 〈 λ_{1} + λ_{2}, y_{2} 〉 \end{array}

which is minimized over the region described in (12). Again, the optimal solutions (minimizing

λ_{1}^{*}, {\bar{λ}}^{*}, λ_{2}^{*}

) is to be found numerically. The estimated post-data is

d P_{i}^{*} (ξ) = (\prod_{j = 1}^{K} (p_{i, j} δ_{a} (d ς_{j}) + q_{i, j} δ_{b} (d ς_{j}))) \frac{e^{- {‖ v + σ_{i, n}^{2} μ_{i}^{*} ‖}^{2} / 2 σ_{i, n}^{2}}}{{(2 π σ_{i, n}^{2})}^{N_{i} / 2}} d v

for i = 1,2 and where

p_{i, j} = \frac{e^{- a g_{i, j}}}{e^{- a g_{i, j}} + e^{- b g_{i, j}}} = 1 - q_{i, j}

from which the estimated parameters and residuals are given by

β_{i, j}^{*} = \frac{a e^{- a g_{i, j}} + b e^{- b g_{i, j}}}{e^{- a g_{i, j}} + e^{- b g_{i, j}}} and ε_{i}^{*} = - σ_{i, n}^{2} μ_{i}^{*}

5. Empirical Example

We illustrate the applicability of our approach using an empirical application consisting of a small data set. The objective here is to demonstrate that our IT estimator is easy to apply and can be used for many different priors. The small sample performance of the IT-GME version of that estimator (uniform discrete priors) and detailed comparisons with other competing estimators is already shown in GMP and it falls outside the objectives of this note. The empirical example is based on one of the examples analyzed in GMP with data drawn from the March 1996 Current Population Survey. We estimated the wage-participation model for the subset of respondents in the labor market. Workers who are self-employed are excluded from the sample. Since the normal maximum likelihood estimator did not converge for that data [15], only results for the OLS, Heckman two-step, a semi-parametric estimator with a nonparametric selection mechanism due to [5], AP, and the different IT models developed here are reported [16]. To make our results comparable across the IT estimators, we use the empirical standard deviations in all three cases and use supports between –100 and 100 for the IT-GME (uniform discrete priors) and the IT-Bernoulli case. In both the IT-Normal and IT-Bernoulli the priors used for the noise components are normal (as is shown in Section 4). Under these very similar specifications, we would expect all three IT examples to yield comparable estimates. Naturally, there are many other priors to choose from, but the objective here is just to show the flexibility and applicability of our approach.

We analyze a sample of 151 Native American females, of whom 65 are in the labor force. The wage equation covariates include years of education, a dummy for currently enrolled in school, potential experience (age - education - 6) and potential experience squared, a dummy for rural location, and a dummy for central city location. The covariates in the selection equation include all the variables in the wage equation and the amount of welfare payments received in the previous year, a dummy equal one for married, and the number of children. We use the three exclusion restrictions to identify the wage equation in the parametric and nonparametric two-step approaches.

Table 1. Estimates of the Native American wage equation (151 individuals; 65 in labor force).

**Table 1.** Estimates of the Native American wage equation (151 individuals; 65 in labor force).
	OLS	2-Step	AP	IT-GME	IT-Normal	IT-Bernoulli
Constant	1.073	1.771	NA	1.038	1.049	1.068
Education	0.055	0.043	0.044	0.054	0.056	0.055
Experience	0.038	0.023	0.038	0.038	0.038	0.038
Experience Squared	–0.001	–0.0005	–0.001	–0.001	–0.001	–0.001
Rural	0.214	0.268	0.332	0.210	0.215	0.214
Central City	–0.170	–0.091	–0.171	–0.186	–0.166	–0.169
Enrolled in School	–0.290	–0.471	–0.190	–0.301	–0.283	–0.288
λ		–0.461
R²	0.355	0.376	NA	0.343	0.355	0.354
MSPE	0.157	0.135	NA	0.147	0.144	0.144

Notes: Bold numbers reflect significantly different than zero at the 10% level

Table 1 presents the estimated coefficients for the wage equation. The R² and Mean Squared Prediction Error (MSPE) for each model are presented as well. All IT estimators outperform the other estimators in terms of predicting selection [17]. The estimated return to education is about 5% across all estimation methods, but only statistically significantly different from 0 for the OLS and the IT estimators. Though, all estimators have estimated parameters of the same magnitude and sign, only the OLS and the three reported IT estimates are statistically significantly different from zero in most cases.

6. Conclusion

In this short paper we develop a simple to apply, information-theoretic, method for analyzing nonlinear data with sample selection problem. Rather than using a likelihood approach or a semi-parametric approach we generalized further the IT-GME model of Golan, Moretti and Perloff (2004). Our model (i) allows for bounded and unbounded supports on all the unknown parameters, (ii) allows us to use a whole class of priors (continuous or discrete), (iii) is specified as a nonlinear concentrated entropy model, and (iv) is easy to apply. Like GMP our model works well even with small data. This is shown in our empirical example. The extensions developed here mark a significant improvement on the GMP model and other IT, generalized entropy models.

A detailed set of sampling experiments comparing our IT method with all other competitors, under different data processes, will be done in future work.

Acknowledgement

We thank Enrico Moretti and Jeff Perloff for their comments on earlier versions of this work.

References and Notes

Heckman, J. Sample selection bias as a specification error. Econometrica 1979, 47, 153–161. [Google Scholar] [CrossRef]
Manski, C.F. Semiparametric analysis of discrete response: Asymptotic properties of the maximum score estimator. J. Econom. 1985, 27, 313–333. [Google Scholar] [CrossRef]
Cosslett, S.R. Distribution-free maximum likelihood estimator of the binary choice model. Econometrica 1981, 51, 765–782. [Google Scholar] [CrossRef]
Han, A.K. Non-parametric analysis of a generalized regression model: The maximum rank correlation estimator. J. Econom. 1987, 35, 303–316. [Google Scholar] [CrossRef]
Ahn, H.; Powell, J.L. Semiparametric estimation of censored selection models with a nonparametric selection mechanism. J. Econom. 1993, 58, 3–29. [Google Scholar] [CrossRef]
Golan, A.; Moretti, E.; Perloff, J.M. A small sample estimation of the sample selection model. Econom. Rev. 2004, 23, 71–91. [Google Scholar] [CrossRef]
Golan, A.; Judge, G.; Miller, D. Maximum Entropy Econometrics: Robust Estimation with Limited Data; John Wiley & Sons: New York, NY, USA, 1996. [Google Scholar]
Golan, A.; Judge, G.; Perloff, J.M. Recovering information from censored and ordered multinomial response data. J. Econom. 1997, 79, 23–51. [Google Scholar] [CrossRef]
Golan, A. An information theoretic approach for estimating nonlinear dynamic models. Nonlinear Dynamics Econom. 2003, 7, 2. [Google Scholar] [CrossRef]
Maddala, G.S. Limited-Dependent and Qualitative Variables in Econometrics; Cambridge University Press: Cambridge, UK, 1983. [Google Scholar]
Kullback, S. Information Theory and Statistics; John Wiley & Sons: New York, NY, USA, 1959. [Google Scholar]
Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Statist. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Gokhale, D.V.; Kullback, S. The Information in Contingency Tables; Marcel Dekker: New York, NY, USA, 1978. [Google Scholar]
Golan, A.; Gzyl, H. An information theoretic estimator for the linear model. Working paper 2008. [Google Scholar]
Due to the small size of the data and the large proportion of censored observations, none of the maximum likelihood methods would converge with all standard software.
For discussion of the data, detailed analyses and discussion of the different estimators as well as a detailed discussion of the AP [5] application, see GMP [6]. The 2-step and AP estimates are taken from that paper, Table 8.
To keep the Table simple and since these specific results are not of interested here, they are not presented.

Appendix

Proof of Theorem 2.1.

Proof:

Recall (11)

\begin{array}{c} S u p {S u p {S (P_{1}, Q_{1}) + S (P_{2}, Q_{2}) | y_{2} = B_{2} E_{P_{2}} [ξ_{2}], η_{1} = B_{1} E_{P_{1}} [ξ_{1}]; \\ {\bar{η}}_{2} = {\bar{B}}_{2} E_{P_{2}} [ξ_{2}]; {\bar{η}}_{1} = {\bar{B}}_{1} E_{P_{1}} [ξ_{1}]} | y_{2} > η_{1}, {\bar{η}}_{2} \leq {\bar{η}}_{1}} \end{array}

(A.1)

First, note that the inner problem in (A.1) is equivalent to

ℓ (λ_{1}, {\bar{λ}}_{1}, λ_{2}, {\bar{λ}}_{2}) = \ln Z (λ_{1}, {\bar{λ}}_{1}, λ_{2}, {\bar{λ}}_{2}) + 〈 λ_{1}, η_{1} 〉 + 〈 {\bar{λ}}_{1}, {\bar{η}}_{1} 〉 + 〈 λ_{2}, y_{2} 〉 + 〈 {\bar{λ}}_{2}, {\bar{η}}_{2} 〉

where

λ_{i}

and

{\bar{λ}}_{i}

(i=1, 2) are the Lagrange multipliers associated with the data (9) and Z is the normalization factor of dP₁dP₂:

\begin{array}{l} Z (λ_{1}, {\bar{λ}}_{1}, λ_{2}, {\bar{λ}}_{2}) = \iint_{C_{1} \times C_{2}} e^{- 〈 λ_{1}, B_{1} ξ_{1} 〉} e^{- 〈 {\bar{λ}}_{1}, {\bar{B}}_{1} ξ_{1} 〉} e^{- 〈 λ_{2}, B_{2} ξ_{2} 〉} e^{- 〈 {\bar{λ}}_{2}, {\bar{B}}_{2} ξ_{2} 〉} d Q_{1} (ξ_{1}) d Q_{2} (ξ_{2}) \\ = \int_{C_{1}} e^{- 〈 λ_{1}, B_{1} ξ_{1} 〉} e^{- 〈 {\bar{λ}}_{1}, {\bar{B}}_{1} ξ_{1} 〉} d Q_{1} (ξ_{1}) \int_{C_{2}} e^{- 〈 λ_{2}, B_{2} ξ_{2} 〉} e^{- 〈 {\bar{λ}}_{2}, {\bar{B}}_{2} ξ_{2} 〉} d Q_{2} (ξ_{2}) = Z_{1} (λ_{1}, {\bar{λ}}_{1}) Z_{2} (λ_{2}, {\bar{λ}}_{2}) \end{array}

Note that the inner sup is over (P₁, P₂), and the outer sup is over the η’s in the region indicated within the

{\cdot}

. The basic idea here is to replace the inequalities appearing in problem (10) with equalities. Next, the dual-unconstrained model of this inner primal problem is the solution to

\underset{λ}{I n f} {ℓ (λ_{i}, {\bar{λ}}_{i}) | λ_{i} \in ℝ^{| J |}, and {\bar{λ}}_{i} \in ℝ^{N - | J |} for i = 1, 2}

where |J| is the number of observations where y_2i > y_1i. With this step, the equivalent dual model of the primal problem (A.1) is

\begin{array}{r} S u p {I n f {l n Z (λ_{1}, {\bar{λ}}_{1}, λ_{2}, {\bar{λ}}_{2}) + 〈 λ_{1}, η_{1} 〉 + 〈 {\bar{λ}}_{1}, {\bar{η}}_{1} 〉 + 〈 λ_{2}, y_{2} 〉 + 〈 {\bar{λ}}_{2}, {\bar{η}}_{2} 〉 | \\ λ_{i} \in ℝ^{| J |}, and {\bar{λ}}_{i} \in ℝ^{N - | J |}} | y_{2} > η_{1}, {\bar{η}}_{2} \leq {\bar{η}}_{1} f o r i = 1, 2} \end{array}

(A.2)

Next, we rewrite the constraints for the outer problem. The constraint

y_{2} > η_{1}

is rewritten as

η_{1} = y_{2} - ζ, ζ > 0

, and the constraint

{\bar{η}}_{2} \leq {\bar{η}}_{1}

is written as

{\bar{η}}_{2} \in ℝ^{N - | J |},

{\bar{η}}_{1} = {\bar{η}}_{2} + \bar{ζ}, \bar{ζ} \geq 0.

Model (A.2) becomes

\begin{array}{c} S u p {I n f {l n Z (λ_{1}, {\bar{λ}}_{1}, λ_{2}, {\bar{λ}}_{2}) + 〈 λ_{1} + λ_{2}, y_{2} 〉 - 〈 λ_{1}, ζ 〉 + 〈 {\bar{λ}}_{1} + {\bar{λ}}_{2}, {\bar{η}}_{1} 〉 + 〈 {\bar{λ}}_{2}, \bar{ζ} 〉 | \\ λ_{i} \in ℝ^{| J |}, and {\bar{λ}}_{i} \in ℝ^{N - | J |}} | ζ > 0, {\bar{η}}_{1}, \bar{ζ} \geq 0} \end{array}

Next, exchanging the sup and the inf operations we get

\begin{array}{c} I n f {l n Z (λ_{1}, {\bar{λ}}_{1}, λ_{2}, {\bar{λ}}_{2}) + 〈 λ_{1} + λ_{2}, y_{2} 〉 + S u p {- 〈 λ_{1}, ζ 〉 + 〈 {\bar{λ}}_{1} + {\bar{λ}}_{2}, {\bar{η}}_{1} 〉 + 〈 {\bar{λ}}_{2}, \bar{ζ} 〉 | \\ ζ > 0, {\bar{η}}_{1}, \bar{ζ} \geq 0} | λ_{i} \in ℝ^{| J |}, and {\bar{λ}}_{i} \in ℝ^{N - | J |}} \end{array}

To compute the inner supremum, note that

where

ℝ_{+}^{d}

denotes the non-negative orthant in

ℝ^{d}

, the right hand side of the last identity is usually written as

I_{ℝ_{+}^{d}} (- λ)

and

I_{A} (x)

is defined as

I_{A} (x) = {\begin{cases} 0 i f x \in A \\ \infty i f x \notin A \end{cases}

.The difference between the first and second problem is that in the first the supremum is reached only when

λ = 0

, whereas in the second it is reached at the boundary of

ℝ_{+}^{d}

. Similarly,

Noting that

{\bar{λ}}_{1} = - {\bar{λ}}_{2} \equiv \bar{λ}

, our MinMax problem (A.2) reduces to finding

which simplified to

I n f {l n Z (λ_{1}, - \bar{λ}, λ_{2}, \bar{λ}) + 〈 λ_{1} + λ_{2}, y_{2} 〉 | λ_{1} \in ℝ_{+}^{| J |}, λ_{2} \in ℝ^{| J |} and {\bar{λ}}_{1} \in ℝ_{+}^{N - | J |}}

which is (12).

☐

© 2010 by the authors; licensee MDPI, Basel, Switzerland. This article is an Open Access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Share and Cite

MDPI and ACS Style

Golan, A.; Gzyl, H. A Concentrated, Nonlinear Information-Theoretic Estimator for the Sample Selection Model. Entropy 2010, 12, 1569-1580. https://doi.org/10.3390/e12061569

AMA Style

Golan A, Gzyl H. A Concentrated, Nonlinear Information-Theoretic Estimator for the Sample Selection Model. Entropy. 2010; 12(6):1569-1580. https://doi.org/10.3390/e12061569

Chicago/Turabian Style

Golan, Amos, and Henryk Gzyl. 2010. "A Concentrated, Nonlinear Information-Theoretic Estimator for the Sample Selection Model" Entropy 12, no. 6: 1569-1580. https://doi.org/10.3390/e12061569

APA Style

Golan, A., & Gzyl, H. (2010). A Concentrated, Nonlinear Information-Theoretic Estimator for the Sample Selection Model. Entropy, 12(6), 1569-1580. https://doi.org/10.3390/e12061569

Article Menu

A Concentrated, Nonlinear Information-Theoretic Estimator for the Sample Selection Model

Abstract

1. Introduction and Basic Model

1.1. The Basic Sample Selection Model

2. The Information-Theoretic Estimator

3. Large Sample Properties

4. Analytic Examples

4.1. Normal Priors

4.2. Gamma Priors

4.3. Bernoulli Priors

5. Empirical Example

6. Conclusion

Acknowledgement

References and Notes

Appendix

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI