**1. Introduction**

Ecological inference (EI) is the process of drawing conclusions about individual-level behavior from aggregate (historically called "ecological") data, when no individual data are available. Situations where the only available data are aggregated at a level other than the level of interest are quite common in many application fields. This is the typical setting for Ecological Inference [1–3], Cross-level Inference [4,5], Small Area Estimation [6], or disaggregation methods [7]. The basic idea is that, in order to study the behavior of the individuals (or sub-groups of individuals), a microeconomic analysis ought to be carried out using fairly localized individual data, and data which are aggregated by areal units may be used in order to investigate the behavior of the individuals comprising those units. In this paper, we specifically refer to the process of drawing conclusions about individual-level behavior from aggregate data, when no individual data are available or when individual data are incomplete. In this inferential context, one problem is that many different possible relationships at the individual (or subgroup) level can generate the same observations at the aggregate (or group) level [8]. In the absence of individual (or subgroup) level measurements (in the form of survey data), such information needs to be inferred. Estimates of the disaggregated values for the variable of interest can be inferred from aggregate data by using appropriate statistical techniques. However, in many situations, given that the micro-data of interest are not available, the accuracy of any predicted value cannot be verified. This research focuses on the estimation on disaggregated indicators by subclasses. Assume that we have an indicator, *yi*·, that is observable across the different areas *i* = 1, ... , *T*. Our objective is to disaggregate it into an indicator *yij* for the *j* = 1, ... ,*K* different sub-categories (or sub-areas) that

conform each class (or area) *i*. The information available for this inference exercise, together with the indicator *yi*·, is another disaggregated indicator *xij* that is related to the target indicator *yij*. This paper approaches this estimation problem in an attempt to unify two estimation strategies and it is organized as follows. Section 2 explains the main features of the matrix-adjustment following the ideas of the Generalized Cross Entropy (GCE) estimation introduced in [9], whereas in Section 3 the basis of the Distributionally Weighted Regression (DWR) estimation are explained. Section 4 studies these two strategies under a common approach and propose a composite prior estimator in line with the Data Weighted Prior (DWP) proposed in [10,11]. The comparative performance of the three techniques is evaluated by means of a numerical experiment in Section 5. Finally, Section 6 presents the main conclusions of the paper.

#### **2. Matrix-Adjustment and Distributionally Weighted Regression Problems**

Within the family of IT estimators, [10] proposed a general solution for the estimation problem described in the introduction basing on the minimization of the divergence between the target variable and some prior information. Following this approach, each indicator *yij* is assumed as a discrete random variable that can take *M* different values. Defining a supporting vector (for the sake of simplicity assumed as common for all the *yij*) *z*- = [*z*1, *z*2, ... , *zM*] that contains the *M* possible realizations of the targets with unknown probabilities *p*- *ij* <sup>=</sup> - *pij*1, *pij*2, ... , *pijM* , *yij* can be written as:

$$y\_{ij} = \sum\_{m=1}^{M} p\_{ijm} z\_m \tag{1}$$

Alternatively, this idea can be generalized in order to include an error term and define each *yij* as:

$$y\_{ij} = \sum\_{m=1}^{M} p\_{ijm} z\_m + \varepsilon\_{ij} \tag{2}$$

In such a case, we assume that the *yij* elements are given from two sources: a signal that keeps the resemblance with the priors *xij*, plus a noise term (ε*ij*). The noise components can be included in order to account for potential spatial heterogeneity and our uncertainty about the target variable. Basically, we represent uncertainty about the realizations of the errors treating each element ε*ij* as a discrete random variable with *L* ≥ 2 possible outcomes contained in a convex set *v*- = {*v*1, ... , *vL*}, which for the sake of simplicity will be assumed as common for all the ε*ij*. We also assume that these possible realizations are symmetric around zero (−*v*<sup>1</sup> = *vL*). The traditional way of fixing the upper and lower limits of this set is to apply the three-sigma rule [12]. Under these conditions, each ε*ij* can be defined as:

$$\varepsilon\_{ij} = \sum\_{l=1}^{L} w\_{ijl} v\_{l\prime} \text{ \textquotedbl{}}\prime i = 1 \text{ \textquotedbl{}}\prime, \dots, T \text{\textquotedbl{}}\prime j = 1 \text{ \textquotedbl{}}\prime,\dots, K \tag{3}$$

where *wijl* is the unknown probability of the outcome *vl* for the cell *ij*. Now, the *yij* elements can be written as:

$$y\_{ij} = \sum\_{m=1}^{M} p\_{ijm} z\_m + \sum\_{l=1}^{L} w\_{ijl} v\_l \tag{4}$$

The solution to the estimation problem is given by the minimization of the Kullback-Leibler divergence between the posteriors distributions *p*- s and the a priori probabilities *q*- *ij* <sup>=</sup> - *qij*1, *qij*2, ... , *qijM* . The *q*- s reflect the information we have on the indicators *xij*, which are somehow related to our target *yij*, being defined by the expression:

$$\propto\_{ij} = \sum\_{m=1}^{M} q\_{ijm} z\_m \tag{5}$$

The solution to the estimation problems is given by minimizing the KL divergence between the *p*- s and the *q*- s. If we do not have an informative prior, the a priori distributions are specified as uniform *qij* <sup>=</sup> <sup>1</sup> *<sup>M</sup>* ; ∀*m* = 1, ... , *M* , which leads to the GME solution. The uniform distribution is usually set as the natural prior *W*<sup>0</sup> for the error terms. Specifically, the constrained minimization problem can be written as:

$$\underset{p,\mathcal{W}}{\text{MinD}}\limits\_{p,\mathcal{W}}\text{D}\left(p,\mathcal{W}\|q\_{\prime}\mathcal{W}^{0}\right) = \sum\_{m=1}^{M}\sum\_{i=1}^{T}\sum\_{j=1}^{K}p\_{ijm}\ln\left(\frac{p\_{ijm}}{q\_{ijm}}\right) + \dots + \sum\_{l=1}^{L}\sum\_{i=1}^{T}\sum\_{j=1}^{K}w\_{ijl}\ln\left(\frac{w\_{ijl}}{w\_{ijl}^{0}}\right) \tag{6}$$

subject to:

$$y\_{i^\*} = \sum\_{j=1}^{K} \left( \sum\_{m=1}^{M} p\_{ijm} z\_m + \sum\_{l=1}^{L} w\_{ljl} v\_l \right) \mathbb{C}\_{\cdot j^\*} ; \ i = 1, \ldots, T \tag{7}$$

$$\sum\_{m=1}^{M} p\_{ijm} = \sum\_{l=1}^{L} w\_{ijl} = 1; 1 \; \forall i = 1, \dots, T; \; j = 1, \dots, K \tag{8}$$

Restrictions (8) are just normalization constrains, whereas Equation (7) reflects the observable information that we have on the relationship between the aggregates *yi*· and the indicators *yij* through the observable *<sup>K</sup>*-dimensional vector *<sup>C</sup>*·*j*. Denoting as *<sup>y</sup>*ˆ<sup>0</sup> *ij* to the solution in absence of this information,

this is given by the indicator *xij*; i.e., *y*ˆ<sup>0</sup> *ij* = *xij* = *M m*=1 *qijmzm*.

Following Golan et al., (1994), the aggregate vectors *yi*· and *C*·*<sup>j</sup>* are, respectively, row and column margins in a matrix of inter-industry flows. However, the availability of sample (observable) and out-of-sample (unobservable) information could be different in our estimation problem, because in the inter-industry problem it is natural to have known *K* + *T* data, but in other estimation problems we only have aggregate information across the dimension of *T* through *yi*·. For example, if we want to disaggregate the income per capita in each area *i* (*yi*·) into the income per capita of its sub-populations (men and women, population classified by education levels, etc.) being observable the weight of each sub-population on the total population, but not the overall income per capita of each sub-group.

Sometimes the aggregate *C*·*<sup>j</sup>* is not observable and it is replaced by the observation of the weights given to the sub-category *j* in each area *i* (θ*ij*) that defines the indicator *yi*· as the weighted sum:

$$y\_{i\cdot} = \sum\_{j=1}^{K} y\_{ij} \theta\_{ij}; i = 1, \dots, T \tag{9}$$

Additionally, the relation between the target indicators *yij* and the prior information *xij* will be made explicit by means of a functional relationship like:

$$
\varepsilon\_{ij} = \alpha\_i + \beta\_{ij}\alpha\_{i\bar{j}} + \varepsilon\_{i\bar{j}} \tag{10}
$$

and, consequently:

$$y\_{i\cdot} = \sum\_{j=1}^{K} (\alpha\_i + \beta\_{i\cdot j} x\_{i\cdot j} + \varepsilon\_{i\cdot j}) \theta\_{i\cdot j\cdot} i = 1, \dots, T \tag{11}$$

Equations (10) and (11) contain the starting point of the traditional approach to spatial disaggregation based on some Distributionally Weighted Regression (DWR) of the type proposed in [13,14]. In Equation (10), the unobservable *yij* are defined as a linear function of *xij*, allowing for slope heterogeneity (note that the β*ij* can be different for each area and sub-class) and an specific area indicator α*<sup>i</sup>* plus an error term ε*ij*. For the estimation of model Equation (10), the same IT-based strategy is followed, by defining for the *M* possible realizations of each parameter, the support vector *b*- = [*b*1, *b*2, ... , *bM*] (again common for parameters α*<sup>i</sup>* and β*ij*) with unknown probabilities *p*α, *p*<sup>β</sup> to be recovered. The noise components ε*ij* are treated in the same ways as in Equation (5).

Once the respective supporting vectors and the a priori probability distributions are set, the DWR estimation can be made in the terms of the following GCE program:

$$\begin{aligned} \underset{\begin{subarray}{c}\mathbf{p}^{\alpha},\mathbf{p}^{\beta},\mathbf{W}\end{subarray}}{\min} & \underset{\begin{subarray}{c}\mathbf{p}^{\alpha},\mathbf{p}^{\beta},\mathbf{W}\end{subarray}}{\operatorname{D}} \left(\mathbf{p}^{\alpha},\mathbf{p}^{\beta},\mathbf{W}\|\mathbf{q}^{\alpha},\mathbf{q}^{\beta},\mathbf{W}^{0}\right) = \\ \underset{\begin{subarray}{c}\mathbf{p}^{\alpha}=1 \ \mathbf{i}=1\end{subarray}}{\sum} & \underset{\begin{subarray}{c}\mathbf{p}^{\alpha},\mathbf{I} \end{subarray}}{\operatorname{D}} \left(\mathbf{p}^{\alpha}\_{mi}\right) + \sum\_{m=1}^{M} \sum\_{i=1}^{T} \sum\_{j=1}^{K} \mathbf{p}^{\beta}\_{mij} \ln\left(\frac{\mathbf{p}^{\beta}\_{mij}}{\mathbf{q}^{\beta}\_{mij}}\right) + \\ & \quad \sum\_{l=1}^{L} \sum\_{i=1}^{T} \sum\_{j=1}^{K} \operatorname{w}\_{ijl} \ln\left(\frac{\mathbf{w}\_{ijl}}{\mathbf{w}^{0}\_{ijl}}\right) \end{aligned} \tag{12}$$

subject to:

$$y\_{i^\*} = \sum\_{j=1}^{K} \left( \sum\_{m=1}^{M} p\_{mj}^a b\_m^a + \sum\_{m=1}^{M} p\_{mj}^\beta b\_m^\beta x\_{ij} + \sum\_{l=1}^{L} w\_{ljl} v\_l \right) \theta\_{lij^\*} \text{ i } i = 1, \dots, T \tag{13}$$

$$\sum\_{m=1}^{M} p\_{\text{mi}}^{a} = \sum\_{m=1}^{M} p\_{\text{mi}j}^{\beta} = \sum\_{l=1}^{L} w\_{l\text{j}l} = 1; \forall i = 1, \dots, T; \ j = 1, \dots, K \tag{14}$$

Both for the parameters and the errors, the supporting vectors usually contain values symmetrically centered on zero. If all the a priori distributions (*q*α, *q*β, *W*0) are specified as uniform, then the GCE solution reduces to the GME one.

#### **3. Unifying the Two Approaches: A Composite Prior Estimator**

In this section, we will unify the two previous approaches under a common framework showing that the matrix adjustment problem introduced in [9] is simply a case of a DWR equation (if the available observable information is the same) with not necessarily uniform distributions for *q*<sup>α</sup> and *q*β. We let out of the discussion the a priori distribution of the errors *W*<sup>0</sup> because the uniform solution is the most intuitive. We will base our explanation on the most common case of supporting vectors with *M* ≥ 2 values distributed symmetrically around zero.

Note that the GME solution to the DWR problem departs from the specification of a priori distributions that assume that the parameters can take any value as long as they remain in the bounds set in the supports. In contrast, in the solution offered in [9] for the inter-industry flows estimation, no area-specific (row-specific in terms of the problem discussed there) effect was considered and the prior expectation on *yij* is given by the corresponding cell *xij*. These assumptions can be formulated in terms of the a priori distributions used in the DWR approach, which means that both approaches can be treated as particular cases of a general estimation problem.

The a priori distribution *q*<sup>α</sup> can be defined in order to consider the assumption of avoiding any area-specific parameter α*<sup>i</sup>* from Equation (10). As opposed to the GME's solution to the DWR estimation where they are specified as uniform (*q*α*u*), now we specify an alternative non-uniform distribution (*q*α*n*) with a point mass at *b*<sup>α</sup> *<sup>m</sup>* = 0. Similarly, the a priori distribution *q*<sup>β</sup> should reflect that the uninformative estimation of *yij* is the regressor *xij*. This non-uniform distribution (*q*β*n*), consequently, should be specified as fulfilling the condition *y*ˆ<sup>0</sup> *ij* = *xij*, or alternatively:

$$\sum\_{m=1}^{M} \mathbf{b}\_{m}^{\beta} \mathbf{q}\_{ijm}^{\beta n} = 1; \ i = 1, \ldots, T; \ j = 1, \ldots, K \tag{15}$$

Appendix A illustrates how specifying such an a priori distribution for the simplest case with *M* = 2 values in the supporting vectors. Having made explicit that, under the same information availability, the two approaches only differ on the a priori distributions specified, it is possible to apply a composite prior estimator that considers both possibilities in the same fashion as in in [10,11]. This estimator is very flexible in the assumptions made on the a priori distributions, given that it allows for including both uniform and non-uniform priors. The estimator it is called Data Weighted Prior (DWP) because it is the information observed which weighs the two alternative priors considered. Furthermore, the authors of [10] prove that its estimates present relatively lower variance than those estimated from a GCE program.

Specifically, the DWP program can be written for our problem as:

*Min p*α,*p*β,*P*γ,*W D <sup>p</sup>*α, *<sup>p</sup>*β, *<sup>p</sup>*γ, *<sup>W</sup>q*α, *<sup>q</sup>*β, *<sup>q</sup>*γ, *<sup>W</sup>*<sup>0</sup> = <sup>1</sup> <sup>−</sup> <sup>γ</sup><sup>α</sup> *i M m*=1 *T i*=1 *p*α*<sup>u</sup> mi ln p*α*<sup>u</sup> mi q*α*<sup>u</sup> mi* + <sup>1</sup> <sup>−</sup> γβ *ij M m*=1 *T i*=1 *K j*=1 *p* β*u mijln* ⎛ ⎜⎜⎜⎜⎝ *p* β*u mij q* β*u mij* ⎞ ⎟⎟⎟⎟⎠ + γα *i M m*=1 *T i*=1 *p*α*<sup>n</sup> mi ln p*α*<sup>n</sup> mi q*α*<sup>n</sup> mi* + γβ *ij M m*=1 *T i*=1 *K j*=1 *p* β*n mijln* ⎛ ⎜⎜⎜⎜⎝ *p* β*n mij q* β*n mij* ⎞ ⎟⎟⎟⎟⎠ + *H h*=1 *T i*=1 *p* γα *hi ln p* γα *hi q* γα *hi* <sup>+</sup> *<sup>H</sup> h*=1 *T i*=1 *K j*=1 *p* γβ *hijln* ⎛ ⎜⎜⎜⎜⎝ *p* γβ *hij q* γβ *hij* ⎞ ⎟⎟⎟⎟⎠ + *L l*=1 *T i*=1 *K j*=1 *wijlln wijl w*0 *ijl* (16)

subject to:

$$y\_{i^\*} = \sum\_{j=1}^{K} \left( \sum\_{m=1}^{M} p\_{mi}^a b\_m^a + \sum\_{m=1}^{M} p\_{mi}^\beta b\_m^\beta x\_{ij} + \sum\_{l=1}^{L} w\_{ijl} v\_l \right) \theta\_{ij^\*} \text{ } i = 1, \dots, T \tag{17}$$

$$\sum\_{m=1}^{M} p\_{mi}^{a} = \sum\_{m=1}^{M} p\_{mij}^{\beta} = \sum\_{h=1}^{H} p\_{hi}^{\gamma a} = \sum\_{h=1}^{H} p\_{hij}^{\gamma \beta} = \sum\_{l=1}^{L} w\_{ljl} = 1;\tag{18}$$
 
$$i = 1, \dots, T; \ j = 1, \dots, K$$

The γ parameters are estimated simultaneously with the rest of coefficients of the model. Each γ measures the weight given to the uniform prior *qu* for each parameter and it is defined as γ = *<sup>H</sup> <sup>h</sup>*=<sup>1</sup> *b* γ *h p* γ *h* , where *b* γ <sup>1</sup> = 0 and *b* γ *<sup>H</sup>* = 1 are, respectively, the lower and upper bound defined in the supporting vectors with *H* values for these parameters ( *b*- = (0, ... , 1) → 0 ≤ γ ≤ 1). The a priori probability distributions are always uniform *q* γ *<sup>h</sup>* <sup>=</sup> <sup>1</sup> *H* and the same is applied for the errors (*w*<sup>0</sup> *ijl* <sup>=</sup> <sup>1</sup> *J* ).

To understand the logic of this estimator, an explanation on the objective function of the previous minimization program is required. Note that Equation (16) is divided in four terms. The last term measures the Kullback divergence between the posterior and the prior probabilities for the noise component of the model. The first term quantifies this divergence between the recovered probabilities and the uniform priors for each coefficient, being this divergence weighted by the corresponding (1 − γ). Next, the second element of (16) measures the divergence with the non-uniform priors and it is weighted by γ. The third element in (16) relates to the Kullback divergence of the weighting parameters γ. Equation (16) is minimized subject to the set of constraints present in Equations (16)–(18). Again, the restrictions in (18) ensure that the posterior probability distributions of the estimates and the errors are compatible with the observations, and Equation (18) are just normalization constraints.
