1. Introduction
Nonprobability samples are increasingly common in empirical sciences. The rise of online and smartphone surveys, along with the decrease of response rates in traditional survey modes, have contributed to the popularization of volunteer surveys where sampling is non-probabilistic. Moreover, the development of Big Data involves the analysis of large scale datasets whose obtention is conditioned by data availability and not by a probabilistic selection, and therefore they can be considered large nonprobability samples of a population [
1].
The lack of a probability sampling scheme can be responsible for selection bias. Following the description from [
1,
2], we can distinguish the target population,
, the subpopulation that a given selection method can potentially cover,
, and the fraction of the subpopulation that is finally covered,
, and whose individuals might participate in the survey. Selection bias occurs when the characteristics of the individuals in
differ significantly from those in
in a way that could affect final estimates. Typically, differences between individuals in
and individuals in
are caused by a lack of coverage induced by the survey administration mode (for example, an online questionnaire cannot be administered to the population without internet access), while differences between
and
are caused by the variability in the propensities to participate between social-demographic groups (for example, an online questionnaire accesible in a thematic website might only be fulfilled by visitors of the website who have a specific interests that could influence the results).
Following the rise of nonprobability samples, a class of methods for reducing selection bias have been proposed in the last decades. These methods were developed from different perspectives according to the availability of auxiliary information. We can mention calibration, Propensity Score Adjustment (PSA), Statistical Matching and superpopulation modelling as the most relevant techniques to mitigate selection bias produced by coverage and self-selection errors.
Calibration weighting was originally developed by [
3] as a method to correct representation issues in samples with coverage or non-response errors. It only requires a vector of auxiliary variables available for each individual of the sample and the population totals of those variables. Calibration is able to remove selection bias in nonprobability samples if the selection mechanism is ignorable [
4], and despite being originally developed for parametric estimation, further work [
5,
6,
7] has extended calibration to distribution function, quantile and poverty measures estimation.
Propensity Score Adjustment (PSA) and Statistical Matching require, apart from the nonprobability sample, a probability sample to do the adjustments. PSA was originally developed for balancing groups in non-randomized clinical trials [
8] and it was adapted for non-response adjustments shortly after [
9,
10]. The application of PSA for removing bias in nonprobability surveys was theoretically developed in [
11,
12]. Statistical Matching was firstly proposed in [
13] and extended in [
14] for non-response adjustments. The difference between both methods is the sample used in the estimators: PSA estimates the propensity of each individual of the nonprobability sample to participate in the survey and then this propensity is used to construct the weights of the estimators, while Statistical Matching adjusts a prediction model using data from the nonprobability sample, applies it in the probability sample to predict their values for the target variable
y and uses them in the parametric estimators. To the best of our knowledge, PSA and Statistical Matching has not been developed for nonparametric estimation.
Superpopulation modelling requires data from the complete census of the target population for the covariates used in the adjustment, which is assumed to be a realization (sample) of a superpopulation where the (unknown) target values follow a model. It is based on the works by [
15,
16], where the main idea is to fit a regression model on the target variable with data from the nonprobability sample, and use the model to predict the values of the target variable for each individual in the population. The prediction can be used for estimation using a model-based approach or some alternative versions such as model-assisted and model-calibrated. LASSO models [
17] and Machine Learning predictors [
18,
19] have been studied as alternatives to ordinary least squares regression in superpopulation modelling.
The interest of society on poverty and inequality has increased in the last decades given the successive economic cycles and crisis. In such a context, official poverty rates and the percentage of people in poverty (or under a poverty threshold) are some important measures of a country’s wealth. The common characteristic of many poverty measures is their complexity. The literature on survey sampling is usually focused on the goal of estimating linear parameters. However, it is usual that the variable of interest in poverty studies is a measure of wages or income, where the distribution function becomes a relevant tool because it is required to calculate the proportion of people with low income, the poverty gap and other measures. Estimators for the cumulative distribution function, quantiles [
20,
21] and poverty measures [
22] can be found in literature regarding probability samples, but there is hardly any work on the estimation of these parameters when the samples are obtained from volunteers.
In this paper, we aim to develop a framework for statistical inference on a general parameter with non probability survey samples when a reference probability sample is available. After introducing the problem of the mean estimation for volunteer samples in
Section 2, in
Section 3, we consider the problem of the estimation for a general parameter through general estimating equations.
Section 4 presents a new estimator for a general parameter through the use of PSA to estimate the propensity score of each individual in the survey weighted estimating equation and major theoretical results are presented. Results from simulation studies are reported in
Section 5 and
Section 6 presents the concluding remarks.
2. Approaches to Estimation of a Mean for Volunteer Online Samples
Let
be the target population with
N elements and
a nonprobability sample drawn from a subset of
,
, with a size of
. Let
y be the target variable of the survey, whose mean in the population
is denoted as
. The sample estimation of
,
, is done using the Horvitz-Thompson estimator:
where
w is a vector of weights that accounts for the lack of representativity of
caused by selection bias. If no auxiliary information is given, the weight would be the same for every unit,
, which requires to assume that the sample was drawn under a simple random sampling scheme. This is a naïve assumption given that
is not probabilistic, this is, the probability of being in the sample is unknown and/or null for any of the units in
.
Let
x be a matrix of covariates measured in
along with
y. If the population totals of the covariates,
, are available, it is possible to estimate the mean using a vector of weights obtained with calibration,
. The calibration weights aim to minimize the distance between the original and the new weights
while respecting the calibration equations
Some choices for the distance
were listed in [
3], along with the resulting estimators. Calibration weighting for selection bias treatment was studied in [
4], where post-stratification, which is a special case of calibration [
23], was used to mitigate the bias caused by different selection mechanisms, showing its efficacy when the selection of the units of
is Missing At Random (MAR).
If a reference sample,
, drawn from the population
is available and a number of covariates
have been measured both in
and
, two procedures can be done to reduce selection bias present in
. Let
be an indicator variable of an element being in
, this is
Propensity Score Adjustment (PSA) assumes that each element of
has a probability (propensity) of being selected for
which can be formulated as
where
is the propensity of the
i-th individual to participate in
. The random mechanism behind this probability is the selection mechanism that governs the nonprobability sample. If the selection is Missing Completely At Random (MCAR), then
and the selection bias is null, while if the selection is MAR then
and the selection mechanism is considered ignorable. This does not mean that the selection bias should be ignored but rather it can be treated with the right techniques.
In PSA, we consider the situation where true propensities are not known and therefore have to be estimated; we do it by combining
and
into a sample. The probability that
is then estimated using a prediction model, traditionally a logistic regression one:
Alternative models, such as non-linear regression and Machine Learning classification algorithms, have been studied in literature as a substitute of logistic regression (see [
24] for a review). The resulting propensities can be used to adjust new weights,
, with different alternatives:
A simple inverse probability weighting is proposed by [
25]
which is a similar approach to the formula used in [
26]
Alternatively, individuals of the combined sample (
) can be grouped in
g equally-sized strata of similar propensity scores from which an average propensity is calculated for each group. Let
be the mean propensities of the
g-th strata. [
2] use the means as in (
7) to calculate the new weights:
where
refers to the strata to which the
i-th individual of
belongs.
A similar approach can be found in [
12], but instead of using the means, a factor is calculated for each strata:
where
and
are respectively the individuals from the probability and nonprobability sample that belong to the
g-th strata, and
is the vector of design weigths of the reference sample. The final weights are obtained by multiplying the original weights and the correction factor:
PSA has been proven to successfully remove selection bias when prognostic covariates are chosen [
11] and further adjustments, such as calibration, are applied in the estimations [
2,
12,
27]. A recent paper [
28] shows a real application of PSA in web panel surveys where the reductions in bias, although present, were not large enough to consider the estimates as unbiased.
As an alternative to PSA, Statistical Matching is another method to mitigate selection bias when a reference sample is available. For the matter, a prediction model for
y using
as the dependent variables is built using data from
. The model is subsequently applied on the reference sample to obtain the estimates from the predicted values of
y in
,
:
The choice of prediction models has been studied in literature; the usual method is linear regression but other approaches such as donor imputation [
13] or Machine Learning algorithms [
19,
29] have been listed as alternatives. Under certain conditions, Statistical Matching can reduce bias and mean square error to a greater extent than PSA [
29].
When a complete census of the entire target population is available, with information on the covariates present in
, superpopulation modelling can be applied to remove selection bias [
19]. In this paper we consider the case when auxiliary information is available only from a reference probability survey.
3. Estimation of a General Parameter by Using PSA
Let
y be the variable of interest in a survey and
be the value of the
i-th unit in that variable,
. Suppose we want to estimate a finite population parameter
of dimension
defined as the solution of the census estimating equations:
where
is be a function of
. Some unidimensional parameters of interest can be:
the population total for ,
the population mean for ,
the population distribution function for with being the indicator function,
the finite population quantile of order j, for , where ,
We denote by
the solution of the equation:
It is clear the
where
r stands for the model of the selection mechanism for the sample
, this is, the true model that fits propensity scores. If
are known we can get the consistent estimator of
by solving the equation above. For the study of the properties of this estimator we consider a quasi-probability approach or pseudo-design-based inference ([
19]) and we treat the volunteer sample as a realization of a Poisson sampling with probabilities
.
For any sample design that verifies certain regularity conditions, the solution to
provides a consistent estimator for the parameter
(see [
30]). Poisson sampling verifies these conditions, so that the consistency of the estimator is obtained immediately from the result of [
30]. The normality of the estimator is demonstrated by [
31], who also obtains the asymptotic variance of the estimator. From said expression and taking into account that in Poisson sampling the extractions are independent and therefore the probability of second order is given by
we can obtain the variance of
:
being
and
4. Estimation of a General Parameter with Estimated Propensities
The propensity scores are not known are impossible to estimate using the nonprobability sample alone, so additional information must be included. Let be a reference probability sample, of size , selected from under a sampling design where the first order inclusion probabilities, , are known and non-null.
The covariates of the propensity model have been measured both in and , while the variable of interest y is only available for those individuals in .
Suppose that the propensity scores can be modelled parametrically as
for some known function
with second continuous derivatives with respect to an unknown parameter
.
We estimate the propensity scores by using data of both the volunteer and the probability sample. The maximum likelihood estimator (MLE) of
is
where
corresponds to the value of lambda that maximizes the log-likelihood function:
As it is usual in survey sampling, we consider the pseudo-likelihood given that some units of the population have not been sampled:
We propose thus a two phase procedure in this manner:
Step 1: Calculate
by solving the score equations:
Step 2: Calculate
as the solution of the estimating function:
We consider the following asymptotic framework for theoretical development, which is equivalent to the framework in [
32]. Let
be a sequence of finite populations of size
. Each
has an associated non-probability sample
of size
and an associated probability sample
of size
. We consider that the population size
, the nonprobability sample size
and the probability sample size
as
. For notational simplicity the index
is suppressed for the rest of the paper. The properties of the estimator
are developed under both the model for the propensity scores and the survey design for the probability sample.
We make the following assumptions:
A.1. The estimating function is twice differentiable with respect to and .
A.2. The propensities and the sampling design ensure that for any .
A.3. The propensities and the sampling design ensure that is asymptotically Normal with mean and entries of the variance at the order for any fixed .
Theorem 1. Under the conditions A.1, A.2 and A.3, is a consistent and asymptotically normal estimator for θ.
Proof. Under assumed conditions,
, thus by using the mean value theorem,
has the same asymptotic behaviour that
which is consistent for
and asymptotically normal distributed (see
Section 3). □
Variance estimation for
can be handled by combining the two estimating equations,
and
, into a single system as it is done in [
33].
The MLE of
,
is the solution to the equations:
and the PSA estimator of
is the solution to the estimating equations
Let . Let be the true parameter values defined through the census estimating equations and the solutions to .
We need an additional assumption:
Theorem 2. Under the conditions A.1, A.2, A.3 and A.4, the asymptotic variance-covariance matrix of is given by the expression:with Proof. Since
and
are consistent estimator of respective parameters, we can write
and the Taylor series expansion gives:
Thus the asymptotic variance of
is given by:
Taking into account the two random mechanisms, and the probabilities of the conditional expectation, we have where r stands for the model of the selection mechanism for the sample and p refers to the probability sampling design for . □
The asymptotic variance of depends on the probability of selecting the sample under the given sampling design and the selection mechanism described by the propensity model. Plug-in estimators can be used to construct variance estimators for all the required components but it is not a simple issue.
In practice, and as described in [
7], the use of jackknife [
34] and bootstrap techniques [
35] in the variance estimation for nonlinear parameters should be more advantageous because of their wide applicability for different cases and conditions. Direct applications of bootstrap methods for estimating the variance-covariance matrix of
involve solving the equation
repeatedly for each bootstrap sample. Multiplier Bootstrap with Estimating Functions was proposed by [
36].
5. Simulation Study
5.1. Data
Data for the simulation study come from a wave of the Spanish Living Conditions Survey collected between 2011 and 2012 [
37], which contains an annual thematic module that, in 2012, was dedicated to household conditions. The survey sampling follows a two-phase cluster sampling, where the primary units are the households and the secondary units are their members. In 2012, the final sample included 33,573 individuals. For this study, the dataset was filtered to rule out individuals and variables with high quantities of missing data. After this procedure, the dataset employed as pseudopopulation of the study had a size of
N = 28,210 individuals and
p = 60 available variables.
From this pseudopopulation, two probability samples of size were drawn according to the following sampling strategies:
The first sample, , was drawn with a stratified cluster sampling, where the strata were defined by the Autonomous Communities (NUTS2 regions) and the clusters were the households, which were drawn with probabilities proportional to the household size. The number of households to be selected, m, was estimated dividing by the medium household size in order to reach the aforementioned size of , resulting in households. The final sample size of was .
The second sample,
, was drawn with an unequal probability sampling, where probabilities were proportional to the minimum income of the individual’s household to make ends meet (variable HS130 in [
37]).
The extraction of the nonprobability sample,
, was done with unequal probability sampling from the full pseudopopulation, where the probability of selection for the
i-th individual,
, was given by the formula:
where
when the i-th sampled individual has a computer at home, and otherwise.
when the i-th sampled individual is a man, and otherwise.
is the age (in years) of the i-th sampled individual.
when the i-th sampled individual lives in a medium population density area, and otherwise.
when the i-th sampled individual lives in a low population density area, and otherwise.
The reasoning behind this sampling procedure is to take into account more similar mechanisms to self-selection procedures that take place in real nonprobability surveys.
We have considered three different sample sizes, = 2000, 4000, 6000. 1000 simulation runs were performed for each procedure and sample size, drawing a sample in each run.
5.2. Simulation
In each simulation, the parameters to be estimated were the following:
The Gini coefficient [
38], which measures the income inequality, estimated as
The proportion of individuals with a disposable income below the at-risk-at-poverty threshold. This measure can be referred to as poverty incidence, poverty proportion, poverty risk or HCI ([
39] and is estimated as
The interquartile range, estimated as
The interdecile range, estimated as
Every parameter was estimated with and without applying PSA so we could evaluate its performance. In order to estimate the propensities, a logistic regression model was chosen:
1000 simulations were executed for each context. The resulting mean bias, standard deviation and Root Mean Square Error were measured in relative numbers to make them comparable across different scenarios. The formulas used for their calculation can be found below:
with
the estimation in the
i-th simulation and
the mean of the 1000 estimations.
5.3. Results
The relative mean bias of the estimations can be observed in
Table 1,
Table 2 and
Table 3. We can observe that PSA reduces the bias in all situations, specially in the estimation of HCI. PSA using the reference sample drawn with probabilities proportional to the income,
, provided much less biased estimates overall.
The relative standard deviation of the estimations can be observed in
Table 4,
Table 5 and
Table 6. The standard deviation remained stable across estimates of Gini coefficient, IQR and IDR, even with small gains for the latter when using the reference sample with probabilities proportional to the minimum income to make ends meet,
, but increased after applying PSA in the estimation of HCI.
The relative Root Mean Square Error of the estimations can be observed in
Table 7,
Table 8 and
Table 9. As a result of the stability of standard deviation and the reduction in bias, the RMSE of the estimates of the four parameters has a similar pattern than the observed for bias. Although RMSE is reduced after applying PSA in all cases, PSA was more efficient when the reference sample was drawn with probabilities proportional to the minimum income to make ends meet,
.
PSA performance could be deeply affected by the selection mechanisms, which could lead to model misspecifications in propensity estimations. To test limitation and robustness of the proposed approach we have repeated the simulation with different patterns of non-response. The selection procedures can be described as follows:
- NP1
Simple Random Sampling Without Replacement (SRSWOR) from the population fraction of individuals with a computer at home, .
- NP2
The probability of selection for the
i-th individual, is given by
- NP3
The probability of selection for the
i-th individual, is given by
- NP4
The probability of selection for the
i-th individual, is given by
The procedure 1 is a typical case of coverage error (which is a type of selection bias itself [
1]). The third scheme represents a cubic relationship between age and the probability of selection, with young people being the individuals with the highest probabilities and decreasing as age increases. The last scheme has two components: one dichotomous and the other cosine-shaped.
Table 10 and
Table 11 show the results of bias and relative ecm for the HCI parameter, where the selection bias of the unweighted estimator is large.
The results show a large decrease in bias and MSE for all response patterns for both PSA methods, which shows the robustness of the adjustment method. The reduction in bias and MSE is different across them. Using PSA with the reference sample drawn under a stratified design, , provided less RMSE when the convenience sample was drawn using NP1. On the other hand, PSA using the reference sample drawn with probabilities proportional to the income, provided much less biased estimates overall when the selection mechanism depended on NP2, NP3 or NP4.
6. Conclusions
Technological development has made large amounts of inexpensive data (commonly known as Big Data) available for researchers to be used for inference. New survey administration methods have also favoured the rise of data from nonprobability samples. Inferences from Big Data and nonprobability surveys have important sources of error ([
4,
24,
28], ...). Given the characteristics of these data collection procedures, selection bias is particularly relevant.
Despite the growing interest raised by nonprobability data (both coming from Big Data or nonprobability surveys), there is still a lack of rigorous theory to make statistical inferences for general parameters through estimating equations. The current paper aims to fill this gap by establishing a theoretical framework for estimation of general parameters with nonprobability samples.
Results observed in our simulation study provide strong evidence on the efficiency of methods based in estimating equations with estimated propensities. However, it must be noted that the efficiency depends on the selection mechanisms of nonprobability samples and the availability of covariates for propensity estimation. In our simulations, results showed that Propensity Score Adjustment is more efficient when the propensity of being in the nonprobability sample is less related to the variable of interest. This behavior has been observed in literature regarding PSA for parametric estimation [
11,
24].
We used parametric methods to obtain the estimated propensities but we could use machine learning techniques as regression trees, spline regression, random forests etc. Recently [
24,
29] presented simulation studies where decision trees, k-nearest neighbors, Naive Bayes, Random Forest, Gradient Boosting Machine and Model Averaged Neural Networks are used for propensity score estimation. These studies compare the empirical efficiency of the use of linear models and Machine Learning prediction algorithms in estimation of linear parameters, but the theory is more complex and has not yet been developed. Other way to reduce the bias of the PSA estimates is to combine the PSA technique with other techniques as Statistical Matching or calibration. [
27] apply a combination of propensity score adjustment and calibration on auxiliary variables in a real volunteer survey aimed to a population for which a complete census was available. [
32] propose a doubly robust estimator for population mean estimation by incorporating the model-based estimator framework to PSA methods, improving their efficiency and making it robust to model misspecifications. Further research should focus on extensions of those methods for general parameter estimation.