1. Introduction
The enterprise population is permanently changing. Its elements are economic units, and the indicators characterizing them may vary: main activity, composition (constitutive establishments), number of employees, and revenue. These changes are rather smooth during the periods of economic stability; however, they may become more drastic during the periods of economic growth or crises.
Table 1 shows the changes in GDP and employment during the COVID-19 pandemic in the 19 countries of the euro area (Eurostat [
1,
2]). It shows that the relative percentage change in the gross domestic product on the previous period in the 1st quarter of 2020 equals −7.8, while it was positive in the 4th quarter of 2019 and earlier. Correspondingly, the number of employed persons in the 1st quarter of 2020 decreased by 1.9 million since 2019. Rapid changes in both directions are observed at later time periods: an increase in the number of employed persons and GDP in the 3rd and 4th quarter of 2020 and a repeated decrease in the 1st quarter of 2021. All these things point to the enterprise population changes.
Population changes influence the choice of statistical methods by the survey statisticians aiming to obtain survey results which would be as accurate as possible. The errors of statistical estimates in the sample surveys arise from many different sources: quality of the sampling frame, sampling itself, record linkage, data editing, adjustment for nonresponse, imputation, quality of auxiliary information, estimators used, etc.
Errors from different sources constitute the total survey error (TSE). A history of this concept is narrated by Lyberg et al. [
3]. At the fountainhead of sample surveys, sampling design and its implementation (questionnaire effect, interviewer effect, nonresponse) were considered as the only available non-sampling sources of the survey error. Even industrial quality control methods were applied to reduce their influence on the survey results in some offices. Around 1970, it was still considered that all error sources can be taken into account when designing the survey. At the end of the 20th century, other aspects of survey errors appeared on the statisticians’ agenda. TSE is a statistical instrument aiming to cover the influence of all error sources on the quality of statistical results. Demographic events, such as enterprise birth, reactivation, death, merger, survival and many others are defined for the enterprises. They are deeply investigated for detailed classification and included in the Guidelines on the Use of Statistical Business Registers [
4]. This shows the importance of the enterprise survival events for statistical purposes.
In connection with TSE, the participants of the European Establishment Statistics Workshop [
5] introduce a concept of the “unit problem”, which addresses the situation when the survey from which the data are obtained differs from the element from which the data should be obtained, e.g., unit error. The main aspects of the unit error and the corresponding unit problem are summarised in [
6] by Delden et al. The authors propose to include the unit errors in the total survey error framework. The unit problem means a unit error, which affects the quality of statistical results.
A question arose in [
7], asking whether all statistics produced should include an estimate of the unit model error and show how the statistical output depends on various unit error models. It would be a large and complicated task, which could be performed only for a small number of case studies, because of the amount of effort required to define units and to distribute the collected data to them. It would not be possible to create a single measure for the unit error because of the diversity of errors. Sometimes, even the structure of the unit is not clear. Nevertheless, if a case study for the estimation of the unit error is conducted, it helps to understand the input of the unit error to the uncertainty of statistical results.
The current paper shares the opinion that only specific case studies can present situations in which the influence of the shortcomings of the statistical input on the quality of statistical results can be measured. We present some of the case studies into business surveys conducted by various statisticians in
Section 2.
Section 3 includes an introduction to survey sampling and stratified sampling design. It is followed by our own case study. It is devoted to measuring the impact of the unit errors on the bias and variance of the estimator of the total in a business survey, at a point when enterprises may join, split or change the stratum between sampling selection and data collection. First and second order inclusion probabilities are recalculated for the enterprises which have merged assuming a joining model. Expressions for the estimators of the stratum total and its variance are obtained in the case when enterprises change the stratum assuming a stratum change model. The statistical simulation results in
Section 4 provide a numeric illustration and show big changes in variance estimates for the estimators of the totals due to enterprise population changes. The article ends with Results, Discussion and Appendix. This study has some intersection with the results included in the project report [
8]. The results of the whole project on the quality for multisource statistics are shortly described in [
9].
2. Overview of the Case Studies
The overview starts with the measurement errors and data editing, continues with the errors due to the inaccurate usage of the terms, and finally deals with the changes in the enterprise composition.
The knowledge of the effect of the data errors on the accuracy of statistics produced is also useful for the editing process, aimed at improving data quality. It is not possible for all data units to be edited manually because of the time limitations, costs, and quality requirements. De Waal et al. [
10] developed selective editing methods. These methods aim at limiting manual editing by paying attention to the units with a high risk of errors essentially affecting the accuracy of statistical results.
In order to reduce the TSE, it was decided to consider that selective editing is an algorithm acting in two steps: finding the erroneous observations and minimizing the number of records going through manual editing. It means that some of the observations fail editing checks and still remain unadjusted. A Heckman model for these remaining observations is proposed by Laitila et al. [
11]. According to this model, an observation may be without an error or may have an error satisfying the linear model. After estimating parameters for this model, expected values of the measurement errors can be estimated.
In order to clarify the essence of the following discussion, we have included the definitions of some of the concepts in the article (OECD, [
12]). An enterprise is the smallest organisational union consisting of the legal units which produce goods or offers services. Legal units are economic units acting in their own right. An enterprise carries out its activities in several locations and consists of local units if there are more than one of them; alternatively, it is itself a local unit if there are no more compounding units. An establishment is a local unit dealing with one economic activity. The importance of the definitions of statistical units in business surveys is discussed in [
13]. They are connected with the statistical business register. The confusion of terms leads to an ambiguous classification of population, and to inaccurate and incomparable statistical results.
The business surveys where enterprises consist of one or several local units are discussed in [
14]. However, the survey estimates are needed by geographical location, which may include only some of the local units, or only part of some enterprises. Here, the situation may be considered as splitting the sampling unit/enterprise. Local geographical areas could be considered as unplanned estimation domains. Consequently, different methodological approaches stemming from a small area estimation framework could be applied, in order to estimate the value added at the local geographical area level. The geographical location of local units of enterprises is exploited in order to express the possible link between enterprises and territory. A statistical method is used to allocate an economic indicator—value added to each local unit. Three proportional methods are investigated for the value-added allocation:
- (a)
each local unit is assigned an equal added value;
- (b)
each local unit is assigned a value added in proportion to its number of employees;
- (c)
each local unit is assigned a value added in proportion to its labour cost.
The empirical study shows that according to the preliminary evaluation criterion using the preservation of the totals, method (c) seems to be preferable.
Various types of change in the business population, along with the discussion on whether they are deemed to have an impact on statistical results, are presented in [
15].
Attention to enterprise births and deaths was paid already by Holt and Smith in 1989 [
16]; they studied the change of the population means and totals through time for overlapping sampling designs and changing sample composition. For non-stratified sampling design, the difference between means
was studied by decomposing it into three terms (components):
- (a)
The partial effect of changing domain-specific means assuming no change in domain composition;
- (b)
The partial effect of changing the domain composition assuming no change in domain-specific means;
- (c)
The interaction between (a) and (b).
A panel of companies is studied in [
17] by Knottnerus, where the same units of the population are observed in multiple periods in order to track the trend in time. A discussion on variances of different estimators, taking into account the migration of companies between strata, replacements for companies that drop out, and the impact on the panel of company mergers, is presented. Expressions for the variances of the estimated monthly revenue totals in such panels are obtained.
Further, the authors [
18] pay heed to the establishment surveys providing monthly turnover estimates for the economic activity classification codes. Monthly surveys use overlapping samples, which include establishment deaths, births and stratum changes. The estimates of the turnover change during a 12-month period should also be obtained. The variance estimator for such a change is complicated because of the correlation between consecutive month estimates. The article describes a general variance estimation procedure. The procedure allows for yearly stratum corrections when establishments move into other strata according to their actual sizes. The procedure also takes into account sample refreshments, births and deaths, etc. In deriving formulas for the variance of an estimated change in a population with dynamic strata, one has to pay attention to three complicating factors. Firstly, the change in a level is the result of two components. One component is due to the change in the population mean of units that remain in the same stratum on both occasions. The other component is caused by the change in the stratum composition between two occasions resulting from births and deaths in the population and from population units that migrate between strata. Secondly, due to the migration of population units between strata, the estimated mean of stratum
h at occasion
t may be correlated with the mean of stratum
l at occasion
t + 1. Thirdly, another complicating factor is that the population is repeatedly sampled, resulting in partially overlapping samples between two occasions. Variance of the yearly growth rate includes covariance between total turnover of all establishments in the population in month
t and in month
t − 12. The estimation of the covariance is the most challenging element of the study.
An enterprise population of size
, stratified by the code of economic activity available in a sampling frame based on a business register (BR), is considered in [
19] by Delden et al. Each population unit
i () has an unknown true industry code and an observed industry code. These industry codes do not necessarily coincide; it means that the observed codes may have errors, which are assumed to be independent across units. These classification errors affect industry total estimation results. By introducing some additional assumptions, the authors define the transition matrix of classification-error probabilities. An audit sample, the dynamics in the business register, and expert knowledge are used to estimate the transition matrix of the industries classification error probabilities. It means that a classification error probability model is introduced. Bias and variance estimates for the turnover estimates arising because of the classification errors are estimated using bootstrap. In addition, the extent to which manual selective editing at the micro level can improve the accuracy is studied.
The population register is considered as a frame for an enterprise sample survey in [
20] by Burgard, et al. Stratified sampling design by region, size and economic activity is applied. If there are changes in an enterprise’s activity or size between sample selection and data collection, then, at the stage of estimation, the enterprise should belong to a different stratum than was selected. The authors call such enterprises “stratum jumpers”. In this situation, original sampling design weights can no longer be applied. The problem is solved by applying various reweighting methods. Two approaches to reweighting are proposed: case number-based and model-based reweighting methods. Case number-based methods are as follows: new weights are constructed by estimating the actual population stratum size at the estimation stage and dividing it by the actual stratum sample size obtained as an average calculated for the original design weights in the actual sample stratum at the estimation stage. The model-based reweighting method implements weight smoothing of the newly obtained strata at the estimation stage by applying an inverse exponential model and penalised spline model to the design weights. The simulation study shows the preference of the first proposed reweighting method with respect to a bias and relative root mean squared error.
A problem is raised by Fizzala in [
21]: how should we deal with the changes in the composition of enterprises? The solution for the French structural business statistics survey is presented. Since 2016, the survey population has consisted of enterprises. When an enterprise is selected for the sample, all its constituent legal units are included in the legal unit sample, and data should be collected from the legal units. At this stage, the cluster sample is available. However, this is not the end. “For reference year
t, the samples are drawn in November
t with links between legal units and enterprises referring to year
, the most recent available at this date. A few months later, new links referring to year
, thus being more up-to-date, are available, and it is natural to try to use them to produce the results at the enterprise level concerning year
t”. It happens that after updating the links between enterprises and legal units, the updated composition of the enterprise may include new local units which were not included at the sample selection stage, or it may lack some previously selected local units. The author considers that the inclusion probabilities of the enterprises comprising the up-to-date sample, or estimation weights, “are hard to obtain”. Therefore, the generalised weight share method (GWSM) is used. This method is introduced and developed in [
22], and it is explained on p. 13 of his book that “this estimation weight basically constitutes an average of the sampling weights” of the enterprise population from which the sample is selected. The estimates using different GWSM versions are produced; meanwhile, the evaluation of the accuracy of the GWSM estimator stays in the future plans of the author. The problem of using enterprises and their local units for statistical purposes at the INSEE is also studied in [
23,
24].
The unit problem has been extensively discussed at international conferences, including several presentations in the Fifth International Conference on Establishment Surveys [
25,
26]. It is confirmed that there are “big differences between the population surveys and business surveys. The same unit could be classified entirely differently after data collection, etc. and this distort parameter estimates and variance estimates”.
3. Design-Based Estimators in the Case of the Enterprise Structure Changes
3.1. Finite Population Surveys, Stratification
A finite population
, or a population, is a study object of survey sampling. The random variable
is defined for all elements of the population. Its parameters such as mean, total and quantile, which are of interest to surveyors, are unknown, unfortunately. A random probability sample is selected from the population in order to obtain the values of the study variable and to estimate the parameter. A population is considered to be fixed, and randomness arises due to the random selection of the sample. A probability distribution describing all possible samples which can be selected according to the sampling plan and their selection probabilities is called a sampling design. Various sampling designs are used [
27]. One of the simplest sampling designs is a fixed size equal probability sampling without replacement called a simple random sampling. Depending on the structure and properties of the population, the population element list available for the sample selection and the aim of the survey, unequal probability sampling of elements, cluster sampling of the element groups/clusters, two or more stage cluster sampling, two or more phase/subsampling designs, etc. can be used. In this article, we use the stratified sampling design [
28,
29]. In this case, the population is divided into non-intersecting groups of elements—strata, and samples are selected in each stratum independently. This sampling design was introduced by Neyman [
30] in 1934. Not only did his article lay the basis for a design-based survey statistics, it also provided a theoretical foundation for survey statistics as a field of science in general.
The reasons for stratification can be miscellaneous: organisational convenience, natural population resolution or the need to decrease the variance for the estimator of the parameter of a study variable. Oftentimes, various parameters are expressed exactly or approximately through a mean or a total. The population total of a study variable under a stratified sampling design is expressed as a sum of the stratum totals: its estimator equals the sum of the estimators in the strata. Since the sample is selected independently in each stratum, the variance of the estimator for the total equals the sum of the variances for estimators of the stratum totals. From this follows that stratum totals can be estimated separately and added; meanwhile, the variance of the estimator of the total equals the sum of the variances of the estimators of totals, due to independent sample selection in the strata. Hence, if the values of the study variable are homogeneous in the strata and differ significantly between strata, the estimator of the total will have a small variance, in comparison with simple random sampling. This may happen for a population with an asymmetric distribution of a study variable. It may be a population of various-sized enterprises and a study variable expressing the amount of production, agricultural farms with a study variable—“harvest of a certain culture”, etc.
In order to efficiently stratify the survey population, i.e., to decrease the variance of the estimator of the total/mean, the following decisions should be made: a stratification variable (one or more) should be chosen, stratification boundaries should be defined, the sample size should be allocated to the strata, and the sampling design should be specified in each stratum. Let us look into the recent works devoted to the stratification problem.
An overview of the methods used to define stratification boundaries is presented in [
31]. The authors propose their own method using a moving average for two lagged values of one stratification variable to determine the strata boundaries. They demonstrate an empirical comparison of their method with the widely known square root method [
32] for fixing stratum boundaries and geometric stratification [
33] in the case of optimal allocation and proportional sample size allocation [
32]. Two stratification variables are used in [
34,
35]. The distributions for the stratification variables are assumed, and the variance minimization problem for the estimator of the population mean of the subsidiary variable is solved by giving optimal strata boundaries. Such a method is very sensitive to assumptions and to the distribution of the study variable. An important aspect of stratification arises in medicine where the response probability for individuals should be predicted [
36]. Optimal stratification for a medical data set has two aims: to satisfy the stratification requirements in order to minimize the variance of the estimator for the response probability; to reach medical meaningfulness of the strata. The methods for the choice of the stratification boundaries mentioned above use one or two stratification variables and aim at minimising the variance of just one study variable. They become inefficient in the case of a data set with many variables. An example of classification/stratification of patients using machine learning methods is presented for a data set with 85 variables [
37]. A big number of variables makes it problematic to use traditional methods; in this situation, machine learning proves to be helpful. Another example of using machine learning methods for classification is used in the survey of the forestry resources [
38], where it is used for the estimation of totals and proportions. Here, land stratum boundaries are defined by the square root method [
32] and by the
-nearest neighbour method. A post-stratified estimator [
28,
32] is used as an estimation method, and in the case of traditional square root stratification, the estimation results show higher accuracy. Classification trees, regression trees and random trees methods are proposed for determining of the stratum boundaries in [
28]. Stratification may be used as an element of the complex sampling design. A two-phase stratified sampling design is considered in [
39] and nine exponential ratio estimators are proposed to estimate a population proportion possessing certain attributes.
There are many various sample surveys, and not all of them are finite population surveys. For example, randomised controlled trial [
40] is a planned experiment rather than a finite population sample survey. In randomised control trials, the trial designer randomly assigns treatments to experimental subjects, in order to precisely estimate the effects of all treatments.
The construction of the stratum boundaries presented here (except for the machine learning methods) aims at minimising the variance of the estimator for one subsidiary variable, which should be highly correlated with the study variable and is known before the survey for all population elements. After data collection, it appears that the real study variable is not so well correlated with the subsidiary variable used for stratification, and the variance of its estimator is higher than expected. There are more study variables which have an even lower correlation with the subsidiary variable; some sampling units may not respond to the questionnaire; some changes may appear in the structure of the sampling units between sample selection and data collection. Estimates are usually needed not only for the whole population but also for its domains. A separate case of a domain is a stratum. The aim of the current article is to show how the changes in the population between sample selection and data collection influence the accuracy of the estimator of the stratum total.
3.2. Changes of the Enterprise Population over Study
Usually, all the enterprises are registered in a business register (BR). It includes such important variables as the statistical classification code for the economic activity, and the number of employees. Based on this BR, the sampling frame for the enterprise surveys is constructed. The kind of economic activity and employee size groups are used as stratification variables to design a simple random stratified sample. Let us assume that the sampling frame has perfect population coverage and includes all the elements of the survey population. Now we assume that changes appeared between sample selection and data collection that have an effect on the stratification of the units. Let us call a sample drawn from the sampling frame a selected sample. After the survey data of sampled enterprises have been obtained, information has been received from the observed sample that some of the sampling units have changed their values for stratification variables. Three types of changes are considered:
- (a)
Joining of the sampling unit with other enterprises, which may be from the same or from a different stratum;
- (b)
Splitting of the sampling unit into new enterprises belonging to the same or to a different stratum;
- (c)
Change in the value of the stratification variable (economic activity code or enterprise size group).
The changes may arise because the errors of classification emerged after sample selection and need to be corrected, or because the enterprises changed their characteristics between the sample selection and data collection.
3.3. Notations
Let the survey population consist of the elements , . We consider that is the th unit in the selected sample and is its update in the observed sample. The initial unit may differ in composition from , but it is the same unit which survived changes, and it still has the responsibility of the unit to report the data observed. The changes of types (a), (b) and/or (c) might have occurred between units and .
The population is stratified into H nonintersecting strata of size , . A simple random sample of size is selected from each stratum , independently giving the joint stratified sample of size .
As it has been mentioned, the observed units may differ from the selected units, and we denote the observed sample , for which neither equality nor may hold.
Let us denote by
a study variable defined for the survey population. The aim of the survey is to estimate the population total of the variable
. The total for the study population without any errors is expressed as
where
denotes the value of the variable
for enterprises
in the sampling frame at the time of sampling. Let denote the observations of variable
for enterprises
at the time of observation. For the changed population
of size
, we define the total
The aim is to estimate this total. Certain steps are taken to solve this problem in [
41].
3.4. Change Type (a)
The unit from the observed sample is composed of a union of some , , units from the sampling frame, , with at least one of the components belonging to the sample . In this case, all the sampled units are identified; however, the values of the study variable are not known for these units. Instead, we observe one value of the study variable for the unit .
Because the values of the study variable are not known for all the elements of the sample
, we cannot use the Horwitz–Thompson estimator [
32]:
The population has changed since the sample selection, and it does not make sense to estimate the total
. So, we are interested in estimating
which is the total
corrected for the change type (a). Here,
is the inclusion probability of the element
into the sample
. In the case of a stratified simple random sample,
if unit
. An alternative may be to use the estimator
with the inclusion probabilities
for the elements of the observed sample
, given the population change type (a). This provides us with an unbiased estimate of the population total
because it accounts for the adjusted inclusion probabilities of the units in the sample.
3.4.1. Estimation of the Inclusion Probabilities and Estimator of the Total
The unit
is included in the sample
if at least one of its component parts
is selected in the sample
ω. The number
can be decomposed in
, where
is the number of elements from the stratum
. These are integers, where
and at least one of
. Then, inclusion probability is expressed as
with
.
Note, that for (), Formula (4) gives , for .
Now let us calculate the second order inclusion probabilities
. Let the element
,
be defined as the composite unit above, and the numbers
be decomposed as above. Assume that the second element
is decomposed:
,
. Let us denote by
,
, the number of elements from the set
that belong to the stratum
. The number
may also be decomposed:
,
, with at least one of the
. Thus, we have
for
,
; and
otherwise. Based on the Horvitz–Thompson estimator of total [
32] and inclusion probabilities (4), (5), we obtain
Result 1. After enterprises joining between sample selection and data collection, the estimator (3) with inclusion probabilities (4), (5) is unbiased for the changed population totalwith varianceand variance estimatoris unbiased. 3.4.2. Comparison of the Estimators
The total
in (1) is estimated by the estimator (3) with the inclusion probabilities (4); its variance is estimated by (6) using (5). The estimates obtained are compared with the estimates obtained using the Horvitz–Thompson estimator (2), constructing the relative measures of difference between the estimators of the total and their variance estimators:
The expectation of the relative difference is taken with respect to the sampling design. For
the expectation of the relative difference
with respect to the sampling design. In case a statistician does not account for the change of type (a), one would use the unadjusted inclusion probabilities
to compute the estimate
of the total
. Note that the observed values
are included in the estimator, because no other values are available, and they are obtained for the units belonging to
by modelling and imputation. The estimator
is biased for
, because the estimator (3) is unbiased for
. The relative difference between the estimators
and
is defined as
At the same time, this indicator may be considered as a relative approximate bias for
:
These accuracy measures are used in the statistical simulation study in
Section 4.
3.5. Change Type (b)
Besides the sample units that merge, it is possible to observe other elements that have split since the selected sample was drawn. Information about the study variable is not received from all of those split elements, and the value of the variable is not as it would be if the units had not been split. If the observed value is only for some of the split-part of the unit then splitting can be viewed as a second stage sampling with the second stage inclusion probability and the final inclusion probability . The probability is not greater than : compared to the change type (a) where . An increase in the variance for the estimator, due to the second-stage sampling, should be taken into account. It will show the increase in variance due to enterprise splitting.
Another solution to the problem would be to consider as missing those values of a study variable of the split-parts which have not provided data. Any imputation method can be used to fill in the missing values of a study variable, and then the population total can be estimated. An increase in the variance for the estimator due to imputation should be taken into account. It also shows an increase in variance due to enterprise splitting.
3.6. Change Type (c)
It is possible that according to the information received from the observed units, the values for the classification variables (for example, code of the economic activity or size group) have changed since the sample selection. Further, we will study the influence of these changes on the accuracy of the estimator of the total and its variance. Let population of one economic activity be divided into neighbouring strata by size at the time of sample selection: .
Let mean the enterprise population in the stratum at the time of sample selection, where is its size. A simple random sample of size is selected from this stratum. Let , be a subsample of the enterprises reporting other stratum codes than selected, their number is . Further, let be a set of enterprises remaining in the stratum , their number is . Let , be the subset of enterprises from the stratum , which reported belonging to the stratum and their number is . Additionally, let , , be a subset of enterprises from the stratum , which reported belonging to the stratum , and their number is . This notation is reasonable in the case of size class changes: sample elements might move one stratum left or right (due to the decrease or increase in the number of employees). In the case of changes in economic activity codes, and can be viewed as any other two economic activity codes containing elements which belong to the kind of activity code .
Let denote the domain at the observation time after enterprises report their stratum codes. It is an evolution of the stratum .
Denote by
—a sample subset that belongs to the stratum
h at the observation time, its size is
. Let
be a study variable,
—the stratum population total at the observation time. Let us estimate it taking into account changes in the population, or, actually, another, domain total, in such a way:
Because of the independent samples in the strata, an equality for variances is valid:
Remark 1. If the stratum isorthen it will lack one of the neighbouring strataor, the samplewill not have the corresponding input, and Equations (9)–(11) will have only two terms on the right-hand side.
In order to study changes in the variance of the estimator for the stratum total due to the change in the values for the classification variable, certain assumptions for the enterprise stratum change mechanism are made.
Assumption 1. Let us assume that any enterprise changes the stratum independently with the same probability,. With the probability,an enterprise changes stratumto the stratum, stratum h to the stratumand vice versa, independently of each other. We consider that the stratum h has two neighbouring strata, and the probability for the enterprise to leave it equals, while the probability to remain is. We consider that the situation can be described by a two-phase sampling design with the Bernoulli second-phase sampling, when enterprises from the strataandwith the probabilityare coming to the stratum, and enterprises with the probabilityindependently remain in the stratum. In the case,an enterprise remains in the stratum h with the probability.
This assumption is yet one feature that makes our study different from the study [
17], which is dedicated to business panels.
Expressions for (10) and (11) are constructed in the following way. Let
denote an indicator for element
from
to belong to the sample
:
The variance of the estimator
,
, is
Here, means the estimator for the variance of which is the estimator of obtained from the sample of the size .
Let
denote the estimators for sums
in the strata
,
,
.
denotes the indicator for the unit
from
to belong to the sample
. The expression for the variance of the estimator
, 1 <
h <
H, in a similar way is as follows:
Let
denote the indicator for the unit
from
to belong to the sample
. Then, correspondingly, the variance and its estimator for 1 <
h <
HBelow are presented the variances of the estimators for the first (
) and for the last (
) strata and their estimators:
The following variance estimators used,
, are for the samples of fixed sizes
, and they underestimate the variances to some extent.
Result 2. Let us take stratum,, and its population totalat the time of observation. It has newcomers from strataand, and it lacks some enterprises which moved to the same strata. The estimator of this total at the time of observationis given in (9). The expression foris obtained by inserting (12), (14) and (16) into (10). The expression foris obtained by inserting (13), (15), (17) and (22)–(24) into (11). Corresponding expressions for h = 1 andin (10) and (11) are given in (18),(19) and (20), (21) with (22)–(24).
Estimator for the population total of one economic activity at the time of observation is simple:
Its value is the same as in the case when enterprises do not change the strata because their design weights are preserved. However, the summands in this estimator
,
, are dependent because each of them may depend on the units belonging to the neighbouring strata. The variance of
is complicated:
Expression for covariance is as follows:
with notations
The derivation of a formula for the covariance is presented in
Appendix A.
Result 3. Let us take a population total of one economic activity at the time of observation. Its estimator (25) has approximate variance which acquires an expression:The expression forand,, comes from
Appendix A. The estimator for the variance of the estimator of the population total used is as follows: The estimator for the sample variance in the expression of the covariance estimator is given in (22).
Remark 2. It may happen that there are no enterprises moving from the neighbouring stratum to the stratum h. Then, the corresponding term inin (27) equals 0.
Remark 3. As we see in (22)–(24), the sample variancesare estimated from the data, which remain in the stratum h or from the data,which come to the stratum h. If the number of the elementsorcoming to the stratum h is small, then the variance component estimator (23) and (24) based on these elements is large. It is possible that only one element comes from the stratum and the calculation of,is impossible. The value of the study variable for such an element is merged with the values of the elements remaining in the stratum h, and the variance input of the newcomer to the receiving stratum h is ignored in the further simulation.
4. Simulation Study
The modified data on enterprises’ expenses from the environment protection survey from the Vilnius Gediminas Technical University Repository [
42] are used for a simulation study (
Figure 1). Enterprises are involved in several activities, all of which are used in
Section 4.2.
4.1. Simulation Study for Change Type (a)
In this study, population consists of 60 enterprises belonging to the 16th economic activity; enterprise income is used as a study variable
y. The population is stratified by the number of employees (5–17, 18–100, >100) because enterprise income is highly correlated with the number of employees (see
Table 2) without seeking any optimization.
A stratified simple random sample of size
with the proportional allocation of the sample size is selected. At the time of observation, there are
enterprises, because some of them are merged and four non-sampled enterprises are joined (joining of enterprises is simulated inside the stratum with the probability
). For the values of
when
(non-sampled enterprises), the ratio imputation for income with respect to the number of employees is used. The relative measures of accuracy (5) and (6) due to joining of the enterprises are presented in
Table 3. The same measures of accuracy are calculated for the enterprise joining probability
. In this case,
and five non-sampled enterprises are joined.
Simulation results show a non-significant relative bias for the estimator of total with recalculated inclusion probabilities
. The relative bias for
(the estimator with initial inclusion probabilities and imputed values for study variable of non-sampled but merged enterprises) is increasing with increasing joining probability
. Estimates for relative variances
and
of both estimators,
and
, increase with the increasing joining probability
p. Simulation results in
Table 3 show that adjustment of the inclusion probabilities due to enterprise joining is worth using.
4.2. Simulation Study for Change Type (c)
A simulated population consists of 2968 enterprises belonging to six economic activities (
Figure 1). It is an enterprise data-based population of 379 records duplicated eight times, from which 64 enterprises with the largest number of employees are removed. As it is mentioned in
Section 4.1, each activity is stratified by the number of employees into three strata without any optimization. An
-size simple random stratified sample with a proportional allocation of a sample size is used for a study. Enterprise income is estimated. Enterprise size changes are simulated with probabilities
, 0.1, 0.15. The totals for each activity are estimated, and relative stratum biases and relative stratum variance changes for the estimator of the total, are presented in
Table 4 and
Table 5. The following expressions are used for relative biases and relative variance changes for the estimator of the total in the strata:
Here, is the Horvitz–Thompson estimator of the variance for the estimator of the total in the case of simple random sampling in the strata, if there are no changes.
The bottom line of
Table 4 is attained as a relative bias of the total in the population of Stratum 1, Stratum 2 and Stratum 3 for all economic activities together.
Except for Stratum 3, one can observe that the relative bias is often increasing with an increase in the probability
p to move to the neighbouring strata. Populations in each economic activity are very skewed (
Figure 1). Stratum 3 of the largest enterprises receives comparatively small newcomers, and they do not cause a big bias in this stratum. The stratum of the smallest enterprises receives middle-sized newcomers, which have a considerable influence on the stratum income, and the bias increases with the increase in the probability
p to change the stratum. Stratum 2 of medium-sized enterprises experiences the effects of small and big newcomers with enterprises in both directions, both to the strata of smaller and to the strata of larger enterprises. It is the most mobile stratum, and its relative bias slightly increases with the increase in the probability
p to change the stratum. The summary relative biases on the bottom show the same tendency.
Simulation results in
Table 5 show relative changes in stratum variances for the estimator of the total when enterprises migrate between strata, and the “change of total” shows the relative change in the variance for the whole estimator of the total in economic activity. All values of the relative changes in the variance for all the estimators of the total are very high, which is due to the very skewed population.
Medium-sized strata have the lowest relative variance, while other strata have higher relative variances which increase with the increasing probability to move from the strata, except for the highest income strata. This highest income strata also influence the relative change in the stratum total variance, which decreases insignificantly with increasing probability .
Enterprises coming to small-size enterprise stratum are only middle-size, but their income are, on average, higher than the income of the receiving stratum. This causes high relative bias for the estimator of total and high relative variance estimates for the estimator of the total, with respect to the estimator of variance when there are no enterprise stratum changes in the receiving stratum.
The large-size enterprise stratum receives only smaller-size enterprises from the neighbouring stratum, and its estimate of total income has a negative relative bias.
The stratum of the medium-size enterprises is the receiver of the small-size enterprises and large-size enterprises; at the same time, it is left by some medium-size enterprises. Despite the appearance of some relative bias of the estimator, their relative estimators for the variance of the estimator of the total is the lowest.
With the increasing probability for enterprise to change the stratum, relative accuracy measures for the small-size enterprise strata and medium-size enterprise strata are increasing, but no effect is observed for large-size enterprises.
Estimates for a total of the whole study population remain almost unbiased. Relative estimates for the variance of the total do not differ significantly with increasing migration probability .
5. Results
A simple random stratified sample is assumed. Three cases of changes in the sampling unit (SU) structure between sample selection and data collection are studied.
- (a)
A sampling unit has joined another sampling unit possibly from a different stratum. First and second order inclusion probabilities for the observed sample elements are recalculated. Consequently, a newly adjusted unbiased Horvitz–Thompson estimator of total, its variance and an unbiased estimator for variance are applied. The simulation results show that the relative change of the bias and the relative change of the estimator of variance in the case when there were no changes increases with an increasing probability of changes.
- (b)
Sampling units split into the multiple parts and data are received from that part of the split units which are successive of the selected ones. Imputation or an estimator for a two-phase sampling is proposed.
- (c)
Another value for the classification variable is reported, i.e., the sampling unit migrated to the neighbouring stratum before data collection. A model for a change of the classification variable is assumed. The new estimator of the total, its variance and variance estimator are obtained. The simulation results show that the relative increase in the bias and the relative increase in variance when there were no changes increase with an increasing probability of migration.
In Case (c), an assumption is made that enterprises change the stratum with equal probabilities and only two neighbouring strata participate in the exchange process with some fixed stratum. This assumption can be relaxed, allowing enterprises to have different probabilities to leave and enter the stratum, and more strata may participate in the exchange.
In order to apply the results in practice, the probability of enterprises changing stratum should be estimated. The probability of enterprises changing the stratum between sample selection and data collection may change over time. Careful study of a specific enterprise population may suggest other models. The relevance of this problem has been demonstrated by real-life surveys.
The methods presented in the article are based on the assumption that there is no non-response and that the joining of units within the sampling frame (Case (a)) or changing of strata (Case (c)) is completely random, i.e., it is not related either to the stratification variable or to the study variable. In reality, several kinds of unit errors may appear at the same time; however, as it was pointed out in the Introduction, only one kind of error at a time is studied in this article.
Changes in the composition of the sampling unit between sample selection and data collection is only one of the problems in enterprise surveys. A much larger number of problems was discussed at the 6th International Conference on the Establishment Surveys [
43] and other conferences.
6. Discussion
As we see, the Horvitz–Thompson estimator is inefficient in the case of sampling units changing the stratum. Some smoothing estimator is needed.
Units coming from the neighbouring stratum may significantly change an estimator of the total and its variance in the target stratum because of several reasons: values of the study variable of the coming unit may differ significantly from the values of the units in the target stratum; values of the study variable may be similar in both strata, but the weights may differ significantly; both reasons are possible. If the value of the coming unit differs from the values of the units in the target stratum, it can be dealt with as an outlier. If the weights of the coming units differ significantly from the weights in the target stratum, it is proposed to smooth them by Beaumont [
44,
45]. Smoothed weights are obtained by applying a suitable model for design weights. One of the methods proposed gives smoothed calibration weights. The author proposes to estimate a mean squared error of the smoothed estimator by bootstrap.
Direct estimators based on sampling design and assumptions on enterprise migration are presented in the article. Due to migration of the sampling units between strata, design weights become unequal, and the population size changes and becomes unknown. No nonresponse is assumed, and no auxiliary variables are used.
Usually, more imperfections appear in the real sample surveys, where some of the sampling units do not report their data, hence the presence of nonresponse. In order to adjust the estimator for frame imperfections and nonresponse, a calibration estimator using auxiliary data is used [
46]. In order to reduce the variance of the estimator, auxiliary variables should be well correlated with the study variable.
Nowadays, there are many databases and unstructured data on the internet. It may therefore appear that it should be enough to take a large amount of data from various sources to solve the estimation problem. Unfortunately, some of these data sources are characterised by survey population undercoverage and selection bias; therefore, for integration of such data sources, probability survey data are needed as one of the components giving a jamb. The latter is currently a popular topic among survey statisticians [
47,
48]. For this reason, further studies into the quality of sample surveys will be necessary.