Survey sampling is a strategy for gathering data about a specific trait of a population based on a sub-part of that population, when there is limited time or cost to observe each person in the entire population. It involves taking samples from the population, analyzing them, and then using the results of the samples to draw final statements, interpretations, and conclusions about the whole population. Survey sampling is a comprehensive and widely used strategy in various areas of data collection. The most compelling goal in data collection is to obtain precise, predictable, and reliable outcomes.
1.1. Survey Sampling and Randomized Response
In conducting surveys, there are many issues of interest when somebody needs to gather data on sensitive or stigmatized issues. The findings might be misleading if we have sensitive characteristics under study, such as scams, the use of illegal drugs or intoxicating beverages, prohibited or unlawful earnings, evading income tax, reserves in the form of prize bonds, number of induced abortions, exploitation of finances, etc.
The main aim in survey sampling is to obtain accurate and reliable data, which mostly fails in the case of sensitive issues. When surveys are conducted with a direct questioning or interviewing method regarding sensitive issues, it is expected to get ambiguous or false results. Issues like scams, prohibited drugs, and tax evasion are considered as sensitive issues. Let us first discuss some real life examples, in which our variable of interest is considered as sensitive or socially stigmatized.
Racism is a major workplace sensitivity issue that refers to making colleagues feel uncomfortable regarding their background or skin tone. If we want to know about people who perpetrate racism, direct questioning methods will fail to elicit authentic responses. Therefore, racism is a highly sensitive issue, especially in business research, as discussed in detail by Geurts [
1]. As another example, if our interest is to assess the proportion of users of illicit medication and alcoholic items in a local area where such substances are completely restricted, respondents may hesitate to reveal their sensitive traits. Thus, if an interviewer conducts a direct inquiry such as “do you smoke or take alcoholic drinks?” the respondents will wonder whether or not to share an accurate response to this delicate inquiry about their drinking status with the interviewer.
Harassment is also one of the sensitive issues where we experience difficulty in collecting data. Harassment can be on the basis of race, religion, gender, or national origin. Workplace harassment, in all its forms, is unacceptable. Mirhosseini et al. [
2] presented a descriptive analysis of sexual harassment and its coping strategies. Although connections between coworkers might develop over time, they should always adhere to company policy. A coworker with unprofessional motives should never make an employee feel pressured or uneasy. Beyond interpersonal interactions, harassing, intimidating, or bullying another employee needs to be addressed swiftly and firmly. Therefore, if a person is interested in collecting data about harassment in their workplace, the respondents may not provide them with accurate information.
When we are intrigued to know the proportion of individuals who evade income tax by making payments through contacts or nepotism, sensitive issues may arise. A detailed study on factors persuading taxpayers to engage in tax evasion was done by Kassa [
3]. If an income tax officer conducts a survey for such purposes, the respondents will likely lie regarding non-payment of income tax due to their fear of punishment and penalties imposed by the authorities or government accountability. Similarly, if we wish to know about the living standards and comforts of individuals in a specific local area, we have to know about their income. Here, on the off chance that we apply the condition that we know their normal pay, and it does not coordinate with their marvelous expectation for everyday comforts, then we need to know the extent of individuals who are engaged with unlawful pay and illegal income. For the most part, respondents conceal their pay and never want to be asked or questioned about their unlawful pay or additional kinds of revenue. Therefore, they may under-report their illegal means of obtaining income to an interviewer who is a stranger.
The respondent frequently wonders whether or not to answer honestly in the case where respondents are straightforwardly presented with these sorts of sensitive inquiries. The respondents experience dread that either their actual response about the sensitive inquiries being posed would be a reason for humiliation or that they would be ridiculed in the general public. Once in a while, they feel that their truthful reaction might draw punishment or their privacy might be violated. The apprehension that the legal framework can be prompt results in either refusal to answer or in evasive answers. Such a scenario might prompt social desirability bias (SDB). Some of the time, we face such conditions when the study variable is sensitive. Sensitive attributes can be the use of drugs, not paying tax, being involved in illegal activities, etc.
Warner [
4] discovered a randomized response (RR) survey model for countering the reservations amongst respondents in the case of susceptible or socially stigmatized inquiries; such a technique is very much needed when we want to obtain reliable and authentic data. This method is very effective at lessening SDB up to a massive degree. To develop the self-assurance of and cater to the confidentiality of the respondents, the unrelated question model is recommended by Greenberg et al. [
5]. Some striking work associated with the RR model has been done by a variety of researchers. Let us have a brief discussion on a few of them. Moors [
6] altered the model of Greenberg [
5] in the case of unknown parameters of the population’s characteristics. He also calculated the values for the probabilities p
1 and p
2. In the situation when Moors’s [
6] model fails, three straightforward alternative RR models are suggested by Mahmood et al. [
7]. Upon comparing his proposed estimators with those of Greenberg et al. [
5], it is evident that his estimators are more efficient. The work of Christofides [
8] advances the groundbreaking work of Warner [
4] by presenting an alternate randomized response technique (RRT). He also included Warner’s [
4] approach in his proposed procedure as a special case. Huang [
9] proved that his proposed approach is more efficient than a number of widely used RR approaches. His technique is applicable to direct response surveys as well as in RR surveys when we are striving to obtain genuine responses regarding sensitive issues.
The concept of RR introduced by Warner [
4] has also been extended by many researchers to identify the deficiencies and propose solutions in his model. Kim and Warde [
10] offered some fresh findings on the RR model, where response variables are presumptively distributed via multinomial distribution. They used Hopkins’s test using a randomization device to produce estimates, considering it with and without the assumption of truthful responses. Many scholars have offered numerous alternative ways to address the privacy issue in contrast to the Moors [
6] RR model. However, their models might result in a significant loss of data information and incur considerable costs in maintaining secrecy. Compared to earlier RR models, Kim and Warde [
11] suggested a simpler model, while still maintaining confidentiality using stratified sampling. The stratified Warner’s [
4] RR approach and the unrelated question RR model were combined by Kim and Elam to create a new RRT [
12]. A three-stage stratified RR approach using optimal allocation was proposed by Kim and Chae [
13], which expands upon the two-stage stratified RRT developed by Kim and Elam [
12]. They demonstrated their suggested RR estimator to be more effective than Kim and Elam’s [
12] estimator, but it provided less privacy protection. A new, more efficient RR procedure was sought by Mangat and Singh [
14] using two randomization devices. Their model was confusing to the respondents as the respondents had to cater two randomizing devices while responding. Therefore, Mangat [
15] presented a simpler technique which was more efficient.
The work of Narjis and Shabbir [
16] and Hsieh et al. [
17] is also to be noted in the case of the use of two-stage RR models in order to find the commonness of a sensitive characteristic. Singh and Singh [
18] offered a technique for finding the population fraction of a stigmatized characteristic, making use of very well-known distribution that is negative binomial distribution for his work. Singh et al. [
19] projected a three-stage randomized response model, making use of poisson distribution. Halim et al. [
20] derived the transition matrices of the conditional misclassification probabilities of multiple above mentioned models and also worked on finding the association of variables while taking RR into account. Jaiswal et al. [
21] projected the calibrated estimator of population mean under a unit response condition using inverse linear, logistic, and exponential integrated models.
1.3. Logit Estimation in Randomized Response
We need a procedure which can provide complete privacy to the respondent so that they have no fear of being stigmatized, as stated by Corstange [
27]: “If the problem is that people have incentives to hide their true opinions or behavior from the interviewer, then our science suffers unless we can develop means to nullify these incentives. Survey respondents may not be willing to reveal their true answers to sensitive questions without foolproof guarantees of anonymity—not only from outside observers such as law enforcement or friends and family, but even from the interviewers themselves.” Ordinary logit models are not suitable when the response dichotomous variable relates to a sensitive issue. Corstange [
27,
28] projected a method which is known as hidden logit to deal with such problems. The hidden logit model is a customized structure of standard logit which standardizes the outcome of a tool or device which is used for randomization. This technique operates to display the genuine likelihood of a “yes” reply as a function of a predictor variable
X. The odd ratio is known as:
Considering
π as the likelihood of a “yes” answer, we work to crack any RR model for π and supplant it in standard logits in order to get the hidden logit model. Utilizing an identical condition, we can discover our estimates of logits using ML methodology. We have to create the logit model form in terms of “
X” and “
β.” According to Corstange [
27], the RR model consists of the following methodology: If flipping a coin is the randomizing device and if the coin lands on heads, the respondent is asked to say “yes” without clarification; however, if the coin lands on tails, they are supposed to provide a “yes/no” response according to the actual state they possess. Let us consider π as the probability of absolute “yes” and
pas the real fraction of participants who truly answer “yes,” then the probability of a “yes” answer derived by Corstange [
27] is provided as:
Solving Equation (2) for
π and putting its value in Equation (1) to solve for
θ, we obtain:
Let us regard “
yi” as a dichotomous variable, for which “1” represents a “yes” response and “0” represents a “no” reply. Subsequently the likelihood function of
β is provided as:
The first derivative of Equation (4) is
Setting Equation (5) as equivalent to zero maximizes this articulation, yet we cannot solve it scientifically. Therefore, to measure the parameters, this equation is settled numerically.
Hussain and Shabbir [
29] and Hussain et al. [
30] also employed different RRTs in order to calculate the hidden logits. The same has been accomplished by Halim et al. [
31] for Mangat and Singh [
14] RRT. Cruff et al. [
32] and Chang et al. [
33] also worked on logistic regression in different capacities. Hsieh and Perri [
34] worked on finding more advanced approaches to check the components that are faced by researchers dealing with two stigmatized variates, taking RRT into account. Our study helps to find the estimates of logistic models in the case of sensitive issues when there is a complex RRT, like in the case of Huang [
9]. This study significantly advances the use and evaluation of logistic models in situations when it is challenging to get sincere responses.
The rest of this article has the following sections. In
Section 2, the proposed methodology of hidden logit is discussed using three RRTs.
Section 3 presents the results and discussions using simulation. The last section states some concluding remarks.