Next Article in Journal
A New Predictive Algorithm for Time Series Forecasting Based on Machine Learning Techniques: Evidence for Decision Making in Agriculture and Tourism Sectors
Previous Article in Journal
Conditional Kaplan–Meier Estimator with Functional Covariates for Time-to-Event Data
Previous Article in Special Issue
A Log-Det Heuristics for Covariance Matrix Estimation: The Analytic Setup
 
 
Article
Peer-Review Record

On the Sampling Size for Inverse Sampling

Stats 2022, 5(4), 1130-1144; https://doi.org/10.3390/stats5040067
by Daniele Cuntrera †, Vincenzo Falco *,† and Ornella Giambalvo †
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Stats 2022, 5(4), 1130-1144; https://doi.org/10.3390/stats5040067
Submission received: 31 March 2022 / Revised: 31 October 2022 / Accepted: 8 November 2022 / Published: 15 November 2022
(This article belongs to the Special Issue Multivariate Statistics and Applications)

Round 1

Reviewer 1 Report

The paper is well written, with a good introduction and a reasonable description of the Inverse sampling method.

General comments:

1) The title suggests to the reader that a new re-sampling method is being proposed, when it is not the case: the paper is strongly based on the formulation given by Kim and Wang (2019), and does not seem to propose a new formulation. I recommend the authors to provide a title that translates more properly the manuscript's objective. 

2) Results and conclusions: 
It is not straightforward to assess the impact of selection bias in coverage rates, particularly in Figures 5 and 7. Actually, in Figure 7, for $n_IS$ fixed, coverage rate curves are very similar for $\theta=0.2, 0.5, 0.7$; $n_IS=500$ provides better rates than $n_IS=250$. This may suggest that the main driver for the coverage rate is the ratio $n_IS / N_2$ instead of $N_2$. To mitigate this doubt and allow a better analysis on the influence of selection bias, I recommend authors to include the results obtained from simulations performed in the absence of selection bias. 

Minor comments:

Line 120: it seems there is a missing division operator in  Equation $E(\delta) = N_B / N = f_B$.

$N_B$ is not defined (although it is implicit)

Line 157: check the inequality $n \leq 1max...$ (I think the correct is $n \leq 1/max...$)

(Check all equations)

Line 164: missing citation

Line 193: "Then, we *then* define..."

Equation between lines 194-195: the assumption for $p_i$ is odd, since $\phi=0$ (no selection bias) leads to $p_i=0$. I recommend authors to adopt the assumption given by Kim and Wang (2019) or provide a motivation for the current choice.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 2 Report

General comments:
The manuscript claims to explore the limits of the method called Inverse Sampling by
empirically investigating the impact of threshold values on the classification of datasets
between the categories of “Big (Big datasets) and “Small (smaller datasets). Such
description, to my opinion, is quite vague, as the Introduction does not offer a better
contextualization of the problem. Note that, at the very beginning of the reading, there
is no clue to the reader that the inverse sampling method considered is the one based on
Hinkins et al. (1997) and Kim and Wang (2018). Such inverse sampling corresponds to a
two-phase sample design with the first phase corresponding to the BigData source of
information, as properly described by the authors in the Methodology section.
However, distinction between the sets ? and ? ? , as well as their relationship with a
BigData source need to be emphasized to avoid misleading ideas. One common
problem related to discussions about BigData is the misconception that BigData is a
matter of size of datasets. Although the manuscript states, right in the beginning, the
correct notion that
Big Data represents the information assets characterized by high volume, velocity and variety to require
specific technology and analytical methods for its transformation into value;
in several parts of the paper the terms Big data, Big datasets, and further, Large datasets
are used as synonymous, reducing the BigData concept to only a matter of volume. In
lines 33 to 35, for example, one can read:
From a practical point of view, nowadays, there are many application areas where Big Data is currently being
used with excellent prospects and potential without, however, considering that Big Datasets are non-
probabilistic samples and are affected by selection bias.
In lines 53 to 54, one can read:
Recently, Big Datasets have been treated with sampling-related approaches, i.e., samples of the Big Datasets
have been analyzed that could allow inferential analyses to be conducted.
In lines 91 to 93, one reads:
This paper aims to explore the limitations of Inverse sampling for “large” datasets that are increasingly
smaller than Big Datasets, thus providing empirical threshold values that allow us to distinguish Big Datasets
from large datasets.
In my opinion, the authors must improve the Introduction so to avoid such confusions. The
following questions need to be clearly addressed in the Introduction:

 

1. When discussing about the size of a data set, what data set is exactly been considered?
2. To what extent one has control over the size of such data set?
3. Please briefly discuss an example that illustrates the importance of considering if the size
of the dataset is adequate or not for using the inverse sampling technique.
Minor problems:
Citations style is mixed. Please adjust all the citations to the journal style.
Equation in line 120 needs correction: ?(?) = ?!/?.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Back to TopTop