Previous Issue
Volume 7, September
 
 

Stats, Volume 7, Issue 4 (December 2024) – 10 articles

  • Issues are regarded as officially published after their release is announced to the table of contents alert mailing list.
  • You may sign up for e-mail alerts to receive table of contents of newly released issues.
  • PDF is the official format for papers published in both, html and pdf forms. To view the papers in pdf format, click on the "PDF Full-text" link, and use the free Adobe Reader to open them.
Order results
Result details
Section
Select all
Export citation of selected articles as:
15 pages, 382 KiB  
Article
Empirical Inferences Under Bayesian Framework to Identify Cellwise Outliers
by Luca Sartore, Lu Chen and Valbona Bejleri
Stats 2024, 7(4), 1244-1258; https://doi.org/10.3390/stats7040073 (registering DOI) - 19 Oct 2024
Abstract
Outliers are typically identified using frequentist methods. The data are classified as “outliers” or “not outliers” based on a test statistic that measures the magnitude of the difference between a value and the majority part of the data. The threshold for a data [...] Read more.
Outliers are typically identified using frequentist methods. The data are classified as “outliers” or “not outliers” based on a test statistic that measures the magnitude of the difference between a value and the majority part of the data. The threshold for a data value to be an outlier is typically defined by the user. However, a subjective choice of the threshold increases the uncertainty associated with outlier status for each data value. A cellwise outlier detection algorithm named FuzzyHRT is used to automate the editing process in repeated surveys. This algorithm uses Bienaymé–Chebyshev’s inequality and fuzzy logic to detect four different types of outliers resulting from format inconsistencies, historical, tail, and relational anomalies. However, fuzzy logic is not suited for probabilistic reasoning behind the identification of anomalous cells. Bayesian methods are well suited for quantifying the uncertainty associated with the identification of outliers. Although, as suggested by the literature, there exist well-developed Bayesian methods for record-level outlier detection, Bayesian methods for identifying outliers within individual records (i.e., at the cell level) remain unexplored. This paper presents two approaches from the Bayesian perspective to study the uncertainty associated with identifying outliers. A Bayesian bootstrap approach is explored to study the uncertainty associated with the output scores from the FuzzyHRT algorithm. Empirical likelihoods in a Bayesian setting are also considered for probabilistic reasoning behind the identification of anomalous cells. NASS survey data for livestock and major crop yield (such as corn) are considered for comparing the performances of the two proposed approaches with recent cellwise outlier methods. Full article
(This article belongs to the Special Issue Bayes and Empirical Bayes Inference)
19 pages, 2811 KiB  
Article
Prepivoted Augmented Dickey-Fuller Test with Bootstrap-Assisted Lag Length Selection
by Somak Maitra and Dimitris N. Politis
Stats 2024, 7(4), 1226-1243; https://doi.org/10.3390/stats7040072 - 17 Oct 2024
Viewed by 278
Abstract
We investigate the application of prepivoting in conjunction with lag length selection to correct the size and power performance of the Augmented Dickey-Fuller test for a unit root. The bootstrap methodology used to perform the prepivoting is a residual based AR bootstrap that [...] Read more.
We investigate the application of prepivoting in conjunction with lag length selection to correct the size and power performance of the Augmented Dickey-Fuller test for a unit root. The bootstrap methodology used to perform the prepivoting is a residual based AR bootstrap that ensures that bootstrap replicate time series are created under the null irrespective of whether the originally observed series obeys the null hypothesis or not. Simulation studies wherein we examine the performance of our proposed method are given; we evaluate our method’s performance on ARMA(1,1) models with varying configurations for size and power performance. We also propose a novel data dependent lag selection technique that uses bootstrap data under the null to select an optimal lag length; the performance of our method is compared to existing lag length selection criteria. Full article
(This article belongs to the Special Issue Modern Time Series Analysis II)
Show Figures

Figure 1

17 pages, 494 KiB  
Article
Levels of Confidence and Utility for Binary Classifiers
by Zhiyi Zhang
Stats 2024, 7(4), 1209-1225; https://doi.org/10.3390/stats7040071 - 17 Oct 2024
Viewed by 220
Abstract
Two performance measures for binary tree classifiers are introduced: the level of confidence and the level of utility. Both measures are probabilities of desirable events in the construction process of a classifier and hence are easily and intuitively interpretable. The statistical estimation of [...] Read more.
Two performance measures for binary tree classifiers are introduced: the level of confidence and the level of utility. Both measures are probabilities of desirable events in the construction process of a classifier and hence are easily and intuitively interpretable. The statistical estimation of these measures is discussed. The usual maximum likelihood estimators are shown to have upward biases, and an entropy-based bias-reducing methodology is proposed. Along the way, the basic question of appropriate sample sizes at tree nodes is considered. Full article
(This article belongs to the Section Data Science)
Show Figures

Figure 1

20 pages, 3076 KiB  
Article
Is Anonymization Through Discretization Reliable? Modeling Latent Probability Distributions for Ordinal Data as a Solution to the Small Sample Size Problem
by Stefan Michael Stroka and Christian Heumann
Stats 2024, 7(4), 1189-1208; https://doi.org/10.3390/stats7040070 - 17 Oct 2024
Viewed by 262
Abstract
The growing interest in data privacy and anonymization presents challenges, as traditional methods such as ordinal discretization often result in information loss by coarsening metric data. Current research suggests that modeling the latent distributions of ordinal classes can reduce the effectiveness of anonymization [...] Read more.
The growing interest in data privacy and anonymization presents challenges, as traditional methods such as ordinal discretization often result in information loss by coarsening metric data. Current research suggests that modeling the latent distributions of ordinal classes can reduce the effectiveness of anonymization and increase traceability. In fact, combining probability distributions with a small training sample can effectively infer true metric values from discrete information, depending on the model and data complexity. Our method uses metric values and ordinal classes to model latent normal distributions for each discrete class. This approach, applied with both linear and Bayesian linear regression, aims to enhance supervised learning models. Evaluated with synthetic datasets and real-world datasets from UCI and Kaggle, our method shows improved mean point estimation and narrower prediction intervals compared to the baseline. With 5–10% training data randomly split from each dataset population, it achieves an average 10% reduction in MSE and a ~5–10% increase in R² on out-of-sample test data overall. Full article
Show Figures

Figure 1

17 pages, 4221 KiB  
Article
Forecasting Mortality Trends: Advanced Techniques and the Impact of COVID-19
by Asmik Nalmpatian, Christian Heumann and Stefan Pilz
Stats 2024, 7(4), 1172-1188; https://doi.org/10.3390/stats7040069 - 16 Oct 2024
Viewed by 229
Abstract
The objective of this research is to evaluate four distinct models for multi-population mortality projection in order to ascertain the most effective approach for forecasting the impact of the COVID-19 pandemic on mortality. Utilizing data from the Human Mortality Database for five countries—Finland, [...] Read more.
The objective of this research is to evaluate four distinct models for multi-population mortality projection in order to ascertain the most effective approach for forecasting the impact of the COVID-19 pandemic on mortality. Utilizing data from the Human Mortality Database for five countries—Finland, Germany, Italy, the Netherlands, and the United States—the study identifies the generalized additive model (GAM) within the age–period–cohort (APC) analytical framework as the most promising for precise mortality forecasts. Consequently, this model serves as the basis for projecting the impact of the COVID-19 pandemic on future mortality rates. By examining various pandemic scenarios, ranging from mild to severe, the study concludes that projections assuming a diminishing impact of the pandemic over time are most consistent, especially for middle-aged and elderly populations. Projections derived from the superior GAM-APC model offer guidance for strategic planning and decision-making within sectors facing the challenges posed by extreme historical mortality events and uncertain future mortality trajectories. Full article
(This article belongs to the Section Survival Analysis)
Show Figures

Figure 1

13 pages, 317 KiB  
Article
A Bayesian Hierarchical Model for 2-by-2 Tables with Structural Zeros
by James Stamey and Will Stamey
Stats 2024, 7(4), 1159-1171; https://doi.org/10.3390/stats7040068 - 16 Oct 2024
Viewed by 310
Abstract
Correlated binary data in 2 × 2 tables have been analyzed from both the frequentist and Bayesian perspectives, but a fully Bayesian hierarchical model has not yet been proposed. This is a commonly used model for correlated proportions when considering, for example, a [...] Read more.
Correlated binary data in 2 × 2 tables have been analyzed from both the frequentist and Bayesian perspectives, but a fully Bayesian hierarchical model has not yet been proposed. This is a commonly used model for correlated proportions when considering, for example, a diagnostic test performance where subjects with negative results are tested a second time. We consider a new hierarchical Bayesian model for the parameters resulting from a 2 × 2 table with a structural zero. We investigate the performance of the hierarchical model via simulation. We then illustrate the usefulness of the model by showing how a set of historical studies can be used to build a predictive distribution for a new study that can be used as a prior distribution for both the risk ratio and marginal probability of a positive test. We then show how the prior based on historical 2 × 2 tables can be used to power a future study that accounts for pre-experimental uncertainty. High-quality prior information can lead to better decision-making by improving precision in estimation and by providing realistic numbers to power studies. Full article
Show Figures

Figure 1

18 pages, 765 KiB  
Article
Preliminary Test Estimation for Parallel 2-Sampling in Autoregressive Model
by Syed Ejaz Ahmed, Arsalane Chouaib Guidoum and Sara Bendjeddou
Stats 2024, 7(4), 1141-1158; https://doi.org/10.3390/stats7040067 - 14 Oct 2024
Viewed by 277
Abstract
The purpose of this paper is to discuss the problem of estimation and testing the equality of two autoregressive parameters of two first-order autoregressive processes AR(1), where for each process, the observations are made at different time points. The [...] Read more.
The purpose of this paper is to discuss the problem of estimation and testing the equality of two autoregressive parameters of two first-order autoregressive processes AR(1), where for each process, the observations are made at different time points. The primary interest is to propose the testing procedures for the homogeneity of autocorrelation parameters ρ1 and ρ2. Furthermore, we are interested in estimating ρ1 under uncertain and weak prior information about the possible equality of ρ1 and ρ2, though we may not have full confidence in the tenacity of this information. A large sample test for the homogeneity of the parameters is developed. Pooled “P” (or restricted estimator) and preliminary test “PT” estimators are proposed, and their properties are investigated and compared with the unrestricted estimator “UE” of ρ1. Full article
(This article belongs to the Section Computational Statistics)
Show Figures

Figure 1

13 pages, 877 KiB  
Article
Mixed Poisson Processes with Dropout for Consumer Studies
by Andrey Pepelyshev, Irina Scherbakova and Yuri Staroselskiy
Stats 2024, 7(4), 1128-1140; https://doi.org/10.3390/stats7040066 - 13 Oct 2024
Viewed by 265
Abstract
We adapt the classical mixed Poisson process models for investigation of consumer behaviour in a situation where after a random time we can no longer identify a customer despite the customer remaining in the panel and continuing to perform buying actions. We derive [...] Read more.
We adapt the classical mixed Poisson process models for investigation of consumer behaviour in a situation where after a random time we can no longer identify a customer despite the customer remaining in the panel and continuing to perform buying actions. We derive explicit expressions for the distribution of the number of purchases by a random customer observed at a random subinterval for a given interval. For the estimation of parameters in the gamma–Poisson scheme, we use the estimator minimizing the Hellinger distance between the sampling and model distributions, and demonstrate that this method is almost as efficient as the maximum likelihood being much simpler. The results can be used for modelling internet user behaviour where cookies and other user identifiers naturally expire after a random time. Full article
(This article belongs to the Section Statistical Methods)
Show Figures

Figure 1

29 pages, 11052 KiB  
Article
A County-Level Analysis of the Economic Performance in Romania and Bulgaria Using Hierarchical Algorithms
by Alexandra-Nicoleta Ciucu (Durnoi), Camelia Delcea and Kosyo Stoychev
Stats 2024, 7(4), 1099-1127; https://doi.org/10.3390/stats7040065 - 11 Oct 2024
Viewed by 402
Abstract
The EU Regional Competitiveness Index 2.0 measures a region’s ability to provide an attractive environment for businesses and residents to work and live. According to this indicator, countries in the southern and eastern regions of the European Union are reported to have the [...] Read more.
The EU Regional Competitiveness Index 2.0 measures a region’s ability to provide an attractive environment for businesses and residents to work and live. According to this indicator, countries in the southern and eastern regions of the European Union are reported to have the lowest values. As it measures the performance of NUTS-2 regions, it was desired to study the problem in more detail, reaching the NUTS-3 level. Thus, within the current research, Romania and Bulgaria are studied by means of a county-level analysis of the economies of the two states established through the prism of the labor market, the field of health, transport, enterprises, tourism, education, and research. Through eight indicators, a series of maps designed to present the situation of the two states was illustrated, and the investigation continued with a cluster analysis carried out by the implementation of hierarchical algorithms. During the course of the current study, a classification and a ranking of the counties of the two countries were performed to determine the areas with the best or, in contrast, the poorest performance. Full article
Show Figures

Figure 1

15 pages, 1311 KiB  
Article
Cross-Country Assessment of Socio-Ecological Drivers of COVID-19 Dynamics in Africa: A Spatial Modelling Approach
by Kolawole Valère Salako, Akoeugnigan Idelphonse Sode, Aliou Dicko, Eustache Ayédèguè Alaye, Martin Wolkewitz and Romain Glèlè Kakaï
Stats 2024, 7(4), 1084-1098; https://doi.org/10.3390/stats7040064 - 11 Oct 2024
Viewed by 498
Abstract
Understanding how countries’ socio-economic, environmental, health status, and climate factors have influenced the dynamics of COVID-19 is essential for public health, particularly in Africa. This study explored the relationships between African countries’ COVID-19 cases and deaths and their socio-economic, environmental, health, clinical, and [...] Read more.
Understanding how countries’ socio-economic, environmental, health status, and climate factors have influenced the dynamics of COVID-19 is essential for public health, particularly in Africa. This study explored the relationships between African countries’ COVID-19 cases and deaths and their socio-economic, environmental, health, clinical, and climate variables. It compared the performance of Ordinary Least Square (OLS) regression, the spatial lag model (SLM), the spatial error model (SEM), and the conditional autoregressive model (CAR) using statistics such as the Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Root Mean Square Error (RMSE), and coefficient of determination (R2). Results showed that the SEM with the 10-nearest neighbours matrix weights performed better for the number of cases, while the SEM with the maximum distance matrix weights performed better for the number of deaths. For the cases, the number of tests followed by the adjusted savings, Gross Domestic Product (GDP) per capita, dependence ratio, and annual temperature were the strongest covariates. For deaths, the number of tests followed by malaria prevalence, prevalence of communicable diseases, adjusted savings, GDP, dependence ratio, Human Immunodeficiency Virus (HIV) prevalence, and moisture index of the moistest quarter play a critical role in explaining disparities across countries. This study illustrates the importance of accounting for spatial autocorrelation in modelling the dynamics of the disease while highlighting the role of countries’ specific factors in driving its dynamics. Full article
(This article belongs to the Section Regression Models)
Show Figures

Figure 1

Previous Issue
Back to TopTop