Previous Issue
Volume 9, February
 
 

Stats, Volume 9, Issue 2 (April 2026) – 18 articles

  • Issues are regarded as officially published after their release is announced to the table of contents alert mailing list.
  • You may sign up for e-mail alerts to receive table of contents of newly released issues.
  • PDF is the official format for papers published in both, html and pdf forms. To view the papers in pdf format, click on the "PDF Full-text" link, and use the free Adobe Reader to open them.
Order results
Result details
Section
Select all
Export citation of selected articles as:
22 pages, 1839 KB  
Article
A New Depth-Based Test for Multivariate Two-Sample Problems
by My Luu, Yuejiao Fu, Augustine Wong and Xiaoping Shi
Stats 2026, 9(2), 39; https://doi.org/10.3390/stats9020039 - 3 Apr 2026
Viewed by 186
Abstract
Statistical depth provides a center–outward ordering of multivariate observations and is widely used in nonparametric inference. We study depth-based tests for multivariate two-sample problems and examine the behaviour of different depth notions using the DD plot (data-depth plot) across a variety of distributional [...] Read more.
Statistical depth provides a center–outward ordering of multivariate observations and is widely used in nonparametric inference. We study depth-based tests for multivariate two-sample problems and examine the behaviour of different depth notions using the DD plot (data-depth plot) across a variety of distributional space. The DD plot illustrates that depth functions differ in their sensitivity to distributional differences, emphasizing the importance of depth selection in two-sample testing. We propose a new two-sample test statistic, log DDR, constructed from ratios of numerical depth values rather than depth-induced ranks. Simulation studies under multiple scenarios and for three representative depth functions indicate that log DDR achieves improved power relative to several competing depth-based nonparametric tests. The results further demonstrate that the performance of log DDR and existing methods depends strongly on the chosen depth function, consistent with insights from the DD plot. These findings support a two-stage testing approach in which the DD plot is used to guide the choice of depth notion before applying log DDR for homogeneity testing. Full article
(This article belongs to the Section Data Science)
Show Figures

Figure 1

18 pages, 10397 KB  
Article
Multiple Imputation of a Continuous Outcome with Fully Observed Predictors Using TabPFN
by Jerome Sepin
Stats 2026, 9(2), 38; https://doi.org/10.3390/stats9020038 - 1 Apr 2026
Viewed by 208
Abstract
Handling missing data is a central challenge in quantitative research, particularly when datasets exhibit complex dependency structures, such as nonlinear relationships and interactions. Multiple imputation (MI) via fully conditional specification (FCS), as implemented in the MICE R package, is widely used but relies [...] Read more.
Handling missing data is a central challenge in quantitative research, particularly when datasets exhibit complex dependency structures, such as nonlinear relationships and interactions. Multiple imputation (MI) via fully conditional specification (FCS), as implemented in the MICE R package, is widely used but relies on user-specified models that may fail to capture complex dependency structures, especially in high-dimensional settings, or on more sophisticated algorithms that are considered data-hungry. This paper investigates the performance of TabPFN, a transformer-based, pretrained foundation model developed for tabular prediction tasks, for MI. TabPFN is pretrained on millions of synthetic datasets and approximates posterior predictive distributions without dataset-specific retraining, offering a compelling solution for imputing complex missing data in small to moderately sized samples. We conduct a simulation study focusing on univariate missingness in a continuous outcome with complete predictors, comparing TabPFN with standard MI methods. Performance is evaluated using bias, standard error, and coverage of the marginal mean estimand across a range of data-generating and missingness mechanisms. Our results show that TabPFN yields competitive or superior performance relative to Classification and Regression Trees and Predictive Mean Matching. These findings highlight TabPFN as a promising tool for missing data imputation, with particular relevance to health research. Full article
(This article belongs to the Special Issue Statistical Methods for Hypothesis Testing)
Show Figures

Figure 1

16 pages, 1425 KB  
Article
On the Classification–Causal Tradeoff in Neural Network Propensity Score Estimation
by Seungman Kim, Jaehoon Lee and Kwanghee Jung
Stats 2026, 9(2), 37; https://doi.org/10.3390/stats9020037 - 31 Mar 2026
Viewed by 229
Abstract
Observational studies serve as a vital alternative to randomized experiments but are highly susceptible to selection bias. Propensity score (PS) methods address this by balancing covariates between groups. Although including all relevant covariates is theoretically ideal, high dimensionality often destabilizes traditional estimation models. [...] Read more.
Observational studies serve as a vital alternative to randomized experiments but are highly susceptible to selection bias. Propensity score (PS) methods address this by balancing covariates between groups. Although including all relevant covariates is theoretically ideal, high dimensionality often destabilizes traditional estimation models. This study evaluates the efficacy of deep neural networks (DNN) and convolutional neural networks (CNN) for PS estimation compared to traditional logistic regression (LR), leveraging their capacity to handle complex nonlinear relationships and interactions. Using a Monte Carlo simulation across 36 conditions, model performance was evaluated based on bias and imbalance reduction. Results indicate that DNNs and CNNs significantly outperform LR. Specifically, while LR increased outcome bias by 17% and reduced covariate imbalance by only 5%, DNNs and CNNs reduced outcome bias by 13% and 16%, respectively, while decreasing covariate imbalance by 18% and 21%. We conclude that despite requiring specialized computational resources, neural networks offer substantial advantages for high-dimensional PS estimation. However, their reliable application necessitates stability-aware training and proper error rate thresholds to prevent probability degeneracy. Full article
Show Figures

Figure 1

26 pages, 427 KB  
Article
On Dimension-Free Stochastic Surrogates and Estimators of Cross-Partial Derivatives and the Hessian Matrix
by Matieyendou Lamboni
Stats 2026, 9(2), 36; https://doi.org/10.3390/stats9020036 - 29 Mar 2026
Viewed by 193
Abstract
This study introduces stochastic surrogates of all the cross-partial derivatives of functions using L evaluations of functions at randomized points. Such randomized points are constructed using the class of lp-spherical distributions or equivalent distributions. For the cross-partial derivatives of a given [...] Read more.
This study introduces stochastic surrogates of all the cross-partial derivatives of functions using L evaluations of functions at randomized points. Such randomized points are constructed using the class of lp-spherical distributions or equivalent distributions. For the cross-partial derivatives of a given order |u|{2,,d}, the proposed surrogates and the corresponding estimators of cross-partial derivatives enjoy the parametric rate of convergence and dimension-free mean squared errors when dp, leading to breaking down the curse of dimensionality. Imposing pd allows to break down the curse of dimensionality for only the cross-partial derivatives of orders given by |u|1+d2log(d). Also, the L-point-based Hessian surrogate and estimator are proposed, including the convergence analysis. A particular choice of p allows to achieve the dimension-free mean squared errors. Analytical examples and simulations have been provided to show the efficiency of such surrogates and estimators. Full article
(This article belongs to the Section Computational Statistics)
11 pages, 1490 KB  
Communication
Analyzing Complex Non-Linear Fascia-Muscle Interactions Using Cross-Recurrence Quantification Analysis
by Andreas Brandl, Marcus Müller and Robert Schleip
Stats 2026, 9(2), 35; https://doi.org/10.3390/stats9020035 - 25 Mar 2026
Viewed by 269
Abstract
Biophysical, neurophysiological, psychological and social processes along with their interactions are complex, often non-linear and inherently time-dependent. However, time series analysis of such measurements usually requires extensive data processing and is therefore potentially associated with structural biases. This exploratory secondary analysis introduces cross-recurrence [...] Read more.
Biophysical, neurophysiological, psychological and social processes along with their interactions are complex, often non-linear and inherently time-dependent. However, time series analysis of such measurements usually requires extensive data processing and is therefore potentially associated with structural biases. This exploratory secondary analysis introduces cross-recurrence quantification analysis (CRQA), which is explicitly suited to time series with complicated non-stationary properties. We illustrate and validate CRQA using a previous study that investigated the dynamic relationship between thoracolumbar fascia deformation and back extensor muscle activity in patients with low back pain. CRQA revealed significant differences in the relationships between fascia and muscles in low back pain patients compared to healthy individuals. The analysis revealed more specific aspects of fascia-muscle coupling than traditional analytical approaches, suggesting that CRQA is a useful additional tool for investigating time-dependent interactions with dynamic complex nonlinear patterns. Full article
Show Figures

Figure 1

16 pages, 2520 KB  
Article
Multidimensional Correlates of Childhood Stunting in India: A Spatial Machine Learning and Explainable AI Approach
by Bhagyajyothi Rao, Md Gulzarull Hasan, Bandhavya Putturaya, Asha Kamath, Mohammad Aatif and Yousif M. Elmosaad
Stats 2026, 9(2), 34; https://doi.org/10.3390/stats9020034 - 24 Mar 2026
Viewed by 236
Abstract
Childhood stunting remains a major public health challenge in India and is influenced by multiple socioeconomic and environmental factors. This ecological study examined district-level correlates of childhood stunting, including Crimes Against Women (CAW), the Multidimensional Poverty Index (MPI), and drought severity, using data [...] Read more.
Childhood stunting remains a major public health challenge in India and is influenced by multiple socioeconomic and environmental factors. This ecological study examined district-level correlates of childhood stunting, including Crimes Against Women (CAW), the Multidimensional Poverty Index (MPI), and drought severity, using data from NFHS-5, the National Crime Records Bureau, NITI Aayog’s MPI reports, and the Drought Atlas of India. Spatial autocorrelation and Spatial regression models were applied alongside machine learning approaches and SHAP-based Explainable AI (XAI) interpretation. Childhood stunting exhibited significant spatial clustering (Moran’s I = 0.520, p < 0.001), with hotspots in northern, central, and eastern India. Higher stunting was associated with higher birth order, low maternal BMI, child anaemia, and MPI, and negative associations with iodised salt usage, electricity access, and timely postnatal care. A significant spatial lag parameter (ρ = 0.348) indicated substantial spillover effects. Machine learning models consistently identified MPI, drought severity, and CAW as key predictors. The integrated spatial and machine learning framework identifies key correlates and spatial dependencies of childhood stunting, highlighting the need for region-specific, multisectoral interventions. Full article
(This article belongs to the Section Applied Statistics and Machine Learning Methods)
Show Figures

Figure 1

17 pages, 833 KB  
Article
An Adaptive Method to Identify Outliers in Skewed Observations: Application to Assess NAACCR Cancer Registry Data Usage
by Xiaowen Yang, Amjila Bam, Nubaira Rizvi, Xiao-Cheng Wu, Donald Mercante and Qingzhao Yu
Stats 2026, 9(2), 33; https://doi.org/10.3390/stats9020033 - 23 Mar 2026
Viewed by 256
Abstract
Outlier detection is a fundamental component of data preprocessing and quality monitoring across diverse scientific domains, including engineering, biomedical sciences, and finance. While many variables in controlled environments approximate a normal distribution, real-world data, particularly biological, environmental, and epidemiological measures, are frequently characterized [...] Read more.
Outlier detection is a fundamental component of data preprocessing and quality monitoring across diverse scientific domains, including engineering, biomedical sciences, and finance. While many variables in controlled environments approximate a normal distribution, real-world data, particularly biological, environmental, and epidemiological measures, are frequently characterized by pronounced right-skewness. To address the shortcomings of conventional methods, this study introduces the Dynamic Threshold for Outlier Detection (DTOD), which reframes outlier detection as a concrete operational workflow. The DTOD framework dynamically adjusts detection thresholds based on a functional relationship between skewness and tail morphology. Validation through large-scale simulation experiments across light-, middle-, and high-skewness levels confirms the method’s versatility. The DTOD proves particularly effective at two ends of the spectrum: enhancing sensitivity for detecting subtle anomalies in light-skewed data while serving as a conservative, high-confidence screening tool that controls false positives in high-skewness environments. In real-world application to North American Association of Central Cancer Registries (NAACCR) data, the method successfully identified outliers with abnormally high unknown tumor size rates in colorectal cancer and maintained a low misclassification rate in highly skewed lung cancer data. Ultimately, the DTOD provides a promising, interpretable solution for improving data quality in skewed scenarios. Full article
Show Figures

Figure 1

23 pages, 1511 KB  
Article
Estimator Statistics from Simulation-Free Dirichlet Block-Bootstrap Resampling
by Tillmann Rosenow
Stats 2026, 9(2), 32; https://doi.org/10.3390/stats9020032 - 20 Mar 2026
Viewed by 287
Abstract
Since the initiation of two variants of the bootstrap method by Efron and Rubin in the late 1970s, a variety of advancements has emerged in the literature. The subsampling of blocks enabled the estimation of the actual variance of the sample mean. The [...] Read more.
Since the initiation of two variants of the bootstrap method by Efron and Rubin in the late 1970s, a variety of advancements has emerged in the literature. The subsampling of blocks enabled the estimation of the actual variance of the sample mean. The equivalence of the data-level and the estimator-level resampling is easily established for the sample mean and estimators alike. For Rubin’s variant of the bootstrap we apply an algorithm by Diniz et al. which allows for the numerically stable computation of the sample-based cumulative distribution function of the estimator under investigation. No actual Monte-Carlo resampling is necessary in this setting and we demonstrate how we get access to the very small probabilities of the tails and moreover to confidence intervals. We do this at the example of a well-known test model that exhibits geometrically decaying spatial correlations. The analysis naturally applies to temporally correlated systems or to the correlations occurring in Markov chains, as well. Full article
(This article belongs to the Section Time Series Analysis)
Show Figures

Figure 1

25 pages, 1131 KB  
Article
A Bayesian Approach for Clustering Constant-Wise Change-Point Data
by Ana Carolina da Cruz and Camila P. E. de Souza
Stats 2026, 9(2), 31; https://doi.org/10.3390/stats9020031 - 17 Mar 2026
Viewed by 344
Abstract
Change-point models deal with ordered data sequences. Their primary goal is to infer the locations where an aspect of the data sequence changes. In this paper, we propose and implement a nonparametric Bayesian model for clustering observations based on their constant-wise change-point profiles [...] Read more.
Change-point models deal with ordered data sequences. Their primary goal is to infer the locations where an aspect of the data sequence changes. In this paper, we propose and implement a nonparametric Bayesian model for clustering observations based on their constant-wise change-point profiles via a Gibbs sampler. Our model incorporates a Dirichlet process on the constant-wise change-point structures to cluster observations while simultaneously performing multiple change-point estimation. Additionally, our approach controls the number of clusters in the model, not requiring specification of the number of clusters a priori. Satisfactory clustering and estimation results were obtained when evaluating our method under various simulated scenarios and on a real dataset from single-cell genomic sequencing. Our proposed methodology is implemented as an R package called BayesCPclust and is available from the Comprehensive R Archive Network. Full article
(This article belongs to the Section Bayesian Methods)
Show Figures

Figure 1

2 pages, 2315 KB  
Correction
Correction: Risca et al. Archimedean Copulas: A Useful Approach in Biomedical Data—A Review with an Application in Pediatrics. Stats 2025, 8, 69
by Giulia Risca, Stefania Galimberti, Paola Rebora, Alessandro Cattoni, Maria Grazia Valsecchi and Giulia Capitoli
Stats 2026, 9(2), 30; https://doi.org/10.3390/stats9020030 - 17 Mar 2026
Viewed by 156
Abstract
In the original publication [...] Full article
Show Figures

Figure 1

13 pages, 251 KB  
Communication
Comparison of Minimal Circular Balanced RMDs Constructed Through Rule I and II of Cyclic Shifts Method
by Muhammad Ejaz Malik, Muhammad Ameeq, Muhammad Riaz and Rashid Ahmed
Stats 2026, 9(2), 29; https://doi.org/10.3390/stats9020029 - 13 Mar 2026
Viewed by 216
Abstract
The repeated measurement design (RMD) is a cost-effective research design commonly used in various fields. RMDs have several advantages; however, the carryover effect is a fundamental issue. Carryover effects typically serve as the primary source of bias in the evaluation of treatment efficacy. [...] Read more.
The repeated measurement design (RMD) is a cost-effective research design commonly used in various fields. RMDs have several advantages; however, the carryover effect is a fundamental issue. Carryover effects typically serve as the primary source of bias in the evaluation of treatment efficacy. To reduce this bias, minimal circular balanced RMDs (MCBRMDs) are utilized. Rule I of the cyclic shift method produces MCBRMDs for only the odd v (number of treatments to be compared). Rule II produces these designs for both v odd and v even. This article contributes to the literature by providing a systematic comparison of two cyclic shift rules for constructing MCBRMDs for odd v. The study provides useful guidance to experimenters in choosing effective designs under practical experimental restrictions by comparing these designs using efficiency of carryover effects and separability. Full article
28 pages, 1825 KB  
Article
Combinatorial Game Theory and Reinforcement Learning in Cumulative Tic-Tac-Toe via Evaluation Functions
by Kai Li and Wei Zhu
Stats 2026, 9(2), 28; https://doi.org/10.3390/stats9020028 - 10 Mar 2026
Viewed by 492
Abstract
We introduce cumulative tic-tac-toe, a novel variant of the classic 3×3 tic-tac-toe game in which play continues until the board is completely filled. Each player’s final score is determined by the total number of three-in-a-row sequences they form. Using combinatorial game [...] Read more.
We introduce cumulative tic-tac-toe, a novel variant of the classic 3×3 tic-tac-toe game in which play continues until the board is completely filled. Each player’s final score is determined by the total number of three-in-a-row sequences they form. Using combinatorial game theory (CGT), we establish that under optimal play, the game is a draw, and we characterize its theoretical properties. To empirically validate and optimize practical play, we develop a reinforcement learning (RL) framework based on temporal-difference (TD) learning, which is enhanced with a domain-informed evaluation function to accelerate convergence. The experimental results show that our triplet-coverage difference (TCD) evaluation function reduces the average number of training episodes by approximately 23.1% compared with a random-initialization baseline, a statistically significant improvement at the 5% significance level. These results demonstrate the efficiency of our CGT–RL approach for cumulative tic-tac-toe and suggest that similar methods may be useful for analyzing related combinatorial games. We also discuss potential analogies in domains such as competitive resource allocation and coalition formation, illustrating how cumulative-scoring games connect abstract game-theoretic ideas to practical sequential decision problems. Full article
Show Figures

Figure 1

36 pages, 1910 KB  
Article
Meta-Analysis of Paired Binary Data with Unobserved Dependence: Insights from Laterality and Bilateralism in Anatomy
by Vasileios Papadopoulos and Aliki Fiska
Stats 2026, 9(2), 27; https://doi.org/10.3390/stats9020027 - 7 Mar 2026
Viewed by 277
Abstract
Anatomical variants are observed on paired body sides, yet many prevalence studies—particularly those based on osteological collections—report only right- and left-side frequencies without specifying whether findings occur bilaterally in the same individual. In such cases, the individual-level left–right structure is unobserved. Consequently, inference [...] Read more.
Anatomical variants are observed on paired body sides, yet many prevalence studies—particularly those based on osteological collections—report only right- and left-side frequencies without specifying whether findings occur bilaterally in the same individual. In such cases, the individual-level left–right structure is unobserved. Consequently, inference on laterality and bilateralism cannot be based on the reported data alone and must rely on explicit assumptions about within-individual dependence. We study this problem in the context of anatomic prevalence data, although the framework applies more broadly to paired binary outcomes. We parameterize the admissible joint distributions using a feasibility-based dependence index λ, spanning the full range from independence to maximal feasible concordance implied by the marginal prevalences. Within this framework, we examine two complementary estimands: the paired odds ratio for laterality and bilateral prevalence. Analytic results and Monte Carlo simulations show that bilateral prevalence varies linearly and remains stable across the admissible dependence range, whereas the paired odds ratio exhibits intrinsic boundary instability as dependence approaches its feasible maximum due to vanishing discordant counts. Uncertainty-propagation analyses further indicate that laterality inference is robust to moderate misspecification of the dependence assumption. These results demonstrate that unobserved within-subject dependence is a structural inferential issue in paired binary meta-analysis and motivate feasibility-based sensitivity analysis when only marginal data are available. Full article
(This article belongs to the Section Biostatistics)
Show Figures

Figure 1

17 pages, 431 KB  
Article
The Gamma Power Generalized Weibull Distribution: Modeling Bibliometric Data
by Arioane Primon Soares, Ryan Novaes Pereira, Fernando A. Peña-Ramírez, Luz Milena Zea Fernández and Renata Rojas Guerra
Stats 2026, 9(2), 26; https://doi.org/10.3390/stats9020026 - 5 Mar 2026
Viewed by 477
Abstract
In this study, we introduce the gamma power generalized Weibull (GPGW) distribution and investigate several of its main mathematical properties. The performance of the maximum likelihood estimators is evaluated through Monte Carlo simulations. The practical relevance of the proposed distribution is illustrated through [...] Read more.
In this study, we introduce the gamma power generalized Weibull (GPGW) distribution and investigate several of its main mathematical properties. The performance of the maximum likelihood estimators is evaluated through Monte Carlo simulations. The practical relevance of the proposed distribution is illustrated through an application to real bibliometric data, where the GPGW is used to model SCImago Journal Rank (SJR) indicators. In comparison with alternative models commonly employed for lifetime and positive data, the GPGW distribution exhibits strong competitive performance. In particular, in the real data application, it outperforms eleven competing distributions in terms of goodness of fit criteria, including the power generalized Weibull (PGW), the gamma-Nadarajah–Haghighi (GNH), and the exponentiated power generalized Weibull (EPGW) distributions. While inheriting several mathematical features of the EPGW distribution, such as expressions for moments, skewness, and kurtosis, the GPGW offers enhanced flexibility, making it a valuable modeling tool for lifetime data and heavy-tailed positive measurements. Full article
(This article belongs to the Section Statistical Methods)
Show Figures

Figure 1

21 pages, 518 KB  
Communication
Ordering and Quantifying Textual Cohesion via Semantic, Geometric and Statistical Structure
by Stelios Arvanitis
Stats 2026, 9(2), 25; https://doi.org/10.3390/stats9020025 - 3 Mar 2026
Viewed by 283
Abstract
We propose a semantic, geometric, and statistical framework for quantifying and ordering textual cohesion in long-form discourse. Sentences are embedded into a semantic similarity graph and Ollivier–Ricci curvature is used to extract sentence- and document-level structural profiles, represented as step functions on a [...] Read more.
We propose a semantic, geometric, and statistical framework for quantifying and ordering textual cohesion in long-form discourse. Sentences are embedded into a semantic similarity graph and Ollivier–Ricci curvature is used to extract sentence- and document-level structural profiles, represented as step functions on a normalized rhetorical-time axis. On this functional space we define the Weighted Utopia Index (wUI), a corpus-relative measure of weighted shortfall from an upper-envelope profile under a dominance-type ordering. The rhetorical-time weighting function is learned self-supervised: we generate controlled sentence-order perturbations with known ordinal coherence degradation and estimate the weight parameters via an ordered probit model on a training split. We evaluate ordering recovery on held-out State of the Union speeches using rank correlations, pairwise and adjacent ordering accuracy, and violation-localization diagnostics with bootstrap uncertainty. Across these criteria, wUI systematically outperforms embedding-only adjacent-similarity baselines, while a Nash-type aggregation provides an interpretable semantic–structural trade-off score. An application to later-period speeches illustrates how the method yields interpretable cohesion rankings and curvature-profile diagnostics without requiring external annotations. Full article
(This article belongs to the Section Applied Statistics and Machine Learning Methods)
Show Figures

Figure 1

14 pages, 305 KB  
Article
Asymptotic Properties of Error Density Estimators in the Two-Phase Linear Regression Model
by Fuxia Cheng and Lixia Wang
Stats 2026, 9(2), 24; https://doi.org/10.3390/stats9020024 - 1 Mar 2026
Viewed by 323
Abstract
This paper investigates kernel estimation of the error density function for the two-phase linear regression model. We derive the asymptotic distributions of residual-based kernel density estimators. First, we demonstrate that the asymptotic distribution of the maximum deviation (suitably normalized) between the residual-based kernel [...] Read more.
This paper investigates kernel estimation of the error density function for the two-phase linear regression model. We derive the asymptotic distributions of residual-based kernel density estimators. First, we demonstrate that the asymptotic distribution of the maximum deviation (suitably normalized) between the residual-based kernel density estimator and the expected kernel density (based on the true errors) coincides with the result for an independent and identically distributed (i.i.d.) sample. We then prove that the residual-based kernel density estimator is asymptotically normal at a fixed point. Full article
(This article belongs to the Section Applied Statistics and Machine Learning Methods)
Show Figures

Figure 1

48 pages, 6582 KB  
Review
The Path from PCA to Autoencoders to Variational Autoencoders: Building Intuition for Deep Generative Modeling
by Alaa Tharwat and Mahmoud M. Eid
Stats 2026, 9(2), 23; https://doi.org/10.3390/stats9020023 - 28 Feb 2026
Viewed by 840
Abstract
This tutorial provides a comprehensive and intuitive journey through the evolution of deep generative models, tracing a clear path from the foundations of Principal Component Analysis (PCA) to modern Variational Autoencoders (VAEs), showing how each method solves the limitations of the previous one. [...] Read more.
This tutorial provides a comprehensive and intuitive journey through the evolution of deep generative models, tracing a clear path from the foundations of Principal Component Analysis (PCA) to modern Variational Autoencoders (VAEs), showing how each method solves the limitations of the previous one. We begin with PCA, a linear tool for reducing data dimensions. Its inability to model non-linear patterns motivates the use of Autoencoders (AEs), which use neural networks to learn flexible, compressed representations. However, AEs lack a probabilistic framework, preventing them from generating new data. VAEs address this by treating the latent space as a probability distribution, enabling data generation. We compare the three methods through theoretical analysis, experiments, and step-by-step numerical examples that show exactly how each model compresses data—a detail often missing elsewhere. Unlike resources that treat these topics separately, we connect them into a single narrative, building intuition progressively from linear to probabilistic deep generative models. Full article
(This article belongs to the Section Applied Statistics and Machine Learning Methods)
Show Figures

Figure 1

26 pages, 436 KB  
Article
Abundance Estimation Using Minimum Order Set Distances in Line Transect Sampling
by Mohammad Ali Al Kadiri and Mariam H. Al-Husari
Stats 2026, 9(2), 22; https://doi.org/10.3390/stats9020022 - 26 Feb 2026
Viewed by 333
Abstract
Line transect sampling is widely used for estimating population abundance, but existing nonparametric estimators of detection density at the transect line often suffer from boundary bias and tuning sensitivity. In this paper, we propose two simple tuning-light estimators based on minimum order statistics [...] Read more.
Line transect sampling is widely used for estimating population abundance, but existing nonparametric estimators of detection density at the transect line often suffer from boundary bias and tuning sensitivity. In this paper, we propose two simple tuning-light estimators based on minimum order statistics of perpendicular distances, requiring measurement of only the judged-closest object within each set. Under mild regularity conditions, the proposed estimators are consistent and asymptotically normal, with low bias and variance demonstrated through simulation studies under exponential and half-normal detection models. An application to a wooden stakes transect survey illustrates the practical advantages of the proposed approach for low-effort ecological surveys. Full article
(This article belongs to the Section Ecological Statistics)
Show Figures

Graphical abstract

Previous Issue
Back to TopTop