1. Introduction
Analysis of count data is a fundamental challenge in statistical modeling, with applications pervasive across numerous disciplines, including epidemiology [
1], genomics [
2], insurance analytics [
3], and environmental science [
4]. Such data, which enumerate the occurrence of events within a fixed domain, are intrinsically discrete and non-negative, and frequently exhibit overdispersion [
5]. These characteristics necessitate the development of specialized regression frameworks that extend beyond the capabilities of standard linear models [
6]. The complexity of the problem is amplified in high-dimensional settings, where the number of potential predictors,
p, vastly exceeds the sample size,
n [
7]. In these contexts, traditional methodologies are plagued by issues of multicollinearity, overfitting, and difficulties in variable selection. This is particularly the case when the feature set comprises high-throughput data such as genomic markers or textual features [
8]. While the Poisson distribution has served as the foundational model for count data due to its property of equidispersion, this assumption is routinely violated in practice [
9]. This has led to the widespread adoption of the Negative Binomial model, which incorporates a dispersion parameter to account for extra-Poisson variation [
10]. Consequently, high-dimensional count regression presents a unique set of challenges that include robust estimation amidst sparse signals, managing correlated predictor structures, and accommodating potential nonlinearities that standard parametric forms may fail to capture [
11,
12].
These challenges are particularly pronounced in fields like biomedical research, where modeling high-throughput omics or RNA-seq count data with thousands of covariates demands methods that overcome the curse of dimensionality while preserving interpretability and accuracy, a task further complicated by zero inflation which necessitates hybrid models like Zero-Inflated or hurdle regressions [
13]. This need for advanced, scalable modeling strategies is equally critical in insurance analytics, for predicting claim counts from vast policyholder variables, and in ecology, for monitoring species abundance with spatially–temporally dependent remote sensing data, where the high-dimensional reality of
introduces estimation instability and spurious associations. While the Negative Binomial model handles overdispersion, the core challenge extends to managing complex interactions and nonlinearities beyond the scope of standard GLMs, requiring methods grounded in probabilistic principles for reliable inference [
14]. To address these issues, a variety of dimension-reduction techniques are employed: from unsupervised Principal Component Analysis (PCA), which risks ignoring the response, to supervised alternatives like Partial Least Squares (PLS) that leverage covariance; from simple but limited univariate screening to more comprehensive, iterative approaches like recursive feature elimination. All these methods strive to balance computational tractability with model performance in high-dimensional settings [
15].
Embedded feature selection methods integrate variable selection into model estimation, with approaches such as Sure Independence Screening (SIS) showing to be effective in ultrahigh-dimensional count settings such as GWAS [
16,
17]. In contrast, wrapper methods are computationally demanding, and filter methods (e.g., Relief) are robust to interactions but not tailored to count data [
18], motivating hybrid approaches that balance efficiency with distributional considerations.
Among penalized likelihood methods, the Lasso [
19] performs simultaneous selection and estimation, but suffers from coefficient bias and instability with correlated predictors. Variants such as the adaptive Lasso address these limitations and achieve oracle properties in Poisson and Negative Binomial models [
20,
21]. Ridge regression [
22] mitigates multicollinearity through L2 shrinkage but lacks sparsity, while the Elastic-Net combines L1 and L2 penalties to encourage both sparsity and grouping, benefiting applications with correlated predictors such as genes in biological pathways [
23]. Penalization strategies have also been extended to Zero-Inflated models, enabling simultaneous selection in both count and inflation components [
24].
Recent advances have produced specialized penalized methods for structured data. The group Lasso [
25] applies a penalty to predefined groups of coefficients (e.g., all dummy variables for a categorical factor or all genes in a pathway), enabling group-wise selection. This is invaluable in functional genomics. The fused Lasso [
26], another variant, promotes smoothness by penalizing the differences between coefficients of ordered predictors (e.g., in time series or spatial data), facilitating the detection of trends. For Negative Binomial responses, these penalized approaches are adapted to jointly estimate the dispersion parameter alongside the regression coefficients within a regularized likelihood framework. Supported by strong theoretical guarantees on consistency and selection accuracy, these techniques constitute a robust toolkit for high-dimensional count regression. A fundamental limitation they share, however, is the assumption of a linear predictor in the log-mean, which can restrict their flexibility in capturing complex nonlinear relationships [
27].
Decision trees offer a nonparametric alternative by recursively partitioning the predictor space to create homogeneous subgroups. The splitting criteria can be tailored for count distributions, such as by minimizing Poisson deviance, allowing trees to inherently capture interactions and nonlinearities without assuming a specific additive form. This flexibility is highly beneficial in applications such as predicting hospital readmission counts from diverse patient records. A critical weakness of single trees is their propensity to overfit, especially in noisy, high-dimensional data, which necessitates post-pruning strategies to balance bias and variance [
28].
To overcome the limitations of single trees, ensemble methods are employed. Random forests [
29,
30,
31] aggregate a multitude of decision trees, each built on a bootstrapped sample and a random subset of predictors in each split. This process de-correlates the trees, mitigates overfitting, and performs implicit feature selection, making it exceptionally powerful for high-dimensional count regression where
, such as in metagenomics. For count data, the algorithm can use distribution-specific splitting rules (e.g., Poisson deviance reduction) and provide variable importance measures. Extensions to Negative Binomial responses can incorporate dispersion estimation to handle overdispersion effectively. Similarly, boosting ensembles like gradient boosting machines iteratively construct trees focused on the residuals of previous models, optimizing the Poisson or Negative Binomial log-likelihood. Modern implementations like XGBoost incorporate additional regularization and sophisticated handling of missing data, making them highly effective for tasks like insurance claim forecasting. Although these nonparametric ensembles often surpass traditional methods in predictive performance by capturing complex patterns, they require careful hyperparameter tuning via cross-validation to achieve optimal results [
32,
33].
The motivation for this work arises from the widespread challenges of analyzing high-dimensional count data, where traditional generalized linear models often struggle with instability, overfitting, and inadequate capture of nonlinear dependencies [
9]. In domains such as genomics and epidemiology, datasets frequently contain thousands of features but limited samples, requiring approaches that are both scalable and flexible enough to capture complex associations without restrictive parametric assumptions [
12]. Penalized regressions address sparsity but impose linearity on the log-mean and may overlook higher-order interactions, while dimension-reduction techniques risk discarding informative variation. Nonparametric alternatives such as decision trees provide flexibility, though single trees suffer from instability. Ensemble methods like Random Forests offer improved predictive performance and robustness [
29], yet standard implementations are not naturally suited to discrete, overdispersed responses, highlighting the need for adaptations that respect the distributional characteristics of count data.
The primary objective of this study is to develop and rigorously evaluate a novel Random Forest framework specifically designed for high-dimensional count regression. This framework will adapt the splitting and aggregation mechanisms to be optimal for both Poisson and Negative Binomial responses. Through comprehensive simulations and real-world data applications, we demonstrate its superiority in overcoming the limitations inherent in both existing parametric penalized methods and standard nonparametric ensemble techniques. This study contributes by introducing a Random Forest algorithm tailored to Poisson and Negative Binomial outcomes, incorporating deviance-based splitting criteria and fixed-dispersion estimation to enhance efficiency in high-dimensional settings. We establish theoretical properties through consistency results and benchmark performance against penalized regression and boosting methods, demonstrating improved accuracy and variable selection. Finally, via extensive simulations and applications to bioinformatics datasets, we show practical utility and provide an open-source R implementation to facilitate adoption in research and applied settings.
3. Simulation Study
We conduct a comprehensive simulation study to evaluate the finite-sample performance of the proposed Poisson and Negative Binomial Random Forest methods. The study is designed to assess performance across a range of challenging data-generating processes (DGPs) common in high-dimensional settings, including overdispersion, nonlinearity, zero-inflation, measurement error, and ultra-high dimensionality. We compare our methods against established benchmarks: -penalized regression (Lasso), the Elastic-Net (), gradient boosting (XGBoost), and a standard Random Forest applied to as a pseudo-response.
3.1. Simulation Design and Data Generation
The core structure of our simulations is as follows. For each replication, we generate a training and a test set from a specified DGP. Unless otherwise noted, the baseline parameters are a sample size of and an ambient dimension of . Predictors are drawn from a block-correlated multivariate normal distribution, , where is a block-diagonal matrix. Each block has a size of 50, and the within-block correlation is set to .
The true regression vector is sparse, with only nonzero coefficients. The active set is chosen uniformly at random. The values of the non-zero coefficients are drawn from a uniform distribution, for , and are zero otherwise. An intercept is included in all models.
We define two broad classes of DGPs based on the structure of the predictor function .
3.1.1. Example 1: Linear Mean Structure
In this example, the predictor is a linear combination of the true signals:
We consider several distributional scenarios built upon this linear structure:
Poisson (Equi-dispersed): , where .
Negative Binomial (Over-dispersed): , where and the dispersion parameter is set to , giving .
Zero-Inflated Poisson (ZIP): Data are generated from a mixture distribution,
where
. The zero-inflation probability
is modeled as
, with
being the first three active predictors and
, resulting in approximately 20% excess zeros.
Measurement Error: The linear predictor is used, but the features available to the models are corrupted versions , where and .
3.1.2. Example 2: Nonlinear Mean Structure
In this example, the predictor function includes nonlinear effects to assess the ability of methods to capture complex signal relationships:
where
and
are the first two features in the active set
, and
.
Nonlinear Poisson: , where .
Nonlinear Negative Binomial: , where and .
Nonlinear Zero-Inflated Poisson (ZIP): The nonlinear predictor defines the Poisson mean in the ZIP model described above.
3.1.3. Ultra-High-Dimensional Regime
To stress-test the algorithms, we include a scenario with and , preserving sparsity (). Data are generated using the linear Poisson DGP.
3.2. Performance Metrics and Evaluation
Across all scenarios, we evaluate predictive and variable selection performance. Prediction accuracy is measured via root mean squared error,
and Poisson or NB deviance,
with the convention that
. Variable recovery is quantified using support overlap
where
denotes the set of variables identified as important by each method.
We repeat each simulation 100 times, generating fresh training and test data in each replication. Competing methods include -penalized regression (Lasso), Elastic-Net with mixing parameter , gradient boosting with decision trees (XGBoost), and standard Random Forests applied to as a pseudo-response.
3.3. Estimation of the Unknown Function,
The core objective of each method in this comparison is to accurately estimate the unknown conditional mean function , which implies estimating the predictor function on the log-scale. The approaches differ fundamentally in their assumptions about the structure of . In the simulation setup, the estimation function encapsulates the following estimation procedures for a given training set :
Penalized GLMs (Lasso, Ridge, Elastic-Net): These methods assume a
linear form for the predictor function,
. They estimate the coefficients by maximizing a penalized likelihood. For a given penalty parameter,
(selected via
k-fold cross-validation within the training set), the optimization problem for the Poisson case is:
where
is the
-norm for Lasso or a combination of
and
norms for the Elastic-Net. The estimated function is then
. This linearity assumption is misspecified for the nonlinear data-generating processes (DGPs) in our study.
Gradient Boosting (XGBoost): This method assumes an
additive form of
constructed from
M simple regression trees (
):
where
is a learning rate and
parameterizes each tree. The algorithm iteratively fits trees to the negative gradient (pseudo-residuals) of the Poisson or Negative Binomial loss function, greedily improving the model’s fit to the data. This allows it to capture complex nonlinearities and interactions without an explicit specification, making it more flexible than the penalized GLMs.
Standard Random Forest (on ): This method sidesteps the generalized linear model framework entirely. It first applies a variance-stabilizing transformation to the count response, , treating it as a continuous outcome. It then estimates nonparametrically by averaging predictions from a collection of regression trees, each trained on a bootstrapped sample. The final prediction is made on the log scale, , and is subsequently exponentiated to estimate the mean. This approach can capture complex but may be inefficient as it ignores the count nature and heteroskedasticity of the data during model fitting.
Proposed Random Forest (Poisson/NB deviance): Our method directly and explicitly models the count nature of the data by using the appropriate deviance as the splitting criterion for each tree within the ensemble. It nonparametrically estimates by finding partitions of the predictor space that minimize the in-sample Poisson or Negative Binomial likelihood. The algorithm’s output is an ensemble of trees whose combined predictions directly yield the estimated mean , from which the predictor function can be inferred as . This approach is designed to be both flexible enough to capture the true and efficient by respecting the distributional properties of the data.
4. Simulation Results
The proposed Random Forest models, RF–Poisson and RF–NB, demonstrated a decisive and substantial superiority in predictive accuracy across all standard data-generating scenarios, as evidenced by their significantly lower root mean squared error (RMSE) values (see
Table 1). In the true Poisson setting, the proposed methods (Mean RMSE 0.84–0.86) outperformed their nearest competitors, the glmnet Lasso and Elastic-Net models (Mean RMSE 2.1–2.6), by a factor of nearly three, and vastly surpassed all other benchmarks including the transformed RF on log1p (3.60), XGBoost Poisson (3.96), and the Zero-Inflated model (4.13). This performance advantage was not merely a product of the data matching the model’s assumptions, as both RF proposals maintained their lead under Negative Binomial, Zero-Inflated Poisson, and measurement error conditions, with the RF–NB model showing a slight edge in the NB and ZIP scenarios. The abysmal performance of the classical GLM: NB and GLM: Poisson, which failed to converge (resulting in infinite error), starkly highlights the necessity of either regularization or flexible machine learning approaches for these types of problems. Crucially, all penalized regression methods (glmnet), while better than the naive GLMs, were consistently and significantly outperformed by the proposed RF approaches, suggesting that the ability of Random Forests to capture complex, nonlinear relationships and interactions without manual specification provided a critical advantage even in these ostensibly linear simulations.
When data generation processes incorporated nonlinearity and extreme complexity, the performance gap between the proposed Random Forest methods and the field of benchmarks became even more pronounced, solidifying their value in realistic settings where linear assumptions are frequently violated. In the Poisson nonlinear and ZIP nonlinear scenarios, the RF models (Mean RMSE 0.94–1.10) again outperformed all other methods by a wide margin (see
Table 2), with the error of the next best glmnet models being approximately 2.5 to 3 times higher. The most telling results emerged from the highly challenging “Nonlinear NB” and “Ultra” settings. In the Nonlinear NB scenario, while all methods struggled with high variance and absolute error, the proposed RF–NB model achieved the lowest mean error (39.72), demonstrating a relative but clear robustness compared to the dramatically higher errors of the glmnet variants (102–106) and the catastrophic failure of the ZIP model (220.55). In the “Ultra” complex setting, the proposed methods again secured the lowest prediction errors (RF–NB: 1.39), significantly outperforming all other approaches, including the otherwise stable but mediocre Ridge regression models (3.38–3.40). The consistent failure of the classical GLMs and the poor performance of the linear-model-based glmnet suite across these complex environments underscore a critical limitation of linear predictors. In contrast, the proposed RF–Poisson and RF–NB models exhibited remarkable resilience and predictive accuracy, establishing them as the most robust and reliable methods for regression analysis of count data across a diverse spectrum of challenging, real-world conditions.
The evaluation of model fit through deviance in
Table 3 reveals a stark and consistent hierarchy of performance, with the proposed Random Forest models, RF–Poisson and RF–NB, demonstrating a profoundly superior ability to describe the observed data across all linear data-generating mechanisms. The deviance values for the proposed methods are not just marginally but substantially lower than all competing approaches, often by an order of magnitude. For instance, in the Poisson scenario, the mean deviance for RF–Poisson (31.49) is less than half that of the best glmnet Lasso model (78.72) and nearly five times lower than the RF on log1p transform (155.22), unequivocally demonstrating that modeling the counts directly with a tailored Random Forest is far more effective than applying standard Random Forest or linear models to transformed data. This pattern holds true in the more complex NB and ZIP settings, where the RF–NB model achieves the lowest deviance, indicating its particular utility in handling overdispersion and zero-inflation. The classical GLMs completely fail to provide a meaningful fit, resulting in infinite deviance, while the Zero-Inflated Poisson model (ZIP) performs catastrophically poorly, with deviance values soaring into the thousands, suggesting severe model misspecification or convergence issues in these simulated conditions. The generally poor performance of the XGBoost Poisson implementation further highlights that the success of the proposed methods is not merely due to being a tree-based algorithm but is specifically attributable to their careful design for count data distributions.
The results under nonlinear and ultra-complex data-generating processes in
Table 4 powerfully reinforce the conclusion that the proposed Random Forest methods are uniquely robust and provide the best model fit in highly challenging environments. In the Poisson nonlinear and ZIP nonlinear scenarios, the deviance of the proposed methods (27) is again less than half that of their nearest competitors, the glmnet Lasso and Elastic-Net models (67–74), showcasing their innate capacity to capture complex nonlinear relationships that linear predictors cannot. The most revealing results are found in the “Nonlinear NB” and “Ultra” settings. In the chaotic Nonlinear NB scenario, the proposed RF–NB model achieves a mean deviance of 99.55, which, while high, is dramatically lower and more stable than all other methods, including the glmnet variants whose deviance explodes into the hundreds and thousands, and the ZIP and GLM models, which fail completely with deviance values in the thousands and infinity, respectively. This demonstrates a critical robustness to extreme data complexity. Finally, in the “Ultra” complex setting, the proposed methods again deliver the best possible fit with the lowest deviance (RF–NB: 18.17), outperforming even the simple Ridge regression models. The consistent and dramatic failure of the classical GLMs and the ZIP model across these nonlinear settings, coupled with the middling and highly variable performance of the other flexible methods like XGBoost and standard RF, solidifies the position of the proposed RF–Poisson and RF–NB frameworks as the most reliable and best-fitting methodologies for regression analysis of count data across a vast spectrum of potential real-world data challenges.
Figure 1 and
Figure 2 corroborate the findings in
Table 1,
Table 2,
Table 3 and
Table 4. The proposed RF–Poisson and RF–NB demonstrated more stability across 100 simulation repetitions compared to other competing methods.
The evaluation of variable selection performance across linear data-generating mechanisms in
Table 5 reveals a critical trade-off between power (the ability to detect true signals) and false discovery rate (FDR; the proportion of selected variables that are false positives), with the proposed Random Forest models, RF–Poisson and RF–NB, consistently demonstrating the most favorable and balanced overall performance. While the glmnet-P Lasso method occasionally achieved a marginally higher power in the Poisson scenario (0.728 vs. 0.767 for RF–Poisson), it did so at an untenable cost, exhibiting catastrophically high FDR values between 0.82 and 0.86 across all conditions, meaning over 80% of its selected variables were spurious. In stark contrast, the proposed RF methods maintained strong power, often the highest or second highest, while simultaneously controlling FDR at substantially lower levels, typically between 0.32 and 0.43, effectively identifying true signals without being overwhelmed by noise. This pattern highlights a fundamental weakness of the penalized regression approaches in these settings: they lack the specificity to distinguish true signals from false ones under these data structures. The other benchmarks, including the standard RF on log1p, XGBoost, and the ZIP model, all suffered from critically high FDR (≥0.87) alongside lower power. The complete failure of the Ridge regression and the classical GLM approaches, which exhibited zero power or 100% FDR, underscores their total inadequacy for variable selection tasks. Thus, the proposed RF frameworks uniquely provide a robust and practical solution, successfully navigating the power-FDR trade-off to deliver reliable and interpretable variable selection.
When the data-generating processes incorporate nonlinearity and extreme sparsity, the variable selection problem becomes profoundly more difficult, and the performance gap between the proposed methods and the alternatives widens considerably, solidifying the superiority of the RF-based approaches in realistic, complex settings (see
Table 6). In the highly challenging “Nonlinear NB” and “Ultra” sparse scenarios, all methods experienced a sharp decline in power; however, the proposed RF–Poisson and RF–NB models maintained a critical advantage by achieving the highest power amongst all methods while still exercising the most effective control over the false discovery rate. For instance, in the “Ultra” sparse setting, the RF–NB model achieved a power of 0.200, which was double that of the best glmnet Lasso model (0.080) and was coupled with a substantially lower FDR (0.277 vs. ≥0.592 for all glmnet variants). This trend is consistent across the other nonlinear scenarios: the proposed methods consistently rank as the top performers by effectively balancing the two metrics. The penalized regression methods (glmnet), while reasonably powerful in some nonlinear cases, completely failed to control FDR, with values frequently exceeding 0.80, rendering their selected variable sets practically useless. The standard RF on log1p and XGBoost performed dismally, with exceptionally high FDR often approaching 0.96 or even 0.98, indicating an almost random selection of variables. The catastrophic performance of the classical GLMs and the poor showing of the specialized ZIP model further emphasize that the proposed RF–Poisson and RF–NB frameworks are uniquely equipped to handle the complexities of variable selection in high-dimensional, nonlinear count data, offering a combination of sensitivity and specificity that is unmatched by any other method in the comparison.
The analysis of computational efficiency in
Table 7 reveals a clear hierarchy of runtime performance, which must be interpreted in the crucial context of the previously established superior predictive accuracy and variable selection performance of the proposed methods. The glmnet-based methods, particularly the “glmnet-NB Lasso (log1p)” and the standard “RF on log1p”, are consistently the fastest algorithms, with mean runtimes often below 0.25 s, leveraging the computational efficiency of regularized linear models and a simple Gaussian regression forest. The proposed RF–Poisson and RF–NB methods occupy a middle ground in terms of computational cost, with runtimes typically between 0.6 and 0.7 s; while they are approximately 3–5 times slower than the fastest glmnet variants, they are still decidedly faster than the XGBoost Poisson implementation, which is the slowest method at over 1.2 s, and are comparable to the ZIP model. The moderate computational overhead of the proposed methods is directly attributed to the cost of their tailored likelihood-based splitting, which is the very mechanism that grants them their significant advantage in model fit and selection accuracy. The classical GLM: NB and GLM: Poisson models are relatively fast, but this metric is meaningless given their previously demonstrated complete failure to provide useful predictions or variable selection. Therefore, the computational cost of the proposed RF methods is not only reasonable but also a worthwhile investment, representing an efficient trade-off for their substantial gains in statistical performance.
The runtime performance under nonlinear and ultra-complex data-generating processes, presented in
Table 8, reinforces the patterns observed in the linear scenarios and further contextualizes the value proposition of the proposed methods. The efficiency ranking remains largely consistent: the “glmnet-NB Lasso/EN” variants and the standard “RF on log1p” are the fastest algorithms, often completing in well under 0.2 s. The proposed RF–Poisson and RF–NB methods again demonstrate moderate and stable computational requirements, with runtimes clustering around 0.65–0.85 s for standard nonlinear problems and dropping to a very efficient 0.24 s in the “Ultra” sparse high-dimensional setting, which is likely due to the early stopping mechanisms in tree building being triggered more quickly when few true signals are present. This stability in diverse and complex scenarios is a key strength, indicating their robustness. In stark contrast, the “ZIP (zeroinfl + SIS)” model shows severe computational degradation in the “Ultra” setting, with its runtime soaring to over 2 s, suggesting that it struggles significantly with model fitting and variable screening in high-dimensional complexity. The XGBoost Poisson implementation remains the slowest consistent performer. When this computational profile is viewed alongside the previously demonstrated best-in-class predictive and variable selection performance, the conclusion is unequivocal: the proposed Random Forest methods provide a computationally feasible and efficient pathway to achieving state-of-the-art results, offering a massively superior statistical return on a relatively modest and manageable investment of computational resources.
5. Application to Norwegian Mother and Child Cohort Study
The Norwegian Mother and Child Cohort Study (MoBa) [
34] was a large population-based pregnancy study conducted between 1999 and 2005. As a sub-study, gene expression was measured from the umbilical cord blood of 200 newborns. After standard quality control and data processing steps, gene expression data from 111 of these samples were successfully profiled using Agilent microarrays. For an even smaller subset of 29 children, a specific biomarker of genotoxicity called micronucleus frequency (MN) was also measured. Both the gene expression (GSE31836) and MN data are publicly available from the NCBI Gene Expression Omnibus (GEO) database.
Figure 3 presents the histogram for the empirical distribution, while the colored curves show the fitted model expectations for comparison of fit under different assumptions.
The data were preprocessed by extracting the GSE31836 expression dataset from GEO that was originally collected by [
35], which contains 41,000 gene expression profiles in 111 samples, although micronucleus count data (MN) were only available for about 29 samples. To ensure reliable inputs, genes with missing values were removed, resulting in a dataset of 6797 complete genes. The dependent variable was defined as the MN counts, while the predictor matrix was restricted to the most variable genes, since high-variance genes are more likely to capture meaningful biological signals. Specifically, variance filtering was applied, and the top 2000 most variable genes were retained to balance information content with computational feasibility. Finally, the filtered expression dataset was transposed and matched to the subset of samples with MN counts. The final dataset used for the comparative benchmark analysis comprises 29 samples with 2000 genes.
To evaluate the predictive performance of each model on this real-world dataset, we used a
k-fold repeated cross-validation scheme. The entire process was repeated 20 times to ensure stable estimates. For each repeat, the data were randomly partitioned into 5 folds. Each method was trained on 4 folds, and its predictions were generated for the held-out test fold. The root mean squared error (RMSE) and model-specific deviance were calculated on these out-of-sample predictions. This procedure was iterated until every fold had served as the test set once, yielding 5 performance estimates per repeat. The final reported metrics for each model in
Table 9 represent the mean of these 100 individual estimates (5 folds × 20 repeats), providing a robust and unbiased assessment of their generalizable predictive accuracy.
Table 9 provides a critical validation of the proposed methods, demonstrating their practical utility and superior performance in a challenging n«p setting with 29 samples and 2000 genes. The proposed RF–Poisson and RF–NB models consistently outperform all competing methods in the most important statistical metrics, achieving the lowest root mean squared error (RMSE of 2.10) and the lowest deviation (17.0), which indicates that they provide the best predictive accuracy and the closest model fit to observed micronuclei count data. This superior performance is particularly notable compared to the standard ‘RF on log1p’, which, while computationally faster (0.096 s), suffers from a substantially higher error (RMSE 2.15) and a critical lack of interpretability by selecting a nonsensically high number of variables (1143 genes), effectively failing to perform any meaningful variable selection. The penalized regression models (glmnet) present a mixed picture; while the Lasso and Elastic-Net variants produce very sparse models (selecting between 6 and 19 genes), this parsimony comes at a severe cost to predictive accuracy, resulting in the highest RMSE values in the study (≥2.38). In contrast, the Ridge regression models, which keep all variables, perform moderately but are still outperformed by the proposed methods. The implementation of XGBoost Poisson performs poorly with a high level of error and the second-highest deviation. The classical “GLM” and ZIP models failed to produce valid results for RMSE and deviance, with the ZIP model also computationally being the slowest. Therefore, the proposed RF–Poisson and RF–NB frameworks strike an optimal balance in this real data application: deliver the best predictive performance, achieve an excellent model fit, and produce a highly interpretable and biologically plausible model by selecting a manageable subset of 35–36 potential genetic predictors, all within a computationally feasible time-frame of less than 0.8 s, establishing them as the most robust and effective choice for analyzing high-dimensional count data.
Biological Insights
Figure 4 provides a critical evaluation of the stability and biological reproducibility of the variables (genes) selected by each method, which is a paramount concern in high-dimensional genomic analysis where the goal is to identify robust biomarkers rather than artifacts of statistical noise.
The most striking result is the exceptional performance of the proposed RF–NB and RF–Poisson methods. Their bars tower over all others, with RF–NB identifying nearly 600 overlapping genes and RF–Poisson identifying over 100. This indicates that, when the analysis is repeated on different subsets of the data (e.g., via cross-validation), these two methods consistently prioritize the same set of top genes. This high degree of stability is a hallmark of a robust method and suggests that the biological signals these genes represent are strongly reproducible.
In stark contrast, nearly all other methods demonstrate profoundly unstable feature selection. The standard RF on log1p and XGBoost Poisson (while they are powerful predictive tools) show almost no overlap (a maximum of 23 genes), meaning their selected features change drastically with slight changes in the input data. This makes their results nearly impossible to interpret biologically, as there is no consistent genetic signature to investigate. Similarly, the various glmnet models (Lasso, EN, Ridge) exhibit low to moderate overlap, despite being designed for variable selection. Their instability in this high-dimensional, real-world setting suggests that they are highly sensitive to correlation and noise in genomic data, leading to inconsistent results. The ZIP model and classical GLMs fail entirely, identifying zero or negligible overlapping features, rendering their selections biologically uninterpretable.
6. Discussion of Results
The empirical results of this study consistently demonstrate that the proposed Random Forest frameworks for Poisson and Negative Binomial responses effectively bridge a critical methodological gap in high-dimensional count regression identified in the previous literature. Where traditional penalized GLMs (e.g., Lasso, Elastic-Net) [
19,
23] offer interpretability through sparsity but falter due to their rigid linearity assumption and instability in feature selection under correlation as noted by [
8], and where standard nonparametric ensembles (e.g., RF on log1p, XGBoost) [
29,
33] offer predictive flexibility at the cost of biological interpretability and feature stability, our proposed methods synthesize the strengths of both paradigms. The superior and stable predictive accuracy, quantified by lower RMSE and deviance across both simulated and real data, underscores their ability to capture the complex, nonlinear dependencies inherent in modern high-dimensional datasets—such as gene-expression interactions that linear predictors invariably miss as discussed by [
11,
34]. Furthermore, exceptional stability in variable selection provides a level of reproducibility that addresses the critical limitation of instability in high-dimensional feature selection raised by [
7], contrasting sharply with the extreme instability of other ensembles and the often overly sparse or inconsistent selections from penalized regressions.
This performance achievement aligns with but substantially extends the call for specialized methods that respect the probabilistic nature of count data made by [
5,
9]. The integration of likelihood-based splitting criteria tailored to the respective mean variance structures allows the forests to overcome the equidispersion limitation of Poisson models noted by [
1] while avoiding the pitfalls of naïve transformations that distort error structures. In the context of the existing literature, our work thus moves beyond simply applying machine learning to count data, instead providing a rigorously validated tool that addresses the dual challenges of predictive performance and reliable inference in high-dimensional settings as envisioned by [
18], particularly for biological applications where stable feature selection is paramount for generating testable hypotheses as demonstrated in our MoBa study analysis.
7. Conclusions
In conclusion, this study establishes that the proposed Random Forest frameworks represent a significant and practical advancement for high-dimensional count regression, directly addressing a critical gap in the methodological literature. Existing approaches force a trade-off: penalized regressions offer interpretability through sparsity but fail to capture the complex nonlinearities and interactions inherent in modern datasets, while standard machine learning ensembles offer predictive flexibility at the cost of ignoring the fundamental distributional characteristics of count data, such as overdispersion, leading to unstable and biologically uninterpretable results. Our work bridges this divide by harmonizing the predictive power of nonparametric ensembles with the model-specific rigor of parametric methods, thus responding directly to the call for methods that can handle complex dependencies while maintaining probabilistic coherence, as noted by [
14]. The results confirm that these tailored algorithms provide an essential tool for researchers in fields like genomics and epidemiology, where the fundamental challenge of accurately modeling complex count data with thousands of correlated predictors, as described by Robinson and Smyth [
2], necessitates a method that is both statistically robust and practically interpretable.
The practical importance and applicability of our introduced method are demonstrated by its superior performance across all evaluated metrics. In simulation studies, the proposed frameworks decisively achieved a significantly lower prediction error (RMSE) and a better model fit (deviance) than all benchmarks, including penalized regressions and alternative ensembles. Crucially, they also provided exceptionally stable and interpretable feature selection, maintaining a high power to detect true signals while effectively controlling the false discovery rate, a balance that other methods consistently failed to achieve. This combination of high predictive accuracy and reliable variable identification makes the method immediately applicable to real-world decision making, such as biomarker discovery and risk assessment, where generating reliable and actionable insights from high-dimensional data is crucial. The successful application to the Norwegian Mother and Child Cohort Study further underscores the utility of our method, which not only delivered the most accurate predictions but also identified a stable, biologically plausible set of genetic features associated with micronuclei frequency.
Despite these strengths, we acknowledge certain limitations that also chart a course for future work to further contribute to the field of statistical analysis. The computational overhead of the proposed forests, though reasonable, is higher than that of ultra-fast penalized regressions [
19] and could hinder the application to extremely large-scale problems. Furthermore, the current implementation lacks native handling of zero inflation, which represents a limitation for applications in ecology and microbiology where such models are crucial [
13]. Future research will therefore focus on developing an integrated Zero-Inflated Random Forest model and on extending the framework to accommodate structured penalties for grouped features, as proposed by [
25], thus enhancing its utility and applicability to an even broader range of scientific questions. Ultimately, by providing a versatile and rigorously validated tool that respects the unique characteristics of count data, this research contributes a fundamental and impactful component to the analytical toolkit for modern high-dimensional statistical analysis. Furthermore, while the proposed Random Forest is presented as a powerful standalone method, its predictions could serve as a valuable base learner in a stacked ensemble with other algorithms, such as XGBoost, to potentially yield further gains in predictive accuracy for specific applications; this remains another exciting avenue for future research.