1. Introduction
Missing data are one of the critical problems encountered in scientific research. It is a typical problem that affects the availability of the datasets necessary for drawing statistical inferences. Furthermore, several statistical analysis techniques require complete datasets. Therefore, researchers often struggle with imputing missing data or dropping the cases with missing values. However, dropping incomplete cases is inefficient and often unacceptable in many cases because it discards original information that could be beneficial in drawing more reliable conclusions [
1]. As a result, imputing is the only viable option when one does not wish to remove original cases from the dataset to conduct analysis.
The issue of missing data has been extensively explored in the classical statistical literature [
2]. Several established methods exist to address this challenge. The simplest approach, case-wise deletion (CWD), involves removing samples with missing values entirely. However, ref. [
2] highlighted limitations of CWD, particularly in linear regression contexts, citing notable drawbacks such as bias and inefficiency. An alternative to CWD is the mean substitution (MS) method, where missing values are replaced by the mean of observed data points. While straightforward and often effective, this technique may oversimplify complex datasets.
A more sophisticated strategy, multiple imputation (MI), was formalized in [
3]. This approach estimates missing values using conditional relationships from observed data. The imputation process is repeated t times (typically 3–5 iterations), generating multiple complete datasets. Statistical analyses are performed on each imputed dataset, and final parameter estimates and standard errors are derived by averaging results across iterations.
Another prominent method, the Expectation-Maximization (EM) algorithm [
4], shares similarities with MI but operates through iterative cycles. During the expectation step, the algorithm computes latent variable likelihoods by treating missing data as unobserved variables. The subsequent maximization step updates parameter estimates (e.g., mean vectors and covariance matrices) to optimize the expected likelihood. Despite being classified as a single imputation method (as it produces one set of imputed values), the EM algorithm remains widely adopted due to its probabilistic rigour and computational efficiency.
A robust methodology referred to as the Multiple Imputation by Chained Equations (MICE) or Full Conditional Specification (FCS) method has gained popularity in the recent past where the technique is used to address the missing data problems [
5]. Several researchers have suggested that MICE is a powerful tool for imputing quantitative variables with missing values in a multivariate data setting, and the method performs better compared to the ad hoc and single imputation methods [
6]. Despite its robustness and applications, other alternatives exist that incorporate the handling of missing data in their algorithms. For instance, proximity imputation, much adopted in random forest algorithms, is an imputation approach that starts by imputing the missing values to fit a random forest. Then, the initial imputed missing values are subsequently updated by the proximity of the data [
7]. Consequently, such processes achieve the desired results over several iterations.
A method proposed in [
8] utilizes proximity matrices measuring pairwise observation similarity within random forests to estimate missing values. Subsequent work in [
9] expanded random forest-based imputation, including applications to unsupervised classification [
10]. Additionally, ref. [
11] introduced an adaptive imputation technique for Random Survival Forests, demonstrating its superiority over traditional proximity-based methods in survival analysis.
A comparative study [
12] evaluated classification methods (k-nearest neighbours [kNN], C4.5, and support vector machines [SVMs]) alongside the MICE algorithm on datasets with missing values. The results highlighted MICE’s effectiveness, whereas C4.5 often exhibited increased misclassification errors when paired with imputation. Separately, ref. [
13] analysed kNN imputation within random forests through extensive simulations.
Within the Bayesian framework, ref. [
14] developed BARTm, an extension of Bayesian Additive Regression Trees (BART) that addresses covariate missingness. Unlike conventional methods requiring imputation or censoring, BARTm modifies decision tree splitting criteria to incorporate missing values directly, treating missingness as a valid partitioning factor. This approach captures potential signals in non-random missingness, accommodates continuous and categorical data, and integrates imputation into the model construction. BARTm also provides Bayesian credible intervals that inherently account for imputation uncertainty. Notably, its computational efficiency enables the seamless prediction of future data with missing entries.
Many traditional missing data methods, as discussed above, were designed for low-to-moderate dimensional datasets. These approaches often prove inadequate in high-dimensional contexts (e.g., genomics, neuroimaging), where imputing all variables risks overparameterization and non-convex optimization challenges, as seen in maximum likelihood and EM-based methods [
15]. While [
16] advocates imputing all variables to ensure unbiased correlations, this becomes impractical when variables vastly outnumber samples.
Furthermore, most existing methods focus on continuous data (e.g., gene expression [
17,
18]), overlooking complex variable interactions and nonlinearities. Conventional MI struggles with such complexities, producing biased estimates [
19]. Emerging techniques like Fully Conditional Specification (FCS) [
20] show promise but remain challenging to implement, particularly for models involving intricate interactions.
Current methods for handling missing data in high-dimensional settings, including standalone Multiple Imputation by Chained Equations (MICE) and Bayesian Random Forest (BRF), exhibit critical limitations. MICE, while flexible, often struggles with complex feature interactions in nonlinear or non-ignorable missingness (MNAR) scenarios, leading to biased imputations [
21]. BRF frameworks [
22,
23], though robust for prediction, lack systematic integration with iterative imputation, limiting their utility in incomplete datasets. To bridge these gaps, we propose a novel hybrid framework that synergizes MICE with BRF, enhancing both imputation accuracy and predictive performance. Our method extends BRF by preprocessing data with MICE to iteratively refine imputations while preserving uncertainty quantification through Bayesian tree ensembles. This addresses two key shortcomings: (1) MICE’s reliance on misspecified parametric assumptions in high dimensions and (2) BRF’s inability to jointly model missingness mechanisms and response variables.
3. Multivariate Imputation by Chained Equations (MICE)
Rubin [
29] highlighted the difficulties in making valid statistical inferences from incomplete datasets. A fundamental requirement is to comprehend the mechanisms leading to missing data, which has inspired the development of specialized inference frameworks and more precise definitions [
2]. Missingness mechanisms are classified into three categories:
Missing Completely at Random (MCAR): The occurrence of missing data is independent of both the observed and unobserved variables:
Missing at Random (MAR): The missingness is contingent solely on the observed data:
Missing Not at Random (MNAR): Missingness is dependent on unobserved data or on the missing values themselves:
In this context, R signifies the missingness indicator, represents the complete data, and denotes the probability distribution associated with the missingness.
Multiple Imputation by Chained Equations (MICE) is particularly useful for datasets with missing values across different variables. Rather than defining a joint distribution, it iteratively samples from the conditional density
, accommodating various models (e.g., log-linear, multivariate normal) [
25,
30,
31]. While it has theoretical limitations, simulations demonstrate its practical effectiveness [
32].
As detailed in [
33], MICE estimates the posterior distribution of
through a series of chained equations.
3.1. Missing Data Imputation with MICE
Let the incomplete dataset be denoted as , where is the response vector and is an covariate matrix. Let be the missingness indicator matrix, where if is observed and 0 otherwise. The goal is to impute missing values in under three mechanisms: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR). MICE iteratively imputes missing values using conditional models. For each covariate j, the algorithm cycles through the following steps until convergence:
3.1.1. MICE Imputation for Missing Completely at Random (MCAR)
If the mechanism behind the missingness is classified as Missing Completely At Random (MCAR), it is expressed as follows:
This signifies that the occurrence of missing data is independent of both observed and unobserved factors. When employing Multiple Imputation by Chained Equations (MICE) to address missingness, the imputation model for a continuous variable
can be represented by the following equation:
In this model, the parameters
and
are estimated using Ordinary Least Squares (OLS) regression techniques. On the other hand, for categorical variables
, the model for imputation is structured as follows:
In this case, the equation models the log odds of a categorical outcome, with parameters estimated for each category. This approach allows for the effective handling of missing data across different types of variables.
3.1.2. MICE Imputation for Missing at Random (MAR)
When the missing mechanism is classified as MAR (Missing at Random), the underlying model is as follows:
This implies that the probability of the missingness indicator being zero (indicating that the data point is not missing) can be explained solely by the observed data and the observed outcomes . In other words, the missingness of data does not depend on the unobserved values themselves but rather on other observed variables.
For example, if
is MAR, the imputation can be obtained using the following:
In this context, the imputation model aims to estimate the missing value using the observed imputed value and the observed outcomes . The parameters , , and represent the estimated coefficients from a regression model, while captures the error term, assumed to follow a normal distribution with mean zero and variance .
The diagnosis of the missing observation is performed using the following:
This diagnostic model helps assess the adequacy of the imputation strategy by modelling the log odds of as a linear combination of the observed predictors and . The parameters , , and represent the coefficients that should be estimated from the data. Analysing these coefficients can provide insights into how well the selected observed variables explain the mechanism of missingness, thus highlighting any potential biases in the imputation process.
3.1.3. MICE Imputation for Missing Not at Random (MNAR)
If the missing mechanism is MNAR, the relationship governing the missingness of data can be expressed as follows:
where
indicates whether the
j-th response variable is observed (1) or missing (0). This equation highlights that the probability of data being missing relies not only on the observed covariates
but also on the missing covariates
.
In MNAR, the imputation can be achieved either by using the pattern-mixture model or model selection. In the pattern-mixture model framework, the imputation of missing values is contextually based on the observed data and the pattern of missingness. The imputation equation can be stated as follows:
Here, refers to the estimated values for the missing j-th observation. The term is the intercept, while contains the imputed values of the covariates excluding the j-th variable. The coefficient captures the effects of these imputed covariates. The term accounts for the missingness indicator, and represents the normally distributed error term with mean zero and variance .
On the other hand, the selection model distinguishes between the outcome generation process and the mechanism that causes the data to be missing. The outcome model can be articulated as follows:
In this context, represents the j-th response variable following a normal distribution with a mean dependent on the regression parameters and applied to the covariates , along with a variance . The missingness model uses a logistic regression framework where the log odds of the missingness indicator being zero is expressed as a linear function of the j-th response variable itself, with parameters and , capturing how the observed responses influence the likelihood of data being missing.
It is worthy of note that, in cases where data is MNAR, the uncertainty introduced by imputation can lead to an increase in the estimated variance. To account for this, we model the error term as follows:
Here, represents the error associated with the observation, which is assumed to follow a normal distribution with a mean of zero. The term denotes the estimated variance from the observed data, while is a factor greater than one that indicates the extent of variance inflation due to the missing data mechanism. This adjustment is critical as it ensures that the variability in the estimates reflects the increased uncertainty stemming from the MNAR assumption.
Subsequently, we applied Rubin’s rules, which provide a systematic approach for combining estimates and their uncertainties from multiple imputed datasets. Let us denote
as the estimate derived from the
imputed dataset. The combined estimate is calculated as follows:
where
m is the total number of imputed datasets. This average provides a single point estimate that optimally reflects the information gleaned from all datasets.
The total variance of this combined estimate is derived from both the within-imputation variances and the between-imputation variance, calculated as follows:
In this formula, represents the average within-imputation variance (i.e., the variance of the estimates within each dataset), while V signifies the between-imputation variance (i.e., the variance of the estimates across the different datasets). The term adjusts the between-imputation variance to account for the degrees of freedom, providing a comprehensive measure of the overall uncertainty in the final estimate. This framework is essential for properly estimating uncertainty in the presence of missing data, allowing for more accurate statistical inference.
6. Simulation and Results
The three missingness mechanisms (MCAR, MAR, and MNAR) were simulated for both regression and classification cases. For the regression case, we adopted the simulation strategies of [
23] for simulating the high-dimensional Friedman nonlinear Gaussian response model and [
7] for different missingness injection mechanisms. Specifically, we set
and
with the nonlinear model
The model is high-dimensional in that only variables
are relevant and the remaining
are noise. For MCAR, the relevant independent variables
are missing randomly with Bernoulli probabilities
such that a case is missing if the Bernoulli random vector returned is 1. For MAR, variables
are missing if the probability of missingness defined over the Probit link equation
is
. Here,
, and
are defined such that
. Similarly, for MNAR, the relevant variables
are missing if the probability of missingness defined over the Probit link equation
is
. The proportion of missingness
was adapted from the studies [
7,
34,
35,
36] that found up to
,
,
, and
missing entries in the simulation and real-life datasets used. Two other methods (RF [
7] and BART2: Bartmachine [
14]) were compared with BRF using the root mean square error
and average root mean square
as performance measures over 10-fold cross-validations. All simulations and analyses were carried out in the R package version (4.3.1).
6.1. Convergence Diagnostic of BRF with MICE
Before conducting the comparative analysis, we assessed the convergence behaviour of the Bayesian Random Forest (BRF) with Multiple Imputation by Chained Equations (MICE) under the Missing Completely at Random (MCAR) assumption for responses that are Gaussian distributed.
Figure 1 and
Table 1 display the trace plot and convergence diagnostic for the BRF with MICE Posterior Samples, represented as Average Predictions (
), under the MCAR mechanism for datasets with Gaussian-distributed response variables. The BRF Posterior Samples are based on
imputations across
trees, resulting in a total of 5000 samples.
At 25% missingness, the trace plots for BRF with MICE demonstrate rapid convergence and stable mixing. The MCMC chains stabilize within 200–300 iterations, forming a dense band of parameter estimates with minimal outliers. This aligns with near-ideal R-hat values (1.000) and a high effective sample size (ESS = 4879), which confirms that the chains converge to a shared stationary distribution and efficiently explore the posterior. The tight overlap of chains after burn-in, paired with R-hat’s proximity to 1.0, underscores robust convergence, while the large ESS reflects minimal autocorrelation and reliable uncertainty quantification.
At 50% missingness, convergence is marginally delayed (300 iterations), with trace plots showing increased initial variability and sporadic outliers. Despite these challenges, the chains stabilize into a coherent band, supported by R-hat values (1.001) that remain near 1.0 and an even higher ESS (5130). The slight increase in R-hat compared to 25% missingness signals minor variability between chains, likely due to the greater uncertainty of imputation, but the stability of ESS indicates sustained sampling efficiency. This balance highlights the BRF with the adaptability of MICE to moderate missingness, where iterative imputation preserves mixing quality despite the increased complexity.
In 75% missingness, the trace plots reveal a pronounced initial instability, with extreme parameter fluctuations and delayed convergence (~300–400 iterations). However, the chains eventually stabilize, as evidenced by acceptable R-hat values (1.001) below the 1.01 threshold and a robust ESS (5118). Although the elevated R-hat reflects increased variability between chains in accordance with sparse observed data, the ESS remains high, demonstrating that the method retains sampling efficiency even under extreme missingness. The eventual coherence of the trace plots, despite the wider spreads, confirms the BRF with the capacity of MICE to manage the propagation of uncertainty in high-missingness regimes, where traditional methods often degrade.
Across all missingness levels, BRF + MICE maintains convergence (R-hat ≤ 1.0013) and strong mixing (ESS > 4800), validating its robustness. The gradual rise in R-hat with missingness highlights growing imputation uncertainty, while stable ESS values confirm consistent sampling efficacy. These metrics, combined with trace plot trends, affirm the method’s reliability in high-dimensional missing data settings.
6.2. Simulation Results for Gaussian Response
Table 2 presents the average test root mean square error (ARMSE), which was calculated via 10-fold cross-validation for Gaussian response data under three missing data mechanisms, with missingness proportions of 25%, 50%, and 75%. The first three rows give the results when there is no missing observation, and they stand as the threshold for comparing the performances of the methods. A method is termed as robust if there is no significant increase in RMSE when missing observations are omitted or imputed. The RMSE results when there are no missing cases are constant, as expected for BRF and RF. The RMSE results for BART2 exhibit few changes at different simulation timestamps across the three missingness mechanisms, which arise from the MCMC simulation involved in the estimation technique of BART2. The second compartment of the table shows the results when the missing data are imputed using missing data strategies by various methods. Specifically, proximity imputation is used for RF, while MICE is used for BRF and BARTm for BART2. For MCAR with imputed missing observations, BRF maintains the same value of RMSE as observed when there are no missing cases for the proportions of missingness
and
. A slight increase is observed when the proportion of missingness approaches 0.75. A similar pattern is observed for RF except for larger RMSE than BRF. The unstable behaviour of BART2 is also observed when the data are imputed using the BARTm strategy. On average, BRF maintains the lowest RMSE for MCAR at various levels of missingness. Similar behaviours are found for MAR and MNAR at different levels of missingness. The detrimental effect of deleting the missing entries before estimation can be observed in the third compartment of the table. The RMSE of the three methods significantly deviates from the results when there are no missing entries. However, on average, the effect on BRF is minimal when compared to RF and BART2. Therefore, for high-dimensional data with missing entries up to 75% arising from different missingness mechanisms, BRF is the best among the three methods considered here.
Figure 2,
Figure 3 and
Figure 4 shows the visual behaviour over the folds. The median RMSE in
Figure 2,
Figure 3 and
Figure 4 confirms that BRF with the MICE imputation technique is the best among the three methods for analysing high-dimensional data with missing data.
6.3. Simulation Results for Binary Classification
For classification analysis, we replicated the high-dimensional simulation framework proposed by [
22], maintaining consistent dimensionality parameters with the regression case. Missingness was injected under three mechanisms (MCAR, MAR, MNAR) using the methodology of [
7]. Three methods were compared: random forest (RF) [
7], BART2 (BartMachine) [
14], and the proposed Bayesian Random Forest (BRF). The performance was evaluated using the Misclassification Error Rate (MER) and Average Misclassification Error Rate (AMER).
The
confusion matrix
A [
37] is structured as follows:
where
Rows: True classes;
Columns: Predicted classes;
Diagonal elements (): True positive classifications;
Off-diagonal elements (): False positive misclassifications.
Classification accuracy is calculated as follows:
where
is the total of the test samples. The MER and AMER are then derived as follows:
The AMER aggregates performance across all 10 cross-validation folds, providing a robust measure of model generalizability.
Table 3 presents the Average Misclassification Error Rate (AMER) computed via 10-fold cross-validation to evaluate the classifier performance under three missing data mechanisms (MCAR, MAR, MNAR), with missingness proportions of 25%, 50%, and 75%, for datasets featuring a binary categorical response variable. The first three rows give the results when there is no missing observation, and they stand as the threshold for comparing the performance of the methods. A method is termed as robust if no significant increase in MER occurs when missing observations are omitted or imputed. When there are no missing cases, the MER results are constant, as expected for BRF and RF. The MER results for BART2 exhibit few changes at different simulation timestamps across the three missingness mechanisms, which arise as a result of the MCMC simulation involved in the estimation technique of BART2.
The second compartment of
Table 3 shows the results when the missing data have been imputed using missing data strategies for the methods. For MCAR with imputed missing observations, BRF MER increases with an increase in the proportion of missing observations, and the values differ from the case where there are no missing values; for RF, the performance when imputed remains the same as the case when no missing values are found. The unstable behaviour of BART2 is also observed when the data are imputed using the BARTm strategy. On average, RF maintains the lowest MER for MCAR at
and
proportions of missingness. Similar behaviours were found for MAR and MNAR at different levels of missingness. The detrimental effect of deleting the missing entries before estimation can be observed in the third compartment of the table. The MER of the three methods significantly deviates from the results when there are no missing entries. The effect is minimal on RF compared to BRF and BART2. Therefore, RF is the best among the three methods considered here for high-dimensional data with missing entries up to
arising from different missingness mechanisms.
Figure 5,
Figure 6 and
Figure 7 show the visual behaviour over the folds. The median MER in
Figure 5,
Figure 6 and
Figure 7 confirms that BRF with the MICE imputation technique is better than BART2 for analysing missing data, while RF is the best of the three methods.
Alternatively, the F1 score is widely used as a measure of accuracy in classification problems, particularly when the data are unbalanced. In such cases, traditional accuracy may be misleading because it can be dominated by the majority class. The F1 score, defined as the harmonic mean of precision and recall, offers a more balanced evaluation by accounting for both false positives and false negatives. The F1 score is given by the following:
where
Here, TP represents true positives, FP false positives, and FN false negatives. By balancing precision and recall, the F1 score provides a single metric that better reflects the performance of a classifier in scenarios where one class significantly outnumbers the other.
Table 4 shows F1 scores under varying missing data mechanisms (MCAR, MAR, MNAR) and missingness proportions (25%, 50%, 75%). The No Missing Cases achieves near-perfect F1 (~0.95) across all scenarios, while Impute Missing Cases retains strong performance (mostly >0.90). Delete Missing Cases performs worst, especially at 75% missingness (F1 drops to ~0.68–0.78). RF consistently outperforms BRF and BART2 in deletion scenarios, and the MAR/MNAR mechanisms show sharper declines compared to MCAR, highlighting the challenge of non-random missingness.
6.4. Simulation Results for Multiclass Classification
Table 5 presents the Average Misclassification Error Rate (AMER) computed via 10-fold cross-validation to evaluate the classifier performance under three missing data mechanisms (MCAR, MAR, MNAR), with missingness proportions of 25%, 50%, and 75%, for datasets featuring a multiclass categorical response variable. The methods compared here are RF and BRF, as BART2 has not been implemented for multiple classes. Again, the first three rows show when there is no missing observation, and they stand as the threshold for comparing the methods’ performances. The second compartment of the table shows the results when the missing data have been imputed. For MCAR with imputed missing observations, BRF MER increases with an increase in the proportion of missing observations, and the values differ from the case where there are no missing values. The MER of RF also increases with an increase in the proportion of missingness, but the performance is better than that of BRF at 0.5 and 0.75. Similar behaviours are found for MAR and MNAR at different levels of missingness. The detrimental effect of deleting the missing entries before estimation can be observed in the third compartment of the table. The MER of the two methods significantly deviates from the results when no missing entries exist. Overall, on average, the effect of missing values is minimal on BRF when compared to RF.
Figure 8,
Figure 9 and
Figure 10 show the visual behaviour over the folds.
Table 6 reports F1 scores under varying missing data mechanisms (MCAR, MAR, MNAR) and missingness proportions (25%, 50%, 75%). No Missing Cases yields the highest F1 scores (0.90–0.93), with BRF slightly outperforming RF. Impute Missing Cases shows moderate declines (e.g., BRF drops to 0.867 at 50–75% MAR/MNAR), while Delete Missing Cases suffers severe degradation at high missingness (e.g., F1 ≈ 0.536 at 75% MAR/MNAR). RF marginally outperforms BRF in MCAR deletion scenarios, but both struggle with MNAR at 75% missingness.
6.5. Computational Time Comparison
Table 7 highlights computational efficiency across BRF, RF, and BART2 under varying missingness. RF consistently outperforms others in speed, maintaining near-constant times ( 0.41–0.44 s) regardless of missing data, reflecting its nonparametric, lightweight design. BRF, while slower (1.17–2.40 s), scales predictably with missingness times rise from 25% to 75%, likely due to increased Bayesian uncertainty propagation, but remains far faster than BART2. BART2 is computationally intensive (5.92–6.51 s), with runtime gradually increasing with missingness, underscoring its additive tree structure’s overhead. BRF’s moderate computational cost balances Bayesian rigour with practicality, whereas BART2’s inefficiency may limit scalability. RF’s speed advantage comes at the cost of probabilistic uncertainty modelling, emphasizing a trade-off between computational efficiency and statistical robustness.
7. Discussion of Results
This study advances methodologies for handling missing data in high-dimensional settings by demonstrating the superiority of Bayesian Random Forest (BRF) coupled with Multiple Imputation by Chained Equations (MICE) in regression tasks. Unlike traditional approaches such as random forest (RF) or Bayesian Additive Regression Trees (BART2), BRF maintains robust predictive accuracy evidenced by stable RMSE even at extreme missingness levels (e.g., 75%), particularly under Missing Not at Random (MNAR) mechanisms. These findings align with recent work by Sportisse et al. [
38] and Albu et al. [
39], who emphasized the necessity of model-based imputation to preserve feature interactions and the advantages of Bayesian ensembles in high-dimensional nonlinear models. BRF’s integration of probabilistic uncertainty quantification during imputation, as advocated by Rubin [
40] and Little [
41], mitigates bias caused by dependency on unobserved variables, outperforming frequentist counterparts like RF and BART2, which exhibit higher RMSE variability due to single imputation chains or proximity-based methods [
42,
43]. The stark performance degradation of deletion strategies across all methods reinforces Enders’ [
1] warnings about listwise deletion in high-dimensional settings, where feature interactions amplify bias, a trend corroborated by Hapfelmeier et al. [
13] and Gomez et al. [
44].
The theoretical implications of this work extend the applicability of Bayesian nonparametric models to missing data challenges. By embedding MICE within a tree ensemble framework, BRF addresses the ‘double uncertainty’ of missingness and model estimation, a limitation noted in frequentist forests [
14]. This aligns with Gelman et al. [
27] and van Buuren [
5], who argue that Bayesian methods naturally propagate imputation uncertainty through posterior predictive checks, reducing overconfidence in high-dimensional predictions. BRF’s stability under extreme missingness further supports statistical learning principles where regularization via priors mitigates variance inflation in sparse settings [
45]. These results contrast with BART2’s instability, which arises from its reliance on single imputation chains, failing to account for between-imputation variability, a critical factor highlighted in Murray [
28] and Little [
41]. The success of BRF underscores the growing consensus that Bayesian models, with their explicit uncertainty quantification, outperform deterministic approaches in MNAR and MAR scenarios [
46,
47].
However, RF’s proximity-based imputation proved more reliable for classification tasks, particularly for binary outcomes. This divergence echoes Tang et al. [
7] and Loh [
48], who found that entropy-driven splitting in RF better preserves categorical decision boundaries during imputation. BRF’s Dirichlet–Multinomial prior, while effective for multiclass problems, introduced slight over-regularization in binary settings, a trade-off consistent with Murray’s [
28] observations on Bayesian classifiers. These findings emphasize that method selection must align with data type: BRF’s model-based rigour suits continuous responses requiring precise uncertainty integration, while RF’s nonparametric flexibility excels in categorical contexts where computational efficiency and interpretability are prioritized [
49,
50]. Collectively, the study reinforces the need for tailored missing data strategies, advocating BRF-MICE for regression and RF for classification while cautioning against deletion-based approaches in high-dimensional settings.
8. Conclusions
This study presents a new approach for handling missing data in high-dimensional contexts by rigorously evaluating the performance of Bayesian Random Forest (BRF) against established methods like random forest (RF) and Bayesian Additive Regression Trees (BART2) via simulation. The results demonstrate BRF’s consistent superiority in both regression and classification tasks, evidenced by lower root mean squared error (RMSE) and Misclassification Error Rates (MERs) across diverse missing data scenarios. Notably, BRF’s advantage is most pronounced when missing values are imputed rather than excluded, particularly under Missing Not At Random (MNAR) mechanisms where traditional deletion methods or single imputation approaches falter. These findings highlight the critical role of multiple imputations in preserving data integrity, as BRF’s ability to iteratively model uncertainty during imputation reduces bias and enhances predictive accuracy. The empirical validation of BRF’s performance in MNAR settings, a common yet challenging scenario in real-world data, provides practitioners with a compelling rationale for adopting probabilistic imputation over simpler, less robust alternatives.
From a theoretical perspective, this work bridges critical gaps in Bayesian nonparametric modelling by demonstrating how ensemble methods within a Bayesian framework can effectively address the complexities of missing data. Traditional Bayesian imputation methods often rely on parametric assumptions that may not hold in high-dimensional or heterogeneous datasets, limiting their flexibility in capturing complex dependencies. Similarly, frequentist approaches typically rely on point estimates, which fail to account for the full range of uncertainty associated with missing values. In contrast, BRF integrates probabilistic tree ensembles with Markov Chain Monte Carlo (MCMC) sampling, directly incorporating uncertainty quantification into the imputation process. This approach ensures that variability across missing data patterns is captured in posterior distributions, leading to more reliable parameter estimates and robust error propagation. Furthermore, this study extends Bayesian nonparametric modelling by demonstrating that tree-based ensemble methods can serve as flexible priors that adapt to complex data structures without requiring explicit distributional assumptions. This advancement aligns with the principles of statistical learning theory, illustrating that models incorporating uncertainty estimates such as the probabilistic predictions of BRF outperform deterministic algorithms in high-dimensional settings prone to overfitting and noise amplification. By unifying Bayesian principles with ensemble learning, this research addresses the challenge of managing the inherent unpredictability of incomplete datasets and provides a theoretical foundation for future innovations in adaptive imputation strategies. These contributions highlight the broader impact of Bayesian ensemble learning in statistical modelling, particularly in fields where missing data are pervasive, such as genomics, healthcare, and finance.
The practical implications of this research are underscored by the distinct advantages of BRF and RF in different contexts. For high-dimensional regression tasks, BRF coupled with Multiple Imputation by Chained Equations (MICE) emerges as a robust solution, maintaining stable parameter estimates even at extreme missingness levels (e.g., >30%). Its iterative imputation process, which refines posterior distributions through cycles of prediction and updating, ensures resilience against biased missingness mechanisms. In contrast, RF’s simplicity and proximity-based imputation, where missing values are inferred from similar observations in tree structures, prove more reliable for classification tasks, particularly with binary outcomes. This divergence underscores the importance of tailoring missing data strategies to problem-specific requirements: BRF’s model-based Bayesian approach excels in continuous response scenarios demanding precise uncertainty quantification, while RF’s nonparametric flexibility is better suited for categorical outcomes where interpretability and computational efficiency are prioritized. These insights equip researchers with a fundamental framework for selecting imputation methods based on data type, missingness mechanism, and analytical goals.
While the simulation-based validation rigorously demonstrates BRF’s advantages under controlled missingness mechanisms (MCAR, MAR, MNAR), these results are inherently constrained by the assumptions underlying synthetic data. A key limitation is the potential mismatch between simulated missingness structures and real-world complexity, where unmodelled factors such as nonlinear feature dependencies, unobserved confounders, or time-varying missingness patterns may systematically bias imputation or prediction. For instance, clinical and genomic datasets often exhibit latent subgroup heterogeneity (e.g., disease subtypes influencing missingness), which simulations may oversimplify or omit, potentially leading to an overestimation of BRF’s robustness. To mitigate this threat to external validity, future work should benchmark BRF-MICE on real-world high-dimensional datasets, such as electronic health records and omics cohorts, where partially known missingness mechanisms allow for sensitivity analyses to quantify robustness to unverifiable MNAR assumptions. Additionally, incorporating adversarial missingness scenarios reflecting domain-specific biases such as sensor dropout in wearables or assay failures in genomics would stress-test the framework under more realistic conditions. Collaboration with domain experts is also critical for curating datasets where ground-truth values can be partially recovered (e.g., via replicate measurements), enabling direct imputation error quantification. While simulations provide essential proof-of-concept validation, their idealized missingness mechanisms and often linear covariate relationships may obscure practical challenges, such as feedback loops between imputation and model training in real-world data. Therefore, parallel validation in applied contexts, including EHR phenotyping and drug response prediction, remains essential to confirm BRF’s practical utility. Also, while BRF exhibits a strong performance, the computational feasibility of scaling the method to extremely large datasets remains an open question. Future studies should explore algorithmic optimizations to enhance scalability while preserving robustness. Finally, the current study focused on three missing data mechanisms (MCAR, MAR, MNAR) and did not consider more complex forms of missingness, such as informative missingness or mixed mechanisms. Future research should investigate how BRF performs under such conditions and explore hybrid imputation strategies that integrate domain knowledge for improved inference.