1. Introduction
Factorial invariance refers to the extent to which the relationships between the measured variables and the underlying latent constructs are equivalent across groups or time points (
McDonald, 1989;
Meredith, 1993;
Millsap & Kwok, 2004). Factorial invariance is an important prerequisite for making meaningful comparisons of statistical properties across groups in social science research that employs factor analysis models. A prevalent approach to assess the factorial invariance is the multigroup confirmatory factor analysis (CFA;
Jöreskog, 1971), which involves performing a series of increasingly restrictive invariance models, either in a forward or backward approach. Once factorial invariance is established, researchers can confidently assess the latent factor means and variances across different groups or time points.
Various factors such as sample size, data type and distribution, and model complexity can impact the detection of noninvariance (
Cao & Liang, 2022;
Sass et al., 2014;
Yoon & Lai, 2018). One issue that has received little attention is the distribution of latent factor scores, including the latent factor mean and variance (
Borsboom et al., 2008). Past studies have generally relied on the assumption of equal distributions of latent factors across groups. While a few studies have explored the variation in latent distributions between groups, the differences in latent means and variances have typically been small (
Liang & Luo, 2020). In practice, however, unequal latent group distributions are commonly encountered, for example, when comparing cognitive constructs between gifted and general students, between persons with disabilities, and the general population. Detecting factorial invariance becomes more challenging when varying degrees of heterogeneity are present in the distribution of latent factors.
Factorial invariance testing can be conducted through multigroup CFA in a frequentist (e.g.,
Wu et al., 2007) or Bayesian framework (e.g.,
Shi et al., 2017). In the frequentist framework, maximum likelihood (ML) estimation is commonly used to derive model parameters by identifying a set of parameter values that maximize the likelihood of obtaining the observed data under the analysis model. In contrast, the Bayesian framework estimates parameters by updating the prior distributions with the data likelihood to generate posterior distributions. This updating process is typically implemented using Markov chain Monte Carlo (MCMC) algorithms, which iteratively sample from the posterior distribution.
In both frameworks, various fit measures are available to evaluate the comparative fit between two invariance models, though these measures are formulated differently in the frequentist and Bayesian contexts. In the frequentist approach, the computation of fit measures is typically based on point estimates (e.g., likelihood-based statistics). Goodness-of-fit indices, such as the chi-square test, root mean square error of approximation (RMSEA), comparative fit index (CFI), Tucker–Lewis index (TLI), Gamma hat (GH;
Steiger, 1989), and McDonald fit index (MFI;
McDonald, 1989), provide a quantitative measure of how well the model fits the data. The difference in their values between invariance models can be used to evaluate the factorial invariance. Information criteria (ICs), such as the Akaike information criterion (AIC), Bayesian information criterion (BIC), and their variates, are often used in model selection. These criteria combine a measure of model deviance and a penalty for model complexity, thus balancing model fit and parsimony. This trade-off is designed to prevent overfitting and enhance the overall generalizability of the selected models.
In the Bayesian framework, recent advancements have introduced Bayesian analogs of traditional frequentist fit indices, including the Bayesian versions of CFI, TLI, RMSEA, GH, and MFI (
Garnier-Villarreal & Jorgensen, 2020;
Hoofs et al., 2018). These Bayesian fit indices leverage the entire posterior distribution to evaluate model discrepancy and complexity, employing computational formulas analogous to their frequentist counterparts. Specifically, these measures are calculated at each iteration of the MCMC process to generate a posterior distribution of the fit indices. This posterior distribution can then be summarized using central tendency and variability metrics. In addition, ICs are also available within Bayesian CFA, including BIC and the deviance information criterion (DIC;
Spiegelhalter et al., 2002). While DIC uses the full posterior distribution for the penalty term, it only employs posterior point estimates for the deviance term. The widely available information criterion (WAIC;
Watanabe, 2010) and leave-one-out cross-validation (LOO;
Geisser & Eddy, 1979;
Vehtari et al., 2017) serve as more robust fully Bayesian selection methods, where the deviance is represented using the log pointwise predictive density (lppd;
Gelman et al., 2013) computed at each sample draw from the entire posterior distribution. These Bayesian selection methods sample from the posterior distribution encompassing the full parameter space and provide flexible approaches to select models with effective incorporation of prior information.
Prior studies on factorial invariance have typically followed either a frequentist or Bayesian framework and focused on comparing fit measures within each respective framework (
Shi et al., 2017;
Liang & Luo, 2020). Only a few studies have compared frequentist and Bayesian methods in assessing factorial invariance (
Liang & Luo, 2020;
Lu et al., 2017), though they did not include the recently developed Bayesian goodness-of-fit indices (
Garnier-Villarreal & Jorgensen, 2020;
Hoofs et al., 2018). Since many Bayesian fit measures are adaptations of their frequentist counterparts, we aim to compare corresponding ML and Bayesian versions of fit measures to understand how they perform across different estimation frameworks in factorial invariance testing. In addition, the impact of heterogeneous latent distributions on model fit indices has not received sufficient attention in the literature, despite its prevalence in empirical research.
Therefore, our study purpose is twofold: first, to compare Bayesian and ML fit measures in factorial invariance testing; and second, to investigate the impact of latent distribution heterogeneity on the sensitivity of these fit measures under various simulation conditions. This comprehensive comparison aims to provide insights into the effectiveness and reliability of both estimation frameworks in addressing complex data structures and ensuring measurement fairness.
1.1. Factorial Invariance
A general multigroup CFA can be expressed as follows:
where for subject
j in group
g,
is the
p × 1 vector of observed scores (
p is the number of items),
is
q × 1 vector of latent factor scores (
q is the number of factors) assuming
in which
is the latent mean vector and
is the covariance matrix of latent factors in group
g,
is a
p × 1 vector of item intercepts in group
g,
is a
p ×
q matrix of factor loadings in group
g, and
is a
p × 1 vector of error scores associated with person
j in group
g, following
in which the error covariance matrix
is typically diagonal. The mean structure of the model is defined as follows:
where
is the mean vector of observed variables
y in group
g. The variance and covariance matrix
of the observed variables
y in group
g is delineated as follows:
The common process of multigroup CFA for testing factorial invariance involves using a forward approach to compare the fit of a series of increasingly restrictive invariance models, including configural invariance (equal factor structure), metric invariance (equal factor loadings:
), scalar invariance (equal item intercepts:
and
), and residual invariance (equal item residual variances:
=
,
and
) models (
Millsap, 2011). The selection of the invariance model depends on comparing the fit of two models with different levels of parameter constraints. Factorial invariance is established if the two invariance models fit the data comparably. Otherwise, the less restrictive model is selected. In the present study, we focus on assessing the metric and scalar invariance testing because they are commonly regarded as adequate for cross-group of comparisons of latent means and variances and have received the most attention in methodological research (
Cao & Liang, 2022;
Liang & Luo, 2020;
Little, 1997;
Widaman & Reise, 1997).
1.2. Heterogeneous Latent Distribution
The process of measurement invariance seeks to separate two possible sources of differences. First, the true latent distribution differences, when the mean and variance of a latent variable differ between groups, and second, item differences unrelated to the latent factor, when a characteristic of the item relates to groups differences, also called item bias. An example of the first source of difference would be if we tested math ability between mathematicians majors and psychology majors in college, where we would expect different latent (math) levels. Meanwhile for the second source of difference, an example would be an analogy item mentioning Joe Montana, and then comparing men vs. women; more women might get the item wrong if they do not know who Joe Montana is, but this has nothing to do with their latent (verbal) ability. So, in the first example we have true group differences, while in the second one we have a biased item that can affect our interpretation of the latent differences.
From a more substantive point of reference, the expression of the psychological trait of extraversion could differ between individualistic cultures and collectivistic cultures, leading to differences in latent factor means and variances. Extraversion may be more valued and rewarded in individualistic cultures, resulting in a higher mean and variance in this group compared to collectivistic cultures, where social harmony and modesty may be more valued. This difference in latent distributions can have implications on the observed scores that are collected in practice. Consider a scenario in which the ratio of latent factor variance is and the ratio of latent factor mean is . Suppose latent factor distribution for group 1 is and accordingly for group 2 is . The variance of the pth indicator can be expressed as for group 1 and for group 2. The mean of the pth indicator can be expressed as for group 1 and for group 2. An increase in latent mean and variance can lead to a corresponding increase in the mean and variance of the observed items, assuming items measure the same construct in both groups. The effect size for the latent mean difference between groups becomes smaller as the latent variance in group 2 increases: . The increasing heterogeneity in latent factor distributions has a direct impact on the measurement of observed items.
In the least restrictive configural model, all model parameters including loadings and intercepts are freely estimated in every group g. For model identification purpose, latent factor distribution is typically constrained with mean zero and variance one by standardizing latent factor scores as . Accordingly, for group g with a latent distribution , the factor loadings can be re-expressed as and the item intercept is in the configural model. If the latent variance is high, it can result in large unstandardized loading estimates. Similarly, a higher latent mean usually leads to a larger intercept in the configural model. In the metric invariance model, factor loadings are constrained equally across the groups even when noninvariant items are present. Typically, one group’s latent distribution is standardized, while the other group’s latent distribution is freely estimated. Misfit resulting from constraining factor loadings is then transformed into the estimation of latent factor variances as well as model fit evaluation. The scalar invariance model builds upon the metric invariance model by additionally constraining equivalent intercepts across groups, further translating misfit to the mean structure.
By constraining the item parameters (factor loadings and intercept) the model states that all differences are due to “true” latent factor differences. So, if an item is biased the measurement invariance constraints will incorrectly assume this bias does not exist, implying that the latent factor differences are biased now. Measurement invariance properly applied should separate true factor differences from item differences unrelated to the latent factor (bias).
The presence of heterogeneity in the latent distribution adds complexity to the test of measurement invariance and affects fit measures used to detect measurement noninvariance to varying degrees. Testing measurement invariance involves comparing fit indices obtained from two invariance models. The impact of heterogeneity on fit indices can vary, and it is crucial to explore how different measures of fit are influenced by the complexity introduced by heterogeneous latent distributions. The sensitivity of fit indices to the heterogeneity in latent distributions as well as the type and magnitude of noninvariance is not yet fully understood and warrants comprehensive investigation.
1.3. Frequentist Fit Measures
1.3.1. Likelihood Ratio Test
The likelihood ratio test (LRT) is a chi-square based statistical test used in factorial invariance testing to compare the fit of two nested models. Define the log-likelihood, known as the deviance, as follows:
where
is the data likelihood given the ML estimates
. When comparing two invariance models, the resulting difference between the two log-likelihood values (
) conforms to a chi-square distribution, where the degrees of freedom (
df) are equal to the difference in the number of parameters between the two models. A statistically significant LRT indicates that the less constrained model fits better (e.g., configural over metric) and the model is less invariant across the groups. As sample size increases, even minor parameter differences may be regarded as significant by the LRT (
Brannick, 1995;
Cheung & Rensvold, 2002), and hence, multiple fit measures are developed to better control the rate of false positives.
1.3.2. Goodness-of-Fit Index
Goodness-of-fit indices evaluate how well a model fits to observed data and most were developed based on the
T statistic, expressed as follows:
where
n is the sample size,
is the parameter vector, and the ML fit function,
, considers the model-implied covariance matrix
, sample covariance matrix
S, and the number of model parameters
p. Goodness-of-fit indices fall into two categories: incremental fit indices, which measure improvement over a baseline model assuming no covariances, and absolute fit indices, which assess how well the model fits the data without referencing a baseline. Given
and
are the fit statistics for the target and baseline models, and
and
are their respective degrees of freedom, popular incremental fit indices include the comparative fit index (CFI;
Bentler, 1990):
and Tucker–Lewis index (TLI;
Tucker & Lewis, 1973):
Common absolute fit indices include the root mean square error of approximation (RMSEA;
Steiger & Lind, 1980):
Although CFI, TLI, and RMSEA are commonly reported, we also included MFI and GH in our investigation as they have been recommended for the test of measurement invariance (
Cheung & Rensvold, 2002;
Meade et al., 2008).
1.3.3. Information Criterion
Information criteria are used for model selection and defined as a function of the deviance in (4) plus a penalty represented by the number of parameters.
and
where
k is the number of model parameters. Lower values of ICs indicate better model fit. These ML-based ICs rely on the ML point estimate and thus their variability and distribution are difficult to quantify (
Lu et al., 2016). Among the ICs above, BIC imposes the most severe penalty and tends to select a simpler model (
Vrieze, 2012). AIC usually yields power close to the LRT for factorial invariance. The SaBIC adjusts the penalty based on the sample size. The AIC and SaBIC have been shown to perform relatively well in invariance model selection (
Burnham & Anderson, 2002;
Cao & Liang, 2022;
Liang & Luo, 2020).
1.4. Bayesian Fit Measures
1.4.1. Bayesian Fit Indices
Unlike frequentist estimation that relies on point estimates of the model’s deviance, Bayesian estimation computes the deviance and effective number of parameters at each MCMC iteration. These values are then substituted into the formulas for frequentist fit indices, enabling the computation of fit indices at each iteration. The fit indices from all iterations are aggregated, typically by averaging their posterior distributions, to provide overall measures of model fit. This approach incorporates parameter uncertainty, offering a more comprehensive evaluation of the model fit to the observed data.
Garnier-Villarreal and Jorgensen (
2020) proposed an approach to adapt fit indices to Bayesian structural equation modeling (SEM). Specifically, at iteration
i, the ML fit function
T in the frequentist fit indices (Equations (6)–(10)) is replaced by
, where
is the discrepancy function based on the observed data, and
represents the effective number of parameters, calculated as follows:
Here,
is the expected posterior deviance, and
is the deviance at the posterior means of parameters
. The
df in frequentist fit indices is substituted by
to quantify model complexity, where
denotes the unique sample moments. With these replacements, the distribution of realized values for various fit indices, including RMSEA, CFI, TLI, GH, and MFI, becomes available, which can be summarized using central tendency such as the mean (expected a posteriori; EAP), mode (modal a posteriori; MAP), or median, as well as percentile measures like the 2.5th and 97.5th percentiles to construct a 95% credible interval. For detailed formulas of each index, refer to
Garnier-Villarreal and Jorgensen (
2020). Research has shown that Bayesian fit indices yield similar results comparable to their frequentist counterparts with non-informative priors assigned (
Edeh et al., 2025;
Garnier-Villarreal & Jorgensen, 2020;
Hoofs et al., 2018). Since these Bayesian fit indices are developed based on frequentist formulas, the same cutoff criteria are applied for comparing invariance models in this study.
1.4.2. Bayesian Model Selection Methods
In the Bayesian LRT for measurement invariance testing, the deviance, , is computed at each iteration for the two invariance models, forming an empirical distribution of the LRT statistic that can be summarized by its central tendency. The difference in posterior mean deviances, , between the two models serves as a Bayesian analog to the frequentist LRT for assessing measurement invariance.
The DIC is often regarded as the Bayesian counterpart to the AIC. The DIC replaces the
in the deviance term of the AIC (Equation (11)) with posterior mean estimate
, and replaces the parameter count
k with the effective number of parameters
as follows:
The DIC is often described as partially Bayesian because the deviance term is computed using EAP point estimates, while the WAIC and LOO are viewed as fully Bayesian which capture uncertainty more comprehensively.
The WAIC is asymptotically equivalent to LOO, but computationally more efficient (
Watanabe, 2010). Both make accurate predictions by utilizing information from the entire posterior distributions of model parameters while appropriately penalizing model complexity. WAIC is conceptualized as follows:
where the log pointwise predictive density (lppd) approximates the deviance (
Gelman et al., 2013) and
indicates the effective number of parameters. The lppd is defined as follows:
where
indicates the parameter estimates at the
ith iteration. The lppd first computes the expected pointwise predictive density (elpd) over
I iterations for each subject
j, and the log elpd is summed across all subjects to derive the lppd of the data. The approximation of lppd improves as the length of the MCMC chain becomes longer. The effective number of parameters
is commonly estimated using the variance of the log pointwise predictive density over all subjects as follows:
LOO estimates the expected log predictive density by leaving out one data point at a time:
where lppd
LOO is defined as
Here, is the marginal likelihood from the ith iteration, excluding the jth subject.
The WAIC and LOO offer theoretical benefits over the DIC by leveraging full posterior distributions instead of point estimates. This extension allows the estimation of a measure of variability in the comparison between models (standard error of the difference), while methods based exclusively on the point estimate (like BIC, DIC) only report a difference without a reference of the variability. LOO excels with intricate data structures like network data and hierarchical data (
Gelman et al., 2014). Additionally, accurate informative priors can enhance the sensitivity of the DIC, WAIC, and LOO in factorial invariance testing, particularly with small sample sizes (
Liang & Luo, 2020).
4. Discussion
Heterogeneity in latent factor distributions is common when assessing factorial invariance within the CFA framework. This study examines how the change in latent means and variances between groups affects the sensitivity of model selection methods to detect noninvariance in factorial invariance testing across various conditions. By comparing frequentist and Bayesian model selection methods, it highlights their respective strengths and limitations in identifying heterogeneous latent structures. The findings enhance understanding of the complexities in assessing factorial invariance and offer guidance on selecting appropriate fit measures under such conditions.
The heterogeneity in latent variances between groups had a more pronounced effect on scalar invariance testing than on metric invariance testing across all fit measures. With a latent variance ratio of 1:1 and the latent factor in the reference group standardized, a 0.8 difference in latent means yielded an effect size of 0.8, considered large according to Cohen’s guidelines (
Cohen, 1988). Conversely, with a latent variance ratio of 1:4, the effect size for a 0.8 mean difference was calculated as
, indicating a small to moderate effect size. Larger latent variances lead to greater variability in observed scores while smaller effect sizes between latent means, potentially diminishing the power to detect differences in group intercepts. In contrast, factor loadings—analogous to slopes in regression analyses—remain comparatively less affected. Latent mean difference, on the other hand, did not influence the selection of invariance models.
Moreover, we believe that the factor variance has a stringer effect on measurement invariance than the factor means due to the structure of the
statistic. As it is the distance between the observed covariance matrix and the model implied covariance matrix (
Bollen, 1989;
Kline, 2016). Where the factor variances have an effect in reproducing a large number of estimates of the implied covariance matrix, while the factor means have a role in the reproducing the model implied means, which are included in the
calculations as a difference between the observed means. For this reason, and the results in the present simulation, latent variance differences could have a stronger impact in measurement invariance when it is tested with fit indices based on the
test, as shown in the following implementation of the maximum likelihood discrepancy function:
which compares the sample covariance matrix
to the model-implied covariance matrix
(or simply
), where
p is the number of variables in the model,
is the vector of sample means, and
is the vector of model-implied means. The corresponding statistic is calculated as
.
In scalar invariance testing, when inspecting the individual model fitting, the fit values for metric models remained stable across various latent variances, whereas scalar models generally showed improved fit with higher latent variances. Consequently, as the latent variance ratio increased, the difference in fit values narrowed, reducing the power to detect intercept noninvariance. With a sufficiently large sample size and considerable noninvariance, all fit measures achieved approximately 0.80 power for a latent variance ratio of 1. Goodness-of-fit indices and the BIC experienced dramatic power reductions when the latent variance ratio increased to 4. In metric invariance testing, latent variances had minimal influence on the fit of both the configural and metric models, implying that the power to detect loading noninvariance was largely unaffected by the latent variance ratio.
As several Bayesian fit measures were developed using frequentist formulas while retaining Bayesian properties through the full posterior space, the overall performance of ML and Bayesian methods in selecting invariance models was comparable. Bayesian selection methods in general yielded slightly lower power than their ML counterparts, except for the TLI, where Bayesian methods demonstrated comparable or slightly higher power. In addition, an interaction between sample size and the degree of noninvariance was observed in the goodness-of-fit indices: smaller sample sizes led to higher power when noninvariance was small, but lower power when noninvariance was relatively large. Although this may seem counterintuitive, it partially aligns with the existing literature (
Cao & Liang, 2022).
Among model selection methods, goodness-of-fit indices generally exhibited lower power to detect both metric and scalar noninvariance than the LRT, ICs (except the BIC), and LOO, which provided the best balance between false and true positive rates. The ML-based LRT and AIC performed similarly to the Bayesian-based DIC, WAIC, and LOO, likely due to the relatively simple one-factor structure and continuous normally distributed data, which allowed point estimates to effectively summarize the parameters and yield comparable power between ML and Bayesian methods. Under the current thresholds, most goodness-of-fit indices exhibited low power. However, GH, particularly within the ML framework, demonstrated good power, although it was associated with elevated false positive rates. In practice, optimal thresholds to balance Type I and Type II errors often depend on various factors such as sample size, model complexity, and the underlying distribution of the test statistic. These results suggest that adaptive thresholding approaches that are tailored to the specific context and methodological framework could be considered in future research.
Implications and Future Research
The implications and recommendations from this study can be summarized below. First, for researchers seeking fit measures that balance false and true positive rates, the LRT (ML or Bayes), AIC, DIC, WAIC, and LOO are advisable. ML-based LRT and AIC are relatively time-efficient, while the Bayesian-based LRT, DIC, WAIC, and LOO require more computational time since they use all posterior samples. Nonetheless, the Bayesian approach offers greater flexibility through the incorporation of prior knowledge and uncertainty quantification in fit measures, warranting further research. When employing Bayesian fit measures, it is recommended to conduct a sensitivity analysis of priors in the context of factorial invariance testing (
Depaoli & van de Schoot, 2017). Second, if the research objective is to detect noninvariance, GH may serve as a complementary measure to supplement the LRT, AIC, DIC, WAIC, and LOO due to its high power, although it may also yield inflated false positives using the conventional cutoff. Using multiple selection methods to determine the model invariance is advised. Third, unequal latent variances between groups have a minimal impact on metric invariance testing but can affect scalar invariance testing especially with high latent variance ratio for specific fit measures.
In practice, researchers should assess latent variances across configural, metric, and scalar invariance models. When substantial group differences in latent variances are detected (e.g., gifted verse general student populations), it is advisable to avoid relying on goodness-of-fit indices for scalar invariance testing and in cases where the number of noninvariant items in metric invariance testing is small, unless a sufficiently large sample size is available and a high degree of noninvariance is present. The choice between frequentist and Bayesian methods depends on factors such as the sample size, model complexity and data characteristics. If the assumptions for frequentist methods are generally met, frequentist selection methods are often preferred due to the computational efficiency. In situations where researchers have substantive prior knowledge they wish to incorporate or encounter issues such as convergence, small sample sizes, complex model structures or nonnormal data, Bayesian estimation may be more robust. Bayesian methods provide full posterior distributions of the fit measures, facilitating more informative assessment of model fit than point estimates from frequentist methods.
Further research could aim to characterize the distribution of fit indices from a Bayesian perspective and to investigate the impact of priors on model selection, guiding the development of optimal prior selection strategies that balance flexibility with robustness. Indices based on the chi-square discrepancy function are generally more sensitive to latent variance differences than latent mean differences, and it would need a specific simulation designed to test this. Additionally, expanding these investigations to encompass more complex models, such as those involving multiple latent factors or hierarchical structures, and diverse data types including nonnormal and ordered categorical data would provide valuable insights into the generalizability of the findings. In ML estimation, violation of normality assumption could affect the point estimates of fit measures to varying extent, while Bayesian approaches provide posterior distributions of fit indices that may be less sensitive to nonnormality, particularly when robust priors are employed. Both frameworks require careful sensitivity analyses to ensure that the invariance conclusions are not driven by distributional anomalies. Further, follow-up work could apply these methods to real datasets in educational or psychological settings where group comparisons are common to evaluate how well the current simulation results align with empirical findings.