1. Introduction
Clustered data, commonly found in fields like neuroscience and epidemiology, are characterized by a high degree of similarity among observations within the same group, which surpasses the similarity observed between different groups. They comprise interconnected samples, each encapsulating a collective of individuals with correlated characteristics. When these samples are amalgamated, they form a complex clustered dataset with multiple clusters, potentially varying in number of constituent samples. This dataset accentuates intra-group correlations, a focus distinct from longitudinal data, which prioritizes the temporal sequence of observations for individual subjects. The hierarchical structure of clustered data introduces analytical challenges, particularly in statistical analysis, where there is oversight of the internal group. Galbraith and Daniel [
1] have demonstrated that neglecting the internal relationships within clustered data can result in substantial biases in parameter estimation, underscoring the imperative to consider these internal associations. The field of clustered data analysis has seen significant advancements in recent years with the emergence of diverse analytical models such as linear models [
2,
3,
4], random effects models [
5], semiparametric models [
6], nonparametric models [
7], logistic regression models [
8], and multilevel models [
9,
10], etc. Each of these methodologies has contributed to the parameter estimation for clustered data, converging on a pivotal insight; disregarding the potential correlations among observations within groups during the modeling process can lead to an underestimation of standard errors, thereby compromising the robustness of inferential conclusions.
The factor analysis model is an effective statistical tool for describing the dependence between multidimensional variables. Spearman [
11] first proposed factor analysis when studying intelligence tests. It not only reduces the dimension, but also reflects the internal correlation among variables and solves the problem of intra-cluster correlation effectively. Experiments in cytology performed by Julian [
12] demonstrated that ignoring the nested structure leads to serious deviations in the estimates of unknown parameters when the intra-cluster correlation is significant. This proves that the nested structure of clustered data cannot be ignored. Huang et al. [
9] proposed a variety of alternative schemes to deal with the nested structure of clustered data. However, the processing methods applicable to the clustered factor model were not mentioned. The mean of cluster value method is a simple and effective method for processing clustered data. It has been successfully applied by many scholars to deal with nested structures. Okech [
13] used the mean of clusters to process the nested structure of the data when processing the characteristic data relating to low-income parents’ parenting pressure and economic pressure through the factor model. Galbraith et al. [
1] also stated in a paper that using the mean of clusters to process clustered data is an effective way to obtain unknown parameters. Recently, with the rise of Bayesian networks, estimating factor analysis models using Bayesian methods has become a hot topic. The Bayesian factor analysis model, first proposed by Press [
14], is capable of obtaining implicit solutions to the unknown parameters of the factor model when considering sample covariances obeying a Wishart distribution. Wirth et al. [
15] demonstrated that the MCMC method can obtain better fitting results when estimating the parameters of the factor analysis model for non-clustered data. Zhang et al. [
16] proposed a novel Bayesian factor analysis model which achieves the downscaling and feature extraction of high-dimensional multi-omics data by introducing the knowledge of biological mapping and improves the robustness of the model in the processing of noise and sparsity. Hansen et al. [
17] proposed a new variational inference algorithm that first uses multiplicative gamma process shrinkage prior to approximation of the posterior distribution of a Bayesian factor model in order to address the poor scalability of factor analytic models for high-dimensional data.
Since Cook [
18] introduced the case-deletion model for identifying influential points, the methods of statistical diagnostic of models have become one of the research hotspots, with various diagnostic methods being developed. Among these, Bayesian statistical diagnostics represents a particularly active branch, with statisticians both domestically and internationally widely utilizing this approach. De Finetti [
19] explored the problem of how to deal with data outliers through Bayesian subjective probability theory and analyzed the treatment in the case of independent, exchangeable, and partially exchangeable error distributions. Jackson et al. [
20] proposed two approximations capable of computing diagnostic statistics for the Bayesian case-deletion model. Zhu et al. [
21] examined three Bayesian case-deletion influence measures in the context of complex models such as longitudinal data. Nevertheless, there is a paucity of research on Bayesian case-deletion diagnostics for factor models. Moreover, the statistical diagnosis of factor models for clustered data has not been addressed in the relevant literature. Factor analysis models excel at identifying correlations among observable variables or individuals within clustered datasets where large volumes of data across various fields share unobservable underlying public variables. However, applying these models for parameter estimation and impact analysis within clustered data remains largely unexplored in the existing literature. For simplicity, we outline the following specific contributions of this manuscript:
We used the results averaged across clusters to deal with the nested structure of the clustered data, where the associations between clusters are inscribed by a common diagonal form of the error matrix;
We performed a comprehensive analysis of factor analysis models for clustered data, employing both Bayesian and principal component analysis methods, and compared the efficacy of these two methodologies;
We utilized Cook’s posterior mean distance for statistical diagnostics of factor analysis models within clustered data and identified groups with anomalies through diagnostic statistics.
The remaining sections are structured as follows:
Section 2 introduces the process of constructing the factor analysis model within clustered data;
Section 3 introduces the basic steps of how to model the factor analysis model of clustered data using the Bayesian method;
Section 4 focuses on how to estimate factor analysis models for clustered data using principal component analysis;
Section 5 introduces how to diagnose the anomalous groups by using Cook’s posterior mean distance;
Section 6 validates the feasibility of the Bayesian method using numerical simulation and compares the estimation effect with that of the principal component method;
Section 7 is a example analysis that uses examples to illustrate that the estimation method in this paper can be effectively applied in practice;
Section 8 encapsulates the findings and concludes the paper.
2. Model Framework of Factor Analysis for Clustered Data
Suppose the observed variable of the
ith individual in the
gth cluster is
. For
, the factor analysis model is expressed as follows:
where
is the
p × 1 dimensional mean vector.
is the factor loading matrix of the gth cluster.
denotes the
m-dimensional common factor of the
ith individual in the gth group.
is the special factor for the
ith individual in the
gth cluster, where
. At the same time, we suppose that
is uncorrelated with
for all
g and
i values. Aggregate all the individuals in the
gth cluster in the form of a matrix as follows:
where
The factor analysis model of all clusters is expressed in diagonal matrix form, which is shown in Equation (
3):
where
4. Principal Component Method for Factor Analysis Models with Clustered Data
In the initial phase of our investigation, we focus on the factor analysis model for a single cluster with the matrix representation presented in Equation (
2). The eigenvalues of the covariance matrix are labeled by
, arranged in descending order of magnitude, where
is the largest. The corresponding orthogonal eigenvector is denoted by
. If
is sufficiently large and the variance of the specific factors is considered negligible,
can be approximately replaced by
, just like
. Using spectral decomposition, we decompose the covariance matrix with the specific form shown below:
where
is the eigenvalue of the covariance matrix, denoted as
, and
is the corresponding orthogonal eigenvector, denoted as
. Since
obeys a normal distribution with a non-zero mean in this paper, the estimator for the factor loading matrix, derived from the principal components method, is delineated as follows:
The common factor
can be estimated using the least-squares algorithm as shown in Equation (
10):
Since all clusters share a common error matrix, the parameter estimates for the factor model in clustered data can be derived by averaging the estimation results from each cluster.
6. Simulation Study
6.1. Random Parameter Generation for the Markov Chain Monte Carlo Algorithm
A clustered dataset is simulated using Monte Carlo methods. The construction of a single cluster factor analysis model is initially considered, with the caveat that the number of samples in each cluster is not necessarily identical. To this end, the number of samples was allowed to adhere to a uniform distribution of , with samples generated sequentially and iteratively from the corresponding conditional distribution. The factor loading matrix is of order 6 × 2, and the initial parameter settings are as follows:
The initial coefficients of the factor loading matrix are set as
,
. In order to ensure that the common factors satisfy different normal distributions in different clusters, it is necessary to assume that the common factor satisfies normal distributions with different means and variances. Furthermore, it is assumed that the mean values follow the uniform distribution of
, and the variance follows the uniform distribution of
, just like
,
. The random error term is also assumed to follow a normal distribution, just like
, where
obeys the inverse gamma distribution, that is,
. In the prior distribution, the hyperparameters are selected as
and
d = 4. This paper determines the convergence and stability of the Gibbs sampling by traversing the ergodic mean plot, as shown in
Figure 1.
6.2. Simulation Results of Gibbs Sampling
Gibbs sampling is leveraged to estimate the parameters
,
, and
through a process executed over 50 iterations. The initial estimates for parameters
and
are derived from the mean of 1000 samples, each drawn from the corresponding prior distribution. For the generation of the simulated clustered dataset, the number of clusters is variably set to 10, 30, and 50. A comprehensive presentation of the simulation outcomes can be found in
Table 1. In this paper, we use Bias to denote the deviation between the true value and the estimated mean, RMSE to indicate the root mean square error, and SD to represent the standard deviation.
Table 1 shows that, after many iterations, the bias and mean square error of all parameter estimates are controlled within a tightly constrained range, and the mean square error and variance are relatively close to each other, which, in turn, indicates that the estimation results are close to the real values, revealing the effectiveness of the factor model for dealing with clustered data when using this method. The principal component method, as a classical estimation method for factor analysis models, compares the parameter estimation based on two different methods under the same settings as the simulated dataset in the paper, and the estimation results of the principal component method are as shown in the table below.
6.3. Random Parameters Generation for the Principal Component Algorithm
Since the principal component method ignores the effect of variance, the parameters of the principal component method are set to be approximately the same as the Bayesian method for comparison purposes. The coefficients of the factor loading matrix are set as follows:
The initial coefficients of the factor loading matrix are set as . We assume that the common factor satisfies normal distributions with different means and variances. The mean of is assumed to be uniformly distributed between 0 and 1, and the variance is assumed to be uniformly distributed between 0 and 0.5. , , and the random error is distributed as . The initial values of are set to be the same as the initial value of under the Gibbs sampling algorithm. The intra-group correlation is also set as .
6.4. Simulation Results of Principal Component Algorithms
The parameters of interest,
and
, can be estimated using principal component analysis. To gauge the accuracy of these estimates, the deviation, standard deviation, and mean square error for the pairs of
and
, as well as
and
, are computed. This process is repeated 1000 times to ensure a robust assessment of how closely the estimated values of the parameters approximate their actual values. The number of clusters is also set to 10, 30, and 50, and the estimated results are depicted in
Table 2.
As can be seen from
Table 1 and
Table 2, the estimation results of the MCMC method are better than the estimation results of the principal component method for each item, accordingly indicating that the MCMC method is more effective relative to the principal component method. The corresponding influence diagnosis is reliable only if the estimation results are better.
6.5. Statistical Diagnosis of Clustered Data
The dataset under consideration is subjected to statistical diagnosis using the case-deletion method. Cook’s distance is utilized to evaluate the influence of individual clusters by comparing the changes in CM(g)—a measure of leverage and influence—following the removal of different clusters. Within the simulation study, the common factor variable, denoted as
, for the 8th, 15th, 21st, and 42nd clusters, is incremented by 2 to appraise the validity of the applied method. The variable of
is designated as the diagnostic parameter of interest. The outcomes of this analysis are presented in
Figure 2.
Figure 2 presents a depiction of Cook’s distance for clusters consisting of 10, 30, and 50 clusters, which are generated through Monte Carlo simulations and subsequently deleted sequentially. Analysis of
Figure 2 reveals that, within the clustered data produced by Monte Carlo simulation, the 8th cluster exerts the most significant influence. In the dataset comprising 30 groups, the 8th, 15th, and 21st clusters are identified as potent influence points, with the 21st cluster demonstrating the most pronounced impact. Similarly, in the dataset with 50 clusters, the 8th, 15th, 21st, and 42nd clusters are recognized as having substantial influence, with the 21st cluster again being the most influential. These findings are congruent with the intentional perturbations introduced to the 8th, 15th, 21st, and 42nd clusters, as previously discussed. Consequently, the efficacy of the diagnostic methodology employed can be assessed and deemed valid.
7. Example Analysis
In this section, we detail the experimental analysis of cigarette sales. Our study is based on data derived from a survey of cigarette sales which was conducted across the United States from 1963 to 1992. This dataset is accessible through the Wiley Online Library at
https://www.wiley.com/legacy/vkhbwileychi/baltagi/ (accessed on 14 February 2024). The survey, conducted by Baltagi [
25], aimed to investigate cigarette consumption patterns across 46 different states within the United States. A comprehensive description of the variables included in the dataset is presented in
Table 3. The dataset, which is inherently clustered by state, has been modeled as such for analysis. Before conducting factor analysis, the data underwent standardization. Subsequently, the suitability of the data for factor analysis was assessed using the KMO test and Bartlett’s spherical test. The outcomes are detailed in
Table 4.
As detailed in
Table 4, the KMO test for sampling adequacy is calculated at 0.78, a value that exceeds the threshold commonly accepted for proceeding with factor analysis. Moreover, Bartlett’s test for sphericity yields a p-value of less than 0.05, which is statistically significant and further corroborates the appropriateness of the data for this analytical approach.
Figure 3 indicates that the extraction of three common factors is justified, thus leading to the development of a factor analysis model inclusive of these three factors.
Figure 4 illustrates the classification of the eight indicators into three separate categories based on their factor loading coefficients. Specifically, the indicators labeled as Price, Cpi, Pimin, Ndi, and Year are categorized together as representing economic factors. The indicators Pop and Pop16 are united to form the demographic factor. Finally, the Sales indicator is identified as the sole representative of the demand factor.
Examination of
Figure 5 reveals that the mean of the trace for each variable stabilizes upon completion of 3000 iterations, suggesting that the Gibbs sampling algorithm has achieved convergence. To ensure reliability, the analysis was conducted with a total of 5000 iterations, with the initial 3000 iterations serving as a burn-in period. The subsequent 2000 iterations were retained for analysis, yielding a set of valid samples. Statistical analysis of these valid samples was then performed. Anderson and Herman [
26] have delineated that the core condition for the identifiability of the factor loading matrix is the requirement for
to be diagonal. Then,
can be identified. The resulting estimates for the factor loading matrix coefficients are presented in
Table 5.
According to the results shown in
Table 5, the difference between the mean and median of the loading coefficient is small, indicating that the fitting effect of the MCMC method is better. Baltagi [
25] studied the variation in cigarette price elasticized across states in this dataset using a heterogeneous coefficient model but did not consider the issue of commonality between states. The clustered factor analysis model can not only obtain the internal relationship among these states, but also describe the common factors among different states. The results are also richer than the analytical effects of the heterogeneous coefficient model. In order to highlight the advantages of clustered data, the quality of the model can be judged by two methods: the Akaike Information criterion (AIC) and Bayesian information criterion (BIC). Akaike [
27] proposed a method to calculate the AIC value in a factor analysis model, and the AIC and BIC values calculated through this calculation method are shown in
Table 6.
As can be seen from
Table 6, the AIC and BIC values of the model considering the internal relationship of clusters are much lower than those of the model which does not consider the internal relationship, which indicates that considering the internal relationship of clustered data is better than ignoring the internal relationship of clusters, and, at the same time, it illustrates that the estimation effect of the MCMC method is better than the principal component method.
In addition, the Bayesian case-deletion model was applied to these data in this study as a way to explore those clusters with anomalies or high-influence clusters. When the common factor
and factor loading matrix
are the parameters of interest and the rest of the parameters are redundant, the influence diagnostic results are as shown in
Figure 6 and
Figure 7, respectively.
Figure 6 presents visual evidence suggesting that, based on Cook’s distance, the 17th cluster can be classified as an outlier or a cluster exerting a significant influence on the model regardless of whether parameter
or
is the focus of interest. A comparison of the arithmetic means of indicators within the original dataset reveals that the 17th state’s cigarette price is substantially lower than the national average. At the same time, this state has a significantly smaller local population compared to other states. However, the local consumer price index and per capita sales are observed to be higher. These observations lead to the inference that the 17th state may wield a more substantial influence. To investigate the potential masking effect among influential clusters, we performed a sequential exclusion analysis, initially omitting the 17th cluster, before conducting the Bayesian statistical diagnosis. The results of these diagnostic assessments are depicted in
Figure 7.
Figure 7 provides a clear depiction revealing that the Cook’s distance values for the parameters of the non-17th clusters exhibit minimal variation. This observation suggests that there is no significant masking effect among these clusters. Concurrently, the figure also indicates that the 17th cluster exerts a more substantial influence on the estimation of the parameters.
8. Discussion
Clustered data, one of the complex data types, are characterized by internal correlation within the same groups and a lack of correlation between different groups. Ignoring this nested structure when inferring can lead to biased conclusions. Utilizing the factor analysis model is an effective approach to address the nested structure inherent in clustered data. This paper presents a factor model specifically tailored for clustered data. Employing the MCMC method in conjunction with Gibbs sampling, we initially estimate the unknown parameters of an individual cluster. The inter-cluster association is depicted through a shared diagonal error matrix. Thereby, the factor model with clustered data is constructed by integrating multiple clusters. Numerical simulations substantiate the model’s and methodology’s validity. Empirical evidence indicates that the clustered data model significantly outperforms models that disregard internal relationships. Furthermore, this study employs Bayes deletion influence to compare diagnostic statistics before and after cluster deletion and utilizes Cook’s posterior mean distance to identify potential outliers among clusters. Both the simulation study and the example analysis yield promising results.
The research concepts presented are extendable to structural equation models with clustered data. As structural equation models are an advanced form of the factor analysis model, the modeling approaches discussed are not only applicable to them but also offer a foundation for future statistical diagnostic research on clustered data across varying models.