Next Article in Journal
Complex t-Intuitionistic Fuzzy Graph with Applications of Rubber Industrial Water Wastes
Previous Article in Journal
Numerical Solution to Poisson’s Equation for Estimating Electrostatic Properties Resulting from an Axially Symmetric Gaussian Charge Density Distribution: Charge in Free Space
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Bayesian Statistical Inference for Factor Analysis Models with Clustered Data

1
College of Mathematics, Yunnan Normal University, Kunming 650500, China
2
College of Life Sciences, Yunnan Normal University, Kunming 650500, China
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(13), 1949; https://doi.org/10.3390/math12131949
Submission received: 4 May 2024 / Revised: 14 June 2024 / Accepted: 20 June 2024 / Published: 23 June 2024

Abstract

:
Clustered data are a complex and frequently used type of data. Traditional factor analysis methods are effective for non-clustered data, but they do not adequately capture correlations between multiple observed individuals or variables in clustered data. This paper proposes a Bayesian approach utilizing MCMC and Gibbs sampling algorithms to accurately estimate parameters of interest within the clustered factor analysis model. The mean traversal graph of parameters ensures that the Markov chain converges, and the Bayesian case-deletion model is used to analyze the model’s impact and identify outliers in clustered data using Cook’s posterior mean distance. The applicability and validity of the principal-component-method-based factor analysis model for clustered data are demonstrated by comparing the parameter estimation of this method with the principal component method, the clustered data with and without internal relationships are compared by example analysis, and the anomalous groups are identified by the Cook’s posterior mean distance.

1. Introduction

Clustered data, commonly found in fields like neuroscience and epidemiology, are characterized by a high degree of similarity among observations within the same group, which surpasses the similarity observed between different groups. They comprise interconnected samples, each encapsulating a collective of individuals with correlated characteristics. When these samples are amalgamated, they form a complex clustered dataset with multiple clusters, potentially varying in number of constituent samples. This dataset accentuates intra-group correlations, a focus distinct from longitudinal data, which prioritizes the temporal sequence of observations for individual subjects. The hierarchical structure of clustered data introduces analytical challenges, particularly in statistical analysis, where there is oversight of the internal group. Galbraith and Daniel [1] have demonstrated that neglecting the internal relationships within clustered data can result in substantial biases in parameter estimation, underscoring the imperative to consider these internal associations. The field of clustered data analysis has seen significant advancements in recent years with the emergence of diverse analytical models such as linear models [2,3,4], random effects models [5], semiparametric models [6], nonparametric models [7], logistic regression models [8], and multilevel models [9,10], etc. Each of these methodologies has contributed to the parameter estimation for clustered data, converging on a pivotal insight; disregarding the potential correlations among observations within groups during the modeling process can lead to an underestimation of standard errors, thereby compromising the robustness of inferential conclusions.
The factor analysis model is an effective statistical tool for describing the dependence between multidimensional variables. Spearman [11] first proposed factor analysis when studying intelligence tests. It not only reduces the dimension, but also reflects the internal correlation among variables and solves the problem of intra-cluster correlation effectively. Experiments in cytology performed by Julian [12] demonstrated that ignoring the nested structure leads to serious deviations in the estimates of unknown parameters when the intra-cluster correlation is significant. This proves that the nested structure of clustered data cannot be ignored. Huang et al. [9] proposed a variety of alternative schemes to deal with the nested structure of clustered data. However, the processing methods applicable to the clustered factor model were not mentioned. The mean of cluster value method is a simple and effective method for processing clustered data. It has been successfully applied by many scholars to deal with nested structures. Okech [13] used the mean of clusters to process the nested structure of the data when processing the characteristic data relating to low-income parents’ parenting pressure and economic pressure through the factor model. Galbraith et al. [1] also stated in a paper that using the mean of clusters to process clustered data is an effective way to obtain unknown parameters. Recently, with the rise of Bayesian networks, estimating factor analysis models using Bayesian methods has become a hot topic. The Bayesian factor analysis model, first proposed by Press [14], is capable of obtaining implicit solutions to the unknown parameters of the factor model when considering sample covariances obeying a Wishart distribution. Wirth et al. [15] demonstrated that the MCMC method can obtain better fitting results when estimating the parameters of the factor analysis model for non-clustered data. Zhang et al. [16] proposed a novel Bayesian factor analysis model which achieves the downscaling and feature extraction of high-dimensional multi-omics data by introducing the knowledge of biological mapping and improves the robustness of the model in the processing of noise and sparsity. Hansen et al. [17] proposed a new variational inference algorithm that first uses multiplicative gamma process shrinkage prior to approximation of the posterior distribution of a Bayesian factor model in order to address the poor scalability of factor analytic models for high-dimensional data.
Since Cook [18] introduced the case-deletion model for identifying influential points, the methods of statistical diagnostic of models have become one of the research hotspots, with various diagnostic methods being developed. Among these, Bayesian statistical diagnostics represents a particularly active branch, with statisticians both domestically and internationally widely utilizing this approach. De Finetti [19] explored the problem of how to deal with data outliers through Bayesian subjective probability theory and analyzed the treatment in the case of independent, exchangeable, and partially exchangeable error distributions. Jackson et al. [20] proposed two approximations capable of computing diagnostic statistics for the Bayesian case-deletion model. Zhu et al. [21] examined three Bayesian case-deletion influence measures in the context of complex models such as longitudinal data. Nevertheless, there is a paucity of research on Bayesian case-deletion diagnostics for factor models. Moreover, the statistical diagnosis of factor models for clustered data has not been addressed in the relevant literature. Factor analysis models excel at identifying correlations among observable variables or individuals within clustered datasets where large volumes of data across various fields share unobservable underlying public variables. However, applying these models for parameter estimation and impact analysis within clustered data remains largely unexplored in the existing literature. For simplicity, we outline the following specific contributions of this manuscript:
  • We used the results averaged across clusters to deal with the nested structure of the clustered data, where the associations between clusters are inscribed by a common diagonal form of the error matrix;
  • We performed a comprehensive analysis of factor analysis models for clustered data, employing both Bayesian and principal component analysis methods, and compared the efficacy of these two methodologies;
  • We utilized Cook’s posterior mean distance for statistical diagnostics of factor analysis models within clustered data and identified groups with anomalies through diagnostic statistics.
The remaining sections are structured as follows: Section 2 introduces the process of constructing the factor analysis model within clustered data; Section 3 introduces the basic steps of how to model the factor analysis model of clustered data using the Bayesian method; Section 4 focuses on how to estimate factor analysis models for clustered data using principal component analysis; Section 5 introduces how to diagnose the anomalous groups by using Cook’s posterior mean distance; Section 6 validates the feasibility of the Bayesian method using numerical simulation and compares the estimation effect with that of the principal component method; Section 7 is a example analysis that uses examples to illustrate that the estimation method in this paper can be effectively applied in practice; Section 8 encapsulates the findings and concludes the paper.

2. Model Framework of Factor Analysis for Clustered Data

Suppose the observed variable of the ith individual in the gth cluster is x g i = ( x g i 1 , x g i 2 ,   , x g i p ) T . For g = 1 , 2 , , G ,   i = 1 , 2 , , n g , the factor analysis model is expressed as follows:
x g i = μ g + Λ g F g i + ε g i ,
where μ g is the p × 1 dimensional mean vector. Λ g is the factor loading matrix of the gth cluster. F g i = F g i 1 , F g i 2 , , F g i m T denotes the m-dimensional common factor of the ith individual in the gth group. ε g i is the special factor for the ith individual in the gth cluster, where ε g i N p ( 0 , Ψ g ) . At the same time, we suppose that ε g i is uncorrelated with F g i for all g and i values. Aggregate all the individuals in the gth cluster in the form of a matrix as follows:
X g = 1 n g × 1 μ g T + F g T Λ g T + ε g ,
where
X g = x g 1 T x g n g T = x g 1 , , x g n g T , 1 n g × 1 μ g T = 1 1 n g × 1 μ g 1 , , μ g p F g T = F g 1 T F g n g T = F g 1 , , F g n g T , ε g = ε g 1 T ε g n g T = ε g 1 , , ε g n g T
The factor analysis model of all clusters is expressed in diagonal matrix form, which is shown in Equation (3):
X = μ + Λ F + ε ,
where
X = d i a g X 1 T , X 2 T , , X G T , μ = d i a g μ 1 i 1 1 × n g , μ 2 i 1 1 × n g , , μ G i 1 1 × n g , F = d i a g F 1 , F 2 , , F G , ε = d i a g ε 1 T , ε 2 T , , ε G T .

3. Bayesian Estimations of Factor Analysis Model for Clustered Data

3.1. Analysis of Prior Distribution

The factor analysis model of a single cluster is shown in Equation (1). Let θ represent a vector comprising all unknown parameters. For Bayesian analysis, consider that θ satisfies the following joint priors:
π θ = π μ g , Λ g , F g i , Ψ g = π μ g π Λ g π F g i π Ψ g .
Ansari et al. [22] mentioned the advantages of using a conjugate prior when dealing with multilevel factor analysis models. This approach simplifies the computation of the posterior distribution, improves computational efficiency, and provides flexibility and stability to the parameters. We follow this research and assume that the variance of the common factors follows an inverse gamma distribution. The prior distributions for the remaining unknown parameters are delineated below:
μ g N p α 0 , Σ 0 , Λ g = j = 1 p Λ g j = D j = 1 p N m α 1 j , Σ 1 j , F g i N m α 1 , Σ 1 , p Ψ g = j = 1 p p Ψ g j = D j = 1 p I G c , d , Ψ g = d i a g Ψ g 1 , , Ψ g p ,
where I G · represents the inverse gamma distribution. Aware of the potential variability among different subgroups, we adjust the means and variances of the common factors to generate subgroup datasets with distinct means and variances, thereby enabling a more precise simulation and analysis of the specific characteristics of these subgroups. In addition, we set the random error term to follow a uniform normal distribution, a setting designed to guarantee the feasibility and validity of the method of group means. α 1 j , Σ 1 j , c , and d are fixed hyperparameters; the remaining variables α 0 , Σ 0 , α 1 , and Σ 1 are variable parameters.

3.2. Gibbs Sampling Method

The posterior distribution of parameters is the starting point of Bayesian statistical inference. Since the posterior distribution of θ involves multiple integrals, we use the Gibbs sampler [23] to extract samples successively from the conditional distribution to form a Markov chain. It is assumed that each parameter is independent, and the probability density function of x g i is L X | θ = g = 1 G L X g | θ = g = 1 G i = 1 n g L x g i | θ . The specific format is as follows:
L x g i | μ g , Λ g , F g i , Ψ g = g = 1 G i = 1 n g 1 2 π p Ψ g 1 2 exp 1 2 x g i μ ^ g T Ψ g 1 x g i μ ^ g ,
where μ ^ g = μ g + Λ g F g i . Since θ denotes all the unknown parameters, the estimates of each component are calculated based on the conditional distribution satisfied by θ to compute x g i n e w . As the number of sampling times increases, the distribution of the sampled random numbers becomes closer to the true distribution. X g = ( x g 1 T , x g 2 T , , x g n g T ) T is the set of observed variables, and π μ g , Λ g , F g i , Ψ g represents the prior distribution of the unknown parameters, according to Equations (4) and (5), the joint posterior distribution h x g i , θ is as follows:
h x g i , θ = L x g i | μ g , Λ g , Ψ g , F g i π μ g , Λ g , Ψ g , F g i .
The posterior distributions of the unknown parameters are inferred from Equation (6) to be p θ | X g . The specific form is shown in (7):
p θ | x g i = L x g i | θ π θ L x g i | θ π θ d θ L x g i | μ g , Λ g , F g i , Ψ g π μ g , Λ g , F g i , Ψ g .
Firstly, the initial values of x g i 0 , μ g 0 , Λ g 0 , F g i 0 , Ψ g 0 are given, { x g i n , μ g n , Λ g n , F g i n , Ψ g n } can be obtained after n steps of Gibbs sampling, and its update steps are as follows:
(1)
Select x g i 1 from p x g i 0 | μ g 0 , Λ g 0 , F g i 0 , Ψ g 0 ;
(2)
Select μ g 1 from p μ g 0 | x g i 1 , Λ g 0 , F g i 0 , Ψ g 0 ;
(3)
Select F g i 1 from p F g i 0 | x g i 1 , μ g 1 , Λ g 0 , Ψ g 0 ;
(4)
Select Λ g 1 from p Λ g 0 | x g i 1 , μ g 1 , F g i 1 , Ψ g 0 ;
(5)
Select Ψ g 1 from p Ψ g 0 | x g i 1 , μ g 1 , Λ g 1 , F g i 1 .
The preceding observations lead to the conclusion that, as the number of cycles approaches infinity, the distribution of x g i n , μ g n , Λ g n , F g i n , Ψ g n gradually approaches the target distribution of p μ g , Λ g , F g i , Ψ g | X g i [24]. This provides sufficient evidence to suggest that the samples taken above might be considered as representing the posterior distribution. The posterior distribution of the unknown parameter θ is as follows:
p x g i | μ g , Λ g , F g i , Ψ g N p μ g + Λ g F g i , Ψ g p μ g | x g i , Λ g , F g i , Ψ g N p n g Ψ g 1 + 0 1 1 n g Ψ g 1 M ¯ + 0 1 α 0 , n g Ψ g 1 + 0 1 1 p F g i | x g i , μ g , Λ g , Ψ g N m Λ g T Ψ g 1 Λ g + 1 1 1 Λ g T Ψ g 1 x g i μ g + 1 1 α 1 , Λ g T Ψ g 1 Λ g + 1 1 1 p Λ g j T | x g i j , μ g j , F g i , Ψ g j N m F g i Ψ g j 1 F g i T + Σ ^ 1 j 1 1 F g i Ψ g j 1 x g i j μ g j T + Σ ^ 1 j 1 α 1 j , F g i Ψ g j 1 F g i T + Σ ^ 1 j 1 1 P Ψ g j | x g i , μ g , Λ g , F g i I G n 2 + c , 1 2 i = 1 n x g i j μ g j T Λ g j T F g i 2 + d
where M ¯ = 1 n g i = 1 n g x g i Λ g F g i . The estimated results of the unknown parameters of the factor analysis model for clustered data can be obtained by employing Gibbs sampling at the cluster level. However, it is necessary to ascertain whether the effect of each group within the clustered data is consistent and in line with the assumptions of the established model. Furthermore, it is crucial to determine whether there are clusters that deviate significantly from the original model. Statistical diagnosis of the above models represents an indispensable component of research involving clustered data.

4. Principal Component Method for Factor Analysis Models with Clustered Data

In the initial phase of our investigation, we focus on the factor analysis model for a single cluster with the matrix representation presented in Equation (2). The eigenvalues of the covariance matrix are labeled by λ 1 , λ 2 , , λ p , arranged in descending order of magnitude, where λ 1 is the largest. The corresponding orthogonal eigenvector is denoted by ξ 1 . If Λ g Λ g T is sufficiently large and the variance of the specific factors is considered negligible, Σ g can be approximately replaced by Λ g Λ g T , just like Σ g Λ g Λ g T . Using spectral decomposition, we decompose the covariance matrix with the specific form shown below:
Σ g Λ g Λ g T = i = 1 p λ i e ^ g i e ^ g i T ,
where λ i is the eigenvalue of the covariance matrix, denoted as Λ g = d i a g λ ^ g 1 , λ ^ g 2 , , λ ^ g m R m × m , and e ^ g i is the corresponding orthogonal eigenvector, denoted as E ^ g = ( e ^ g 1 , e ^ g 2 , , e ^ g m ) R p × m . Since F g i obeys a normal distribution with a non-zero mean in this paper, the estimator for the factor loading matrix, derived from the principal components method, is delineated as follows:
Λ ^ g = E ^ g Λ g 1 2 b 1 2 I m .
The common factor F g can be estimated using the least-squares algorithm as shown in Equation (10):
F ^ g = Λ ^ g 1 2 E ^ g T X g T μ g I 1 × n g .
Since all clusters share a common error matrix, the parameter estimates for the factor model in clustered data can be derived by averaging the estimation results from each cluster.

5. Statistical Diagnosis of Factor Model for Clustered Data

5.1. Case-Deletion Model

The case-deletion model is a diagnostic tool used to assess whether observed data points are influential points. It involves comparing the differences in corresponding statistics or parameters after deleting one point to determine whether it is an influential point. The model after deleting the mth group can be expressed as Equation (11):
X g = 1 n g × 1 μ g T + F g T Λ g T + ε g .
Let x m , μ m , F m , Λ m , ε m represent the dataset after deleting the mth group. The MCMC method is used to estimate parameters after data deletion, and Cook’s posterior mean distance is used to identify the influence between clusters.

5.2. Cook’s Posterior Mean Distance

To study the posterior mean effect of the gth group on unknown parameter θ , θ = μ g , Λ g , F g , Ψ g . The Bayesian case-deletion influence measure of deleting the gth cluster is defined by Cook’s posterior mean distance, as shown in Equation (12):
C M g = θ ^ θ ^ g T Ψ θ θ ^ θ ^ g ,
where θ ^ = θ p θ | x g d θ and θ ^ g = θ p θ | x g g d θ , respectively, represent the posterior mean before and after the deletion of the gth cluster, Ψ θ represents the inverse matrix of the covariance of θ , and Ψ θ is a positive definite matrix. If the value of C M g is larger, the difference between θ ^ and θ ^ ( g ) is greater, and the gth cluster is the strong-influence cluster or the anomalous cluster.

6. Simulation Study

6.1. Random Parameter Generation for the Markov Chain Monte Carlo Algorithm

A clustered dataset is simulated using Monte Carlo methods. The construction of a single cluster factor analysis model is initially considered, with the caveat that the number of samples in each cluster is not necessarily identical. To this end, the number of samples was allowed to adhere to a uniform distribution of U [ 40 , 50 ] , with samples generated sequentially and iteratively from the corresponding conditional distribution. The factor loading matrix is of order 6 × 2, and the initial parameter settings are as follows:
Λ g = Λ 11 Λ 12 Λ 13 Λ 14 Λ 15 Λ 16 Λ 21 Λ 22 Λ 23 Λ 24 Λ 25 Λ 26 T .
The initial coefficients of the factor loading matrix are set as Λ 11 = Λ 12 = Λ 13 = Λ 14 = Λ 15 = Λ 16 = 0.9 ,   Λ 21 = Λ 22 = Λ 23 = Λ 24 = Λ 25 = Λ 26 = 0.1 , Λ g j T N 2 α 1 j , Σ 1 j , j = 1 , , 6 . In order to ensure that the common factors satisfy different normal distributions in different clusters, it is necessary to assume that the common factor satisfies normal distributions with different means and variances. Furthermore, it is assumed that the mean values follow the uniform distribution of U [ 0 , 1 ] , and the variance follows the uniform distribution of U [ 0 , 0.5 ] , just like F g i N 2 α 1 , Σ 1 , α 1 U [ 0 , 1 ] , Σ 1 U [ 0 , 0.5 ] . The random error term is also assumed to follow a normal distribution, just like ε g i N 6 0 , Ψ g , where Ψ g obeys the inverse gamma distribution, that is, Ψ g I G 6 c , d . In the prior distribution, the hyperparameters are selected as α 1 j = 0.9 , 0.1 T , Σ 1 j = d i a g 0.1 , 0.1 , c = 3 , and d = 4. This paper determines the convergence and stability of the Gibbs sampling by traversing the ergodic mean plot, as shown in Figure 1.
μ ^ g = 1 n l l + 1 n μ g m , Λ ^ g j T = 1 n l l + 1 n Λ g j m , F ^ g i = 1 n l l + 1 n F g i m , Ψ ^ g = 1 n l l + 1 n Ψ g m

6.2. Simulation Results of Gibbs Sampling

Gibbs sampling is leveraged to estimate the parameters Λ ^ g , f ^ g i , and Ψ ^ g through a process executed over 50 iterations. The initial estimates for parameters f g i and ε g i are derived from the mean of 1000 samples, each drawn from the corresponding prior distribution. For the generation of the simulated clustered dataset, the number of clusters is variably set to 10, 30, and 50. A comprehensive presentation of the simulation outcomes can be found in Table 1. In this paper, we use Bias to denote the deviation between the true value and the estimated mean, RMSE to indicate the root mean square error, and SD to represent the standard deviation.
B i a s = 1 n k = 1 n θ ^ k θ , R M S E = 1 n k = 1 n θ ^ k θ 2 , S D = 1 n 1 k = 1 n θ ^ k E θ ^ k 2
Table 1 shows that, after many iterations, the bias and mean square error of all parameter estimates are controlled within a tightly constrained range, and the mean square error and variance are relatively close to each other, which, in turn, indicates that the estimation results are close to the real values, revealing the effectiveness of the factor model for dealing with clustered data when using this method. The principal component method, as a classical estimation method for factor analysis models, compares the parameter estimation based on two different methods under the same settings as the simulated dataset in the paper, and the estimation results of the principal component method are as shown in the table below.

6.3. Random Parameters Generation for the Principal Component Algorithm

Since the principal component method ignores the effect of variance, the parameters of the principal component method are set to be approximately the same as the Bayesian method for comparison purposes. The coefficients of the factor loading matrix are set as follows:
Λ g = Λ 11 Λ 12 Λ 13 Λ 14 Λ 15 Λ 16 Λ 21 Λ 22 Λ 23 Λ 24 Λ 25 Λ 26 T .
The initial coefficients of the factor loading matrix are set as Λ 11 = Λ 12 = Λ 13 = Λ 14 = Λ 15 = Λ 16 = 0.9 ,   Λ 21 = Λ 22 = Λ 23 = Λ 24 = Λ 25 = Λ 26 = 0.1 . We assume that the common factor satisfies normal distributions with different means and variances. The mean of F g is assumed to be uniformly distributed between 0 and 1, and the variance is assumed to be uniformly distributed between 0 and 0.5. F g i N 2 α 1 , Σ 1 , α 1 U [ 0 , 1 ] , Σ 1 U [ 0 , 0.5 ] , and the random error ε g i is distributed as N p 0 , Ψ g . The initial values of Ψ g are set to be the same as the initial value of Ψ g under the Gibbs sampling algorithm. The intra-group correlation is also set as x g i = 0.8 x g i 1 + δ i , δ i N 6 0 , Ψ g .

6.4. Simulation Results of Principal Component Algorithms

The parameters of interest, Λ ^ g and f ^ g , can be estimated using principal component analysis. To gauge the accuracy of these estimates, the deviation, standard deviation, and mean square error for the pairs of Λ ^ g and Λ g , as well as f ^ g and f g , are computed. This process is repeated 1000 times to ensure a robust assessment of how closely the estimated values of the parameters approximate their actual values. The number of clusters is also set to 10, 30, and 50, and the estimated results are depicted in Table 2.
As can be seen from Table 1 and Table 2, the estimation results of the MCMC method are better than the estimation results of the principal component method for each item, accordingly indicating that the MCMC method is more effective relative to the principal component method. The corresponding influence diagnosis is reliable only if the estimation results are better.

6.5. Statistical Diagnosis of Clustered Data

The dataset under consideration is subjected to statistical diagnosis using the case-deletion method. Cook’s distance is utilized to evaluate the influence of individual clusters by comparing the changes in CM(g)—a measure of leverage and influence—following the removal of different clusters. Within the simulation study, the common factor variable, denoted as F g , for the 8th, 15th, 21st, and 42nd clusters, is incremented by 2 to appraise the validity of the applied method. The variable of F g is designated as the diagnostic parameter of interest. The outcomes of this analysis are presented in Figure 2.
Figure 2 presents a depiction of Cook’s distance for clusters consisting of 10, 30, and 50 clusters, which are generated through Monte Carlo simulations and subsequently deleted sequentially. Analysis of Figure 2 reveals that, within the clustered data produced by Monte Carlo simulation, the 8th cluster exerts the most significant influence. In the dataset comprising 30 groups, the 8th, 15th, and 21st clusters are identified as potent influence points, with the 21st cluster demonstrating the most pronounced impact. Similarly, in the dataset with 50 clusters, the 8th, 15th, 21st, and 42nd clusters are recognized as having substantial influence, with the 21st cluster again being the most influential. These findings are congruent with the intentional perturbations introduced to the 8th, 15th, 21st, and 42nd clusters, as previously discussed. Consequently, the efficacy of the diagnostic methodology employed can be assessed and deemed valid.

7. Example Analysis

In this section, we detail the experimental analysis of cigarette sales. Our study is based on data derived from a survey of cigarette sales which was conducted across the United States from 1963 to 1992. This dataset is accessible through the Wiley Online Library at https://www.wiley.com/legacy/vkhbwileychi/baltagi/ (accessed on 14 February 2024). The survey, conducted by Baltagi [25], aimed to investigate cigarette consumption patterns across 46 different states within the United States. A comprehensive description of the variables included in the dataset is presented in Table 3. The dataset, which is inherently clustered by state, has been modeled as such for analysis. Before conducting factor analysis, the data underwent standardization. Subsequently, the suitability of the data for factor analysis was assessed using the KMO test and Bartlett’s spherical test. The outcomes are detailed in Table 4.
As detailed in Table 4, the KMO test for sampling adequacy is calculated at 0.78, a value that exceeds the threshold commonly accepted for proceeding with factor analysis. Moreover, Bartlett’s test for sphericity yields a p-value of less than 0.05, which is statistically significant and further corroborates the appropriateness of the data for this analytical approach. Figure 3 indicates that the extraction of three common factors is justified, thus leading to the development of a factor analysis model inclusive of these three factors.
Figure 4 illustrates the classification of the eight indicators into three separate categories based on their factor loading coefficients. Specifically, the indicators labeled as Price, Cpi, Pimin, Ndi, and Year are categorized together as representing economic factors. The indicators Pop and Pop16 are united to form the demographic factor. Finally, the Sales indicator is identified as the sole representative of the demand factor.
Examination of Figure 5 reveals that the mean of the trace for each variable stabilizes upon completion of 3000 iterations, suggesting that the Gibbs sampling algorithm has achieved convergence. To ensure reliability, the analysis was conducted with a total of 5000 iterations, with the initial 3000 iterations serving as a burn-in period. The subsequent 2000 iterations were retained for analysis, yielding a set of valid samples. Statistical analysis of these valid samples was then performed. Anderson and Herman [26] have delineated that the core condition for the identifiability of the factor loading matrix is the requirement for Λ g T Ψ g 1 Λ g to be diagonal. Then, Λ g can be identified. The resulting estimates for the factor loading matrix coefficients are presented in Table 5.
According to the results shown in Table 5, the difference between the mean and median of the loading coefficient is small, indicating that the fitting effect of the MCMC method is better. Baltagi [25] studied the variation in cigarette price elasticized across states in this dataset using a heterogeneous coefficient model but did not consider the issue of commonality between states. The clustered factor analysis model can not only obtain the internal relationship among these states, but also describe the common factors among different states. The results are also richer than the analytical effects of the heterogeneous coefficient model. In order to highlight the advantages of clustered data, the quality of the model can be judged by two methods: the Akaike Information criterion (AIC) and Bayesian information criterion (BIC). Akaike [27] proposed a method to calculate the AIC value in a factor analysis model, and the AIC and BIC values calculated through this calculation method are shown in Table 6.
As can be seen from Table 6, the AIC and BIC values of the model considering the internal relationship of clusters are much lower than those of the model which does not consider the internal relationship, which indicates that considering the internal relationship of clustered data is better than ignoring the internal relationship of clusters, and, at the same time, it illustrates that the estimation effect of the MCMC method is better than the principal component method.
In addition, the Bayesian case-deletion model was applied to these data in this study as a way to explore those clusters with anomalies or high-influence clusters. When the common factor F g i and factor loading matrix Λ g are the parameters of interest and the rest of the parameters are redundant, the influence diagnostic results are as shown in Figure 6 and Figure 7, respectively.
Figure 6 presents visual evidence suggesting that, based on Cook’s distance, the 17th cluster can be classified as an outlier or a cluster exerting a significant influence on the model regardless of whether parameter F g or Λ g is the focus of interest. A comparison of the arithmetic means of indicators within the original dataset reveals that the 17th state’s cigarette price is substantially lower than the national average. At the same time, this state has a significantly smaller local population compared to other states. However, the local consumer price index and per capita sales are observed to be higher. These observations lead to the inference that the 17th state may wield a more substantial influence. To investigate the potential masking effect among influential clusters, we performed a sequential exclusion analysis, initially omitting the 17th cluster, before conducting the Bayesian statistical diagnosis. The results of these diagnostic assessments are depicted in Figure 7.
Figure 7 provides a clear depiction revealing that the Cook’s distance values for the parameters of the non-17th clusters exhibit minimal variation. This observation suggests that there is no significant masking effect among these clusters. Concurrently, the figure also indicates that the 17th cluster exerts a more substantial influence on the estimation of the parameters.

8. Discussion

Clustered data, one of the complex data types, are characterized by internal correlation within the same groups and a lack of correlation between different groups. Ignoring this nested structure when inferring can lead to biased conclusions. Utilizing the factor analysis model is an effective approach to address the nested structure inherent in clustered data. This paper presents a factor model specifically tailored for clustered data. Employing the MCMC method in conjunction with Gibbs sampling, we initially estimate the unknown parameters of an individual cluster. The inter-cluster association is depicted through a shared diagonal error matrix. Thereby, the factor model with clustered data is constructed by integrating multiple clusters. Numerical simulations substantiate the model’s and methodology’s validity. Empirical evidence indicates that the clustered data model significantly outperforms models that disregard internal relationships. Furthermore, this study employs Bayes deletion influence to compare diagnostic statistics before and after cluster deletion and utilizes Cook’s posterior mean distance to identify potential outliers among clusters. Both the simulation study and the example analysis yield promising results.
The research concepts presented are extendable to structural equation models with clustered data. As structural equation models are an advanced form of the factor analysis model, the modeling approaches discussed are not only applicable to them but also offer a foundation for future statistical diagnostic research on clustered data across varying models.

Author Contributions

Conceptualization, B.C. and X.L.; methodology, B.C. and X.L.; software, B.C.; dataset curation, N.H.; writing—original draft, B.C.; writing—review and editing, B.C.; funding acquisition, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (23BTJ051), the Regional Science Foundation of China (11261108), and Yunnan Key Laboratory of Modern Analytical Mathematics and Applications, China (202302AN360007).

Data Availability Statement

The data presented in this study is available on request from the corresponding authors, and the dataset was jointly completed by the team, so the data is not publicly available.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Galbraith, S.; Daniel, J.A.; Vissel, B. A study of clustered data and approaches to its analysis. J. Neurosci. 2010, 30, 10601–10608. [Google Scholar] [CrossRef]
  2. Scott, A.J.; Holt, D. The effect of two-stage sampling on ordinary least squares methods. J. Am. Stat. Assoc. 1982, 77, 848–854. [Google Scholar] [CrossRef]
  3. Greenwald, B.C. A general analysis of bias in the estimated standard errors of least squares coefficients. J. Econom. 1983, 22, 323–338. [Google Scholar] [CrossRef]
  4. Chen, K.; Jin, Z. Partial linear regression models for clustered data. J. Am. Stat. Assoc. 2006, 101, 195–204. [Google Scholar] [CrossRef]
  5. Moulton, B.R. Random group effects and the precision of regression estimates. J. Econom. 1986, 32, 385–397. [Google Scholar] [CrossRef]
  6. Lin, X.; Carroll, R.J. Semiparametric regression for clustered data. Biometrika 2001, 88, 1179–1185. [Google Scholar] [CrossRef]
  7. Lin, X.; Carroll, R.J. Nonparametric function estimation for clustered data when the predictor is measured without/with error. J. Am. Stat. Assoc. 2000, 95, 520–534. [Google Scholar] [CrossRef]
  8. George, Y.Q.; Williams, G.W.; Beck, G.J. A generalized model of logistic regression for clustered data. Commun. Stat. Theory Methods 1987, 16, 3447–3476. [Google Scholar] [CrossRef]
  9. Huang, F.L. Alternatives to multilevel modeling for the analysis of clustered data. J. Exp. Educ. 2016, 84, 175–196. [Google Scholar] [CrossRef]
  10. Huang, F.L. Analyzing Group Level Effects with Clustered Data Using Taylor Series Linearization. Pract. Assess. Res. Eval. 2014, 19, 13. [Google Scholar] [CrossRef]
  11. Spearman, C. General intelligence, objectively determined and measured. Am. J. Psychol. 1904, 15, 201–292. [Google Scholar] [CrossRef]
  12. Julian, M.W. The consequences of ignoring multilevel data structures in nonhierarchical covariance modeling. Struct. Equ. Model. 2001, 8, 325–352. [Google Scholar] [CrossRef]
  13. Okech, D. Reporting multiple-group mean and covariance structure across occasions with Structural Equation Modeling. Res. Soc. Work. Pract. 2012, 22, 567–577. [Google Scholar] [CrossRef]
  14. Press, S.J. Applied multivariate analysis. Biometrics 1972, 45, 833–834. [Google Scholar] [CrossRef]
  15. Wirth, R.J.; Edwards, M.C. Item factor analysis: Current approaches and future directions. Psychol. Methods 2007, 12, 58–79. [Google Scholar] [CrossRef]
  16. Zhang, Q.; Chang, C.; Shen, L.; Long, Q. Incorporating graph information in Bayesian factor analysis with robust and adaptive shrinkage priors. Biometrics 2024, 80, ujad014. [Google Scholar] [CrossRef]
  17. Hansen, B.; Avalos-Pacheco, A.; Russo, M.; De Vito, R. Fast variational inference for Bayesian factor analysis in single and multi-study settings. J. Comput. Graph. Stat. 2024, 1–42. [Google Scholar] [CrossRef]
  18. Cook, R.D. Detection of influential observations in linear regression. Technometrics 1977, 19, 15–18. [Google Scholar] [CrossRef]
  19. De Finetti, B. The Bayesian approach to the rejection of outliers. In Proceedings of the Fourth Berkeley Symposium of Math Statist and Probability, Berkeley, CA, USA, 1 January 1961; pp. 199–210. Available online: https://digitalassets.lib.berkeley.edu/math/ucb/text/math_s4_v1_article-13.pdf (accessed on 14 January 2024).
  20. Jackson, D.; White, I.R.; Carpenter, J. Identifying influential observations in Bayesian models by using Markov chain Monte Carlo. Stat. Med. 2012, 31, 1238–1248. [Google Scholar] [CrossRef]
  21. Zhu, H.; Ibrahim, J.G.; Cho, H.; Tang, N. Bayesian case influence measures for statistical models with missing data. J. Comput. Graph. Stat. 2012, 21, 253–271. [Google Scholar] [CrossRef]
  22. Ansari, A.; Jedidi, K.; Dube, L. Heterogeneous factor analysis models: A Bayesian approach. Psychometrika 2002, 67, 49–77. [Google Scholar] [CrossRef]
  23. Geman, S.; Geman, D. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 1984, 6, 721–741. [Google Scholar] [CrossRef]
  24. Gelf, A.E.; Smith, A.F.M. Sampling-based approaches to calculating marginal densities. J. Am. Stat. Assoc. 1990, 85, 398–409. [Google Scholar] [CrossRef]
  25. Baltagi, B.H.; Griffin, J.M.; Xiong, W. To pool or not to pool: Homogeneous versus heterogeneous estimators applied to cigarette demand. Rev. Econ. Stat. 2000, 82, 117–126. [Google Scholar] [CrossRef]
  26. Anderson, T.W.; Rubin, H. Statistical Inference in Factor Analysis. In Proceedings of the Berkeley Symposium on Mathematical Statistics and Probability; University of California Press: Berkeley, CA, USA, 1956; Available online: https://digitalassets.lib.berkeley.edu/math/ucb/text/math_s3_v5_article-08.pdf (accessed on 16 January 2024).
  27. Akaike, H. Factor Analysis and AIC. Psychometrika 1987, 52, 317–332. [Google Scholar] [CrossRef]
Figure 1. Ergodic mean plots. (a) Ergodic mean plot of common factor. (b) Ergodic mean plot of manifest variables.
Figure 1. Ergodic mean plots. (a) Ergodic mean plot of common factor. (b) Ergodic mean plot of manifest variables.
Mathematics 12 01949 g001
Figure 2. Cook’s posterior mean distance for 10, 30, and 50 clusters. (a) CM (g) for 10 clusters. (b) CM (g) for 30 clusters. (c) CM (g) for 50 clusters.
Figure 2. Cook’s posterior mean distance for 10, 30, and 50 clusters. (a) CM (g) for 10 clusters. (b) CM (g) for 30 clusters. (c) CM (g) for 50 clusters.
Mathematics 12 01949 g002
Figure 3. Scree plot.
Figure 3. Scree plot.
Mathematics 12 01949 g003
Figure 4. Classification structure chart.
Figure 4. Classification structure chart.
Mathematics 12 01949 g004
Figure 5. Ergodic mean plots. (a) Ergodic mean plot of common factor. (b) Ergodic mean plot of manifest variables.
Figure 5. Ergodic mean plots. (a) Ergodic mean plot of common factor. (b) Ergodic mean plot of manifest variables.
Mathematics 12 01949 g005
Figure 6. Cook’s posterior mean distances. (a) Cook’s posterior mean distance for F g . (b) Cook’s posterior mean distance for Λ g .
Figure 6. Cook’s posterior mean distances. (a) Cook’s posterior mean distance for F g . (b) Cook’s posterior mean distance for Λ g .
Mathematics 12 01949 g006
Figure 7. The Cook posterior mean distance after group deletion. (a) Cook’s posterior mean distance for F g . (b) Cook’s posterior mean distance for Λ g .
Figure 7. The Cook posterior mean distance after group deletion. (a) Cook’s posterior mean distance for F g . (b) Cook’s posterior mean distance for Λ g .
Mathematics 12 01949 g007
Table 1. Estimated results for the Gibbs sampling algorithm for clustered data.
Table 1. Estimated results for the Gibbs sampling algorithm for clustered data.
Clusters103050
Parameters Bias RMSE SD Bias RMSE SD Bias RMSE SD
F 1 −0.00540.051850.04919−0.00450.045080.030470.00670.052400.04949
F 2 −0.00450.070890.04995−0.00280.046210.04405−0.00220.042120.03937
Λ 11 0.00770.031570.031560.00630.031550.03153−0.00540.022910.02291
Λ 21 0.00350.031700.031680.00200.031470.03145−0.00180.022950.02294
Λ 31 0.00580.031610.031590.00330.031390.031380.00290.023230.02321
Λ 41 −0.00190.031620.031590.00140.031710.031690.00130.022880.02288
Λ 51 −0.00980.031780.031760.00710.031550.031520.00680.022910.02290
Λ 61 0.00430.031620.031590.00810.031650.03163−0.00370.023050.02302
Λ 12 0.00410.031620.031580.00380.031560.031540.00310.024120.02411
Λ 22 0.00390.031490.031460.00750.031530.03150−0.00700.024000.02400
Λ 32 −0.00920.031720.031700.00610.031610.03160−0.00540.024400.02438
Λ 42 −0.00230.031830.031800.00220.031670.031650.00200.024110.02410
Λ 52 0.00630.031440.031420.00700.031660.031630.00690.024150.02414
Λ 62 0.00660.031730.031700.00320.031630.03161−0.00220.024200.02419
Ψ 11 0.06430.132520.099800.05090.129670.09969−0.04870.128180.09865
Ψ 22 −0.06620.133910.10050−0.06630.131880.100160.01060.130390.09804
Ψ 33 −0.06470.131440.09782−0.06810.133910.10050−0.06370.130750.10042
Ψ 44 0.08630.130120.099010.08280.127000.099740.06960.126220.09833
Ψ 55 0.04120.128390.098460.03870.127750.097890.06520.128150.09924
Ψ 66 0.06500.132570.09719−0.06120.129950.096470.05930.126070.09580
Table 2. Estimated results for the principal component algorithm method for clustered data.
Table 2. Estimated results for the principal component algorithm method for clustered data.
Clusters103050
Parameters Bias RMSE SD Bias RMSE SD Bias RMSE SD
F 1 −0.546620.52500.1785−0.525880.52340.1797−0.521380.46920.1792
F 2 −0.526570.53930.2127−0.529660.54020.2156−0.525260.48680.2139
Λ 11 0.620610.81190.52760.601960.78910.51680.619050.79110.5129
Λ 21 0.618130.78970.51150.613690.79870.51790.605840.79080.5120
Λ 31 0.617850.79160.51470.612340.79890.51970.608300.79780.5201
Λ 41 0.615470.80180.52080.604780.77930.51190.616070.80570.5234
Λ 51 0.615540.78930.51450.619000.80530.52180.608480.79750.5195
Λ 61 0.618200.78980.51080.613840.79970.51910.607940.79520.5167
Λ 12 −0.082370.44920.4398−0.070560.45520.4493−0.088710.45790.4488
Λ 22 −0.056520.45550.4503−0.074480.45870.4523−0.071140.45880.4521
Λ 32 −0.082930.45460.4459−0.081210.45780.4502−0.079110.45550.4496
Λ 42 −0.092580.45310.4424−0.083220.45500.4481−0.073800.45930.4522
Λ 52 −0.084330.45620.4449−0.074150.45750.4516−0.082880.46010.4523
Λ 62 −0.079830.45680.4499−0.072340.45530.4450−0.073020.45830.4524
Table 3. Description of data variables.
Table 3. Description of data variables.
Data VariablesDefinition
StateThe state surveyed for cigarette data
YearTime of data investigation
PriceRetail price of cigarettes
PopThe local population
Pop16The number of local people older than 16
CpiLocal consumer price index
NdiNational Disposable Income
SalesPer capita sales of cigarettes
PiminMinimum price per pack of cigarettes
Table 4. KMO test and Bartlett’s spherical test.
Table 4. KMO test and Bartlett’s spherical test.
Index Value
KMO test 0.78
Bartlett’s spherical testK-squared72,607
p-value< 2.2 × 10 16
Table 5. Estimated effects of the Gibbs sampling.
Table 5. Estimated effects of the Gibbs sampling.
Clusters 46 46 46
Common Factor Economic Population Demand
Parameters Mean Median SD Mean Median SD Mean Median SD
Year0.98970.98930.02060.05110.05100.0214−0.0418−0.04120.0213
Price0.96080.96130.0144−0.2603−0.26040.01520.09070.09060.0164
Pop0.99110.99080.03600.00830.00840.0390−0.0980−0.09790.0392
pop160.99080.99100.01030.10740.10730.0104−0.0807−0.08000.0110
Cpi0.99590.99590.0138−0.0473−0.04720.0140−0.0459−0.04550.0147
Ndi0.99170.99200.0120−0.11200.11970.01290.03470.03480.0123
Sales0.62960.62950.01520.77260.77250.01530.07670.07650.0153
Pimin0.96500.96470.0271−0.2448−0.24610.02910.08780.08890.0305
Table 6. Comparison of AIC and BIC values with Gibbs sampling and principal component analysis.
Table 6. Comparison of AIC and BIC values with Gibbs sampling and principal component analysis.
MCMC MethodPrincipal Component Method
AIC BIC AIC BIC
The within-cluster correlation is considered398.4359.2739.9688.1
The within-cluster correlation is not considered4272.94192.114,129.314,077.5
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, B.; He, N.; Li, X. Bayesian Statistical Inference for Factor Analysis Models with Clustered Data. Mathematics 2024, 12, 1949. https://doi.org/10.3390/math12131949

AMA Style

Chen B, He N, Li X. Bayesian Statistical Inference for Factor Analysis Models with Clustered Data. Mathematics. 2024; 12(13):1949. https://doi.org/10.3390/math12131949

Chicago/Turabian Style

Chen, Bowen, Na He, and Xingping Li. 2024. "Bayesian Statistical Inference for Factor Analysis Models with Clustered Data" Mathematics 12, no. 13: 1949. https://doi.org/10.3390/math12131949

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop