The results of the simulation study revealed that the Ht and PB approaches yielded the most accurate results with respect to the number of factors to be retained across all conditions. Therefore, in order to ensure that the results are understandable, clear, and as parsimonious as possible, the following discussion will focus on the simulation results for the following combinations of outlier and factor determination methods: PB/PA, PB/MAP, PB/EGA, Ht/PA, Ht/MAP, and Ht/EGA. In addition, the results for EFA extraction using the standard covariance matrix (S) are also included in the results because they are the default available in most statistical software packages. The results for all of the extraction and covariance estimation methods are presented in
Table 1,
Table 2,
Table 3,
Table 4,
Table 5,
Table 6,
Table 7 and
Table 8.
3.1. Accuracy Rate for Correctly Identifying Number of Factors
The ANOVA identified the following terms to be statistically significant with respect to the proportion of cases for which the number of factors was correctly identified: interaction of the number of contaminated variables (V) by the proportion of the sample that is contaminated with outliers (C) by method (), V by standard deviation shift (S) by method (), V by mean shift (M) by method (), and sample size (N) by method ().
Table 1 includes the proportion of the number of factors correctly retained by the number of contaminated variables (V), the proportion of sample contaminated (C), and the method of identification/method of handling outliers. Note that for each combination of V and C, accuracy rates of 0.90 or above are in bold.
PA and EGA generally yielded the highest accuracy rates across conditions when coupled with covariances matrices using either PB or Ht. The use of the standard covariance matrix (S) was consistently associated with the lowest accuracy rates, except when the number of contaminated variables was one (regardless of proportion contaminated) or six with a contamination of 0.01. In addition, MAP was somewhat less accurate than either EGA or PA for most simulated conditions.
The accuracy rate by method, number of contaminated variables, and standard deviation shift (S) appear in
Table 2. As with
Table 1, accuracy rates of 0.90 or higher are in bold.
As was the case in
Table 1, the Ht/PA, Ht/EGA, PB/PA, and PB/EGA combinations exhibited the highest accuracy rates across conditions. The standard covariance extraction approach yielded the lowest accuracy rates, except for with one contaminated variable or six contaminated variables accompanied by a standard deviation shift of 1 or 1.5. Likewise, MAP yielded lower accuracy rates than either EGA or PA for each of the outlier methods.
Table 3 includes the accuracy rates by number of contaminated variables, mean shift (M), and combination of outlier handling and factor determination methods.
These results are very similar to those in
Table 2, with Ht/PA, Ht/EGA, PB/PA, and PB/EGA yielding the most accurate results and S being the least accurate, except with the lowest levels of contamination. In addition, PB and Ht were more impervious to the number of contaminated variables and the degree of mean shift when compared to S, which exhibited worse performance with respect to identifying the correct number of factors when the number of contaminated factors and/or the degree of mean shift increased in value. With respect to methods for determining the number of factors, MAP was less accurate than either PA or EGA.
The accuracy rates by sample size (N), factor loading value (L), interfactor correlations (C), and method (
Table 4) show that Ht/PA, Ht/EGA, PB/PA, and PB/EGA had the highest accuracy rates across sample sizes. Accuracy rates of 0.90 or higher are in bold.
For all of the methods, accuracy improved concomitantly with increases in sample size. Similarly, the methods included in the study yielded higher accuracy rates when the factor loadings were larger. Conversely, across factor extraction and outlier methods, accuracy rates were lower when the interfactor correlations were larger. Finally, the lowest accuracy rates were associated with S across the methods for determining the number of factors, sample sizes, factor loading values, and interfactor correlations.
3.2. Mean Number of Factors Retained
ANOVA was used to identify manipulated factors and their interactions that were associated with the mean number of factors to be retained. The results of this analysis identified the interaction of the number of contaminated variables (V) by the proportion of sample that is contaminated (C) by method (
), and the standard deviation shift (S) by method (
). The mean number of factors identified as optimal, by number of contaminated variables, proportion of contaminated variables, and combination of methods for dealing with outliers and determining the number of factors to retain, appear in
Figure 3.
The impact of the standard deviation shift on the performance of the extraction methods in terms of the number of factors to be retained appears in
Table 5.
When the standard deviation shift was three, the standard covariance matrix coupled with either EGA or PA yielded an inflated number of factors to retain. Inflation in the number of factors retained was also present for EGA with PB and Ht, with the least such effect for Ht/EGA and Ht/PA. In contrast, when the standard deviation shift was three, MAP was associated with underfactored solutions, particularly for S/MAP.
3.3. Empirical Example
In order to demonstrate how the methods that have been included in this study can be used in practice, demonstrations using two empirical datasets are presented. For both examples, the data consist of scores for 24 subscales from the Wechsler test of cognitive ability. Based on prior research, subscales 1–6 belong to factor 1, subscales 7–12 to factor 2, subscales 13–18 to factor 3, and subscales 19–24 to factor 4. Each subscale is normed to have a mean of 100 and a standard deviation of 15. The sample for the first example consists of 260 college undergraduates who completed the assessment, whereas the second sample consisted of 107 undergraduates who completed the assessment during a different semester from the first group. Given the results of the simulation study described above, results for EGA, PA, and MAP using the standard correlation matrix are provided, as well as correlation matrices based on the PB and Ht techniques.
Table 6 includes the mean, standard deviation, minimum, and maximum correlation values for the off-diagonal elements of the full correlation and each subscale block for the standard, percentage bend, and heavy-tailed correlation matrices.
The blocks represent the sets of subscales that theoretically belong to the same latent trait in the population (i.e., subscales 1–6, 7–12, 13–18, and 19–20). Because visualizing the full correlation matrices is very difficult, the descriptive statistics are presented here as a way of characterizing them. Perhaps the clearest trend present in these results is that, for both example datasets, the standard deviations of the PB and Ht correlations are slightly smaller than those of the standard correlation matrix. This result would be anticipated, given that both PB and Ht are designed to remove and/or down-weight outliers. In addition, the mean correlation values for PB and Ht are either comparable to or slightly smaller (but never larger) than those of the standard correlation matrix for both example datasets. It should be noted that any such differences are not large at all, however.
The number of factors returned for the example data by each combination of extraction and outlier methods appears in
Table 7.
EGA yielded the correct number (four) for each approach to dealing with the outliers (PB and Ht) for both example datasets. In contrast, PA was correct for both PB and Ht for the larger sample, but the standard correlation matrix did not yield accurate results. For the smaller sample, PA yielded the correct number of factors for Ht but not for PB. MAP consistently underestimated the number of factors to retain. These results mirror those from the simulation study, where PA and EGA yielded more accurate results than MAP, as did PB and Ht in combination with PA.
Table 8 includes the factor analysis results for each outlier method by subscale.
A cut-off of loadings greater than 0.3 was used to determine whether a subscale belonged to a factor. Specifically, the subscales to which each subscale loaded, based on a maximum likelihood factor extraction with promax rotation for four factors, are included in the table. In general, the results are quite accurate, regardless of the correlation matrix used for the analysis. The subscales that belong to the same latent trait in the population were generally also found to belong to the same population based on the factor analysis. The only exceptions to this outcome were subscale 13 for ML standard and subscale 17 for ML PB, neither of which loaded on any factor. None of the subscales loaded on more than one factor. With respect to EGA, the results were completely accurate for both the PB and Ht outlier methods. On the other hand, for the EGA standard combination, subscales 5 and 23 were not associated with the network cluster common to the rest of the items with which they were associated within the population.