Next Article in Journal / Special Issue
Evaluating Cluster-Level Factor Models with lavaan and Mplus
Previous Article in Journal / Special Issue
Automated Test Assembly in R: The eatATA Package
 
 
Article
Peer-Review Record

How to Estimate Absolute-Error Components in Structural Equation Models of Generalizability Theory

Psych 2021, 3(2), 113-133; https://doi.org/10.3390/psych3020011
by Terrence D. Jorgensen
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Psych 2021, 3(2), 113-133; https://doi.org/10.3390/psych3020011
Submission received: 10 April 2021 / Revised: 25 May 2021 / Accepted: 26 May 2021 / Published: 29 May 2021

Round 1

Reviewer 1 Report

Traditionally, generalizability theory (G-theory) designs have been analyzed with variance components models (multilevel models or ANOVA equations). There is a literature that shows how G-theory design can also be estimated with structural equation models (SEM). The present paper extends this literature by showing (1) how variance components (e.g., main effects variances for items and occasions) for estimating absolute error can be calculated and (2) how ordinal data can be analyzed. The paper is well written and provides a nice extension of the SEM literature on estimating G-theory designs. I only have a few comments for the authors. 

There is a long tradition of using G-theory (especially in educational research) in which the ordinal nature of the outcome measure (in many application items) is completely ignored—and the outcome is treated as continuous. From a SEM perspective, it could be argued that these models were misspecified because they did not take into account the measurement level of the observations. However, most proponents of G-theory (e.g., Brennan, Shavelson) were not concerned about this issue and I think this should be also emphasized in the paper. Probably, the main reason for ignoring the ordinal nature of the data in G-theory applications was that this allowed researchers to draw conclusions on the original metric of the measures. It seems that a similar thinking is present in the current paper when IFA model results are provided on the ordinal response scale.

The first part of the paper provides a very good (and also didactical) explanation of how different G-theory designs can be estimated in SEM. It is also shown how previous SEM approaches were limited because they did not estimate all variance components of the G-theory design. However, I found the section on Discretized Data (section 2.4) to be very short and less clear. It would be helpful if this section would be better connected to the different G-theory designs and to challenges of dealing with ordinal data in G-theory designs.

One final aspect that I think needs clarification is the practical relevance of the SEM approach for estimating G-theory designs. I am thinking about possible applications in which one would prefer the SEM approach for estimating G-theory designs (and not multilevel models or Brennan’s software). Given the availability of multilevel modeling software that can handle crossed random effects (e.g., lme4), I would assume that most researchers would use multilevel models to obtain the variance components estimates. It would be interesting to hear more about potential benefits of the SEM approach to analyzing G-theory designs.

Author Response

There is a long tradition of using G-theory (especially in educational research) in which the ordinal nature of the outcome measure (in many application items) is completely ignored—and the outcome is treated as continuous. From a SEM perspective, it could be argued that these models were misspecified because they did not take into account the measurement level of the observations. However, most proponents of G-theory (e.g., Brennan, Shavelson) were not concerned about this issue and I think this should be also emphasized in the paper. Probably, the main reason for ignoring the ordinal nature of the data in G-theory applications was that this allowed researchers to draw conclusions on the original metric of the measures. It seems that a similar thinking is present in the current paper when IFA model results are provided on the ordinal response scale.

The reviewer points out that G-theorists have long treated Likert-scale measures as continuous, most likely for the link to the observed response scale.  Also in response to Reviewer 3, the text now devotes more attention to relative strengths/limitations of the LRV perspective vs. treating the numerically weighted ordinal categories as numeric values (see added text in blue font in Section 2.4.2).  G&Y's ordinal omega is now solely in the Discussion, given that I did not actually propose this existing method. Rather, the revised Discussion (Section 5.3) explains its potential value, especially defining analogous D-coefs to include absolute error.

The first part of the paper provides a very good (and also didactical) explanation of how different G-theory designs can be estimated in SEM. It is also shown how previous SEM approaches were limited because they did not estimate all variance components of the G-theory design. However, I found the section on Discretized Data (section 2.4) to be very short and less clear. It would be helpful if this section would be better connected to the different G-theory designs and to challenges of dealing with ordinal data in G-theory designs.

The reviewer praised the didactic value of the paper's unique contribution (estimating absolute-error components in SEM), but criticized the scope of discussion about discrete measurements and how this affects the proposed method for estimating absolute-error components.  Because all the GT fundamentals were already discussed in the previous subsections, the primary purpose of this subsection was merely to explain how the threshold model links observed ordinal measurements to continuous LRVs (to which the previously discussed GT-SEMs still apply). I have clarified this at the bottom of Section 2.4.1 (see added text in blue font): "Thus, once the threshold model is appropriately constrained, the remaining IFA parameters can be specified using the same principles discussed for CFA parameters above."  Additionally, the new Section 2.4.2 includes more information about the different ways of analyzing ordinal variables (including their relative challenges), as noted above.

One final aspect that I think needs clarification is the practical relevance of the SEM approach for estimating G-theory designs. I am thinking about possible applications in which one would prefer the SEM approach for estimating G-theory designs (and not multilevel models or Brennan’s software). Given the availability of multilevel modeling software that can handle crossed random effects (e.g., lme4), I would assume that most researchers would use multilevel models to obtain the variance components estimates. It would be interesting to hear more about potential benefits of the SEM approach to analyzing G-theory designs.

I thank the reviewer for suggesting this addition. The Discussion now includes Section 5.1 (Advantages of SEM for GT) to elucidate the advantages of the SEM approach relative to mixed models, as well as disadvantages of the SEM approach due to missing data in Section 5.2

Reviewer 2 Report

This paper is a effort of extending and criticizing the structural equation models of generalizability theory.

Though I believe this is an honest effort on behalf of the author, the entire paper is not well written or completely comprehensible.

All academic papers must follow specific rules of writting and editing,  and authors should prepare their manuscripts accordingly before claiming to be published in scientific journals. I am afraid this paper "breaks" quite a few of these rules (i.e. "I show", "I demonstrate", use of abbreviations in the abstract is forbitten etc). Α similar difficult to follow syntax runs the entire manuscript.

Last but not least, PsyCH Journal is a psychology themed journal and this paper does not clarify on how its content can be useful for related research subjects.

Thereby, I am afraid I would have to advice against this paper's publication.

Author Response

Though I believe this is an honest effort on behalf of the author, the entire paper is not well written or completely comprehensible.

The reviewer is thanked by the author for motivating him/her to pay careful attention to his/her use of language.

All academic papers must follow specific rules of writting and editing,  and authors should prepare their manuscripts accordingly before claiming to be published in scientific journals. I am afraid this paper "breaks" quite a few of these rules (i.e. "I show", "demonstrate", use of abbreviations in the abstract is forbitten etc). Α similar difficult to follow syntax runs the entire manuscript.

In accordance with the editor's recommendations, the author has removed the use of first-person pronouns from the Abstract.  However, they remain in the text because the use of first-person narrative does not break grammar or syntax rules. Regarding its appropriateness for publication in psychology-themed scientific journals, each recent edition of the Publication Manual of the American Psychological Association explicitly recommends writing in first-person active voice, rather than third-person passive voice.

https://blog.apastyle.org/apastyle/2009/09/use-of-first-person-in-apa-style.html

Although the Psych journal does not explicitly require APA style, its instructions for authors merely requests that authors can choose any style (free format submission) as long as they are consistent. Given the name of the journal, the author thought APA style would be most appropriate.  Abbreviations in the Abstract are also acceptable in APA style: 

https://blog.apastyle.org/apastyle/2015/10/an-abbreviations-faq.html#Q7

Again in accordance with the editor's advice, the author only used abbreviations in the manuscript that are (a) commonly understood even without being explicitly defined and (b) used at least 3 times. In this case, only GT and SEM (the cornerstones of the paper) were used in the Abstract.  Even disregarding APA style, the Psych journal's instructions for authors explicitly states that "Abbreviations should be defined in parentheses the first time they appear in the abstract, main text, and in figure or table captions and used consistently thereafter."

https://www.mdpi.com/journal/psych/instructions 

Last but not least, PsyCH Journal is a psychology themed journal and this paper does not clarify on how its content can be useful for related research subjects.

In the first paragraph of the introduction, the author added several references to the use of GT in various domains of psychology, as well as a sentence clarifying that "GT can be used to quantify several types of reliability that are of common interest in psychological disciplines, including scale reliability, test-retest reliability, and interrater reliability (IRR).".  The revision points out in Section 2.1 the equivalence of the G-coef for a p × i design to coefficient α, which is frequently used by psychologists to quantify scale reliability.  The author also added a real-data example using GT to calculate IRR coefficients (see Section 4), pointing out their equivalence to ICCs defined in McGraw & Wong's (1996) highly cited paper in Psychological Methods.

 

Reviewer 3 Report

I found the material addressed in this manuscript interesting and a solid potential contribution to the research literature on G-theory. However, I would recommend simplifying and omitting some material. I also believe some statements mischaracterize information from several cited articles.

  1. Title. I recommend shortening the title to something like “Extensions of Structural Equation Modeling in Generalizability theory to Estimate Absolute Error.” I do this because I believe that the criticisms you make apply mostly to the work of Zumbo, Ark, and colleagues (e.g., ordinal alpha, ordinal G-theory) rather than the other studies cited about using SEM in G-theory contexts. Zumbo et al. do seem to deemphasize the imaginary nature of reliability indices based on the LRV scale. However, to avoid misrepresenting them, I recommend that you review Zumbo and Kroc (2019)’s response to Chalmers (2018). For other authors cited, who applied SEM to G-theory applications, I believe it would be more accurate to say that their studies have limitations rather than implying that they are flawed. I say more about this later.

 

Zumbo BD, Kroc E. A. (2019). Measurement Is a choice and Stevens’ scales of measurement do not help make it: A response to Chalmers. Educational and Psychological Measurement, 79(6):1184-1197.

 

  1. Abstract. First, I do not think it is accurate to say that previous studies give the impression that SEMs are limited to producing G-coefficients. For example, Raykov and Macoulides (2006) say the following in footnotes 1 and 2.

 

1Marcoulides (2000b) also demonstrated how the SEM approach can be used to estimate absolute generalizability coefficients for various one- and multifaceted designs.

 

2Marcoulides (2000b) illustrated that instead of analyzing the relations among variables, for which only variance components for persons and any interactions with persons can be estimated, analyzing the matrix of correlations among persons leads to the variance components for the other facets. As such, all potential sources of measurement error in a design can be estimated.

 

Marcoulides, G. A. (2000b, March). Generalizability theory: Advancements and implementations. Invited colloquium presented at the 22nd Language Testing Research Colloquium, Vancouver, Canada.

Second, I recommend deleting that part of the sentence above and follow it with something like. “In this study, I expanded SEM models to allow for calculation of dependability coefficients in addition to generalizability coefficients by placing constraints on the mean structure of models using the R package lavaan.”

Third, I also would delete the claim that you argue that coefficients from LRV reflect hypothetical reliability because Chalmers (2018), Vispoel et al. (2019), and others have made the same point. That is, you are giving the impression that you are saying something new when this is not the case.

Finally, I was pleased to see you mention creation of reliability coefficients for ordinal scale composites, but disappointed that such composites were not used to derive cut-score specific D-coefficients and may not represent real scales that could be used in practice.

  1. Line 17. Cronbach did not develop G-theory soon after discussing coefficient alpha. His article about alpha was published in 1951 (more than 20 years earlier). In addition, Cronbach himself does not take credit for creating alpha and felt embarrassed that people have referred to it as Cronbach’s alpha (see Cronbach & Shavelson, 2004. page 397).

 

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of a test. Psychometrika, 16, 297–334.

Cronbach, L. J., & Shavelson, R. J. (Ed.). (2004). My current thoughts on coefficient alpha and successor procedures. Educational and Psychological Measurement, 64(3), 391–418. 

 

  1. Line 22. It is incorrect to say that an arbitrary number of measurement facets are identified in G-theory studies. Selection of number of facets is based on careful consideration of the primary sources of measurement error that would affect scores in particular contexts. This is especially true when designing G-studies.

 

  1. Line 27. I would delete the word “unnecessary” as it has a pejorative connotation.

 

  1. Lines 28-30. As I noted before, others including Raykov and Marcoulides (2006) and Ark (2015) have pointed out that SEMs can be used to derive variance components for calculating absolute error. An appealing part of your study is that you actually demonstrate how to do it.

 

Ark, T. K. (2015). Ordinal generalizability theory using an underlying latent variable framework [Unpublished doctoral dissertation, University of British Columbia]. Retrieved from https://open.library.ubc.ca/collections/ubctheses/24/items/1.0166304.

 

  1. Line 31-33. The main reason for G-coefficients is to estimate reliability for making decisions based on relative rather than absolute differences in scores.

 

  1. Paragraph starting on line 34. If you plan to keep global D-coefficients in the paper, then you need to make a more compelling case for them. I personally have not found them to be very useful because classification decisions are based on cut-scores. They also represent the lowest possible value for a D-coefficient across the scale continuum. Providing cut-score specific D-coefficients on a tangible ordinal scale metric would be a real contribution.

 

  1. Paragraph starting on Line 44. Vispoel et al. (2019) emphasize that G-coefficients for LRVs do not represent reliability estimates for observed or ordinal scores. They note that LRV indices represent estimates of reliability if we could eliminate scaling irregularities and further emphasize that LRV analyses are most appropriately used to evaluate effects of scale coarseness and disattenuating correlation coefficients for multiple sources of measurement error. In other words, your previous point about such analyses representing hypothetical reliability is clearly made in that article and in others mentioned earlier.

 

  1. Lines 283-286. The modified omega coefficient from Green and Yang (2009) would not be much more useful than a LRV omega or alpha if it cannot be linked to an actual scale that could be used in practice.

 

  1. Table 2. I would recommend reducing the jargon with labeling of models here and elsewhere. I believe that the gtheory package in R is a standard G-theory package like those developed by Brennan (GENOVA, etc.) except that it uses restricted maximum likelihood rather than least squares estimation to derive variance components. Do you need to say linear mixed model (LMM) or even normal? Wouldn’t the label LMM, as you use it here, describe standard ANOVAs that include both within and between subject factors as well? Also, is it accurate to use the label ordinal for some of these models? If you do, you are continuing to use a label you tacitly criticize. Vispoel et al., (2019) recommend avoiding the label ordinal altogether for analyses conducted on LRV metrics. They emphasize that LRV analyses primarily represent a vehicle for addressing scale coarseness that encompasses corrections for both limited numbers of response options and unequal intervals between those options. Using the label ordinal for some of the procedures you cover here perpetuates the confusion originally created by Zumbo and colleagues’ use of that label. It seems to me that your models could be more clearly distinguished in terms of the scale metric (raw score, LRV, etc.) and estimation procedure. Finally, why even include the Green and Yang procedure in this paper if you cannot use it to derive D-coefficients, given that doing so is the main purpose of your study?

 

  1. Line 336. Again, you are using the word critique when limitation is more appropriate. The cited authors noted the same limitations you mention.

 

  1. Paragraph starting on line 339. In your limitations section, points made about raters seem much too long in relation to everything else. They also seem out of place given that none of your designs included raters. Why not confine your interpretations to designs you did model? Although I have no strong feelings about this, I am not sure whether including the p by (i:o) design adds much to the paper given that variance components from the p by i by o design can be used to do partitioning for the p by (i:o) design.

Overall impression. In my view, the most noteworthy contribution of this paper is describing how to use SEM to get variance components related to absolute error. Researchers have said it is possible to do so but typically have not made the procedure explicit. This all works fine when we do G-theory analyses using raw score, interval-level data. You could revise the paper to just illustrate this using G-coefficients, D-coefficients, and cut-score specific D-coefficients and contrast the variance components you obtain from SEM to those derived from the gtheory package in R and/or other packages/estimation methods.

The problem with going beyond this to cover alternative D-coefficients is that such coefficients are most useful in decision making when referenced to specific cut-scores on a real scale that you can use operationally. Reporting D-coefficients for an ordinal composite only seems meaningful if you can create an ordinal composite scale that you could use in decision making. Otherwise, the coefficients you report seem nearly as hypothetical as ones on the LRV metric. If the new coefficients are telling us something important, then describe exactly what that is. How would that information affect decisions made from scores? Might the ordinal composite coefficients represent lower bounds in the sense that LRV coefficients might represent upper bounds? In other words, you need to build a stronger case for importance of these aspects of your study if you choose to retain them.

If you cannot build a more compelling case for the value of global D-coefficients and your ordinal versions of them, then I think you should limit the material to using SEMs to produce G-coefficients, global D-coefficients, and cut-score specific D-coefficients for observed scores and compare those results to ones obtained using standard G-theory procedures (ANOVA modeling, restricted maximum likelihood. procedures from the gtheory package, etc.). Plots of cut-score specific D-coefficients might also be included. You could further extend results for your models to include partitioning for different types of measurement error within G-coefficients (see, e.g., Vispoel et al. 2018) and D-coefficients (see, e.g., Vispoel & Tao, 2013) to further diagnose reasons for differences in coefficients. That would seem enough for a publishable article.

Vispoel, W. P., Morris, C. A., & Kilinc, M. (2018). Applications of generalizability theory and their relations to classical test theory and structural equation modeling. Psychological Methods. 23, 1-26.

Vispoel, W. P., & Tao, S. (2013). A generalizability analysis of score consistency for the Balanced Inventory of Desirable Responding. Psychological Assessment, 25, 94-104.

 

Another thing to point out is that G-theory reliability coefficients are conservative in the sense that they reflect random rather than classical parallelism. I am sure you know that random parallelism is operationalized in a SEM by modeling essential tau-equivalent relationships. Another possible way to expand the analyses here would be to model both congeneric and essential tau-equivalent relationships (see, e.g., Vispoel et al., 2020). Also, ULS estimates would mimic results that would be derived from an ANOVA model. I would not expect meaningful differences here among ULS, ML, RML, and MLM estimates, but they are probably worth investigating. If they do not differ, you could mention that in a footnote.

Vispoel, W. P., Xu, G. & Kilinc, M. (2020). Expanding G-theory models to incorporate congeneric relationships: Illustrations using the Big Five Inventory. Journal of Personality Assessment, Sept 14:1-14. doi: 10.1080/00223891.2020.1808474. Epub ahead of print. PMID: 32926640

 

I hope you find my comments helpful in revising the paper. It could make a nice contribution to the research literature

Comments for author File: Comments.pdf

Author Response

Last sentence first:

I hope you find my comments helpful in revising the paper. It could make a nice contribution to the research literature

Thank you for the support and constructive criticism. I do find the revision to be an improvement over my original submission, and I hope you concur.

  1. Title. I recommend shortening the title to something like “Extensions of Structural Equation Modeling in Generalizability theory to Estimate Absolute Error.” I do this because I believe that the criticisms you make apply mostly to the work of Zumbo, Ark, and colleagues (e.g., ordinal alpha, ordinal G-theory) rather than the other studies cited about using SEM in G-theory contexts. Zumbo et al. do seem to deemphasize the imaginary nature of reliability indices based on the LRV scale. However, to avoid misrepresenting them, I recommend that you review Zumbo and Kroc (2019)’s response to Chalmers (2018). For other authors cited, who applied SEM to G-theory applications, I believe it would be more accurate to say that their studies have limitations rather than implying that they are flawed. I say more about this later.

Thank you for pointing me to Zumbo & Kroc (2019). I have changed the title to "How to Estimate Absolute-Error Components in Structural Equation Models of Generalizability Theory". I address the other points as they come up below.

  1. Abstract. First, I do not think it is accurate to say that previous studies give the impression that SEMs are limited to producing G-coefficients. Second, ... 

Indeed, I cited this information in my own footnote, but I revised the text of the Abstract accordingly: "Proposals for estimating absolute-error components have given the impression that a separate SEM must be fitted to a transposed data matrix... a single SEM can be specified to estimate absolute error (and thus dependability) by placing appropriate constraints on the mean structure, as well as thresholds (when used for ordinal measures)."

Third, I also would delete the claim that you argue that coefficients from LRV reflect hypothetical reliability because Chalmers (2018), Vispoel et al. (2019), and others have made the same point. That is, you are giving the impression that you are saying something new when this is not the case.

Finally, I was pleased to see you mention creation of reliability coefficients for ordinal scale composites, but disappointed that such composites were not used to derive cut-score specific D-coefficients and may not represent real scales that could be used in practice.

Given your recommendation to focus the paper on estimating absolute-error components, I have moved the discussion of Green & Yang's to the Discussion (leading up to analogous D-coefs as an area for future development), so the updated Abstract no longer refers to the hypothetical reliability issue.

  1. Line 17. Cronbach did not develop G-theory soon after discussing coefficient alpha. His article about alpha was published in 1951 (more than 20 years earlier). In addition, Cronbach himself does not take credit for creating alpha and felt embarrassed that people have referred to it as Cronbach’s alpha (see Cronbach & Shavelson, 2004. page 397).

I was careful not to claim he proposed it, but merely that he discussed (and named) it in 1951 before publishing about GT in 1963.  Admittedly, a 12-year publication gap is not necessarily brief, so I deleted the word "soon".  The rest of the sentence remains because it conveys exactly what the reviewer communicated: that even Cronbach wanted to move beyond coefficient alpha, which is a useful rhetorical point in favor of his subsequent development of GT. 

  1. Line 22. It is incorrect to say that an arbitrary number of measurement facets are identified in G-theory studies. Selection of number of facets is based on careful consideration of the primary sources of measurement error that would affect scores in particular contexts. This is especially true when designing G-studies.

"Abitrary" in this context merely meant that it doesn't matter how many facets there are. I have changed "arbitrary" to "(theoretically) unlimited".

  1. Line 27. I would delete the word “unnecessary” as it has a pejorative connotation.

The revised text does not contain that word.

  1. Lines 28-30. As I noted before, others including Raykov and Marcoulides (2006) and Ark (2015) have pointed out that SEMs can be used to derive variance components for calculating absolute error. An appealing part of your study is that you actually demonstrate how to do it.

Indeed, this was in a footnote of my original submission, although I cited only R&M (2006).  The penultimate paragraph of the Introduction now cites the "Q method" and explains how this paper contributes a single-model solution (see added text in blue font).

  1. Line 31-33. The main reason for G-coefficients is to estimate reliability for making decisions based on relative rather than absolute differences in scores.

I have added the sentence: "Thus, a G-coef quantifies reliability of decisions based on relative rather than absolute differences in scores."

  1. Paragraph starting on line 34. If you plan to keep global D-coefficients in the paper, then you need to make a more compelling case for them. I personally have not found them to be very useful because classification decisions are based on cut-scores. They also represent the lowest possible value for a D-coefficient across the scale continuum. Providing cut-score specific D-coefficients on a tangible ordinal scale metric would be a real contribution.

I agree that the value of D-coefs is essentially lost without incorporating the criterion against which the scores are judged.  I merely stuck with global D-coefs in the original submission because they were simpler to calculate and cut-scores were (a) described adequately in the cited literature but (b) not a focus of the paper. However, I agree it is pedagogically useful to illustrate the use of cut-scores, however contrived they must necessarily be for these artificial data.  Following Vispoel et al. (2019), I have included the relevant formulas (and syntax on OSF) to demonstrate calculating D-coefs for a hypothetical cut-score 2 SDs above the mean, regardless of the scale (observed normal, observed discrete, or LRV).  I do not propose an extension of G&Y's method for D-coefs, but I do show how a D-coef for a cut-score 2 SDs above the mean on the LRV scale can be obtained in a less restrictive way than shown by Vispoel et al. (2019)---see Section 2.4.2

  1. Paragraph starting on Line 44. Vispoel et al. (2019) emphasize that G-coefficients for LRVs do not represent reliability estimates for observed or ordinal scores. They note that LRV indices represent estimates of reliability if we could eliminate scaling irregularities and further emphasize that LRV analyses are most appropriately used to evaluate effects of scale coarseness and disattenuating correlation coefficients for multiple sources of measurement error. In other words, your previous point about such analyses representing hypothetical reliability is clearly made in that article and in others mentioned earlier.

This paragraph was removed, given the change in focus. This discussion (amply cited and expanded by the points you raised) is now mostly contained in Section 2.4.2 (see added text in blue font).

  1. Lines 283-286. The modified omega coefficient from Green and Yang (2009) would not be much more useful than a LRV omega or alpha if it cannot be linked to an actual scale that could be used in practice.

Correct, the link to the observed response scale is the added value of G&Y's modification. 

  1. Table 2. I would recommend reducing the jargon with labeling of models here and elsewhere. I believe that the gtheory package in R is a standard G-theory package like those developed by Brennan (GENOVA, etc.) except that it uses restricted maximum likelihood rather than least squares estimation to derive variance components. Do you need to say linear mixed model (LMM) or even normal? Wouldn’t the label LMM, as you use it here, describe standard ANOVAs that include both within and between subject factors as well? Also, is it accurate to use the label ordinal for some of these models? If you do, you are continuing to use a label you tacitly criticize. Vispoel et al., (2019) recommend avoiding the label ordinal altogether for analyses conducted on LRV metrics. They emphasize that LRV analyses primarily represent a vehicle for addressing scale coarseness that encompasses corrections for both limited numbers of response options and unequal intervals between those options. Using the label ordinal for some of the procedures you cover here perpetuates the confusion originally created by Zumbo and colleagues’ use of that label. It seems to me that your models could be more clearly distinguished in terms of the scale metric (raw score, LRV, etc.) and estimation procedure. Finally, why even include the Green and Yang procedure in this paper if you cannot use it to derive D-coefficients, given that doing so is the main purpose of your study?

Calculating D-coefs was one of the stated goals of this paper, but given the G&Y modification is more about capitalizing on existing software capabilities than a new development, I agree it is less distracting to focus on yet-unresolved issues about the value of coefficients on a LRV metric. I have moved all mention of the G&Y modification to the Discussion (Section 5.3), leading up to pointing out the potential value of its extension to D-coefs in future research. I have also updated Table 2 according to your other suggestions: the entries are organized by data properties (normal v. discretized), response scale (observed v. latent), and estimator. I added both D-coefs using a cut score and mean-squares estimation.

  1. Line 336. Again, you are using the word critique when limitation is more appropriate. The cited authors noted the same limitations you mention.

This text no longer appears in the revised Discussion. Instead, the limitations with sparse data are illustrated in a more in-depth application to the real multirater data (Section 4), following the Results but prior to Discussion. The limitation is noted again in the Discussion (Section 5.2).

  1. Paragraph starting on line 339 (author: This must have been line 349). In your limitations section, points made about raters seem much too long in relation to everything else. They also seem out of place given that none of your designs included raters. Why not confine your interpretations to designs you did model? Although I have no strong feelings about this, I am not sure whether including the p by (i:o) design adds much to the paper given that variance components from the p by i by o design can be used to do partitioning for the p by (i:o) design.

I felt it was more complete to discuss both crossed and nested factors, to demonstrate the implications of nesting on disaggregating different sources of error. But I do point out the possibility of deriving nested-design coefs from crossed-design data (see added text in blue font on p. 9), referring the reader to Vispoel et al. (2019, pp. 161 & 168--169) for details.

Rather than attempting a brief discussion of this limitation, the revision now extends the multirater-data example in Section 4 as a fully worked example prior to the Discussion, to more concretely demonstrating the limitations of SEM using real data from a planned missing-data design that yielded extremely sparse data.

Overall impression. In my view, the most noteworthy contribution of this paper is describing how to use SEM to get variance components related to absolute error. Researchers have said it is possible to do so but typically have not made the procedure explicit. This all works fine when we do G-theory analyses using raw score, interval-level data. You could revise the paper to just illustrate this using G-coefficients, D-coefficients, and cut-score specific D-coefficients and contrast the variance components you obtain from SEM to those derived from the gtheory package in R and/or other packages/estimation methods.

The revised title reflects this focus, and I have added mean-squares (GENOVA) estimates to Table 2, as well as cut-score-specific D-coefs.

The problem with going beyond this to cover alternative D-coefficients is that such coefficients are most useful in decision making when referenced to specific cut-scores on a real scale that you can use operationally. Reporting D-coefficients for an ordinal composite only seems meaningful if you can create an ordinal composite scale that you could use in decision making. Otherwise, the coefficients you report seem nearly as hypothetical as ones on the LRV metric. If the new coefficients are telling us something important, then describe exactly what that is. How would that information affect decisions made from scores? Might the ordinal composite coefficients represent lower bounds in the sense that LRV coefficients might represent upper bounds? In other words, you need to build a stronger case for importance of these aspects of your study if you choose to retain them.

I did not propose any "new coefficients"; rather, I only proposed how to obtain long-existing ones in a single SEM.  I agree it is not obvious how the LRV interpretation could lend itself to comparing (hypothetical, unobserved) scores to an absolute criterion (as, ideally, a modification analogous to G&Y's would do).  The revision continues to address the limitations you mentioned, especially in the new Section 2.4.2.

I do, however, show how to calculate D-coefs for cut-scores 2 SDs from the mean (following Vispoel et al., 2019) across the combinations in Table 2, and the Discussion (Section 5.3) stresses the value of extending G&Y's method beyond a measure of consistency.

If you cannot build a more compelling case for the value of global D-coefficients and your ordinal versions of them, then I think you should limit the material to using SEMs to produce G-coefficients, global D-coefficients, and cut-score specific D-coefficients for observed scores and compare those results to ones obtained using standard G-theory procedures (ANOVA modeling, restricted maximum likelihood. procedures from the gtheory package, etc.). Plots of cut-score specific D-coefficients might also be included. You could further extend results for your models to include partitioning for different types of measurement error within G-coefficients (see, e.g., Vispoel et al. 2018) and D-coefficients (see, e.g., Vispoel & Tao, 2013) to further diagnose reasons for differences in coefficients. That would seem enough for a publishable article.

Cut-score-specific D-coefs have been incorporated into the revision, and a plot of D-coefs across cut-scores has been added for the extended multirater-data (p × r design) example in Section 4. Thank you for the suggestions about illustrating the versatility shown by Vispoel et al. (2019), whom I now cite for details about how the nested-design coefs can be derived from the crossed design. Several sections of the revision now use terms for the different sources of error (e.g., specific factor, transient, rater, and random response), citing Vispoel et al. (2019) for further details about those calculations. 

Another thing to point out is that G-theory reliability coefficients are conservative in the sense that they reflect random rather than classical parallelism. I am sure you know that random parallelism is operationalized in a SEM by modeling essential tau-equivalent relationships. Another possible way to expand the analyses here would be to model both congeneric and essential tau-equivalent relationships (see, e.g., Vispoel et al., 2020).

Thank you for pointing out this recent study.  I have added this point/citation to the Discussion (Section 5.1; see also Footnote 4) as an additional advantage of the SEM framework for GT, although I did not pursue this in the examples because it is not the focus of the paper.

Also, ULS estimates would mimic results that would be derived from an ANOVA model. I would not expect meaningful differences here among ULS, ML, RML, and MLM estimates, but they are probably worth investigating. If they do not differ, you could mention that in a footnote.

In fact, the modeling framework turns out to make a bigger difference than the estimator within each framework.  As I now point out in a footnote in the Results (p. 12): "It is noteworthy that the mixed-modeling framework yields identical (to the 5th decimal place) estimates using either (ordinary/unweighted) least-squares or REML estimation. Likewise, the SEM framework yields identical estimates using least-squares or REML estimation. However, the modeling frameworks do differ in the second or third decimal place because the discrepancy functions differ.  The sum-of-squares or negative log-likelihood is minimized with respect to each row of data (i.e., observed vs. predicted casewise scores) in mixed-models but with respect to summary statistics (i.e., observed vs. predicted means and covariance matrix) in SEM."

 

Round 2

Reviewer 2 Report

Additions made in the Introduction have now clarified why this paper could be useful and in which fields.

Additions in Methodology and Results sections have substantially improved the entire paper.

Changes made with respect to Advantages of SEM for GT and the limitations subsection in the Discussion section are well put.

The style of writing throughout the entire paper, which did not represent an academic article in its previous form, has been impressively improved.

Thereby, I suggest the publication of this work.

 

Author Response

Thank you for your positive feedback.  I am glad the revision meets your expectations.

Reviewer 3 Report

I believe that the author has provided a well written, thoughtful, and responsive revision to the original paper. I am not sure whether the material about raters is needed but have no objective to it being included. I found it interesting, but readers unfamiliar with G-theory and SEM might find that material a bit challenging given the various convergence issues encountered. However, these problems do serve as examples of what others might encounter when they attempt to apply these techniques under similar conditions.

Overall, I support acceptance of the paper and need not review it again. I only have a few additional minor editorial suggestions.

Line 44. I would replace “but includes” with “plus”.

Line 53. I would replace “the misconception” with “a possible misconception”.

Line 63. If this is the first time, you used the acronym SEs, I would spell it out, i.e., standard errors (SEs).

Line 121. I would replace “can be used” with “is”.

 

I congratulate the author on a fine piece of research that will make a nice contribution to the research literature on G-theory. I look forward to applying these techniques myself in future studies.

Author Response

I believe that the author has provided a well written, thoughtful, and responsive revision to the original paper. I am not sure whether the material about raters is needed but have no objective to it being included. I found it interesting, but readers unfamiliar with G-theory and SEM might find that material a bit challenging given the various convergence issues encountered. However, these problems do serve as examples of what others might encounter when they attempt to apply these techniques under similar conditions.

Thank you for the positive feedback on the revision.  I agree about the real-world value of the example with rater data.  In fact, all 4 of the studies that I cited in that section (i.e., planned missing-data designs with sparse data) were projects that I was requested to consult on within a few short years, so I suspect this is more common than I would have originally guessed.  That was my motivation for including the example of real-world challenges to the SEM approach.

Overall, I support acceptance of the paper and need not review it again. I only have a few additional minor editorial suggestions.

Line 44. I would replace “but includes” with “plus”.

Done.

Line 53. I would replace “the misconception” with “a possible misconception”.

Done.

Line 63. If this is the first time, you used the acronym SEs, I would spell it out, i.e., standard errors (SEs).

Done.

Line 121. I would replace “can be used” with “is”.

The text now reads "the more dependable the scale would be when making criterion-based decisions".

I congratulate the author on a fine piece of research that will make a nice contribution to the research literature on G-theory. I look forward to applying these techniques myself in future studies.

Thank you for helping me improve this paper, and good luck with your research.

Back to TopTop