Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

A Flexible Inventory of Survey Items for Environmental Concepts Generated via Special Attention to Content Validity and Item Response Theory

Sustainability 2024, 16(5), 1916; https://doi.org/10.3390/su16051916

by John A. Vucetich^1,*, Jeremy T. Bruskotter²

, Benjamin Ghasemi³

, Claire E. Rapp⁴, Michael Paul Nelson⁴ and Kristina M. Slagle²

Reviewer 1:

Glenn Israel

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Reviewer 4: Anonymous

Sustainability 2024, 16(5), 1916; https://doi.org/10.3390/su16051916

Submission received: 15 October 2023 / Revised: 6 February 2024 / Accepted: 14 February 2024 / Published: 26 February 2024

(This article belongs to the Special Issue Climate Change, Environmental Behavior, and Sustainable Development in Social and Cultural Psychology)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Review of “A flexible inventory of survey items for environmental concepts generated via special attention to content validity in item response theory”

This manuscript proposes several sets of items to measure environmental concepts following the authors critique of previously published indexes. They also review IRT methods, which is helpful for readers who may have limited exposure to the approach, followed by item analyses, second-order exploratory factor analysis, and regressions of belief, attitudes, and behavioral intention and behavior indexes on concept indexes. This is a lengthy paper, and I found the level of detail to be uneven with too much in some sections and too little in others. While I found many of the authors critiques reasonably justified, some of the same critiques can be made of the concepts and analyses in this study. For example, their critique of Dunlap and colleagues’ approach to identifying concepts within the revised NEP scale and gaining input from experts in the field is little different, in my view, from the “modified Delphi” approach that the authors used. I also suggest that assertions in lines 203-205 are unfair and pejorative unless the authors have contacted Dunlap and colleagues to confirm that “two young scholars” overinterpreted “an idiosyncratic set of environmental writings”.

The text in section 2.4 (lines 218-335) adds little to the central argument of the paper. In essence, it is a long-winded critique of a few of the scales reviewed by Cruz and Manata (2020). This section should be removed or edited to the essential point made in its final paragraph (with the addition of the relevant citations).

The set of concepts identified by the authors were well-defined in some instances and justified with explanation and citations in my view, while others had a one-sentence definition. This leads me to question the clarity and meaning of several concepts. In addition, one concern is whether adding additional subject experts would lead to additional concepts to be measured or defining a different set of concepts. I often found myself disagreeing with labels used for the concepts relative to the descriptions or disagreeing with the scope of the content of concepts. Issues with the definition of concepts subsequently led to issues with item wording and alignment with the concepts in the results section. I think the authors should acknowledge that the limited set of authors and subject matter consultants as a limitation of the paper.

Section 2.5, which discusses IRT, serves as a methods section. In the methods section, there is no report on testing of the assumption of unidimensionality or any other assumptions of item response methods. The authors should also cite references for the selection of their approach to the analysis (in this case it appears to be the two parameter IRT model/graded polytomous response model). Although the choice of model was discussed briefly, some consideration of whether respondents have formed an opinion on the various constructs or whether they have an informed basis for doing so should also be addressed, as this might influence the amount of guessing on some items and adversely affect parameter estimates. There was not a description of the statistical software or the procedures used to estimate the IRT models (it was provided for the second order factor analysis) and, in my view, this should be included. In addition, I would encourage the authors to include more item-specific information in the supplemental section, such as category response curves and item information charts.

The authors describe criteria for assessing each item and for selecting 3-4 items for inclusion in the final measurement of each concept in section 5.1. I note that these criteria required judgement at times, that is based on “grammatical simplicity” or “post hoc evaluation” (558-562). Given these criteria, I was unable to understand why item ST3 was selected over ST8. The latter had more discrimination and a higher threshold on the upper end of the scale, while the former was simply shorter in the number of words. This leads to the question of how similar the parameters should be to use the grammatical simplicity criteria. I suggest that the use of the category response curves and item information charts should also be used in this decision-making process.

It also occurs to me that IRT methods can be affected by methodological artifacts. See Zhu and Lu (2017) where they found evidence of differences between positively and negatively worded items in a set. Likewise, I suspect that different question types (e.g., Likert-type items, semantic differential items, etc.) could induce response effects within a set of items for a concept. Have the authors considered these issues and what steps have been taken to address them?

In the section 4, the authors report having a representative sample of the US population but in the supplemental section they indicate they are the role residents are overrepresented. I think the latter should be acknowledged in the main paper. In addition, the authors should report the extent of item nonresponse across the data set. If there was item nonresponse, then the authors should report how they addressed this issue.

I also noted that none of the revised NEP scale items were included in any of the item sets tested with IRT methods (one near exception was an item in the Fragility scale that mimicked #13). This was surprising given that these served as a key point in the rationale of the paper and they have been widely used in the literature The authors may want to review a study which applies IRT methods to the revised NEP scale items ( Zho and Lu 2017, https://doi.org/10.1016/j.jenvp.2017.10.005).

The items used to measure the concept of non-anthropocentrism in Table 11 were concerning in my view. It was surprising to me that the authors failed to recognize the possibility to respondents reacting to the items in multiple ways: general aspects of nature versus specific aspects; human benefits versus human needs. These dimensions confound the results and raise questions about whether the authors had clearly define the concept prior to drafting the items for testing. There is evidence here that the authors violated the unidimensionality assumption of IRT. I also found the items used for the Animism concept to be limited in scope and hence, not very informative. If the purpose is to distinguish living from non-living things, I would think a wider scope of items might prove more useful (i.e., things where there might be more debate over whether they are alive or to what degree and this might also depend on the scale, such a a puddle of water, a pond, river (in motion), or ocean). The choice of items to some concepts is a limitation in my view. As mentioned above, the definition of concepts is problematic in my view and raised questions about which items are well-aligned with their respective construct and which are less well-aligned. This is most apparent for the items for Holism, which seem to address aspects of animism.

Section 5.2 displays results of the second-order factor analysis of the calculated concept scales. My concern with this analysis is two-fold: first, their earlier critique of Dunlap and colleague’s revised NEP scale included the observation about being based on post hoc dimensions and the authors do the same thing in their study. What were their a priori expectations about the relationships among the constructs? Second, a confirmatory factor model using both the items and the concepts would provide a much more rigorous test of the measures (and address my concern above the alignment of items to concepts). The exploratory factor analysis results were not very informative or useful in my view. The authors should consider dropping this section in a future revision or switching to a CFA.

In section 5.3, I found the presentation of the results in the tables to be verbose. In addition, the best subsets regression approach is both atheoretical and less common than one that included the entire set of concepts as predictors in a “net effects” model (based on type III sums of squares). I would prefer to see a single table with a full model (parameter estimates for all concepts) for each dependent variable in a column (issue interest, how humans treat nature, behavioral intent, and behaviors). This would facilitate comparison of the effects of various concepts among the four outcomes and more easily to discern patterns, if any (and reduce the total number of tables). I also think a list of the behaviors in the index should be included in the body of the paper.

The authors admit that some of the concepts are measured by items that poorly distinguish respondents at the higher ends of the scale. While duplication of measures is to be avoided and limiting respondent fatigue by using shorter surveys is desirable, I can envision a situation where addition items can improve the discrimination and reliability or a scale. The authors have chosen to create measures for a wide swath of concepts and in doing so, forgone testing a larger number of items per concept. Without further evidence, I am not convinced that the proposed measures represent an improvement and should be adopted by other researchers.

Additional comments: There were numerous typos and unclear statements in the text. Regarding the latter, line 80 mentions numbers assigned by Dunlap et al. (2000) without explaining that these refer to the item number in the set as opposed to some other value or meaning. Also, Table 7 lists five items but the text in line 603 mentions seven. Line 639 states that there was higher discrimination for item “D6 (3.0)” but Table 10 shows this to be item D9. I believe line 899 should be “insight” rather than “inside”

Lines 84-87 should include citations for the definitions unless these are the authors (this was unclear to me). Lines 120-121 should include citations for the “some published studies”.

Regarding the discussion of non-anthropocentrism in lines 98-120, it appears to me that the authors are interjecting additional meaning beyond the actual words in the items being discussed. This strikes me as over-interpretation and, without empirical evidence that the general public apply these difference meanings, this is not justified.

Information in table titles should be moved to notes at the bottom of the table (this applies to all the tables).

Comments for author File: Comments.pdf

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Please see attached file for full comments

Comments for author File: Comments.pdf

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The research contribution of this paper is the improvement of the environmental belief measurement, which somewhat enhances its content validity and enriches the conceptual dimensions. To propose a list of 13 environmental concepts, and to develop 85 trial survey items. Based on theoretical research, it was tested through a survey of 449 residents. This paper is innovative in terms of theory to improve the evaluation system; in terms of practice, it is meaningful to promote pro-environmental behavior of residents.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

1. The researchers of the present study surveyed a panel of 449 adults (>18 years) residing in the United States. Appendix S2 is not found in the text. Or, the authors may provide a brief introduction or make a summary table for showing the research participants’ demographic information.

2. The researchers developed a list of concepts with a panel of scholars, which formed the basis of the research variables in the present study. These concepts had expert validity, but how were their respective trial survey items developed? What sources were they based on?

3. The discrimination parameter typically ranges from 0.5 to 2.0 and the item difficulty/threshold parameter typically ranges from −3.0 to +3.0. As the authors stated in the text, there is no hard rule, but it is still hard to understand the rules the researchers adopted to remove and remain the items for each variable. It seems the researchers are not consistent in their considerations.

4. LINE 577-581: “Hope. The three items with the highest DPs were H2 (DP=2.2), H3 (DP=2.2), and H4 (DP=1.8) (Table 3).” In Table 3, DP of H3 is 1.5, not 2.2. Is this a typo? Besides, H5 has higher DP, but why isn’t it adopted? (and so forth) The DP of Ani2, Stb3, Hol2 are dramatically high but they are adopted?

5. All table titles have too many words. It is suggested that the authors separate each table title and their explanations. Explanations could be placed under the table in terms of “Note.”

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

There are several aspects to the revised manuscript that continue to trouble me. First, the authors argue that previously published scales fail to demonstrate alignment between concepts (or sub-concepts) and items used in their measurement based on fitting structural equation CFAs. At the same time, the authors resist the notion that their concepts and items should be subject to the same standard. That is, do the original 85 items or the final set of 42 items align with their respective concepts as the authors assert? Given the poor performance of a number of items in the IRT models, including one of the 3 used to measure Holism (Hol1), there is reason to believe there is a mis-alignment of items to concepts. In short, I didn’t find the authors’ response for not conducting CFA to be compelling.

Second, the authors argue that their development of items occurred with specific reference to the related concept. However, the assumption of unidimensionality among the items should have been tested and reported. This could be done with EFA for each set of a concept’s items. I saw no evidence of that this assumption was addressed prior to performing the IRT analysis. Furthermore, the authors decision to split the anthropocentrism items into two sets is evidence of non-unidimensionality. Finally, the authors addition of section 6.8 does not address this issue from the standpoint of meeting assumptions of the method.

Third, although the authors say they want to develop a set of items to are “good” measures of environmental concepts, the limitations of the study work to undermine its usefulness. First and foremost, the authors fail to follow a best practice in survey development, which is to use cognitive interviews to ascertain how the general public thinks about and their frame of reference for the concepts identified as being important. This is critical for a study of this type. Cognitive interviews and pre-testing would have identified problems with items that were confusing, not well understood, too complicated to answer. In the authors’ response to my earlier review, they detail conversations about the difficulty in developing items for Animism and I think this response supports my assertion that the study should have included more qualitative procedures prior to the survey. I also disagree with the authors response about why items from the NEP scale weren’t included given the widespread use of this scale from measuring key environmental concepts (I note the 1978 and 2000 papers are cited in Google scholar over 5,100 and 7,700 times, respectively), which suggests to me that many scholars have found these items useful for their research. I also note that Fragility item F1 (Table 9), which is very similar to NEP’s item 13 (Table 1), was retained as a measure of that concept. I think this omission weakens the manuscript and is a limitation. In addition, I was astonished to see that authors report that they could develop only 3 items each for the concepts of Animism, Stability, and Holism. In my view, item development might have benefitted from talking with more people in the target population to get the full range of their thinking.

Fourth, the second-order factor analysis in section 5.2 assumes that the concepts are adequately differentiated and the items measure them well. Measurement error will attenuate (and can distort) relationships. The IRT analyses indicated that several concepts relied on items that had limited ability to measure where respondents were at the upper end of the scale. With this said, I think the CFA that I advocated above in comment 1, would provide stronger grounds for examining item to concept and concept to concept associations, even when one treats this as an exploration of associations among the concepts (rather than as a formal test of the concepts’ relationships). More importantly, the authors treat the results of the factor analysis in subsequent sections as a substantive finding (as discussed in section 6.3, lines 941-951. One additional note: Table S5, which contains correlations of the composite measures, should also include descriptive statistics for the composite measures.

Fifth, I disagree with the authors’ presentation of the regression results and their response to the reviewer in section 5.3. Tables 17-20 present many numbers, are difficult to navigate, and incompletely discussed in the text. Using a combination of symbols and shading to denote significance levels adds unnecessary complexity and the standard symbols (+ for p<.1, * for p<.05, ** for p<.01, and *** for p<.001) used instead. More importantly, there is no text summarizing changes in the parameter estimates for a concept as other concepts with their resulting parameter estimates are added to the model. As far as I can tell, the qualitative summary on page 25, lines 815-841, are based on the 8^th model reported for each dependent variable. I stand by my earlier recommendation to present a single table with a full model (parameter estimates for all concepts) for each dependent variable in a column (issue interest, how humans treat nature, behavioral intent, and behaviors). This would facilitate comparison of the effects of various concepts among the four outcomes and more easily to discern patterns, if any (and reduce the total number of tables). One additional note: there is an inconsistency in the question stem in table 17 which refers to “interests” (and not to “importance”) while the text uses “importance” and this should be resolved for greater clarity.

Other comments:

Lines 229-233: I note that journals differ in the amount and content of articles that are published, due to editorial policies, publication type (print versus online), time period, etc. I think it is presumptive to say there was not “due attention” and it would be better to simply say that there is limited documentation available.

Line 271: How many items were included in the Weigel and Weigel scale? By the way, I stand by my earlier comment that section 2.4 could be considerably shortened and I am not swayed by the authors’ response on that point in the earlier review. Furthermore, three of the four scales reviewed were from the 1970s, which raises questions about using a strawman critique.

Lines 481-497: The label “Doubting Others” is a misnomer in my view and I encourage the author to consider an alternative. The definition in lines 481-481 does not use the term “doubt” and it indicates that human-nature relationships depend on the actions of others. In my mind this refers more to the beliefs about the locus of control, power and influence. On the other hand, the items in Table 4 refer to trust (or doubt) and to efficacy of actors. There appears to be some mis-alignment between the definition of concept and the items intended to measure it.

Lines 574: The authors assert, without evidence, that the convenience sample does not compromise the study’s conclusions. At the very least, citations are needed to support the argument that the relationships being studied are not affected by the unique attributes of the sample. Furthermore, there is considerable evidence the survey panel respondents can yield biased estimates of population parameters and they might include bogus respondents (there’s a helpful Pew Research Center report on the latter).

Line 749: The authors state there was no cross-loading in the results in Table 15. I disagree. Nature’s Breadth loads .42 on Factor A and -.40 on Factor B. The strength of the loading is nearly equal for this concept on the two factors.

Lines 1128-1133: The recommendation assumes respondents are highly motivated and many are not. Research on respondent heuristics and satisficing practices show that many do not optimize responding to questions (so researchers must create better items).

Lines 1139-1145: The authors are overstating the usefulness of the study given the issues needing to be addressed in my view.

There are a number of editorial issues:

Table 2 notes: Tables 16-19 should be 17-20. Also, I don’t think its clear what information in the note applies to subsequent tables (many direct the reader back to the note in this table).

Line 642: The error “seven” should be “five” was not corrected.

Table 7, item Cm3: Is the item “content” correct? Should it be “contented” and, if not, might that have been mis-interpreted by some respondents?

Table 14 note: Contains duplicate text as lines 720-723.

Line 772: The revision suggests individual items are being used in the regressions rather than the composite measures. The original version was clearer.

Line 979: “inside” should be “insight” and was not corrected.

Author Response

Please see attachment.

Author Response File: Author Response.pdf

Round 3

Reviewer 1 Report

Comments and Suggestions for Authors

In reviewing the authors’ responses to my comments on the revision and the revised manuscript, I find few changes have been made and all of these are minor. The authors’ central argument is that the other three reviewers didn’t raise the same issues that I did, so it is more a matter of judgement (or editorial style), and they chose not to make more substantive changes that I recommended. I cannot know if the other reviewers considered the issues that I raised, or if they agreed/disagreed with them, or if they simply acquiesced to the authors’ assertions. At the end of the day, I do not support publication of the manuscript in its present form.

Author Response

We acknowledge the reviewer's judgment. From our perspective, there are no further revisions to make to the manuscript in light of the reviewer's comments.

We refer the editors of Sustainability to our response to the reviewer's second round of comments for our most recent substantive response to the comments of this reviewer.

We thank the reviewer for the time they invested in reviewing the manuscript.

Sincerely,

John Vucetich, on behalf of the co-authors

Article Menu

A Flexible Inventory of Survey Items for Environmental Concepts Generated via Special Attention to Content Validity and Item Response Theory

Further Information

Guidelines

MDPI Initiatives

Follow MDPI