Next Article in Journal
A Possible Dark Side of Listening? Teachers Listening to Pupils Can Increase Burnout
Previous Article in Journal
The Social Motivational Orientations in Sport Scale (SMOSS): Validation for Portuguese Physical Education Students
 
 
Article
Peer-Review Record

EMC-PK2: An Experimental Observation Tool for Capturing the Instructional Coherence and Quality in Early Math Classrooms

Educ. Sci. 2024, 14(10), 1039; https://doi.org/10.3390/educsci14101039
by Luke Rainey *, Dale Clark Farran and Kelley Durkin
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Educ. Sci. 2024, 14(10), 1039; https://doi.org/10.3390/educsci14101039
Submission received: 18 June 2024 / Revised: 29 August 2024 / Accepted: 18 September 2024 / Published: 24 September 2024
(This article belongs to the Special Issue Teaching Quality, Teaching Effectiveness, and Teacher Assessment)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The authors report in this paper on a very much-needed tool for assessing the quality and coherence of instruction in the early elementary grades. The paper should definitely be published, although I have some suggestions—mostly minor.

1.       The introduction focuses on coherence, but the tool assessed the quality of math instruction as much as coherence. I suggest making it clear that the paper is about both quality and coherence, especially since coherence, in some respects, can be achieved through coherently bad instruction. I’m also not completely sure how the authors define coherence, although they explain it. For example, why is the direct instruction they describe on page 2 not coherent? The contrast between direct instruction and “effective mathematics teaching” described in the next paragraph seems to be related to quality. How are quality and coherence related? Is “instructional coherence” defined at the bottom of page 3 the same as “effective instruction”?

Given the focus on coherence in the introduction, it is surprising that the authors don’t return to the topic in the discussion section.

 

2.       I also suggest making a more convincing argument for the teaching approach the tool privileges. The authors have a lot of “shoulds” in their explanation of effective (in contrast to direct) teaching, but they don’t explain the basis for the shoulds. They could explain that teaching rules for solving problems that is common in direct instruction doesn’t build deep understanding or the conceptual knowledge that later mathematics learning relies on. Children learn that when they add 26 to 38 they need to carry the one. But they don’t necessarily understand that the “1” is actually 10 or that the “2” and “3” are actually 20 and 30. I suggest they also mention early on that the tool was created to align with NCTM standards (which they mention later).

 

3.       I highly recommend coming up with a more memorable name for the measure than EMC-PK2 --Maybe just PK2-Math.

 

4.       The description of the tool on pages 6 and 7 is very clear, but it is complicated – lots of moving pieces. It might be helpful to have a table with a list of the items for each type of data (content, IMAs, Post ratings…). At the very least, the authors could refer in the description of the tool to Table 1 where the IMA and Post ratings are listed.

 

5.       The correlations and factor analyses were apparently done on individual IMA ratings, even though they were nested within teachers/classrooms. It would be useful to provide an explanation for that approach.

 

6.       The paper provides information on reliability, but it could do more to demonstrate the value of the tool. I suggest adding descriptive data – for example mean scores/ratings and SD by grade. This would give information on which practices were observed a lot versus a little and how the ratings varied across preschool and the early elementary grades.

 

Minor suggestions

1.       In addition to the math gap related to economic status (mentioned as a rationale for attention to math in the early grades on page 1), students in the U.S., overall, perform poorly in math relative to most other economically advance countries.

 

2.       I suggest describing early on in the paper (maybe at the end of the “Introduction”) what the paper includes: 1) a review of extant measure of math teaching, and 2) a description and preliminary analyses related to a new measure.

 

3.       At the top of page 2 the authors explain the common approach to creating coherence – using the same math curriculum across grades and using a shared set of instructional strategies. They then explain that curricular coherence may not result in changes in instructional practices. But wouldn’t using a “shared set of instructional strategies” result in changes in instructional practices?

 

4.       I don’t understand the sentence in the first paragraph on page 2 “The systems [what systems?] that underpin instruction can be deeply embedded [embedded in what?] and interdependent making instructional change difficulty.”

 

5.       At the bottom of page 2: “…Blazar and Pollard [35] concluded:”  But there is nothing that follows.

 

6.       The beginning of page 3 refers to “These analyses” and “These activities” – but I don’t know what analyses or activities are being referred to.

 

7.       The authors claim that they focused on whether content was appropriate for grade level using the Common Core State Standards. But the measure includes preschool which is not included in the common core standards.

 

8.       At the top of page 8 the authors claim that “some” of the observers had knowledge of early math. Do they mean specialized expertise? I would assume that since all of the observers had experience in early elementary classrooms they would all have some knowledge of early math. The sentence suggests that you don’t need knowledge of early math to use the tool, and I doubt that is the case.

 

9.       The section labeled “Data collection” at the bottom of page 7 is really more about training observers. I thought it was odd to talk about the observations without any information on where the study was done. That information is provided later. I suggest moving the label “Data Collection” to 6.3 (replacing “COHERE observation”).

 

10.   A more accurate explanation of TK eligibility is turning four by Sept 1. I would also say that it is the year before kindergarten, so readers don’t think it might be some kind of replacement of kindergarten.

 

11.   I’m not sure of the value of the first paragraph after “7. Understanding the Measure” on page 9. I guess there might be some value in sharing the process of getting the tool to its current form, but it could just be confusing.

 

12.   What was the rationale for developing subscales (bottom of page 9)?

 

13.   In the bottom paragraph on page 10 the authors refer to “instructions being clear and sufficient.” I don’t see that item in any of the lists of IMA ratings.

 

14.   I’m not sure “teacher responsiveness” is the best label for the four items that load on the factor in the factor analysis, but I’m not able to come up with a better one. But only one of the items involves responding. They seem more to represent the teacher actively engaging in mathematical thinking with students.

 

15.   In the section on Reliability (page 11) the authors introduce another measure – practices students engaged in during each IMA. It seems that this set of items should be brought up in the description of the tool. They don’t include them in any analyses, so its not clear what to make of these codes.

 

16.   The discussion section could benefit from two subheadings: “Challenges” (the first few paragraphs that summarize some of the challenges encountered in creating the observation tool) and “Benefits” the paragraphs delineating the unique and valuable qualities of the tool.

 

17.   The sentence close to the bottom of page 14 “…few researchers may want to be able to commit the resources necessary to provide more accurate descriptions of classroom mathematics processes” is a bit of a downer. I suggest eliminating and stressing the value of a tool that provides accurate descriptions of mathematics teaching and learning.

 

 

 

Author Response

  1. The introduction focuses on coherence, but the tool assessed the quality of math instruction as much as coherence. I suggest making it clear that the paper is about both quality and coherence, especially since coherence, in some respects, can be achieved through coherently bad instruction. I’m also not completely sure how the authors define coherence, although they explain it. For example, why is the direct instruction they describe on page 2 not coherent? The contrast between direct instruction and “effective mathematics teaching” described in the next paragraph seems to be related to quality. How are quality and coherence related? Is “instructional coherence” defined at the bottom of page 3 the same as “effective instruction”?

Given the focus on coherence in the introduction, it is surprising that the authors don’t return to the topic in the discussion section.

We appreciate this suggestion and recognize that our previous way of describing quality and coherence was unclear. We have rewritten the introduction by focusing on instructional quality, which includes coherence as an element, and have consolidated elements from sections 2 and 3 to clarify how these ideas are connected. We have moved the definitions of “Coherence-3” into section 2 to further clarify the working definition of coherence that guided our thinking during this project and are important to high-quality mathematics instruction.

 

  1. I also suggest making a more convincing argument for the teaching approach the tool privileges. The authors have a lot of “shoulds” in their explanation of effective (in contrast to direct) teaching, but they don’t explain the basis for the shoulds. They could explain that teaching rules for solving problems that is common in direct instruction doesn’t build deep understanding or the conceptual knowledge that later mathematics learning relies on. Children learn that when they add 26 to 38 they need to carry the one. But they don’t necessarily understand that the “1” is actually 10 or that the “2” and “3” are actually 20 and 30. I suggest they also mention early on that the tool was created to align with NCTM standards (which they mention later).

 

We appreciate this suggestion and have revised lines 102-11 to now state, “Effective mathematics teaching focuses on problem solving and reasoning rather than heavily focusing on computational skills [25, 26, 31]. Students are encouraged to think of themselves as mathematical thinkers with tasks promoting high-level student thinking and discussions about students’ reasoning and strategies [32, 25, 33]. Teachers ask purposeful, open-ended questions to extend students’ thinking, explore their strategies, and help them make sense of important mathematical concepts and procedures [25, 34]. Teachers help facilitate mathematical discourse between students [25]. Students have shown improved mathematics achievement when they have the opportunity to engage in detailed ways with another student’s ideas or to have another student engage with their ideas [35, 36].” We have also revised section 2 to further describe why these aspects of high-quality mathematics teaching are important.

 

  1. I highly recommend coming up with a more memorable name for the measure than EMC-PK2 --Maybe just PK2-Math.

 

This name is already associated with previous dissemination efforts of the tool, including presentations and other documentation, so it would be difficult to change it as this point.  But we acknowledge that it is not the most memorable name.

 

  1. The description of the tool on pages 6 and 7 is very clear, but it is complicated – lots of moving pieces. It might be helpful to have a table with a list of the items for each type of data (content, IMAs, Post ratings…). At the very least, the authors could refer in the description of the tool to Table 1 where the IMA and Post ratings are listed.

 

We agree that there are a lot of moving pieces to describe with this tool. We have added a table with the list of item categories at the beginning of section 4 as suggested (now Table 1). We have also tried to reorganize section 4 with subheadings to more clearly describe the different parts of the tool.

 

  1. The correlations and factor analyses were apparently done on individual IMA ratings, even though they were nested within teachers/classrooms. It would be useful to provide an explanation for that approach.

      We took this approach to make the presentation of the findings more understandable, and we followed a similar approach to Agodini et al. (2010) in conducting the exploratory factor analyses at the IMA level without nesting. We now justify this in the manuscript. We also found the same factor structure when looking at the data at the IMA level, observation level, and teacher level, suggesting that these factors were stable across units of analysis.

 

  1. The paper provides information on reliability, but it could do more to demonstrate the value of the tool. I suggest adding descriptive data – for example mean scores/ratings and SD by grade. This would give information on which practices were observed a lot versus a little and how the ratings varied across preschool and the early elementary grades.

 

We agree that the descriptive data are very informative to demonstrate the value of the tool, and we have added section 6.3.1 to use descriptive data to summarize the indicators of high-quality instruction across the grade levels with a brief discussion about what these data suggest.

 

Minor suggestions

  1. In addition to the math gap related to economic status (mentioned as a rationale for attention to math in the early grades on page 1), students in the U.S., overall, perform poorly in math relative to most other economically advance countries.

 

Thank you for this suggestion. We have added to lines 30-32 that “Overall, students in the U.S. have consistently performed significantly below the international average in mathematics achievement tests” as an additional rationale.

 

  1. I suggest describing early on in the paper (maybe at the end of the “Introduction”) what the paper includes: 1) a review of extant measure of math teaching, and 2) a description and preliminary analyses related to a new measure.

 

We agree that this would make the paper easier to follow, and we have added a clearer description of the aims of the paper and its sections in lines 73-81: “In this paper, we are guided by the broad questions: What classroom practices define high quality early math instruction in the early grades? How can these practices be captured through a valid observation tool? Using this observation tool, how does the quality of math instruction change across the early grades? We will share our review of extant measures of early mathematics teaching and learning which led us to develop a new framework for observing early math across Pre-K through 2nd grade. We will then describe and offer preliminary analyses of this new measure using an argument-based approach to validity. Finally, we will discuss the implications of this work and the future questions that it suggests for the field.”

 

  1. At the top of page 2 the authors explain the common approach to creating coherence – using the same math curriculum across grades and using a shared set of instructional strategies. They then explain that curricular coherence may not result in changes in instructional practices. But wouldn’t using a “shared set of instructional strategies” result in changes in instructional practices?

 

We have tried to clarify that just introducing aligned instructional strategies alone is unlikely to be sufficient and have added the following to lines 52-56: “However, curricular coherence alone may not result in changes in instructional practices, and children may still experience incoherent learning experiences between different classrooms, grades, and schools. Teachers may still need the professional support to adjust their well-established ways of teaching and time to understand and incorporate the new instructional strategies into their practice.”

  1. I don’t understand the sentence in the first paragraph on page 2 “The systems [what systems?] that underpin instruction can be deeply embedded [embedded in what?] and interdependent making instructional change difficulty.”

 

We agree this needed to be clarified, and we have revised the sentence to now read “The systems that underpin instruction can be deeply rooted in local norms and regulations and held in place by interdependent structural mechanisms, making instructional change difficult [22].”

 

  1. At the bottom of page 2: “…Blazar and Pollard [35] concluded:”  But there is nothing that follows.

Thank you for catching this issue. It may be due to formatting, and we have attempted to fix it so that the indented quote in lines 116-120 reads, “These analyses point to benefits of teaching practices in two key areas. The first is active mathematics, in which teachers provide opportunities for hands-on participation, physical movement, or peer interaction. These activities overlap with ambitious teaching techniques that often make use of manipulatives and tactile activities in the service of building conceptual understanding (p. 3).”

 

  1. The beginning of page 3 refers to “These analyses” and “These activities” – but I don’t know what analyses or activities are being referred to.

 

Please see above, this text was part of the indented quote, following “Analyzing the teaching practices of 55 4th and 5th grade teachers, Blazar and Pollard [35] concluded…” We hope that formatting issue has now been resolved.

 

  1. The authors claim that they focused on whether content was appropriate for grade level using the Common Core State Standards. But the measure includes preschool which is not included in the common core standards.

In what are now lines 132 and 307-308, we referenced the California Pre-K Foundations as the preschool standards we used. Previously, the reference was in one place and not the other, so we added the clarification to both times the standards were mentioned for clarity. 

  1. At the top of page 8 the authors claim that “some” of the observers had knowledge of early math. Do they mean specialized expertise? I would assume that since all of the observers had experience in early elementary classrooms they would all have some knowledge of early math. The sentence suggests that you don’t need knowledge of early math to use the tool, and I doubt that is the case.

 

Thank you for noticing that this sentence needed clarification. In lines 346-348, we revised the description to state that “Data collectors were recruited from a pool of graduate students and former classroom teachers, all with significant experience in elementary classrooms, and some with specialized expertise in early math.” 

  1. The section labeled “Data collection” at the bottom of page 7 is really more about training observers. I thought it was odd to talk about the observations without any information on where the study was done. That information is provided later. I suggest moving the label “Data Collection” to 6.3 (replacing “COHERE observation”).

 

Following the thoughtful suggestions of the reviewers, we have greatly restructured the paper. This is now part of section 5.1 Training Observers. 

  1. A more accurate explanation of TK eligibility is turning four by Sept 1. I would also say that it is the year before kindergarten, so readers don’t think it might be some kind of replacement of kindergarten.

 

Our understanding is the TK enrollment criteria have changed since our data collection took place, and we have simplified this description to indicate it precedes K. We have revised what are now lines 398-402 to state, “Transitional kindergarten, or TK, is a California publicly-funded program for young learners before kindergarten. Since our data collection, California has adopted a universal TK approach that effectively replaces Pre-K, but during our data collection, TK covered a smaller age range and existed alongside Pre-K.”

  1. I’m not sure of the value of the first paragraph after “7. Understanding the Measure” on page 9. I guess there might be some value in sharing the process of getting the tool to its current form, but it could just be confusing.

 

Following reviewer suggestions, we have greatly reorganized and condensed this information.

  1. What was the rationale for developing subscales (bottom of page 9)?

 

Following the argument-based validity framework suggested by reviewers, we now frame the development of the subscales as part of generalization and using the items in the EMC-PK2 to generate overall scores accurately (and stably) representing the quality of the mathematics instruction observed.

           

  1. In the bottom paragraph on page 10 the authors refer to “instructions being clear and sufficient.” I don’t see that item in any of the lists of IMA ratings.

 

Thank you for catching this error. It has been revised to be correctly labeled as “student participation.”

 

  1. I’m not sure “teacher responsiveness” is the best label for the four items that load on the factor in the factor analysis, but I’m not able to come up with a better one. But only one of the items involves responding. They seem more to represent the teacher actively engaging in mathematical thinking with students.

     

      We agree that revising the label of this subscale would better represent what it measures, and we have revised the label to “Teacher Facilitation.”

 

  1. In the section on Reliability (page 11) the authors introduce another measure – practices students engaged in during each IMA. It seems that this set of items should be brought up in the description of the tool. They don’t include them in any analyses, so its not clear what to make of these codes.

 

We have reorganized the section describing the pieces of the tool, including the addition of Table 1, and have restructured the validity argument to hopefully make the content of these codes clearer as the previously provided descriptions of the practices were easy to miss.

 

  1. The discussion section could benefit from two subheadings: “Challenges” (the first few paragraphs that summarize some of the challenges encountered in creating the observation tool) and “Benefits” the paragraphs delineating the unique and valuable qualities of the tool.

 

We appreciate this suggestion and have subdivided the discussion section to include those two headings. 

 

  1. The sentence close to the bottom of page 14 “…few researchers may want to be able to commit the resources necessary to provide more accurate descriptions of classroom mathematics processes” is a bit of a downer. I suggest eliminating and stressing the value of a tool that provides accurate descriptions of mathematics teaching and learning.

 

We are thankful for this suggestion and have revised the discussion section to eliminate this sentence and stress the value of the tool in the new Benefits subsection.

Reviewer 2 Report

Comments and Suggestions for Authors

Thank you for the opportunity to review “EMC-PK2: An experimental observation tool for capturing the instructional coherence and quality in early math classrooms.” In this manuscript, the authors state their purpose as exploring the development of a classroom observation tool for coherent mathematics instruction in early childhood classrooms (PK-2) developed as part of a large-scale longitudinal study. The authors rightly state that most available classroom observation instruments focus on elementary grades and are not specific to early childhood mathematics. They describe the development of the instrument (EMC-PK2) and provide some psychometrics to demonstrate its constructs and their reliability. Specifically, they draw on correlations and exploratory factor analyses to describe subscales and make the case for the utility of the instrument. I appreciated the focus on the development process and the focus on early childhood settings. Observation instruments have proliferated in mathematics education in recent years and there is often not sufficient validity or reliability evidence presented regarding validity and reliability. 

 

While I appreciate the careful work the authors have done here, my main concern is that I am not sure that in its current form, the manuscript does not do enough to motivate the need for the instrument, its potential uses, or the implications of the validity argument. I encourage the authors to consider a framing for this manuscript that places the validity evidence in a broader context and/or makes a broader argument, using this evidence as illustrative of that argument. One way to do that would be to have clear research questions guiding the manuscript, something that the current version does not have. Another way to do that would be to frame the manuscript as a fuller validity argument for the EMC-PK2 (see, for example Bell et al., 2012 for a discussion of this). Regardless of approach, I encourage the authors to consider what the contribution of this analysis is beyond describing the tool and presenting limited psychometric evidence about subscales. In order to do this, I think the authors will need to engage more directly with both classroom observation literature and what high quality mathematics teaching means in early childhood settings.

 

I provide some addition questions/suggestions for consideration:

The introduction focuses on instructional coherence broadly and from a policy perspective. While instructional coherence is critical and policy approaches should consider it, the authors are clear that what they are describing in EMC-PK2 is a research tool that measures instructional practice in early childhood mathematics. I think the authors need to do more to connect the instructional coherence framing to the instrument itself. To do this would require a deeper engagement with literature around mathematics teaching and learning in early childhood specifically. What are the instructional practices and learning opportunities that make early childhood mathematics distinct? The authors describe math instructional coherence quite broadly on lines 83-96, acknowledging that this definition holds across grade levels. A robust discussion of mathematics instruction in early childhood settings is missing from the literature review. Section 3 claims the need for a measure of high quality teaching in early grade mathematics, but the authors do not make the case for why this might look different in early childhood - what are the practices that go unmeasured? This matters because there has been work done in this area and in fact observation instruments such as MQI have been validated for use in Kindergarten classrooms (see for example Mantzicopoulos et al., 2018). 

 

While the authors describe their review of classroom observation instruments (section 4) as a way to motivate and describe the creation of EMC-PK, they focus mostly on their own evaluations of these tools for their needs. They only minimally engage with literature on classroom observation instruments in mathematics, of which there is a good deal in recent years, including multiple special issues of journals such as ZDM and SEE. I encourage the authors to consider how their work fits into this broader conversation. They might begin with some of the following: Bostic et al., 2021; Bell & Gitomer, 2023 (and the related special issue); Charalambous & Praetorious, 2018 (and articles from the related special issue), 2020; Praetorious & Charalambous, 2018; Schlesinger & Jentsch, 2016; among others. 

 

Another area that needs more attention in a revision of this manuscript is the description of the sample and methods. The authors describe data collection but give minimal attention to the sample. Perhaps because there were not explicit research questions stated, I was left with questions about both the sample and methods used. If the goal is to introduce the instrument, I suggest a more concrete and organized description of its components and scoring, perhaps in a table. For example, while the authors state multiple times that PK classrooms organize instruction differently than elementary grade classrooms and as such the scoring is done differently, they do not describe how (I wondered how raters code and score 3 hours of teaching, for example). The authors also state that data was analyzed from the COHERE project but give only minimal information regarding the schools, classrooms and teachers (how were they recruited? To what extent do they represent classrooms in their respective schools? etc.). The authors allude to more information in other papers in progress, but some baseline information is needed in this paper as well.

 

Other comments/suggestions:

  • The authors cite subject-matter coherence as a key component of early childhood mathematics instruction. They use Common Core standards as an indicator of this (line 131) but these standards do not focus on pre-K so I wanted a bit more information about how to think about this in PreK settings.

  • The statement on lines 169-171 is incorrect. The authors state that the MET project developed the Framework for Teaching measure. This is not accurate. FFT was developed by Danielson (2007). It was not developed by or for the MET project, though it was one of multiple measures used in the MET project.

  • I found the descriptions of the aspects of focus in sections 4 and 5 difficult to follow. In particular, it seems that COEMET and Advanced Narrative were key to the development of EMC-PK2 but also met much of the criteria. Why were they not simply used and adapted? How were they used? I also found the amount of acronyms in these sections challenging to the readability of the manuscript. I suggest explaining more about the substance of the EMC-PK2, perhaps by providing examples. This relates to the broader goal of the manuscript, but I also wondered about how much design detail was needed. I wanted to better understand what was being measured (e.g., through IMA and POSTS).

  • The four subscales presented (teacher responsiveness, student engagement, differentiation, and classroom atmosphere) are ultimately not all that different from scales in existing measures and it was not clear to me how they are unique to or specific to early childhood settings. If these are the main findings of the analysis, I suggest the authors do more to connect these claims to a broader argument and the literature. Indeed, the main claim that calm, well-organized classrooms may not have mathematical depth is one that underlies a number of hybrid observation instruments, which is why they measure these constructs together or why researchers layer mathematics-focused instruments together with more general instruments.

  • I wondered how the authors dealt with items that did not load well onto factors. Do they substantively measure something important? (Big math idea connected activities for example seems important for early childhood mathematics where other research has shown that classrooms can contain time in mathematics lessons spent on non-mathematical activities such as cutting or coloring).

 

References:

Bell, C. A., & Gitomer, D. H. (2023). Building the field’s knowledge of teaching and learning: Centering the socio-cultural contexts of observation systems to ensure valid score interpretation. Studies in Educational Evaluation, 78, 101278.

 

Bell, C. A., Gitomer, D. H., McCaffrey, D. F., Hamre, B. K., Pianta, R. C., & Qi, Y. (2012). An argument approach to observation protocol validity. Educational Assessment, 17(2-3), 62-87.

 

Bostic, J., Lesseig, K., Sherman, M., & Boston, M. (2021). Classroom observation and mathematics education research. Journal of Mathematics Teacher Education, 24, 5-31.

 

Charalambous, C. Y., & Praetorius, A. K. (2018). Studying mathematics instruction through different lenses: Setting the ground for understanding instructional quality more comprehensively. ZDM, 50, 355-366.

 

Charalambous, C. Y., & Praetorius, A. K. (2020). Creating a forum for researching teaching and its quality more synergistically. Studies in Educational Evaluation, 67, 100894.

 

Danielson, C. (2007). Enhancing professional practice: A framework for teaching. ASCD.

 

Mantzicopoulos, P., French, B. F., & Patrick, H. (2018). The mathematical quality of instruction (MQI) in kindergarten: An evaluation of the stability of the MQI using generalizability theory. Early Education and Development, 29(6), 893-908.

 

Praetorius, A. K., & Charalambous, C. Y. (2018). Classroom observation frameworks for studying instructional quality: looking back and looking forward. Zdm, 50, 535-553.

Schlesinger, L., & Jentsch, A. (2016). Theoretical and methodological challenges in measuring instructional quality in mathematics education using classroom observations. ZDM, 48, 29-40.

Author Response

Thank you for the opportunity to review “EMC-PK2: An experimental observation tool for capturing the instructional coherence and quality in early math classrooms.” In this manuscript, the authors state their purpose as exploring the development of a classroom observation tool for coherent mathematics instruction in early childhood classrooms (PK-2) developed as part of a large-scale longitudinal study. The authors rightly state that most available classroom observation instruments focus on elementary grades and are not specific to early childhood mathematics. They describe the development of the instrument (EMC-PK2) and provide some psychometrics to demonstrate its constructs and their reliability. Specifically, they draw on correlations and exploratory factor analyses to describe subscales and make the case for the utility of the instrument. I appreciated the focus on the development process and the focus on early childhood settings. Observation instruments have proliferated in mathematics education in recent years and there is often not sufficient validity or reliability evidence presented regarding validity and reliability. 

 

While I appreciate the careful work the authors have done here, my main concern is that I am not sure that in its current form, the manuscript does not do enough to motivate the need for the instrument, its potential uses, or the implications of the validity argument. I encourage the authors to consider a framing for this manuscript that places the validity evidence in a broader context and/or makes a broader argument, using this evidence as illustrative of that argument. One way to do that would be to have clear research questions guiding the manuscript, something that the current version does not have. Another way to do that would be to frame the manuscript as a fuller validity argument for the EMC-PK2 (see, for example Bell et al., 2012 for a discussion of this). Regardless of approach, I encourage the authors to consider what the contribution of this analysis is beyond describing the tool and presenting limited psychometric evidence about subscales. In order to do this, I think the authors will need to engage more directly with both classroom observation literature and what high quality mathematics teaching means in early childhood settings.

 

Thank you for your constructive and insightful feedback. We have done significant editing to frame the manuscript as a fuller validity argument for the observation system as suggested. Using the Bell reference, as well as Kane (2013), section 6 now includes subheadings to present a more formal validity argument.

 

We have also tried to clarify our overall paper goals and research questions in the Introduction in lines 73-81: “In this paper, we are guided by the broad questions: What classroom practices define high quality early math instruction in the early grades? How can these practices be captured through a valid observation tool? Using this observation tool, how does the quality of math instruction change across the early grades? We will share our review of extant measures of early mathematics teaching and learning which led us to develop a new framework for observing early math across Pre-K through 2nd grade. We will then describe and offer preliminary analyses of this new measure using an argument-based approach to validity. Finally, we will discuss the implications of this work and the future questions that it suggests for the field.”

 

Sections 2-3 have also been reorganized to clarify our conception of high-quality mathematics as including coherent mathematics along with other dimensions and to connect to some of the other suggested work in this space (e.g., Bostic et al.).

 

I provide some addition questions/suggestions for consideration:

The introduction focuses on instructional coherence broadly and from a policy perspective. While instructional coherence is critical and policy approaches should consider it, the authors are clear that what they are describing in EMC-PK2 is a research tool that measures instructional practice in early childhood mathematics. I think the authors need to do more to connect the instructional coherence framing to the instrument itself. To do this would require a deeper engagement with literature around mathematics teaching and learning in early childhood specifically. What are the instructional practices and learning opportunities that make early childhood mathematics distinct? The authors describe math instructional coherence quite broadly on lines 83-96, acknowledging that this definition holds across grade levels. A robust discussion of mathematics instruction in early childhood settings is missing from the literature review. Section 3 claims the need for a measure of high quality teaching in early grade mathematics, but the authors do not make the case for why this might look different in early childhood - what are the practices that go unmeasured? This matters because there has been work done in this area and in fact observation instruments such as MQI have been validated for use in Kindergarten classrooms (see for example Mantzicopoulos et al., 2018). 

 

We have restructured sections 2 and 3 to more clearly define instructional quality, what instructional quality means for the early grades, and to reference some of the suggested literature. Based on reviewer suggestions, we have also greatly condensed our review of other observation instruments and have emphasized the unique features of the EMC-PK2. We added clarification that our review of current observation systems did not yield a single measure that captured all the aspects of Coherence-3, was math-specific, and was designed to be used in pre-k through elementary grade classrooms.

 

While the authors describe their review of classroom observation instruments (section 4) as a way to motivate and describe the creation of EMC-PK, they focus mostly on their own evaluations of these tools for their needs. They only minimally engage with literature on classroom observation instruments in mathematics, of which there is a good deal in recent years, including multiple special issues of journals such as ZDM and SEE. I encourage the authors to consider how their work fits into this broader conversation. They might begin with some of the following: Bostic et al., 2021; Bell & Gitomer, 2023 (and the related special issue); Charalambous & Praetorious, 2018 (and articles from the related special issue), 2020; Praetorious & Charalambous, 2018; Schlesinger & Jentsch, 2016; among others. 

 

Thank you for these great suggestions. We have tried to reframe the evaluation of the tool not only as just for our project needs, but also as a conceptual question relating to the dimensions of coherent math instruction. Several of these suggested references have been incorporated into section 3 to connect to this broader conversation as suggested. We also now include in lines 155-160, “However, Bell & Gitomer (2023) suggest that there is room for context-specific observation systems and that no unified framework can exist that will be able to fully provide relevant insights about teaching for every research study. And in fact, they argue that the field of teaching research benefits from multiple observation measures which each expand our understanding of the complex nature of classroom instruction and how it might be improved.”

 

Another area that needs more attention in a revision of this manuscript is the description of the sample and methods. The authors describe data collection but give minimal attention to the sample. Perhaps because there were not explicit research questions stated, I was left with questions about both the sample and methods used. If the goal is to introduce the instrument, I suggest a more concrete and organized description of its components and scoring, perhaps in a table. For example, while the authors state multiple times that PK classrooms organize instruction differently than elementary grade classrooms and as such the scoring is done differently, they do not describe how (I wondered how raters code and score 3 hours of teaching, for example). The authors also state that data was analyzed from the COHERE project but give only minimal information regarding the schools, classrooms and teachers (how were they recruited? To what extent do they represent classrooms in their respective schools? etc.). The authors allude to more information in other papers in progress, but some baseline information is needed in this paper as well.

 

We have added explicit research questions in the introduction to clarify the goals of the paper and have included Table 1 to better illustrate the different pieces of the EMC-PK2. We now include more details regarding the sample and data collection in section 5, including . We have also elaborated on scoring in section 6.1.

 

Other comments/suggestions:

  • The authors cite subject-matter coherence as a key component of early childhood mathematics instruction. They use Common Core standards as an indicator of this (line 131) but these standards do not focus on pre-K so I wanted a bit more information about how to think about this in PreK settings.

In what are now lines 132 and 307-308, we referenced the California Pre-K Foundations as the preschool standards we used. Previously, the reference was in one place and not the other, so we added the clarification to both times the standards were mentioned for clarity. 

  • The statement on lines 169-171 is incorrect. The authors state that the MET project developed the Framework for Teaching measure. This is not accurate. FFT was developed by Danielson (2007). It was not developed by or for the MET project, though it was one of multiple measures used in the MET project.

Thank you for this correction. We have removed this specific reference and revised the section based on reviewer feedback to spend less space on each instrument reviewed (except the COEMET and Narrative, which were the most closely related to the EMC-PK2) to focus more on discussing the EMC-PK2.

  • I found the descriptions of the aspects of focus in sections 4 and 5 difficult to follow. In particular, it seems that COEMET and Advanced Narrative were key to the development of EMC-PK2 but also met much of the criteria. Why were they not simply used and adapted? How were they used? I also found the amount of acronyms in these sections challenging to the readability of the manuscript. I suggest explaining more about the substance of the EMC-PK2, perhaps by providing examples. This relates to the broader goal of the manuscript, but I also wondered about how much design detail was needed. I wanted to better understand what was being measured (e.g., through IMA and POSTS).

 

An attempt to explain why neither of the tools were adopted or just revised was added to lines 199-210: “However, neither instrument on its own provided all the dimensions we believed were important to capturing coherence and high-quality mathematics instruction. For instance, we determined that to measure “subject matter coherence,” it was important to capture precise math content being observed that could be tied to grade level. We also found that the items related to “psychological connections” could be expanded to better capture child level math engagement. We also believed that a measure would benefit from more “moderators” of things that could prevent or facilitate the Coherence-3 domains. We also tried to avoid Likert-like scaling so typical of many observation measures (e.g., “None, Some, A Lot” Or “Disagree, Somewhat Agree, Agree”). Instead, we developed criterion referenced items with clear behavioral guidelines for each score point. Similar to the work of Agodini and colleagues [54], we developed several post-observation ratings of how smoothly the classroom ran and teacher tone.” We have also followed the reviewer suggestions to streamline some of the design details about the tool in section 4 and added guiding subheadings for clarity.

 

  • The four subscales presented (teacher responsiveness, student engagement, differentiation, and classroom atmosphere) are ultimately not all that different from scales in existing measures and it was not clear to me how they are unique to or specific to early childhood settings. If these are the main findings of the analysis, I suggest the authors do more to connect these claims to a broader argument and the literature. Indeed, the main claim that calm, well-organized classrooms may not have mathematical depth is one that underlies a number of hybrid observation instruments, which is why they measure these constructs together or why researchers layer mathematics-focused instruments together with more general instruments.

 

We now frame the manuscript with a stronger validity argument about the instrument and emphasize its benefits of being able to capture mathematics instructional quality in unique ways across pre-k through second grade classrooms in the discussion section. We have also added suggested descriptive analyses across grade levels to illustrate the value of the tool in showing differences in instructional quality in detailed ways.

 

  • I wondered how the authors dealt with items that did not load well onto factors. Do they substantively measure something important? (Big math idea connected activities for example seems important for early childhood mathematics where other research has shown that classrooms can contain time in mathematics lessons spent on non-mathematical activities such as cutting or coloring).

 

We now add some descriptive statistics about these items that did not load well onto factors because we do believe they may be measuring something important but that is distinct from other instructional subscales captured by the tool.

Reviewer 3 Report

Comments and Suggestions for Authors

This paper presents a classroom observational tool, the EMC-PK2, focused on capturing “coherent mathematics teaching and learning practices in preschool through second grade classrooms.”  The authors present sources of evidence of validity.  I appreciate the focus of the observational instrument on early childhood classrooms, a critical time for young children’s mathematical development, and there are some noted strengths of the article.  In the paragraphs that follow, I organize my review highlighting the strengths and suggesting areas for improvement.  While I number my comments, the comments are not numbered by order of priority to address.

(1)    The authors do a nice job articulating the importance of early childhood mathematics education and drawing on literature in mathematics education to articulate how they envision quality. 

(2)    I encourage the authors to explicitly state the purpose of this paper (ideally with some research questions to follow) early in the paper.  I was on page 4, and I began to wonder the main goals of the paper. 

(3)    The section entitled “Review of common classroom observation measures” raised many questions for me.  I was surprised that the details were not accompanied with citations about which measures were reviewed (paragraph at bottom of page 4).  It really felt like two much detail that raised more questions for me as the reader.  I suggest the authors combine and shorten sections 3 and 4 of this paper to highlight how they used existing literature and existing measures to build the EMC-PK2.  I would limit details on other measures (e.g., CLASS) and focus on the two measures that mostly informed the EMC-PK2 (i.e., COEMET and Narrative measures). Also, move the first two paragraphs of section 5 to the review section that highlights the key components of COEMET and Narrative that informed your work.  Then, the next section can focus on EMC-PK2.

(4)    In general, sections 3, 4, and 5 were difficult to follow and to keep track of all of the information.  In addition to combining/shortening sections 3 and 4, I suggest the authors include a table that shows an overview of their instrument.  This type of organizational tool would help the reader follow the arguments.  The table could show the main sections of the instrument, how they are scored, etc.  In section 5, the comparisons to COEMET and Narrative became distracting, rather than allowing the reader to truly understand the EMC-PK2, its sections, and how it is utilized.

(5)    It seems like the purpose of this article is to present evidence of validity based on claims in the later part of the paper.  I encourage the authors to state that up front with a purpose statement and research questions.  That being said, I encourage the authors to consult and consider the use of Michael Kane’s framework for developing an interpretation-use argument (IUA) for an instrument as well as the Standards for Educational and Psychological Testing (AERA, NCME, APA, 2014).  Validity is a complex multi-faceted phenomenon, where instrument developers collect a variety of evidence to support that valid inferences can be drawn from the resulting scores (one of my mentors likened it to building a portfolio of evidence like an attorney might do in preparation for a court case).  What are the assumptions you are making about your measure? What types of evidence are you collecting to test that assumption and to support claims about the validity of the measure?  You can utilize your data to build your arguments, but clearly connect it to claim its validity more explicitly.  There are two readings I highly recommend for the presentation of argument-based approaches to validity:

Bell, C. A., Gitomer, D. H., McCaffrey, D. F., Hamre, B. K., Pianta, R. C., & Qi, Y. (2012). An argument approach to observation protocol validity. Educational Assessment, 17(2-3), 62-87. DOI: 10.1080/10627197.2012.715014

Walkowiak, T. A., Adams, E. L., & Berry, R. Q. (2019). Validity arguments for instruments that measure mathematics teaching practices: Comparing the M-Scan and IPL-M. In J. Bostic, E. Krupa, & J. Shih (Eds.). Assessment in mathematics education contexts: Theoretical frameworks and new directions (pp. 90-119). New York, NY: Routledge.

Also, here are some Kane citations to review:

Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational Measurement (pp. 17-64). Westport, CT: American Council on Education and Prager Publishers.

Kane, M. T. (2013). The argument-based approach to validation. School Psychology Review, 42(4), 448-457. DOI: 10.1037/0033-2909.112.3.527

(6)    For the ICCs about rater bias, include why ICCs are the appropriate metric for supporting rater bias.  I noted a sentence that said “Low ICCs indicate a low likelihood of rater bias.”  But, why is that?  Elaborate.  And, what type of rater bias are you referring to?  I think you are referring to bias because of the expertise of the rater – that is, regardless of their background/expertise, there is low variation?  This is never stated so elaborating on the “bias” is necessary.  It might be worth mentioning in the discussion section that bias could be further detected with a generalizability study as a next step to determine how much of the variation in scores is explained by the raters.   

(7)    I really like the role of the “checker” of scores (page 9) to add rigor to the data collection processes.  I’m wondering if the checker and rater together decided if a change needed to be made?  The last sentence of the paragraph of top of page 9 made it sound like the authority/power was given more to the checker. 

(8)    I got a little lost between the mention of the Common Core standards (practices) and the types of coherences referenced from the DREME network, such that when I got to the discussion section, I was unclear on the claims being made about instruction, particularly around the section on the math practices from the Common Core.  You might consider showcasing a mapping of the items on the EMC-PK2 to the main theoretical frameworks more explicitly such that you can make a clearer claim about the content of the instrument as it relates to validity. 

(9)    The authors mention the complexities of early childhood classrooms in the discussion section (movement, grouping, etc.), but it was not completely clear how this was accounted for in the observational processes and system.  I know the IMAs were a big focus and were described on page 6, but how did the raters account for more than one activity taking place in a classroom at a time (e.g., group with the teacher, other children at center activities)?   I’m assuming the teacher was always the focus?

(10) I really like the focus on practice on page 15 – mentioning that the tool could be used to identify actionable next steps for teachers – this is important!

(11) In general, I appreciate the components of this paper, and I think it can make a contribution to the field with major restructuring and clear claims about the assumptions being made about the validity of the instrument, and how the pieces of evidence support the assumption so one can conclude that valid inferences can be drawn from the resulting scores.

Comments on the Quality of English Language

There are not concerns with the quality of the English Language.  Concerns are more about improving structure/organization and adding sign posts to the reader. 

Author Response

This paper presents a classroom observational tool, the EMC-PK2, focused on capturing “coherent mathematics teaching and learning practices in preschool through second grade classrooms.”  The authors present sources of evidence of validity.  I appreciate the focus of the observational instrument on early childhood classrooms, a critical time for young children’s mathematical development, and there are some noted strengths of the article.  In the paragraphs that follow, I organize my review highlighting the strengths and suggesting areas for improvement.  While I number my comments, the comments are not numbered by order of priority to address.

 

  • The authors do a nice job articulating the importance of early childhood mathematics education and drawing on literature in mathematics education to articulate how they envision quality.

 

We appreciate this and have further strengthened our description of high-quality math instruction in the early grades based on reviewer feedback.

 

  • I encourage the authors to explicitly state the purpose of this paper (ideally with some research questions to follow) early in the paper. I was on page 4, and I began to wonder the main goals of the paper.

 

Thank you for your suggestion. We have tried to clarify our overall paper goals and research questions in the Introduction in lines 73-81: “In this paper, we are guided by the broad questions: What classroom practices define high quality early math instruction in the early grades? How can these practices be captured through a valid observation tool? Using this observation tool, how does the quality of math instruction change across the early grades? We will share our review of extant measures of early mathematics teaching and learning which led us to develop a new framework for observing early math across Pre-K through 2nd grade. We will then describe and offer preliminary analyses of this new measure using an argument-based approach to validity. Finally, we will discuss the implications of this work and the future questions that it suggests for the field.”

(3)    The section entitled “Review of common classroom observation measures” raised many questions for me.  I was surprised that the details were not accompanied with citations about which measures were reviewed (paragraph at bottom of page 4).  It really felt like two much detail that raised more questions for me as the reader.  I suggest the authors combine and shorten sections 3 and 4 of this paper to highlight how they used existing literature and existing measures to build the EMC-PK2.  I would limit details on other measures (e.g., CLASS) and focus on the two measures that mostly informed the EMC-PK2 (i.e., COEMET and Narrative measures). Also, move the first two paragraphs of section 5 to the review section that highlights the key components of COEMET and Narrative that informed your work.  Then, the next section can focus on EMC-PK2.

We agree and have greatly revised these sections as suggested. We now briefly mention our broader review and then focus specifically on describing the Narrative and COEMET measures before describing the EMC-PK2 in more detail.

(4)    In general, sections 3, 4, and 5 were difficult to follow and to keep track of all of the information.  In addition to combining/shortening sections 3 and 4, I suggest the authors include a table that shows an overview of their instrument.  This type of organizational tool would help the reader follow the arguments.  The table could show the main sections of the instrument, how they are scored, etc.  In section 5, the comparisons to COEMET and Narrative became distracting, rather than allowing the reader to truly understand the EMC-PK2, its sections, and how it is utilized.

We have attempted to shorten these sections and remove some detail that could be read as extraneous. We have added an organizational table to outline the major item categories in the EMC-PK2, in section 4, Table 1, as suggested. In addition, in what is now section 6 we have restructured the argument around validity to more clearly describe the items and their scoring.

(5)    It seems like the purpose of this article is to present evidence of validity based on claims in the later part of the paper.  I encourage the authors to state that up front with a purpose statement and research questions.  That being said, I encourage the authors to consult and consider the use of Michael Kane’s framework for developing an interpretation-use argument (IUA) for an instrument as well as the Standards for Educational and Psychological Testing (AERA, NCME, APA, 2014).  Validity is a complex multi-faceted phenomenon, where instrument developers collect a variety of evidence to support that valid inferences can be drawn from the resulting scores (one of my mentors likened it to building a portfolio of evidence like an attorney might do in preparation for a court case).  What are the assumptions you are making about your measure? What types of evidence are you collecting to test that assumption and to support claims about the validity of the measure?  You can utilize your data to build your arguments, but clearly connect it to claim its validity more explicitly.  There are two readings I highly recommend for the presentation of argument-based approaches to validity:

Bell, C. A., Gitomer, D. H., McCaffrey, D. F., Hamre, B. K., Pianta, R. C., & Qi, Y. (2012). An argument approach to observation protocol validity. Educational Assessment, 17(2-3), 62-87. DOI: 10.1080/10627197.2012.715014

Walkowiak, T. A., Adams, E. L., & Berry, R. Q. (2019). Validity arguments for instruments that measure mathematics teaching practices: Comparing the M-Scan and IPL-M. In J. Bostic, E. Krupa, & J. Shih (Eds.). Assessment in mathematics education contexts: Theoretical frameworks and new directions (pp. 90-119). New York, NY: Routledge.

Also, here are some Kane citations to review:

Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational Measurement (pp. 17-64). Westport, CT: American Council on Education and Prager Publishers.

   Kane, M. T. (2013). The argument-based approach to validation. School Psychology Review, 42(4), 448-457. DOI: 10.1037/0033-2909.112.3.527

We sincerely appreciate these references and suggestions. As previously mentioned, we have revised the paper to more clearly articulate the study goals. We have also restructured the current section 6 around an argument-based validity framework based on these recommendations. We believe this greatly strengthens the paper.

(6)    For the ICCs about rater bias, include why ICCs are the appropriate metric for supporting rater bias.  I noted a sentence that said “Low ICCs indicate a low likelihood of rater bias.”  But, why is that?  Elaborate.  And, what type of rater bias are you referring to?  I think you are referring to bias because of the expertise of the rater – that is, regardless of their background/expertise, there is low variation?  This is never stated so elaborating on the “bias” is necessary.  It might be worth mentioning in the discussion section that bias could be further detected with a generalizability study as a next step to determine how much of the variation in scores is explained by the raters.  

         We have tried to clarify what we mean by bias in this section by revising the text to state, “Low ICCs indicate a low likelihood of rater bias as they suggest a rater was not consistently rating that item in the same way across all their observations and rather was attending to the variation expected in classrooms.” We also have added the need for broader data collection with the tool in future studies in the Implications section.

(7)    I really like the role of the “checker” of scores (page 9) to add rigor to the data collection processes.  I’m wondering if the checker and rater together decided if a change needed to be made?  The last sentence of the paragraph of top of page 9 made it sound like the authority/power was given more to the checker. 

We agree that the way this was previously phrased was unclear, and this text was rewritten for clarification about the role of the checker in lines 371-374 to state, “Corrections would be made by the research team if the observer made an accidental mistake or if they could not justify their rating through their exchange with the data checker. The notes from those exchanges were added to the officiant data record as were any decisions to correct codes.”

(8)    I got a little lost between the mention of the Common Core standards (practices) and the types of coherences referenced from the DREME network, such that when I got to the discussion section, I was unclear on the claims being made about instruction, particularly around the section on the math practices from the Common Core.  You might consider showcasing a mapping of the items on the EMC-PK2 to the main theoretical frameworks more explicitly such that you can make a clearer claim about the content of the instrument as it relates to validity. 

We are grateful for this suggestion and have made large revisions to the introduction to focus less on coherence and more on high-quality mathematics instruction, of which coherence is a part. We hope that this along with the strong argument-based validity framework now used help improve the clarify of the paper.

(9)    The authors mention the complexities of early childhood classrooms in the discussion section (movement, grouping, etc.), but it was not completely clear how this was accounted for in the observational processes and system.  I know the IMAs were a big focus and were described on page 6, but how did the raters account for more than one activity taking place in a classroom at a time (e.g., group with the teacher, other children at center activities)?   I’m assuming the teacher was always the focus?

We appreciate that this was not clear before and have added more information about the nature of capturing concurrent activities to lines 253-266: “IMAs were mostly consecutive but were sometimes concurrent, as in the case of multiple simultaneous center or rotation activities. Observers created a new IMA record for each concurrent activity with a separate content objective. For instance, an activity block with three rotating centers involving flashcard addition, geometry puzzles, and counting collections would result in three IMAs. The tool allowed observers to quickly create blank records with duplicate start times. Observers visited each concurrent center to record notes, and rotate among them, spending the most time in the activities with teacher involvement. Activities with significant teacher involvement became “Full” and required additional codes, so it was important observers visit long enough to capture sufficient notes about the types of interactions occurring between the teacher and students. This aspect of our system also made it distinct in that the focus was enlarged from what the teacher was doing with students to what students may be doing in a math lesson independently or working together– important characteristics of early grades instruction.”

(10) I really like the focus on practice on page 15 – mentioning that the tool could be used to identify actionable next steps for teachers – this is important!

(11) In general, I appreciate the components of this paper, and I think it can make a contribution to the field with major restructuring and clear claims about the assumptions being made about the validity of the instrument, and how the pieces of evidence support the assumption so one can conclude that valid inferences can be drawn from the resulting scores.

         We appreciate this comment and the one above.

Back to TopTop