*Article* **Patterns of Scientific Reasoning Skills among Pre-Service Science Teachers: A Latent Class Analysis**

**Samia Khan 1,\* and Moritz Krell <sup>2</sup>**


**Abstract:** We investigated the scientific reasoning competencies of pre-service science teachers (PSTs) using a multiple-choice assessment. This assessment targeted seven reasoning skills commonly associated with scientific investigation and scientific modeling. The sample consisted of 112 PSTs enrolled in a secondary teacher education program. A latent class (LC) analysis was conducted to evaluate if there are subgroups with distinct patterns of reasoning skills. The analysis revealed two subgroups, where LC1 (73% of the PSTs) had a statistically higher probability of solving reasoning tasks than LC2. Specific patterns of reasoning emerged within each subgroup. Within LC1, tasks involving analyzing data and drawing conclusions were answered correctly more often than tasks involving formulating research questions and generating hypotheses. Related to modeling, tasks on testing models were solved more often than those requiring judgment on the purpose of models. This study illustrates the benefits of applying person-centered statistical analyses, such as LC analysis, to identify subgroups with distinct patterns of scientific reasoning skills in a larger sample. The findings also suggest that highlighting specific skills in teacher education, such as: formulating research questions, generating hypotheses, and judging the purposes of models, would better enhance the full complement of PSTs' scientific reasoning competencies.

**Keywords:** scientific reasoning; science teacher education; pre-service teachers; person-centered statistical analyses; latent class analysis

## **1. Introduction**

Scientific reasoning has been a subject of study in the field of science education for some time [1]. Assessing this reasoning, however, remains a 21st century challenge for science educators today [2]. The present study is on the scientific reasoning of future science teachers themselves. We have assessed reasoning amongst this group because they will need to teach and demonstrate reasoning to their future students in science, and we can design activities in science teacher education that can enhance their competency in this field.

Scientific reasoning is a competency that encompasses the abilities needed for scientific problem-solving, as well as the capacity to reflect on problem-solving [3,4]. In the sciences, reasoning has been previously distinguished from other constructs such as problem-solving and critical thinking or scientific thinking alone. Descriptions of thinking, problem-solving, and reasoning are often conflated. For example, scientific reasoning has been suggested as being a kind of problem-solving; however, it has also been suggested that reasoning can be distinguished from problem-solving alone in that direct retrieval of a solution from memory is not possible with reasoning [5]. Ford [6] further reinforces that reasoning does not mean following a series of rules either but rather encompasses permanent evaluation and critique, as suggested by the reflective component of the above definition. Reasoning in the sciences requires cognitive processes that can contribute to, or allow for, inquiring and answering questions about the world and the nature of phenomena. These cognitive processes include

**Citation:** Khan, S.; Krell, M. Patterns of Scientific Reasoning Skills among Pre-Service Science Teachers: A Latent Class Analysis. *Educ. Sci.* **2021**, *11*, 647. https://doi.org/10.3390/ educsci11100647

Academic Editor: Ian Hay

Received: 17 August 2021 Accepted: 13 October 2021 Published: 15 October 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

formulating and evaluating hypotheses, two of several processes regularly invoked in scientific domains [7,8].

The multiple cognitive processes that have been investigated in research on reasoning in science and science education have been variously described as formal logic, non-formal reasoning, creativity, model-based reasoning, abductive reasoning, analogical reasoning, and probabilistic reasoning [9–12]. These processes may or may not be used in the wider category of critical thinking [13]. Scholars have provided evidence that the ability to use these processes for reasoning is transferable across domains [14], while others such as Kind and Osborne [15] suggest that reasoning is highly variable by the content and the procedural and epistemic knowledge of the reasoner. Scholars have also shown that the ability to reason in science does not necessarily improve with age [16] but that it can be taught and enhanced in both the early years and at university levels [17–19].

Our focus in the present study is on the reasoning competencies of pre-service science teachers (PSTs) enrolled in a university teacher education program. Most studies on pre-service science teachers' scientific reasoning competencies adopt variable-centered approaches and report, for example, average scores for sample groups or populations. For example, one study [20] reported on a group of 66 Australian pre-service science teachers that they performed significantly better on tasks that required skills of 'planning investigations' compared to tasks related to skills of 'formulating research questions' and 'generating hypotheses'. Such insights are valuable but sometimes might be too roughgrained depending on the research questions, as different subgroups with distinct patterns of scientific reasoning skills exist within a sample. In order to identify such subgroups, person-centered analyses are necessary, that, statistically speaking, aim to "[R]educe the 'noise' in the data by splitting the total variability into 'between-group' variability and 'within-group' variability" [21] (p. 2). Hence, person-centered analyses, like latent class analyses (LCA), are finer-grained analyses in the sense that they are case-based and identify individuals with similar patterns of scientific reasoning skills (e.g., [22]). Person-centered analyses are also referred to as 'typological' approaches [23]. Such approaches can be specifically valuable for educators as they move beyond the 'average' and follow, methodologically, "[M]odern developmental theory, in which individuals are regarded as the organising unit of human development" [23] (p. 502). In the present study, we seek to establish whether subgroups of reasoners can be ascertained among PSTs using an LCA. The seven reasoning skills examined are: *formulating research questions*, *generating hypotheses*, *planning investigations*, *analyzing data and drawing conclusions*, *judging the purpose of models*, *testing models*, and *changing models*. While historical examination of scientific work has revealed that practices such as thought experiments, analogies, and imagistic simulation are important to scientists' development of new concepts [24], these seven skills under investigation were identified as key empirical areas of inquiry in science education [25–29] and likely having been taught in undergraduate science programs [3].

#### **2. Materials and Methods**

#### *2.1. Sample*

A full cohort of 56 PSTs from a university in North America participated in this study. Their mean age was 27 years (*SD* = 6.34; mode = 23). Data collection was done in their science teacher education secondary methods course within a Bachelor of Education afterdegree program. To enroll in the secondary program, all students had at least one prior degree (usually 4 years of Science or more). The instrument described below (Section 2.2) was administered to the PSTs in their methods course at the beginning and at the end of the semester (pre–post-assessment). For the purpose of identifying groups with distinct patterns of scientific reasoning, we analyzed pre- and post-assessment data taken together of 56 PSTs. The total response sample for each item was thus *n* = *n*pre + *n*post or *n* = 112. Only PSTs without missing responses have been included in the analysis, resulting in a sample of *n* = 101 for the statistical analysis. The number of PSTs by primary major were: Biology (*n* = 30), Chemistry (*n* = 11), Physics (*n* = 8), Biomedicine (*n* = 1), Earth Sciences

(*n* = 1), Mathematics (*n* = 1), n/a (*n* = 4). Most of the PSTs' prior degrees were within the field of Biology (*n* = 60; e.g., general Biology, Applied Biology, or Evolutionary Biology), followed by Chemistry (*n* = 25) and Physics (*n* = 6).

#### *2.2. Data Collection*

An established multiple-choice instrument was administered to assess the PSTs' scientific reasoning competencies. The instrument was originally developed in the German language [27] and was later adapted into English, with thorough evaluations [30]. The instrument includes 21 multiple-choice items that were developed to assess seven reasoning skills of *formulating research questions*, *generating hypotheses*, *planning investigations*, *analyzing data and drawing conclusions*, *judging the purpose of models*, *testing models*, and *changing models*. Authentic scientific contexts were included in the items, which are mostly related to general science and Biology as well. As suggested in the organizing device that has been used for test development (see Table 1), these seven skills are related to two sub-competencies: conducting scientific investigations and using scientific models [31]. To correctly solve the multiple-choice items, PSTs have to apply their procedural and epistemic knowledge related to the respective skills [32–34]. Table 1 lists the two sub-competencies, their associated skills, and the specific knowledge necessary to correctly answer the items.

**Table 1.** Sub-competencies of scientific reasoning and associated skills with necessary procedural and epistemic knowledge, as described by Mathesius et al. [34].


*2.3. Data Analysis: Latent Class Analysis*

A latent class analysis (LCA) was utilized to identify patterns of scientific reasoning skills among PSTs. The R package poLCA was employed [35]. All further (classical) statistical analyses, such as *t*-tests and descriptive analyses, were carried out with IBM

SPSS statistics, version 26. In an LCA, PSTs' responses are analyzed on the latent level, all variables are assumed to be (at least) on a nominal level, and there are no restrictions on the kind of relation between the (manifest) variables [33,36,37]. LCA was selected for data analysis because it permits the identification and computation of different groups (i.e., latent classes) of PSTs, with each group consisting of individuals with a response pattern that is as homogenous as possible (low within-group variability) but different from the response patterns of the other groups (high between-group variability). Therefore, LCA would be considered as belonging to the person-centered approaches of data analyses [21,23].

A core question of LCA is to decide on the appropriate number of latent classes [36]. To compare different LCA models, indices such as the Akaike information criterion (AIC), the Bayesian information criterion (BIC), and the sample size adjusted Bayesian information criterion (ssaBIC) are typically employed. These indices factor in the parsimony, the sample size, and the likelihood of the LCA models—each of the indices in a different manner [38]. When comparing different LCA models with these information indices, the smallest value of each index points out the comparatively best LCA model; however, the BIC and the ssaBIC were identified as superior indicators compared to the AIC [39] (p. 557), which is why these indicators are used in the present study. On the other hand, the BIC and the ssaBIC often do not identify the same LCA model as optimal [38]. Therefore, one has to use a combination of different insights to decide how many latent classes represent the data set best [38].

It is an important characteristic of LCA that the subjects are not assigned to the different latent classes in a deterministic manner but more so in a probabilistic sense. For diagnostic purposes, it is common to classify each subject to the latent class with the highest probability of assignment. Therefore, an "Additional indicator [of model-goodness] is the average membership probability within each [latent] class" [40] (p. 52); the higher this probability, the better the LCA model. Furthermore, one should analyze the item parameters for extreme values that indicate an estimated probability of 0% or 100% to solve a task; the fewer extreme values, the better the LCA model [40].

#### **3. Results**

Table 2 provides the fit-indices for the LCA models compared in this study. Because the BIC (2 latent classes) and ssaBIC (4 latent classes) suggest selecting different LCA models, the number of extreme values and the probability of assignment have been used as additional indicators. Based on these indicators, it can be assumed that the response pattern of the PSTs is best represented using two latent classes. These two latent classes consist of about 73% or 74 PSTs (latent class 1) and 27% or 27 PSTs (latent class 2) of the sample, respectively.


**Table 2.** Fit-indices of the different LCA models compared. Note that models with more than four latent classes did not fit the data.

> Figure 1 illustrates the response profiles for the two latent classes across the seven skills of scientific reasoning covered in the multiple-choice instrument. Generally, PSTs in latent class 1 show a higher mean probability of correct answers within all seven skills. Comparing the mean probability of correct answers between the two latent classes with independent *t*-tests resulted in significant differences for the skills *planning investigations* (*p* = 0.04; *d* = 0.48, small to medium effect size measure), *analyzing data and drawing conclusions* (*p* < 0.001; *d* = 1.25, large effect size measure) as well as *judging the purpose of models*

*Educ. Sci.* **2021**, *11*, x FOR PEER REVIEW 5 of 9

small effect size measure).

(*p* < 0.001; *d* = 1.25, large effect size measure), *testing models* (*p* < 0.001; *d* = 1.49, large effect size measure), and *changing models* (*p* < 0.001; *d* = 0.88, large effect size measure). large effect size measures). For using scientific models (Table 1), no significant differences between the skills could be found.

For latent class 2 and considering skills related to conducting scientific investigations (Table 1), items related to the skill *planning investigations* have been answered correctly significantly more often than the tasks related to the other three skills (*p* < 0.001; *d* > 1.00,

latent class 1 show a higher mean probability of correct answers within all seven skills. Comparing the mean probability of correct answers between the two latent classes with independent *t*-tests resulted in significant differences for the skills *planning investigations* (*p* = 0.04; *d* = 0.48, small to medium effect size measure), *analyzing data and drawing conclusions* (*p* < 0.001; *d* = 1.25, large effect size measure) as well as *judging the purpose of models* (*p* < 0.001; *d* = 1.25, large effect size measure), *testing models* (*p* < 0.001; *d* = 1.49, large effect

For latent class 1 and considering skills related to conducting scientific investigations (Table 1), response probabilities for the skills *formulating research questions* and *generating hypotheses* on the one hand, and *planning investigations* and *analyzing data and drawing conclusions*, on the other hand, are quite similar, even though significant differences with large effect size measures could be found between these two groups of skills. For the skills related to using scientific models (Table 1), correct responses were found significantly more often for the skill *testing models* than for *judging the purpose of models* (*p* = 0.02; *d* = 0.36,

size measure), and *changing models* (*p* < 0.001; *d* = 0.88, large effect size measure).

**Figure 1.** Response profiles for the two latent classes across the seven skills of scientific reasoning (mean score ± 2 \* standard error). **Figure 1.** Response profiles for the two latent classes across the seven skills of scientific reasoning (mean score ± 2 \* standard error).

For latent class 1 and considering skills related to conducting scientific investigations (Table 1), response probabilities for the skills *formulating research questions* and *generating hypotheses* on the one hand, and *planning investigations* and *analyzing data and drawing conclusions*, on the other hand, are quite similar, even though significant differences with large effect size measures could be found between these two groups of skills. For the skills related to using scientific models (Table 1), correct responses were found significantly more often for the skill *testing models* than for *judging the purpose of models* (*p* = 0.02; *d* = 0.36, small effect size measure).

For latent class 2 and considering skills related to conducting scientific investigations (Table 1), items related to the skill *planning investigations* have been answered correctly significantly more often than the tasks related to the other three skills (*p* < 0.001; *d* > 1.00, large effect size measures). For using scientific models (Table 1), no significant differences between the skills could be found.

In order to better understand the characteristics of the PSTs assigned to latent class 1 and latent class 2, we compared their age, primary majors, and the sum of previous degrees. Independent *t*-tests (Table 3) revealed that there are significantly more PSTs with the primary major of Biology in latent class 1 (about 65%) than in latent class 2 (about 33%). For the primary major of Chemistry, it is quite the reverse (about 15 % in latent class 1 and about 33% in latent class 2); also, the number of PSTs with more than one previous degree

is significantly higher in latent class 1 (*n* = 11) than in latent class 2 (*n* = 1). These findings illustrate that the study of Biology as a primary major and a higher number of previous degrees made it more likely to belong to the more proficient latent class 1, whereas the study of Chemistry as a primary major made it more likely to belong to latent class 2.

**Table 3.** Comparison of the PSTs assigned to latent class (LC) 1 and LC 2 along the variables age, primary major of Biology, Chemistry or Physics, and the sum of previous degrees (the latter as a dichotomized variable with 1 = one previous degree and 2 = more than one previous degree).


\* Adjusted *t*-statistic and *df* because of violated assumption of variance homogeneity.

#### **4. Discussion**

Using LCA, we revealed that two groups of reasoners emerged amongst the PSTs. One subgroup (latent class 1) had a statistically higher probability of solving scientific reasoning tasks than the other subgroup (latent class 2). Overall, the groups were significantly different on the following five skills out of seven investigated: *planning investigations*, *analyzing data and drawing conclusions*, *judging the purpose of models*, *testing models*, and *changing models*. They were not significantly different from each other on *formulating research questions* and *generating hypotheses*.

The latent class 1 subgroup responded significantly differently from each other on the skills *planning investigations* and *analyzing data and drawing conclusions* in contrast to the skills *formulating research questions* and *generating hypotheses*. Tasks about *testing models* were solved more often than those requiring *judging the purpose of models* within this subgroup. The latent class 2 subgroup responded significantly differently from each other on *planning investigations* compared to the other skills. For using scientific models, no significant differences could be found within this subgroup on the skills related to modeling (*judging the purpose of models*, *testing models*, and *changing models*).

These two subgroups also shared several other key characteristics. In latent class 1, a significant majority had a major in Biology compared to latent class 2, whereas there were far fewer from Chemistry in latent class 1. Moreover, there were significantly more PSTs with more than one previous degree in latent class 1 than in latent class 2. This finding is noteworthy for science teacher education because it suggests that Biology majors were significantly better at *planning investigations*, *analyzing data and drawing conclusions*, *judging the purpose of models*, *testing models*, and *changing models* than Chemistry majors. These findings might have been caused by the dominance of Biology-related items in the instrument; however, as the items require PSTs to apply procedural and epistemic knowledge as shown in Table 1 (and less so content knowledge), the findings lead us towards a renewed emphasis on reasoning tasks for Chemistry teacher education. Nevertheless, future studies could investigate the importance of science content knowledge from specific subjects (such as Biology) for solving the items, for instance, by applying think-aloud studies [25] or statistically investigating difficulty-generating task characteristics [41].

As a 'person-centered' statistical approach, the LCA was particularly powerful in ascertaining subgroups within a science teacher education cohort. This statistical approach is a departure from traditional variable-centered approaches in education that tend to report on average scores for sample groups [21,23]. The LCA permits statistical cases to emerge from within samples or classrooms and is a recommended approach to generate case studies for further inquiry in science teacher education research.

In combination with relevant epistemic, procedural, and content knowledge, greater attention to *formulating research questions* and *generating hypotheses* would be helpful within science teacher education. Furthermore, reasoning tasks involving *judging the purpose of models* and *changing models* could be a high priority for modeling investigations in preservice science teacher education. Possible science teacher education activities to support such tasks include the three-phased generating, evaluating, and modifying (GEM) models approach [10]. This approach emphasizes generating hypotheses in the first phase and testing and changing models in the second and third phases [42]. In general, science teacher education courses, Biology majors, or those with additional degrees could be purposefully included within heterogeneous groups for cooperative learning tasks. It was interesting to the authors that Biology majors outperformed other majors in this study, although this might be caused by the dominance of Biology-related items in the instrument; insights into the differences in performance among majors would be a helpful avenue for the design of science teachers education courses and group work in the ways suggested above. By participating in reasoning tasks with such recommendations in mind, future teachers might be able to better support their own students to develop competencies in these areas.

The significance of this study is that it identifies two groups of reasoners who are PSTs with different propensities to reason in science using person-centered statistics. Normally, the classroom would be treated similarly as an entire group; however, with this statistical approach, the researchers are able to show that subgroups of PSTs themselves emerged as competent at very different reasoning tasks. One subgroup is significantly more competent at *planning investigations*, *analyzing data and drawing conclusions*, *judging the purpose of models*, *testing models*, and *changing models* than the other. The subgroups had approximately equivalent competencies at *formulating research questions* and *generating hypotheses* showing for the first time that among PSTs, different subgroups with specific patterns of scientific reasoning skills exist. This finding can have an impact on science students of these future teachers, who presumably will draw upon their own competencies to demonstrate how to reason in the classroom. Future directions for research could target investigation and model-based reasoning competencies among PSTs and relationships to student reasoning. *Judging the purpose of models*, *formulating research questions*, and *generating hypotheses* were areas that PSTs were less competent; researching interventions related to these aspects of modeling and investigation would be worthwhile.

**Author Contributions:** Conceptualization, M.K., S.K.; methodology, M.K.; investigation, M.K., S.K.; resources, M.K., S.K.; writing—original draft preparation, S.K.; writing—review and editing, M.K., S.K.; visualization, M.K.; funding acquisition, M.K., S.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the 2018 UBC-FUB Joint Funding Scheme, grant number FSP-2018-401.

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Institutional Review Board (or Ethics Committee) of the University of British Columbia (ID H18-01801, approved 23 July 2018).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** Data is available upon request to the second author.

**Acknowledgments:** The authors wish to thank Alexis Gonzalez for support in data collection and tabulation.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study, in the collection, analyses, or interpretation of data, in the writing of the manuscript, or in the decision to publish the results.

### **References**


**Moritz Krell 1,\* , Samia Khan <sup>2</sup> and Jan van Driel <sup>3</sup>**


**Abstract:** The development and evaluation of valid assessments of scientific reasoning are an integral part of research in science education. In the present study, we used the linear logistic test model (LLTM) to analyze how item features related to text complexity and the presence of visual representations influence the overall item difficulty of an established, multiple-choice, scientific reasoning competencies assessment instrument. This study used data from *n* = 243 pre-service science teachers from Australia, Canada, and the UK. The findings revealed that text complexity and the presence of visual representations increased item difficulty and, in total, contributed to 32% of the variance in item difficulty. These findings suggest that the multiple-choice items contain the following cognitive demands: encoding, processing, and combining of textually presented information from different parts of the items and encoding, processing, and combining information that is presented in both the text and images. The present study adds to our knowledge of which cognitive demands are imposed upon by multiple-choice assessment instruments and whether these demands are relevant for the construct under investigation—in this case, scientific reasoning competencies. The findings are discussed and related to the relevant science education literature.

**Keywords:** scientific reasoning; cognition; assessment; item features; item difficulty

#### **1. Introduction**

An understanding of science and its procedures, capabilities, and limitations is crucial for a society facing complex problems. This significance was recently highlighted during the COVID-19 crisis, where misinformation through traditional and social forms of media appeared to be highly influential in shaping peoples' opinions and actions about the crisis [1]. Science education can respond to these issues in part by supporting the development of scientific reasoning competencies (SRC) among students of science. Additionally, science teachers would benefit from strong SRC themselves to model and promote SRC among their students [2–4]. SRC are defined as the dispositions to be able to solve a scientific problem in a certain situation by applying a set of scientific skills and knowledge, and by reflecting on the process of scientific problem-solving at a meta-level [5–8]. SRC are also seen as a core element of 21st-century skills in science curricula, as they are assumed to help enable civic participation in socio-scientific issues facing societies and have been said to be indicative of a society's future economic power [9,10]. Hence, SRC, such as developing scientific questions and hypotheses, modeling, generating evidence through experimentation, and evaluating claims, are addressed in science education policy papers and curriculum documents as a key outcome of science education in various countries around the world (e.g., [11–13]). SRC are also emphasized as part of science teachers' professional competencies that should be developed during initial teacher education [14].

**Citation:** Krell, M.; Khan, S.; van Driel, J. Analyzing Cognitive Demands of a Scientific Reasoning Test Using the Linear Logistic Test Model (LLTM). *Educ. Sci.* **2021**, *11*, 472. https://doi.org/10.3390/ educsci11090472

Academic Editor: Silvija Markic

Received: 15 July 2021 Accepted: 23 August 2021 Published: 27 August 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Existing studies suggest that pre-service science teachers typically have basic SRC, with pre-service secondary teachers outperforming pre-service primary or early childhood teachers [5]. For the specific skill of scientific modeling, it was shown that pre-service science teachers apply strategies and experience challenges similar to secondary school students [15]. Furthermore, longitudinal studies revealed that SRC slightly develop during science teacher education at university [16] and that specific teacher education programs can contribute to competence development in this field [17].

The development and evaluation of assessments that are capable of providing valid measures of respondents' SRC have become an integral part of research in science education [8,18]; however, several authors have recently questioned the quality of many existing instruments to assess SRC. For example, Ding et al. [19] identified poor definitions of the underlying constructs to be measured and criticized that most scientific reasoning instruments, "[A]re in fact intended to target a broader construct of scientific literacy" (p. 623) rather than specific competencies needed for reasoning in science. In a review study, it was found that the psychometric quality of most published instruments to assess SRC was not evaluated satisfactorily [18]. Furthermore, Osborne [8] criticized a general lack of validity evidence for these available instruments and referred to the valid assessment of SRC, as, "[T]he 21st century challenge for science education."

Arguably, an exception to these criticisms regarding the quality of instruments to assess SRC is a German multiple-choice instrument that has recently been developed to assess pre-service science teachers' SRC during their course of studies at university [16,20]. English and Spanish adaptations of this instrument have also been developed and evaluated [5,21]. For the original German instrument, comprehensive sources of validity evidence have been considered following the recommendations in the Standards for Educational and Psychological Testing [22]. For example, the instrument has been developed based on a clear theoretical framework, distinguishing between two sub-competencies of scientific reasoning—*conducting scientific investigations* and *using scientific models*—and seven related skills of *formulating research questions*, *generating hypotheses*, *planning investigations*, *analyzing data and drawing conclusions*, *judging the purpose of models*, *testing models*, and *changing models*. Furthermore, standardized construction guidelines for item development were used based on this framework [23], and the whole process of item development was guided by a critical examination of various sources of validity evidence (e.g., [23,24]), as summarized in [16]. In this process, one validation study [24] analyzed the influence of item features on item difficulty. The authors found that item length (word count) and the use of visual images, tables, formulas, abstract concepts, and specialized terms in the items significantly contributed to item difficulty. Taken together, these features contributed to 32% of the variance in item difficulty. The authors argued that these findings still provide evidence for the valid interpretation of the test scores as measures of SRC because the identified effects of item features on item difficulty were in accordance with the theoretical background of item development, and they showed a plausible pattern of cognitive demands [24].

In general, the analysis of item features and their influence on item difficulty is a common approach to research in psychological and educational assessment [25–28]. The basic assumption in this context is that assessments should represent the construct under investigation and test items should stimulate cognitive processes that constitute the target construct (construct validity or construct representation, respectively, [29,30]). For example, items that are intended to assess the competencies of "analyzing evidence" might provide an experimental design and corresponding findings and ask students to interpret the evidence appropriately [28]. The development of test items has to account for item features and underlying cognitive processes so that the instrument allows for valid interpretations of obtained test scores [27]. Related to this, legitimate and illegitimate sources of item difficulty have been distinguished [24]. While legitimate sources of item difficulty are those that are intentionally implemented to assess skills or knowledge representative of the respective competency, illegitimate sources of item difficulty are not directly related to the target construct, such as reading capabilities in science or mathematics tests, and can negatively

impact valid test score interpretation [24]. Identifying threats to validity, such as constructirrelevant sources of item difficulty, however, has the potential to inform item development and thus improve the validity of assessments. Furthermore, construct-relevant sources of item difficulty can guide item development [27,31]. Nonetheless, "[W]hat constitutes construct-irrelevant variance is a tricky and contentious issue" [30] (p. 743) and depends on the definition of the respective construct. As a result, exploratory studies investigating the influence of item features on item difficulty of an existing assessment instrument can contribute to a better understanding of the cognitive demands of the instrument [26,28].

This study adds to this body of research by investigating the influence of item features on item difficulty of the above-mentioned German multiple-choice instrument. This study contributes to construct validation of this internationally employed testing instrument [16,21]. Furthermore, and independent from the specific instrument, this study provides insights about the influence of item features on item difficulty, and as a result, might be used by scholars to provide direction for systematically developing testing instruments that account for such features [27]. The focus of this study is on formal item features related to text complexity and the presence of external visual representations. There are already some studies that investigated the influence of formal item features on item difficulty in science education. For example, text length has been identified as a feature that tends to increase item difficulty [24,32]. In contrast to internal (i.e., mental) representations, external representations are defined as externalizations or materializations of more or less abstract thoughts in the form of gestures, objects, pictures and signs [33]. Taxonomies of (external) representations distinguish between descriptions and depictions, with descriptions including text, mathematical expressions, and formulas and depictions including photographs, maps, and diagrams [34]. Many representations are also combinations of different forms. For example, diagrams include textual (descriptive) and graphical (depictive) elements [35]. Formal item features, such as text length or task format, have been described as being part of the surface structure of test items; that is, such item features are often not directly related to the construct to be assessed [32,36]. On the other hand, the existence of formal item features is an inevitable part of item development, and hence, knowledge about how such features influence item difficulty is of significance for scholars interested in developing testing instruments.

#### **2. Aims of the Study and Hypotheses**

This study investigates the effect of item features on item difficulty for the English adaptation of the multiple-choice SRC assessment instrument described above. Item features related to text complexity and the presence of visual representations will be tested for their influence on item difficulty. This study complements existing evaluation studies on the English adaptation of the instrument that have not yet analyzed item features [5,21]. Furthermore, the present study also significantly adds to our knowledge of which cognitive demands appear to be imposed upon by multiple-choice assessment instruments and whether these demands are relevant for the construct under investigation—in this case, SRC [24,28,31].

The following assumptions undergird the study: (1) item difficulty is increased with an increase in the complexity of text included in the item because the complex text makes it more difficult to encode and process information relevant to identify the attractor (or the correct answer option) [24,32]; (2) item difficulty is increased for items that contain visual representations next to textual information because this addition requires respondents to simultaneously encode and process information that is presented in text and image, which, in turn, increases cognitive load [37].

#### **3. Materials and Methods**

#### *3.1. Sample and Data Collection*

Data of *N* = 243 pre-service science teachers from Australia (*n* = 103; mean age= 28), Canada (*n* = 112; mean age= 27), and the UK (*n* = 26; mean age= 31) were analyzed in this

study. Some data partly originate from existing studies [2,3,5,21] and were secondarily analyzed for the purpose of this study. The UK sub-sample contains new data that have neither been analyzed nor published. Hence, this study made use of some available data sets in order to test the above hypotheses. Having an international sample with participants from three countries allowed the hypotheses to be tested independently from the specific context and, thus, potentially provide more generalizable findings. SRC are an important goal of science teacher education in all three countries [2,3].

In each case, participating pre-service science teachers voluntarily agreed to participate in this study and anonymously completed the instrument, which is why the sample sizes are relatively small (e.g., *n* = 26 from the UK). The study information was shared with participants digitally (i.e., via email) or in person, in science methods courses of the respective pre-service teacher education programs. Completing the instrument, however, occurred outside of courses and was not an obligatory part of the pre-service science teachers' curriculum. Ethics approval was also obtained from local ethics approval committees. To ensure equivalence of testing conditions, the same standardized test instruction was used in all three subsamples—namely, background information about the study and the assessed competencies, and voluntary participation.

In all three subsamples, the above-mentioned English adaptation of the German SRC assessment instrument was administered. As described in [5,21], the English adaptation was systematically translated and evaluated based on the German original instrument [16]. For each of the seven skills of *formulating research questions*, *generating hypotheses*, *planning investigations*, *analyzing data and drawing conclusions*, *judging the purpose of models*, *testing models*, and *changing models*, the English instrument includes three multiple-choice items (i.e., 21 items in total). Each item is contextualized within an authentic scientific context, and the respondents have to apply their procedural and epistemic knowledge within this context to identify the attractor. (For sample items, see [21]; the full instrument is available upon request to the first author).

#### *3.2. Item Analysis*

The aim of this study was to analyze the influence of item features related to text complexity and the presence of visual representations on item difficulty. For this purpose, 21 items were analyzed by a trained student assistant and the first author to obtain information about text complexity and the presence of visual representations (i.e., figures or diagrams) in each item. The latter was scored with yes (=1) or no (=0) as this scoring was also conducted in earlier studies (e.g., [24,32]). For text complexity, three different readability measures were calculated, as described in [38]: the 4. Wiener Sachtextformel (WSTF), local substantival textual cohesion (LSTC), and global substantival textual cohesion (GSTC). These readability measures provide a sound statistical estimation of text complexity in science education [38].

The 4. Wiener Sachtextformel (WSTF) calculates a readability measure based on the percentage of words with more than two syllables (SYLL) and the average length (i.e., word count) of sentences (SENT) as follows [39]:

$$\text{WSTF} = 0.2656 \cdot \text{SENT} + 0.2744 \cdot \text{SYLL} - 1.693. \tag{1}$$

Substantival textual cohesion indicates text coherence based on substantives, either locally (i.e., in consecutive sentences) or globally (i.e., in the whole text) [40]. Global substantival textual cohesion (GSTC) is calculated by dividing the number of substantives that appear more than once in a text (SUB2) by the number of substantives that appear only once (SUB). Local substantival textual cohesion (LSTC) is calculated by dividing the number of substantially connected sentences (LSCS, i.e., consecutive sentences with the same substantive) by the total number of sentences (S) as follows:

$$\text{GSTC} = \frac{\text{SUB}\_2}{\text{SUB}} \cdot 100\%\_{\prime} \tag{2}$$

$$\text{LSTC} = \frac{\text{LSCS}}{\text{S}} \cdot 100\%. \tag{3}$$

Higher numbers of WSTF and lower numbers of LSTC and GSTC indicate more complex texts; 5.4 < WSTF < 8.4, 0.41 < LSTC < 0.65, and 0.70 < GSTC < 0.89 have been suggested as indicating appropriately understandable texts for science education [38].

#### *3.3. Data Analysis: Linear Logistic Test Model*

To estimate the influence of the different item features on an item's difficulty, the linear logistic test model (LLTM) was applied [41,42] as this model was applied in several similar studies analyzing item features (e.g., [28,43]). The LLTM belongs to Rasch models, a family of established psychometric models utilized in psychological and educational research [44]. The family of Rasch models includes descriptive and explanatory psychometric models [45,46]. For example, the one-parameter logistic model (1PLM) is a descriptive psychometric model that allows for the estimation of individual person ability (*θs*) and item difficulty (*β<sup>i</sup>* ) parameters. In 1PLM, it is assumed that the probability of a correct item response depends only on *θ<sup>s</sup>* and *β<sup>i</sup>* [44].

$$P(X\_{is}) = \frac{\exp(\theta\_s - \beta\_i)}{1 + \exp(\theta\_s - \beta\_i)}\tag{4}$$

In contrast to descriptive models such as 1PLM, explanatory models consider item or person features to further explain the item difficulty or person ability parameters, respectively [46]. The LLTM is an item explanatory model because it assumes that item difficulty is a linear (additive) combination of basic parameters *α<sup>k</sup>* [43]. Formally, the *β<sup>i</sup>* parameter of 1PLM is replaced with a linear combination of these basic parameters [41] as follows:

$$\beta'\_i = \sum\_{k=1}^{N} (\mathfrak{a}\_k \chi\_{ik}) \tag{5}$$

where *α<sup>k</sup>* as the regression coefficient for *k* (i.e., the estimated difficulty of the item feature *k*), and *Xik* as the given weight of item feature *k* on item *i* (i.e., the extent to which the respective item feature applies to item *i*). Hence, *α<sup>k</sup>* illustrates the contribution of item feature *k* to item difficulty [43]. If an LLTM can be shown to fit the given data, the estimated parameters *α<sup>k</sup>* provide measures for the item features' contribution to item difficulty. More specifically, it is assumed that item difficulty can be sufficiently and totally explained with the specified parameters in the LLTM [42]. Therefore, the LLTM can be considered more restrictive and more parsimonious than the 1PLM [47].

To evaluate the model fit of an LLTM, a two-step procedure is proposed: first, 1PLM has to fit "at least approximately" [42] (p. 509) to the data. For testing the fit of a Rasch model to the given data, fit indices such as the sum of squared standardized residuals (MNSQs) are proposed. MNSQs provide a measure of the discrepancy between the assumptions of the Rasch model and the observed data [48]. Second, the decomposition of *β<sup>i</sup>* (Formula 5) needs to be checked for empirical validity. For this reason, the item difficulty parameters estimated in 1PLM, and the corresponding LLTM can be compared (e.g., graphically or by calculating Pearson correlation coefficient, [25]). High associations between both parameters indicate that the decomposition of *β<sup>i</sup>* might be valid [42]. Furthermore, information criteria, such as the Akaike information criterion (AIC) and the Bayesian information criterion (BIC), and the log-likelihood difference test can be applied to compare the fit of both models and different LLTMs [42]. In the present study, the R package eRm [49] was used for model specification and parameter estimation.

#### *3.4. Model Specification*

In this study, two LLTMs with the following variables were specified to estimate parameters *α<sup>k</sup>* . In the first LLTM–called LLTMbaseline–it was coded to which of the seven skills each item belongs (i.e., dummy coding). This procedure mirrors the assumption that there are specific cognitive demands to solve the items associated with each skill [23,50]. Hence, the assignment to the respective skills is assumed to sufficiently and totally explain the item difficulty in the LLTMbaseline.

The second LLTM—called LLTMextended—additionally included parameters for the readability measures WSTF, LSTC, and GSTC, and the presence of visual representations described above. Hence, the LLTMextended assumes that next to the scientific reasoning skills, the readability of text and the presence of visual representations also impose specific cognitive demands to process and encode information provided in the items, and to answer correctly [24,32,37,38].

#### **4. Results**

The Results Section is subdivided into three subsections: Basic Statistics, Descriptive Modeling, and Explanatory Modeling. The latter two sections refer to the two-step procedure of LLTM model evaluation, as described in Section 3.3.

#### *4.1. Basic Statistics*

Table 1 provides basic descriptive statistics and Pearson correlations for item difficulty and the variables considered in this study. Item difficulty was calculated as the proportion of correct responses (i.e., 1.0 = 100% correct responses). It is evident that the multiplechoice items had appropriate difficulty for the present sample, as about 47% of them were answered correctly (*M*ItemDiff = 0.47). About 43% of the items contain a visual representation. Based on the WSTF and LSTC, the items would be considered rather easy to read. The LSTC is even higher than expected, indicating a very high local substantival textual cohesion. Only the average GSTC (*M*GSTC = 0.63) indicates low global substantival textual cohesion of the items. Statistically significant correlations (i.e., *p* < 0.05) were only found between LSTC and GSTC (*r* = 0.48; medium effect size). Due to the medium effect size of this correlation, no serious problem of multicollinearity for further analysis occurs. Notably, no statistically significant correlations were found between item difficulty (ItemDiff) and the variables WSTF, LSTC, GSTC, and VisRep.

**Table 1.** Mean score (*M*), standard deviation (*SD*), and Pearson correlation coefficient (r) with related *p*-value for the respective variables. Expectance = values indicating appropriately understandable texts as suggested in [38]. ItemDiff = item difficulty; WSTF = 4. Wiener Sachtextformel; LSTC = local substantival textual cohesion; GSTC = global substantival textual cohesion; VisRep = item contains a visual representation (0 = no; 1 = yes).


For further illustration, sample items can be found in Appendix A. These items represent the median score of WSTF (*M* = 6.51), LSTC (*M* = 0.85), and GSTC (*M* = 0.61), respectively.

Figure 1 below illustrates how the variables shown in Table 1 differ between the tasks for the seven skills of scientific reasoning. Kruskal–Wallis tests indicate significant differences between the skills for the variables GSTC (*H* = 13.19, *p* = 0.040) and VisRep (*H* = 12.22, *p* = 0.045). For GSTC, items related to the skills *planning investigations* (*M* = 0.73)

and *analyzing data and drawing conclusions* (*M* = 0.78) show rather high values, compared to lower values for the skills *formulating research questions* (*M* = 0.53), *generating hypotheses* (*M* = 0.55), *judging the purpose of models* (*M* = 0.66), *testing models* (*M* = 0.54), and *changing models* (*M* = 0.64). These five skills are below the suggested range of 0.70 < GSTC < 0.89, unlike the others, indicating appropriately understandable texts in science education [39]. For VisRep, it is evident that items related to *formulating research questions*, *generating hypotheses,* and *planning investigations* do not contain visual representations, while most items related to the other skills do. lower values for the skills *formulating research questions* (*M* = 0.53), *generating hypotheses* (*M*  = 0.55), *judging the purpose of models* (*M* = 0.66), *testing models* (*M* = 0.54), and *changing models* (*M* = 0.64). These five skills are below the suggested range of 0.70 < GSTC < 0.89, unlike the others, indicating appropriately understandable texts in science education [39]. For VisRep, it is evident that items related to *formulating research questions*, *generating hypotheses,* and *planning investigations* do not contain visual representations, while most items related to the other skills do.

Figure 1 below illustrates how the variables shown in Table 1 differ between the tasks for the seven skills of scientific reasoning. Kruskal–Wallis tests indicate significant differences between the skills for the variables GSTC (*H* = 13.19, *p* = 0.040) and VisRep (*H* = 12.22, *p* = 0.045). For GSTC, items related to the skills *planning investigations* (*M* = 0.73) and *analyzing data and drawing conclusions* (*M* = 0.78) show rather high values, compared to

*Educ. Sci.* **2021**, *11*, x FOR PEER REVIEW 7 of 16

**Figure 1.** Boxplots for the variables ItemDiff (**top left**), WSTF (**top right**), LSTC (**middle left**), GSTC (**middle right**), and VisRep (**bottom left**) separated for the items assessing the seven skills *formulating research questions* (Que), *generating hypotheses* (Hyp), *planning investigations* (Pla), *analyzing data and drawing conclusions* (Ana), *judging the purpose of models* (Pur), *testing models* (Tes), and *changing models* (Cha). **Figure 1.** Boxplots for the variables ItemDiff (**top left**), WSTF (**top right**), LSTC (**middle left**), GSTC (**middle right**), and VisRep (**bottom left**) separated for the items assessing the seven skills *formulating research questions* (Que), *generating hypotheses* (Hyp), *planning investigations* (Pla), *analyzing data and drawing conclusions* (Ana), *judging the purpose of models* (Pur), *testing models* (Tes), and *changing models* (Cha).

#### *4.2. Descriptive Rasch Modeling: One-Parameter Logistic Model (1PLM)* The fit between data and 1PLM has been evaluated and documented in previous studies in detail [2,5,16,21]. Here, MNSQs are reported, which indicates the discrepancy

*4.2. Descriptive Rasch Modeling: One-Parameter Logistic Model (1PLM)* 

*Educ. Sci.* **2021**, *11*, x FOR PEER REVIEW 8 of 16

The fit between data and 1PLM has been evaluated and documented in previous studies in detail [2,5,16,21]. Here, MNSQs are reported, which indicates the discrepancy between the assumptions of the Rasch model and the data. MNSQ values are always positive because statistically, they are chi-square statistics divided by their degrees of freedom [51]. MNSQ values should lie in the range of 0.5–1.5 ("productive for measurement") or 1.5–2.0 ("unproductive for construction of measurement but not degrading"), respectively, but not be >2.0 ("distorts or degrades the measurement system") [48]. MNSQs can be calculated in two different versions—the outfit and the infit MNSQ. As the outfit MNSQ is more sensitive to outliers than the infit MNSQ, both statistics should be considered [51]. between the assumptions of the Rasch model and the data. MNSQ values are always positive because statistically, they are chi-square statistics divided by their degrees of freedom [51]. MNSQ values should lie in the range of 0.5–1.5 ("productive for measurement") or 1.5–2.0 ("unproductive for construction of measurement but not degrading"), respectively, but not be >2.0 ("distorts or degrades the measurement system") [48]. MNSQs can be calculated in two different versions—the outfit and the infit MNSQ. As the outfit MNSQ is more sensitive to outliers than the infit MNSQ, both statistics should be considered [51]. The MNSQ values in this study range between 0.7 and 1.2 (outfit MNSQ), and be-

The MNSQ values in this study range between 0.7 and 1.2 (outfit MNSQ), and between 0.9 and 1.1 (infit MNSQ), respectively. Furthermore, the Andersen likelihood ratio test with the external split criterion "country" (i.e., Australia, Canada, UK) is not significant (*LR*(40) = 46.22, *p* = 0.23), thus indicating item homogeneity [49]. Person separation reliability is rel. = 0.52 and similar to previous reliability estimates for this instrument (e.g., [5]: EAP/PV reliability = 0.55; [16]: Cronbach's Alpha = 0.60). tween 0.9 and 1.1 (infit MNSQ), respectively. Furthermore, the Andersen likelihood ratio test with the external split criterion "country" (i.e., Australia, Canada, UK) is not significant (*LR*(40) = 46.22, *p* = 0.23), thus indicating item homogeneity [49]. Person separation reliability is rel. = 0.52 and similar to previous reliability estimates for this instrument (e.g., [5]: EAP/PV reliability = 0.55; [16]: Cronbach's Alpha = 0.60).

#### *4.3. Explanatory Rasch Modeling: Linear Logistic Test Model (LLTM) 4.3. Explanatory Rasch Modeling: Linear Logistic Test Model (LLTM)*

MNSQ values for both LLTMs indicate a reasonable fit between data and model (LLTMbaseline: 0.7 < outfit MNSQ < 1.6; 0.7 < infit MNSQ < 1.5; LLTMextended: 0.5 < outfit MNSQ < 1.7; 0.7 < infit MNSQ < 1.6). Person separation reliability is rel. = 0.46 and 0.50, respectively. Pearson correlations between the item parameters estimated in the LLTMs and the 1PLM are large for both the LLTMbaseline (*r* = 0.65, *p* = 0.002; i.e., *R* <sup>2</sup> = 0.42) and the LLTMextended (*r* = 0.86, *p* < 0.001; i.e., *R* <sup>2</sup> = 0.75). The graphical model tests of the LLTMs and the 1PLM show that the item parameters scatter around the 45◦ line rather well for the LLTMextended, while less so for the LLTMbaseline (Figure 2). This is also indicated by the empirical regression line (blue lines in Figure 2), which is closer to the 45◦ diagonal when comparing item difficulty parameters of the 1PLM and the LLTMextended than when comparing these parameters of the 1PLM and the LLTMbaseline. In sum, the findings indicate that the item parameters estimated in the LLTMextended were closer to the estimated parameters from the 1PLM, than those estimated in the LLTMbaseline. MNSQ values for both LLTMs indicate a reasonable fit between data and model (LLTMbaseline: 0.7 < outfit MNSQ < 1.6; 0.7 < infit MNSQ < 1.5; LLTMextended: 0.5 < outfit MNSQ < 1.7; 0.7 < infit MNSQ < 1.6). Person separation reliability is rel. = 0.46 and 0.50, respectively. Pearson correlations between the item parameters estimated in the LLTMs and the 1PLM are large for both the LLTMbaseline (*r* = 0.65, *p* = 0.002; i.e., *R*2 = 0.42) and the LLTMextended (*r* = 0.86, *p* < 0.001; i.e., *R*2 = 0.75). The graphical model tests of the LLTMs and the 1PLM show that the item parameters scatter around the 45° line rather well for the LLTMextended, while less so for the LLTMbaseline (Figure 2). This is also indicated by the empirical regression line (blue lines in Figure 2), which is closer to the 45° diagonal when comparing item difficulty parameters of the 1PLM and the LLTMextended than when comparing these parameters of the 1PLM and the LLTMbaseline. In sum, the findings indicate that the item parameters estimated in the LLTMextended were closer to the estimated parameters from the 1PLM, than those estimated in the LLTMbaseline.

**Figure 2.** *Cont*.

*Educ. Sci.* **2021**, *11*, x FOR PEER REVIEW 9 of 16

**Figure 2.** Graphical model tests comparing the 1PLM (x-axis) and the LLTM (y-axis) by the estimated item parameters (logits) for the LLTMbaseline (**top**) and the LLTMextended (**bottom**). Each dot represents one item, with a 2\*standard error of estimated item parameter (ellipses). The blue line is the empirical regression, with a 95% confidence interval in grey. **Figure 2.** Graphical model tests comparing the 1PLM (x-axis) and the LLTM (y-axis) by the estimated item parameters (logits) for the LLTMbaseline (**top**) and the LLTMextended (**bottom**). Each dot represents one item, with a 2\*standard error of estimated item parameter (ellipses). The blue line is the empirical regression, with a 95% confidence interval in grey.

Table 2 provides the information criteria AIC and BIC and the log-likelihood difference test for model comparison between the 1PLM and the two LLTMs. AIC and BIC assess the relative model fit, with smaller values indicating the better fitting model. These values, therefore, indicate that the 1PLM fits better with the data than both LLTMs. The log-likelihood difference test also proposes a significantly better fit of the 1PL, compared to both LLTMs. Comparing both LLTMs, AIC and BIC indicate that the LLTMextended fits better to the data than the LLTMbaseline. Table 2 provides the information criteria AIC and BIC and the log-likelihood difference test for model comparison between the 1PLM and the two LLTMs. AIC and BIC assess the relative model fit, with smaller values indicating the better fitting model. These values, therefore, indicate that the 1PLM fits better with the data than both LLTMs. The loglikelihood difference test also proposes a significantly better fit of the 1PL, compared to both LLTMs. Comparing both LLTMs, AIC and BIC indicate that the LLTMextended fits better to the data than the LLTMbaseline.


**Table 2.** Model comparison between the 1PLM and both LLTMs (LogLik: marginal log-likelihood; AIC: Akaike information criterion; BIC: Bayesian information criterion; LD test: *p*-value of the loglikelihood difference test comparing the respective LLTM with the 1PLM). **Table 2.** Model comparison between the 1PLM and both LLTMs (LogLik: marginal log-likelihood; AIC: Akaike information criterion; BIC: Bayesian information criterion; LD test: *p*-value of the log-likelihood difference test comparing the respective LLTM with the 1PLM).

Table 3 provides the *αk* parameters as estimated in the two LLTMs. Positive αk parameters indicate that the respective variable decreases item difficulty, while negative α<sup>k</sup> parameters illustrate an increase in item difficulty. For the dummy coded variables representing the seven skills of scientific reasoning, *planning investigations* was chosen as the baseline because the related items ended up being rather easy (Figure 1). As the confidence intervals of most parameters in Table 3 do not include zero, they can be assumed to be significantly different from zero at the 5% level. Exceptions are WSTF, Pur, Test, and Cha in the LLTMextended. Comparing the parameters in both LLTMs, it is evident the additional consideration of the variables WSTF, LSTC, GSTC, and VisRep reduces the effect of most of the dummy coded skills. Table 3 provides the *α<sup>k</sup>* parameters as estimated in the two LLTMs. Positive *α<sup>k</sup>* parameters indicate that the respective variable decreases item difficulty, while negative *α<sup>k</sup>* parameters illustrate an increase in item difficulty. For the dummy coded variables representing the seven skills of scientific reasoning, *planning investigations* was chosen as the baseline because the related items ended up being rather easy (Figure 1). As the confidence intervals of most parameters in Table 3 do not include zero, they can be assumed to be significantly different from zero at the 5% level. Exceptions are WSTF, Pur, Test, and Cha in the LLTMextended. Comparing the parameters in both LLTMs, it is evident the additional consideration of the variables WSTF, LSTC, GSTC, and VisRep reduces the effect of most of the dummy coded skills.


**Table 3.** Parameters estimated in the two LLTMs (SE = standard error; 95% CI = 95% confidence interval); lines with 95% CI including zero are formatted in grey.

In the LLTMextended, the existence of visual representations (*α<sup>k</sup>* = −0.79) makes items harder to solve. Similarly, items related to the skills *formulating research questions*, *generating hypotheses*, and *analyzing data and drawing conclusions* are harder to solve than items related to the skill *planning investigations* (i.e., the baseline); this is also evident in Figure 1. As lower numbers of LSTC and GSTC are indicative of more complex texts, the *α<sup>k</sup>* parameters of GSTC are in line with what was expected: the lower the GSTC is, the more difficult are the items to solve. Unlike expected, lower LSTC values decreased item difficulty (*α<sup>k</sup>* = −1.89).

As described above (Formula (5)), each item's difficulty is calculated in an LLTM as a linear (additive) combination of the item features' difficulty, with *α<sup>k</sup>* as the estimated difficulty of item feature *k*. Based on the *α<sup>k</sup>* values in Table 3, this means for the LLTMextended that, for example, GSTC impacts item difficulty about seven times stronger than VisRep (5.61/0.79 = 7.1). It is important to note that *α<sup>k</sup>* values are unstandardized and do not take the different scales of item features into account (e.g., binary variable VisRep vs. continuous variable GSTC).

#### **5. Discussion**

The purpose of this study was to investigate the effect of item features on item difficulty for a multiple-choice SRC assessment instrument established in science education [5,16,21]. More specifically, item features related to text complexity (4. Wiener Sachtextformel: WSTF; local and global substantival textual cohesion: LSTC and GSTC) and the presence of visual representations as figures or diagrams (i.e., VisRep) were investigated for their influence on item difficulty. The findings revealed that LSTC and GSTC, as well as VisRep, significantly impacted item difficulty in the multiple-choice assessment instrument, while WSTF did not. These findings are discussed below while acknowledging the limitations of this study.

In this study, the item features considered in the LLTMextended explain about 75% of the variance in item difficulty estimated in the 1PLM—well above the limit of a large effect (*R* <sup>2</sup> <sup>≥</sup> 0.26; [27]) and also higher than what has been found in similar studies (e.g., [28]: *R* <sup>2</sup> = 0.43; [24]: *R* <sup>2</sup> = 0.32). Conversely, a variance explanation of 75% means that 25% of the variance in item difficulty estimated in 1PLM cannot be explained with the parameters specified in the LLTMextended and might be attributable to individual differences. For example, general cognitive abilities such as verbal intelligence and problem-solving skills have been shown to significantly predict students' SRC [52].

The difference in variance explanation between the two LLTMs specified in this study suggests that 33% of the variance in item difficulty can be explained with the additional parameters related to text complexity and the existence of visual representations included in the LLTMextended, that is, WSTF, LSTC, GSTC, and VisRep. The resulting amount of 33% is very similar to the result of an earlier study that found 32% [24] on item features affecting item difficulty in the German version of the instrument. This similarity in the effect of item features on item difficulty in both language versions of the instrument (English and German) is another indicator of test equivalence between the two versions [21].

A comparison of the parameters estimated in the LLTMbaseline and the LLTMextended (Table 3) reveals that with the additional consideration of parameters related to text complexity and the presence of visual representations, the significant effect of *judging the purpose of models* (PUR), *testing models* (TES), and *changing models* (CHA), which were found in the LLTMbaseline, disappeared. This finding indicates that the significant effects of PUR, TES, and CHA, identified in the LLTMbaseline, might be artifacts caused by the effect of item features not considered in the LLTMbaseline and confounded with PUR, TES, and CHA. For example, all items related to PUR contain visual representations (Figure 1), while, on average, this applies to only 43% of the items (Table 1). Hence, the effect of PUR, identified in the LLTMbaseline, might have been caused by the presence of visual representations as figures or diagrams in the items related to PUR.

While the correlation analysis (Table 1) revealed no significant association between item difficulty and the item parameters of WSTF, LSTC, GSTC, and VisRep, these associations were found for most of the parameters in the LLTMextended. This difference in findings is most likely caused by the fact that the correlation analysis was carried out based on the items (i.e., *N* = 21), a relatively small number to detect associations on a statistically significant level [26]. In contrast, the parameter estimation in the LLTM was performed based on a larger sample of individuals, or an *N* = 243 in this study.

Examining the individual parameters estimated in the LLTMextended (Table 3), items containing visual representations tended to be harder to solve. This finding was also reported in [24] and described as unexpected, and potentially caused by the fact that visual representations in the items, "were often used to show complex scientific models and, hence, may increase the difficulty" (p. 8). Another explanation might be that the simultaneous encoding and processing of information provided in text and image can increase cognitive load and, hence, item difficulty [37]. As expected, lower global substantival textual cohesion increased item difficulty, with GSTC calculated as the proportion of substantives that appear more than once in a text (Formula (2)); however, unexpectedly, lower local substantival textual cohesion decreased item difficulty, with LSTC as the proportion of sentences with the same substantive as the preceding or subsequent sentence (Formula (3)). Both GSTC and LSTC measures are established indicators for text complexity and readability, with lower values indicating more difficult text [38]. The effect of GSTC on item difficulty most likely indicates that solving the items requires the encoding and processing of complex textual information provided in the item text globally, a task that is even more difficult with text that is challenging to read [24,32]. For the present multiple-choice items, this processing might involve respondents having to encode, process, and combine information that is textually presented in different parts of the item, such as the item stem and the answering options [50]. Hence, if information in the item stem and the answering options are more coherently presented (in terms of substantives), an item becomes easier to solve. For example, signal words, occurring both in the item stem and the attractor, can ease item difficulty [28]. The unexpected findings related to the effect of LSTC on item difficulty should be investigated further, for example, qualitatively, using cognitive interviews. One

plausible reason for the unexpected finding related to LSTC is that both GSTC and LSTC are typically used to analyze the readability of longer texts than what is included in the items of the present multiple-choice instrument [38]. Finally, the significant effects of some of the dummy coded skills (i.e., QUE, HYP, ANA; Table 3) illustrate that the items developed to assess the different skills of scientific reasoning require the application of specific procedural and epistemic knowledge to be solved [23].

The multiple-choice instrument under consideration in this study is already employed by scholars internationally in three language versions [2,16,21]. The findings of the present study shed light on specific cognitive demands that are necessary to correctly answering the items. These findings should be considered by scholars when interpreting test scores. Independent from the specific instrument, the study provides important insights about the influence of item features on item difficulty. These insights can inform the systematic development of a testing instrument that accounts for such features [27].

Naturally, this study has some limitations. The LLTM is well established for the analysis of item features and their influence on item difficulty within the approach of evaluating construct representation (e.g., [25,26]). Nevertheless, the assumption of an additive combination of the single features' difficulty, as described in Formula (5), is also criticized [43]. For example, a multiplicative combination of each item feature's influence on item difficulty might also be possible. Furthermore, in this study, only main effects were considered in LLTMs, but no interaction effects were considered between the specified variables. The variables considered in this study were also analyzed post hoc and were not systematically considered during item development; hence, the item features were not equally distributed across the items for the seven skills of SRC (e.g., items related to *formulating research questions*, *generating hypotheses*, and *planning investigations* do not contain visual representations at all; Figure 1). Finally, LLTMs assume that the specified item features completely (i.e., 100%) explain item difficulty [42], which was not the case in the present study. Despite a good explanation of item difficulty in the LLTMextended, there is a significantly better model fit for the 1PLM (Table 2). The comparatively poor model fit of an LLTM is a common finding (e.g., [25,43]), which is explained with the strict assumption of a complete explanation of item difficulty by the specified item features [41]. The model comparison based on the information criteria, on the other hand, does not allow any statement about the absolute fit of the models considered [53]. Since a relatively worse model fit does not necessarily indicate an absolutely bad model fit, a check of the difficulty parameters estimated in the LLTM in the sense of a prognostic validation by replication studies is proposed [27,41]. This approach could be employed in the present context by developing additional items with systematically varied item features, followed by testing these features' influence on item difficulty again. Notwithstanding this issue of model fit, the comparison of the item difficulty parameters estimated in the 1PLM and both LLTMs allowed for an estimation of the amount of variance in item difficulty explained by the item features specified in the respective LLTM.

#### **6. Conclusions**

In this study, we investigated the effect of the item features WSTF, LSTC, GSTC, and VisRep on the difficulty of the items of a multiple-choice instrument to assess SRC in science education [5,21]. This analysis was based on the assumptions that the readability of text and the presence of visual representations impose specific cognitive demands to process and encode information provided in the items [24,32,37,38]. Furthermore, dummy-coded variables representing the specific skills of scientific reasoning were also considered in the analysis, assuming that specific cognitive demands (i.e., application of specific procedural and epistemic knowledge) are associated with each skill [23,50]. The findings illustrate that these variables, in sum, explain about 75% of the variance in item difficulty.

From a validity perspective, the similarity between the present findings and the previous study on the German version of the multiple-choice instrument [24] provides further evidence for test equivalence of both language versions [21]. From a cognitive point

of view [25], the findings of the present study suggest that specific cognitive demands are imposed by the readability of text and the presence of visual representations in multiplechoice assessment instruments. Specifically, the multiple-choice items analyzed in the present study appear to demand the encoding, processing, and combining of textually presented information from different parts of the items—such as item stem and answering options—while simultaneously encoding and processing information that is presented in both the text and visual representations. It has been shown that to solve the multiplechoice items used in this study, the application of procedural and epistemic knowledge is required [23,50]. The findings of this study illustrate that multiple-choice items on this assessment impose additional cognitive demands due to the necessity of processing text and visual representations.

**Author Contributions:** Conceptualization, M.K.; methodology, M.K.; investigation, M.K., S.K., and J.v.D.; resources, M.K., S.K., and J.v.D.; writing—original draft preparation, M.K.; writing—review and editing, M.K., S.K., and J.v.D.; visualization, M.K.; funding acquisition, M.K., S.K., and J.v.D. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the FUB Center for International Cooperation, Grant Number FMEx2-2016-104, and the 2018 UBC-FUB Joint Funding Scheme, Grant Number FSP-2018-401.

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Institutional Review Boards (or Ethics Committees) of The University of Melbourne (ID 11530, approved 3 January 2018), of the University of British Columbia (ID H18-01801, approved 23 July 2018), and of the University of Dundee (ID E2018-94, approved 15 July 2019).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** Data is available upon request to the first author.

**Acknowledgments:** The authors wish to thank Alexis Gonzalez and Song Xue in data collection and tabulation for the Canadian and UK samples, respectively, Christine Redman for her help with data collection for the Australian sample, and Jonna Kirchhof for her support in item analysis described in Section 3.2.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **Appendix A**

The below items represent the median score of WSTF (*M* = 6.51), LSTC (*M* = 0.85), and GSTC (*M* = 0.61), respectively. Note that the items are presented in a tabular format for better reading and not in the same way as they appeared in the testing instrument. The attractor of each item is highlighted in italics.


#### **Item stem**

Fraud with organic grocery bags?

Under the influence of oxygen, bacteria and fungi transform organic material mainly into carbon dioxide and water. This process of transformation is called composting. A part of the resulting substances is transformed into humus (dead organic soil matter). The following report was published in a newspaper: "The Deutsche Umwelthilfe (German Environmental Relief) launch accusations against two supermarket chains: The allegedly 100 % compostable grocery bags are not biodegradable at all; therefore they are just as ecologically harmful as common plastic bags."

A team of experts has been asked to conduct a scientific investigation into how compostable are these organic grocery bags really?

#### **Answering options** Tick one of the boxes below.

Which scientific question might underlie this investigation? Tick one of the boxes below. • What impact do the biological decomposition products from organic grocery bags have on the environment? • What biological decomposition products are formed in the process of composting organic grocery bags?

Which scientific question might underlie this investigation?

	- What materials comprise organic grocery bags?

#### **Item "changing models 03" (MLSTC = 0.85)**

*Educ. Sci.* **2021**, *11*, x FOR PEER REVIEW 14 of 16

#### **Item stem** In physical reality, there is a variety of continuous transitions between different sounds, such as [ra] and [la]. While

Language Acquisition

Language Acquisition

In physical reality, there is a variety of continuous transitions between different sounds, such as [ra] and [la]. While infants are aurally capable of perceiving all of these different transitions of sound, an imprint toward a specific language can be observed after the first year of life. Vocal expressions within different languages are then no longer perceived in their entirety but rather through a specific filter. infants are aurally capable of perceiving all of these different transitions of sound, an imprint toward a specific language can be observed after the first year of life. Vocal expressions within different languages are then no longer perceived in their entirety but rather through a specific filter.

For this phenomenon of language acquisition, the following model was developed: For this phenomenon of language acquisition, the following model was developed:

*Figure.* Model of language acquisition by sound perception. *Figure.* Model of language acquisition by sound perception.

The model predicts that Australians and Japanese acquire their language in different ways and the subjective perception of sounds develops differently. The model predicts that Australians and Japanese acquire their language in different ways and the subjective perception of sounds develops differently.

#### **Answering options Answering options**

What reason would make it necessary to change the model? What reason would make it necessary to change the model? Tick one of the boxes below.

tion, harm the human body in the long run.

months—a considerably longer duration.

Tick one of the boxes below. The model has to be changed . . .


#### • …if the subjective perception of [ra] and [la] cannot be applied to languages other than English and Japanese. **Item "generating hypotheses 02" (***M***GSTC = 0.61)**

#### **Item stem**

*and [la].*

#### • …if there are Australian adults who do not have a distinct subjective perception of [ra] and [la]. In Outer Space

**Answering options** 

Tick one of the boxes below.

**Item "generating hypotheses 02" (***M***GSTC = 0.61) Item stem**  After many years of space missions, we know that existing conditions in space, such as zero gravity and cosmic radiation, harm the human body in the long run.

Previous stays in outer space were limited to a few months, whereas the scheduled flights to Mars will span many

In Outer Space Previous stays in outer space were limited to a few months, whereas the scheduled flights to Mars will span many months—a considerably longer duration.

After many years of space missions, we know that existing conditions in space, such as zero gravity and cosmic radia-In a study, the health impacts of such long-lasting stays in outer space are to be investigated.

In a study, the health impacts of such long-lasting stays in outer space are to be investigated.

#### **Answering options**

Which scientific hypothesis might underlie this investigation? Tick one of the boxes below.


### **References**

