Next Article in Journal
A 14-Year Trend in Intended Illegal Protest Activities of 8th-Grade Students: Do Civic Knowledge and Individual Students’ Backgrounds Count?
Previous Article in Journal
Predicting the Acceptance of Informal Learning Technologies: A Case of the TikTok Application
Previous Article in Special Issue
Addressing School Absenteeism Through Monitoring: A Review of Evidence-Based Educational Policies and Practices
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Data Collection and Monitoring in an Educational RCT of a Postsecondary Access Program: Assessing Internal and External Validity

by
Brent Joseph Evans
1,*,
Eric Perry Bettinger
2 and
Anthony Lising Antonio
2
1
Department of Leadership, Policy, and Organizations, Vanderbilt University, Nashville, TN 37203, USA
2
School of Education, Stanford University, Stanford, CA 94305, USA
*
Author to whom correspondence should be addressed.
Educ. Sci. 2025, 15(3), 363; https://doi.org/10.3390/educsci15030363
Submission received: 15 October 2024 / Revised: 4 March 2025 / Accepted: 11 March 2025 / Published: 14 March 2025
(This article belongs to the Special Issue Assessment for Learning: The Added Value of Educational Monitoring)

Abstract

:
The objective of this article is to discuss the advantages of effective educational monitoring in the context of a longitudinal RCT. Intentional data collection and monitoring enables the important assessment of issues of both internal and external validity. We discuss how we used mixed methods data collection to reveal important changing contextual factors in an evaluation of a postsecondary access program in the U.S. state of Texas. Specifically, we employed quantitative analysis of the RCT to compare the college enrollment rates of high schools that were randomly assigned a college adviser with schools that were not assigned a college adviser. We employed survey data collection, qualitative interviews, and site visits to monitor the fidelity of treatment implementation and compliance to treatment assignment over time. In the absence of monitoring treatment fidelity and compliance over time in both treatment and control schools, we would have missed critical changes that explain the observed attenuation of treatment effect estimates. We also discuss how monitoring can inform defenses of the stable unit treatment value assumption and suggest how effective the program will be when applied more widely or to other contexts.

1. Introduction

Monitoring is essential to not only successfully administer education programs but also during efficacy evaluations of education programs (Scheerens et al., 2003). This paper discusses the use of monitoring over a multi-year, mixed methods impact evaluation of a college access program, Advise TX, in a large American state. The program placed recent postsecondary graduates as full-time college counselors and advisers in high schools across the state with the goal of increasing college going in those high schools. We conducted a large-scale randomized controlled trial (RCT) of the program over several years that incorporated numerous quantitative and qualitative data collection efforts to inform the estimates of program impacts, to understand mechanisms of effects, to improve program operation, and to analyze how the program and its impacts changed over time. Only by designing and implementing these multi-modal monitoring techniques could we effectively link the goals of educational research with the goals of improved practice.
According to Scheerens et al. (2003), monitoring is typically viewed as long-term information gathering to help make data-informed and data-driven decisions about programs and policies for multiple purposes. One primary purpose is for program learning and improvement, and the high level of cooperation in data gathering we received from the Advise TX program that we studied is indicative of their desire to learn about their program in order to improve the provision of their services. Another primary purpose of data collection and monitoring is accountability. Indeed, this purpose motivated the Institute of Education Sciences, a division of the U.S. Department of Education, which funded our evaluation, and the state of Texas, which was deciding whether to continue investing in and expanding the program throughout the state.
These two priorities demonstrate the value of monitoring at multiple different levels. At the school level, our data collection and analysis allowed local program and school staff to effectively direct resources to achieve their goals of enhancing postsecondary enrollment outcomes at the schools they served. At the program-wide level, our longitudinal evaluation enabled them to adjust across schools and over time to improve their efficiency and efficacy. Finally, at the state and national levels, this work helps policymakers and funders decide whether and how much to invest in the expansion of the program by revealing whether the program’s impact aligns with state and national goals, which in this case, was deciding whether the observed increases in college enrollment for some populations justified the cost of the program.
The treatment effect estimates we provided in our RCT analysis of the program (see Bettinger & Evans 2019) serve as an important indicator in what Richards (1988) calls performance monitoring, which focuses on school outputs. Texas uses college enrollment indicators at the high school level to monitor high school success in preparation for postsecondary access across schools in the state. Texas believes direct high school-to-college enrollment increases the likelihood of degree attainment and sets quantitative targets for the percentage of high school graduates enrolling in state postsecondary institutions in the fall after high school completion (e.g., 58% for 2020) (Texas Higher Education Coordinating Board, 2021). Our evaluation was able to demonstrate that the program improves this important indicator.
The literature has many examples of the value of educational monitoring across an array of contexts (e.g., Alexander, 2000; Jansen et al., 2020; Kilpatrick et al., 1994; Sälzer & Prenzel, 2019; Singer-Brodowski et al., 2019) and even directly connecting it to program evaluation (e.g., Scheerens et al., 2003). The extant literature also considers the advantages of employing mixed methods in RCTs (e.g., Spillane et al., 2010) and of how to plan data collection in program evaluation to assess the variation in treatment effects (e.g., Weiss et al., 2014). However, there is a notable absence of connecting the idea of monitoring to specific components of an RCT evaluation. This paper begins to fill that gap by discussing several general issues of concern in conducting an effectiveness study using an RCT and addresses how data collection efforts can help researchers resolve, or at least think through, those issues. It uses the Advise TX evaluation as a case study to demonstrate how different types of monitoring and data collection activities can be synergistic and can inform conclusions drawn from the evaluation.
We discuss the importance of several crucial principles. First, combining multiple sources of quantitative and qualitative data is essential to understanding whether and why a program or policy works as intended. Second, constructing data systems to provide longitudinal data is vital for capturing how a program or policy changes over time. The long-term nature of data collection is fundamental to monitoring but also critical to effective program evaluation through a multi-year RCT. Because evaluation studies in education are in environments that are characterized by change, not stability, many opportunities for threats to validity emerge. Therefore, it is essential to consider the changing environmental context when conducting educational evaluations. Third, it is essential to not only observe what happens within the treatment group but also monitor the control group within an RCT (or within any comparative evaluation).
The paper demonstrates these ideas by describing the application of data collection and monitoring to a series of important considerations in conducting RCT efficacy studies in education. We discuss the interval validity issues of the fidelity of treatment implementation, compliance to treatment assignment, and the Stable Unit Treatment Value Assumption (SUTVA), as well as issues of external validity. Before proceeding with that discussion, we provide an overview of the program and salient details of the RCT design followed by a brief description of the various components of data we collected. For a traditional explication of the program and analysis of the RCT, see Bettinger and Evans (2019). This article complements that prior work by focusing on the application of monitoring to a narrow set of elements within a longitudinal RCT.

2. Background on the Advise TX Program and the RCT Evaluation

In the aftermath of the Great Recession, the U.S. state of Texas saw a rapid increase in the demand for postsecondary education. In particular, enrollment rates for students of color and first-generation college students soared. Texas saw both an opportunity and a narrow window where there was broad support for improving access to and the affordability of college. The Texas Higher Education Coordinating Board (THECB), its foundational arm Texas for All, and activists throughout the state began searching for ways to strengthen students’ access to postsecondary education. As we discuss below, the policy-based activism led to shifts in state policy throughout the experiment.
One of the solutions pursued by policymakers was to improve college advising within secondary schools. Texas had a student-to-high school counselor ratio of 462:1 in 2013, which was similar to the national average (American Counseling Association, 2014). To supplement the availability of high school counselors, the state relied on external programs. A number of large-scale programs, such as Upward Bound, have existed over time, but there were also several upstart programs throughout the United States that were expanding rapidly. One of these was the College Advising Corps (CAC). CAC had garnered national attention after receiving large grants from the Jack Kent Cooke and Lumina Foundations, identifying it as a promising strategy in college advising. In 2010, Texas provided funding to establish a chapter of CAC. They called the program Advise TX. After an initial pilot in 10 schools, Texas aimed to place CAC in about 115 schools in the 2011–2012 school year. This expansion provided the opportunity to design and conduct an RCT evaluation of the program.
CAC recruits, trains, and places recently graduated college students into disadvantaged high schools to serve as college advisers. It is a “whole-school” model in that one adviser serves all students instead of exclusively targeting only a specific subset of students. CAC envisions themselves as a data-driven organization (Horng et al., 2013). While the program was still relatively nascent in 2010, the national and regional Texas directors were eager to identify new data on how to improve their model. This eagerness led CAC to modify its model, as they learned more information about their efficacy.
There are several other important actors and constituents that provide context for the experiment. Texas did not provide complete funding for CAC, so donations from both national and regional philanthropists and foundations were necessary to provide financial stability. High schools, which were facing increased national and state scrutiny over the college outcomes of students, were searching for ways to improve access. Many of these schools, particularly the most disadvantaged ones, were experiencing rapid turnover in both staff and administration.1
In order to efficiently allocate resources to achieve policy goals, the THECB developed a plan to target advisers to schools with high proportions of disadvantaged students, resulting in a sampling frame of 418 high schools in the state, which met the following three benchmark criteria: at least 35% of students on free or reduced price lunch (a metric of low income status), less than 70% of students enrolling in college after high school, and less than 55% of students in a college preparatory curriculum. As an indicator of high school interest, nearly 250 of these invited high schools applied to have CAC enter their school in 2011; therefore, all of the schools in the study serve a higher proportion of low-income students and have lower college enrollment rates relative to the state averages.
Given the enormous demand and limited availability of advisers, Advise TX decided to allocate a portion of advisers to high schools using random assignment. Given the heterogeneity in schools’ background, they opted for a two-stage process. In the first stage, they automatically committed to 78 high schools that were the most disadvantaged in the state and had historically low college enrollment rates. Advise TX also disqualified 50 high schools who had applied because of their historically high college enrollment rates. The remaining 112 schools were vying for 36 openings. In the second stage, they allocated these 36 slots using randomization. Advise TX created regional blocks of schools and then randomly selected schools from those blocks to be offered the program in the 2011–2012 school year.
Subsequent outcomes, such as college enrollment and persistence, were observed for all treatment and control schools (those that participated in the lottery but were not offered the program). Outcome data are student level administrative data provided by the THECB, identifying which students from treatment and control high schools entered any public institution of higher education in the state.
As a cluster-randomized trial with only 36 schools randomly receiving treatment, statistical power was limited. To improve statistical power, Advise TX agreed to maintain the treatment and control contrast for three successive high school graduating cohorts. Hence, the experiment ran for the graduating classes of 2012, 2013, and 2014.
We present a comprehensive explanation of the study design, data, and results in Bettinger and Evans (2019). To summarize the findings for context in this paper, we found no overall effects of the program on college enrollment rates when pooling results across all students and across all three years of the study. However, subgroup analysis revealed that the groups most targeted by the intervention, Hispanic and low-income students, had increases of two to three percentage points in the likelihood of enrolling in college. These effects were observed to increase enrollments primarily at two-year colleges, but the effects attenuated over time, which we discuss in more detail below. Table 1 summarizes the intent to treat programmatic effects on fall college enrollment for the full sample, for Hispanic students, and for low-income students for the first three years of program’s implementation.

3. Data Collection

While a comparison of the administrative records on postsecondary enrollment outcomes between treated and control high schools would suffice to provide an estimate of the program’s effect on college enrollment, we engaged in a host of additional data collection and monitoring efforts. Each college adviser was asked to record every contact with a student, including one-on-one meetings, group meetings, and attendance at college access events. These data allowed for detailed tracking of the interactions between advisers and students, which both facilitated program improvement and helped us assess several issues in the RCT, such as mechanisms by which the treatment might have an effect and dilution or concentration effects.
These anonymous quantitative administrative and tracking data were further supplemented by a series of surveys. We administered different survey instruments to the advisers, students, and guidance counselors at the treatment schools (we include an example survey to the advisers in the Supplementary Materials to this article). To monitor what was happening at the control schools, we administered guidance counselor surveys to assess the broader context and the level of college advising available in control schools.
We also employed several qualitative data collection techniques to enhance our knowledge of what was happening at these schools. Through a series of observational site visits, during which we conducted interviews (all participants provided informed consent, which was approved by the Stanford University IRB), we were able to assess the process by which the college adviser could influence college going outcomes. These interviews revealed whether the treatment was implemented with fidelity at the treated schools and allowed us to understand changes that occurred at the control schools over time.
Employing a mixed methods strategy improves the ability to understand contextual factors that are important in any program or policy evaluation (Burch & Heinrich, 2015). The quantitative results we observe have greater meaning when we can explain the pattern of results across people and time enabled by the survey and interview data. The longitudinal mixed methods approach consistently improved the program and our understanding of the results of the RCT, as we describe in more detail below. In the subsequent sections, we address specific internal and external validity concerns encountered in RCT research and how those concerns can be ameliorated using monitoring and auxiliary data collection.

4. Fidelity of Treatment Implementation

We begin by focusing on how mixed methods monitoring can provide important insight into measuring treatment fidelity. Treatment fidelity refers to the degree to which the planned treatment is the same as the treatment actually offered to the units randomly assigned to treatment (Evans, 2021). This concept is distinct from treatment compliance, which is a measure of whether the treatment offered was actually received. At the forefront of treatment fidelity is implementation. If the process of implementing the treatment is different from the planned treatment, then fidelity is lost.
The loss of fidelity can complicate the interpretation of experimental results, and we consider two related concerns. In the absence of measuring and accounting for fidelity to treatment, one might reach the wrong conclusions about treatment effects. In the extreme, imagine a program implementation that completely fails, and essentially no treatment is offered to the participants. In the absence of data on fidelity, the researchers may naively assume the program was implemented and conclude that the observed null effects mean the program had no effect. In fact, we would have no idea of the program effect, because the program was simply not implemented such that no treatment contrast existed between treatment and control units.
Second, variation in treatment fidelity can result in heterogeneous treatment effects, especially when implemented in a multisite RCT. Under the assumption that the program as designed has some positive effect, sites with high-quality implementation and, therefore, high levels of treatment fidelity will experience larger effects than sites in which the treatment offered is weaker than the intended treatment design. This is often referred to as an issue of dosage or varying treatment intensity (Angrist & Imbens, 1995).2
To the extent that variation in fidelity can be anticipated, it can be incorporated into the design (e.g., randomizing the degree of fidelity). However, an ex ante understanding of variation in fidelity is unlikely in many applications. It is challenging to implement a program consistently across any multisite randomized trial and even more so across an entire large state, as in our Texas example. Furthermore, it is difficult to predict which schools or classrooms will implement with greater fidelity in educational contexts. Hence, there is a need for measuring treatment fidelity across treatment units during the provision of treatment in order to better understand both the level of fidelity and the variation across units. This requires some a priori understanding of the context and the treatment such that the researcher can design data collection to appropriately capture the ways fidelity can be compromised and the factors that drive it. Monitoring the fidelity of treatment implementation across sites is, therefore, highly useful.

Application to the Advise TX Study

In the case of our CAC evaluation, the general features of the intervention were defined, but many parameters of the implementation were unspecified, such as the number of students served within a school, the number of hours spent on a variety of different tasks, and even, the exact manner any given task was to be implemented. Measuring these variables was necessary to fully understand implementation. Furthermore, the design of the broad treatment (the assignment of a CAC adviser to a high school) was intended to be consistent over time, but the implementation details could vary from year to year.
In designing our data collection for the CAC evaluation, we considered several opportunities to collect various forms of data to measure implementation so that we could consider heterogeneity due to variation in treatment fidelity. Central to implementation in this context is the behavior of the college adviser placed in the school. Based upon previous implementation studies of the program and consultation with CAC, we understood that this behavior could be affected by the current organization of college services in schools, the support of school leadership, and the autonomy provided to the CAC adviser. Accordingly, we conducted site visits to observe the adviser’s operation and to interview students, administrators, and the adviser. We also separately surveyed the advisers, students, and high school guidance counselors. We used student tracker data, which recorded the contacts between the adviser and each student with whom he or she met. We were able to triangulate data from these disparate sources to build an understanding of the context that leads to treatment fidelity. We discuss the advantages of each of these forms of data collection below.
Through the tracker data, we observed variations in the number of meetings the adviser had with individuals and groups of students across sites. This enabled us to consider whether the “dosage” of frequent adviser contact to each student could explain the heterogeneity in the treatment effects we observed across sites. Specifically, it led us to explore whether smaller schools, who on average had a higher number of contacts between the adviser and each student, experienced an increased treatment effect.
Qualitative data gathered from the site visits and supported by survey data of school counselors and advisers revealed that the role of the adviser differed across schools based on the structure of implementation chosen by the individual school. We identified two typically employed models that we call hub versus spoke. At schools who implemented a hub model, the adviser served as the central point person for all college-related inquiries, services, and programs. Professional counselors may have provided assistance, but the control and responsibility for college advising in the school, and thus the specific implementation of the program, lay with the CAC adviser. In contrast, advisers in some schools were one of a team of advising providers whose services were organized by a school staff member effectively acting as their direct supervisor. Advisers in such schools were one “spoke” of a wheel of college advising activity provided to students. In the spoke model, advisers contributed to established practices in the school without the autonomy to direct college advising efforts.
The Advise TX program armed advisers with a set of skills and recommended practices but did not dictate how they were to integrate the program into existing school advising structures. Our data suggest that the two different models lead to different student experiences, interactions with the adviser, and results. An RCT relying purely on administrative outcome data would not observe these differences. Because we intentionally monitored implementation by collecting and analyzing these additional forms of data, we could consider whether any heterogeneous treatment effects existed across schools that employ these different models and better inform the program about which structure appears to provide larger effects.
We also combined the surveys of school leaders, guidance counselors, and advisers with observations and interviews during site visits to assess the level of leadership support for the Advise TX program in each school. These data sources allowed us to assess whether the school principal’s broader objectives for the school, such as prioritizing college preparation and enrollment, aligned with the program’s objectives or whether this college focus was overshadowed by other priorities, such as school safety, discipline, and state testing. Responses from the adviser survey were used to corroborate our assessments of the schools’ priorities. This alignment with priorities could manifest itself in multiple ways, such as devoting additional resources to the adviser beyond the program minimum or even the physical location of the adviser in the school. Such alignment likely plays a role in the success of program implementation. If students were presented with the adviser as a central resource for achieving success at a school that prioritizes college access, the adviser would likely be more successful. Again, in the absence of these contextual differences across sites, experimenters would be at risk of drawing incorrect conclusions about the causes of site heterogeneity.
Monitoring over time is just as essential as monitoring across sites. As we studied the program effects over three years, the longitudinal nature of intervention and outcomes creates additional concern for consistent treatment implementation over time within schools.
The design of the program places advisers in a high school for one year with the eligibility to renew for a second year. Therefore, adviser turnover across school years is a normal component of the program design, and any year-to-year adviser turnover would not be viewed as a threat to treatment fidelity. However, if an adviser were to leave before the end of the school year, then the program would not be faithfully implemented at that location. We use administrative data from Advise TX to observe whether advisers left their position early in the school year and conclude that this potential issue was unlikely to impact the treatment effect estimates and conclusions we draw from the evaluation.
Administrative data on adviser placement can inform possible effects of adviser turnover but do not reveal the shifting context of schools over time. Our surveys inform how the school environment shifts by revealing important changes, such as principal turnover and the availability of additional college related resources over time, which may further explain the observed heterogeneity of effects.
As we conclude our section on treatment fidelity, we want to reiterate several points. First, measuring treatment fidelity is important because it affects the interpretation of results, especially in regards to heterogeneity of treatment effects. This is especially important in cluster RCTs in which multiple sites are likely to have different contexts affecting treatment implementation. Second, because it is difficult to determine what may affect implementation in advance, it is important to assess fidelity in the field during the study with observational methods and interviews and surveys of key site actors. Third, fidelity is likely a more important issue in educational research than in many other domains. The complexity of multiple agents working together in the education production function makes it extremely unlikely that the implementation of any treatment intervention will be perfectly consistent across teachers, schools, districts, states, and nations. The complexities in education make interpreting treatment impact estimates challenging, but a concerted effort to design and integrate a mixed method approach can greatly assist in understanding the factors that contribute to differences in implementation and help us draw more accurate conclusions about the effect of educational interventions. An additional benefit of an increasing number of scholars providing attention to these issues is that future scholars will be better able to anticipate the heterogeneity of implementation and either guard against it or intentionally design it into their study.

5. Compliance to Treatment Assignment

Compliance to treatment assignment is a slightly different concept than fidelity of treatment implementation. It refers to whether a unit that was assigned treatment actually received the treatment. In an RCT evaluation using individual level random assignment, compliance is often conceived of as the take-up of the offered treatment. For example, a student could be assigned to the intervention group in a tutoring study but decide not to attend any tutoring sessions. Conversely, a student assigned to a control group might find a way to attend tutoring. Both students would be non-compliers. The distinction between treatment assignment and treatment received has implications for the treatment effect estimates (Schochet & Chiang, 2011). Experimental studies often report an intent to treat effect, which ignores any compliance issues and simply reports the outcome difference between the treatment-assigned and control-assigned groups. Measuring compliance enables an additional estimate: the local average treatment effect for compliers, which inflates the intent to treat effect by accounting for the fact that some individuals receive a different treatment status than they were assigned. Monitoring offers a way to assess compliance by tracking students to determine whether they received the offered treatment.
Compliance is assessed at the level of randomization, so it is measured at the cluster level in a cluster RCT. In an education context, this often means at the classroom or school level. In such cases, whether a specific individual student receives the treatment is irrelevant, compliance is concerned with whether a school assigned treatment receives the treatment.
In order to assess changes over time, examining compliance longitudinally is critical. It is possible that compliance to treatment may improve or degrade. This is true for both treatment and control units. Treated units, for example, may initially receive treatment only to lose access to it at a later time. Conversely, control units may initially be in compliance with treatment assignment but subsequently receive the treatment. Such changes could result in changing treatment effect estimates, even if the efficacy of the treatment is unchanged. In the absence of monitoring compliance in a long-term RCT, the correct interpretation of changing treatment effects over time would be at risk.

Application to the Advise TX Study

The RCT study ran for three years in which we measured the actual receipt of treatment in each year for each school originally assigned treatment or control. Advise TX was able to consistently provide advisers for most schools assigned treatment throughout the study, such that the treatment schools experienced little change in access to an adviser and, hence, had high levels of compliance. The control schools, however, desired the treatment. These schools had submitted an application for an adviser and agreed to be part of the experiment, and Advise TX had determined that the schools would benefit from having an adviser. They were mostly unsuccessful in petitioning the program to provide them with an adviser, thereby ostensibly preserving the validity of the RCT. If the compliance monitoring was accomplished solely through administrative records of the program, which tracked adviser assignment to each high school, we would have concluded compliance was fairly high in the first year (we measure it to be about 75% across both treated and control schools). However, compliance declined over time, as fewer treatment-assigned schools received an adviser and more control-assigned schools successfully secured an adviser. By 2015 (the most recent year of data we observe), 17 of the initial control schools had an adviser, and only 16 of the 26 initial treated schools had an adviser.
Even in control schools that ostensibly complied by not receiving an Advise TX adviser, the conditions in the control schools were changing over time in ways that undermined the initially high levels of compliance among the control-assigned schools. Because these schools desired a program to which they were denied access, those schools sought alternative access to other college advising resources. While most control schools did not receive the exact Advise TX program, many of them received college advising support from similar postsecondary access programs after the first year of the study. Specifically, the average number of college advising programs operating in control schools in the first year of treatment was 2.27, but that grew to 2.79 (a 23% increase) by the third year of treatment.
The adoption of similar programs in control schools reduced the treatment contrast between treated and control units. With several control schools receiving a similar intervention, the average outcome of control schools improved, and the treatment effect estimate attenuated from the first year of the program evaluation to the third year of the RCT.
We can only assert that this explanation is the cause of the attenuated treatment effect over time because of the supplementary data collection efforts we undertook. By monitoring what was happening in control schools through annual surveys of guidance counselors, we could observe the increased take-up rate of similar programs in the control schools.
The central idea is that conditions within units can change over time, thereby changing the treatment estimates. Cluster RCTs are more susceptible to this problem, as it is often harder to monitor all of the contextual factors that can change over time within units comprised of many parts. A multitude of changes that would plausibly affect treatment estimates could occur over time within schools. Even if compliance does not change, the context within the control condition might change. What if, instead of adding additional college advising resources, like a college adviser, guidance counselors at control schools became more efficient at their job or adopted more effective practices? The treatment effect estimates would also attenuate, even though compliance remained consistent, and the treatment within the treated schools remained effective.
This is not a novel idea. Another example is articulated by Lemons et al. (2014). Through a series of experimental evaluations that took place across nearly a decade, the estimates for a kindergarten reading program, the K-PALS program, attenuated. Upon close inspection, thanks to monitoring activities that revealed what happened in the classroom, they determined that the standard of teaching and resources in schools improved over time because of the adoption of Reading First, which mandated improved reading instructional methods in elementary schools and provided financial resources to ensure implementation. This improvement reduced the treatment contrast between treated and control schools and caused the efficacy estimates of the program to fall from the initial studies to the more recent estimates. Through qualitative interviews with personnel in the district, they were able to discern that the counterfactual changed rather than concluding that the program stopped working.

6. Stable Unit Treatment Value Assumption (SUTVA)

The stable unit treatment value assumption is made in nearly all causal efficacy studies. The assumption stems from the potential outcomes framework of estimating treatment effects, which forces us to have a single, well-defined potential outcome under the treatment condition and a single, well-defined potential outcome under the control condition (Rubin, 1986). The assumption states that one unit’s potential outcomes cannot be affected by another unit’s treatment assignment. SUTVA violations undermine having singular, well-defined outcomes in each condition by generating the possibility of multiple potential outcomes under each condition.
SUTVA violations take different forms. A common one is called interference. Imagine an individual level random assignment study in which there are two friends. It is possible the first friend’s potential outcome under the treatment condition would be affected by whether the second friend is also assigned treatment. For example, being in the treatment group together might distract each other such that the treatment is less effective. The first friend, who receives treatment, would have one potential outcome if the second friend were also in the treatment group and a different potential outcome if the second friend were in the control group. This violates SUTVA because a unit has multiple potential outcomes under the same treatment condition. This problem generalizes to any type of interaction between units when those interactions change as a result of treatment assignment and affect the outcome.
One particular concern of interference within an RCT is the concept of spillover or contamination in which the treatment spreads from treated units to control units (VanderWeele et al., 2013). This concept is different from compliance, because the control units are not receiving the treatment directly from the providers of the treatment but rather indirectly from others who took up the treatment formally. Spillover effects are an internal validity problem because they reduce the treatment contrast between treated and control units within an RCT and attenuate the treatment effect estimates (Murnane & Willett, 2011). Monitoring efforts that measure either the interaction between treated and control units or directly assess whether the control students are receiving the treatment can mitigate this SUTVA concern.
A different form of SUTVA violation is known as dilution or concentration effects (Morgan & Winship, 2015). This violation is driven by having multiple potential outcomes under an experimental condition because the outcomes vary with the percentage of units in the treatment or control condition. Vaccine studies are a canonical example; the potential outcome of an individual differs depending on the proportion of the population provided with the vaccine. The potential outcome under the control condition might go from a high probability of becoming ill when the proportion of the population provided treatment is low to a low probability of becoming ill when the proportion of the population provided treatment is high due to herd immunity.
Monitoring with supplementary data sources can potentially account for some of these situations. Qualitative observational data may reveal evidence of these types of interactions or, alternatively, may provide support of the assumption if these types of interactions do not appear. In the case of the friendship example above, social network analysis could be used to illuminate the relationship among a set of units and possibly leveraged to explain the likelihood of SUTVA violations.
Understanding dilution or concentration effects begins with measuring the extent of the provision of treatment. In many contexts, the researchers may be able to determine what proportion of the population is provided with treatment and, therefore, control how widespread the treatment is. However, in other contexts, an intentional effort to assess the penetration of the treatment into the population under study is necessary to put the measured treatment effect into context. That effort is important because the treatment effect estimate might only apply to a context in which a similar percentage of the population receives the treatment. Both qualitative and quantitative data sources can be leveraged here. The provision of the treatment could be observed and recorded by whoever has contact with the treated units, or researchers might make qualitative observations of the population to determine how widespread the treatment becomes.

Application to the Advise TX Study

In general, cluster RCTs are less prone to the interference type of SUTVA violation. It is easier to believe the assumption that the treatment assignment of one randomized unit does not affect the outcome of another unit when higher level clusters are randomly assigned. Students are most likely to interact with each other within schools, not across schools, so when entire schools are randomized, it is easier to believe one school’s assignment to treatment is unlikely to affect the college going outcomes of a different school.
However, a manifestation of this concern could arise over time because of additional resources that are being allocated by the state or school district subsequent to random assignment. If a district has two schools in the experiment, school A’s potential outcome under the treatment condition could depend on school B’s assignment to treatment if the district would shift financial or human capital between schools A and B depending on whether school B was assigned treatment. Imagine a district sending a more college minded guidance counselor in school A to school B because school A received the program, but school B did not. Even if such diversion of resources did happen initially, it seems plausible that such shifts could happen over the three-year period of the study.
The only way to determine if this happens is to either directly measure such inputs into both schools over time or talk with knowledgeable stakeholders to assess whether such shifts in resource allocation occurred. We attempted this latter form of data collection through school leader surveys and interviews. Crucially, we gathered data at both treatment and control schools, which revealed that control schools were finding additional college advising resources, as discussed above.
Related to spillover effects, the cluster RCT design again mitigates this concern. It is less likely students in the control school received college advising information and support from treated friends in our RCT than if students had been individually assigned treatment and control within the same school. However, we cannot rule out that interference did not occur, as we did not attempt to directly observe such interactions. We did not collect social network data or attempt to gauge interactions of individuals across treated and control schools; although, such data collection efforts may be possible. We believe this problem is small in our context and, as noted earlier, would work to attenuate the effect if it does exist.
Although they usefully minimize interference issues, cluster RCTs are potentially more susceptible to dilution or concentration concerns because of variation in contextual factors across clusters. In the Advise TX case, school size is a notable factor. Consider the difference in impact a college adviser might have when working in a school with 100 graduating seniors relative to being in a school with 500 graduating seniors. In the latter case, the same amount of effort and effectiveness of the adviser might be spread too thinly to have a noticeable impact on the high school’s college going rate. The treatment is essentially diluted. Because we were aware of this issue, we gathered data on high school size to enable a heterogeneity test of program effectiveness across school size to determine whether the program appeared more effective in small schools.
Furthermore, we assessed the penetration of the program into each school by monitoring the number of students within each school who had direct contact with the adviser. By asking advisers to record each student contact, we gathered quantitative, longitudinal tracking information of student–adviser contacts. Beyond helping to measure concentration of the treatment to assess SUTVA concerns for the evaluation, this type of monitoring also facilitated programmatic improvement by measuring the effort of each adviser so that targeted supervision could be applied to improve adviser performance.

7. External Validity

The prior three sections have focused predominantly on one of the strengths of RCTs, internal validity, which assesses whether we are correctly estimating the causal effect of the treatment on the population under study in the experimental analysis. However, external validity, whether the findings can generalize to other people, places, and contexts, is also important. Monitoring data collection and analysis efforts can also helpfully address the external validity limitations of a study by enabling measurement of the context in which the RCT takes place. Afterall, how will we understand whether the program would be effective in a different school, district, state, or nation if we do not measure the context in which it is studied?
It is rare to observe broad program evaluations in the literature across a national or multi-national context. Therefore, understanding the differences between the context in which the RCT is run and other contexts to which the program might be applied is critical. This is especially complex in education interventions, because the contexts for education differ so greatly across schools and institutions, districts or local education agencies, systems, states, and nations.
Being attentive to changing contexts over time is a fundamental element of education monitoring, and it has important implications when assessing the external validity of a program evaluation. It is important to consider how various aspects of the education production function may change over time. These aspects include students, schools, and the broader policy landscape. For example, it is important to note whether the type and preparation of students changes over time by observing demographic characteristics and student test scores. Related to the quality of schools, various quantitative and qualitative methods can assess changes in the characteristics of teachers, administrators, financial resources, technology, and facilities. Finally, policy at local, regional, or national levels frequently changes in ways that may alter the efficacy of a program. In the absence of monitoring such changes, programs run the risk of not meeting expectations in their efficacy due to the interaction with a changing policy landscape.
An additional external validity concern is generated by scaling up a program. Measured treatment effects generated from an RCT might hold when a program is in a few schools, but those effects may change as a program becomes widespread for several reasons. For example, program limitations in hiring and supervision might make it effective when small but less so when expanding.
It is also important to think through general equilibrium effects. The interactions between individuals and institutions in a system implies that any effects measured among a small number of units as in a typical RCT might not hold when a program is scaled to the system level. In other words, a program effect estimated via an RCT for a small number of schools might not hold if the program were implemented in all schools.
Although monitoring cannot directly prevent these problems, it can reveal whether they are happening. The continued assessment of programmatic effects over time allows for the consideration of how changing conditions affect the program’s efficacy.

Application to the Advise TX Study

A major contextual factor we dealt with because of the longitudinal nature of the experiment was changing state policy. Due to activist and advocacy efforts to improve Texas high schools, the state passed a new law, House Bill 5, in 2013, which was the second year of the intervention). That law changed the high school graduation requirements and plausibly impacted college going rates. Because this change affected the entire state, it was not a concern for the internal validity of the RCT because both treated and control schools were equally affected. However, such a change could alter how the results generalized. If the program is more effective under the earlier graduation requirement regime, it might prove less effective moving forward. In the absence of monitoring such contextual policy changes, we would have less information about how well the program works under different policies, which could reveal whether the program would be effective in other states with different policy contexts.
Several scaling concerns exist for the Advise TX program. Are they able to find enough high-quality advisers as the program expands? Can the training and management components of the program scale? Although no amount of data collection could have answered those questions directly within the RCT, interviews with program staff provide opportunities to assess the likelihood of these issues becoming problematic.
Because of the competition between students in admissions to colleges, especially selective colleges, there are several potential general equilibrium effect concerns with the program. If the program was effective at placing students into selective colleges when operating in a small number of schools, it might be less effective at reaching those gains if operating in every school, because there are not enough seats at selective four-year colleges to accept the larger number of students applying from every high school in the state. Alternatively, if the program is effective at placing low-income students into scholarship programs when operating at a small set of schools, it might be unlikely to achieve that level of success when operating at all schools because of competition for limited scholarship resources. By tracking not just the college enrollment outcomes but also the intermediate outcomes, such as contact hours with the adviser, the number of college applications submitted, and the amount of financial aid received, we can anticipate some of these challenges.

8. Conclusions

This paper has discussed several issues related to internal and external validity when conducting RCTs of educational interventions. We have documented how a combination of quantitative and qualitative data collection allows researchers to assess issues, such as the fidelity of treatment implementation, compliance, SUTVA, and external validity. When evaluations are longitudinal in nature, monitoring is even more critical because of the possibility of changing context over time in both the treatment and control groups. Generating a treatment contrast is a strength of the RCT design that leads to a causal interpretation. However, as we have demonstrated in several sections, the interpretation of that treatment contrast and the causal question that may be answered may differ over time.
These monitoring efforts should be incorporated into more educational evaluations. During the design phase of an RCT, researchers should anticipate possible threats to internal and external validity based on their contextual knowledge of the intervention and population under study. Then, supplementary data collection efforts can be planned. If the experimenters are not experts in qualitative data, they can partner with experts in different methodologies to accomplish a comprehensive mixed methods data collection and monitoring effort.
Programs and funders can also encourage these efforts when making decisions to undertake or finance evaluations. One example is the Institute of Education Sciences, which funds millions of dollars of educational evaluations every year. For experimental evaluation studies, they require an assessment of treatment fidelity. They could extend those requirements to include additional monitoring, especially for longitudinal RCTs.
A possible alternative to the extensive efforts we advocate for here is through study replication. Measuring heterogeneity of treatment effects over populations or over time, as well as assessing issues of external validity, can be resolved by repeating the RCT at different points in time in different contexts. However, this is often costly. Thinking critically about experimental design, data collection, and monitoring within an initial RCT can obviate some replication efforts.
We acknowledge several challenges and limitations of the approach we advocate. First, the specific tools we employed (survey data collection, interviewing advisers and college counselors at treatment and control schools, etc.) were suitable for our research in the context of studying a college advising program, but the tools must be modified to each individual context. Relatedly, it is expensive and time-consuming to develop and implement multiple data collection techniques. It likely increases the cost of designing and running an impact evaluation. This challenge can be mitigated by building in these data collection components when applying for funding for a program evaluation. Finally, our discussion has focused on randomized controlled trials and will require adaptation to apply to quasi-experimental analyses.
This last idea is a useful avenue for future research. We have focused on how educational monitoring can contribute to understanding a randomized control trial with its relatively simple identifying assumption and straightforward analytic approach. Future studies should consider how monitoring could be incorporated into other causal research designs, such as a regression discontinuity analysis or a matching study.
The point of both RCT studies and monitoring over time is to inform program managers and policy makers, and Pelgrum (2009) explicitly links educational monitoring to policy goals. The Advise TX evaluation serves as an excellent example of monitoring serving a policy purpose, as the program was expressly aligned with the state’s postsecondary access goals and provided policymakers with information about the program’s cost and efficacy. The incorporation of numerous mixed methods data collection efforts bolstered policymakers’ confidence in the findings of the RCT and provided deeper explanations for why the results change over time by accounting for the context in which the study took place. We hope that our consideration and integration of monitoring and data collection with elements of RCT design, implementation, and analysis will improve future RCT evaluations of programs and policies.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/educsci15030363/s1. Adviser Survey Example.

Author Contributions

Conceptualization, B.J.E., E.P.B. and A.L.A.; Formal analysis, B.J.E., E.P.B. and A.L.A.; Writing—original draft, B.J.E., E.P.B. and A.L.A.; Funding acquisition, B.J.E., E.P.B. and A.L.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Institute of Education Sciences grant number R305B130009.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and approved by the Institutional Review Board of Stanford University (protocol code 21439 and date of most recent approval 28 August 2020).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the qualitative data collection in this study.

Data Availability Statement

The datasets presented in this article are not readily available because they use administrative records from the Texas Higher Education Coordinating Board and Advise TX, both of which restrict access to their respective data. Requests to access the datasets should be directed to the Texas Higher Education Coordinating Board and/or Advise TX.

Conflicts of Interest

The authors declare no conflict of interest.

Notes

1
An additional layer of complexity is that CAC partners with institutions of higher education in the state to recruit, train, and manage their advisers. Although those institutions serve as critical partners to Advise TX, they were not the focus of the program evaluation.
2
This could also occur when fidelity is endogenous to treatment impact, as is often the case in medical research on optimal drug dosage. Individuals who recognize improvement in health might be more faithful to the intended dosage than one who has not seen such benefits.

References

  1. Alexander, F. (2000). The changing face of accountability: Monitoring and assessing institutional performance in higher education. Journal of Higher Education, 71, 411–431. [Google Scholar] [CrossRef]
  2. American Counseling Association. (2014). United States student-to-counselor ratios for elementary and secondary schools. Available online: https://www.counseling.org/docs/default-source/public-policy-faqs-and-documents/2013-counselor-to-student-ratio-chart.pdf?sfvrsn=2 (accessed on 4 March 2025).
  3. Angrist, J., & Imbens, G. (1995). Average causal response with varying treatment intensity. Journal of the American Statistical Association, 90, 431–442. [Google Scholar] [CrossRef]
  4. Bettinger, E. P., & Evans, B. J. (2019). College guidance for all: A randomized experiment in pre-college advising. Journal of Policy Analysis & Management, 38, 579–599. [Google Scholar]
  5. Burch, P., & Heinrich, C. (2015). Mixed methods for policy research and program evaluation. Sage Publications, Inc. [Google Scholar]
  6. Evans, B. J. (2021). Understanding the complexities of experimental analysis in the context of higher education. In L. W. Perna (Ed.), Higher education: Handbook of theory and research (vol. 36, pp. 611–661). Springer. [Google Scholar]
  7. Horng, E. L., Evans, B. J., Foster, J. D., Kalamkarian, H. S., Hurd, N. F., & Bettinger, E. P. (2013). Lessons learned from a data-driven college access program: The National College Advising Corps. New Directions for Youth Development, 140, 55–75. [Google Scholar] [CrossRef]
  8. Jansen, R., van Leeuwen, A., Janssen, J., & Kester, L. (2020). A mixed method approach to studying self-regulated learning in MOOCs: Combing trace data with interviews. Frontline Learning Research, 8, 35–64. [Google Scholar] [CrossRef]
  9. Kilpatrick, A., Turner, J., & Holland, T. (1994). Quality control in field education: Monitoring students’ performance. Journal of Teaching in Social Work, 9, 107–120. [Google Scholar] [CrossRef]
  10. Lemons, C., Fuchs, D., Gilbert, J., & Fuchs, L. (2014). Evidence-based practices in a changing world: Reconsidering the counterfactual in education research. Educational Researcher, 43, 242–252. [Google Scholar] [CrossRef]
  11. Morgan, S., & Winship, C. (2015). Counterfactuals and causal inference: Methods and principles for social research (2nd ed.). Cambridge University Press. [Google Scholar]
  12. Murnane, R. J., & Willett, J. B. (2011). Methods matter: Improving causal inference in educational and social science research. Oxford University Press. [Google Scholar]
  13. Pelgrum, W. (2009). Monitoring in education: An overview. In F. Scheuermann, & F. Pedró (Eds.), Assessing the effects of ICT in education: Indicators, criteria and benchmarks for international comparisons (pp. 41–61). European Union and OECD. [Google Scholar]
  14. Richards, C. (1988). A typology of educational monitoring systems. Educational Evaluation and Policy Analysis, 10, 106–116. [Google Scholar] [CrossRef]
  15. Rubin, D. (1986). Statistics and causal inference: Comment: Which ifs have causal answers. Journal of the American Statistical Association, 81(396), 961–962. [Google Scholar] [CrossRef]
  16. Sälzer, C., & Prenzel, M. (2019). Examining change over time in international largescale assessments: Lessons learned from PISA. In L. Suter, B. Denman, & E. Smith (Eds.), The SAGE handbook of comparative studies in education (pp. 243–257). SAGE Publications. [Google Scholar]
  17. Scheerens, J., Glas, C., & Thomas, S. (2003). Educational evaluation, assessment and monitoring. Taylor & Francis. [Google Scholar]
  18. Schochet, P. Z., & Chiang, H. S. (2011). Estimation and identification of the complier average causal effect parameter in education RCTs. Journal of Educational and Behavioral Statistics, 36(3), 307–345. [Google Scholar] [CrossRef]
  19. Singer-Brodowski, M., Brock, A., Etzkorn, N., & Otte, I. (2019). Monitoring of education for sustainable development in Germany—Insights from early childhood education, school and higher education. Environmental Education Research, 25, 492–507. [Google Scholar] [CrossRef]
  20. Spillane, J. P., Pareja, A. S., Dorner, L., Barnes, C., May, H., Huff, J., & Cambrun, E. (2010). Mixing methods in randomized controlled trials (RCTs): Validation, contextualization, triangulation, and control. Educational Assessment, Evaluation and Accountability, 22, 5–28. [Google Scholar] [CrossRef]
  21. Texas Higher Education Coordinating Board. (2021). 60x30TX progress report. Available online: https://reportcenter.highered.texas.gov/reports/data/60x30tx-progress-report-july-2021/ (accessed on 4 March 2025).
  22. VanderWeele, T. J., Hong, G., Jones, S. M., & Brown, J. (2013). Mediation and spillover effects in group-randomized trials: A case study of the 4Rs educational intervention. Journal of the American Statistical Association, 108(502), 469–482. [Google Scholar] [CrossRef] [PubMed]
  23. Weiss, M. J., Bloom, H. S., & Brock, T. (2014). A conceptual framework for studying the sources of variation in program effects. Journal of Policy Analysis & Management, 33, 778–808. [Google Scholar]
Table 1. Intent to treat effect estimates on first year college fall enrollments.
Table 1. Intent to treat effect estimates on first year college fall enrollments.
Year 1 (2011–2012)Year 2 (2012–2013)Year 3 (2013–2014)
Panel A. Full Sample
Coefficient0.022 *0.013−0.014
Standard Error0.0110.0100.011
Panel B. Hispanic Students
Coefficient0.022 +0.009−0.014
Standard Error0.0120.0130.013
Panel C. Low-Income Students
Coefficient0.038 **0.046 **0.005
Standard Error0.0140.0130.012
Note: + p < 0.10; * p < 0.05; ** p < 0.01. Each coefficient cell reports the coefficient on treatment assignment for each year and for each sample using a linear probability model. Standard errors are clustered at the school level. R2 and sample size varies by year and sample.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Evans, B.J.; Bettinger, E.P.; Antonio, A.L. Data Collection and Monitoring in an Educational RCT of a Postsecondary Access Program: Assessing Internal and External Validity. Educ. Sci. 2025, 15, 363. https://doi.org/10.3390/educsci15030363

AMA Style

Evans BJ, Bettinger EP, Antonio AL. Data Collection and Monitoring in an Educational RCT of a Postsecondary Access Program: Assessing Internal and External Validity. Education Sciences. 2025; 15(3):363. https://doi.org/10.3390/educsci15030363

Chicago/Turabian Style

Evans, Brent Joseph, Eric Perry Bettinger, and Anthony Lising Antonio. 2025. "Data Collection and Monitoring in an Educational RCT of a Postsecondary Access Program: Assessing Internal and External Validity" Education Sciences 15, no. 3: 363. https://doi.org/10.3390/educsci15030363

APA Style

Evans, B. J., Bettinger, E. P., & Antonio, A. L. (2025). Data Collection and Monitoring in an Educational RCT of a Postsecondary Access Program: Assessing Internal and External Validity. Education Sciences, 15(3), 363. https://doi.org/10.3390/educsci15030363

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop