Empirical Education: focusing evaluations

The standard design for experimental program evaluations in educational settings may not be doing justice to the questions that matter most to district decision makers. In many sites where we have worked, the most important question had to do with a gap between two populations within the district. For example, one district’s improvement plan specifically targeted the gap in science achievement between black students and white students. In another, there was a specific concern with the performance of new, and often uncertified, teachers compared to experienced teachers. NCLB, with its requirement for disaggregating the performance of specific subgroups, has reinforced this perspective. A new science curriculum that has a modest positive impact on performance across the district could be rejected if it had the effect of increasing the gap between the two populations of concern.

When a new program favors one kind of student or teacher over another, we call it an interaction, that is, an interaction between the experimental “treatment” and some pre-existing “trait” of the population involved. In experimental design, we call these characteristics of the people or the setting moderators because they are seen as moderating the impact of the new program. Moderators are often considered secondary or even exploratory outcomes in experimental program evaluations, which are designed primarily to find out whether the new program makes an overall difference for the study population as a whole. Who gets and doesn’t get the program can be manipulated experimentally. By contrast, the moderator is a pre-existing characteristic that (usually) can’t be manipulated. While the experiment focuses on a specific program (treatment), any number of moderators can be examined after the fact.

Many of our experiments in school systems are aimed at answering a question of local interest. In this case, we often find that the most important question concerns an interaction rather than the average impact of the experimental intervention itself. The potential moderator of interest, such as minority status, under-achievement, or certification can be specified in advance, based on the identified gap in performance the new program was intended to address in the first place. When the interaction is the primary outcome of interest, its status goes beyond even the emphasis that many experts put on interactions as a means for getting a fuller picture of the effectiveness of an intervention (Cook, 2002; Shadish, Cook, & Campbell, 2002). But because investigations of interactions are usually exploratory and not the primary question (except perhaps for the specific setting in which the experiment took place), it is difficult to look across studies of the same intervention to come to any generalization about the moderating effects of certain variables. Research reviews that synthesize multiple studies of the same intervention such as found on the What Works Clearinghouse and Best Evidence Encyclopedia are not concerned with interactions, even if an individual study finds one to be quite substantial. This is unfortunate because, in many studies that find no overall impact for a program, we may discover that it is differentially effective for an important subgroup. It would therefore be useful, for example, to examine whether the moderating effect of a certain variable varies more than is expected by chance across experimental settings. This would indicate whether the moderating effect is robust or whether it depends on local circumstances.

This situation points to the importance of conducting local program evaluations that can focus on the achievement gap of greatest concern. Fortunately, recent theoretical work by Howard Bloom (Bloom, 2005) of MDRC provides an indication that statistical power for detecting differences among subgroups of students in the impact of an intervention (that is, the interaction) can be larger than for detecting a net impact of the same size for that program. This means that a local experiment primarily interested in an interaction can be smaller, and less expensive, than a traditional experiment looking for an overall average effect. The need for information about gaps, as well as the possible greater efficiency of studying gaps, provides support for a strategy of conducting relatively small experiments to answer questions of local interest to a school district (Newman, 2008). Small, and less expensive, experimental program evaluations focused on moderating effects can provide more valuable information to decision makers than large-scale experiments intended for broad generalization, which cannot provide useful evidence for all interactions of interest to schools.

Empirical Education is now engaged in research to empirically verify Bloom’s observation about statistical power; we expect to be reporting the results next spring. —DN

Bloom, H. S. (2005). Randomizing groups to evaluate place-based programs. In H. S. Bloom (Ed)., Learning More From Social Experiments. New York, NY: Sage.

Cook, T. D. (2002). Randomized experiments in educational policy research: A critical examination of the reasons the education evaluation community has offered for not doing them, Educational Evaluation and Policy Analysis, 24, 175-199.

Newman, D. (2008) Toward School Districts Conducting Their Own Rigorous Program Evaluations: Final Report on the “Low Cost Experiments to Support Local School District Decisions” Project. Empirical Education Research Reports, Palo Alto, CA: Empirical Education Inc.

Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi- experimental designs for generalized causal inference. Boston: Houghton Mifflin.

The question came up at the recent workshop held in Washington DC for school district researchers to learn more about rigorous program evaluation: “Why is the strongest research design often the hardest to make happen?” There are very good theoretical reasons to use randomized control when trying to evaluate whether a school district’s instructional or professional development program works. What we want to know is whether involving students and teachers in some program will result in outcomes that are better than if those same students and teachers were not involved in the program. The workshop presenter, Mark Lipsey of Vanderbilt University, pointed out that if we had a time machine we could observe how well the students and teachers achieved with the program, then go back in time, don’t give them the program — thus creating the science fiction alternate universe — and watch how they did without the program. We can’t do that, so the next best thing is to find a group that is just like the one with the program and see how they do. By choosing who gets a program and who doesn’t get it from a pool of volunteer teachers (or schools) using a coin toss (or another random method), we can be sure that self selection had nothing to do with group assignment and that, at least on average, the only difference between members of the two groups is that one group won the coin toss and the other didn’t. Most other methods introduce potential bias that can change the results.

Randomized control can work where the district is doing a small pilot and has only enough materials for some of the teachers, where resources call for a phased implementation starting with a small number of schools, or where slots in a program are going to be allocated by lottery anyway. To many people, the coin toss (or other lottery method) just doesn’t seem right. Any number of other criteria could be suggested as a better rationale for assigning the program: some students are needier, some teachers may be better able to take advantage of it, and so on. But the whole point is to avoid exactly those kinds of criteria and make the choice entirely random. The coin toss itself highlights the decision process, creating a concern that it will be hard to justify, for example, to a parent who wants to know why his kid’s school didn’t get the program.

Our own experience with random assignment has not been so negative. Most districts will agree to it, although some do refuse on principle. When we begin working with the teachers face–to–face, there is usually camaraderie about tossing the coin, especially when it is between two teachers paired up because of their similarity on characteristics they themselves identify as important (we’ve also found this pairing method helps give us more precise estimates of the impact). The main problem we find with randomization, if it is being used as part of a district’s own local program evaluation, is the pre–planning that is required. Typically, decisions as to which schools get the program first or which teachers will be selected to pilot the program are made before consideration is given to doing a rigorous evaluation. In most cases, the program is already in motion or the pilot is coming to a conclusion before the evaluation is designed. At that point in the process, the best method will be to find a comparison group from among the teachers or schools that were not chosen or did not volunteer for the program (or to look outside the district for comparison cases). The prior choices introduce selection bias that we can attempt to compensate for statistically; still, we can never be sure our adjustments eliminate the bias. In other words, in our experience the primary reason that randomization is harder than weaker methods is that it requires that the evaluation design and the program implementation plan are coordinated from the start. —DN

Monday, December 8, 2008

Focusing Evaluations on Achievement Gaps

Wednesday, May 14, 2008

What Makes Randomization Hard to Do?

Followers

Blog Archive