Monday, December 8, 2008

Focusing Evaluations on Achievement Gaps

The standard design for experimental program evaluations in educational settings may not be doing justice to the questions that matter most to district decision makers. In many sites where we have worked, the most important question had to do with a gap between two populations within the district. For example, one district’s improvement plan specifically targeted the gap in science achievement between black students and white students. In another, there was a specific concern with the performance of new, and often uncertified, teachers compared to experienced teachers. NCLB, with its requirement for disaggregating the performance of specific subgroups, has reinforced this perspective. A new science curriculum that has a modest positive impact on performance across the district could be rejected if it had the effect of increasing the gap between the two populations of concern.

When a new program favors one kind of student or teacher over another, we call it an interaction, that is, an interaction between the experimental “treatment” and some pre-existing “trait” of the population involved. In experimental design, we call these characteristics of the people or the setting moderators because they are seen as moderating the impact of the new program. Moderators are often considered secondary or even exploratory outcomes in experimental program evaluations, which are designed primarily to find out whether the new program makes an overall difference for the study population as a whole. Who gets and doesn’t get the program can be manipulated experimentally. By contrast, the moderator is a pre-existing characteristic that (usually) can’t be manipulated. While the experiment focuses on a specific program (treatment), any number of moderators can be examined after the fact.

Many of our experiments in school systems are aimed at answering a question of local interest. In this case, we often find that the most important question concerns an interaction rather than the average impact of the experimental intervention itself. The potential moderator of interest, such as minority status, under-achievement, or certification can be specified in advance, based on the identified gap in performance the new program was intended to address in the first place. When the interaction is the primary outcome of interest, its status goes beyond even the emphasis that many experts put on interactions as a means for getting a fuller picture of the effectiveness of an intervention (Cook, 2002; Shadish, Cook, & Campbell, 2002). But because investigations of interactions are usually exploratory and not the primary question (except perhaps for the specific setting in which the experiment took place), it is difficult to look across studies of the same intervention to come to any generalization about the moderating effects of certain variables. Research reviews that synthesize multiple studies of the same intervention such as found on the What Works Clearinghouse and Best Evidence Encyclopedia are not concerned with interactions, even if an individual study finds one to be quite substantial. This is unfortunate because, in many studies that find no overall impact for a program, we may discover that it is differentially effective for an important subgroup. It would therefore be useful, for example, to examine whether the moderating effect of a certain variable varies more than is expected by chance across experimental settings. This would indicate whether the moderating effect is robust or whether it depends on local circumstances.

This situation points to the importance of conducting local program evaluations that can focus on the achievement gap of greatest concern. Fortunately, recent theoretical work by Howard Bloom (Bloom, 2005) of MDRC provides an indication that statistical power for detecting differences among subgroups of students in the impact of an intervention (that is, the interaction) can be larger than for detecting a net impact of the same size for that program. This means that a local experiment primarily interested in an interaction can be smaller, and less expensive, than a traditional experiment looking for an overall average effect. The need for information about gaps, as well as the possible greater efficiency of studying gaps, provides support for a strategy of conducting relatively small experiments to answer questions of local interest to a school district (Newman, 2008). Small, and less expensive, experimental program evaluations focused on moderating effects can provide more valuable information to decision makers than large-scale experiments intended for broad generalization, which cannot provide useful evidence for all interactions of interest to schools.

Empirical Education is now engaged in research to empirically verify Bloom’s observation about statistical power; we expect to be reporting the results next spring. —DN

Bloom, H. S. (2005). Randomizing groups to evaluate place-based programs. In H. S. Bloom (Ed)., Learning More From Social Experiments. New York, NY: Sage.

Cook, T. D. (2002). Randomized experiments in educational policy research: A critical examination of the reasons the education evaluation community has offered for not doing them, Educational Evaluation and Policy Analysis, 24, 175-199.

Newman, D. (2008) Toward School Districts Conducting Their Own Rigorous Program Evaluations: Final Report on the “Low Cost Experiments to Support Local School District Decisions” Project. Empirical Education Research Reports, Palo Alto, CA: Empirical Education Inc.

Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi- experimental designs for generalized causal inference. Boston: Houghton Mifflin.

Wednesday, November 5, 2008

Climate Change: Innovation

Congratulations to Barack Obama on his sweeping victory. We can expect a change of policy climate with a new administration bringing new players and new policy ideas to the table. The appointment of a new director of the Institute of Education Sciences will provide an early opportunity to set direction for research and development. Reauthorization of NCLB and related legislation — including negotiating the definition and usage of “scientific research” — will be another, although pundit consensus was that this change will take two more years, given the urgency of fixing the economy and resolving the war in Iraq. But already change is in the air with proposals for dramatic shifts in priorities. Here we raise a question about the big new idea that is getting a lot of play: innovation.

Educational innovation being called for includes funding for research and development [R&D (with a capital D for a focus on new ideas)], acquisition of school technology, and funding for dissemination of new charter school models. The Brookings Institution recently published a policy paper Changing the Game: The Federal Role in Supporting 21st Century Educational Innovation by Sara Mead and Andy Rotherham. The paper imagines a new part of the US Department of Education called the Office of Educational Entrepreneurship and Innovation (OEEI) that would be charged with the job of implementing “a game-changing strategy [that] requires the federal government to make new types of investments, form new partnerships with philanthropy and the nonprofit sector, and act in new ways to support the growth of entrepreneurship and innovation within the public education system” (p34). The authors see this as complementary to standards-based reform, which is yielding diminishing returns. “To reach the lofty goals that standards-based reform has set, we need more than just pressure. We need new models of organizing schooling and new tools to support student learning that are dramatically more effective or efficient than what schools doing today” (p35).

As an entrepreneurial education business, we applaud the idea behind the envisioned OEEI. The question for us arises when we think about how OEEI would know whether a game-changing model is “dramatically more effective or efficient.” How will the OEEI decide which entrepreneurs should receive or continue to receive funds? Although the authors call for a “relentless focus on results,” they do not say how results would be measured. The venture capital (VC) model bases success on return on investment. Many VC investments fail but, if a good percentage succeeds, the overall monetary return to the VC is positive. While venture philanthropies often work the same way, the profits go back into supporting more entrepreneurs instead of back to the investors. Scaling up profitably is a sufficient sign of success. Perhaps we can assume that parents, communities, and school systems would not choose to adopt new products if they were ineffective or inefficient. If this were true, then scaling up would be an indirect indication of educational effectiveness. Will positive results for innovations in the marketplace be sufficient, or should there perhaps be a role for research to determine their effectiveness?

The authors suggest a $300 million per year “Grow What Works” fund of which less than 5% would be set aside for “rigorous independent evaluations of the results achieved by the entrepreneurs” (p48). Similarly, their suggestion for a program like the Defense Advanced Research Projects Agency (DARPA) would allow only up to 10%. Budgeting research at this level is unlikely to have much influence over what is likely to be an overwhelming imperative for market success. Moreover, what will be the role of independent evaluations if they fail to show the innovation to be dramatically more effective or efficient? Funding research as a set-aside from a funded program is always an uphill battle because it appears to take money away from the core activity. So let‘s be innovative and call this R&D with the intention of empowering both the R and the D. Rather than offer a token concession to the research community, build ongoing formative research and impact evaluations into the development and scale-up processes themselves. This may more closely resemble the “design-engineering-development” activities that Tony Bryk describes.

Integrating the R with the D will have two benefits. First it will provide information to federal and private funding agencies on the progress toward whatever measurable goal is set for an innovation. Second, it will help the parents, communities, and school systems make informed decisions about whether the innovation will work locally. The important partner here is the school district, which can take an active role in evaluation as well as development. These are the entities that ultimately have to decide whether the innovations are more effective and efficient that what they already do. They are also the ones with all the student, teacher, financial, and other data needed to conduct quasi-experiments or interrupted time series studies. If an agency like OEEI is created, it should insist that school districts become partners in the R&D for innovations they consider introducing. —DN

Saturday, October 11, 2008

Needed: A Developmental Approach to District Data Use

The recent publication of Data Driven School Improvement: Linking Data and Learning, edited by Ellen Mandinach and Margaret Honey, takes a useful step toward documenting innovative practices at the classroom, school, district, and state levels. The book’s 14 chapters (and 30 authors) avoid the advocacy orientation frequently found in discussions of data-driven decision making (D3M). Case studies provide rich detail that is often missing in other discussions. It is useful to get a sense of some of the actual questions that were addressed and that motivated the setup of the technology of data warehouses, assessment tools, and dashboards.

The book provides a good introduction to a complicated field that is currently attracting much attention from practitioners and researchers, as well as from technology vendors. In some ways, however, it does not go deep enough in providing a framework for understanding the topic. While one of the key chapters provides a conceptual framework in terms of a set of processes and related skills such as for collecting, analyzing, prioritizing data, the framework is static in the sense that there is no account of or theory as to how teachers, principals, or district administrators might acquire these skills or come to be interested in using them. Without a developmental theory, we can’t predict what processes or skills are likely to be prerequisites for others or how processes can be scaffolded, for example, by using some of the useful technologies described in several of the chapters. Many of the examples of how data are used can be loosely described as data mining and moving toward identifying needs or gaps or problems. Situations where statistical analysis (beyond averages of descriptive data) is called for are mentioned only occasionally. Such a question might have asked for a comparison of what happened when a new program was put in place compared to what would have happened without the program as well as compared to the level of need identified for which the program was considered a solution. The chapters for the most part keep the discussion at a level that does not call for a statistical test or an examination of a correlation. This may be reasonable when considering decisions within a classroom, but is an oversimplification when it comes to decisions considered at the district central office.

It is reasonable to posit stages in a developmental sequence where descriptive needs assessment would be a logical first step before moving on to more complex analyses that would, for example, introduce statistical controls. On a technical level, it is reasonable to consider data on a single school year to be both more readily available to school district administrators and also to address more straightforward questions than multi-year longitudinal data. For example, a question about mean differences among ethnic groups calls for simpler analytic tools than a question about changes over time in the size of the gap between groups. Both may feed into a needs analysis, but the latter calls for statistical calculations that go beyond a simple comparison. Similarly, a question about whether a new program had an impact not only calls for statistical machinery but requires the introduction of experimental design in setting up an appropriate comparison. Again it is reasonable to posit that incorporating research design into the “data-driven” decisions is a more advanced stage that builds upon the tools and processes that explore correlations to identify potential areas of need. A developmental theory of data-driven school improvement may provide a basis for tools, supports, and professional development for school district personnel that can accelerate adoption of these valuable processes. A development theory would provide a guide for starting where they are and for providing the scaffold to a next level that builds incrementally on what is already in place. —DN

Mandinach, E and Honey, M. (Eds) (2008) Data Driven School Improvement: Linking Data and Learning. New York: Teachers College Press.

Friday, September 5, 2008

Where Advocating Accountability Falls Short

Calling for greater accountability continues to be a theme in American education policy. Recently, Senator Barack Obama made this proposition: “I’ll recruit an army of new teachers, and pay them higher salaries, and give them more support. And in exchange, I’ll ask for higher standards and more accountability” (August 28, 2008).

Although the details of policy positions are not generally provided in political speeches, this one is worth pulling apart to see what might be the issues in implementing such a policy.

First, what is the accountability that there will be more of? In this case we may presume that, since accountability is linked specifically to teacher salaries, teachers will be held accountable. Is this appropriate—or even possible—as federal policy? The educational enterprise can be held accountable at many levels. While teachers have face-to-face contact with students who may do well or poorly, the team of teachers working at a grade level or in a small school could be collectively accountable. Moving up a level, a principal could be held accountable for the school’s results. And the district superintendent and the state schools chief can also be accountable for results in their jurisdictions. From purely an accountability point of view, teachers are not necessarily the best focus for federal policy. Certainly, recruiting and incentive efforts can be federally funded, but it seems at best awkward to legislate sanctions for individual teachers based on holding them accountable for their individual performance in raising their students’ scores.

It is currently possible technically to hold teachers accountable. Database and statistical technologies are now available to link teacher identities to student records. District data systems routinely provide links between teachers and their rosters of students and, in many cases, these are extended longitudinally. Many state data systems have also begun providing unique teacher IDs so that linkages to the achievement of individual students can be tracked. And drawing on these longitudinal linkages, “value-added” analyses are being used to quantify the contribution over time of individual teachers to students in their classes.

However, as an approach to federal policy, taking a top-down tactic—making superintendents and principals accountable—may be better than promoting technologies that attempt to measure individual teachers. (The technical controversies about the statistics used in some versions of value-added analysis are a noteworthy topic that we’ll save for another day.) A more productive approach may be to focus on the disincentives for teachers to collaborate or help one another when accountability is at the individual level. We find that many teachers report teaching students other than those officially registered in their classes. Frequently these are informal arrangements that can increase aggregate achievement for the students involved but muddy the district or state records for individual teachers. A school may be a more appropriate unit of accountability and, in that local context, data on individual teachers can more accurately be evaluated. The school principals will know both their schools’ standing among other schools in the district and will have school-level data to help in making staffing decisions. The central office staff, in turn, will have a broader view of the progress of the individual schools on which to base decisions about allocation of resources.

For any achievement-based accountability approach—whether at the district, school, or teacher level—it is important to understand achievement in relation to challenges related to, for example, the economic status of the district or neighborhood or the prior preparation of the students in the classroom. We must also consider the growth of the students, not just their proficiency status at the end of the year. These considerations require statistical calculations, not just counting up percentages of proficient students. And once we begin looking at analyses such as trajectories of some schools compared to others facing similar challenges, we can take the next step and begin tracking the success of interventions, professional development programs, and other local policies aimed at addressing areas of weakness or supporting teachers who are not helping their students make the kind of progress the school is looking for. The data systems and the analytic tools needed to track a teacher’s or a school’s progress over time can also be turned to guiding resource allocations and interventions and, as a next logical step, providing the capability of tracking whether the additional interventions, support, or professional development are having the desired impact.

Given the capacities of current data systems, how might a policy involving greater support and greater accountability for teachers be implemented? Here is one example. Federal funding to districts for principal leadership training could be tied to district-level labor contracts giving the building leaders greater control over personnel decisions. The leadership training will include the interpretation and use of longitudinal data, constituting tools for comparing the principal’s school to others facing similar challenges. Professional learning communities for the principals can be part of this leadership program, assisting district teams to work through ideas for interventions. While the achievement-based accountability measures of teacher performance can be used as one of the factors in building-level decisions, the leadership training would include how to use the data systems for tracking both teacher job performance and the impact of support and training on that performance. —DN

Sunday, June 1, 2008

How Do Districts Use Evidence?

The research journal Education Policy published an article this month that is important for understanding how data and evidence are used at the school district level: “Evidence-Based Decision Making in School District Central Offices” by Meredith Honig and Cynthia Coburn, both alumnae of Stanford’s Graduate School of Education (Honig & Coburn, 2008). Understand that most of the data-driven decision-making research (and most decision-making based on data) occurs at the classroom level; teachers get immediate and actionable information about individual students. But Honig and Coburn are talking about central office administrators. Data at the district level are more complicated and, as the authors document, infused with political complications. When district leaders are making decisions about products or programs to adopt, evidence of the scientific sort is at best one element among many.

Honig and Coburn review three decades of research and, after eliminating purely anecdotal and obviously advocacy pieces, they found 52 books and articles of substantial value. What they document parallels our own experience at Empirical Education in many respects. That is, rigorous evidence, once it is gathered through either reading scientific reviews or conducting local program evaluations, is never used “directly.” It is not a matter of the evidence dictating the decision. They document that scientific evidence is incorporated into a wide range of other kinds of information and evidence. These may include teacher feedback, implementation issues, past experience, or what the neighboring district superintendent said about it—all of which are legitimate sources of information and need to be incorporated into the thinking about what to do. This “working knowledge” is practical and “mediates” between information sources and decisions.

The other aspect of decision-making that Honig and Coburn address involves the organizational or political context of evidence use. In many cases the decision to move forward has been made before the evaluation is complete or even started; thus the evidence from it is used (or ignored) to support that decision or to maintain enthusiasm. As in any policy organization or administrative agency, there is a strong element of advocacy in how evidence is filtered and used. The authors suggest that this filtering for advocacy can be beneficial in helping administrators make the case for programs that could be beneficial.

In other words, there is a cognitive/organizational reality that “mediates” between evidence and policy decisions. The authors contrast this reality with the position they attribute to federal policy makers and the authors of NCLB that scientific evidence ought to be used “directly” or instrumentally to make decisions. In fact, they see the federal policy as arguing that “these other forms of evidence are inappropriate or less valuable than social science research evidence and that reliance on these other forms is precisely the pattern that federal policy makers should aim to break” (p601). This is where their argument is weakest. The contrast they set up between the idea of practical knowledge mediating between evidence and decisions and the idea that evidence should be used directly is a false dichotomy. The “advocate for direct use of evidence” is a straw man. There are certainly researchers and research methodologists who do not study and are not familiar with how evidence is used in district decisions. But not being experts in decision processes does not make them advocates for a particular process called “direct.” The federal policy is not aimed at decision processes. Instead, it aims to raise the standards of evidence in formal research that claims to measure the impact of programs so that, when such evidence is integrated into decision processes and weighed against practical concerns of local resources, local conditions, local constraints, and local goals, the information value is positive. Federal policy is not trying to remove decision processes, it is trying to remove research reports that purport to provide research evidence but actually come to unwarranted conclusions because of poor research design, incorrect statistical calculations, or bias.

We should also not mistake Honig’s and Coburn’s descriptions of decision processes for descriptions of deep, underlying, and unchangeable human cognitive tendencies. It is certainly possible for district decision-makers to learn to be better consumers of research, to distinguish weak advocacy studies from stronger designs, and to identify whether a particular report can be usefully generalized to their local conditions. We can also anticipate an improvement in the level of the conversation between districts’ evaluation departments, curriculum departments, and IT people so that local evaluations are conducted to answer critical questions and to provide useful information that can be integrated with other local considerations into a decision. —DN

Honig, M. I. & Coburn, C. (2008). Evidence-Based Decision Making in School District Central Offices. Educational Policy, 22(4), 578-608.

Wednesday, May 14, 2008

What Makes Randomization Hard to Do?

The question came up at the recent workshop held in Washington DC for school district researchers to learn more about rigorous program evaluation: “Why is the strongest research design often the hardest to make happen?” There are very good theoretical reasons to use randomized control when trying to evaluate whether a school district’s instructional or professional development program works. What we want to know is whether involving students and teachers in some program will result in outcomes that are better than if those same students and teachers were not involved in the program. The workshop presenter, Mark Lipsey of Vanderbilt University, pointed out that if we had a time machine we could observe how well the students and teachers achieved with the program, then go back in time, don’t give them the program — thus creating the science fiction alternate universe — and watch how they did without the program. We can’t do that, so the next best thing is to find a group that is just like the one with the program and see how they do. By choosing who gets a program and who doesn’t get it from a pool of volunteer teachers (or schools) using a coin toss (or another random method), we can be sure that self selection had nothing to do with group assignment and that, at least on average, the only difference between members of the two groups is that one group won the coin toss and the other didn’t. Most other methods introduce potential bias that can change the results.

Randomized control can work where the district is doing a small pilot and has only enough materials for some of the teachers, where resources call for a phased implementation starting with a small number of schools, or where slots in a program are going to be allocated by lottery anyway. To many people, the coin toss (or other lottery method) just doesn’t seem right. Any number of other criteria could be suggested as a better rationale for assigning the program: some students are needier, some teachers may be better able to take advantage of it, and so on. But the whole point is to avoid exactly those kinds of criteria and make the choice entirely random. The coin toss itself highlights the decision process, creating a concern that it will be hard to justify, for example, to a parent who wants to know why his kid’s school didn’t get the program.

Our own experience with random assignment has not been so negative. Most districts will agree to it, although some do refuse on principle. When we begin working with the teachers face–to–face, there is usually camaraderie about tossing the coin, especially when it is between two teachers paired up because of their similarity on characteristics they themselves identify as important (we’ve also found this pairing method helps give us more precise estimates of the impact). The main problem we find with randomization, if it is being used as part of a district’s own local program evaluation, is the pre–planning that is required. Typically, decisions as to which schools get the program first or which teachers will be selected to pilot the program are made before consideration is given to doing a rigorous evaluation. In most cases, the program is already in motion or the pilot is coming to a conclusion before the evaluation is designed. At that point in the process, the best method will be to find a comparison group from among the teachers or schools that were not chosen or did not volunteer for the program (or to look outside the district for comparison cases). The prior choices introduce selection bias that we can attempt to compensate for statistically; still, we can never be sure our adjustments eliminate the bias. In other words, in our experience the primary reason that randomization is harder than weaker methods is that it requires that the evaluation design and the program implementation plan are coordinated from the start. —DN

Monday, April 14, 2008

Data-Driven Decision Making—Applications at the District Level

Data warehouses and data-driven decision making were major topics of discussion at the Consortium for School Networking conference March 9-11 in Washington DC that Empirical Education staff attended. This conference has a sizable representation by Chief Information Officers from school districts as well as a long tradition of supporting instructional applications of technology. Clearly with the onset of the accountability provisions of NCLB, the growing focus has been on organizing and integrating such school district data as test scores, class rosters, and attendance. While the initial motivation may have been to provide the required reports to the next level up, there continues to be a lively discussion of functionality within the district. The notion behind data-driven decision making (D3M) is that educators can make more productive decisions if based on this growing source of knowledge. Most of the attention has focused on teachers using data on students to make instructional decisions for individuals. At the CoSN conference, one speaker claimed that teachers’ use of data for classroom decisions was the true meaning of D3M; uses at the district levels to inform decisions were at best of secondary importance. We would like to argue that the applications at the district level should not be minimized.

To start with, we should note that there is little evidence that giving teachers access to warehoused testing data is effective in improving achievement. We are involved in two experimental studies on this topic, but more should be undertaken if we are going to understand the conditions for success with this technology. We are intrigued by the possibility that, with several waves of data during the year, teachers become action researchers, working through the following steps: 1) seeing where specific students are having trouble, 2) trying out intervention techniques with these children or groups, and 3) examining the results within a few months (or weeks). Thus the technique would be not just based on teacher impressions but from assessments that provide a measurement of student growth relative to standards and to the other students in the class. If a technique isn’t working, the teacher will move to another. And the cycle continues.

D3M can be used in similar three-step process at the district level but this is much rarer. At the district level D3M is most often used diagnostically to identify areas of weakness, for example, to identify schools that are doing worse than they should or to identify achievement gaps between categories of students. This is like the first step in the teacher D3M. District planners may then make decisions about acquiring new instructional programs, providing PD to certain teachers, replacing particular staff, and so on. This is like the teacher’s second step. What we see far less frequently at the district level is the teacher’s third step: looking at the results so as to measure whether the new program is having the desired effect. In the district decision context this step requires a certain amount of planning and research design. Experimental control is not as important in the classroom because the teacher will likely be aware of any other plausible explanations for a student’s change. On the scale of a district pilot program or new intervention, research design elements are needed to distinguish any difference from what might have happened anyway or to exclude selection bias. Also, where the decision potentially impacts a large number of schools, teachers, and students, statistical calculations are needed to determine the size of the difference and the level of confidence the decision makers can have that the result is not just a matter of chance. We encourage the proponents of D3M to consider the importance of its application at the district level to take advantage, on a larger scale, of processes that happen in the classroom everyday. —DN

Friday, March 14, 2008

Making Way for Innovation: An Open Email to Two Congressional Staffers Working on NCLB

Roberto and Brad, it was a pleasure hearing your commentary at the February 20 Policy Forum “Using Evidence for a Change” and having a chance to meet you afterward. Roberto, we promised you a note summarizing the views expressed by several on the panel and raised in the question period.

We can contrast two views of research evident at the policy forum:

The first view holds that, because research is so expensive and difficult, only the federal government can afford it and only highly qualified professional researchers can be entrusted with it. The goal of such research activities is to obtain highly precise and generalizable evidence. In this view, practitioners (at the state, district, or school level) are put in the role of consumers of the evidence.

The second view holds that research should be made a routine activity within any school district contemplating a significant investment in an instructional or professional development program. Since all the necessary data are readily at hand (and without FERPA restrictions), it is straightforward for district personnel to conduct their own simple comparison group study. The result would be reasonably accurate local information on the program‘s impact in the setting. In this view, practitioners are producers of the evidence.

The approach suggested by the second view is far more cost effective than the first, as well as more timely. It is also driven directly by the immediate needs of districts. While each individual study would pertain only to a local implementation, in combination, hundreds of such studies can be collected and published by organizations like the What Works Clearinghouse or by consortia of states or districts. Turning practitioners into producers of evidence also removes the brakes on innovation identified in the policy forum. With practitioners as evidence producers, schools can adopt “unproven” programs as long as they do so as a pilot that can be evaluated for its impact on student achievement.

A few tweaks to NCLB will be necessary to turn practitioners into producers of evidence:

1. Currently NCLB implicitly takes the “practitioners as consumers of evidence” view in requiring that the scientifically based research be conducted prior to a district‘s acquisition of a program. We have already published a blog entry analyzing the changes to the SBR language in the Miller-McKeon and Lugar-Bingaman proposals and how minor modifications could remove the implicit “consumers” view. These are tweaks such as, for example, changing a phrase that calls for:
“including integrating reliable teaching methods based on scientifically valid research”
to a call for
“including integrating reliable teaching methods based on, or evaluated by, scientifically valid research.”

2. Make clear that a portion of the program funds are to be used in piloting new programs so they can be evaluated for their impact on student achievement. Consider a provision similar to the “priority” idea that Nina Rees persuaded ED to use in awarding its competitive programs.

3. Build in a waiver provision such as that proposed by the Education Sciences Board that would remove some of the risk to a failing district in piloting a new promising program. This “pilot program waiver” should cover consequences of failure for the participating schools for the period of the pilot. The waiver should also remove requirements that NCLB program funds be used only for the lowest scoring students, since this would preclude having the control group needed for a rigorous study.

The view of “practitioners as consumers of evidence” is widely unpopular. It is viewed by decision-makers as inviting the inappropriate construction of an approved list, as was revealed in the Reading First program. It is seen as restricting local innovation by requiring compliance with the proclamations of federal agencies. In the end, science is reduced to a check box on the district requisition form. If education is to become an evidence-based practice, we have to start with the practitioners. —DN

Thursday, February 14, 2008

Outcomes—Who Cares About Them?

This should be obvious about research in education: If teachers or administrators don’t care about the outcomes we measure, then no matter how elegantly we design and analyze experiments and present their findings, they won’t mean much.

A straightforward—simplistic, perhaps—approach to making an experiment meaningful is to measure whether the program we are testing has an impact on the same test scores to which the educators are held accountable. If the instructional or professional development program won’t help the school move more students into the proficient category, then it is not worth the investment of time and money.

Well, maybe. Suppose the high-stakes test is a poor assessment of the state standards for skills like problem-solving or communication? As researchers, we’ve found ourselves in this quandary.

At the other end of the continuum, many experimental studies use outcome measures explicitly designed to measure the program being studied. One strategy is to test both the program and the comparison group on material that was taught only to the program group. Although this may seem like an unfair bias, it can be a reasonable approach for what we would call an “efficacy” study—an experiment that is trying to determine whether the program has any effect at all under the best of circumstances (similar to using a placebo in medicine). Still, it is certainly important for the consumers of research not to mistake the impact measured in such studies with the impact they can expect to see on their high-stakes test.

Currently, growth models are being discussed as better ways to measure achievement. It is important to keep in mind that these techniques do not solve the problem of mismatch between standards and tests. If the test doesn’t measure what is important, then the growth model just becomes a way to measure progress on a scale that educators don’t believe in. Insofar as growth models extend high-stakes testing into measuring the amount of student growth for which each individual teacher is responsible, the disconnect just grows.

One technique that experimental studies can take advantage of without waiting for improvements in testing is the measurement of outcomes that consist of changes in classroom processes. We call these “mediators” because the process changes result from the experimental manipulation, they happen over time before the final outcome is measured, and in theory they represent a possible mechanism by which the program has an impact on the final outcome. For example, in testing an inquiry-based math program, we can measure—through surveys or observations—the extent to which classroom processes such as inquiry and hands-on activities appear more (or less) among the program or comparison teachers. This is best done where teachers (or schools) have been assigned randomly to program or comparison groups. And it is essential that we are measuring some factor that could be observed in both conditions. Differences in the presence of a mediator can often be measured long before the results of outcome tests are available, giving school administrators an initial indication of the new program’s success. The relationship of the program’s impact on the mediator and its impact on the test outcome can also tell us something about how the test impact came about.

Testing is far from perfect. Improvements in the content of what is tested, combined with technical improvements that can lower the cost of delivery and speed the turn-around of results to the students and teachers, will benefit both school accountability and research on the effectiveness of instructional and professional development programs. In the meantime, consumers of research have to consider whether an outcome measure is something they care about. —DN

Monday, January 14, 2008

What’s Unfair about a Margin of Error?

We think that TV newsman John Merrow is mistaken when, in an Education Week opinion piece (“Learning Without Loopholes”, December 4, 2007), he says it is inappropriate for states to use a “margin of error” in calculating whether schools have cleared an AYP hurdle. To the contrary, we would argue that schools don’t use this statistical technique as much as they should.

Merrow documents a number of cynical methods districts and states use for gaming the AYP system so as to avoid having their schools fall into “in need of improvement” status. One alleged method is the statistical technique familiar in reporting opinion surveys where a candidate’s lead is reported to be within the margin of error. Even though there may be a 3-point gap, statistically speaking, with a plus-or-minus 5-point margin of error, the difference between the candidates may actually be zero. In the case of a school, the same idea may be applied to AYP. Let’s say that the amount of improvement needed to meet AYP for the 4th grade population were 50 points (on the scale of the state test) over last year’s 4th grade scores. But let’s imagine that the 4th grade scores averaged only 35 points higher. In this case, the school appears to have missed the AYP goal by 15 points. However, if the margin of error were set at plus-or-minus 20 points, we would not have the confidence to conclude that there’s a difference between the goal and the measured value.

(Margin of Error bar graph) What is a margin or error or “confidence interval”? First of all, we assume there is a real value that we are estimating using the sample. Because we don’t have perfect knowledge, we try to make a fair estimate with some specified level of confidence. We want to know how far the average score that we got from the sample (e.g., of voters or of our 4th grade students) could possibly be from the real average. If we were, hypothetically, to go back and take lots of new samples, we assume they would be spread out around the real value. But because we have only one sample to work with, we do a statistical calculation based on the size of the sample, the nature of the variability among scores, and our desired level of confidence to establish an interval around our estimated average score. With the 80% confidence interval that we illustrated, we are saying that there’s a 4-in-5 chance that the true value we’re trying to estimate is within that interval. If we need greater confidence (for example, if we need to be sure that the real score is within the interval 95 out of a 100 times), we have to make the interval wider.

Merrow argues that, while using this statistical technique to get an estimated range is appropriate for opinion polls, where a sample of 1,000 voters from a much larger pool is used and we are figuring by how much the result may change if we had a different sample of 1,000 voters, the technique is not appropriate for a school, where we are getting a score for all the students. After all, we don’t use a margin of error in the actual election; we just count all the ballots. In other words, there is no “real” score that we are estimating. The school’s score is the real score.

We disagree. An important difference between an election and a school’s mean achievement score is that the achievement score, in the AYP context, implies a causal process: Being in need of improvement implies that the teachers, the leadership, or other conditions at the school need to be improved and that doing so will result in higher student achievement. While ultimately it is the student test scores that need to improve, the actions to be taken under NCLB pertain to the staff and other conditions at the school. If the staff is to blame for the poor conditions, we can’t blame them for a range of variations at the student level. This is where we see the uncertainty coming in.

First consider the way we calculate AYP. With the current “status model” method, we are actually comparing an old sample (last year’s 4th graders) with a new sample (this year’s 4th graders) drawn from the same neighborhood. Do we want to conclude that the building staff would perform the same with a different sample of students? Consider also that the results may have been different if the 4th graders were assigned to different teachers in the school. Moreover, with student mobility and testing differences that occur depending on the day the test is given, additional variations must be considered. But more generally, if we are predicting that “improvements” in the building staff will change the result, we are trying to characterize these teachers in general, in relation to any set of students. To be fair to those who are expected to make change happen, we want to represent fairly the variation in the result that is outside the administrator’s and teachers’ control, and not penalize them if the difference between what is observed and what is expected can be accounted for by this variation.

The statistical methods for calculating a confidence interval (CI) around such an estimate, while not trivial, are well established. The CI helps us to avoid concluding there is a difference (e.g., between the AYP goal and the school’s achievement) when it is reasonably possible that no difference exists. The same technique applies if a district research director is asked whether a professional development program made a difference. The average score for students of the teachers who took the program may be higher than the average scores of students of (otherwise equivalent) teachers who didn’t. But is the difference large enough to be clearly distinct from zero? Did the size of the difference escape the margin or error? Without properly doing this statistical calculation, the district may conclude that the program had some value when the differences were actually just in the noise.

While the U.S. Department of Education is correct to approve the use of CIs, there is still an issue of using CIs that are far wider than justified. The width of a CI is a matter of choice and depends on the activity. Most social science research uses a 95% CI. This is the threshold for the so-called “statistical significance,” and it means that the likelihood is less than 5% that a difference as large or larger than the one observed would have occurred if the real difference (between the two candidates, between the AYP goal and the school’s achievement, or between classes taught by teachers with or without professional development) were actually zero. In scientific work, there is a concern to avoid declaring there is evidence for a difference when there is actually no difference. Should schools be more or less stringent than the world of science?

Merrow points out that many states have set their CI at a much more stringent 99%. This makes the CI so wide that the observed difference between the AYP goal and the measured scores would have to be very large before we say there is a difference. In fact, we’d expect such a difference to occur by chance alone only 1% of the time. In other words, the measured score would have to be very far below the AYP goal before we’d be willing to conclude that the difference we’re seeing isn’t due to chance. As Merrow points out, this is a good idea if the education agency considers NCLB to be unjust and punitive and wants to avoid schools being declared in need of improvement. But imagine what the “right” CI would be if NCLB gave schools additional assistance when identified as below target. It is still reasonable to include a CI in the calculation, but perhaps 80% would be more appropriate.

The concept of a confidence interval is essential as schools move to data-driven decision making. Statistical calculations are often entirely missing from data-mining tools, and chance differences end up being treated as important. There are statistical methods such as including pretest scores in the statistical equation for making calculations more precise and for narrowing the CI. Growth modeling, for example, allows us to use student-level (as opposed to grade-average) pretest scores to increase precision. School district decisions should be based on good measurement and a reasonable allowance for chance differences. —DN/AJ