Tuesday, November 30, 2010

Recognizing Success

When the Obama-Duncan administration approaches teacher evaluation, the emphasis is on recognizing success. We heard that clearly in Arne Duncan’s comments on the release of teacher value-added modeling (VAM) data for LA Unified by the LA Times. He’s quoted as saying, "What's there to hide? In education, we've been scared to talk about success." Since VAM is often thought of as a method for weeding out low performing teachers, Duncan’s statement referencing success casts the use of VAM in a more positive light. Therefore we want to raise the issue here: how do you know when you’ve found success? The general belief is that you’ll recognize it when you see it. But sorting through a multitude of variables is not a straightforward process, and that’s where research methods and statistical techniques can be useful. Below we illustrate how this plays out in teacher and in program evaluation.

As we report in our news story, Empirical is participating in the Gates Foundation project called Measures of Effective Teaching (MET). This project is known for its focus on value-added modeling (VAM) of teacher effectiveness. It is also known for having collected 13,000 hours of video from 3,000 teachers’ classrooms—an astounding accomplishment. Research partners from many top institutions hope to be able to identify the observable correlates for teachers whose students perform at high levels as well as for teachers whose students do not. (The MET project tested all the students with an “alternative assessment” in addition to using the conventional state achievement tests.) With this massive sample that includes both data about the students and videos of teachers, researchers can identify classroom practices that are consistently associated with student success. Empirical’s role in MET is to build a web-based tool that enables school system decision-makers to make use of the data to improve their own teacher evaluation processes. Thus they will be able to build on what’s been learned when conducting their own mini-studies aimed at improving their local observational evaluation methods.

When the MET project recently had its “leads” meeting in Washington DC., the assembled group of researchers, developers, school administrators, and union leaders were treated to an after-dinner speech and Q&A by Joanne Weiss. Joanne is now Arne Duncan’s chief of staff, after having directed the Race to the Top program (and before that was involved in many Silicon Valley educational innovations). The approach of the current administration to teacher evaluation–emphasizing that it is about recognizing success—carries over into program evaluation. This attitude was clear in Joanne’s presentation, in which she declared an intention to “shine a light on what is working.” The approach is part of their thinking about the reauthorization of ESEA, where more flexibility is given to local decision-makers to develop solutions, while the federal legislation is more about establishing achievement goals such as being the leader in college graduation.

Hand in hand with providing flexibility to find solutions, Joanne also spoke of the need to build “local capacity to identify and scale up effective programs.” We welcome the idea that school districts will be free to try out good ideas and identify those that work. This kind of cycle of continuous improvement is very different from the idea, incorporated in NCLB, that researchers will determine what works and disseminate these facts to the practitioners. Joanne spoke about continuous improvement in the context of teachers and principals, where on a small scale it may be possible to recognize successful teachers and programs without research methodologies. While a teacher’s perception of student progress in the classroom may be aided by regular assessments, the determination of success seldom calls for research design. We advocate for a broader scope, and maintain that a cycle of continuous improvement is just as much needed at the district and state levels. At those levels, we are talking about identifying successful schools or successful programs where research and statistical techniques are needed to direct the light onto what is working. Building research capacity at the district and state level will be a necessary accompaniment to any plan to highlight successes. And, of course, research can’t be motivated purely by the desire to document the success of a program. We have to be equally willing to recognize failure. The administration will have to take seriously the local capacity building to achieve the hoped-for identification and scaling up of successful programs.

Wednesday, September 8, 2010

2010-2011: The Year of the VAM

If you haven’t heard about Value-Added Modeling (VAM) in relation to the controversial teacher ratings in Los Angeles and subsequent brouhaha in the world of education, chances are that you’ll hear about it in the coming year.

VAM is a family of statistical techniques for estimating the contribution of a teacher or of a school to the academic growth of students. Recently, the LA Times obtained the longitudinal test score records for all the elementary school teachers and students in LA Unified and had a RAND economist (working as an independent consultant) run the calculations. The result was a “score” for all LAUSD elementary school teachers. Note that the economist who did the calculations wrote up a technical report on how it was done and the specific questions his research was aimed at answering.

Reactions to the idea that a teacher could be evaluated using a set of test scores—in this case from the California Standards Test—were swift and divisive. The concept was denounced by the teachers’ union, with the local leader calling for a boycott. Meanwhile, the US Secretary of Education, Arne Duncan, made headlines by commenting favorably on the idea. The LA Times quotes him as saying “What’s there to hide? In education, we’ve been scared to talk about success.”

There is a tangle of issues here, along with exaggerations, misunderstandings, and confusion between research techniques and policy decisions. This column will address some of the issues over the coming year. We also plan to announce some of our own contributions to the VAM field in the form of project news.

The major hot-button issues include appropriate usage (e.g., for part or all of the input to merit pay decisions) and technical failings (e.g., biases in the calculations). Of course, these two issues are often linked; for example, many argue that biases may make VAM unfair for individual merit pay. The recent Brief from the Economic Policy Institute, authored by an impressive team of researchers (several our friends/mentors from neighboring Stanford), makes a well reasoned case for not using VAM as the only input to high-stakes decisions. While their arguments are persuasive with respect to VAM as the lone criterion for awarding merit pay or firing individual teachers, we still see a broad range of uses for the technique, along with the considerable challenges.

For today, let’s look at one issue that we find particularly interesting: How to handle teacher collaboration in a VAM framework. In a recent Education Week commentary, Kim Marshall argues that any use of test scores for merit pay is a losing proposition. One of the many reasons he cites is its potentially negative impact on collaboration.

A problem with an exercise like that conducted by the LA Times is that there are organizational arrangements that do not come into the calculations. For example, we find that team teaching within a grade at a school is very common. A teacher with an aptitude for teaching math may take another teacher’s students for a math period, while sending her own kids to the other teacher for reading. These informal arrangements are not part of the official school district roster. They can be recorded (with some effort) during the current year but are lost for prior years. Mentoring is a similar situation, wherein the value provided to the kids is distributed among members of their team of teachers. We don’t know how much difference collaborative or mentoring arrangements make to individual VAM scores, but one fear in using VAM in setting teacher salaries is that it will militate against productive collaborations and reduce overall achievement.

Some argue that, because VAM calculations do not properly measure or include important elements, VAM should be disqualified from playing any role in evaluation. We would argue that, although they are imperfect, VAM calculations can still be used as a component of an evaluation process. Moreover, continued improvements can be made in testing, in professional development, and in the VAM calculations themselves. In the case of collaboration, what is needed are ways that a principal can record and evaluate the collaborations and mentoring so that the information can be worked into the overall evaluation and even into the VAM calculation. In such an instance, it would be the principal at the school, not an administrator at the district central office, who can make the most productive use of the VAM calculations. With knowledge of the local conditions and potential for bias, the building leader may be in the best position to make personnel decisions.

VAM can also be an important research tool—using consistently high and/or low scores as a guide for observing classroom practices that are likely to be worth promoting through professional development or program implementations. We’ve seen VAM used this way, for example, by the research team at Wake County Public Schools in North Carolina in identifying strong and weak practices in several content areas. This is clearly a rich area for continued research.

The LA Times has helped to catapult the issue of VAM onto the national radar. It has also sparked a discussion of how school data can be used to support local decisions—which can’t be a bad thing.
— DN

Thursday, June 3, 2010

Making Vendor Research More Credible

The latest evidence that research can be both rigorous and relevant was the subject of an announcement that the Software and Information Industry Association (SIIA) made last month about their new guidelines for conducting effectiveness research. The document is aimed at SIIA members, most of whom are executives of education software and technology companies and not necessarily schooled in research methodology. The main goal in publishing the guidelines is to improve the quality—and therefore the credibility—of research sponsored by the industry. The document provides SIIA members with things to keep in mind when contracting for research or using research in marketing materials. The document also has value for educators, especially those responsible for purchasing decisions. That’s an important point that I’ll get back to.

One thing to make clear in this blog entry is that while your humble blogger (DN) is given credit as the author, the Guidelines actually came from a working group of SIIA members who put in many months of brainstorming, discussion, and review. DN’s primary contribution was just to organize the ideas, ensure they were technically accurate, and put them into easy to understand language.

Here’s a taste of some of the ideas contained in the 22 guidelines:

• With a few exceptions, all research should be reported regardless of the result. Cherry picking just the studies with strong positive results distorts the facts and in the long run hurts credibility. One lesson that might be taken from this is that conducting several small studies may be preferable to trying to prove a product effective (or not) in a single study.
• Always provide a link to the full report. Too often in marketing materials (including those of advocacy groups, not just publishers) a fact such as “8th grade math achievement increased from 31% in 2004 to 63% in 2005,” is offered with no citation. In this specific case, the fact was widely cited but after considerable digging could be traced back to a report described by the project director as “anecdotal”.
• Be sure to take implementation into account. In education, all instructional programs require setting up complex systems of teacher-student interaction, which can vary in numerous ways. Issues of how research can support the process and what to do with inadequate or outright failed implementation must be understood by researchers and consumers of research.
• Watch out for the control condition. In education there are no placebos. In almost all cases we are comparing a new program to whatever is in place. Depending on how well the existing program works, the program being evaluated may appear to have an impact or not. This calls for careful consideration of where to test a product and understandable concern by educators as to how well a particular product tested in another district will perform against what is already in place in their district.

The Guidelines are not just aimed at industry. SIIA believes that as decision-makers at schools begin to see a commitment to providing stronger research, their trust in the results will increase. It is also in the educators’ interest to review the guidelines because they provide a reference point for what actionable research should look like. Ultimately, the Guidelines provide educators with help in conducting their own research, whether it is on their own or in partnership with the education technology providers. -DN

Monday, March 29, 2010

Research: From NCLB to Obama’s Blueprint for ESEA

We can finally put “Scientifically Based Research” to rest. The term that appeared more than 100 times in NCLB appears zero times in the Obama administration’s Blueprint for Reform, which is the document outlining its approach to the reauthorization of ESEA. The term was always an awkward neologism, coined presumably to avoid simply saying “scientific research.” It also allowed NCLB to contain an explicit definition to be enforced—a definition stipulating not just any scientific activities, but research aimed at coming to causal conclusions about the effectiveness of some product, policy, or laboratory procedure.

A side effect of the SBR focus has been the growth of a compliance mentality among both school systems and publishers. Schools needed some assurance that a product was backed by SBR before they would spend money, while textbooks were ranked in terms of the number of SBR-proven elements they contained.

Some have wondered if the scarcity of the word “research” in the new Blueprint might signal a retreat from scientific rigor and the use of research in educational decisions (see, for example, Debra Viadero’s blog). Although the approach is indeed different, the new focus makes a stronger case for research and extends its scope into decisions at all levels.

The Blueprint shifts the focus to effectiveness. The terms “effective” or “effectiveness” appear about 95 times in the document. “Evidence” appears 18 times. And the compliance mentality is specifically called out as something to eliminate.

“We will ask policymakers and educators at all levels to carefully analyze the impact of their policies, practices, and systems on student outcomes. ... And across programs, we will focus less on compliance and more on enabling effective local strategies to flourish.” (p. 35)

Instead of the stiff definition of SBR, we now have a call to “policymakers and educators at all levels to carefully analyze the impact of their policies, practices, and systems on student outcomes.” Thus we have a new definition for what’s expected: carefully analyzing impact. The call does not go out to researchers per se, but to policymakers and educators at all levels. This is not a directive from the federal government to comply with the conclusions of scientists funded to conduct SBR. Instead, scientific research is everybody’s business now.

Carefully analyzing the impact of practices on student outcomes is scientific research. For example, conducting research carefully requires making sure the right comparisons are made. A study that is biased by comparing two groups with very different motivations or resources is not a careful analysis of impact. A study that simply compares the averages of two groups without any statistical calculations can mistakenly identify a difference when there is none, or vice versa. A study that takes no measure of how schools or teachers used a new practice—or that uses tests of student outcomes that don’t measure what is important—can’t be considered a careful analysis of impact. Building the capacity to use adequate study design and statistical analysis will have to be on the agenda of the ESEA if the Blueprint is followed.

Far from reducing the role of research in the U.S. education system, the Blueprint for ESEA actually advocates a radical expansion. The word “research” is used only a few times, and “science” is used only in the context of STEM education. Nonetheless, the call for widespread careful analysis of the evidence of effective practices that impact student achievement broadens the scope of research, turning all policymakers and educators into practitioners of science. — DN

Tuesday, February 23, 2010

Stimulating Innovation and Evidence

After a massive infusion of stimulus money into K-12 technology through the Title IID “Enhancing Education Through Technology” (EETT) grants, known also as “ed-tech” grants, the administration is planning to cut funding for the program in future budgets.

Well, they’re not exactly “cutting” funding for technology, but consolidating the dedicated technology funding stream into a larger enterprise, awkwardly named the “Effective Teaching and Learning for a Complete Education” program. For advocates of educational technology, here’s why this may not be so much a blow as a challenge and an opportunity.

Consider the approach stated at the White House “fact sheet”:

“The Department of Education funds dozens of programs that narrowly limit what states, districts, and schools can do with funds. Some of these programs have little evidence of success, while others are demonstrably failing to improve student achievement. The President’s Budget eliminates six discretionary programs and consolidates 38 K-12 programs into 11 new programs that emphasize using competition to allocate funds, giving communities more choices around activities, and using rigorous evidence to fund what works...Finally, the Budget dedicates funds for the rigorous evaluation of education programs so that we can scale up what works and eliminate what does not.”

From this, technology advocates might worry that policy is being guided by the findings of “no discernable impact” from a number of federally funded technology evaluations (including the evaluation mandated by the EETT legislation itself).

But this is not the case. The White House declares, “The President strongly believes that technology, when used creatively and effectively, can transform education and training in the same way that it has transformed the private sector.”

The administration is not moving away from the use of computers, electronic whiteboards, data systems, Internet connections, web resources, instructional software, and so on in education. Rather, the intention is that these tools are integrated, where appropriate and effective, into all of the other programs.

This does put technology funding on a very different footing. It is no longer in its own category. Where school administrators are considering funding from the “Effective Teaching and Learning for a Complete Education” program, they may place a technology option up against an approach to lower class size, a professional development program, or other innovations that may integrate technologies as a small piece of an overall intervention. Districts would no longer write proposals to EETT to obtain financial support to invest in technology solutions. Technology vendors will increasingly be competing for the attention of school district decision-makers on the basis of the comparative effectiveness of their solution—not just in comparison to other technologies but in comparison to other innovative solutions. The administration has clearly signaled that innovative and effective technologies will be looked upon favorably. It has also signaled that effectiveness is the key criterion.

As an Empirical Education team prepares for a visit to Washington DC for the conference of the Consortium for School Networking and the Software and Information Industry Association’s EdTech Government Forum, (we are active members in both organizations), we have to consider our message to the education technology vendors and school system technology advocates. (Coincidentally, we will also be presenting research at the annual conference of the Society for Research on Educational Effectiveness, also held in DC that week). As a research company we are constrained from taking an advocacy role—in principle we have to maintain that the effectiveness of any intervention is an empirical issue. But we do see the infusion of short term stimulus funding into educational technology through the EETT program as an opportunity for schools and publishers. Working jointly to gather the evidence from the technologies put in place this year and next will put schools and publishers in a strong position to advocate for continued investment in the technologies that prove effective.

While it may have seemed so in 1993 when the U.S. Department of Education’s Office of Educational Technology was first established, technology can no longer be considered inherently innovative. The proposed federal budget is asking educators and developers to innovate to find effective technology applications. The stimulus package is giving the short term impetus to get the evidence in place. — DN

Friday, January 8, 2010

Rigor AND Relevance

One of the conversations at the Institute of Education Sciences (the federal research agency) in 2010 is about rigor. How do we adhere to strict rules about what is accepted as scientific evidence while making the work sponsored by the agency more relevant to educators, as the director, John Easton, wants to do?

The conflict between rigor and relevance arises for a number of reasons that we will illustrate in this entry. The basic problem arises when rigor is defined in terms of specific methodologies such as randomized experiments or a specific criterion such as a 95% confidence interval. Defining rigor by such procedural rules restricts the body of evidence to a small number of studies and to a narrow range of questions that can be answered with the methods that would be considered acceptable. Our position is not that the education sciences have to become less rigorous in order to become more relevant. Instead, our position is that the concept of scientific rigor is being misunderstood.

Rigor, in ordinary English, is used to suggest rigidly following rules and procedures. However, because blind adherence to procedures is inappropriate in any area of science, the usage within the education sciences needs clarification and realignment. Our suggestion to IES is to focus on the underlying scientific principles rather than the procedures and criteria derived from the principles. Here are some examples.

The standard rules of research assume that a positive outcome identified in a study is very unlikely to be an artifact of a particular sample. There is a very important principle behind this that must be rigorously understood by researchers. And the appropriate statistical calculations must be applied. The rigor, however, is in understanding the trade-off between mistaking a false positive result for a real result or erroneously rejecting a positive effect of a new program as statistically insignificant when, in fact, there is a real difference. Scientific practice favors protecting against the first kind of mistake and conventionally sets the bar high. But changing the trade-off to favor avoiding the mistake of considering a program ineffective when it is really effective would not constitute less rigor. Faced with a very serious problem, a policy maker may prefer the risk of spending money on something that might not work rather than rejecting a promising program that narrowly missed the conventional threshold for statistical significance.

Randomization provides another example. The fundamental principle that has to be understood is how results of quantitative studies can be biased by confounding and how controlling for the effects of confounders produces a more accurate estimate of the treatment effect. While randomizing units (e.g., teachers, grade-level teams, schools) into treatment and control groups is recognized as the gold standard for controlling for the effects of potential confounding variables so as to isolate the impact of treatment, rigor is not accomplished by restricting education science to randomized experiments. A relevant study can often benefit from the use of observational data stored in school district information systems. Rigor would then consist of understanding how other designs and statistical controls can be appropriately applied to reduce potential bias (and when statistics can’t fix a bad design). There is nothing rigorous in discarding a dataset outright because it has not been created in a fully controlled experimental setting or because it is not free of measurement error.

While controlling selection bias through experimental designs and statistical adjustments must be understood by education scientists, it is also essential to attend to the context of the study and the range of its generalizability—what we can usefully conclude from the research. The experiment itself may have interfered with usual processes (a situation called ecological invalidity) such as when teacher-level randomization breaks up the existing team teaching within a grade-level team. We need a record of differences in program implementation that shows the relationship between quality of implementation and student performance and also prevents us from mistaking attributes of better-implementing teachers for attributes of the program. While the world of schools can be a messy place to conduct research, taking implementation issues seriously in the study design does not equal less rigor.

Ultimately, it comes down to knowing what we can say to the stakeholders, whether they are educators, publishers, or government agencies. What can be said derives from rigorous application of research principles and, to some extent, calls upon the art of careful audience-sensitive communication. It is not more rigorous to leave out the results of post-hoc explorations. Rigor in education science includes framing the results with appropriate cautions about preliminary findings, limitations on generalization, and results that are interesting and warrant continued tracking or more targeted investigations. Making progress in education science calls for rigor, and rigor includes clear communication and the participation of stakeholders in interpreting results.