Empirical Education: educational innovation

Congratulations to Barack Obama on his sweeping victory. We can expect a change of policy climate with a new administration bringing new players and new policy ideas to the table. The appointment of a new director of the Institute of Education Sciences will provide an early opportunity to set direction for research and development. Reauthorization of NCLB and related legislation — including negotiating the definition and usage of “scientific research” — will be another, although pundit consensus was that this change will take two more years, given the urgency of fixing the economy and resolving the war in Iraq. But already change is in the air with proposals for dramatic shifts in priorities. Here we raise a question about the big new idea that is getting a lot of play: innovation.

Educational innovation being called for includes funding for research and development [R&D (with a capital D for a focus on new ideas)], acquisition of school technology, and funding for dissemination of new charter school models. The Brookings Institution recently published a policy paper Changing the Game: The Federal Role in Supporting 21st Century Educational Innovation by Sara Mead and Andy Rotherham. The paper imagines a new part of the US Department of Education called the Office of Educational Entrepreneurship and Innovation (OEEI) that would be charged with the job of implementing “a game-changing strategy [that] requires the federal government to make new types of investments, form new partnerships with philanthropy and the nonprofit sector, and act in new ways to support the growth of entrepreneurship and innovation within the public education system” (p34). The authors see this as complementary to standards-based reform, which is yielding diminishing returns. “To reach the lofty goals that standards-based reform has set, we need more than just pressure. We need new models of organizing schooling and new tools to support student learning that are dramatically more effective or efficient than what schools doing today” (p35).

As an entrepreneurial education business, we applaud the idea behind the envisioned OEEI. The question for us arises when we think about how OEEI would know whether a game-changing model is “dramatically more effective or efficient.” How will the OEEI decide which entrepreneurs should receive or continue to receive funds? Although the authors call for a “relentless focus on results,” they do not say how results would be measured. The venture capital (VC) model bases success on return on investment. Many VC investments fail but, if a good percentage succeeds, the overall monetary return to the VC is positive. While venture philanthropies often work the same way, the profits go back into supporting more entrepreneurs instead of back to the investors. Scaling up profitably is a sufficient sign of success. Perhaps we can assume that parents, communities, and school systems would not choose to adopt new products if they were ineffective or inefficient. If this were true, then scaling up would be an indirect indication of educational effectiveness. Will positive results for innovations in the marketplace be sufficient, or should there perhaps be a role for research to determine their effectiveness?

The authors suggest a $300 million per year “Grow What Works” fund of which less than 5% would be set aside for “rigorous independent evaluations of the results achieved by the entrepreneurs” (p48). Similarly, their suggestion for a program like the Defense Advanced Research Projects Agency (DARPA) would allow only up to 10%. Budgeting research at this level is unlikely to have much influence over what is likely to be an overwhelming imperative for market success. Moreover, what will be the role of independent evaluations if they fail to show the innovation to be dramatically more effective or efficient? Funding research as a set-aside from a funded program is always an uphill battle because it appears to take money away from the core activity. So let‘s be innovative and call this R&D with the intention of empowering both the R and the D. Rather than offer a token concession to the research community, build ongoing formative research and impact evaluations into the development and scale-up processes themselves. This may more closely resemble the “design-engineering-development” activities that Tony Bryk describes.

Integrating the R with the D will have two benefits. First it will provide information to federal and private funding agencies on the progress toward whatever measurable goal is set for an innovation. Second, it will help the parents, communities, and school systems make informed decisions about whether the innovation will work locally. The important partner here is the school district, which can take an active role in evaluation as well as development. These are the entities that ultimately have to decide whether the innovations are more effective and efficient that what they already do. They are also the ones with all the student, teacher, financial, and other data needed to conduct quasi-experiments or interrupted time series studies. If an agency like OEEI is created, it should insist that school districts become partners in the R&D for innovations they consider introducing. —DN

We think that TV newsman John Merrow is mistaken when, in an Education Week opinion piece (“Learning Without Loopholes”, December 4, 2007), he says it is inappropriate for states to use a “margin of error” in calculating whether schools have cleared an AYP hurdle. To the contrary, we would argue that schools don’t use this statistical technique as much as they should.

Merrow documents a number of cynical methods districts and states use for gaming the AYP system so as to avoid having their schools fall into “in need of improvement” status. One alleged method is the statistical technique familiar in reporting opinion surveys where a candidate’s lead is reported to be within the margin of error. Even though there may be a 3-point gap, statistically speaking, with a plus-or-minus 5-point margin of error, the difference between the candidates may actually be zero. In the case of a school, the same idea may be applied to AYP. Let’s say that the amount of improvement needed to meet AYP for the 4th grade population were 50 points (on the scale of the state test) over last year’s 4th grade scores. But let’s imagine that the 4th grade scores averaged only 35 points higher. In this case, the school appears to have missed the AYP goal by 15 points. However, if the margin of error were set at plus-or-minus 20 points, we would not have the confidence to conclude that there’s a difference between the goal and the measured value.

(Margin of Error bar graph) What is a margin or error or “confidence interval”? First of all, we assume there is a real value that we are estimating using the sample. Because we don’t have perfect knowledge, we try to make a fair estimate with some specified level of confidence. We want to know how far the average score that we got from the sample (e.g., of voters or of our 4th grade students) could possibly be from the real average. If we were, hypothetically, to go back and take lots of new samples, we assume they would be spread out around the real value. But because we have only one sample to work with, we do a statistical calculation based on the size of the sample, the nature of the variability among scores, and our desired level of confidence to establish an interval around our estimated average score. With the 80% confidence interval that we illustrated, we are saying that there’s a 4-in-5 chance that the true value we’re trying to estimate is within that interval. If we need greater confidence (for example, if we need to be sure that the real score is within the interval 95 out of a 100 times), we have to make the interval wider.

Merrow argues that, while using this statistical technique to get an estimated range is appropriate for opinion polls, where a sample of 1,000 voters from a much larger pool is used and we are figuring by how much the result may change if we had a different sample of 1,000 voters, the technique is not appropriate for a school, where we are getting a score for all the students. After all, we don’t use a margin of error in the actual election; we just count all the ballots. In other words, there is no “real” score that we are estimating. The school’s score is the real score.

We disagree. An important difference between an election and a school’s mean achievement score is that the achievement score, in the AYP context, implies a causal process: Being in need of improvement implies that the teachers, the leadership, or other conditions at the school need to be improved and that doing so will result in higher student achievement. While ultimately it is the student test scores that need to improve, the actions to be taken under NCLB pertain to the staff and other conditions at the school. If the staff is to blame for the poor conditions, we can’t blame them for a range of variations at the student level. This is where we see the uncertainty coming in.

First consider the way we calculate AYP. With the current “status model” method, we are actually comparing an old sample (last year’s 4th graders) with a new sample (this year’s 4th graders) drawn from the same neighborhood. Do we want to conclude that the building staff would perform the same with a different sample of students? Consider also that the results may have been different if the 4th graders were assigned to different teachers in the school. Moreover, with student mobility and testing differences that occur depending on the day the test is given, additional variations must be considered. But more generally, if we are predicting that “improvements” in the building staff will change the result, we are trying to characterize these teachers in general, in relation to any set of students. To be fair to those who are expected to make change happen, we want to represent fairly the variation in the result that is outside the administrator’s and teachers’ control, and not penalize them if the difference between what is observed and what is expected can be accounted for by this variation.

The statistical methods for calculating a confidence interval (CI) around such an estimate, while not trivial, are well established. The CI helps us to avoid concluding there is a difference (e.g., between the AYP goal and the school’s achievement) when it is reasonably possible that no difference exists. The same technique applies if a district research director is asked whether a professional development program made a difference. The average score for students of the teachers who took the program may be higher than the average scores of students of (otherwise equivalent) teachers who didn’t. But is the difference large enough to be clearly distinct from zero? Did the size of the difference escape the margin or error? Without properly doing this statistical calculation, the district may conclude that the program had some value when the differences were actually just in the noise.

While the U.S. Department of Education is correct to approve the use of CIs, there is still an issue of using CIs that are far wider than justified. The width of a CI is a matter of choice and depends on the activity. Most social science research uses a 95% CI. This is the threshold for the so-called “statistical significance,” and it means that the likelihood is less than 5% that a difference as large or larger than the one observed would have occurred if the real difference (between the two candidates, between the AYP goal and the school’s achievement, or between classes taught by teachers with or without professional development) were actually zero. In scientific work, there is a concern to avoid declaring there is evidence for a difference when there is actually no difference. Should schools be more or less stringent than the world of science?

Merrow points out that many states have set their CI at a much more stringent 99%. This makes the CI so wide that the observed difference between the AYP goal and the measured scores would have to be very large before we say there is a difference. In fact, we’d expect such a difference to occur by chance alone only 1% of the time. In other words, the measured score would have to be very far below the AYP goal before we’d be willing to conclude that the difference we’re seeing isn’t due to chance. As Merrow points out, this is a good idea if the education agency considers NCLB to be unjust and punitive and wants to avoid schools being declared in need of improvement. But imagine what the “right” CI would be if NCLB gave schools additional assistance when identified as below target. It is still reasonable to include a CI in the calculation, but perhaps 80% would be more appropriate.

The concept of a confidence interval is essential as schools move to data-driven decision making. Statistical calculations are often entirely missing from data-mining tools, and chance differences end up being treated as important. There are statistical methods such as including pretest scores in the statistical equation for making calculations more precise and for narrowing the CI. Growth modeling, for example, allows us to use student-level (as opposed to grade-average) pretest scores to increase precision. School district decisions should be based on good measurement and a reasonable allowance for chance differences. —DN/AJ

Wednesday, November 5, 2008

Climate Change: Innovation

Monday, January 14, 2008

What’s Unfair about a Margin of Error?

Followers

Blog Archive