The nomination of John Q. Easton as the new head of IES highlights a debate that has been going on for quite a long time. As Donald Campbell noted in the early 70s in his classic paper “The Experimenting Society” (updated in 1988), “The U.S. Congress is apt to mandate an immediate, nationwide evaluation of a new program to be done by a single evaluator, once and for all, subsequent implementations to go without evaluation.” In contrast, he describes a “contagious cross-validation model for local programs” and recommends a much more distributed approach that would “support adoptions that included locally designed cross-validating evaluations, including funds for appropriate comparison groups not receiving the treatment.” Using such a model, he predicts that “After five years we might have 100 locally interpretable experiments.” (p.303) The work of the Consortium on Chicago School Research, which Easton has led, has a local focus on Chicago schools consistent with the idea that experiments should be locally interpretable. Elsewhere, we have argued that local experiments can also be vastly less expensive; thus having 100 of them is quite feasible. These experiments also can be completed in a more timely manner—it need not take five years to accumulate a wealth of evidence. We welcome a change in orientation at IES from organizing single large national experiments to the more useful, efficient, and practical model of supporting many local rigorous experiments. –DN
Campbell, D. T. (1988). The Experimenting Society. In E. S. Overman (Ed.), Methodology and epistemology for social science: Selected Papers. (pp. 303). Chicago: University of Chicago Press.
Showing posts with label Denis Newman. Show all posts
Showing posts with label Denis Newman. Show all posts
Monday, April 6, 2009
Thursday, January 1, 2009
Role of Technology as Infrastructure for Schools
We are excited and optimistic about the New Year. It will be a time of great challenges as well as critical transitions and important debates about the future of education in this country. The emerging proposal for a massive stimulus package gives reason both for optimism and caution. Thus far the package has included repairing school buildings, improving their broadband connections, and bringing in more technology.
The Consortium for School Networking (CoSN), of which Empirical Education is a member and long time supporter, advocates for technology for schools. In an article entitled “Why Obama Can’t Ignore Ed Tech”, Jim Goodnight, founder and CEO of SAS, and Keith Krueger, CEO of CoSN, argue for investing in education technology as a way to support “21st century learning” while creating jobs in the technology and telecommunications sectors. They also suggest that the investment will lead school districts to hire staff members specializing in technical and technology curricula, a function they note as currently being “vastly understaffed.”
As a research organization, we have to maintain a cautious attitude about claims, such as those in the CoSN article, that technology products will reduce discipline problems and dropout rates generally. We do agree that an investment in school technology will call for increased staffing—that is, creating jobs—which is the primary goal of the stimulus package.
But we believe there is a better argument for an investment in technical infrastructure. Network and data warehouse technologies inherently provide the mechanisms for measuring whether the investments are making a difference. Combined with online formative testing, automatic generation of usage data, and analytic tools, these technologies will put schools in a position to keep technology accountable for promised results. Using technology as a tool for tracking results of the stimulus package will, of course, create jobs. It will call for the creation of additional positions for data coaches, data analysts, trainers, and staff to handle the test administration, data cleaning, and communication functions.
The fear that a stimulus package will just throw money at the problem is justified. Yes, it will provide jobs and benefits to certain industries in the short term, whereas any lasting improvement may be elusive. While building a new bridge employs construction workers and the lasting benefit can be measured, for example, by improved traffic, the lasting benefits of school technology may seem more subtle. We would argue, to the contrary, that a technology infrastructure for schools contains its own mechanism for accountability. The argument for school technology should drive home the notion that schools can be capable of determining whether the stimulus investment is having an impact on learning, discipline, graduation rates, and other measurable outcomes. Policy makers will not have to depend on promises of new forms of learning when they can put in place a technology infrastructure that provides school decision-makers with the information about whether the investment is making a difference. —DN
The Consortium for School Networking (CoSN), of which Empirical Education is a member and long time supporter, advocates for technology for schools. In an article entitled “Why Obama Can’t Ignore Ed Tech”, Jim Goodnight, founder and CEO of SAS, and Keith Krueger, CEO of CoSN, argue for investing in education technology as a way to support “21st century learning” while creating jobs in the technology and telecommunications sectors. They also suggest that the investment will lead school districts to hire staff members specializing in technical and technology curricula, a function they note as currently being “vastly understaffed.”
As a research organization, we have to maintain a cautious attitude about claims, such as those in the CoSN article, that technology products will reduce discipline problems and dropout rates generally. We do agree that an investment in school technology will call for increased staffing—that is, creating jobs—which is the primary goal of the stimulus package.
But we believe there is a better argument for an investment in technical infrastructure. Network and data warehouse technologies inherently provide the mechanisms for measuring whether the investments are making a difference. Combined with online formative testing, automatic generation of usage data, and analytic tools, these technologies will put schools in a position to keep technology accountable for promised results. Using technology as a tool for tracking results of the stimulus package will, of course, create jobs. It will call for the creation of additional positions for data coaches, data analysts, trainers, and staff to handle the test administration, data cleaning, and communication functions.
The fear that a stimulus package will just throw money at the problem is justified. Yes, it will provide jobs and benefits to certain industries in the short term, whereas any lasting improvement may be elusive. While building a new bridge employs construction workers and the lasting benefit can be measured, for example, by improved traffic, the lasting benefits of school technology may seem more subtle. We would argue, to the contrary, that a technology infrastructure for schools contains its own mechanism for accountability. The argument for school technology should drive home the notion that schools can be capable of determining whether the stimulus investment is having an impact on learning, discipline, graduation rates, and other measurable outcomes. Policy makers will not have to depend on promises of new forms of learning when they can put in place a technology infrastructure that provides school decision-makers with the information about whether the investment is making a difference. —DN
Wednesday, May 14, 2008
What Makes Randomization Hard to Do?
The question came up at the recent workshop held in Washington DC for school district researchers to learn more about rigorous program evaluation: “Why is the strongest research design often the hardest to make happen?” There are very good theoretical reasons to use randomized control when trying to evaluate whether a school district’s instructional or professional development program works. What we want to know is whether involving students and teachers in some program will result in outcomes that are better than if those same students and teachers were not involved in the program. The workshop presenter, Mark Lipsey of Vanderbilt University, pointed out that if we had a time machine we could observe how well the students and teachers achieved with the program, then go back in time, don’t give them the program — thus creating the science fiction alternate universe — and watch how they did without the program. We can’t do that, so the next best thing is to find a group that is just like the one with the program and see how they do. By choosing who gets a program and who doesn’t get it from a pool of volunteer teachers (or schools) using a coin toss (or another random method), we can be sure that self selection had nothing to do with group assignment and that, at least on average, the only difference between members of the two groups is that one group won the coin toss and the other didn’t. Most other methods introduce potential bias that can change the results.
Randomized control can work where the district is doing a small pilot and has only enough materials for some of the teachers, where resources call for a phased implementation starting with a small number of schools, or where slots in a program are going to be allocated by lottery anyway. To many people, the coin toss (or other lottery method) just doesn’t seem right. Any number of other criteria could be suggested as a better rationale for assigning the program: some students are needier, some teachers may be better able to take advantage of it, and so on. But the whole point is to avoid exactly those kinds of criteria and make the choice entirely random. The coin toss itself highlights the decision process, creating a concern that it will be hard to justify, for example, to a parent who wants to know why his kid’s school didn’t get the program.
Our own experience with random assignment has not been so negative. Most districts will agree to it, although some do refuse on principle. When we begin working with the teachers face–to–face, there is usually camaraderie about tossing the coin, especially when it is between two teachers paired up because of their similarity on characteristics they themselves identify as important (we’ve also found this pairing method helps give us more precise estimates of the impact). The main problem we find with randomization, if it is being used as part of a district’s own local program evaluation, is the pre–planning that is required. Typically, decisions as to which schools get the program first or which teachers will be selected to pilot the program are made before consideration is given to doing a rigorous evaluation. In most cases, the program is already in motion or the pilot is coming to a conclusion before the evaluation is designed. At that point in the process, the best method will be to find a comparison group from among the teachers or schools that were not chosen or did not volunteer for the program (or to look outside the district for comparison cases). The prior choices introduce selection bias that we can attempt to compensate for statistically; still, we can never be sure our adjustments eliminate the bias. In other words, in our experience the primary reason that randomization is harder than weaker methods is that it requires that the evaluation design and the program implementation plan are coordinated from the start. —DN
Randomized control can work where the district is doing a small pilot and has only enough materials for some of the teachers, where resources call for a phased implementation starting with a small number of schools, or where slots in a program are going to be allocated by lottery anyway. To many people, the coin toss (or other lottery method) just doesn’t seem right. Any number of other criteria could be suggested as a better rationale for assigning the program: some students are needier, some teachers may be better able to take advantage of it, and so on. But the whole point is to avoid exactly those kinds of criteria and make the choice entirely random. The coin toss itself highlights the decision process, creating a concern that it will be hard to justify, for example, to a parent who wants to know why his kid’s school didn’t get the program.
Our own experience with random assignment has not been so negative. Most districts will agree to it, although some do refuse on principle. When we begin working with the teachers face–to–face, there is usually camaraderie about tossing the coin, especially when it is between two teachers paired up because of their similarity on characteristics they themselves identify as important (we’ve also found this pairing method helps give us more precise estimates of the impact). The main problem we find with randomization, if it is being used as part of a district’s own local program evaluation, is the pre–planning that is required. Typically, decisions as to which schools get the program first or which teachers will be selected to pilot the program are made before consideration is given to doing a rigorous evaluation. In most cases, the program is already in motion or the pilot is coming to a conclusion before the evaluation is designed. At that point in the process, the best method will be to find a comparison group from among the teachers or schools that were not chosen or did not volunteer for the program (or to look outside the district for comparison cases). The prior choices introduce selection bias that we can attempt to compensate for statistically; still, we can never be sure our adjustments eliminate the bias. In other words, in our experience the primary reason that randomization is harder than weaker methods is that it requires that the evaluation design and the program implementation plan are coordinated from the start. —DN
Thursday, February 14, 2008
Outcomes—Who Cares About Them?
This should be obvious about research in education: If teachers or administrators don’t care about the outcomes we measure, then no matter how elegantly we design and analyze experiments and present their findings, they won’t mean much.
A straightforward—simplistic, perhaps—approach to making an experiment meaningful is to measure whether the program we are testing has an impact on the same test scores to which the educators are held accountable. If the instructional or professional development program won’t help the school move more students into the proficient category, then it is not worth the investment of time and money.
Well, maybe. Suppose the high-stakes test is a poor assessment of the state standards for skills like problem-solving or communication? As researchers, we’ve found ourselves in this quandary.
At the other end of the continuum, many experimental studies use outcome measures explicitly designed to measure the program being studied. One strategy is to test both the program and the comparison group on material that was taught only to the program group. Although this may seem like an unfair bias, it can be a reasonable approach for what we would call an “efficacy” study—an experiment that is trying to determine whether the program has any effect at all under the best of circumstances (similar to using a placebo in medicine). Still, it is certainly important for the consumers of research not to mistake the impact measured in such studies with the impact they can expect to see on their high-stakes test.
Currently, growth models are being discussed as better ways to measure achievement. It is important to keep in mind that these techniques do not solve the problem of mismatch between standards and tests. If the test doesn’t measure what is important, then the growth model just becomes a way to measure progress on a scale that educators don’t believe in. Insofar as growth models extend high-stakes testing into measuring the amount of student growth for which each individual teacher is responsible, the disconnect just grows.
One technique that experimental studies can take advantage of without waiting for improvements in testing is the measurement of outcomes that consist of changes in classroom processes. We call these “mediators” because the process changes result from the experimental manipulation, they happen over time before the final outcome is measured, and in theory they represent a possible mechanism by which the program has an impact on the final outcome. For example, in testing an inquiry-based math program, we can measure—through surveys or observations—the extent to which classroom processes such as inquiry and hands-on activities appear more (or less) among the program or comparison teachers. This is best done where teachers (or schools) have been assigned randomly to program or comparison groups. And it is essential that we are measuring some factor that could be observed in both conditions. Differences in the presence of a mediator can often be measured long before the results of outcome tests are available, giving school administrators an initial indication of the new program’s success. The relationship of the program’s impact on the mediator and its impact on the test outcome can also tell us something about how the test impact came about.
Testing is far from perfect. Improvements in the content of what is tested, combined with technical improvements that can lower the cost of delivery and speed the turn-around of results to the students and teachers, will benefit both school accountability and research on the effectiveness of instructional and professional development programs. In the meantime, consumers of research have to consider whether an outcome measure is something they care about. —DN
A straightforward—simplistic, perhaps—approach to making an experiment meaningful is to measure whether the program we are testing has an impact on the same test scores to which the educators are held accountable. If the instructional or professional development program won’t help the school move more students into the proficient category, then it is not worth the investment of time and money.
Well, maybe. Suppose the high-stakes test is a poor assessment of the state standards for skills like problem-solving or communication? As researchers, we’ve found ourselves in this quandary.
At the other end of the continuum, many experimental studies use outcome measures explicitly designed to measure the program being studied. One strategy is to test both the program and the comparison group on material that was taught only to the program group. Although this may seem like an unfair bias, it can be a reasonable approach for what we would call an “efficacy” study—an experiment that is trying to determine whether the program has any effect at all under the best of circumstances (similar to using a placebo in medicine). Still, it is certainly important for the consumers of research not to mistake the impact measured in such studies with the impact they can expect to see on their high-stakes test.
Currently, growth models are being discussed as better ways to measure achievement. It is important to keep in mind that these techniques do not solve the problem of mismatch between standards and tests. If the test doesn’t measure what is important, then the growth model just becomes a way to measure progress on a scale that educators don’t believe in. Insofar as growth models extend high-stakes testing into measuring the amount of student growth for which each individual teacher is responsible, the disconnect just grows.
One technique that experimental studies can take advantage of without waiting for improvements in testing is the measurement of outcomes that consist of changes in classroom processes. We call these “mediators” because the process changes result from the experimental manipulation, they happen over time before the final outcome is measured, and in theory they represent a possible mechanism by which the program has an impact on the final outcome. For example, in testing an inquiry-based math program, we can measure—through surveys or observations—the extent to which classroom processes such as inquiry and hands-on activities appear more (or less) among the program or comparison teachers. This is best done where teachers (or schools) have been assigned randomly to program or comparison groups. And it is essential that we are measuring some factor that could be observed in both conditions. Differences in the presence of a mediator can often be measured long before the results of outcome tests are available, giving school administrators an initial indication of the new program’s success. The relationship of the program’s impact on the mediator and its impact on the test outcome can also tell us something about how the test impact came about.
Testing is far from perfect. Improvements in the content of what is tested, combined with technical improvements that can lower the cost of delivery and speed the turn-around of results to the students and teachers, will benefit both school accountability and research on the effectiveness of instructional and professional development programs. In the meantime, consumers of research have to consider whether an outcome measure is something they care about. —DN
Monday, January 14, 2008
What’s Unfair about a Margin of Error?
We think that TV newsman John Merrow is mistaken when, in an Education Week opinion piece (“Learning Without Loopholes”, December 4, 2007), he says it is inappropriate for states to use a “margin of error” in calculating whether schools have cleared an AYP hurdle. To the contrary, we would argue that schools don’t use this statistical technique as much as they should.
Merrow documents a number of cynical methods districts and states use for gaming the AYP system so as to avoid having their schools fall into “in need of improvement” status. One alleged method is the statistical technique familiar in reporting opinion surveys where a candidate’s lead is reported to be within the margin of error. Even though there may be a 3-point gap, statistically speaking, with a plus-or-minus 5-point margin of error, the difference between the candidates may actually be zero. In the case of a school, the same idea may be applied to AYP. Let’s say that the amount of improvement needed to meet AYP for the 4th grade population were 50 points (on the scale of the state test) over last year’s 4th grade scores. But let’s imagine that the 4th grade scores averaged only 35 points higher. In this case, the school appears to have missed the AYP goal by 15 points. However, if the margin of error were set at plus-or-minus 20 points, we would not have the confidence to conclude that there’s a difference between the goal and the measured value.
(Margin of Error bar graph) What is a margin or error or “confidence interval”? First of all, we assume there is a real value that we are estimating using the sample. Because we don’t have perfect knowledge, we try to make a fair estimate with some specified level of confidence. We want to know how far the average score that we got from the sample (e.g., of voters or of our 4th grade students) could possibly be from the real average. If we were, hypothetically, to go back and take lots of new samples, we assume they would be spread out around the real value. But because we have only one sample to work with, we do a statistical calculation based on the size of the sample, the nature of the variability among scores, and our desired level of confidence to establish an interval around our estimated average score. With the 80% confidence interval that we illustrated, we are saying that there’s a 4-in-5 chance that the true value we’re trying to estimate is within that interval. If we need greater confidence (for example, if we need to be sure that the real score is within the interval 95 out of a 100 times), we have to make the interval wider.
Merrow argues that, while using this statistical technique to get an estimated range is appropriate for opinion polls, where a sample of 1,000 voters from a much larger pool is used and we are figuring by how much the result may change if we had a different sample of 1,000 voters, the technique is not appropriate for a school, where we are getting a score for all the students. After all, we don’t use a margin of error in the actual election; we just count all the ballots. In other words, there is no “real” score that we are estimating. The school’s score is the real score.
We disagree. An important difference between an election and a school’s mean achievement score is that the achievement score, in the AYP context, implies a causal process: Being in need of improvement implies that the teachers, the leadership, or other conditions at the school need to be improved and that doing so will result in higher student achievement. While ultimately it is the student test scores that need to improve, the actions to be taken under NCLB pertain to the staff and other conditions at the school. If the staff is to blame for the poor conditions, we can’t blame them for a range of variations at the student level. This is where we see the uncertainty coming in.
First consider the way we calculate AYP. With the current “status model” method, we are actually comparing an old sample (last year’s 4th graders) with a new sample (this year’s 4th graders) drawn from the same neighborhood. Do we want to conclude that the building staff would perform the same with a different sample of students? Consider also that the results may have been different if the 4th graders were assigned to different teachers in the school. Moreover, with student mobility and testing differences that occur depending on the day the test is given, additional variations must be considered. But more generally, if we are predicting that “improvements” in the building staff will change the result, we are trying to characterize these teachers in general, in relation to any set of students. To be fair to those who are expected to make change happen, we want to represent fairly the variation in the result that is outside the administrator’s and teachers’ control, and not penalize them if the difference between what is observed and what is expected can be accounted for by this variation.
The statistical methods for calculating a confidence interval (CI) around such an estimate, while not trivial, are well established. The CI helps us to avoid concluding there is a difference (e.g., between the AYP goal and the school’s achievement) when it is reasonably possible that no difference exists. The same technique applies if a district research director is asked whether a professional development program made a difference. The average score for students of the teachers who took the program may be higher than the average scores of students of (otherwise equivalent) teachers who didn’t. But is the difference large enough to be clearly distinct from zero? Did the size of the difference escape the margin or error? Without properly doing this statistical calculation, the district may conclude that the program had some value when the differences were actually just in the noise.
While the U.S. Department of Education is correct to approve the use of CIs, there is still an issue of using CIs that are far wider than justified. The width of a CI is a matter of choice and depends on the activity. Most social science research uses a 95% CI. This is the threshold for the so-called “statistical significance,” and it means that the likelihood is less than 5% that a difference as large or larger than the one observed would have occurred if the real difference (between the two candidates, between the AYP goal and the school’s achievement, or between classes taught by teachers with or without professional development) were actually zero. In scientific work, there is a concern to avoid declaring there is evidence for a difference when there is actually no difference. Should schools be more or less stringent than the world of science?
Merrow points out that many states have set their CI at a much more stringent 99%. This makes the CI so wide that the observed difference between the AYP goal and the measured scores would have to be very large before we say there is a difference. In fact, we’d expect such a difference to occur by chance alone only 1% of the time. In other words, the measured score would have to be very far below the AYP goal before we’d be willing to conclude that the difference we’re seeing isn’t due to chance. As Merrow points out, this is a good idea if the education agency considers NCLB to be unjust and punitive and wants to avoid schools being declared in need of improvement. But imagine what the “right” CI would be if NCLB gave schools additional assistance when identified as below target. It is still reasonable to include a CI in the calculation, but perhaps 80% would be more appropriate.
The concept of a confidence interval is essential as schools move to data-driven decision making. Statistical calculations are often entirely missing from data-mining tools, and chance differences end up being treated as important. There are statistical methods such as including pretest scores in the statistical equation for making calculations more precise and for narrowing the CI. Growth modeling, for example, allows us to use student-level (as opposed to grade-average) pretest scores to increase precision. School district decisions should be based on good measurement and a reasonable allowance for chance differences. —DN/AJ
Merrow documents a number of cynical methods districts and states use for gaming the AYP system so as to avoid having their schools fall into “in need of improvement” status. One alleged method is the statistical technique familiar in reporting opinion surveys where a candidate’s lead is reported to be within the margin of error. Even though there may be a 3-point gap, statistically speaking, with a plus-or-minus 5-point margin of error, the difference between the candidates may actually be zero. In the case of a school, the same idea may be applied to AYP. Let’s say that the amount of improvement needed to meet AYP for the 4th grade population were 50 points (on the scale of the state test) over last year’s 4th grade scores. But let’s imagine that the 4th grade scores averaged only 35 points higher. In this case, the school appears to have missed the AYP goal by 15 points. However, if the margin of error were set at plus-or-minus 20 points, we would not have the confidence to conclude that there’s a difference between the goal and the measured value.
(Margin of Error bar graph) What is a margin or error or “confidence interval”? First of all, we assume there is a real value that we are estimating using the sample. Because we don’t have perfect knowledge, we try to make a fair estimate with some specified level of confidence. We want to know how far the average score that we got from the sample (e.g., of voters or of our 4th grade students) could possibly be from the real average. If we were, hypothetically, to go back and take lots of new samples, we assume they would be spread out around the real value. But because we have only one sample to work with, we do a statistical calculation based on the size of the sample, the nature of the variability among scores, and our desired level of confidence to establish an interval around our estimated average score. With the 80% confidence interval that we illustrated, we are saying that there’s a 4-in-5 chance that the true value we’re trying to estimate is within that interval. If we need greater confidence (for example, if we need to be sure that the real score is within the interval 95 out of a 100 times), we have to make the interval wider.
Merrow argues that, while using this statistical technique to get an estimated range is appropriate for opinion polls, where a sample of 1,000 voters from a much larger pool is used and we are figuring by how much the result may change if we had a different sample of 1,000 voters, the technique is not appropriate for a school, where we are getting a score for all the students. After all, we don’t use a margin of error in the actual election; we just count all the ballots. In other words, there is no “real” score that we are estimating. The school’s score is the real score.
We disagree. An important difference between an election and a school’s mean achievement score is that the achievement score, in the AYP context, implies a causal process: Being in need of improvement implies that the teachers, the leadership, or other conditions at the school need to be improved and that doing so will result in higher student achievement. While ultimately it is the student test scores that need to improve, the actions to be taken under NCLB pertain to the staff and other conditions at the school. If the staff is to blame for the poor conditions, we can’t blame them for a range of variations at the student level. This is where we see the uncertainty coming in.
First consider the way we calculate AYP. With the current “status model” method, we are actually comparing an old sample (last year’s 4th graders) with a new sample (this year’s 4th graders) drawn from the same neighborhood. Do we want to conclude that the building staff would perform the same with a different sample of students? Consider also that the results may have been different if the 4th graders were assigned to different teachers in the school. Moreover, with student mobility and testing differences that occur depending on the day the test is given, additional variations must be considered. But more generally, if we are predicting that “improvements” in the building staff will change the result, we are trying to characterize these teachers in general, in relation to any set of students. To be fair to those who are expected to make change happen, we want to represent fairly the variation in the result that is outside the administrator’s and teachers’ control, and not penalize them if the difference between what is observed and what is expected can be accounted for by this variation.
The statistical methods for calculating a confidence interval (CI) around such an estimate, while not trivial, are well established. The CI helps us to avoid concluding there is a difference (e.g., between the AYP goal and the school’s achievement) when it is reasonably possible that no difference exists. The same technique applies if a district research director is asked whether a professional development program made a difference. The average score for students of the teachers who took the program may be higher than the average scores of students of (otherwise equivalent) teachers who didn’t. But is the difference large enough to be clearly distinct from zero? Did the size of the difference escape the margin or error? Without properly doing this statistical calculation, the district may conclude that the program had some value when the differences were actually just in the noise.
While the U.S. Department of Education is correct to approve the use of CIs, there is still an issue of using CIs that are far wider than justified. The width of a CI is a matter of choice and depends on the activity. Most social science research uses a 95% CI. This is the threshold for the so-called “statistical significance,” and it means that the likelihood is less than 5% that a difference as large or larger than the one observed would have occurred if the real difference (between the two candidates, between the AYP goal and the school’s achievement, or between classes taught by teachers with or without professional development) were actually zero. In scientific work, there is a concern to avoid declaring there is evidence for a difference when there is actually no difference. Should schools be more or less stringent than the world of science?
Merrow points out that many states have set their CI at a much more stringent 99%. This makes the CI so wide that the observed difference between the AYP goal and the measured scores would have to be very large before we say there is a difference. In fact, we’d expect such a difference to occur by chance alone only 1% of the time. In other words, the measured score would have to be very far below the AYP goal before we’d be willing to conclude that the difference we’re seeing isn’t due to chance. As Merrow points out, this is a good idea if the education agency considers NCLB to be unjust and punitive and wants to avoid schools being declared in need of improvement. But imagine what the “right” CI would be if NCLB gave schools additional assistance when identified as below target. It is still reasonable to include a CI in the calculation, but perhaps 80% would be more appropriate.
The concept of a confidence interval is essential as schools move to data-driven decision making. Statistical calculations are often entirely missing from data-mining tools, and chance differences end up being treated as important. There are statistical methods such as including pretest scores in the statistical equation for making calculations more precise and for narrowing the CI. Growth modeling, for example, allows us to use student-level (as opposed to grade-average) pretest scores to increase precision. School district decisions should be based on good measurement and a reasonable allowance for chance differences. —DN/AJ
Saturday, December 15, 2007
What Happens When a Publisher Doesn’t Have Scientific Evidence?
A letter from Citizens for Responsibility and Ethics in Washington (CREW) to the Inspector General of the U.S. Department of Education raises important issues. Although the letter is written in a very careful, thorough, and lawyerly manner, no doubt most readers will notice right away that the subject of the letter are the business practices of Ignite!, the company run by the president’s brother Neil.
CREW documents that Ignite! has sold quite a few units of Curriculum on Wheels (COW) to schools in Texas and elsewhere and that these were purchased with NCLB funds. They also document that there is no accessible scientific evidence that COWs are effective. Given the NCLB requirement that funds be used for programs that have scientifically-based evidence of effectiveness, there appears to be a problem. The question we want to raise is: whose problem is this?
The media report that Mr. Bush has responded to the issues. For example, this explanation appears in eSchool News (Nov. 17, 2007):
* In his interview with eSchool News, Bush said the watchdog group has misinterpreted the federal statute.
* “We’re proud we have a product that has the science of learning built into its design, with tons of anecdotal evidence,” the Ignite! founder said. “But we don’t yet have efficacy studies that meet the What Works Clearinghouse standards–in fact, I challenge you to find any educational curriculum that has met that standard.”
Mr. Bush appears to suggest that NCLB requires only that products incorporate scientific principles. This suggestion is doubtful, outside Reading First, which had its own rules. With respect to actually showing scientifically valid evidence of effectiveness, he concedes that none exists for COWs, but points to the fact that his company’s competitors also lack that kind of evidence.
We came to two conclusions about CREW’s contentions: First, their letter suggests that Ignite! did something wrong in selling its product without scientific evidence. A perspective we want to suggest is that nothing in NCLB calls for vendors to base their products on the “science of learning,” let alone conduct WWC-qualified experimental evidence of effectiveness. Nowhere is it stated that vendors are not allowed to sell ineffective products. Education is not like the market for medical products, in which the producers have to prove effectiveness to get FDA approval to begin marketing. NCLB rules apply to school systems that are using federal funds to purchase programs like COW. The IG investigation has to be directed to the state and local agencies that allow this to happen. We think that the investigators will quickly discover that these agencies have not been given much guidance as to how to interpret the requirements. (Of course with Reading First, the Department took a hands-on approach to approving only particular products whose effectiveness was judged to be scientifically based, but this approach was exceptional.)
Our second conclusion is that the current law is unenforceable because there is insufficient scientific evidence about the effectiveness of the products and services for which agencies want to use their NCLB funds. The law needs to be modified. But the solution is not to water down the provisions (e.g., by allowing anecdotal evidence if that’s all that is available) or remove them altogether as some suggest. The idea behind having evidence that an instructional program works is a good one. The law has to address how the evidence can be produced while supporting local innovation and choice. State and local agencies will need the funds to conduct proper evaluations. Most importantly, the law has to allow agencies to adopt “unproven” programs under the condition that they assist in producing the evidence to support their continued usage.
CREW’s letter misses the mark. But an investigation by the IG may help to ignite a reconsideration of how schools can get the evidence they need. —DN
CREW documents that Ignite! has sold quite a few units of Curriculum on Wheels (COW) to schools in Texas and elsewhere and that these were purchased with NCLB funds. They also document that there is no accessible scientific evidence that COWs are effective. Given the NCLB requirement that funds be used for programs that have scientifically-based evidence of effectiveness, there appears to be a problem. The question we want to raise is: whose problem is this?
The media report that Mr. Bush has responded to the issues. For example, this explanation appears in eSchool News (Nov. 17, 2007):
* In his interview with eSchool News, Bush said the watchdog group has misinterpreted the federal statute.
* “We’re proud we have a product that has the science of learning built into its design, with tons of anecdotal evidence,” the Ignite! founder said. “But we don’t yet have efficacy studies that meet the What Works Clearinghouse standards–in fact, I challenge you to find any educational curriculum that has met that standard.”
Mr. Bush appears to suggest that NCLB requires only that products incorporate scientific principles. This suggestion is doubtful, outside Reading First, which had its own rules. With respect to actually showing scientifically valid evidence of effectiveness, he concedes that none exists for COWs, but points to the fact that his company’s competitors also lack that kind of evidence.
We came to two conclusions about CREW’s contentions: First, their letter suggests that Ignite! did something wrong in selling its product without scientific evidence. A perspective we want to suggest is that nothing in NCLB calls for vendors to base their products on the “science of learning,” let alone conduct WWC-qualified experimental evidence of effectiveness. Nowhere is it stated that vendors are not allowed to sell ineffective products. Education is not like the market for medical products, in which the producers have to prove effectiveness to get FDA approval to begin marketing. NCLB rules apply to school systems that are using federal funds to purchase programs like COW. The IG investigation has to be directed to the state and local agencies that allow this to happen. We think that the investigators will quickly discover that these agencies have not been given much guidance as to how to interpret the requirements. (Of course with Reading First, the Department took a hands-on approach to approving only particular products whose effectiveness was judged to be scientifically based, but this approach was exceptional.)
Our second conclusion is that the current law is unenforceable because there is insufficient scientific evidence about the effectiveness of the products and services for which agencies want to use their NCLB funds. The law needs to be modified. But the solution is not to water down the provisions (e.g., by allowing anecdotal evidence if that’s all that is available) or remove them altogether as some suggest. The idea behind having evidence that an instructional program works is a good one. The law has to address how the evidence can be produced while supporting local innovation and choice. State and local agencies will need the funds to conduct proper evaluations. Most importantly, the law has to allow agencies to adopt “unproven” programs under the condition that they assist in producing the evidence to support their continued usage.
CREW’s letter misses the mark. But an investigation by the IG may help to ignite a reconsideration of how schools can get the evidence they need. —DN
Monday, October 15, 2007
Congress Grapples with the Meaning of “Scientific Research”
Good news and bad news. As reported recently in Education Week(Viadero, 2007, October 17), pieces of legislation currently being put forward contain competing definitions for scientific research. The good news is that we may finally be getting rid of the obtuse and cumbersome term “Scientifically Based Research.” Instead we find some of the legislation using the ordinary English phrase “scientific research” (without the legalese capitalization). So far, the various proposals for NCLB reauthorization are sticking with the idea that school districts will find scientific evidence useful in selecting effective instructional programs and are mostly just tweaking the definition.
So why is the definition of scientific research important? This gets to the bad news. It is important because the definition—whatever it turns out to be—will determine which programs are, in effect, on an approved list for purchase with NCLB funds.
Let’s take a look at two candidate definitions, just focusing on the more controversial provisions.
* The Education Sciences Reform Act of 2002 says that research meeting its “scientifically based research standards” makes “claims of causal relationships only in random assignment experiments or other designs (to the extent such designs substantially eliminate plausible competing explanations for the obtained results) ”
* However, the current House proposal (the Miller-McKeon Draft) defines “principles of scientific research” as guiding research that (among other things) makes “strong claims of causal relationships only in research designs that eliminate plausible competing explanation for observed results, which may include but shall not be limited to random assignment experiments.”
Both say essentially the same thing, but the new wording takes the primacy off random assignment and puts it on eliminating plausible competing explanations. We see the change as a concession to researchers who find random assignment too difficult to pull off. These researchers are not, however, relieved of the requirement to eliminate competing explanations (for which randomized control remains the most effective method). Meanwhile, another bill, introduced recently by Senators Lugar and Bingaman takes a radically different approach to a definition.
* This bill defines what it means for a reading program to be “research–proven” and proposes the requirements for the actual studies that would “prove” that the program is effective. Among the minimum criteria described in the proposal are:
* The program must be evaluated in not less than two studies in which:
* The study duration was not less than 12 weeks.
* The sample size of each study is not less than five classes or 125 students per treatment (10 classes or 250 students overall). Multiple smaller studies may be combined to reach this sample size collectively.
* The median difference between program and control group students across all qualifying studies is not less than 20 percent of student-level standard deviation, in favor of the program students.
As soon as legislation tries to be this specific, counter examples immediately leap to mind. For example, we are currently conducting a study of a reading program that fits the last two points but, because the program is designed as a 10-week intervention, it can never become research-proven under this definition. Another oddity is that the size of the impact and the size of the sample are specified, but not the level of confidence required—it is unlikely we would have any confidence in a finding of a 0.2 effect size with only 10 classrooms in the study. Perhaps the most unacceptable part of this definition is the term “research-proven.” This is far too strong and absolute. It suggests that as soon as two small studies are completed, the program gets a perpetual green light for district purchases under NCLB.
As odd as this definition may be, we can understand why it was introduced. The most prevalent interpretation of the requirement for “Scientifically Based Research” in NCLB has been that the program under consideration should have been written and developed based on findings derived from scientific research. It was not required that the program itself have any scientific evidence of effectiveness. The Lugar-Bingaman proposal calls for scientific tests of the program itself. In Reading First, programs that had actual evidence of effectiveness were famously left off the approved list, while programs that simply claimed to be designed based on prior scientific research were put on. This proposal will help to level the playing field. To avoid the traps that open up when specific designs are legislated, perhaps the law could call for the convening of a broadly representative panel to hash out the differences between competing sets of criteria rather than enshrine one abbreviated set in federal law.
But even with consensus on the review criteria for acceptable research (and for explaining the trade–offs to the consumers of the research reviews at the state and local level), we are still left with an approved list—a set of programs with sufficient scientific evidence of effectiveness to be purchased. Meanwhile new programs (books, software, professional development, interventions, etc.) are becoming available every day that have not yet been “proven.”
There is a relatively simple fix that would help democratize the process for states and districts that want to try something because it looks promising but has not yet been “proven” in a sufficient number of other districts. Wherever the law says that a program must have scientific research behind it, also allow the state or district to conduct the necessary scientific research as part of the federal funding. So for example, where the Miller–McKeon Draft calls for
“a description of how the activities to be carried out by the eligible partnership will be based on a review of scientifically valid research,”
simply change that to
“a description of how the activities to be carried out by the eligible partnership will be based on a review of, or evaluation using, scientifically valid research.”
Similarly, a call for
“including integrating reliable teaching methods based on scientifically valid research”
can instead be a call for
“including integrating reliable teaching methods based on, or evaluated by, scientifically valid research.”
This opens the way for districts to try things they think should work for them while helping to increase the total amount of research available for evaluating the effectiveness of new promising programs. Most importantly, it turns the static approved list into a process for continuous research and improvement. —DN
So why is the definition of scientific research important? This gets to the bad news. It is important because the definition—whatever it turns out to be—will determine which programs are, in effect, on an approved list for purchase with NCLB funds.
Let’s take a look at two candidate definitions, just focusing on the more controversial provisions.
* The Education Sciences Reform Act of 2002 says that research meeting its “scientifically based research standards” makes “claims of causal relationships only in random assignment experiments or other designs (to the extent such designs substantially eliminate plausible competing explanations for the obtained results) ”
* However, the current House proposal (the Miller-McKeon Draft) defines “principles of scientific research” as guiding research that (among other things) makes “strong claims of causal relationships only in research designs that eliminate plausible competing explanation for observed results, which may include but shall not be limited to random assignment experiments.”
Both say essentially the same thing, but the new wording takes the primacy off random assignment and puts it on eliminating plausible competing explanations. We see the change as a concession to researchers who find random assignment too difficult to pull off. These researchers are not, however, relieved of the requirement to eliminate competing explanations (for which randomized control remains the most effective method). Meanwhile, another bill, introduced recently by Senators Lugar and Bingaman takes a radically different approach to a definition.
* This bill defines what it means for a reading program to be “research–proven” and proposes the requirements for the actual studies that would “prove” that the program is effective. Among the minimum criteria described in the proposal are:
* The program must be evaluated in not less than two studies in which:
* The study duration was not less than 12 weeks.
* The sample size of each study is not less than five classes or 125 students per treatment (10 classes or 250 students overall). Multiple smaller studies may be combined to reach this sample size collectively.
* The median difference between program and control group students across all qualifying studies is not less than 20 percent of student-level standard deviation, in favor of the program students.
As soon as legislation tries to be this specific, counter examples immediately leap to mind. For example, we are currently conducting a study of a reading program that fits the last two points but, because the program is designed as a 10-week intervention, it can never become research-proven under this definition. Another oddity is that the size of the impact and the size of the sample are specified, but not the level of confidence required—it is unlikely we would have any confidence in a finding of a 0.2 effect size with only 10 classrooms in the study. Perhaps the most unacceptable part of this definition is the term “research-proven.” This is far too strong and absolute. It suggests that as soon as two small studies are completed, the program gets a perpetual green light for district purchases under NCLB.
As odd as this definition may be, we can understand why it was introduced. The most prevalent interpretation of the requirement for “Scientifically Based Research” in NCLB has been that the program under consideration should have been written and developed based on findings derived from scientific research. It was not required that the program itself have any scientific evidence of effectiveness. The Lugar-Bingaman proposal calls for scientific tests of the program itself. In Reading First, programs that had actual evidence of effectiveness were famously left off the approved list, while programs that simply claimed to be designed based on prior scientific research were put on. This proposal will help to level the playing field. To avoid the traps that open up when specific designs are legislated, perhaps the law could call for the convening of a broadly representative panel to hash out the differences between competing sets of criteria rather than enshrine one abbreviated set in federal law.
But even with consensus on the review criteria for acceptable research (and for explaining the trade–offs to the consumers of the research reviews at the state and local level), we are still left with an approved list—a set of programs with sufficient scientific evidence of effectiveness to be purchased. Meanwhile new programs (books, software, professional development, interventions, etc.) are becoming available every day that have not yet been “proven.”
There is a relatively simple fix that would help democratize the process for states and districts that want to try something because it looks promising but has not yet been “proven” in a sufficient number of other districts. Wherever the law says that a program must have scientific research behind it, also allow the state or district to conduct the necessary scientific research as part of the federal funding. So for example, where the Miller–McKeon Draft calls for
“a description of how the activities to be carried out by the eligible partnership will be based on a review of scientifically valid research,”
simply change that to
“a description of how the activities to be carried out by the eligible partnership will be based on a review of, or evaluation using, scientifically valid research.”
Similarly, a call for
“including integrating reliable teaching methods based on scientifically valid research”
can instead be a call for
“including integrating reliable teaching methods based on, or evaluated by, scientifically valid research.”
This opens the way for districts to try things they think should work for them while helping to increase the total amount of research available for evaluating the effectiveness of new promising programs. Most importantly, it turns the static approved list into a process for continuous research and improvement. —DN
Saturday, September 15, 2007
Ed Week: “Federal Reading Review Overlooks Popular Texts”
The August 29, 2007 issue of Education Week reports the release of the What Works Clearinghouse’s review of beginning reading programs. Out of nearly 900 studies that were reviewed, only 51 met the WWC standards—an average of about two studies per reading program that were included. (120 other reading programs were examined in 850 studies deemed methodologically unacceptable.) The article, written by Kathleen Kennedy Manzo, notes that the major textbook offerings, on which districts spend hundreds of millions of dollars, did not have acceptable research available. Bob Slavin, an accomplished researcher and founder of the Success for All program (which got a middling rating on the WWC scale), also noted that the programs reviewed were mostly supplementary and smaller intervention programs, rather than the more comprehensive school-wide programs.
Why is there this apparent bias in what is covered in WWC reviews? Is it in the research base or in the approach that the WWC takes to reviews? It is a bit of both. First it is easier to find an impact of a program when it is supplemental and it is being compared to classrooms that do not have that supplement. This is especially true where the intervention is intense and targeted to a subset of the students. In contrast, consider trying to test a basal reading program. What does the control group have? Probably the prior version of the same basal or some other basal. Both programs may be good tools for helping teachers teach students to read, but the difference between the two is very hard to measure. In such an experiment, the “treatment” program would have “no discernible effect” (the WWC category for no measurable impact). Unlike a medical experiment where the control group gets a placebo, we can’t find a control group that has no reading program at all. Probably the major reason there is so little rigorous research on textbook programs is that districts usually have no choice: they have to buy one or another. Research on supplementary programs, in contrast, can inform a discretionary decision and so has more value to the decision-maker.
While it may be hard to answer whether one textbook program is more effective than another, a better question may be whether one works better for specific populations, such as inexperienced teachers or English learners. It is a useful question if you are deciding on a text for your particular district but it is not a question that is addressed in WWC reviews.
Another characteristic of WWC reviews is that the metric of impact is the same whether it is a small experiment on a highly defined intervention or a very large experiment on a comprehensive intervention. As researchers, we know that it is easier to show a large impact in a small targeted experiment. It is difficult to test something like Success for All that requires school-wide commitment. At Empirical Education we suggest to educators that WWC is a good starting point to find out what research has been conducted on interventions of interest. But the WWC reviews are not a substitute for trying out the intervention in your own district. In a local experimental pilot, the control group is your current program. Your research question is whether the intervention is sufficiently more effective than your current program for the teachers or students of interest to make it worth the investment. —DN
Why is there this apparent bias in what is covered in WWC reviews? Is it in the research base or in the approach that the WWC takes to reviews? It is a bit of both. First it is easier to find an impact of a program when it is supplemental and it is being compared to classrooms that do not have that supplement. This is especially true where the intervention is intense and targeted to a subset of the students. In contrast, consider trying to test a basal reading program. What does the control group have? Probably the prior version of the same basal or some other basal. Both programs may be good tools for helping teachers teach students to read, but the difference between the two is very hard to measure. In such an experiment, the “treatment” program would have “no discernible effect” (the WWC category for no measurable impact). Unlike a medical experiment where the control group gets a placebo, we can’t find a control group that has no reading program at all. Probably the major reason there is so little rigorous research on textbook programs is that districts usually have no choice: they have to buy one or another. Research on supplementary programs, in contrast, can inform a discretionary decision and so has more value to the decision-maker.
While it may be hard to answer whether one textbook program is more effective than another, a better question may be whether one works better for specific populations, such as inexperienced teachers or English learners. It is a useful question if you are deciding on a text for your particular district but it is not a question that is addressed in WWC reviews.
Another characteristic of WWC reviews is that the metric of impact is the same whether it is a small experiment on a highly defined intervention or a very large experiment on a comprehensive intervention. As researchers, we know that it is easier to show a large impact in a small targeted experiment. It is difficult to test something like Success for All that requires school-wide commitment. At Empirical Education we suggest to educators that WWC is a good starting point to find out what research has been conducted on interventions of interest. But the WWC reviews are not a substitute for trying out the intervention in your own district. In a local experimental pilot, the control group is your current program. Your research question is whether the intervention is sufficiently more effective than your current program for the teachers or students of interest to make it worth the investment. —DN
Friday, June 15, 2007
National Study of Educational Software a Disappointment
The recent report on the effectiveness of reading and mathematics software products provides strong evidence that, on average, teachers who are willing to pilot a software product and try it out in their classroom for most of a year are not likely to see much benefit in terms of student reading or math achievement. What does this tell us about whether schools should continue purchasing instructional software systems such as those tested? Unfortunately, not as much as it could have. The study was conducted under the constraint of having to report to Congress, which appropriates funds for national programs, rather than to the school district decision-makers, who make local decisions based on a constellation of school performance, resource, and implementation issues. Consequently we are left with no evidence either way as to the impact of software when purchased and supported by a district and implemented systematically.
By many methodological standards, the study, which cost more than $10 million, is quite strong. The use of random assignment of teachers to take up the software or to continue with their regular methods, for example, assures that bias from self-selection did not play a role as it does in many other technology studies. In our opinion, the main weakness of the study was that it spread the participating teachers out over a large number of districts and schools and tested each product in only one grade. This approach encompasses a broad sample of schools but leaves the individual teachers often as the lone implementer in the school and one of only a few in the district. This potentially reduces the support that would normally be provided by school leadership and district resources, as well as the mutual support of a team of teachers in the building.
We believe that a more appropriate and informative experiment would focus in the implementation in one or a small number of districts and in a limited number of schools. In this way, we can observe an implementation measuring characteristics such as how professional development is organized and how teachers are helped (or not helped) to integrate the software with district goals and standards. While this approach allows us to observe only a limited number of settings, it provides a richer picture that can be evaluated as a small set of coherent implementations. The measures of impact, then, can be associated with a realistic context.
Advocates for school technology have pointed out limitations of the national study. Often the suggestion is that a different approach or focus would have demonstrated the value of educational technology. For example, a joint statement from CoSN, ISTE, and SETDA released April 5, 2007 quotes Dr. Chris Dede, Wirth Professor in Learning Technologies at Harvard University: “In the past five years, emerging interactive media have provided ways to bring new, more powerful pedagogies and content to classrooms. This study misestimates the value of information and communication technologies by focusing exclusively on older approaches that do not take advantage of current technologies and leading edge educational methods.” While Chris is correct that the research did not address cutting edge technologies, it did test software that has been and, in most cases, continues to be successful in the marketplace. It is unlikely that technology advocates would call for taking the older approaches off the market. (Note that Empirical Education is a member of and active participant in CoSN.)
Decision-makers need some basis for evaluating the software that is commercially available. We can’t expect federally funded research to provide sufficiently targeted or timely evidence. This is why we advocate for school districts getting into the routine of piloting products on a small scale before a district-wide implementation. If the pilots are done systematically, they can be turned into small-scale experiments that inform the local decision. Hundreds of such experiments can be conducted quite cost effectively as vendor-district collaborations and will have the advantage of testing exactly the product, professional development, and support for implementation under exactly the conditions that the decision-maker cares about. —DN
By many methodological standards, the study, which cost more than $10 million, is quite strong. The use of random assignment of teachers to take up the software or to continue with their regular methods, for example, assures that bias from self-selection did not play a role as it does in many other technology studies. In our opinion, the main weakness of the study was that it spread the participating teachers out over a large number of districts and schools and tested each product in only one grade. This approach encompasses a broad sample of schools but leaves the individual teachers often as the lone implementer in the school and one of only a few in the district. This potentially reduces the support that would normally be provided by school leadership and district resources, as well as the mutual support of a team of teachers in the building.
We believe that a more appropriate and informative experiment would focus in the implementation in one or a small number of districts and in a limited number of schools. In this way, we can observe an implementation measuring characteristics such as how professional development is organized and how teachers are helped (or not helped) to integrate the software with district goals and standards. While this approach allows us to observe only a limited number of settings, it provides a richer picture that can be evaluated as a small set of coherent implementations. The measures of impact, then, can be associated with a realistic context.
Advocates for school technology have pointed out limitations of the national study. Often the suggestion is that a different approach or focus would have demonstrated the value of educational technology. For example, a joint statement from CoSN, ISTE, and SETDA released April 5, 2007 quotes Dr. Chris Dede, Wirth Professor in Learning Technologies at Harvard University: “In the past five years, emerging interactive media have provided ways to bring new, more powerful pedagogies and content to classrooms. This study misestimates the value of information and communication technologies by focusing exclusively on older approaches that do not take advantage of current technologies and leading edge educational methods.” While Chris is correct that the research did not address cutting edge technologies, it did test software that has been and, in most cases, continues to be successful in the marketplace. It is unlikely that technology advocates would call for taking the older approaches off the market. (Note that Empirical Education is a member of and active participant in CoSN.)
Decision-makers need some basis for evaluating the software that is commercially available. We can’t expect federally funded research to provide sufficiently targeted or timely evidence. This is why we advocate for school districts getting into the routine of piloting products on a small scale before a district-wide implementation. If the pilots are done systematically, they can be turned into small-scale experiments that inform the local decision. Hundreds of such experiments can be conducted quite cost effectively as vendor-district collaborations and will have the advantage of testing exactly the product, professional development, and support for implementation under exactly the conditions that the decision-maker cares about. —DN
Subscribe to:
Posts (Atom)