I have been so angry about the head-long rush into untested and poorly-thought-out value-added accountability models of schools and teachers in various states all around the country that I haven’t found a calm mental space in which to get words on paper. Let me now try. Forgive me if I sputter.
Here’s the problem in a nutshell. Value-added Models (VAM) of accountability are now the rage. And it is understandable why this is so. They involve predictions about “appropriate” student gains of performance. If results – almost always measured via state standardized test scores – fall within or above the “expected” gains, then you are a “good” school or teacher. If the gains fall below the expected gains that you are a “bad” school or teacher. Such a system has been in place in Tennessee for over a decade. You may be aware that from that test interesting claims have been made about effective vs. ineffective teachers adding a whole extra year of gain. So, in the last few years, as accountability pressures have been ratcheted up in all states, more and more of such systems have been put in place, most recently in New York State where a truly byzantine formula is being used starting next year to hold principals and teachers accountable.
It will surely fail (and be litigated). Let me try to explain why.
VAM: fair in theory. Let me make a critical distinction. I am in favor of the idea of VAM – and you should be, too. Because in theory looking at gains is the only fair way to deal with the extraordinary diversity in what families, students, teachers and schools bring to the table; thus, a fairer way to judge schools and teachers. In theory, VAM overturns the current invidious comparisons based on SES where if you happen to have a demographic of upper-middle-class families in your school or district then you look really good compared to an inner-city school.
I have seen this sham first-hand over many years: lots of so-called good NJ and NY suburban districts are truly awful when you look firsthand (as I have for 3 decades) at the pedagogy, assignments, and local assessments; but those kids outscore the kids from Trenton and New York City, even though both city systems have a number of outstanding schools and teachers. And as I and others have reported many times, by the College Board’s own data, the wealth of the family is by far the greatest of all predictors of SAT scores – far better than the grade point average. For every $20,000 increase in family income SAT scores rise approximately 10 points.
So, point #1: value-added accountability is needed to sort this out in a more fair and helpful way. We need to know the answer to the question: given what you started with, what progress was made? Fairness and better insight into what works and what doesn’t pedagogically demands the inquiry.
2nd point: the truth hurts, and methinks that some teachers, schools, and districts doth protest too much. Numerous value-added studies reveal some vital hidden truths. It actually IS true that the VA models accurately predict teacher performance vis a vis tests over a 3-year period, at the extremes. Thus, the really effective teachers stay so and the really ineffective ones remain so. But, the data is murky in the middle, making rankings foolish; and all the impartial studies say: no adequate precision exists for rating and rankings year by year.
Yet, when you start to do pre- and post- analysis in “good” schools and colleges, the results are sometimes shockingly bad. I know of one outstanding prep school that commissioned ETS to develop a test of critical thinking, they gave it to 9th and 12th graders – and there was zero gain. Nada. Similarly, recent findings were just reported for a number of colleges that take the CLA (Collegiate Learning Assessment). The reaction at many of these schools? Kill the messenger! The results can’t be right since we know we are good! Hmmph. So, again, I am not inherently sympathetic to these kinds of laments. I am in favor of this kind of longitudinal accountability for progress. We know that it works in video games, sports, business, the arts: we judge you on the rate of growth, and you keep striving for your personal best.
VAM in practice: harmful. But the devil is in the details. And the details are damning for what state ed. now plans to do in many states. So, Point #3 has to be squarely and honestly faced – especially by those of us who favor accountability: the ugly truth is that current and proposed uses for the approach are not ready for prime time on psychometric grounds. Worse, policy-makers (and, yes, some enemies of public education) are foisting these flawed approaches on us with seeming disregard for margin of error and the invalidity of shifting the purpose of the test – an old story in education.
It becomes like a sick game of Telephone: what starts out as a reasonable idea, when whispered down the line to people who don’t really get the details – or don’t want to get them – becomes an abomination. The same thing happened when the SATs started to be used as accountability measures instead of predictors of student freshman college grades 30 years ago, thanks to Bill Bennett and others in the Reagan administration with axes to grind. (Did you know, by the way, that the margin of error on an individual SAT score is 33 points plus or minus? Yep: Suzy’s 580 is really the midpoint of a 550 – 610 confidence interval. As true today as it was in 1982.)
By looking at individual teachers, over only 1 year (instead of the minimum of three years, as the psychometricians and VAM designers stress), we now demand more from the tests than can be obtained with sufficient precision. The margin of error from year to year can now be as large as the gain predicted, in small samples. Thus, by posting yearly scores, by failing to publish and acknowledge the margin of error, by failing to note that “good” teachers in Year One are often “bad” in Year Two (what measurement people call ‘score instability’), we are again living through the same hell as when SAT scores were wrongly used and abused 30 years ago to rate schools.
The VAM systems out there now (and the further absurdity of publishing yearly results in the paper) also violate two principles at the heart of quality control as formulated by Edward Deming: Drive out fear and end quotas – because both undermine the right motivation needed to do a quality job and take pride in it. Since when did public shaming ever work as a motivator? What the thoughtless policy-makers in Albany and other places where VAM is underway or about to be rolled out have really done is re-invent the Russian wheat quotas of the 1950s. It didn’t work then and it won’t work now.
I am not going to bore you with the arcane psychometrics of the model. In fact, another reason the actual VAM system is so harmful is that it has zero transparency: no person held accountable can understand or challenge their score because the math is so complex. So, I will just quote from numerous recent papers, by reputable researchers, on their doubts about the current VAM systems and claims for their soundness; you can follow up, if interested and strong in measurement:
From “Mathematical Intimidation: driven by data” by the President of Math for America, John Ewing: When value-added models were first conceived, even their most ardent supporters cautioned about their use… Over the past decade, such cautions about VAM slowly evaporated, especially in the popular press… Even people who point out the limitations of VAM appear to be willing to use “student achievement” in the form of value-added scores to make such judgments. People recognize that tests are an imperfect measure of educational success, but when sophisticated mathematics is applied, they believe the imperfections go away by some mathematical magic. But this is not magic. What really happens is that the mathematics is used to disguise the problems and intimidate people into ignoring them—a modern, mathematical version of the Emperor’s New Clothes.
As the popular press promoted value-added models with ever-increasing zeal, there was a parallel, much less visible scholarly conversation about the limitations of value-added models. In 2003 a book with the title Evaluating Value-Added Models for Teacher Accountability laid out some of the problems and concluded: The research base is currently insufficient to support the use of VAM for high-stakes decisions. We have identified numerous possible sources of error in teacher effects and any attempt to use VAM estimates for high-stakes decisions must be informed by an understanding of these potential errors.
From a technical study of VAM models: In general, obtaining sufficiently precise estimates of teacher effects to support ranking is likely to be a challenge. Student test score data tend to be far from ideal, with relatively small classes and substantial numbers of missing values. In addition, models require numerous assumptions that contribute to model uncertainty. Methods to improve precision might include pooling data across years to estimate multi-year average teacher effects.
From a RAND report on “the promise and perils of VAM”:
Variations in Teachers Affect Student Performance, but Size of Effect Is Uncertain. The recent literature on VAM suggests that teacher effects on student learning are large, accounting for a significant portion of the variability in growth, and that they persist for at least three to four years into the future. RAND researchers critically evaluated the methods used in these studies and the validity of the resulting claims. They concluded that teachers do, indeed, have discernible effects on student achievement and that these teacher effects appear to persist across years. The shortcomings of existing studies, however, make it difficult to determine the size of teacher effects. Nonetheless, it appears that the magnitude of some of the effects reported in these studies is overstated….
Sampling error is another potential source of error in VAM estimates. Estimates of teacher effects have larger sampling errors than estimates of school effects because of the smaller numbers of students used in the estimation of individual teacher effects. Thus, some estimates of interest will be too unreliable to use. Even so, for some purposes, such as identifying teachers who are extremely effective or ineffective, the estimates might be sufficiently precise. However, for other purposes, such as ranking teachers, the uncertainty in the estimates is likely to be too large to allow anything to be said with any degree of confidence….The Bottom Line: The current research base is insufficient to support the use of VAM for high-stakes decisions, and applications of VAM must be informed by an understanding of the potential sources of errors in teacher effects.
From a policy paper from the Educational Testing Service: VAM results should not serve as the sole or principal basis for making consequential decisions about teachers. There are many pitfalls to making causal attributions of teacher effectiveness on the basis of the kinds of data available from typical school districts. We still lack sufficient understanding of how seriously the different technical problems threaten the validity of such interpretations.
From an EPI briefing paper: VAM’s instability can result from differences in the characteristics of students assigned to particular teachers in a particular year, from small samples of students (made even less representative in schools serving disadvantaged students by high rates of student mobility), from other influences on student learning both inside and outside school, and from tests that are poorly lined up with the curriculum teachers are expected to cover, or that do not measure the full range of achievement of students in the class. For these and other reasons, the research community has cautioned against the heavy reliance on test scores, even when sophisticated VAM methods are used, for high stakes decisions such as pay, evaluation, or tenure. For instance, the Board on Testing and Assessment of the National Research Council of the National Academy of Sciences stated, …VAM estimates of teacher effectiveness should not be used to make operational decisions because such estimates are far too unstable to be considered fair or reliable.
Back to the early 1900s and the cult of efficiency. All of this is painfully familiar. It all sounds EXACTLY what Ray Callahan wrote about the cult of efficiency in his book of that title – in 1962!! – on how schools were held accountable in the early 1900’s in response to the great waves of urban migration and the newly-developed Taylor system of accountability in manufacturing. Callahan’s book is a brilliant and detailed account of how policy people put similarly heavy-handed invalid and unreliable accountability on public schools, under the guise of modernity, and nearly destroyed it. I implore you to read it, if you care about the gap between the vision and reality of school accountability. And if chills don’t go up your spine when reading it I will be very surprised.
Here are a few quotes from the book:
“In the years between 1911 and 1925 educational administrators responded in a variety of ways to the demands for more efficient operation of the schools. Most of these actions were connected in some way by educators to the magic words ‘scientific management.’
“Almost immediately after the country became acquainted with scientific management principles, pressure began to apply them to the classroom…a school board member from Allegheny PA told the NEA that ‘if teachers did not voluntarily take steps to increase their efficiency the business world would force them to do so….most attention became directed to the development of ‘objective’ achievement tests [as a result].
“Although some educators undoubtedly believed that the education of children would be improved through the introduction of various efficiency measures, the primary motivation for their adoption by administrators was self-defense…efficiency measures ‘had to be reported simply and in a language businessmen could understand.’
“Some 8th graders [Bobbit] said did addition ‘at the rate of 35 combinations per minute’ while another ‘will add at an average rate of 105 combinations per minute’ So, Bobbitte claimed educators ‘had come to see that it is possible to set up definite standards for the various educational products. The ability to add at a speed of 65 combinations per minute, with an accuracy of 94% is as definite a specification as can be set up for any aspect of the work of the steel plant.’ But in most schools, he said the teacher ‘if asked whether his 8th grade pupils could add at the rate of 65 per minute with an accuracy of 94%, could not answer the question; nor would he know how to go about finding out. He needs a measuring scale…a teacher who fell short of the standard can now know herself to be a poor teacher.’
“[Supt Taylor of NYC] worked out charts on which teachers would be rated, and in his attention to detail ( e.g. teachers were rated on the time spent passing out and collecting papers) would have done justice to any efficiency engineer. And despite his words of ‘teacher as artist’, he stated that if her work was ‘inefficient’ the supervisor had the right to say ‘take my way or find a better one.’
“There were other unfortunate effects of the [school] surveys. In many instances, the [outside] experts tended to be extremely critical and when these criticisms were exploited by the newspapers some poor schoolman was in for trouble.
“It is impossible to determine how many huge schools were built for reasons of economy in American cities, but the numbers was and continues to be considerable. And the same logic was used to justify in deciding which subjects were of most value: since they could not agree on the value e.g. of Greek or since there apparently was no difference in educational achievement in large and small classes, then the economic factor was the decisive one.
“It seems in retrospect that, regardless of the motivation, the consequences for American education [of applying business and industrial values and practices to education] were tragic…it is clear that the essence of the tragedy was in adopting values and practices indiscriminately and applying them with little or no consideration of educational values or purposes…. It is possible that if educators had sought ‘the finest product at the lowest cost’ the results would have been less unfortunate. But the record shows that the emphasis was not on ‘the finest product’ but ‘the lowest cost.’
“The tragedy itself was fourfold: that educational questions were subordinated to business considerations; that administrators were produced who were not in any true sense educators; that a scientific label was put on some very unscientific and dubious methods and practices; and that an anti-intellectual climate, already prevent, was strengthened.”
What Callahan reveals over and over again is that we tend to measure what is easy to measure, we tend to run completely roughshod over nuance of margin of error in the measuring, and lurking behind most of the schemes is a mindless commitment to buzzwords rather than attempts to genuinely improve learning and achievement. Plus ça change?
In a later post I will propose the outline of a modest solution to this mess. The solution requires us to learn from athletics: utterly transparent and valid measures, timely and frequent results, the ability to challenge judgments made, many diverse measurements over time, teacher-coach ‘ownership’ of the rules and systems, and tiered leagues (e.g. Division I, II, III) in which we have reasonable expectations and good incentives to make genuine improvement over time. A central feature would be having teachers involved for 6 days a year scoring student work in regional gatherings, much as they do in ‘moderation’ meetings in the UK and in AP and IB scoring.
Why it makes me mad. All this makes me deeply angry. These policies will drive good people out of education and undercut the accountability movement. But this kind of policy-making is more than stupid and Kafkaesque; it is immoral. It is immoral to demand of others what we are unwilling to do to ourselves, whether we cite Kant’s Categorical Imperative or the Golden Rule. And no one, absolutely no one, promulgating this policy is willing to have themselves be held similarly accountable. Who would? Who would willingly hold themselves accountable for measures that involve arcane math, no formative feedback en route, unreliable data, and admit no counter-evidence? Shame on the hypocrites proposing this; shame on the policy wonks who cheerfully overlook the flaws in order to grind their political axes; shame on all of us for not rising up in protest.
PS: the references for the above quotes -
Ewing, John, “Mathematical Intimidation: Driven by the Data, Notices of the American Mathematical Society, May 2011, Vol. 58 #5. http://www.ams.org/notices/201105/rtx110500667p.pdf
Audrey Amrein-Beardsley, Methodological concerns about the education value-added assessment system, Educational Researcher 37 (2008), 65–75. http://dx.doi.org/10.3102/0013189X08316420
Eva L. Baker, Paul E. Barton, Linda Darling-Hammond, Edward Haertel, Hellen F. Ladd, Robert L. Linn, Diane Ravitch, Richard Rothstein, Richard J. Shavelson, and Lorrie A. Shepard, Problems with the Use of Student Test Scores to Evaluate Teachers, Economic Policy Institute (EPI) Briefing Paper #278, August 29, 2010, Washington, DC. http://www.epi.org/publications/entry/bp278
Henry Braun, Using Student Progress to Evaluate Teachers: A Primer on Value-Added Models, Educational Testing Service Policy Perspective, Princeton, NJ, 2005. http://www.ets.org/Media/Research/pdf/
Henry Braun, Naomi Chudowsky, and Judith Koenig, eds., Getting Value Out of Value-Added: Report of a Workshop, Committee on Value-Added Methodology for Instructional Improvement, Program Evaluation, and Accountability; National Research Council, Washington, DC, 2010. http://www.nap.edu/catalog/12820.html
Daniel F. McCaffrey, Daniel Koretz, J. R. Lockwood, and Laura S. Hamilton, Evaluating Value-Added Models for Teacher Accountability, RAND Corporation, Santa Monica, CA, 2003. http://www.rand.org/pubs/monographs/2004/RAND_MG158.pdf
Daniel F. McCaffrey, J. R. Lockwood, Daniel Koretz, Thomas A. Louis, and Laura Hamilton, Models for value-added modeling of teacher effects, Journal of Educational and Behavioral Statistics 29(1), Spring 2004, 67-101. http://www.rand.org/pubs/reprints/2005/RAND_RP1165.pdf
RAND Research Brief, The Promise and Peril of Using Value-Added Modeling to Measure Teacher Effectiveness, Santa Monica, CA, 2004. http://www.rand.org/pubs/research_briefs/RB9050/RAND_RB9050.pdf
The original research on value-added by its founders in Tennessee:
W. Sanders, A. Saxton, and B. Horn, The Tennessee value-added assessment system: A quantitative outcomes-based approach to educational assessment, in Grading Teachers, Grading Schools: Is Student Achievement a Valid Evaluational Measure? (J. Millman, ed.), Corwin Press, Inc., Thousand Oaks, CA, 1997, pp 137–162.
PPS: Just picked up May 2012 issue (V93 no. 8) of Phi Delta Kappan. Good summary of cautions of VAM by Jimmy Scherer of U of Pittsburgh in an article entitled What’s the value of VAM?: “VAM has the potential to improve such an accountability system by isolating teacher effects. However, it fails to eliminate some common concerns associated with high-stakes testing and cannot be used to rank teachers.”
PPPS: Jay Mathews covers this blog entry in his Washington Post entry of May 14 and offers a helpful link back to an earlier post of his on VAM.