The blog gods are very fickle. I had almost put the finishing touches on Validity Part 3 – in which I offered a lengthy discussion of the SAT – and then the College Board today announces an overhaul of the test. And interestingly, the chief rationale is validity:

But beyond the particulars, Mr. Coleman emphasized that the three-hour exam — 3 hours and 50 minutes with the essay — had been redesigned with an eye to reinforce the skills and evidence-based thinking students should be learning in high school, and move away from a need for test-taking tricks and strategies. Sometimes, students will be asked not just to select the right answer, but to justify it by choosing the quote from a text that provides the best supporting evidence for their answer…. Instead of arcane “SAT words” (“depreciatory,” “membranous”), the vocabulary words on the new exam will be ones commonly used in college courses, such as “synthesis” and “empirical.”

So, let’s consider reliability, the partner of validity. It’s very easy to confuse validity and reliability, and many articles on testing conflate the two.

Recall we said that the validity question is: does the test measure what it claims to measure? I.e. do results on specific questions that sample the vast domain predict or correlate with results at the larger goal or total domain?

What is reliability? The reliability question is different. Whether or not the test is valid, are the scores stable and consistent, with error minimized? Or was this particular score an outlier or anomaly (should the test be taken repeatedly)? In other words, what is the “true” score? It’s the same question we worry about in national polls: what’s the “true” % for and against, mindful of margin of error?

Speaking of the SAT, a great bar bet is to ask people the margin of error on an individual score on the test. If we leave out the writing,  you may be very surprised to learn that the margin of error on the SAT is plus or minus 32 points out of 1200. In other words, statistically speaking, if you take the SAT three times, and you get a 560, a 580, and a 600 for your three scores on the Verbal section, these are basically all the SAME score when you factor in margin of error.

As I hinted last time, reliability is a big problem on complex authentic performance. Even people who are reasonably competent at a complex performance – e.g. writing essays – may have scores that vary considerably over time, as any English teacher knows.

This is true even in professional sports where we expect great consistency in performance over time. Here are the National League West baseball standings from last April 30th:

Rockies April 30

Here are the final standings on September 30th:Rockies end of season

Oops: the Rockies went from first to worst. Their first month of baseball was not a reliable “score.”

Note, therefore, that the “test” of a baseball game is as valid as any test can be: the goal is winning baseball, at the major league level. By definition, that means playing the game of baseball against other major league teams. So validity is near perfect. But there is great unreliability in any single game or even a few weeks of games, as last year’s data reveal. Reliability isn’t really established until well into the season of many games where the “true” score of ability of a team reveals itself.

And the same is true of individual hitters. A batter may go 0-4 in one game. Then, a week later he goes 3-4. Is either “score”  reliable in the end? No. The batter will likely end up hitting around .250 for the year (i.e. more like 1 for 4, on average).

The fear over reliability among large-scale test-makers. Well, this is a potentially HUGE problem for test-makers. If Johnny’s “true” score – like that of the hitter – can vary wildly from day to day and test to test, then we cannot have much confidence at all in the results of a single test, can we?

The ugly fact is that the tests are MADE reliable by using redundant and simple questions – and fancy psychometrics – that beg the deeper question: how do we know that academic achievement is a stable thing, especially in a novice? Maybe it’s more like baseball in which, even at high levels, a single performance result is sufficiently unreliable that it is unwise to use it to make a big judgment. Fine, maybe an answer to the question 2 + 5 = yields reliable answers, but it doesn’t follow from such simple and unambiguous questions and their answers that a student’s genuine level of (complex performance of) mastery of arithmetic is stable from day to day. Especially if we were to start using multistep open-ended problems that demand transfer.

Indeed, this idea about margin of error, and thus humility about the meaning of results, is enshrined in the AERA/APA/NCME Standards of Measurement, and has been for decades: never make a huge decision on the basis of a single test score because of both reliability and validity issues. It violates the ethics of measurement to do so. (Alas, the Standards are being revised and only the older forms are available; and not available for free anymore. I searched but found none.)

That’s what makes the current one-shot high-stakes test situation in education so untenable. Judgments about students and teachers are being made on the basis of a single result. It’s wrong, and people who know better in the measurement community know it’s wrong, and ought to be up in arms about it.

The irony is that critics of testing typically claim that the tests are not valid. But that’s where they probably go wrong. The better argument concerns the questionable reliability of a single high-stakes score.

It’s no accident that the World Series is best four out of seven. Can’t kids and teachers get a similar shake?

But before we go with pitchforks to state education buildings, ACT and ETS, the same argument applies to YOUR tests and quizzes. What is the margin of error on a 20-question quiz in arithmetic? Most likely the answer is around plus or minus 3 points. So a 14, 17, and 20 are the same score – just as on the SAT.

In short, before the pot calls the kettle black, let’s look at local assessments carefully for validity AND reliability.

What you can do as a teacher. Reliability is about confidence in scores/grades, where error is minimized. And error is best minimized by using multiple measures, at different times. Although I find the Saxon math books pretty dull and dreary, their quizzes and tests are built upon this idea as well as the research on spaced vs. massed practice. In other words, you don’t just test the content once or twice right after having taught it. That is probably going to lead to very unreliable results in your grade book.

In addition, it’s best to use multiple measures, for both reliability and validity. Vary the format: use multiple-choice, short-answer, oral questioning, and projects, with redundancy on what is assessed. Include student self-assessments. And for every complex performance, use a complementary quiz on the same content.

Curve results on tests, as most college professors do (and should), when the results are out of whack with patterns that have been established.

Avoid overly-precise-seeming scoring systems. Giving a student a 72 on a history paper is poor measurement. Better to give a 3 out of 4 or something similar. If you know statistics, then report the score in a box and whisker or confidence interval form.

PS: The article in the NY Times on the history of the proposed changes in the SAT is MUST reading!