[ Note: I have made a number of line edits for clarity since this was first posted. Please refer to this version rather than the original version. Part 2 and Part 3 in this series on assessing can be found by clicking on the links.]

I have found from doing assessment work with educators over the decades that surprisingly few people understand what validity means in assessment and how validity is determined. This confusion leads to various unhappy and important consequences: people bash test questions without understanding how they work, supervisors give grossly inaccurate advice about how to prepare kids for external tests, and teachers end up designing invalid tests without realizing it.

What is validity? Let’s start with a simple definition. Validity is about whether the test measures what it is supposed to measure. Given a goal, I construct a test. A question is “valid” if the question accurately measures what it is supposed to measure, i.e. the goal I have. A test is valid if the questions (and the results; see below) on the test align with all the goals being assessed.

Note, therefore, that – technically speaking – a question is not, itself, valid or invalid. Rather, validity is about inference. What can we and can’t we infer from the results on the question? Does this specific question and the test results on it permit me to draw conclusions about some more general goal(s)? Do the answers to this specific question predict/correlate with performance at a more general goal? That’s what validity is about.

Simple example. If I give a writing prompt and say “write me an essay on whether or not my homework policy is fair” my goal is more general than the prompt. I want to know how well you write essays. My goal has little to do with your understanding of grading systems. That goal and the validity issue become clearer when I give the next prompt: I use a different specific prompt (on vegetarianism), but it’s still supposed to be an essay.

Another useful example: 2 + 5 = ? is a test question; what is it measuring? Without knowing, we cannot yet say for sure whether it is a valid question, as noted above; it depends upon the goal.

Here, then, below, are some possible goals; decide for yourself how valid the question is (2 + 5 = ?) for addressing each of the following proposed goals:

The question can be used to determine whether or not students –

  1. know the answer to 2 + 5.
  2. know the meaning of the + and = symbols.
  3. can add 1-digit numbers that add up to less than 10.
  4. can add 1-digit numbers.

Obviously, it’s valid for #1. But #1 is not typically our goal in asking a test question; our aim is usually more general, as found in goals #2 #3 and #4, above. i.e. test questions are meant to be representative samples of a large and varied subject (or “domain” as measurement people say).

Representative questions: thinking like a test-maker, not a teacher. The generalization-from-a-sample issue is where it gets interesting and sticky. The question 2 + 5 = ? might well be valid for drawing conclusions about goals #2 and #3, but is probably not valid for goal #4. Why? Because 2 + 5= ?  is relatively easy, and thus not representative of all the 1-digit problems. Thus, if our goal IS #4, we need to generalize from the specific test question asked. In general, what must a student know to be deemed good at 1-digit problems? We would thus want to be sure to use the question 5 + 9 = ?  Do you see why? We know it to be a harder question – it involves carrying while the other one doesn’t. So, if we want our test to be a valid predictor of “can add 1-digit numbers” we have to use such a question.

It is likely, of course, that fewer students would get 5 + 9 correct in 1st grade vs. those who got the 2 + 5 question correct. This is a critical fact, and it changes how we must think about validity as educators.  We must learn to think like the test-maker!

Now suppose we can only choose 1 question in the interest of time. If the goal being measured is #4 – students can add 1-digit numbers – then, measurement folks would much rather ask just 5 + 9 than 2 + 5. (They would rather NOT use just 1 question for “reliability” reasons, to be discussed in the next post). Why would they choose the harder question? Because they know – from past results, as well as conceptually – that it is a more accurate predictor for our goal than the easier question of 2 + 5. Yes, they realize that fewer students may get the answer right if they ask only it. However, unlike the teacher, the psychometrician is interested in getting the measurement right, not in finding questions that all students can get right.

We want kids to get everything right as teachers! But that is not the point here. We should also want to get the validity right. It’s up to teachers to ensure that all possible types of addition problems – including the most challenging – were covered well in instruction and local assessment. Then, they would be ready for the test. (Think: what are the most common errors and misconceptions? We would want those tested; more on this in a later post concerning distractors.)

Immediate implication for teachers: your tests need to be as rigorous if not more rigorous in this way than the external test. You can’t just mimic their format. In fact, it might be wise to NOT mimic the format and use only constructed response questions or have students at least explain why they chose the answer they did. (More next time on this point).

So, it is an invalid inference to say that all students who get 2 + 5 = ? correct “can add 1-digit numbers.”  You cannot confidently draw that conclusion from the results. Because more students get this one correct than get the more ‘telling’ question right. In other words, the results on the 5 + 9 question provide a more accurate gauge of what % of students can be predicted to meet the goal than the 2 + 5 question.

A fair test as a valid sample. We noted above that a quiz has to worry about “representative” problems from the general “domain” that the goal reflects. A too-easy question, then, is not, by itself “representative” of all the 1-digit problems in the domain of such problems. Similarly: results on a quirky esoteric gotcha question that few students get correct likely hides the true level of understanding of the more general topic.

This is easier to see when we compare quizzes. Which quiz below, A or B, is likely to give more valid results as to whether or not students “can add 1-digit numbers accurately”?

Test A

2 + 3 =    2 + 5 =      3 +3 =      4 + 5 =

Test B

2 + 3 =     2 + 5 =   7 + 8 =     6 + 9 =

Clearly, we would expect more telling results from Test B than Test A. It addresses some of the harder questions in the domain, not just the easier ones as in Test A. Test B anticipates errors of carrying and gets beyond just adding on your fingers by counting – important indicators of the “ability to add ALL 1-digit numbers”.

The role of statistics in validity: the results need to be analyzed. But now notice a second new idea implied here and noted at the outset: validity can only be fully established based on patterns of current and past results. You cannot just judge the question itself for validity; you have to judge whether the pattern of results in using the question is what we would predict/expect/experience, based on our understanding of the goal, and based on other valid assessment results over time. (This is why tests must be piloted!And in the absence of piloting, it’s why tests are often ‘curved’ in HS and college)

Consider our essay prompt again. Suppose my prompt to 4th graders was: write me an essay on the wisdom of the Fed intervening in the economy by buying up bonds to act as a stimulus. Huh? The results would be terrible: no 4th grader knows much about the Fed (except 1-2 geeky budding entrepreneurs). So, the pattern of results – atypically poor – suggests that the prompt yielded invalid results. The problem was most likely the prompt, not the kids and the teaching, in other words.

But the reverse can also be true: I might ask a really lightweight question like “Write an essay on the wisdom of eating dessert first instead of last” and get far more revealing and accurate results as to who can write essays than if I use highly academic prompts based on big ideas or hard texts. Indeed, one reason why writing prompts are often so lame on state and national tests is to make sure that content knowledge is NOT the determining factor in judging the writing. We simply want to know: can you write? If the essay is highly dependent upon ‘insider’ knowledge that many kids cannot be expected to have, then the results will not yield a valid indication of ‘who can write essays’.

Efficiency is desirable, even at the expense of authenticity, in testing. Ah, but that opens up a can of worms that most teachers don’t understand, in my experience. What follows from this notion is that a question can seem trivial or odd but provide valid inferences against the goal (just as there can be questions that seem profound and illuminating but are invalid for use with the goal). In other words, there are often in testing highly revealing questions that may strike the naïve person as “dumb” “trivial” or “invalid” questions.

A great example is a test of vocabulary words and analogies for assessing reading and thinking skills. For decades, testers have happily used vocabulary test items as a way to get at reading ability. Huh, how? Because the test-maker knows from the research that 1) extremely rich vocabularies come from reading rich text, and 2) they know from many results that vocabulary tests correlate with the ability to handle text difficulty. Same with analogy questions, so favored in the SATs, LSATs, and GRE’s: they are efficient and valid proxies to get at analytical and critical thinking. Many studies show that they highly correlate with more direct assessment of those complex skills.

This efficient proxy is a key thing for testmakers. They need the test to be as quick as possible, given the issue of cost, logistics, and the problem of exhausting the student. They will happily glom onto questions that serve as efficient proxies for the real thing. Testing a person’s vocabulary is quick and historically highly predictive of reading ability (which takes much longer to assess directly via reading passages and writing.) No need for authentic assessment of reading, then in terms of psychometrics: vocabulary testing gives the needed results for far less time and money. There may be pedagogical reasons of authenticity in assessment – and I have argued strongly that there are – but the test-maker is not concerned with that need, alas (unless told to be so by the person writing the test specs). They only seek efficient validity, given the cards typically dealt them.

Authenticity isn’t needed for validity. Worse, the reverse is true: many “authentic assessments” lack validity in the sense discussed above – drawing inference to goals from the results. More on this in the next post.

Goals related to facts.  As so much of this discussion suggests, validity is a troublesome issue because the goals are usually broader and deeper than any test question and so judgment is required in fitting the just-right specific question to a more general and hard to measure goal. There is no simple formula for validity. We need to analyze the information carefully, just as scientists do, to determine whether the hypothesis – this question aligns with the goal – fits the data.

Because the goals are more general than the particular question, it is easy for teachers to be misled by the meaning of factual questions on a standardized test.

The Standards (be they state or national) rarely identify specific facts that one has to know. Obviously there are exceptions: key dates, people, and events in history; key terms in math and science, etc. But look closely at your Standards documents (and released tests) and you’ll see that most goal statements are broader than any specific fact, and few questions seek factoids.

Consider this example, widely used in state tests:

In which decade was the Civil War fought in the United States?

  1. the 1770s
  2. the 1790s
  3. the 1860s
  4. the 1890s

Assessing a fact, right? No. There is almost never a Standard that says “know the start and end years of the Civil War.”  In this case, the goal is much more general: can the student place this event (and other “key” events) in a reasonably accurate time-line, to show a proper sense of time and chronology in US history? In fact, the student could remember a few factoids about the Civil War (Grant vs. Lee, Appomattox) but select answer number 1, above. Now what should we conclude about their ‘understanding of the Civil War’? Surely, the timeline question is a more revealing indicator (and why something like it is so often asked on history tests).

That’s why it is very unwise to just look at the content of last year’s questions, when you have access to released items. It’s not the question that matters; it’s what goal the question was testing for that matters! The question next year will be different while the goal stays the same. That’s why I say: pay attention to the standards, not the tests. Think about it: that’s how the test-maker thinks about the standards, too.

What follows? Some – not all – of the hue and cry about “bogus” test questions is based on a complete misunderstanding. (For a great discussion of the value of well-designed multiple-choice questions, go here.) In a later post, I will explain why I think some famous test questions that many people said were bogus are not.

Hint: pineapple in New York State.

Here, then, are 3 practical take-aways from our first look at validity:

  1. You should always state on your own copy of a test you use the goal for each question. See below for nice examples from the old Florida FCAT. You will soon see that sometimes your questions are not the best ones when you start worry more self-consciously about validity. In external tests, care less about the specific question than the Standard it is assessing for.
  2. Make sure you have worried about “representative” questions that sample from the entire domain of challenges related to that goal. Don’t just ask the easy, obvious, or familiar ones. Nor should you ask gotcha trivial ones (unless you are confident that they are proxies for real understanding of the subject). This is often why local results are lower on state tests than local tests: too many local tests are not rigorous enough in sampling fairly from the entire range of possible questions.
  3. You should look closely at the pattern of results to determine whether the question was a “fair test” of the goal. Part of why college and HS teachers “grade on a curve” is to allow for the fact that the question may have been too hard or easy, as reflected in the results.

PS: A Florida FCAT example to emulate, about how to label each question for yourself:

Cherry Blossom Main Idea FCAT