In my previous post on standards, I pointed out that though we talk of “a” standard, it would be more accurate to talk about the three different kinds of standards involved in any standard: content, process, and performance.

The key point about performance standards is that rigor must be established in two different ways: by the challenge of the task and by the anchoring of the scoring using valid models that “set the bar” high. I have already discussed the need to worry about the rigor of the task; let’s look at rigor in scoring, i.e. how to determine how rigorous the samples are that we use to anchor the system. How high should we set the bar?

The Goldilocks problem

Which samples of performance should we choose, in other words? With what samples or scores will we anchor the system? When we are on the receiving end of standards-based performance assessment – as in Advanced Placement or state-wide writing – there is no difficulty; we have no choice. We merely note what they have chosen as samples of student work that reflect quality along the scoring continuum. We study the published samples of work and commentary; then, we know to anchor our local assessment in those same examples if we want to align internal with external standards. (By the way: how often do your ELA and English teachers anchor local grades against those state-wide samples? We’ll return to this shortly).

But what if we are on the anchor-selection or anchor-creation side of things, especially in subjects or grade levels where there is no precedent or model like state-wide writing or AP to refer to? How can we be sure that when we select or create anchors that the demands on students are appropriate? (What does “appropriate” mean here, anyway?)

This is what I call the Goldilocks problem – we need to anchor a standards-based system in samples of work that are not too hard, not too easy, just right.

An “appropriate” anchoring cannot just be what we think ok. It surely must relate to both what the standards writers had in mind and (therefore) what colleges and employers think is “up to standard” in terms of student work.

In other words, a key question for all faculty to ask is: how can we be sure that what we think of as quality work is really seen as quality work in the outside world? If we don’t know, how can we find out? And can one excellent work sample do the job of anchoring our scoring system? (Hint: my answer is going to be “no”).  Any truly standards-based system must depend upon what is deemed quality work beyond our walls – especially at the colleges and workplaces where we want our students to matriculate.

A model case: track and field

It is easy to see how challenging this problem is, even under the best circumstances, by shifting the question to an easy-to-measure area of performance: track and field.

My daughter is a senior in high school and does “really well” in the 1600 meter run. How do I know she does “really well”? Because she is the number one runner on her high school team. She has only been beaten once in three years. And she gets all A’s in track.

Uh, but Grant: maybe the team as a whole is ‘not great’ and her performance is ‘really’ just average in the grand scheme of things.

Good point! (Note that we easily and non-judgmentally yield to data-driven reality-therapy in sports). OK, I can go further: she won her league championship in the mile last year! So, she was ranked #1 in the league, as the national track and field site shows:

So? Maybe the league is not very good as a whole; maybe it is like a bunch of really bad urban schools in the same bad district.

Ouch! That’s a tough but reasonable comparison. Yes, it’s true: the Friends League is composed of small and like schools. Similarly, there are students in Trenton and Newark currently getting A’s in their local ‘league’ but who will soon find out – when they enter the ‘real’ world – that they are not doing “excellent” work.

After all, being ranked #1 in the Friends League is just a norm-referenced rank. Norms are not standards. She just happens to be better than 50+ other runners. But how good is she, really? (And what does that ‘really’ really mean?)

Fortunately, in track we have a precise and uncontroversial measure of performance for making this clearer – the essence of an objective work sample: her times in the 1600. The times run take this argument beyond subjectivity, anecdotes, and crude norms. So, we note that Cilla ran a 5:12 in the championship and that it was her best time of the season. So, now we want to know: is 5:12 a “good” time? i.e. Does it meet, exceed or fall short of “the standard”? This turns out to be a more difficult question than we might imagine.

Good, compared to…

So, we head back to the national track website and look at all the results for last year. What do we find?

Let’s first compare her performance across a larger population of leagues to check out the skeptic’s argument. We can look statewide and have a rich and clarifying (though still norm-referenced) assessment. In the state of Pennsylvania we now see she ranks fairly high:

53rd statewide is intuitively “pretty good” for a performance in an entire state – but a far cry from #1. And it is still only a norm-referenced judgment.

[Aside: Here’s a shocking idea: publicly ranking everyone in the state. Unthinkable on tests, yet we do it without flinching in track. Why are we so loathe to do it in academics?]

So if the #53 runner runs the 1600 in 5:12, what does the #1 runner run it in? The #1 runner in PA can run 4:53. Wow – under 5 minutes. Even laypersons can sense that this is pretty fast. So, we might say that Cilla’s time now doesn’t look quite as excellent as it once did; her performance is “good” but it may or may not be “up to standard.”

Out of curiosity, let’s continue the norm-referenced analysis. Let’s look at her times in a national comparison to find out her national rank:

A very far cry from #1!! But, again, is that ranking “good” or “bad”? or is rank beside the point? The #1 runner in the country in high school can run 4:40:

In addition, we note that the times of all the runners are very close together – only milliseconds separate dozens of runners and only a few seconds separate hundreds of runners. (A good reason never to trust rankings by themselves, by the way, such as in the US News & World Report school and college rankings).

We now need to face the challenge squarely. We know the norms; what should be the standards? What is a valid cut score for determining who is “really” fast? A valid cut score would be one in which we are confident in our ability to say: if you are above it, your work meets standards and if your work is below it your work does not meet standards.

This is what we should all do when deciding whether to pass or fail students. Right now, however, our cut score in local grading is totally arbitrary, an artifact of subtraction: a 59 fails and a 60 passes, yet no attempt is made to ensure that 60 is “acceptable” work and 59 is “sub-standard” work. We just count errors up, subtract from 100, figure the score, turn it into a letter grade, and call it a day. (That won’t cut it moving forward, as I shall argue in the third and final installment.)

We get a critical piece of information at the top and bottom of each page: the site notes whether a runner is in the “first tier” or the “second tier” of “elite” runners.

So, Cilla is in the 2nd tier of “elite” runners – a standards-based evaluation made by expert judges who translate “norms” into “standards” based on – what? Presumably, experience in track and field (but no details are provided about the source of the judgment). First tier accords with our hunch about who is “really fast”: 5:00 is the cut score for determining 1st tier elite status.

But why those cut scores? What does it mean to say she is in the 2nd tier of elite runners? Are we then saying that all the other thousands of runners around the country are “not running to standard”? That seems pretty harsh. All of them won many meets! Couldn’t we say that if you run under 5:30 you are running the 1600 to a good “standard”? (Priscilla beat the top runners on other teams, many of whom came in with times in the 5:20s). Where should we draw these and other lines and why? At the very least, we must not confuse “standards” with “minimum competency” – a perpetual problem in education. Nor should we confuse standards with norms, because perhaps the norm is below standard; maybe 5:20 is just too slow, even if you won most of your meets.

Do you see the implications? To ask “is performance up to standard?” requires us to answer: well, it depends…  It depends upon what you mean by “standard” and it depends upon context. And in this case context refers to goals. Clearly, Cilla is “competent.” But, it also doesn’t matter that Cilla is the “top” runner in her league and gets straight A’s – if she has higher aspirations related to college performance.

Linking our standards to wider-world standards

So, let’s look at college results. (Minor problem: most colleges run the 1500 meters, so we have to translate her times. I won’t bore you with the complexities of the calculation, but for argument’s sake we can say that if you run a 5:12 in the 1600 you can run the 1500m in around 5:01). Cilla is applying to 2 colleges in the Division I Patriot league. Here are the league results from spring 2010:

If her aim is to run Division One track, then she looks pretty strong: her time (which translates to 5:01) would have put her 31st in the league championship meet. But if she wants to go to Bucknell or Lehigh? Lots of people ahead of her; not likely to get a scholarship or be on the Varsity, at least in the near term.

If she applies to a college in the southeastern conference, her chances of running the 1500 are even slimmer:

She would rank #72 overall and have no chance at Tennessee or Miss. State.

But if her aim is to run Division III, however, she is a highly desirable candidate: her current time would have given her 25th place in the Division III national championships last spring.

So, what should we conclude about anchors and rigor? Even when we have an utterly transparent, valid, objective, and precise “work sample” (as in track) we still don’t know if she “really” meets standards until we know her aspirations, compare her times to results in the wider world (especially where she wants to head), and have someone make an expert judgment about what her times mean. There are many “standards” based on many destinations and aspirations.

The “just right” single standard and anchor is an illusion.

It is of course far worse in school to figure this all out, where the assessments are often opaque, rarely valid, often subjective, and fairly imprecise; and where we have very little knowledge as teachers as to what colleges and employers are looking for precisely in terms of performance level. Thus, unless they take AP or IB exams, students rarely know where they really stand, and discrepancies between local grades and test scores will likely – and in fact routinely do – yield rude shocks. Students typically only know where they stand with reference to local norms, leading to misunderstanding – just like when we looked at Cilla’s local results and straight A’s and thought she was a great runner.

In my third and final post later this week I will suggest some vital practical implications. The key move is to develop a completely different approach to scoring and reporting student work. It will require that we use multiple anchors to better communicate where students stand. In other words, until and unless we fix our grading system we cannot claim that we are standards-based.