How high should we set the bar? How good is good enough – especially in a world where 40% of HS graduates need remediation in college? That leads directly to a practical question of great (and current) significance that is overlooked at present: What is a valid and helpful local grading system in terms of its links to national and state standards?

We are now knee-deep in this question, thanks to the recent release of the Common-Core-calibrated New York State test results (and some of the hysterical reactions to them).

I want to remind everyone that this is hardly a new challenge. Setting levels or cut scores has been a problem to tackle since the advent of K-12 formal schooling demanded that it mesh with college entrance standards almost 90 years ago. The SAT, ACT, AP and IB programs were created to help with this issue. Benjamin Bloom saw the need to consider this issue squarely and always addressed it when discussing his Mastery Learning system over 45 years ago. Every teacher either faces the issue or finesses it in one way or another: how should I assess and give feedback to learners to fully prepare them for a world and its standards beyond my classroom?

Here is my standard for addressing this problem. Regardless of what solution we come up with, it must pass the following test: No surprises and complete transparency as to where the student stands must be our motto and criterion as assessors locally. We owe each student the facts as to where he or she might place in terms of wider-world standards. Ideally, students know where they stand BEFORE they take an external test. In such a system, the test confirms what they and we already know  – as now often happens pretty well in sports and performance arts.

All you have to do is think of a young and locally-schooled teacher giving a B+ to one of her “better” students in the worst school in the state to see that this ends up unfair to kids, once the state test is taken and the disappointing scores are returned.

A model case: track and field

We can see the challenge and the possible solution more clearly by shifting to what is arguably a clear and worthy model for study: track and field.

My daughter ran track in high school and did “really well” in the 1600 meter run. How do I know she did “really well”? Because she was the number one runner on her high school team. She was only beaten once in three years. And she received all A’s in track.

Uh, but Grant: but maybe the team as a whole is ‘not great’ and her performance is ‘really’ not so great in the grand scheme of things – like the B student in the bad school.


OK, then, I can go further: she won her league championship in the mile! So, she was ranked #1 in her league, as the national school and college track and field website shows:

So? Maybe the league is not very good as a whole; maybe she is a great runner in a bush league. Maybe she is like that B+ student in a city of poor schools. After all, being #1 in the League is still  a norm-referenced rank.  Norms are not standards. She just happens to be better than a small sample of other runners in small schools. We still don’t have a big enough sample or, better yet, a criterion-referenced way to validly evaluate her performance level.

Ouch! Indeed, her league was composed of small and similar schools. And, yes, not far away in Camden NJ, there are students currently getting B’s in their local ‘league’ of district schools but who just found out on NJ tests that they are not doing “excellent” work. So, OK, I accept the challenge: let’s consider how good she really is.

Fortunately, in track we have a precise and uncontroversial criterion of performance: her times. The times she ran take this argument beyond simplistic norms, subjectivity, and parental anecdotes. So, we note that Cilla ran a 5:12 in the league championship and that it was her best time of the season. So, now we want to know: is 5:12 a “really good” time? A “great” time? Or a “so-so” time? i.e. Does it meet, exceed or fall short of “a standard”? Is she deserving of a college scholarship or not? Could she get into a top-tier program in running or not?

This is precisely the issue now before us in New York. The levels of performance have been made stiffer. The state has proposed that it should now take a much better score for someone to be considered “good” in learning. And, in theory, it is meant to do a better job of linking to wider-world standards (since the remediation rate of students with good grades in the state was way too high.)

So, how does this level setting work in running?

Good, compared to…

When we head back to the track website and look at all the results for the year, what do we find?

Let’s first compare her performance across a larger population of leagues to take our first cut at the skeptic’s argument. We can look statewide and find a rich and clarifying (though still norm-referenced) assessment. In the state of Pennsylvania we now see she ranks fairly high:

53rd statewide is intuitively “pretty good” for a performance of thousands of runner in a populous state – but a far cry from #1 in her league. And it is still only a norm-referenced ranking.  But this is like a state test: seeing how you did state-wide. I presume that being 53rd of all test takers in all the state’s schools would earn her a highly-successful score.

So if the #53 runner runs the 1600 in 5:12, what does the #1 runner run it in? The #1 runner in PA ran 4:53. Wow – under 5 minutes. Even laypersons can sense that this is pretty fast. So, we might say that Cilla’s time now doesn’t look quite as excellent as it once did. Her performance is surely “pretty good” but it may or may not be “up to the highest standard.”

Let’s look at my daughter’s times and rank in the nation as a whole:

765: A very far cry from #1!! But, again, this is just a ranking. We want to establish criteria for evaluating how “good” or “bad” such a performance is.

The #1 HS female runner in the country ran 4:40 in the event:

As for the dangers of mere ranks, note that the times of  the runners are very close together – only milliseconds separate dozens of runners and only a few seconds separate hundreds of runners. This is a good example of why you should NEVER trust mere rankings by themselves, such as in the US News & World Report school and college rankings, or grade students on an informal or formal curve locally.

We now need to face our cut-score challenge squarely. We know the norms; what should be the standards? What is a valid cut score for determining who is “really” fast? Cut scores – such as the 4 levels in New York – should be done in ways that make us confident in our ability to say: regardless of your rank or your raw score, the levels should correspond to valid levels of performance. If the test says you are “proficient” then you should genuinely be proficient in the wider world – that’s what the Common Core is all about, making and keeping that promise in its assessments. UPDATE: here is a first-person account of a teacher helping to set the level in NYS.

Ensuring that standards are reasonable and transparent

However, this is what we should all be doing as teachers, too, when deciding all student grades. Right now, however, the cut score for passing locally is totally arbitrary, an artifact of arithmetic, not assessment validity: a 59 fails and a 60 passes just ’cause of the math. No attempt is made to ensure that 60 is in fact “acceptable” work and thus a valid level, and that 59 really does, therefore reflect “sub-standard” work. Rather, what we do now in our grading is just count errors on a test that may or may not be valid, subtract from 100, calculate the score, turn it into a letter grade, and call it a day. This is arguably a far bigger problem, and one of very long standing, than New York changing its cut scores (levels) to align with college readiness.

In track, there is national agreement on the cut scores for excellence. On the web site we get a critical piece of information at the top and bottom of each page: the site notes whether a runner is in the “first tier” or the “second tier” of “elite” runners.

So, Cilla is in the 2nd team of “elite” runners – a standards-based evaluation made by expert judges who translate “norms” into “standards” somehow. Presumably, it is based on their experience in all kinds of track and field programs (but no details are provided about the source of the judgment on the site). And our hunch is confirmed: US First Team Elite “really fast” runners can run the mile in under 5 minutes.

Counting both first and second team elite runners, by the way, the resulting % of elite runners is about 25% of all HS runners – which is pretty close to the new state average scores in New York at levels 3 and 4 (31%). So, in theory – leaving aside issues of the implementation of new tests, funding, teacher training, leadership in schools, etc. – what the new tests are doing in terms of leveling is reasonable (though we need to know more about the validity studies being used in NY and in the two Consortia).

What does it mean to say she is in the 2nd team of elite runners?  Clearly, Cilla is “proficient.” But is that “good enough”? Well, it depends – upon her aspirations.

Linking our standards to appropriate wider-world standards

Let’s look at college results. Cilla wished to apply to 2 colleges in the Division I Patriot league. Minor problem: most colleges run the 1500 meters, so we have to translate her times. If you run a 5:12 in the 1600 you can run the 1500m in around 5:01. (There are calculators on the site for doing the translations.) Here are the league results from that year for the 1500:

If her aim was to run Division One track, in the Patriot League or its equivalent,  then she looked pretty strong: her time (which translates to 5:01, recall) would have put her 31st in the league championship. But if she wanted to go to Bucknell or Lehigh specifically? Lots of people were ahead of her; she was not likely to get a scholarship or be on the Varsity.

If she wished to apply to a college in the Southeastern Conference, her chances of running the 1500 were even slimmer:

She would rank #72 overall and have had no chance at Tennessee or Miss. State.

But if her aim was to run Division III, she would have been a highly desirable candidate: her  time would have given her 25th place in the Division III national championships. (Indeed, numerous D3 coaches wanted her to come to their schools.)

So, what should we conclude?

Three things: One, what the new tests are doing is reasonable, even if there is ugliness in the moment. Two, there is no such thing as ONE standard, even in a national test. There are as many standards as there are different kinds of destinations and aspirations (e.g. Division One, Two, and Three). And, three, local grading systems are a scandal: they provide no frame of reference that is consistent across teachers and schools, is intellectually defensible, and is transparent. If the mantra is No surprises and complete transparency as to where you stand, local grading is (and has long been) a far bigger problem than any new test or cut score, in my view.

I will follow this post next time with a look at a sensible approach to local grading that would do a far better job of linking local grades to wider-world standards – without getting rid of current letter grades (since that battle is just not worth fighting in many places – and need not be fought, as I will show).

[If this sounds vaguely familiar, it is a revised and updated version of a post from 2 years ago.]