To continue our look at rubrics and models, I offer below a revised version of a dialogue I wrote 20 years ago to attempt to clarify what a rubric is and what differentiates good from not so good rubrics. You can read more in MOD J in Advanced Topics in Unit Design. [I further revised the last third of the dialogue to clear up some fuzziness pointed out to me by a reader]
Just what is a rubric? And why do we call it that?
A rubric is a set of written guidelines for distinguishing between performances or products of different quality. (We would use a checklist if we were looking for something or its absence only, e.g. yes there is a bibliography). A rubric is composed of descriptors for criteria at each level of performance, typically on a four or six point scale. Sometimes bulleted indicators are used under each general descriptor to provide concrete examples or tell-tale signs about what to look for under each descriptor. A good rubric makes possible valid and reliable criterion-referenced judgment about performance.
The word “rubric” derives from the Latin word for “red.” In olden times, a rubric was the set of instructions or gloss on a law or liturgical service — and typically written in red. Thus, a rubric instructs people — in this case on how to proceed in judging a performance “lawfully.”
You said that rubrics are built out of criteria. But some rubrics use words like “traits” or “dimensions.” Is a trait the same as a criterion?
Strictly speaking they are different. Consider writing: “coherence” is a trait; “coherent” is the criterion for that trait. Here’s another pair: we look through the lens of “organization” to determine if the paper is “organized and logically developed.” Do you see the difference? A trait is a place to look; the criterion is what we look for, what we need to see to judge the work successful (or not) at that trait.
Why should I worry about different traits of performance or criteria for them? Why not just use a simple holistic rubric and be done with it?
Because the fairness and feedback may be compromised in the name of efficiency. In complex performance the criteria are often independent of one another: the taste of the meal has little connection to its appearance, and the appearance has little relationship to its nutritional value. These criteria are independent of one another. What this means in practice is that you could easily imagine giving a high score for taste and a low score for appearance in one meal and vice versa in another. Yet, in a holistic scheme you would have to give the two (different) performances the same score. However, it isn’t helpful to say that both meals are of the same general quality.
Another reason to use separate dimensions of performance separately scored is the problem of landing on one holistic score with varied indicators. Consider the oral assessment rubric below. What should we do if the student makes great eye contact but fails to make a clear case for the importance of their subject? Cannot we easily imagine that on the separate performance dimensions of “contact with audience” and “argued-for importance of topic” that a student might be good at one and poor at the other? The rubric would have us believe that these sub-achievements would always go together. But logic and experience suggest otherwise.
Oral Assessment Rubric
- 5 – Excellent: The student clearly describes the question studied and provides strong reasons for its importance. Specific information is given to support the conclusions that are drawn and described. The delivery is engaging and sentence structure is consistently correct. Eye contact is made and sustained throughout the presentation. There is strong evidence of preparation, organization, and enthusiasm for the topic. The visual aid is used to make the presentation more effective. Questions from the audience are clearly answered with specific and appropriate information.
- 4 – Very Good: The student described the question studied and provides reasons for its importance. An adequate amount of information is given to support the conclusions that are drawn and described. The delivery and sentence structure are generally correct. There is evidence of preparation, organization, and enthusiasm for the topic. The visual aid is mentioned and used. Questions from the audience are answered clearly.
- 3 – Good: The student describes the question studied and conclusions are stated, but supporting information is not as strong as a 4 or 5. The delivery and sentence structure are generally correct. There is some indication of preparation and organization. The visual aid is mentioned. Questions from the audience are answered.
- 2 – Limited: The student states the question studied, but fails to fully describe it. No conclusions are given to answer the question. The delivery and sentence structure is understandable, but with some errors. Evidence of preparation and organization is lacking. The visual aid may or may not be mentioned. Questions from the audience are answered with only the most basic response.
- 1 – Poor: The student makes a presentation without stating the question or its importance. The topic is unclear and no adequate conclusions are stated. The delivery is difficult to follow. There is no indication of preparation or organization. Questions from the audience receive only the most basic, or no, response.
- 0 – No oral presentation is attempted.
Couldn’t you just circle the relevant sentences from each level to make the feedback more precise?
Sure, but then you have made it into an analytic-trait rubric, since each sentence refers to a different criterion across all the levels. (Trace each sentence in the top paragraph into the lower levels to see its parallel version, to see how each paragraph is really made up out of separate traits.) It doesn’t matter how you format it – into 1 rubric or many – as long as you keep genuinely different criteria separate.
Given that kind of useful breaking down of performance into independent dimensions, why do teachers and state testers so often do holistic scoring with one rubric?
Because holistic scoring is quicker, easier, and often reliable enough when we are assessing a generic skill quickly like writing on a state test (as opposed, for example, to assessing control of specific genres of writing). It’s a trade-off, a dilemma of efficiency and effectiveness.
What did you mean when you said above that rubrics could affect validity. Why isn’t that a function of the task or question only?
Validity concerns permissible inferences from scores. Tests or tasks are not valid or invalid; inferences about general ability based on specific results are valid or invalid. In other words, from this specific writing prompt I am trying to infer, generally, to your ability as a writer.
Suppose, then, a rubric for judging story-writing places exclusive emphasis on spelling and grammatical accuracy. The scores would likely be highly reliable — since it is easy to count those kinds of errors — but surely it would likely yield invalid inferences about who can truly write wonderful stories. It isn’t likely, in other words, that spelling accuracy correlates with the ability to write in an engaging, vivid, and coherent way about a story (the elements presumably at the heart of story writing.) Many fine spellers can’t construct engaging narratives, and many wonderful story-tellers did poorly in school grammar and spelling tests.
You should consider, therefore, not just the appropriateness of a performance task but of a rubric and its criteria. On may rubrics, for example, the student need only produce “organized” and “mechanically sound” writing. Surely that is not a sufficient description of good writing. (More on this, below).
It’s all about the purpose of the performance: what’s the goal – of writing? of inquiry? of speaking? of science fair projects? Given the goals being assessed, are we then focusing on the most telling criteria? Have we identified the most important and revealing dimensions of performance, given the criteria most apporpriate for such an outcome? Does the rubric provide an authentic and effective way of discriminating between performances? Are the descriptors for each level of performance sufficiently grounded in actual samples of performance of different quality? These and other questions lie at the heart of rubric construction.
How do you properly address such design questions?
By focusing on the purpose of performance i.e. the sought-after impact, not just the most obvious features of performers or performances. Too many rubrics focus on surface features that may be incidental to whether the overall result or purpose was achieved. Judges of math problem-solving, for example, tend to focus too much on obvious computational errors; judges of writing tend to focus too much on syntactical or mechanical errors. We should highlight criteria that relate most directly to the desired impact based on the purpose of the task.
I need an example.
Consider joke-telling. The joke could have involved content relevant to the audience, it could have been told with good diction and pace, and the timing of the punch-line could have been solid. But those are just surface features. The bottom-line question relates to purpose: was the joke funny? i.e. did people really laugh?
But how does this relate to academics?
Consider the following impact-focused questions:
- The math solution may have been accurate and thorough, but was the problem solved?
- The history paper may have been well-documented and clearly written, with no mechanical errors, but was the argument convincing? Were the counter-arguments and counter-evidence effectively addressed?
- The poem have have rhymed, but did it conjure up vibrant images and feelings?
- The experiment may have been thoroughly written up, but was the conclusion valid?
It is crucial that student learn that the point of performance is effective/successful results, not just good-faith effort and/or mimicry of format and examples.
So, it’s helpful to consider four different kinds of criteria: impact, process, content, polish. Impact criteria should be primary. Process refers to methods or techniques. Content refers to appropriateness and accuracy of content. Polish refers to how well crafted the product is. Take speaking: many good speakers make eye contact and vary their pitch, in polished ways, as they talk about the right content. But those are not the bottom-line criteria of good speaking, they are merely useful techniques in trying to achieve one’s desired impact (e.g. keeping an audience engaged). Impact criteria relate to the purpose of the speaking — namely, the desired effects of my speech: was I understood? Was I engaging? Was I persuasive? moving? — i.e. whatever my intent, was it realized?
That seems hard on the kid and developmentally suspect!
Not at all. You need to learn early and often that there is a purpose and an audience in all genuine performance. The sooner you learn to think about the key purpose audience questions – What’s my goal? What counts as success here? What does this audience and situation demand? What am i trying to cause in the end? the more effective and self-directed you’ll be as a learner. It’s not an accident in Hattie’s research that this kind of metacognitive work yields some of the greatest educational gains.
Are there any simple rules for better distinguishing between valid and invalid criteria?
One simple test is negative: can you imagine someone meeting all the proposed criteria in your draft rubric, but not being able to perform well at the task, given its true purpose or nature? Then you have the wrong criteria. For example, many writing rubrics assess organization, mechanics, accuracy, and appropriateness to topic in judging analytic essays. These are necessary but not sufficient; they don’t get to the heart of the purpose of writing — achieving some effect or impact on the reader. These more surface-related criteria can be met but still yield bland and uninteresting writing. So they cannot be the best basis for a rubric.
But surely formal and mechanical aspects of performance matter!
Of course they do. But they don’t get at the point of writing, merely the means of achieving the purpose — and not necessarily the only means. What is the writer’s intent? What is the purpose of any writing? It should “work” or yield a certain effect on the reader. Huck Finn “works” even though the written speech of the characters is ungrammatical. The writing aims at some result; writers aim to accomplish some response — that’s what we must better assess for. If we are assessing analytic writing we should presumably be assessing something like the insightfulness, novelty, clarity and compelling nature of the analysis. The real criteria will be found from an analysis of the answers to questions about the purpose of the performance.
Notice that these last four dimensions implicitly contain the more formal mechanical dimensions that concern you: a paper is not likely to be compelling and thorough if it lacks organization and clarity. We would in fact expect to see the descriptor for the lower levels of performance addressing those matters in terms of the deficiencies that impede clarity or persuasiveness. So, we don’t want learners to fixate on surface features or specific behaviors; rather, we want them to fixate on good outcomes related to purpose.
Huh? What do you mean by distinguishing between specific behaviors and criteria?
Most current rubrics tend to over-value polish, content, and process while under-valuing the impact of the result, as noted above. That amounts to making the student fixate on surface features rather than purpose. It unwittingly tells the student that obeying instructions is more important than succeeding (and leads some people to wrongly think that all rubrics inhibit creativity and genuine excellence).
Take the issue of eye contact, mentioned above. We can easily imagine or find examples of good speaking in which eye contact wasn’t made: think of the radio! Watch some of the TED talks. And we can find examples of dreary speaking with lots of eye contact being made. Any techniques are best used as “indicators” under the main descriptor in a rubric, i.e. there are a few different examples or techniques that MAY be used that tend to help with “delivery” – but they shouldn’t be mandatory because they are not infallible criteria or the only way todo it well.
Is this why some people think rubrics kill creativity?
Exactly right. BAD rubrics kill creativity because they demand formulaic response. Good rubrics demand great results, and give students the freedom to cause them. Bottom line: if you signal in your rubrics that a powerful result is the goal you FREE up creativity and initiative. If you mandate format, content, and process and ignore the impact, you inhibit creativity and reward safe uncreative work.
But it’s so subjective to judge impact!
Not at all. “Organization” is actually far more subjective and intangible a quality in a presentation than “kept me engaged the whole time” if you think about it. And when you go to a bookstore, what are you looking for in a book? Not primarily “organization” or “mechanics” but some desired impact on you. In fact, I think we do students a grave injustice by allowing them to continually submit (and get high grades!) on boring, dreary papers, presentations, and projects. It teaches a bad lesson: as long as you put the right facts in, I don’t care how well you communicated.
The best teacher I ever saw was teacher in Portland HS, Portland Maine, who got his kids to make the most fascinating student oral presentations I have ever heard. How did you do it? I asked. Simple, he said. You got 1 of 2 grades: YES = kept us on the edge of our seats. NO = we lost interest or were bored by it.
Should we not assess techniques, forms, or useful behaviors at all, then?
I didn’t mean to suggest it was a mistake. Giving feedback on ALL the types of criteria is helpful. For example, in archery one might aptly desire to score stance, technique with the bow, and accuracy. Stance matters. On the other hand, the ultimate value of the performance surely relates to its accuracy. In practice that means we can justifiably score for a process or approach, but we should not over-value it so that it appears that results really don’t matter much.
What should you do, then, when using different types of criteria, to signal to the learner what to attend to and why?
You should weight the criteria validly and not arbitrarily. We often, for example, weight the varied criteria equally that we are using (say, persuasiveness, organization, idea development, mechanics) – 25% each. Why? Habit or laziness. Validity demands that we ask: given the purpose and audience, how should the criteria be weighted? A well-written paper with little that is interesting or illuminating should not get really high marks – yet using many current writing rubrics, the paper would because the criteria are weighted equally and impact is not typically scored.
Beyond this basic point about assigning valid weights to the varied criteria, the weighting can vary over time, to signal that your expectations as a teacher properly change once kids get that writing, speaking, or problem solving is about purposeful effects. E.g. accuracy in archery may be appropriately worth only 25% when scoring a novice, but 100% when scoring archery performance in competition.
Given how complex this is, why not just say that the difference between the levels of performance is that if a 6 is thorough or clear or accurate, etc. then a 5 is less thorough, less clear or less accurate than a 6? Most rubrics seem to do that: they rely on a lot of comparative (and evaluative) language.
Alas, you’re right. This is a cop-out – utterly unhelpful to learners. It’s ultimately lazy to just use comparative language; it stems from a failure to provide a clear and precise description of the unique features of performance at each level. And the student is left with pretty weak feedback when rubrics rely heavily on words like “less than a 5” or “a fairly complete performance” — not much different than getting a paper back with a letter grade.
Ideally, a rubric focuses on discernible and useful empirical differences in performance; that way the assessment is educative, not just measurement. Too many such rubrics end up being norm-referenced tests in disguise, in other words, where judges fail to look closely at the more subtle but vital features of performance. Mere reliability is not enough: we want a system that can improve performance through feedback.
Compare the following excerpt from the ACTFL guidelines with a social studies rubric below it to see the point: the ACTFL rubric is rich in descriptive language which provides insight into each level and its uniqueness. The social studies rubric never gets much beyond comparative language in reference to the dimensions to be assessed (note how the only difference between each score point is a change in one adjective or a comparative):
- Novice-High: Able to satisfy immediate needs using learned utterances… can ask questions or make statements with reasonable accuracy only where this involves short memorized utterances or formulae. Most utterances are telegraphic, and errors often occur when word endings and verbs are omitted or confused… Speech is characterized by enumeration, rather than by sentences. There is some concept of the present tense forms of regular verbs particular -ar verbs, and some common irregular verbs… There is some use of articles, indicating a concept of gender, although mistakes are constant and numerous…
- Intermediate-High: Able to satisfy most survival needs and limited social demands. Developing flexibility in language production although fluency is still uneven. Can initiate and sustain a general conversation on factual topics beyond basic survival needs. Can give autobiographical information… Can provide sporadically, although not consistently, simple directions and narration of present, past, and future events, although limited vocabulary range and insufficient control of grammar lead to much hesitation and inaccuracy…. Has basic knowledge of the differences between ser and estar, although errors are frequent…. Can control the present tense of most regular and irregular verbs…. Comprehensible to native speakers used to dealing with foreigners, but still has to repeat utterances frequently to be understood by general public.
Compare those rich descriptors and their specificity to this vagueness in the social studies rubric from a Canadian provincial exam:
|The examples or case studies selected are relevant, accurate, and comprehensively developed, revealing a mature and insightful understanding of social studies content.|
|The examples or case studies selected are relevant, accurate, and clearly developed, revealing a solid understanding of social studies content.|
|The examples or case studies selected are relevant and adequately developed but may contain some factual errors. The development of the case studies/examples reveals an adequate understanding of social studies content.|
|The examples or cases selected, while relevant, are vaguely or incompletely developed, and/or they contain inaccuracies. A restricted understanding of social studies is revealed.|
|The examples are relevant, but a minimal attempt has been made to develop them, and/or the examples contain major errors revealing a lack of understanding of content.|
What’s the difference between insightful, solid, and adequate understanding? We have no idea from the rubric (which harkens back to the previous post: the only way to find out is to look at the sample papers that anchor the rubrics.)
Even worse, though, is when rubrics turn qualitative differences into arbitrary quantitative differences.
What do you mean?
A “less clear” paper is obviously less desirable than a “clear” paper (even though that doesn’t tell us much about what clarity or its absence look like), but it is almost never valid to say that a good paper has more facts or more footnotes or more arguments than a worse paper. A paper is never worse because it has fewer footnotes; it is worse because the sources cited are somehow less appropriate or illuminating. A paper is not good because it is long but because it has something to say. There is a bad temptation to construct descriptors based on easy to count quantities instead of valid qualities.
The rubric should thus always describe “better” and “worse” in tangible qualitative terms in each descriptor: what specifically make this argument or proof better than another one? So, when using comparative language to differentiate quality, make sure at least that what is being compared is relative quality, not relative arbitrary quantity.