In maths we tend to be rather untroubled by questions of assessment, at least compared with our colleagues in most other disciplines. This is largely because we’ve convinced ourselves that there are no issues of subjectivity involved: given a purported answer to a given question, we can generally say whether it’s right or wrong with very little scope for debate. A quagmire of subjectivity appears, of course, the moment we start awarding more than a single mark per question and get into the debatable lands known as “partial credit”, but we tend to make our way round this quagmire by constructing detailed and atomistic marking schemes, in which the award of individual marks is closely tied to specific solution steps or pieces of knowledge demonstrated when answering questions. We then tend to relax, having convinced ourselves that our assessments are now objective and impregnable — as they will have to appear when challenged by stroppy students or external examiners.
What we don’t like to talk about — although anyone who’s been involved in setting maths assessments must be aware of it — is that all we achieve by setting up these marking schemes is to shift the subjectivity a step back, into the choice of questions and the design of the schemes themselves. I’ve felt increasingly over the last few years that this is a large and underexamined problem, despite the sensible comments provided in some quarters (e.g. Cox 2011, section 5.6; Challis, Houston & Stirling 2001) and I’d like to look at one of the reasons for this here. (I should note that throughout I’ll refer only to exams rather than other forms of assessments. There’s a sensible debate to be had about the merits and demerits of various forms of assessment, but anything that assigns a numerical mark to a student’s performance tends to suffer from the problem I’ll discuss here.)
In particular, I’ll assume that like most teachers in tertiary education, we’re working in a modularised system, in which a student passes a module by scoring at least some minimum mark (typically 40%) in the exam. Passing a minimum number of modules is necessary for a student to progress to subsequent years and ultimately to be awarded their qualification; passing specific modules is necessary as the prerequisite for students to take specific modules in subsequent years. The minimum mark is non-negotiable and set at an institutional level, generally by some tumshie in a suit who hasn’t sullied a classroom for some years, and has seen the inside of a maths exam approximately never.
Note that this immediately makes the sensible advice of Challis et al. impossible to follow:
When “licence to proceed” is a key concern, then assessment tasks should concentrate on essential, core knowledge and skills, and the pass threshold should be set at a high level. The results of this assessment should not be aggregated with other results, but should be reported separately, and there should be a requirement to pass. Other assessment tasks, like written examinations, should be used to discriminate between students in order to place them in rank order.
The problem, then, is that in a modular system the exam mark is trying to do at least two jobs, and these don’t in general make the same demands on it. The marking scheme requires that the mark represents a cumulation of individual “abilities” or “pieces of information” that the student has learned. The module system requires that the award of the module means that the student has mastered this topic well enough to proceed to later modules that require this topic. It is clearly going to require a lot of skill or luck to ensure that these requirements coincide — and more so when we remember that although the exam is going to be taken by a large number of students, marks and modules are awarded to individual students, so simply being “statistically fair” isn’t good enough.
Let’s start by taking the most absurdly reductive description of the exam process. Say a module consists of 100 discrete “facts”, and an exam question tests the student’s knowledge of one of these facts. The only real choice the examiner then has is then how many questions to set: she can set a few questions — say five, each worth twenty marks — or lots — say a hundred, each worth one mark. It’s presumed that the students know in advance roughly how many questions will be set. Our interest will be in “strategically minded” students, who are aiming to pass the module with the minimum effort they can get away with; we will ignore the small minority of students who can and will reliably learn either everything or nothing in the module, as we can do nothing that will make or mar them further. (In some blessed institutions, there may be such a culture of overachievement that most students are trying to learn as much as they humanly can. Those working in such institutions also tend not to be plagued by modularisation; they should count themselves lucky and sacrifice a hecatomb of quality assurance agents every Michaelmas in gratitude for the mercy of Apollo.)
Suppose the examiner sets 100 questions, so a student will pass the module by correctly recalling 40 facts. Because it’s entirely predictable that every fact in the module will be examined, a student can rely on passing the exam by learning exactly 40 of them, generally the 40 that the student finds easiest to learn. If it happens that knowing an essentially arbitrary 40% of the module content is sufficient for a student to successfully complete subsequent modules, there is no problem. I have, however, yet to encounter any maths course in which a student can prosper with such a small smattering of the prerequisite material. The consequence is that students are being officially assured that they have learned this subject when in fact they haven’t. This is, I’d argue, unfair to the students in question; it’s certainly unfair to the poor sod who has to teach the subsequent modules that build on this one.
Suppose, on the other hand, that the examiner sets five questions, so a student will pass the module by correctly recalling two facts out of the five sampled. In one sense this should have a desirable effect: to have a good chance of passing this module, a student can’t rely on learning exactly 40% of the material. Any strategic student who wants to give himself a good chance must therefore learn rather more than that minimum — the fewer questions are set, the more of the material a student must learn in order to pass with a specified probability. Problem solved? Unfortunately, the new problem is that the outcome of the exam for individual students is considerably more random. Some students will fail while others with less knowledge pass: they will certainly see this as unfair. Worse still, some students who know less than 40% of the material will sneak through the exam, and again be assured to have learned a subject they haven’t.
It’s clear that we’ve skewered ourselves on a ceratinous little dilemma, which is another manifestation of the general tension between implicit and explicit instruction. We can be very fair in the sense of reducing randomness of outcome, but in that case we’re unfair in setting standards which are too low for the students’ later good. Alternatively, we can be fair in the sense of raising the standards required, but in that case the outcomes are considerably determined by blind chance.
Let’s now see whether marking schemes, partial credit and a more realistic picture of what questions involve can help us out of this pickle. Assuming that the module is at all coherent, it should be possible to set longer questions, each of which tests multiple pieces of knowledge. It should also be possible, for example by giving intermediate results or targets to work backwards from, to design these questions so that they use the students’ knowledge in either a more or a less cumulative fashion. (For example, a simple optimisation question might consist of two parts: (i) find the derivative of f(x); (ii) use this information to locate the minimum of f(x) over a given domain. If the student is instead told in part (i) to show that df/dx = whatever it is, they can complete (ii) without having completed (i). This sort of thing is often described as “testing higher-order skills”.)
What this means is that an exam and marking scheme can — in principle at least — be designed to give a somewhat non-linear relationship between the proportion of the knowledge that a student has learned and the mark they receive. This design does not depend on testing only a fraction of the module content, so we should be able to eliminate the randomness problem. (Or, at least, to reduce it to the inevitable randomness introduced by students under- and over-performing on the day. This is a non-trivial but separate problem with assessment design…)
A bit of idiosyncratic terminology before I go on: if the graph of marks against knowledge is convex-upward, I’ll describe the marking scheme as front-loaded; if the graph is concave-upward, I’ll describe the scheme as end-loaded. A front-loaded scheme makes it easy for students to gather the first few marks and progressively harder to gain more; typically the 40% threshold can be passed while knowing less than 40% of the material. In an end-loaded scheme, students have to work hard for the early marks but can then gain more easily once they’ve broken the back of a question.
You can see where this is going. If we can design an exam that covers the entire module and that is sufficiently end-loaded, we should be able to ensure that the 40% threshold corresponds to knowing not 40% of the material, but some rather higher proportion which is actually adequate for the students’ later needs. Further, when the “strategic” students attempt past papers, they should discover this fact and thus be forced to learn the required amount. (Students who are so thick they aren’t capable of being strategic, like the ones who turn up only for the first four weeks of a ten week course in the happy belief that this will give them the magical 40% required to pass, probably deserve to fail.)
The difficulty is that designing an end-loaded assessment is a lot harder than it sounds — even assuming that there isn’t an institutional edict requiring that, say, 75% of your class have to pass as well as meeting the required standard. (When criterion-referencing and norm-referencing are muddled up in this way by the institutional tumshies there is no hope left: students and staff will rapidly enter a spiral of diminished expectations and diminished collective effort and there’s nothing to do except set fire to the building on your way out.) Let’s look at some enemies of end-loading.
(i) The “pick N questions out of M” exam format. Presumably this works perfectly well in essay-based subjects where the same essential abilities will be tested by an essay on any vaguely relevant topic, but it’s death to maths exams. The cherry-picking that it allows immediately means that the students get a disproportionate amount of credit for whatever they find easiest to do, and that even for the most ambitious students there is little sense in staying awake until the end of term.
(ii) Giving intermediate results to help students through a question. This is a very, very strong temptation, beloved of referees and external examiners: “it’s not fair that if they can’t do X then they won’t be able to get credit for doing Y”. Regrettably, it also makes it easier for students to cherry-pick only the easiest bits of questions — again, a front-ended scheme results.
(iii) Opening a question with a predictable bit of bookwork: (i) State and prove the Milne-Thomson Circle Theorem. (ii) State, but do not attempt to prove, Blasius’s Theorem. (iii) Using Milne-Thomson’s and Blasius’s theorems, determine…” If these bits have to be included, it’s essential that the bookwork carries as little credit as the referees and tumshies will let you get away with.
As this makes fairly clear, it seems to me that the only way to make exams fair is to do several things that students — and a few of my colleagues — don’t like in the least. Set questions with a “steep start” so that students need to make an appreciable effort to pick up the first handful of marks. Set, where possible, “open-book” exams, to remove the temptation to start every question with bookwork. (My first attempt to set an open-book exam resulted in a 100% failure rate: despite warnings to the contrary, the students had failed to grasp the fact that open-book exams are much, much harder to pass than the closed-book variety.) Be stingy with partial credit: getting all the way through to the correct answer should be worth a good deal more than getting three quarters of the way there. (I like the “alphas” system for achieving this, though of course it’s merry hell mapping a combination of marks and alphas onto a tumshie-oriented institutional scale.) Don’t give many intermediate results. Set the kind of codas to questions that students really hate: “Explain this result in terms of…”; “Without detailed further calculations, indicate what would happen if…”
By this stage you will presumably have realised why I feature regularly on my department’s roll-call of examiners held up for public remonstrance by exam boards and muttered about darkly by disaffected students. In the last year or so I’ve retreated, in practice, from my own principles by setting exams designed to be very, very hard to fail, and more in the direction of “1. Write your name on, or near, the paper. If you cannot recall your name, write someone else’s name for partial credit. (20 marks)”. But I really ought to know better and keep on stating the bleeding obvious:
1. Examining is neither unproblematic nor objective, even in mathematics.
2. Exams that are designed to be “easy” to pass are rarely fair to students.
As they say: discuss…