At a recent department meeting, I was apparently singled out for having achieved what my line manager thought were excellent results in the end-of-term student feedback questionnaires. I don’t believe in prizes for teaching, and I’m putting my line manager’s opinion on record solely because I want to have a go at feedback questionnaires and the way they’re used, and — rightly or wrongly — such criticisms tend to sound more plausible if they can’t be dismissed as sour grapes.
Like many such questionnaires, including the egregious NSS, ours consists of a set of multiple-choice items, each of which asks the student to express their degree of agreement with a statement such as “The lecturer presented the course in a well-organised way” or “I have received prompt feedback on my work”. (There is also space on the back of the questionnaire for students to write any comments or explanations: this space is rarely used unless the lecturer makes a big effort to encourage it.) Having been collected, the questionnaires are scanned and passed through electronic marking software which returns a mean score for each item; written comments are not recorded, though we are free to transcribe them for our own purposes if we like. All the items are phrased such that 5 is “good” and 1 is “bad”, so a high mean on any item is regarded as evidence of good performance; my mention in dispatches was on the basis that all my means were higher than 4.5.
It’s hard to know where to start cataloguing the problems with this process, but let’s try anyway. I’ll leave to one side the question of whether it’s valuable to ask students’ opinions of a course immediately afterwards rather than, say, a couple of years later when they can see how much they really learned from it; and I’ll also leave to one side the question of whether one can readily identify “good” student responses with “good” teaching. Assuming that we’re trying to assess what students think about a current or recent class, what are the problems with my department’s approach?
The first problem is the business of computing means for individual items. Whatever one thinks of the general validity of Likert-type instruments, they are subtle and easily abused tools, and this has to be an abuse. First, as Susan Jamieson [Medical Ed. 38: 1217-1218, 2004] points out, labelling a qualitative response with a number does not turn it into a quantitative datum to which quantitative tools can necessarily be applied. (Dr Jamieson’s sane little paper has received a certain amount of stick in some quarters, essentially on the basis that her terminology is imprecise, but as far as I can tell her objections in this context are perfectly valid.) Second, it’s fairly clear that, for example, the net conclusion to draw from an equal number of “strongly agree” and “strongly disagree” responses is not “on average, the respondents neither agreed nor disagreed” but “the question was strongly divisive”. Oh, and there’s this problem too, of course.
A second problem is that for some questions the agree–disagree scale is seriously misleading. Suppose that a student has never handed in any work to be marked: how should he respond to “I have received prompt feedback on my work”? Logically, he should “strongly disagree”, since he has never received such feedback; mercifully, few such students are logical enough to respond like this. More commonly, he “neither agrees nor disagrees”, and thus anchors the responses to this item firmly to the middle of the scale — incidentally contributing to the widespread panic that we don’t provide our students with enough feedback. (You may recall that my suspicions lie in the other direction.) Occasionally, he has a pang of guilt and “strongly agrees” in order not to penalise the lecturer. Very occasionally, he has enough sense to leave this item blank. The net effect is that the class’s response to this item is a complicated mixture of illogic and guilt, with any sensible signal buried deep within the noise.
A third problem is that there is little reason for students to take the exercise seriously. In the days when I had some freedom to run feedback questionnaires, I tended to do so in the middle of the semester, and to follow them up by explaining to the class what — if anything — was going to change in response; in other words, I treated them as formative rather than summative assessment. These days I’m under orders to distribute my forms at the end of the semester, when the students well know that anything they say or do will make no difference to them. Some are altruistic enough to try to give considered feedback in any case, but for many I suspect that the whole thing is an exercise in either revenging themselves upon a lecturer they dislike or patting on the head a lecturer who they like. Essentially, what these questionnaires give a snapshot of is the emotions of tired students who have no incentive to examine why these emotions arise.
A fourth problem is to do with sampling: these results are horribly subject both to the law of small numbers and to selection bias. When I’ve had very “good” results from a questionnaire, it has inevitably been from a relatively small class into which the law of averages couldn’t properly get its teeth; I don’t know, but I suspect that the same can be said for most of my conspicuously “bad” results. Selection bias is an additional effect, though one which works to the advantage of the lecturer: when students have had a couple of months to get fed up with your teaching and stop turning up, of course the students present in class when you hand out the forms will, on the whole, be the ones who like you. (Online questionnaires, by the way, aren’t the answer to this: in the days when I was allowed to use them, I found the extra effort required to fetch oneself online and complete the form cut down the number of respondents significantly and biased the responses strongly toward the most committed students.)
A fifth problem is that the focus on multiple-choice scores ignores the difference between a poll and a consultation. The former is intended to assess opinion; the latter is intended to collect opinions. What matters, from the point of view of improving teaching and learning, is the latter: to know why students respond the way they do; what they think is or ought to be happening in class; what specific issues are rubbing them up the wrong way. This is what the written responses provide, and it’s remarkable how different some of the multiple-choice results look when read in their light. A set of strongly positive responses looks rather less credible when accompanied by a paean to the lecturer’s desirability — and no, I don’t get these, but I have colleagues who do. (Fans of Indiana Jones will know the sort of thing I mean.) A set of strongly negative responses looks rather less credible when the crayon-scrawls on the back make it clear that the student signed up for the class not having read the syllabus and believing it to be about something else. (Any fans of Amazon one-star reviews out there?) A set of strongly positive responses coupled with strongly negative comments, or vice versa, is evidence that the student has read the scale the wrong way round — and yes, this really does happen. A run of identical comments is a useful reminder that the individual questionnaires are not independent random variables, and that one dissatisfied and articulate student can readily mobilise supporters. And, of course, one thoughtful response that identifies specific weaknesses or praises specific elements of a course can outweigh in its practical effects an entire bucketful of praise or whinging.
If we were serious about wanting to know what our students thought, we’d ask them to tell us, in their own words — perhaps drawing their attention to particular points of interest, but perhaps just leaving them to set their own priorities and deducing what we could from those priorities as expressed in the responses. (I have an example of this that I’d like to return to in another post.) Even if logistically we couldn’t do this and had to rely on multiple-choice questions, we’d design these carefully and interrogate them thoroughly for hidden assumptions (such as “every student has at some point handed in homework”). We’d decide in advance whether we wanted to sample all the students on the class list or just the most committed, then design our procedure and interpret our results accordingly. We would only ever compare the results from one class with those from classes that were comparable in size and level, and we certainly wouldn’t fetishise the mean — indeed, I suspect we’d be most interested in items that polarised the responses. And we’d give the students some incentive to respond thoughtfully by treating the results thoughtfully, rather than as telephone votes in a kind of pedagogical talent show.
Our students, diligently as they may sometimes conceal it, are not generally stupid. They can tell when we don’t take an exercise seriously, and if we don’t then they certainly won’t. Ironically, the other reason that I get “good” results on my multiple-choice responses is that I explain to my students why I don’t believe in them, and direct their attention to the back of the form instead. It’s amazing how much more kindly people will speak about you if they think you’re listening.