In a precedent-setting case, an Ontario arbitrator has directed Ryerson University to ensure that student evaluations of teaching, or SETs, “are not used to measure teaching effectiveness for promotion or tenure.” The SET issue has been discussed in Ryerson collective bargaining sessions since 2003, and a formal grievance was filed in 2009.
The long-running case has been followed, and the ruling applauded, by academics throughout Canada and internationally, who for years have complained that universities rely too heavily on student surveys as a means of evaluating professors’ teaching effectiveness.
“We were delighted,” said Sophie Quigley, professor of computer science at Ryerson, and the grievance officer who filed the case back in 2009. “These are statistically correct arguments we’ve been making over the years, and it’s wonderful that reason has prevailed.”
While acknowledging that SETs are relevant in “capturing student experience” of a course and its instructor, arbitrator William Kaplan stated in his ruling that expert evidence presented by the faculty association “establishes, with little ambiguity, that a key tool in assessing teaching effectiveness is flawed.”
It’s a position faculty have argued for years, particularly as SETs migrated online and the numbers of students participating plummeted, while at the same time university administrations relied more heavily on what on the surface seemed to them a legitimate data-driven tool.
Mr. Kaplan’s conclusion that SETs are in fact deeply problematic will “unleash debate at universities across the country,” said David Robinson, executive director of the Canadian Association of University Teachers. “The ruling really confirms the concerns members have raised.” While student evaluations have a place, Mr. Robinson argued, “they are not a clear metric. It’s disconcerting for faculty to find themselves judged on the basis of data that is totally unreliable.”
As Dr. Quigley pointed out, studies about SETs didn’t exist 15 years ago, and it was perhaps easier for universities to see the surveys as an effective means of assessment. “Psychologically, there is an air of authority in using all this data, making it seem official and sound,” she noted.
Now, however, there is much research to back up the argument against SETs as a reliable measure of teaching effectiveness, particularly when the data is used to plot averages on charts and compare faculty results. The Ontario Confederation of University Faculty Associations (OCUFA) commissioned two reports on the issue, one by Richard Freishtat, director of the Center for Teaching and Learning at the University of California, Berkeley, and another by statistician Philip B. Stark, also at Berkeley.
The findings in those two reports were accepted by Mr. Kaplan, who cited flaws in methodology and ethical concerns around confidentiality and informed consent. He also cited serious human-rights issues, with studies showing that biases around gender, ethnicity, accent, age, even “attractiveness,” may factor into students’ ratings of professors, making SETs deeply discriminatory against numerous “vulnerable” faculty.
“We expect this ruling will be used by other faculty associations,” said Dr. Quigley, who said she has received numerous requests for further information about the case from faculty across Canada.
OCUFA representatives agreed about the wider significance of Mr. Kaplan’s decision. “The ruling gives a strong signal of the direction the thinking is going on this,” said Jeff Tennant, a professor of French studies at Western University and chair of the OCUFA collective bargaining committee and faculty representative on the OCUFA working group about SETs that commissioned the two reports submitted to the arbitrator.
“I think university administrations need to recognize that if they’re committed to quality teaching, if they want to monitor and evaluate performance, they have to use instruments that actually do measure teaching effectiveness in a way that these student surveys do not,” said Dr. Tennant. Peer evaluations and teaching dossiers, for instance, have been shown to be more reliable as indicators of teaching effectiveness than SETs, he said.
A report documenting all the research OCUFA gathered in support of the Ryerson case will be published in October. “There’s a real opportunity for Canadian universities to take leadership here, to say, ‘We recognize the evidence that’s been marshalled here from dozens and dozens of studies.’ We can continue to survey students to get information about their experience, that information is valuable to us, but we’re going to have to find more reliable means to evaluate faculty teaching.”
In the end, Mr. Kaplan agreed with the OCUFA reports: “Extremely comprehensive teaching dossiers – as is also already anticipated by the collective agreement – containing diverse pedagogical information drawn from the instructor and other sources should provide the necessary information to evaluate the actual teaching as an ongoing process of inquiry, experimentation and reflection. Together with peer evaluation, they help paint the most accurate picture of teaching effectiveness.”
This arbitration, OCUFA, and CAUT do a great disservice to students in dismissing the evidence that student evaluations are reliable and valid indicators of current teaching effectiveness, and can play an important role in improving instruction. Dr. Quigley is wrong that studies of course evaluations did not exist 15 years ago; they have been researched for over 50 years and empirical summaries of that substantial research are generally positive, although certainly not without debate. Dr. Tennant claims that peer evaluations and dossiers have been shown to be better indicators of effectiveness than course evaluations. I would much like to see references providing empirical evidence that peer evaluations (mutual ones at that) or self-selected teaching dossiers are more valid than student ratings.
How can anyone think that a few visits to a few classes by peers or administrators are comparable to ratings by students who have attended our classes through an entire term? Especially given results for individual faculty would be reported for multiple classes over multiple years. Students can tell if instructors deliver organized lectures, return work in a timely fashion, generate student interest in the material, and other qualities associated with effective teaching.
Sure, factors correlate with evaluations, such as class size, quantitative vs non-quantitative courses, and new preparations. But no measure of anything is completely free of extraneous influences and intelligent people (like academics?) can appreciate the role of such factors in making their determinations.
One irony in all this is that course evaluations actually show how effective the vast majority of faculty are, something we should be proud of rather than undermining that message.
Jim: Do you not recognize that these decades of studies have shown that students rank non-white males worse than others, and that using them as a metric of evaluation is therefore a way of discriminating against non-white males in the academy? Not to mention, there are so many ways of “gaming” the system when it comes to course evaluations that they are essentially meaningless.
I’ve actually gotten comments on my “nice rack” on course evaluations. Do you think this is as valuable to me as a teacher as a peer coming in who has lots of experience and providing me with feedback?
I applaud the decision. We should all do away with these “customer satisfaction” surveys. They are meaningless and biased.
I agree with much of what Jim Clark has written. Students are the only people who see the full effects of teaching in any course. A peer evaluation once or twice a year offers a very limited view and there’s lots of evidence that the evaluator in the classroom has an impact that limits effectiveness.
My (former) university mandates that department chairs and directors evaluate every faculty member every year, but provides absolutely no guidelines on what constitutes good teaching or how to evaluate it. In my experience as department chair, it was apparent that senior faculty sent to observe their younger colleagues had no idea how or what to evaluate. Peer evaluations only work effectively if the observers have such guidelines and are trained in what to look for. I doubt most universities have such systems in place. Without them observers offer such useless comments as “Dr. X is very knowledgeable in this subject”, which is a comment on the university’s hiring practice rather than the instructor’s teaching. More worrisome is the fact that untrained observers often evaluate according to their own teaching, which may put young faculty at risk.
I also agree that heavy reliance on student evaluations is problematic. I’ve seen both highly thoughtful and flippant comments. I think many students don’t believe that their input has any effect and thus either abstain from the process or give it little thought. The solution is not to simply discard student evaluations, but to find ways to make them work. Give students reasons to carefully consider what they write and ask them specific, useful questions rather than the usual “What is your overall evaluation of this instructor?” Discarding them eliminates our only means of evaluating delivery skills in the classroom, which is what really matters to students.
What a lively conversation. The point of the decision is the concern that course evaluations are used as high stakes instruments that determines employment. I’m under the belief that educational professionals (credentialed, trained, and licensed) should make those sorts of decisions. As I look through my course evaluations, it reads like a social media post. I often find myself asking, how can a student judge my teaching when I am teaching them how to teach? In addition, we often times find ourselves victimized my student interest and their precious experience. Many expect the “social contract”. Which consist of two exams and a paper…anything that forces them to study the material they view as unorganized or busy work. Student surveys are merely opinions that are not reliable, particularly when two students can sit in the same class for a full semester and have totally different experiences. How is it that one student experienced a 5 and the other student experienced a 1?? In sum, course evaluations are simply surveys of opinions and student opinions should be heard but not fixed as the determinant of faculty employment. Ironically, student opinion didn’t much factor into the hiring of the faculty member, yet it has played pivotal roles in faculty dismissals. The end game is, if students like you, they have a higher opinion of you, whether they are learning or not. I know of a faculty member who was granted tenure because of the conversation regarding her course evaluation scores, however the committee neglected to see that students in her courses fail the stare exam at levels well below the state average, while another professor was denied tenure and her students had amassed a 100% passing rate on their state exams. The committee valued student opinions and not the state outcome data as its measure of effectiveness and rationale for continued and discontinued employment.. So, in this case, I’d say no to student surveys being reliable.
Kelly, biased is not the same as meaningless. And Jim correctly points out that Dr. Quigley made a statement that is simply incorrect (that studies of SETs did not exist 15 years ago).
Much of this debate suffers from people taking the studies to say more than they actually do. You say “students rank non-white males worse than others”, when what the recent studies actually show is that “students rank non-white males, controlling for other factors, worse than others”. (Substitute “on average” for ‘controllling…”, if you prefer). They do not show that lousy white males are ranked higher than excellent non-white non-males. The effect size shown in the recent studies is variable, and sometimes significant, but not enormous. So this is a real bias, which must be considered, but it does not mean the SETs do not also show other differences – perhaps differences in effectiveness we should be considering. Maybe instead of tossing out SETs we should get better at designing and interpreting them, with these biases in mind.
In particular, Jim makes the good point that the proposed alternatives suffer from a lack of evidence that they are more effective, or for that matter, less biased. So lets keep our heads, think carefully and fairly about assessments, and not jump to (currently) unsupported conclusions because they feel “right”.
Note that I’m a white male, so although you might be inclined to give my thoughts extra weight, please resist that bias. And likewise for the reverse.