News Nota Bene

Arbitration decision on student evaluations of teaching applauded by faculty

Such evaluations can’t be used for tenure and promotion decisions, arbitrator rules in case involving Ryerson University.

Moira Farr

August 28, 2018

Posted in

Articles

Lire cet article en français 12 Comments

In a precedent-setting case, an Ontario arbitrator has directed Ryerson University to ensure that student evaluations of teaching, or SETs, “are not used to measure teaching effectiveness for promotion or tenure.” The SET issue has been discussed in Ryerson collective bargaining sessions since 2003, and a formal grievance was filed in 2009.

The long-running case has been followed, and the ruling applauded, by academics throughout Canada and internationally, who for years have complained that universities rely too heavily on student surveys as a means of evaluating professors’ teaching effectiveness.

“We were delighted,” said Sophie Quigley, professor of computer science at Ryerson, and the grievance officer who filed the case back in 2009. “These are statistically correct arguments we’ve been making over the years, and it’s wonderful that reason has prevailed.”

While acknowledging that SETs are relevant in “capturing student experience” of a course and its instructor, arbitrator William Kaplan stated in his ruling that expert evidence presented by the faculty association “establishes, with little ambiguity, that a key tool in assessing teaching effectiveness is flawed.”

It’s a position faculty have argued for years, particularly as SETs migrated online and the numbers of students participating plummeted, while at the same time university administrations relied more heavily on what on the surface seemed to them a legitimate data-driven tool.

Mr. Kaplan’s conclusion that SETs are in fact deeply problematic will “unleash debate at universities across the country,” said David Robinson, executive director of the Canadian Association of University Teachers. “The ruling really confirms the concerns members have raised.” While student evaluations have a place, Mr. Robinson argued, “they are not a clear metric. It’s disconcerting for faculty to find themselves judged on the basis of data that is totally unreliable.”

As Dr. Quigley pointed out, studies about SETs didn’t exist 15 years ago, and it was perhaps easier for universities to see the surveys as an effective means of assessment. “Psychologically, there is an air of authority in using all this data, making it seem official and sound,” she noted.

Now, however, there is much research to back up the argument against SETs as a reliable measure of teaching effectiveness, particularly when the data is used to plot averages on charts and compare faculty results. The Ontario Confederation of University Faculty Associations (OCUFA) commissioned two reports on the issue, one by Richard Freishtat, director of the Center for Teaching and Learning at the University of California, Berkeley, and another by statistician Philip B. Stark, also at Berkeley.

The findings in those two reports were accepted by Mr. Kaplan, who cited flaws in methodology and ethical concerns around confidentiality and informed consent. He also cited serious human-rights issues, with studies showing that biases around gender, ethnicity, accent, age, even “attractiveness,” may factor into students’ ratings of professors, making SETs deeply discriminatory against numerous “vulnerable” faculty.

“We expect this ruling will be used by other faculty associations,” said Dr. Quigley, who said she has received numerous requests for further information about the case from faculty across Canada.

OCUFA representatives agreed about the wider significance of Mr. Kaplan’s decision. “The ruling gives a strong signal of the direction the thinking is going on this,” said Jeff Tennant, a professor of French studies at Western University and chair of the OCUFA collective bargaining committee and faculty representative on the OCUFA working group about SETs that commissioned the two reports submitted to the arbitrator.

“I think university administrations need to recognize that if they’re committed to quality teaching, if they want to monitor and evaluate performance, they have to use instruments that actually do measure teaching effectiveness in a way that these student surveys do not,” said Dr. Tennant. Peer evaluations and teaching dossiers, for instance, have been shown to be more reliable as indicators of teaching effectiveness than SETs, he said.

A report documenting all the research OCUFA gathered in support of the Ryerson case will be published in October. “There’s a real opportunity for Canadian universities to take leadership here, to say, ‘We recognize the evidence that’s been marshalled here from dozens and dozens of studies.’ We can continue to survey students to get information about their experience, that information is valuable to us, but we’re going to have to find more reliable means to evaluate faculty teaching.”

In the end, Mr. Kaplan agreed with the OCUFA reports: “Extremely comprehensive teaching dossiers – as is also already anticipated by the collective agreement – containing diverse pedagogical information drawn from the instructor and other sources should provide the necessary information to evaluate the actual teaching as an ongoing process of inquiry, experimentation and reflection. Together with peer evaluation, they help paint the most accurate picture of teaching effectiveness.”

Moira Farr

Moira Farr is a contract instructor at Carleton University as well as a freelance writer and editor.

12 Comments

This arbitration, OCUFA, and CAUT do a great disservice to students in dismissing the evidence that student evaluations are reliable and valid indicators of current teaching effectiveness, and can play an important role in improving instruction. Dr. Quigley is wrong that studies of course evaluations did not exist 15 years ago; they have been researched for over 50 years and empirical summaries of that substantial research are generally positive, although certainly not without debate. Dr. Tennant claims that peer evaluations and dossiers have been shown to be better indicators of effectiveness than course evaluations. I would much like to see references providing empirical evidence that peer evaluations (mutual ones at that) or self-selected teaching dossiers are more valid than student ratings.

How can anyone think that a few visits to a few classes by peers or administrators are comparable to ratings by students who have attended our classes through an entire term? Especially given results for individual faculty would be reported for multiple classes over multiple years. Students can tell if instructors deliver organized lectures, return work in a timely fashion, generate student interest in the material, and other qualities associated with effective teaching.

Sure, factors correlate with evaluations, such as class size, quantitative vs non-quantitative courses, and new preparations. But no measure of anything is completely free of extraneous influences and intelligent people (like academics?) can appreciate the role of such factors in making their determinations.

One irony in all this is that course evaluations actually show how effective the vast majority of faculty are, something we should be proud of rather than undermining that message.

Jim: Do you not recognize that these decades of studies have shown that students rank non-white males worse than others, and that using them as a metric of evaluation is therefore a way of discriminating against non-white males in the academy? Not to mention, there are so many ways of “gaming” the system when it comes to course evaluations that they are essentially meaningless.

I’ve actually gotten comments on my “nice rack” on course evaluations. Do you think this is as valuable to me as a teacher as a peer coming in who has lots of experience and providing me with feedback?

I applaud the decision. We should all do away with these “customer satisfaction” surveys. They are meaningless and biased.

I agree with much of what Jim Clark has written. Students are the only people who see the full effects of teaching in any course. A peer evaluation once or twice a year offers a very limited view and there’s lots of evidence that the evaluator in the classroom has an impact that limits effectiveness.
My (former) university mandates that department chairs and directors evaluate every faculty member every year, but provides absolutely no guidelines on what constitutes good teaching or how to evaluate it. In my experience as department chair, it was apparent that senior faculty sent to observe their younger colleagues had no idea how or what to evaluate. Peer evaluations only work effectively if the observers have such guidelines and are trained in what to look for. I doubt most universities have such systems in place. Without them observers offer such useless comments as “Dr. X is very knowledgeable in this subject”, which is a comment on the university’s hiring practice rather than the instructor’s teaching. More worrisome is the fact that untrained observers often evaluate according to their own teaching, which may put young faculty at risk.
I also agree that heavy reliance on student evaluations is problematic. I’ve seen both highly thoughtful and flippant comments. I think many students don’t believe that their input has any effect and thus either abstain from the process or give it little thought. The solution is not to simply discard student evaluations, but to find ways to make them work. Give students reasons to carefully consider what they write and ask them specific, useful questions rather than the usual “What is your overall evaluation of this instructor?” Discarding them eliminates our only means of evaluating delivery skills in the classroom, which is what really matters to students.

What a lively conversation. The point of the decision is the concern that course evaluations are used as high stakes instruments that determines employment. I’m under the belief that educational professionals (credentialed, trained, and licensed) should make those sorts of decisions. As I look through my course evaluations, it reads like a social media post. I often find myself asking, how can a student judge my teaching when I am teaching them how to teach? In addition, we often times find ourselves victimized my student interest and their precious experience. Many expect the “social contract”. Which consist of two exams and a paper…anything that forces them to study the material they view as unorganized or busy work. Student surveys are merely opinions that are not reliable, particularly when two students can sit in the same class for a full semester and have totally different experiences. How is it that one student experienced a 5 and the other student experienced a 1?? In sum, course evaluations are simply surveys of opinions and student opinions should be heard but not fixed as the determinant of faculty employment. Ironically, student opinion didn’t much factor into the hiring of the faculty member, yet it has played pivotal roles in faculty dismissals. The end game is, if students like you, they have a higher opinion of you, whether they are learning or not. I know of a faculty member who was granted tenure because of the conversation regarding her course evaluation scores, however the committee neglected to see that students in her courses fail the stare exam at levels well below the state average, while another professor was denied tenure and her students had amassed a 100% passing rate on their state exams. The committee valued student opinions and not the state outcome data as its measure of effectiveness and rationale for continued and discontinued employment.. So, in this case, I’d say no to student surveys being reliable.

Kelly, biased is not the same as meaningless. And Jim correctly points out that Dr. Quigley made a statement that is simply incorrect (that studies of SETs did not exist 15 years ago).

Much of this debate suffers from people taking the studies to say more than they actually do. You say “students rank non-white males worse than others”, when what the recent studies actually show is that “students rank non-white males, controlling for other factors, worse than others”. (Substitute “on average” for ‘controllling…”, if you prefer). They do not show that lousy white males are ranked higher than excellent non-white non-males. The effect size shown in the recent studies is variable, and sometimes significant, but not enormous. So this is a real bias, which must be considered, but it does not mean the SETs do not also show other differences – perhaps differences in effectiveness we should be considering. Maybe instead of tossing out SETs we should get better at designing and interpreting them, with these biases in mind.

In particular, Jim makes the good point that the proposed alternatives suffer from a lack of evidence that they are more effective, or for that matter, less biased. So lets keep our heads, think carefully and fairly about assessments, and not jump to (currently) unsupported conclusions because they feel “right”.

Note that I’m a white male, so although you might be inclined to give my thoughts extra weight, please resist that bias. And likewise for the reverse.

All these attacks on teaching evaluations just underscores the fact that teaching in universities are always in the backseat. The damning implication of teaching evaluations not being used as part of evaluation is that the student experiences in the classroom do not matter to the promotion of a professor. Instead of looking at evaluations taking into considerations of bias in mind, or redesigning how the evaluations to better account for potential biases, this campaign instead tries to undermine evaluations in general. Can we have a proper conversation to improve teaching evaluation? For example, if response rates are an issue after the system went online, why not make it mandatory so that every student must fill it out in order to receive the final mark?

What is very obvious from a student perspective in all of this debate, is that teaching in universities do not want to be evaluated. It’s already an open secret that teaching is secondary to an academic’s career. It is perhaps not so surprising that a sizable group of academics are trying to attack the importance of teaching evaluations. I’m happy to be proven otherwise. Generally speaking, there are few recourse available for students, when a professor e.g. show up to lectures disorganized, being unprofessional to students, or simply read his PowerPoint slides word for word. What am I suppose to do as a student aside from filling out that student evaluation?

Jim, Keith and Gary

This article actually does provide you links to the expert reports that have evaluated the evidence of SETs with respect to performance of students on teaching outcomes. Indeed Stark is such an expert and has done a number of analyses of available data. As Kelly clearly pointed out, the evidence has quite consistently shown little or no evidence of a positive association between student evaluations and teaching outcomes, but does show extensive evidence of bias towards Women, visible minorities and faculty with “accents”. No analysis (or observational data set) is perfect (My work is in statistical and quantitative genetics, but I am familiar with the statistical methodologies used), but my (admittedly very casual) reading of the literature over the past decade is consistent with what was found in the report.

Not sure if this is super relevant, but I am a white(“ish”) male, and generally do very favourably on my student evaluations (at least for someone who teaches quantitative materials to biologists), so this whole system has played in my favour, but the evidence is clear that it probably had little to do with my teaching effectiveness!

Cheers

First, my apologies if my little smiley in the previous comment shows up for some readers as an enormous graphic – as it does for me, quite alarming.

Second, although the arbitrator’s decision refers specifically to using SETs to measure ‘teaching effectiveness” for purposes of promotion and tenure, the comment I was responding to (kelly’s) makes a more blanket statement against SETs haivng value, period. Which I’ll continue, for now, to argue is going beyond the evidence, even the most favourable evidence on the anti-SET side.

As to teaching effectiveness itself, Ian, I believe you are still overstating the case a little. Freishtat’s report indicates that there is “some debate in the literature” regarding SET’s ability to measure teaching effectiveness, and adds “The consensus is that a teaching dossier is the ideal tool for assessing teaching effectiveness,incorporating SETs as part of a larger composite of one’s teaching.” Freishtat does extensively cover the many areas of concern with SETs, but does not go so far as to show evidence that they have no connection to teaching effectiveness… though this report does make note of Uttl et al.’s 2016 meta-analysis, which does report that. I haven’t read that meta-analysis myself. There’s some critique of it available, but of course, one study is one study. It may hold up, and then the case for ignoring SET’s on teaching effectiveness will be very strong.

Stark goes much further, relying particularly on Boring et al. 2016 to say that the biases dwarf any intended measure, and with such variability as to make it impossible to control for the biases. Yet even Stark suggests retaining SETs for a very limited purpose (definitely not for evaluating teaching effectiveness), and with caution. So even this stronger report does not argue that SETs are necessarily worthless or meaningless.

Personally, I lean to the position that SETs are no good as measures of “teaching effectiveness”. But I find some of the comments in the original article off-putting, as they seem eager to replace one measure of debatable effectiveness with others (classroom observation) that come with their own set of caveats and that may be just as ineffective in practice. And I do suspect, despite the clear biases, that careful use of SETs can still say something of value, and serve a function besides protecting white male privilege.

I’m having a hard time finding the expert testimony of the opposing side; does anybody know of a link to a similar research summary in favor of SET for high stakes evaluations?

I just want to make a slight correction to this article. I did not make the comment that studies of course evaluations did not exist 15 years ago. What I did comment on was the undeniable fact that there were not as much research in course evaluations 15 years ago as there are now, in particular as it relates to bias. This comment was made in the context of how our long standing dispute had evolved by the time it was actually arbitrated. When the grievance was initiated, our main focus was on the misuse of newly computed averages, but by the time it was arbitrated it had become much more broad.

The key bit is:

“particularly as SETs migrated online and the numbers of students participating plummeted”

I saw the quantity and quality of student evaluations of my colleagues plummet when this happened.Powerful self-selection effects were manifest – students either loved or hated the professor, sometimes equal numbers of each. The 80% in the middle did not feel strongly enough to take the time and trouble to write an evaluation – only the extremes did – exactly the opposite of what you want for statistical significance.

Please read the full letter to see my surprise addition.
I am writing to express my full support for the removal of student evaluations of teaching (SETs). As a black professor with a commendable track record of student evaluations and numerous teaching awards over the past 50 years, I believe it is essential to share my perspective on this matter.
While student evaluations have been in place for a considerable time, I have observed that students often do not take SET seriously. This is seen clearly when our students were told to go on their own for the next 5 days and do SET evaluations the percentage of students who did the evaluation survey was pathetic something like 5-10%. As to when the administration used to come previously to the class on a certain coordinated day with the professor while the students are captive and asked for an evaluation, they get almost 100% evaluation. Many Students tend to view them as a means of retaliating against teachers for personal reasons, regardless of the quality of teaching provided. Allow me to illustrate this with an example from my own experience.
In one of my laboratory classes, while waiting for the students to arrive from their lecture given by other professors teaching different subjects. I would inquire about their experience, but to my surprise, over the past 15 years since the introduction of SETs, I began receiving peculiar responses. I distinctly recall a female student informing me that they intended to penalize this professor in the evaluation. When I asked for the reason, she explained that they had asked the professor a question about reviewing part of the previous week’s lecture. The professor responded that they only had one hour to cover the new material and suggested the student re-read the notes. Many students felt unjustly treated by this response, and this student swore to give the professor a zero in the evaluation. I was taken aback and remained silent, unable to intervene.
Furthermore, as a former senator at the University of Western, I attended a conference on student teaching evaluations organized by the vice president. The lecture hall was filled with educators, and I was disheartened to hear accounts from a distressed female distinguished professor. She shared the terrible comments she and others received in evaluations, which were unrelated to teaching abilities. Instead, these comments focused on their dress, hair, and weight. Later, when I inquired to the VP about measures being taken to address this harassment, the vice president provided no concrete response, merely evading the issue. Regrettably, I never received any report or update regarding the matter.
In many departments, SETs are utilized by chairpersons as a means of punishment during promotions or annual evaluations. I have also discovered that even if you are a crummy teacher if you have published papers and obtained grants, your negative SET evaluations are overlooked, which is inherently unfair. I recall a dean who, before assuming the deanship, was widely criticized by students for his ineffective lecturing and very poor SET evaluation. However, this flaw did not hinder his appointment as a dean, and he continued his subpar teaching after assuming the position, without any consequences. Such occurrences undermine the credibility of SETs and make it clear that they do not consistently reflect the true competence of instructors.
Conversely, even if you receive excellent student evaluations, they may not significantly impact your promotion if you have not published. The message conveyed is that other factors are considered alongside SETs, thus rendering them less influential. This implies a lack of consistency and fairness in the evaluation process, leading me to believe that SETs should be abolished altogether.
However, I must acknowledge that there is a contradictory aspect to my argument. Having been educated in a different country as an undergraduate, I witnessed a system in which professors were indifferent to whether students understood the lectures or not. It simply did not matter to them, regardless of how poorly they performed the school had no policy on good teaching. SETs have rectified this by encouraging professors to strive for better evaluations and listen to their students’ feedback, which is undoubtedly a positive development. Perhaps SET evaluations should be decoupled from promotion entirely and instead, be linked to a monetary category. Based on the evaluation received, teachers could be assigned a certain percentage of additional funding unrelated to promotion.
In conclusion, as an experienced professor who has garnered accolades and recognition for teaching excellence, I wholeheartedly endorse the removal of student evaluations of teaching. The flaws and inconsistencies I have witnessed throughout my career and personal experiences highlight the urgent need for a more effective and fair assessment system. While SETs initially brought positive changes, their current implementation and influence on promotion and evaluations have become problematic. It is my sincere hope that we can reevaluate and reshape the evaluation process to better serve both students and educators.
Yours sincerely,
Thank You
Dr.K.A.Galil.Professor of Medicine &Professor Of Dentistry (emeritus)
DDS.,D.Oral & Maxillofacial Surgery
,PH.D,FAGD.,FADI.,Cert.Periodontist (University of Michigan) (Royal College (RCDSO )
Departments of Periodontics,Orthodontics and Clinical Anatomy
Schulich School Of Medicine and Dentistry.
University of Western Ontario
London,Ontario.
http://www.drgalil.ca