Features Students

Do universities put too much weight on student evaluations of teaching?

Research suggests that student evaluations of teaching are often badly designed and used inappropriately. But change is underway.

Diane Peters

March 06, 2019

Posted in

Feature articles

Lire cet article en français 18 Comments

Bob Uttl’s first academic job seemed to be going well. As an assistant professor in psychology at Oregon State University, hired in 1999, he was publishing regularly and developing a good rapport with students, even those taking his challenging courses in research methods and psychometrics. “Some students were doing great and some of them not so great,” he recalls.

Then it all fell apart. Dr. Uttl was denied tenure. A student who had been removed from his course for academic dishonesty had written a letter saying she could not understand his accent – Dr. Uttl comes from the former Czechoslovakia. A colleague’s peer review claimed the font size used in a handout showed a “disdain” for students, and he found out later that this same colleague didn’t want to hire him in the first place. The tenure and promotion committee cited his student evaluations of teaching (SETs) scores as grounds for dismissal.

“Your career goes up in smoke,” says Dr. Uttl. He fought the dismissal and a student petitioned to have him reinstated. A federal court ruled in his favour in 2005 for several reasons, including the fact that the dean and others could not explain why his SETs scores, which hovered around the department average, were deemed not good enough.

Dr. Uttl was retroactively granted tenure and promotion, but the damage was done. No one would hire him and his U.S. work visa got cancelled. He landed an academic job in Japan and then sessional work in Red Deer, Alberta, a long commute from his new home in Calgary, where his Canadian wife had a job. But things worked out eventually: he is now a professor of psychology at Mount Royal University and, as an ongoing research interest, studies SETs.

He and others have been saying for some years that Canadian universities put too much weight on SETs (which go by a range of other names, including student surveys, student ratings of instruction or, simply, course evaluations). They’re being used to size up departments, rehire or fire part-time instructors, and inform tenure and promotion decisions.

“The literature has always said student evaluations should be used in the context of peer evaluation and self-evaluation, they should not have too much weight put on them,” says Brad Wuetherick, executive director of the Centre for Learning and Teaching at Dalhousie University. “But they’re being used on their own, unfortunately, because it’s a number. It’s an easy thing to understand.”

What’s more, research over the past decade or so has shown serious flaws with SETs: “They don’t measure teaching effectiveness and they are subject to all sorts of biases,” says Dr. Uttl. However, this narrow approach of equating questionnaire scores with teaching skill may be coming to an end at Canadian universities. An arbitration decision handed down last June in a dispute between Ryerson University and the Ryerson Faculty Association backs up Dr. Uttl’s assertions.

The arbitrator, William Kaplan, concluded that while SETs may be good at “capturing the student experience” of a course and its instructor, they are “imperfect” and “unreliable” as a tool for assessing teaching effectiveness. He declared that the university’s faculty course surveys could no longer be used to measure teaching effectiveness for promotion or tenure, the crux of the dispute.

As a result of the ruling, universities across the country are now looking at whether they need to revamp their student survey practices and connected human resources policies. It’s hard to fail a survey, but Canadian universities pretty much just did.

In the ivory tower past, university professors taught and students learned – or did not learn, as the case may be. Bad teachers sometimes got flagged and were nudged out, but many got to keep droning away at the lectern until retirement. As the idea of student-centred learning emerged, students increasingly saw themselves as active participants in their education. They wanted an official say and SETs offered a sanctioned means to share insights on the classroom experience.

In the 1980s, professors and university departments started experimenting with student surveys. By the early 1990s, university administrators began taking them over, standardizing questions and running them for all faculty. Those that resisted adopting this approach were often met with a student response. As an undergraduate, Mr. Wuetherick was part of the push for SETs at the University of Alberta in 1994. “We were advocating for the student voice to be taken seriously,” he recalls.

“The original intent was admirable,” says Gavan Watson, director of the Centre for Innovation in Teaching and Learning and the associate vice-president, academic, teaching and learning at Memorial University. “There was a sense that it was important to understand what the student experience of being in the course was like.”

As governments pushed universities to be more transparent as a condition for funding, student evaluations “satisfied the growing administrative emphasis on accountability,” says Jeff Tennant, associate professor in the department of French studies at Western University. Dr. Tennant is also the chair of the Ontario Confederation of University Faculty Associations’ collective bargaining committee and a member of its working group on student questionnaires.

The widespread integration of SETs was bolstered by a 1981 meta-analysis by Peter Cohen of Dartmouth College that found “strong support for the validly of student ratings [of instruction] as a measure of teaching effectiveness.” Subsequent research in the following years backed up Dr. Cohen’s findings, says Dr. Uttl. Then, as universities increased their computing power and interest in data, they began producing spreadsheets and charts with easy-to-digest numbers – and raised some early warning flags.

Ryerson computer science professor Sophie Quigley, who was the grievance officer for the faculty association when it launched its complaint against the university in 2009, says concerns about the university’s faculty course surveys were long-standing but worsened in 2007. “It got moved to an online format, and this is when it started getting used in a very different way,” she says. “The university introduced a bunch of averages that were not in place before.”

Ryerson uses a Likert scale, with questions such as, “The instructor is knowledgeable about the course material,” allowing a response of 1 for “agree” to 5 for “disagree.” The university would then use those numeric responses to create average scores. “The numbers from one to five, they’re just labels. They should not be made into averages,” says Dr. Quigley.

Ryerson also produced more complex datasets from their student surveys, but they were harder to interpret, says Dr. Quigley. “Averages are a single number so people naturally liked that better.” She and her colleagues started seeing these bad-math averages used in reports to rank university departments and to decide if people got tenure. “These averages got a life of their own.”

At many universities, a teacher’s skill set was being boiled down to a few numbers and compared to a standard that was often arbitrary and seldom standardized. Instructors falling below a certain average were declared bad teachers. “Your effectiveness as a teacher, your teacher score, became your SET score,” says Dr. Uttl. “People have been fired over this, their entire careers demolished.”

Beyond simplistic SETs scores, there’s more: newer research shows that these questionnaires don’t measure teaching effectiveness all that effectively. Many earlier studies were conducted by researchers affiliated with companies selling surveys to universities, says Dr. Uttl. As well, the sample sizes were often small for studies showing statistically significant correlations between SETs scores and student success. In a 2017 report, Dr. Uttl does the math again on a range of old studies, taking into account study size, plus student’s prior knowledge and ability, and finds “no significant correlations between the SET[s] ratings and learning.”

Also, in terms of SETs scores, large classes at inconvenient times rank poorly, as do lower-year undergraduate courses compared to upper-level and graduate ones. Dr. Uttl conducted a 2017 study that linked poor scores with quantitative courses. Other studies have shown that compulsory courses fare worse than optional ones. “Often, with difficult courses, it’s only with time that you get to reflect and understand what the value of a different learning experience was,” says Dr. Watson at Memorial.

An instructor’s gender, age (either too close to students in age, or much older), attractiveness, ethnicity and accent also influence scores. However, Dalhousie’s Mr. Wuetherick warns that the effect is not overly strong for these factors. “Class size and class level explain more of the variance than a whole bunch of the others combined,” he says.

Nevertheless, those who dig deep into SETs suspect there are even more factors influencing which boxes students tick. For instance, Dr. Uttl says how well a student learned the material in a prerequisite course will inevitably impact their perceptions of the follow-up. “If you nearly fail the first statistics course, you will do poorly in the next course,” he says.

A recent study out of Germany concluded that the availability of chocolate cookies during an academic course session affects the evaluation of teaching. The authors write, seemingly without irony, that the findings “question the validity of SETs and their use in making widespread decisions within a faculty.”

Meanwhile, many of these student evaluations don’t use best practices in survey design and include ill-phrased or improper questions. For instance, instructors at Ryerson requested that the university omit the question which asked students if the teacher was “effective,” a nebulous concept. Many surveys ask if the instructor is knowledgeable about the subject matter, a patently inappropriate question, particularly for undergraduates says Dr. Uttl. “Students cannot possibly tell you whether you are knowledgeable. They don’t know the field,” he says.

Too few students are filling out the surveys as well. Research suggests that, while as many as 80 percent of students will fill out paper surveys, this drops to 60 percent or less online. University of Toronto reports that the response rate for its online surveys is about 40 percent. Dr. Quigley says the response rates on Ryerson’s faculty course surveys used to be around 60 percent but gradually declined to 20 percent after going online. “They become more and more meaningless,” she says.

The Ryerson arbitration decision has already caused ripples. Late last fall, the Ontario Undergraduate Student Alliance issued a report that, among other things, calls for the Higher Education Quality Council of Ontario to set standards for SETs, and for universities to balance scores with peer review for hiring and promotion. Student ideas shared at the organization’s general assembly helped provide the content for the report, which was written by Kathryn Kettle, vice-president of policy and advocacy with the Students’ General Association at Laurentian University. “Students really care about this issue, and the fact that biases are inherent in surveys,” she says.

More recently, Dr. Tennant and OCUFA’s working group released a much-anticipated report in February, just prior to University Affairs going to press. It concludes that student questionnaires on courses and teaching – its preferred term – “fail to accurately reflect teaching quality” and should be used for formative purposes only, not for summative uses or for stand-alone performance evaluation.

Even predating the Ryerson decision, many universities began changing their student evaluation systems to better address their inherent flaws. “If you place value on the student voice about their experience, it behooves you to collect that voice in a way that’s as effective as possible,” says Susan McCahan, vice-provost, academic programs, and vice-provost, innovations in undergraduate education, at U of T. The university has been rolling out newly designed course evaluations to include six standard questions – all written using survey design best practices – plus two open-ended qualitative comment questions. Faculties and departments can add in other questions, too.

Similarly, Western University’s new system lets individual instructors add two supplementary questions for its student questionnaires of courses and teaching. They can choose from 45 questions in nine categories, allowing for specialized feedback on technology use, online classes, tutorials and labs. “Then they’re getting the information that, in theory, they want,” says Dr. Watson, who helped develop the new system in his previous role at Western before moving to Memorial in 2018.

Western also created a web portal called Your Feedback to help students, instructors and university staff to better understand the questionnaires. “We took it as an opportunity to make explicit some of the implicit assumptions around this data,” says Dr. Watson.

Many universities are also getting smarter about how they use their SETs data. U of T has been turning its survey results into a huge database since 2012 and now has hundreds of thousands of data points. It used some of these data to run a validation study, which it published in 2018, to track how things like class size, gender and year of study impacted results. The study found that class size had the most effect on scores. U of T expects to use this information as an educational tool for various groups, including tenure and promotion committees. “We want to help people interpret the [student survey] data in a more nuanced way,” says Dr. McCahan.

Mr. Wuetherick, as well, is analyzing SETs data at Dalhousie to under-stand what factors impact scores. He says hidden factors inevitably influence results and it may be impossible to entirely tease them all out. His team at Dalhousie is working to develop a framework by this summer to guide tenure and promotion committees around the use of SETs data, and how to integrate it with peer evaluation and other information in a teaching dossier. “We’ve gotten away with being collectively lazy as a system in how we evaluate teaching,” he says.

But, guiding instructors on how to put together an effective and comprehensive dossier and then, in turn, supporting administrators in how to interpret this information, is a big ask. The University of Calgary’s Taylor Institute for Teaching and Learning has created a Teaching Philosophies and Teaching Dossiers Guide that can help. It’s 50 pages long.

“It’s going to take more time to collect the evidence and it’s going to take more time to understand what effectiveness looks like,” says Dr. Watson. Meanwhile, as universities dig into more sophisticated means of measuring and showing teaching effectiveness, this raises additional questions. “The definition of effective teaching varies widely,” notes Dr. Uttl.

Setting a standard around good teaching, accepting what can and cannot be measured, and understanding the biases of students and faculties all give universities much to examine. And they need to do it now, before additional legal or even human rights challenges come down – Dr. Uttl says he would not be surprised to see a class-action lawsuit sometime in the near future. Adds Dr. Watson, “Given the complexity of what teaching is like and what learning is like, there ought to be a variety of data collected, a variety of evidence that’s presented that helps describe what the experience of being in a class is like.”

Diane Peters

Diane Peters is a Toronto-based writer and editor.

18 Comments

Any professor worth his or her salt who has taught for any period of time knows that student evaluations, in terms of both written comments and scores–are, in most cases, an accurate reflection of one’s ability to convey material, provide feedback, and work hard for the benefit of students. Use any system of scoring you want, and professors who cannot teach, provide little feedback, and are unavailable for students will still score poorly. Those who can teach well and take the job seriously will score significantly higher. If professors score poorly, they need to stop blaming their evaluation scores on the grades they give, gender differences, and other lame excuses. The evaluations are telling the professor point blank what is wrong, and in most instances, it’s their inability to teach, their own lack of effort, and their unapproachable nature that students remember. As for a drop in electronic evals, professors are not doing them at the start of class as they would with paper evals. Some are doing them at the end of class as students leave, or they are telling students to do them on their own time. That will definitely result in a drop in evals. Think about it. If you do them at the start of class for 15-20 minutes like paper evals, the attendance rate should be the same.

Did you actually read the article? And see how it cites actual evidence about how these evals don’t really do what they are claimed to do?

As one of those professors worth my salt I know that I am an effective and very good teacher and have been for 20 years. I also know that my evaluations are of dubious value. And I know, for example, that my female colleagues, also worth their salt, get lower evals because of bias and that my international colleagues with accents get lower evals and that my colleagues of colour suffer similar fates. And I know that because I assign a lot of reading and the class is hard I get complaints from my students.

I just know it.

I am pleased to see these successful challenges. Course “opinionaires” as I have sometimes called them can occasionally convey some useful information (for example, as cited in the earlier comment on this post by Stuart Chambers), notably with repeated and consistent extreme scores. The stated purpose (e.g., improvement versus career decisions or course choice) can also impact ratings. CAUT has been aware of these issues since at least the 1970s. In 1986 it published a revised edition of an alternative approach to SETs, namely the Teaching Dossier.

In addition to the instructor’s actions and efforts, however, there are many institutional contributors to the quality of instruction, ranging from staffing and admissions (hence class size), availability of teaching-support services, support for and training of teaching assistants, quality-location-number of classroom spaces, bookstore operations, student services in support of quality learning and student health, and more. The total responsibility for the quality of instruction and learning does not lie solely with instructors, whereas the SET approach risks pretending that it does.

The Teaching Dossier reference is:

Shore, B. M., Foster, S. F., Knapper, C. K., Nadeau, G. G., Neill, N., & Sim, V. (1986). Guide to the teaching dossier, its presentation and use (rev. ed.). CAUT Bulletin, 33(2) Suppl. (4 pp. of 4 col.); also in French in the same issue.

Student input is valuable in the evaluation of instruction, but there are many better ways to include it.

Bruce M. Shore, Emeritus Professor, McGill University
Former Chair (1973-1985), CAUT Teaching Effectiveness Committee

Systemic bias exists everywhere; universities are not somehow magically exempt from the problems that plague the rest of society. Not only have I read the research, but I have seen with my own eyes comments about weight, fashion sense, racialization, ethnicity, gender identity, sexual orientation, religious identity. None of these have anything to do with pedagogy or a person’s ability to be an effective educator. It’s why McGill worked so hard to create a process for identifying and addressing bias and discrimination in course evaluations. It is not a perfect solution, but it is a step in the right direction.

Students are just as imperfect as the rest of us. If there are “Jordan Petersons” standing at the front of the classroom, you can be sure some of them are sitting in the seats.

So if professor A receives a failing evaluation 10 times in 10 different courses of various sizes and undergraduate levels, but professor B receives 10 straight stellar evaluations, professor A may want to learn from the comments in those evaluations and do some soul-searching before ranting about being a victim of this of that student bias.

See also Youmans, R. J., & Jee, B. D. (2007). Fudging the numbers: Distributing chocolate influences student evaluations of an undergraduate course. Teaching of Psychology, 34(4), 245-247.

‘Student evaluations provide important information about teaching effectiveness. Research has shown that student evaluations can be mediated by unintended aspects of a course. In this study, we examined whether an event unrelated to a course would increase student evaluations. Six discussion sections completed course evaluations administered by an independent experimenter. The experimenter offered chocolate to 3 sections before they completed the evaluations. Overall, students offered chocolate gave more positive evaluations than students not offered chocolate. This result highlights the need to standardize evaluation procedures to control for the influence of external factors on student evaluations.’

Gavin Moodie, does the type of chocolate offered matter? Do you get lower evaluation scores if you offer students Laura Secord instead of Lindor? 🙂

Well hopefully this sort of evidence-based research will put an end to these sorts of outraged reports based on nothing but anecdotes.

http://www.activehistory.ca/2017/03/shes-hot-female-sessional-instructors-gender-bias-and-student-evaluations/

I routinely see it stated as obvious that students cannot evaluate the instructor’s knowledgeability about the course material, and in searching for evidence now, I find only unsupported assertions in the sources used. I wonder about this, for a couple of reasons:
1. I dimly recall a study, perhaps 20 years ago, supporting the opposite position – but of course I can’t find it, and for all I know it was equally weak, evidentially.
2. Lacking data, I fall back on anecdote; it seemed pretty obvious to student-me that some instructors were barely keeping on top of the material and fell apart if any questions were asked, while others knew the material backwards, forwards, and then some.

While I’m on the subject of anecdote, I’ll also note, despite the discomfort around the whole issue, that an instructor’s accent can of course be a factor in effectiveness of instruction. So can mumbling, talking to the board, or distracting mannerisms, and I’d take consistent student comments on these things to heart (and I have).

There are clear, demonstrated problems with SETs, although the usually-suggested alternatives seem to lack good supporting evidence as well. But personally I’m not convinced that the evidence gets us from “having serious problems” to “having no value at all”, and yes, this is considering the demonstrated biases as well.

The problem with any study is simple: It cannot tell you the percentage of emphasis students place on any given characteristic. If a profesor hands out chocolate and another one does not, does this account for a 1% difference in the student evaluation mark or a 10% difference? Is the other 90-95% of a professor’s evaluation score decided by his/her mastery of the course? What exact percentage does race or gender play (if any) if the professor is the best lecturer in the department? Perhaps skill transcends superficial traits. If one professor provides a C+ average in the same course as another professor who gives a B average, how is that sometimes, the professor giving the lower grade scores higher on evaluations? Maybe, just maybe, students are more savvy than we think.

Stuart, I stopped reading my course evals after my second year of teaching when I got the comment that I had a “nice rack”.

What a worthless way to treat educators.

‘But personally I’m not convinced that the evidence gets us from “having serious problems” to “having no value at all”, and yes, this is considering the demonstrated biases as well.’

So well said Gray! Too many voices in this conversation seem to want to take an extreme opinion or blow things all out of proportion.

Casey, you had one insensitive comment; that’s anecdotal. Surely, you learned something about other comments (possibly hundreds over the years) and your overall scores.

Thanks for being a voice of reason Stuart. If people wanted more than Stuart’s commonsense response here (e.g., that students can observe teachers multiple times over a long period of time and draw some conclusions about whether they are good teachers, especially when such judgments are aggregated across multiple groups of students and multiple offerings), they can look at a number of reviews of the course evaluation literature by people who actually have studied teaching and the use of course evaluations. The critical thing for people to keep in mind is that it is always possible in areas of complex human behavior to find studies discrepant with the overall research literature.

It is important to note as well that the Ryerson decision was based, for some inexplicable reason, on reviews of the literature carried out by people who are strongly opposed to their use. Apparently, no review was requested from researchers who have studied the area and come to quite different conclusions. Anyway, here’s some general summaries of the literature by people from the other side.

https://www.stlhe.ca/wp-content/uploads/2011/07/Student-Evaluation-of-Teaching1.pdf

https://www.ideaedu.org/Portals/0/Uploads/Documents/IDEA%20Papers/IDEA%20Papers/PaperIDEA_58.pdf

https://www.ideaedu.org/Portals/0/Uploads/Documents/IDEA%20Papers/IDEA%20Papers/PaperIDEA_50.pdf

https://www.tandfonline.com/doi/pdf/10.1080/02602938.2011.563279?needAccess=true

Upon being promoted to full professor in 1992, entirely on success in research, I decided to go to work on teaching, which I had neglected relative to research. I used a professionally developed instrument from UMass Amherst, appropriate to courses in statistics. This was a new experience for the students, who had never before been asked to evaluate a course. I found that I was good at organization and at communicating what was expected (ca 4 on Likert scale), but poor (3 or 2 on Likert scale) at motivating the material and at delivery. I set to work and within 2 years raised the scores to 4s and 5s in all categories on the UMass instrument.
Subsequently my university institutionalized a one size fits all course evaluation questionnaire (CEQ). I went along with the program, even though I considered it useless because it was delivered at the end of term. Which means no opportunity for action by the prof, to the benefit of students. Instead I began using a 2 question format at the end of class a month into the term: What was the best thing thing about this course? What needs to be improved? I found this quite useful if I tabulated the responses, reported them back to the class, and took immediate action on items arising repeatedly.
The CEQ at my university wandered into irrelevance when my university began compiling quantile scores that were reported to the Department Head and the Dean’s office.I noticed that my scores remained the same (4s and 5s) but I my courses dropped to the median. Why? Well because I teach statistics I could see why. The courses I was delivering began accounting for more than half the responses used in computing the median as the enrolments increased. Which brings to mind the usual Lies, Damn Lies sequence (attributed incorrectly to Mark Twain or Benjamin Disraeli).
I also noticed that students were growing tired of being asked for assessment of every course every year. I saw no reason for assessment of every course every year at the end of the term so I ceased to encourage students to do the end of term CEQ.
Here are my thoughts.
1. Students can evaluate delivery as well as anyone. By definition they cannot evaluate content.
I think the same applies to colleagues. How many can evaluate content?
Are colleagues any better than students at evaluating delivery? I doubt it.
2. Student evaluations at the end of the term are a waste of time. There’s nothing in it for the student.
3. If you want to improve your teaching, try the 2 question anonymous evaluation one month into the term, then act on it immediately.
~David Schneider
Memorial University

@Stuart Chambers – I think you are making a HUGE number of assumptions here.

First assumption is that you’re not considering diversity of educational contexts; in China/HK/Malaysia/South Korea, teachers are often teaching classes that students are forced to take rather than wanting to take; this seriously impacts student evaluations of both the course and the teacher.

Second assumption is that feedback is even remotely relevant; having struggled to get students to provide legitimate feedback within my own teaching, I invariably wind up with irrelevancy in commentary. Students who put in literally zero effort will complain the class was boring.

Third assumption is that students can (and will) recognize and gauge the teacher’s efforts. I know teachers who routinely put in 10+ hours a week outside of class to meet with students and give them additional and excellent feedback, and this never makes it into their evaluations. Some students simply don’t dwell on the positive, but focus on the negative.

Fourth assumption is that the evaluations are telling the professor what is wrong with them and NOT what is wrong with the students – again, the evaluations are as biased as can be, and do not necessarily reflect (with any degree of accuracy) what the prof is like. I defy anyone to step into my classroom and actually observe me in action, then compare it to my course and teacher evaluations – and I have no doubt any strict teacher would prefer an observation from the department head over student commentary.

Fifth assumption is that students won’t use this as a weapon against a teacher they don’t like. Case in point, a colleague of mine had a student show up 18 minutes late for a group assessment which he had no choice but to start if he was going to complete it; the student had no excuse, no reason to be late, so he refused to give her a make-up examination in keeping with University policy. How do you think she’s going to mark him when she does her evaluation? You’re fooling yourself if you say that she’ll be impartial.

Until course and teacher evaluations meet the same criteria as teacher evaluations of students, they are bollocks. The minute students have to put their name on it, the minute they students can be called in to answer for their assessments in the same way a teacher has to, I have literally zero use for them. If I want an evaluation of my teaching, I will look at the student’s marks and development over the term and how they fared at the end. THAT is MY evaluation.

Robin Dahling, you are making a HUGE number of excuses!

First, students at any university take courses that are compulsory. The professor’s job is to make them interesting and to engage students. Any prof worth his/her salt can take any course and make students want to attend.
Second, in 30 years of teaching, with rare exceptions, all I have seen are relevant, intelligent comments. I guess you had bad luck.
Third, students can gauge a prof’s efforts when those efforts translate into outcomes. I have seen profs work hard, but they cannot translate that knowledge and effort into the classroom. Basically, no matter how hard they try, students are not understanding them, or the class is boring. Effort does not always equal results.
Fourth, the students are the best indicator of how a prof performed in a given semester. The prof judges the students with an overall grade, but the evaluations are part of the job. A department head is only one person, but when 100 students judge you, that’s a real test of your mettle.
Fifth, a tiny minority of disgruntled students won’t skew the mark. If 100 students give the prof a failing grade, the last thing you will want is the department head seeing you teach.
Sixth, students do not receive a “safe space” for their ideas; likewise, profs do not receive a “safe space” from judgement by students. Welcome to university!

Just going to leave these references here. Specific to medical faculty, but relevant nonetheless.

Rannelli L, Coderre S, Paget M, Woloschuk W, Wright B, McLaughlin K. How do medical students form impressions of the effectiveness of classroom teachers? Med Educ. 2014;48:831-37.

Scheepers RA, Lombars KMJMH, van Aken MAG, Heineman MJ, Arah OA. Personality traits affect teaching performance of attending physicians: Results of a multi-centre observational study. PLOS ONE. 2014;9:e98107.

Hessler M, Popping DM, Hollstein H, Ohlenburg H, Arnemann PH, et al. Availability of cookies during an academic course session affects evaluation of teaching. Med Educ. 2018;52:1064-72.

Ladha M, Bharwani A, McLaughlin K, Stelfox HT, Bass A. The effects of white coats and gender on medical students’ perceptions of physicians. BMC Med Educ. 2017;17:93.

Morgan HK, Purkiss JA, Porter AC, Lypson ML, Santen SA, Christner JG, Grum CM, Hammoud MM. Student evaluations of faculty physicians: Gender differences in teaching evaluations. Journal of Women’s Health. 2016;25:453-56.

Fassiotto M, Li J, Maldonado Y, Kothary N. Female surgeons as counter stereotype: The impact of gender perceptions on trainee evaluations of physician faculty. Journal of Surgical Education. 2018;75:1140-48.

Reply to