Peer review appears to be a poor predictor of impact

David Kent breaks down an eLife article that suggests peer review scores cannot distinguish very good grants from excellent grants. In fact, at a certain point in the process, it is pretty much a random lottery.

This past week I came across an excellent article in the open-access juggernaut eLife (here and here) that made me worry a little. In “NIH peer review percentile scores are poorly predictive of grant productivity,” Fang et al. describe a retrospective analysis of more than 100,000 grants and their subsequent impact. Their findings seem to suggest that peer review scores cannot distinguish very good grants from excellent grants; the authors go so far as to suggest that at a certain point in the process, it is pretty much a random lottery. The lottery prize in this case is hundreds of thousands of dollars and many people’s research careers.

For non-aficionados of grant review, the basic process is for a researcher to put their ideas together in an application package that gets sent across the world for peer review by relevant experts. Generally, these scores are collated and discussed by a central panel who then discuss and decide which grants to fund. The scores of the original peer reviewers matter – those applications that get unanimously scored highly will likely get funded and those with unanimously low scores will not. The interesting portion is the upper middle section where a grant is just on the side of being funded (or not).

Typically for the National Institutes of Health, this line in the sand has been drawn around the 20 percent mark; however, more recently it has regularly dipped below this rate. Historically it has been as high as 30 percent (pre-2004), but increased application numbers have made this success rate impossible to continue. This puts an obvious pressure on those grant applications that typically land between the top 10 percent and top 30 percent, and the final decision-making process is often shrouded in smoke.

This eLife article by Fang and colleagues focuses its efforts on the current pinch point (grants scored in the top 20 percent) and asks whether or not the reviewer score (e.g., which percentile the grant fell into) actually predicted more or less “successful” subsequent research. Obviously, the authors have selected a set of productivity (number of publications) and impact (number of citations) benchmarks and some people might take issue with these (e.g., citation numbers vary by the size of a field, the number of publications is not necessarily a measure of the impact of each publications, etc) but it is a very interesting exercise nonetheless. (See an explanation of the below figures here.)

The analysis of over 100,000 funded grants produced two major findings. The first is that the very top marks (e.g., grants falling in the top 3 percentile points) resulted in statistically measurable increases in numbers of published work and citations of that work (~10 publications, ~500 citations over a five years period, Fig. 1 re-posted from eLife) – it would be difficult to argue that these grants (on average) did not deserve funding. The second, and the reason for this post, is that scores of top three percent to top 20 percent had effectively no difference between them in either number of publications produced or no citations (again see Fig. 1). A further mathematical analysis of this data prompted the following shocker:

“only ~1 percent of the variance in productivity could be accounted for by percentile ranking, suggesting that all of the effort currently spent in peer review has a minimal impact in stratifying meritorious applications relative to what would be expected from a random ranking”

Makes you feel all warm and fuzzy inside as a peer reviewer doesn’t it?

At the end of the day, I guess this just adds some numbers to what my colleagues have been saying for years – once you’re in the top 20 percent bracket, the grants are all good – it then depends on things beyond your control (hard vs. easy external reviewers, a “champion” on the panel who sways the decision-making, etc.), but it makes one wonder if there might be a more intelligent way of deciding which grants in the top 20 percent to fund.

Again, Fang and colleagues have some worthwhile suggestions for how to take this information forward. They suggest that reviewer scores are used to identify the top 20 percent of grants and then the funding is distributed either randomly or based on strategic priorities of the NIH. An extension of their data would suggest that the top three percent might be given funding without entering into such a “lottery” and the rest (the three percent to 20 percent grants) would be decided in such a manner. An interesting further layer that will probably remain buried in the depths of panel filing cabinets is an assessment of the more subjective role of a panel. For example, what happens to the grants that are “fought upwards” (or downwards) by a panel member? Perhaps this is the actual role for such panels: to distinguish “high quality” reviews from poorly prepared or unjustified reviews.

Either way, this study should prompt some hard questions for organizations that invest a lot of time into organizing grant-review processes. Is it really worth taking days of researchers time on such panels? Depending on the scale and cost of such organization, one might actually save a sufficient amount of resources such that one or two more projects might even be funded in each round.

Peer review appears to be a poor predictor of impact

Cancel reply