Research and Funding
Inside NIA: A Blog for Researchers

Are impact ratings random?

Are impact ratings random?

Posted on January 29, 2014 by Robin Barr, Director of the Division of Extramural Activities. See Robin Barr's full profile.

With paylines being what they are at NIA and NIH, we tend to hear that discriminating among applications in the narrow range of scores represented by the top 20 percent of applications reviewed by the Center for Scientific Review is a lottery. Surely, it is too much to ask of peer review to discriminate reasonably among applications of such quality. So what sense, then, in drawing a payline? Why not hold a lottery instead? Or at least allow program and senior Institute staff considerable discretion in selecting priorities for award among this set.

Now most Institutes allow some discretionary zone where staff apply Institute priorities to help determine funding. A few Institutes also have a slightly wobbly funding line which is described as “most applications up to...” will be paid or some similar language. Perhaps we should encourage more such interpretation when funding lines tighten and review no longer discriminates effectively among these applications? But first let’s try to find out whether review really is losing the power to discriminate among these top-scoring applications.

Does the close tie between certain criterion scores and impact rating still hold when considering only the very best applications?

We can’t judge absolutely whether reviewers are fairly assigning scores to applications where all are seen as reasonably strong. But I looked at data that allow us to see whether reviewers are consistently applying a standard within this set.

Here’s how I did it. Several authors (and an earlier post on this blog) have commented on the relationship between criterion scores and the impact rating given to an application. The Approach criterion usually shows a correlation of about 0.8 with the final impact rating. Significance and Innovation show substantial correlations, too, with Investigators and Environment bringing up the rear.  I asked whether these correlations would hold up as the sample range narrowed. Would the Approach, Significance, and Innovation criteria still show strong correlations with the final impact rating if only applications scoring at the 30th percentile or better are considered? How about the 20th percentile or better? Or even the 10th percentile or better? Keep in mind that a lower percentile score is better, of course, so applications at the 10th percentile scored even better than those at the 20th percentile.

We should expect the criterion score-impact rating correlation to weaken when examining a smaller group or applications.

We all know that when discriminations become finer, correlations from the same test or standard grow weaker. So, we expect correlations to fall as the range narrows. But do they disappear? Suppose our original supposition is true. Namely the ranking in terms of scientific quality among the narrow set of applications scoring better than the 20th or 10th percentile is random or close to random. It makes sense then to expect that indicators of relative scientific quality—the criterion correlations—would drop to zero when the range narrows. Non-criterion factors would begin to crowd out the scientific quality differences—represented by the criterion scores—because the quality differences are now small to non-existent.

Non-criterion factors might be reviewer beliefs like this:

  • A bias against a particular school
  • “This investigator has too much money.”
  • “It is an A1 on his only grant. He really needs the money!”
  • “Early-stage investigators get special treatment!”

What do the data show? Reviewers’ criterion scores continue to show substantial relationships to impact rating, even at the very top of the range of possible scores

The figure on this page shows the results. The Y-axis shows the percentile band of applications from which correlations were calculated. The X-axis shows the correlation for each band. These are applications with NIA assignments reviewed at the Center for Scientific Review in fiscal year 2011 to fiscal year 2013.

 Correlation between mean reviewer rating on criterion score and final impact rating." The Y-axis shows the percentile band of applications from which correlations were calculated. The X-axis shows the correlation for each band. A label on that axis reads, "For example, 40th represents the set of all applications at the 40th percentile and better. Whole set represents all reviewed applications." These are applications with NIA assignments reviewed at the Center for Scientific Review in fiscal year 2011 to fiscal year 2013. The criterion score Approach falls from a correlation of about 0.85 in the whole set to 0.80 in the 50th band, to 0.63 in the 10th band. The criterion score Significance falls from a correlation of 0.68 in the whole set, to 0.66 in the 50th band, to 0.56 in the 10th band. The criterion score Innovation falls from a correlation of 0.62 in the whole set to 0.57 in the 50th band, to 0.40 in the 10th band. The criterion score Investigators falls from a correlation of 0.51 in the whole set, to 0.45 in the 50th band, to 0.32 in the 10th band. The criterion score Environment falls from a correlation of 0.43 in the whole set to 0.38 in the 50th band, to 0.31 in the 10th band. All figures approximations.What I found remarkable—OK, I live a very sheltered life!—is that the Approach criterion correlates 0.62 with the impact rating even among applications ranked within the 10th percentile. Significance is not far behind at 0.56. The other three criteria show much poorer correlations. In brief, on average, reviewers are meaningfully discriminating even among applications at the very top of the range on the Approach and Significance criteria. Their assessment is not random within this set. Judgments of scientific quality are not being crowded out.

A meaningful judgment is not necessarily a scientifically reasonable judgment. Similarly, if a second group of reviewers were to assess the applications then they may judge them differently reflecting the possibility that individual preference determines rating at this level rather than a shared standard of scientific quality. Still, the result is some evidence that meaningful discrimination happens even when all the science is strong.

Do you have thoughts or questions about this post? Let me hear them by commenting below.


Read Next:

The Approach criterion: why does it matter so much in peer review?

Share this:
Email Twitter Linkedin Facebook

Posted by Thomas Gill on Jan 29, 2014 - 2:09 pm

This is nicely done, but I am not persuaded. I'm not sure that the criterion scores are the best gold standard for judging the impact score. Reviewers may very well titrate their criterion scores so that they are concordant with their impact score, i.e. they are not independent of one another. I don't think that the high correlation at the low percentiles necessarily means that reviewers can distinguish among applications in the top 20%. This would require an independent set of reviews, as suggested in the next to last paragraph above.

Posted by Michael Kyba on Jan 29, 2014 - 2:15 pm

I think the only way to answer the question posed at the beginning of the blog namely, are the differences between scores in the tight band of excellent grants in the upper 20% of scored applications meaningful, is to have a second set of peer reviewers score the same set of grants. I assume that with the billions the NIH allocates based on this system, there must have been some quality control assessments of this nature. I would love to hear the outcome. It would also be good for to hear about what the NIH has done to track actual impact of funded grants with regard to citations that papers from this work ultimately received. How strong is the correlation between percentile rank of funded applications and the number of citations to the papers produced by this funded work in following years?

Posted by Larry on Jan 29, 2014 - 3:37 pm

I have no strong objection to having a lottery for some applications, which merely amounts to an admission that the system is imperfect. But I do object to giving program staff an ever-larger role in deciding who gets funded. If reviewers, who have spent at least several hours reviewing each application (at least I do), cannot discriminate among the top 20% of applications, how can administrators who have spend much less time evaluating them? The more program staff overrule peer reviewers, the more inclined potential reviewers will be inclined to simply say "No" or "Why Bother?" We've all heard stories of grants that were rated significantly below the Payline being awarded, and almost invariably it's some prominent well-connected guy who called up the PO and made his case. Giving administrators a larger role can only make the system more politicized that it already is, or worse, result in simply giving grants to whomever can be the biggest bully.

Posted by Leigh on Feb 02, 2014 - 10:50 am

Right on!!

Posted by Vince Mor on Jan 29, 2014 - 7:45 pm

I'm actually encouraged by these analyses since they lend support to the following recommendation; all grants should be reviewed by between 7 and 10 reviewers independently (depending on what we can afford) and the top and bottom reviewers' scores are dropped and do away with the meetings or even conference calls altogether. The data suggest strong correlations and this is only based upon 3 reviewers' averages. The entire score structure would be more stable with more reviewers and only a VERY small number of grants are actually fully realigned during discussion and with many more reviewers the chance that a fatal flaw would be missed also declines. Its not that I don't want to see my colleagues at study section, it is just that it has become an increasingly unpleasant experience.

Posted by Charles DeCarli on Jan 30, 2014 - 9:12 pm

Dear Robin: These results are both refreshing and helpful. I have been on study section for years and have always worried that our precision has declined as the payline criteria became more restrictive. It is refreshing to see that your data continue to support the notion that strong applications continue to be identified even at the highest levels of payline restrictions. I also found it helpful to see that the approach remains the single most important (correlated) aspect of the grant despite emphasis on impact and innovation. Again, this suggests that good study design is the most critical piece to good science. Please keep up the good work. It helps all of us to see how different factors can influence successful applications.