Skip to main content
U.S. flag

An official website of the United States government

Are impact ratings random?

NIA Blog Team Avatar

NIA Blog Team


With paylines being what they are at NIA and NIH, we tend to hear that discriminating among applications in the narrow range of scores represented by the top 20 percent of applications reviewed by the Center for Scientific Review is a lottery. Surely, it is too much to ask of peer review to discriminate reasonably among applications of such quality. So what sense, then, in drawing a payline? Why not hold a lottery instead? Or at least allow program and senior Institute staff considerable discretion in selecting priorities for award among this set.

Now most Institutes allow some discretionary zone where staff apply Institute priorities to help determine funding. A few Institutes also have a slightly wobbly funding line which is described as “most applications up to...” will be paid or some similar language. Perhaps we should encourage more such interpretation when funding lines tighten and review no longer discriminates effectively among these applications? But first let’s try to find out whether review really is losing the power to discriminate among these top-scoring applications.

Does the close tie between certain criterion scores and impact rating still hold when considering only the very best applications?

We can’t judge absolutely whether reviewers are fairly assigning scores to applications where all are seen as reasonably strong. But I looked at data that allow us to see whether reviewers are consistently applying a standard within this set.

Here’s how I did it. Several authors (and an earlier post on this blog) have commented on the relationship between criterion scores and the impact rating given to an application. The Approach criterion usually shows a correlation of about 0.8 with the final impact rating. Significance and Innovation show substantial correlations, too, with Investigators and Environment bringing up the rear.  I asked whether these correlations would hold up as the sample range narrowed. Would the Approach, Significance, and Innovation criteria still show strong correlations with the final impact rating if only applications scoring at the 30th percentile or better are considered? How about the 20th percentile or better? Or even the 10th percentile or better? Keep in mind that a lower percentile score is better, of course, so applications at the 10th percentile scored even better than those at the 20th percentile.

We should expect the criterion score-impact rating correlation to weaken when examining a smaller group or applications.

We all know that when discriminations become finer, correlations from the same test or standard grow weaker. So, we expect correlations to fall as the range narrows. But do they disappear? Suppose our original supposition is true. Namely the ranking in terms of scientific quality among the narrow set of applications scoring better than the 20th or 10th percentile is random or close to random. It makes sense then to expect that indicators of relative scientific quality—the criterion correlations—would drop to zero when the range narrows. Non-criterion factors would begin to crowd out the scientific quality differences—represented by the criterion scores—because the quality differences are now small to non-existent.

Non-criterion factors might be reviewer beliefs like this:

  • A bias against a particular school
  • “This investigator has too much money.”
  • “It is an A1 on his only grant. He really needs the money!”
  • “Early-stage investigators get special treatment!”

What do the data show? Reviewers’ criterion scores continue to show substantial relationships to impact rating, even at the very top of the range of possible scores

The figure on this page shows the results. The Y-axis shows the percentile band of applications from which correlations were calculated. The X-axis shows the correlation for each band. These are applications with NIA assignments reviewed at the Center for Scientific Review in fiscal year 2011 to fiscal year 2013.

 Correlation between mean reviewer rating on criterion score and final impact rating." The Y-axis shows the percentile band of applications from which correlations were calculated. The X-axis shows the correlation for each band. A label on that axis reads, "For example, 40th represents the set of all applications at the 40th percentile and better. Whole set represents all reviewed applications." These are applications with NIA assignments reviewed at the Center for Scientific Review in fiscal year 2011 to fiscal year 2013. The criterion score Approach falls from a correlation of about 0.85 in the whole set to 0.80 in the 50th band, to 0.63 in the 10th band. The criterion score Significance falls from a correlation of 0.68 in the whole set, to 0.66 in the 50th band, to 0.56 in the 10th band. The criterion score Innovation falls from a correlation of 0.62 in the whole set to 0.57 in the 50th band, to 0.40 in the 10th band. The criterion score Investigators falls from a correlation of 0.51 in the whole set, to 0.45 in the 50th band, to 0.32 in the 10th band. The criterion score Environment falls from a correlation of 0.43 in the whole set to 0.38 in the 50th band, to 0.31 in the 10th band. All figures approximations.What I found remarkable—OK, I live a very sheltered life!—is that the Approach criterion correlates 0.62 with the impact rating even among applications ranked within the 10th percentile. Significance is not far behind at 0.56. The other three criteria show much poorer correlations. In brief, on average, reviewers are meaningfully discriminating even among applications at the very top of the range on the Approach and Significance criteria. Their assessment is not random within this set. Judgments of scientific quality are not being crowded out.

A meaningful judgment is not necessarily a scientifically reasonable judgment. Similarly, if a second group of reviewers were to assess the applications then they may judge them differently reflecting the possibility that individual preference determines rating at this level rather than a shared standard of scientific quality. Still, the result is some evidence that meaningful discrimination happens even when all the science is strong.

Do you have thoughts or questions about this post? Let me hear them by commenting below.

 

Read Next:

The Approach criterion: why does it matter so much in peer review?

nia.nih.gov

An official website of the National Institutes of Health