Are impact ratings random?
With paylines being what they are at NIA and NIH, we tend to hear that discriminating among applications in the narrow range of scores represented by the top 20 percent of applications reviewed by the Center for Scientific Review is a lottery. Surely, it is too much to ask of peer review to discriminate reasonably among applications of such quality. So what sense, then, in drawing a payline? Why not hold a lottery instead? Or at least allow program and senior Institute staff considerable discretion in selecting priorities for award among this set.
Now most Institutes allow some discretionary zone where staff apply Institute priorities to help determine funding. A few Institutes also have a slightly wobbly funding line which is described as “most applications up to...” will be paid or some similar language. Perhaps we should encourage more such interpretation when funding lines tighten and review no longer discriminates effectively among these applications? But first let’s try to find out whether review really is losing the power to discriminate among these top-scoring applications.
Does the close tie between certain criterion scores and impact rating still hold when considering only the very best applications?
We can’t judge absolutely whether reviewers are fairly assigning scores to applications where all are seen as reasonably strong. But I looked at data that allow us to see whether reviewers are consistently applying a standard within this set.
Here’s how I did it. Several authors (and an earlier post on this blog) have commented on the relationship between criterion scores and the impact rating given to an application. The Approach criterion usually shows a correlation of about 0.8 with the final impact rating. Significance and Innovation show substantial correlations, too, with Investigators and Environment bringing up the rear. I asked whether these correlations would hold up as the sample range narrowed. Would the Approach, Significance, and Innovation criteria still show strong correlations with the final impact rating if only applications scoring at the 30th percentile or better are considered? How about the 20th percentile or better? Or even the 10th percentile or better? Keep in mind that a lower percentile score is better, of course, so applications at the 10th percentile scored even better than those at the 20th percentile.
We should expect the criterion score-impact rating correlation to weaken when examining a smaller group or applications.
We all know that when discriminations become finer, correlations from the same test or standard grow weaker. So, we expect correlations to fall as the range narrows. But do they disappear? Suppose our original supposition is true. Namely the ranking in terms of scientific quality among the narrow set of applications scoring better than the 20th or 10th percentile is random or close to random. It makes sense then to expect that indicators of relative scientific quality—the criterion correlations—would drop to zero when the range narrows. Non-criterion factors would begin to crowd out the scientific quality differences—represented by the criterion scores—because the quality differences are now small to non-existent.
Non-criterion factors might be reviewer beliefs like this:
- A bias against a particular school
- “This investigator has too much money.”
- “It is an A1 on his only grant. He really needs the money!”
- “Early-stage investigators get special treatment!”
What do the data show? Reviewers’ criterion scores continue to show substantial relationships to impact rating, even at the very top of the range of possible scores
The figure on this page shows the results. The Y-axis shows the percentile band of applications from which correlations were calculated. The X-axis shows the correlation for each band. These are applications with NIA assignments reviewed at the Center for Scientific Review in fiscal year 2011 to fiscal year 2013.
What I found remarkable—OK, I live a very sheltered life!—is that the Approach criterion correlates 0.62 with the impact rating even among applications ranked within the 10th percentile. Significance is not far behind at 0.56. The other three criteria show much poorer correlations. In brief, on average, reviewers are meaningfully discriminating even among applications at the very top of the range on the Approach and Significance criteria. Their assessment is not random within this set. Judgments of scientific quality are not being crowded out.
A meaningful judgment is not necessarily a scientifically reasonable judgment. Similarly, if a second group of reviewers were to assess the applications then they may judge them differently reflecting the possibility that individual preference determines rating at this level rather than a shared standard of scientific quality. Still, the result is some evidence that meaningful discrimination happens even when all the science is strong.
Do you have thoughts or questions about this post? Let me hear them by commenting below.