When I review papers for journals, I often find myself facing something of a tension between two competing motives. On the one hand, I’d like to evaluate each manuscript as an independent contribution to the scientific literature–i.e., without having to worry about how the manuscript stacks up against other potential manuscripts I could be reading. The rationale being that the plausibility of the findings reported in a manuscript shouldn’t really depend on what else is being published in the same journal, or in the field as a whole: if there are methodological problems that threaten the conclusions, they shouldn’t become magically more or less problematic just because some other manuscript has (or doesn’t have) gaping holes. Reviewing should simply be a matter of documenting one’s major concerns and suggestions and sending them back to the Editor for infallible judgment.
The trouble with this idea is that if you’re of a fairly critical bent, you probably don’t believe the majority of the findings reported in the manuscripts sent to you to review. Empirically, this actually appears to be the right attitude to hold, because as a good deal of careful work by biostatisticians like John Ioannidis shows, most published research findings are false, and most true associations are inflated. So, in some ideal world, where the job of a reviewer is simply to assess the likelihood that the findings reported in a paper provide an accurate representation of reality, and/or to identify ways of bringing those findings closer in line with reality, skepticism is the appropriate default attitude. Meaning, if you keep the question “why don’t I believe these results?” firmly in mind as you read through a paper and write your review, you probably aren’t going to go wrong all that often.
The problem is that, for better or worse, one’s job as a reviewer isn’t really–or at least, solely–to evaluate the plausibility of other people’s findings. In large part, it’s to evaluate the plausibility of reported findings in relation to the other stuff that routinely gets published in the same journal. For instance, if you regularly reviewing papers for a very low-tier journal, the editor is probably not going to be very thrilled to hear you say “well, Ms. Editor, none of the last 15 papers you’ve sent me are very good, so you should probably just shut down the journal.” So a tension arises between writing a comprehensive review that accurately captures what the reviewer really thinks about the results–which is often (at least in my case) something along the lines of “pffft, there’s no fucking way this is true”–and writing a review that weighs the merits of the reviewed manuscript relative to the other candidates for publication in the same journal.
To illustrate, suppose I review a paper and decide that, in my estimation, there’s only a 20% chance the key results reported in the paper would successfully replicate (for the sake of argument, we’ll pretend I’m capable of this level of precision). Should I recommend outright rejection? Maybe, since 1 in 5 odds of long-term replication don’t seem very good. But then again, what if 20% is actually better than average? What if I think the average article I’m sent to review only has a 10% chance of holding up over time? In that case, if I recommend rejection of the 20% article, and the editor follows my recommendation, most of the time I’ll actually be contributing to the journal publishing poorer quality articles than if I’d recommended accepting the manuscript, even if I’m pretty sure the findings reported in the manuscript are false.
Lest this sound like I’m needlessly overanalyzing the review process instead of buckling down and writing my own overdue reviews (okay, you’re right, now stop being a jerk), consider what happens when you scale the problem up. When journal editors send reviewers manuscripts to look over, the question they really want an answer to is, “how good is this paper compared to everything else that crosses my desk?” But most reviewers naturally incline to answer a somewhat different–and easier–question, namely, “in the grand scheme of life, the universe, and everything, how good is this paper?” The problem, then, is that if the variance in curmudgeonliness between reviewers exceeds the (reliable) variance within reviewers, then arguably the biggest factor in determining whether or not a given paper gets rejected is simply who happens to review it. Not how much expertise the reviewer has, or even how ‘good’ they are (in the sense that some reviewers are presumably better than others at identifying serious problems and overlooking trivial ones), but simply how critical they are on average. Which is to say, if I’m Reviewer 2 on your manuscript, you’ll probably have a better chance of rejection than if Reviewer 2 is someone who characteristically writes one-paragraph reviews that begin with the words “this is an outstanding and important piece of work…”
Anyway, on some level this is a pretty trivial observation; after all, we all know that the outcome of the peer review process is, to a large extent, tantamount to a roll of the dice. We know that there are cranky reviewers and friendly reviewers, and we often even have a sense of who they are, which is why we often suggest people to include or exclude as reviewers in our cover letters. The practical question though–and the reason for bringing this up here–is this: given that we have this obvious and ubiquitous problem of reviewers having different standards for what’s publishable, and that this undeniably impacts the outcome of peer review, are there any simple steps we could take to improve the reliability of the review process?
The way I’ve personally made peace between my desire to provide the most comprehensive and accurate review I can and the pragmatic need to evaluate each manuscript in relation to other manuscripts is to use the “comments to the Editor” box to provide some additional comments about my review. Usually what I end up doing is writing my review with little or no thought for practical considerations such as “how prestigious is this journal” or “am I a particularly harsh reviewer” or “is this a better or worse paper than most others in this journal”. Instead, I just write my review, and then when I’m done, I use the comments to the editor to say things like “I’m usually a pretty critical reviewer, so don’t take the length of my review as an indication I don’t like the manuscript, because I do,” or, “this may seem like a negative review, but it’s actually more positive than most of my reviews, because I’m a huge jerk.” That way I can appease my conscience by writing the review I want to while still giving the editor some indication as to where I fit in the distribution of reviewers they’re likely to encounter.
I don’t know if this approach makes any difference at all, and maybe editors just routinely ignore this kind of thing; it’s just the best solution I’ve come up with that I can implement all by myself, without asking anyone else to change their behavior. But if we allow ourselves to contemplate alternative approaches that include changes to the review process itself (while still adhering to the standard pre-publication review model, which, like many other people, I’ve argued is fundamentally dysfunctional), then there are many other possibilities.
One idea, for instance, would be to include calibration questions that could be used to estimate (and correct for) individual differences in curmudgeonliness. For instance, in addition to questions about the merit of the manuscript itself, the review form could have a question like “what proportion of articles you review do you estimate end up being rejected?” or “do you consider yourself a more critical or less critical reviewer than most of your peers?”
Another, logistically more difficult, idea would be to develop a centralized database of review outcomes, so that editors could see what proportion of each reviewer’s assignments ultimately end up being rejected (though they couldn’t see the actual content of the reviews). I don’t know if this type of approach would improve matters at all; it’s quite possible that the review process is fundamentally so inefficient and slow that editors just don’t have the time to spend worrying about this kind of thing. But it’s hard to believe that there aren’t some simple calibration steps we could take to bring reviewers into closer alignment with one another–even if we’re confined to working within the standard pre-publication model of peer review. And given the abysmally low reliability of peer review, even small improvements could potentially produce large benefits in the aggregate.
What say reviewers are forced to provide a scoring instead of a yes/no outcome? Then, once a few reviews come in from a single reviewer, the editor can start standardizing their scores to the distribution of scores they have already produced and then define a percentile cutoff and apply that to the standardized reviewer scores. The “Reviewer Z” or “Reviewer Quantile” approach.
An interesting interpretation of the basic idea is that it’s kind of like building an item response theory model, except where the reviewers are the items! You have high-difficulty reviewers and low-difficulty reviewers, and you use their responses to estimate an “ability” score for each paper, etc. The idea would be to get a good spread of difficulty in the reviewers in order to obtain a sensitive estimate of the paper’s ability. Or if the editor thought that a paper was particularly high-ability, he/she could choose to assign it to more high-difficulty reviewers. You get the idea. Assuming there are some databases of which reviewers said yes/no to which papers, this might not actually be too hard to implement.
One problem is that reviewers are often selected based on having substantive knowledge in the relevant area (which I think is good), and it seems difficult to fit that into the model. Although, interestingly, this is sort of analogous to the IRT issue of differential item functioning. Maybe the solution is to select reviewers based on substantive relevance, but to still consider their responses in the light of their difficulty scores, simply living with the fact that the estimates may be less than ideally precise. I guess I’m just thinking into my keyboard now. But I’m curious what you think about this interpretation.
CM, yeah, this is similar to what I was getting at. I think the issue there is that very few people review more than a handful of times for a particular editor over any meaningful period of time. So it would require a centralized system that spans multiple journals–and then the tricky part, like most of these things, is getting people (well, publishers) to buy in.
Jake, that’s a great point. I think given enough reviews (or, more likely, numerical ratings) you probably could use IRT to model the quality of a paper, and it might work quite nicely. But I’m not sure it would be practical under the current pre-publication evaluation system (for the same reason I suggested above–there wouldn’t be enough reviews per person to estimate reviewers’ curmudgeonliness). This is where an open science/post-publication evaluation model would shine–in terms of reliability, we’re almost certainly much better off with 20 people each providing a brief review and 10-point rating than with the current system where 2 or 3 semi-randomly selected people are charged with writing an exhaustive review. And under that kind of system, user might easily review (or comment on) hundreds of articles, which would make it easier to figure out what kind of reviewer they were.
I thought editors already had information about previous review recommendations from reviewers who have reviewed for them before? I mean, there is almost no doubt that they *have* this information, but don’t at least some editors also have this tabulated in a way that they can easily access? Maybe we solicit information about this from some editors.
Rick, I’m sure they have it for their particular journal, but with the exception of some very high-volume journals who keep relying on the same reviewers, most people aren’t going to be asked to review more than a handful of times for a given journal over a meaningful period of time. I imagine some of the bigger publishers–e.g., Elsevier or Wiley–could probably maintain databases that span all of their journals, though I’d be surprised if they do that (any editors want to weigh in?). But there’s definitely no centralized system that spans publishers, which is what would be ideal.
At this point, I work off of two main criteria for reviewing:
1) Is it methodologically sound? (I have seen scary things submitted to good journals.)
2) Is the interpretation sensible? For example, if the authors set out to prove theory X and do not find support for it, do they still claim that theory X is correct and infallible or do they examine other options?
After that, I just work off of the assumption that they’re telling the truth about their data and run with it. I’d ask questions if it were completely implausible, but that’s about it.
Interestingly, the CS and Robotics conferences that I’ve reviewed for do 1-5 scores and subscores–you end up assigning multiple scores to a paper in different domains. However, I don’t think anyone tracks it to see who is an easy vs. difficult reviewer.
Liz, I think that’s totally reasonable, and I do essentially the same thing. But I think reviewers have very different ideas of what “methodologically sound” and “sensible” mean; some are much more critical than others, hence the difficulty.
It does seem like the CS folks are in an ideal position to track this type of thing since (if I understand it right) each reviewer has to score a whole bunch of reviews (unlike journal articles, or most of the social and biomedical sciences, where almost everything submitted to a conference gets accepted). If anyone else knows of any systematic effort to study this issue, please chime in!