Too much p = .048? Towards partial automation of scientific evaluation

Distinguishing good science from bad science isn’t an easy thing to do. One big problem is that what constitutes ‘good’ work is, to a large extent, subjective; I might love a paper you hate, or vice versa. Another problem is that science is a cumulative enterprise, and the value of each discovery is, in some sense, determined by how much of an impact that discovery has on subsequent work–something that often only becomes apparent years or even decades after the fact. So, to an uncomfortable extent, evaluating scientific work involves a good deal of guesswork and personal preference, which is probably why scientists tend to fall back on things like citation counts and journal impact factors as tools for assessing the quality of someone’s work. We know it’s not a great way to do things, but it’s not always clear how else we could do better.

Fortunately, there are many aspects of scientific research that don’t depend on subjective preferences or require us to suspend judgment for ten or fifteen years. In particular, methodological aspects of a paper can often be evaluated in a (relatively) objective way, and strengths or weaknesses of particular experimental designs are often readily discernible. For instance, in psychology, pretty much everyone agrees that large samples are generally better than small samples, reliable measures are better than unreliable measures, representative samples are better than WEIRD ones, and so on. The trouble when it comes to evaluating the methodological quality of most work isn’t so much that there’s rampant disagreement between reviewers (though it does happen), it’s that research articles are complicated products, and the odds of any individual reviewer having the expertise, motivation, and attention span to catch every major methodological concern in a paper are exceedingly small. Since only two or three people typically review a paper pre-publication, it’s not surprising that in many cases, whether or not a paper makes it through the review process depends as much on who happened to review it as on the paper itself.

A nice example of this is the Bem paper on ESP I discussed here a few weeks ago. I think most people would agree that things like data peeking, lumping and splitting studies, and post-hoc hypothesis testing–all of which are apparent in Bem’s paper–are generally not good research practices. And no doubt many potential reviewers would have noted these and other problems with Bem’s paper had they been asked to reviewer. But as it happens, the actual reviewers didn’t note those problems (or at least, not enough of them), so the paper was accepted for publication.

I’m not saying this to criticize Bem’s reviewers, who I’m sure all had a million other things to do besides pore over the minutiae of a paper on ESP (and for all we know, they could have already caught many other problems with the paper that were subsequently addressed before publication). The problem is a much more general one: the pre-publication peer review process in psychology, and many other areas of science, is pretty inefficient and unreliable, in the sense that it draws on the intense efforts of a very few, semi-randomly selected, individuals, as opposed to relying on a much broader evaluation by the community of researchers at large.

In the long term, the best solution to this problem may be to fundamentally rethink the way we evaluate scientific papers–e.g., by designing new platforms for post-publication review of papers (e.g., see this post for more on efforts towards that end). I think that’s far and away the most important thing the scientific community could do to improve the quality of scientific assessment, and I hope we ultimately will collectively move towards alternative models of review that look a lot more like the collaborative filtering systems found on, say, reddit or Stack Overflow than like peer review as we now know it. But that’s a process that’s likely to take a long time, and I don’t profess to have much of an idea as to how one would go about kickstarting it.

What I want to focus on here is something much less ambitious, but potentially still useful–namely, the possibility of automating the assessment of at least some aspects of research methodology. As I alluded to above, many of the factors that help us determine how believable a particular scientific finding is are readily quantifiable. In fact, in many cases, they’re already quantified for us. Sample sizes, p values, effect sizes,  coefficient alphas… all of these things are, in one sense or another, indices of the quality of a paper (however indirect), and are easy to capture and code. And many other things we care about can be captured with only slightly more work. For instance, if we want to know whether the authors of a paper corrected for multiple comparisons, we could search for strings like “multiple comparisons”, “uncorrected”, “Bonferroni”, and “FDR”, and probably come away with a pretty decent idea of what the authors did or didn’t do to correct for multiple comparisons. It might require a small dose of technical wizardry to do this kind of thing in a sensible and reasonably accurate way, but it’s clearly feasible–at least for some types of variables.

Once we extracted a bunch of data about the distribution of p values and sample sizes from many different papers, we could then start to do some interesting (and potentially useful) things, like generating automated metrics of research quality. For instance:

  • In multi-study articles, the variance in sample size across studies could tell us something useful about the likelihood that data peeking is going on (for an explanation as to why, see this). Other things being equal, an article with 9 studies with identical sample sizes is less likely to be capitalizing on chance than one containing 9 studies that range in sample size between 50 and 200 subjects (as the Bem paper does), so high variance in sample size could be used as a rough index for proclivity to peek at the data.
  • Quantifying the distribution of p values found in an individual article or an author’s entire body of work might be a reasonable first-pass measure of the amount of fudging (usually inadvertent) going on. As I pointed out in my earlier post, it’s interesting to note that with only one or two exceptions, virtually all of Bem’s statistically significant results come very close to p = .05. That’s not what you expect to see when hypothesis testing is done in a really principled way, because it’s exceedingly unlikely to think a researcher would be so lucky as to always just barely obtain the expected result. But a bunch of p = .03 and p = .048 results are exactly what you expect to find when researchers test multiple hypotheses and report only the ones that produce significant results.
  • The presence or absence of certain terms or phrases is probably at least slightly predictive of the rigorousness of the article as a whole. For instance, the frequent use of phrases like “cross-validated”, “statistical power”, “corrected for multiple comparisons”, and “unbiased” is probably a good sign (though not necessarily a strong one); conversely, terms like “exploratory”, “marginal”, and “small sample” might provide at least some indication that the reported findings are, well, exploratory.

These are just the first examples that come to mind; you can probably think of other better ones. Of course, these would all be pretty weak indicators of paper (or researcher) quality, and none of them are in any sense unambiguous measures. There are all sorts of situations in which such numbers wouldn’t mean much of anything. For instance, high variance in sample sizes would be perfectly justifiable in a case where researchers were testing for effects expected to have very different sizes, or conducting different kinds of statistical tests (e.g., detecting interactions is much harder than detecting main effects, and so necessitates larger samples). Similarly, p values close to .05 aren’t necessarily a marker of data snooping and fishing expeditions; it’s conceivable that some researchers might be so good at what they do that they can consistently design experiments that just barely manage to show what they’re intended to (though it’s not very plausible). And a failure to use terms like “corrected”, “power”, and “cross-validated” in a paper doesn’t necessarily mean the authors failed to consider important methodological issues, since such issues aren’t necessarily relevant to every single paper. So there’s no question that you’d want to take these kinds of metrics with a giant lump of salt.

Still, there are several good reasons to think that even relatively flawed automated quality metrics could serve an important purpose. First, many of the problems could be overcome to some extent through aggregation. You might not want to conclude that a particular study was poorly done simply because most of the reported p values were very close to .05; but if you were look at a researcher’s entire body of, say, thirty or forty published articles, and noticed the same trend relative to other researchers, you might start to wonder. Similarly, we could think about composite metrics that combine many different first-order metrics to generate a summary estimate of a paper’s quality that may not be so susceptible to contextual factors or noise. For instance, in the case of the Bem ESP article, a measure that took into account the variance in sample size across studies, the closeness of the reported p values to .05, the mention of terms like ‘one-tailed test’, and so on, would likely not have assigned Bem’s article a glowing score, even if each individual component of the measure was not very reliable.

Second, I’m not suggesting that crude automated metrics would replace current evaluation practices; rather, they’d be used strictly as a complement. Essentially, you’d have some additional numbers to look at, and you could choose to use them or not, as you saw fit, when evaluating a paper. If nothing else, they could help flag potential issues that reviewers might not be spontaneously attuned to. For instance, a report might note the fact that the term “interaction” was used several times in a paper in the absence of “main effect,” which might then cue a reviewer to ask, hey, why you no report main effects? — but only if they deemed it a relevant concern after looking at the issue more closely.

Third, automated metrics could be continually updated and improved using machine learning techniques. Given some criterion measure of research quality, one could systematically train and refine an algorithm capable of doing a decent job recapturing that criterion. Of course, it’s not clear that we really have any unobjectionable standard to use as a criterion in this kind of training exercise (which only underscores why it’s important to come up with better ways to evaluate scientific research). But a reasonable starting point might be to try to predict replication likelihood for a small set of well-studied effects based on the features of the original report. Could you for instance show, in an automated way, that initial effects reported in studies that failed to correct for multiple comparisons or reported p values closer to .05 were less likely to be subsequently replicated?

Of course, as always with this kind of stuff, the rub is that it’s easy to talk the talk and not so easy to walk the walk. In principle, we can make up all sorts of clever metrics, but in practice, it’s not trivial to automatically extract even a piece of information as seemingly simple as sample size from many papers (consider the difference between “Undergraduates (N = 15) participated…” and “Forty-two individuals diagnosed with depression and an equal number of healthy controls took part…”), let alone build sophisticated composite measures that could reasonably well approximate human judgments. It’s all well and good to write long blog posts about how fancy automated metrics could help separate good research from bad, but I’m pretty sure I don’t want to actually do any work to develop them, and you probably don’t either. Still, the potential benefits are clear, and it’s not like this is science fiction–it’s clearly viable on at least a modest scale. So someone should do it… Maybe Elsevier? Jorge Hirsch? Anyone? Bueller? Bueller?

6 thoughts on “Too much p = .048? Towards partial automation of scientific evaluation”

  1. How much does this system really need to be automated to extract from the article? What if, as a part of the review process, the authors filled out a form that stated sample sizes, correction for multiple testing, etc.

  2. All good points and I agree with most of this. Actually you’ve put into words a few things I’ve been turning over in my mind too. So in no particular order:

    * I think extracting p values from text ought to be fairly easy, computationally speaking. Search for “p=*.N” where N is a sequence of numbers and * is either 0 or nothing (cos some people write it p=0.05 and some do p=.05). N is your p value. Would be easy to knock up a script to do that in any scripting language. It wouldn’t catch all the p values by any means but it would get most of them.

    My suspicion, unfortunately, is that if you did this on a huge bunch of papers and plotted the data as a histogram you’d get a massive spike just below 0.05, a smaller one just below 0.01, etc.

    You’d have to ignore the p values that were *exactly* 0.05 or whatever because they were probably not data but methods (“we thresholded at p=0.05” or whatever). However luckily, you wouldn’t lose too much real data there because it’s rare to find a p value of exactly that.

    * I’ve also dreamed about forcing people to include a kind of a diagram which shows all of their p values plotted on a graph which shows the chance of finding at least one p value that low by chance alone.

    basically you start out with a piece of graph paper which has log(p value) on the y axis and number-of-tests on the x axis.

    You draw a line representing the chance of finding at least one p value at a certain level. So for 0.05 this would start at 0.05 with one test, go up to 0.1 with 2 tests, and so on. bearing in mind that the y axis is a log scale this would be a curve.

    Then you get them to plot every p value they found on this piece of paper with the best p-values further along the x axis.

    If none of the dots are above the line, all results would be expected by chance.

    It’s basically a Bonferroni correction on a piece of paper but it has the advantage of not actually altering the p values…

  3. Wait, I don’t mean log(p), I mean log(1/p)…I think.

    Anyway the other advantage of this is this it would allow you to see whether the results *would* have been significant if you’d done fewer comparisons.

    Say you do 100 comparisons and you find several results at p=0.01. Under Bonferroni correction this is complete junk. But plot them on my graph and you’d see that it would be significant if you’d done fewer comparisons. Now suppose also that all of the “good” results came from a certain part of the study (Gene X) and all the crap ones came from another (Gene Y). You’d be able to see immediately that Gene X results were better (further to the right) and that had you not included Gene Y it would have been significant.

    This would, hopefully, encourage you to do a follow up study on Gene X and forget Gene Y.

  4. Great ideas. The only comment I have is that I think searching for certain keywords is probably not a really good idea… because then all that will happen is that authors will make sure to use the good words and exclude the bad ones. I can just imagine an advisor telling a grad student, “This paper is good, but please try and work in the words ‘multiple comparisons’. Trust me, it’s part of the review process.”

    I can think of at least two different forms at work (the early paperwork you do to get the ball rolling on the patent process, and employee evaluations) where there are places that those in the know just kind of work in certain key words. It’s not necessarily a bad thing I guess (in the latter case, it’s actually a way of communicating to the employee what specific rating they were given, even though you’re not supposed to state it directly) but don’t think people won’t learn the keywords and work them in artificially.

  5. Jeff, yes, I think that’s also an excellent approach, and will probably write a post about that at some point too. The limiting factor there is that you don’t want to mandate that authors fill out a really long questionnaire, because they won’t do it. My preferred approach is to have a standard statement that’s publicly available (the “gold standard”) that authors are asked to simply certify in their methods section–or, if they deviate from it, to indicate how. That way there’s an incentive for people to adhere to high standards without mandating much more work or scaring people away.

    Neuroskeptic, yeah, p values are the easiest thing to extract… you could probably capture > 90% of occurrences with a single regular expression. Other things are more difficult. But the real work isn’t necessarily just the text parsing, it’s also locating and standardizing the full article texts, saving the data to a relational DB in a sensible and usable way, etc. It can be done (actually, I’ve been working on something along those lines for quite a while now–will be able to talk more about it in a few weeks), but I don’t want to be the one to do it. 🙂

    The plot you’re talking about is exactly the way FDR correction works–line up all the ordered p values and set the cut-off where the last one dips below some criterion. So I think you’re basically arguing for an FDR approach, which works well in many cases (and not just imaging studies; e.g., I used it in this paper on bloggers’ personalities). The only issue is that it’s not really applicable in cases where people are doing a bunch of different analyses, and the p values are coming from different places and may or may not be independent. Of course, that non-independence is also going to be a problem for a blind automated approach.

    James, yeah, I agree, many keywords can be easy to game, I was just giving examples. But I think with enough work one could come up with a tool that’s clever enough to separate irrelevant uses from relevant ones. Plus cheating would be a risky strategy because it’d be pretty easy to spot (e.g., all you’d need to do is have the parser spit out a few words on either side of each target word, and you’d very quickly realize that John Doe was using the term ‘cross-validated’ only in the context of sentences like ‘although these results are not cross-validated’…) So I think this is really an implementation issue and not a principled barrier to text-based metrics.

Leave a Reply