scientists aren’t dumb; statistics is hard

There’s a feature article in the new issue of Science News on the failure of science “to face the shortcomings of statistics”. The author, Tom Siegfried, argues that many scientific results shouldn’t be believed because they depend on faulty statistical practices:

Even when performed correctly, statistical tests are widely misunderstood and frequently misinterpreted. As a result, countless conclusions in the scientific literature are erroneous, and tests of medical dangers or treatments are often contradictory and confusing.

I have mixed feelings about the article. It’s hard to disagree with the basic idea that many scientific results are the results of statistical malpractice and/or misfortune. And Siegfried generally provides lucid explanations of some common statistical pitfalls when he sticks to the descriptive side of things. For instance, he gives nice accounts of Bayesian inference, of the multiple comparisons problem, and of the distinction between statistical significance and clinical/practical significance. And he nicely articulates what’s wrong with one of the most common (mis)interpretations of p values:

Correctly phrased, experimental data yielding a P value of .05 means that there is only a 5 percent chance of obtaining the observed (or more extreme) result if no real effect exists (that is, if the no-difference hypothesis is correct). But many explanations mangle the subtleties in that definition. A recent popular book on issues involving science, for example, states a commonly held misperception about the meaning of statistical significance at the .05 level: “This means that it is 95 percent certain that the observed difference between groups, or sets of samples, is real and could not have arisen by chance.“

So as a laundry list of common statistical pitfalls, it works quite nicely.

What I don’t really like about the article is that it seems to lay the blame squarely on the use of statistics to do science, rather than the way statistical analysis tends to be performed. That’s to say, a lay person reading the article could well come away with the impression that the very problem with science is that it relies on statistics. As opposed to the much more reasonable conclusion that science is hard, and statistics is hard, and ensuring that your work sits at the intersection of good science and good statistical practice is even harder. Siegfried all but implies that scientists are silly to base their conclusions on statistical inference. For instance:

It’s science’s dirtiest secret: The “scientific method“ of testing hypotheses by statistical analysis stands on a flimsy foundation. Statistical tests are supposed to guide scientists in judging whether an experimental result reflects some real effect or is merely a random fluke, but the standard methods mix mutually inconsistent philosophies and offer no meaningful basis for making such decisions.


Experts in the math of probability and statistics are well aware of these problems and have for decades expressed concern about them in major journals. Over the years, hundreds of published papers have warned that science’s love affair with statistics has spawned countless illegitimate findings. In fact, if you believe what you read in the scientific literature, you shouldn’t believe what you read in the scientific literature.

The problem is that there isn’t really any viable alternative to the “love affair with statistics”. Presumably Siegfried doesn’t think (most) scientists ought to be doing qualitative research; so the choice isn’t between statistics and no statistics, it’s between good and bad statistics.

In that sense, the tone of a lot of the article is pretty condescending: it comes off more like Siegfried saying “boy, scientists sure are dumb” and less like the more accurate observation that doing statistics is really hard, and it’s not surprising that even very smart people mess up frequently.

What makes it worse is that Siegfried slips up on a couple of basic points himself, and says some demonstrably false things in a couple of places. For instance, he explains failures to replicate genetic findings this way:

Nowhere are the problems with statistics more blatant than in studies of genetic influences on disease. In 2007, for instance, researchers combing the medical literature found numerous studies linking a total of 85 genetic variants in 70 different genes to acute coronary syndrome, a cluster of heart problems. When the researchers compared genetic tests of 811 patients that had the syndrome with a group of 650 (matched for sex and age) that didn’t, only one of the suspect gene variants turned up substantially more often in those with the syndrome — a number to be expected by chance.

“Our null results provide no support for the hypothesis that any of the 85 genetic variants tested is a susceptibility factor“ for the syndrome, the researchers reported in the Journal of the American Medical Association.

How could so many studies be wrong? Because their conclusions relied on “statistical significance,“ a concept at the heart of the mathematical analysis of modern scientific experiments.

This is wrong for at least two reasons. One is that, to believe the JAMA study Siegfried is referring to, and disbelieve the results of all 85 previously reported findings, you have to accept the null hypothesis, which is one of the very same errors Siegfried is supposed to be warning us against. In fact, you have to accept the null hypothesis 85 times. In the JAMA paper, the authors are careful to note that it’s possible the actual effects were simply overstated in the original studies, and that at least some of the original findings might still hold under more restrictive conditions. The conclusion that there really is no effect whatsoever is almost never warranted, because you rarely have enough power to rule out even very small effects. But Siegfried offers no such qualifiers; instead, he happily accepts 85 null hypotheses in support of his own argument.

The other issue is that it isn’t really the reliance on statistical significance that causes replication failures; it’s usually the use of excessively liberal statistical criteria. The problem has very little to do with the hypothesis testing framework per se. To see this, consider that if researchers always used a criterion of p < .0000001 instead of the conventional p < .05, there would almost never be any replication failures (because there would almost never be any statistically significant findings, period). So the problem is not so much with the classical hypothesis testing framework as with the choices many researchers make about how to set thresholds within that framework. (That’s not to say that there aren’t any problems associated with frequentist statistics, just that this isn’t really a fair one.)

Anyway, Siegfried’s explanations of the pitfalls of statistical significance then leads him to make what has to be hands-down the silliest statement in the article:

But in fact, there’s no logical basis for using a P value from a single study to draw any conclusion. If the chance of a fluke is less than 5 percent, two possible conclusions remain: There is a real effect, or the result is an improbable fluke. Fisher’s method offers no way to know which is which. On the other hand, if a study finds no statistically significant effect, that doesn’t prove anything, either. Perhaps the effect doesn’t exist, or maybe the statistical test wasn’t powerful enough to detect a small but real effect.

If you take this statement at face value, you should conclude there’s no point in doing statistical analysis, period. No matter what statistical procedure you use, you’re never going to know for cross-your-heart-hope-to-die sure that your conclusions are warranted. After all, you’re always going to have the same two possibilities: either the effect is real, or it’s not (or, if you prefer to frame the problem in terms of magnitude, either the effect is about as big as you think it is, or it’s very different in size). The same exact conclusion goes through if you take a threshold of p < .001 instead of one of p < .05: the effect can still be a spurious and improbable fluke. And it also goes through if you have twelve replications instead of just one positive finding: you could still be wrong (and people have been wrong). So saying that “two possible conclusions remain” isn’t offering any deep insight; it’s utterly vacuous.

The reason scientists use a conventional threshold of p < .05 when evaluating results isn’t because we think it gives us some magical certainty into whether a finding is “real” or not; it’s because it feels like a reasonable level of confidence to shoot for when making claims about whether the null hypothesis of no effect is likely to hold or not. Now there certainly are many problems associated with the hypothesis testing framework–some of them very serious–but if you really believe that “there’s no logical basis for using a P value from a single study to draw any conclusion,” your beef isn’t actually with p values, it’s with the very underpinnings of the scientific enterprise.

Anyway, the bottom line is Siegfried’s article is not so much bad as irresponsible. As an accessible description of some serious problems with common statistical practices, it’s actually quite good. But I guess the sense I got in reading the article was that at some point Siegfried became more interested in writing a contrarian piece about how scientists are falling down on the job than about how doing statistics well is just really hard for almost all of us (I certainly fail at it all the time!). And ironically, in the process of trying to explain just why “science fails to face the shortcomings of statistics”, Siegfried commits some of the very same errors he’s taking scientists to task for.

[UPDATE: Steve Novella says much the same thing here.]

[UPDATE 2: Andrew Gelman has a nice roundup of other comments on Siegfried’s article throughout the blogosphere.]