Tangentially related to the last post, Games With Words has a post up soliciting opinions about the merit of effect sizes. The impetus is a discussion we had in the comments on his last post about Jonah Lehrer’s New Yorker article. It started with an obnoxious comment (mine, of course) and then rapidly devolved into a murderous duel civil debate about the importance (or lack thereof) of effect sizes in psychology. What I argued is that consideration of effect sizes is absolutely central to most everything psychologists do, even if that consideration is usually implicit rather than explicit. GWW thinks effect sizes aren’t that important, or at least, don’t have to be.
The basic observation in support of thinking in terms of effect sizes rather than (or in addition to) p values is simply that the null hypothesis is nearly always false. (I think I said “always” in the comments, but I can live with “nearly always”). There are exceedingly few testable associations between two or more variables that could plausibly have an effect size of exactly zero. Which means that if all you care about is rejecting the null hypothesis by reaching p < .05, all you really need to do is keep collecting data–you will get there eventually.
I don’t think this is a controversial point, and my sense is that it’s the received wisdom among (most) statisticians. That doesn’t mean that the hypothesis testing framework isn’t useful, just that it’s fundamentally rooted in ideas that turn out to be kind of silly upon examination. (For the record, I use significance tests all the time in my own work, and do all sorts of other things I know on some level to be silly, so I’m not saying that we should abandon hypothesis testing wholesale).
Anyway, GWW’s argument is that, at least in some areas of psychology, people don’t really care about effect sizes, and simply want to know if there’s a real effect or not. I disagree for at least two reasons. First, when people say they don’t care about effect sizes, I think what they really mean is that they don’t feel a need to explicitly think about effect sizes, because they can just rely on a decision criterion of p < .05 to determine whether or not an effect is ‘real’. The problem is that, since the null hypothesis is always false (i.e., effects are never exactly zero in the population), if we just keep collecting data, eventually all effects become statistically significant, rendering the decision criterion completely useless. At that point, we’d presumably have to rely on effect sizes to decide what’s important. So it may look like you can get away without considering effect sizes, but that’s only because, for the kind of sample sizes we usually work with, p values basically end up being (poor) proxies for effect sizes.
Second, I think it’s simply not true that we care about any effect at all. GWW makes a seemingly reasonable suggestion that even if it’s not sensible to care about a null of exactly zero, it’s quite sensible to care about nothing but the direction of an effect. But I don’t think that really works either. The problem is that, in practice, we don’t really just care about the direction of the effect; we also want to know that it’s meaningfully large (where ‘meaningfully’ is intentionally vague, and can vary from person to person or question to question). GWW gives a priming example: if a theoretical model predicts the presence of a priming effect, isn’t it enough just to demonstrate a statistically significant priming effect in the predicted direction? Does it really matter how big the effect is?
Yes. To see this, suppose that I go out and collect priming data online from 100,000 subjects, and happily reject the null at p < .05 based on a priming effect of a quarter of a millisecond (where the mean response time is, say, on the order of a second). Does that result really provide any useful support for my theory, just because I was able to reject the null? Surely not. For one thing, a quarter of a millisecond is so tiny that any reviewer worth his or her salt is going to point out that any number of confounding factors could be responsible for that tiny association. An effect that small is essentially uninterpretable. But there is, presumably, some minimum size for every putative effect which would lead us to say: “okay, that’s interesting. It’s a pretty small effect, but I can’t just dismiss it out of hand, because it’s big enough that it can’t be attributed to utterly trivial confounds.” So yes, we do care about effect sizes.
The problem, of course, is that what constitutes a ‘meaningful’ effect is largely subjective. No doubt that’s why null hypothesis testing holds such an appeal for most of us (myself included)–it may be silly, but it’s at least objectively silly. It doesn’t require you to put your subjective beliefs down on paper. Still, at the end of the day, that apprehensiveness we feel about it doesn’t change the fact that you can’t get away from consideration of effect sizes, whether explicitly or implicitly. Saying that you don’t care about effect sizes doesn’t actually make it so; it just means that you’re implicitly saying that you literally care about any effect that isn’t exactly zero–which is, on its face, absurd. Had you picked any other null to test against (e.g., a range of standardized effect sizes between -0.1 and 0.1), you wouldn’t have that problem.
To reiterate, I’m emphatically not saying that anyone who doesn’t explicitly report, or even think about, effect sizes when running a study should be lined up against a wall and fired upon at will is doing something terribly wrong. I think it’s a very good idea to (a) run power calculations before starting a study, (b) frequently pause to reflect on what kinds of effects one considers big enough to be worth pursuing; and (c) report effect size measures and confidence intervals for all key tests in one’s papers. But I’m certainly not suggesting that if you don’t do these things, you’re a bad person, or even a bad researcher. All I’m saying is that the importance of effect sizes doesn’t go away just because you’re not thinking about them. A decision about what constitutes a meaningful effect size is made every single time you test your data against the null hypothesis; so you may as well be the one making that decision explicitly, instead of having it done for you implicitly in a silly way. No one really cares about anything-but-zero.