Tangentially related to the last post, Games With Words has a post up soliciting opinions about the merit of effect sizes. The impetus is a discussion we had in the comments on his last post about Jonah Lehrer’s New Yorker article. It started with an obnoxious comment (mine, of course) and then rapidly devolved into a murderous duel civil debate about the importance (or lack thereof) of effect sizes in psychology. What I argued is that consideration of effect sizes is absolutely central to most everything psychologists do, even if that consideration is usually implicit rather than explicit. GWW thinks effect sizes aren’t that important, or at least, don’t have to be.
The basic observation in support of thinking in terms of effect sizes rather than (or in addition to) p values is simply that the null hypothesis is nearly always false. (I think I said “always” in the comments, but I can live with “nearly always”). There are exceedingly few testable associations between two or more variables that could plausibly have an effect size of exactly zero. Which means that if all you care about is rejecting the null hypothesis by reaching p < .05, all you really need to do is keep collecting data–you will get there eventually.
I don’t think this is a controversial point, and my sense is that it’s the received wisdom among (most) statisticians. That doesn’t mean that the hypothesis testing framework isn’t useful, just that it’s fundamentally rooted in ideas that turn out to be kind of silly upon examination. (For the record, I use significance tests all the time in my own work, and do all sorts of other things I know on some level to be silly, so I’m not saying that we should abandon hypothesis testing wholesale).
Anyway, GWW’s argument is that, at least in some areas of psychology, people don’t really care about effect sizes, and simply want to know if there’s a real effect or not. I disagree for at least two reasons. First, when people say they don’t care about effect sizes, I think what they really mean is that they don’t feel a need to explicitly think about effect sizes, because they can just rely on a decision criterion of p < .05 to determine whether or not an effect is ‘real’. The problem is that, since the null hypothesis is always false (i.e., effects are never exactly zero in the population), if we just keep collecting data, eventually all effects become statistically significant, rendering the decision criterion completely useless. At that point, we’d presumably have to rely on effect sizes to decide what’s important. So it may look like you can get away without considering effect sizes, but that’s only because, for the kind of sample sizes we usually work with, p values basically end up being (poor) proxies for effect sizes.
Second, I think it’s simply not true that we care about any effect at all. GWW makes a seemingly reasonable suggestion that even if it’s not sensible to care about a null of exactly zero, it’s quite sensible to care about nothing but the direction of an effect. But I don’t think that really works either. The problem is that, in practice, we don’t really just care about the direction of the effect; we also want to know that it’s meaningfully large (where ‘meaningfully’ is intentionally vague, and can vary from person to person or question to question). GWW gives a priming example: if a theoretical model predicts the presence of a priming effect, isn’t it enough just to demonstrate a statistically significant priming effect in the predicted direction? Does it really matter how big the effect is?
Yes. To see this, suppose that I go out and collect priming data online from 100,000 subjects, and happily reject the null at p < .05 based on a priming effect of a quarter of a millisecond (where the mean response time is, say, on the order of a second). Does that result really provide any useful support for my theory, just because I was able to reject the null? Surely not. For one thing, a quarter of a millisecond is so tiny that any reviewer worth his or her salt is going to point out that any number of confounding factors could be responsible for that tiny association. An effect that small is essentially uninterpretable. But there is, presumably, some minimum size for every putative effect which would lead us to say: “okay, that’s interesting. It’s a pretty small effect, but I can’t just dismiss it out of hand, because it’s big enough that it can’t be attributed to utterly trivial confounds.” So yes, we do care about effect sizes.
The problem, of course, is that what constitutes a ‘meaningful’ effect is largely subjective. No doubt that’s why null hypothesis testing holds such an appeal for most of us (myself included)–it may be silly, but it’s at least objectively silly. It doesn’t require you to put your subjective beliefs down on paper. Still, at the end of the day, that apprehensiveness we feel about it doesn’t change the fact that you can’t get away from consideration of effect sizes, whether explicitly or implicitly. Saying that you don’t care about effect sizes doesn’t actually make it so; it just means that you’re implicitly saying that you literally care about any effect that isn’t exactly zero–which is, on its face, absurd. Had you picked any other null to test against (e.g., a range of standardized effect sizes between -0.1 and 0.1), you wouldn’t have that problem.
To reiterate, I’m emphatically not saying that anyone who doesn’t explicitly report, or even think about, effect sizes when running a study should be lined up against a wall and fired upon at will is doing something terribly wrong. I think it’s a very good idea to (a) run power calculations before starting a study, (b) frequently pause to reflect on what kinds of effects one considers big enough to be worth pursuing; and (c) report effect size measures and confidence intervals for all key tests in one’s papers. But I’m certainly not suggesting that if you don’t do these things, you’re a bad person, or even a bad researcher. All I’m saying is that the importance of effect sizes doesn’t go away just because you’re not thinking about them. A decision about what constitutes a meaningful effect size is made every single time you test your data against the null hypothesis; so you may as well be the one making that decision explicitly, instead of having it done for you implicitly in a silly way. No one really cares about anything-but-zero.
The problem is that, since the null hypothesis is always false
Is that an a priori truth, or an empirical claim. If the former, what’s the proof? If the latter, what’s the evidence? I’m willing to believe that you’re right, but so far your arguments have been only thought experiments, and I have my own thought experiments that lead to other conclusions.
In any case, it seems like you’re proposing dividing the world into two types of effects: those large enough to be detected, and those so small they’ll never be detected outside of thought experiments. I’m not convinced the the latter are as common as you think, but for present purposes this is a distinction without a difference, as we’re unlikely to ever know who is right since the relevant data will never be collected (and for reasons I think we’re both happy with).
Given that effect size matters, what should we make of the typical 0.1-0.5% signal change we typically see in fMRI? I have heard people dismissing fMRI research because of that.
Is that an a priori truth, or an empirical claim. If the former, what’s the proof?
I’d classify it as the former. The reasoning is simple: if the null is true, that means that if you were to sample the entire population, there would literally be absolutely no statistical association between your variables. I don’t see how that’s possible, and I’d argue you pretty much have to believe in intelligent design to make that plausible, because the only way you’d ever get exactly zero is if some omnipotent being were arranging it that way–otherwise a million different factors would contrive to push the effect one way or the other, for completely trivial reasons. But if you don’t buy that reasoning, I’m happy to agree to disagree.
In any case, it seems like you’re proposing dividing the world into two types of effects: those large enough to be detected, and those so small they’ll never be detected outside of thought experiments.
I was just illustrating the logical conclusion of the belief that the null is viable. In practice, you don’t have to talk about infinitesimally small effects; the null can become quite silly even with pretty reasonable sample sizes. E.g., in my blogging paper, which had > 300 subjects for most analyses, about 25% of all possible correlations were significant. Do I really think there’s a meaningful difference between those 25% and most of the other 75%? Of course not; if I’d collected another 100 subjects, the null would have been rejected for another large chunk of effects, and I’m pretty sure if I’d collected 1,000 subjects, pretty much every effect would be significant. So should I just conclude that it’s meaningless to say anything other than “personality is related to word use in pretty much every way?” No, obviously. The hypothesis testing framework is useless at that point, and I have to instead think about relative effect sizes.
Given that effect size matters, what should we make of the typical 0.1-0.5% signal change we typically see in fMRI? I have heard people dismissing fMRI research because of that.
It’s not the raw effect size that matters, it’s the standardized effect size. Changes of 1% BOLD signal can be huge, because the standard deviation around that change is often tiny.
if the null is true, that means that if you were to sample the entire population, there would literally be absolutely no statistical association between your variables
Sure, if you’re dealing with fixed effects, but I don’t think that’s very common outside of census reports. The existing population itself is just a sample drawn from some underlying distribution, and it’s that underlying distribution we are trying to generalize to.
So should I just conclude that it’s meaningless to say anything other than “personality is related to word use in pretty much every way?
If you got those results, sure! You’d want to distinguish between big effects and small effects, but you’d still have to explain all the effects.
But not all things are related. If I clap my hands, the mountains in the Rockies don’t move, despite the fact that they’re all part of the same causal system. If I send a letter to someone and it never arrives, they aren’t affected by the contents of the letter. Raise water from -50 degrees to -40 degrees and it doesn’t melt; that much heat isn’t sufficient. If I vote for Al Gore and my vote isn’t counted (true story), his vote tally doesn’t rise. If it had been counted, he still wouldn’t have become president (one vote is insufficient). There’s no reason to believe the brain is any different from the rest of the world.
Sure, if you’re dealing with fixed effects, but I don’t think that’s very common outside of census reports. The existing population itself is just a sample drawn from some underlying distribution, and it’s that underlying distribution we are trying to generalize to.
I think there’s actually a much deeper question here as to whether or not it makes sense to think of the extant population as the population. Personally I don’t think it does (except in unusual cases, like when you have 200 remaining members of an endangered species), but I’m happy to disagree. I don’t think it changes anything, because you have exactly the same problem: why would you think that if you could examine the underlying, nonexistent distribution (whatever that means) you would approach an effect size of exactly zero at the limit? If it’s not plausible that there would be an effect of exactly zero in a sample of 7 billion humans, do you really think adding another 7 billion is going to change anything? The causal system is not going to become any less densely interconnected because of it.
If you got those results, sure! You’d want to distinguish between big effects and small effects, but you’d still have to explain all the effects.
I have a matrix of 35 traits by 5,000 English words. What’s there to explain? What’s at all interesting about the fact that the relationship between any given personality trait and any word you care to name isn’t exactly zero? That’s trivially true; I would never doubt it. But does that mean I shouldn’t have done the study? Do you really believe that any of the effects you’ve tested for in your work would have any reasonable shot at being exactly zero if you could collect a billion subjects?
But not all things are related. If I clap my hands, the mountains in the Rockies don’t move, despite the fact that they’re all part of the same causal system. If I send a letter to someone and it never arrives, they aren’t affected by the contents of the letter.
I’m not making a principled, dogmatic statement here, I’m making a practical one. You can trivially define hypotheses that are completely uninteresting where the null is going to be true. For instance, if you want to test the hypothesis that every time you clap your hand, a mountain moves fifty miles, I’m certainly not going to stop you from accepting the null. It’s not a meaningful scientific question. What I’m saying is that for pretty much any meaningful question, up to and including ESP, the null is false. If your hypothesis was that clapping human hands cause the molecules in the Rockies to move, then of course we can reject the null hypothesis out of hand.
What I haven’t seen you give yet is an example of a research question you would actually care about personally where it was even remotely plausible that the null was true. That’s the central issue. Not whether either of us can come up with contrived existence proofs, but whether, in practice, when you set out to test hypotheses you actually care about your studies, you find plausible the idea that if you collected all the data in the world, you would get an effect of exactly zero for any of the things you care about. I would argue that that’s absurd on its face, but I’m happy to disagree if you really do believe that. But that really is what you have to commit yourself to.
Anyway, you get the last word–I think I’ve run out of steam on this debate. But I’ve enjoyed it and found it useful!
What I haven’t seen you give yet is an example of a research question you would actually care about personally where it was even remotely plausible that the null was true.
Every one I’ve ever studied. I actually gave multiple examples in the relevant blog post.
If it’s not plausible that there would be an effect of exactly zero in a sample of 7 billion humans, do you really think adding another 7 billion is going to change anything?
I think you’re confusing measured effect with expected effect. Hypothesis testing is about expected effects — at least, it’s supposed to be, since that’s all anyone cares about. That is, I’m pretty sure for most things we study nobody cares what the exact effect is in any given sample, whether of 7 billion humans or more or fewer. We care about the expected effect. And that may be 0.
Adsense is actually a really great program for those who
maintain blogs, as blogs get updated all the time and the Adsense
possibilities are almost limitless. The website speed test at Secret Search Engine Labs will analyze how fast a
page on your site is loading and give you tips on how to improve it.
And then on March 20, the world’s largest paid private blog network – BMR – announced that its vast network had been almost entirely
de-indexed by Google,
causing chaos in the internet marketing industries.