Induction is not optional (if you’re using inferential statistics): reply to Lakens

A few months ago, I posted an online preprint titled The Generalizability Crisis. Here’s the abstract:

Most theories and hypotheses in psychology are verbal in nature, yet their evaluation overwhelmingly relies on inferential statistical procedures. The validity of the move from qualitative to quantitative analysis depends on the verbal and statistical expressions of a hypothesis being closely aligned—that is, that the two must refer to roughly the same set of hypothetical observations. Here I argue that most inferential statistical tests in psychology fail to meet this basic condition. I demonstrate how foundational assumptions of the “random effects” model used pervasively in psychology impose far stronger constraints on the generalizability of results than most researchers appreciate. Ignoring these constraints dramatically inflates false positive rates and routinely leads researchers to draw sweeping verbal generalizations that lack any meaningful connection to the statistical quantities they’re putatively based on. I argue that failure to consider generalizability from a statistical perspective lies at the root of many of psychology’s ongoing problems (e.g., the replication crisis), and conclude with a discussion of several potential avenues for improvement.

I submitted the paper to Behavioral and Brain Sciences, and recently received 6 (!) generally positive reviews. I’m currently in the process of revising the manuscript in response to a lot of helpful feedback (both from the BBS reviewers and a number of other people). In the interim, however, I’ve decided to post a response to one of the reviews that I felt was not helpful, and instead has had the rather unfortunate effect of derailing some of the conversation surrounding my paper.

The review in question is by Daniel Lakens, who, in addition to being one of the BBS reviewers, also posted his review publicly on his blog. While I take issue with the content of Lakens’s review, I’m a fan of open, unfiltered, commentary, so I appreciate Daniel taking the time to share his thoughts, and I’ve done the same here. In the rather long piece that follows, I argue that Lakens’s criticisms of my paper stem from an incoherent philosophy of science, and that once we amend that view to achieve coherence, it becomes very clear that his position doesn’t contradict the argument laid out in my paper in any meaningful way—in fact, if anything, the former is readily seen to depend on the latter.

Lakens makes five main points in his review. My response also has five sections, but I’ve moved some arguments around to give the post a better flow. I’ve divided things up into two main criticisms (mapping roughly onto Lakens’s points 1, 4, and 5), followed by three smaller ones you should probably read only if you’re entertained by petty, small-stakes academic arguments.

Bad philosophy

Lakens’s first and probably most central point can be summarized as a concern with (what he sees as) a lack of philosophical grounding, resulting in some problematic assumptions. Lakens argues that my paper fails to respect a critical distinction between deduction and induction, and consequently runs aground by assuming that scientists (or at least, psychologists) are doing induction when (according to Lakens) they’re doing deduction. He suggests that my core argument—namely, that verbal and statistical hypotheses have to closely align in order to support sensible inference—assumes a scientific project quite different from what most psychologists take themselves to be engaged in.

In particular, Lakens doesn’t think that scientists are really in the business of deriving general statements about the world on the basis of specific observations (i.e., induction). He thinks science is better characterized as a deductive enterprise, where scientists start by positing a particular theory, and then attempt to test the predictions they wring out of that theory. This view, according to Lakens, does not require one to care about statistical arguments of the kind laid out in my paper. He writes:

Yarkoni incorrectly suggests that “upon observing that a particular set of subjects rated a particular set of vignettes as more morally objectionable when primed with a particular set of cleanliness-related words than with a particular set of neutral words, one might draw the extremely broad conclusion that ‘cleanliness reduces the severity of moral judgments'”. This reverses the scientific process as proposed by Popper, which is (as several people have argued, see below) the dominant approach to knowledge generation in psychology. The authors are not concluding that “cleanliness reduces the severity of moral judgments” from their data. This would be induction. Instead, they are positing that “cleanliness reduces the severity of moral judgments”, they collected data and performed and empirical test, and found their hypothesis was corroborated. In other words, the hypothesis came first. It is not derived from the data – the hypothesis is what led them to collect the data.

Lakens’s position is that theoretical hypotheses are not inferred from the data in a bottom-up, post-hoc way—i.e., by generalizing from finite observations to a general regularity—rather, they’re formulated in advance of the data, which is then only used to evaluate the tenability of the theoretical hypothesis. This, in his view, is how we should think about what psychologists are doing—and he credits this supposedly deductivist view to philosophers of science like Popper and Lakatos:

Yarkoni deviates from what is arguably the common approach in psychological science, and suggests induction might actually work: “Eventually, if the effect is shown to hold when systematically varying a large number of other experimental factors, one may even earn the right to summarize the results of a few hundred studies by stating that “cleanliness reduces the severity of moral judgments””. This approach to science flies right in the face of Popper (1959/2002, p. 10), who says: “I never assume that we can argue from the truth of singular statements to the truth of theories. I never assume that by force of ‘verified’ conclusions, theories can be established as ‘true’, or even as merely ‘probable’.”

Similarly, Lakatos (1978, p. 2) writes: “One can today easily demonstrate that there can be no valid derivation of a law of nature from any finite number of facts; but we still keep reading about scientific theories being proved from facts. Why this stubborn resistance to elementary logic?” I am personally on the side of Popper and Lakatos, but regardless of my preferences, Yarkoni needs to provide some argument his inductive approach to science has any possibility of being a success, preferably by embedding his views in some philosophy of science. I would also greatly welcome learning why Popper and Lakatos are wrong. Such an argument, which would overthrow the dominant model of knowledge generation in psychology, could be impactful, although a-priori I doubt it will be very successful.

For reasons that will become clear shortly, I think Lakens’s appeal to Popper and Lakatos here is misguided—those philosophers’ views actually have very little resemblance to the position Lakens stakes out for himself. But let’s start with the distinction Lakens draws between induction and deduction, and the claim that the latter provides an alternative to the former—i.e., that psychologists can avoid making inductive claims if they simply construe what they’re doing as a form of deduction. While this may seem like an intuitive claim at first blush, closer inspection quickly reveals that, far from psychologists having a choice between construing the world in deductive versus inductive terms, they’re actually forced to embrace both forms of reasoning, working in tandem.

There are several ways to demonstrate this, but since Lakens holds deductivism in high esteem, we’ll start out from a strictly deductive position, and then show why our putatively deductive argument eventually requires us to introduce a critical inductive step in order to make any sense out of how contemporary psychology operates.

Let’s start with the following premise:

P1: If theory T is true, we should confirm prediction P

Suppose we want to build a deductively valid argument that starts from the above premise, which seems pretty foundational to hypothesis-testing in psychology. How can we embed P1 into a valid syllogism, so that we can make empirical observations (by testing P) and then updating our belief in theory T? Here’s the most obvious deductively valid way to complete the syllogism:

P1: If theory T is true, we should confirm prediction P
P2: We fail to confirm prediction P
C: Theory T is false

So stated, this modus tollens captures the essence of “naive“ Popperian falsficationism: what scientists do (or ought to do) is attempt to disprove their hypotheses. On this view, if a theory T legitimately entails P, then disconfirming P is sufficient to falsify T. Once that’s done, a scientist can just pack it up and happily move onto the next theory.

Unfortunately, this account, while intuitive and elegant, fails miserably on the reality front. It simply isn’t how scientists actually operate. The problem, as Lakatos famously pointed out, is that the “core“ of a theory T never strictly entails a prediction P by itself. There are invariably other auxiliary assumptions and theories that need to hold true in order for the T → P conditional to apply. For example, observing that people walk more slowly out of a testing room after being primed with old age-related words than with youth-related words doesn’t provide any meaningful support for a theory of social priming unless one is willing to make a large number of auxiliary assumptions—for example, that experimenter knowledge doesn’t inadvertently bias participants; that researcher degrees of freedom have been fully controlled in the analysis; that the stimuli used in the two conditions don’t differ in some irrelevant dimension that can explain the subsequent behavioral change; and so on.

This “sophisticated falsificationism“, as Lakatos dubbed it, is the viewpoint that I gather Lakens thinks most psychologists implicitly subscribe to. And Lakens believes that the deductive nature of the reasoning articulated above is what saves psychologists from having to worry about statistical notions of generalizability.

Unfortunately, this is wrong. To see why, we need only observe that the Popperian and Lakatosian views frame their central deductive argument in terms of falsificationism: researchers can disprove scientific theories by failing to confirm predictions, but—as the Popper statement Lakens approvingly quotes suggests—they can’t affirmatively prove them. This constraint isn’t terribly problematic in heavily quantitative scientific disciplines where theories often generate extremely specific quantitative predictions whose failure would be difficult to reconcile with those theories’ core postulates. For example, Einstein predicted the gravitational redshift of light in 1907 on the basis of his equivalence principle, yet it took nearly 50 years to definitively confirm that prediction via experiment. At the time it was formulated, Einstein’s prediction would have made no sense except in light of the equivalence principle—so the later confirmation of the prediction provided very strong corroboration of the theory (and, by the same token, a failure to experimentally confirm the existence of redshift would have dealt general relativity a very serious blow). Thus, at least in those areas of science where it’s possible to extract extremely “risky“ predictions from one’s theories (more on that later), it seems perfectly reasonable to proceed as if critical experiments can indeed affirmatively corroborate theories—even if such a conclusion isn’t strictly deductively valid.

This, however, is not how almost any psychologists actually operate. As Paul Meehl pointed out in his seminal contrast of standard operating procedures in physics and psychology (Meehl, 1967), psychologists almost never make predictions whose disconfirmation would plausibly invalidate theories. Rather, they typically behave like confirmationists, concluding, on the basis of empirical confirmation of predictions, that their theories are supported (or corroborated). But this latter approach has a logic quite different from the (valid) falsificationist syllogism we saw above. The confirmationist logic that pervades psychology is better represented as follows:

P1: If theory T is true, we should confirm prediction P
P2: We confirm prediction P
C: Theory T is true

C would be a really nice conclusion to draw, if we were entitled to it, because, just as Lakens suggests, we would then have arrived at a way to deduce general theoretical statements from finite observations. Quite a trick indeed. But it doesn’t work; the argument is deductively invalid. If it’s not immediately clear to you why, consider the following argument, which has exactly the same logical structure:

Argument 1
P1: If God loves us all, the sky should be blue
P2: The sky is blue
C: God loves us all

We are not concerned here with the truth of the two premises, but only with the validity of the argument as a whole. And the argument is clearly invalid. Even if we were to assume P1 and P2, C still wouldn’t follow. Observing that the sky is blue (clearly true) doesn’t entail that God loves us all, even if P1 happens to be true, because there could be many other reasons the sky is blue that don’t involve God in any capacity (including, say, differential atmospheric scattering of different wavelengths of light), none of which are precluded by the stated premises.

Now you might want to say, well, sure, but Argument 1 is patently absurd, whereas the arguments Lakens attributes to psychologists are not nearly so silly. But from a strictly deductive standpoint, the typical logic of hypothesis testing in psychology is exactly as silly. Compare the above argument with a running example Lakens (following my paper) uses in his review:

Argument 2
P1: If the theory that cleanliness reduces the severity of moral judgments is true, we should observe condition A > condition B, p < .05
P2: We observe condition A > condition B, p < .05
C: Cleanliness reduces the severity of moral judgments

Subjectively, you probably find this argument much more compelling than the God-makes-the-sky-blue version in Argument 1. But that’s because you’re thinking about the relative plausibility of P1 in the two cases, rather than about the logical structure of the argument. As a purportedly deductive argument, Argument 2 is exactly as bad as Argument 1, and for exactly the same reason: it affirms the consequent. C doesn’t logically follow from P1 and P2, because there could be any number of other potential premises (P3…Pk) that reflect completely different theories yet allow us to derive exactly the same prediction P.

This propensity to pass off deductively nonsensical reasoning as good science is endemic to psychology (and, to be fair, many other sciences). The fact that the confirmation of most empirical predictions in psychology typically provides almost no support for the theories those predictions are meant to test does not seem to deter researchers from behaving as if affirmation of the consequent is a deductively sound move. As Meehl rather colorfully wrote all the way back in 1967:

In this fashion a zealous and clever investigator can slowly wend his way through a tenuous nomological network, performing a long series of related experiments which appear to the uncritical reader as a fine example of “an integrated research program,” without ever once refuting or corroborating so much as a single strand of the network.

Meehl was hardly alone in taking a dim view of the kind of argument we find in Argument 2, and which Lakens defends as a perfectly respectable “deductive“ way to do psychology. Lakatos—the very same Lakatos that Lakens claims he “is on the side of“—was no fan of it either. Lakatos generally had very little to say about psychology, and it seems pretty clear (at least to me) that his views about how science works were rooted primarily in consideration of natural sciences like physics. But on the few occasions that he did venture an opinion about the “soft“ sciences, he made it abundantly clear that he was not a fan. From Lakatos (1970) :

This requirement of continuous growth … hits patched-up, unimaginative series of pedestrian “˜empirical’ adjustments which are so frequent, for instance, in modern social psychology. Such adjustments may, with the help of so-called “˜statistical techniques’, make some “˜novel’ predictions and may even conjure up some irrelevant grains of truth in them. But this theorizing has no unifying idea, no heuristic power, no continuity. They do not add up to a genuine research programme and are, on the whole, worthless1.

If we follow that footnote 1 after “worthless“, we find this:

After reading Meehl (1967) and Lykken (1968) one wonders whether the function of statistical techniques in the social sciences is not primarily to provide a machinery for producing phoney corroborations and thereby a semblance of “scientific progress” where, in fact, there is nothing but an increase in pseudo-intellectual garbage. “¦ It seems to me that most theorizing condemned by Meehl and Lykken may be ad hoc3. Thus the methodology of research programmes might help us in devising laws for stemming this intellectual pollution …

By ad hoc3, Lakatos means that social scientists regularly explain anomalous findings by concocting new post-hoc explanations that may generate novel empirical predictions, but don’t follow in any sensible way from the “positive heuristic“ of a theory (i.e., the set of rules and practices that describe in advance how a researcher ought to interpret and respond to discrepancies). Again, here’s Lakatos:

In fact, I define a research programme as degenerating even if it anticipates novel facts but does so in a patched-up development rather than by a coherent, pre-planned positive heuristic. I distinguish three types of ad hoc auxiliary hypotheses: those which have no excess empirical content over their predecessor (‘ad hoc1’), those which do have such excess content but none of it is corroborated (‘ad hoc2’) and finally those which are not ad hoc in these two senses but do not form an integral part of the positive heuristic (‘ad hoc3’). “¦ Some of the cancerous growth in contemporary social ‘sciences’ consists of a cobweb of such ad hoc3 hypotheses, as shown by Meehl and Lykken.

The above quotes are more or less the extent of what Lakatos had to say about psychology and the social sciences in his published work.

Now, I don’t claim to be able to read the minds of deceased philosophers, but in view of the above, I think it’s safe to say that Lakatos probably wouldn’t have appreciated Lakens claiming to be “on his side“. If Lakens wants to call the kind of view that considers Argument 2 a good way to do empirical science, fine; but I’m going to refer to it as Lakensian deductivism from here on out, because it’s not deductivism in any sense that approximates the normal meaning of the word “deductive“ (I mean, it’s actually deductively invalid!), and I suspect Popper, Lakatos, and Meehl­ might have politely (or maybe not so politely) asked Lakens to cease and desist from implying that they approve of, or share, his views.

Induction to the rescue

So far, things are not looking so good for a strictly deductive approach to psychology. If we follow Lakens in construing deduction and induction as competing philosophical worldviews, and insist on banishing any kind of inductive reasoning from our inferential procedures, then we’re stuck facing up to the fact that virtually all hypothesis testing done by psychologists is actually deductively invalid, because it almost invariably has the logical form captured in Argument 2. I think this is a rather unfortunate outcome, if you happen to be a proponent of a view that you’re trying to convince people merits the label “deduction“.

Fortunately, all is not lost. It turns out that there is a way to turn Argument 2 into a perfectly reasonable basis for doing empirical science of the psychological variety. Unfortunately for Lakens, it runs directly through the kinds of arguments laid out in my paper. To see that, let’s first observe that we can turn the logically invalid Argument 2 into a valid syllogism by slightly changing the wording of P1:

Argument 3
P1: If, and only if, cleanliness reduces the severity of moral judgments, we should find that condition A > condition B, p < .05
P2: We find that condition A > condition B, p < .05
C: Cleanliness reduces the severity of moral judgments

Notice the newly-added words and only if in P1. They makes all the difference! If we know that the prediction P can only be true if theory T is correct, then observing P does in fact allow us to deductively conclude that T is correct. Hooray!

Well, except that this little modification, which looks so lovely on paper, doesn’t survive contact with reality, because in psychology, it’s almost never the case that a given prediction could only have plausibly resulted from one’s favorite theory. Even if you think P1 is true in Argument 2 (i.e., the theory really does make that prediction), it’s clearly false in our updated Argument 3. There are lots of other reasons why we might observe the predicted result, p < .05, even if the theoretical hypothesis is false (i.e., if cleanliness doesn’t reduce the severity of moral judgment). For example, maybe the stimuli in condition A differ on some important but theoretically irrelevant dimension from those in B. Or maybe there are demand characteristics that seep through to the participants despite the investigators’ best efforts. Or maybe the participants interpret the instructions in some unexpected way, leading to strange results. And so on.

Still, we’re on the right track. And we can tighten things up even further by making one last modification: we replace our biconditional P1 above with the following probabilistic version:

Argument 4
P1: It’s unlikely that we would observe A > B, p < .05, unless cleanliness reduces the severity of moral judgments
P2: We observe A > B, p < .05
C1: It’s probably true that cleanliness reduces the severity of moral judgments

Some logicians might quibble with Argument 4, because replacing words like “all“ and “only“ with words like “probably“ and “unlikely“ requires some careful thinking about the relationship between logical and probabilistic inference. But we’ll ignore that here. Whatever modifications you need to make to enable your logic to handle probabilistic statements, I think the above is at least a sensible way for psychologists to proceed when testing hypotheses. If it’s true that the predicted result is unlikely unless the theory is true, and we confirm the prediction, then it seems reasonable to assert (with full recognition that one might be wrong) that the theory is probably true.

But now the other shoe drops. Because even if we accept that Argument 4 is (for at least some logical frameworks) valid, we still need to show that it’s sound. And soundness requires the updated P1 to be true. If P1 isn’t true, then the whole enterprise falls apart again; nobody is terribly interested in scientific arguments that are logically valid but empirically false. We saw that P1 in Argument 2 was uncontroversial, but was embedded in a logically invalid argument. And conversely, P1 in Argument 3 was embedded in a logically valid argument, but was clearly indefensible. Now we’re suggesting that P1 in Argument 4, which sits somewhere in between Argument 2 and Argument 3, manages to capture the strengths of both of the previous arguments, while avoiding their weaknesses. But we can’t just assert this by fiat; it needs to be demonstrated somehow. So how do we do that?

The banal answer is that, at this point, we have to start thinking about the meanings of the words contained in P1, and not just about the logical form of the entire argument. Basically, we need to ask ourselves: is it really true that all other explanations for the predicted statistical result, are, in the aggregate, unlikely?

Notice that, whether we like it or not, we are now compelled to think about the meaning of the statistical prediction itself. To evaluate the claim that the result A > B (p < .05) would be unlikely unless the theoretical hypothesis is true, we need to understand the statistical model that generated the p-values in question. And that, in turn, forces us to reason inductively, because inferential statistics is, by definition, about induction. The point of deploying inferential statistics, rather than constraining one’s self to only describing the sampled measurements, is to generalize beyond the observed sample to a broader population. If you want to know whether the predicted p-value follows from your theory, you need to know whether the population your verbal hypothesis applies to is well approximated by the population your statistical model affords generalization to. If it isn’t, then there’s no basis for positing a premise like P1.

Once we’ve accepted this much—and to be perfectly blunt about it, if you don’t accept this much, you probably shouldn’t be using inferential statistics in the first place—then we have no choice but to think carefully about the alignment between our verbal and statistical hypotheses. Is P1 in Argument 4 true? Is it really the case that observing A > B, p < .05, would be unlikely unless cleanliness reduces the severity of moral judgments? Well that depends. What population of hypothetical observations does the model that generates the p-value refer to? Does it align with the population implied by the verbal hypothesis?

This is the critical question one must answer, and there’s no way around it. One cannot claim, as Lakens tries to, that psychologists don’t need to worry about inductive inference, because they’re actually doing deduction. Induction and deduction are not in opposition here; they’re actually working in tandem! Even if you agree with Lakens and think that the overarching logic guiding psychological hypothesis testing is of the deductive form expressed in Argument 4 (as opposed to the logically invalid form in Argument 2, as Meehl suggested), you still can’t avoid the embedded inductive step captured by P1, unless you want to give up the use of inferential statistics entirely.

The bottom line is that Lakens—and anyone else who finds the flavor of so-called deductivism he advocates appealing—faces a dilemma on two horns. One way to deal with the fact that Lakensian deductivism is in fact deductively invalid is to lean into it and assert that, logic notwithstanding, this is just how psychologists operate, and the important thing is not whether or not the logic makes deductive sense if you scrutinize it closely, but whether it allows people to get on with their research in a way they’re satisfied with.

The upside of such a position is that it allows you to forever deflect just about any criticism of what you’re doing simply by saying “well, the theory seems to me to follow from the prediction I made“. The downside—and it’s a big one, in my opinion—is that science becomes a kind of rhetorical game, because at that point there’s pretty much nothing anybody else can say to disabuse you of the belief that you’ve confirmed your theory. The only thing that’s required is that the prediction make sense to you (or, if you prefer, to you plus two or three reviewers). A secondary consequence is that it also becomes impossible to distinguish the kind of allegedly scientific activity psychologists engage in from, say, postmodern scholarship, so a rather unwelcome conclusion of taking Lakens’s view seriously is that we may as well extend the label science to the kind of thing that goes on in journals like Social Text. Maybe Lakens is okay with this, but I very much doubt that this is the kind of worldview most psychologists want to commit themselves to.

The more sensible alternative is to accept that the words and statistics we use do actually need to make contact with a common understanding of reality if we’re to be able to make progress. This means that when we say things like “it’s unlikely that we would observe a statistically significant effect here unless our theory is true“, evaluation of such a statement requires that one be able to explain, and defend, the relationship between the verbal claims and the statistical quantities on which the empirical support is allegedly founded.

The latter, rather weak, assumption—essentially, that scientists should be able to justify the premises that underlie their conclusions—is all my paper depends on. Once you make that assumption, nothing more depends on your philosophy of science. You could be a Popperian, a Lakatosian, an inductivist, a Lakensian, or an anarchist… It really doesn’t matter, because, unless you want to embrace the collapse of science into postmodernism, there’s no viable philosophy of science under which scientists get to use words and statistics in whatever way they like, without having to worry about the connection between them. If you expect to be taken seriously as a scientist who uses inferential statistics to draw conclusions from empirical data, you’re committed to caring about the relationship between the statistical models that generate your p-values and the verbal hypotheses you claim to be testing. If you find that too difficult or unpleasant, that’s fine (I often do too!); you can just drop the statistics from your arguments, and then it’s at least clear to people that your argument is purely qualitative, and shouldn’t be accorded the kind of reception we normally reserve (fairly or unfairly) for quantitative science. But you don’t get to claim the prestige and precision that quantitation seems to confer on researchers while doing none of the associated work. And you certainly can’t avoid doing that work simply by insisting that you’re doing a weird, logically fallacious, kind of “deduction“.

Unfair to severity

Lakens’s second major criticism is that I’m too hard on the notion of severity. He argues that I don’t give the Popper/Meehl/Mayo risky prediction/severe testing school of thought sufficient credit, and that it provides a viable alternative to the kind of position he takes me to be arguing for. Lakens makes two main points, which I’ll dub Severity I and Severity II.

Severity I

First, Lakens argues that my dismissal of risky or severe tests as a viable approach in most of psychology is unwarranted. I’ll quote him at length here, because the core of his argument is embedded in some other stuff, and I don’t want to be accused of quoting out of context (note that I did excise one part of the quote, because I deal with it separately below):

Yarkoni’s criticism on the possibility of severe tests is regrettably weak. Yarkoni says that “Unfortunately, in most domains of psychology, there are pervasive and typically very plausible competing explanations for almost every finding.” From his references (Cohen, Lykken, Meehl) we can see he refers to the crud factor, or the idea that the null hypothesis is always false. As we recently pointed out in a review paper on crud (Orben & Lakens, 2019), Meehl and Lykken disagreed about the definition of the crud factor, the evidence of crud in some datasets can not be generalized to all studies in pychology, and “The lack of conceptual debate and empirical research about the crud factor has been noted by critics who disagree with how some scientists treat the crud factor as an “axiom that needs no testing” (Mulaik, Raju, & Harshman, 1997).”. Altogether, I am very unconvinced by this cursory reference to crud makes a convincing point that “there are pervasive and typically very plausible competing explanations for almost every finding”. Risky predictions seem possible, to me, and demonstrating the generalizability of findings is actually one way to perform a severe test.

When Yarkoni discusses risky predictions, he sticks to risky quantitative predictions. As explained in Lakens (2020), “Making very narrow range predictions is a way to make it statistically likely to falsify your prediction if it is wrong. But the severity of a test is determined by all characteristics of a study that increases the capability of a prediction to be wrong, if it is wrong. For example, by predicting you will only observe a statistically significant difference from zero in a hypothesis test if a very specific set of experimental conditions is met that all follow from a single theory, it is possible to make theoretically risky predictions.” “¦ It is unclear to me why Yarkoni does not think that approaches such as triangulation (Munafò & Smith, 2018) are severe tests. I think these approaches are the driving force between many of the more successful theories in social psychology (e.g., social identity theory), and it works fine.

There are several relatively superficial claims Lakens makes in these paragraphs that are either wrong or irrelevant. I’ll take them up below, but let me first address the central claim, which is that, contrary to the argument I make in my paper, risky prediction in the Popper/Meehl/Mayo sense is actually a viable strategy in psychology.

It’s instructive to note that Lakens doesn’t actually provide any support for this assertion; his argument is entirely negative. That is, he argues that I haven’t shown severity to be impossible. This is a puzzling way to proceed, because the most obvious way to refute an argument of the form “it’s almost impossible to do X“ is to just point to a few garden variety examples where people have, in fact, successfully done X. Yet at no point in Lakens’s lengthy review does he provide any actual examples of severe tests in psychology—i.e., of cases where the observed result would be extremely implausible if the favored theory were false. This omission is hard to square with his insistence that severe testing is a perfectly sensible approach that many psychologists already use successfully. Hundreds of thousands of papers have been published in psychology over the past century; if an advocate of a particular methodological approach can’t identify even a tiny fraction of the literature that has successfully applied that approach, how seriously should that view be taken by other people?

As background, I should note that Lakens’s inability to give concrete examples of severe testing isn’t peculiar to his review of my paper; in various interactions we’ve had over the last few years, I’ve repeatedly asked him to provide such examples. He’s obliged exactly once, suggesting this paper, titled Ego Depletion Is Not Just Fatigue: Evidence From a Total Sleep Deprivation Experiment by Vohs and colleagues.

In the sole experiment Vohs et al. report, they purport to test the hypothesis that ego depletion is not just fatigue (one might reasonably question whether there’s any non-vacuous content to this hypothesis to begin with, but that’s a separate issue). They proceed by directing participants who either have or have not been deprived of sleep to suppress their emotions while viewing disgusting video clips. In a subsequent game, they then ask the same participants to decide (seemingly incidentally) how loud a noise to blast an opponent with—a putative measure of aggression. The results show that participants who suppressed emotion selected louder volumes than those who did not, whereas the sleep deprivation manipulation had no effect.

I leave it as an exercise to the reader to decide for themselves whether the above example is a severe test of the theoretical hypothesis. To my mind, at least, it clearly isn’t; it fits very comfortably into the category of things that Meehl and Lakatos had in mind when discussing the near-total disconnect between verbal theories and purported statistical evidence. There are dozens, if not hundreds, of ways one might obtain the predicted result even if the theoretical hypothesis Vos et al. articulate were utterly false (starting from the trivial observation that one could obtain the pattern the authors reported even if the two manipulations tapped exactly the same construct but were measured with different amounts of error). There is nothing severe about the test, and to treat it as such is to realize Meehl and Lakatos’s worst fears about the quality of hypothesis-testing in much of psychology.

To be clear, I did not suggest in my paper (nor am I here) that severe tests are impossible to construct in psychology. I simply observed that they’re not a realistic goal in most domains, particularly in “soft“ areas (e.g., social psychology). I think I make it abundantly clear in the paper that I don’t see this as a failing of psychologists, or of their favored philosophy of science; rather, it’s intrinsic to the domain itself. If you choose to study extremely complex phenomena, where any given behavior is liable to be a product of an enormous variety of causal factors interacting in complicated ways, you probably shouldn’t expect to be able to formulate clear law-like predictions capable of unambiguously elevating one explanation above others. Social psychology is not physics, and there’s no reason to think that methodological approaches that work well when one is studying electrons and quarks should also work well when one is studying ego depletion and cognitive dissonance.

As for the problematic minor claims in the paragraphs I quoted above (you can skip down to the “Severity II“ section you’re bored or short on time)… First, the citations to Cohen, Lykken, and Meehl contain well-developed arguments to the same effect as my claim that “there are pervasive and typically very plausible competing explanations for almost every finding“. These arguments do not depend on what one means by “crud“, which is the subject of Orben & Lakens (2019). The only point relevant to my argument is that outcomes in psychology are overwhelmingly determined by many factors, so that it’s rare for a hypothesized effect in psychology to have no plausible explanation other than the authors’ preferred theoretical hypothesis. I think this is self-evidently true, and needs no further justification. But if you think it does require justification, I invite you to convince yourself of it in the following easy steps: (1) Write down 10 or 20 random effects that you feel are a reasonably representative sample of your field. (2) For each one, spend 5 minutes trying to identify alternative explanations for the predicted result that would be plausible even if the researcher’s theoretical hypothesis were false. (3) Observe that you were able to identify plausible confounds for all of the effects you wrote down. There, that was easy, right?

Second, it isn’t true that I stick to risky quantitative predictions. I explicitly note that risky predictions can be non-quantitative:

The canonical way to accomplish this is to derive from one’s theory some series of predictions—typically, but not necessarily, quantitative in nature—sufficiently specific to that theory that they are inconsistent with, or at least extremely implausible under, other accounts.

I go on to describe several potential non-quantitative approaches (I even cite Lakens!):

This does not mean, however, that vague directional predictions are the best we can expect from psychologists. There are a number of strategies that researchers in such fields could adopt that would still represent at least a modest improvement over the status quo (for discussion, see Meehl, 1990). For example, researchers could use equivalence tests (Lakens, 2017); predict specific orderings of discrete observations; test against compound nulls that require the conjunctive rejection of many independent directional predictions; and develop formal mathematical models that posit non-trivial functional forms between the input and ouput (Marewski & Olsson, 2009; Smaldino, 2017).

Third, what Lakens refers to as “triangulation“ is, as far as I can tell, conceptually akin to a logical conjunction of effects suggested above, so again, it’s unfair to say that I oppose this idea. I support it—in principle. However, two points are worth noting. First, the practical barrier to treating conjunctive rejections as severe tests is that it requires researchers to actually hold their own feet to the fire by committing ahead of time to the specific conjunction that they deem a severe test. It’s not good enough to state ahead of time that the theory makes 6 predictions, and then, when results reveal that the theory only confirms 4 of the predictions, to generate some post-hoc explanation for the 2 failed predictions while still claiming that the theory managed to survive a critical test.

Second, as we’ve already seen, the mere fact that a researcher believes a test is severe does not actually make it so, and there are good reasons to worry that many researchers grossly underestimate the degree of actual support a particular statistical procedure (or conjunction of procedures) actually confers on a theory. For example, you might naively suppose that if your theory makes 6 independent directional predictions—implying a probability of 2^6, or 1.5%, of getting all 6 right purely by chance—then joint corroboration of all your predictions provides strong support for your theory. But this isn’t generally the case, because many plausible competing accounts in psychology will tend to generate similarly-signed predictions. As a trivial example, when demand characteristics are present, they will typically tend to push in the direction of the researcher’s favored hypotheses.

The bottom line is that, while triangulation is a perfectly sensible strategy in principle, deploying it in a way that legitimately produces severe tests of psychological theories does not seem any easier than the other approaches I mention—nor, again, does Lakens seem able to provide any concrete examples.

Severity II

Lakens’s second argument regarding severity (or my alleged lack of respect for it) is that I put the cart before the horse: whereas I focus largely on the generalizability of claims made on the basis of statistical evidence, Lakens argues that generalizability is purely an instrumental goal, and that the overarching objective is severity. He writes:

I think the reason most psychologists perform studies that demonstrate the generalizability of their findings has nothing to do with their desire to inductively build a theory from all these single observations. They show the findings generalize, because it increases the severity of their tests. In other words, according to this deductive approach, generalizability is not a goal in itself, but a it follows from the goal to perform severe tests.

And:

Generalization as a means to severely test a prediction is common, and one of the goals of direct replications (generalizing to new samples) and conceptual replications (generalizing to different procedures). Yarkoni might disagree with me that generalization serves severity, not vice versa. But then what is missing from the paper is a solid argument why people would want to generalize to begin with, assuming at least a decent number of them do not believe in induction. The inherent conflict between the deductive approaches and induction is also not explained in a satisfactory manner.

As a purported criticism of my paper, I find this an unusual line of argument, because not only does it not contradict anything I say in my paper, it actually directly affirms it. In effect, Lakens is saying yes, of course it matters whether the statistical model you use maps onto your verbal hypothesis; how else would you be able to formulate a severe test of the hypothesis using inferential statistics? Well, I agree with him! My only objection is that he doesn’t follows his own argument far enough. He writes that “generalization as a means to severely test a prediction is common“, but he’s being too modest. It isn’t just common; for studies that use inferential statistics, it’s universal. If you claim to be using statistical results to test your theoretical hypotheses, you’re obligated to care about the alignment between the universes of observations respectively defined by your verbal and statistical hypotheses. As I’ve pointed out at length above, this isn’t a matter of philosophical disagreement (i.e., of some imaginary “inherent conflict between the deductive approaches and induction“); it’s definitional. Inferential statistics is about generalizing from samples to populations. How could you possibly assert that a statistical test of a hypothesis is severe if you have no idea whether the population defined by your statistical model aligns with the one defined by your verbal hypothesis? Can Lakens provide an example of a severe statistical test that doesn’t require one to think about what population of observations a model applies to? I very much doubt it.

For what it’s worth, I don’t think the severity of hypothesis testing is the only reason to worry about the generalizability of one’s statistical results. We can see this trivially, inasmuch as severity only makes sense in a hypothesis testing context, whereas generalizability matters any time inferential statistics (which make reference to some idealized population) are invoked. If you report a p-value from a linear regression model, I don’t need to know what hypothesis motivated the analysis in order to interpret the results, but I do need to understand what universe of hypothetical observations the statistical model you specified refers to. If Lakens wants to argue that statistical results are uninterpretable unless they’re presented as confirmatory tests of an a priori hypothesis, that’s his prerogative (though I doubt he’ll find many takers for that view). At the very least, though, it should be clear that his own reasoning gives one more, and not less, reason to take the arguments in my paper seriously.

Hopelessly impractical

[Attention conservation notice: the above two criticisms are the big ones; you can safely stop reading here without missing much. The stuff below is frankly more a reflection of my irritation at some of Lakens’s rhetorical flourishes than about core conceptual issues.]

A third theme that shows up repeatedly in Lakens’s review is the idea that the arguments I make, while perhaps reasonable from a technical standpoint, are far too onerous to expect real researchers to implement. There are two main strands of argument here. Both of them, in my view, are quite wrong. But one of them is wrong and benign, whereas the other is wrong and possibly malignant.

Impractical I

The first (benign) strand is summarized by Lakens’s Point 3, which he titles theories and tests are not perfectly aligned in deductive approaches. As we’ll see momentarily, “perfectly“ is a bit of a weasel word that’s doing a lot of work for Lakens here. But his general argument is that you only need to care about the alignment between statistical and verbal specifications of a hypothesis if you’re an inductivist:

To generalize from a single observation to a general theory through induction, the sample and the test should represent the general theory. This is why Yarkoni is arguing that there has to be a direct correspondence between the theoretical model, and the statistical test. This is true in induction.

I’ve already spent several thousand words above explaining why this is simply false. To recap (I know I keep repeating myself, but this really is the crux of the whole issue): if you’re going to report inferential statistics and claim that they provide support for your verbal hypotheses, then you’re obligated to care about the correspondence between the test and the theory. This doesn’t require some overarching inductivist philosophy of science (which is fortunate, because I don’t hold one myself); it only requires you to believe that when you make statements of the form “statistic X provides evidence for verbal claim Y“, you should be able to explain why that’s true. If you can’t explain why the p-value (or Bayes Factor, etc.) from that particular statistical specification supports your verbal hypothesis, but a different specification that produces a radically different p-value wouldn’t, it’s not clear why anybody else should take your claims seriously. After all, inferential statistics aren’t (or at least, shouldn’t be) just a kind of arbitrary numerical magic we sprinkle on top of our words to get people to respect us. They mean things. So the alternative to caring about the relationship between inferential statistics and verbal claims is not, as Lakens seems to think, deductivism—it’s ritualism.

The tacit recognition of this point is presumably why Lakens is careful to write that “theories and tests are not perfectly aligned in deductive approaches“ (my emphasis). If he hadn’t included the word “perfectly“, the claim would seem patently silly, since theories and tests obviously need to be aligned to some degree no matter what philosophical view one adopts (save perhaps for outright postmodernism). Lakens’s argument here only makes any sense if the reader can be persuaded that my view, unlike Lakens’, demands perfection. But it doesn’t (more on that below).

Lakens then goes on to address one of the central planks of my argument, namely, the distinction between fixed and random factors (which typically has massive implications for the p-values one observes). He suggests that while the distinction is real, it’s wildly unrealistic to expect anybody to actually be able to respect it:

If I want to generalize beyond my direct observations, which are rarely sampled randomly from all possible factors that might impact my estimate, I need to account for uncertainty in the things I have not observed. As Yarkoni clearly explains, one does this by adding random factors to a model. He writes (p. 7) “Each additional random factor one adds to a model licenses generalization over a corresponding population of potential measurements, expanding the scope of inference beyond only those measurements that were actually obtained. However, adding random factors to one’s model also typically increases the uncertainty with which the fixed effects of interest are estimated”. You don’t need to read Popper to see the problem here – if you want to generalize to all possible random factors, there are so many of them, you will never be able to overcome the uncertainty and learn anything. This is why inductive approaches to science have largely been abandoned.

You don’t need to read Paul Meehl’s Big Book of Logical Fallacies to see that Lakens is equivocating. He equates wanting to generalize beyond one’s sample with wanting to generalize “to all possible random factors“—as if the only two possible interpretations of an effect are that it either generalizes to all conceivable scenarios, or that it can’t be generalized beyond the sample at all. But this just isn’t true; saying that researchers should build statistical models that reflect their generalization intentions is not the same as saying that every mixed-effects model needs to include all variance components that could conceivably have any influence, however tiny, on the measured outcomes. Lakens presents my argument as a statistically pedantic, technically-correct-but-hopelessly-ineffectual kind of view—at which point it’s supposed to become clear to the reader that it’s just crazy to expect psychologists to proceed in the way I recommend. And I agree that it would be crazy—if that was actually what I was arguing. But it isn’t. I make it abundantly clear in my paper that aligning verbal and statistical hypotheses needn’t entail massive expansion of the latter; it can also (and indeed, much more feasibly) entail contraction of the former. There’s an entire section in the paper titled Draw more conservative inferences that begins with this:

Perhaps the most obvious solution to the generalizability problem is for authors to draw much more conservative inferences in their manuscripts—and in particular, to replace the hasty generalizations pervasive in contemporary psychology with slower, more cautious conclusions that hew much more closely to the available data. Concretely, researchers should avoid extrapolating beyond the universe of observations implied by their experimental designs and statistical models. Potentially relevant design factors that are impractical to measure or manipulate, but that conceptual considerations suggest are likely to have non-trivial effects (e.g., effects of stimuli, experimenter, research site, culture, etc.), should be identified and disclosed to the best of authors’ ability.

Contra Lakens, this is hardly an impractical suggestion; if anything, it offers to reduce many authors’ workload, because Introduction and Discussion sections are typically full of theoretical speculations that go well beyond the actual support of the statistical results. My prescription, if taken seriously, would probably shorten the lengths of a good many psychology papers. That seems pretty practical to me.

Moreover—and again contrary to Lakens’s claim—following my prescription would also dramatically reduce uncertainty rather than increasing it. Uncertainty arises when one lacks data to inform one’s claims or beliefs. If maximal certainty is what researchers want, there are few better ways to achieve that than to make sure their verbal claims cleave as closely as possible to the boundaries implicitly defined by their experimental procedures and statistical models, and hence depend on fewer unmodeled (and possibly unknown) variables.

Impractical II

The other half of Lakens’s objection from impracticality is to suggest that, even if the arguments I lay out have some merit from a principled standpoint, they’re of little practical use to most researchers, because I don’t do enough work to show readers how they can actually use those principles in their own research. Lakens writes:

The issues about including random factors is discussed in a more complete, and importantly, applicable, manner in Barr et al (2013). Yarkoni remains vague on which random factors should be included and which not, and just recommends ‘more expansive’ models. I have no idea when this is done satisfactory. This is a problem with extreme arguments like the one Yarkoni puts forward. It is fine in theory to argue your test should align with whatever you want to generalize to, but in practice, it is impossible. And in the end, statistics is just a reasonably limited toolset that tries to steer people somewhat in the right direction. The discussion in Barr et al (2013), which includes trade-offs between converging models (which Yarkoni too easily dismisses as solved by modern computational power – it is not solved) and including all possible factors, and interactions between all possible factors, is a bit more pragmatic.“

And:

As always, it is easy to argue for extremes in theory, but this is generally uninteresting for an applied researcher. It would be great if Yarkoni could provide something a bit more pragmatic about what to do in practice than his current recommendation about fitting “more expansive models” – and provides some indication where to stop, or at least suggestions what an empirical research program would look like that tells us where to stop, and why.

And:

Previous authors have made many of the same points, but in a more pragmatic manner (e.g., Barr et al., 2013m Clark, 1974,). Yarkoni fails to provide any insights into where the balance between generalizing to everything, and generalizing to factors that matter, should lie, nor does he provide an evaluation of how far off this balance research areas are. It is easy to argue any specific approach to science will not work in theory – but it is much more difficult to convincingly argue it does not work in practice.

There are many statements in Lakens’s review that made me shake my head, but the argument advanced in the above quotes is the only one that filled me (briefly) with rage. In part that’s because parts of what Lakens says here blatantly misrepresent my paper. For example, he writes that “Yarkoni just recommends “˜more expansive models’“, which is frankly a bit insulting given that I spend a full third of my paper talking about various ways to address the problem (e.g., by designing studies that manipulate many factors at once; by conducting meta-analyses over variance components; etc.).

Similarly, Lakens implies that Barr et al. (2013) gives better versions of my arguments, when actually the two papers are doing completely different things. Barr et al. (2013) is a fantastic paper, but it focuses almost entirely on the question of how one should specify and estimate mixed-effects models, and says essentially nothing about why researchers should think more carefully about random factors, or which ones researchers ought to include in their model. One way to think about it is that Barr et al. (2013) is the paper you should read after my paper has convinced you that it actually matters a lot how you specify your random-effects structure. Of course, if you’re already convinced of the latter (which many people are, though Lakens himself doesn’t seem to be), then yeah, you should maybe skip my paper““you’re not the intended audience.

In any case, the primary reason I found this part of Lakens’s review upsetting is that the above quotes capture a very damaging, but unfortunately also very common, sentiment in psychology, which is the apparent belief that somebody—and perhaps even nature itself—owes researchers easy solutions to extremely complex problems.

Lakens writes that “Yarkoni remains vague on which random factors should be included and which not“, and that “ It would be great if Yarkoni could provide something a bit more pragmatic about what to do in practice than his current recommendation about fitting “more expansive models”. Well, on a superficial level, I agree with Lakens: I do remain vague on which factors should be included, and it would be lovely if I were able to say something like “here, Daniel, I’ve helpfully identified for you the five variance components that you need to care about in all your studies“. But I can’t say something like that, because it would be a lie. There isn’t any such one-size-fits-all prescription—and trying to pretend there is would, in my view, be deeply counterproductive. Psychology is an enormous field full of people trying to study a very wide range of complex phenomena. There is no good reason to suppose that the same sources of variance will assume even approximately the same degree of importance across broad domains, let alone individual research questions. Should psychophysicists studying low-level visual perception worry about the role of stimulus, experimenter, or site effects? What about developmental psychologists studying language acquisition? Or social psychologists studying cognitive dissonance? I simply don’t know.

One reason I don’t know, as I explain in my paper, is that the answer depends heavily on what conclusions one intends to draw from one’s analyses—i.e., on one’s generalization intentions. I hope Lakens would agree with me that it’s not my place to tell other people what their goal should be in doing their research. Whether or not a researcher needs to model stimuli, sites, tasks, etc. as random factors depends on what claim they intend to make. If a researcher intends to behave as if their results apply to a population of stimuli like the ones one used in their study, and not just to the exact sampled stimuli, then they should use a statistical model that reflects that intention. But if they don’t care to make that generalization, and are comfortable drawing no conclusions beyond the confines of the tested stimuli, then maybe they don’t need to worry about explicitly modeling stimulus effects at all. Either way, what determines whether or not a statistical model is or isn’t appropriate is whether or not that model adequately captures what a researcher claims it’s capturing—not whether Tal Yarkoni has data suggesting that, on average, site effects are large in one area of social psychology but not large in another area of psychophysics.

The other reason I can’t provide concrete guidance about what factors psychologists ought to model as random is that attempting to establish even very rough generalizations of this sort would involve an enormous amount of work—and the utility of that work would be quite unclear, given how contextually specific the answers are likely to be. Lakens himself seems to recognize this; at one point in his review, he suggests that the topic I address “probably needs a book length treatment to do it justice.“ Well, that’s great, but what are working researchers supposed to do in the meantime? Is the implication that psychologists should feel free to include whatever random effects they do or don’t feel like in their models until such time as someone shows up with a compendium of variance component estimates that apply to different areas of psychology? Does Lakens also dismiss papers seeking to convince people that it’s important to consider statistical power when designing studies, unless those papers also happen to provide ready-baked recommendations for what an appropriate sample size is for different research areas within psychology? Would he also conclude that there’s no point in encouraging researchers to define “smallest effect sizes of interest“, as he himself has done in the past, unless one can provide concrete recommendations for what those numbers should be?

I hope not. Such a position would amount to shooting the messenger. The argument in my paper is that model specification matters, and that researchers need to think about that carefully. I think I make that argument reasonably clearly and carefully. Beyond that, I don’t think it’s my responsibility to spend the next N years of my own life trying to determine what factors matter most in social, developmental, or cognitive psychology, just so that researchers in those fields can say, “thanks, your crummy domain-general estimates are going to save me from having to think deeply about what influences matter in my own particular research domain“. I think it’s every individual researcher’s job to think that through for themselves, if they expect to be taken seriously.

Lastly, and at the risk of being a bit petty (sorry), I can’t resist pointing out what strikes me as a rather serious internal contradiction between Lakens’s claim that my arguments are unhelpful unless they come with pre-baked variance estimates, and his own stated views about severity. On the one hand, Lakens claims that psychologists ought to proceed by designing studies that subject their theoretical hypotheses to severe tests. On the other hand, he seems to have no problem with researchers mindlessly following field-wide norms when specifying their statistical models (e.g., modeling only subjects as random effects, because those are the current norms). I find these two strands of thought difficult to reconcile. As we’ve already seen, the severity of a statistical procedure as a test of a theoretical hypothesis depends on the relationship between the verbal hypothesis and the corresponding statistical specification. How, then, could a researcher possibly feel confident that their statistical procedure constitutes a severe test of their theoretical hypothesis, if they’re using an off-the-shelf model specification and have no idea whether they would have obtained radically different results if they had randomly sampled a different set of stimuli, participants, experimenters, or task operationalizations?

Obviously, it can’t. Having to think carefully about what the terms in one’s statistical model mean, how they relate to one’s theoretical hypothesis, and whether those assumptions are defensible, isn’t at all “impractical“; it’s necessary. If you can’t explain clearly why a model specification that includes only subjects as random effects constitutes a severe test of your hypothesis, why would you expect other people to take your conclusions at face value?

Trouble with titles

There’s one last criticism Lakens raises in his review of my paper. It concerns claims I make about the titles of psychology papers:

This is a minor point, but I think a good illustration of the weakness of some of the main arguments that are made in the paper. On the second page, Yarkoni argues that “the vast majority of psychological scientists have long operated under a regime of (extremely) fast generalization”. I don’t know about the vast majority of scientists, but Yarkoni himself is definitely using fast generalization. He looked through a single journal, and found 3 titles that made general statements (e.g., “Inspiration Encourages Belief in God”). When I downloaded and read this article, I noticed the discussion contains a ‘constraint on generalizability’ in the discussion, following (Simons et al., 2017). The authors wrote: “We identify two possible constraints on generality. First, we tested our ideas only in American and Korean samples. Second, we found that inspiring events that encourage feelings of personal insignificance may undermine these effects.”. Is Yarkoni not happy with these two sentence clearly limiting the generalizability in the discussion?

I was initially going to respond to this in detail, but ultimately decided against it, because (a) by Lakens’ own admission, it’s a minor concern; (b) this is already very long as-is; and (c) while it’s a minor point in the context of my paper, I think this issue has some interesting and much more general implications for how we think about titles. So I’ve decided I won’t address it here, but will eventually take it up in a separate piece that gives it a more general treatment, and that includes a kind of litmus test one can use to draw reasonable conclusions about whether or not a title is appropriate. But, for what it’s worth, I did do a sweep through the paper in the process of revision, and have moderated some of the language.

Conclusion

Daniel Lakens argues that psychologists don’t need to care much if at all about the relationship between their statistical model specifications and their verbal hypotheses, because hypothesis testing in psychology proceeds deductively: researchers generate predictions from their theories, and then update their confidence in their theories on the basis of whether or not those predictions are confirmed. This all sounds great until you realize that those predictions are almost invariably evaluated using inferential statistical methods that are inductive by definition. So long as psychologists are relying on inferential statistics as decision aids, there can be no escape from induction. Deduction and induction are not competing philosophies or approaches; the standard operating procedure in psychology is essentially a hybrid of the two.

If you don’t like the idea that the ability to appraise a verbal hypothesis using statistics depends critically on the ability to understand and articulate how the statistical terms map onto the verbal ideas, that’s fine; an easy way to solve that problem is to just not use inferential statistics. That’s a perfectly reasonable position, in my view (and one I discuss at length in my paper). But once you commit yourself to relying on things like p-values and Bayes Factors to help you decide what you believe about the world, you’re obligated to think about, justify, and defend your statistical assumptions. They aren’t, or shouldn’t be, just a kind of pedantic technical magic you can push-button sprinkle on top of your favorite verbal hypotheses to make them really stick.

The Great Minds Journal Club discusses Westfall & Yarkoni (2016)

[Editorial note: The people and events described here are fictional. But the paper in question is quite real.]

“Dearly Beloved,” The Graduate Student began. “We are gathered here to–”

“Again?” Samantha interrupted. “Again with the Dearly Beloved speech? Can’t we just start a meeting like a normal journal club for once? We’re discussing papers here, not holding a funeral.”

“We will discuss papers,” said The Graduate Student indignantly. “In good time. But first, we have to follow the rules of Great Minds Journal Club. There’s a protocol, you know.”

Samantha was about to point out that she didn’t know, because The Graduate Student was the sole author of the alleged rules, and the alleged rules had a habit of changing every week. But she was interrupted by the sound of the double doors at the back of the room swinging violently inwards.

“Sorry I’m late,” said Jin, strolling into the room, one hand holding what looked like a large bucket of coffee with a lid on top. “What are we reading today?”

“Nothing,” said Lionel. “The reading has already happened. What we’re doing now is discussing the paper that everyone’s already read.”

“Right, right,” said Jin. “What I meant to ask was: what paper that we’ve all already read are we discussing today?”

“Statistically controlling for confounding constructs is harder than you think,” said The Graduate Student.

“I doubt it,” said Jin. “I think almost everything is intolerably difficult.”

“No, that’s the title of the paper,” Lionel chimed in. “Statistically controlling for confounding constructs is harder than you think. By Westfall and Yarkoni. In PLOS ONE. It’s what we picked to read for this week. Remember? Are you on the mailing list? Do you even work here?”

“Do I work here… Hah. Funny man. Remember, Lionel… I’ll be on your tenure committee in the Fall.”

“Why don’t we get started,” said The Graduate Student, eager to prevent a full-out sarcastathon. “I guess we can do our standard thing where Samantha and I describe the basic ideas and findings, talk about how great the paper is, and suggest some possible extensions… and then Jin and Lionel tear it to shreds.”

“Sounds good,” said Jin and Lionel in concert.

“The basic problem the authors highlight is pretty simple,” said Samantha. “It’s easy to illustrate with an example. Say you want to know if eating more bacon is associated with a higher incidence of colorectal cancer–like that paper that came out a while ago suggested. In theory, you could just ask people how often they eat bacon and how often they get cancer, and then correlate the two. But suppose you find a positive correlation–what can you conclude?”

“Not much,” said Pablo–apparently in a talkative mood. It was the first thing he’d said to anyone all day–and it was only 3 pm.

“Right. It’s correlational data,” Samantha continued. “Nothing is being experimentally manipulated here, so we have no idea if the bacon-cancer correlation reflects the effect of bacon itself, or if there’s some other confounding variable that explains the association away.”

“Like, people who exercise less tend to eat more bacon, and exercise also prevents cancer,” The Graduate Student offered.

“Or it could be a general dietary thing, and have nothing to do with bacon per se,” said Jin. “People who eat a lot of bacon also have all kinds of other terrible dietary habits, and it’s really the gestalt of all the bad effects that causes cancer, not any one thing in particular.”

“Or maybe,” suggested Pablo, “a sneaky parasite unknown to science invades the brain and the gut. It makes you want to eat bacon all the time. Because bacon is its intermediate host. And then it also gives you cancer. Just to spite you.”

“Right, it could be any of those things.” Samantha said. “Except for maybe that last one. The point is, there are many potential confounds. If we want to establish that there’s a ‘real’ association between bacon and cancer, we need to somehow remove the effect of other variables that could be correlated with both bacon-eating and cancer-having. The traditional way to do this is to statistical “control for” or “hold constant” the effects of confounding variables. The idea is that you adjust the variables in your regression equation so that you’re essentially asking what would the relationship between bacon and cancer look like if we could eliminate the confounding influence of things like exercise, diet, alcohol, and brain-and-gut-eating parasites? It’s a very common move, and the logic of statistical control is used to justify a huge number of claims all over the social and biological sciences.”

“I just published a paper showing that brain activation in frontoparietal regions predicts people’s economic preferences even after controlling for self-reported product preferences,” said Jin. “Please tell me you’re not going to shit all over my paper. Is that where this is going?”

“It is,” said Lionel gleefully. “That’s exactly where this is going.”

“It’s true,” Samantha said apologetically. “But if it’s any consolation, we’re also going to shit on Lionel’s finding that implicit prejudice is associated with voting behavior after controlling for explicit attitudes.”

“That’s actually pretty consoling,” said Jin, smiling at Lionel.

“So anyway, statistical control is pervasive,” Samantha went on. “But there’s a problem: statistical control–at least the way people typically do it–is a measurement-level technique. Meaning, when you control for the rate of alcohol use in a regression of cancer on bacon, you’re not really controlling for alcohol use. What you’re actually controlling for is just one particular operationalization of alcohol use–which probably doesn’t cover the entire construct, and is also usually measured with some error.”

“Could you maybe give an example,” asked Pablo. He was the youngest in the group, being only a second-year graduate student. (The Graduate Student, by contrast, had been in the club for so long that his real name had long ago been forgotten by the other members of the GMJC.)

“Sure,” said The Graduate Student. “Suppose your survey includes an item like ‘how often do you consume alcoholic beverages’, and the response options include things like never, less than once a month, I’m never not consuming alcoholic beverages, and so on. Now, people are not that great at remembering exactly how often they have a drink–especially the ones who tend to have a lot of drinks. On top of that, there’s a stigma against drinking a lot, so there’s probably going to be some degree of systematic underreporting. All of this contrives to give you a measure that’s less than perfectly reliable–meaning, it won’t give you the same values that you would get if you could actually track people for an extended period of time and accurately measure exactly how much ethanol they consume, by volume. In many, many cases, measured covariates of this kind are pretty mediocre.”

“I see,” said Pablo. “That makes sense. So why is that a problem?”

“Because you can’t control for that which you aren’t measuring,” Samantha said. “Meaning, if your alleged measure of alcohol consumption–or any other variable you care about–isn’t measuring the thing you care about with perfect accuracy, then you can’t remove its influence on other things. It’s easiest to see this if you think about the limiting case where your measurements are completely unreliable. Say you think you’re measuring weekly hours of exercise, but actually your disgruntled research assistant secretly switched out the true exercise measure for randomly generated values. When you then control for the alleged ‘exercise’ variable in your model, how much of the true influence of exercise are you removing?”

“None,” said Pablo.

“Right. Your alleged measure of exercise doesn’t actually reflect anything about exercise, so you’re accomplishing nothing by controlling for it. The same exact point holds–to varying degrees–when your measure is somewhat reliable, but not perfect. Which is to say, pretty much always.”

“You could also think about the same general issue in terms of construct validity,” The Graduate Student chimed in. “What you’re typically trying to do by controlling for something is account for a latent construct or concept you care about–not a specific measure. For example, the latent construct of a “healthy diet” could be measured in many ways. You could ask people how much broccoli they eat, how much sugar or transfat they consume, how often they eat until they can’t move, and so on. If you surveyed people with a lot of different items like this, and then extracted the latent variance common to all of them, then you might get a component that could be interpreted as something like ‘healthy diet’. But if you only use one or two items, they’re going to be very noisy indicators of the construct you care about. Which means you’re not really controlling for how healthy people’s diet is in your model relating bacon to cancer. At best, you’re controlling for, say, self-reported number of vegetables eaten. But there’s a very powerful temptation for authors to forget that caveat, and to instead think that their measurement-level conclusions automatically apply at the construct level. The result is that you end up with a huge number of papers saying things like ‘we show that fish oil promotes heart health even after controlling for a range of dietary and lifestyle factors’. When in fact the measurement-level variables they’ve controlled for can’t help but capture only a tiny fraction of all of the dietary and lifestyle factors that could potentially confound the association you care about.”

“I see,” said Pablo. “But this seems like a pretty basic point, doesn’t it?”

“Yes,” said Lionel. “It’s a problem as old as time itself. It might even be older than Jin.”

Jin smiled at Lionel and tipped her coffee cup-slash-bucket towards him slightly in salute.

“In fairness to the authors,” said The Graduate Student, “they do acknowledge that essentially the same problem has been discussed in many literatures over the past few decades. And they cite some pretty old papers. Oldest one is from… 1965. Kahneman, 1965.”

An uncharacteristic silence fell over the room.

That Kahneman?” Jin finally probed.

“The one and only.”

“Fucking Kahneman,” said Lionel. “That guy could really stand to leave a thing or two for the rest of us to discover.”

“So, wait,” said Jin, evidently coming around to Lionel’s point of view. “These guys cite a 50-year old paper that makes essentially the same argument, and still have the temerity to publish this thing?”

“Yes,” said Samantha and The Graduate Student in unison.

“But to be fair, their presentation is very clear,” Samantha said. “They lay out the problem really nicely–which is more than you can say for many of the older papers. Plus there’s some neat stuff in here that hasn’t been done before, as far as I know.”

“Like what?” asked Lionel.

“There’s a nice framework for analytically computing error rates for any set of simple or partial correlations between two predictors and a DV. And, to save you the trouble of having to write your own code, there’s a Shiny web app.”

“In my day, you couldn’t just write a web app and publish it as a paper,” Jin grumbled. “Shiny or otherwise.”

“That’s because in your day, the internet didn’t exist,” Lionel helpfully offered.

“No internet?” the Graduate Student shrieked in horror. “How old are you, Jin?”

“Old enough to become very wise,” said Jin. “Very, very wise… and very corpulent with federal grant money. Money that I could, theoretically, use to fund–or not fund–a graduate student of my choosing next semester. At my complete discretion, of course.” She shot The Graduate Student a pointed look.

“There’s more,” Samantha went on. “They give some nice examples that draw on real data. Then they show how you can solve the problem with SEM–although admittedly that stuff all builds directly on textbook SEM work as well. And then at the end they go on to do some power calculations based on SEM instead of the standard multiple regression approach. I think that’s new. And the results are… not pretty.”

“How so,” asked Lionel.

“Well. Westfall and Yarkoni suggest that for fairly typical parameter regimes, researchers who want to make incremental validity claims at the latent-variable level–using SEM rather than multiple regression–might be looking at a bare minimum of several hundred participants, and often many thousands, in order to adequately power the desired inference.”

“Ouchie,” said Jin.

“What happens if there’s more than one potential confound?” asked Lionel. “Do they handle the more general multiple regression case, or only two predictors?”

“No, only two predictors,” said The Graduate Student. “Not sure why. Maybe they were worried they were already breaking enough bad news for one day.”

“Could be,” said Lionel. “You have to figure that in an SEM, when unreliability in the predictors is present, the uncertainty is only going to compound as you pile on more covariates–because it’s going to become increasingly unclear how the model should attribute any common variance that the predictor of interest shares with both the DV and at least one other covariate. So whatever power estimates they come up with in the paper for the single-covariate case are probably upper bounds on the ability to detect incremental contributions in the presence of multiple covariates. If you have a lot of covariates–like the epidemiology or nutrition types usually do–and at least some of your covariates are fairly unreliable, things could get ugly really quickly. Who knows what kind of sample sizes you’d need in order to make incremental validity claims about small effects in epi studies where people start controlling for the sun, moon, and stars. Hundreds of thousands? Millions? I have no idea.”

“Jesus,” said The Graduate Student. “That would make it almost impossible to isolate incremental contributions in large observational datasets.”

“Correct,” said Lionel.

“The thing I don’t get,” said Samantha, “is that the epidemiologists clearly already know about this problem. Or at least, some of them do. They’ve written dozens of papers about ‘residual confounding’, which is another name for the same problem Westfall and Yarkoni discuss. And yet there are literally thousands of large-sample, observational papers published in prestigious epidemiology, nutrition, or political science journals that never even mention this problem. If it’s such a big deal, why does almost nobody actually take any steps to address it?”

“Ah…” said Jin. “As the senior member of our group, I can probably answer that question best for you. You see, it turns out it’s quite difficult to publish a paper titled After an extensive series of SEM analyses of a massive observational dataset that cost the taxpayer three million dollars to assemble, we still have no idea if bacon causes cancer. Nobody wants to read that paper. You know what paper people do want to read? The one called Look at me, I eat so much bacon I’m guaranteed to get cancer according to the new results in this paper–but I don’t even care, because bacon is so delicious. That’s the paper people will read, and publish, and fund. So that’s the paper many scientists are going to write.”

A second uncharacteristic silence fell over the room.

“Bit of a downer today, aren’t you,” Lionel finally said. “I guess you’re playing the role of me? I mean, that’s cool. It’s a good look for you.”

“Yes,” Jin agreed. “I’m playing you. Or at least, a smarter, more eloquent, and better-dressed version of you.”

“Why don’t we move on,” Samantha interjected before Lionel could re-arm and respond. “Now that we’ve laid out the basic argument, should we try to work through the details and see what we find?”

“Yes,” said Lionel and Jin in unison–and proceeded to tear the paper to shreds.

the mysterious inefficacy of weather

I like to think of myself as a data-respecting guy–by which I mean that I try to follow the data wherever it leads, and work hard to suppress my intuitions in cases where those intuitions are convincingly refuted by the empirical evidence. Over the years, I’ve managed to argue myself into believing many things that I would have once found ludicrous–for instance, that parents have very little influence on their children’s personalities, or that in many fields, the judgments of acclaimed experts with decades of training are only marginally better than those of people selected at random, and often considerably worse than simple actuarial models. I believe these things not because I want to or like to, but because I think a dispassionate reading of the available evidence suggests that that’s just how the world works, whether I like it or not.

Still, for all of my efforts, there are times when I find myself unable to set aside my intuitions in the face of what would otherwise be pretty compelling evidence. A case in point is the putative relationship between weather and mood. I think most people–including me–take it as a self-evident fact that weather exerts a strong effect on mood. Climate is one of the first things people bring up when discussing places they’ve lived or visited. When I visit other cities and talk to people about what Austin, Texas (my current home) is like, my description usually amounts to something like it’s an amazing place to live so long as you don’t mind the heat. When people talk about Seattle, they bitch about the rain and the clouds; when people rave about living in California, they’re often thinking in no small part about the constant sunshine that pervades most of the state. When someone comments on the absurdly high rate of death metal bands in Finland, our first reaction is to chuckle and think well, what the hell else is there to do that far up north in the winter?–a reaction promptly followed by a twinge of guilt, because Seasonal Affective Disorder is no laughing matter.

And yet… and yet, the empirical evidence linking variations in the weather to variations in human mood is surprisingly scant. There are a few published reports of very large effects of weather on mood going back several decades, but these are invariably from very small samples–and we know that big correlations tend to occur in little studies. By contrast, large-scale studies with hundreds or thousands of subjects have found very little evidence of a relationship between mood and weather–and the effects identified are not necessarily consistent across studies.

For example, Denissen and colleagues (2008) fit a series of multilevel models of the relationship between objective weather parameters and self-reported mood in 1,233 German subjects, and found only very small associations between weather variables and negative (but not positive) affect. [Klimstra et al (2011)] found similarly negligible main effects in another sample of ~500 subjects. The state of the empirical literature on weather and mood was nicely summed up by Denissen et al in their Discussion:

As indicated by the relatively small regression weights, weather fluctuations accounted for very little variance in people’s day-to-day mood. This result may be unexpected given the existence of commonly held conceptions that weather exerts a strong influence on mood (Watson, 2000), though it replicates findings by Watson (2000) and Keller et al. (2005), who also failed to report main effects. –Dennisen et al (2008)

With the advent of social media and that whole Big Data thing, we can now conduct analyses on a scale that makes the Denissen or Klimstra studies look almost like case studies. In particular, the availability of hundreds of millions of tweets and facebook posts, coupled with comprehensive weather records from every part of the planet, means that we can now investigate the effects of almost every kind of weather pattern (cloud cover, temperature, humidity, barometric pressure, etc.) on many different indices of mood. And yet, here again, the evidence is not very kind to our intuitive notion of a strong association between weather and mood.

For example, in a study of 10 million facebook users in 100 US cities, Coviello et al (2014) found that the incidence of positive posts decreased by approximately 1%, and that of negative posts increased by 1%, on days when rain fell compared to days without rain. While that finding is certainly informative (and served as a starting point for other much more impressive analyses of network contagion), it’s not a terribly impressive demonstration of weather’s supposedly robust impact on mood. I mean, a 1% increase in rain-induced negative affect is probably not what’s really keeping anyone from moving to Seattle. Yet if anyone’s managed to detect a much bigger effect of weather on mood in a large-sample study, I’m not aware of it.

I’ve also had the pleasure of experiencing the mysterious absence of weather effects firsthand: as a graduate student, I once spent nearly two weeks trying to find effects of weather on mood in a large dataset (thousands of users from over twenty cities worldwide) culled from LiveJournal, taking advantage of users’ ability to indicate their mood in a status field via an emoticon (a feat of modern technology that’s now become nearly universal thanks to the introduction of those 4-byte UTF-8 emoji monstrosities 🙀👻🍧😻). I stratified my data eleventy different ways; I tried kneading it into infinity-hundred pleasant geometric shapes; I sang to it in the shower and brought it ice cream in bed. But nothing worked. And I’m pretty sure it wasn’t that my analysis pipeline was fundamentally broken, because I did manage (as a sanity check) to successfully establish that LiveJournal users are more likely to report feeling “cold” when the temperature outside is lower (❄️😢). So it’s not like physical conditions have no effect on people’s internal states. It’s just that the obvious weather variables (temperature, rain, humidity, etc.) don’t seem to shift our mood very much, despite our persistent convictions.

Needless to say, that project is currently languishing quite comfortably in the seventh level of file drawer hell (i.e., that bottom drawer that I locked then somehow lost the key to).

Anyway, the question I’ve been mulling over on and off for several years now–though, two-week data-mining binge aside, never for long enough to actually arrive at a satisfactory answer–is why empirical studies have been largely unable to detect an effect of weather on mood. Here are some of the potential answers I’ve come up with:

  • There really isn’t a strong effect of weather on mood, and the intuition that there is one stems from a perverse kind of cultural belief or confirmation bias that leads us all to behave in very strange, and often life-changing, ways–for example, to insist on moving to Miami instead of Seattle (which, climate aside, would be a crazy move, right?). This certainly allows for the possibility that there are weak effects on mood–which plenty of data already support–but then, that’s not so exciting, and doesn’t explain why so many people are so eager to move to Hawaii or California for the great weather.

  • Weather does exert a big effect on mood, but it does so in a highly idiosyncratic way that largely averages out across individuals. On this view, while most people’s mood might be sensitive to weather to some degree, the precise manifestation differs across individuals, so that some people would rather shoot themselves in the face than spend a week in an Edmonton winter, while others will swear up and down that it really is possible (no, literally!) to melt in the heat of a Texas summer. From a modeling standpoint, if the effects of weather on mood are reliable but extremely idiosyncratic, identifying consistent patterns could be a very difficult proposition, as it would potentially require us to model some pretty complex higher-order interactions. And the difficulty is further compounded by strong geographic selection biases: since people tend to move to places where they like the climate, the variance in mood attributable to weather changes is probably much smaller than it would be under random dispersal.

  • People’s mood is heavily influenced by the weather when they first spend time somewhere new, but then they get used to it. We habituate to almost everything else, so why not weather? Maybe people who live in California don’t really benefit from living in constant sunshine. Maybe they only enjoyed the sun for their first two weeks in California, and the problem is that now, whenever they travel somewhere else, the rain/snow/heat of other places makes them feel worse than their baseline (habituated) state. In other words, maybe Californians have been snorting sunshine for so long that they now need a hit of clarified sunbeams three times a day just to feel normal.

  • The relationship between objective weather variables and subjective emotional states is highly non-linear. Maybe we can’t consistently detect a relationship between high temperatures and anger because the perception of temperature is highly dependent on a range of other variables (e.g., 30 degrees celsius can feel quite pleasant on a cloudy day in a dry climate, but intolerable if it’s humid and the sun is out). This would make the modeling challenge more difficult, but certainly not insurmountable.

  • Our measures of mood are not very reliable, and since reliability limits validity, it’s no surprise if we can’t detect consistent effects of weather on mood. Personally I’m actually very skeptical about this one, since there’s plenty of evidence that self-reports of emotion are more than adequate in any number of other situations (e.g., it’s not at all hard to detect strong trait effects of personality on reported mood states). But it’s still not entirely crazy to suggest that maybe what we’re looking at is at least partly a measurement problem—especially once we start talking about algorithmically extracting sentiment from Twitter or Facebook posts, which is a notoriously difficult problem.

  • The effects of weather on mood are strong, but very transient, and we’re simply not very good at computing mental integrals over all of our moment-by-moment experiences. That is, we tend to overestimate the  impact of weather on our mood because we find it easy to remember instances when the weather affected our mood, and not so easy to track all of the other background factors that might influence our mood more deeply but less perceptibly. There are many heuristics and biases you could attribute this to (e.g., the peak-end rule, the availability heuristic, etc.), but the basic point is that, on this view, the belief that the weather robustly influences our mood is a kind of mnemonic illusion attributable to well-known bugs in (or, more charitably, features of) our cognitive architecture.

Anyway, as far as I can tell, none of the above explanations fully account for the available data. And, to be fair, there’s no reason to think any of them should: if I had to guess, I would put money on the true explanation being a convoluted mosaic of some or all of the above factors (plus others I haven’t considered, no doubt). But the proximal problem is that there just doesn’t seem to be much data to speak to the question one way or the other. And this annoys me more than I would like. I won’t go so far as to say I spend a lot of time thinking about the problem, because I don’t. But I think about it often enough that writing a 2,000-word blog post in the hopes that other folks will provide some compelling input seems like a very reasonable time investment.

And so, having read this far—which must mean you’re at least vaguely entertained, right?—it’s your turn to help me out. Please tell me: Why is it so damn hard to detect the effects of weather on mood? Make it rain comments! It will probably cheer me up. Slightly.

☀️🌞😎😅

the weeble distribution: a love story

“I’m a statistician,” she wrote. “By day, I work for the census bureau. By night, I use my statistical skills to build the perfect profile. I’ve mastered the mysterious headline, the alluring photo, and the humorous description that comes off as playful but with a hint of an edge. I’m pretty much irresistible at this point.”

“Really?” I wrote back. “That sounds pretty amazing. The stuff about building the perfect profile, I mean. Not the stuff about working at the census bureau. Working at the census bureau sounds decent, I guess, but not amazing. How do you build the perfect profile? What kind of statistical analysis do you do? I have a bit of programming experience, but I don’t know any statistics. Maybe we can meet some time and you can teach me a bit of statistics.”

I am, as you can tell, a smooth operator.

A reply arrived in my inbox a day later:

No, of course I don’t really spend all my time constructing the perfect profile. What are you, some kind of idiot?

And so was born our brief relationship; it was love at first insult.


“This probably isn’t going to work out,” she told me within five minutes of meeting me in person for the first time. We were sitting in the lobby of the Chateau Laurier downtown. Her choice of venue. It’s an excellent place to meet an internet date; if you don’t like the way they look across the lobby, you just back out quietly and then email the other person to say sorry, something unexpected came up.

“That fast?” I asked. “You can already tell you don’t like me? I’ve barely introduced myself.”

“Oh, no, no. It’s not that. So far I like you okay. I’m just going by the numbers here. It probably isn’t going to work out. It rarely does.”

“That’s a reasonable statement,” I said, “but a terrible thing to say on a first date. How do you ever get a second date with anyone, making that kind of conversation?”

“It helps to be smoking hot,” she said. “Did I offend you terribly?”

“Not really, no. But I’m not a very sentimental kind of guy.”

“Well, that’s good.”


Later, in bed, I awoke to a shooting pain in my leg. It felt like I’d been kicked in the shin.

“Did you just kick me in the shin,” I asked.

“Yes.”

“Any particular reason?”

“You were a little bit on my side of the bed. I don’t like that.”

“Oh. Okay. Sorry.”

“I still don’t think this will work,” she said, then rolled over and went back to sleep.


She was right. We dated for several months, but it never really worked. We had terrific fights, and reasonable make-up sex, but our interactions never had very much substance. We related to one another like two people who were pretty sure something better was going to come along any day now, but in the meantime, why not keep what we had going, because it was better than eating dinner alone.

I never really learned what she liked; I did learn that she disliked most things. Mostly our conversations revolved around statistics and food. I’ll give you some examples.


“Beer is the reason for statistics,” she informed me one night while we were sitting at Cicero’s and sharing a lasagna.

“I imagine beer might be the reason for a lot of bad statistics,” I said.

“No, no. Not just bad statistics. All statistics. The discipline of statistics as we know it exists in large part because of beer.”

“Pray, do go on,” I said, knowing it would have been futile to ask her to shut up.

“Well,” she said, “there once was a man named Student“¦”

I won’t bore you with all the details; the gist of it is that there once was a man by name of William Gosset, who worked for Guinness as a brewer in the early 1900s. Like a lot of other people, Gosset was interested in figuring out how to make Guinness taste better, so he invented a bunch of statistical tests to help him quantify the differences in quality between different batches of beer. Guinness didn’t want Gosset to publish his statistical work under his real name, for fear he might somehow give away their trade secrets, so they made him use the pseudonym “Student”. As a result, modern-day statisticians often work with somethinfg called Student’s t distribution, which is apparently kind of a big deal. And all because of beer.

“That’s a nice story,” I said. “But clearly, if Student—or Gosset or whatever his real name was—hadn’t been working for Guinness, someone else would have invented the same tests shortly afterwards, right? It’s not like he was so brilliant no one else would have ever thought of the same thing. I mean, if Edison hadn’t invented the light bulb, someone else would have. I take it you’re not really saying that without beer, there would be no statistics.”

“No, that is what I’m saying. No beer, no stats. Simple.”

“Yeah, okay. I don’t believe you.”

“Oh no?”

“No. What’s that thing about lies, damned lies, and stat—”

“Statistics?”

“No. Statisticians.”

“No idea,” she said. “Never heard that saying.”

“It’s that they lie. The saying is that statisticians lie. Repeatedly and often. About anything at all. It’s that they have no moral compass.”

“Sounds about right.”


“I don’t get this whole accurate to within 3 percent 19 times out of 20 business,” I whispered into her ear late one night after we’d had sex all over her apartment. “I mean, either you’re accurate or you’re not, right? If you’re accurate, you’re accurate. And if you’re not accurate, I guess maybe then you could be within 3 percent or 7 percent or whatever. But what the hell does it mean to be accurate X times out of Y? And how would you even know how many times you’re accurate? And why is it always 19 out of 20?”

She turned on the lamp on the nightstand and rolled over to face me. Her hair covered half of her face; the other half was staring at me with those pale blue eyes that always looked like they wanted to either jump you or murder you, and you never knew which.

“You really want me to explain confidence intervals to you at 11:30 pm on a Thursday night?”

“Absolutely.”

“How much time do you have?”

“All, Night, Long,” I said, channeling Lionel Richie.

“Wonderful. Let me put my spectacles on.”

She fumbled around on the nightstand looking for them.

“What do you need your glasses for,” I asked. “We’re just talking.”

“Well, I need to be able to see you clearly. I use the amount of confusion on your face to gauge how much I need to dumb down my explanations.”


Frankly, most of the time she was as cold as ice. The only time she really came alive—other than in the bedroom—was when she talked about statistics. Then she was a different person: excited and exciting, full of energy. She looked like a giant Tesla coil, mid-discharge.

“Why do you like statistics so much,” I asked her over a bento box at ZuNama one day.

“Because,” she said, “without statistics, you don’t really know anything.”

“I thought you said statistics was all about uncertainty.”

“Right. Without statistics, you don’t know anything“¦ and with statistics, you still don’t know anything. But with statistics, we can at least get a sense of how much we know or don’t know.”

“Sounds very“¦ Rumsfeldian,” I said. “Known knowns“¦ unknown unknowns“¦ is that right?”

“It’s kind of right,” she said. “But the error bars are pretty huge.”

“I’m going to pretend I know what that means. If I admit I have no idea, you’ll think I wasn’t listening to you in bed the other night.”

“No,” she said. “I know you were listening. You were listening very well. It’s just that you were understanding very poorly.”


Uncertainty was a big theme for her. Once, to make a point, she asked me how many nostrils a person breathes through at any given time. And then, after I experimented on myself and discovered that the answer was one and not two, she pushed me on it:

“Well, how do you know you’re not the only freak in the world who breathes through one nostril?”

“Easily demonstrated,” I said, and stuck my hand right in front of her face, practically covering her nose.

“Breathe out!”

She did.

“And now breathe in! And then repeat several times!”

She did.

“You see,” I said, retracting my hand once I was satisfied. “It’s not just me. You also breathe through one nostril at a time. Right now it’s your left.”

“That proves nothing,” she said. “We’re not independent observations; I live with you. You probably just gave me your terrible mononarial disease. All you’ve shown is that we’re both sick.”

I realized then that I wasn’t going to win this round—or any other round.

“Try the unagi,” I said, waving at the sushi in a heroic effort to change the topic.

“You know I don’t like to try new things. It’s bad enough I’m eating sushi.”

“Try the unagi,” I suggested again.

So she did.

“It’s not bad,” she said after chewing on it very carefully for a very long time. “But it could use some ketchup.”

“Don’t you dare ask them for ketchup,” I said. “I will get up and leave if you ask them for ketchup.”

She waved her hand at the server.


“There once was a gentleman named Bayes,” she said over coffee at Starbucks one morning. I was running late for work, but so what? Who’s going to pass up the chance to hear about a gentleman named Bayes when the alternative is spending the morning refactoring enterprise code and filing progress reports?

“Oh yes, I’ve heard about him,” I said. “He’s the guy who came up with Bayes’ theorem.” I’d heard of Bayes theorem in some distant class somewhere, and knew it had something to do with statistics, though I had not one clue what it actually referred to.

“No, the Bayes I’m talking about is John Bayes—my mechanic. He’s working on my car right now.”

“Really?”

“No, not really, you idiot. Yes, Bayes as in Bayes’ theorem.”

“Thought so. Well, go ahead and tell me all about him. What is John Bayes famous for?”

“Bayes’ theorem.”

“Huh. How about that.”

She launched into a very dry explanation of conditional probabilities and prior distributions and a bunch of other terms I’d never heard of before and haven’t remembered since. I stopped her about three minutes in.

“You know none of this helps me, right? I mean, really, I’m going to forget anything you tell me. You know what might help, is maybe if instead of giving me these long, dry explanations, you could put things in a way I can remember. Like, if you, I don’t know, made up a limerick. I bet I could remember your explanations that way.”

“Oh, a limerick. You want a Bayesian limerick. Okay.”

She scrunched up her forehead like she was thinking very deeply. Held the pose for a few seconds.

“There once was a man named John Bayes,” she began, and then stopped.

“Yes,” I said. “Go on.”

“Who spent most of his days“¦ calculating the posterior probability of go fuck yourself.”

“Very memorable,” I said, waving for the check.


“Suppose I wanted to estimate how much I love you,” I said over asparagus and leek salad at home one night. “How would I do that?”

“You love me?” she arched an eyebrow.

“Good lord no,” I laughed hysterically. “It’s a completely and utterly hypothetical question. But answer it anyway. How would I do it?”

She shrugged.

“That’s a measurement problem. I’m a statistician, not a psychometrician. I develop and test statistical models. I don’t build psychological instruments. I haven’t the faintest idea how you’d measure love. As I’m sure you’ve observed, it’s something I don’t know or care very much about.”

I nodded. I had observed that.

“You act like there’s a difference between all these things there’s really no difference between,” I said. “Models, measures“¦ what the hell do I care? I asked a simple question, and I want a simple answer.”

“Well, my friend, in that case, the answer is that you must look deep into your own heart and say, heart, how much do I love this woman, and then your heart will surely whisper the answer delicately into your oversized ear.”

“That’s the dumbest thing I’ve ever heard,” I said, tugging self-consciously at my left earlobe. It wasn’t that big.

“Right?” she said. “You said you wanted a simple answer. I gave you a simple answer. It also happens to be a very dumb answer. Well, great, now you know one of the fundamental principles of statistical analysis.”

“That simple answers tend to be bad answers?”

“No,” she said. “That when you’re asking a statistician for help, you need to operationalize your question very carefully, or the statistician is going to give you a sensible answer to a completely different question than the one you actually care about.”


“How come you never ask me about my work,” I asked her one night as we were eating dinner at Chez Margarite. She was devouring lemon-infused pork chops; I was eating a green papaya salad with mint chutney and mango salsa dressing.

“Because I don’t really care about your work,” she said.

“Oh. That’s“¦ kind of blunt.”

“Sorry. I figured I should be honest. That’s what you say you want in a relationship, right? Honesty?”

“Sure,” I said, as the server refilled our water glasses.

“Well,” I offered. “Maybe not that much honesty.”

“Would you like me to feign interest?”

“Maybe just for a bit. That might be nice.”

“Okay,” she sighed, giving me the green light with a hand wave. “Tell me about your work.”

It was a new experience for me; I didn’t want to waste the opportunity, so I tried to choose my words carefully.

“Well, for the last month or so, I’ve been working on re-architecting our site’s database back-end. We’ve never had to worry about scaling before. Our DB can handle a few dozen queries per second, even with some pretty complicated joins. But then someone posts a product page to reddit because of a funny typo, and suddenly we’re getting hundreds of requests a second, and all hell breaks loose.”

I went on to tell her about normal forms and multivalued dependencies and different ways of modeling inheritance in databases. She listened along, nodding intermittently and at roughly appropriate intervals. But I could tell her heart wasn’t in it. She kept looking over with curiosity at the group of middle-aged Japanese businessmen seated at the next table over from us. Or out the window at the homeless man trying to sell rhododendrons to passers-by. Really, she looked everywhere but at me. Finally, I gave up.

“Look,” I said, “I know you’re not into this. I guess I don’t really need to tell you about what I do. Do you want to tell me more about the Weeble distribution?”

Her face lit up with excitement; for a moment, she looked like the moon. A cold, heartless, beautiful moon, full of numbers and error bars and mascara.

Weibull,” she said.

“Fine,” I said. “You tell me about the Weibull distribution, and I’ll feign interest. Then we’ll have crème brulee for dessert, and then I’ll buy you a rhododendron from that guy out there on the way out.”

“Rhododendrons,” she snorted. “What a ridiculous choice of flower.”


“How long do you think this relationship is going to last,” I asked her one brisk evening as we stood outside Gordon’s Gourmets with oversized hot dogs in hand.

I was fully aware our relationship was a transient thing—like two people hanging out on a ferry for a couple of hours, both perfectly willing to having a reasonably good time together until the boat hits the far side of the lake, but neither having any real interest in trading numbers or full names.

I was in it for—let’s be honest—the sex and the conversation. As for her, I’m not really sure what she got out of it; I’m not very good at either of those things. I suppose she probably had a hard time finding anyone willing to tolerate her for more than a couple of days.

“About another month,” she said. “We should take a trip to Europe and break up there. That way it won’t be messy when we come back. You book your plane ticket, I’ll book mine. We’ll go together, but come back separately. I’ve always wanted to end a relationship that way—in a planned fashion where there are no weird expectations and no hurt feelings.”

“You think planning to break up in Europe a month from now is a good way to avoid hurt feelings?”

“Correct.”

“Okay, I guess I can see that.”


And that’s pretty much how it went. About a month later, we were sitting in a graveyard in a small village in southern France, winding our relationship down. Wine was involved, and had been involved for most of the day; we were both quite drunk.

We’d gone to see this documentary film about homeless magicians who made their living doing card tricks for tourists on the beaches of the French Riviera, and then we stumbled around town until we came across the graveyard, and then, having had a lot of wine, we decided, why not sit on the graves and talk. And so we sat on graves and talked for a while until we finally ran out of steam and affection for each other.

“How do you want to end it,” I asked her when we were completely out of meaningful words, which took less time than you might imagine.

“You sound so sinister,” she said. “Like we’re talking about a suicide pact. When really we’re just two people sitting on graves in a quiet cemetery in France, about to break up forever.”

“Yeah, that. How do you want to end it.”

“Well, I like endings like in Sex, Lies and Videotape, you know? Endings that don’t really mean anything.”

“You like endings that don’t mean anything.”

“They don’t have to literally mean nothing. I just mean they don’t have to have any deep meaning. I don’t like movies that end on some fake bullshit dramatic note just to further the plot line or provide a sense of closure. I like the ending of Sex, Lies, and Videotape because it doesn’t follow from anything; it just happens.”

“Remind me how it ends?”

“They’re sitting on the steps outside, and Ann—-Andie McDowell’s character–says “I think it’s going to rain. Then Graham says, “it is raining.” And that’s it. Fade to black.”

“So that’s what you like.”

“Yes.”

“And you want to end our relationship like that.”

“Yes.”

“Okay,” I said. “I guess I can do that.”

I looked around. It was almost dark, and the bottle of wine was empty. Well, why not.

I think it’s going to rain,” I said.

Jesus,” she said incredulously, leaning back against a headstone belonging to some guy named Jean-Francois. ” I meant we should end it like that. That kind of thing. Not that actual thing. What are you, some kind of moron?”

“Oh. Okay. And yes.”

I thought about it for a while.

“I think I got this,” I finally said.

“Ok, go,” she smiled. One of the last—and only—times I saw her smile. It was devastating.

“Okay. I’m going to say: I have some unfinished business to attend to at home. I should really get back to my life. And then you should say something equally tangential and vacuous. Something like: ‘yes, you really should get back there. Your life must be lonely without you.'”

“Your life must be lonely without you“¦” she tried the words out.

“That’s perfect,” she smiled. “That’s exactly what I wanted.”


There is no ceiling effect in Johnson, Cheung, & Donnellan (2014)

This is not a blog post about bullying, negative psychology or replication studies in general. Those are important issues, and a lot of ink has been spilled over them in the past week or two. But this post isn’t about those issues (at least, not directly). This post is about ceiling effects. Specifically, the ceiling effect purportedly present in a paper in Social Psychology, in which Johnson, Cheung, and Donnellan report the results of two experiments that failed to replicate an earlier pair of experiments by Schnall, Benton, and Harvey.

If you’re not up to date on recent events, I recommend reading Vasudevan Mukunth’s post, which provides a nice summary. If you still want to know more after that, you should probably take a gander at the original paper by Schnall, Benton, & Harvey and the replication paper. Still want more? Go read Schnall’s rebuttal. Then read the rejoinder to the rebuttal. Then read Schnall’s first and second blog posts. And maybe a number of other blog posts (here, here, here, and here). Oh, and then, if you still haven’t had enough, you might want to skim the collected email communications between most of the parties in question, which Brian Nosek has been kind enough to curate.

I’m pointing you to all those other sources primarily so that I don’t have to wade very deeply into the overarching issues myself–because (a) they’re complicated, (b) they’re delicate, and (c) I’m still not entirely sure exactly how I feel about them. However, I do have a fairly well-formed opinion about the substantive issue at the center of Schnall’s published rebuttal–namely, the purported ceiling effect that invalidates Johnson et al’s conclusions. So I thought I’d lay that out here in excruciating detail. I’ll warn you right now that if your interests lie somewhere other than the intersection of psychology and statistics (which they probably should), you probably won’t enjoy this post very much. (If your interests do lie at the intersection of psychology and statistics, you’ll probably give this post a solid “meh”.)

Okay, with all the self-handicapping out of the way, let’s get to it. Here’s what I take to be…

Schnall’s argument

The crux of Schnall’s criticism of the Johnson et al replication is a purported ceiling effect. What, you ask, is a ceiling effect? Here’s Schnall’s definition:

A ceiling effect means that responses on a scale are truncated toward the top end of the scale. For example, if the scale had a range from 1-7, but most people selected “7”, this suggests that they might have given a higher response (e.g., “8” or “9”) had the scale allowed them to do so. Importantly, a ceiling effect compromises the ability to detect the hypothesized influence of an experimental manipulation. Simply put: With a ceiling effect it will look like the manipulation has no effect, when in reality it was unable to test for such an effects in the first place. When a ceiling effect is present no conclusions can be drawn regarding possible group differences.

This definition has some subtle-but-important problems we’ll come back to, but it’s reasonable as a first approximation. With this definition in mind, here’s how Schnall describes her core analysis, which she uses to argue that Johnson et al’s results are invalid:

Because a ceiling effect on a dependent variable can wash out potential effects of an independent variable (Hessling, Traxel & Schmidt, 2004), the relationship between the percentage of extreme responses and the effect of the cleanliness manipulation was examined. First, using all 24 item means from original and replication studies, the effect of the manipulation on each item was quantified. … Second, for each dilemma the percentage of extreme responses averaged across neutral and clean conditions was computed. This takes into account the extremity of both conditions, and therefore provides an unbiased indicator of ceiling per dilemma. … Ceiling for each dilemma was then plotted relative to the effect of the cleanliness manipulation (Figure 1).

We can (and will) quibble with these analysis choices, but the net result of the analysis is this:

schnall_figure

Here, we see normalized effect size (y-axis) plotted against extremity of item response (x-axis). Schnall’s basic argument is that there’s a strong inverse relationship between the extremity of responses to an item and the size of the experimental effect on that item. In other words, items with extreme responses don’t show an effect, whereas items with non-extreme responses do show an effect. She goes on to note that this pattern is full accounted for by her own original experiments, and that there is no such relationship in Johnson et al’s data. On the basis of this finding, Schnall concludes that:

Scores are compressed toward the top end of the scale and therefore show limited determinate variance near ceiling. Because a significance test compares variance due to a manipulation to variance due to error, an observed lack of effect can result merely from a lack in variance that would normally be associated with a manipulation. Given the observed ceiling effect, a statistical artefact, the analyses reported by Johnson et al. (2014a) are invalid and allow no conclusions about the reproducibility of the original findings.

Problems with the argument

One can certainly debate over what the implications would be even if Schnall’s argument were correct; for instance, it’s debatable whether the presence of a ceiling effect would actually invalidate Johnson et al’s conclusions that they had failed to replicate Schnall et al. An alternative and reasonable interpretation is that Johnson et al would have simply identified important boundary conditions under which the original effect doesn’t work (e.g., that it doesn’t hold in Michigan residents), since they were using Schnall’s original measures. But we don’t have to worry about that in any case, because there are several serious problems with Schnall’s argument. Some of them have to do with the statistical analysis she performs to make her point; some of them have to do with subtle mischaracterizations of what ceiling effects are and where they come from; and some of them have to do with the fact that Schnall’s data actually directly contradict her own argument. Let’s take each of these in turn.

Problems with the analysis

A first problem with Schnall’s analysis is that the normalization procedure she uses to make her point is biased. Schnall computes the normalized effect size for each item as:

(M1 – M2)/(M1 + M2)

Where M1 and M2 are the means for each item in the two experimental conditions (neutral and clean). This transformation is supposed to account for the fact that scores are compressed at the upper end of the scale, near the ceiling.

What Schnall fails to note, however, is that compression should also occur at the bottom of the scale, near the floor. For example, suppose an individual item has means of 1.2 and 1.4. Then Schnall’s normalized effect size estimate would be 0.2/2.6 = 0.07. But if the means had been 4.0 and 4.2–the same relative difference–then the adjusted estimate would actually be much smaller (around 0.02). So Schnall’s analysis is actually biased in favor of detecting the negative correlation she takes as evidence of a ceiling effect, because she’s not accounting for floor effects simultaneously. A true “clipping” or compression of scores shouldn’t occur at only one extreme of the scale; what should matter is how far from the midpoint a response happens to be. What should happen, if Schnall were to recompute the scores in Figure 1 using a modified criterion (e.g., relative deviation from the scale’s midpoint, rather than absolute score), is that the points at the top left of the figure should pull towards the y-axis to some degree, effectively reducing the slope she takes as evidence of a problem. If there’s any pattern that would suggest a measurement problem, it’s actually an inverted u-shape, where normalized effects are greatest for items with means nearest the midpoint, and smallest for items at both extremes, not just near ceiling. But that’s not what we’re shown.

A second problem is that Schnall’s data actually contradict her own conclusion. She writes:

Across the 24 dilemmas from all 4 experiments, dilemmas with a greater percentage of extreme responses were associated with lower effect sizes (r = -.50, p = .01, two-tailed). This negative correlation was entirely driven by the 12 original items, indicating that the closer responses were to ceiling, the smaller was the effect of the manipulation (r = -.49, p = .10).4In contrast, across the 12 replication items there was no correlation (r = .11, p = .74).

But if anything, these results provide evidence of a ceiling effect only in Schnall’s original study, and not in the Johnson et al replications. Recall that Schnall’s argument rests on two claims: (a) effects are harder to detect the more extreme responding on an item gets, and (b) responding is so extreme on the items in the Johnson et al experiments that nothing can be detected. But the results she presents blatantly contradict the second claim. Had there been no variability in item means in the Johnson et al studies, Schnall could have perhaps argued that restriction of range is so extreme that it is impossible to detect any kind of effect. In practice, however, that’s not the case. There is considerable variability along the x-axis, and in particular, one can clearly see that there are two items in Johnson et al that are nowhere near ceiling and yet show no discernible normalized effect of experimental condition at all. Note that these are the very same items that show some of the strongest effects in Schnall’s original study. In other words, the data Schnall presents in support of her argument actually directly contradict her argument. If one is to believe that a ceiling effect is preventing Schnall’s effect from emerging in Johnson et al’s replication studies, then there is no reasonable explanation for the fact that those two leftmost red squares in the figure above are close to the y = 0 line. They should be behaving exactly like they did in Schnall’s study–which is to say, they should be showing very large normalized effects–even if items at the very far right show no effects at all.

Third, Schnall’s argument that a ceiling effect completely invalidates Johnson et al’s conclusions is a gross exaggeration. Ceiling effects are not all-or-none; the degree of score compression into the upper end of a measure will vary continuously (unless there is literally no variance at all in the reponses, which is clearly not the case here). Even if we took at face value Schnall’s finding that there’s an inverse relationship between effect size and extremity in her original data (r = -0.5), all this would tell us is that there’s some compression of scores. Schnall’s suggestion that “given the observed ceiling effect, a statistical artifact, the analyses reported in Johnson et al (2014a) are invalid and allow no conclusions about the reproducibility of the original findings” is simply false. Even in the very best case scenario (which this obviously isn’t), the very strongest claim Schnall could comfortably make is that there may be some compression of scores, with unknown impact on the detectable effect size. It is simply not credible for Schnall to suggest that the mere presence of something that looks vaguely like a ceiling effect is sufficient to completely rule out detection of group differences in the Johnson et al experiments. And we know this with 100% certainty, because…

There are robust group differences in the replication experiments

Perhaps the clearest refutation of Schnall’s argument for a ceiling effect is that, as Johnson et al noted in their rejoinder, the Johnson et al experiments did in fact successfully identify some very clear group differences (and, ironically, ones that were also present in Schnall’s original experiments). Specifically, Johnson et al showed a robust effect of gender on vignette ratings. Here’s what the results look like:

We can see clearly that, in both replication experiments, there’s a large effect of gender but no discernible effect of experimental condition. This pattern directly refutes Schnall’s argument. She cannot have it both ways: if a ceiling effect precludes the presence of group differences, then there cannot be a ceiling effect in the replication studies, or else the gender effect could not have emerged repeatedly. Conversely, if ceiling effects don’t preclude detection of effects, then there is no principled reason why Johnson et al would fail to detect Schnall’s original effect.

Interestingly, it’s not just the overall means that tell the story quite clearly. Here’s what happens if we plot the gender effects in Johnson et al’s experiments in the same way as Schnall’s Figure 1 above:

gender_fx_by_extremity

Notice that we see here the same negative relationship between effect size and extremity that Schnall observed in her own data, and whose absence in Johnson et al’s data she (erroneously) took as evidence of a ceiling effect.

There’s a ceiling effect in Schnall’s own data

Yet another flaw in Schnall’s argument is that taking the ceiling effect charge seriously would actually invalidate at least one of her own experiments. Consider that the only vignette in Schnall et al’s original Experiment 1 that showed a statistically significant effect also had the highest rate of extreme responding in that study (mean rating of 8.25 / 9). Even more strikingly, the proportion of participants who gave the most extreme response possible on that vignette (70%) was higher than for any of the vignettes in either of Johnson et al’s experiments. In other words, Schnall’s core argument is that her effect could not possibly be replicated in Johnson et al’s experiments because of the presence of a ceiling effect, yet the only vignette to show a significant effect in Schnall’s original Experiment 1 had an even more pronounced ceiling effect. Once again, she cannot have it both ways. Either ceiling effects don’t preclude detection of effects, or, by Schnall’s own logic, the original Study 1 effect was probably a false positive.

When pressed on this point by Daniel Lakens in the email thread, Schnall gave the following response:

Note for the original studies we reported that the effect was seen on aggregate data, not necessarily for individual dilemmas. Such results will always show statistical fluctuations at the item level, hence it is important to not focus on any individual dilemma but on the overall pattern.

I confess that I’m not entirely clear on what Schnall means here. One way to read this is that she is conceding that the significant effect in the vignette in question (the “kitten” dilemma) was simply due to random fluctuations. Note that since the effect in Schnall’s Experiment 1 was only barely significant when averaging across all vignettes (in fact, it wasn’t quite significant even so), eliminating this vignette from consideration would actually have produced a null result. But suppose we overlook that and instead agree with Schnall that strange things can happen to individual items, and that what we should focus on is the aggregate moral judgment, averaged across vignettes. That would be perfectly reasonable, except that it’s directly at odds with Schnall’s more general argument. To see this, we need only look at the aggregate distribution of scores in Johnson et al’s Experiments 1 and 2:

johnson_distributions

There’s clearly no ceiling effect here; the mode in both experiments is nowhere near the maximum. So once again, Schnall can’t have it both ways. If her argument is that what matters is the aggregate measure (which seems right to me, since many reputable measures have multiple individual items with skewed distributions, and this can even be a desirable property in certain cases), then there’s nothing objectionable about the scores in the Johnson et al experiments. Conversely, if Schnall’s argument is that it’s fair to pick on individual items, then there is effectively no reason to believe Schnall’s own original Experiment 1 (and for all I know, her experiment 2 as well–I haven’t looked).

What should we conclude?

What can we conclude from all this? A couple of things. First, Schnall has no basis for arguing that there was a fundamental statistical flaw that completely invalidates Johnson et al’s conclusions. From where I’m sitting, there doesn’t seem to be any meaningful ceiling effect in Johnson et al’s data, and that’s attested to by the fact that Johnson et al had no trouble detecting gender differences in both experiments (successfully replicating Schnall’s earlier findings). Moreover, the arguments Schnall makes in support of the postulated ceiling effects suffer from serious flaws. At best, what Schnall could reasonably argue is that there might be some restriction of range in the ratings, which would artificially reduce the effect size. However, given that Johnson et al’s sample sizes were 3 – 5 times larger than Schnall’s, it is highly implausible to suppose that effects as big as Schnall’s completely disappeared–especially given that robust gender effects were detected. Moreover, given that the skew in Johnson et al’s aggregate distributions is not very extreme at all, and that many individual items on many questionnaire measures show ceiling or floor effects (e.g., go look at individual Big Five item distributions some time), taking Schnall’s claims seriously one would in effect invalidate not just Johnson et al’s results, but also a huge proportion of the more general psychology literature.

Second, while Schnall has raised a number of legitimate and serious concerns about the tone of the debate and comments surrounding Johnson et al’s replication, she’s also made a number of serious charges of her own that depend on the validity of her argument about celing effects, and not on the civility (or lack thereof) of commentators on various sides of the debate. Schnall has (incorrectly) argued that Johnson et al have committed a basic statistical error that most peer reviewers would have caught–effectively accusing them of incompetence. She has argued that Johnson et al’s claim of replication failure is unwarranted, and constitutes defamation of her scientific reputation. And she has suggested that the editors of the special issue (Daniel Lakens and Brian Nosek) behaved unethically by first not seeking independent peer review of the replication paper, and then actively trying to suppress her own penetrating criticisms. In my view, none of these accusations are warranted, because they depend largely on Schnall’s presumption of a critical flaw in Johnson et al’s work that is in fact nonexistent. I understand that Schnall has been under a lot of stress recently, and I sympathize with her concerns over unfair comments made by various people (most of whom have now issued formal apologies). But given the acrimonious tone of the more general ongoing debate over replication, it’s essential that we distinguish the legitimate issues from the illegitimate ones so that we can focus exclusively on the former, and don’t end up needlessly generating more hostility on both sides.

Lastly, there is the question of what conclusions we should draw from the Johnson et al replication studies. Personally, I see no reason to question Johnson et al’s conclusions, which are actually very modest:

In short, the current results suggest that the underlying effect size estimates from these replication experiments are substantially smaller than the estimates generated from the original SBH studies. One possibility is that there are unknown moderators that account for these apparent discrepancies. Perhaps the most salient difference betweenthe current studies and the original SBH studies is the student population. Our participants were undergraduates inUnited States whereas participants in SBH’sstudies were undergraduates in the United Kingdom. It is possible that cultural differences in moral judgments or in the meaning and importance of cleanliness may explain any differences.

Note that Johnson et al did not assert or intimate in any way that Schnall et al’s effects were “not real”. They did not suggest that Schnall et al had committed any errors in their original study. They explicitly acknowledged that unknown moderators might explain the difference in results (though they also noted that this was unlikely considering the magnitude of the differences). Effectively, Johnson et al stuck very close to their data and refrained from any kind of unfounded speculation.

In sum, unless Schnall has other concerns about Johnson’s data besides the purported ceiling effect (and she hasn’t raised any that I’ve seen), I think Johnson et al’s paper should enter the record exactly as its authors intended. Johnson, Cheung, & Donnellan (2014) is, quite simply, a direct preregistered replication of Schnall, Benton, & Harvey (2008) that failed to detect the effects reported in the original study, and there should be nothing at all controversial about this. There are certainly worthwhile discussions to be had about why the replication failed, and what that means for the original effect, but this doesn’t change the fundamental fact that the replication did fail, and we shouldn’t pretend otherwise.

what exactly is it that 53% of neuroscience articles fail to do?

[UPDATE: Jake Westfall points out in the comments that the paper discussed here appears to have made a pretty fundamental mistake that I then carried over to my post. I’ve updated the post accordingly.]

[UPDATE 2: the lead author has now responded and answered my initial question and some follow-up concerns.]

A new paper in Nature Neuroscience by Emmeke Aarts and colleagues argues that neuroscientists should start using hierarchical  (or multilevel) models in their work in order to account for the nested structure of their data. From the abstract:

In neuroscience, experimental designs in which multiple observations are collected from a single research object (for example, multiple neurons from one animal) are common: 53% of 314 reviewed papers from five renowned journals included this type of data. These so-called ‘nested designs’ yield data that cannot be considered to be independent, and so violate the independency assumption of conventional statistical methods such as the t test. Ignoring this dependency results in a probability of incorrectly concluding that an effect is statistically significant that is far higher (up to 80%) than the nominal α level (usually set at 5%). We discuss the factors affecting the type I error rate and the statistical power in nested data, methods that accommodate dependency between observations and ways to determine the optimal study design when data are nested. Notably, optimization of experimental designs nearly always concerns collection of more truly independent observations, rather than more observations from one research object.

I don’t have any objection to the advocacy for hierarchical models; that much seems perfectly reasonable. If you have nested data, where each subject (or petrie dish or animal or whatever) provides multiple samples, it’s sensible to try to account for as many systematic sources of variance as you can. That point may have been made many times before,  but it never hurts to make it again.

What I do find surprising though–and frankly, have a hard time believing–is the idea that 53% of neuroscience articles are at serious risk of Type I error inflation because they fail to account for nesting. This seems to me to be what the abstract implies, yet it’s a much stronger claim that doesn’t actually follow just from the observation that virtually no studies that have reported nested data have used hierarchical models for analysis. What it also requires is for all of those studies that use “conventional” (i.e., non-hierarchical) analyses to have actively ignored the nesting structure and treated repeated measurements as if they in fact came from entirely different subjects or clusters.

To make this concrete, suppose we have a dataset made up of 400 observations, consisting of 20 subjects who each provided 10 trials in 2 different experimental conditions (i.e., 20 x 2 x 10 = 400). And suppose the thing we ultimately want to know is whether or not there’s a statistical difference in outcome between the two conditions. There are three at least three ways we could set up our comparison:

  1. Ignore the grouping variable (i.e., subject) entirely, effectively giving us 200 observations in each condition. We then conduct the test as if we have 200 independent observations in each condition.
  2. Average the 10 trials in each condition within each subject first, then conduct the test on the subject means. In this case, we effectively have 20 observations in each condition (1 per subject).
  3. Explicitly include the effects of both subject and trial in our model. In this case we have 400 observations, but we’re explictly accounting for the correlation between trials within a given subject, so that the statistical comparison of conditions effectively has somewhere between 20 and 400 “observations” (or degrees of freedom).

Now, none of these approaches is strictly “wrong”, in that there could be specific situations in which any one of them would be called for. But as a general rule, the first approach is almost never appropriate. The reason is that we typically want to draw conclusions that generalize across the cases in the higher level of the hierarchy, and don’t have any intrinsic interest in the individual trials themselves. In the above example, we’re asking whether people on average, behave differently in the two conditions. If we treat our data as if we had 200 subjects in each condition, effectively concatenating trials across all subjects, we’re ignoring the fact that the responses acquired from each subject will tend to be correlated (i.e., Jane Doe’s behavior on Trial 2 will tend to be more similar to her own behavior on Trial 1 than to another subject’s behavior on Trial 1). So we’re pretending that we know something about 200 different individuals sampled at random from the population, when in fact we only know something about 20 different  individuals. The upshot, if we use approach (1), is that we do indeed run a high risk of producing false positives we’re going to end up answering a question quite different from the one we think we’re answering. [Update: Jake Westfall points out in the comments below that we won’t necessarily inflate Type I error rate. Rather, the net effect of failing to model the nesting structure properly will depend on the relative amount of within-cluster vs. between-cluster variance. The answer we get will, however, usually deviate considerably from the answer we would get using approaches (2) or (3).]

By contrast, approaches (2) and (3) will, in most cases, produce pretty similar results. It’s true that the hierarchical approach is generally a more sensible thing to do, and will tend to provide a better estimate of the true population difference between the two conditions. However, it’s probably better to describe approach (2) as suboptimal, and not as wrong. So long as the subjects in our toy example above are in fact sampled at random, it’s pretty reasonable to assume that we have exactly 20 independent observations, and analyze our data accordingly. Our resulting estimates might not be quite as good as they could have been, but we’re unlikely to miss the mark by much.

To return to the Aarts et al paper, the key question is what exactly the authors mean when they say in their abstract that:

In neuroscience, experimental designs in which multiple observations are collected from a single research object (for example, multiple neurons from one animal) are common: 53% of 314 reviewed papers from five renowned journals included this type of data. These so-called ‘nested designs’ yield data that cannot be considered to be independent, and so violate the independency assumption of conventional statistical methods such as the t test. Ignoring this dependency results in a probability of incorrectly concluding that an effect is statistically significant that is far higher (up to 80%) than the nominal α level (usually set at 5%).

I’ve underlined the key phrases here. It seems to me that the implication the reader is supposed to draw from this is that roughly 53% of the neuroscience literature is at high risk of reporting spurious results. But in reality this depends entirely on whether the authors mean that 53% of studies are modeling trial-level data but ignoring the nesting structure (as in approach 1 above), or that 53% of studies in the literature aren’t using hierarchical models, even though they may be doing nothing terribly wrong otherwise (e.g., because they’re using approach (2) above).

Unfortunately, the rest of the manuscript doesn’t really clarify the matter. Here’s the section in which the authors report how they obtained that 53% number:

To assess the prevalence of nested data and the ensuing problem of inflated type I error rate in neuroscience, we scrutinized all molecular, cellular and developmental neuroscience research articles published in five renowned journals (Science, Nature, Cell, Nature Neuroscience and every month’s first issue of Neuron) in 2012 and the first six months of 2013. Unfortunately, precise evaluation of the prevalence of nesting in the literature is hampered by incomplete reporting: not all studies report whether multiple measurements were taken from each research object and, if so, how many. Still, at least 53% of the 314 examined articles clearly concerned nested data, of which 44% specifically reported the number of observations per cluster with a minimum of five observations per cluster (that is, for robust multilevel analysis a minimum of five observations per cluster is required11, 12). The median number of observations per cluster, as reported in literature, was 13 (Fig. 1a), yet conventional analysis methods were used in all of these reports.

This is, as far as I can see, still ambiguous. The only additional information provided here is that 44% of studies specifically reported the number of observations per cluster. Unfortunately this still doesn’t tell us whether the effective degrees of freedom used in the statistical tests in those papers included nested observations, or instead averaged over nested observations within each group or subject prior to analysis.

Lest this seem like a rather pedantic statistical point, I hasten to emphasize that a lot hangs on it. The potential implications for the neuroscience literature are very different under each of these two scenarios. If it is in fact true that 53% of studies are inappropriately using a “fixed-effects” model (approach 1)–which seems to me to be what the Aarts et al abstract implies–the upshot is that a good deal of neuroscience research is very bad statistical shape, and the authors will have done the community a great service by drawing attention to the problem. On the other hand, if the vast majority of the studies in that 53% are actually doing their analyses in a perfectly reasonable–if perhaps suboptimal–way, then the Aarts et al article seems rather alarmist. It would, of course, still be true that hierarchical models should be used more widely, but the cost of failing to switch would be much lower than seems to be implied.

I’ve emailed the corresponding author to ask for a clarification. I’ll update this post if I get a reply. In the meantime, I’m interested in others’ thoughts as to the likelihood that around half of the neuroscience literature involves inappropriate reporting of fixed-effects analyses. I guess personally I would be very surprised if this were the case, though it wouldn’t be unprecedented–e.g., I gather that in the early days of neuroimaging, the SPM analysis package used a fixed-effects model by default, resulting in quite a few publications reporting grossly inflated t/z/F statistics. But that was many years ago, and in the literatures I read regularly (in psychology and cognitive neuroscience), this problem rarely arises any more. A priori, I would have expected the same to be true in cellular and molecular neuroscience.


UPDATE 04/01 (no, not an April Fool’s joke)

The lead author, Emmeke Aarts, responded to my email. Here’s her reply in full:

Thank you for your interest in our paper. As the first author of the paper, I will answer the question you send to Sophie van der Sluis. Indeed we report that 53% of the papers include nested data using conventional statistics, meaning that they did not use multilevel analysis but an analysis method that assumes independent observations like a students t-test or ANOVA.

As you also note, the data can be analyzed at two levels, at the level of the individual observations, or at the subject/animal level. Unfortunately, with the information the papers provided us, we could not extract this information for all papers. However, as described in the section ‘The prevalence of nesting in neuroscience studies’, 44% of these 53% of papers including nested data, used conventional statistics on the individual observations, with at least a mean of 5 observations per subject/animal. Another 7% of these 53% of papers including nested data used conventional statistics at the subject/animal level. So this leaves 49% unknown. Of this 49%, there is a small percentage of papers which analyzed their data at the level of individual observations, but had a mean less than 5 observations per subject/animal (I would say 10 to 20% out of the top of my head), the remaining percentage is truly unknown. Note that with a high level of dependency, using conventional statistics on nested data with 2 observations per subject/animal is already undesirable. Also note that not only analyzing nested data at the individual level is undesirable, analyzing nested data at the subject/animal level is unattractive as well, as it reduces the statistical power to detect the experimental effect of interest (see fig. 1b in the paper), in a field in which a decent level of power is already hard to achieve (e.g., Button 2013).

I think this definitively answers my original question: according to Aarts, of the 53% of studies that used nested data, at least 44% performed conventional (i.e., non-hierarchical) statistical analyses on the individual observations. (I would dispute the suggestion that this was already stated in the paper; the key phrase is “on the individual observations”, and the wording in the manuscript was much more ambiguous.) Aarts suggests that ~50% of the studies couldn’t be readily classified, so in reality that proportion could be much higher. But we can say that at least 23% of the literature surveyed committed what would, in most domains, constitute a fairly serious statistical error.

I then sent Aarts another email following up on Jake Westfall’s comment (i.e., how nested vs. crossed designs were handled. She replied:

As Jake Westfall points out, it indeed depends on the design if ignoring intercept variance (so variance in the mean observation per subject/animal) leads to an inflated type I error. There are two types of designs we need to distinguish here, design type I, where the experimental variable (for example control or experimental group) does not vary within the subjects/animals but only over the subjects/animals, and design Type II, where the experimental variable does vary within the subject/animal. Only in design type I, the type I error is increased by intercept variance. As pointed out in the discussion section of the paper, the paper only focuses on design Type I (“Here we focused on the most common design, that is, data that span two levels (for example, cells in mice) and an experimental variable that does not vary within clusters (for example, in comparing cell characteristic X between mutants and wild types, all cells from one mouse have the same genotype)”), to keep this already complicated matter accessible to a broad readership. Moreover, design type I is what is most frequently seen in biological neuroscience, taking multiple observations from one animal and subsequently comparing genotypes automatically results in a type I research design.

When dealing with a research design II, it is actually the variation in effect within subject/animals that increases the type I error rate (the so-called slope variance), but I will not elaborate too much on this since it is outside the scope of this paper and a completely different story.

Again, this all sounds very straightforward and sound to me. So after both of these emails, here’s my (hopefully?) final take on the paper:

  • Work in molecular, cellular, and developmental neuroscience–or at least, the parts of those fields well-represented in five prominent journals–does indeed appear to suffer from some systemic statistical problems. While the proportion of studies at high risk of Type I error is smaller than the number Aarts et al’s abstract suggests (53%), the latter, more accurate, estimate (at least 23% of the literature) is still shockingly high. This doesn’t mean that a quarter or more of the literature can’t be trusted–as some of the commenters point out below, most conclusions aren’t based on just a single p value from a single analysis–but it does raise some very serious concerns. The Aarts et al paper is an important piece of work that will help improve statistical practice going forward.
  • The comments on this post, and on Twitter, have been interesting to read. There appear to be two broad camps of people who were sympathetic to my original concern about the paper. One camp consists of people who were similarly concerned about technical aspects of the paper, and in most cases were tripped up by the same confusion surrounding what the authors meant when they said 53% of studies used “conventional statistical analyses”. That point has now been addressed. The other camp consists of people who appear to work in the areas of neuroscience Aarts et al focused on, and were reacting not so much to the specific statistical concern raised by Aarts et al as to the broader suggestion that something might be deeply wrong with the neuroscience literature because of this. I confess that my initial knee-jerk impression to the Aarts et al paper was driven in large part by the intuition that surely it wasn’t possible for so large a fraction of the literature to be routinely modeling subjects/clusters/groups as fixed effects. But since it appears that that is in fact the case, I’m not sure what to say with respect to the broader question over whether it is or isn’t appropriate to ignore nesting in animal studies. I will say that in the domains I personally work in, it seems very clear that collapsing across all subjects for analysis purposes is nearly always (if not always) a bad idea. Beyond that, I don’t really have any further opinion other than what I said in this response to a comment below.
  • While the claims made in the paper appear to be fundamentally sound, the presentation leaves something to be desired. It’s unclear to me why the authors relegated some of the most important technical points to the Discussion, or didn’t explictly state them at all. The abstract also seems to me to be overly sensational–though, in hindsight, not nearly as much as I initially suspected. And it also seems questionable to tar all of neuroscience with a single brush when the analyses reported only applied to a few specific domains (and we know for a fact that in, say, neuroimaging, this problem is almost nonexistent). I guess to be charitable, one could pick the same bone with a very large proportion of published work, and this kind of thing is hardly unique to this study. Then again, the fact that a practice is widespread surely isn’t sufficient to justify that practice–or else there would be little point in Aarts et al criticizing a practice that so many people clearly engage in routinely.
  • Given my last post, I can’t help pointing out that this is a nice example of how mandatory data sharing (or failing that, a culture of strong expectations of preemptive sharing) could have made evaluation of scientific claims far easier. If the authors had attached the data file coding the 315 studies they reviewed as a supplement, I (and others) would have been able to clarify the ambiguity I originally raised much more quickly. I did send a follow up email to Aarts to ask if she and her colleagues would consider putting the data online, but haven’t heard back yet.

The homogenization of scientific computing, or why Python is steadily eating other languages’ lunch

Over the past two years, my scientific computing toolbox been steadily homogenizing. Around 2010 or 2011, my toolbox looked something like this:

  • Ruby for text processing and miscellaneous scripting;
  • Ruby on Rails/JavaScript for web development;
  • Python/Numpy (mostly) and MATLAB (occasionally) for numerical computing;
  • MATLAB for neuroimaging data analysis;
  • R for statistical analysis;
  • R for plotting and visualization;
  • Occasional excursions into other languages/environments for other stuff.

In 2013, my toolbox looks like this:

  • Python for text processing and miscellaneous scripting;
  • Ruby on Rails/JavaScript for web development, except for an occasional date with Django or Flask (Python frameworks);
  • Python (NumPy/SciPy) for numerical computing;
  • Python (Neurosynth, NiPy etc.) for neuroimaging data analysis;
  • Python (NumPy/SciPy/pandas/statsmodels) for statistical analysis;
  • Python (MatPlotLib) for plotting and visualization, except for web-based visualizations (JavaScript/d3.js);
  • Python (scikit-learn) for machine learning;
  • Excursions into other languages have dropped markedly.

You may notice a theme here.

The increasing homogenization (Pythonification?) of the tools I use on a regular basis primarily reflects the spectacular recent growth of the Python ecosystem. A few years ago, you couldn’t really do statistics in Python unless you wanted to spend most of your time pulling your hair out and wishing Python were more like R (which, is a pretty remarkable confession considering what R is like). Neuroimaging data could be analyzed in SPM (MATLAB-based), FSL, or a variety of other packages, but there was no viable full-featured, free, open-source Python alternative. Packages for machine learning, natural language processing, web application development, were only just starting to emerge.

These days, tools for almost every aspect of scientific computing are readily available in Python. And in a growing number of cases, they’re eating the competition’s lunch.

Take R, for example. R’s out-of-the-box performance with out-of-memory datasets has long been recognized as its achilles heel (yes, I’m aware you can get around that if you’re willing to invest the time–but not many scientists have the time). But even people who hated the way R chokes on large datasets, and its general clunkiness as a language, often couldn’t help running back to R as soon as any kind of serious data manipulation was required. You could always laboriously write code in Python or some other high-level language to pivot, aggregate, reshape, and otherwise pulverize your data, but why would you want to? The beauty of packages like plyr in R was that you could, in a matter of 2 – 3 lines of code, perform enormously powerful operations that could take hours to duplicate in other languages. The downside was the intensive learning curve associated with learning each package’s often quite complicated API (e.g., ggplot2 is incredibly expressive, but every time I stop using ggplot2 for 3 months, I have to completely re-learn it), and having to contend with R’s general awkwardness. But still, on the whole, it was clearly worth it.

Flash forward to The Now. Last week, someone asked me for some simulation code I’d written in R a couple of years ago. As I was firing up R Studio to dig around for it, I realized that I hadn’t actually fired up R studio for a very long time prior to that moment–probably not in about 6 months. The combination of NumPy/SciPy, MatPlotLib, pandas and statmodels had effectively replaced R for me, and I hadn’t even noticed. At some point I just stopped dropping out of Python and into R whenever I had to do the “real” data analysis. Instead, I just started importing pandas and statsmodels into my code. The same goes for machine learning (scikit-learn), natural language processing (nltk), document parsing (BeautifulSoup), and many other things I used to do outside Python.

It turns out that the benefits of doing all of your development and analysis in one language are quite substantial. For one thing, when you can do everything in the same language, you don’t have to suffer the constant cognitive switch costs of reminding yourself say, that Ruby uses blocks instead of comprehensions, or that you need to call len(array) instead of array.length to get the size of an array in Python; you can just keep solving the problem you’re trying to solve with as little cognitive overhead as possible. Also, you no longer need to worry about interfacing between different languages used for different parts of a project. Nothing is more annoying than parsing some text data in Python, finally getting it into the format you want internally, and then realizing you have to write it out to disk in a different format so that you can hand it off to R or MATLAB for some other set of analyses*. In isolation, this kind of thing is not a big deal. It doesn’t take very long to write out a CSV or JSON file from Python and then read it into R. But it does add up. It makes integrated development more complicated, because you end up with more code scattered around your drive in more locations (well, at least if you have my organizational skills). It means you spend a non-negligible portion of your “analysis” time writing trivial little wrappers for all that interface stuff, instead of thinking deeply about how to actually transform and manipulate your data. And it means that your beautiful analytics code is marred by all sorts of ugly open() and read() I/O calls. All of this overhead vanishes as soon as you move to a single language.

Convenience aside, another thing that’s impressive about the Python scientific computing ecosystem is that a surprising number of Python-based tools are now best-in-class (or close to it) in terms of scope and ease of use–and, in virtue of C bindings, often even in terms of performance. It’s hard to imagine an easier-to-use machine learning package than scikit-learn, even before you factor in the breadth of implemented algorithms, excellent documentation, and outstanding performance. Similarly, I haven’t missed any of the data manipulation functionality in R since I switched to pandas. Actually, I’ve discovered many new tricks in pandas I didn’t know in R (some of which I’ll describe in an upcoming post). Considering that pandas considerably outperforms R for many common operations, the reasons for me to switch back to R or other tools–even occasionally–have dwindled.

Mind you, I don’t mean to imply that Python can now do everything anyone could ever do in other languages. That’s obviously not true. For instance, there are currently no viable replacements for many of the thousands of statistical packages users have contributed to R (if there’s a good analog for lme4 in Python, I’d love to know about it). In signal processing, I gather that many people are wedded to various MATLAB toolboxes and packages that don’t have good analogs within the Python ecosystem. And for people who need serious performance and work with very, very large datasets, there’s often still no substitute for writing highly optimized code in a low-level compiled language. So, clearly, what I’m saying here won’t apply to everyone. But I suspect it applies to the majority of scientists.

Speaking only for myself, I’ve now arrived at the point where around 90 – 95% of what I do can be done comfortably in Python. So the major consideration for me, when determining what language to use for a new project, has shifted from what’s the best tool for the job that I’m willing to learn and/or tolerate using? to is there really no way to do this in Python? By and large, this mentality is a good thing, though I won’t deny that it occasionally has its downsides. For example, back when I did most of my data analysis in R, I would frequently play around with random statistics packages just to see what they did. I don’t do that much any more, because the pain of having to refresh my R knowledge and deal with that thing again usually outweighs the perceived benefits of aimless statistical exploration. Conversely, sometimes I end up using Python packages that I don’t like quite as much as comparable packages in other languages, simply for the sake of preserving language purity. For example, I prefer Rails’ ActiveRecord ORM to the much more explicit SQLAlchemy ORM for Python–but I don’t prefer to it enough to justify mixing Ruby and Python objects in the same application. So, clearly, there are costs. But they’re pretty small costs, and for me personally, the scales have now clearly tipped in favor of using Python for almost everything. I know many other researchers who’ve had the same experience, and I don’t think it’s entirely unfair to suggest that, at this point, Python has become the de facto language of scientific computing in many domains. If you’re reading this and haven’t had much prior exposure to Python, now’s a great time to come on board!

Postscript: In the period of time between starting this post and finishing it (two sessions spread about two weeks apart), I discovered not one but two new Python-based packages for data visualization: Michael Waskom’s seaborn package–which provides very high-level wrappers for complex plots, with a beautiful ggplot2-like aesthetic–and Continuum Analytics’ bokeh, which looks like a potential game-changer for web-based visualization**. At the rate the Python ecosystem is moving, there’s a non-zero chance that by the time you read this, I’ll be using some new Python package that directly transliterates my thoughts into analytics code.

 

* I’m aware that there are various interfaces between Python, R, etc. that allow you to internally pass objects between these languages. My experience with these has not been overwhelmingly positive, and in any case they still introduce all the overhead of writing extra lines of code and having to deal with multiple languages.

** Yes, you heard right: web-based visualization in Python. Bokeh generates static JavaScript and JSON for you from Python code, so  your users are magically able to interact with your plots on a webpage without you having to write a single line of native JS code.

R, the master troll of statistical languages

Warning: what follows is a somewhat technical discussion of my love-hate relationship with the R statistical language, in which I somehow manage to waste 2,400 words talking about a single line of code. Reader discretion is advised.

I’ve been using R to do most of my statistical analysis for about 7 or 8 years now–ever since I was a newbie grad student and one of the senior grad students in my lab introduced me to it. Despite having spent hundreds (thousands?) of hours in R, I have to confess that I’ve never set aside much time to really learn it very well; what basic competence I’ve developed has been acquired almost entirely by reading the inline help and consulting the Oracle of Bacon Google when I run into problems. I’m not very good at setting aside time for reading articles or books or working my way through other people’s code (probably the best way to learn), so the net result is that I don’t know R nearly as well as I should.

That said, if I’ve learned one thing about R, it’s that R is all about flexibility: almost any task can be accomplished in a dozen different ways. I don’t mean that in the trivial sense that pretty much any substantive programming problem can be solved in any number of ways in just about any language; I mean that for even very simple and well-defined tasks involving just one or two lines of code there are often many different approaches.

To illustrate, consider the simple task of selecting a column from a data frame (data frames in R are basically just fancy tables). Suppose you have a dataset that looks like this:

In most languages, there would be one standard way of pulling columns out of this table. Just one unambiguous way: if you don’t know it, you won’t be able to work with data at all, so odds are you’re going to learn it pretty quickly. R doesn’t work that way. In R there are many ways to do almost everything, including selecting a column from a data frame (one of the most basic operations imaginable!). Here are four of them:

 

I won’t bother to explain all of these; the point is that, as you can see, they all return the same result (namely, the first column of the ice.cream data frame, named ‘flavor’).

This type of flexibility enables incredibly powerful, terse code once you know R reasonably well; unfortunately, it also makes for an extremely steep learning curve. You might wonder why that would be–after all, at its core, R still lets you do things the way most other languages do them. In the above example, you don’t have to use anything other than the simple index-based approach (i.e., data[,1]), which is the way most other languages that have some kind of data table or matrix object (e.g., MATLAB, Python/NumPy, etc.) would prefer you to do it. So why should the extra flexibility present any problems?

The answer is that when you’re trying to learn a new programming language, you typically do it in large part by reading other people’s code–and nothing is more frustrating to a newbie when learning a language than trying to figure out why sometimes people select columns in a data frame by index and other times they select them by name, or why sometimes people refer to named properties with a dollar sign and other times they wrap them in a vector or double square brackets. There are good reasons to have all of these different idioms, but you wouldn’t know that if you’re new to R and your expectation, quite reasonably, is that if two expressions look very different, they should do very different things. The flexibility that experienced R users love is very confusing to a newcomer. Most other languages don’t have that problem, because there’s only one way to do everything (or at least, far fewer ways than in R).

Thankfully, I’m long past the point where R syntax is perpetually confusing. I’m now well into the phase where it’s only frequently confusing, and I even have high hopes of one day making it to the point where it barely confuses me at all. But I was reminded of the steepness of that initial learning curve the other day while helping my wife use R to do some regression analyses for her thesis. Rather than explaining what she was doing, suffice it to say that she needed to write a function that, among other things, takes a data frame as input and retains only the numeric columns for subsequent analysis. Data frames in R are actually lists under the hood, so they can have mixed types (i.e., you can have string columns and numeric columns and factors all in the same data frame; R lists basically work like hashes or dictionaries in other loosely-typed languages like Python or Ruby). So you can run into problems if you haphazardly try to perform numerical computations on non-numerical columns (e.g., good luck computing the mean of ‘cat’, ‘dog’, and ‘giraffe’), and hence, pre-emptive selection of only the valid numeric columns is required.

Now, in most languages (including R), you can solve this problem very easily using a loop. In fact, in many languages, you would have to use an explicit for-loop; there wouldn’t be any other way to do it. In R, you might do it like this*:

numeric_cols = rep(FALSE, ncol(ice.cream))
for (i in 1:ncol(ice.cream)) numeric_cols[i] = is.numeric(ice.cream[,i])

We allocate memory for the result, then loop over each column and check whether or not it’s numeric, saving the result. Once we’ve done that, we can select only the numeric columns from our data frame with data[,numeric_cols].

This is a perfectly sensible way to solve the problem, and as you can see, it’s not particularly onerous to write out. But of course, no self-respecting R user would write an explicit loop that way, because R provides you with any number of other tools to do the job more efficiently. So instead of saying “just loop over the columns and check if is.numeric() is true for each one,” when my wife asked me how to solve her problem, I cleverly said “use apply(), of course!”

apply() is an incredibly useful built-in function that implicitly loops over one or more margins of a matrix; in theory, you should be able to do the same work as the above two lines of code with just the following one line:

apply(ice.cream, 2, is.numeric)

Here the first argument is the data we’re passing in, the third argument is the function we want to apply to the data (is.numeric()), and the second argument is the margin over which we want to apply that function (1 = rows, 2 = columns, etc.). And just like that, we’ve cut the length of our code in half!

Unfortunately, when my wife tried to use apply(), her script broke. It didn’t break in any obvious way, mind you (i.e., with a crash and an error message); instead, the apply() call returned a perfectly good vector. It’s just that all of the values in that vector were FALSE. Meaning, R had decided that none of the columns in my wife’s data frame were numeric–which was most certainly incorrect. And because the code wasn’t throwing an error, and the apply() call was embedded within a longer function, it wasn’t obvious to my wife–as an R newbie and a novice programmer–what had gone wrong. From her perspective, the regression analyses she was trying to run with lm() were breaking with strange messages. So she spent a couple of hours trying to debug her code before asking me for help.

Anyway, I took a look at the help documentation, and the source of the problem turned out to be the following: apply() only operates over matrices or vectors, and not on data frames. So when you pass a data frame to apply() as the input, it’s implicitly converted to a matrix. Unfortunately, because matrices can only contain values of one data type, any data frame that has at least one string column will end up being converted to a string (or, in R’s nomenclature, character) matrix. And so now when we apply the is.numeric() function to each column of the matrix, the answer is always going to be FALSE, because all of the columns have been converted to character vectors. So apply() is actually doing exactly what it’s supposed to; it’s just that it doesn’t deign to tell you that it’s implicitly casting your data frame to a matrix before doing anything else. The upshot is that unless you carefully read the apply() documentation and have a basic understanding of data types (which, if you’ve just started dabbling in R, you may well not), you’re hosed.

At this point I could have–and probably should have–thrown in the towel and just suggested to my wife that she use an explicit loop. But that would have dealt a mortal blow to my pride as an experienced-if-not-yet-guru-level R user. So of course I did what any self-respecting programmer does: I went and googled it. And the first thing I came across was the all.is.numeric() function in the Hmisc package which has the following description:

Tests, without issuing warnings, whether all elements of a character vector are legal numeric values.

Perfect! So now the solution to my wife’s problem became this:

library(Hmisc)
apply(ice.cream, 2, all.is.numeric)

…which had the desirable property of actually working. But it still wasn’t very satisfactory, because it requires loading a pretty large library (Hmisc) with a bunch of dependencies just to do something very simple that should really be doable in the base R distribution. So I googled some more. And came across a relevant Stack Exchange answer, which had the following simple solution to my wife’s exact problem:

sapply(ice.cream, is.numeric)

You’ll notice that this is virtually identical to the apply() approach that crashed. That’s no coincidence; it turns out that sapply() is just a variant of apply() that works on lists. And since data frames are actually lists, there’s no problem passing in a data frame and iterating over its columns. So just like that, we have an elegant one-line solution to the original problem that doesn’t invoke any loops or third-party packages.

Now, having used apply() a million times, I probably should have known about sapply(). And actually, it turns out I did know about sapply–in 2009. A Spotlight search reveals that I used it in some code I wrote for my dissertation analyses. But that was 2009, back when I was smart. In 2012, I’m the kind of person who uses apply() a dozen times a day, and is vaguely aware that R has a million related built-in functions like sapply(), tapply(), lapply(), and vapply(), yet still has absolutely no idea what all of those actually do. In other words, in 2012, I’m the kind of experienced R user that you might generously call “not very good at R”, and, less generously, “dumb”.

On the plus side, the end product is undeniably cool, right? There are very few languages in which you could achieve so much functionality so compactly right out of the box. And this isn’t an isolated case; base R includes a zillion high-level functions to do similarly complex things with data in a fraction of the code you’d need to write in most other languages. Once you throw in the thousands of high-quality user-contributed packages, there’s nothing else like it in the world of statistical computing.

Anyway, this inordinately long story does have a point to it, I promise, so let me sum up:

  • If I had just ignored the desire to be efficient and clever, and had told my wife to solve the problem the way she’d solve it in most other languages–with a simple for-loop–it would have taken her a couple of minutes to figure out, and she’d probably never have run into any problems.
  • If I’d known R slightly better, I would have told my wife to use sapply(). This would have taken her 10 seconds and she’d definitely never have run into any problems.
  • BUT: because I knew enough R to be clever but not enough R to avoid being stupid, I created an entirely avoidable problem that consumed a couple of hours of my wife’s time. Of course, now she knows about both apply() and sapply(), so you could argue that in the long run, I’ve probably still saved her time. (I’d say she also learned something about her husband’s stubborn insistence on pretending he knows what he’s doing, but she’s already the world-leading expert on that topic.)

Anyway, this anecdote is basically a microcosm of my entire experience with R. I suspect many other people will relate. Basically what it boils down to is that R gives you a certain amount of rope to work with. If you don’t know what you’re doing at all, you will most likely end up accidentally hanging yourself with that rope. If, on the other hand, you’re a veritable R guru, you will most likely use that rope to tie some really fancy knots, scale tall buildings, fashion yourself a space tuxedo, and, eventually, colonize brave new statistical worlds. For everyone in between novice and guru (e.g., me), using R on a regular basis is a continual exercise in alternately thinking “this is fucking awesome” and banging your head against the wall in frustration at the sheer stupidity (either your own, or that of the people who designed this awful language). But the good news is that the longer you use R, the more of the former and the fewer of the latter experiences you have. And at the end of the day, it’s totally worth it: the language is powerful enough to make you forget all of the weird syntax, strange naming conventions, choking on large datasets, and issues with data type conversions.

Oh, except when your wife is yelling at gently reprimanding you for wasting several hours of her time on a problem she could have solved herself in 5 minutes if you hadn’t insisted that she do it the idiomatic R way. Then you remember exactly why R is the master troll of statistical languages.

 

 

* R users will probably notice that I use the = operator for assignment instead of the <- operator even though the latter is the officially prescribed way to do it in R (i.e., a <- 2 is favored over a = 2). That’s because these two idioms are interchangeable in all but one (rare) use case, and personally I prefer to avoid extra keystrokes whenever possible. But the fact that you can do even basic assignment in two completely different ways in R drives home the point about how pathologically flexible–and, to a new user, confusing–the language is.

Sixteen is not magic: Comment on Friston (2012)

UPDATE: I’ve posted a very classy email response from Friston here.

In a “comments and controversies“ piece published in NeuroImage last week, Karl Friston describes “Ten ironic rules for non-statistical reviewers“. As the title suggests, the piece is presented ironically; Friston frames it as a series of guidelines reviewers can follow in order to ensure successful rejection of any neuroimaging paper. But of course, Friston’s real goal is to convince you that the practices described in the commentary are bad ones, and that reviewers should stop picking on papers for such things as having too little power, not cross-validating results, and not being important enough to warrant publication.

Friston’s piece is, simultaneously, an entertaining satire of some lamentable reviewer practices, and—in my view, at least—a frustratingly misplaced commentary on the relationship between sample size, effect size, and inference in neuroimaging. While it’s easy to laugh at some of the examples Friston gives, many of the positions Friston presents and then skewers aren’t just humorous portrayals of common criticisms; they’re simply bad caricatures of comments that I suspect only a small fraction of reviewers ever make. Moreover, the cures Friston proposes—most notably, the recommendation that sample sizes on the order of 16 to 32 are just fine for neuroimaging studies—are, I’ll argue, much worse than the diseases he diagnoses.

Before taking up the objectionable parts of Friston’s commentary, I’ll just touch on the parts I don’t think are particularly problematic. Of the ten rules Friston discusses, seven seem palatable, if not always helpful:

  • Rule 6 seems reasonable; there does seem to be excessive concern about the violation of assumptions of standard parametric tests. It’s not that this type of thing isn’t worth worrying about at some point, just that there are usually much more egregious things to worry about, and it’s been demonstrated that the most common parametric tests are (relatively) insensitive to violations of normality under realistic conditions.
  • Rule 10 is also on point; given that we know the reliability of peer review is very low, it’s problematic when reviewers make the subjective assertion that a paper just isn’t important enough to be published in such-and-such journal, even as they accept that it’s technically sound. Subjective judgments about importance and innovation should be left to the community to decide. That’s the philosophy espoused by open-access venues like PLoS ONE and Frontiers, and I think it’s a good one.
  • Rules 7 and 9—criticizing a lack of validation or a failure to run certain procedures—aren’t wrong, but seem to me much too broad to support blanket pronouncements. Surely much of the time when reviewers highlight missing procedures, or complain about a lack of validation, there are perfectly good reasons for doing so. I don’t imagine Friston is really suggesting that reviewers should stop asking authors for more information or for additional controls when they think it’s appropriate, so it’s not clear what the point of including this here is. The example Friston gives in Rule 9 (of requesting retinotopic mapping in an olfactory study), while humorous, is so absurd as to be worthless as an indictment of actual reviewer practices. In fact, I suspect it’s so absurd precisely because anything less extreme Friston could have come up with would have caused readers to think, “but wait, that could actually be a reasonable concern“¦“
  • Rules 1, 2, and 3 seem reasonable as far as they go; it’s just common sense to avoid overconfidence, arguments from emotion, and tardiness. Still, I’m not sure what’s really accomplished by pointing this out; I doubt there are very many reviewers who will read Friston’s commentary and say “you know what, I’m an overconfident, emotional jerk, and I’m always late with my reviews–I never realized this before.“ I suspect the people who fit that description—and for all I know, I may be one of them—will be nodding and chuckling along with everyone else.

This leaves Rules 4, 5, and 8, which, conveniently, all focus on a set of interrelated issues surrounding low power, effect size estimation, and sample size. Because Friston’s treatment of these issues strikes me as dangerously wrong, and liable to send a very bad message to the neuroimaging community, I’ve laid out some of these issues in considerably more detail than you might be interested in. If you just want the direct rebuttal, skip to the “Reprising the rules“ section below; otherwise the next two sections sketch Friston’s argument for using small sample sizes in fMRI studies, and then describe some of the things wrong with it.

Friston’s argument

Friston’s argument is based on three central claims:

  1. Classical inference (i.e., the null hypothesis testing framework) suffers from a critical flaw, which is that the null is always false: no effects (at least in psychology) are ever truly zero. Collect enough data and you will always end up rejecting the null hypothesis with probability of 1.
  2. Researchers care more about large effects than about small ones. In particular, there is some size of effect that any given researcher will call “˜trivial’, below which that researcher is uninterested in the effect.
  3. If the null hypothesis is always false, and if some effects are not worth caring about in practical terms, then researchers who collect very large samples will invariably end up identifying many effects that are statistically significant but completely uninteresting.

I think it would be hard to dispute any of these claims. The first one is the source of persistent statistical criticism of the null hypothesis testing framework, and the second one is self-evidently true (if you doubt it, ask yourself whether you would really care to continue your research if you knew with 100% confidence that all of your effects would never be any larger than one one-thousandth of a standard deviation). The third one follows directly from the first two.

Where Friston’s commentary starts to depart from conventional wisdom is in the implications he thinks these premises have for the sample sizes researchers should use in neuroimaging studies. Specifically, he argues that since large samples will invariably end up identifying trivial effects, whereas small samples will generally only have power to detect large effects, it’s actually in neuroimaging researchers’ best interest not to collect a lot of data. In other words, Friston turns what most commentators have long considered a weakness of fMRI studies—their small sample size—into a virtue.

Here’s how he characterizes an imaginary reviewer’s misguided concern about low power:

Reviewer: Unfortunately, this paper cannot be accepted due to the small number of subjects. The significant results reported by the authors are unsafe because the small sample size renders their design insufficiently powered. It may be appropriate to reconsider this work if the authors recruit more subjects.

Friston suggests that the appropriate response from a clever author would be something like the following:

Response: We would like to thank the reviewer for his or her comments on sample size; however, his or her conclusions are statistically misplaced. This is because a significant result (properly controlled for false positives), based on a small sample indicates the treatment effect is actually larger than the equivalent result with a large sample. In short, not only is our result statistically valid. It is quantitatively more significant than the same result with a larger number of subjects.

This is supported by an extensive appendix (written non-ironically), where Friston presents a series of nice sensitivity and classification analyses intended to give the reader an intuitive sense of what different standardized effect sizes mean, and what the implications are for the detection of statistically significant effects using a classical inference (i.e., hypothesis testing) approach. The centerpiece of the appendix is a loss-function analysis where Friston pits the benefit of successfully detecting a large effect (which he defines as a Cohen’s d of 1, i.e., an effect of one standard deviation) against the cost of rejecting the null when the effect is actually trivial (defined as a d of 0.125 or less). Friston notes that the loss function is minimized (i.e., the difference between the hit rate for large effects and the miss rate for trivial effects is maximized) when n = 16, which is where the number he repeatedly quotes as a reasonable sample size for fMRI studies comes from. (Actually, as I discuss in my Appendix I below, I think Friston’s power calculations are off, and the right number, even given his assumptions, is more like 22. But the point is, it’s a small number either way.)

It’s important to note that Friston is not shy about asserting his conclusion that small samples are just fine for neuroimaging studies—especially in the Appendices, which are not intended to be ironic. He makes claims like the following:

The first appendix presents an analysis of effect size in classical inference that suggests the optimum sample size for a study is between 16 and 32 subjects. Crucially, this analysis suggests significant results from small samples should be taken more seriously than the equivalent results in oversized studies.

And:

In short, if we wanted to optimise the sensitivity to large effects but not expose ourselves to trivial effects, sixteen subjects would be the optimum number.

And:

In short, if you cannot demonstrate a significant effect with sixteen subjects, it is probably not worth demonstrating.

These are very strong claims delivered with minimal qualification, and given Friston’s influence, could potentially lead many reviewers to discount their own prior concerns about small sample size and low power—which would be disastrous for the field. So I think it’s important to explain exactly why Friston is wrong and why his recommendations regarding sample size shouldn’t be taken seriously.

What’s wrong with the argument

Broadly speaking, there are three problems with Friston’s argument. The first one is that Friston presents the absolute best-case scenario as if it were typical. Specifically, the recommendation that a sample of 16 ““ 32 subjects is generally adequate for fMRI studies assumes that  fMRI researchers are conducting single-sample t-tests at an uncorrected threshold of p < .05; that they only care about effects on the order of 1 sd in size; and that any effect smaller than d = .125 is trivially small and is to be avoided. If all of this were true, an n of 16 (or rather, 22—see Appendix I below) might be reasonable. But it doesn’t really matter, because if you make even slightly less optimistic assumptions, you end up in a very different place. For example, for a two-sample t-test at p < .001 (a very common scenario in group difference studies), the optimal sample size, according to Friston’s own loss-function analysis, turns out to be 87 per group, or 174 subjects in total.

I discuss the problems with the loss-function analysis in much more detail in Appendix I below; the main point here is that even if you take Friston’s argument at face value, his own numbers put the lie to the notion that a sample size of 16 ““ 32 is sufficient for the majority of cases. It flatly isn’t. There’s nothing magic about 16, and it’s very bad advice to suggest that authors should routinely shoot for sample sizes this small when conducting their studies given that Friston’s own analysis would seem to demand a much larger sample size the vast majority of the time.

 What about uncertainty?

The second problem is that Friston’s argument entirely ignores the role of uncertainty in drawing inferences about effect sizes. The notion that an effect that comes from a small study is likely to be bigger than one that comes from a larger study may be strictly true in the sense that, for any fixed p value, the observed effect size necessarily varies inversely with sample size. It’s true, but it’s also not very helpful. The reason it’s not helpful is that while the point estimate of statistically significant effects obtained from a small study will tend to be larger, the uncertainty around that estimate is also greater—and with sample sizes in the neighborhood of 16 – 20, will typically be so large as to be nearly worthless. For example, a correlation of r = .75 sounds huge, right? But when that correlation is detected at a threshold of p < .001 in a sample of 16 subjects, the corresponding 99.9% confidence interval is .06 – .95—a range so wide as to be almost completely uninformative.

Fortunately, what Friston argues small samples can do for us indirectly—namely, establish that effect sizes are big enough to care about—can be done much more directly, simply by looking at the uncertainty associated with our estimates. That’s exactly what confidence intervals are for. If our goal is to ensure that we only end up talking about results big enough to care about, it’s surely better to answer the question “how big is the effect?” by saying, “d = 1.1, with a 95% confidence interval of 0.2 – 2.1” than by saying “well it’s statistically significant at p < .001 in a sample of 16 subjects, so it’s probably pretty big”. In fact, if you take the latter approach, you’ll be wrong quite often, for the simple reason that p values will generally be closer to the statistical threshold with small samples than with big ones. Remember that, by definition, the point at which one is allowed to reject the null hypothesis is also the point at which the relevant confidence interval borders on zero. So it doesn’t really matter whether your sample is small or large; if you only just barely managed to reject the null hypothesis, you cannot possibly be in a good position to conclude that the effect is likely to be a big one.

As far as I can tell, Friston completely ignores the role of uncertainty in his commentary. For example, he gives the following example, which is supposed to convince you that you don’t really need large samples:

Imagine we compared the intelligence quotient (IQ) between the pupils of two schools. When comparing two groups of 800 pupils, we found mean IQs of 107.1 and 108.2, with a difference of 1.1. Given that the standard deviation of IQ is 15, this would be a trivial effect size “¦ In short, although the differential IQ may be extremely significant, it is scientifically uninteresting “¦ Now imagine that your research assistant had the bright idea of comparing the IQ of students who had and had not recently changed schools. On selecting 16 students who had changed schools within the past five years and 16 matched pupils who had not, she found an IQ difference of 11.6, where this medium effect size just reached significance. This example highlights the difference between an uninformed overpowered hypothesis test that gives very significant, but uninformative results and a more mechanistically grounded hypothesis that can only be significant with a meaningful effect size.

But the example highlights no such thing. One is not entitled to conclude, in the latter case, that the true effect must be medium-sized just because it came from a small sample. If the effect only just reached significance, the confidence interval by definition just barely excludes zero, and we can’t say anything meaningful about the size of the effect, but only about its sign (i.e., that it was in the expected direction)—which is (in most cases) not nearly as useful.

In fact, we will generally be in a much worse position with a small sample than a large one, because at least with a large sample, we at least stand a chance of being able to distinguish small effects from large ones. Recall that Friston suggests against collecting very large samples for the very reason that they are likely to produce a wealth of statistically-significant-but-trivially-small effects. Well, maybe so, but so what? Why would it be a bad thing to detect trivial effects so long as we were also in an excellent position to know that those effects were trivial? Nothing about the hypothesis-testing framework commits us to treating all of our statistically significant results like they’re equally important. If we have a very large sample, and some of our effects have confidence intervals from 0.02 to 0.15 while others have CIs from 0.42 to 0.52, we would be wise to focus most of our attention on the latter rather than the former. At the very least this seems like a more reasonable approach than deliberately collecting samples so small that they will rarely be able to tell us anything meaningful about the size of our effects.

What about the prior?

The third, and arguably biggest, problem with Friston’s argument is that it completely ignores the prior—i.e., the expected distribution of effect sizes across the brain. Friston’s commentary assumes a uniform prior everywhere; for the analysis to go through, one has to believe that trivial effects and very large effects are equally likely to occur. But this is patently absurd; while that might be true in select situations, by and large, we should expect small effects to be much more common than large ones. In a previous commentary (on the Vul et al “voodoo correlations“ paper), I discussed several reasons for this; rather than go into detail here, I’ll just summarize them:

  • It’s frankly just not plausible to suppose that effects are really as big as they would have to be in order to support adequately powered analyses with small samples. For example, a correlational analysis with 20 subjects at p < .001 would require a population effect size of r = .77 to have 80% power. If you think it’s plausible that focal activation in a single brain region can explain 60% of the variance in a complex trait like fluid intelligence or extraversion, I have some property under a bridge I’d like you to come by and look at.
  • The low-hanging fruit get picked off first. Back when fMRI was in its infancy in the mid-1990s, people could indeed publish findings based on samples of 4 or 5 subjects. I’m not knocking those studies; they taught us a huge amount about brain function. In fact, it’s precisely because they taught us so much about the brain that researchers can no longer stick 5 people in a scanner and report that doing a working memory task robustly activates the frontal cortex. Nowadays, identifying an interesting effect is more difficult—and if that effect were really enormous, odds are someone would have found it years ago. But this shouldn’t surprise us; neuroimaging is now a relatively mature discipline, and effects on the order of 1 sd or more are extremely rare in most mature fields (for a nice review, see Meyer et al (2001)).
  • fMRI studies with very large samples invariably seem to report much smaller effects than fMRI studies with small samples. This can only mean one of two things: (a) large studies are done much more poorly than small studies (implausible—if anything, the opposite should be true); or (b) the true effects are actually quite small in both small and large fMRI studies, but they’re inflated by selection bias in small studies, whereas large studies give an accurate estimate of their magnitude (very plausible).
  • Individual differences or between-group analyses, which have much less power than within-subject analyses, tend to report much more sparing activations. Again, this is consistent with the true population effects being on the small side.

To be clear, I’m not saying there are never any large effects in fMRI studies. Under the right circumstances, there certainly will be. What I’m saying is that, in the absence of very good reasons to suppose that a particular experimental manipulation is going to produce a large effect, our default assumption should be that the vast majority of (interesting) experimental contrasts are going to produce diffuse and relatively weak effects.

Note that Friston’s assertion that “if one finds a significant effect with a small sample size, it is likely to have been caused by a large effect size“ depends entirely on the prior effect size distribution. If the brain maps we look at are actually dominated by truly small effects, then it’s simply not true that a statistically significant effect obtained from a small sample is likely to have been caused by a large effect size. We can see this easily by thinking of a situation in which an experiment has a weak but very diffuse effect on brain activity. Suppose that the entire brain showed “˜trivial’ effects of d = 0.125 in the population, and that there were actually no large effects at all. A one-sample t-test at p < .001 has less than 1% power to detect this effect, so you might suppose, as Friston does, that we could discount the possibility that a significant effect would have come from a trivial effect size. And yet, because a whole-brain analysis typically involves tens of thousands of tests, there’s a very good chance such an analysis will end up identifying statistically significant effects somewhere in the brain. Unfortunately, because the only way to identify a trivial effect with a small sample is to capitalize on chance (Friston discusses this point in his Appendix II, and additional treatments can be found in Ionnadis (2008), or in my 2009 commentary), that tiny effect won’t look tiny when we examine it; it will in all likelihood look enormous.

Since they say a picture is worth a thousand words, here’s one (from an unpublished paper in progress):

The top panel shows you a hypothetical distribution of effects (Pearson’s r) in a 2-dimensional “˜brain’ in the population. Note that there aren’t any astronomically strong effects (though the white circles indicate correlations of .5 or greater, which are certainly very large). The bottom panel shows what happens when you draw random samples of various sizes from the population and use different correction thresholds/approaches. You can see that the conclusion you’d draw if you followed Friston’s advice—i.e., that any effect you observe with n = 20 must be pretty robust to survive correction—is wrong; the isolated region that survives correction at FDR = .05, while “˜real’ in a trivial sense, is not in fact very strong in the true map—it just happens to be grossly inflated by sampling error. This is to be expected; when power is very low but the number of tests you’re performing is very large, the odds are good that you’ll end up identifying some real effect somewhere in the brain–and the estimated effect size within that region will be grossly distorted because of the selection process.

Encouraging people to use small samples is a sure way to ensure that researchers continue to publish highly biased findings that lead other researchers down garden paths trying unsuccessfully to replicate “˜huge’ effects. It may make for an interesting, more publishable story (who wouldn’t rather talk about the single cluster that supports human intelligence than about the complex, highly distributed pattern of relatively weak effects?), but it’s bad science. It’s exactly the same problem geneticists confronted ten or fifteen years ago when the first candidate gene and genome-wide association studies (GWAS) seemed to reveal remarkably strong effects of single genetic variants that subsequently failed to replicate. And it’s the same reason geneticists now run association studies with 10,000+ subjects and not 300.

Unfortunately, the costs of fMRI scanning haven’t come down the same way the costs of genotyping have, so there’s tremendous resistance at present to the idea that we really do need to routinely acquire much larger samples if we want to get a clear picture of how big effects really are. Be that as it may, we shouldn’t indulge in wishful thinking just because of logistical constraints. The fact that it’s difficult to get good estimates doesn’t mean we should pretend our bad estimates are actually good ones.

What’s right with the argument

Having criticized much of Friston’s commentary, I should note that there’s one part I like a lot, and that’s the section on protected inference in Appendix I. The point Friston makes here is that you can still use a standard hypothesis testing approach fruitfully—i.e., without falling prey to the problem of classical inference—so long as you explicitly protect against the possibility of identifying trivial effects. Friston’s treatment is mathematical, but all he’s really saying here is that it makes sense to use non-zero ranges instead of true null hypotheses. I’ve advocated the same approach before (e.g., here), as I’m sure many other people have. The point is simple: if you think an effect of, say, 1/8th of a standard deviation is too small to care about, then you should define a ‘pseudonull’ hypothesis of d = -.125 to .125 instead of a null of exactly zero.

Once you do that, any time you reject the null, you’re now entitled to conclude with reasonable certainty that your effects are in fact non-trivial in size. So I completely agree with Friston when he observes in the conclusion to the Appendix I that:

…the adage “˜you can never have enough data’ is also true, provided one takes care to protect against inference on trivial effect sizes ““ for example using protected inference as described above.

Of course, the reason I agree with it is precisely because it directly contradicts Friston’s dominant recommendation to use small samples. In fact, since rejecting non-zero values is more difficult than rejecting a null of zero, when you actually perform power calculations based on protected inference, it becomes immediately apparent just how inadequate samples on the order of 16 ““ 32 subjects will be most of the time (e.g., rejecting a null of zero when detecting an effect of d = 0.5 with 80% power using a one-sample t-test at p < .05 requires 33 subjects, but if you want to reject a “˜trivial’ effect size of d <= |.125|, that n is now upwards of 50).

Reprising the rules

With the above considerations in mind, we can now turn back to Friston’s rules 4, 5, and 8, and see why his admonitions to reviewers are uncharitable at best and insensible at worst. First, Rule 4 (the under-sampled study). Here’s the kind of comment Friston (ironically) argues reviewers should avoid:

 Reviewer: Unfortunately, this paper cannot be accepted due to the small number of subjects. The significant results reported by the authors are unsafe because the small sample size renders their design insufficiently powered. It may be appropriate to reconsider this work if the authors recruit more subjects.

Perhaps many reviewers make exactly this argument; I haven’t been an editor, so I don’t know (though I can say that I’ve read many reviews of papers I’ve co-reviewed and have never actually seen this particular variant). But even if we give Friston the benefit of the doubt and accept that one shouldn’t question the validity of a finding on the basis of small samples (i.e., we accept that p values mean the same thing in large and small samples), that doesn’t mean the more general critique from low power is itself a bad one. To the contrary, a much better form of the same criticism–and one that I’ve raised frequently myself in my own reviews–is the following:

 Reviewer: the authors draw some very strong conclusions in their Discussion about the implications of their main finding. But their finding issues from a sample of only 16 subjects, and the confidence interval around the effect is consequently very large, and nearly include zero. In other words, the authors’ findings are entirely consistent with the effect they report actually being very small–quite possibly too small to care about. The authors should either weaken their assertions considerably, or provide additional evidence for the importance of the effect.

Or another closely related one, which I’ve also raised frequently:

 Reviewer: the authors tout their results as evidence that region R is ‘selectively’ activated by task T. However, this claim is based entirely on the fact that region R was the only part of the brain to survive correction for multiple comparisons. Given that the sample size in question is very small, and power to detect all but the very largest effects is consequently very low, the authors are in no position to conclude that the absence of significant effects elsewhere in the brain suggests selectivity in region R. With this small a sample, the authors’ data are entirely consistent with the possibility that many other brain regions are just as strongly activated by task T, but failed to attain significance due to sampling error. The authors should either avoid making any claim that the activity they observed is selective, or provide direct statistical support for their assertion of selectivity.

Neither of these criticisms can be defused by suggesting that effect sizes from smaller samples are likely to be larger than effect sizes from large studies. And it would be disastrous for the field of neuroimaging if Friston’s commentary succeeded in convincing reviewers to stop criticizing studies on the basis of low power. If anything, we collectively need to focus far greater attention on issues surrounding statistical power.

Next, Rule 5 (the over-sampled study):

Reviewer: I would like to commend the authors for studying such a large number of subjects; however, I suspect they have not heard of the fallacy of classical inference. Put simply, when a study is overpowered (with too many subjects), even the smallest treatment effect will appear significant. In this case, although I am sure the population effects reported by the authors are significant; they are probably trivial in quantitative terms. It would have been much more compelling had the authors been able to show a significant effect without resorting to large sample sizes. However, this was not the case and I cannot recommend publication.

I’ve already addressed this above; the problem with this line of reasoning is that nothing says you have to care equally about every statistically significant effect you detect. If you ever run into a reviewer who insists that your sample is overpowered and has consequently produced too many statistically significant effects, you can simply respond like this:

 Response: we appreciate the reviewer’s concern that our sample is potentially overpowered. However, this strikes us as a limitation of classical inference rather than a problem with our study. To the contrary, the benefit of having a large sample is that we are able to focus on effect sizes rather than on rejecting a null hypothesis that we would argue is meaningless to begin with. To this end, we now display a second, more conservative, brain activation map alongside our original one that raises the statistical threshold to the point where the confidence intervals around all surviving voxels exclude effects smaller than d = .125. The reviewer can now rest assured that our results protect against trivial effects. We would also note that this stronger inference would not have been possible if our study had had a much smaller sample.

There is rarely if ever a good reason to criticize authors for having a large sample after it’s already collected. You can always raise the statistical threshold to protect against trivial effects if you need to; what you can’t easily do is magic more data into existence in order to shrink your confidence intervals.

Lastly, Rule 8 (exploiting “˜superstitious’ thinking about effect sizes):

 Reviewer: It appears that the authors are unaware of the dangers of voodoo correlations and double dipping. For example, they report effect sizes based upon data (regions of interest) previously identified as significant in their whole brain analysis. This is not valid and represents a pernicious form of double dipping (biased sampling or non-independence problem). I would urge the authors to read Vul et al. (2009) and Kriegeskorte et al. (2009) and present unbiased estimates of their effect size using independent data or some form of cross validation.

Friston’s recommended response is to point out that concerns about double-dipping are misplaced, because the authors are typically not making any claims that the reported effect size is an accurate representation of the population value, but only following standard best-practice guidelines to include effect size measures alongside p values. This would be a fair recommendation if it were true that reviewers frequently object to the mere act of reporting effect sizes based on the specter of double-dipping; but I simply don’t think this is an accurate characterization. In my experience, the impetus for bringing up double-dipping is almost always one of two things: (a) authors getting overly excited about the magnitude of the effects they have obtained, or (b) authors conducting non-independent tests and treating them as though they were independent (e.g., when identifying an ROI based on a comparison of conditions A and B, and then reporting a comparison of A and C without considering the bias inherent in this second test). Both of these concerns are valid and important, and it’s a very good thing that reviewers bring them up.

The right way to determine sample size

If we can’t rely on blanket recommendations to guide our choice of sample size, then what? Simple: perform a power calculation. There’s no mystery to this; both brief and extended treatises on statistical power are all over the place, and power calculators for most standard statistical tests are available online as well as in most off-line statistical packages (e.g., I use the pwr package for R). For more complicated statistical tests for which analytical solutions aren’t readily available (e.g., fancy interactions involving multiple within- and between-subject variables), you can get reasonably good power estimates through simulation.

Of course, there’s no guarantee you’ll like the answers you get. Actually, in most cases, if you’re honest about the numbers you plug in, you probably won’t like the answer you get. But that’s life; nature doesn’t care about making things convenient for us. If it turns out that it takes 80 subjects to have adequate power to detect the effects we care about and expect, we can (a) suck it up and go for n = 80, (b) decide not to run the study, or (c) accept that logistical constraints mean our study will have less power than we’d like (which implies that any results we obtain will offer only a fractional view of what’s really going on). What we don’t get to do is look the other way and pretend that it’s just fine to go with 16 subjects simply because the last time we did that, we got this amazingly strong, highly selective activation that successfully made it into a good journal. That’s the same logic that repeatedly produced unreplicable candidate gene findings in the 1990s, and, if it continues to go unchecked in fMRI research, risks turning the field into a laughing stock among other scientific disciplines.

Conclusion

The point of all this is not to convince you that it’s impossible to do good fMRI research with just 16 subjects, or that reviewers don’t sometimes say silly things. There are many questions that can be answered with 16 or even fewer subjects, and reviewers most certainly do say silly things (I sometimes cringe when re-reading my own older reviews). The point is that blanket pronouncements, particularly when made ironically and with minimal qualification, are not helpful in advancing the field, and can be very damaging. It simply isn’t true that there’s some magic sample size range like 16 to 32 that researchers can bank on reflexively. If there’s any generalization that we can allow ourselves, it’s probably that, under reasonable assumptions, Friston’s recommendations are much too conservative. Typical effect sizes and analysis procedures will generally require much larger samples than neuroimaging researchers are used to collecting. But again, there’s no substitute for careful case-by-case consideration.

In the natural course of things, there will be cases where n = 4 is enough to detect an effect, and others where the effort is questionable even with 100 subjects; unfortunately, we won’t know which situation we’re in unless we take the time to think carefully and dispassionately about what we’re doing. It would be nice to believe otherwise; certainly, it would make life easier for the neuroimaging community in the short term. But since the point of doing science is to discover what’s true about the world, and not to publish an endless series of findings that sound exciting but don’t replicate, I think we have an obligation to both ourselves and to the taxpayers that fund our research to take the exercise more seriously.

 

 

Appendix I: Evaluating Friston’s loss-function analysis

In this appendix I review a number of weaknesses in Friston’s loss-function analysis, and show that under realistic assumptions, the recommendation to use sample sizes of 16 ““ 32 subjects is far too optimistic.

First, the numbers don’t seem to be right. I say this with a good deal of hesitation, because I have very poor mathematical skills, and I’m sure Friston is much smarter than I am. That said, I’ve tried several different power packages in R and finally resorted to empirically estimating power with simulated draws, and all approaches converge on numbers quite different from Friston’s. Even the sensitivity plots seem off by a good deal (for instance, Friston’s Figure 3 suggests around 30% sensitivity with n = 80 and d = 0.125, whereas all the sources I’ve consulted produce a value around 20%). In my analysis, the loss function is minimized at n = 22 rather than n = 16. I suspect the problem is with Friston’s approximation, but I’m open to the possibility that I’ve done something very wrong, and confirmations or disconfirmations are welcome in the comments below. In what follows, I’ll report the numbers I get rather than Friston’s (mine are somewhat more pessimistic, but the overarching point doesn’t change either way).

Second, there’s the statistical threshold. Friston’s analysis assumes that all of our tests are conducted without correction for multiple comparisions (i.e., at p < .05), but this clearly doesn’t apply to the vast majority of neuroimaging studies, which are either conducting massive univariate (whole-brain) analyses, or testing at least a few different ROIs or networks. As soon as you lower the threshold, the optimal sample size returned by the loss-function analysis increases dramatically. If the threshold is a still-relatively-liberal (for whole-brain analysis) p < .001, the loss function is now minimized at 48 subjects–hardly a welcome conclusion, and a far cry from 16 subjects. Since this is probably still the modal fMRI threshold, one could argue Friston should have been trumpeting a sample size of 48 all along—not exactly a “˜small’ sample size given the associated costs.

Third, the n = 16 (or 22) figure only holds for the simplest of within-subject tests (e.g., a one-sample t-test)–again, a best-case scenario (though certainly a common one). It doesn’t apply to many other kinds of tests that are the primary focus of a huge proportion of neuroimaging studies–for instance, two-sample t-tests, or interactions between multiple within-subject factors. In fact, if you apply the same analysis to a two-sample t-test (or equivalently, a correlation test), the optimal sample size turns out to be 82 (41 per group) at a threshold of p < .05, and a whopping 174 (87 per group) at a threshold of p < .001. In other words, if we were to follow Friston’s own guidelines, the typical fMRI researcher who aims to conduct a (liberal) whole-brain individual differences analysis should be collecting 174 subjects a pop. For other kinds of tests (e.g., 3-way interactions), even larger samples might be required.

Fourth, the claim that only large effects–i.e., those that can be readily detected with a sample size of 16–are worth worrying about is likely to annoy and perhaps offend any number of researchers who have perfectly good reasons for caring about effects much smaller than half a standard deviation. A cursory look at most literatures suggests that effects of 1 sd are not the norm; they’re actually highly unusual in mature fields. For perspective, the standardized difference in height between genders is about 1.5 sd; the validity of job interviews for predicting success is about .4 sd; and the effect of gender on risk-taking (men take more risks) is about .2 sd—what Friston would call a very small effect (for other examples, see Meyer et al., 2001). Against this backdrop, suggesting that only effects greater than 1 sd (about the strength of the association between height and weight in adults) are of interest would seem to preclude many, and perhaps most, questions that researchers currently use fMRI to address. Imaging genetics studies are immediately out of the picture; so too, in all likelihood, are cognitive training studies, most investigations of individual differences, and pretty much any experimental contrast that claims to very carefully isolate a relatively subtle cognitive difference. Put simply, if the field were to take Friston’s analysis seriously, the majority of its practitioners would have to pack up their bags and go home. Entire domains of inquiry would shutter overnight.

To be fair, Friston briefly considers the possibility that small sample sizes could be important. But he doesn’t seem to take it very seriously:

Can true but trivial effect sizes can ever be interesting? It could be that a very small effect size may have important implications for understanding the mechanisms behind a treatment effect ““ and that one should maximise sensitivity by using large numbers of subjects. The argument against this is that reporting a significant but trivial effect size is equivalent to saying that one can be fairly confident the treatment effect exists but its contribution to the outcome measure is trivial in relation to other unknown effects“¦

The problem with the latter argument is that the real world is a complicated place, and most interesting phenomena have many causes. A priori, it is reasonable to expect that the vast majority of effects will be small. We probably shouldn’t expect any single genetic variant to account for more than a small fraction of the variation in brain activity, but that doesn’t mean we should give up entirely on imaging genetics. And of course, it’s worth remembering that, in the context of fMRI studies, when Friston talks about “˜very small effect sizes,’ that’s a bit misleading; even medium-sized effects that Friston presumably allows are interesting could be almost impossible to detect at the sample sizes he recommends. For example, a one-sample t-test with n = 16 subjects detects an effect of d = 0.5 only 46% or 5% of the time at p < .05 and p < .001, respectively. Applying Friston’s own loss function analysis to detection of d = 0.5 returns an optimal sample size of n = 63 at p < .05 and n = 139 at p < .001—a message not entirely consistent with the recommendations elsewhere in his commentary.

ResearchBlogging.orgFriston, K. (2012). Ten ironic rules for non-statistical reviewers NeuroImage DOI: 10.1016/j.neuroimage.2012.04.018

large-scale data exploration, MIC-style

UPDATE 2/8/2012: Simon & Tibshirani posted a critical commentary on this paper here. See additional thoughts here.

Real-world data are messy. Relationships between two variables can take on an infinite number of forms, and while one doesn’t see, say, umbrella-shaped data very often, strange things can happen. When scientists talk about correlations or associations between variables, they’re usually referring to one very specific form of relationship–namely, a linear one. The assumption is that most associations between pairs of variables are reasonably well captured by positing that one variable increases in proportion to the other, with some added noise. In reality, of course, many associations aren’t linear, or even approximately so. For instance, many associations are cyclical (e.g., hours at work versus day of week), or curvilinear (e.g., heart attacks become precipitously more frequent past middle age), and so on.

Detecting a non-linear association is potentially just as easy as detecting a linear relationship if we know the form of that association up front. But there, of course, lies the rub: we generally don’t have strong intuitions about how most variables are likely to be non-linearly related. A more typical situation in many ‘big data’ scientific disciplines is that we have a giant dataset full of thousands or millions of observations and hundreds or thousands of variables, and we want to determine which of the many associations between different variables are potentially important–without knowing anything about their potential shape. The problem, then, is that traditional measures of association don’t work very well; they’re only likely to detect associations to the extent that those associations approximate a linear fit.

A new paper in Science by David Reshef and colleagues (and as a friend pointed out, it’s a feat in and of itself just to get a statistics paper into Science) directly targets this data mining problem by introducing an elegant new measure of association called the Maximal Information Coefficient (MIC; see also the authors’ project website).  The clever insight at the core of the paper is that one can detect a systematic (i.e., non-random) relationship between two variables by quantifying and normalizing their maximal mutual information. Mutual information (MI) is an information theory measure of how much information you have about one variable given knowledge of the other. You have high MI when you can accurately predict the level of one variable given knowledge of the other, and low MI when knowledge of one variable is unhelpful in predicting the other. Importantly, unlike other measures (e.g., the correlation coefficient), MI makes no assumptions about the form of the relationship between the variables; one can have high mutual information for non-linear associations as well as linear ones.

MI and various derivative measures have been around for a long time now; what’s innovative about the Reshef et al paper is that the authors figured out a way to efficiently estimate and normalize the maximal MI one can obtain for any two variables. The very clever approach the authors use is to overlay a series of grids on top of the data, and to keep altering the resolution of the grid and moving its lines around until one obtains the maximum possible MI. In essence, it’s like dropping a wire mesh on top of a scatterplot and playing with it until you’ve boxed in all of the data points in the most informative way possible. And the neat thing is, you can apply the technique to any kind of data at all, and capture a very broad range of systematic relationships, not just linear ones.

To give you an intuitive sense of how this works, consider this Figure from the supplemental material:

The underlying function here is sinusoidal. This is a potentially common type of association in many domains–e.g., it might explain the cyclical relationship between, say, coffee intake and hour of day (more coffee in the early morning and afternoon; less in between). But the linear correlation is essentially zero, so a typical analysis wouldn’t pick it up at all. On the other hand, the relationship itself is perfectly deterministic; if we can correctly identify the generative function in this case, we would have perfect information about Y given X. The question is how to capture this intuition algorithmically–especially given that real data are noisy.

This is where Reshef et al’s grid-based approach comes in. In the left panel above, you have a 2 x 8 grid overlaid on a sinusoidal function (the use of a 2 x 8 resolution here is just illustrative; the algorithm actually produces estimates for a wide range of grid resolutions). Even though it’s the optimal grid of that particular resolution, it still isn’t very good: knowing which row a particular point along the line falls into doesn’t tell you a whole lot about which column it falls into, and vice versa. In other words, mutual information is low. By contrast, the optimal 8 x 2 grid on the right side of the figure has a (perfect) MIC of 1: if you know which row in the grid a point on the line falls into, you can also determine which column it falls into with perfect accuracy. So the MIC approach will detect that there’s a perfectly systematic relationship between these two variables without any trouble, whereas the standard pearson correlation would be 0 (i.e., no relation at all). There are a couple of other steps involved (e.g., one needs to normalize the MIC to account for differences in grid resolution), but that’s the gist of it.

If the idea seems surprisingly simple, it is. But as with many very good ideas, hindsight is 20/20; it’s an idea that seems obvious once you hear it, but clearly wasn’t trivial to come up with (or someone would have done it a long time ago!). And of course, the simplicity of the core idea also shouldn’t blind us to the fact that there was undoubtedly a lot of very sophisticated work involved in figuring out how to normalize and bound the measure, provin that the approach works and implementing a dynamic algorithm capable of computing good MIC estimates in a reasonable amount of time (this Harvard Gazette article suggests Reshef and colleagues worked on the various problems for three years).

The utility of MIC and its improvement over existing measures is probably best captured in Figure 2 from the paper:

Panel A shows the values one obtains with different measures when trying to capture different kinds of noiseless relationships (e.g., linear, exponential, and sinusoidal ones). The key point is that MIC assigns a value of 1 (the maximum) to every kind of association, whereas no other measure is capable of detecting the same range of associations with the same degree of sensitivity (and most fail horribly). By contrast, when given random data, MIC produces a value that tends towards zero (though it’s still not quite zero, a point I’ll come back to later). So what you effectively have is a measure that, with some caveats, can capture a very broad range of associations and place them on the same metric. The latter aspect is nicely captured in Panel G, which gives one a sense of what real (i.e., noisy) data corresponding to different MIC levels would look like. The main point is that, unlike other measures, a given value can correspond to very different types of associations. Admittedly, this may be a mixed blessing, since the flip side is that knowing the MIC value tells you almost nothing about what the association actually looks like (though Anscombe’s Quartet famously demonstrates that even a linear correlation can be misleading in this respect). But on the whole, I think it represents a potentially big advance in our ability to detect novel associations in a data-driven way.

Having introduced and explained the method, Reshef et al then go on to apply it to 4 very different datasets. I’ll just focus on one here–a set of global indicators from the World Health Organization (WHO). The data set contains 357 variables, or 63,546 variable pairs. When plotting MIC against the Pearson correlation coefficient the data look like this (panel A; click to blow up the figure):

The main point to note is that while MIC detects most strong linear effects (e.g., panel D), it also detects quite a few associations that have low linear correlations (e.g., E, F, and G). Reshef et al note that many of these effects have sensible interpretations (e.g., they argue that the left trend line in panel F reflects predominantly Pacific Island nations where obesity is culturally valued, and hence increases with income), but would be completely overlooked by an automated data mining approach that focuses only on linear correlations. They go on to report a number of other interesting examples ranging from analyses of gut bacteria to baseball statistics. All in all, it’s a compelling demonstration of a new metric that could potentially play an important role in large-scale data mining analyses going forward.

That said, while the paper clearly represents an important advance for large-scale data mining efforts, it’s also quite light on caveats and limitations (even for a length-constrained Science paper). Some potential concerns that come to mind:

  • Reshef et al are understandably going to put their best foot forward, so we can expect that the ‘representative’ examples they display (e.g., the WHO scatter plots above) are among the cleanest effects in the data, and aren’t necessarily typical. There’s nothing wrong with this, but it’s worth keeping in mind that much (and perhaps most) of the time, the associations MIC identifies aren’t going to be quite so clear-cut. Reshef’s et al approach can help identify potentially interesting associations, but once they’re identified, it’s still up to the investigator to figure out how to characterize them.
  • MIC is a (potentially quite heavily) biased measure. While it’s true, as the authors suggest, that it will “tend to 0 for statistically independent variables”, in most situations, the observed value will be substantially larger than 0 even when variables are completely uncorrelated. This falls directly out of the ‘M’ in MIC, because when you take the maximal value from some larger search space as your estimate, you’re almost invariably going to end up capitalizing on chance to some degree. MIC will only tend to 0 when the sample size is very large; as this figure (from the supplemental material) shows, even with a sample size of n = 204, the MIC for uncorrelated variables will tend to hover somewhere around .15 for the parameterization used throughout the paper (the red line):
    This isn’t a huge deal, but it does mean that interpretation of small MIC values is going to be very difficult in practice, since the lower end of the distribution is going to depend heavily on sample size. And it’s quite unpleasant to have a putatively standardized metric of effect size whose interpretation depends to some extent on sample parameters.
  • Reshef et al don’t report any analyses quantifying the sensitivity of MIC compared to conventional metrics like Pearson’s correlation coefficient. Obviously, MIC can pick up on effects Pearson can’t; but a crucial question is whether MIC shows comparable sensitivity when effects are linear. Similarly, we don’t know how well MIC performs when sample sizes are substantially smaller than those Reshef et al use in their simulations and empirical analyses. If it breaks down with n’s on the order of, say, 50 – 100, that would be important to know. So it would be great to see follow-up work characterizing performance under such circumstances–preferably before a flood of papers is published that all use MIC to do data mining in relatively small data sets.
  • As Andrew Gelman points out here, it’s not entirely clear that one wants a measure that gives a high r-square-like value for pretty much any non-random association between variables. For instance, a perfect circle would get an MIC of 1 at the limit, which is potentially weird given that you can’t never deterministically predict y from x. I don’t have a strong feeling about this one way or the other, but can see why this might bother someone.

Caveats aside though, from my perspective–as someone who likes to play with very large datasets but isn’t terribly statistically savvy–the Reshef et al paper seems like a really impressive piece of work that could have a big impact on at least some kinds of data mining analyses. I’d be curious to hear what more quantitatively sophisticated folks have to say.

ResearchBlogging.org
Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, & Sabeti PC (2011). Detecting novel associations in large data sets. Science (New York, N.Y.), 334 (6062), 1518-24 PMID: 22174245